Assignment 8-A

File Processing

Build a comprehensive Log Analysis & Data Processing System that reads, processes, and transforms data across multiple file formats. Master file operations, path handling, CSV processing, and JSON serialization.

4-6 hours
Intermediate
150 Points
Submit Assignment
What You'll Practice
  • Read and write text files
  • Use context managers properly
  • Navigate paths with pathlib
  • Process CSV data files
  • Parse and create JSON
Contents
01

Assignment Overview

In this assignment, you will build a Log Analysis & Data Processing System that demonstrates professional-level file handling skills. This project requires you to read log files, process CSV data, transform information between formats, and organize output using proper path handling.

File Formats: You'll work with .txt (log files), .csv (data files), and .json (configuration and output).
Skills Applied: This assignment tests your understanding of File Operations (8.1), Path Handling (8.2), and CSV & JSON (8.3) from Module 8.
File Operations (8.1)

Open, read, write files with context managers and file modes

Path Handling (8.2)

pathlib module, directory navigation, cross-platform paths

CSV & JSON (8.3)

Parse CSV files, serialize/deserialize JSON data

Ready to submit? Already completed the assignment? Submit your work now!
Submit Now
02

The Scenario

DataFlow Analytics Platform

You have been hired as a Data Engineer at DataFlow Analytics, a company that processes server logs and sales data for clients. Your manager has assigned you:

"We need a tool that can ingest raw log files and CSV data, analyze them, and produce clean JSON reports. The system must handle different file encodings, use pathlib for cross-platform compatibility, and properly manage file resources. All output should be organized in a structured directory hierarchy."

Your Task

Create a Python file processing system that reads server logs, processes sales data from CSV, generates summary reports in JSON format, and organizes all files using proper path handling.

Sample Input Data
server_logs.txt (Log File)
2026-01-22 08:15:32 INFO  Server started on port 8080
2026-01-22 08:16:45 DEBUG Database connection established
2026-01-22 08:17:12 INFO  User login: user_123
2026-01-22 08:18:03 WARNING High memory usage: 85%
2026-01-22 08:19:55 ERROR  Failed to process request: timeout
2026-01-22 08:20:11 INFO  User logout: user_123
2026-01-22 08:21:30 CRITICAL Database connection lost
sales_data.csv (CSV File)
order_id,product_name,category,quantity,unit_price,date
1001,Laptop Pro,Electronics,2,1299.99,2026-01-15
1002,Wireless Mouse,Electronics,5,29.99,2026-01-15
1003,Office Chair,Furniture,1,449.00,2026-01-16
1004,USB-C Hub,Electronics,3,79.99,2026-01-16
1005,Standing Desk,Furniture,1,699.00,2026-01-17
config.json (Configuration)
{
    "input_directory": "data/input",
    "output_directory": "data/output",
    "log_levels": ["INFO", "WARNING", "ERROR", "CRITICAL"],
    "date_format": "%Y-%m-%d",
    "encoding": "utf-8"
}
03

Requirements

Your project must implement ALL of the following requirements. Each requirement is mandatory and will be tested individually.

1
Project Structure with pathlib (8.2)

Create a well-organized project structure using pathlib:

  • Use Path objects for all file operations
  • Create directories programmatically if they don't exist
  • Use relative paths for portability
  • Never use hardcoded path separators (\ or /)
from pathlib import Path

class FileProcessor:
    """File processor with proper path handling."""
    
    def __init__(self, base_dir: str = "."):
        self.base_path = Path(base_dir).resolve()
        self.input_dir = self.base_path / "data" / "input"
        self.output_dir = self.base_path / "data" / "output"
        self.logs_dir = self.output_dir / "logs"
        self.reports_dir = self.output_dir / "reports"
        
        # Create directory structure
        self._setup_directories()
    
    def _setup_directories(self):
        """Create required directories if they don't exist."""
        directories = [
            self.input_dir,
            self.output_dir,
            self.logs_dir,
            self.reports_dir
        ]
        for directory in directories:
            directory.mkdir(parents=True, exist_ok=True)
            print(f"✓ Directory ready: {directory}")
    
    def list_files(self, directory: Path, pattern: str = "*") -> list:
        """List files matching pattern in directory."""
        return list(directory.glob(pattern))
    
    def get_file_info(self, filepath: Path) -> dict:
        """Get file metadata using pathlib."""
        return {
            "name": filepath.name,
            "stem": filepath.stem,
            "suffix": filepath.suffix,
            "size_bytes": filepath.stat().st_size,
            "is_file": filepath.is_file(),
            "parent": str(filepath.parent)
        }
2
Configuration Management with JSON (8.3)

Load and save configuration using JSON:

  • Read configuration from config.json
  • Use json.load() with context manager
  • Handle missing config with defaults
  • Save updated configuration back to file
import json
from pathlib import Path
from typing import Dict, Any

class ConfigManager:
    """Manage application configuration via JSON."""
    
    DEFAULT_CONFIG = {
        "input_directory": "data/input",
        "output_directory": "data/output",
        "log_levels": ["INFO", "WARNING", "ERROR", "CRITICAL"],
        "date_format": "%Y-%m-%d",
        "encoding": "utf-8"
    }
    
    def __init__(self, config_path: Path):
        self.config_path = config_path
        self.config = self._load_config()
    
    def _load_config(self) -> Dict[str, Any]:
        """Load configuration from JSON file."""
        if not self.config_path.exists():
            print(f"Config not found, creating default: {self.config_path}")
            self._save_config(self.DEFAULT_CONFIG)
            return self.DEFAULT_CONFIG.copy()
        
        with open(self.config_path, 'r', encoding='utf-8') as f:
            config = json.load(f)
            print(f"✓ Loaded config from {self.config_path}")
            return config
    
    def _save_config(self, config: Dict[str, Any]):
        """Save configuration to JSON file."""
        with open(self.config_path, 'w', encoding='utf-8') as f:
            json.dump(config, f, indent=4)
    
    def get(self, key: str, default: Any = None) -> Any:
        """Get configuration value."""
        return self.config.get(key, default)
    
    def update(self, key: str, value: Any):
        """Update configuration and save."""
        self.config[key] = value
        self._save_config(self.config)
3
Log File Parser (8.1)

Read and parse server log files:

  • Use context manager (with statement) for all file operations
  • Read files line by line for memory efficiency
  • Parse log format: DATE TIME LEVEL MESSAGE
  • Handle different encodings gracefully
from datetime import datetime
from pathlib import Path
from typing import List, Dict
import re

class LogParser:
    """Parse server log files."""
    
    LOG_PATTERN = re.compile(
        r'(\d{4}-\d{2}-\d{2})\s+(\d{2}:\d{2}:\d{2})\s+'
        r'(DEBUG|INFO|WARNING|ERROR|CRITICAL)\s+(.+)'
    )
    
    def __init__(self, encoding: str = 'utf-8'):
        self.encoding = encoding
        self.entries = []
    
    def parse_file(self, filepath: Path) -> List[Dict]:
        """Parse a log file and return list of log entries."""
        entries = []
        
        with open(filepath, 'r', encoding=self.encoding) as f:
            for line_num, line in enumerate(f, 1):
                line = line.strip()
                if not line:
                    continue
                
                entry = self._parse_line(line, line_num)
                if entry:
                    entries.append(entry)
        
        self.entries = entries
        print(f"✓ Parsed {len(entries)} log entries from {filepath.name}")
        return entries
    
    def _parse_line(self, line: str, line_num: int) -> Dict:
        """Parse a single log line."""
        match = self.LOG_PATTERN.match(line)
        if not match:
            return None
        
        date_str, time_str, level, message = match.groups()
        
        return {
            "line_number": line_num,
            "date": date_str,
            "time": time_str,
            "datetime": f"{date_str} {time_str}",
            "level": level,
            "message": message.strip()
        }
    
    def filter_by_level(self, levels: List[str]) -> List[Dict]:
        """Filter entries by log level."""
        return [e for e in self.entries if e["level"] in levels]
    
    def get_statistics(self) -> Dict:
        """Get log statistics."""
        stats = {"total": len(self.entries)}
        for level in ["DEBUG", "INFO", "WARNING", "ERROR", "CRITICAL"]:
            stats[level.lower()] = sum(
                1 for e in self.entries if e["level"] == level
            )
        return stats
4
CSV Data Processor (8.3)

Read and process sales data from CSV:

  • Use csv.DictReader for named column access
  • Handle different delimiters and quote characters
  • Convert data types (strings to numbers, dates)
  • Calculate aggregations (totals, averages)
import csv
from pathlib import Path
from typing import List, Dict
from datetime import datetime
from collections import defaultdict

class SalesDataProcessor:
    """Process sales data from CSV files."""
    
    def __init__(self, encoding: str = 'utf-8'):
        self.encoding = encoding
        self.data = []
    
    def load_csv(self, filepath: Path) -> List[Dict]:
        """Load and parse CSV file."""
        with open(filepath, 'r', encoding=self.encoding, newline='') as f:
            reader = csv.DictReader(f)
            self.data = []
            
            for row in reader:
                processed_row = self._process_row(row)
                self.data.append(processed_row)
        
        print(f"✓ Loaded {len(self.data)} records from {filepath.name}")
        return self.data
    
    def _process_row(self, row: Dict) -> Dict:
        """Process and convert data types for a row."""
        return {
            "order_id": int(row["order_id"]),
            "product_name": row["product_name"],
            "category": row["category"],
            "quantity": int(row["quantity"]),
            "unit_price": float(row["unit_price"]),
            "date": row["date"],
            "total_price": int(row["quantity"]) * float(row["unit_price"])
        }
    
    def get_summary_by_category(self) -> Dict:
        """Aggregate sales by category."""
        summary = defaultdict(lambda: {
            "total_revenue": 0,
            "total_quantity": 0,
            "order_count": 0
        })
        
        for item in self.data:
            cat = item["category"]
            summary[cat]["total_revenue"] += item["total_price"]
            summary[cat]["total_quantity"] += item["quantity"]
            summary[cat]["order_count"] += 1
        
        return dict(summary)
    
    def get_daily_sales(self) -> Dict:
        """Aggregate sales by date."""
        daily = defaultdict(float)
        for item in self.data:
            daily[item["date"]] += item["total_price"]
        return dict(daily)
5
CSV Writer (8.3)

Write processed data to new CSV files:

  • Use csv.DictWriter for writing
  • Write header row with column names
  • Handle special characters properly
  • Create summary reports in CSV format
import csv
from pathlib import Path
from typing import List, Dict

class CSVWriter:
    """Write data to CSV files."""
    
    def __init__(self, encoding: str = 'utf-8'):
        self.encoding = encoding
    
    def write_csv(self, filepath: Path, data: List[Dict], 
                  fieldnames: List[str] = None):
        """Write list of dictionaries to CSV file."""
        if not data:
            print(f"⚠ No data to write to {filepath}")
            return
        
        # Use keys from first item if fieldnames not provided
        if fieldnames is None:
            fieldnames = list(data[0].keys())
        
        with open(filepath, 'w', encoding=self.encoding, 
                  newline='') as f:
            writer = csv.DictWriter(f, fieldnames=fieldnames)
            writer.writeheader()
            writer.writerows(data)
        
        print(f"✓ Wrote {len(data)} records to {filepath.name}")
    
    def write_summary_report(self, filepath: Path, 
                             summary: Dict[str, Dict]):
        """Write category summary to CSV."""
        rows = []
        for category, stats in summary.items():
            rows.append({
                "category": category,
                "total_revenue": f"{stats['total_revenue']:.2f}",
                "total_quantity": stats["total_quantity"],
                "order_count": stats["order_count"],
                "avg_order_value": f"{stats['total_revenue']/stats['order_count']:.2f}"
            })
        
        self.write_csv(filepath, rows)
6
JSON Report Generator (8.3)

Generate comprehensive JSON reports:

  • Combine data from multiple sources
  • Use proper JSON formatting with indentation
  • Handle datetime serialization
  • Create nested report structures
import json
from pathlib import Path
from datetime import datetime
from typing import Dict, Any

class ReportGenerator:
    """Generate JSON reports from processed data."""
    
    def __init__(self, output_dir: Path):
        self.output_dir = output_dir
    
    def generate_full_report(self, log_stats: Dict, 
                             sales_summary: Dict,
                             daily_sales: Dict) -> Dict:
        """Generate comprehensive analysis report."""
        report = {
            "report_metadata": {
                "generated_at": datetime.now().isoformat(),
                "report_type": "daily_analysis",
                "version": "1.0"
            },
            "log_analysis": {
                "summary": log_stats,
                "health_status": self._determine_health(log_stats)
            },
            "sales_analysis": {
                "by_category": sales_summary,
                "by_date": daily_sales,
                "total_revenue": sum(daily_sales.values())
            }
        }
        return report
    
    def _determine_health(self, log_stats: Dict) -> str:
        """Determine system health based on log stats."""
        if log_stats.get("critical", 0) > 0:
            return "CRITICAL"
        elif log_stats.get("error", 0) > 5:
            return "UNHEALTHY"
        elif log_stats.get("warning", 0) > 10:
            return "WARNING"
        return "HEALTHY"
    
    def save_report(self, report: Dict, filename: str):
        """Save report to JSON file."""
        filepath = self.output_dir / filename
        
        with open(filepath, 'w', encoding='utf-8') as f:
            json.dump(report, f, indent=4, default=str)
        
        print(f"✓ Report saved to {filepath}")
        return filepath
7
Text File Writer (8.1)

Write human-readable text reports:

  • Use different file modes (w, a)
  • Format output with proper alignment
  • Write section headers and separators
  • Handle line endings properly
from pathlib import Path
from datetime import datetime
from typing import Dict

class TextReportWriter:
    """Write formatted text reports."""
    
    def __init__(self, output_dir: Path):
        self.output_dir = output_dir
    
    def write_summary_report(self, filepath: Path, 
                             log_stats: Dict, 
                             sales_summary: Dict):
        """Write a formatted text summary report."""
        with open(filepath, 'w', encoding='utf-8') as f:
            # Header
            f.write("=" * 60 + "\n")
            f.write("       DATAFLOW ANALYTICS - DAILY SUMMARY REPORT\n")
            f.write("=" * 60 + "\n")
            f.write(f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n")
            f.write("\n")
            
            # Log Analysis Section
            f.write("-" * 60 + "\n")
            f.write("LOG ANALYSIS\n")
            f.write("-" * 60 + "\n")
            f.write(f"{'Total Entries:':<20} {log_stats['total']:>10}\n")
            f.write(f"{'Info:':<20} {log_stats.get('info', 0):>10}\n")
            f.write(f"{'Warnings:':<20} {log_stats.get('warning', 0):>10}\n")
            f.write(f"{'Errors:':<20} {log_stats.get('error', 0):>10}\n")
            f.write(f"{'Critical:':<20} {log_stats.get('critical', 0):>10}\n")
            f.write("\n")
            
            # Sales Analysis Section
            f.write("-" * 60 + "\n")
            f.write("SALES BY CATEGORY\n")
            f.write("-" * 60 + "\n")
            f.write(f"{'Category':<20} {'Revenue':>15} {'Orders':>10}\n")
            f.write("-" * 45 + "\n")
            
            total_revenue = 0
            for category, stats in sales_summary.items():
                revenue = stats['total_revenue']
                orders = stats['order_count']
                f.write(f"{category:<20} ${revenue:>14,.2f} {orders:>10}\n")
                total_revenue += revenue
            
            f.write("-" * 45 + "\n")
            f.write(f"{'TOTAL':<20} ${total_revenue:>14,.2f}\n")
            f.write("\n")
            f.write("=" * 60 + "\n")
        
        print(f"✓ Text report written to {filepath}")
8
File Search and Discovery (8.2)

Search for files using glob patterns:

  • Use glob() and rglob() for recursive search
  • Filter files by extension
  • Get file statistics (size, modified time)
  • Handle missing files gracefully
from pathlib import Path
from typing import List, Dict
from datetime import datetime

class FileDiscovery:
    """Discover and catalog files in directories."""
    
    def __init__(self, base_path: Path):
        self.base_path = base_path
    
    def find_files(self, pattern: str, recursive: bool = False) -> List[Path]:
        """Find files matching pattern."""
        if recursive:
            return list(self.base_path.rglob(pattern))
        return list(self.base_path.glob(pattern))
    
    def find_by_extension(self, extension: str) -> List[Path]:
        """Find all files with given extension."""
        if not extension.startswith('.'):
            extension = f'.{extension}'
        return list(self.base_path.rglob(f'*{extension}'))
    
    def get_directory_summary(self) -> Dict:
        """Get summary of directory contents."""
        files = list(self.base_path.rglob('*'))
        
        summary = {
            "total_files": 0,
            "total_dirs": 0,
            "total_size_bytes": 0,
            "by_extension": {}
        }
        
        for path in files:
            if path.is_file():
                summary["total_files"] += 1
                summary["total_size_bytes"] += path.stat().st_size
                ext = path.suffix or "no_extension"
                summary["by_extension"][ext] = \
                    summary["by_extension"].get(ext, 0) + 1
            elif path.is_dir():
                summary["total_dirs"] += 1
        
        return summary
    
    def get_recent_files(self, hours: int = 24) -> List[Dict]:
        """Get files modified within specified hours."""
        cutoff = datetime.now().timestamp() - (hours * 3600)
        recent = []
        
        for path in self.base_path.rglob('*'):
            if path.is_file() and path.stat().st_mtime > cutoff:
                recent.append({
                    "path": str(path),
                    "name": path.name,
                    "modified": datetime.fromtimestamp(
                        path.stat().st_mtime
                    ).isoformat()
                })
        
        return recent
9
Encoding Handler (8.1)

Handle different file encodings:

  • Detect file encoding when possible
  • Handle UTF-8, Latin-1, and other encodings
  • Gracefully handle encoding errors
  • Convert between encodings
from pathlib import Path
from typing import Optional

class EncodingHandler:
    """Handle file encoding detection and conversion."""
    
    COMMON_ENCODINGS = ['utf-8', 'latin-1', 'cp1252', 'ascii']
    
    def read_with_fallback(self, filepath: Path, 
                           encodings: list = None) -> tuple:
        """
        Try to read file with multiple encodings.
        Returns (content, encoding_used).
        """
        encodings = encodings or self.COMMON_ENCODINGS
        
        for encoding in encodings:
            try:
                with open(filepath, 'r', encoding=encoding) as f:
                    content = f.read()
                return content, encoding
            except UnicodeDecodeError:
                continue
        
        # Last resort: read with error handling
        with open(filepath, 'r', encoding='utf-8', 
                  errors='replace') as f:
            content = f.read()
        return content, 'utf-8 (with replacements)'
    
    def convert_encoding(self, input_path: Path, 
                         output_path: Path,
                         target_encoding: str = 'utf-8'):
        """Convert file to different encoding."""
        content, original_encoding = self.read_with_fallback(input_path)
        
        with open(output_path, 'w', encoding=target_encoding) as f:
            f.write(content)
        
        print(f"✓ Converted {input_path.name}: "
              f"{original_encoding} → {target_encoding}")
10
Main Application (Integration)

Create a main.py that ties everything together:

  • Initialize all components
  • Process sample data files
  • Generate all report types
  • Display summary to console
from pathlib import Path

def main():
    """Main entry point for file processing system."""
    print("=" * 60)
    print("   DATAFLOW ANALYTICS - FILE PROCESSING SYSTEM")
    print("=" * 60 + "\n")
    
    # Initialize processor with base directory
    base_dir = Path(__file__).parent
    processor = FileProcessor(base_dir)
    
    # Load configuration
    config_path = base_dir / "config.json"
    config = ConfigManager(config_path)
    
    # Parse log files
    print("\n📋 Processing Log Files...")
    log_parser = LogParser(encoding=config.get("encoding"))
    log_files = processor.list_files(processor.input_dir, "*.txt")
    
    all_log_entries = []
    for log_file in log_files:
        entries = log_parser.parse_file(log_file)
        all_log_entries.extend(entries)
    
    log_stats = log_parser.get_statistics()
    
    # Process CSV sales data
    print("\n📊 Processing Sales Data...")
    sales_processor = SalesDataProcessor(encoding=config.get("encoding"))
    csv_files = processor.list_files(processor.input_dir, "*.csv")
    
    for csv_file in csv_files:
        sales_processor.load_csv(csv_file)
    
    sales_summary = sales_processor.get_summary_by_category()
    daily_sales = sales_processor.get_daily_sales()
    
    # Generate reports
    print("\n📝 Generating Reports...")
    report_gen = ReportGenerator(processor.reports_dir)
    full_report = report_gen.generate_full_report(
        log_stats, sales_summary, daily_sales
    )
    report_gen.save_report(full_report, "analysis_report.json")
    
    # Write text report
    text_writer = TextReportWriter(processor.reports_dir)
    text_writer.write_summary_report(
        processor.reports_dir / "summary_report.txt",
        log_stats, sales_summary
    )
    
    # Write CSV summary
    csv_writer = CSVWriter()
    csv_writer.write_summary_report(
        processor.reports_dir / "category_summary.csv",
        sales_summary
    )
    
    print("\n" + "=" * 60)
    print("✅ Processing Complete!")
    print(f"📁 Reports saved to: {processor.reports_dir}")
    print("=" * 60)

if __name__ == "__main__":
    main()
04

Submission

Create a public GitHub repository with the exact name shown below:

Required Repository Name
python-file-processor
github.com/<your-username>/python-file-processor
Required Files
python-file-processor/
├── file_processor.py         # FileProcessor class with path handling
├── config_manager.py         # ConfigManager for JSON config
├── log_parser.py             # LogParser for text files
├── sales_processor.py        # SalesDataProcessor for CSV
├── csv_writer.py             # CSVWriter class
├── report_generator.py       # JSON report generator
├── text_writer.py            # Text report writer
├── file_discovery.py         # File search utilities
├── encoding_handler.py       # Encoding utilities
├── main.py                   # Main application
├── config.json               # Configuration file
├── data/
│   ├── input/
│   │   ├── server_logs.txt   # Sample log file
│   │   └── sales_data.csv    # Sample CSV data
│   └── output/
│       ├── logs/
│       └── reports/
│           ├── analysis_report.json
│           ├── summary_report.txt
│           └── category_summary.csv
├── output.txt                # Console output from main.py
└── README.md                 # Documentation
README.md Must Include:
  • Your full name and submission date
  • Project structure diagram
  • Explanation of path handling strategy
  • Sample input/output examples
  • Instructions to run the application
Do Include
  • All 10 requirements implemented
  • Context managers for all file operations
  • pathlib for all path handling
  • Sample input data files
  • Generated output files
  • Proper error handling
Do Not Include
  • Hardcoded path separators (\ or /)
  • Files opened without context managers
  • os.path instead of pathlib
  • Unhandled file exceptions
  • Missing encoding parameters
  • Empty output directories
Important: Make sure all generated output files are included in your repository to show the system works correctly!
Submit Your Assignment

Enter your GitHub username - we'll verify your repository automatically

05

Grading Rubric

Your assignment will be graded on the following criteria:

Criteria Points Description
Path Handling (8.2) 30 pathlib usage, directory creation, cross-platform paths
File Operations (8.1) 30 Context managers, read/write, encoding handling
CSV Processing (8.3) 25 DictReader/DictWriter, data conversion, aggregations
JSON Handling (8.3) 25 Config management, report generation, proper formatting
Integration & Output 25 Working main.py, sample data, generated reports
Code Quality 15 Docstrings, type hints, README documentation
Total 150

Ready to Submit?

Make sure all files are present and reports are generated.

Submit Your Assignment
06

What You Will Practice

File Operations (8.1)

Context managers, file modes, reading/writing, encoding handling

Path Handling (8.2)

pathlib module, glob patterns, directory navigation, cross-platform paths

CSV Processing (8.3)

DictReader/DictWriter, data conversion, aggregations, report generation

JSON Handling (8.3)

Configuration files, serialization, report formatting, nested structures