Assignment 8: File Processing | Python Course

Assignment Overview

In this assignment, you will build a Log Analysis & Data Processing System that demonstrates professional-level file handling skills. This project requires you to read log files, process CSV data, transform information between formats, and organize output using proper path handling.

File Formats: You'll work with .txt (log files), .csv (data files), and .json (configuration and output).

Skills Applied: This assignment tests your understanding of File Operations (8.1), Path Handling (8.2), and CSV & JSON (8.3) from Module 8.

File Operations (8.1)

Open, read, write files with context managers and file modes

Path Handling (8.2)

pathlib module, directory navigation, cross-platform paths

CSV & JSON (8.3)

Parse CSV files, serialize/deserialize JSON data

Ready to submit? Already completed the assignment? Submit your work now!

Submit Now

The Scenario

DataFlow Analytics Platform

You have been hired as a Data Engineer at DataFlow Analytics, a company that processes server logs and sales data for clients. Your manager has assigned you:

"We need a tool that can ingest raw log files and CSV data, analyze them, and produce clean JSON reports. The system must handle different file encodings, use pathlib for cross-platform compatibility, and properly manage file resources. All output should be organized in a structured directory hierarchy."

Your Task

Create a Python file processing system that reads server logs, processes sales data from CSV, generates summary reports in JSON format, and organizes all files using proper path handling.

Sample Input Data

server_logs.txt (Log File)

2026-01-22 08:15:32 INFO  Server started on port 8080
2026-01-22 08:16:45 DEBUG Database connection established
2026-01-22 08:17:12 INFO  User login: user_123
2026-01-22 08:18:03 WARNING High memory usage: 85%
2026-01-22 08:19:55 ERROR  Failed to process request: timeout
2026-01-22 08:20:11 INFO  User logout: user_123
2026-01-22 08:21:30 CRITICAL Database connection lost

sales_data.csv (CSV File)

order_id,product_name,category,quantity,unit_price,date
1001,Laptop Pro,Electronics,2,1299.99,2026-01-15
1002,Wireless Mouse,Electronics,5,29.99,2026-01-15
1003,Office Chair,Furniture,1,449.00,2026-01-16
1004,USB-C Hub,Electronics,3,79.99,2026-01-16
1005,Standing Desk,Furniture,1,699.00,2026-01-17

config.json (Configuration)

{
    "input_directory": "data/input",
    "output_directory": "data/output",
    "log_levels": ["INFO", "WARNING", "ERROR", "CRITICAL"],
    "date_format": "%Y-%m-%d",
    "encoding": "utf-8"
}

Requirements

Your project must implement ALL of the following requirements. Each requirement is mandatory and will be tested individually.

Project Structure with pathlib (8.2)

Create a well-organized project structure using pathlib:

Use Path objects for all file operations
Create directories programmatically if they don't exist
Use relative paths for portability
Never use hardcoded path separators (\ or /)

from pathlib import Path

class FileProcessor:
    """File processor with proper path handling."""
    
    def __init__(self, base_dir: str = "."):
        self.base_path = Path(base_dir).resolve()
        self.input_dir = self.base_path / "data" / "input"
        self.output_dir = self.base_path / "data" / "output"
        self.logs_dir = self.output_dir / "logs"
        self.reports_dir = self.output_dir / "reports"
        
        # Create directory structure
        self._setup_directories()
    
    def _setup_directories(self):
        """Create required directories if they don't exist."""
        directories = [
            self.input_dir,
            self.output_dir,
            self.logs_dir,
            self.reports_dir
        ]
        for directory in directories:
            directory.mkdir(parents=True, exist_ok=True)
            print(f"✓ Directory ready: {directory}")
    
    def list_files(self, directory: Path, pattern: str = "*") -> list:
        """List files matching pattern in directory."""
        return list(directory.glob(pattern))
    
    def get_file_info(self, filepath: Path) -> dict:
        """Get file metadata using pathlib."""
        return {
            "name": filepath.name,
            "stem": filepath.stem,
            "suffix": filepath.suffix,
            "size_bytes": filepath.stat().st_size,
            "is_file": filepath.is_file(),
            "parent": str(filepath.parent)
        }

Configuration Management with JSON (8.3)

Load and save configuration using JSON:

Read configuration from config.json
Use json.load() with context manager
Handle missing config with defaults
Save updated configuration back to file

import json
from pathlib import Path
from typing import Dict, Any

class ConfigManager:
    """Manage application configuration via JSON."""
    
    DEFAULT_CONFIG = {
        "input_directory": "data/input",
        "output_directory": "data/output",
        "log_levels": ["INFO", "WARNING", "ERROR", "CRITICAL"],
        "date_format": "%Y-%m-%d",
        "encoding": "utf-8"
    }
    
    def __init__(self, config_path: Path):
        self.config_path = config_path
        self.config = self._load_config()
    
    def _load_config(self) -> Dict[str, Any]:
        """Load configuration from JSON file."""
        if not self.config_path.exists():
            print(f"Config not found, creating default: {self.config_path}")
            self._save_config(self.DEFAULT_CONFIG)
            return self.DEFAULT_CONFIG.copy()
        
        with open(self.config_path, 'r', encoding='utf-8') as f:
            config = json.load(f)
            print(f"✓ Loaded config from {self.config_path}")
            return config
    
    def _save_config(self, config: Dict[str, Any]):
        """Save configuration to JSON file."""
        with open(self.config_path, 'w', encoding='utf-8') as f:
            json.dump(config, f, indent=4)
    
    def get(self, key: str, default: Any = None) -> Any:
        """Get configuration value."""
        return self.config.get(key, default)
    
    def update(self, key: str, value: Any):
        """Update configuration and save."""
        self.config[key] = value
        self._save_config(self.config)

Log File Parser (8.1)

Read and parse server log files:

Use context manager (with statement) for all file operations
Read files line by line for memory efficiency
Parse log format: DATE TIME LEVEL MESSAGE
Handle different encodings gracefully

from datetime import datetime
from pathlib import Path
from typing import List, Dict
import re

class LogParser:
    """Parse server log files."""
    
    LOG_PATTERN = re.compile(
        r'(\d{4}-\d{2}-\d{2})\s+(\d{2}:\d{2}:\d{2})\s+'
        r'(DEBUG|INFO|WARNING|ERROR|CRITICAL)\s+(.+)'
    )
    
    def __init__(self, encoding: str = 'utf-8'):
        self.encoding = encoding
        self.entries = []
    
    def parse_file(self, filepath: Path) -> List[Dict]:
        """Parse a log file and return list of log entries."""
        entries = []
        
        with open(filepath, 'r', encoding=self.encoding) as f:
            for line_num, line in enumerate(f, 1):
                line = line.strip()
                if not line:
                    continue
                
                entry = self._parse_line(line, line_num)
                if entry:
                    entries.append(entry)
        
        self.entries = entries
        print(f"✓ Parsed {len(entries)} log entries from {filepath.name}")
        return entries
    
    def _parse_line(self, line: str, line_num: int) -> Dict:
        """Parse a single log line."""
        match = self.LOG_PATTERN.match(line)
        if not match:
            return None
        
        date_str, time_str, level, message = match.groups()
        
        return {
            "line_number": line_num,
            "date": date_str,
            "time": time_str,
            "datetime": f"{date_str} {time_str}",
            "level": level,
            "message": message.strip()
        }
    
    def filter_by_level(self, levels: List[str]) -> List[Dict]:
        """Filter entries by log level."""
        return [e for e in self.entries if e["level"] in levels]
    
    def get_statistics(self) -> Dict:
        """Get log statistics."""
        stats = {"total": len(self.entries)}
        for level in ["DEBUG", "INFO", "WARNING", "ERROR", "CRITICAL"]:
            stats[level.lower()] = sum(
                1 for e in self.entries if e["level"] == level
            )
        return stats

CSV Data Processor (8.3)

Read and process sales data from CSV:

Use csv.DictReader for named column access
Handle different delimiters and quote characters
Convert data types (strings to numbers, dates)
Calculate aggregations (totals, averages)

import csv
from pathlib import Path
from typing import List, Dict
from datetime import datetime
from collections import defaultdict

class SalesDataProcessor:
    """Process sales data from CSV files."""
    
    def __init__(self, encoding: str = 'utf-8'):
        self.encoding = encoding
        self.data = []
    
    def load_csv(self, filepath: Path) -> List[Dict]:
        """Load and parse CSV file."""
        with open(filepath, 'r', encoding=self.encoding, newline='') as f:
            reader = csv.DictReader(f)
            self.data = []
            
            for row in reader:
                processed_row = self._process_row(row)
                self.data.append(processed_row)
        
        print(f"✓ Loaded {len(self.data)} records from {filepath.name}")
        return self.data
    
    def _process_row(self, row: Dict) -> Dict:
        """Process and convert data types for a row."""
        return {
            "order_id": int(row["order_id"]),
            "product_name": row["product_name"],
            "category": row["category"],
            "quantity": int(row["quantity"]),
            "unit_price": float(row["unit_price"]),
            "date": row["date"],
            "total_price": int(row["quantity"]) * float(row["unit_price"])
        }
    
    def get_summary_by_category(self) -> Dict:
        """Aggregate sales by category."""
        summary = defaultdict(lambda: {
            "total_revenue": 0,
            "total_quantity": 0,
            "order_count": 0
        })
        
        for item in self.data:
            cat = item["category"]
            summary[cat]["total_revenue"] += item["total_price"]
            summary[cat]["total_quantity"] += item["quantity"]
            summary[cat]["order_count"] += 1
        
        return dict(summary)
    
    def get_daily_sales(self) -> Dict:
        """Aggregate sales by date."""
        daily = defaultdict(float)
        for item in self.data:
            daily[item["date"]] += item["total_price"]
        return dict(daily)

CSV Writer (8.3)

Write processed data to new CSV files:

Use csv.DictWriter for writing
Write header row with column names
Handle special characters properly
Create summary reports in CSV format

import csv
from pathlib import Path
from typing import List, Dict

class CSVWriter:
    """Write data to CSV files."""
    
    def __init__(self, encoding: str = 'utf-8'):
        self.encoding = encoding
    
    def write_csv(self, filepath: Path, data: List[Dict], 
                  fieldnames: List[str] = None):
        """Write list of dictionaries to CSV file."""
        if not data:
            print(f"⚠ No data to write to {filepath}")
            return
        
        # Use keys from first item if fieldnames not provided
        if fieldnames is None:
            fieldnames = list(data[0].keys())
        
        with open(filepath, 'w', encoding=self.encoding, 
                  newline='') as f:
            writer = csv.DictWriter(f, fieldnames=fieldnames)
            writer.writeheader()
            writer.writerows(data)
        
        print(f"✓ Wrote {len(data)} records to {filepath.name}")
    
    def write_summary_report(self, filepath: Path, 
                             summary: Dict[str, Dict]):
        """Write category summary to CSV."""
        rows = []
        for category, stats in summary.items():
            rows.append({
                "category": category,
                "total_revenue": f"{stats['total_revenue']:.2f}",
                "total_quantity": stats["total_quantity"],
                "order_count": stats["order_count"],
                "avg_order_value": f"{stats['total_revenue']/stats['order_count']:.2f}"
            })
        
        self.write_csv(filepath, rows)

JSON Report Generator (8.3)

Generate comprehensive JSON reports:

Combine data from multiple sources
Use proper JSON formatting with indentation
Handle datetime serialization
Create nested report structures

import json
from pathlib import Path
from datetime import datetime
from typing import Dict, Any

class ReportGenerator:
    """Generate JSON reports from processed data."""
    
    def __init__(self, output_dir: Path):
        self.output_dir = output_dir
    
    def generate_full_report(self, log_stats: Dict, 
                             sales_summary: Dict,
                             daily_sales: Dict) -> Dict:
        """Generate comprehensive analysis report."""
        report = {
            "report_metadata": {
                "generated_at": datetime.now().isoformat(),
                "report_type": "daily_analysis",
                "version": "1.0"
            },
            "log_analysis": {
                "summary": log_stats,
                "health_status": self._determine_health(log_stats)
            },
            "sales_analysis": {
                "by_category": sales_summary,
                "by_date": daily_sales,
                "total_revenue": sum(daily_sales.values())
            }
        }
        return report
    
    def _determine_health(self, log_stats: Dict) -> str:
        """Determine system health based on log stats."""
        if log_stats.get("critical", 0) > 0:
            return "CRITICAL"
        elif log_stats.get("error", 0) > 5:
            return "UNHEALTHY"
        elif log_stats.get("warning", 0) > 10:
            return "WARNING"
        return "HEALTHY"
    
    def save_report(self, report: Dict, filename: str):
        """Save report to JSON file."""
        filepath = self.output_dir / filename
        
        with open(filepath, 'w', encoding='utf-8') as f:
            json.dump(report, f, indent=4, default=str)
        
        print(f"✓ Report saved to {filepath}")
        return filepath

Text File Writer (8.1)

Write human-readable text reports:

Use different file modes (w, a)
Format output with proper alignment
Write section headers and separators
Handle line endings properly

from pathlib import Path
from datetime import datetime
from typing import Dict

class TextReportWriter:
    """Write formatted text reports."""
    
    def __init__(self, output_dir: Path):
        self.output_dir = output_dir
    
    def write_summary_report(self, filepath: Path, 
                             log_stats: Dict, 
                             sales_summary: Dict):
        """Write a formatted text summary report."""
        with open(filepath, 'w', encoding='utf-8') as f:
            # Header
            f.write("=" * 60 + "\n")
            f.write("       DATAFLOW ANALYTICS - DAILY SUMMARY REPORT\n")
            f.write("=" * 60 + "\n")
            f.write(f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n")
            f.write("\n")
            
            # Log Analysis Section
            f.write("-" * 60 + "\n")
            f.write("LOG ANALYSIS\n")
            f.write("-" * 60 + "\n")
            f.write(f"{'Total Entries:':<20} {log_stats['total']:>10}\n")
            f.write(f"{'Info:':<20} {log_stats.get('info', 0):>10}\n")
            f.write(f"{'Warnings:':<20} {log_stats.get('warning', 0):>10}\n")
            f.write(f"{'Errors:':<20} {log_stats.get('error', 0):>10}\n")
            f.write(f"{'Critical:':<20} {log_stats.get('critical', 0):>10}\n")
            f.write("\n")
            
            # Sales Analysis Section
            f.write("-" * 60 + "\n")
            f.write("SALES BY CATEGORY\n")
            f.write("-" * 60 + "\n")
            f.write(f"{'Category':<20} {'Revenue':>15} {'Orders':>10}\n")
            f.write("-" * 45 + "\n")
            
            total_revenue = 0
            for category, stats in sales_summary.items():
                revenue = stats['total_revenue']
                orders = stats['order_count']
                f.write(f"{category:<20} ${revenue:>14,.2f} {orders:>10}\n")
                total_revenue += revenue
            
            f.write("-" * 45 + "\n")
            f.write(f"{'TOTAL':<20} ${total_revenue:>14,.2f}\n")
            f.write("\n")
            f.write("=" * 60 + "\n")
        
        print(f"✓ Text report written to {filepath}")

File Search and Discovery (8.2)

Search for files using glob patterns:

Use glob() and rglob() for recursive search
Filter files by extension
Get file statistics (size, modified time)
Handle missing files gracefully

from pathlib import Path
from typing import List, Dict
from datetime import datetime

class FileDiscovery:
    """Discover and catalog files in directories."""
    
    def __init__(self, base_path: Path):
        self.base_path = base_path
    
    def find_files(self, pattern: str, recursive: bool = False) -> List[Path]:
        """Find files matching pattern."""
        if recursive:
            return list(self.base_path.rglob(pattern))
        return list(self.base_path.glob(pattern))
    
    def find_by_extension(self, extension: str) -> List[Path]:
        """Find all files with given extension."""
        if not extension.startswith('.'):
            extension = f'.{extension}'
        return list(self.base_path.rglob(f'*{extension}'))
    
    def get_directory_summary(self) -> Dict:
        """Get summary of directory contents."""
        files = list(self.base_path.rglob('*'))
        
        summary = {
            "total_files": 0,
            "total_dirs": 0,
            "total_size_bytes": 0,
            "by_extension": {}
        }
        
        for path in files:
            if path.is_file():
                summary["total_files"] += 1
                summary["total_size_bytes"] += path.stat().st_size
                ext = path.suffix or "no_extension"
                summary["by_extension"][ext] = \
                    summary["by_extension"].get(ext, 0) + 1
            elif path.is_dir():
                summary["total_dirs"] += 1
        
        return summary
    
    def get_recent_files(self, hours: int = 24) -> List[Dict]:
        """Get files modified within specified hours."""
        cutoff = datetime.now().timestamp() - (hours * 3600)
        recent = []
        
        for path in self.base_path.rglob('*'):
            if path.is_file() and path.stat().st_mtime > cutoff:
                recent.append({
                    "path": str(path),
                    "name": path.name,
                    "modified": datetime.fromtimestamp(
                        path.stat().st_mtime
                    ).isoformat()
                })
        
        return recent

Encoding Handler (8.1)

Handle different file encodings:

Detect file encoding when possible
Handle UTF-8, Latin-1, and other encodings
Gracefully handle encoding errors
Convert between encodings

from pathlib import Path
from typing import Optional

class EncodingHandler:
    """Handle file encoding detection and conversion."""
    
    COMMON_ENCODINGS = ['utf-8', 'latin-1', 'cp1252', 'ascii']
    
    def read_with_fallback(self, filepath: Path, 
                           encodings: list = None) -> tuple:
        """
        Try to read file with multiple encodings.
        Returns (content, encoding_used).
        """
        encodings = encodings or self.COMMON_ENCODINGS
        
        for encoding in encodings:
            try:
                with open(filepath, 'r', encoding=encoding) as f:
                    content = f.read()
                return content, encoding
            except UnicodeDecodeError:
                continue
        
        # Last resort: read with error handling
        with open(filepath, 'r', encoding='utf-8', 
                  errors='replace') as f:
            content = f.read()
        return content, 'utf-8 (with replacements)'
    
    def convert_encoding(self, input_path: Path, 
                         output_path: Path,
                         target_encoding: str = 'utf-8'):
        """Convert file to different encoding."""
        content, original_encoding = self.read_with_fallback(input_path)
        
        with open(output_path, 'w', encoding=target_encoding) as f:
            f.write(content)
        
        print(f"✓ Converted {input_path.name}: "
              f"{original_encoding} → {target_encoding}")

Main Application (Integration)

Create a main.py that ties everything together:

Initialize all components
Process sample data files
Generate all report types
Display summary to console

from pathlib import Path

def main():
    """Main entry point for file processing system."""
    print("=" * 60)
    print("   DATAFLOW ANALYTICS - FILE PROCESSING SYSTEM")
    print("=" * 60 + "\n")
    
    # Initialize processor with base directory
    base_dir = Path(__file__).parent
    processor = FileProcessor(base_dir)
    
    # Load configuration
    config_path = base_dir / "config.json"
    config = ConfigManager(config_path)
    
    # Parse log files
    print("\n📋 Processing Log Files...")
    log_parser = LogParser(encoding=config.get("encoding"))
    log_files = processor.list_files(processor.input_dir, "*.txt")
    
    all_log_entries = []
    for log_file in log_files:
        entries = log_parser.parse_file(log_file)
        all_log_entries.extend(entries)
    
    log_stats = log_parser.get_statistics()
    
    # Process CSV sales data
    print("\n📊 Processing Sales Data...")
    sales_processor = SalesDataProcessor(encoding=config.get("encoding"))
    csv_files = processor.list_files(processor.input_dir, "*.csv")
    
    for csv_file in csv_files:
        sales_processor.load_csv(csv_file)
    
    sales_summary = sales_processor.get_summary_by_category()
    daily_sales = sales_processor.get_daily_sales()
    
    # Generate reports
    print("\n📝 Generating Reports...")
    report_gen = ReportGenerator(processor.reports_dir)
    full_report = report_gen.generate_full_report(
        log_stats, sales_summary, daily_sales
    )
    report_gen.save_report(full_report, "analysis_report.json")
    
    # Write text report
    text_writer = TextReportWriter(processor.reports_dir)
    text_writer.write_summary_report(
        processor.reports_dir / "summary_report.txt",
        log_stats, sales_summary
    )
    
    # Write CSV summary
    csv_writer = CSVWriter()
    csv_writer.write_summary_report(
        processor.reports_dir / "category_summary.csv",
        sales_summary
    )
    
    print("\n" + "=" * 60)
    print("✅ Processing Complete!")
    print(f"📁 Reports saved to: {processor.reports_dir}")
    print("=" * 60)

if __name__ == "__main__":
    main()

Submission

Create a public GitHub repository with the exact name shown below:

Required Repository Name

python-file-processor

github.com/<your-username>/python-file-processor

Required Files

python-file-processor/
├── file_processor.py         # FileProcessor class with path handling
├── config_manager.py         # ConfigManager for JSON config
├── log_parser.py             # LogParser for text files
├── sales_processor.py        # SalesDataProcessor for CSV
├── csv_writer.py             # CSVWriter class
├── report_generator.py       # JSON report generator
├── text_writer.py            # Text report writer
├── file_discovery.py         # File search utilities
├── encoding_handler.py       # Encoding utilities
├── main.py                   # Main application
├── config.json               # Configuration file
├── data/
│   ├── input/
│   │   ├── server_logs.txt   # Sample log file
│   │   └── sales_data.csv    # Sample CSV data
│   └── output/
│       ├── logs/
│       └── reports/
│           ├── analysis_report.json
│           ├── summary_report.txt
│           └── category_summary.csv
├── output.txt                # Console output from main.py
└── README.md                 # Documentation

README.md Must Include:

Your full name and submission date
Project structure diagram
Explanation of path handling strategy
Sample input/output examples
Instructions to run the application

Do Include

All 10 requirements implemented
Context managers for all file operations
pathlib for all path handling
Sample input data files
Generated output files
Proper error handling

Do Not Include

Hardcoded path separators (\ or /)
Files opened without context managers
os.path instead of pathlib
Unhandled file exceptions
Missing encoding parameters
Empty output directories

Important: Make sure all generated output files are included in your repository to show the system works correctly!

Submit Your Assignment

Enter your GitHub username - we'll verify your repository automatically

Grading Rubric

Your assignment will be graded on the following criteria:

Criteria	Points	Description
Path Handling (8.2)	30	pathlib usage, directory creation, cross-platform paths
File Operations (8.1)	30	Context managers, read/write, encoding handling
CSV Processing (8.3)	25	DictReader/DictWriter, data conversion, aggregations
JSON Handling (8.3)	25	Config management, report generation, proper formatting
Integration & Output	25	Working main.py, sample data, generated reports
Code Quality	15	Docstrings, type hints, README documentation
Total	150

Ready to Submit?

Make sure all files are present and reports are generated.

Submit Your Assignment

What You Will Practice

File Operations (8.1)

Context managers, file modes, reading/writing, encoding handling

Path Handling (8.2)

pathlib module, glob patterns, directory navigation, cross-platform paths

CSV Processing (8.3)

DictReader/DictWriter, data conversion, aggregations, report generation

JSON Handling (8.3)

Configuration files, serialization, report formatting, nested structures

File Processing

What You'll Practice

Contents

Assignment Overview

File Operations (8.1)

Path Handling (8.2)

CSV & JSON (8.3)

The Scenario

DataFlow Analytics Platform

Your Task

Sample Input Data

server_logs.txt (Log File)

sales_data.csv (CSV File)

config.json (Configuration)

Requirements

Project Structure with pathlib (8.2)

Configuration Management with JSON (8.3)

Log File Parser (8.1)

CSV Data Processor (8.3)

CSV Writer (8.3)

JSON Report Generator (8.3)

Text File Writer (8.1)

File Search and Discovery (8.2)

Encoding Handler (8.1)

Main Application (Integration)

Submission

Required Repository Name

Required Files

README.md Must Include:

Do Include

Do Not Include

Grading Rubric

Ready to Submit?

What You Will Practice

File Operations (8.1)

Path Handling (8.2)

CSV Processing (8.3)

JSON Handling (8.3)