Assignment Overview
In this assignment, you will build a Log Analysis & Data Processing System that demonstrates professional-level file handling skills. This project requires you to read log files, process CSV data, transform information between formats, and organize output using proper path handling.
File Operations (8.1)
Open, read, write files with context managers and file modes
Path Handling (8.2)
pathlib module, directory navigation, cross-platform paths
CSV & JSON (8.3)
Parse CSV files, serialize/deserialize JSON data
The Scenario
DataFlow Analytics Platform
You have been hired as a Data Engineer at DataFlow Analytics, a company that processes server logs and sales data for clients. Your manager has assigned you:
"We need a tool that can ingest raw log files and CSV data, analyze them, and produce clean JSON reports. The system must handle different file encodings, use pathlib for cross-platform compatibility, and properly manage file resources. All output should be organized in a structured directory hierarchy."
Your Task
Create a Python file processing system that reads server logs, processes sales data from CSV, generates summary reports in JSON format, and organizes all files using proper path handling.
Sample Input Data
server_logs.txt (Log File)
2026-01-22 08:15:32 INFO Server started on port 8080
2026-01-22 08:16:45 DEBUG Database connection established
2026-01-22 08:17:12 INFO User login: user_123
2026-01-22 08:18:03 WARNING High memory usage: 85%
2026-01-22 08:19:55 ERROR Failed to process request: timeout
2026-01-22 08:20:11 INFO User logout: user_123
2026-01-22 08:21:30 CRITICAL Database connection lost
sales_data.csv (CSV File)
order_id,product_name,category,quantity,unit_price,date
1001,Laptop Pro,Electronics,2,1299.99,2026-01-15
1002,Wireless Mouse,Electronics,5,29.99,2026-01-15
1003,Office Chair,Furniture,1,449.00,2026-01-16
1004,USB-C Hub,Electronics,3,79.99,2026-01-16
1005,Standing Desk,Furniture,1,699.00,2026-01-17
config.json (Configuration)
{
"input_directory": "data/input",
"output_directory": "data/output",
"log_levels": ["INFO", "WARNING", "ERROR", "CRITICAL"],
"date_format": "%Y-%m-%d",
"encoding": "utf-8"
}
Requirements
Your project must implement ALL of the following requirements. Each requirement is mandatory and will be tested individually.
Project Structure with pathlib (8.2)
Create a well-organized project structure using pathlib:
- Use
Pathobjects for all file operations - Create directories programmatically if they don't exist
- Use relative paths for portability
- Never use hardcoded path separators (
\or/)
from pathlib import Path
class FileProcessor:
"""File processor with proper path handling."""
def __init__(self, base_dir: str = "."):
self.base_path = Path(base_dir).resolve()
self.input_dir = self.base_path / "data" / "input"
self.output_dir = self.base_path / "data" / "output"
self.logs_dir = self.output_dir / "logs"
self.reports_dir = self.output_dir / "reports"
# Create directory structure
self._setup_directories()
def _setup_directories(self):
"""Create required directories if they don't exist."""
directories = [
self.input_dir,
self.output_dir,
self.logs_dir,
self.reports_dir
]
for directory in directories:
directory.mkdir(parents=True, exist_ok=True)
print(f"✓ Directory ready: {directory}")
def list_files(self, directory: Path, pattern: str = "*") -> list:
"""List files matching pattern in directory."""
return list(directory.glob(pattern))
def get_file_info(self, filepath: Path) -> dict:
"""Get file metadata using pathlib."""
return {
"name": filepath.name,
"stem": filepath.stem,
"suffix": filepath.suffix,
"size_bytes": filepath.stat().st_size,
"is_file": filepath.is_file(),
"parent": str(filepath.parent)
}
Configuration Management with JSON (8.3)
Load and save configuration using JSON:
- Read configuration from
config.json - Use
json.load()with context manager - Handle missing config with defaults
- Save updated configuration back to file
import json
from pathlib import Path
from typing import Dict, Any
class ConfigManager:
"""Manage application configuration via JSON."""
DEFAULT_CONFIG = {
"input_directory": "data/input",
"output_directory": "data/output",
"log_levels": ["INFO", "WARNING", "ERROR", "CRITICAL"],
"date_format": "%Y-%m-%d",
"encoding": "utf-8"
}
def __init__(self, config_path: Path):
self.config_path = config_path
self.config = self._load_config()
def _load_config(self) -> Dict[str, Any]:
"""Load configuration from JSON file."""
if not self.config_path.exists():
print(f"Config not found, creating default: {self.config_path}")
self._save_config(self.DEFAULT_CONFIG)
return self.DEFAULT_CONFIG.copy()
with open(self.config_path, 'r', encoding='utf-8') as f:
config = json.load(f)
print(f"✓ Loaded config from {self.config_path}")
return config
def _save_config(self, config: Dict[str, Any]):
"""Save configuration to JSON file."""
with open(self.config_path, 'w', encoding='utf-8') as f:
json.dump(config, f, indent=4)
def get(self, key: str, default: Any = None) -> Any:
"""Get configuration value."""
return self.config.get(key, default)
def update(self, key: str, value: Any):
"""Update configuration and save."""
self.config[key] = value
self._save_config(self.config)
Log File Parser (8.1)
Read and parse server log files:
- Use context manager (
withstatement) for all file operations - Read files line by line for memory efficiency
- Parse log format:
DATE TIME LEVEL MESSAGE - Handle different encodings gracefully
from datetime import datetime
from pathlib import Path
from typing import List, Dict
import re
class LogParser:
"""Parse server log files."""
LOG_PATTERN = re.compile(
r'(\d{4}-\d{2}-\d{2})\s+(\d{2}:\d{2}:\d{2})\s+'
r'(DEBUG|INFO|WARNING|ERROR|CRITICAL)\s+(.+)'
)
def __init__(self, encoding: str = 'utf-8'):
self.encoding = encoding
self.entries = []
def parse_file(self, filepath: Path) -> List[Dict]:
"""Parse a log file and return list of log entries."""
entries = []
with open(filepath, 'r', encoding=self.encoding) as f:
for line_num, line in enumerate(f, 1):
line = line.strip()
if not line:
continue
entry = self._parse_line(line, line_num)
if entry:
entries.append(entry)
self.entries = entries
print(f"✓ Parsed {len(entries)} log entries from {filepath.name}")
return entries
def _parse_line(self, line: str, line_num: int) -> Dict:
"""Parse a single log line."""
match = self.LOG_PATTERN.match(line)
if not match:
return None
date_str, time_str, level, message = match.groups()
return {
"line_number": line_num,
"date": date_str,
"time": time_str,
"datetime": f"{date_str} {time_str}",
"level": level,
"message": message.strip()
}
def filter_by_level(self, levels: List[str]) -> List[Dict]:
"""Filter entries by log level."""
return [e for e in self.entries if e["level"] in levels]
def get_statistics(self) -> Dict:
"""Get log statistics."""
stats = {"total": len(self.entries)}
for level in ["DEBUG", "INFO", "WARNING", "ERROR", "CRITICAL"]:
stats[level.lower()] = sum(
1 for e in self.entries if e["level"] == level
)
return stats
CSV Data Processor (8.3)
Read and process sales data from CSV:
- Use
csv.DictReaderfor named column access - Handle different delimiters and quote characters
- Convert data types (strings to numbers, dates)
- Calculate aggregations (totals, averages)
import csv
from pathlib import Path
from typing import List, Dict
from datetime import datetime
from collections import defaultdict
class SalesDataProcessor:
"""Process sales data from CSV files."""
def __init__(self, encoding: str = 'utf-8'):
self.encoding = encoding
self.data = []
def load_csv(self, filepath: Path) -> List[Dict]:
"""Load and parse CSV file."""
with open(filepath, 'r', encoding=self.encoding, newline='') as f:
reader = csv.DictReader(f)
self.data = []
for row in reader:
processed_row = self._process_row(row)
self.data.append(processed_row)
print(f"✓ Loaded {len(self.data)} records from {filepath.name}")
return self.data
def _process_row(self, row: Dict) -> Dict:
"""Process and convert data types for a row."""
return {
"order_id": int(row["order_id"]),
"product_name": row["product_name"],
"category": row["category"],
"quantity": int(row["quantity"]),
"unit_price": float(row["unit_price"]),
"date": row["date"],
"total_price": int(row["quantity"]) * float(row["unit_price"])
}
def get_summary_by_category(self) -> Dict:
"""Aggregate sales by category."""
summary = defaultdict(lambda: {
"total_revenue": 0,
"total_quantity": 0,
"order_count": 0
})
for item in self.data:
cat = item["category"]
summary[cat]["total_revenue"] += item["total_price"]
summary[cat]["total_quantity"] += item["quantity"]
summary[cat]["order_count"] += 1
return dict(summary)
def get_daily_sales(self) -> Dict:
"""Aggregate sales by date."""
daily = defaultdict(float)
for item in self.data:
daily[item["date"]] += item["total_price"]
return dict(daily)
CSV Writer (8.3)
Write processed data to new CSV files:
- Use
csv.DictWriterfor writing - Write header row with column names
- Handle special characters properly
- Create summary reports in CSV format
import csv
from pathlib import Path
from typing import List, Dict
class CSVWriter:
"""Write data to CSV files."""
def __init__(self, encoding: str = 'utf-8'):
self.encoding = encoding
def write_csv(self, filepath: Path, data: List[Dict],
fieldnames: List[str] = None):
"""Write list of dictionaries to CSV file."""
if not data:
print(f"⚠ No data to write to {filepath}")
return
# Use keys from first item if fieldnames not provided
if fieldnames is None:
fieldnames = list(data[0].keys())
with open(filepath, 'w', encoding=self.encoding,
newline='') as f:
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(data)
print(f"✓ Wrote {len(data)} records to {filepath.name}")
def write_summary_report(self, filepath: Path,
summary: Dict[str, Dict]):
"""Write category summary to CSV."""
rows = []
for category, stats in summary.items():
rows.append({
"category": category,
"total_revenue": f"{stats['total_revenue']:.2f}",
"total_quantity": stats["total_quantity"],
"order_count": stats["order_count"],
"avg_order_value": f"{stats['total_revenue']/stats['order_count']:.2f}"
})
self.write_csv(filepath, rows)
JSON Report Generator (8.3)
Generate comprehensive JSON reports:
- Combine data from multiple sources
- Use proper JSON formatting with indentation
- Handle datetime serialization
- Create nested report structures
import json
from pathlib import Path
from datetime import datetime
from typing import Dict, Any
class ReportGenerator:
"""Generate JSON reports from processed data."""
def __init__(self, output_dir: Path):
self.output_dir = output_dir
def generate_full_report(self, log_stats: Dict,
sales_summary: Dict,
daily_sales: Dict) -> Dict:
"""Generate comprehensive analysis report."""
report = {
"report_metadata": {
"generated_at": datetime.now().isoformat(),
"report_type": "daily_analysis",
"version": "1.0"
},
"log_analysis": {
"summary": log_stats,
"health_status": self._determine_health(log_stats)
},
"sales_analysis": {
"by_category": sales_summary,
"by_date": daily_sales,
"total_revenue": sum(daily_sales.values())
}
}
return report
def _determine_health(self, log_stats: Dict) -> str:
"""Determine system health based on log stats."""
if log_stats.get("critical", 0) > 0:
return "CRITICAL"
elif log_stats.get("error", 0) > 5:
return "UNHEALTHY"
elif log_stats.get("warning", 0) > 10:
return "WARNING"
return "HEALTHY"
def save_report(self, report: Dict, filename: str):
"""Save report to JSON file."""
filepath = self.output_dir / filename
with open(filepath, 'w', encoding='utf-8') as f:
json.dump(report, f, indent=4, default=str)
print(f"✓ Report saved to {filepath}")
return filepath
Text File Writer (8.1)
Write human-readable text reports:
- Use different file modes (
w,a) - Format output with proper alignment
- Write section headers and separators
- Handle line endings properly
from pathlib import Path
from datetime import datetime
from typing import Dict
class TextReportWriter:
"""Write formatted text reports."""
def __init__(self, output_dir: Path):
self.output_dir = output_dir
def write_summary_report(self, filepath: Path,
log_stats: Dict,
sales_summary: Dict):
"""Write a formatted text summary report."""
with open(filepath, 'w', encoding='utf-8') as f:
# Header
f.write("=" * 60 + "\n")
f.write(" DATAFLOW ANALYTICS - DAILY SUMMARY REPORT\n")
f.write("=" * 60 + "\n")
f.write(f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n")
f.write("\n")
# Log Analysis Section
f.write("-" * 60 + "\n")
f.write("LOG ANALYSIS\n")
f.write("-" * 60 + "\n")
f.write(f"{'Total Entries:':<20} {log_stats['total']:>10}\n")
f.write(f"{'Info:':<20} {log_stats.get('info', 0):>10}\n")
f.write(f"{'Warnings:':<20} {log_stats.get('warning', 0):>10}\n")
f.write(f"{'Errors:':<20} {log_stats.get('error', 0):>10}\n")
f.write(f"{'Critical:':<20} {log_stats.get('critical', 0):>10}\n")
f.write("\n")
# Sales Analysis Section
f.write("-" * 60 + "\n")
f.write("SALES BY CATEGORY\n")
f.write("-" * 60 + "\n")
f.write(f"{'Category':<20} {'Revenue':>15} {'Orders':>10}\n")
f.write("-" * 45 + "\n")
total_revenue = 0
for category, stats in sales_summary.items():
revenue = stats['total_revenue']
orders = stats['order_count']
f.write(f"{category:<20} ${revenue:>14,.2f} {orders:>10}\n")
total_revenue += revenue
f.write("-" * 45 + "\n")
f.write(f"{'TOTAL':<20} ${total_revenue:>14,.2f}\n")
f.write("\n")
f.write("=" * 60 + "\n")
print(f"✓ Text report written to {filepath}")
File Search and Discovery (8.2)
Search for files using glob patterns:
- Use
glob()andrglob()for recursive search - Filter files by extension
- Get file statistics (size, modified time)
- Handle missing files gracefully
from pathlib import Path
from typing import List, Dict
from datetime import datetime
class FileDiscovery:
"""Discover and catalog files in directories."""
def __init__(self, base_path: Path):
self.base_path = base_path
def find_files(self, pattern: str, recursive: bool = False) -> List[Path]:
"""Find files matching pattern."""
if recursive:
return list(self.base_path.rglob(pattern))
return list(self.base_path.glob(pattern))
def find_by_extension(self, extension: str) -> List[Path]:
"""Find all files with given extension."""
if not extension.startswith('.'):
extension = f'.{extension}'
return list(self.base_path.rglob(f'*{extension}'))
def get_directory_summary(self) -> Dict:
"""Get summary of directory contents."""
files = list(self.base_path.rglob('*'))
summary = {
"total_files": 0,
"total_dirs": 0,
"total_size_bytes": 0,
"by_extension": {}
}
for path in files:
if path.is_file():
summary["total_files"] += 1
summary["total_size_bytes"] += path.stat().st_size
ext = path.suffix or "no_extension"
summary["by_extension"][ext] = \
summary["by_extension"].get(ext, 0) + 1
elif path.is_dir():
summary["total_dirs"] += 1
return summary
def get_recent_files(self, hours: int = 24) -> List[Dict]:
"""Get files modified within specified hours."""
cutoff = datetime.now().timestamp() - (hours * 3600)
recent = []
for path in self.base_path.rglob('*'):
if path.is_file() and path.stat().st_mtime > cutoff:
recent.append({
"path": str(path),
"name": path.name,
"modified": datetime.fromtimestamp(
path.stat().st_mtime
).isoformat()
})
return recent
Encoding Handler (8.1)
Handle different file encodings:
- Detect file encoding when possible
- Handle UTF-8, Latin-1, and other encodings
- Gracefully handle encoding errors
- Convert between encodings
from pathlib import Path
from typing import Optional
class EncodingHandler:
"""Handle file encoding detection and conversion."""
COMMON_ENCODINGS = ['utf-8', 'latin-1', 'cp1252', 'ascii']
def read_with_fallback(self, filepath: Path,
encodings: list = None) -> tuple:
"""
Try to read file with multiple encodings.
Returns (content, encoding_used).
"""
encodings = encodings or self.COMMON_ENCODINGS
for encoding in encodings:
try:
with open(filepath, 'r', encoding=encoding) as f:
content = f.read()
return content, encoding
except UnicodeDecodeError:
continue
# Last resort: read with error handling
with open(filepath, 'r', encoding='utf-8',
errors='replace') as f:
content = f.read()
return content, 'utf-8 (with replacements)'
def convert_encoding(self, input_path: Path,
output_path: Path,
target_encoding: str = 'utf-8'):
"""Convert file to different encoding."""
content, original_encoding = self.read_with_fallback(input_path)
with open(output_path, 'w', encoding=target_encoding) as f:
f.write(content)
print(f"✓ Converted {input_path.name}: "
f"{original_encoding} → {target_encoding}")
Main Application (Integration)
Create a main.py that ties everything together:
- Initialize all components
- Process sample data files
- Generate all report types
- Display summary to console
from pathlib import Path
def main():
"""Main entry point for file processing system."""
print("=" * 60)
print(" DATAFLOW ANALYTICS - FILE PROCESSING SYSTEM")
print("=" * 60 + "\n")
# Initialize processor with base directory
base_dir = Path(__file__).parent
processor = FileProcessor(base_dir)
# Load configuration
config_path = base_dir / "config.json"
config = ConfigManager(config_path)
# Parse log files
print("\n📋 Processing Log Files...")
log_parser = LogParser(encoding=config.get("encoding"))
log_files = processor.list_files(processor.input_dir, "*.txt")
all_log_entries = []
for log_file in log_files:
entries = log_parser.parse_file(log_file)
all_log_entries.extend(entries)
log_stats = log_parser.get_statistics()
# Process CSV sales data
print("\n📊 Processing Sales Data...")
sales_processor = SalesDataProcessor(encoding=config.get("encoding"))
csv_files = processor.list_files(processor.input_dir, "*.csv")
for csv_file in csv_files:
sales_processor.load_csv(csv_file)
sales_summary = sales_processor.get_summary_by_category()
daily_sales = sales_processor.get_daily_sales()
# Generate reports
print("\n📝 Generating Reports...")
report_gen = ReportGenerator(processor.reports_dir)
full_report = report_gen.generate_full_report(
log_stats, sales_summary, daily_sales
)
report_gen.save_report(full_report, "analysis_report.json")
# Write text report
text_writer = TextReportWriter(processor.reports_dir)
text_writer.write_summary_report(
processor.reports_dir / "summary_report.txt",
log_stats, sales_summary
)
# Write CSV summary
csv_writer = CSVWriter()
csv_writer.write_summary_report(
processor.reports_dir / "category_summary.csv",
sales_summary
)
print("\n" + "=" * 60)
print("✅ Processing Complete!")
print(f"📁 Reports saved to: {processor.reports_dir}")
print("=" * 60)
if __name__ == "__main__":
main()
Submission
Create a public GitHub repository with the exact name shown below:
Required Repository Name
python-file-processor
Required Files
python-file-processor/
├── file_processor.py # FileProcessor class with path handling
├── config_manager.py # ConfigManager for JSON config
├── log_parser.py # LogParser for text files
├── sales_processor.py # SalesDataProcessor for CSV
├── csv_writer.py # CSVWriter class
├── report_generator.py # JSON report generator
├── text_writer.py # Text report writer
├── file_discovery.py # File search utilities
├── encoding_handler.py # Encoding utilities
├── main.py # Main application
├── config.json # Configuration file
├── data/
│ ├── input/
│ │ ├── server_logs.txt # Sample log file
│ │ └── sales_data.csv # Sample CSV data
│ └── output/
│ ├── logs/
│ └── reports/
│ ├── analysis_report.json
│ ├── summary_report.txt
│ └── category_summary.csv
├── output.txt # Console output from main.py
└── README.md # Documentation
README.md Must Include:
- Your full name and submission date
- Project structure diagram
- Explanation of path handling strategy
- Sample input/output examples
- Instructions to run the application
Do Include
- All 10 requirements implemented
- Context managers for all file operations
- pathlib for all path handling
- Sample input data files
- Generated output files
- Proper error handling
Do Not Include
- Hardcoded path separators (
\or/) - Files opened without context managers
- os.path instead of pathlib
- Unhandled file exceptions
- Missing encoding parameters
- Empty output directories
Enter your GitHub username - we'll verify your repository automatically
Grading Rubric
Your assignment will be graded on the following criteria:
| Criteria | Points | Description |
|---|---|---|
| Path Handling (8.2) | 30 | pathlib usage, directory creation, cross-platform paths |
| File Operations (8.1) | 30 | Context managers, read/write, encoding handling |
| CSV Processing (8.3) | 25 | DictReader/DictWriter, data conversion, aggregations |
| JSON Handling (8.3) | 25 | Config management, report generation, proper formatting |
| Integration & Output | 25 | Working main.py, sample data, generated reports |
| Code Quality | 15 | Docstrings, type hints, README documentation |
| Total | 150 |
What You Will Practice
File Operations (8.1)
Context managers, file modes, reading/writing, encoding handling
Path Handling (8.2)
pathlib module, glob patterns, directory navigation, cross-platform paths
CSV Processing (8.3)
DictReader/DictWriter, data conversion, aggregations, report generation
JSON Handling (8.3)
Configuration files, serialization, report formatting, nested structures