Assignment 4-A

Pandas Data Wrangling Challenge

Apply your Pandas skills to clean messy datasets, handle missing values, merge multiple data sources, and create comprehensive analytical reports for DataCorp Analytics Firm.

4-5 hours
Advanced
100 Points
Submit Assignment
What You'll Practice
  • Data cleaning & preprocessing
  • Handling missing values
  • Merging & joining datasets
  • GroupBy aggregations
  • Pivot tables & analysis

Assignment Overview

In this comprehensive assignment, you will work as a Junior Data Analyst at DataCorp Analytics Firm. The company has received messy, inconsistent datasets from a client and needs you to clean, transform, merge, and analyze the data to produce actionable business insights.

Objectives
  • Master DataFrame indexing and selection
  • Load data from multiple file formats
  • Clean and preprocess messy datasets
  • Merge and join related datasets
  • Perform advanced GroupBy aggregations
  • Create pivot tables for reporting
Skills Tested
  • loc, iloc, and boolean indexing
  • read_csv, read_json, read_excel
  • fillna, dropna, drop_duplicates
  • String operations and regex
  • merge, join, concat operations
  • groupby, pivot_table, apply
Deliverables
  • pandas_wrangling.ipynb (notebook)
  • 3 output CSV files
  • analysis_report.txt
  • README.md

The Scenario

📧 Email from Sarah Chen, Lead Data Scientist

"Welcome back to the team! We have a critical project from MegaCorp Industries. They've sent us employee data, expense reports, department information, and performance reviews—but the data is a complete mess. Different formats, missing values, inconsistent naming, duplicate records... you name it.

Your task is to wrangle this data into shape and produce three key reports:

  1. Cleaned Employee Master File - A single, clean dataset with all employee information
  2. Department Performance Summary - Aggregated metrics by department including headcount, avg salary, performance scores
  3. Expense Analysis Report - Monthly expense trends by department and category with anomaly flags

The client needs this by end of week. I know the data is messy, but I trust your Pandas skills. Good luck!"

— Sarah Chen, Lead Data Scientist

The Datasets

Download all four datasets below. Each file has intentional data quality issues that you must identify and fix.

datacorp_employees.csv

Employee master data including personal info, department, salary, hire dates, and performance scores.

50 records Missing values present Duplicates exist Inconsistent formats
Download CSV
Known Issues to Fix:
  • Phone numbers in multiple formats (with/without country code, dashes, spaces)
  • Some salaries stored as strings with currency symbols
  • Missing performance scores (need imputation strategy)
  • Duplicate employee IDs with conflicting information
  • Inconsistent city name capitalization
datacorp_expenses.csv

Expense transaction records with dates, amounts, categories, vendors, and approval status.

100 records Date format issues Negative amounts Category inconsistencies
Download CSV
Known Issues to Fix:
  • Multiple date formats (MM/DD/YYYY, YYYY-MM-DD, DD-Mon-YYYY)
  • Some amounts are negative (should be positive for expenses)
  • Category names have inconsistent casing and spelling variations
  • Some vendor names have extra whitespace or special characters
  • Missing department_id for some transactions
datacorp_departments.json

Department hierarchy with budgets, locations, team information, and active projects.

Nested JSON Complex structure Requires flattening
Download JSON
Known Issues to Fix:
  • Deeply nested structure requires json_normalize or custom flattening
  • Budget values stored as strings in some cases
  • Location data needs to be extracted from nested objects
  • Project arrays need to be handled appropriately
datacorp_reviews.csv

Performance review records with ratings across multiple dimensions and reviewer comments.

45 records Rating scale issues Outliers present Text data cleanup
Download CSV
Known Issues to Fix:
  • Ratings should be 1-5 scale but some have 0 or values > 5
  • Some review comments contain HTML entities that need decoding
  • Multiple reviews per employee that need aggregation
  • Reviewer names have inconsistent formatting

Requirements

Part 1 Data Loading & Initial Exploration (15 points)

In your pandas_wrangling.ipynb notebook:
  1. Load all four datasets using appropriate Pandas read functions.
    • Use pd.read_csv() for CSV files with correct encoding detection
    • Use pd.read_json() for the JSON file - handle nested structure appropriately
    • Apply proper dtype specifications where known
  2. Create a data profiling function that returns a dictionary containing:
    • Shape (rows, columns)
    • Column names and their data types
    • Missing value count per column
    • Duplicate row count
    • Memory usage in MB
  3. Implement data validation checks that identify:
    • Columns with more than 20% missing values
    • Potential date columns stored as strings
    • Numeric columns with unexpected negative values
    • Columns with high cardinality (> 50% unique values)

Part 2 Data Cleaning (25 points)

Create cleaning functions in your notebook:
  1. clean_employee_data(df) function that:
    • Standardizes phone numbers to format: +1-XXX-XXX-XXXX
    • Converts salary strings to numeric (remove $, commas, handle "K" suffix)
    • Fills missing performance scores using department median
    • Removes exact duplicate rows, keeping first occurrence
    • For duplicate employee_ids with different data, keep the most recent hire_date record
    • Standardizes city names to title case
    • Validates email format using regex (flag invalid emails in new column)
  2. clean_expense_data(df) function that:
    • Parses all date formats into a single datetime format (YYYY-MM-DD)
    • Converts negative amounts to positive (expenses should be positive)
    • Standardizes category names (lowercase, replace variations with canonical names)
    • Strips whitespace and special characters from vendor names
    • Fills missing department_id using employee_id lookup from employees table
    • Creates a "month_year" column for aggregation
    • Flags potential anomalies: expenses > 3 standard deviations from category mean
  3. clean_review_data(df) function that:
    • Clips rating values to valid range (1-5)
    • Decodes HTML entities in comment text (e.g., & → &)
    • Standardizes reviewer names (First Last format)
    • Calculates average rating per employee across all dimensions
    • Handles reviews with all-null ratings (drop or flag)
  4. flatten_department_json(json_data) function that:
    • Flattens nested JSON structure into a flat DataFrame
    • Extracts location city and country into separate columns
    • Converts budget strings to numeric values
    • Creates a "project_count" column from the projects array
    • Handles missing nested fields gracefully

Part 3 Data Transformation (25 points)

Create transformation functions in your notebook:
  1. create_employee_master(employees_df, departments_df, reviews_df) function that:
    • Left joins employees with departments on department_id
    • Aggregates review data per employee (avg ratings, review count, latest review date)
    • Joins aggregated review data to employees
    • Calculates tenure in years from hire_date
    • Creates salary band categories: 'Entry' (<50K), 'Mid' (50-80K), 'Senior' (80-120K), 'Executive' (>120K)
    • Adds performance category based on score: 'Low' (<60), 'Average' (60-75), 'High' (75-90), 'Exceptional' (>90)
    • Output should have NO duplicate employee_ids
  2. create_department_summary(employee_master_df, expenses_df) function that:
    • Groups by department to calculate: headcount, total payroll, avg salary, avg performance score
    • Calculates salary range (max - min) per department
    • Counts employees in each performance category per department
    • Calculates average tenure per department
    • Joins with department budget data to calculate budget utilization (payroll/budget)
    • Aggregates total expenses per department from expenses table
    • Sorts by total payroll descending
  3. create_expense_analysis(expenses_df, departments_df) function that:
    • Creates a pivot table: rows=month_year, columns=category, values=sum of amount
    • Adds row totals and column totals
    • Calculates month-over-month percent change for each category
    • Identifies top 3 vendors by total spend per category
    • Flags months where total expenses exceed department monthly budget (budget/12)
    • Creates a rolling 3-month average expense per category
    • Joins department name for readability
  4. Advanced transformations:
    • Use pd.merge() with appropriate join types and validate parameter
    • Use pd.concat() where combining DataFrames vertically
    • Implement at least one custom aggregation using agg() with named aggregations
    • Use transform() for at least one group-level calculation
    • Apply apply() with a custom function on at least one column

Part 4 Validation & Testing (15 points)

In your notebook, include validation cells that:
  1. Verify data loading:
    • Print shape of each loaded DataFrame
    • Verify expected columns exist
    • Show sample rows from each dataset
  2. Verify cleaning results:
    • Assert phone number format matches pattern
    • Assert no negative salaries exist
    • Assert no duplicate employee_ids remain
    • Assert dates are valid datetime objects
    • Assert ratings are within 1-5 range
  3. Verify transformation results:
    • Print employee master shape and columns
    • Verify department summary totals
    • Display expense pivot table structure

Part 5 Output Files & Analysis Report (20 points)

Required Output Files:
  1. cleaned_employees.csv - The final cleaned and enriched employee master file
  2. department_summary.csv - Aggregated department metrics and KPIs
  3. expense_analysis.csv - Pivot table with expense trends and analysis
analysis_report.txt must include:
  1. Data Quality Summary:
    • Initial record counts per dataset
    • Missing value summary (before and after cleaning)
    • Duplicates found and removed
    • Data type conversions performed
  2. Key Findings (answer these questions):
    • Which department has the highest average salary? Lowest?
    • What is the correlation between tenure and performance score?
    • Which expense category has grown the most over the time period?
    • Are there any departments consistently over budget?
    • What percentage of employees are in each performance category?
  3. Anomalies & Recommendations:
    • List all expense anomalies flagged (transaction IDs and amounts)
    • Identify any departments with unusual patterns
    • Provide 3-5 data-driven recommendations for the client

Grading Rubric

Component Points Criteria
Data Loading & Profiling 15 All datasets load correctly, profiling function complete, validation checks comprehensive
Data Cleaning 25 All cleaning functions work correctly, edge cases handled, data quality significantly improved
Data Transformation 25 All merges correct, aggregations accurate, advanced Pandas operations used effectively
Validation & Testing 15 Verification cells demonstrate all functions work, assertions pass
Output Files & Report 20 All CSV files correct, report complete with insights, recommendations actionable
Total 100
Deductions
  • -5 points: Missing required file
  • -5 points: Code doesn't run without errors
  • -10 points: Hardcoded values instead of dynamic calculations
  • -5 points: No docstrings or comments
  • -5 points: Inconsistent code style (not following PEP 8)

Submission

Create a public GitHub repository with the exact name shown below, add all required files, and submit through the submission portal.

github.com/<your-username>/datacorp-pandas-wrangling
Required Files Checklist:
pandas_wrangling.ipynb cleaned_employees.csv department_summary.csv expense_analysis.csv analysis_report.txt README.md

All files are required. Submission will fail if any file is missing.

Pro Tips

Debugging Strategy
  • Use df.info() and df.describe() frequently
  • Check df.isna().sum() after each cleaning step
  • Verify row counts before and after merges
  • Use assert statements to validate assumptions
Code Quality
  • Add docstrings to all functions
  • Use method chaining where appropriate
  • Avoid modifying DataFrames in place
  • Use descriptive variable names
Time Management
  • Start with data loading and exploration (~30 min)
  • Focus on cleaning functions next (~90 min)
  • Build transformations incrementally (~90 min)
  • Write tests as you go, not at the end
Common Pitfalls
  • Not handling merge duplicates properly
  • Forgetting to reset index after groupby
  • Mixing up left/right/inner/outer joins
  • Not copying DataFrames before modifying

Pre-Submission Checklist

Code Requirements
Deliverables