Assignment 4: Pandas Data Wrangling | Data Science Course

Assignment Overview

In this comprehensive assignment, you will work as a Junior Data Analyst at DataCorp Analytics Firm. The company has received messy, inconsistent datasets from a client and needs you to clean, transform, merge, and analyze the data to produce actionable business insights.

Objectives

Master DataFrame indexing and selection
Load data from multiple file formats
Clean and preprocess messy datasets
Merge and join related datasets
Perform advanced GroupBy aggregations
Create pivot tables for reporting

Skills Tested

loc, iloc, and boolean indexing
read_csv, read_json, read_excel
fillna, dropna, drop_duplicates
String operations and regex
merge, join, concat operations
groupby, pivot_table, apply

Deliverables

pandas_wrangling.ipynb (notebook)
3 output CSV files
analysis_report.txt
README.md

The Scenario

📧 Email from Sarah Chen, Lead Data Scientist

"Welcome back to the team! We have a critical project from MegaCorp Industries. They've sent us employee data, expense reports, department information, and performance reviews—but the data is a complete mess. Different formats, missing values, inconsistent naming, duplicate records... you name it.

Your task is to wrangle this data into shape and produce three key reports:

Cleaned Employee Master File - A single, clean dataset with all employee information
Department Performance Summary - Aggregated metrics by department including headcount, avg salary, performance scores
Expense Analysis Report - Monthly expense trends by department and category with anomaly flags

The client needs this by end of week. I know the data is messy, but I trust your Pandas skills. Good luck!"

— Sarah Chen, Lead Data Scientist

The Datasets

Download all four datasets below. Each file has intentional data quality issues that you must identify and fix.

datacorp_employees.csv

Employee master data including personal info, department, salary, hire dates, and performance scores.

50 records Missing values present Duplicates exist Inconsistent formats

Download CSV

Known Issues to Fix:

Phone numbers in multiple formats (with/without country code, dashes, spaces)
Some salaries stored as strings with currency symbols
Missing performance scores (need imputation strategy)
Duplicate employee IDs with conflicting information
Inconsistent city name capitalization

datacorp_expenses.csv

Expense transaction records with dates, amounts, categories, vendors, and approval status.

100 records Date format issues Negative amounts Category inconsistencies

Download CSV

Known Issues to Fix:

Multiple date formats (MM/DD/YYYY, YYYY-MM-DD, DD-Mon-YYYY)
Some amounts are negative (should be positive for expenses)
Category names have inconsistent casing and spelling variations
Some vendor names have extra whitespace or special characters
Missing department_id for some transactions

datacorp_departments.json

Department hierarchy with budgets, locations, team information, and active projects.

Nested JSON Complex structure Requires flattening

Download JSON

Known Issues to Fix:

Deeply nested structure requires json_normalize or custom flattening
Budget values stored as strings in some cases
Location data needs to be extracted from nested objects
Project arrays need to be handled appropriately

datacorp_reviews.csv

Performance review records with ratings across multiple dimensions and reviewer comments.

45 records Rating scale issues Outliers present Text data cleanup

Download CSV

Known Issues to Fix:

Ratings should be 1-5 scale but some have 0 or values > 5
Some review comments contain HTML entities that need decoding
Multiple reviews per employee that need aggregation
Reviewer names have inconsistent formatting

Requirements

Part 1 Data Loading & Initial Exploration (15 points)

In your pandas_wrangling.ipynb notebook:

Load all four datasets using appropriate Pandas read functions.
- Use pd.read_csv() for CSV files with correct encoding detection
- Use pd.read_json() for the JSON file - handle nested structure appropriately
- Apply proper dtype specifications where known
Create a data profiling function that returns a dictionary containing:
- Shape (rows, columns)
- Column names and their data types
- Missing value count per column
- Duplicate row count
- Memory usage in MB
Implement data validation checks that identify:
- Columns with more than 20% missing values
- Potential date columns stored as strings
- Numeric columns with unexpected negative values
- Columns with high cardinality (> 50% unique values)

Part 2 Data Cleaning (25 points)

Create cleaning functions in your notebook:

clean_employee_data(df) function that:
- Standardizes phone numbers to format: +1-XXX-XXX-XXXX
- Converts salary strings to numeric (remove $, commas, handle "K" suffix)
- Fills missing performance scores using department median
- Removes exact duplicate rows, keeping first occurrence
- For duplicate employee_ids with different data, keep the most recent hire_date record
- Standardizes city names to title case
- Validates email format using regex (flag invalid emails in new column)
clean_expense_data(df) function that:
- Parses all date formats into a single datetime format (YYYY-MM-DD)
- Converts negative amounts to positive (expenses should be positive)
- Standardizes category names (lowercase, replace variations with canonical names)
- Strips whitespace and special characters from vendor names
- Fills missing department_id using employee_id lookup from employees table
- Creates a "month_year" column for aggregation
- Flags potential anomalies: expenses > 3 standard deviations from category mean
clean_review_data(df) function that:
- Clips rating values to valid range (1-5)
- Decodes HTML entities in comment text (e.g., & → &)
- Standardizes reviewer names (First Last format)
- Calculates average rating per employee across all dimensions
- Handles reviews with all-null ratings (drop or flag)
flatten_department_json(json_data) function that:
- Flattens nested JSON structure into a flat DataFrame
- Extracts location city and country into separate columns
- Converts budget strings to numeric values
- Creates a "project_count" column from the projects array
- Handles missing nested fields gracefully

Part 3 Data Transformation (25 points)

Create transformation functions in your notebook:

create_employee_master(employees_df, departments_df, reviews_df) function that:
- Left joins employees with departments on department_id
- Aggregates review data per employee (avg ratings, review count, latest review date)
- Joins aggregated review data to employees
- Calculates tenure in years from hire_date
- Creates salary band categories: 'Entry' (<50K), 'Mid' (50-80K), 'Senior' (80-120K), 'Executive' (>120K)
- Adds performance category based on score: 'Low' (<60), 'Average' (60-75), 'High' (75-90), 'Exceptional' (>90)
- Output should have NO duplicate employee_ids
create_department_summary(employee_master_df, expenses_df) function that:
- Groups by department to calculate: headcount, total payroll, avg salary, avg performance score
- Calculates salary range (max - min) per department
- Counts employees in each performance category per department
- Calculates average tenure per department
- Joins with department budget data to calculate budget utilization (payroll/budget)
- Aggregates total expenses per department from expenses table
- Sorts by total payroll descending
create_expense_analysis(expenses_df, departments_df) function that:
- Creates a pivot table: rows=month_year, columns=category, values=sum of amount
- Adds row totals and column totals
- Calculates month-over-month percent change for each category
- Identifies top 3 vendors by total spend per category
- Flags months where total expenses exceed department monthly budget (budget/12)
- Creates a rolling 3-month average expense per category
- Joins department name for readability
Advanced transformations:
- Use pd.merge() with appropriate join types and validate parameter
- Use pd.concat() where combining DataFrames vertically
- Implement at least one custom aggregation using agg() with named aggregations
- Use transform() for at least one group-level calculation
- Apply apply() with a custom function on at least one column

Part 4 Validation & Testing (15 points)

In your notebook, include validation cells that:

Verify data loading:
- Print shape of each loaded DataFrame
- Verify expected columns exist
- Show sample rows from each dataset
Verify cleaning results:
- Assert phone number format matches pattern
- Assert no negative salaries exist
- Assert no duplicate employee_ids remain
- Assert dates are valid datetime objects
- Assert ratings are within 1-5 range
Verify transformation results:
- Print employee master shape and columns
- Verify department summary totals
- Display expense pivot table structure

Part 5 Output Files & Analysis Report (20 points)

Required Output Files:

cleaned_employees.csv - The final cleaned and enriched employee master file
department_summary.csv - Aggregated department metrics and KPIs
expense_analysis.csv - Pivot table with expense trends and analysis

analysis_report.txt must include:

Data Quality Summary:
- Initial record counts per dataset
- Missing value summary (before and after cleaning)
- Duplicates found and removed
- Data type conversions performed
Key Findings (answer these questions):
- Which department has the highest average salary? Lowest?
- What is the correlation between tenure and performance score?
- Which expense category has grown the most over the time period?
- Are there any departments consistently over budget?
- What percentage of employees are in each performance category?
Anomalies & Recommendations:
- List all expense anomalies flagged (transaction IDs and amounts)
- Identify any departments with unusual patterns
- Provide 3-5 data-driven recommendations for the client

Grading Rubric

Component	Points	Criteria
Data Loading & Profiling	15	All datasets load correctly, profiling function complete, validation checks comprehensive
Data Cleaning	25	All cleaning functions work correctly, edge cases handled, data quality significantly improved
Data Transformation	25	All merges correct, aggregations accurate, advanced Pandas operations used effectively
Validation & Testing	15	Verification cells demonstrate all functions work, assertions pass
Output Files & Report	20	All CSV files correct, report complete with insights, recommendations actionable
Total	100

Deductions

-5 points: Missing required file
-5 points: Code doesn't run without errors
-10 points: Hardcoded values instead of dynamic calculations
-5 points: No docstrings or comments
-5 points: Inconsistent code style (not following PEP 8)

Submission

Create a public GitHub repository with the exact name shown below, add all required files, and submit through the submission portal.

github.com/<your-username>/datacorp-pandas-wrangling

Submit Assignment

Required Files Checklist:

pandas_wrangling.ipynb cleaned_employees.csv department_summary.csv expense_analysis.csv analysis_report.txt README.md

All files are required. Submission will fail if any file is missing.

Pro Tips

Debugging Strategy

Use df.info() and df.describe() frequently
Check df.isna().sum() after each cleaning step
Verify row counts before and after merges
Use assert statements to validate assumptions

Code Quality

Add docstrings to all functions
Use method chaining where appropriate
Avoid modifying DataFrames in place
Use descriptive variable names

Time Management

Start with data loading and exploration (~30 min)
Focus on cleaning functions next (~90 min)
Build transformations incrementally (~90 min)
Write tests as you go, not at the end

Common Pitfalls

Not handling merge duplicates properly
Forgetting to reset index after groupby
Mixing up left/right/inner/outer joins
Not copying DataFrames before modifying

Pandas Data Wrangling Challenge

What You'll Practice

Assignment Overview

Objectives

Skills Tested

Deliverables

The Scenario

📧 Email from Sarah Chen, Lead Data Scientist

The Datasets

datacorp_employees.csv

datacorp_expenses.csv

datacorp_departments.json

datacorp_reviews.csv

Requirements

Part 1 Data Loading & Initial Exploration (15 points)

In your pandas_wrangling.ipynb notebook:

Part 2 Data Cleaning (25 points)

Create cleaning functions in your notebook:

Part 3 Data Transformation (25 points)

Create transformation functions in your notebook:

Part 4 Validation & Testing (15 points)

In your notebook, include validation cells that:

Part 5 Output Files & Analysis Report (20 points)

Required Output Files:

analysis_report.txt must include:

Grading Rubric

Deductions

Submission

Required Files Checklist:

Pro Tips

Debugging Strategy

Code Quality

Time Management

Common Pitfalls

Pre-Submission Checklist

Code Requirements

Deliverables