What is Exploratory Data Analysis?
Exploratory Data Analysis (EDA) is an approach to analyzing datasets to summarize their main characteristics, often using visual methods. Pioneered by statistician John Tukey in the 1970s, EDA emphasizes looking at data before making assumptions. The goal is to discover patterns, spot anomalies, test hypotheses, and check assumptions through summary statistics and graphical representations.
Exploratory Data Analysis (EDA)
A systematic approach to investigating datasets using statistical summaries and visualizations to understand data structure, detect patterns, identify outliers, and form hypotheses before formal modeling
Why EDA Matters
Skipping EDA is one of the most common mistakes in data science projects. Without understanding your data, you risk building models on flawed assumptions, missing important patterns, or wasting time on irrelevant features. EDA helps you make informed decisions about data cleaning, feature engineering, and model selection.
With EDA
- Catch data quality issues early
- Understand variable distributions
- Identify relationships between features
- Make informed modeling decisions
Without EDA
- Build models on corrupted data
- Miss important patterns
- Waste time on irrelevant features
- Get misleading results
Practice: EDA Fundamentals
Scenario: Your team receives new datasets daily and needs a consistent way to assess data quality. Create a function that quickly identifies issues.
# Test with this problematic data
import pandas as pd
df = pd.DataFrame({
'id': [1, 2, 2, 4, 5],
'name': ['Alice', 'Bob', 'Bob', 'Diana', None],
'age': [25, None, 35, 45, 55],
'salary': [50000, 60000, 60000, None, 80000]
})
Task: Write data_quality_report(df) that returns: total rows, duplicate count, missing values per column, and data types.
Show Solution
def data_quality_report(df):
"""Generate comprehensive data quality report."""
report = {
'total_rows': len(df),
'total_columns': len(df.columns),
'duplicate_rows': df.duplicated().sum(),
'memory_mb': df.memory_usage(deep=True).sum() / 1024**2
}
# Missing values summary
missing = df.isna().sum()
missing_pct = (missing / len(df) * 100).round(1)
report['missing_summary'] = pd.DataFrame({
'missing_count': missing,
'missing_pct': missing_pct,
'dtype': df.dtypes
})
print(f"\n{'='*50}")
print(f"DATA QUALITY REPORT")
print(f"{'='*50}")
print(f"Rows: {report['total_rows']}")
print(f"Columns: {report['total_columns']}")
print(f"Duplicates: {report['duplicate_rows']}")
print(f"Memory: {report['memory_mb']:.3f} MB")
print(f"\nMissing Values:")
print(report['missing_summary'][report['missing_summary']['missing_count'] > 0])
return report
# Test
report = data_quality_report(df)
Task: Write a function check_data_quality(df) that returns a dictionary with missing value counts and duplicate row count.
# Test data with issues
df = pd.DataFrame({
'id': [1, 2, 2, 4],
'value': [10, None, 30, None],
'category': ['A', 'B', 'B', 'A']
})
Show Solution
def check_data_quality(df):
"""Check data quality and return summary dict."""
quality_report = {
'total_rows': len(df),
'duplicate_rows': df.duplicated().sum(),
'missing_by_column': df.isna().sum().to_dict(),
'total_missing': df.isna().sum().sum(),
'missing_percent': (df.isna().sum().sum() / df.size * 100).round(2)
}
return quality_report
# Test
report = check_data_quality(df)
print(report)
# {'total_rows': 4, 'duplicate_rows': 0,
# 'missing_by_column': {'id': 0, 'value': 2, 'category': 0},
# 'total_missing': 2, 'missing_percent': 16.67}
Task: Create a class EDASummary that takes a DataFrame and provides methods for: shape info, missing values, numeric stats, and categorical stats.
Show Solution
class EDASummary:
"""A class for comprehensive EDA summary statistics."""
def __init__(self, df):
self.df = df
self.numeric_cols = df.select_dtypes(include=['number']).columns.tolist()
self.categorical_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()
def shape_info(self):
"""Return basic shape information."""
return {
'rows': self.df.shape[0],
'columns': self.df.shape[1],
'memory_mb': self.df.memory_usage(deep=True).sum() / 1024**2
}
def missing_summary(self):
"""Return missing value summary."""
missing = self.df.isna().sum()
return pd.DataFrame({
'missing_count': missing,
'missing_pct': (missing / len(self.df) * 100).round(2)
}).query('missing_count > 0').sort_values('missing_count', ascending=False)
def numeric_summary(self):
"""Return summary stats for numeric columns."""
if not self.numeric_cols:
return "No numeric columns"
return self.df[self.numeric_cols].describe()
def categorical_summary(self):
"""Return summary for categorical columns."""
summary = {}
for col in self.categorical_cols:
summary[col] = {
'unique': self.df[col].nunique(),
'top_value': self.df[col].mode()[0] if len(self.df[col].mode()) > 0 else None,
'top_freq': self.df[col].value_counts().iloc[0] if len(self.df) > 0 else 0
}
return pd.DataFrame(summary).T
# Usage
eda = EDASummary(df)
print(eda.shape_info())
print(eda.missing_summary())
print(eda.numeric_summary())
print(eda.categorical_summary())
The EDA Workflow
A structured EDA workflow ensures you do not miss important aspects of your data. While the specific steps may vary by project, most EDA processes follow a similar pattern. Think of it as a conversation with your data, where each question leads to new insights and further questions.
Interactive: EDA Workflow Explorer
Click to ExploreMaster the 5-step EDA workflow. Click each step to see what it involves, example code, and pro tips.
Understand the Data
5-10 minutesGet familiar with your dataset's structure, size, and basic characteristics. This is your first conversation with the data - learn what columns exist, their data types, and preview sample values.
Goals:
- Know the dataset dimensions
- Understand column meanings
- Identify data types
- Preview actual values
Pro Tips:
df.sample() for random rows
df.columns.tolist()
# First steps with any new dataset
import pandas as pd
df = pd.read_csv('data.csv')
# 1. Check dimensions
print(f"Dataset: {df.shape[0]:,} rows, {df.shape[1]} cols")
# 2. View column info
print(df.info())
# 3. Preview data
print(df.head())
Key Questions to Answer
During EDA, keep asking questions about your data. Good questions lead to insights that improve your analysis. Here are essential questions to answer for any dataset:
| Category | Questions to Ask |
|---|---|
| Structure | How many rows and columns? What are the data types? |
| Quality | Are there missing values? Duplicates? Outliers? |
| Distribution | What does each variable look like? Normal? Skewed? |
| Relationships | How do variables relate to each other? To the target? |
Practice: EDA Workflow
Scenario: You received an e-commerce dataset and need to perform a complete initial analysis before building any models.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'product': ['Laptop', 'Phone', 'Tablet', 'Monitor', 'Keyboard'] * 20,
'price': list(np.random.randint(50, 1200, 100)),
'quantity': list(np.random.randint(1, 50, 100)),
'category': ['Electronics', 'Electronics', 'Electronics', 'Accessories', 'Accessories'] * 20,
'revenue': list(np.random.randint(500, 50000, 100))
})
df.loc[5:10, 'price'] = None # Add some missing values
Task: Write a comprehensive EDA function that outputs: shape, types, missing values, numeric summaries, categorical value counts, and potential outliers (using IQR).
Show Solution
def complete_eda(df):
"""Perform comprehensive EDA on any DataFrame."""
print("="*60)
print("1. DATASET OVERVIEW")
print("="*60)
print(f"Shape: {df.shape[0]} rows, {df.shape[1]} columns")
print(f"Memory: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print("\n" + "="*60)
print("2. DATA TYPES")
print("="*60)
print(df.dtypes)
print("\n" + "="*60)
print("3. MISSING VALUES")
print("="*60)
missing = df.isna().sum()
if missing.sum() > 0:
print(missing[missing > 0])
else:
print("No missing values!")
print("\n" + "="*60)
print("4. NUMERIC COLUMNS")
print("="*60)
print(df.describe())
print("\n" + "="*60)
print("5. CATEGORICAL COLUMNS")
print("="*60)
for col in df.select_dtypes(include=['object', 'category']).columns:
print(f"\n{col}:")
print(df[col].value_counts().head())
print("\n" + "="*60)
print("6. POTENTIAL OUTLIERS (IQR Method)")
print("="*60)
for col in df.select_dtypes(include=['number']).columns:
Q1, Q3 = df[col].quantile([0.25, 0.75])
IQR = Q3 - Q1
outliers = df[(df[col] < Q1 - 1.5*IQR) | (df[col] > Q3 + 1.5*IQR)]
if len(outliers) > 0:
print(f"{col}: {len(outliers)} potential outliers")
complete_eda(df)
Task: Write a function analyze_categorical(df, col) that prints value counts, unique count, and most common value for a categorical column.
# Test data
df = pd.DataFrame({
'region': ['North', 'South', 'North', 'East', 'North', 'South', 'West', 'North'],
'status': ['Active', 'Active', 'Inactive', 'Active', 'Active', 'Inactive', 'Active', 'Active']
})
Show Solution
def analyze_categorical(df, col):
"""Analyze a categorical column in a DataFrame."""
print(f"Analysis of '{col}'")
print("=" * 40)
print(f"Unique values: {df[col].nunique()}")
print(f"Most common: {df[col].mode()[0]}")
print(f"\nValue counts:")
print(df[col].value_counts())
print(f"\nValue percentages:")
print((df[col].value_counts(normalize=True) * 100).round(1))
# Test
analyze_categorical(df, 'region')
# Analysis of 'region'
# ========================================
# Unique values: 4
# Most common: North
# Value counts:
# North 4
# South 2
# East 1
# West 1
Task: Write Python code to perform a complete initial EDA on any DataFrame df.
Show Solution
def quick_eda(df):
"""Perform comprehensive initial EDA on a DataFrame."""
print("=" * 60)
print("DATASET OVERVIEW")
print("=" * 60)
print(f"Shape: {df.shape[0]} rows, {df.shape[1]} columns")
print(f"Memory: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print("\n" + "=" * 60)
print("DATA TYPES")
print("=" * 60)
print(df.dtypes.value_counts())
print("\n" + "=" * 60)
print("MISSING VALUES")
print("=" * 60)
missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(2)
missing_df = pd.DataFrame({'Missing': missing, 'Percent': missing_pct})
print(missing_df[missing_df['Missing'] > 0])
print("\n" + "=" * 60)
print("DUPLICATES")
print("=" * 60)
print(f"Duplicate rows: {df.duplicated().sum()}")
print("\n" + "=" * 60)
print("NUMERICAL SUMMARY")
print("=" * 60)
print(df.describe().round(2))
print("\n" + "=" * 60)
print("CATEGORICAL SUMMARY")
print("=" * 60)
for col in df.select_dtypes(include='object').columns:
print(f"\n{col}: {df[col].nunique()} unique values")
print(df[col].value_counts().head(5))
# Usage: quick_eda(df)
First Look at Data
When you receive a new dataset, your first task is to get a high-level understanding of what you are working with. Pandas provides several methods that give you immediate insights into your data's structure and content. These initial checks take seconds but save hours of confusion later.
Essential First Commands
Start every analysis with these fundamental commands. They reveal the basic shape, types, and sample values in your dataset. Think of this as a quick health check for your data.
import pandas as pd
# Load your dataset
df = pd.read_csv('sales_data.csv')
# Basic structure
print(df.shape) # (1000, 12) - 1000 rows, 12 columns
print(df.columns.tolist()) # List all column names
print(df.dtypes) # Data type of each column
# Sample data
print(df.head()) # First 5 rows
print(df.tail()) # Last 5 rows
print(df.sample(5)) # Random 5 rows
Summary Statistics with describe()
The describe() method provides statistical summaries for numerical columns. It shows count, mean,
standard deviation, min, max, and quartiles. This single command reveals distribution characteristics
and potential outliers at a glance.
# Numerical summary
print(df.describe())
# Include all columns (including categorical)
print(df.describe(include='all'))
# Specific percentiles
print(df.describe(percentiles=[.1, .25, .5, .75, .9]))
Understanding Data Types
Data types affect how you can analyze and visualize variables. Pandas infers types when loading data, but it does not always get them right. Dates might load as strings, categories as objects, and IDs as numbers.
# Check memory usage and types
print(df.info())
# Count by data type
print(df.dtypes.value_counts())
# Convert types as needed
df['date'] = pd.to_datetime(df['date'])
df['category'] = df['category'].astype('category')
df['customer_id'] = df['customer_id'].astype(str)
Practice: First Look at Data
Scenario: HR gave you this describe() output for 'salary' column:
count 10000.000000
mean 65000.000000
std 25000.000000
min 5000.000000
25% 45000.000000
50% 60000.000000
75% 80000.000000
max 500000.000000
Task: Identify data quality issues, explain what they suggest, and write code to investigate outliers and visualize the distribution.
Show Solution
Issues Identified:
- Extreme outlier: Max ($500K) vs 75th percentile ($80K) - 6x gap
- Right skew: Mean ($65K) > Median ($60K)
- Suspicious minimum: $5K seems too low for full-time
# Investigate outliers using IQR
Q1, Q3 = df['salary'].quantile([0.25, 0.75])
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
print(f"IQR: ${IQR:,.0f}")
print(f"Lower bound: ${lower_bound:,.0f}")
print(f"Upper bound: ${upper_bound:,.0f}")
# Find outliers
outliers = df[(df['salary'] < lower_bound) | (df['salary'] > upper_bound)]
print(f"\nOutliers: {len(outliers)} ({len(outliers)/len(df)*100:.1f}%)")
# Visualize
import matplotlib.pyplot as plt
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
df['salary'].hist(bins=50, ax=axes[0])
axes[0].set_title('Salary Distribution')
df.boxplot(column='salary', ax=axes[1])
axes[1].set_title('Salary Boxplot')
plt.show()
Given this describe() output for a 'salary' column:
count 10000.000000
mean 65000.000000
std 25000.000000
min 5000.000000
25% 45000.000000
50% 60000.000000
75% 80000.000000
max 500000.000000
Task: Identify any potential data quality issues and explain what additional investigation you would perform.
Show Solution
Issues Identified:
- Potential outliers: Max (500,000) is far from 75th percentile (80,000) - a gap of 420,000
- Right skew: Mean (65,000) > Median/50% (60,000) indicates positive skew
- Suspicious minimum: $5,000 salary seems very low - could be part-time, error, or different unit
# Investigate outliers
print(df[df['salary'] > 200000]) # High earners
print(df[df['salary'] < 20000]) # Unusually low
# Check for IQR-based outliers
Q1 = df['salary'].quantile(0.25)
Q3 = df['salary'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['salary'] < Q1 - 1.5*IQR) | (df['salary'] > Q3 + 1.5*IQR)]
print(f"Potential outliers: {len(outliers)}")
# Visualize distribution
import matplotlib.pyplot as plt
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
df['salary'].hist(bins=50, ax=axes[0])
axes[0].set_title('Salary Distribution')
df.boxplot(column='salary', ax=axes[1])
axes[1].set_title('Salary Boxplot')
plt.show()
Your DataFrame has these columns with incorrect data types:
order_date: object (should be datetime)zip_code: int64 (should be string to preserve leading zeros)is_active: int64 (should be boolean)product_category: object (should be category for memory efficiency)
Task: Write code to convert all columns to their appropriate types.
Show Solution
# Convert date string to datetime
df['order_date'] = pd.to_datetime(df['order_date'])
# Convert zip_code to string, preserving leading zeros
df['zip_code'] = df['zip_code'].astype(str).str.zfill(5)
# Convert to boolean
df['is_active'] = df['is_active'].astype(bool)
# Convert to category for memory efficiency
df['product_category'] = df['product_category'].astype('category')
# Verify changes
print(df.dtypes)
print(f"\nMemory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
# For multiple conversions at once
df = df.astype({
'order_date': 'datetime64[ns]',
'is_active': 'bool',
'product_category': 'category'
})
Task: Write a function called first_look() that takes a DataFrame and prints a comprehensive summary including:
- Shape and memory usage
- Column names and types
- Missing value summary
- Numeric column statistics
- Categorical column value counts (top 5)
Show Solution
def first_look(df):
"""
Perform comprehensive first look analysis on a DataFrame.
"""
print("=" * 60)
print("DATASET OVERVIEW")
print("=" * 60)
# Basic shape and memory
print(f"\n๐ Shape: {df.shape[0]:,} rows ร {df.shape[1]} columns")
memory_mb = df.memory_usage(deep=True).sum() / 1024**2
print(f"๐พ Memory Usage: {memory_mb:.2f} MB")
# Column types summary
print(f"\n๐ Column Types:")
for dtype, count in df.dtypes.value_counts().items():
print(f" {dtype}: {count} columns")
# Missing values
print("\n" + "=" * 60)
print("MISSING VALUES")
print("=" * 60)
missing = df.isna().sum()
missing_pct = (missing / len(df) * 100).round(2)
missing_df = pd.DataFrame({
'Missing': missing,
'Percent': missing_pct
}).query('Missing > 0').sort_values('Missing', ascending=False)
if len(missing_df) > 0:
print(missing_df)
else:
print("โ
No missing values found!")
# Numeric columns
numeric_cols = df.select_dtypes(include=['number']).columns
if len(numeric_cols) > 0:
print("\n" + "=" * 60)
print("NUMERIC COLUMN SUMMARY")
print("=" * 60)
print(df[numeric_cols].describe().T)
# Categorical columns (top values)
cat_cols = df.select_dtypes(include=['object', 'category']).columns
if len(cat_cols) > 0:
print("\n" + "=" * 60)
print("CATEGORICAL COLUMNS (Top 5 values)")
print("=" * 60)
for col in cat_cols:
print(f"\n{col} ({df[col].nunique()} unique):")
print(df[col].value_counts().head())
print("\n" + "=" * 60)
# Usage
first_look(df)
Automated Data Profiling
While manual exploration is valuable, automated profiling tools can generate comprehensive reports in seconds. These tools analyze every column, detect data types, calculate statistics, identify correlations, and flag potential issues. They are especially useful for large datasets with many columns.
Using ydata-profiling (formerly pandas-profiling)
The ydata-profiling library generates an interactive HTML report with detailed analysis of each variable. It includes distributions, missing values, correlations, and alerts for potential data quality issues.
# Install: pip install ydata-profiling
from ydata_profiling import ProfileReport
# Generate a comprehensive report
profile = ProfileReport(df, title="Sales Data Analysis")
# Save as HTML
profile.to_file("sales_report.html")
# For large datasets, use minimal mode
profile = ProfileReport(df, minimal=True)
Quick Profiling with Pandas
For quick profiling without additional libraries, you can create your own summary function. This approach is lightweight and customizable to your specific needs.
def quick_profile(df):
"""Generate a quick data profile summary."""
print("=" * 50)
print(f"Dataset Shape: {df.shape[0]} rows, {df.shape[1]} columns")
print("=" * 50)
for col in df.columns:
print(f"\n{col}")
print(f" Type: {df[col].dtype}")
print(f" Missing: {df[col].isna().sum()} ({df[col].isna().mean()*100:.1f}%)")
print(f" Unique: {df[col].nunique()}")
if df[col].dtype in ['int64', 'float64']:
print(f" Range: [{df[col].min()}, {df[col].max()}]")
quick_profile(df)
Practice: Automated Profiling
Scenario: You have a 50,000-row sales DataFrame and need to generate a quick profiling report for stakeholders.
Task: Write code to generate a ydata-profiling report with settings optimized for large datasets.
Show Solution
from ydata_profiling import ProfileReport
# For large datasets, use minimal mode
profile = ProfileReport(
sales_df,
title="Sales Data Profile Report",
minimal=True, # Faster for large datasets
explorative=True
)
# Save to HTML file
profile.to_file("sales_profile_report.html")
# Or display in Jupyter Notebook
profile.to_notebook_iframe()
# For very large datasets (>100K rows), sample first
if len(sales_df) > 100000:
sample_df = sales_df.sample(n=10000, random_state=42)
profile = ProfileReport(sample_df, title="Sales Sample Profile")
profile.to_file("sales_sample_report.html")
You have two DataFrames: train_df (training data) and test_df (test data). You want to compare their distributions to check for data drift.
Task: Write code to generate a comparison report between the two datasets.
Show Solution
from ydata_profiling import ProfileReport
# Generate profiles for both datasets
train_profile = ProfileReport(train_df, title="Training Data")
test_profile = ProfileReport(test_df, title="Test Data")
# Generate comparison report
comparison = train_profile.compare(test_profile)
comparison.to_file("train_test_comparison.html")
# Alternative: Custom comparison function
def compare_datasets(df1, df2, name1="Dataset 1", name2="Dataset 2"):
"""Compare two DataFrames for consistency."""
print(f"Shape Comparison:")
print(f" {name1}: {df1.shape}")
print(f" {name2}: {df2.shape}")
print(f"\nColumn Comparison:")
cols1 = set(df1.columns)
cols2 = set(df2.columns)
print(f" Only in {name1}: {cols1 - cols2}")
print(f" Only in {name2}: {cols2 - cols1}")
print(f"\nNumeric Column Statistics:")
common_numeric = df1.select_dtypes(include=['number']).columns.intersection(
df2.select_dtypes(include=['number']).columns
)
for col in common_numeric:
print(f"\n {col}:")
print(f" {name1} - Mean: {df1[col].mean():.2f}, Std: {df1[col].std():.2f}")
print(f" {name2} - Mean: {df2[col].mean():.2f}, Std: {df2[col].std():.2f}")
compare_datasets(train_df, test_df, "Train", "Test")
Task: Create an enhanced version of quick_profile() that:
- Detects potential outliers using IQR method
- Identifies high-cardinality categorical columns
- Flags columns with high missing percentage (>10%)
- Calculates correlation with target variable (if provided)
Show Solution
def enhanced_profile(df, target_col=None):
"""
Generate an enhanced data profile with quality checks.
Parameters:
-----------
df : pd.DataFrame
The DataFrame to profile
target_col : str, optional
Target column for correlation analysis
"""
print("=" * 70)
print("ENHANCED DATA PROFILE")
print("=" * 70)
# Basic info
print(f"\n๐ Shape: {df.shape[0]:,} rows ร {df.shape[1]} columns")
issues = []
# Analyze each column
for col in df.columns:
col_issues = []
# Missing values check
missing_pct = df[col].isna().mean() * 100
if missing_pct > 10:
col_issues.append(f"โ ๏ธ High missing: {missing_pct:.1f}%")
# Numeric column analysis
if df[col].dtype in ['int64', 'float64']:
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
outlier_count = ((df[col] < lower) | (df[col] > upper)).sum()
if outlier_count > 0:
outlier_pct = outlier_count / len(df) * 100
col_issues.append(f"โ ๏ธ Outliers: {outlier_count} ({outlier_pct:.1f}%)")
# Categorical column analysis
if df[col].dtype == 'object':
cardinality = df[col].nunique()
if cardinality > 50:
col_issues.append(f"โ ๏ธ High cardinality: {cardinality} unique")
if cardinality == len(df):
col_issues.append("โ ๏ธ Possible ID column (all unique)")
# Store issues
if col_issues:
issues.append((col, col_issues))
# Print issues summary
if issues:
print("\n" + "=" * 70)
print("๐จ DATA QUALITY ISSUES")
print("=" * 70)
for col, col_issues in issues:
print(f"\n{col}:")
for issue in col_issues:
print(f" {issue}")
else:
print("\nโ
No major data quality issues detected!")
# Correlation with target
if target_col and target_col in df.columns:
print("\n" + "=" * 70)
print(f"๐ CORRELATION WITH TARGET: {target_col}")
print("=" * 70)
numeric_cols = df.select_dtypes(include=['number']).columns
correlations = df[numeric_cols].corrwith(df[target_col]).sort_values(ascending=False)
print(correlations)
print("\n" + "=" * 70)
# Usage
enhanced_profile(df, target_col='sales')
Data Quality Assessment
Data quality issues can silently corrupt your analysis and models. A thorough quality assessment identifies problems like missing values, duplicates, inconsistent formatting, and impossible values. Catching these issues early saves significant time and prevents misleading results.
Checking for Missing Values
Missing data is one of the most common data quality issues. Understanding the pattern and extent of missing values helps you decide how to handle them. Random missing values are easier to handle than systematic patterns that might indicate data collection problems.
# Count missing values
print(df.isna().sum())
# Percentage missing per column
print((df.isna().sum() / len(df) * 100).round(2))
# Visualize missing patterns
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
sns.heatmap(df.isna(), cbar=True, yticklabels=False)
plt.title('Missing Value Patterns')
plt.show()
Detecting Duplicates
Duplicate records can inflate statistics and bias models. Check for exact duplicates and near-duplicates (rows that differ only in one or two columns). The appropriate action depends on whether duplicates represent data errors or legitimate repeated observations.
# Check for duplicate rows
print(f"Duplicate rows: {df.duplicated().sum()}")
# View duplicate rows
print(df[df.duplicated(keep=False)])
# Check duplicates based on specific columns
print(f"Duplicate IDs: {df.duplicated(subset=['customer_id']).sum()}")
# Remove duplicates (keep first occurrence)
df_clean = df.drop_duplicates()
Validating Data Ranges
Domain knowledge helps identify impossible or suspicious values. Ages cannot be negative, percentages should be 0 to 100, and dates should be within expected ranges. Create validation rules based on your understanding of what the data should look like.
# Check for impossible values
print(df[df['age'] < 0]) # Negative ages
print(df[df['price'] < 0]) # Negative prices
print(df[df['percentage'] > 100]) # Over 100%
# Validate date ranges
df['date'] = pd.to_datetime(df['date'])
print(df[(df['date'] < '2020-01-01') | (df['date'] > '2024-12-31')])
# Check categorical consistency
print(df['status'].value_counts()) # Look for typos
Practice: Data Quality Assessment
Scenario: You received a messy customer dataset with various missing value patterns. You need to understand and visualize the missingness before deciding on imputation strategies.
Task: Write code to: (1) show missing value counts/percentages sorted by severity, (2) identify columns with >10% missing that need attention, and (3) create a horizontal bar chart visualization.
Show Solution
import pandas as pd
import matplotlib.pyplot as plt
def analyze_missing_values(df, threshold=10):
"""Comprehensive missing value analysis."""
# Calculate statistics
missing_count = df.isna().sum()
missing_pct = (missing_count / len(df) * 100).round(2)
# Create summary DataFrame
summary = pd.DataFrame({
'Missing Count': missing_count,
'Missing %': missing_pct,
'Data Type': df.dtypes
})
# Filter and sort
summary = summary[summary['Missing Count'] > 0]
summary = summary.sort_values('Missing Count', ascending=False)
print("MISSING VALUE REPORT")
print("="*50)
print(summary)
# Flag high-missing columns
high_missing = summary[summary['Missing %'] > threshold]
if len(high_missing) > 0:
print(f"\nโ ๏ธ ATTENTION NEEDED (>{threshold}% missing):")
print(list(high_missing.index))
# Visualization
if len(summary) > 0:
plt.figure(figsize=(10, max(4, len(summary)*0.4)))
colors = ['red' if x > threshold else 'steelblue'
for x in summary['Missing %']]
plt.barh(summary.index, summary['Missing %'], color=colors)
plt.xlabel('Missing Percentage')
plt.title('Missing Values by Column')
plt.axvline(x=threshold, color='red', linestyle='--',
label=f'{threshold}% threshold')
plt.legend()
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()
return summary
# Usage
analyze_missing_values(df, threshold=10)
Your order DataFrame might have duplicate orders. You need to:
- Find exact duplicate rows
- Find duplicate order_ids (same order entered multiple times)
- Identify orders with same customer_id and date (potential duplicates)
- Remove duplicates keeping the most recent entry
Show Solution
# 1. Exact duplicate rows
exact_dupes = df[df.duplicated(keep=False)]
print(f"Exact duplicates: {len(exact_dupes)}")
# 2. Duplicate order_ids
dupe_orders = df[df.duplicated(subset=['order_id'], keep=False)]
print(f"Duplicate order_ids: {len(dupe_orders)}")
print(dupe_orders.sort_values('order_id'))
# 3. Same customer and date (potential duplicates)
potential_dupes = df[df.duplicated(subset=['customer_id', 'order_date'], keep=False)]
print(f"Same customer/date: {len(potential_dupes)}")
print(potential_dupes.sort_values(['customer_id', 'order_date']))
# 4. Remove duplicates, keeping most recent
# First, ensure order_date is datetime
df['order_date'] = pd.to_datetime(df['order_date'])
# Sort by order_id and date (newest first), then keep first
df_deduped = df.sort_values(['order_id', 'order_date'], ascending=[True, False])
df_deduped = df_deduped.drop_duplicates(subset=['order_id'], keep='first')
print(f"Original rows: {len(df)}")
print(f"After deduplication: {len(df_deduped)}")
print(f"Removed: {len(df) - len(df_deduped)}")
Your employee DataFrame should follow these business rules:
- Age must be between 18 and 100
- Salary must be positive
- Start_date must be before today
- Department must be one of: ['Sales', 'Engineering', 'HR', 'Marketing', 'Finance']
Task: Write code to find all rows that violate any of these rules.
Show Solution
import pandas as pd
from datetime import datetime
# Define valid departments
valid_departments = ['Sales', 'Engineering', 'HR', 'Marketing', 'Finance']
# Convert date column
df['start_date'] = pd.to_datetime(df['start_date'])
today = datetime.now()
# Create validation masks
invalid_age = (df['age'] < 18) | (df['age'] > 100)
invalid_salary = df['salary'] <= 0
invalid_date = df['start_date'] > today
invalid_dept = ~df['department'].isin(valid_departments)
# Find all violations
violations = pd.DataFrame({
'Invalid Age': invalid_age,
'Invalid Salary': invalid_salary,
'Future Start Date': invalid_date,
'Invalid Department': invalid_dept
})
# Any violation
any_violation = violations.any(axis=1)
print(f"Rows with violations: {any_violation.sum()}")
# Show detailed violations
violations_df = df[any_violation].copy()
violations_df['Violations'] = violations.apply(
lambda row: ', '.join(violations.columns[row]), axis=1
)
print(violations_df[['employee_id', 'age', 'salary', 'start_date', 'department', 'Violations']])
# Summary by violation type
print("\nViolation Summary:")
for col in violations.columns:
count = violations[col].sum()
if count > 0:
print(f" {col}: {count} rows")
Task: Create a comprehensive data_quality_report() function that generates a complete data quality assessment including:
- Completeness score (percentage of non-null values)
- Uniqueness score (for potential ID columns)
- Validity score (based on custom rules)
- Consistency checks (format consistency within columns)
- Overall data quality score
Show Solution
def data_quality_report(df, id_columns=None, rules=None):
"""
Generate a comprehensive data quality report.
Parameters:
-----------
df : pd.DataFrame
id_columns : list, optional - Columns that should be unique
rules : dict, optional - Validation rules {column: function}
Returns:
--------
dict : Quality scores and issues
"""
report = {
'scores': {},
'issues': [],
'summary': {}
}
# 1. COMPLETENESS
completeness_scores = (1 - df.isna().mean()) * 100
overall_completeness = completeness_scores.mean()
report['scores']['completeness'] = overall_completeness
for col in df.columns:
if completeness_scores[col] < 95:
report['issues'].append({
'type': 'completeness',
'column': col,
'score': completeness_scores[col],
'message': f'{col}: {100-completeness_scores[col]:.1f}% missing'
})
# 2. UNIQUENESS (for ID columns)
if id_columns:
uniqueness_scores = []
for col in id_columns:
unique_ratio = df[col].nunique() / len(df) * 100
uniqueness_scores.append(unique_ratio)
if unique_ratio < 100:
dupes = len(df) - df[col].nunique()
report['issues'].append({
'type': 'uniqueness',
'column': col,
'score': unique_ratio,
'message': f'{col}: {dupes} duplicate values'
})
report['scores']['uniqueness'] = sum(uniqueness_scores) / len(uniqueness_scores) if uniqueness_scores else 100
else:
report['scores']['uniqueness'] = 100
# 3. VALIDITY (custom rules)
if rules:
validity_scores = []
for col, rule_func in rules.items():
if col in df.columns:
valid = df[col].apply(rule_func)
validity_score = valid.mean() * 100
validity_scores.append(validity_score)
if validity_score < 100:
invalid_count = (~valid).sum()
report['issues'].append({
'type': 'validity',
'column': col,
'score': validity_score,
'message': f'{col}: {invalid_count} invalid values'
})
report['scores']['validity'] = sum(validity_scores) / len(validity_scores) if validity_scores else 100
else:
report['scores']['validity'] = 100
# 4. CONSISTENCY (format checks for string columns)
consistency_scores = []
for col in df.select_dtypes(include=['object']).columns:
# Check for mixed case
if df[col].dropna().str.lower().nunique() < df[col].nunique():
report['issues'].append({
'type': 'consistency',
'column': col,
'message': f'{col}: Mixed case values detected'
})
# Check for leading/trailing spaces
has_spaces = df[col].dropna().str.strip() != df[col].dropna()
if has_spaces.any():
report['issues'].append({
'type': 'consistency',
'column': col,
'message': f'{col}: {has_spaces.sum()} values with extra spaces'
})
report['scores']['consistency'] = 100 - min(len([i for i in report['issues'] if i['type'] == 'consistency']) * 5, 50)
# 5. OVERALL SCORE
weights = {'completeness': 0.3, 'uniqueness': 0.2, 'validity': 0.3, 'consistency': 0.2}
overall = sum(report['scores'][k] * weights[k] for k in weights)
report['scores']['overall'] = overall
# Print report
print("=" * 60)
print("DATA QUALITY REPORT")
print("=" * 60)
print(f"\n๐ Overall Quality Score: {overall:.1f}%")
print(f"\n Completeness: {report['scores']['completeness']:.1f}%")
print(f" Uniqueness: {report['scores']['uniqueness']:.1f}%")
print(f" Validity: {report['scores']['validity']:.1f}%")
print(f" Consistency: {report['scores']['consistency']:.1f}%")
if report['issues']:
print(f"\nโ ๏ธ Issues Found: {len(report['issues'])}")
for issue in report['issues']:
print(f" - {issue['message']}")
return report
# Usage example
rules = {
'age': lambda x: pd.isna(x) or (18 <= x <= 100),
'email': lambda x: pd.isna(x) or '@' in str(x),
'salary': lambda x: pd.isna(x) or x > 0
}
report = data_quality_report(
df,
id_columns=['employee_id'],
rules=rules
)
Key Takeaways
EDA First
Always explore your data before building models to avoid flawed assumptions
Systematic Approach
Follow a structured workflow: understand, clean, explore, visualize, document
Start Simple
Use shape, dtypes, describe(), and head() for immediate insights
Automate When Possible
Use profiling tools like ydata-profiling for comprehensive automated reports
Check Quality Early
Identify missing values, duplicates, and invalid data before analysis
Document Findings
Record insights and decisions for reproducibility and communication
Knowledge Check
Test your understanding of EDA concepts:
What is the primary purpose of Exploratory Data Analysis (EDA)?
Which pandas method provides statistical summaries including mean, std, min, and max?
In describe() output, what does a large difference between mean and median (50%) typically indicate?
Which method would you use to check for duplicate rows in a DataFrame?
What is ydata-profiling (formerly pandas-profiling) used for?
Which of these is NOT a typical data quality issue to check during EDA?