Univariate Analysis

Numerical Variable Distributions

Univariate analysis begins with understanding how individual numerical variables are distributed. By examining the shape, center, and spread of each variable, you can identify patterns, detect anomalies, and make informed decisions about data preprocessing before modeling.

Key Concept

What is Univariate Analysis?

Univariate analysis examines one variable at a time. For numerical variables, we explore the distribution shape (symmetric, skewed), central tendency (mean, median), and spread (range, standard deviation).

Always start EDA with univariate analysis before exploring relationships between variables.

Histograms - The Foundation

Histograms divide numerical data into bins and show the frequency of values in each bin. They reveal the overall shape of your data distribution.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load sample data
df = pd.DataFrame({
    'age': np.random.normal(35, 10, 1000),
    'income': np.random.exponential(50000, 1000),
    'score': np.random.uniform(0, 100, 1000)
})

# Basic histogram with Matplotlib
plt.figure(figsize=(10, 4))
plt.hist(df['age'], bins=30, edgecolor='black', alpha=0.7)
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.title('Age Distribution')
plt.show()

Seaborn histplot - Enhanced Histograms

Seaborn's histplot() provides more options including kernel density estimation (KDE) overlay.

# Histogram with KDE overlay
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Normal distribution (age)
sns.histplot(df['age'], kde=True, ax=axes[0], color='steelblue')
axes[0].set_title('Normal: Age')

# Right-skewed distribution (income)
sns.histplot(df['income'], kde=True, ax=axes[1], color='coral')
axes[1].set_title('Right-Skewed: Income')

# Uniform distribution (score)
sns.histplot(df['score'], kde=True, ax=axes[2], color='seagreen')
axes[2].set_title('Uniform: Score')

plt.tight_layout()
plt.show()

Distribution Statistics with describe()

Combine visual analysis with numerical summaries to fully understand your distributions.

# Comprehensive statistics
print(df['income'].describe())

# Additional distribution metrics
from scipy import stats

print(f"\nSkewness: {df['income'].skew():.3f}")  # Positive = right skew
print(f"Kurtosis: {df['income'].kurtosis():.3f}")  # High = heavy tails

# Percentiles for deeper understanding
percentiles = [1, 5, 10, 25, 50, 75, 90, 95, 99]
print(f"\nPercentiles:")
for p in percentiles:
    print(f"  {p}th: {np.percentile(df['income'], p):,.0f}")

Box Plots for Quick Summary

Box plots show the five-number summary (min, Q1, median, Q3, max) and highlight outliers as individual points.

# Box plot comparison
fig, ax = plt.subplots(figsize=(10, 5))
df.boxplot(column=['age', 'score'], ax=ax)
plt.title('Box Plot Comparison')
plt.ylabel('Value')
plt.show()

# Horizontal box plot with Seaborn
plt.figure(figsize=(10, 3))
sns.boxplot(x=df['income'], color='coral')
plt.title('Income Distribution (Box Plot)')
plt.xlabel('Income ($)')
plt.show()

Pro Tip: Use sns.violinplot() to see both the box plot summary and the full distribution shape combined.

Practice: Numerical Distributions

Task: Given income data with right skew: income = np.random.exponential(50000, 1000), create a histogram with KDE overlay using seaborn and calculate the mean, median, and skewness.

Show Solution

import seaborn as sns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

income = np.random.exponential(50000, 1000)

# Create histogram with KDE
sns.histplot(income, kde=True, color='coral')
plt.xlabel('Income ($)')
plt.title('Income Distribution')

# Calculate statistics
print(f"Mean: ${np.mean(income):,.0f}")
print(f"Median: ${np.median(income):,.0f}")
print(f"Skewness: {pd.Series(income).skew():.2f}")
plt.show()

Task: Create a 1x3 subplot showing: (1) normal distribution, (2) right-skewed exponential, (3) uniform distribution. Each should have KDE overlay and display skewness in the title.

Show Solution

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

np.random.seed(42)
normal = np.random.normal(50, 10, 1000)
exponential = np.random.exponential(50, 1000)
uniform = np.random.uniform(0, 100, 1000)

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for ax, data, name in zip(axes, [normal, exponential, uniform], 
                          ['Normal', 'Exponential', 'Uniform']):
    sns.histplot(data, kde=True, ax=ax)
    skew = pd.Series(data).skew()
    ax.set_title(f'{name} (skew: {skew:.2f})')

plt.tight_layout()
plt.show()

Categorical Variable Frequencies

Categorical variables contain discrete values representing groups or categories. Understanding the distribution of categories helps identify class imbalance, dominant groups, and rare categories that may need special handling during modeling.

Value Counts - The Essential Method

The value_counts() method is your primary tool for exploring categorical variables. It returns frequencies sorted from most to least common.

import pandas as pd

# Sample categorical data
df = pd.DataFrame({
    'department': ['Sales', 'IT', 'HR', 'Sales', 'IT', 'Sales', 
                   'Marketing', 'IT', 'Sales', 'HR'] * 100,
    'level': ['Junior', 'Senior', 'Junior', 'Mid', 'Senior', 
              'Mid', 'Junior', 'Mid', 'Senior', 'Junior'] * 100,
    'status': ['Active', 'Active', 'Active', 'Inactive', 'Active',
               'Active', 'Active', 'Active', 'Active', 'Active'] * 100
})

# Basic value counts
print(df['department'].value_counts())
# Sales        400
# IT           300  
# HR           200
# Marketing    100

# As percentages (proportions)
print(df['department'].value_counts(normalize=True))
# Sales        0.40
# IT           0.30
# HR           0.20
# Marketing    0.10

Bar Plots for Categories

Bar plots visualize category frequencies. Use horizontal bars for categories with long names.

import matplotlib.pyplot as plt
import seaborn as sns

# Vertical bar plot
plt.figure(figsize=(8, 5))
df['department'].value_counts().plot(kind='bar', color='steelblue', edgecolor='black')
plt.title('Employees by Department')
plt.xlabel('Department')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()

# Seaborn countplot (automatic counting)
plt.figure(figsize=(8, 5))
sns.countplot(data=df, x='department', order=df['department'].value_counts().index)
plt.title('Department Distribution')
plt.show()

Identifying Class Imbalance

Class imbalance occurs when some categories are much more frequent than others. This is critical for classification problems.

# Check for imbalance
status_counts = df['status'].value_counts()
print(status_counts)
# Active      900
# Inactive    100

# Calculate imbalance ratio
majority = status_counts.max()
minority = status_counts.min()
imbalance_ratio = majority / minority
print(f"Imbalance ratio: {imbalance_ratio:.1f}:1")  # 9:1

# Visualize imbalance
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Absolute counts
status_counts.plot(kind='bar', ax=axes[0], color=['green', 'red'])
axes[0].set_title('Absolute Counts')
axes[0].set_ylabel('Count')

# Percentage
status_counts.plot(kind='pie', ax=axes[1], autopct='%1.1f%%', colors=['green', 'red'])
axes[1].set_title('Percentage Distribution')
axes[1].set_ylabel('')

plt.tight_layout()
plt.show()

Handling Rare Categories

Categories with very few observations can cause issues. Consider grouping them into an "Other" category.

# Identify rare categories (less than 5% of data)
threshold = 0.05
value_pcts = df['department'].value_counts(normalize=True)
rare_cats = value_pcts[value_pcts < threshold].index.tolist()
print(f"Rare categories: {rare_cats}")

# Group rare categories into 'Other'
df['department_grouped'] = df['department'].apply(
    lambda x: 'Other' if x in rare_cats else x
)

# Verify grouping
print(df['department_grouped'].value_counts())

Watch Out: High cardinality (too many unique categories) can cause issues with encoding. Use df['col'].nunique() to check.

Practice: Categorical Analysis

Task: Given product categories: categories = ['Electronics', 'Clothing', 'Electronics', 'Food', 'Clothing', 'Electronics', 'Food', 'Clothing', 'Electronics', 'Other'], create a vertical bar chart with counts displayed on each bar.

Show Solution

import pandas as pd
import matplotlib.pyplot as plt

categories = ['Electronics', 'Clothing', 'Electronics', 'Food', 
              'Clothing', 'Electronics', 'Food', 'Clothing', 
              'Electronics', 'Other']

counts = pd.Series(categories).value_counts()

plt.figure(figsize=(8, 5))
bars = plt.bar(counts.index, counts.values, color='steelblue')

# Add count labels on bars
for bar, count in zip(bars, counts.values):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.1,
             str(count), ha='center', fontsize=12)

plt.xlabel('Category')
plt.ylabel('Count')
plt.title('Product Category Distribution')
plt.show()

Task: Given city data with 50 cities, keep the top 5 most frequent cities and group the rest as "Other". Create a horizontal bar chart showing percentage of each group.

Show Solution

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Simulate city data (50 cities, some more frequent)
np.random.seed(42)
cities = np.random.choice(
    ['NYC', 'LA', 'Chicago', 'Houston', 'Phoenix'] + 
    [f'City_{i}' for i in range(45)],
    size=1000, p=[0.15, 0.12, 0.1, 0.08, 0.07] + [0.48/45]*45
)

df = pd.DataFrame({'city': cities})

# Keep top 5, group rest as Other
top_5 = df['city'].value_counts().nlargest(5).index
df['city_grouped'] = df['city'].apply(
    lambda x: x if x in top_5 else 'Other'
)

# Calculate percentages
pct = df['city_grouped'].value_counts(normalize=True) * 100

# Plot horizontal bar chart
plt.figure(figsize=(10, 5))
bars = plt.barh(pct.index, pct.values, color='teal')
for bar, p in zip(bars, pct.values):
    plt.text(bar.get_width() + 0.5, bar.get_y() + bar.get_height()/2,
             f'{p:.1f}%', va='center')
plt.xlabel('Percentage')
plt.title('City Distribution (Top 5 + Other)')
plt.show()

Outlier Detection Methods

Outliers are data points that significantly differ from other observations. They can be legitimate extreme values, data entry errors, or measurement issues. Detecting and properly handling outliers is crucial for accurate analysis and model performance.

When to Keep Outliers

Legitimate extreme values (e.g., CEO salary)
Important for understanding data range
Using robust statistical methods
Tree-based models (handle outliers well)

When to Remove/Transform

Data entry errors or measurement issues
Using mean-sensitive methods
Linear regression (sensitive to outliers)
Distance-based algorithms (KNN, K-Means)

Method 1: IQR (Interquartile Range)

The IQR method defines outliers as values below Q1 - 1.5*IQR or above Q3 + 1.5*IQR. This is the method used in box plots.

import numpy as np
import pandas as pd

# Sample data with outliers
data = pd.Series([10, 12, 14, 15, 16, 18, 20, 22, 100, 5, 150])

# Calculate IQR bounds
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

print(f"Q1: {Q1}, Q3: {Q3}, IQR: {IQR}")
print(f"Lower bound: {lower_bound}")
print(f"Upper bound: {upper_bound}")

# Identify outliers
outliers = data[(data < lower_bound) | (data > upper_bound)]
print(f"\nOutliers: {outliers.values}")  # [100, 150]

# Filter outliers
clean_data = data[(data >= lower_bound) & (data <= upper_bound)]
print(f"Clean data: {clean_data.values}")

Method 2: Z-Score

Z-score measures how many standard deviations a value is from the mean. Values with |Z| > 3 are typically considered outliers.

from scipy import stats

# Calculate Z-scores
z_scores = stats.zscore(data)
print("Z-scores:")
for val, z in zip(data, z_scores):
    print(f"  {val}: z = {z:.2f}")

# Identify outliers (|z| > 3)
threshold = 3
outlier_mask = np.abs(z_scores) > threshold
outliers = data[outlier_mask]
print(f"\nOutliers (|z| > 3): {outliers.values}")

# Alternative: Using pandas
data_mean = data.mean()
data_std = data.std()
z_scores_manual = (data - data_mean) / data_std

Method 3: Modified Z-Score (Robust)

Uses median and MAD (Median Absolute Deviation) instead of mean and std, making it more robust to outliers.

# Modified Z-score using MAD
median = data.median()
mad = np.median(np.abs(data - median))
modified_z = 0.6745 * (data - median) / mad

print("Modified Z-scores:")
for val, mz in zip(data, modified_z):
    print(f"  {val}: modified_z = {mz:.2f}")

# Outliers where |modified_z| > 3.5
outliers = data[np.abs(modified_z) > 3.5]
print(f"\nOutliers: {outliers.values}")

Visualizing Outliers

Box plots automatically show outliers. Combine with other visualizations for comprehensive outlier analysis.

import matplotlib.pyplot as plt
import seaborn as sns

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Box plot - shows outliers as points
axes[0].boxplot(data)
axes[0].set_title('Box Plot (outliers as dots)')
axes[0].set_ylabel('Value')

# Histogram with outlier markers
axes[1].hist(data, bins=15, edgecolor='black')
axes[1].axvline(lower_bound, color='red', linestyle='--', label='IQR bounds')
axes[1].axvline(upper_bound, color='red', linestyle='--')
axes[1].set_title('Histogram with IQR Bounds')
axes[1].legend()

# Scatter plot to see outliers in context
axes[2].scatter(range(len(data)), data, c=['red' if x in outliers.values else 'blue' for x in data])
axes[2].set_title('Values (red = outliers)')
axes[2].set_xlabel('Index')

plt.tight_layout()
plt.show()

Pro Tip: When unsure, use the IQR method - it does not assume normal distribution and is the most widely used approach.

Practice: Outlier Detection

Task: Given salaries: salaries = [35000, 42000, 45000, 48000, 52000, 55000, 61000, 250000, 320000], identify outliers using IQR method and print the exact values that are outliers.

Show Solution

import pandas as pd

salaries = pd.Series([35000, 42000, 45000, 48000, 52000, 
                      55000, 61000, 250000, 320000])

Q1 = salaries.quantile(0.25)
Q3 = salaries.quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

print(f"Q1: ${Q1:,.0f}, Q3: ${Q3:,.0f}, IQR: ${IQR:,.0f}")
print(f"Lower bound: ${lower_bound:,.0f}")
print(f"Upper bound: ${upper_bound:,.0f}")

outliers = salaries[(salaries < lower_bound) | (salaries > upper_bound)]
print(f"\nOutliers found: {outliers.tolist()}")
# Output: [250000, 320000]

Task: Generate temperature data with temps = np.append(np.random.normal(72, 5, 100), [32, 110, 115]). Detect outliers using both IQR and Z-score (threshold=3) methods. Print how many outliers each method finds and which values they flag.

Show Solution

import numpy as np
import pandas as pd
from scipy import stats

np.random.seed(42)
temps = pd.Series(np.append(np.random.normal(72, 5, 100), [32, 110, 115]))

# IQR Method
Q1, Q3 = temps.quantile([0.25, 0.75])
IQR = Q3 - Q1
iqr_lower = Q1 - 1.5 * IQR
iqr_upper = Q3 + 1.5 * IQR
iqr_outliers = temps[(temps < iqr_lower) | (temps > iqr_upper)]

# Z-Score Method
z_scores = np.abs(stats.zscore(temps))
zscore_outliers = temps[z_scores > 3]

print("IQR Method:")
print(f"  Bounds: [{iqr_lower:.1f}, {iqr_upper:.1f}]")
print(f"  Outliers ({len(iqr_outliers)}): {sorted(iqr_outliers.values)}")

print("\nZ-Score Method (threshold=3):")
print(f"  Outliers ({len(zscore_outliers)}): {sorted(zscore_outliers.values)}")

print(f"\nBoth methods agree on: {set(iqr_outliers) & set(zscore_outliers)}")

Task: Using prices = [10, 12, 14, 15, 16, 18, 19, 20, 21, 85, 90], create a horizontal box plot that clearly shows the outliers. Add a title "Price Distribution with Outliers".

Show Solution

import matplotlib.pyplot as plt

prices = [10, 12, 14, 15, 16, 18, 19, 20, 21, 85, 90]

plt.figure(figsize=(10, 3))
plt.boxplot(prices, vert=False, patch_artist=True,
            boxprops=dict(facecolor='lightblue'),
            flierprops=dict(marker='o', markerfacecolor='red', markersize=10))
plt.xlabel('Price ($)')
plt.title('Price Distribution with Outliers')
plt.show()

# The red dots on the right are the outliers (85, 90)

Data Transformation Techniques

Transformations help normalize skewed distributions, stabilize variance, and make data more suitable for analysis. Different transformations work best for different types of skewness.

Skewness

A measure of distribution asymmetry. Positive skew = tail to right (log transform helps). Negative skew = tail to left (square/cube transform helps). Skewness near 0 = symmetric.

Measuring Skewness

import pandas as pd
import numpy as np

# Create sample skewed data
right_skewed = pd.Series(np.random.exponential(scale=2, size=1000))
left_skewed = pd.Series(-np.random.exponential(scale=2, size=1000) + 10)
normal = pd.Series(np.random.normal(loc=50, scale=10, size=1000))

print(f"Right skewed: {right_skewed.skew():.2f}")   # ~2.0
print(f"Left skewed: {left_skewed.skew():.2f}")     # ~-2.0
print(f"Normal: {normal.skew():.2f}")               # ~0.0

# Interpretation
# |skew| < 0.5: approximately symmetric
# 0.5 < |skew| < 1: moderately skewed
# |skew| > 1: highly skewed

Log Transformation

Best for right-skewed (positively skewed) data. Reduces the impact of large values and compresses the right tail.

# Log transformation for right-skewed data
df['salary_log'] = np.log1p(df['salary'])  # log1p handles zeros

# Compare before and after
print(f"Original skewness: {df['salary'].skew():.2f}")
print(f"Log-transformed skewness: {df['salary_log'].skew():.2f}")

# Note: Use np.log1p() for data with zeros
# np.log(x) fails for x <= 0
# np.log1p(x) = log(1 + x), safe for x >= 0

# To reverse: np.expm1(x) = exp(x) - 1

Square Root Transformation

Milder than log transformation, good for count data and moderately skewed distributions.

# Square root transformation
df['count_sqrt'] = np.sqrt(df['count'])

# Compare skewness
print(f"Original: {df['count'].skew():.2f}")
print(f"Sqrt transformed: {df['count_sqrt'].skew():.2f}")

# Good for: count data, moderately right-skewed data
# Cannot handle negative values

Box-Cox Transformation

Automatically finds the optimal transformation parameter (lambda) to maximize normality.

from scipy import stats

# Box-Cox requires positive values
data = df['salary'][df['salary'] > 0]

# Find optimal lambda
transformed_data, lambda_val = stats.boxcox(data)

print(f"Optimal lambda: {lambda_val:.2f}")
print(f"Original skewness: {data.skew():.2f}")
print(f"Transformed skewness: {pd.Series(transformed_data).skew():.2f}")

# Lambda interpretation:
# lambda = 1: No transformation
# lambda = 0: Log transformation
# lambda = 0.5: Square root transformation
# lambda = -1: Reciprocal transformation

Comparing Transformations

import matplotlib.pyplot as plt

# Create comparison of transformations
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Original data
axes[0, 0].hist(df['salary'], bins=30, edgecolor='black')
axes[0, 0].set_title(f"Original (skew: {df['salary'].skew():.2f})")

# Log transform
log_data = np.log1p(df['salary'])
axes[0, 1].hist(log_data, bins=30, edgecolor='black', color='green')
axes[0, 1].set_title(f"Log Transform (skew: {log_data.skew():.2f})")

# Square root
sqrt_data = np.sqrt(df['salary'])
axes[1, 0].hist(sqrt_data, bins=30, edgecolor='black', color='orange')
axes[1, 0].set_title(f"Square Root (skew: {sqrt_data.skew():.2f})")

# Box-Cox
boxcox_data, _ = stats.boxcox(df['salary'][df['salary'] > 0])
axes[1, 1].hist(boxcox_data, bins=30, edgecolor='black', color='purple')
axes[1, 1].set_title(f"Box-Cox (skew: {pd.Series(boxcox_data).skew():.2f})")

plt.tight_layout()
plt.show()

Transformation	Best For	Handles Zeros?	Handles Negatives?
`np.log1p()`	Right-skewed data	✅ Yes	❌ No
`np.sqrt()`	Moderate skew, counts	✅ Yes	❌ No
`stats.boxcox()`	Any right-skew (auto)	❌ No	❌ No
`np.power(x, 2)`	Left-skewed data	✅ Yes	✅ Yes

Important: Always visualize your data before and after transformation. The goal is to approximate normality, not achieve perfect normality.

Practice: Data Transformations

Task: Generate right-skewed income data with income = np.random.exponential(50000, 1000). Apply log1p transformation and create a 1x2 subplot showing histograms before and after, with skewness values in titles.

Show Solution

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

np.random.seed(42)
income = pd.Series(np.random.exponential(50000, 1000))

# Apply log transformation
income_log = np.log1p(income)

# Create comparison plot
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

sns.histplot(income, kde=True, ax=axes[0], color='coral')
axes[0].set_title(f'Original Income (skew: {income.skew():.2f})')
axes[0].set_xlabel('Income ($)')

sns.histplot(income_log, kde=True, ax=axes[1], color='seagreen')
axes[1].set_title(f'Log-Transformed (skew: {income_log.skew():.2f})')
axes[1].set_xlabel('Log(Income + 1)')

plt.tight_layout()
plt.show()

Task: Generate house prices with prices = np.random.exponential(300000, 500) + 50000. Apply Box-Cox transformation, print the optimal lambda value, and compare skewness of original, log, sqrt, and Box-Cox transformations.

Show Solution

import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt

np.random.seed(42)
prices = pd.Series(np.random.exponential(300000, 500) + 50000)

# Apply transformations
log_prices = np.log(prices)
sqrt_prices = np.sqrt(prices)
boxcox_prices, optimal_lambda = stats.boxcox(prices)
boxcox_prices = pd.Series(boxcox_prices)

# Compare skewness
print("Transformation Comparison:")
print(f"  Original:  skew = {prices.skew():.3f}")
print(f"  Log:       skew = {log_prices.skew():.3f}")
print(f"  Sqrt:      skew = {sqrt_prices.skew():.3f}")
print(f"  Box-Cox:   skew = {boxcox_prices.skew():.3f}")
print(f"\nOptimal Box-Cox lambda: {optimal_lambda:.3f}")

# Note: lambda ≈ 0 means log is best, lambda ≈ 0.5 means sqrt
if abs(optimal_lambda) < 0.1:
    print("Lambda near 0 → log transform is optimal")
elif abs(optimal_lambda - 0.5) < 0.1:
    print("Lambda near 0.5 → sqrt transform is optimal")

Interactive Demo: Distribution Explorer

Experiment with different distribution types and transformations to see how they affect skewness and outliers.

Distribution Type

Sample Size: 500

Transformation

Statistics

Mean	-
Median	-
Std Dev	-
Skewness	-
Outliers (IQR)	-

Interpretation

Select options to see analysis...

Key Takeaways

Distribution Shape

Histograms and KDE plots reveal skewness, modality, and spread of numerical variables

Category Balance

Value counts and bar plots show class imbalance and dominant categories

IQR Method

Values outside Q1 - 1.5*IQR to Q3 + 1.5*IQR are potential outliers

Z-Score Method

Values with absolute Z-score greater than 3 are typically considered outliers

Log Transform

Reduces right skew and compresses large values for better model performance

Box-Cox Transform

Automatically finds optimal power transformation for normality

What You'll Learn

Contents

Numerical Variable Distributions

What is Univariate Analysis?

Histograms - The Foundation

Seaborn histplot - Enhanced Histograms

Distribution Statistics with describe()

Box Plots for Quick Summary

Practice: Numerical Distributions

Medium Create a histogram with KDE overlay showing income distribution

Hard Generate and compare three distribution types

Categorical Variable Frequencies

Value Counts - The Essential Method

Bar Plots for Categories

Identifying Class Imbalance

Handling Rare Categories

Practice: Categorical Analysis

Easy Create a bar chart showing product category distribution

Medium Group rare categories and visualize with percentages

Outlier Detection Methods

When to Keep Outliers

When to Remove/Transform

Method 1: IQR (Interquartile Range)

Method 2: Z-Score

Method 3: Modified Z-Score (Robust)

Visualizing Outliers

Practice: Outlier Detection

Medium Detect salary outliers using IQR method

Hard Compare IQR vs Z-score outlier detection on temperature data

Easy Create a box plot to visualize outliers

Data Transformation Techniques

Measuring Skewness

Log Transformation

Square Root Transformation

Box-Cox Transformation

Comparing Transformations

Practice: Data Transformations

Medium Reduce income skewness using log transformation

Hard Apply Box-Cox transformation and find optimal lambda

Interactive Demo: Distribution Explorer

Statistics

Interpretation

Key Takeaways

Distribution Shape

Category Balance

IQR Method

Z-Score Method

Log Transform

Box-Cox Transform

Knowledge Check