Numerical Variable Distributions
Univariate analysis begins with understanding how individual numerical variables are distributed. By examining the shape, center, and spread of each variable, you can identify patterns, detect anomalies, and make informed decisions about data preprocessing before modeling.
What is Univariate Analysis?
Univariate analysis examines one variable at a time. For numerical variables, we explore the distribution shape (symmetric, skewed), central tendency (mean, median), and spread (range, standard deviation).
Histograms - The Foundation
Histograms divide numerical data into bins and show the frequency of values in each bin. They reveal the overall shape of your data distribution.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Load sample data
df = pd.DataFrame({
'age': np.random.normal(35, 10, 1000),
'income': np.random.exponential(50000, 1000),
'score': np.random.uniform(0, 100, 1000)
})
# Basic histogram with Matplotlib
plt.figure(figsize=(10, 4))
plt.hist(df['age'], bins=30, edgecolor='black', alpha=0.7)
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.title('Age Distribution')
plt.show()
Seaborn histplot - Enhanced Histograms
Seaborn's histplot() provides more options including kernel density estimation (KDE) overlay.
# Histogram with KDE overlay
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
# Normal distribution (age)
sns.histplot(df['age'], kde=True, ax=axes[0], color='steelblue')
axes[0].set_title('Normal: Age')
# Right-skewed distribution (income)
sns.histplot(df['income'], kde=True, ax=axes[1], color='coral')
axes[1].set_title('Right-Skewed: Income')
# Uniform distribution (score)
sns.histplot(df['score'], kde=True, ax=axes[2], color='seagreen')
axes[2].set_title('Uniform: Score')
plt.tight_layout()
plt.show()
Distribution Statistics with describe()
Combine visual analysis with numerical summaries to fully understand your distributions.
# Comprehensive statistics
print(df['income'].describe())
# Additional distribution metrics
from scipy import stats
print(f"\nSkewness: {df['income'].skew():.3f}") # Positive = right skew
print(f"Kurtosis: {df['income'].kurtosis():.3f}") # High = heavy tails
# Percentiles for deeper understanding
percentiles = [1, 5, 10, 25, 50, 75, 90, 95, 99]
print(f"\nPercentiles:")
for p in percentiles:
print(f" {p}th: {np.percentile(df['income'], p):,.0f}")
Box Plots for Quick Summary
Box plots show the five-number summary (min, Q1, median, Q3, max) and highlight outliers as individual points.
# Box plot comparison
fig, ax = plt.subplots(figsize=(10, 5))
df.boxplot(column=['age', 'score'], ax=ax)
plt.title('Box Plot Comparison')
plt.ylabel('Value')
plt.show()
# Horizontal box plot with Seaborn
plt.figure(figsize=(10, 3))
sns.boxplot(x=df['income'], color='coral')
plt.title('Income Distribution (Box Plot)')
plt.xlabel('Income ($)')
plt.show()
sns.violinplot() to see both the box plot summary and the full distribution shape combined.Practice: Numerical Distributions
Task: Given income data with right skew: income = np.random.exponential(50000, 1000), create a histogram with KDE overlay using seaborn and calculate the mean, median, and skewness.
Show Solution
import seaborn as sns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
income = np.random.exponential(50000, 1000)
# Create histogram with KDE
sns.histplot(income, kde=True, color='coral')
plt.xlabel('Income ($)')
plt.title('Income Distribution')
# Calculate statistics
print(f"Mean: ${np.mean(income):,.0f}")
print(f"Median: ${np.median(income):,.0f}")
print(f"Skewness: {pd.Series(income).skew():.2f}")
plt.show()
Task: Create a 1x3 subplot showing: (1) normal distribution, (2) right-skewed exponential, (3) uniform distribution. Each should have KDE overlay and display skewness in the title.
Show Solution
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
np.random.seed(42)
normal = np.random.normal(50, 10, 1000)
exponential = np.random.exponential(50, 1000)
uniform = np.random.uniform(0, 100, 1000)
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
for ax, data, name in zip(axes, [normal, exponential, uniform],
['Normal', 'Exponential', 'Uniform']):
sns.histplot(data, kde=True, ax=ax)
skew = pd.Series(data).skew()
ax.set_title(f'{name} (skew: {skew:.2f})')
plt.tight_layout()
plt.show()
Categorical Variable Frequencies
Categorical variables contain discrete values representing groups or categories. Understanding the distribution of categories helps identify class imbalance, dominant groups, and rare categories that may need special handling during modeling.
Value Counts - The Essential Method
The value_counts() method is your primary tool for exploring categorical variables. It returns frequencies sorted from most to least common.
import pandas as pd
# Sample categorical data
df = pd.DataFrame({
'department': ['Sales', 'IT', 'HR', 'Sales', 'IT', 'Sales',
'Marketing', 'IT', 'Sales', 'HR'] * 100,
'level': ['Junior', 'Senior', 'Junior', 'Mid', 'Senior',
'Mid', 'Junior', 'Mid', 'Senior', 'Junior'] * 100,
'status': ['Active', 'Active', 'Active', 'Inactive', 'Active',
'Active', 'Active', 'Active', 'Active', 'Active'] * 100
})
# Basic value counts
print(df['department'].value_counts())
# Sales 400
# IT 300
# HR 200
# Marketing 100
# As percentages (proportions)
print(df['department'].value_counts(normalize=True))
# Sales 0.40
# IT 0.30
# HR 0.20
# Marketing 0.10
Bar Plots for Categories
Bar plots visualize category frequencies. Use horizontal bars for categories with long names.
import matplotlib.pyplot as plt
import seaborn as sns
# Vertical bar plot
plt.figure(figsize=(8, 5))
df['department'].value_counts().plot(kind='bar', color='steelblue', edgecolor='black')
plt.title('Employees by Department')
plt.xlabel('Department')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()
# Seaborn countplot (automatic counting)
plt.figure(figsize=(8, 5))
sns.countplot(data=df, x='department', order=df['department'].value_counts().index)
plt.title('Department Distribution')
plt.show()
Identifying Class Imbalance
Class imbalance occurs when some categories are much more frequent than others. This is critical for classification problems.
# Check for imbalance
status_counts = df['status'].value_counts()
print(status_counts)
# Active 900
# Inactive 100
# Calculate imbalance ratio
majority = status_counts.max()
minority = status_counts.min()
imbalance_ratio = majority / minority
print(f"Imbalance ratio: {imbalance_ratio:.1f}:1") # 9:1
# Visualize imbalance
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
# Absolute counts
status_counts.plot(kind='bar', ax=axes[0], color=['green', 'red'])
axes[0].set_title('Absolute Counts')
axes[0].set_ylabel('Count')
# Percentage
status_counts.plot(kind='pie', ax=axes[1], autopct='%1.1f%%', colors=['green', 'red'])
axes[1].set_title('Percentage Distribution')
axes[1].set_ylabel('')
plt.tight_layout()
plt.show()
Handling Rare Categories
Categories with very few observations can cause issues. Consider grouping them into an "Other" category.
# Identify rare categories (less than 5% of data)
threshold = 0.05
value_pcts = df['department'].value_counts(normalize=True)
rare_cats = value_pcts[value_pcts < threshold].index.tolist()
print(f"Rare categories: {rare_cats}")
# Group rare categories into 'Other'
df['department_grouped'] = df['department'].apply(
lambda x: 'Other' if x in rare_cats else x
)
# Verify grouping
print(df['department_grouped'].value_counts())
df['col'].nunique() to check.
Practice: Categorical Analysis
Task: Given product categories: categories = ['Electronics', 'Clothing', 'Electronics', 'Food', 'Clothing', 'Electronics', 'Food', 'Clothing', 'Electronics', 'Other'], create a vertical bar chart with counts displayed on each bar.
Show Solution
import pandas as pd
import matplotlib.pyplot as plt
categories = ['Electronics', 'Clothing', 'Electronics', 'Food',
'Clothing', 'Electronics', 'Food', 'Clothing',
'Electronics', 'Other']
counts = pd.Series(categories).value_counts()
plt.figure(figsize=(8, 5))
bars = plt.bar(counts.index, counts.values, color='steelblue')
# Add count labels on bars
for bar, count in zip(bars, counts.values):
plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.1,
str(count), ha='center', fontsize=12)
plt.xlabel('Category')
plt.ylabel('Count')
plt.title('Product Category Distribution')
plt.show()
Task: Given city data with 50 cities, keep the top 5 most frequent cities and group the rest as "Other". Create a horizontal bar chart showing percentage of each group.
Show Solution
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Simulate city data (50 cities, some more frequent)
np.random.seed(42)
cities = np.random.choice(
['NYC', 'LA', 'Chicago', 'Houston', 'Phoenix'] +
[f'City_{i}' for i in range(45)],
size=1000, p=[0.15, 0.12, 0.1, 0.08, 0.07] + [0.48/45]*45
)
df = pd.DataFrame({'city': cities})
# Keep top 5, group rest as Other
top_5 = df['city'].value_counts().nlargest(5).index
df['city_grouped'] = df['city'].apply(
lambda x: x if x in top_5 else 'Other'
)
# Calculate percentages
pct = df['city_grouped'].value_counts(normalize=True) * 100
# Plot horizontal bar chart
plt.figure(figsize=(10, 5))
bars = plt.barh(pct.index, pct.values, color='teal')
for bar, p in zip(bars, pct.values):
plt.text(bar.get_width() + 0.5, bar.get_y() + bar.get_height()/2,
f'{p:.1f}%', va='center')
plt.xlabel('Percentage')
plt.title('City Distribution (Top 5 + Other)')
plt.show()
Outlier Detection Methods
Outliers are data points that significantly differ from other observations. They can be legitimate extreme values, data entry errors, or measurement issues. Detecting and properly handling outliers is crucial for accurate analysis and model performance.
When to Keep Outliers
- Legitimate extreme values (e.g., CEO salary)
- Important for understanding data range
- Using robust statistical methods
- Tree-based models (handle outliers well)
When to Remove/Transform
- Data entry errors or measurement issues
- Using mean-sensitive methods
- Linear regression (sensitive to outliers)
- Distance-based algorithms (KNN, K-Means)
Method 1: IQR (Interquartile Range)
The IQR method defines outliers as values below Q1 - 1.5*IQR or above Q3 + 1.5*IQR. This is the method used in box plots.
import numpy as np
import pandas as pd
# Sample data with outliers
data = pd.Series([10, 12, 14, 15, 16, 18, 20, 22, 100, 5, 150])
# Calculate IQR bounds
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
print(f"Q1: {Q1}, Q3: {Q3}, IQR: {IQR}")
print(f"Lower bound: {lower_bound}")
print(f"Upper bound: {upper_bound}")
# Identify outliers
outliers = data[(data < lower_bound) | (data > upper_bound)]
print(f"\nOutliers: {outliers.values}") # [100, 150]
# Filter outliers
clean_data = data[(data >= lower_bound) & (data <= upper_bound)]
print(f"Clean data: {clean_data.values}")
Method 2: Z-Score
Z-score measures how many standard deviations a value is from the mean. Values with |Z| > 3 are typically considered outliers.
from scipy import stats
# Calculate Z-scores
z_scores = stats.zscore(data)
print("Z-scores:")
for val, z in zip(data, z_scores):
print(f" {val}: z = {z:.2f}")
# Identify outliers (|z| > 3)
threshold = 3
outlier_mask = np.abs(z_scores) > threshold
outliers = data[outlier_mask]
print(f"\nOutliers (|z| > 3): {outliers.values}")
# Alternative: Using pandas
data_mean = data.mean()
data_std = data.std()
z_scores_manual = (data - data_mean) / data_std
Method 3: Modified Z-Score (Robust)
Uses median and MAD (Median Absolute Deviation) instead of mean and std, making it more robust to outliers.
# Modified Z-score using MAD
median = data.median()
mad = np.median(np.abs(data - median))
modified_z = 0.6745 * (data - median) / mad
print("Modified Z-scores:")
for val, mz in zip(data, modified_z):
print(f" {val}: modified_z = {mz:.2f}")
# Outliers where |modified_z| > 3.5
outliers = data[np.abs(modified_z) > 3.5]
print(f"\nOutliers: {outliers.values}")
Visualizing Outliers
Box plots automatically show outliers. Combine with other visualizations for comprehensive outlier analysis.
import matplotlib.pyplot as plt
import seaborn as sns
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
# Box plot - shows outliers as points
axes[0].boxplot(data)
axes[0].set_title('Box Plot (outliers as dots)')
axes[0].set_ylabel('Value')
# Histogram with outlier markers
axes[1].hist(data, bins=15, edgecolor='black')
axes[1].axvline(lower_bound, color='red', linestyle='--', label='IQR bounds')
axes[1].axvline(upper_bound, color='red', linestyle='--')
axes[1].set_title('Histogram with IQR Bounds')
axes[1].legend()
# Scatter plot to see outliers in context
axes[2].scatter(range(len(data)), data, c=['red' if x in outliers.values else 'blue' for x in data])
axes[2].set_title('Values (red = outliers)')
axes[2].set_xlabel('Index')
plt.tight_layout()
plt.show()
Practice: Outlier Detection
Task: Given salaries: salaries = [35000, 42000, 45000, 48000, 52000, 55000, 61000, 250000, 320000], identify outliers using IQR method and print the exact values that are outliers.
Show Solution
import pandas as pd
salaries = pd.Series([35000, 42000, 45000, 48000, 52000,
55000, 61000, 250000, 320000])
Q1 = salaries.quantile(0.25)
Q3 = salaries.quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
print(f"Q1: ${Q1:,.0f}, Q3: ${Q3:,.0f}, IQR: ${IQR:,.0f}")
print(f"Lower bound: ${lower_bound:,.0f}")
print(f"Upper bound: ${upper_bound:,.0f}")
outliers = salaries[(salaries < lower_bound) | (salaries > upper_bound)]
print(f"\nOutliers found: {outliers.tolist()}")
# Output: [250000, 320000]
Task: Generate temperature data with temps = np.append(np.random.normal(72, 5, 100), [32, 110, 115]). Detect outliers using both IQR and Z-score (threshold=3) methods. Print how many outliers each method finds and which values they flag.
Show Solution
import numpy as np
import pandas as pd
from scipy import stats
np.random.seed(42)
temps = pd.Series(np.append(np.random.normal(72, 5, 100), [32, 110, 115]))
# IQR Method
Q1, Q3 = temps.quantile([0.25, 0.75])
IQR = Q3 - Q1
iqr_lower = Q1 - 1.5 * IQR
iqr_upper = Q3 + 1.5 * IQR
iqr_outliers = temps[(temps < iqr_lower) | (temps > iqr_upper)]
# Z-Score Method
z_scores = np.abs(stats.zscore(temps))
zscore_outliers = temps[z_scores > 3]
print("IQR Method:")
print(f" Bounds: [{iqr_lower:.1f}, {iqr_upper:.1f}]")
print(f" Outliers ({len(iqr_outliers)}): {sorted(iqr_outliers.values)}")
print("\nZ-Score Method (threshold=3):")
print(f" Outliers ({len(zscore_outliers)}): {sorted(zscore_outliers.values)}")
print(f"\nBoth methods agree on: {set(iqr_outliers) & set(zscore_outliers)}")
Task: Using prices = [10, 12, 14, 15, 16, 18, 19, 20, 21, 85, 90], create a horizontal box plot that clearly shows the outliers. Add a title "Price Distribution with Outliers".
Show Solution
import matplotlib.pyplot as plt
prices = [10, 12, 14, 15, 16, 18, 19, 20, 21, 85, 90]
plt.figure(figsize=(10, 3))
plt.boxplot(prices, vert=False, patch_artist=True,
boxprops=dict(facecolor='lightblue'),
flierprops=dict(marker='o', markerfacecolor='red', markersize=10))
plt.xlabel('Price ($)')
plt.title('Price Distribution with Outliers')
plt.show()
# The red dots on the right are the outliers (85, 90)
Data Transformation Techniques
Transformations help normalize skewed distributions, stabilize variance, and make data more suitable for analysis. Different transformations work best for different types of skewness.
Measuring Skewness
import pandas as pd
import numpy as np
# Create sample skewed data
right_skewed = pd.Series(np.random.exponential(scale=2, size=1000))
left_skewed = pd.Series(-np.random.exponential(scale=2, size=1000) + 10)
normal = pd.Series(np.random.normal(loc=50, scale=10, size=1000))
print(f"Right skewed: {right_skewed.skew():.2f}") # ~2.0
print(f"Left skewed: {left_skewed.skew():.2f}") # ~-2.0
print(f"Normal: {normal.skew():.2f}") # ~0.0
# Interpretation
# |skew| < 0.5: approximately symmetric
# 0.5 < |skew| < 1: moderately skewed
# |skew| > 1: highly skewed
Log Transformation
Best for right-skewed (positively skewed) data. Reduces the impact of large values and compresses the right tail.
# Log transformation for right-skewed data
df['salary_log'] = np.log1p(df['salary']) # log1p handles zeros
# Compare before and after
print(f"Original skewness: {df['salary'].skew():.2f}")
print(f"Log-transformed skewness: {df['salary_log'].skew():.2f}")
# Note: Use np.log1p() for data with zeros
# np.log(x) fails for x <= 0
# np.log1p(x) = log(1 + x), safe for x >= 0
# To reverse: np.expm1(x) = exp(x) - 1
Square Root Transformation
Milder than log transformation, good for count data and moderately skewed distributions.
# Square root transformation
df['count_sqrt'] = np.sqrt(df['count'])
# Compare skewness
print(f"Original: {df['count'].skew():.2f}")
print(f"Sqrt transformed: {df['count_sqrt'].skew():.2f}")
# Good for: count data, moderately right-skewed data
# Cannot handle negative values
Box-Cox Transformation
Automatically finds the optimal transformation parameter (lambda) to maximize normality.
from scipy import stats
# Box-Cox requires positive values
data = df['salary'][df['salary'] > 0]
# Find optimal lambda
transformed_data, lambda_val = stats.boxcox(data)
print(f"Optimal lambda: {lambda_val:.2f}")
print(f"Original skewness: {data.skew():.2f}")
print(f"Transformed skewness: {pd.Series(transformed_data).skew():.2f}")
# Lambda interpretation:
# lambda = 1: No transformation
# lambda = 0: Log transformation
# lambda = 0.5: Square root transformation
# lambda = -1: Reciprocal transformation
Comparing Transformations
import matplotlib.pyplot as plt
# Create comparison of transformations
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# Original data
axes[0, 0].hist(df['salary'], bins=30, edgecolor='black')
axes[0, 0].set_title(f"Original (skew: {df['salary'].skew():.2f})")
# Log transform
log_data = np.log1p(df['salary'])
axes[0, 1].hist(log_data, bins=30, edgecolor='black', color='green')
axes[0, 1].set_title(f"Log Transform (skew: {log_data.skew():.2f})")
# Square root
sqrt_data = np.sqrt(df['salary'])
axes[1, 0].hist(sqrt_data, bins=30, edgecolor='black', color='orange')
axes[1, 0].set_title(f"Square Root (skew: {sqrt_data.skew():.2f})")
# Box-Cox
boxcox_data, _ = stats.boxcox(df['salary'][df['salary'] > 0])
axes[1, 1].hist(boxcox_data, bins=30, edgecolor='black', color='purple')
axes[1, 1].set_title(f"Box-Cox (skew: {pd.Series(boxcox_data).skew():.2f})")
plt.tight_layout()
plt.show()
| Transformation | Best For | Handles Zeros? | Handles Negatives? |
|---|---|---|---|
np.log1p() |
Right-skewed data | ✅ Yes | ❌ No |
np.sqrt() |
Moderate skew, counts | ✅ Yes | ❌ No |
stats.boxcox() |
Any right-skew (auto) | ❌ No | ❌ No |
np.power(x, 2) |
Left-skewed data | ✅ Yes | ✅ Yes |
Practice: Data Transformations
Task: Generate right-skewed income data with income = np.random.exponential(50000, 1000). Apply log1p transformation and create a 1x2 subplot showing histograms before and after, with skewness values in titles.
Show Solution
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
np.random.seed(42)
income = pd.Series(np.random.exponential(50000, 1000))
# Apply log transformation
income_log = np.log1p(income)
# Create comparison plot
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
sns.histplot(income, kde=True, ax=axes[0], color='coral')
axes[0].set_title(f'Original Income (skew: {income.skew():.2f})')
axes[0].set_xlabel('Income ($)')
sns.histplot(income_log, kde=True, ax=axes[1], color='seagreen')
axes[1].set_title(f'Log-Transformed (skew: {income_log.skew():.2f})')
axes[1].set_xlabel('Log(Income + 1)')
plt.tight_layout()
plt.show()
Task: Generate house prices with prices = np.random.exponential(300000, 500) + 50000. Apply Box-Cox transformation, print the optimal lambda value, and compare skewness of original, log, sqrt, and Box-Cox transformations.
Show Solution
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
np.random.seed(42)
prices = pd.Series(np.random.exponential(300000, 500) + 50000)
# Apply transformations
log_prices = np.log(prices)
sqrt_prices = np.sqrt(prices)
boxcox_prices, optimal_lambda = stats.boxcox(prices)
boxcox_prices = pd.Series(boxcox_prices)
# Compare skewness
print("Transformation Comparison:")
print(f" Original: skew = {prices.skew():.3f}")
print(f" Log: skew = {log_prices.skew():.3f}")
print(f" Sqrt: skew = {sqrt_prices.skew():.3f}")
print(f" Box-Cox: skew = {boxcox_prices.skew():.3f}")
print(f"\nOptimal Box-Cox lambda: {optimal_lambda:.3f}")
# Note: lambda ≈ 0 means log is best, lambda ≈ 0.5 means sqrt
if abs(optimal_lambda) < 0.1:
print("Lambda near 0 → log transform is optimal")
elif abs(optimal_lambda - 0.5) < 0.1:
print("Lambda near 0.5 → sqrt transform is optimal")
Interactive Demo: Distribution Explorer
Experiment with different distribution types and transformations to see how they affect skewness and outliers.
Statistics
| Mean | - |
| Median | - |
| Std Dev | - |
| Skewness | - |
| Outliers (IQR) | - |
Interpretation
Select options to see analysis...
Key Takeaways
Distribution Shape
Histograms and KDE plots reveal skewness, modality, and spread of numerical variables
Category Balance
Value counts and bar plots show class imbalance and dominant categories
IQR Method
Values outside Q1 - 1.5*IQR to Q3 + 1.5*IQR are potential outliers
Z-Score Method
Values with absolute Z-score greater than 3 are typically considered outliers
Log Transform
Reduces right skew and compresses large values for better model performance
Box-Cox Transform
Automatically finds optimal power transformation for normality
Knowledge Check
Test your understanding of univariate analysis techniques.
Which plot is best for visualizing the distribution of a single numerical variable?
What does the IQR method use as outlier boundaries?
Which transformation is best for reducing right skewness?
What Pandas method gives value counts for categorical variables?
A Z-score of -2.5 indicates the value is:
Which visualization shows outliers as individual points beyond the whiskers?