What is Descriptive Statistics?
Descriptive statistics is the branch of statistics that deals with summarizing and describing data. Unlike inferential statistics which makes predictions about populations, descriptive statistics focuses on describing what the data actually shows without making conclusions beyond it.
Descriptive vs Inferential Statistics
Descriptive Statistics summarizes and organizes data so it can be easily understood. It describes the sample you have.
Inferential Statistics uses sample data to make predictions or inferences about a larger population. It goes beyond the data at hand.
The Three Pillars of Descriptive Statistics
Central Tendency
Measures that describe the center or typical value of a dataset: mean, median, and mode.
Variability
Measures that describe the spread or dispersion: range, variance, standard deviation, IQR.
Distribution Shape
Measures that describe the shape: skewness (asymmetry) and kurtosis (peakedness).
Getting Started with Data
Let's set up a sample dataset to explore these concepts:
import numpy as np
import pandas as pd
from scipy import stats
# Sample dataset: exam scores
scores = [72, 85, 88, 91, 76, 82, 78, 95, 89, 73,
80, 84, 77, 93, 86, 79, 81, 90, 75, 87]
# Create a pandas Series for easier manipulation
exam_scores = pd.Series(scores, name="Exam Scores")
print("Sample Data:")
print(exam_scores.values)
print(f"\nNumber of observations: {len(exam_scores)}")
Measures of Central Tendency
Central tendency measures describe the center of a dataset - the typical or representative value. The three main measures are mean, median, and mode, each useful in different situations.
Interactive: Mean vs Median with Outliers
Drag to ExploreSee how adding an outlier affects mean vs median. The median is resistant to extreme values!
Insight: With no outlier, mean and median are similar. Drag the slider to see the mean shift dramatically while the median stays stable!
Mean (Average)
The mean is the sum of all values divided by the number of values. It's the most common measure but is sensitive to outliers.
Mean = Σx / n
Sum of all values divided by count
# Calculating the mean
scores = [72, 85, 88, 91, 76, 82, 78, 95, 89, 73,
80, 84, 77, 93, 86, 79, 81, 90, 75, 87]
# Using NumPy
mean_np = np.mean(scores)
print(f"Mean (NumPy): {mean_np}")
# Using Pandas
exam_scores = pd.Series(scores)
mean_pd = exam_scores.mean()
print(f"Mean (Pandas): {mean_pd}")
# Manual calculation
mean_manual = sum(scores) / len(scores)
print(f"Mean (Manual): {mean_manual}")
# Output: Mean = 83.05
Median (Middle Value)
The median is the middle value when data is sorted. If there's an even number of values, it's the average of the two middle values. The median is resistant to outliers.
# Calculating the median
median_np = np.median(scores)
print(f"Median (NumPy): {median_np}")
median_pd = exam_scores.median()
print(f"Median (Pandas): {median_pd}")
# Manual calculation
sorted_scores = sorted(scores)
n = len(sorted_scores)
if n % 2 == 0:
median_manual = (sorted_scores[n//2 - 1] + sorted_scores[n//2]) / 2
else:
median_manual = sorted_scores[n//2]
print(f"Median (Manual): {median_manual}")
# Output: Median = 82.5
Mode (Most Frequent Value)
The mode is the value that appears most frequently. A dataset can have no mode, one mode (unimodal), or multiple modes (bimodal/multimodal).
# Calculating the mode
from scipy import stats
# Dataset with a clear mode
grades = [85, 90, 85, 78, 92, 85, 88, 90, 85]
mode_scipy = stats.mode(grades, keepdims=True)
print(f"Mode: {mode_scipy.mode[0]}")
print(f"Count: {mode_scipy.count[0]}")
# Using Pandas
grades_series = pd.Series(grades)
mode_pd = grades_series.mode()
print(f"Mode (Pandas): {mode_pd.values}")
# Output: Mode = 85 (appears 4 times)
When to Use Each Measure
Mean
- ✅ Symmetric distributions
- ✅ Interval/ratio data
- ❌ Avoid with outliers
- ❌ Avoid with skewed data
Median
- ✅ Skewed distributions
- ✅ Data with outliers
- ✅ Ordinal data
- ✅ Income, house prices
Mode
- ✅ Categorical data
- ✅ Finding most common
- ✅ Bimodal distributions
- ❌ May not exist
Practice: Central Tendency
Task: Given employee salaries: salaries = [45000, 48000, 52000, 55000, 51000, 49000, 53000, 47000, 850000] (the last one is the CEO), calculate mean, median, and mode. Explain which measure best represents the "typical" employee salary and why.
Show Solution
import numpy as np
from scipy import stats
salaries = [45000, 48000, 52000, 55000, 51000, 49000, 53000, 47000, 850000]
mean_sal = np.mean(salaries)
median_sal = np.median(salaries)
mode_sal = stats.mode(salaries, keepdims=True).mode[0]
print(f"Mean: ${mean_sal:,.0f}") # $138,889 - inflated by CEO
print(f"Median: ${median_sal:,.0f}") # $51,000 - typical employee
print(f"Mode: ${mode_sal:,.0f}") # No repeated values
# ANSWER: Median ($51,000) best represents typical salary
# The CEO's $850,000 is an outlier that skews the mean
Task: A clothing store recorded t-shirt sizes sold: sizes = ['M', 'L', 'S', 'M', 'XL', 'M', 'L', 'M', 'S', 'M', 'L', 'M', 'XXL', 'L', 'M']. Find which size is most popular (mode) and calculate what percentage of sales it represents.
Show Solution
import pandas as pd
sizes = ['M', 'L', 'S', 'M', 'XL', 'M', 'L', 'M', 'S', 'M', 'L', 'M', 'XXL', 'L', 'M']
sizes_series = pd.Series(sizes)
# Find mode
mode = sizes_series.mode()[0]
mode_count = (sizes_series == mode).sum()
percentage = (mode_count / len(sizes_series)) * 100
print(f"Most popular size: {mode}")
print(f"Count: {mode_count} out of {len(sizes_series)}")
print(f"Percentage: {percentage:.1f}%")
# Output: M is the mode, sold 7 times (46.7%)
Measures of Variability
While central tendency tells us about the typical value, variability measures tell us how spread out the data is. Two datasets can have the same mean but very different spreads, making variability essential for understanding data.
Range
The simplest measure of spread - the difference between the maximum and minimum values. It's easy to calculate but highly sensitive to outliers.
scores = [72, 85, 88, 91, 76, 82, 78, 95, 89, 73,
80, 84, 77, 93, 86, 79, 81, 90, 75, 87]
# Calculate range
data_range = np.max(scores) - np.min(scores)
print(f"Range: {data_range}")
# Or using built-in functions
data_range = max(scores) - min(scores)
print(f"Range: {data_range}") # Output: 23
Variance
Variance measures the average squared deviation from the mean. It quantifies how far data points are from the center on average.
Variance = Σ(x - μ)² / n (Population)
Variance = Σ(x - x̄)² / (n-1) (Sample)
n-1 for sample (Bessel's correction) to reduce bias
# Population variance (divide by n)
var_pop = np.var(scores)
print(f"Population Variance: {var_pop:.2f}")
# Sample variance (divide by n-1) - use this for samples!
var_sample = np.var(scores, ddof=1) # ddof=1 for sample
print(f"Sample Variance: {var_sample:.2f}")
# Pandas uses sample variance by default
var_pd = pd.Series(scores).var()
print(f"Pandas Variance: {var_pd:.2f}")
Standard Deviation
Standard deviation is the square root of variance. It's in the same units as the data, making it more interpretable than variance.
# Population standard deviation
std_pop = np.std(scores)
print(f"Population Std Dev: {std_pop:.2f}")
# Sample standard deviation (most common)
std_sample = np.std(scores, ddof=1)
print(f"Sample Std Dev: {std_sample:.2f}")
# Pandas (sample std by default)
std_pd = pd.Series(scores).std()
print(f"Pandas Std Dev: {std_pd:.2f}")
# Interpretation
mean = np.mean(scores)
print(f"\nMean ± 1 Std: {mean:.1f} ± {std_sample:.1f}")
print(f"Range: [{mean - std_sample:.1f}, {mean + std_sample:.1f}]")
Interquartile Range (IQR)
IQR is the range of the middle 50% of data (Q3 - Q1). It's robust to outliers and commonly used in box plots.
# Calculate quartiles
q1 = np.percentile(scores, 25)
q3 = np.percentile(scores, 75)
iqr = q3 - q1
print(f"Q1 (25th percentile): {q1}")
print(f"Q3 (75th percentile): {q3}")
print(f"IQR: {iqr}")
# Using SciPy
from scipy.stats import iqr
iqr_scipy = iqr(scores)
print(f"IQR (SciPy): {iqr_scipy}")
# Outlier detection using IQR
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr
print(f"\nOutlier bounds: [{lower_bound:.1f}, {upper_bound:.1f}]")
Coefficient of Variation (CV)
CV expresses standard deviation as a percentage of the mean, allowing comparison of variability across different scales.
# Coefficient of Variation
mean = np.mean(scores)
std = np.std(scores, ddof=1)
cv = (std / mean) * 100
print(f"Mean: {mean:.2f}")
print(f"Std Dev: {std:.2f}")
print(f"CV: {cv:.2f}%")
# Useful for comparing variability
# e.g., heights (CV ~5%) vs income (CV ~50-100%)
Practice: Variability
Task: Machine A produces parts with weights: [100.2, 99.8, 100.1, 99.9, 100.0] grams. Machine B produces parts with weights: [50.5, 49.2, 51.3, 48.8, 50.2] grams. Using Coefficient of Variation (CV), determine which machine is more consistent (lower relative variability).
Show Solution
import numpy as np
machine_a = [100.2, 99.8, 100.1, 99.9, 100.0]
machine_b = [50.5, 49.2, 51.3, 48.8, 50.2]
# Calculate CV = (std / mean) * 100
cv_a = (np.std(machine_a, ddof=1) / np.mean(machine_a)) * 100
cv_b = (np.std(machine_b, ddof=1) / np.mean(machine_b)) * 100
print(f"Machine A - Mean: {np.mean(machine_a):.2f}g, Std: {np.std(machine_a, ddof=1):.3f}g")
print(f"Machine A - CV: {cv_a:.2f}%")
print(f"\nMachine B - Mean: {np.mean(machine_b):.2f}g, Std: {np.std(machine_b, ddof=1):.3f}g")
print(f"Machine B - CV: {cv_b:.2f}%")
print(f"\nMachine A is more consistent (CV: {cv_a:.2f}% vs {cv_b:.2f}%)")
# Machine A has ~0.16% CV, Machine B has ~1.97% CV
Task: House prices in a neighborhood (in thousands): prices = [250, 275, 290, 310, 285, 295, 280, 265, 890, 305]. Use the IQR method to identify any outliers and explain if the $890K house should be flagged.
Show Solution
import numpy as np
prices = [250, 275, 290, 310, 285, 295, 280, 265, 890, 305]
q1 = np.percentile(prices, 25)
q3 = np.percentile(prices, 75)
iqr = q3 - q1
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr
print(f"Q1: ${q1:.0f}K, Q3: ${q3:.0f}K, IQR: ${iqr:.0f}K")
print(f"Lower bound: ${lower_bound:.0f}K")
print(f"Upper bound: ${upper_bound:.0f}K")
outliers = [p for p in prices if p < lower_bound or p > upper_bound]
print(f"\nOutliers: {outliers}")
# The $890K house is an outlier (above upper bound ~$350K)
Distribution Shape
Beyond center and spread, understanding the shape of data distribution is crucial. Skewness and kurtosis describe asymmetry and peakedness, helping identify the nature of your data and appropriate analytical methods.
Skewness
Skewness measures the asymmetry of a distribution. It tells you which direction the tail extends.
Negative Skew
←
Left tail longer
Mean < Median < Mode
Example: Easy test scores
Symmetric
○
Balanced tails
Mean ≈ Median ≈ Mode
Example: Heights, IQ
Positive Skew
→
Right tail longer
Mode < Median < Mean
Example: Income, house prices
from scipy.stats import skew
# Our exam scores
scores = [72, 85, 88, 91, 76, 82, 78, 95, 89, 73,
80, 84, 77, 93, 86, 79, 81, 90, 75, 87]
# Calculate skewness
skewness = skew(scores)
print(f"Skewness: {skewness:.3f}")
# Interpretation
if skewness > 0.5:
print("Moderately to highly positively skewed")
elif skewness < -0.5:
print("Moderately to highly negatively skewed")
else:
print("Approximately symmetric")
# Example of positively skewed data (income-like)
income = [30000, 35000, 40000, 45000, 50000, 55000,
60000, 80000, 100000, 150000, 500000]
print(f"\nIncome skewness: {skew(income):.3f}") # Positive
Kurtosis
Kurtosis measures the "tailedness" of a distribution - how much data is in the tails versus the center compared to a normal distribution.
Platykurtic
Kurtosis < 0
Flat, light tails
Fewer outliers
Example: Uniform distribution
Mesokurtic
Kurtosis ≈ 0
Normal shape
Moderate tails
Example: Normal distribution
Leptokurtic
Kurtosis > 0
Peaked, heavy tails
More outliers
Example: Financial returns
from scipy.stats import kurtosis
# Calculate excess kurtosis (Fisher's definition)
# Normal distribution has excess kurtosis = 0
kurt = kurtosis(scores)
print(f"Excess Kurtosis: {kurt:.3f}")
# Interpretation
if kurt > 1:
print("Leptokurtic - heavy tails, more outliers likely")
elif kurt < -1:
print("Platykurtic - light tails, fewer outliers")
else:
print("Approximately mesokurtic (normal-like)")
# Compare different distributions
normal_data = np.random.normal(0, 1, 1000)
uniform_data = np.random.uniform(-2, 2, 1000)
laplace_data = np.random.laplace(0, 1, 1000)
print(f"\nNormal kurtosis: {kurtosis(normal_data):.3f}")
print(f"Uniform kurtosis: {kurtosis(uniform_data):.3f}")
print(f"Laplace kurtosis: {kurtosis(laplace_data):.3f}")
Percentiles and Quartiles
Percentiles divide data into 100 equal parts. Quartiles are special percentiles (25th, 50th, 75th) that divide data into four equal parts.
# Calculate various percentiles
scores_series = pd.Series(scores)
# Quartiles
print("Quartiles:")
print(f"Q1 (25th): {scores_series.quantile(0.25)}")
print(f"Q2 (50th): {scores_series.quantile(0.50)}") # Median
print(f"Q3 (75th): {scores_series.quantile(0.75)}")
# Any percentile
print(f"\n10th percentile: {np.percentile(scores, 10)}")
print(f"90th percentile: {np.percentile(scores, 90)}")
# Five-number summary
print(f"\nFive-number summary:")
print(f"Min: {scores_series.min()}")
print(f"Q1: {scores_series.quantile(0.25)}")
print(f"Median: {scores_series.median()}")
print(f"Q3: {scores_series.quantile(0.75)}")
print(f"Max: {scores_series.max()}")
Practice: Distribution Shape
Task: Given household incomes (in thousands): income = [35, 42, 48, 52, 55, 58, 62, 68, 75, 95, 120, 180, 250], calculate skewness and kurtosis. Based on the results, would you recommend a log transformation before using this data in a linear model?
Show Solution
import numpy as np
import pandas as pd
from scipy.stats import skew, kurtosis
income = [35, 42, 48, 52, 55, 58, 62, 68, 75, 95, 120, 180, 250]
skewness = skew(income)
kurt = kurtosis(income)
mean_val = np.mean(income)
median_val = np.median(income)
print(f"Mean: ${mean_val:.0f}K, Median: ${median_val:.0f}K")
print(f"Skewness: {skewness:.2f}")
print(f"Kurtosis: {kurt:.2f}")
# Apply log transform and compare
log_income = np.log(income)
print(f"\nAfter log transform:")
print(f"Skewness: {skew(log_income):.2f}")
# Skewness > 1 indicates strong right skew
# Log transformation IS recommended for linear models
Task: Exam scores: scores = [45, 52, 58, 62, 65, 68, 70, 72, 75, 78, 80, 82, 85, 88, 92, 95, 98]. Calculate the five-number summary, determine if the distribution is symmetric or skewed, and identify if any scores would be considered outliers using the 1.5×IQR rule.
Show Solution
import numpy as np
import pandas as pd
scores = [45, 52, 58, 62, 65, 68, 70, 72, 75, 78, 80, 82, 85, 88, 92, 95, 98]
# Five-number summary
five_num = {
'Min': np.min(scores),
'Q1': np.percentile(scores, 25),
'Median': np.median(scores),
'Q3': np.percentile(scores, 75),
'Max': np.max(scores)
}
print("Five-Number Summary:")
for k, v in five_num.items():
print(f" {k}: {v}")
# Check symmetry
iqr = five_num['Q3'] - five_num['Q1']
lower_whisker = five_num['Median'] - five_num['Q1']
upper_whisker = five_num['Q3'] - five_num['Median']
print(f"\nIQR: {iqr}")
print(f"Q1 to Median: {lower_whisker}")
print(f"Median to Q3: {upper_whisker}")
# Outlier bounds
lower_bound = five_num['Q1'] - 1.5 * iqr
upper_bound = five_num['Q3'] + 1.5 * iqr
outliers = [s for s in scores if s < lower_bound or s > upper_bound]
print(f"\nOutlier bounds: [{lower_bound:.1f}, {upper_bound:.1f}]")
print(f"Outliers: {outliers if outliers else 'None'}")
# Distribution is roughly symmetric, no outliers
Pandas Statistical Methods
Pandas provides a comprehensive suite of statistical methods built directly into DataFrames and Series. The describe() method gives you a complete statistical summary in one line, while individual methods offer precise control.
The describe() Method
The describe() method provides a complete statistical summary including count, mean, std, min, quartiles, and max.
import pandas as pd
import numpy as np
# Create sample DataFrame
df = pd.DataFrame({
'Age': [25, 32, 45, 28, 36, 42, 29, 51, 33, 38],
'Salary': [45000, 52000, 78000, 48000, 62000, 71000, 51000, 85000, 55000, 65000],
'Experience': [2, 5, 18, 3, 10, 15, 4, 22, 8, 12]
})
# Basic describe
print(df.describe())
# Output includes:
# - count: non-null values
# - mean: average
# - std: standard deviation
# - min: minimum
# - 25%: first quartile (Q1)
# - 50%: median (Q2)
# - 75%: third quartile (Q3)
# - max: maximum
Customizing describe()
You can customize describe() with percentiles and include categorical data.
# Custom percentiles
print(df.describe(percentiles=[.1, .25, .5, .75, .9]))
# Include categorical columns
df['Department'] = ['Sales', 'IT', 'HR', 'IT', 'Sales',
'IT', 'HR', 'Sales', 'IT', 'HR']
print(df.describe(include='all'))
# Only categorical
print(df.describe(include=['object']))
# Only numeric
print(df.describe(include=[np.number]))
Individual Statistical Methods
Pandas provides individual methods for every statistical measure, applicable to both Series and DataFrames.
# Central tendency
print(f"Mean:\n{df[['Age', 'Salary']].mean()}")
print(f"\nMedian:\n{df[['Age', 'Salary']].median()}")
print(f"\nMode:\n{df['Department'].mode()}")
# Variability
print(f"\nVariance:\n{df[['Age', 'Salary']].var()}")
print(f"\nStd Dev:\n{df[['Age', 'Salary']].std()}")
# Range-based
print(f"\nMin:\n{df[['Age', 'Salary']].min()}")
print(f"\nMax:\n{df[['Age', 'Salary']].max()}")
# Quantiles
print(f"\nQ1:\n{df[['Age', 'Salary']].quantile(0.25)}")
print(f"\nQ3:\n{df[['Age', 'Salary']].quantile(0.75)}")
Grouped Statistics with groupby()
Combine groupby() with statistical methods for powerful segmented analysis.
# Statistics by group
print("Mean by Department:")
print(df.groupby('Department')[['Age', 'Salary', 'Experience']].mean())
# Multiple aggregations
print("\nMultiple stats by Department:")
print(df.groupby('Department')['Salary'].agg(['mean', 'median', 'std', 'min', 'max']))
# Named aggregations
summary = df.groupby('Department').agg(
avg_salary=('Salary', 'mean'),
max_salary=('Salary', 'max'),
avg_age=('Age', 'mean'),
headcount=('Age', 'count')
)
print("\nNamed aggregations:")
print(summary)
Correlation and Covariance
Understand relationships between variables with correlation and covariance matrices.
# Correlation matrix
print("Correlation Matrix:")
print(df[['Age', 'Salary', 'Experience']].corr())
# Covariance matrix
print("\nCovariance Matrix:")
print(df[['Age', 'Salary', 'Experience']].cov())
# Correlation between two columns
print(f"\nAge-Salary correlation: {df['Age'].corr(df['Salary']):.3f}")
print(f"Experience-Salary correlation: {df['Experience'].corr(df['Salary']):.3f}")
Additional Useful Methods
# Cumulative statistics
print(f"Cumulative Sum:\n{df['Salary'].cumsum()}")
print(f"\nCumulative Max:\n{df['Salary'].cummax()}")
# Rolling statistics (window-based)
print(f"\n3-period Rolling Mean:\n{df['Salary'].rolling(3).mean()}")
# Ranking
print(f"\nSalary Ranks:\n{df['Salary'].rank()}")
# Value counts (for categorical)
print(f"\nDepartment Counts:\n{df['Department'].value_counts()}")
# Unique values
print(f"\nUnique Departments: {df['Department'].nunique()}")
Practice: Pandas Statistics
Task: Given sales data with regions, calculate mean, median, std, min, and max sales for each region. Identify which region has the highest average sales and which has the most consistent performance (lowest CV).
Show Solution
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Region': ['East', 'West', 'East', 'North', 'West', 'North', 'East', 'West', 'North', 'East'],
'Sales': [12500, 15800, 11200, 9800, 16200, 10500, 13100, 14900, 11000, 12800]
})
# Calculate multiple statistics per region
stats = df.groupby('Region')['Sales'].agg(['mean', 'median', 'std', 'min', 'max'])
print("Regional Statistics:")
print(stats)
# Calculate CV for consistency comparison
cv_by_region = df.groupby('Region')['Sales'].agg(lambda x: (x.std() / x.mean()) * 100)
print(f"\nCV by Region:")
print(cv_by_region)
print(f"\nHighest avg sales: {stats['mean'].idxmax()} (${stats['mean'].max():,.0f})")
print(f"Most consistent: {cv_by_region.idxmin()} (CV: {cv_by_region.min():.1f}%)")
Task: Given employee data with Experience (years), Salary, Performance Score (1-10), and Training Hours, create a correlation matrix and identify: (1) the strongest positive correlation, (2) any negative correlations, and (3) which factor is most correlated with Salary.
Show Solution
import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame({
'Experience': [2, 5, 8, 3, 12, 7, 4, 9, 6, 10],
'Salary': [45000, 58000, 72000, 48000, 95000, 65000, 52000, 78000, 62000, 85000],
'Performance': [7, 8, 7, 6, 9, 8, 7, 8, 9, 8],
'Training_Hours': [40, 25, 15, 35, 10, 20, 30, 12, 22, 8]
})
# Full correlation matrix
corr_matrix = df.corr()
print("Correlation Matrix:")
print(corr_matrix.round(3))
# Find strongest positive correlation (excluding diagonal)
mask = np.triu(np.ones_like(corr_matrix, dtype=bool), k=1)
corr_pairs = corr_matrix.where(mask).stack()
print(f"\nStrongest positive correlation: {corr_pairs.idxmax()}")
print(f" Value: {corr_pairs.max():.3f}")
# Negative correlations
neg_corrs = corr_pairs[corr_pairs < 0]
print(f"\nNegative correlations:")
for idx, val in neg_corrs.items():
print(f" {idx}: {val:.3f}")
# Factor most correlated with Salary
salary_corr = corr_matrix['Salary'].drop('Salary').abs().sort_values(ascending=False)
print(f"\nFactors correlated with Salary:")
print(salary_corr)
Key Takeaways
Central Tendency Trio
Mean for symmetric data, Median for skewed data or outliers, Mode for categorical data. Always report the appropriate measure for your data type.
Variability Matters
Standard deviation measures typical distance from mean. IQR is robust to outliers. Use CV to compare variability across different scales.
Shape Indicators
Skewness reveals asymmetry (positive = right tail, negative = left tail). Kurtosis indicates tail weight and outlier propensity.
Pandas describe()
Use df.describe() for instant statistical summaries. Customize with percentiles and include parameters for comprehensive analysis.
Grouped Analysis
Combine groupby() with agg() for powerful segmented statistics. Use named aggregations for readable, multi-metric summaries.
Sample vs Population
Use ddof=1 for sample statistics (n-1 denominator). Pandas uses sample statistics by default; NumPy uses population unless specified.
Knowledge Check
Test your understanding of descriptive statistics with this quick quiz.
Which measure of central tendency is most resistant to outliers?
For a positively skewed distribution, which relationship is true?
What does IQR stand for and what does it measure?
Which NumPy parameter gives sample standard deviation instead of population?
Which Pandas method provides a summary of descriptive statistics?
A distribution with heavy tails and more outliers than normal is called: