Sampling Distributions
Inferential statistics allows us to make conclusions about a population based on a sample. The foundation of this is understanding how sample statistics vary from sample to sample. This variation is captured by the sampling distribution, which forms the basis for all hypothesis testing and confidence intervals.
Population vs Sample
Population: The entire group we want to study (often too large to measure completely).
Sample: A subset of the population that we actually measure.
The Sampling Distribution of the Mean
If we take many samples from a population and calculate the mean of each, these sample means form a distribution called the sampling distribution of the mean. This distribution has predictable properties that enable statistical inference.
import numpy as np
from scipy import stats
# Population: All adult heights (simulated)
np.random.seed(42)
population = np.random.normal(170, 10, 100000) # μ=170cm, σ=10cm
print(f"Population mean: {population.mean():.2f}")
print(f"Population std: {population.std():.2f}")
# Take many samples and calculate their means
sample_size = 30
n_samples = 1000
sample_means = []
for _ in range(n_samples):
sample = np.random.choice(population, size=sample_size, replace=False)
sample_means.append(sample.mean())
sample_means = np.array(sample_means)
print(f"\nMean of sample means: {sample_means.mean():.2f}")
print(f"Std of sample means: {sample_means.std():.2f}")
print(f"Theoretical SE: {population.std() / np.sqrt(sample_size):.2f}")
Central Limit Theorem (CLT)
The Central Limit Theorem is one of the most important concepts in statistics. It states that regardless of the population's distribution, the sampling distribution of the mean approaches a normal distribution as sample size increases (typically n ≥ 30).
Shape
Sampling distribution becomes approximately normal (n ≥ 30).
Center
Mean of sample means equals population mean (μx̄ = μ).
Spread
Standard error: SE = σ / √n (decreases with larger n).
Independence
Samples must be random and independent from each other.
# Demonstrating CLT with a skewed population
from scipy import stats
# Highly skewed population (exponential distribution)
skewed_pop = np.random.exponential(scale=2, size=100000)
print(f"Population skewness: {stats.skew(skewed_pop):.2f}")
# Sample means from skewed population
sample_means_skewed = [np.random.choice(skewed_pop, 50).mean()
for _ in range(1000)]
# Check normality of sample means
stat, p_value = stats.shapiro(sample_means_skewed)
print(f"Shapiro-Wilk p-value: {p_value:.4f}")
# Despite skewed population, sample means are approximately normal!
# Standard Error calculation
def standard_error(sigma, n):
"""Calculate standard error of the mean"""
return sigma / np.sqrt(n)
# Effect of sample size on SE
pop_std = 10
for n in [10, 30, 100, 1000]:
se = standard_error(pop_std, n)
print(f"n={n:4d}: SE = {se:.2f}")
Practice: Sampling
Scenario: A company knows from past surveys that customer satisfaction scores have σ = 18 points (on a 100-point scale). They want to estimate the mean satisfaction score.
Task: Calculate the standard error for samples of n = 50, n = 200, and n = 800. By what percentage does SE improve when quadrupling sample size?
Show Solution
import numpy as np
sigma = 18
sample_sizes = [50, 200, 800]
for n in sample_sizes:
se = sigma / np.sqrt(n)
print(f"n = {n}: SE = {se:.2f}")
# Comparing n=50 to n=200 (4x larger)
se_50 = sigma / np.sqrt(50)
se_200 = sigma / np.sqrt(200)
improvement = (se_50 - se_200) / se_50 * 100
print(f"\nQuadrupling sample size reduces SE by {improvement:.0f}%")
# Answer: 50% (SE halves when n quadruples)
Scenario: A factory produces bolts with known specifications: μ = 10.0mm length, σ = 0.3mm. Quality control takes random samples of 25 bolts each hour.
Task: What is the probability that a sample mean will be less than 9.9mm? Should QC be concerned if they observe x̄ = 9.88mm?
Show Solution
from scipy import stats
import numpy as np
mu, sigma, n = 10.0, 0.3, 25
se = sigma / np.sqrt(n) # 0.06
# P(x̄ < 9.9)
z = (9.9 - mu) / se
p_below_9_9 = stats.norm.cdf(z)
print(f"P(x̄ < 9.9) = {p_below_9_9:.4f}")
print(f"About {p_below_9_9*100:.1f}% of samples will be below 9.9mm")
# Should we be concerned about x̄ = 9.88?
z_observed = (9.88 - mu) / se
p_observed = stats.norm.cdf(z_observed)
print(f"\nP(x̄ < 9.88) = {p_observed:.4f}")
print(f"Only {p_observed*100:.1f}% chance by random variation")
print("YES - this suggests a potential process issue!")
Confidence Intervals
A confidence interval provides a range of plausible values for a population parameter based on sample data. Rather than giving a single point estimate, confidence intervals quantify the uncertainty in our estimate and are fundamental to statistical inference.
Confidence Interval
A confidence interval is a range of values that likely contains the true population parameter. A 95% CI means that if we repeated the sampling process many times, 95% of the resulting intervals would contain the true parameter.
Confidence Interval for the Mean (σ known)
When the population standard deviation is known, we use the z-distribution to construct confidence intervals for the population mean.
import numpy as np
from scipy import stats
# Sample data
sample = np.array([23, 25, 28, 22, 26, 24, 27, 25, 29, 24])
n = len(sample)
sample_mean = sample.mean()
# Assume population std is known
sigma = 2.5
# 95% Confidence Interval using z-distribution
confidence = 0.95
alpha = 1 - confidence
z_critical = stats.norm.ppf(1 - alpha/2) # 1.96 for 95%
# Standard error
se = sigma / np.sqrt(n)
# Calculate CI
margin_of_error = z_critical * se
ci_lower = sample_mean - margin_of_error
ci_upper = sample_mean + margin_of_error
print(f"Sample mean: {sample_mean:.2f}")
print(f"Standard error: {se:.2f}")
print(f"Z-critical (95%): {z_critical:.2f}")
print(f"Margin of error: {margin_of_error:.2f}")
print(f"95% CI: ({ci_lower:.2f}, {ci_upper:.2f})")
Confidence Interval for the Mean (σ unknown)
When the population standard deviation is unknown (more common in practice), we use the sample standard deviation and the t-distribution, which accounts for the additional uncertainty.
# More realistic: population std unknown, use t-distribution
sample = np.array([23, 25, 28, 22, 26, 24, 27, 25, 29, 24])
n = len(sample)
sample_mean = sample.mean()
sample_std = sample.std(ddof=1) # Sample std with Bessel's correction
# t-critical value (wider than z for small samples)
confidence = 0.95
alpha = 1 - confidence
df = n - 1 # degrees of freedom
t_critical = stats.t.ppf(1 - alpha/2, df)
# Standard error using sample std
se = sample_std / np.sqrt(n)
# Calculate CI
margin_of_error = t_critical * se
ci_lower = sample_mean - margin_of_error
ci_upper = sample_mean + margin_of_error
print(f"Sample mean: {sample_mean:.2f}")
print(f"Sample std: {sample_std:.2f}")
print(f"T-critical (95%, df={df}): {t_critical:.3f}")
print(f"95% CI: ({ci_lower:.2f}, {ci_upper:.2f})")
# Using scipy's built-in function
ci = stats.t.interval(confidence, df, loc=sample_mean, scale=se)
print(f"95% CI (scipy): ({ci[0]:.2f}, {ci[1]:.2f})")
Common Confidence Levels
| Confidence Level | Z-Critical | Alpha (α) | Use Case |
|---|---|---|---|
| 90% | 1.645 | 0.10 | Exploratory analysis |
| 95% | 1.96 | 0.05 | Standard (most common) |
| 99% | 2.576 | 0.01 | High-stakes decisions |
Confidence Interval for Proportions
# CI for a proportion (e.g., survey results)
# 420 out of 500 customers satisfied
n = 500
successes = 420
p_hat = successes / n # Sample proportion
# Standard error for proportion
se_prop = np.sqrt(p_hat * (1 - p_hat) / n)
# 95% CI
z_critical = 1.96
margin = z_critical * se_prop
ci_lower = p_hat - margin
ci_upper = p_hat + margin
print(f"Sample proportion: {p_hat:.3f}")
print(f"Standard error: {se_prop:.4f}")
print(f"95% CI: ({ci_lower:.3f}, {ci_upper:.3f})")
print(f"Interpretation: We are 95% confident the true")
print(f"satisfaction rate is between {ci_lower*100:.1f}% and {ci_upper*100:.1f}%")
Practice: Confidence Intervals
Scenario: A political pollster wants to estimate voter support with a margin of error of ±3 percentage points at 95% confidence. From similar polls, they estimate σ ≈ 50% (maximum variance for proportions).
Task: How many voters must be surveyed? What if they want 99% confidence with the same margin?
Show Solution
import numpy as np
from scipy import stats
margin = 3 # percentage points
sigma = 50 # maximum variance assumption
# For 95% confidence
z_95 = 1.96
n_95 = (z_95 * sigma / margin) ** 2
print(f"95% CI: Need n = {np.ceil(n_95):.0f} voters")
# For 99% confidence
z_99 = stats.norm.ppf(0.995) # 2.576
n_99 = (z_99 * sigma / margin) ** 2
print(f"99% CI: Need n = {np.ceil(n_99):.0f} voters")
increase = (n_99 - n_95) / n_95 * 100
print(f"\n99% requires {increase:.0f}% more respondents than 95%")
Scenario: An e-commerce company sampled 64 deliveries and found mean = 3.2 days, sample std = 0.8 days. Management wants to advertise guaranteed delivery times.
Task: Calculate the 95% CI for mean delivery time. What delivery guarantee can they safely advertise (upper bound)?
Show Solution
from scipy import stats
import numpy as np
mean, s, n = 3.2, 0.8, 64
se = s / np.sqrt(n)
# Using t-distribution (unknown pop. variance)
t_crit = stats.t.ppf(0.975, df=n-1)
ci_lower = mean - t_crit * se
ci_upper = mean + t_crit * se
print(f"95% CI: ({ci_lower:.2f}, {ci_upper:.2f}) days")
print(f"\nRecommendation: Guarantee delivery within {np.ceil(ci_upper):.0f} days")
print(f"(Upper bound rounded up for safety)")
Hypothesis Testing
Hypothesis testing is a formal procedure for making decisions about population parameters based on sample data. It provides a framework for determining whether observed effects are statistically significant or could have occurred by chance alone.
Interactive: P-Value Decision Maker
DecideAdjust the p-value and significance level (α) to see how hypothesis testing decisions are made in real research.
Interpretation: With p-value equal to α, the result is exactly at the threshold. By convention, we reject H₀ when p ≤ α.
Null and Alternative Hypotheses
Null Hypothesis (H₀): The default assumption, typically stating no effect or no difference. We assume this is true until evidence suggests otherwise.
Alternative Hypothesis (H₁ or Hₐ): What we want to prove, claiming there IS an effect or difference.
The Hypothesis Testing Process
1. State Hypotheses
Define H₀ (null) and H₁ (alternative) based on your research question.
2. Set Alpha (α)
Choose significance level (typically 0.05). This is your threshold for "surprising" results.
3. Calculate Test Statistic
Compute z-score, t-score, or other statistic from sample data.
4. Make Decision
Compare p-value to α, or test statistic to critical value.
One-Sample Z-Test
Use when testing a claim about a population mean when the population standard deviation is known.
import numpy as np
from scipy import stats
# Example: A company claims average delivery time is 30 minutes
# We sample 50 deliveries and find mean = 32.5 minutes
# Population std is known to be 5 minutes
# Step 1: State hypotheses
# H0: μ = 30 (delivery time equals claim)
# H1: μ ≠ 30 (delivery time differs from claim) - two-tailed
mu_0 = 30 # Claimed mean
sample_mean = 32.5
sigma = 5 # Known population std
n = 50
alpha = 0.05
# Step 2: Calculate test statistic
se = sigma / np.sqrt(n)
z_stat = (sample_mean - mu_0) / se
# Step 3: Calculate p-value (two-tailed)
p_value = 2 * (1 - stats.norm.cdf(abs(z_stat)))
# Step 4: Make decision
print(f"Z-statistic: {z_stat:.3f}")
print(f"P-value: {p_value:.4f}")
print(f"Alpha: {alpha}")
if p_value < alpha:
print("Decision: Reject H0")
print("Conclusion: Delivery time differs significantly from 30 min")
else:
print("Decision: Fail to reject H0")
print("Conclusion: No significant evidence against the claim")
Types of Errors
Type I Error (α)
Rejecting H₀ when it is actually true (false positive).
Example: Concluding a drug works when it does not.
Probability = α (significance level)
Type II Error (β)
Failing to reject H₀ when it is actually false (false negative).
Example: Missing that a drug actually works.
Power = 1 - β
One-Tailed vs Two-Tailed Tests
# One-tailed test: Testing if mean is GREATER than value
# H0: μ ≤ 30
# H1: μ > 30 (right-tailed)
# Right-tailed p-value
p_value_right = 1 - stats.norm.cdf(z_stat)
print(f"Right-tailed p-value: {p_value_right:.4f}")
# One-tailed test: Testing if mean is LESS than value
# H0: μ ≥ 30
# H1: μ < 30 (left-tailed)
# Left-tailed p-value
p_value_left = stats.norm.cdf(z_stat)
print(f"Left-tailed p-value: {p_value_left:.4f}")
# Two-tailed test: Testing if mean is DIFFERENT from value
# H0: μ = 30
# H1: μ ≠ 30
p_value_two = 2 * min(p_value_left, p_value_right)
print(f"Two-tailed p-value: {p_value_two:.4f}")
Practice: Hypothesis Testing
Scenario: Before redesign, average time on page was 2.5 minutes. After redesign, a sample of 100 visitors shows mean = 2.8 minutes with s = 1.2 minutes.
Task: Set up the appropriate hypotheses (what type of test?) and determine if the redesign significantly increased engagement at α = 0.05.
Show Solution
from scipy import stats
import numpy as np
# H0: μ ≤ 2.5 (no increase)
# H1: μ > 2.5 (time increased) - RIGHT-TAILED
mu_0, x_bar, s, n = 2.5, 2.8, 1.2, 100
se = s / np.sqrt(n)
t_stat = (x_bar - mu_0) / se
p_value = 1 - stats.t.cdf(t_stat, df=n-1) # Right-tailed
print(f"t-statistic: {t_stat:.3f}")
print(f"p-value: {p_value:.4f}")
print(f"\nDecision: {'Reject H0' if p_value < 0.05 else 'Fail to reject H0'}")
print("The redesign significantly increased time on page!")
Scenario: A supplier claims their components have a defect rate of at most 2%. You test 500 components and find 15 defects.
Task: Perform a complete hypothesis test: state hypotheses, calculate test statistic, find p-value, and make a business recommendation at α = 0.05.
Show Solution
from scipy import stats
import numpy as np
# H0: p ≤ 0.02 (defect rate at most 2%)
# H1: p > 0.02 (defect rate exceeds claim) - RIGHT-TAILED
n, x = 500, 15
p_hat = x / n # 0.03 or 3%
p_0 = 0.02
# Z-test for proportions
se = np.sqrt(p_0 * (1 - p_0) / n)
z_stat = (p_hat - p_0) / se
p_value = 1 - stats.norm.cdf(z_stat)
print(f"Observed defect rate: {p_hat:.1%}")
print(f"Claimed rate: {p_0:.1%}")
print(f"\nZ-statistic: {z_stat:.3f}")
print(f"P-value: {p_value:.4f}")
print(f"\nDecision: {'Reject H0' if p_value < 0.05 else 'Fail to reject H0'}")
if p_value < 0.05:
print("\nRecommendation: The supplier's claim appears false.")
print("Consider renegotiating terms or finding alternative supplier.")
Scenario: Two groups tested: Control (n=10000, mean=50.0, std=10) and Treatment (n=10000, mean=50.3, std=10). The t-test shows p = 0.02.
Task: Is this result practically significant? Calculate Cohen's d and explain why statistical significance doesn't always mean practical importance.
Show Solution
import numpy as np
mean1, mean2 = 50.0, 50.3
std_pooled = 10 # Both groups have same std
n1 = n2 = 10000
# Cohen's d (effect size)
cohens_d = (mean2 - mean1) / std_pooled
print(f"Mean difference: {mean2 - mean1}")
print(f"Cohen's d: {cohens_d}")
print("\nEffect size interpretation:")
print(" d < 0.2: negligible")
print(" d = 0.2: small")
print(" d = 0.5: medium")
print(" d = 0.8: large")
print(f"\nOur d = {cohens_d} is NEGLIGIBLE")
print("\nConclusion: Despite p = 0.02 being statistically significant,")
print("the effect size is trivial (0.03 standard deviations).")
print("The large sample size detected a real but meaningless difference.")
print("NOT practically significant - don't implement the treatment!")
T-Tests and ANOVA
T-tests are used when comparing means with small samples or unknown population variance. ANOVA (Analysis of Variance) extends this to compare means across three or more groups. These are among the most commonly used statistical tests in data science and research.
T-Distribution
The t-distribution is similar to the normal distribution but with heavier tails. It accounts for the extra uncertainty when estimating population variance from a sample. As sample size increases, it approaches the normal distribution.
One-Sample T-Test
Tests whether a sample mean differs significantly from a known or hypothesized value when population variance is unknown.
import numpy as np
from scipy import stats
# Example: Testing if mean test score differs from 75
scores = np.array([78, 72, 85, 81, 76, 79, 83, 74, 77, 80])
mu_0 = 75 # Hypothesized mean
# One-sample t-test
t_stat, p_value = stats.ttest_1samp(scores, mu_0)
print(f"Sample mean: {scores.mean():.2f}")
print(f"Sample std: {scores.std(ddof=1):.2f}")
print(f"T-statistic: {t_stat:.3f}")
print(f"P-value: {p_value:.4f}")
alpha = 0.05
if p_value < alpha:
print(f"Reject H0: Mean significantly differs from {mu_0}")
else:
print(f"Fail to reject H0: No significant difference from {mu_0}")
Independent Two-Sample T-Test
Compares the means of two independent groups to determine if they are significantly different.
# Compare test scores between two teaching methods
method_a = np.array([85, 78, 92, 88, 76, 95, 89, 84])
method_b = np.array([72, 68, 81, 75, 79, 70, 74, 77])
# Independent two-sample t-test
# H0: μA = μB (no difference between methods)
# H1: μA ≠ μB (methods produce different results)
t_stat, p_value = stats.ttest_ind(method_a, method_b)
print(f"Method A mean: {method_a.mean():.2f}")
print(f"Method B mean: {method_b.mean():.2f}")
print(f"Difference: {method_a.mean() - method_b.mean():.2f}")
print(f"T-statistic: {t_stat:.3f}")
print(f"P-value: {p_value:.4f}")
# With Welch's correction (unequal variances)
t_stat_welch, p_value_welch = stats.ttest_ind(method_a, method_b, equal_var=False)
print(f"Welch's t-test p-value: {p_value_welch:.4f}")
Paired T-Test
Used when comparing two related measurements on the same subjects (before/after, matched pairs).
# Before and after training scores for same individuals
before = np.array([65, 72, 68, 70, 75, 69, 71, 67])
after = np.array([70, 78, 72, 75, 82, 74, 77, 73])
# Paired t-test
# H0: μd = 0 (no difference after training)
# H1: μd ≠ 0 (training has effect)
t_stat, p_value = stats.ttest_rel(before, after)
differences = after - before
print(f"Mean difference: {differences.mean():.2f}")
print(f"T-statistic: {t_stat:.3f}")
print(f"P-value: {p_value:.4f}")
if p_value < 0.05:
print("Training significantly improved scores")
One-Way ANOVA
Compares means across three or more groups to test if at least one group differs significantly.
# Compare sales across three store regions
north = np.array([23, 25, 28, 24, 26, 27, 25])
south = np.array([31, 29, 33, 30, 32, 28, 34])
west = np.array([26, 28, 25, 27, 29, 26, 28])
# One-way ANOVA
# H0: μN = μS = μW (all regions have equal mean sales)
# H1: At least one region differs
f_stat, p_value = stats.f_oneway(north, south, west)
print(f"North mean: {north.mean():.2f}")
print(f"South mean: {south.mean():.2f}")
print(f"West mean: {west.mean():.2f}")
print(f"F-statistic: {f_stat:.3f}")
print(f"P-value: {p_value:.4f}")
if p_value < 0.05:
print("Significant difference exists between regions")
print("Use post-hoc tests to identify which groups differ")
Post-Hoc Analysis with Tukey's HSD
# When ANOVA is significant, identify which groups differ
from scipy.stats import tukey_hsd
# Tukey's HSD (Honestly Significant Difference) test
result = tukey_hsd(north, south, west)
print(result)
# Or using statsmodels for more detailed output
from statsmodels.stats.multicomp import pairwise_tukeyhsd
import pandas as pd
# Combine data
all_sales = np.concatenate([north, south, west])
groups = ['North']*len(north) + ['South']*len(south) + ['West']*len(west)
tukey = pairwise_tukeyhsd(all_sales, groups, alpha=0.05)
print(tukey.summary())
Practice: T-Tests and ANOVA
Scenario: 15 patients had their blood pressure measured before and after taking a new medication.
Data:
Before: [145, 150, 148, 142, 155, 149, 151, 147, 143, 152, 150, 148, 146, 154, 149]
After: [138, 145, 142, 140, 148, 144, 146, 141, 139, 145, 143, 142, 140, 147, 144]
Task: Perform the appropriate test and determine if the drug significantly reduces blood pressure at α = 0.05.
Show Solution
from scipy import stats
import numpy as np
before = [145, 150, 148, 142, 155, 149, 151, 147, 143, 152, 150, 148, 146, 154, 149]
after = [138, 145, 142, 140, 148, 144, 146, 141, 139, 145, 143, 142, 140, 147, 144]
# Paired t-test (same patients measured twice)
t_stat, p_value = stats.ttest_rel(before, after)
# One-tailed: we expect after < before
p_one_tailed = p_value / 2 # Since t > 0 means before > after
print(f"Mean Before: {np.mean(before):.1f}")
print(f"Mean After: {np.mean(after):.1f}")
print(f"Mean Reduction: {np.mean(before) - np.mean(after):.1f} mmHg")
print(f"\nt-statistic: {t_stat:.3f}")
print(f"p-value (one-tailed): {p_one_tailed:.6f}")
print(f"\nConclusion: {'Drug significantly reduces BP' if p_one_tailed < 0.05 else 'No significant effect'}")
Scenario: A retail chain wants to compare customer satisfaction scores across three stores.
Data:
Store A: [85, 88, 82, 90, 87, 84, 89]
Store B: [78, 82, 75, 80, 77, 83, 79]
Store C: [88, 91, 85, 93, 89, 87, 92]
Task: Perform ANOVA and post-hoc analysis to determine which stores differ significantly.
Show Solution
from scipy import stats
import numpy as np
from itertools import combinations
store_a = [85, 88, 82, 90, 87, 84, 89]
store_b = [78, 82, 75, 80, 77, 83, 79]
store_c = [88, 91, 85, 93, 89, 87, 92]
# One-way ANOVA
f_stat, p_value = stats.f_oneway(store_a, store_b, store_c)
print(f"Store A mean: {np.mean(store_a):.1f}")
print(f"Store B mean: {np.mean(store_b):.1f}")
print(f"Store C mean: {np.mean(store_c):.1f}")
print(f"\nANOVA F-statistic: {f_stat:.2f}")
print(f"p-value: {p_value:.6f}")
print(f"\n{'At least one store differs significantly' if p_value < 0.05 else 'No significant differences'}")
# Post-hoc: Pairwise t-tests with Bonferroni correction
if p_value < 0.05:
print("\nPost-hoc Analysis (Bonferroni corrected):")
stores = {'A': store_a, 'B': store_b, 'C': store_c}
alpha_corrected = 0.05 / 3 # 3 comparisons
for (name1, data1), (name2, data2) in combinations(stores.items(), 2):
_, p = stats.ttest_ind(data1, data2)
sig = "*" if p < alpha_corrected else ""
print(f" {name1} vs {name2}: p = {p:.4f} {sig}")
Chi-Square Tests
Chi-square tests are used for categorical data to test relationships between variables or whether observed frequencies match expected frequencies. Unlike t-tests and ANOVA which compare means, chi-square tests work with counts and proportions.
Chi-Square Statistic
The chi-square statistic measures how much observed frequencies deviate from expected frequencies. Larger values indicate greater deviation from the null hypothesis.
Chi-Square Goodness of Fit Test
Tests whether observed frequencies match expected frequencies (testing a distribution).
import numpy as np
from scipy import stats
# Example: Testing if a die is fair
# Rolled 60 times, expected 10 for each face
observed = np.array([8, 12, 9, 11, 14, 6]) # Actual counts
expected = np.array([10, 10, 10, 10, 10, 10]) # Expected if fair
# Chi-square goodness of fit
# H0: Die is fair (frequencies match expected)
# H1: Die is not fair
chi2_stat, p_value = stats.chisquare(observed, expected)
print(f"Observed: {observed}")
print(f"Expected: {expected}")
print(f"Chi-square statistic: {chi2_stat:.3f}")
print(f"P-value: {p_value:.4f}")
print(f"Degrees of freedom: {len(observed) - 1}")
if p_value < 0.05:
print("Reject H0: Die is not fair")
else:
print("Fail to reject H0: No evidence die is unfair")
Chi-Square Test of Independence
Tests whether two categorical variables are independent or associated.
# Example: Is customer satisfaction related to product type?
# Contingency table: rows = satisfaction, columns = product
# Observed frequencies
# Product A Product B Product C
# Satisfied 50 40 30
# Unsatisfied 20 35 25
observed = np.array([
[50, 40, 30], # Satisfied
[20, 35, 25] # Unsatisfied
])
# Chi-square test of independence
# H0: Satisfaction and product are independent
# H1: There is an association between satisfaction and product
chi2_stat, p_value, dof, expected = stats.chi2_contingency(observed)
print("Observed frequencies:")
print(observed)
print("\nExpected frequencies (if independent):")
print(expected.round(2))
print(f"\nChi-square statistic: {chi2_stat:.3f}")
print(f"Degrees of freedom: {dof}")
print(f"P-value: {p_value:.4f}")
if p_value < 0.05:
print("Variables are associated (not independent)")
else:
print("No significant association found")
Effect Size: Cramér's V
Measures the strength of association between categorical variables (0 to 1).
# Cramér's V - effect size for chi-square
def cramers_v(contingency_table):
"""Calculate Cramér's V for association strength"""
chi2 = stats.chi2_contingency(contingency_table)[0]
n = contingency_table.sum()
min_dim = min(contingency_table.shape) - 1
return np.sqrt(chi2 / (n * min_dim))
v = cramers_v(observed)
print(f"Cramér's V: {v:.3f}")
# Interpretation guide
if v < 0.1:
strength = "negligible"
elif v < 0.2:
strength = "weak"
elif v < 0.4:
strength = "moderate"
elif v < 0.6:
strength = "relatively strong"
else:
strength = "strong"
print(f"Association strength: {strength}")
Assumptions for Chi-Square Tests
Requirements
- Random sampling
- Independent observations
- Expected frequency ≥ 5 in each cell
- Categorical (not continuous) data
If Assumptions Violated
- Small expected counts: Use Fisher's exact test
- Combine categories if possible
- Use simulation-based methods
- Report limitation in findings
# Fisher's Exact Test for small samples (2x2 tables)
# When expected frequencies are too small for chi-square
# Example: Treatment success in small study
# Success Failure
# Treatment 8 2
# Control 3 7
table = np.array([[8, 2], [3, 7]])
# Fisher's exact test
odds_ratio, p_value = stats.fisher_exact(table)
print(f"Odds ratio: {odds_ratio:.2f}")
print(f"P-value: {p_value:.4f}")
print("Treatment is significantly more effective" if p_value < 0.05 else "No significant difference")
Practice: Chi-Square Tests
Scenario: A casino's quality control rolled a die 600 times with these results: 1→95, 2→108, 3→92, 4→102, 5→97, 6→106.
Task: Perform a goodness-of-fit test to determine if the die is fair at α = 0.05. What would you recommend?
Show Solution
from scipy import stats
import numpy as np
observed = np.array([95, 108, 92, 102, 97, 106])
expected = np.array([100, 100, 100, 100, 100, 100]) # Fair die
chi2, p_value = stats.chisquare(observed, expected)
print(f"Observed: {observed}")
print(f"Expected: {expected}")
print(f"\nχ² statistic: {chi2:.2f}")
print(f"p-value: {p_value:.4f}")
print(f"Degrees of freedom: {len(observed) - 1}")
print(f"\nDecision: {'Reject H0 (die is NOT fair)' if p_value < 0.05 else 'Fail to reject H0 (die appears fair)'}")
print("\nRecommendation: The die can remain in play.")
Scenario: A marketing team surveyed customers about product preference across age groups:
| Product A | Product B | Product C | |
|---|---|---|---|
| 18-34 | 45 | 30 | 25 |
| 35-54 | 35 | 40 | 45 |
| 55+ | 20 | 50 | 30 |
Task: Test if product preference is independent of age group. Calculate the effect size (Cramér's V) and interpret the relationship strength.
Show Solution
from scipy import stats
import numpy as np
# Contingency table
data = np.array([
[45, 30, 25], # 18-34
[35, 40, 45], # 35-54
[20, 50, 30] # 55+
])
chi2, p_value, dof, expected = stats.chi2_contingency(data)
# Cramér's V (effect size)
n = data.sum()
min_dim = min(data.shape) - 1
cramers_v = np.sqrt(chi2 / (n * min_dim))
print("Contingency Table:")
print(data)
print(f"\nExpected frequencies:")
print(np.round(expected, 1))
print(f"\nχ² = {chi2:.2f}")
print(f"p-value = {p_value:.4f}")
print(f"Degrees of freedom = {dof}")
print(f"\nCramér's V = {cramers_v:.3f}")
print("Effect size interpretation:")
print(" V < 0.1: negligible")
print(" V ≈ 0.1: small")
print(" V ≈ 0.3: medium")
print(" V ≈ 0.5: large")
print(f"\nConclusion: {'Preference depends on age group' if p_value < 0.05 else 'No significant relationship'}")
Key Takeaways
Sample to Population
Inferential statistics uses sample data to make conclusions about populations, accounting for sampling variability through standard error and confidence intervals
Central Limit Theorem
The sampling distribution of the mean becomes approximately normal regardless of the population distribution when sample size is large enough (n ≥ 30)
Confidence Intervals
CIs provide a range of plausible values for parameters, with 95% CI meaning 95% of similarly constructed intervals would contain the true value
Hypothesis Testing
Compare p-value to significance level (α). If p < α, reject the null hypothesis. Statistical significance does not always mean practical significance
Choose the Right Test
Use t-tests for comparing 2 groups, ANOVA for 3+ groups, and chi-square for categorical data. Check assumptions before applying any test
Avoid Common Errors
Type I error is false positive (rejecting true H₀), Type II is false negative (missing true effect). Balance these based on the consequences of each error