Inferential Statistics

Sampling Distributions

Inferential statistics allows us to make conclusions about a population based on a sample. The foundation of this is understanding how sample statistics vary from sample to sample. This variation is captured by the sampling distribution, which forms the basis for all hypothesis testing and confidence intervals.

Key Concept

Population vs Sample

Population: The entire group we want to study (often too large to measure completely).

Sample: A subset of the population that we actually measure.

Goal: Use sample statistics (x̄, s) to estimate population parameters (μ, σ)

The Sampling Distribution of the Mean

If we take many samples from a population and calculate the mean of each, these sample means form a distribution called the sampling distribution of the mean. This distribution has predictable properties that enable statistical inference.

import numpy as np
from scipy import stats

# Population: All adult heights (simulated)
np.random.seed(42)
population = np.random.normal(170, 10, 100000)  # μ=170cm, σ=10cm

print(f"Population mean: {population.mean():.2f}")
print(f"Population std: {population.std():.2f}")

# Take many samples and calculate their means
sample_size = 30
n_samples = 1000
sample_means = []

for _ in range(n_samples):
    sample = np.random.choice(population, size=sample_size, replace=False)
    sample_means.append(sample.mean())

sample_means = np.array(sample_means)

print(f"\nMean of sample means: {sample_means.mean():.2f}")
print(f"Std of sample means: {sample_means.std():.2f}")
print(f"Theoretical SE: {population.std() / np.sqrt(sample_size):.2f}")

Central Limit Theorem (CLT)

The Central Limit Theorem is one of the most important concepts in statistics. It states that regardless of the population's distribution, the sampling distribution of the mean approaches a normal distribution as sample size increases (typically n ≥ 30).

Shape

Sampling distribution becomes approximately normal (n ≥ 30).

Center

Mean of sample means equals population mean (μx̄ = μ).

Spread

Standard error: SE = σ / √n (decreases with larger n).

Independence

Samples must be random and independent from each other.

# Demonstrating CLT with a skewed population
from scipy import stats

# Highly skewed population (exponential distribution)
skewed_pop = np.random.exponential(scale=2, size=100000)
print(f"Population skewness: {stats.skew(skewed_pop):.2f}")

# Sample means from skewed population
sample_means_skewed = [np.random.choice(skewed_pop, 50).mean() 
                       for _ in range(1000)]

# Check normality of sample means
stat, p_value = stats.shapiro(sample_means_skewed)
print(f"Shapiro-Wilk p-value: {p_value:.4f}")
# Despite skewed population, sample means are approximately normal!

# Standard Error calculation
def standard_error(sigma, n):
    """Calculate standard error of the mean"""
    return sigma / np.sqrt(n)

# Effect of sample size on SE
pop_std = 10
for n in [10, 30, 100, 1000]:
    se = standard_error(pop_std, n)
    print(f"n={n:4d}: SE = {se:.2f}")

Why CLT Matters: Because the sampling distribution is approximately normal, we can use z-scores and probability calculations to make inferences about population parameters, even when the population itself is not normally distributed.

Practice: Sampling

Scenario: A company knows from past surveys that customer satisfaction scores have σ = 18 points (on a 100-point scale). They want to estimate the mean satisfaction score.

Task: Calculate the standard error for samples of n = 50, n = 200, and n = 800. By what percentage does SE improve when quadrupling sample size?

Show Solution

import numpy as np

sigma = 18
sample_sizes = [50, 200, 800]

for n in sample_sizes:
    se = sigma / np.sqrt(n)
    print(f"n = {n}: SE = {se:.2f}")

# Comparing n=50 to n=200 (4x larger)
se_50 = sigma / np.sqrt(50)
se_200 = sigma / np.sqrt(200)
improvement = (se_50 - se_200) / se_50 * 100
print(f"\nQuadrupling sample size reduces SE by {improvement:.0f}%")
# Answer: 50% (SE halves when n quadruples)

Scenario: A factory produces bolts with known specifications: μ = 10.0mm length, σ = 0.3mm. Quality control takes random samples of 25 bolts each hour.

Task: What is the probability that a sample mean will be less than 9.9mm? Should QC be concerned if they observe x̄ = 9.88mm?

Show Solution

from scipy import stats
import numpy as np

mu, sigma, n = 10.0, 0.3, 25
se = sigma / np.sqrt(n)  # 0.06

# P(x̄ < 9.9)
z = (9.9 - mu) / se
p_below_9_9 = stats.norm.cdf(z)
print(f"P(x̄ < 9.9) = {p_below_9_9:.4f}")
print(f"About {p_below_9_9*100:.1f}% of samples will be below 9.9mm")

# Should we be concerned about x̄ = 9.88?
z_observed = (9.88 - mu) / se
p_observed = stats.norm.cdf(z_observed)
print(f"\nP(x̄ < 9.88) = {p_observed:.4f}")
print(f"Only {p_observed*100:.1f}% chance by random variation")
print("YES - this suggests a potential process issue!")

Confidence Intervals

A confidence interval provides a range of plausible values for a population parameter based on sample data. Rather than giving a single point estimate, confidence intervals quantify the uncertainty in our estimate and are fundamental to statistical inference.

Key Concept

Confidence Interval

A confidence interval is a range of values that likely contains the true population parameter. A 95% CI means that if we repeated the sampling process many times, 95% of the resulting intervals would contain the true parameter.

Formula: CI = point estimate ± (critical value × standard error)

Confidence Interval for the Mean (σ known)

When the population standard deviation is known, we use the z-distribution to construct confidence intervals for the population mean.

import numpy as np
from scipy import stats

# Sample data
sample = np.array([23, 25, 28, 22, 26, 24, 27, 25, 29, 24])
n = len(sample)
sample_mean = sample.mean()

# Assume population std is known
sigma = 2.5

# 95% Confidence Interval using z-distribution
confidence = 0.95
alpha = 1 - confidence
z_critical = stats.norm.ppf(1 - alpha/2)  # 1.96 for 95%

# Standard error
se = sigma / np.sqrt(n)

# Calculate CI
margin_of_error = z_critical * se
ci_lower = sample_mean - margin_of_error
ci_upper = sample_mean + margin_of_error

print(f"Sample mean: {sample_mean:.2f}")
print(f"Standard error: {se:.2f}")
print(f"Z-critical (95%): {z_critical:.2f}")
print(f"Margin of error: {margin_of_error:.2f}")
print(f"95% CI: ({ci_lower:.2f}, {ci_upper:.2f})")

Confidence Interval for the Mean (σ unknown)

When the population standard deviation is unknown (more common in practice), we use the sample standard deviation and the t-distribution, which accounts for the additional uncertainty.

# More realistic: population std unknown, use t-distribution
sample = np.array([23, 25, 28, 22, 26, 24, 27, 25, 29, 24])
n = len(sample)
sample_mean = sample.mean()
sample_std = sample.std(ddof=1)  # Sample std with Bessel's correction

# t-critical value (wider than z for small samples)
confidence = 0.95
alpha = 1 - confidence
df = n - 1  # degrees of freedom
t_critical = stats.t.ppf(1 - alpha/2, df)

# Standard error using sample std
se = sample_std / np.sqrt(n)

# Calculate CI
margin_of_error = t_critical * se
ci_lower = sample_mean - margin_of_error
ci_upper = sample_mean + margin_of_error

print(f"Sample mean: {sample_mean:.2f}")
print(f"Sample std: {sample_std:.2f}")
print(f"T-critical (95%, df={df}): {t_critical:.3f}")
print(f"95% CI: ({ci_lower:.2f}, {ci_upper:.2f})")

# Using scipy's built-in function
ci = stats.t.interval(confidence, df, loc=sample_mean, scale=se)
print(f"95% CI (scipy): ({ci[0]:.2f}, {ci[1]:.2f})")

Common Confidence Levels

Confidence Level	Z-Critical	Alpha (α)	Use Case
90%	1.645	0.10	Exploratory analysis
95%	1.96	0.05	Standard (most common)
99%	2.576	0.01	High-stakes decisions

Confidence Interval for Proportions

# CI for a proportion (e.g., survey results)
# 420 out of 500 customers satisfied

n = 500
successes = 420
p_hat = successes / n  # Sample proportion

# Standard error for proportion
se_prop = np.sqrt(p_hat * (1 - p_hat) / n)

# 95% CI
z_critical = 1.96
margin = z_critical * se_prop

ci_lower = p_hat - margin
ci_upper = p_hat + margin

print(f"Sample proportion: {p_hat:.3f}")
print(f"Standard error: {se_prop:.4f}")
print(f"95% CI: ({ci_lower:.3f}, {ci_upper:.3f})")
print(f"Interpretation: We are 95% confident the true")
print(f"satisfaction rate is between {ci_lower*100:.1f}% and {ci_upper*100:.1f}%")

Common Misconception: A 95% CI does NOT mean there is a 95% probability the true parameter is in this specific interval. The true parameter is fixed. Instead, 95% of intervals constructed this way would contain the true value.

Practice: Confidence Intervals

Scenario: A political pollster wants to estimate voter support with a margin of error of ±3 percentage points at 95% confidence. From similar polls, they estimate σ ≈ 50% (maximum variance for proportions).

Task: How many voters must be surveyed? What if they want 99% confidence with the same margin?

Show Solution

import numpy as np
from scipy import stats

margin = 3  # percentage points
sigma = 50  # maximum variance assumption

# For 95% confidence
z_95 = 1.96
n_95 = (z_95 * sigma / margin) ** 2
print(f"95% CI: Need n = {np.ceil(n_95):.0f} voters")

# For 99% confidence  
z_99 = stats.norm.ppf(0.995)  # 2.576
n_99 = (z_99 * sigma / margin) ** 2
print(f"99% CI: Need n = {np.ceil(n_99):.0f} voters")

increase = (n_99 - n_95) / n_95 * 100
print(f"\n99% requires {increase:.0f}% more respondents than 95%")

Scenario: An e-commerce company sampled 64 deliveries and found mean = 3.2 days, sample std = 0.8 days. Management wants to advertise guaranteed delivery times.

Task: Calculate the 95% CI for mean delivery time. What delivery guarantee can they safely advertise (upper bound)?

Show Solution

from scipy import stats
import numpy as np

mean, s, n = 3.2, 0.8, 64
se = s / np.sqrt(n)

# Using t-distribution (unknown pop. variance)
t_crit = stats.t.ppf(0.975, df=n-1)
ci_lower = mean - t_crit * se
ci_upper = mean + t_crit * se

print(f"95% CI: ({ci_lower:.2f}, {ci_upper:.2f}) days")
print(f"\nRecommendation: Guarantee delivery within {np.ceil(ci_upper):.0f} days")
print(f"(Upper bound rounded up for safety)")

Hypothesis Testing

Hypothesis testing is a formal procedure for making decisions about population parameters based on sample data. It provides a framework for determining whether observed effects are statistically significant or could have occurred by chance alone.

Interactive: P-Value Decision Maker

Decide

Adjust the p-value and significance level (α) to see how hypothesis testing decisions are made in real research.

P-Value 0.050

0.001 0.05 0.10 0.20

Significance Level (α) 0.050

0.01 (strict) 0.05 (standard) 0.10 (lenient)

Visual Comparison

p = 0.05

α = 0.05

p = 0.050 ≤ α = 0.050

Reject H₀

Result is statistically significant

Interpretation: With p-value equal to α, the result is exactly at the threshold. By convention, we reject H₀ when p ≤ α.

alt="P-Value visualization showing rejection regions and p-value calculation" class="figure-img img-fluid rounded shadow-lg">

Two-tailed hypothesis test (left) and p-value calculation (right) with rejection regions

alt="Type I and Type II Errors in Hypothesis Testing" class="figure-img img-fluid rounded shadow-lg">

Type I Error (false positive) vs Type II Error (false negative) with overlapping distributions

Key Concept

Null and Alternative Hypotheses

Null Hypothesis (H₀): The default assumption, typically stating no effect or no difference. We assume this is true until evidence suggests otherwise.

Alternative Hypothesis (H₁ or Hₐ): What we want to prove, claiming there IS an effect or difference.

Goal: Determine if sample evidence is strong enough to reject H₀

The Hypothesis Testing Process

1. State Hypotheses

Define H₀ (null) and H₁ (alternative) based on your research question.

2. Set Alpha (α)

Choose significance level (typically 0.05). This is your threshold for "surprising" results.

3. Calculate Test Statistic

Compute z-score, t-score, or other statistic from sample data.

4. Make Decision

Compare p-value to α, or test statistic to critical value.

One-Sample Z-Test

Use when testing a claim about a population mean when the population standard deviation is known.

import numpy as np
from scipy import stats

# Example: A company claims average delivery time is 30 minutes
# We sample 50 deliveries and find mean = 32.5 minutes
# Population std is known to be 5 minutes

# Step 1: State hypotheses
# H0: μ = 30 (delivery time equals claim)
# H1: μ ≠ 30 (delivery time differs from claim) - two-tailed

mu_0 = 30  # Claimed mean
sample_mean = 32.5
sigma = 5  # Known population std
n = 50
alpha = 0.05

# Step 2: Calculate test statistic
se = sigma / np.sqrt(n)
z_stat = (sample_mean - mu_0) / se

# Step 3: Calculate p-value (two-tailed)
p_value = 2 * (1 - stats.norm.cdf(abs(z_stat)))

# Step 4: Make decision
print(f"Z-statistic: {z_stat:.3f}")
print(f"P-value: {p_value:.4f}")
print(f"Alpha: {alpha}")

if p_value < alpha:
    print("Decision: Reject H0")
    print("Conclusion: Delivery time differs significantly from 30 min")
else:
    print("Decision: Fail to reject H0")
    print("Conclusion: No significant evidence against the claim")

Types of Errors

Type I Error (α)

Rejecting H₀ when it is actually true (false positive).

Example: Concluding a drug works when it does not.

Probability = α (significance level)

Type II Error (β)

Failing to reject H₀ when it is actually false (false negative).

Example: Missing that a drug actually works.

Power = 1 - β

One-Tailed vs Two-Tailed Tests

# One-tailed test: Testing if mean is GREATER than value
# H0: μ ≤ 30
# H1: μ > 30 (right-tailed)

# Right-tailed p-value
p_value_right = 1 - stats.norm.cdf(z_stat)
print(f"Right-tailed p-value: {p_value_right:.4f}")

# One-tailed test: Testing if mean is LESS than value
# H0: μ ≥ 30
# H1: μ < 30 (left-tailed)

# Left-tailed p-value
p_value_left = stats.norm.cdf(z_stat)
print(f"Left-tailed p-value: {p_value_left:.4f}")

# Two-tailed test: Testing if mean is DIFFERENT from value
# H0: μ = 30
# H1: μ ≠ 30
p_value_two = 2 * min(p_value_left, p_value_right)
print(f"Two-tailed p-value: {p_value_two:.4f}")

P-value Interpretation: The p-value is the probability of observing results at least as extreme as the sample, assuming H₀ is true. A small p-value (< α) means the observed result would be unlikely under H₀, so we reject it.

Practice: Hypothesis Testing

Scenario: Before redesign, average time on page was 2.5 minutes. After redesign, a sample of 100 visitors shows mean = 2.8 minutes with s = 1.2 minutes.

Task: Set up the appropriate hypotheses (what type of test?) and determine if the redesign significantly increased engagement at α = 0.05.

Show Solution

from scipy import stats
import numpy as np

# H0: μ ≤ 2.5 (no increase)
# H1: μ > 2.5 (time increased) - RIGHT-TAILED

mu_0, x_bar, s, n = 2.5, 2.8, 1.2, 100
se = s / np.sqrt(n)
t_stat = (x_bar - mu_0) / se
p_value = 1 - stats.t.cdf(t_stat, df=n-1)  # Right-tailed

print(f"t-statistic: {t_stat:.3f}")
print(f"p-value: {p_value:.4f}")
print(f"\nDecision: {'Reject H0' if p_value < 0.05 else 'Fail to reject H0'}")
print("The redesign significantly increased time on page!")

Scenario: A supplier claims their components have a defect rate of at most 2%. You test 500 components and find 15 defects.

Task: Perform a complete hypothesis test: state hypotheses, calculate test statistic, find p-value, and make a business recommendation at α = 0.05.

Show Solution

from scipy import stats
import numpy as np

# H0: p ≤ 0.02 (defect rate at most 2%)
# H1: p > 0.02 (defect rate exceeds claim) - RIGHT-TAILED

n, x = 500, 15
p_hat = x / n  # 0.03 or 3%
p_0 = 0.02

# Z-test for proportions
se = np.sqrt(p_0 * (1 - p_0) / n)
z_stat = (p_hat - p_0) / se
p_value = 1 - stats.norm.cdf(z_stat)

print(f"Observed defect rate: {p_hat:.1%}")
print(f"Claimed rate: {p_0:.1%}")
print(f"\nZ-statistic: {z_stat:.3f}")
print(f"P-value: {p_value:.4f}")
print(f"\nDecision: {'Reject H0' if p_value < 0.05 else 'Fail to reject H0'}")

if p_value < 0.05:
    print("\nRecommendation: The supplier's claim appears false.")
    print("Consider renegotiating terms or finding alternative supplier.")

Scenario: Two groups tested: Control (n=10000, mean=50.0, std=10) and Treatment (n=10000, mean=50.3, std=10). The t-test shows p = 0.02.

Task: Is this result practically significant? Calculate Cohen's d and explain why statistical significance doesn't always mean practical importance.

Show Solution

import numpy as np

mean1, mean2 = 50.0, 50.3
std_pooled = 10  # Both groups have same std
n1 = n2 = 10000

# Cohen's d (effect size)
cohens_d = (mean2 - mean1) / std_pooled

print(f"Mean difference: {mean2 - mean1}")
print(f"Cohen's d: {cohens_d}")
print("\nEffect size interpretation:")
print("  d < 0.2: negligible")
print("  d = 0.2: small")
print("  d = 0.5: medium")
print("  d = 0.8: large")
print(f"\nOur d = {cohens_d} is NEGLIGIBLE")
print("\nConclusion: Despite p = 0.02 being statistically significant,")
print("the effect size is trivial (0.03 standard deviations).")
print("The large sample size detected a real but meaningless difference.")
print("NOT practically significant - don't implement the treatment!")

T-Tests and ANOVA

T-tests are used when comparing means with small samples or unknown population variance. ANOVA (Analysis of Variance) extends this to compare means across three or more groups. These are among the most commonly used statistical tests in data science and research.

Key Concept

T-Distribution

The t-distribution is similar to the normal distribution but with heavier tails. It accounts for the extra uncertainty when estimating population variance from a sample. As sample size increases, it approaches the normal distribution.

Degrees of Freedom: df = n - 1 (one-sample), df = n₁ + n₂ - 2 (two-sample)

One-Sample T-Test

Tests whether a sample mean differs significantly from a known or hypothesized value when population variance is unknown.

import numpy as np
from scipy import stats

# Example: Testing if mean test score differs from 75
scores = np.array([78, 72, 85, 81, 76, 79, 83, 74, 77, 80])
mu_0 = 75  # Hypothesized mean

# One-sample t-test
t_stat, p_value = stats.ttest_1samp(scores, mu_0)

print(f"Sample mean: {scores.mean():.2f}")
print(f"Sample std: {scores.std(ddof=1):.2f}")
print(f"T-statistic: {t_stat:.3f}")
print(f"P-value: {p_value:.4f}")

alpha = 0.05
if p_value < alpha:
    print(f"Reject H0: Mean significantly differs from {mu_0}")
else:
    print(f"Fail to reject H0: No significant difference from {mu_0}")

Independent Two-Sample T-Test

Compares the means of two independent groups to determine if they are significantly different.

# Compare test scores between two teaching methods
method_a = np.array([85, 78, 92, 88, 76, 95, 89, 84])
method_b = np.array([72, 68, 81, 75, 79, 70, 74, 77])

# Independent two-sample t-test
# H0: μA = μB (no difference between methods)
# H1: μA ≠ μB (methods produce different results)

t_stat, p_value = stats.ttest_ind(method_a, method_b)

print(f"Method A mean: {method_a.mean():.2f}")
print(f"Method B mean: {method_b.mean():.2f}")
print(f"Difference: {method_a.mean() - method_b.mean():.2f}")
print(f"T-statistic: {t_stat:.3f}")
print(f"P-value: {p_value:.4f}")

# With Welch's correction (unequal variances)
t_stat_welch, p_value_welch = stats.ttest_ind(method_a, method_b, equal_var=False)
print(f"Welch's t-test p-value: {p_value_welch:.4f}")

Paired T-Test

Used when comparing two related measurements on the same subjects (before/after, matched pairs).

# Before and after training scores for same individuals
before = np.array([65, 72, 68, 70, 75, 69, 71, 67])
after = np.array([70, 78, 72, 75, 82, 74, 77, 73])

# Paired t-test
# H0: μd = 0 (no difference after training)
# H1: μd ≠ 0 (training has effect)

t_stat, p_value = stats.ttest_rel(before, after)

differences = after - before
print(f"Mean difference: {differences.mean():.2f}")
print(f"T-statistic: {t_stat:.3f}")
print(f"P-value: {p_value:.4f}")

if p_value < 0.05:
    print("Training significantly improved scores")

One-Way ANOVA

Compares means across three or more groups to test if at least one group differs significantly.

# Compare sales across three store regions
north = np.array([23, 25, 28, 24, 26, 27, 25])
south = np.array([31, 29, 33, 30, 32, 28, 34])
west = np.array([26, 28, 25, 27, 29, 26, 28])

# One-way ANOVA
# H0: μN = μS = μW (all regions have equal mean sales)
# H1: At least one region differs

f_stat, p_value = stats.f_oneway(north, south, west)

print(f"North mean: {north.mean():.2f}")
print(f"South mean: {south.mean():.2f}")
print(f"West mean: {west.mean():.2f}")
print(f"F-statistic: {f_stat:.3f}")
print(f"P-value: {p_value:.4f}")

if p_value < 0.05:
    print("Significant difference exists between regions")
    print("Use post-hoc tests to identify which groups differ")

Post-Hoc Analysis with Tukey's HSD

# When ANOVA is significant, identify which groups differ
from scipy.stats import tukey_hsd

# Tukey's HSD (Honestly Significant Difference) test
result = tukey_hsd(north, south, west)
print(result)

# Or using statsmodels for more detailed output
from statsmodels.stats.multicomp import pairwise_tukeyhsd
import pandas as pd

# Combine data
all_sales = np.concatenate([north, south, west])
groups = ['North']*len(north) + ['South']*len(south) + ['West']*len(west)

tukey = pairwise_tukeyhsd(all_sales, groups, alpha=0.05)
print(tukey.summary())

When to Use Each Test: Use t-tests for comparing 2 groups, ANOVA for 3+ groups. Use paired tests when measurements are related (same subjects). Always check assumptions: normality, independence, and equal variances.

Practice: T-Tests and ANOVA

Scenario: 15 patients had their blood pressure measured before and after taking a new medication.

Data:
Before: [145, 150, 148, 142, 155, 149, 151, 147, 143, 152, 150, 148, 146, 154, 149]
After: [138, 145, 142, 140, 148, 144, 146, 141, 139, 145, 143, 142, 140, 147, 144]

Task: Perform the appropriate test and determine if the drug significantly reduces blood pressure at α = 0.05.

Show Solution

from scipy import stats
import numpy as np

before = [145, 150, 148, 142, 155, 149, 151, 147, 143, 152, 150, 148, 146, 154, 149]
after = [138, 145, 142, 140, 148, 144, 146, 141, 139, 145, 143, 142, 140, 147, 144]

# Paired t-test (same patients measured twice)
t_stat, p_value = stats.ttest_rel(before, after)

# One-tailed: we expect after < before
p_one_tailed = p_value / 2  # Since t > 0 means before > after

print(f"Mean Before: {np.mean(before):.1f}")
print(f"Mean After: {np.mean(after):.1f}")
print(f"Mean Reduction: {np.mean(before) - np.mean(after):.1f} mmHg")
print(f"\nt-statistic: {t_stat:.3f}")
print(f"p-value (one-tailed): {p_one_tailed:.6f}")
print(f"\nConclusion: {'Drug significantly reduces BP' if p_one_tailed < 0.05 else 'No significant effect'}")

Scenario: A retail chain wants to compare customer satisfaction scores across three stores.

Data:
Store A: [85, 88, 82, 90, 87, 84, 89]
Store B: [78, 82, 75, 80, 77, 83, 79]
Store C: [88, 91, 85, 93, 89, 87, 92]

Task: Perform ANOVA and post-hoc analysis to determine which stores differ significantly.

Show Solution

from scipy import stats
import numpy as np
from itertools import combinations

store_a = [85, 88, 82, 90, 87, 84, 89]
store_b = [78, 82, 75, 80, 77, 83, 79]
store_c = [88, 91, 85, 93, 89, 87, 92]

# One-way ANOVA
f_stat, p_value = stats.f_oneway(store_a, store_b, store_c)

print(f"Store A mean: {np.mean(store_a):.1f}")
print(f"Store B mean: {np.mean(store_b):.1f}")
print(f"Store C mean: {np.mean(store_c):.1f}")
print(f"\nANOVA F-statistic: {f_stat:.2f}")
print(f"p-value: {p_value:.6f}")
print(f"\n{'At least one store differs significantly' if p_value < 0.05 else 'No significant differences'}")

# Post-hoc: Pairwise t-tests with Bonferroni correction
if p_value < 0.05:
    print("\nPost-hoc Analysis (Bonferroni corrected):")
    stores = {'A': store_a, 'B': store_b, 'C': store_c}
    alpha_corrected = 0.05 / 3  # 3 comparisons
    
    for (name1, data1), (name2, data2) in combinations(stores.items(), 2):
        _, p = stats.ttest_ind(data1, data2)
        sig = "*" if p < alpha_corrected else ""
        print(f"  {name1} vs {name2}: p = {p:.4f} {sig}")

Chi-Square Tests

Chi-square tests are used for categorical data to test relationships between variables or whether observed frequencies match expected frequencies. Unlike t-tests and ANOVA which compare means, chi-square tests work with counts and proportions.

Key Concept

Chi-Square Statistic

The chi-square statistic measures how much observed frequencies deviate from expected frequencies. Larger values indicate greater deviation from the null hypothesis.

Formula: χ² = Σ[(Observed - Expected)² / Expected]

Chi-Square Goodness of Fit Test

Tests whether observed frequencies match expected frequencies (testing a distribution).

import numpy as np
from scipy import stats

# Example: Testing if a die is fair
# Rolled 60 times, expected 10 for each face
observed = np.array([8, 12, 9, 11, 14, 6])  # Actual counts
expected = np.array([10, 10, 10, 10, 10, 10])  # Expected if fair

# Chi-square goodness of fit
# H0: Die is fair (frequencies match expected)
# H1: Die is not fair

chi2_stat, p_value = stats.chisquare(observed, expected)

print(f"Observed: {observed}")
print(f"Expected: {expected}")
print(f"Chi-square statistic: {chi2_stat:.3f}")
print(f"P-value: {p_value:.4f}")
print(f"Degrees of freedom: {len(observed) - 1}")

if p_value < 0.05:
    print("Reject H0: Die is not fair")
else:
    print("Fail to reject H0: No evidence die is unfair")

Chi-Square Test of Independence

Tests whether two categorical variables are independent or associated.

# Example: Is customer satisfaction related to product type?
# Contingency table: rows = satisfaction, columns = product

# Observed frequencies
#                Product A  Product B  Product C
# Satisfied          50         40         30
# Unsatisfied        20         35         25

observed = np.array([
    [50, 40, 30],   # Satisfied
    [20, 35, 25]    # Unsatisfied
])

# Chi-square test of independence
# H0: Satisfaction and product are independent
# H1: There is an association between satisfaction and product

chi2_stat, p_value, dof, expected = stats.chi2_contingency(observed)

print("Observed frequencies:")
print(observed)
print("\nExpected frequencies (if independent):")
print(expected.round(2))
print(f"\nChi-square statistic: {chi2_stat:.3f}")
print(f"Degrees of freedom: {dof}")
print(f"P-value: {p_value:.4f}")

if p_value < 0.05:
    print("Variables are associated (not independent)")
else:
    print("No significant association found")

Effect Size: Cramér's V

Measures the strength of association between categorical variables (0 to 1).

# Cramér's V - effect size for chi-square
def cramers_v(contingency_table):
    """Calculate Cramér's V for association strength"""
    chi2 = stats.chi2_contingency(contingency_table)[0]
    n = contingency_table.sum()
    min_dim = min(contingency_table.shape) - 1
    return np.sqrt(chi2 / (n * min_dim))

v = cramers_v(observed)
print(f"Cramér's V: {v:.3f}")

# Interpretation guide
if v < 0.1:
    strength = "negligible"
elif v < 0.2:
    strength = "weak"
elif v < 0.4:
    strength = "moderate"
elif v < 0.6:
    strength = "relatively strong"
else:
    strength = "strong"
print(f"Association strength: {strength}")

Assumptions for Chi-Square Tests

Requirements

Random sampling
Independent observations
Expected frequency ≥ 5 in each cell
Categorical (not continuous) data

If Assumptions Violated

Small expected counts: Use Fisher's exact test
Combine categories if possible
Use simulation-based methods
Report limitation in findings

# Fisher's Exact Test for small samples (2x2 tables)
# When expected frequencies are too small for chi-square

# Example: Treatment success in small study
#              Success  Failure
# Treatment       8        2
# Control         3        7

table = np.array([[8, 2], [3, 7]])

# Fisher's exact test
odds_ratio, p_value = stats.fisher_exact(table)

print(f"Odds ratio: {odds_ratio:.2f}")
print(f"P-value: {p_value:.4f}")
print("Treatment is significantly more effective" if p_value < 0.05 else "No significant difference")

Practice: Chi-Square Tests

Scenario: A casino's quality control rolled a die 600 times with these results: 1→95, 2→108, 3→92, 4→102, 5→97, 6→106.

Task: Perform a goodness-of-fit test to determine if the die is fair at α = 0.05. What would you recommend?

Show Solution

from scipy import stats
import numpy as np

observed = np.array([95, 108, 92, 102, 97, 106])
expected = np.array([100, 100, 100, 100, 100, 100])  # Fair die

chi2, p_value = stats.chisquare(observed, expected)

print(f"Observed: {observed}")
print(f"Expected: {expected}")
print(f"\nχ² statistic: {chi2:.2f}")
print(f"p-value: {p_value:.4f}")
print(f"Degrees of freedom: {len(observed) - 1}")
print(f"\nDecision: {'Reject H0 (die is NOT fair)' if p_value < 0.05 else 'Fail to reject H0 (die appears fair)'}")
print("\nRecommendation: The die can remain in play.")

Scenario: A marketing team surveyed customers about product preference across age groups:

	Product A	Product B	Product C
18-34	45	30	25
35-54	35	40	45
55+	20	50	30

Task: Test if product preference is independent of age group. Calculate the effect size (Cramér's V) and interpret the relationship strength.

Show Solution

from scipy import stats
import numpy as np

# Contingency table
data = np.array([
    [45, 30, 25],  # 18-34
    [35, 40, 45],  # 35-54
    [20, 50, 30]   # 55+
])

chi2, p_value, dof, expected = stats.chi2_contingency(data)

# Cramér's V (effect size)
n = data.sum()
min_dim = min(data.shape) - 1
cramers_v = np.sqrt(chi2 / (n * min_dim))

print("Contingency Table:")
print(data)
print(f"\nExpected frequencies:")
print(np.round(expected, 1))
print(f"\nχ² = {chi2:.2f}")
print(f"p-value = {p_value:.4f}")
print(f"Degrees of freedom = {dof}")
print(f"\nCramér's V = {cramers_v:.3f}")
print("Effect size interpretation:")
print("  V < 0.1: negligible")
print("  V ≈ 0.1: small")
print("  V ≈ 0.3: medium")
print("  V ≈ 0.5: large")
print(f"\nConclusion: {'Preference depends on age group' if p_value < 0.05 else 'No significant relationship'}")

Key Takeaways

Sample to Population

Inferential statistics uses sample data to make conclusions about populations, accounting for sampling variability through standard error and confidence intervals

Central Limit Theorem

The sampling distribution of the mean becomes approximately normal regardless of the population distribution when sample size is large enough (n ≥ 30)

Confidence Intervals

CIs provide a range of plausible values for parameters, with 95% CI meaning 95% of similarly constructed intervals would contain the true value

Hypothesis Testing

Compare p-value to significance level (α). If p < α, reject the null hypothesis. Statistical significance does not always mean practical significance

Choose the Right Test

Use t-tests for comparing 2 groups, ANOVA for 3+ groups, and chi-square for categorical data. Check assumptions before applying any test

Avoid Common Errors

Type I error is false positive (rejecting true H₀), Type II is false negative (missing true effect). Balance these based on the consequences of each error

What You'll Learn

Contents

Sampling Distributions

Population vs Sample

The Sampling Distribution of the Mean

Central Limit Theorem (CLT)

Shape

Center

Spread

Independence

Practice: Sampling

Medium Determine sample size precision for customer satisfaction survey

Hard Calculate probability of unusual sample mean for manufacturing QC

Confidence Intervals

Confidence Interval

Confidence Interval for the Mean (σ known)

Confidence Interval for the Mean (σ unknown)

Common Confidence Levels

Confidence Interval for Proportions

Practice: Confidence Intervals

Hard Design a survey with a specific precision target

Medium Estimate average delivery time from sample data

Hypothesis Testing

Interactive: P-Value Decision Maker

Null and Alternative Hypotheses

The Hypothesis Testing Process

1. State Hypotheses

2. Set Alpha (α)

3. Calculate Test Statistic

4. Make Decision

One-Sample Z-Test

Types of Errors

Type I Error (α)

Type II Error (β)

One-Tailed vs Two-Tailed Tests

Practice: Hypothesis Testing

Easy Test if website redesign increased time on page

Hard Evaluate a manufacturing process claim with full test procedure

Medium Interpret effect size alongside statistical significance

T-Tests and ANOVA

T-Distribution

One-Sample T-Test

Independent Two-Sample T-Test

Paired T-Test

One-Way ANOVA

Post-Hoc Analysis with Tukey's HSD

Practice: T-Tests and ANOVA

Medium Evaluate a new drug's effect on blood pressure

Hard Compare customer satisfaction across three store locations

Chi-Square Tests

Chi-Square Statistic

Chi-Square Goodness of Fit Test

Chi-Square Test of Independence

Effect Size: Cramér's V

Assumptions for Chi-Square Tests

Requirements

If Assumptions Violated

Practice: Chi-Square Tests

Medium Test if a casino die is fair

Hard Analyze if product preference depends on age group

Key Takeaways

Sample to Population

Central Limit Theorem

Confidence Intervals

Hypothesis Testing

Choose the Right Test

Avoid Common Errors

Knowledge Check

1 What does the Central Limit Theorem state?

2 A 95% confidence interval means:

3 If p-value = 0.03 and α = 0.05, what is the decision?

4 Which test compares means between two related groups (e.g., before/after)?

5 What is a Type I error?

6 When should you use a chi-square test of independence?