Assignment 4: Statistics Challenge | Data Analytics Course

Assignment Overview

In this comprehensive statistics assignment, you will work as a Business Analyst at DataMart E-Commerce. The company has collected vast amounts of sales, customer, and marketing data but lacks statistical insights to make informed business decisions. Your task is to apply statistical methods to uncover patterns, test hypotheses, and provide actionable recommendations.

Objectives

Calculate and interpret descriptive statistics
Work with probability distributions
Conduct hypothesis tests (t-tests, chi-square)
Perform A/B testing analysis
Create statistical reports with insights
Visualize statistical findings

Skills Tested

Measures of central tendency & dispersion
Normal distribution & z-scores
Confidence intervals
One-sample & two-sample t-tests
Chi-square tests for independence
P-value interpretation

Deliverables

statistics_analysis.ipynb (notebook)
descriptive_stats_report.csv
hypothesis_test_results.txt
statistical_insights.pdf

The Scenario

📧 Email from Jennifer Martinez, VP of Analytics

"Welcome to the DataMart Analytics Team! I need your statistical expertise for several critical business questions we're facing this quarter.

We've been collecting data from our E-Commerce platform, but we need someone who can apply rigorous statistical methods to answer key questions:

Are our average order values significantly different across customer segments?
Is there a relationship between customer demographics and product preferences?
Did our recent website redesign actually improve conversion rates?
What's the probability of hitting our Q4 revenue targets based on current trends?

I've attached four datasets covering sales transactions, customer profiles, marketing campaigns, and A/B test results. Your assignment is to conduct comprehensive statistical analysis and provide data-driven recommendations.

The executive team is making decisions next week, so accuracy and clear interpretation are critical. Show your work, explain your assumptions, and don't just give numbers--tell us what they mean for the business!"

-- Jennifer Martinez, VP of Analytics

The Datasets

Download all four datasets below. Each dataset contains real-world E-Commerce data that requires statistical analysis.

datamart_sales.csv

Sales transaction data including order values, dates, customer segments, product categories, and payment methods.

500 records Complete data Multiple segments 6 months coverage

Download CSV

Key Variables:

order_id - Unique transaction identifier
order_value - Total purchase amount in USD (numeric)
order_date - Transaction timestamp
customer_segment - Premium, Standard, or Basic (categorical)
product_category - Electronics, Fashion, Home, Books (categorical)
payment_method - Credit Card, PayPal, Debit Card (categorical)

datamart_customers.csv

Customer demographic and behavioral data including age, location, lifetime value, satisfaction scores.

350 records Demographics included Satisfaction scores Lifetime value data

Download CSV

Key Variables:

customer_id - Unique customer identifier
age - Customer age in years (numeric)
gender - Male, Female, Other (categorical)
city_tier - Tier 1, Tier 2, Tier 3 cities (categorical)
lifetime_value - Total customer spend in USD (numeric)
satisfaction_score - Rating from 1-10 (ordinal)

datamart_campaigns.csv

Marketing campaign performance metrics including impressions, clicks, conversions, and ROI data.

100 campaigns Multiple channels Conversion tracking ROI calculated

Download CSV

Key Variables:

campaign_id - Unique campaign identifier
channel - Email, Social Media, Search Ads, Display (categorical)
impressions - Number of ad views (numeric)
clicks - Number of ad clicks (numeric)
conversions - Number of purchases (numeric)
spend - Campaign cost in USD (numeric)

datamart_ab_test.csv

A/B test data from website redesign experiment comparing old vs new homepage designs on conversion rates.

1000 records Control vs Treatment Randomized assignment Conversion tracked

Download CSV

Key Variables:

user_id - Unique visitor identifier
group - Control (old design) or Treatment (new design)
page_views - Number of pages visited (numeric)
time_on_site - Session duration in seconds (numeric)
converted - 1 if purchased, 0 if not (binary)
device_type - Desktop, Mobile, Tablet (categorical)

Requirements

Exercise 1 Descriptive Statistics Analysis (25 points)

Sales Transaction Analysis

Using the datamart_sales.csv dataset, conduct a comprehensive descriptive statistical analysis of order values across different customer segments and product categories.

Overall Sales Statistics (5 points)
- Calculate mean, median, mode, standard deviation, variance for order_value
- Find the range (min and max values)
- Calculate the coefficient of variation (CV = std dev / mean)
- Identify and explain any outliers using the IQR method (Q1 - 1.5*IQR, Q3 + 1.5*IQR)
- Create a box plot showing the distribution of order values
Segmented Analysis (10 points)
- Group data by customer_segment (Premium, Standard, Basic)
- Calculate mean and median order value for each segment
- Calculate standard deviation for each segment
- Create a comparative table showing all statistics side-by-side
- Visualize using grouped bar charts or violin plots
- Interpret: Which segment has the highest average spend? Which has the most variability?
Product Category Analysis (5 points)
- Calculate the frequency distribution of orders by product_category
- Find the percentage contribution of each category to total revenue
- Identify the category with the highest average order value
- Create a pie chart showing revenue distribution by category
Time Series Trends (5 points)
- Group orders by month (using order_date)
- Calculate monthly average order value and total revenue
- Calculate the month-over-month growth rate
- Create a line chart showing the trend over 6 months
- Interpret: Is there an upward or downward trend? Any seasonal patterns?

Expected Output: A comprehensive descriptive statistics report saved as descriptive_stats_report.csv containing all calculated metrics organized by segments and categories.

Exercise 2 Probability Distributions & Analysis (20 points)

Campaign Performance Probability

Using the datamart_campaigns.csv dataset, apply probability theory and distribution analysis to understand campaign performance and predict future outcomes.

Click-Through Rate (CTR) Distribution (8 points)
- Calculate CTR for each campaign: CTR = (clicks / impressions) × 100
- Calculate mean CTR and standard deviation across all campaigns
- Create a histogram of CTR values with 20 bins
- Test if CTR follows a normal distribution using visual inspection (Q-Q plot)
- Calculate the probability that a random campaign has CTR > 5% (use empirical distribution)
- Interpret: What percentage of campaigns are performing above average?
Conversion Rate Analysis (8 points)
- Calculate conversion rate: Conversion Rate = (conversions / clicks) × 100
- Assume conversion follows a normal distribution, calculate z-scores for each campaign
- Identify campaigns with z-score > 2 (exceptional performers) and z-score < -2 (underperformers)
- If average conversion rate is 3% with std dev of 0.8%, what's the probability a campaign converts above 4%?
- Use the 68-95-99.7 rule to identify the range where 95% of conversion rates fall
ROI Probability Forecasting (4 points)
- Calculate ROI for each campaign: ROI = ((conversions × avg_order_value - spend) / spend) × 100
- Assume average order value is $75
- Calculate the probability distribution of ROI values
- What's the probability that a future campaign will have positive ROI?
- Business Recommendation: Based on probability analysis, which marketing channel should receive increased budget?

Python Hint: Use scipy.stats.norm.cdf() for calculating probabilities from normal distribution. For z-scores: z = (x - mean) / std_dev

Expected Output: Include probability calculations, distribution plots (histograms, Q-Q plots), and a written interpretation in your notebook explaining what these probabilities mean for business decisions.

Exercise 3 Hypothesis Testing (25 points)

Customer Segment & Demographics Analysis

Conduct rigorous hypothesis tests to answer critical business questions using both the datamart_sales.csv and datamart_customers.csv datasets.

Two-Sample T-Test: Premium vs Standard Customers (10 points)
- Research Question: Do Premium customers have significantly higher average order values than Standard customers?
- Null Hypothesis (H₀): μ_premium = μ_standard (no difference in means)
- Alternative Hypothesis (H₁): μ_premium > μ_standard (Premium customers spend more)
- Set significance level: α = 0.05
- Perform an independent two-sample t-test (use scipy.stats.ttest_ind())
- Report the t-statistic, p-value, and degrees of freedom
- Calculate the 95% confidence interval for the difference in means
- Decision: Reject or fail to reject H₀ based on p-value
- Interpretation: Explain in plain language what this result means for customer segmentation strategy
Chi-Square Test: Gender & Product Preference (10 points)
- Research Question: Is there a relationship between customer gender and product category preferences?
- Null Hypothesis (H₀): Gender and product category are independent (no association)
- Alternative Hypothesis (H₁): Gender and product category are dependent (association exists)
- Merge datamart_sales.csv with datamart_customers.csv on customer_id
- Create a contingency table (crosstab) of gender vs product_category
- Perform chi-square test of independence (use scipy.stats.chi2_contingency())
- Report chi-square statistic, p-value, degrees of freedom, and expected frequencies
- Check if all expected frequencies are ≥ 5 (assumption validation)
- Decision: Reject or fail to reject H₀ at α = 0.05
- Interpretation: Should marketing campaigns be gender-specific for certain product categories?
One-Sample T-Test: Satisfaction Score (5 points)
- Research Question: Is the average customer satisfaction score significantly different from the industry benchmark of 7.5?
- Null Hypothesis (H₀): μ = 7.5
- Alternative Hypothesis (H₁): μ ≠ 7.5 (two-tailed test)
- Perform one-sample t-test using datamart_customers.csv satisfaction scores
- Report t-statistic, p-value, and 95% confidence interval
- Decision: Is DataMart performing significantly better or worse than industry average?

Critical Requirements:

Always state null and alternative hypotheses clearly before testing
Check assumptions (normality, equal variances) before running tests
Report exact p-values (not just "< 0.05")
Include effect size calculations (Cohen's d for t-tests, Cramér's V for chi-square)
Provide business interpretation, not just statistical conclusions

Expected Output: A structured report saved as hypothesis_test_results.txt containing all test results with hypotheses, test statistics, p-values, decisions, and interpretations for each test.

Exercise 4 A/B Testing for Website Redesign (30 points)

Conversion Rate Optimization Experiment

DataMart recently ran an A/B test to evaluate whether a new homepage design improves conversion rates compared to the old design. Using datamart_ab_test.csv, conduct a comprehensive A/B test analysis.

Data Exploration & Sample Size Validation (5 points)
- Calculate the sample size for Control group and Treatment group
- Verify that assignment to groups was approximately 50/50 (randomization check)
- Check for any significant differences in baseline metrics (page_views, time_on_site) between groups
- Calculate the minimum detectable effect (MDE) based on sample size
Conversion Rate Comparison (10 points)
- Calculate conversion rate for Control group: CR_control = (sum of converted) / (total users in control)
- Calculate conversion rate for Treatment group: CR_treatment = (sum of converted) / (total users in treatment)
- Calculate the absolute lift: Lift = CR_treatment - CR_control
- Calculate the relative lift: Relative Lift = (CR_treatment - CR_control) / CR_control × 100%
- Create a bar chart comparing conversion rates with 95% confidence intervals
Statistical Significance Test (10 points)
- Null Hypothesis (H₀): CR_treatment = CR_control (new design has no effect)
- Alternative Hypothesis (H₁): CR_treatment > CR_control (new design improves conversion)
- Perform a two-proportion z-test using statsmodels.stats.proportion.proportions_ztest()
- Report z-statistic, p-value, and 95% confidence interval for the difference
- Calculate statistical power of the test (use statsmodels.stats.power)
- Decision: At α = 0.05, is the new design significantly better?
Segmented Analysis (3 points)
- Break down conversion rates by device_type (Desktop, Mobile, Tablet)
- Test if the treatment effect varies by device (interaction effect)
- Insight: Does the new design work better on certain devices?
Business Recommendation (2 points)
- Based on statistical evidence, should DataMart deploy the new design to 100% of users?
- If yes, estimate the expected increase in conversions and revenue (use historical data)
- If no, explain what additional testing or changes might be needed
- Consider practical significance vs statistical significance

Key Formulas:

Conversion Rate: CR = (conversions / total_users) × 100
Standard Error: SE = √(p(1-p)/n) where p is conversion rate proportion
95% CI: CR ± 1.96 × SE
Z-test statistic: z = (p1 - p2) / √(p(1-p)(1/n1 + 1/n2)) where p is pooled proportion

Expected Output: Include detailed A/B test analysis in your notebook with visualizations showing conversion rates by group and device. Save a summary recommendation in statistical_insights.pdf.

Submission Instructions

Submit all files in a single ZIP file named exactly as shown below:

Required ZIP Name

YourName_Statistics_Assignment.zip

Example: JohnDoe_Statistics_Assignment.zip

Required Files

Statistics_Assignment/
├── statistics_analysis.ipynb     # Jupyter Notebook with ALL exercises
├── descriptive_stats_report.csv  # Descriptive statistics summary (Exercise 1)
├── hypothesis_test_results.txt   # Hypothesis test results (Exercise 3)
├── statistical_insights.pdf      # Professional report (2-3 pages)
└── README.md                     # REQUIRED - see contents below

README.md Must Include:

Your full name and submission date
Brief description of your solution approach
Tools and libraries used
Any assumptions made during analysis

Do Include

All exercises completed with outputs visible
Clear section headers for each exercise
Markdown cells explaining your reasoning
Well-commented Python code
Visualizations (box plots, charts)
Business insights and recommendations

Do Not Include

Any .pyc or __pycache__ files
Virtual environment folders
Code that doesn't run without errors
Unexecuted notebook cells
Plagiarized code or analysis

Important: Before submitting, run all cells in your notebook to make sure it executes without errors and generates all output files correctly!

Submit Your Assignment

Upload your ZIP file to submit your assignment

Grading Rubric (100 Points Total)

Category	Criteria	Points
Exercise 1 Descriptive Statistics	All required statistics calculated correctly (10 pts) Proper visualizations (box plots, charts) (8 pts) Clear interpretation of results (7 pts)	25
Exercise 2 Probability Distributions	Correct probability calculations (10 pts) Appropriate use of normal distribution (6 pts) Business insights from probability analysis (4 pts)	20
Exercise 3 Hypothesis Testing	Correct null/alternative hypotheses stated (5 pts) Proper test selection and execution (12 pts) Accurate p-value interpretation (5 pts) Business implications explained (3 pts)	25
Exercise 4 A/B Testing	Correct conversion rate calculations (10 pts) Proper statistical test and interpretation (12 pts) Segmented analysis by device (5 pts) Actionable business recommendation (3 pts)	30
Total Points		100

Bonus Points Opportunities (+10 max)

+3 points: Additional statistical test (e.g., ANOVA for comparing all 3 customer segments simultaneously)
+3 points: Advanced visualizations (interactive plots using Plotly, comprehensive dashboard)
+2 points: Power analysis for hypothesis tests with recommendations on sample size
+2 points: Bayesian interpretation alongside frequentist approach (prior/posterior distributions)

PREVIOUS TOPIC

Hypothesis Testing

NEXT MODULE

Statistics Challenge

What You'll Practice

Assignment Overview

Objectives

Skills Tested

Deliverables

The Scenario

📧 Email from Jennifer Martinez, VP of Analytics

The Datasets

datamart_sales.csv

datamart_customers.csv

datamart_campaigns.csv

datamart_ab_test.csv

Requirements

Exercise 1 Descriptive Statistics Analysis (25 points)

Sales Transaction Analysis

Exercise 2 Probability Distributions & Analysis (20 points)

Campaign Performance Probability

Exercise 3 Hypothesis Testing (25 points)

Customer Segment & Demographics Analysis

Exercise 4 A/B Testing for Website Redesign (30 points)

Conversion Rate Optimization Experiment

Submission Instructions

Required ZIP Name

Required Files

README.md Must Include:

Do Include

Do Not Include

Grading Rubric (100 Points Total)

Bonus Points Opportunities (+10 max)

Pre-Submission Checklist

Code Quality

Statistical Rigor

Visualizations

Deliverables

Hypothesis Testing

Power BI Introduction