Assignment 4-A

Statistics Challenge

Apply statistical methods to analyze E-Commerce sales data, customer behavior patterns, and marketing campaign performance. Use descriptive statistics, probability theory, and hypothesis testing to drive data-informed business decisions.

3-4 hours
Intermediate
100 Points
Submit Assignment
What You'll Practice
  • Descriptive statistics (mean, median, std dev)
  • Probability distributions
  • Hypothesis testing (t-tests, chi-square)
  • A/B testing for business decisions
  • Statistical interpretation & reporting

Assignment Overview

In this comprehensive statistics assignment, you will work as a Business Analyst at DataMart E-Commerce. The company has collected vast amounts of sales, customer, and marketing data but lacks statistical insights to make informed business decisions. Your task is to apply statistical methods to uncover patterns, test hypotheses, and provide actionable recommendations.

Objectives
  • Calculate and interpret descriptive statistics
  • Work with probability distributions
  • Conduct hypothesis tests (t-tests, chi-square)
  • Perform A/B testing analysis
  • Create statistical reports with insights
  • Visualize statistical findings
Skills Tested
  • Measures of central tendency & dispersion
  • Normal distribution & z-scores
  • Confidence intervals
  • One-sample & two-sample t-tests
  • Chi-square tests for independence
  • P-value interpretation
Deliverables
  • statistics_analysis.ipynb (notebook)
  • descriptive_stats_report.csv
  • hypothesis_test_results.txt
  • statistical_insights.pdf

The Scenario

📧 Email from Jennifer Martinez, VP of Analytics

"Welcome to the DataMart Analytics Team! I need your statistical expertise for several critical business questions we're facing this quarter.

We've been collecting data from our E-Commerce platform, but we need someone who can apply rigorous statistical methods to answer key questions:

  • Are our average order values significantly different across customer segments?
  • Is there a relationship between customer demographics and product preferences?
  • Did our recent website redesign actually improve conversion rates?
  • What's the probability of hitting our Q4 revenue targets based on current trends?

I've attached four datasets covering sales transactions, customer profiles, marketing campaigns, and A/B test results. Your assignment is to conduct comprehensive statistical analysis and provide data-driven recommendations.

The executive team is making decisions next week, so accuracy and clear interpretation are critical. Show your work, explain your assumptions, and don't just give numbers--tell us what they mean for the business!"

-- Jennifer Martinez, VP of Analytics

The Datasets

Download all four datasets below. Each dataset contains real-world E-Commerce data that requires statistical analysis.

datamart_sales.csv

Sales transaction data including order values, dates, customer segments, product categories, and payment methods.

500 records Complete data Multiple segments 6 months coverage
Download CSV
Key Variables:
  • order_id - Unique transaction identifier
  • order_value - Total purchase amount in USD (numeric)
  • order_date - Transaction timestamp
  • customer_segment - Premium, Standard, or Basic (categorical)
  • product_category - Electronics, Fashion, Home, Books (categorical)
  • payment_method - Credit Card, PayPal, Debit Card (categorical)
datamart_customers.csv

Customer demographic and behavioral data including age, location, lifetime value, satisfaction scores.

350 records Demographics included Satisfaction scores Lifetime value data
Download CSV
Key Variables:
  • customer_id - Unique customer identifier
  • age - Customer age in years (numeric)
  • gender - Male, Female, Other (categorical)
  • city_tier - Tier 1, Tier 2, Tier 3 cities (categorical)
  • lifetime_value - Total customer spend in USD (numeric)
  • satisfaction_score - Rating from 1-10 (ordinal)
datamart_campaigns.csv

Marketing campaign performance metrics including impressions, clicks, conversions, and ROI data.

100 campaigns Multiple channels Conversion tracking ROI calculated
Download CSV
Key Variables:
  • campaign_id - Unique campaign identifier
  • channel - Email, Social Media, Search Ads, Display (categorical)
  • impressions - Number of ad views (numeric)
  • clicks - Number of ad clicks (numeric)
  • conversions - Number of purchases (numeric)
  • spend - Campaign cost in USD (numeric)
datamart_ab_test.csv

A/B test data from website redesign experiment comparing old vs new homepage designs on conversion rates.

1000 records Control vs Treatment Randomized assignment Conversion tracked
Download CSV
Key Variables:
  • user_id - Unique visitor identifier
  • group - Control (old design) or Treatment (new design)
  • page_views - Number of pages visited (numeric)
  • time_on_site - Session duration in seconds (numeric)
  • converted - 1 if purchased, 0 if not (binary)
  • device_type - Desktop, Mobile, Tablet (categorical)

Requirements

Exercise 1 Descriptive Statistics Analysis (25 points)

Sales Transaction Analysis

Using the datamart_sales.csv dataset, conduct a comprehensive descriptive statistical analysis of order values across different customer segments and product categories.

  1. Overall Sales Statistics (5 points)
    • Calculate mean, median, mode, standard deviation, variance for order_value
    • Find the range (min and max values)
    • Calculate the coefficient of variation (CV = std dev / mean)
    • Identify and explain any outliers using the IQR method (Q1 - 1.5*IQR, Q3 + 1.5*IQR)
    • Create a box plot showing the distribution of order values
  2. Segmented Analysis (10 points)
    • Group data by customer_segment (Premium, Standard, Basic)
    • Calculate mean and median order value for each segment
    • Calculate standard deviation for each segment
    • Create a comparative table showing all statistics side-by-side
    • Visualize using grouped bar charts or violin plots
    • Interpret: Which segment has the highest average spend? Which has the most variability?
  3. Product Category Analysis (5 points)
    • Calculate the frequency distribution of orders by product_category
    • Find the percentage contribution of each category to total revenue
    • Identify the category with the highest average order value
    • Create a pie chart showing revenue distribution by category
  4. Time Series Trends (5 points)
    • Group orders by month (using order_date)
    • Calculate monthly average order value and total revenue
    • Calculate the month-over-month growth rate
    • Create a line chart showing the trend over 6 months
    • Interpret: Is there an upward or downward trend? Any seasonal patterns?
Expected Output: A comprehensive descriptive statistics report saved as descriptive_stats_report.csv containing all calculated metrics organized by segments and categories.

Exercise 2 Probability Distributions & Analysis (20 points)

Campaign Performance Probability

Using the datamart_campaigns.csv dataset, apply probability theory and distribution analysis to understand campaign performance and predict future outcomes.

  1. Click-Through Rate (CTR) Distribution (8 points)
    • Calculate CTR for each campaign: CTR = (clicks / impressions) × 100
    • Calculate mean CTR and standard deviation across all campaigns
    • Create a histogram of CTR values with 20 bins
    • Test if CTR follows a normal distribution using visual inspection (Q-Q plot)
    • Calculate the probability that a random campaign has CTR > 5% (use empirical distribution)
    • Interpret: What percentage of campaigns are performing above average?
  2. Conversion Rate Analysis (8 points)
    • Calculate conversion rate: Conversion Rate = (conversions / clicks) × 100
    • Assume conversion follows a normal distribution, calculate z-scores for each campaign
    • Identify campaigns with z-score > 2 (exceptional performers) and z-score < -2 (underperformers)
    • If average conversion rate is 3% with std dev of 0.8%, what's the probability a campaign converts above 4%?
    • Use the 68-95-99.7 rule to identify the range where 95% of conversion rates fall
  3. ROI Probability Forecasting (4 points)
    • Calculate ROI for each campaign: ROI = ((conversions × avg_order_value - spend) / spend) × 100
    • Assume average order value is $75
    • Calculate the probability distribution of ROI values
    • What's the probability that a future campaign will have positive ROI?
    • Business Recommendation: Based on probability analysis, which marketing channel should receive increased budget?
Python Hint: Use scipy.stats.norm.cdf() for calculating probabilities from normal distribution. For z-scores: z = (x - mean) / std_dev
Expected Output: Include probability calculations, distribution plots (histograms, Q-Q plots), and a written interpretation in your notebook explaining what these probabilities mean for business decisions.

Exercise 3 Hypothesis Testing (25 points)

Customer Segment & Demographics Analysis

Conduct rigorous hypothesis tests to answer critical business questions using both the datamart_sales.csv and datamart_customers.csv datasets.

  1. Two-Sample T-Test: Premium vs Standard Customers (10 points)
    • Research Question: Do Premium customers have significantly higher average order values than Standard customers?
    • Null Hypothesis (H₀): μ_premium = μ_standard (no difference in means)
    • Alternative Hypothesis (H₁): μ_premium > μ_standard (Premium customers spend more)
    • Set significance level: α = 0.05
    • Perform an independent two-sample t-test (use scipy.stats.ttest_ind())
    • Report the t-statistic, p-value, and degrees of freedom
    • Calculate the 95% confidence interval for the difference in means
    • Decision: Reject or fail to reject H₀ based on p-value
    • Interpretation: Explain in plain language what this result means for customer segmentation strategy
  2. Chi-Square Test: Gender & Product Preference (10 points)
    • Research Question: Is there a relationship between customer gender and product category preferences?
    • Null Hypothesis (H₀): Gender and product category are independent (no association)
    • Alternative Hypothesis (H₁): Gender and product category are dependent (association exists)
    • Merge datamart_sales.csv with datamart_customers.csv on customer_id
    • Create a contingency table (crosstab) of gender vs product_category
    • Perform chi-square test of independence (use scipy.stats.chi2_contingency())
    • Report chi-square statistic, p-value, degrees of freedom, and expected frequencies
    • Check if all expected frequencies are ≥ 5 (assumption validation)
    • Decision: Reject or fail to reject H₀ at α = 0.05
    • Interpretation: Should marketing campaigns be gender-specific for certain product categories?
  3. One-Sample T-Test: Satisfaction Score (5 points)
    • Research Question: Is the average customer satisfaction score significantly different from the industry benchmark of 7.5?
    • Null Hypothesis (H₀): μ = 7.5
    • Alternative Hypothesis (H₁): μ ≠ 7.5 (two-tailed test)
    • Perform one-sample t-test using datamart_customers.csv satisfaction scores
    • Report t-statistic, p-value, and 95% confidence interval
    • Decision: Is DataMart performing significantly better or worse than industry average?
Critical Requirements:
  • Always state null and alternative hypotheses clearly before testing
  • Check assumptions (normality, equal variances) before running tests
  • Report exact p-values (not just "< 0.05")
  • Include effect size calculations (Cohen's d for t-tests, Cramér's V for chi-square)
  • Provide business interpretation, not just statistical conclusions
Expected Output: A structured report saved as hypothesis_test_results.txt containing all test results with hypotheses, test statistics, p-values, decisions, and interpretations for each test.

Exercise 4 A/B Testing for Website Redesign (30 points)

Conversion Rate Optimization Experiment

DataMart recently ran an A/B test to evaluate whether a new homepage design improves conversion rates compared to the old design. Using datamart_ab_test.csv, conduct a comprehensive A/B test analysis.

  1. Data Exploration & Sample Size Validation (5 points)
    • Calculate the sample size for Control group and Treatment group
    • Verify that assignment to groups was approximately 50/50 (randomization check)
    • Check for any significant differences in baseline metrics (page_views, time_on_site) between groups
    • Calculate the minimum detectable effect (MDE) based on sample size
  2. Conversion Rate Comparison (10 points)
    • Calculate conversion rate for Control group: CR_control = (sum of converted) / (total users in control)
    • Calculate conversion rate for Treatment group: CR_treatment = (sum of converted) / (total users in treatment)
    • Calculate the absolute lift: Lift = CR_treatment - CR_control
    • Calculate the relative lift: Relative Lift = (CR_treatment - CR_control) / CR_control × 100%
    • Create a bar chart comparing conversion rates with 95% confidence intervals
  3. Statistical Significance Test (10 points)
    • Null Hypothesis (H₀): CR_treatment = CR_control (new design has no effect)
    • Alternative Hypothesis (H₁): CR_treatment > CR_control (new design improves conversion)
    • Perform a two-proportion z-test using statsmodels.stats.proportion.proportions_ztest()
    • Report z-statistic, p-value, and 95% confidence interval for the difference
    • Calculate statistical power of the test (use statsmodels.stats.power)
    • Decision: At α = 0.05, is the new design significantly better?
  4. Segmented Analysis (3 points)
    • Break down conversion rates by device_type (Desktop, Mobile, Tablet)
    • Test if the treatment effect varies by device (interaction effect)
    • Insight: Does the new design work better on certain devices?
  5. Business Recommendation (2 points)
    • Based on statistical evidence, should DataMart deploy the new design to 100% of users?
    • If yes, estimate the expected increase in conversions and revenue (use historical data)
    • If no, explain what additional testing or changes might be needed
    • Consider practical significance vs statistical significance
Key Formulas:
  • Conversion Rate: CR = (conversions / total_users) × 100
  • Standard Error: SE = √(p(1-p)/n) where p is conversion rate proportion
  • 95% CI: CR ± 1.96 × SE
  • Z-test statistic: z = (p1 - p2) / √(p(1-p)(1/n1 + 1/n2)) where p is pooled proportion
Expected Output: Include detailed A/B test analysis in your notebook with visualizations showing conversion rates by group and device. Save a summary recommendation in statistical_insights.pdf.

Submission Instructions

Submit all files in a single ZIP file named exactly as shown below:

Required ZIP Name
YourName_Statistics_Assignment.zip
Example: JohnDoe_Statistics_Assignment.zip
Required Files
Statistics_Assignment/
├── statistics_analysis.ipynb     # Jupyter Notebook with ALL exercises
├── descriptive_stats_report.csv  # Descriptive statistics summary (Exercise 1)
├── hypothesis_test_results.txt   # Hypothesis test results (Exercise 3)
├── statistical_insights.pdf      # Professional report (2-3 pages)
└── README.md                     # REQUIRED - see contents below
README.md Must Include:
  • Your full name and submission date
  • Brief description of your solution approach
  • Tools and libraries used
  • Any assumptions made during analysis
Do Include
  • All exercises completed with outputs visible
  • Clear section headers for each exercise
  • Markdown cells explaining your reasoning
  • Well-commented Python code
  • Visualizations (box plots, charts)
  • Business insights and recommendations
Do Not Include
  • Any .pyc or __pycache__ files
  • Virtual environment folders
  • Code that doesn't run without errors
  • Unexecuted notebook cells
  • Plagiarized code or analysis
Important: Before submitting, run all cells in your notebook to make sure it executes without errors and generates all output files correctly!
Submit Your Assignment

Upload your ZIP file to submit your assignment

Grading Rubric (100 Points Total)

Category Criteria Points Your Score
Exercise 1
Descriptive Statistics
  • All required statistics calculated correctly (10 pts)
  • Proper visualizations (box plots, charts) (8 pts)
  • Clear interpretation of results (7 pts)
25
Exercise 2
Probability Distributions
  • Correct probability calculations (10 pts)
  • Appropriate use of normal distribution (6 pts)
  • Business insights from probability analysis (4 pts)
20
Exercise 3
Hypothesis Testing
  • Correct null/alternative hypotheses stated (5 pts)
  • Proper test selection and execution (12 pts)
  • Accurate p-value interpretation (5 pts)
  • Business implications explained (3 pts)
25
Exercise 4
A/B Testing
  • Correct conversion rate calculations (10 pts)
  • Proper statistical test and interpretation (12 pts)
  • Segmented analysis by device (5 pts)
  • Actionable business recommendation (3 pts)
30
Total Points 100
Bonus Points Opportunities (+10 max)
  • +3 points: Additional statistical test (e.g., ANOVA for comparing all 3 customer segments simultaneously)
  • +3 points: Advanced visualizations (interactive plots using Plotly, comprehensive dashboard)
  • +2 points: Power analysis for hypothesis tests with recommendations on sample size
  • +2 points: Bayesian interpretation alongside frequentist approach (prior/posterior distributions)

Pre-Submission Checklist

Complete this checklist before submitting your assignment to ensure you haven't missed anything important.

Code Quality
Statistical Rigor
Visualizations
Deliverables