Project Overview
This project introduces you to unsupervised machine learning through customer segmentation. You will work with the famous Mall Customers Dataset from Kaggle containing 200 customers with 5 features including CustomerID, Gender, Age, Annual Income (k$), and Spending Score (1-100). Your goal is to identify distinct customer groups that a marketing team can target with personalized campaigns.
Explore
Analyze income, spending patterns, and demographics
Preprocess
Scale features for optimal clustering
Cluster
Apply K-Means with optimal number of clusters
Profile
Create actionable customer segment profiles
Learning Objectives
Unsupervised Learning
- Understand clustering vs classification
- Implement K-Means algorithm from scratch concepts
- Use the elbow method to find optimal K
- Evaluate clustering with silhouette score
- Handle feature scaling for distance-based algorithms
Business Skills
- Translate clusters into business personas
- Create actionable marketing recommendations
- Visualize customer segments effectively
- Present findings to non-technical stakeholders
- Document analysis methodology professionally
Business Scenario
Prestige Mall Analytics
You have been hired as a Data Scientist at Prestige Mall, a premium shopping destination. The marketing team wants to understand their customers better to create targeted promotional campaigns. They have collected basic customer data and need you to identify distinct customer segments.
"We have data on 200 customers including their age, annual income, and spending scores. We need you to segment these customers into meaningful groups so we can tailor our marketing strategies. Tell us who our high-value customers are, who we're losing money on, and how we can better target each group."
Questions to Answer
- How many distinct customer segments exist?
- What defines each customer segment?
- Which segments are most valuable to the business?
- Are there any unexpected customer groups?
- Which segments should receive premium offers?
- How can we convert low spenders to high spenders?
- What campaigns would work for each segment?
- Are there underserved customer groups?
The Dataset
You will work with the Mall Customers dataset, a popular dataset for learning customer segmentation and clustering techniques. Download from Kaggle or use the local copy:
Dataset Download
Download the Mall Customers dataset from Kaggle or use our local copy for convenience.
Original Data Source
This project uses the Mall Customer Segmentation Dataset from Kaggle - one of the most popular datasets for learning unsupervised machine learning and clustering. The dataset contains basic customer information collected from a mall's membership cards.
Dataset Schema
| Column | Type | Range | Description |
|---|---|---|---|
CustomerID | Integer | 1-200 | Unique customer identifier |
Gender | String | Male/Female | Customer gender (112 Female, 88 Male) |
Age | Integer | 18-70 | Customer age in years |
Annual Income (k$) | Integer | 15-137 | Annual income in thousands of dollars |
Spending Score (1-100) | Integer | 1-99 | Mall-assigned spending score based on behavior |
Sample Data Preview
| CustomerID | Gender | Age | Annual Income (k$) | Spending Score |
|---|---|---|---|---|
| 1 | Male | 19 | 15 | 39 |
| 2 | Male | 21 | 15 | 81 |
| 3 | Female | 20 | 16 | 6 |
| 124 | Male | 39 | 69 | 91 |
| 200 | Female | 30 | 137 | 83 |
Project Requirements
Create a well-organized Jupyter notebook that covers all the following components with clear documentation and visualizations.
Exploratory Data Analysis
- Load and inspect the Mall Customers dataset
- Display dataset shape, dtypes, and descriptive statistics
- Check for missing values and data quality
- Analyze distribution of Age, Income, and Spending Score
- Create histograms for numerical features
- Analyze gender distribution (bar chart)
- Generate pairplot and correlation heatmap
- Visualize income vs spending score scatter plot
Data Preprocessing
- Select relevant features for clustering (Income & Spending Score)
- Apply StandardScaler for feature normalization
- Explain why scaling is important for K-Means
- Optional: Include Age as a third feature
Finding Optimal K
- Implement the Elbow Method using WCSS (Within-Cluster Sum of Squares)
- Test K values from 1 to 10
- Plot the elbow curve and identify the "elbow point"
- Calculate and plot Silhouette Scores for K=2 to 10
- Justify your choice of optimal K with both methods
K-Means Clustering
- Train K-Means model with optimal K (usually 5)
- Assign cluster labels to each customer
- Visualize clusters with scatter plot (Income vs Spending Score)
- Mark cluster centroids on the visualization
- Create 3D visualization if using 3 features
Cluster Analysis & Profiling
- Analyze each cluster's characteristics (mean, median, counts)
- Create descriptive names for each segment
- Generate cluster summary table with key statistics
- Visualize cluster distributions using box plots
- Analyze gender and age distribution within clusters
Business Recommendations
- Provide marketing recommendations for each segment
- Identify high-value vs at-risk customers
- Suggest targeted campaigns for each cluster
- Summarize key business insights
Clustering Specifications
Use these clustering techniques and evaluation metrics to ensure your analysis is thorough and professional.
- Algorithm: sklearn.cluster.KMeans
- n_init: 10 (default, to avoid local minima)
- max_iter: 300 (default)
- random_state: 42 (for reproducibility)
- Features: Annual Income, Spending Score
- Optional: Include Age for 3D clustering
- WCSS: Within-Cluster Sum of Squares (Elbow)
- Silhouette Score: sklearn.metrics.silhouette_score
- Inertia: kmeans.inertia_ (same as WCSS)
- Cluster Centers: kmeans.cluster_centers_
- Expected K: 5 clusters (based on elbow)
- Silhouette Range: 0.4-0.6 is good
Expected Customer Segments (5 Clusters)
High Income, High Spending
"VIP Customers" - Priority for premium offers and loyalty programs
High Income, Low Spending
"Careful Spenders" - Target with exclusive deals
Low Income, High Spending
"Enthusiastic Shoppers" - Offer payment plans
Low Income, Low Spending
"Budget Conscious" - Discount and clearance campaigns
Average Income, Average Spending
"Mainstream Shoppers" - General promotions
Required Visualizations
Create at least 10 visualizations in your notebook. Each visualization should have clear titles, labels, and annotations.
EDA Visualizations
- Distribution histogram for Age
- Distribution histogram for Annual Income
- Distribution histogram for Spending Score
- Gender distribution bar chart
- Pairplot colored by Gender
- Correlation heatmap
Clustering Visualizations
- Elbow curve (WCSS vs K)
- Silhouette score plot (Score vs K)
- Cluster scatter plot (Income vs Spending) with centroids
- Cluster distribution bar chart (counts per cluster)
- Box plots of features by cluster
- 3D scatter plot (if using 3 features)
Submission Requirements
Create a public GitHub repository with the exact name shown below:
Required Repository Name
customer-segmentation-ml
Required Project Structure
customer-segmentation-ml/
├── data/
│ └── mall_customers.csv # Dataset
├── notebooks/
│ └── customer_segmentation.ipynb # Main analysis notebook
├── visualizations/
│ ├── elbow_curve.png # Elbow method plot
│ ├── silhouette_scores.png # Silhouette analysis
│ ├── cluster_scatter.png # Main clustering visualization
│ └── cluster_profiles.png # Cluster summary
├── requirements.txt # Python dependencies
└── README.md # Project documentation
README.md Required Sections
- Project Title and Description
- Your name and submission date
- Dataset description (source, features)
- Technologies used (Python, sklearn, matplotlib)
- Key findings (optimal K, segment profiles)
- Visualizations (embedded cluster plots)
- Business recommendations
- How to run the notebook
Enter your GitHub username - we will verify your repository automatically
Grading Rubric
Your project will be graded on the following criteria. Total: 300 points.
| Criteria | Points | Description |
|---|---|---|
| Exploratory Data Analysis | 50 | Thorough exploration with descriptive statistics and visualizations |
| Data Preprocessing | 30 | Feature selection and proper scaling with explanation |
| Finding Optimal K | 50 | Elbow method, silhouette analysis, and justified K selection |
| K-Means Clustering | 40 | Correct implementation with cluster visualization |
| Cluster Analysis | 50 | Detailed segment profiling with meaningful names |
| Visualizations | 40 | At least 10 clear, labeled visualizations |
| Documentation | 40 | README, code comments, business recommendations |
| Total | 300 |
Grading Levels
Excellent
Exceeds all requirements
Good
Meets all requirements
Satisfactory
Meets minimum requirements
Needs Work
Missing key requirements
Ready to Submit?
Make sure you have completed all requirements and reviewed the grading rubric above.
Submit Your Project