Project 3: Customer Segmentation | Machine Learning Course

Project Overview

This project introduces you to unsupervised machine learning through customer segmentation. You will work with the famous Mall Customers Dataset from Kaggle containing 200 customers with 5 features including CustomerID, Gender, Age, Annual Income (k$), and Spending Score (1-100). Your goal is to identify distinct customer groups that a marketing team can target with personalized campaigns.

Skills Applied: This project tests your proficiency in Python (pandas, numpy, matplotlib, seaborn), scikit-learn (K-Means, preprocessing, metrics), and unsupervised learning concepts.

Explore

Analyze income, spending patterns, and demographics

Preprocess

Scale features for optimal clustering

Cluster

Apply K-Means with optimal number of clusters

Profile

Create actionable customer segment profiles

Learning Objectives

Unsupervised Learning

Understand clustering vs classification
Implement K-Means algorithm from scratch concepts
Use the elbow method to find optimal K
Evaluate clustering with silhouette score
Handle feature scaling for distance-based algorithms

Business Skills

Translate clusters into business personas
Create actionable marketing recommendations
Visualize customer segments effectively
Present findings to non-technical stakeholders
Document analysis methodology professionally

Ready to submit? Already completed the project? Submit your work now!

Submit Now

Business Scenario

Prestige Mall Analytics

You have been hired as a Data Scientist at Prestige Mall, a premium shopping destination. The marketing team wants to understand their customers better to create targeted promotional campaigns. They have collected basic customer data and need you to identify distinct customer segments.

"We have data on 200 customers including their age, annual income, and spending scores. We need you to segment these customers into meaningful groups so we can tailor our marketing strategies. Tell us who our high-value customers are, who we're losing money on, and how we can better target each group."

Sarah Chen, Marketing Director, Prestige Mall

Questions to Answer

Segmentation

How many distinct customer segments exist?
What defines each customer segment?
Which segments are most valuable to the business?
Are there any unexpected customer groups?

Marketing Strategy

Which segments should receive premium offers?
How can we convert low spenders to high spenders?
What campaigns would work for each segment?
Are there underserved customer groups?

Pro Tip: Think like a business analyst! Give each segment a memorable name (e.g., "Careful Spenders", "Big Spenders", "Budget Conscious") that marketing teams can easily understand and act upon.

The Dataset

You will work with the Mall Customers dataset, a popular dataset for learning customer segmentation and clustering techniques. Download from Kaggle or use the local copy:

Dataset Download

Download the Mall Customers dataset from Kaggle or use our local copy for convenience.

Download from Kaggle mall_customers.csv (Local)

Original Data Source

This project uses the Mall Customer Segmentation Dataset from Kaggle - one of the most popular datasets for learning unsupervised machine learning and clustering. The dataset contains basic customer information collected from a mall's membership cards.

View on Kaggle Explore Similar Datasets

Dataset Schema

Column	Type	Range	Description
`CustomerID`	Integer	1-200	Unique customer identifier
`Gender`	String	Male/Female	Customer gender (112 Female, 88 Male)
`Age`	Integer	18-70	Customer age in years
`Annual Income (k$)`	Integer	15-137	Annual income in thousands of dollars
`Spending Score (1-100)`	Integer	1-99	Mall-assigned spending score based on behavior

Dataset Stats: 200 customers, 5 columns, balanced gender distribution, no missing values

Key Insight: Focus on Annual Income and Spending Score for clustering

Sample Data Preview

CustomerID	Gender	Age	Annual Income (k$)	Spending Score
1	Male	19	15	39
2	Male	21	15	81
3	Female	20	16	6
124	Male	39	69	91
200	Female	30	137	83

Project Requirements

Create a well-organized Jupyter notebook that covers all the following components with clear documentation and visualizations.

Exploratory Data Analysis

Load and inspect the Mall Customers dataset
Display dataset shape, dtypes, and descriptive statistics
Check for missing values and data quality
Analyze distribution of Age, Income, and Spending Score
Create histograms for numerical features
Analyze gender distribution (bar chart)
Generate pairplot and correlation heatmap
Visualize income vs spending score scatter plot

Data Preprocessing

Select relevant features for clustering (Income & Spending Score)
Apply StandardScaler for feature normalization
Explain why scaling is important for K-Means
Optional: Include Age as a third feature

Finding Optimal K

Implement the Elbow Method using WCSS (Within-Cluster Sum of Squares)
Test K values from 1 to 10
Plot the elbow curve and identify the "elbow point"
Calculate and plot Silhouette Scores for K=2 to 10
Justify your choice of optimal K with both methods

K-Means Clustering

Train K-Means model with optimal K (usually 5)
Assign cluster labels to each customer
Visualize clusters with scatter plot (Income vs Spending Score)
Mark cluster centroids on the visualization
Create 3D visualization if using 3 features

Cluster Analysis & Profiling

Analyze each cluster's characteristics (mean, median, counts)
Create descriptive names for each segment
Generate cluster summary table with key statistics
Visualize cluster distributions using box plots
Analyze gender and age distribution within clusters

Business Recommendations

Provide marketing recommendations for each segment
Identify high-value vs at-risk customers
Suggest targeted campaigns for each cluster
Summarize key business insights

Clustering Specifications

Use these clustering techniques and evaluation metrics to ensure your analysis is thorough and professional.

K-Means Algorithm

Algorithm: sklearn.cluster.KMeans
n_init: 10 (default, to avoid local minima)
max_iter: 300 (default)
random_state: 42 (for reproducibility)
Features: Annual Income, Spending Score
Optional: Include Age for 3D clustering

Evaluation Metrics

WCSS: Within-Cluster Sum of Squares (Elbow)
Silhouette Score: sklearn.metrics.silhouette_score
Inertia: kmeans.inertia_ (same as WCSS)
Cluster Centers: kmeans.cluster_centers_
Expected K: 5 clusters (based on elbow)
Silhouette Range: 0.4-0.6 is good

Expected Customer Segments (5 Clusters)

High Income, High Spending

"VIP Customers" - Priority for premium offers and loyalty programs

High Income, Low Spending

"Careful Spenders" - Target with exclusive deals

Low Income, High Spending

"Enthusiastic Shoppers" - Offer payment plans

Low Income, Low Spending

"Budget Conscious" - Discount and clearance campaigns

Average Income, Average Spending

"Mainstream Shoppers" - General promotions

Required Visualizations

Create at least 10 visualizations in your notebook. Each visualization should have clear titles, labels, and annotations.

EDA Visualizations

Distribution histogram for Age
Distribution histogram for Annual Income
Distribution histogram for Spending Score
Gender distribution bar chart
Pairplot colored by Gender
Correlation heatmap

Clustering Visualizations

Elbow curve (WCSS vs K)
Silhouette score plot (Score vs K)
Cluster scatter plot (Income vs Spending) with centroids
Cluster distribution bar chart (counts per cluster)
Box plots of features by cluster
3D scatter plot (if using 3 features)

Design Tip: Use a consistent color palette for your clusters across all visualizations. Consider using colorblind-friendly colors.

Submission Requirements

Create a public GitHub repository with the exact name shown below:

Required Repository Name

customer-segmentation-ml

github.com/<your-username>/customer-segmentation-ml

Required Project Structure

customer-segmentation-ml/
├── data/
│   └── mall_customers.csv          # Dataset
├── notebooks/
│   └── customer_segmentation.ipynb # Main analysis notebook
├── visualizations/
│   ├── elbow_curve.png             # Elbow method plot
│   ├── silhouette_scores.png       # Silhouette analysis
│   ├── cluster_scatter.png         # Main clustering visualization
│   └── cluster_profiles.png        # Cluster summary
├── requirements.txt                # Python dependencies
└── README.md                       # Project documentation

README.md Required Sections

Project Title and Description
Your name and submission date
Dataset description (source, features)
Technologies used (Python, sklearn, matplotlib)

Key findings (optimal K, segment profiles)
Visualizations (embedded cluster plots)
Business recommendations
How to run the notebook

Submit Your Project

Enter your GitHub username - we will verify your repository automatically

Grading Rubric

Your project will be graded on the following criteria. Total: 300 points.

Criteria	Points	Description
Exploratory Data Analysis	50	Thorough exploration with descriptive statistics and visualizations
Data Preprocessing	30	Feature selection and proper scaling with explanation
Finding Optimal K	50	Elbow method, silhouette analysis, and justified K selection
K-Means Clustering	40	Correct implementation with cluster visualization
Cluster Analysis	50	Detailed segment profiling with meaningful names
Visualizations	40	At least 10 clear, labeled visualizations
Documentation	40	README, code comments, business recommendations
Total	300

Grading Levels

Excellent

270-300

Exceeds all requirements

Good

225-269

Meets all requirements

Satisfactory

180-224

Meets minimum requirements

Needs Work

< 180

Missing key requirements

Ready to Submit?

Make sure you have completed all requirements and reviewed the grading rubric above.