Intermediate Project 3

Customer Segmentation

Build an unsupervised machine learning pipeline using K-Means clustering to segment mall customers based on their annual income and spending behavior. Learn to identify customer groups for targeted marketing strategies using the famous Kaggle Mall Customers dataset.

6-8 hours
Intermediate
300 Points
What You Will Build
  • Exploratory Data Analysis
  • K-Means clustering model
  • Elbow method for optimal K
  • Silhouette score analysis
  • Customer segment profiles
Contents
01

Project Overview

This project introduces you to unsupervised machine learning through customer segmentation. You will work with the famous Mall Customers Dataset from Kaggle containing 200 customers with 5 features including CustomerID, Gender, Age, Annual Income (k$), and Spending Score (1-100). Your goal is to identify distinct customer groups that a marketing team can target with personalized campaigns.

Skills Applied: This project tests your proficiency in Python (pandas, numpy, matplotlib, seaborn), scikit-learn (K-Means, preprocessing, metrics), and unsupervised learning concepts.
Explore

Analyze income, spending patterns, and demographics

Preprocess

Scale features for optimal clustering

Cluster

Apply K-Means with optimal number of clusters

Profile

Create actionable customer segment profiles

Learning Objectives

Unsupervised Learning
  • Understand clustering vs classification
  • Implement K-Means algorithm from scratch concepts
  • Use the elbow method to find optimal K
  • Evaluate clustering with silhouette score
  • Handle feature scaling for distance-based algorithms
Business Skills
  • Translate clusters into business personas
  • Create actionable marketing recommendations
  • Visualize customer segments effectively
  • Present findings to non-technical stakeholders
  • Document analysis methodology professionally
Ready to submit? Already completed the project? Submit your work now!
Submit Now
02

Business Scenario

Prestige Mall Analytics

You have been hired as a Data Scientist at Prestige Mall, a premium shopping destination. The marketing team wants to understand their customers better to create targeted promotional campaigns. They have collected basic customer data and need you to identify distinct customer segments.

"We have data on 200 customers including their age, annual income, and spending scores. We need you to segment these customers into meaningful groups so we can tailor our marketing strategies. Tell us who our high-value customers are, who we're losing money on, and how we can better target each group."

Sarah Chen, Marketing Director, Prestige Mall

Questions to Answer

Segmentation
  • How many distinct customer segments exist?
  • What defines each customer segment?
  • Which segments are most valuable to the business?
  • Are there any unexpected customer groups?
Marketing Strategy
  • Which segments should receive premium offers?
  • How can we convert low spenders to high spenders?
  • What campaigns would work for each segment?
  • Are there underserved customer groups?
Pro Tip: Think like a business analyst! Give each segment a memorable name (e.g., "Careful Spenders", "Big Spenders", "Budget Conscious") that marketing teams can easily understand and act upon.
03

The Dataset

You will work with the Mall Customers dataset, a popular dataset for learning customer segmentation and clustering techniques. Download from Kaggle or use the local copy:

Dataset Download

Download the Mall Customers dataset from Kaggle or use our local copy for convenience.

Original Data Source

This project uses the Mall Customer Segmentation Dataset from Kaggle - one of the most popular datasets for learning unsupervised machine learning and clustering. The dataset contains basic customer information collected from a mall's membership cards.

Dataset Info: 200 samples × 5 columns | 4 features + 1 ID | Numerical: Age (18-70), Income ($15k-$137k), Spending Score (1-100) | Categorical: Gender (Male/Female) | No missing values | Perfect for clustering beginners
Dataset Schema

ColumnTypeRangeDescription
CustomerIDInteger1-200Unique customer identifier
GenderStringMale/FemaleCustomer gender (112 Female, 88 Male)
AgeInteger18-70Customer age in years
Annual Income (k$)Integer15-137Annual income in thousands of dollars
Spending Score (1-100)Integer1-99Mall-assigned spending score based on behavior
Dataset Stats: 200 customers, 5 columns, balanced gender distribution, no missing values
Key Insight: Focus on Annual Income and Spending Score for clustering
Sample Data Preview
CustomerIDGenderAgeAnnual Income (k$)Spending Score
1Male191539
2Male211581
3Female20166
124Male396991
200Female3013783
04

Project Requirements

Create a well-organized Jupyter notebook that covers all the following components with clear documentation and visualizations.

1
Exploratory Data Analysis
  • Load and inspect the Mall Customers dataset
  • Display dataset shape, dtypes, and descriptive statistics
  • Check for missing values and data quality
  • Analyze distribution of Age, Income, and Spending Score
  • Create histograms for numerical features
  • Analyze gender distribution (bar chart)
  • Generate pairplot and correlation heatmap
  • Visualize income vs spending score scatter plot
2
Data Preprocessing
  • Select relevant features for clustering (Income & Spending Score)
  • Apply StandardScaler for feature normalization
  • Explain why scaling is important for K-Means
  • Optional: Include Age as a third feature
3
Finding Optimal K
  • Implement the Elbow Method using WCSS (Within-Cluster Sum of Squares)
  • Test K values from 1 to 10
  • Plot the elbow curve and identify the "elbow point"
  • Calculate and plot Silhouette Scores for K=2 to 10
  • Justify your choice of optimal K with both methods
4
K-Means Clustering
  • Train K-Means model with optimal K (usually 5)
  • Assign cluster labels to each customer
  • Visualize clusters with scatter plot (Income vs Spending Score)
  • Mark cluster centroids on the visualization
  • Create 3D visualization if using 3 features
5
Cluster Analysis & Profiling
  • Analyze each cluster's characteristics (mean, median, counts)
  • Create descriptive names for each segment
  • Generate cluster summary table with key statistics
  • Visualize cluster distributions using box plots
  • Analyze gender and age distribution within clusters
6
Business Recommendations
  • Provide marketing recommendations for each segment
  • Identify high-value vs at-risk customers
  • Suggest targeted campaigns for each cluster
  • Summarize key business insights
05

Clustering Specifications

Use these clustering techniques and evaluation metrics to ensure your analysis is thorough and professional.

K-Means Algorithm
  • Algorithm: sklearn.cluster.KMeans
  • n_init: 10 (default, to avoid local minima)
  • max_iter: 300 (default)
  • random_state: 42 (for reproducibility)
  • Features: Annual Income, Spending Score
  • Optional: Include Age for 3D clustering
Evaluation Metrics
  • WCSS: Within-Cluster Sum of Squares (Elbow)
  • Silhouette Score: sklearn.metrics.silhouette_score
  • Inertia: kmeans.inertia_ (same as WCSS)
  • Cluster Centers: kmeans.cluster_centers_
  • Expected K: 5 clusters (based on elbow)
  • Silhouette Range: 0.4-0.6 is good
Expected Customer Segments (5 Clusters)
High Income, High Spending

"VIP Customers" - Priority for premium offers and loyalty programs

High Income, Low Spending

"Careful Spenders" - Target with exclusive deals

Low Income, High Spending

"Enthusiastic Shoppers" - Offer payment plans

Low Income, Low Spending

"Budget Conscious" - Discount and clearance campaigns

Average Income, Average Spending

"Mainstream Shoppers" - General promotions

06

Required Visualizations

Create at least 10 visualizations in your notebook. Each visualization should have clear titles, labels, and annotations.

EDA Visualizations
  • Distribution histogram for Age
  • Distribution histogram for Annual Income
  • Distribution histogram for Spending Score
  • Gender distribution bar chart
  • Pairplot colored by Gender
  • Correlation heatmap
Clustering Visualizations
  • Elbow curve (WCSS vs K)
  • Silhouette score plot (Score vs K)
  • Cluster scatter plot (Income vs Spending) with centroids
  • Cluster distribution bar chart (counts per cluster)
  • Box plots of features by cluster
  • 3D scatter plot (if using 3 features)
Design Tip: Use a consistent color palette for your clusters across all visualizations. Consider using colorblind-friendly colors.
07

Submission Requirements

Create a public GitHub repository with the exact name shown below:

Required Repository Name
customer-segmentation-ml
github.com/<your-username>/customer-segmentation-ml
Required Project Structure
customer-segmentation-ml/
├── data/
│   └── mall_customers.csv          # Dataset
├── notebooks/
│   └── customer_segmentation.ipynb # Main analysis notebook
├── visualizations/
│   ├── elbow_curve.png             # Elbow method plot
│   ├── silhouette_scores.png       # Silhouette analysis
│   ├── cluster_scatter.png         # Main clustering visualization
│   └── cluster_profiles.png        # Cluster summary
├── requirements.txt                # Python dependencies
└── README.md                       # Project documentation
README.md Required Sections
  • Project Title and Description
  • Your name and submission date
  • Dataset description (source, features)
  • Technologies used (Python, sklearn, matplotlib)
  • Key findings (optimal K, segment profiles)
  • Visualizations (embedded cluster plots)
  • Business recommendations
  • How to run the notebook
Submit Your Project

Enter your GitHub username - we will verify your repository automatically

08

Grading Rubric

Your project will be graded on the following criteria. Total: 300 points.

Criteria Points Description
Exploratory Data Analysis 50 Thorough exploration with descriptive statistics and visualizations
Data Preprocessing 30 Feature selection and proper scaling with explanation
Finding Optimal K 50 Elbow method, silhouette analysis, and justified K selection
K-Means Clustering 40 Correct implementation with cluster visualization
Cluster Analysis 50 Detailed segment profiling with meaningful names
Visualizations 40 At least 10 clear, labeled visualizations
Documentation 40 README, code comments, business recommendations
Total 300
Grading Levels
Excellent
270-300

Exceeds all requirements

Good
225-269

Meets all requirements

Satisfactory
180-224

Meets minimum requirements

Needs Work
< 180

Missing key requirements

Ready to Submit?

Make sure you have completed all requirements and reviewed the grading rubric above.

Submit Your Project