Project 2: Customer Segmentation | Data Science Course

02

Business Scenario

ShopSmart Retail

You have been hired as a Customer Analytics Specialist at ShopSmart, a growing retail company operating in the competitive e-commerce space. The company has experienced steady growth over the past year, but the marketing team has been using generic, one-size-fits-all email campaigns that yield low engagement rates (averaging just 8% open rate and 1.5% click-through rate).

The CMO recognizes that different customers have vastly different needs and value propositions. A customer who purchased once six months ago shouldn't receive the same messaging as a loyal customer who shops weekly. However, without data-driven customer segments, the team has no framework for personalization.

"We have customer transaction data but no way to identify our best customers versus those at risk of churning. Can you segment our customer base and help us understand each group's characteristics so we can tailor our marketing efforts? We need actionable segments we can actually use in campaigns."

Priya Menon, Head of Marketing

The Business Challenge

ShopSmart faces several common retail challenges that customer segmentation can address:

Customer Churn

35% of customers make only one purchase and never return. The company has no early warning system to identify at-risk customers before they churn.

Inefficient Marketing

Marketing budget is spread evenly across all customers, wasting resources on disengaged users while under-investing in high-value segments.

Unknown Patterns

Management has no visibility into how many loyal customers exist, what percentage are at risk, or which segments drive the most revenue.

Business Objectives

Segmentation

Identify distinct customer segments based on behavior
Determine the optimal number of clusters
Profile each segment with clear characteristics

Strategy

Identify high-value customers for loyalty programs
Find at-risk customers for retention campaigns
Discover growth opportunities in each segment

Questions to Answer

Who are our most valuable customers?
Which customers are at risk of churning?
What percentage of customers are occasional buyers?

Deliverables

RFM score for each customer
Cluster assignment with labels
Marketing recommendations per segment

Pro Tip: Think like a marketer! Give your segments memorable names (e.g., "Champions", "At Risk", "New Customers") that stakeholders can easily understand and act upon.

03

The Dataset

You will work with a customer transaction dataset containing purchase history from a retail business over 12 months. This realistic dataset includes multiple transactions per customer, allowing you to calculate meaningful RFM metrics and identify distinct customer segments.

Dataset Overview:

150

Total Transactions

94

Unique Customers

12

Months of Data

14

Data Columns

Download customer_transactions.csv

What Makes This Dataset Ideal for Segmentation

Repeat Customers

Many customers have multiple transactions, allowing you to measure frequency and identify loyal vs. one-time buyers.

Time Spread

Transactions span a full year, providing sufficient time range to calculate meaningful recency values and identify churn patterns.

Value Diversity

Wide range of transaction amounts from small purchases to large orders, enabling clear monetary segmentation.

Dataset Schema

Column	Type	Description
`transaction_id`	String	Unique transaction identifier
`customer_id`	String	Unique customer identifier
`customer_name`	String	Customer full name
`email`	String	Customer email address
`transaction_date`	Date	Date of transaction (YYYY-MM-DD)
`product_id`	String	Product identifier
`product_name`	String	Name of product purchased
`category`	String	Product category (Electronics, Furniture, Office Supplies)
`quantity`	Integer	Number of units purchased
`unit_price`	Float	Price per unit ($)
`total_amount`	Float	Total transaction value ($)
`payment_method`	String	Payment type (Credit Card, Debit Card, PayPal)
`city`	String	Customer's city
`region`	String	Geographic region (North, South, East, West)

Key Columns for RFM Analysis

Column	Used For	Why It Matters
`customer_id`	Grouping transactions by customer	Enables aggregation of all purchases per customer
`transaction_date`	Recency calculation	Find most recent purchase date for each customer
`transaction_id`	Frequency calculation	Count number of transactions per customer
`total_amount`	Monetary calculation	Sum total spending per customer

Bonus Opportunity: While not required for basic segmentation, columns like category, region, and payment_method can be used for additional profiling insights (e.g., "Champions prefer Electronics and use Credit Cards").

04

RFM Analysis

RFM (Recency, Frequency, Monetary) is a proven customer segmentation technique that scores customers based on their purchase behavior. You must calculate these three metrics for each customer.

Recency (R)

Definition: How recently did the customer make a purchase?

Calculation: Days since last purchase

Lower recency = Better (more recent customer)

Frequency (F)

Definition: How often does the customer purchase?

Calculation: Total number of transactions

Higher frequency = Better (loyal customer)

Monetary (M)

Definition: How much does the customer spend?

Calculation: Total amount spent

Higher monetary = Better (high-value customer)

Understanding RFM Calculations

Recency Details

Analysis Date: Set to one day after the last transaction in your dataset. This ensures consistency.

For Each Customer: Calculate the number of days between the analysis date and their most recent purchase.

Interpretation: A customer with recency of 5 days is much more engaged than one with 200 days.

Frequency Details

Count Transactions: Simply count the total number of transactions each customer has made.

Customer Loyalty: Customers with 10+ transactions show strong loyalty and engagement.

One-time Buyers: Customers with frequency = 1 represent acquisition opportunities.

Monetary Details

Sum All Purchases: Total the dollar amount spent across all transactions for each customer.

High-Value Customers: Top 20% of customers often generate 80% of revenue (Pareto Principle).

Revenue Impact: Focus retention efforts on high monetary value customers.

RFM Scoring System (Optional Enhancement)

Beyond raw RFM values, you can create a scoring system (1-5) where each customer receives a score for each metric. This makes segments easier to communicate to business stakeholders.

Score	Recency	Frequency	Monetary
5 (Best)	0-20 days ago	8+ transactions	$1,000+ spent
4	21-50 days ago	5-7 transactions	$500-$999 spent
3	51-100 days ago	3-4 transactions	$200-$499 spent
2	101-180 days ago	2 transactions	$50-$199 spent
1 (Worst)	180+ days ago	1 transaction	Under $50 spent

RFM Segment Codes: Some analysts combine scores into a three-digit code (e.g., "555" = Champion, "111" = Lost Customer). An RFM score of 555 represents the best possible customer, while 111 indicates a customer at high risk of permanent churn.

Required: Your notebook must show the RFM DataFrame with at least the customer_id, recency, frequency, and monetary columns calculated correctly.

05

K-Means Clustering

After calculating RFM values, apply K-Means clustering to group customers into segments. You must determine the optimal number of clusters using the Elbow method and/or Silhouette analysis.

1

Data Preprocessing & Feature Scaling

K-Means clustering is highly sensitive to the scale of features. Without proper scaling, the monetary value (ranging from $50 to $5,000) would dominate over frequency (1-15 transactions), leading to poor cluster quality.

Why Scaling Matters: Imagine three customers:

Customer A: Recency=10 days, Frequency=5, Monetary=$2,000
Customer B: Recency=15 days, Frequency=6, Monetary=$2,200
Customer C: Recency=12 days, Frequency=5, Monetary=$500

Without scaling, K-Means sees A and B as similar (both ~$2,000) despite having similar behavior to C. With scaling, all three dimensions are treated equally.

StandardScaler Method: Transforms each feature to have mean=0 and standard deviation=1, ensuring all RFM metrics contribute equally to the clustering algorithm.

2

Elbow Method for Optimal Clusters

The Elbow Method helps you determine the optimal number of customer segments by plotting the within-cluster sum of squares (inertia) against different values of k (number of clusters).

What to Look For

The "elbow" is the point where adding more clusters provides diminishing returns. The inertia drops sharply until this point, then levels off. This inflection point suggests the optimal k.

Interpretation Example

If inertia drops from 500→200→100→85→80→78, the elbow appears at k=4 (drops from 100→85 but then slows). Going beyond k=4 adds complexity without much improvement.

Test Range: Evaluate k values from 2 to 10. Too few clusters (k=2) oversimplify customer diversity. Too many (k=10+) create segments too small to be actionable.

3

Silhouette Analysis for Validation

The Silhouette Score measures how well each customer fits within their assigned cluster compared to other clusters. Scores range from -1 to +1.

Score Range	Interpretation	Action
0.70 - 1.00	Strong, well-defined clusters	Excellent choice
0.50 - 0.70	Reasonable cluster structure	Acceptable
0.25 - 0.50	Weak cluster structure	Consider different k
Below 0.25	Poor clustering	Try different k

Balance Both Metrics: Choose the k value that shows an elbow in the inertia plot AND maintains a good Silhouette Score (above 0.50). Sometimes the mathematically optimal k may not align with business needs (e.g., k=7 might be too many segments to manage effectively).

4

Apply Final Clustering & Assign Segments

After determining your optimal k value (typically 3-5 for customer segmentation), apply K-Means to assign each customer to a cluster. Each cluster number will later be mapped to a meaningful business label.

3-4

Most Common
Champions, Loyals, At-Risk, Lost

5-6

More Granular
Add segments like "New Customers" or "Hibernating"

7+

Advanced
Complex segmentation for large enterprises

Random State: Always set a random state (e.g., 42) to ensure reproducible results. This means running your notebook multiple times will produce identical clusters.

Required Visualizations

1. Elbow Plot

Inertia vs. k to determine optimal clusters

2. Silhouette Plot

Silhouette score vs. k for validation

3. 3D Scatter

RFM clusters in 3D space (Plotly)

06

Segment Profiling & Recommendations

After clustering, analyze each segment's characteristics and develop actionable marketing recommendations. This is where your business acumen shines—translate data patterns into strategic actions.

Segment Analysis Process

After clustering, you need to understand what makes each segment unique. Calculate summary statistics for each cluster to reveal the behavioral patterns that define each customer group.

Key Statistics to Calculate:

Mean RFM values - Average behavior of the segment
Customer count - Size of each segment
Total revenue - Revenue contribution per segment

Percentage of total customers - Relative size
Revenue percentage - Value contribution
Min/Max ranges - Segment boundaries

Understanding 3D Visualization

A 3D scatter plot with Recency on the X-axis, Frequency on the Y-axis, and Monetary on the Z-axis provides an intuitive view of how your segments are distributed in RFM space. Each point represents a customer, colored by their assigned segment.

What Good Clustering Looks Like

Clear separation between segment colors
Tight grouping within each color cluster
Minimal overlap between segments
No outlier segments with just 1-2 customers

Warning Signs

Segments heavily overlapping in 3D space
One segment containing 80%+ of customers
Segments with fewer than 5% of total customers
No clear visual distinction between segments

Example Segment Profiles

Below are typical segments you might discover in retail customer data. Your actual segments may differ based on your chosen k value and the clustering results.

Segment	Recency	Frequency	Monetary	Label	Strategy
Cluster 0	Very Low (5-20 days)	High (8+ purchases)	High ($1,000+)	Champions	VIP loyalty rewards, early product access, referral incentives, satisfaction surveys
Cluster 1	Low (20-50 days)	Medium (4-7 purchases)	Medium ($400-$999)	Loyal Customers	Upsell premium products, cross-sell recommendations, exclusive member benefits
Cluster 2	High (100-180 days)	Low (2-3 purchases)	Low ($100-$399)	At Risk	Win-back campaigns (20% discount), re-engagement emails, personalized recommendations
Cluster 3	Very High (180+ days)	Very Low (1 purchase)	Low (<$100)	Lost	Final re-activation offer (30-40% off), survey for feedback, consider list removal

Your Segments Will Vary: The exact boundaries and characteristics depend on your data and chosen k value. A 3-cluster solution combines some of these groups, while a 5-cluster solution might split "Loyal Customers" into "Promising" and "Established Loyal" segments.

Expected Segment Distribution

In most retail datasets, you'll see an uneven distribution following these patterns:

5-15%

Champions

Small but high-value segment generating 40-60% of revenue

20-30%

Loyal Customers

Stable segment with good engagement and moderate spend

25-35%

At Risk

Critical segment needing immediate retention efforts

30-40%

Lost/Inactive

Largest segment with low engagement and minimal value

Required Recommendations

For each segment, provide comprehensive profiling that bridges data analysis and business action:

1. Segment Name

Use memorable, business-friendly labels that instantly communicate the customer's status. Examples: "Champions," "Loyal Customers," "Potential Loyalists," "At Risk," "Hibernating," "Lost," "New Customers," "Promising."

2. Characteristics

Summarize the RFM profile in plain English. Example: "Made a purchase within the last 30 days, buy frequently (5+ times), and spend moderately ($200-$500 average)."

3. Size & Distribution

Report both count and percentage. Example: "47 customers (23% of total customer base)." This helps prioritize which segments deserve the most attention.

4. Value Contribution

Calculate total revenue from the segment and its percentage of overall revenue. Example: "Generated $12,500 (31% of total revenue)." This quantifies the business impact.

5. Marketing Strategy (Most Important)

Provide 2-3 specific, actionable recommendations tailored to each segment's behavior:

For Champions: "Implement VIP loyalty program with exclusive early access to new products. Incentivize referrals with reward points. Conduct satisfaction surveys to maintain engagement."
For At-Risk: "Launch win-back email campaign with 20% discount on next purchase. Conduct exit survey to understand disengagement. Offer personalized product recommendations based on past purchases."
For Lost: "Send re-engagement campaign highlighting new product lines. Offer significant discount (30-40%) or free shipping. Consider removing from regular marketing lists to reduce costs."

Think ROI: Match marketing intensity to segment value. High-value segments (Champions, Loyal Customers) justify premium retention costs. Low-value, low-engagement segments may not warrant expensive campaigns.

Deliverable: Create a summary table or visualization showing all segments with their labels, sizes, and recommended strategies.

07

Submission Requirements

Create a public GitHub repository with the exact name shown below:

Required Repository Name

customer-segmentation-project

github.com/<your-username>/customer-segmentation-project

Required Project Structure

customer-segmentation-project/
├── data/
│   └── customer_transactions.csv    # The dataset (download from above)
├── notebooks/
│   └── customer_segmentation.ipynb  # Your main analysis notebook
├── outputs/
│   └── customer_segments.csv        # Final segmented customer data
├── requirements.txt                 # Python dependencies
└── README.md                        # REQUIRED - see contents below

README.md Must Include:

Your full name and submission date
Project overview and business context
RFM methodology explanation
Number of clusters chosen and why
Segment profiles with labels and strategies
Screenshots of key visualizations (Elbow plot, 3D scatter)

Python Dependencies

Create a requirements.txt file in your project root listing all Python packages needed to run your notebook. This allows anyone to recreate your environment.

Required Libraries:

pandas - Data manipulation and RFM calculations
numpy - Numerical operations
scikit-learn - K-Means clustering and scaling
plotly - Interactive 3D visualizations

matplotlib - Static plots (Elbow method)
seaborn - Statistical visualizations
jupyter - Notebook environment

Version Format: Use package>=version to ensure minimum compatible versions (e.g., pandas>=2.0.0).

Output File: customer_segments.csv

Export a CSV file containing your segmentation results. This file should be saved in an outputs/ folder and include all customers with their assigned segments.

Required Column	Description	Example Value
`customer_id`	Unique customer identifier	CUST001
`recency`	Days since last purchase	15
`frequency`	Total number of transactions	8
`monetary`	Total amount spent	1250.50
`cluster`	Numeric cluster assignment	0
`segment_label`	Business-friendly segment name	Champions

Export Process: Use pandas .to_csv() method with index=False to prevent adding row numbers. Map cluster numbers (0, 1, 2, 3) to meaningful labels ("Champions", "Loyal Customers", etc.) before exporting.

Do Include

Complete RFM calculation with code
Elbow method and Silhouette analysis
At least 5 visualizations
Segment profiles with marketing strategies
Exported customer_segments.csv
README with methodology explanation

Do Not Include

Virtual environment folders (venv, .env)
Any .pyc or __pycache__ files
Unexecuted notebooks
Hardcoded file paths
Clusters without business interpretation

Submit Your Project

Enter your GitHub username - we will verify your repository automatically

08

Grading Rubric

Your project will be evaluated on both technical execution and business insight. A perfect score requires not just correct implementation, but also clear communication of findings and actionable recommendations.

Criteria	Points	Description
1. RFM Calculation	20	Correct computation of Recency (days since last purchase), Frequency (transaction count), and Monetary (total spending) values for each customer. Must use appropriate date handling and aggregation functions.
2. Data Preprocessing	10	Proper data cleaning (if needed), datetime conversion, and StandardScaler implementation. Features must be scaled before clustering. Handle any edge cases (e.g., customers with single transactions).
3. K-Means Implementation	20	Correct K-Means clustering with appropriate parameters (random_state for reproducibility, n_init=10). Clear justification for chosen k value based on analysis, not arbitrary selection.
4. Elbow & Silhouette Analysis	10	Both Elbow plot (inertia vs k) and Silhouette scores calculated for k=2 to k=10. Visual plots included with clear labeling. Written explanation of how optimal k was determined.
5. Visualizations	15	Minimum 5 professional visualizations: Elbow plot, Silhouette plot, 3D scatter (Plotly), cluster size distribution, and one additional insight chart. All charts must have titles, axis labels, and legends.
6. Segment Profiling	15	Each segment has: (1) Memorable business name, (2) Clear RFM characteristics, (3) Size and percentage, (4) Revenue contribution, (5) Summary table or profile card. Segments must be distinct and interpretable.
7. Marketing Recommendations	10	2-3 specific, actionable marketing strategies for each segment. Recommendations must be tailored to segment behavior (not generic). Examples: loyalty programs for champions, win-back campaigns for at-risk customers.
Total	100	Weighted Score

Grading Breakdown

Excellent

90-100

Outstanding work with insights

Good

75-89

Solid implementation

Passing

60-74

Meets basic requirements

Helpful Resources

Module 10.1: Clustering

Review K-Means algorithm fundamentals

Module 8.1: Feature Creation

Feature engineering techniques

Scikit-learn: K-Means

Official documentation

Plotly: 3D Scatter Plots

Interactive 3D visualization guide

Customer Segmentation

What You Will Build

Contents

Project Overview

RFM Analysis

K-Means Clustering

Segment Profiling

Recommendations

Business Scenario

ShopSmart Retail

The Business Challenge

Customer Churn

Inefficient Marketing

Unknown Patterns

Business Objectives

The Dataset

Dataset Overview:

What Makes This Dataset Ideal for Segmentation

Repeat Customers

Time Spread

Value Diversity

Dataset Schema

Key Columns for RFM Analysis

RFM Analysis

Understanding RFM Calculations

Recency Details

Frequency Details

Monetary Details

RFM Scoring System (Optional Enhancement)

K-Means Clustering

Data Preprocessing & Feature Scaling

Elbow Method for Optimal Clusters

What to Look For

Interpretation Example

Silhouette Analysis for Validation

Apply Final Clustering & Assign Segments

Required Visualizations

Segment Profiling & Recommendations

Segment Analysis Process

Key Statistics to Calculate:

Understanding 3D Visualization

What Good Clustering Looks Like

Warning Signs

Example Segment Profiles

Expected Segment Distribution

Champions

Loyal Customers

At Risk

Lost/Inactive

Required Recommendations

1. Segment Name

2. Characteristics

3. Size & Distribution

4. Value Contribution

5. Marketing Strategy (Most Important)

Submission Requirements

Required Repository Name

Required Project Structure

README.md Must Include:

Python Dependencies

Required Libraries:

Output File: customer_segments.csv

Do Include

Do Not Include

Grading Rubric

Grading Breakdown

Excellent

Good

Passing

Helpful Resources

Module 10.1: Clustering

Module 8.1: Feature Creation

Scikit-learn: K-Means

Plotly: 3D Scatter Plots