Project Overview
Customer segmentation is one of the most impactful applications of unsupervised learning in business. In this project, you will use RFM (Recency, Frequency, Monetary) analysis combined with K-Means clustering to identify distinct customer groups and develop targeted marketing strategies for each segment.
RFM Analysis
Calculate Recency, Frequency, and Monetary values
K-Means Clustering
Implement and optimize clustering algorithm
Segment Profiling
Analyze and name each customer segment
Recommendations
Develop marketing strategies per segment
Business Scenario
ShopSmart Retail
You have been hired as a Customer Analytics Specialist at ShopSmart, a growing retail company operating in the competitive e-commerce space. The company has experienced steady growth over the past year, but the marketing team has been using generic, one-size-fits-all email campaigns that yield low engagement rates (averaging just 8% open rate and 1.5% click-through rate).
The CMO recognizes that different customers have vastly different needs and value propositions. A customer who purchased once six months ago shouldn't receive the same messaging as a loyal customer who shops weekly. However, without data-driven customer segments, the team has no framework for personalization.
"We have customer transaction data but no way to identify our best customers versus those at risk of churning. Can you segment our customer base and help us understand each group's characteristics so we can tailor our marketing efforts? We need actionable segments we can actually use in campaigns."
The Business Challenge
ShopSmart faces several common retail challenges that customer segmentation can address:
Customer Churn
35% of customers make only one purchase and never return. The company has no early warning system to identify at-risk customers before they churn.
Inefficient Marketing
Marketing budget is spread evenly across all customers, wasting resources on disengaged users while under-investing in high-value segments.
Unknown Patterns
Management has no visibility into how many loyal customers exist, what percentage are at risk, or which segments drive the most revenue.
Business Objectives
- Identify distinct customer segments based on behavior
- Determine the optimal number of clusters
- Profile each segment with clear characteristics
- Identify high-value customers for loyalty programs
- Find at-risk customers for retention campaigns
- Discover growth opportunities in each segment
- Who are our most valuable customers?
- Which customers are at risk of churning?
- What percentage of customers are occasional buyers?
- RFM score for each customer
- Cluster assignment with labels
- Marketing recommendations per segment
The Dataset
You will work with a customer transaction dataset containing purchase history from a retail business over 12 months. This realistic dataset includes multiple transactions per customer, allowing you to calculate meaningful RFM metrics and identify distinct customer segments.
Dataset Overview:
What Makes This Dataset Ideal for Segmentation
Repeat Customers
Many customers have multiple transactions, allowing you to measure frequency and identify loyal vs. one-time buyers.
Time Spread
Transactions span a full year, providing sufficient time range to calculate meaningful recency values and identify churn patterns.
Value Diversity
Wide range of transaction amounts from small purchases to large orders, enabling clear monetary segmentation.
Dataset Schema
| Column | Type | Description |
|---|---|---|
transaction_id | String | Unique transaction identifier |
customer_id | String | Unique customer identifier |
customer_name | String | Customer full name |
email | String | Customer email address |
transaction_date | Date | Date of transaction (YYYY-MM-DD) |
product_id | String | Product identifier |
product_name | String | Name of product purchased |
category | String | Product category (Electronics, Furniture, Office Supplies) |
quantity | Integer | Number of units purchased |
unit_price | Float | Price per unit ($) |
total_amount | Float | Total transaction value ($) |
payment_method | String | Payment type (Credit Card, Debit Card, PayPal) |
city | String | Customer's city |
region | String | Geographic region (North, South, East, West) |
Key Columns for RFM Analysis
| Column | Used For | Why It Matters |
|---|---|---|
customer_id |
Grouping transactions by customer | Enables aggregation of all purchases per customer |
transaction_date |
Recency calculation | Find most recent purchase date for each customer |
transaction_id |
Frequency calculation | Count number of transactions per customer |
total_amount |
Monetary calculation | Sum total spending per customer |
category, region, and payment_method can be used for additional profiling insights (e.g., "Champions prefer Electronics and use Credit Cards").
RFM Analysis
RFM (Recency, Frequency, Monetary) is a proven customer segmentation technique that scores customers based on their purchase behavior. You must calculate these three metrics for each customer.
Definition: How recently did the customer make a purchase?
Calculation: Days since last purchase
Lower recency = Better (more recent customer)
Definition: How often does the customer purchase?
Calculation: Total number of transactions
Higher frequency = Better (loyal customer)
Definition: How much does the customer spend?
Calculation: Total amount spent
Higher monetary = Better (high-value customer)
Understanding RFM Calculations
Recency Details
Analysis Date: Set to one day after the last transaction in your dataset. This ensures consistency.
For Each Customer: Calculate the number of days between the analysis date and their most recent purchase.
Interpretation: A customer with recency of 5 days is much more engaged than one with 200 days.
Frequency Details
Count Transactions: Simply count the total number of transactions each customer has made.
Customer Loyalty: Customers with 10+ transactions show strong loyalty and engagement.
One-time Buyers: Customers with frequency = 1 represent acquisition opportunities.
Monetary Details
Sum All Purchases: Total the dollar amount spent across all transactions for each customer.
High-Value Customers: Top 20% of customers often generate 80% of revenue (Pareto Principle).
Revenue Impact: Focus retention efforts on high monetary value customers.
RFM Scoring System (Optional Enhancement)
Beyond raw RFM values, you can create a scoring system (1-5) where each customer receives a score for each metric. This makes segments easier to communicate to business stakeholders.
| Score | Recency | Frequency | Monetary |
|---|---|---|---|
| 5 (Best) | 0-20 days ago | 8+ transactions | $1,000+ spent |
| 4 | 21-50 days ago | 5-7 transactions | $500-$999 spent |
| 3 | 51-100 days ago | 3-4 transactions | $200-$499 spent |
| 2 | 101-180 days ago | 2 transactions | $50-$199 spent |
| 1 (Worst) | 180+ days ago | 1 transaction | Under $50 spent |
K-Means Clustering
After calculating RFM values, apply K-Means clustering to group customers into segments. You must determine the optimal number of clusters using the Elbow method and/or Silhouette analysis.
Data Preprocessing & Feature Scaling
K-Means clustering is highly sensitive to the scale of features. Without proper scaling, the monetary value (ranging from $50 to $5,000) would dominate over frequency (1-15 transactions), leading to poor cluster quality.
- Customer A: Recency=10 days, Frequency=5, Monetary=$2,000
- Customer B: Recency=15 days, Frequency=6, Monetary=$2,200
- Customer C: Recency=12 days, Frequency=5, Monetary=$500
Without scaling, K-Means sees A and B as similar (both ~$2,000) despite having similar behavior to C. With scaling, all three dimensions are treated equally.
StandardScaler Method: Transforms each feature to have mean=0 and standard deviation=1, ensuring all RFM metrics contribute equally to the clustering algorithm.
Elbow Method for Optimal Clusters
The Elbow Method helps you determine the optimal number of customer segments by plotting the within-cluster sum of squares (inertia) against different values of k (number of clusters).
What to Look For
The "elbow" is the point where adding more clusters provides diminishing returns. The inertia drops sharply until this point, then levels off. This inflection point suggests the optimal k.
Interpretation Example
If inertia drops from 500→200→100→85→80→78, the elbow appears at k=4 (drops from 100→85 but then slows). Going beyond k=4 adds complexity without much improvement.
Test Range: Evaluate k values from 2 to 10. Too few clusters (k=2) oversimplify customer diversity. Too many (k=10+) create segments too small to be actionable.
Silhouette Analysis for Validation
The Silhouette Score measures how well each customer fits within their assigned cluster compared to other clusters. Scores range from -1 to +1.
| Score Range | Interpretation | Action |
|---|---|---|
| 0.70 - 1.00 | Strong, well-defined clusters | Excellent choice |
| 0.50 - 0.70 | Reasonable cluster structure | Acceptable |
| 0.25 - 0.50 | Weak cluster structure | Consider different k |
| Below 0.25 | Poor clustering | Try different k |
Apply Final Clustering & Assign Segments
After determining your optimal k value (typically 3-5 for customer segmentation), apply K-Means to assign each customer to a cluster. Each cluster number will later be mapped to a meaningful business label.
Most Common
Champions, Loyals, At-Risk, Lost
More Granular
Add segments like "New Customers" or "Hibernating"
Advanced
Complex segmentation for large enterprises
Random State: Always set a random state (e.g., 42) to ensure reproducible results. This means running your notebook multiple times will produce identical clusters.
Required Visualizations
Inertia vs. k to determine optimal clusters
Silhouette score vs. k for validation
RFM clusters in 3D space (Plotly)
Segment Profiling & Recommendations
After clustering, analyze each segment's characteristics and develop actionable marketing recommendations. This is where your business acumen shines—translate data patterns into strategic actions.
Segment Analysis Process
After clustering, you need to understand what makes each segment unique. Calculate summary statistics for each cluster to reveal the behavioral patterns that define each customer group.
Key Statistics to Calculate:
- Mean RFM values - Average behavior of the segment
- Customer count - Size of each segment
- Total revenue - Revenue contribution per segment
- Percentage of total customers - Relative size
- Revenue percentage - Value contribution
- Min/Max ranges - Segment boundaries
Understanding 3D Visualization
A 3D scatter plot with Recency on the X-axis, Frequency on the Y-axis, and Monetary on the Z-axis provides an intuitive view of how your segments are distributed in RFM space. Each point represents a customer, colored by their assigned segment.
What Good Clustering Looks Like
- Clear separation between segment colors
- Tight grouping within each color cluster
- Minimal overlap between segments
- No outlier segments with just 1-2 customers
Warning Signs
- Segments heavily overlapping in 3D space
- One segment containing 80%+ of customers
- Segments with fewer than 5% of total customers
- No clear visual distinction between segments
Example Segment Profiles
Below are typical segments you might discover in retail customer data. Your actual segments may differ based on your chosen k value and the clustering results.
| Segment | Recency | Frequency | Monetary | Label | Strategy |
|---|---|---|---|---|---|
| Cluster 0 | Very Low (5-20 days) |
High (8+ purchases) |
High ($1,000+) |
Champions | VIP loyalty rewards, early product access, referral incentives, satisfaction surveys |
| Cluster 1 | Low (20-50 days) |
Medium (4-7 purchases) |
Medium ($400-$999) |
Loyal Customers | Upsell premium products, cross-sell recommendations, exclusive member benefits |
| Cluster 2 | High (100-180 days) |
Low (2-3 purchases) |
Low ($100-$399) |
At Risk | Win-back campaigns (20% discount), re-engagement emails, personalized recommendations |
| Cluster 3 | Very High (180+ days) |
Very Low (1 purchase) |
Low (<$100) |
Lost | Final re-activation offer (30-40% off), survey for feedback, consider list removal |
Expected Segment Distribution
In most retail datasets, you'll see an uneven distribution following these patterns:
Champions
Small but high-value segment generating 40-60% of revenue
Loyal Customers
Stable segment with good engagement and moderate spend
At Risk
Critical segment needing immediate retention efforts
Lost/Inactive
Largest segment with low engagement and minimal value
Required Recommendations
For each segment, provide comprehensive profiling that bridges data analysis and business action:
1. Segment Name
Use memorable, business-friendly labels that instantly communicate the customer's status. Examples: "Champions," "Loyal Customers," "Potential Loyalists," "At Risk," "Hibernating," "Lost," "New Customers," "Promising."
2. Characteristics
Summarize the RFM profile in plain English. Example: "Made a purchase within the last 30 days, buy frequently (5+ times), and spend moderately ($200-$500 average)."
3. Size & Distribution
Report both count and percentage. Example: "47 customers (23% of total customer base)." This helps prioritize which segments deserve the most attention.
4. Value Contribution
Calculate total revenue from the segment and its percentage of overall revenue. Example: "Generated $12,500 (31% of total revenue)." This quantifies the business impact.
5. Marketing Strategy (Most Important)
Provide 2-3 specific, actionable recommendations tailored to each segment's behavior:
- For Champions: "Implement VIP loyalty program with exclusive early access to new products. Incentivize referrals with reward points. Conduct satisfaction surveys to maintain engagement."
- For At-Risk: "Launch win-back email campaign with 20% discount on next purchase. Conduct exit survey to understand disengagement. Offer personalized product recommendations based on past purchases."
- For Lost: "Send re-engagement campaign highlighting new product lines. Offer significant discount (30-40%) or free shipping. Consider removing from regular marketing lists to reduce costs."
Submission Requirements
Create a public GitHub repository with the exact name shown below:
Required Repository Name
customer-segmentation-project
Required Project Structure
customer-segmentation-project/
├── data/
│ └── customer_transactions.csv # The dataset (download from above)
├── notebooks/
│ └── customer_segmentation.ipynb # Your main analysis notebook
├── outputs/
│ └── customer_segments.csv # Final segmented customer data
├── requirements.txt # Python dependencies
└── README.md # REQUIRED - see contents below
README.md Must Include:
- Your full name and submission date
- Project overview and business context
- RFM methodology explanation
- Number of clusters chosen and why
- Segment profiles with labels and strategies
- Screenshots of key visualizations (Elbow plot, 3D scatter)
Python Dependencies
Create a requirements.txt file in your project root listing all Python packages needed to run your notebook. This allows anyone to recreate your environment.
Required Libraries:
- pandas - Data manipulation and RFM calculations
- numpy - Numerical operations
- scikit-learn - K-Means clustering and scaling
- plotly - Interactive 3D visualizations
- matplotlib - Static plots (Elbow method)
- seaborn - Statistical visualizations
- jupyter - Notebook environment
Version Format: Use package>=version to ensure minimum compatible versions (e.g., pandas>=2.0.0).
Output File: customer_segments.csv
Export a CSV file containing your segmentation results. This file should be saved in an outputs/ folder and include all customers with their assigned segments.
| Required Column | Description | Example Value |
|---|---|---|
customer_id |
Unique customer identifier | CUST001 |
recency |
Days since last purchase | 15 |
frequency |
Total number of transactions | 8 |
monetary |
Total amount spent | 1250.50 |
cluster |
Numeric cluster assignment | 0 |
segment_label |
Business-friendly segment name | Champions |
.to_csv() method with index=False to prevent adding row numbers. Map cluster numbers (0, 1, 2, 3) to meaningful labels ("Champions", "Loyal Customers", etc.) before exporting.
Do Include
- Complete RFM calculation with code
- Elbow method and Silhouette analysis
- At least 5 visualizations
- Segment profiles with marketing strategies
- Exported customer_segments.csv
- README with methodology explanation
Do Not Include
- Virtual environment folders (venv, .env)
- Any .pyc or __pycache__ files
- Unexecuted notebooks
- Hardcoded file paths
- Clusters without business interpretation
Enter your GitHub username - we will verify your repository automatically
Grading Rubric
Your project will be evaluated on both technical execution and business insight. A perfect score requires not just correct implementation, but also clear communication of findings and actionable recommendations.
| Criteria | Points | Description |
|---|---|---|
| 1. RFM Calculation | 20 | Correct computation of Recency (days since last purchase), Frequency (transaction count), and Monetary (total spending) values for each customer. Must use appropriate date handling and aggregation functions. |
| 2. Data Preprocessing | 10 | Proper data cleaning (if needed), datetime conversion, and StandardScaler implementation. Features must be scaled before clustering. Handle any edge cases (e.g., customers with single transactions). |
| 3. K-Means Implementation | 20 | Correct K-Means clustering with appropriate parameters (random_state for reproducibility, n_init=10). Clear justification for chosen k value based on analysis, not arbitrary selection. |
| 4. Elbow & Silhouette Analysis | 10 | Both Elbow plot (inertia vs k) and Silhouette scores calculated for k=2 to k=10. Visual plots included with clear labeling. Written explanation of how optimal k was determined. |
| 5. Visualizations | 15 | Minimum 5 professional visualizations: Elbow plot, Silhouette plot, 3D scatter (Plotly), cluster size distribution, and one additional insight chart. All charts must have titles, axis labels, and legends. |
| 6. Segment Profiling | 15 | Each segment has: (1) Memorable business name, (2) Clear RFM characteristics, (3) Size and percentage, (4) Revenue contribution, (5) Summary table or profile card. Segments must be distinct and interpretable. |
| 7. Marketing Recommendations | 10 | 2-3 specific, actionable marketing strategies for each segment. Recommendations must be tailored to segment behavior (not generic). Examples: loyalty programs for champions, win-back campaigns for at-risk customers. |
| Total | 100 | Weighted Score |
Grading Breakdown
Excellent
90-100
Outstanding work with insights
Good
75-89
Solid implementation
Passing
60-74
Meets basic requirements