Assignment 10-A

Unsupervised Learning Discovery Challenge

Discover hidden patterns in unlabeled data: perform customer segmentation using K-Means clustering, reduce high-dimensional datasets with PCA, and visualize complex data structures to derive actionable business insights.

6-8 hours
Advanced
250 Points
Submit Assignment
What You'll Practice
  • K-Means clustering implementation
  • Optimal cluster selection (Elbow, Silhouette)
  • PCA for dimensionality reduction
  • Cluster visualization and profiling
  • Business insights from segments
Contents
01

Assignment Overview

In this assignment, you will build a complete Unsupervised Learning System using scikit-learn. This comprehensive project requires you to apply ALL concepts from Module 10: clustering algorithms and dimensionality reduction techniques to discover patterns in unlabeled customer data.

Important: You must use scikit-learn for all machine learning tasks. You may use pandas for data manipulation, numpy for numerical operations, and matplotlib/seaborn for visualization.
Skills Applied: This assignment tests your understanding of Clustering (Topic 10.1) and Dimensionality Reduction (Topic 10.2) from Module 10.
Clustering (10.1)

K-Means algorithm, cluster validation, silhouette analysis, elbow method, and customer segmentation

Dimensionality Reduction (10.2)

PCA, variance explained, component analysis, and high-dimensional data visualization

Ready to submit? Already completed the assignment? Submit your work now!
Submit Now
02

The Scenario

RetailMax E-Commerce

You have been hired as a Data Scientist at RetailMax, a growing e-commerce company. The marketing team wants to understand their customer base better to create targeted campaigns. Your manager has assigned you a critical project:

"We have transaction data for thousands of customers but no predefined categories. We need you to discover natural customer segments based on their purchasing behavior. Additionally, our customer feature set is quite large, and we need to identify the most important dimensions for visualization and analysis. Use clustering and PCA to help us understand our customers better."

Your Tasks

Create a Jupyter notebook called unsupervised_analysis.ipynb that implements customer segmentation using clustering algorithms and reduces dimensionality using PCA for visualization and feature analysis.

Project 1: Customer Segmentation

Segment customers using K-Means clustering based on:

  • Purchase frequency and recency
  • Average transaction value
  • Product category preferences
  • Customer lifetime metrics
Project 2: Dimensionality Reduction

Apply PCA to understand the data structure:

  • Reduce features for visualization
  • Identify principal components
  • Analyze variance explained
  • Visualize clusters in 2D space
03

The Dataset

You will work with real customer behavior data from RetailMax's e-commerce platform. Download the dataset below to get started.

retailmax_customers.csv

Customer transaction and behavior data including purchase history, product preferences, engagement metrics, and demographics.

100 customers 14 features Clean dataset
Download CSV
Tip: Remember to scale your features using StandardScaler before applying K-Means or PCA, as these algorithms are sensitive to feature magnitudes.
04

Requirements

Your unsupervised_analysis.ipynb must implement ALL of the following components. Each section is mandatory and will be graded individually.

Part 1: Data Preparation (40 points)

1
Data Loading and Exploration

Load the dataset and perform initial exploration:

  • Check data shape, types, and missing values
  • Generate descriptive statistics for all features
  • Visualize feature distributions with histograms
  • Create correlation heatmap
def explore_data(df):
    """
    Perform comprehensive data exploration.
    Returns: summary statistics and visualizations
    """
    # Your implementation
    pass
2
Feature Scaling

Prepare features for clustering and PCA:

  • Select relevant features (exclude customer_id)
  • Apply StandardScaler to normalize features
  • Store the scaler for inverse transformations
from sklearn.preprocessing import StandardScaler

def prepare_features(df, feature_cols):
    """
    Scale features for unsupervised learning.
    Returns: scaled_data, scaler, feature_names
    """
    # Your implementation
    pass

Part 2: K-Means Clustering (80 points)

3
Elbow Method

Determine optimal number of clusters using the elbow method:

  • Test K values from 2 to 10
  • Calculate inertia (within-cluster sum of squares) for each K
  • Plot the elbow curve
  • Identify the "elbow" point
from sklearn.cluster import KMeans

def find_optimal_k_elbow(X, k_range=range(2, 11)):
    """
    Apply elbow method to find optimal K.
    Returns: inertias, optimal_k, elbow_plot
    """
    # Your implementation
    pass
4
Silhouette Analysis

Validate cluster quality using silhouette scores:

  • Calculate silhouette score for each K
  • Plot silhouette scores vs number of clusters
  • Create silhouette plots for the optimal K
  • Analyze cluster cohesion and separation
from sklearn.metrics import silhouette_score, silhouette_samples

def silhouette_analysis(X, k_range=range(2, 11)):
    """
    Perform silhouette analysis for cluster validation.
    Returns: silhouette_scores, best_k, silhouette_plot
    """
    # Your implementation
    pass
5
K-Means Clustering

Perform K-Means clustering with optimal K:

  • Fit KMeans with the selected number of clusters
  • Assign cluster labels to each customer
  • Get cluster centroids
  • Add cluster labels to original dataframe
def perform_kmeans(X, n_clusters, random_state=42):
    """
    Perform K-Means clustering.
    Returns: kmeans_model, labels, centroids
    """
    # Your implementation
    pass
6
Cluster Profiling

Create detailed profiles for each cluster:

  • Calculate mean values for each feature per cluster
  • Count customers in each cluster
  • Identify distinguishing characteristics
  • Create cluster summary table
def profile_clusters(df, cluster_labels, feature_cols):
    """
    Generate cluster profiles with statistics.
    Returns: cluster_profiles_df, cluster_summary
    """
    # Your implementation
    pass
7
Cluster Visualization

Visualize clusters using multiple approaches:

  • Scatter plot of top 2 features colored by cluster
  • Radar/spider chart for cluster comparison
  • Box plots of key features by cluster
  • Cluster size distribution bar chart
def visualize_clusters(df, cluster_labels, feature_cols):
    """
    Create cluster visualizations.
    Saves plots to 'visualizations/' folder
    """
    # Your implementation
    pass

Part 3: PCA Dimensionality Reduction (80 points)

8
Apply PCA

Reduce dimensionality using Principal Component Analysis:

  • Fit PCA on scaled data
  • Transform data to principal components
  • Store component loadings
from sklearn.decomposition import PCA

def apply_pca(X, n_components=None):
    """
    Apply PCA for dimensionality reduction.
    Returns: pca_model, transformed_data, loadings
    """
    # Your implementation
    pass
9
Variance Analysis

Analyze explained variance:

  • Calculate explained variance ratio for each component
  • Calculate cumulative explained variance
  • Plot scree plot (variance by component)
  • Determine number of components for 80% and 95% variance
def analyze_variance(pca_model):
    """
    Analyze PCA explained variance.
    Returns: variance_df, scree_plot, n_components_80, n_components_95
    """
    # Your implementation
    pass
10
Component Interpretation

Interpret principal components:

  • Extract component loadings (feature weights)
  • Identify top contributing features for each PC
  • Create loading heatmap visualization
  • Name/describe each principal component
def interpret_components(pca_model, feature_names, n_components=5):
    """
    Interpret principal components.
    Returns: loadings_df, component_descriptions
    """
    # Your implementation
    pass
11
2D Visualization with PCA

Visualize data and clusters in 2D PCA space:

  • Project data onto first 2 principal components
  • Create scatter plot colored by cluster labels
  • Add cluster centroids to the plot
  • Include explained variance in axis labels
def visualize_pca_clusters(pca_data, cluster_labels, pca_model):
    """
    Visualize clusters in PCA space.
    Saves plot to 'visualizations/pca_clusters.png'
    """
    # Your implementation
    pass
12
Biplot Visualization

Create a biplot showing samples and feature vectors:

  • Plot samples in PC1-PC2 space
  • Overlay feature loading vectors as arrows
  • Label feature arrows
  • Interpret feature relationships
def create_biplot(pca_data, pca_model, feature_names, cluster_labels=None):
    """
    Create PCA biplot with feature vectors.
    Saves plot to 'visualizations/pca_biplot.png'
    """
    # Your implementation
    pass

Part 4: Business Insights (50 points)

13
Segment Naming and Description

Create business-friendly segment names and descriptions:

  • Assign descriptive names to each cluster (e.g., "High-Value Loyalists")
  • Write 2-3 sentence descriptions for each segment
  • Identify key characteristics that define each segment
14
Marketing Recommendations

Provide actionable recommendations for each segment:

  • Suggest marketing strategies per segment
  • Recommend product focus for each group
  • Propose retention or growth tactics
  • Estimate potential value of targeted campaigns
15
Summary Report

Create a final summary with:

  • Executive summary of findings
  • Key metrics table (cluster sizes, avg values)
  • Top 3 actionable insights
  • Limitations and next steps
05

Submission Instructions

Submit your completed assignment via GitHub following these instructions:

1
Create Jupyter Notebook

Create a single notebook called unsupervised_analysis.ipynb containing all requirements:

  • Organize with clear markdown headers for each part
  • Each function must have docstrings explaining inputs and outputs
  • Include markdown cells with analysis and interpretations
  • Run all cells top to bottom before submission
2
Save Visualizations

Export all plots to the visualizations/ folder:

  • elbow_plot.png
  • silhouette_plot.png
  • cluster_scatter.png
  • cluster_profiles.png
  • pca_variance.png
  • pca_clusters.png
  • pca_biplot.png
3
Create README

Create README.md that includes:

  • Your name and assignment title
  • Summary of customer segments discovered
  • Key PCA findings
  • Instructions to run your notebook
4
Create requirements.txt
numpy==1.24.0
pandas==2.0.0
scikit-learn==1.3.0
matplotlib==3.7.0
seaborn==0.12.0
5
Repository Structure

Your GitHub repository should look like this:

retailmax-customer-segmentation/
├── README.md
├── requirements.txt
├── unsupervised_analysis.ipynb
└── visualizations/
    ├── elbow_plot.png
    ├── silhouette_plot.png
    ├── cluster_scatter.png
    ├── cluster_profiles.png
    ├── pca_variance.png
    ├── pca_clusters.png
    └── pca_biplot.png
6
Submit via Form

Once your repository is ready:

  • Make sure your repository is public
  • Click the "Submit Assignment" button below
  • Fill in the submission form with your GitHub username
Important: Make sure all cells in your notebook run without errors and all visualizations are saved before submitting!
06

Grading Rubric

Your assignment will be graded on the following criteria:

Criteria Points Description
Data Preparation 40 Data exploration, feature selection, proper scaling
Cluster Validation 40 Elbow method, silhouette analysis, optimal K selection
K-Means Implementation 40 Correct clustering, profiling, visualization
PCA Analysis 40 Variance analysis, component interpretation, loadings
PCA Visualization 40 2D cluster plot, biplot, proper labeling
Business Insights 30 Segment naming, marketing recommendations, summary
Code Quality 20 Docstrings, comments, clean organization
Total 250

Ready to Submit?

Make sure you have completed all requirements and reviewed the grading rubric above.

Submit Your Assignment
07

What You Will Practice

Clustering (10.1)

K-Means algorithm, cluster validation with elbow and silhouette methods, customer segmentation

Dimensionality Reduction (10.2)

PCA for feature reduction, variance analysis, component interpretation, and visualization

Data Visualization

Scatter plots, biplots, heatmaps, and radar charts for cluster and component analysis

Business Insights

Translating technical results into actionable marketing strategies and recommendations

08

Pro Tips

Clustering Tips
  • Always scale features before K-Means
  • Use multiple methods to validate optimal K
  • Set random_state for reproducibility
  • Consider business context for K selection
PCA Tips
  • Standardize data before PCA
  • Check cumulative variance explained
  • Interpret loadings for component meaning
  • Use biplots for feature relationships
Visualization Tips
  • Use consistent colors across plots
  • Label axes with explained variance
  • Add legends for cluster identification
  • Save high-resolution images (dpi=300)
Common Mistakes
  • Forgetting to scale features
  • Including ID columns in clustering
  • Choosing K based only on elbow
  • Not interpreting results for business
09

Pre-Submission Checklist

Clustering Requirements
PCA Requirements