Capstone Project 4

Sentiment Analysis

Build a complete NLP pipeline to classify product review sentiments. You will preprocess text data, extract features using TF-IDF vectorization, train classification models, and evaluate performance with real customer reviews.

10-15 hours
Intermediate
550 Points
What You Will Build
  • Text preprocessing pipeline
  • TF-IDF feature extraction
  • Sentiment classification model
  • Model evaluation metrics
  • Prediction interface
Contents
01

Project Overview

This capstone project focuses on Natural Language Processing (NLP) and text classification. You will work with a realistic product reviews dataset containing 120 customer reviews across 50 different products and multiple categories. Your goal is to build a sentiment analysis system that can automatically classify reviews as positive, negative, or neutral based on the text content.

Skills Applied: This project tests your proficiency in text preprocessing, tokenization, TF-IDF vectorization, and classification algorithms (Logistic Regression, Naive Bayes, or SVM).
Learning Objectives
Text Preprocessing Mastery
  • Clean and normalize unstructured text data (lowercase, punctuation removal)
  • Apply tokenization to break text into individual words
  • Remove stopwords that don't contribute to sentiment (the, is, and, etc.)
  • Apply lemmatization to reduce words to their base form (running → run)
Feature Engineering for NLP
  • Understand TF-IDF vectorization for converting text to numerical features
  • Configure optimal parameters (max_features, ngram_range, min_df)
  • Experiment with unigrams vs bigrams vs trigrams
  • Balance vocabulary size with model performance
Classification Techniques
  • Train Logistic Regression for multi-class text classification
  • Apply Naive Bayes algorithm optimized for text data
  • Compare model performance using precision, recall, and F1-score
  • Identify which words are most predictive of each sentiment class
Evaluation & Analysis
  • Interpret confusion matrices for multi-class problems
  • Understand when models confuse similar sentiments (neutral vs positive)
  • Extract feature importance to find most influential words
  • Build reusable prediction pipeline for new reviews
Real-World Application

Sentiment analysis powers customer feedback systems at Amazon, Yelp, TripAdvisor, and social media monitoring tools. Companies use it to track brand reputation, identify product issues early, and prioritize customer service responses. Your project demonstrates production-ready NLP skills.

Text Preprocessing

Clean, tokenize, and normalize text data

TF-IDF Vectorization

Convert text to numerical features

Classification Model

Train and tune ML classifiers

Evaluation

Measure accuracy, precision, and recall

Ready to submit? Already completed the project? Submit your work now!
Submit Now
02

Business Scenario

ReviewPulse Analytics

You have been hired as a Machine Learning Engineer at ReviewPulse Analytics, a company that helps e-commerce businesses understand customer feedback at scale. Currently, the company employs 15 human analysts who manually read and categorize 3,000-5,000 reviews daily. This process takes 4-6 hours per analyst and costs the company approximately $180,000 annually in labor. Additionally, manual classification suffers from inconsistency, with different analysts sometimes categorizing the same review differently.

The product team wants to automate sentiment classification to process reviews in real-time, reduce costs by 70%, and provide instant insights to e-commerce clients. Your AI-powered system would enable clients to track sentiment trends hourly instead of waiting days for manual reports.

"We receive thousands of product reviews daily and need an automated system to classify them as positive, negative, or neutral. This will help our clients quickly identify product issues and customer satisfaction trends. Can you build a reliable sentiment classification model that achieves at least 80% accuracy and processes reviews in under 100ms?"

Marcus Chen, Head of AI Products
The Business Challenge

ReviewPulse Analytics faces several critical challenges that NLP can address:

Scale & Speed

Manual review classification can't keep pace with incoming volume. During holiday seasons, review volume spikes 5x (15,000+ daily), creating a 3-4 day backlog. Clients need real-time sentiment dashboards, not delayed reports.

Consistency Issues

Human analysts show 15-20% disagreement on neutral reviews. One analyst might classify "It's okay" as neutral while another sees it as slightly negative. ML provides consistent classification regardless of workload or time.

Text Complexity

Reviews contain slang, sarcasm ("Yeah, just PERFECT... if you enjoy products that break in 2 days"), mixed sentiments ("Camera is great but battery life is terrible"), and domain-specific vocabulary that confuses simple keyword matching.

Business Questions to Answer

Text Analysis
  • What are the most common words in positive vs negative reviews?
  • How does review length correlate with sentiment?
  • Which categories have the most negative reviews?
Model Performance
  • Which classifier performs best on this dataset?
  • What is the optimal TF-IDF configuration?
  • How well does the model generalize to new reviews?
Feature Importance
  • Which words are most predictive of positive sentiment?
  • Which words are most predictive of negative sentiment?
  • How do n-grams improve classification?
Business Insights
  • What products generate the most negative feedback?
  • Are verified purchases more positive or negative?
  • What recommendations can improve product ratings?
Pro Tip: Focus on building a robust preprocessing pipeline first. Clean text data is crucial for good model performance!
03

The Dataset

You will work with a realistic product reviews dataset. Download the CSV file and place it in your project's data/ folder:

Dataset Schema
Column Type Description
review_idIntegerUnique review identifier
product_idStringUnique product identifier (PROD-XXX)
product_nameStringProduct name
categoryStringProduct category (Electronics, Kitchen, Health, etc.)
review_textStringFull review text written by customer
ratingIntegerStar rating (1-5)
review_dateDateDate of review (YYYY-MM-DD)
reviewer_nameStringReviewer name
verified_purchaseBooleanWhether purchase was verified
helpful_votesIntegerNumber of helpful votes
total_votesIntegerTotal votes on review
Dataset Stats: 120 reviews, 50 unique products, 9 categories, 8 months of data (Jan-Aug 2024)
Sentiment Labels

You will need to create sentiment labels from the rating column:

Negative

Rating 1-2

Neutral

Rating 3

Positive

Rating 4-5

04

Project Requirements

Your Jupyter Notebook must include all of the following components. Structure your notebook with clear markdown headers and explanations for each section.

1
Project Setup and Introduction

Title, your name, date, project overview, and business context. Import all required libraries.

Required Libraries
  • Data handling: pandas, numpy
  • Visualization: matplotlib, seaborn
  • NLP tools: nltk (stopwords, WordNetLemmatizer), re (regex)
  • Scikit-learn: TfidfVectorizer, train_test_split
  • Models: LogisticRegression, MultinomialNB
  • Evaluation: classification_report, confusion_matrix, accuracy_score

Note: Download NLTK stopwords and wordnet datasets before use.

2
Data Exploration and Label Creation

Comprehensive exploration of review data and sentiment patterns.

Distribution Analysis
  • Rating Distribution: Count plot showing 1-5 star ratings
  • Category Breakdown: Which product categories have most reviews?
  • Sentiment Labels: Create Negative (1-2), Neutral (3), Positive (4-5)
  • Class Balance: Check if dataset is balanced across sentiments
Text Characteristics
  • Review Length: Calculate word count for each review
  • Length by Sentiment: Do negative reviews tend to be longer?
  • Common Words: Most frequent words across all reviews
  • Verified Purchases: Sentiment distribution for verified vs unverified
Expected Patterns: Typically, positive reviews (4-5 stars) make up 60-70% of data, neutral ~15-20%, and negative ~15-20%. If severely imbalanced, consider techniques like SMOTE or class weights during training.
3
Text Preprocessing Pipeline

Build a multi-step preprocessing function to clean and normalize review text.

A
Lowercase Conversion

Convert all characters to lowercase so "Amazing" and "amazing" are treated as the same word. This reduces vocabulary size and improves matching.

B
Remove Special Characters & Punctuation

Use regex to remove punctuation, symbols, and optionally numbers. Keep only alphabetic characters. Example: "Great! Best product ever!!!" → "Great Best product ever"

C
Tokenization

Split text into individual words (tokens). "amazing product" → ["amazing", "product"]. This creates a list of words for further processing.

D
Stopword Removal

Remove common words like "the", "is", "and", "a" using NLTK's stopwords list. These don't contribute to sentiment. Keep negations like "not", "no", "never" as they flip sentiment.

E
Lemmatization

Reduce words to base form using WordNetLemmatizer. "running" → "run", "better" → "good", "worst" → "bad". This groups similar words together and reduces vocabulary size.

Important: Create a cleaned_text column with preprocessed text. Apply your preprocessing function to ALL reviews before splitting train/test data. This ensures consistency.
Before & After Example:
Original: "This product is AMAZING!!! Best purchase I've ever made. 10/10 would recommend!"
After preprocessing: "product amazing best purchase ever make would recommend"
4
TF-IDF Vectorization

Convert preprocessed text into numerical feature vectors using Term Frequency-Inverse Document Frequency.

What is TF-IDF?

TF-IDF assigns weights to words based on two factors:

  • Term Frequency (TF): How often a word appears in a document (review)
  • Inverse Document Frequency (IDF): How rare a word is across all documents

Result: Common words like "product" get low scores. Distinctive words like "amazing" or "terrible" get high scores because they're more predictive of sentiment.

Key Parameters to Tune
  • max_features: Vocabulary size (try 1000, 3000, 5000)
  • ngram_range: (1,1) = unigrams, (1,2) = uni+bigrams
  • min_df: Ignore rare words (appear in <2 documents)
  • max_df: Ignore very common words (appear in >95% docs)
Expected Results
  • Unigrams (1,1): Good baseline, 75-80% accuracy
  • Bigrams (1,2): Captures phrases like "not good", 78-83%
  • Higher features: More vocabulary = more features but may overfit
  • Optimal: Usually 3000-5000 features with (1,2) ngrams
Train-Test Split: Use 80-20 split with stratify=y to maintain sentiment distribution in both sets. Fit TF-IDF vectorizer ONLY on training data, then transform both train and test to prevent data leakage.
5
Model Training and Comparison

Train at least two different text classifiers and systematically compare their performance.

Model Best For Key Parameters Expected Accuracy
Multinomial Naive Bayes Baseline for text, fast training alpha: 0.1, 1.0, 10.0 75-82%
Logistic Regression Strong performer, interpretable C: 0.1, 1.0, 10.0, max_iter: 1000 80-88%
SVM (LinearSVC) High accuracy, slower training C: 0.1, 1.0, 10.0 82-90% (optional)
Why Multiple Models?
  • Naive Bayes: Assumes feature independence (word occurrences). Fast but less accurate.
  • Logistic Regression: Models word importance with coefficients. Usually best performer for text.
  • SVM: Finds optimal decision boundaries. Powerful but computationally expensive.
Use Cross-Validation: Apply 5-fold stratified CV to get reliable performance estimates. This prevents overfitting and gives confidence intervals for accuracy.
6
Model Evaluation and Interpretation

Comprehensive evaluation using multiple metrics to understand model strengths and weaknesses.

Classification Report
  • Precision: Of predicted positive, how many are actually positive? High precision = few false positives.
  • Recall: Of actual positives, how many did we find? High recall = few false negatives.
  • F1-Score: Harmonic mean of precision and recall. Best overall metric for imbalanced data.
  • Support: Number of actual occurrences of each class in test set.
Confusion Matrix Analysis
  • Diagonal values: Correct predictions (true positives for each class)
  • Off-diagonal: Misclassifications showing which sentiments get confused
  • Common pattern: Neutral reviews often confused with positive/negative
  • Visualization: Use seaborn heatmap with annotations for clarity
Feature Importance Analysis

Extract most predictive words from your best model:

  • For Logistic Regression: Get coefficients for each class. Positive coefficients indicate words strongly associated with that sentiment.
  • Top Positive Words: Expect "excellent", "amazing", "perfect", "love", "best"
  • Top Negative Words: Expect "terrible", "worst", "waste", "poor", "disappointed"
  • Visualization: Create bar charts showing top 10 words per sentiment class
Error Analysis: Examine misclassified reviews. Look for patterns - are sarcastic reviews misclassified? Reviews with mixed sentiments? This provides insights for model improvement.
7
Prediction Function

Create a reusable function that takes new review text and returns predicted sentiment with confidence score.

1
Apply Preprocessing

Pass raw text through your preprocessing function (same steps used during training)

2
Vectorize Text

Transform cleaned text using fitted TF-IDF vectorizer from training

3
Get Prediction

Use trained model to predict sentiment class and probability scores

Test Examples:
Input: "This product is amazing! Best purchase ever!"
Output: Positive (98% confidence)

Input: "Complete waste of money. Broke after one day."
Output: Negative (94% confidence)

Input: "It's okay, nothing special but works as expected."
Output: Neutral (72% confidence)
8
Insights and Recommendations

Summarize your findings with 5-7 key insights. Provide actionable recommendations for the business based on your analysis.

05

NLP Specifications

Follow these specifications for text processing and model building. These will help ensure consistent and high-quality results.

Preprocessing Requirements
  • Lowercase: Convert all text to lowercase
  • Punctuation: Remove all punctuation marks
  • Stopwords: Remove English stopwords using NLTK
  • Lemmatization: Use WordNetLemmatizer
  • Min Length: Keep words with 2+ characters
TF-IDF Parameters
  • max_features: Try 1000, 3000, 5000
  • ngram_range: (1,1), (1,2), or (1,3)
  • min_df: 2 (ignore rare terms)
  • max_df: 0.95 (ignore very common terms)
  • sublinear_tf: True (apply sublinear scaling)
Model Requirements
  • Train/Test Split: 80/20 with stratification
  • Cross-Validation: 5-fold stratified CV
  • Baseline Model: Multinomial Naive Bayes
  • Primary Model: Logistic Regression
  • Hyperparameter Tuning: GridSearchCV (optional)
Evaluation Metrics
  • Accuracy: Overall correct predictions
  • Precision: Per-class positive predictive value
  • Recall: Per-class true positive rate
  • F1-Score: Harmonic mean of precision and recall
  • Confusion Matrix: Visualize all predictions
Code Tip: Create a TextPreprocessor class that encapsulates all preprocessing steps. This makes your pipeline reusable and easier to maintain.
06

Required Visualizations

Create at least 8 of the following visualizations. All charts should be clear, well-labeled, and professionally styled.

1. Bar Chart
Sentiment Distribution

Count of reviews per sentiment class

2. Word Cloud
Positive Review Words

Most common words in positive reviews

3. Word Cloud
Negative Review Words

Most common words in negative reviews

4. Heatmap
Confusion Matrix

Visualize model predictions vs actual

5. Histogram
Review Length Distribution

Distribution of word counts by sentiment

6. Bar Chart
Top Predictive Words

Top 15 features by importance

7. Grouped Bar
Sentiment by Category

Sentiment distribution across product categories

8. Line Chart
Model Comparison

Accuracy comparison across classifiers

Bonus
ROC Curves

ROC-AUC for multi-class classification

Visualization Best Practices
Confusion Matrix

Create heatmap visualization to show classification performance:

  • Seaborn heatmap with annotations showing counts
  • Color scale: Blues or coolwarm to highlight patterns
  • Labels: Predicted on x-axis, Actual on y-axis
  • Interpretation: Diagonal shows correct predictions, off-diagonal shows confusion patterns
Word Clouds

Visualize most common words per sentiment:

  • Separate clouds for positive, negative, neutral
  • Size indicates frequency: Bigger words appear more often
  • Color coding: Green for positive, red for negative, yellow for neutral
  • Insight value: Quickly identify distinguishing vocabulary per sentiment
Visualization Tip

Create a side-by-side comparison of top 20 positive vs negative words using horizontal bar charts. This clearly shows which vocabulary drives sentiment classification and helps validate that your model learns meaningful patterns.

07

Submission Requirements

Create a public GitHub repository with the exact name shown below:

Required Repository Name
sentiment-analysis-project
github.com/<your-username>/sentiment-analysis-project
Required Project Structure
Directory Layout
  • data/ folder containing product_reviews.csv
  • notebooks/ folder with sentiment_analysis.ipynb (your main notebook)
  • requirements.txt at root level listing all dependencies
  • README.md at root level with project documentation
README.md Must Include:
  • Your full name and submission date
  • Project overview and business context
  • Model performance summary (accuracy, F1-score)
  • Key findings (5-7 bullet points)
  • Technologies used (Python, scikit-learn, NLTK, etc.)
  • Instructions to run the notebook
  • Screenshots of at least 3 visualizations
Required Python Libraries

Create a requirements.txt file with these dependencies (minimum versions):

Library Version Purpose
pandas 2.0.0+ Data manipulation and analysis
numpy 1.24.0+ Numerical operations and arrays
scikit-learn 1.3.0+ TF-IDF vectorization, ML models, evaluation
matplotlib 3.7.0+ Static visualizations
seaborn 0.12.0+ Statistical visualizations (confusion matrix)
nltk 3.8.0+ Natural language processing tools
wordcloud 1.9.0+ Word cloud visualizations (optional)
jupyter 1.0.0+ Notebook environment
Do Include
  • Clear markdown sections with headers
  • All code cells executed with outputs
  • At least 8 visualizations
  • Model comparison table
  • Working prediction function
  • Business insights and recommendations
  • README with screenshots
Do Not Include
  • Virtual environment folders (venv, .env)
  • Any .pyc or __pycache__ files
  • NLTK data folders (users will download)
  • Unexecuted notebooks
  • Hardcoded absolute file paths
  • Pickle files of trained models
Important: Before submitting, run Kernel > Restart and Run All to ensure your notebook executes from top to bottom without errors!
Submit Your Project

Enter your GitHub username - we will verify your repository automatically

08

Grading Rubric

Your project will be graded on the following criteria. Total: 550 points.

Criteria Points Description
Text Preprocessing 100 Complete preprocessing pipeline with cleaning, tokenization, stopword removal, and lemmatization
TF-IDF Vectorization 75 Proper implementation of TF-IDF with appropriate parameters
Model Training 100 At least 2 classifiers trained and compared with cross-validation
Visualizations 100 At least 8 clear, well-labeled visualizations including word clouds and confusion matrix
Model Evaluation 75 Complete evaluation with classification report, feature importance, and error analysis
Code Quality 50 Clean, well-organized, reusable code with comments
Documentation 50 Clear markdown, README with screenshots, requirements.txt
Total 550

Ready to Submit?

Make sure you have completed all requirements and reviewed the grading rubric above.

Submit Your Project
09

Pre-Submission Checklist

Notebook Requirements
Repository Requirements