Project 4: Sentiment Analysis | Data Science Course

Project Overview

This capstone project focuses on Natural Language Processing (NLP) and text classification. You will work with a realistic product reviews dataset containing 120 customer reviews across 50 different products and multiple categories. Your goal is to build a sentiment analysis system that can automatically classify reviews as positive, negative, or neutral based on the text content.

Skills Applied: This project tests your proficiency in text preprocessing, tokenization, TF-IDF vectorization, and classification algorithms (Logistic Regression, Naive Bayes, or SVM).

Learning Objectives

Text Preprocessing Mastery

Clean and normalize unstructured text data (lowercase, punctuation removal)
Apply tokenization to break text into individual words
Remove stopwords that don't contribute to sentiment (the, is, and, etc.)
Apply lemmatization to reduce words to their base form (running → run)

Feature Engineering for NLP

Understand TF-IDF vectorization for converting text to numerical features
Configure optimal parameters (max_features, ngram_range, min_df)
Experiment with unigrams vs bigrams vs trigrams
Balance vocabulary size with model performance

Classification Techniques

Train Logistic Regression for multi-class text classification
Apply Naive Bayes algorithm optimized for text data
Compare model performance using precision, recall, and F1-score
Identify which words are most predictive of each sentiment class

Evaluation & Analysis

Interpret confusion matrices for multi-class problems
Understand when models confuse similar sentiments (neutral vs positive)
Extract feature importance to find most influential words
Build reusable prediction pipeline for new reviews

Real-World Application

Sentiment analysis powers customer feedback systems at Amazon, Yelp, TripAdvisor, and social media monitoring tools. Companies use it to track brand reputation, identify product issues early, and prioritize customer service responses. Your project demonstrates production-ready NLP skills.

Text Preprocessing

Clean, tokenize, and normalize text data

TF-IDF Vectorization

Convert text to numerical features

Classification Model

Train and tune ML classifiers

Evaluation

Measure accuracy, precision, and recall

Ready to submit? Already completed the project? Submit your work now!

Submit Now

Business Scenario

ReviewPulse Analytics

You have been hired as a Machine Learning Engineer at ReviewPulse Analytics, a company that helps e-commerce businesses understand customer feedback at scale. Currently, the company employs 15 human analysts who manually read and categorize 3,000-5,000 reviews daily. This process takes 4-6 hours per analyst and costs the company approximately $180,000 annually in labor. Additionally, manual classification suffers from inconsistency, with different analysts sometimes categorizing the same review differently.

The product team wants to automate sentiment classification to process reviews in real-time, reduce costs by 70%, and provide instant insights to e-commerce clients. Your AI-powered system would enable clients to track sentiment trends hourly instead of waiting days for manual reports.

"We receive thousands of product reviews daily and need an automated system to classify them as positive, negative, or neutral. This will help our clients quickly identify product issues and customer satisfaction trends. Can you build a reliable sentiment classification model that achieves at least 80% accuracy and processes reviews in under 100ms?"

Marcus Chen, Head of AI Products

The Business Challenge

ReviewPulse Analytics faces several critical challenges that NLP can address:

Scale & Speed

Manual review classification can't keep pace with incoming volume. During holiday seasons, review volume spikes 5x (15,000+ daily), creating a 3-4 day backlog. Clients need real-time sentiment dashboards, not delayed reports.

Consistency Issues

Human analysts show 15-20% disagreement on neutral reviews. One analyst might classify "It's okay" as neutral while another sees it as slightly negative. ML provides consistent classification regardless of workload or time.

Text Complexity

Reviews contain slang, sarcasm ("Yeah, just PERFECT... if you enjoy products that break in 2 days"), mixed sentiments ("Camera is great but battery life is terrible"), and domain-specific vocabulary that confuses simple keyword matching.

Business Questions to Answer

Text Analysis

What are the most common words in positive vs negative reviews?
How does review length correlate with sentiment?
Which categories have the most negative reviews?

Model Performance

Which classifier performs best on this dataset?
What is the optimal TF-IDF configuration?
How well does the model generalize to new reviews?

Feature Importance

Which words are most predictive of positive sentiment?
Which words are most predictive of negative sentiment?
How do n-grams improve classification?

Business Insights

What products generate the most negative feedback?
Are verified purchases more positive or negative?
What recommendations can improve product ratings?

Pro Tip: Focus on building a robust preprocessing pipeline first. Clean text data is crucial for good model performance!

The Dataset

You will work with a realistic product reviews dataset. Download the CSV file and place it in your project's data/ folder:

Download product_reviews.csv

Dataset Schema

Column	Type	Description
`review_id`	Integer	Unique review identifier
`product_id`	String	Unique product identifier (PROD-XXX)
`product_name`	String	Product name
`category`	String	Product category (Electronics, Kitchen, Health, etc.)
`review_text`	String	Full review text written by customer
`rating`	Integer	Star rating (1-5)
`review_date`	Date	Date of review (YYYY-MM-DD)
`reviewer_name`	String	Reviewer name
`verified_purchase`	Boolean	Whether purchase was verified
`helpful_votes`	Integer	Number of helpful votes
`total_votes`	Integer	Total votes on review

Dataset Stats: 120 reviews, 50 unique products, 9 categories, 8 months of data (Jan-Aug 2024)

Sentiment Labels

You will need to create sentiment labels from the rating column:

Negative

Rating 1-2

Neutral

Rating 3

Positive

Rating 4-5

Project Requirements

Your Jupyter Notebook must include all of the following components. Structure your notebook with clear markdown headers and explanations for each section.

Project Setup and Introduction

Title, your name, date, project overview, and business context. Import all required libraries.

Required Libraries

Data handling: pandas, numpy
Visualization: matplotlib, seaborn
NLP tools: nltk (stopwords, WordNetLemmatizer), re (regex)
Scikit-learn: TfidfVectorizer, train_test_split
Models: LogisticRegression, MultinomialNB
Evaluation: classification_report, confusion_matrix, accuracy_score

Note: Download NLTK stopwords and wordnet datasets before use.

Data Exploration and Label Creation

Comprehensive exploration of review data and sentiment patterns.

Distribution Analysis

Rating Distribution: Count plot showing 1-5 star ratings
Category Breakdown: Which product categories have most reviews?
Sentiment Labels: Create Negative (1-2), Neutral (3), Positive (4-5)
Class Balance: Check if dataset is balanced across sentiments

Text Characteristics

Review Length: Calculate word count for each review
Length by Sentiment: Do negative reviews tend to be longer?
Common Words: Most frequent words across all reviews
Verified Purchases: Sentiment distribution for verified vs unverified

Expected Patterns: Typically, positive reviews (4-5 stars) make up 60-70% of data, neutral ~15-20%, and negative ~15-20%. If severely imbalanced, consider techniques like SMOTE or class weights during training.

Text Preprocessing Pipeline

Build a multi-step preprocessing function to clean and normalize review text.

Lowercase Conversion

Convert all characters to lowercase so "Amazing" and "amazing" are treated as the same word. This reduces vocabulary size and improves matching.

Remove Special Characters & Punctuation

Use regex to remove punctuation, symbols, and optionally numbers. Keep only alphabetic characters. Example: "Great! Best product ever!!!" → "Great Best product ever"

Tokenization

Split text into individual words (tokens). "amazing product" → ["amazing", "product"]. This creates a list of words for further processing.

Stopword Removal

Remove common words like "the", "is", "and", "a" using NLTK's stopwords list. These don't contribute to sentiment. Keep negations like "not", "no", "never" as they flip sentiment.

Lemmatization

Reduce words to base form using WordNetLemmatizer. "running" → "run", "better" → "good", "worst" → "bad". This groups similar words together and reduces vocabulary size.

Important: Create a cleaned_text column with preprocessed text. Apply your preprocessing function to ALL reviews before splitting train/test data. This ensures consistency.

Before & After Example:
Original: "This product is AMAZING!!! Best purchase I've ever made. 10/10 would recommend!"
After preprocessing: "product amazing best purchase ever make would recommend"

TF-IDF Vectorization

Convert preprocessed text into numerical feature vectors using Term Frequency-Inverse Document Frequency.

What is TF-IDF?

TF-IDF assigns weights to words based on two factors:

Term Frequency (TF): How often a word appears in a document (review)
Inverse Document Frequency (IDF): How rare a word is across all documents

Result: Common words like "product" get low scores. Distinctive words like "amazing" or "terrible" get high scores because they're more predictive of sentiment.

Key Parameters to Tune

max_features: Vocabulary size (try 1000, 3000, 5000)
ngram_range: (1,1) = unigrams, (1,2) = uni+bigrams
min_df: Ignore rare words (appear in <2 documents)
max_df: Ignore very common words (appear in >95% docs)

Expected Results

Unigrams (1,1): Good baseline, 75-80% accuracy
Bigrams (1,2): Captures phrases like "not good", 78-83%
Higher features: More vocabulary = more features but may overfit
Optimal: Usually 3000-5000 features with (1,2) ngrams

Train-Test Split: Use 80-20 split with stratify=y to maintain sentiment distribution in both sets. Fit TF-IDF vectorizer ONLY on training data, then transform both train and test to prevent data leakage.

Model Training and Comparison

Train at least two different text classifiers and systematically compare their performance.

Model	Best For	Key Parameters	Expected Accuracy
Multinomial Naive Bayes	Baseline for text, fast training	`alpha`: 0.1, 1.0, 10.0	75-82%
Logistic Regression	Strong performer, interpretable	`C`: 0.1, 1.0, 10.0, `max_iter`: 1000	80-88%
SVM (LinearSVC)	High accuracy, slower training	`C`: 0.1, 1.0, 10.0	82-90% (optional)

Why Multiple Models?

Naive Bayes: Assumes feature independence (word occurrences). Fast but less accurate.
Logistic Regression: Models word importance with coefficients. Usually best performer for text.
SVM: Finds optimal decision boundaries. Powerful but computationally expensive.

Use Cross-Validation: Apply 5-fold stratified CV to get reliable performance estimates. This prevents overfitting and gives confidence intervals for accuracy.

Model Evaluation and Interpretation

Comprehensive evaluation using multiple metrics to understand model strengths and weaknesses.

Classification Report

Precision: Of predicted positive, how many are actually positive? High precision = few false positives.
Recall: Of actual positives, how many did we find? High recall = few false negatives.
F1-Score: Harmonic mean of precision and recall. Best overall metric for imbalanced data.
Support: Number of actual occurrences of each class in test set.

Confusion Matrix Analysis

Diagonal values: Correct predictions (true positives for each class)
Off-diagonal: Misclassifications showing which sentiments get confused
Common pattern: Neutral reviews often confused with positive/negative
Visualization: Use seaborn heatmap with annotations for clarity

Feature Importance Analysis

Extract most predictive words from your best model:

For Logistic Regression: Get coefficients for each class. Positive coefficients indicate words strongly associated with that sentiment.
Top Positive Words: Expect "excellent", "amazing", "perfect", "love", "best"
Top Negative Words: Expect "terrible", "worst", "waste", "poor", "disappointed"
Visualization: Create bar charts showing top 10 words per sentiment class

Error Analysis: Examine misclassified reviews. Look for patterns - are sarcastic reviews misclassified? Reviews with mixed sentiments? This provides insights for model improvement.

Prediction Function

Create a reusable function that takes new review text and returns predicted sentiment with confidence score.

Apply Preprocessing

Pass raw text through your preprocessing function (same steps used during training)

Vectorize Text

Transform cleaned text using fitted TF-IDF vectorizer from training

Get Prediction

Use trained model to predict sentiment class and probability scores

Test Examples:
Input: "This product is amazing! Best purchase ever!"
Output: Positive (98% confidence)

Input: "Complete waste of money. Broke after one day."
Output: Negative (94% confidence)

Input: "It's okay, nothing special but works as expected."
Output: Neutral (72% confidence)

Insights and Recommendations

Summarize your findings with 5-7 key insights. Provide actionable recommendations for the business based on your analysis.

NLP Specifications

Follow these specifications for text processing and model building. These will help ensure consistent and high-quality results.

Preprocessing Requirements

Lowercase: Convert all text to lowercase
Punctuation: Remove all punctuation marks
Stopwords: Remove English stopwords using NLTK
Lemmatization: Use WordNetLemmatizer
Min Length: Keep words with 2+ characters

TF-IDF Parameters

max_features: Try 1000, 3000, 5000
ngram_range: (1,1), (1,2), or (1,3)
min_df: 2 (ignore rare terms)
max_df: 0.95 (ignore very common terms)
sublinear_tf: True (apply sublinear scaling)

Model Requirements

Train/Test Split: 80/20 with stratification
Cross-Validation: 5-fold stratified CV
Baseline Model: Multinomial Naive Bayes
Primary Model: Logistic Regression
Hyperparameter Tuning: GridSearchCV (optional)

Evaluation Metrics

Accuracy: Overall correct predictions
Precision: Per-class positive predictive value
Recall: Per-class true positive rate
F1-Score: Harmonic mean of precision and recall
Confusion Matrix: Visualize all predictions

Code Tip: Create a TextPreprocessor class that encapsulates all preprocessing steps. This makes your pipeline reusable and easier to maintain.

Required Visualizations

Create at least 8 of the following visualizations. All charts should be clear, well-labeled, and professionally styled.

1. Bar Chart

Sentiment Distribution

Count of reviews per sentiment class

2. Word Cloud

Positive Review Words

Most common words in positive reviews

3. Word Cloud

Negative Review Words

Most common words in negative reviews

4. Heatmap

Confusion Matrix

Visualize model predictions vs actual

5. Histogram

Review Length Distribution

Distribution of word counts by sentiment

6. Bar Chart

Top Predictive Words

Top 15 features by importance

7. Grouped Bar

Sentiment by Category

Sentiment distribution across product categories

8. Line Chart

Model Comparison

Accuracy comparison across classifiers

Bonus

ROC Curves

ROC-AUC for multi-class classification

Visualization Best Practices

Confusion Matrix

Create heatmap visualization to show classification performance:

Seaborn heatmap with annotations showing counts
Color scale: Blues or coolwarm to highlight patterns
Labels: Predicted on x-axis, Actual on y-axis
Interpretation: Diagonal shows correct predictions, off-diagonal shows confusion patterns

Word Clouds

Visualize most common words per sentiment:

Separate clouds for positive, negative, neutral
Size indicates frequency: Bigger words appear more often
Color coding: Green for positive, red for negative, yellow for neutral
Insight value: Quickly identify distinguishing vocabulary per sentiment

Visualization Tip

Create a side-by-side comparison of top 20 positive vs negative words using horizontal bar charts. This clearly shows which vocabulary drives sentiment classification and helps validate that your model learns meaningful patterns.

Submission Requirements

Create a public GitHub repository with the exact name shown below:

Required Repository Name

sentiment-analysis-project

github.com/<your-username>/sentiment-analysis-project

Required Project Structure

Directory Layout

data/ folder containing product_reviews.csv
notebooks/ folder with sentiment_analysis.ipynb (your main notebook)
requirements.txt at root level listing all dependencies
README.md at root level with project documentation

README.md Must Include:

Your full name and submission date
Project overview and business context
Model performance summary (accuracy, F1-score)
Key findings (5-7 bullet points)
Technologies used (Python, scikit-learn, NLTK, etc.)
Instructions to run the notebook
Screenshots of at least 3 visualizations

Required Python Libraries

Create a requirements.txt file with these dependencies (minimum versions):

Library	Version	Purpose
`pandas`	2.0.0+	Data manipulation and analysis
`numpy`	1.24.0+	Numerical operations and arrays
`scikit-learn`	1.3.0+	TF-IDF vectorization, ML models, evaluation
`matplotlib`	3.7.0+	Static visualizations
`seaborn`	0.12.0+	Statistical visualizations (confusion matrix)
`nltk`	3.8.0+	Natural language processing tools
`wordcloud`	1.9.0+	Word cloud visualizations (optional)
`jupyter`	1.0.0+	Notebook environment

Do Include

Clear markdown sections with headers
All code cells executed with outputs
At least 8 visualizations
Model comparison table
Working prediction function
Business insights and recommendations
README with screenshots

Do Not Include

Virtual environment folders (venv, .env)
Any .pyc or __pycache__ files
NLTK data folders (users will download)
Unexecuted notebooks
Hardcoded absolute file paths
Pickle files of trained models

Important: Before submitting, run Kernel > Restart and Run All to ensure your notebook executes from top to bottom without errors!

Submit Your Project

Enter your GitHub username - we will verify your repository automatically

Grading Rubric

Your project will be graded on the following criteria. Total: 550 points.

Criteria	Points	Description
Text Preprocessing	100	Complete preprocessing pipeline with cleaning, tokenization, stopword removal, and lemmatization
TF-IDF Vectorization	75	Proper implementation of TF-IDF with appropriate parameters
Model Training	100	At least 2 classifiers trained and compared with cross-validation
Visualizations	100	At least 8 clear, well-labeled visualizations including word clouds and confusion matrix
Model Evaluation	75	Complete evaluation with classification report, feature importance, and error analysis
Code Quality	50	Clean, well-organized, reusable code with comments
Documentation	50	Clear markdown, README with screenshots, requirements.txt
Total	550

Ready to Submit?

Make sure you have completed all requirements and reviewed the grading rubric above.

Submit Your Project

Sentiment Analysis

What You Will Build

Contents

Project Overview

Learning Objectives

Text Preprocessing Mastery

Feature Engineering for NLP

Classification Techniques

Evaluation & Analysis

Real-World Application

Text Preprocessing

TF-IDF Vectorization

Classification Model

Evaluation

Business Scenario

ReviewPulse Analytics

The Business Challenge

Scale & Speed

Consistency Issues

Text Complexity

Business Questions to Answer

The Dataset

Dataset Schema

Sentiment Labels

Negative

Neutral

Positive

Project Requirements

Project Setup and Introduction

Required Libraries

Data Exploration and Label Creation

Distribution Analysis

Text Characteristics

Text Preprocessing Pipeline

Lowercase Conversion

Remove Special Characters & Punctuation

Tokenization

Stopword Removal

Lemmatization

TF-IDF Vectorization

What is TF-IDF?

Key Parameters to Tune

Expected Results

Model Training and Comparison

Why Multiple Models?

Model Evaluation and Interpretation

Classification Report

Confusion Matrix Analysis

Feature Importance Analysis

Prediction Function

Apply Preprocessing

Vectorize Text

Get Prediction

Insights and Recommendations

NLP Specifications

Required Visualizations

Sentiment Distribution

Positive Review Words

Negative Review Words

Confusion Matrix

Review Length Distribution

Top Predictive Words

Sentiment by Category

Model Comparison

ROC Curves

Visualization Best Practices

Confusion Matrix

Word Clouds

Visualization Tip

Submission Requirements

Required Repository Name

Required Project Structure

Directory Layout

README.md Must Include:

Required Python Libraries

Do Include

Do Not Include

Grading Rubric

Ready to Submit?

Pre-Submission Checklist