Project Overview
This capstone project focuses on Natural Language Processing (NLP) and text classification. You will work with a realistic product reviews dataset containing 120 customer reviews across 50 different products and multiple categories. Your goal is to build a sentiment analysis system that can automatically classify reviews as positive, negative, or neutral based on the text content.
Learning Objectives
Text Preprocessing Mastery
- Clean and normalize unstructured text data (lowercase, punctuation removal)
- Apply tokenization to break text into individual words
- Remove stopwords that don't contribute to sentiment (the, is, and, etc.)
- Apply lemmatization to reduce words to their base form (running → run)
Feature Engineering for NLP
- Understand TF-IDF vectorization for converting text to numerical features
- Configure optimal parameters (max_features, ngram_range, min_df)
- Experiment with unigrams vs bigrams vs trigrams
- Balance vocabulary size with model performance
Classification Techniques
- Train Logistic Regression for multi-class text classification
- Apply Naive Bayes algorithm optimized for text data
- Compare model performance using precision, recall, and F1-score
- Identify which words are most predictive of each sentiment class
Evaluation & Analysis
- Interpret confusion matrices for multi-class problems
- Understand when models confuse similar sentiments (neutral vs positive)
- Extract feature importance to find most influential words
- Build reusable prediction pipeline for new reviews
Real-World Application
Sentiment analysis powers customer feedback systems at Amazon, Yelp, TripAdvisor, and social media monitoring tools. Companies use it to track brand reputation, identify product issues early, and prioritize customer service responses. Your project demonstrates production-ready NLP skills.
Text Preprocessing
Clean, tokenize, and normalize text data
TF-IDF Vectorization
Convert text to numerical features
Classification Model
Train and tune ML classifiers
Evaluation
Measure accuracy, precision, and recall
Business Scenario
ReviewPulse Analytics
You have been hired as a Machine Learning Engineer at ReviewPulse Analytics, a company that helps e-commerce businesses understand customer feedback at scale. Currently, the company employs 15 human analysts who manually read and categorize 3,000-5,000 reviews daily. This process takes 4-6 hours per analyst and costs the company approximately $180,000 annually in labor. Additionally, manual classification suffers from inconsistency, with different analysts sometimes categorizing the same review differently.
The product team wants to automate sentiment classification to process reviews in real-time, reduce costs by 70%, and provide instant insights to e-commerce clients. Your AI-powered system would enable clients to track sentiment trends hourly instead of waiting days for manual reports.
"We receive thousands of product reviews daily and need an automated system to classify them as positive, negative, or neutral. This will help our clients quickly identify product issues and customer satisfaction trends. Can you build a reliable sentiment classification model that achieves at least 80% accuracy and processes reviews in under 100ms?"
The Business Challenge
ReviewPulse Analytics faces several critical challenges that NLP can address:
Scale & Speed
Manual review classification can't keep pace with incoming volume. During holiday seasons, review volume spikes 5x (15,000+ daily), creating a 3-4 day backlog. Clients need real-time sentiment dashboards, not delayed reports.
Consistency Issues
Human analysts show 15-20% disagreement on neutral reviews. One analyst might classify "It's okay" as neutral while another sees it as slightly negative. ML provides consistent classification regardless of workload or time.
Text Complexity
Reviews contain slang, sarcasm ("Yeah, just PERFECT... if you enjoy products that break in 2 days"), mixed sentiments ("Camera is great but battery life is terrible"), and domain-specific vocabulary that confuses simple keyword matching.
Business Questions to Answer
- What are the most common words in positive vs negative reviews?
- How does review length correlate with sentiment?
- Which categories have the most negative reviews?
- Which classifier performs best on this dataset?
- What is the optimal TF-IDF configuration?
- How well does the model generalize to new reviews?
- Which words are most predictive of positive sentiment?
- Which words are most predictive of negative sentiment?
- How do n-grams improve classification?
- What products generate the most negative feedback?
- Are verified purchases more positive or negative?
- What recommendations can improve product ratings?
The Dataset
You will work with a realistic product reviews dataset. Download the CSV file and place it
in your project's data/ folder:
Dataset Schema
| Column | Type | Description |
|---|---|---|
review_id | Integer | Unique review identifier |
product_id | String | Unique product identifier (PROD-XXX) |
product_name | String | Product name |
category | String | Product category (Electronics, Kitchen, Health, etc.) |
review_text | String | Full review text written by customer |
rating | Integer | Star rating (1-5) |
review_date | Date | Date of review (YYYY-MM-DD) |
reviewer_name | String | Reviewer name |
verified_purchase | Boolean | Whether purchase was verified |
helpful_votes | Integer | Number of helpful votes |
total_votes | Integer | Total votes on review |
Sentiment Labels
You will need to create sentiment labels from the rating column:
Negative
Rating 1-2
Neutral
Rating 3
Positive
Rating 4-5
Project Requirements
Your Jupyter Notebook must include all of the following components. Structure your notebook with clear markdown headers and explanations for each section.
Project Setup and Introduction
Title, your name, date, project overview, and business context. Import all required libraries.
Required Libraries
- Data handling: pandas, numpy
- Visualization: matplotlib, seaborn
- NLP tools: nltk (stopwords, WordNetLemmatizer), re (regex)
- Scikit-learn: TfidfVectorizer, train_test_split
- Models: LogisticRegression, MultinomialNB
- Evaluation: classification_report, confusion_matrix, accuracy_score
Note: Download NLTK stopwords and wordnet datasets before use.
Data Exploration and Label Creation
Comprehensive exploration of review data and sentiment patterns.
Distribution Analysis
- Rating Distribution: Count plot showing 1-5 star ratings
- Category Breakdown: Which product categories have most reviews?
- Sentiment Labels: Create Negative (1-2), Neutral (3), Positive (4-5)
- Class Balance: Check if dataset is balanced across sentiments
Text Characteristics
- Review Length: Calculate word count for each review
- Length by Sentiment: Do negative reviews tend to be longer?
- Common Words: Most frequent words across all reviews
- Verified Purchases: Sentiment distribution for verified vs unverified
Text Preprocessing Pipeline
Build a multi-step preprocessing function to clean and normalize review text.
Lowercase Conversion
Convert all characters to lowercase so "Amazing" and "amazing" are treated as the same word. This reduces vocabulary size and improves matching.
Remove Special Characters & Punctuation
Use regex to remove punctuation, symbols, and optionally numbers. Keep only alphabetic characters. Example: "Great! Best product ever!!!" → "Great Best product ever"
Tokenization
Split text into individual words (tokens). "amazing product" → ["amazing", "product"]. This creates a list of words for further processing.
Stopword Removal
Remove common words like "the", "is", "and", "a" using NLTK's stopwords list. These don't contribute to sentiment. Keep negations like "not", "no", "never" as they flip sentiment.
Lemmatization
Reduce words to base form using WordNetLemmatizer. "running" → "run", "better" → "good", "worst" → "bad". This groups similar words together and reduces vocabulary size.
cleaned_text column with preprocessed text. Apply your preprocessing function to ALL reviews before splitting train/test data. This ensures consistency.
Original: "This product is AMAZING!!! Best purchase I've ever made. 10/10 would recommend!"
After preprocessing: "product amazing best purchase ever make would recommend"
TF-IDF Vectorization
Convert preprocessed text into numerical feature vectors using Term Frequency-Inverse Document Frequency.
What is TF-IDF?
TF-IDF assigns weights to words based on two factors:
- Term Frequency (TF): How often a word appears in a document (review)
- Inverse Document Frequency (IDF): How rare a word is across all documents
Result: Common words like "product" get low scores. Distinctive words like "amazing" or "terrible" get high scores because they're more predictive of sentiment.
Key Parameters to Tune
- max_features: Vocabulary size (try 1000, 3000, 5000)
- ngram_range: (1,1) = unigrams, (1,2) = uni+bigrams
- min_df: Ignore rare words (appear in <2 documents)
- max_df: Ignore very common words (appear in >95% docs)
Expected Results
- Unigrams (1,1): Good baseline, 75-80% accuracy
- Bigrams (1,2): Captures phrases like "not good", 78-83%
- Higher features: More vocabulary = more features but may overfit
- Optimal: Usually 3000-5000 features with (1,2) ngrams
stratify=y to maintain sentiment distribution in both sets. Fit TF-IDF vectorizer ONLY on training data, then transform both train and test to prevent data leakage.
Model Training and Comparison
Train at least two different text classifiers and systematically compare their performance.
| Model | Best For | Key Parameters | Expected Accuracy |
|---|---|---|---|
| Multinomial Naive Bayes | Baseline for text, fast training | alpha: 0.1, 1.0, 10.0 |
75-82% |
| Logistic Regression | Strong performer, interpretable | C: 0.1, 1.0, 10.0, max_iter: 1000 |
80-88% |
| SVM (LinearSVC) | High accuracy, slower training | C: 0.1, 1.0, 10.0 |
82-90% (optional) |
Why Multiple Models?
- Naive Bayes: Assumes feature independence (word occurrences). Fast but less accurate.
- Logistic Regression: Models word importance with coefficients. Usually best performer for text.
- SVM: Finds optimal decision boundaries. Powerful but computationally expensive.
Model Evaluation and Interpretation
Comprehensive evaluation using multiple metrics to understand model strengths and weaknesses.
Classification Report
- Precision: Of predicted positive, how many are actually positive? High precision = few false positives.
- Recall: Of actual positives, how many did we find? High recall = few false negatives.
- F1-Score: Harmonic mean of precision and recall. Best overall metric for imbalanced data.
- Support: Number of actual occurrences of each class in test set.
Confusion Matrix Analysis
- Diagonal values: Correct predictions (true positives for each class)
- Off-diagonal: Misclassifications showing which sentiments get confused
- Common pattern: Neutral reviews often confused with positive/negative
- Visualization: Use seaborn heatmap with annotations for clarity
Feature Importance Analysis
Extract most predictive words from your best model:
- For Logistic Regression: Get coefficients for each class. Positive coefficients indicate words strongly associated with that sentiment.
- Top Positive Words: Expect "excellent", "amazing", "perfect", "love", "best"
- Top Negative Words: Expect "terrible", "worst", "waste", "poor", "disappointed"
- Visualization: Create bar charts showing top 10 words per sentiment class
Prediction Function
Create a reusable function that takes new review text and returns predicted sentiment with confidence score.
Apply Preprocessing
Pass raw text through your preprocessing function (same steps used during training)
Vectorize Text
Transform cleaned text using fitted TF-IDF vectorizer from training
Get Prediction
Use trained model to predict sentiment class and probability scores
Input: "This product is amazing! Best purchase ever!"
Output: Positive (98% confidence)
Input: "Complete waste of money. Broke after one day."
Output: Negative (94% confidence)
Input: "It's okay, nothing special but works as expected."
Output: Neutral (72% confidence)
Insights and Recommendations
Summarize your findings with 5-7 key insights. Provide actionable recommendations for the business based on your analysis.
NLP Specifications
Follow these specifications for text processing and model building. These will help ensure consistent and high-quality results.
- Lowercase: Convert all text to lowercase
- Punctuation: Remove all punctuation marks
- Stopwords: Remove English stopwords using NLTK
- Lemmatization: Use WordNetLemmatizer
- Min Length: Keep words with 2+ characters
- max_features: Try 1000, 3000, 5000
- ngram_range: (1,1), (1,2), or (1,3)
- min_df: 2 (ignore rare terms)
- max_df: 0.95 (ignore very common terms)
- sublinear_tf: True (apply sublinear scaling)
- Train/Test Split: 80/20 with stratification
- Cross-Validation: 5-fold stratified CV
- Baseline Model: Multinomial Naive Bayes
- Primary Model: Logistic Regression
- Hyperparameter Tuning: GridSearchCV (optional)
- Accuracy: Overall correct predictions
- Precision: Per-class positive predictive value
- Recall: Per-class true positive rate
- F1-Score: Harmonic mean of precision and recall
- Confusion Matrix: Visualize all predictions
TextPreprocessor
class that encapsulates all preprocessing steps. This makes your pipeline reusable and easier
to maintain.
Required Visualizations
Create at least 8 of the following visualizations. All charts should be clear, well-labeled, and professionally styled.
Sentiment Distribution
Count of reviews per sentiment class
Positive Review Words
Most common words in positive reviews
Negative Review Words
Most common words in negative reviews
Confusion Matrix
Visualize model predictions vs actual
Review Length Distribution
Distribution of word counts by sentiment
Top Predictive Words
Top 15 features by importance
Sentiment by Category
Sentiment distribution across product categories
Model Comparison
Accuracy comparison across classifiers
ROC Curves
ROC-AUC for multi-class classification
Visualization Best Practices
Confusion Matrix
Create heatmap visualization to show classification performance:
- Seaborn heatmap with annotations showing counts
- Color scale: Blues or coolwarm to highlight patterns
- Labels: Predicted on x-axis, Actual on y-axis
- Interpretation: Diagonal shows correct predictions, off-diagonal shows confusion patterns
Word Clouds
Visualize most common words per sentiment:
- Separate clouds for positive, negative, neutral
- Size indicates frequency: Bigger words appear more often
- Color coding: Green for positive, red for negative, yellow for neutral
- Insight value: Quickly identify distinguishing vocabulary per sentiment
Visualization Tip
Create a side-by-side comparison of top 20 positive vs negative words using horizontal bar charts. This clearly shows which vocabulary drives sentiment classification and helps validate that your model learns meaningful patterns.
Submission Requirements
Create a public GitHub repository with the exact name shown below:
Required Repository Name
sentiment-analysis-project
Required Project Structure
Directory Layout
- data/ folder containing
product_reviews.csv - notebooks/ folder with
sentiment_analysis.ipynb(your main notebook) requirements.txtat root level listing all dependenciesREADME.mdat root level with project documentation
README.md Must Include:
- Your full name and submission date
- Project overview and business context
- Model performance summary (accuracy, F1-score)
- Key findings (5-7 bullet points)
- Technologies used (Python, scikit-learn, NLTK, etc.)
- Instructions to run the notebook
- Screenshots of at least 3 visualizations
Required Python Libraries
Create a requirements.txt file with these dependencies (minimum versions):
| Library | Version | Purpose |
|---|---|---|
pandas |
2.0.0+ | Data manipulation and analysis |
numpy |
1.24.0+ | Numerical operations and arrays |
scikit-learn |
1.3.0+ | TF-IDF vectorization, ML models, evaluation |
matplotlib |
3.7.0+ | Static visualizations |
seaborn |
0.12.0+ | Statistical visualizations (confusion matrix) |
nltk |
3.8.0+ | Natural language processing tools |
wordcloud |
1.9.0+ | Word cloud visualizations (optional) |
jupyter |
1.0.0+ | Notebook environment |
Do Include
- Clear markdown sections with headers
- All code cells executed with outputs
- At least 8 visualizations
- Model comparison table
- Working prediction function
- Business insights and recommendations
- README with screenshots
Do Not Include
- Virtual environment folders (venv, .env)
- Any .pyc or __pycache__ files
- NLTK data folders (users will download)
- Unexecuted notebooks
- Hardcoded absolute file paths
- Pickle files of trained models
Enter your GitHub username - we will verify your repository automatically
Grading Rubric
Your project will be graded on the following criteria. Total: 550 points.
| Criteria | Points | Description |
|---|---|---|
| Text Preprocessing | 100 | Complete preprocessing pipeline with cleaning, tokenization, stopword removal, and lemmatization |
| TF-IDF Vectorization | 75 | Proper implementation of TF-IDF with appropriate parameters |
| Model Training | 100 | At least 2 classifiers trained and compared with cross-validation |
| Visualizations | 100 | At least 8 clear, well-labeled visualizations including word clouds and confusion matrix |
| Model Evaluation | 75 | Complete evaluation with classification report, feature importance, and error analysis |
| Code Quality | 50 | Clean, well-organized, reusable code with comments |
| Documentation | 50 | Clear markdown, README with screenshots, requirements.txt |
| Total | 550 |
Ready to Submit?
Make sure you have completed all requirements and reviewed the grading rubric above.
Submit Your Project