Capstone Project 3

Sentiment Analysis

Build production-ready NLP models for sentiment classification. You will implement text preprocessing pipelines, train LSTM networks from scratch, and fine-tune state-of-the-art BERT transformer models using Hugging Face to classify customer reviews as positive, negative, or neutral.

12-16 hours
Advanced
600 Points
What You Will Build
  • Text preprocessing pipeline
  • Word embeddings and tokenization
  • LSTM sentiment classifier
  • Fine-tuned BERT model
  • Model comparison analysis
Contents
01

Project Overview

Sentiment analysis is one of the most impactful NLP applications in industry, powering customer feedback systems, social media monitoring, and brand reputation management. In this project, you will work with the Sentiment140 dataset containing 1.6 million tweets labeled as positive or negative. You will build two models: a traditional LSTM neural network and a state-of-the-art BERT transformer, comparing their performance on text classification. Target accuracy: over 85% with LSTM and over 90% with BERT.

Skills Applied: This project tests your proficiency in text preprocessing (cleaning, tokenization), word embeddings (Word2Vec, GloVe), LSTM architecture design, transformer models (BERT), and the Hugging Face ecosystem for NLP.
Preprocessing

Clean text, remove noise, handle emojis and special characters

Tokenization

Convert text to sequences, build vocabulary, pad sequences

LSTM Network

Build recurrent architecture with embeddings and dropout

BERT Fine-tuning

Adapt pre-trained transformers for classification

Learning Objectives

Technical Skills
  • Implement comprehensive text cleaning pipelines
  • Use Keras Tokenizer and TensorFlow text processing
  • Build LSTM models with embedding layers
  • Fine-tune BERT using Hugging Face Transformers
  • Evaluate NLP models with precision, recall, F1-score
NLP Concepts
  • Understand word embeddings and vector representations
  • Learn attention mechanisms in transformers
  • Compare RNN vs transformer architectures
  • Handle class imbalance in text classification
  • Interpret model predictions with attention weights
Ready to submit? Already completed the project? Submit your work now!
Submit Now
02

Business Scenario

SocialPulse Analytics

You have been hired as a Senior NLP Engineer at SocialPulse Analytics, a social media intelligence company that helps brands understand customer sentiment in real-time. The company processes millions of social media posts daily to provide insights about brand perception, product feedback, and emerging trends. Your task is to build the core sentiment classification engine.

"Our clients need real-time sentiment analysis that is both fast and accurate. We need two models: a lightweight LSTM for high-throughput processing and a BERT model for high-stakes analysis where accuracy is critical. Can you build both and help us understand the trade-offs?"

Dr. Emily Zhang, Chief Data Scientist

Technical Challenges to Solve

Text Preprocessing
  • How to handle Twitter-specific tokens (@mentions, #hashtags)?
  • Should emojis be removed or converted to text?
  • How to deal with slang, abbreviations, and misspellings?
  • What is the optimal text length for truncation?
Model Architecture
  • How many LSTM layers are optimal?
  • Bidirectional LSTM vs unidirectional?
  • Which BERT variant to use (base, distilled)?
  • How to prevent overfitting on noisy social data?
Performance Trade-offs
  • LSTM: faster inference vs lower accuracy
  • BERT: higher accuracy vs slower inference
  • Memory and compute requirements comparison
  • When to use which model in production?
Evaluation
  • How to handle sarcasm and irony?
  • Which metrics matter most (accuracy, F1, AUC)?
  • How to analyze model errors?
  • Confidence calibration for predictions
Pro Tip: Real-world sentiment analysis is challenging because of sarcasm, context-dependent meaning, and domain-specific language. Document edge cases where your models struggle and propose solutions.
03

The Dataset

You will work with the Sentiment140 dataset, one of the most popular datasets for sentiment analysis. Download from Kaggle or use alternative datasets based on your preference:

Primary Dataset: Sentiment140

The Sentiment140 dataset contains 1.6 million tweets automatically labeled using emoticons as noisy labels. Tweets with positive emoticons (like :)) are labeled positive (4), and tweets with negative emoticons (like :() are labeled negative (0).

Dataset Info: 1.6M tweets | Binary classification (positive/negative) | 6 columns | ~240MB compressed | Collected April-June 2009
Alternative Datasets

You may also use these alternative datasets for different perspectives:

IMDB Reviews

50K movie reviews, binary sentiment

Kaggle
Amazon Reviews

Product reviews with star ratings

Kaggle
Yelp Reviews

Restaurant reviews, 5-star scale

Kaggle
Sentiment140 Schema
ColumnTypeDescription
targetIntegerSentiment label: 0 = negative, 4 = positive
idsIntegerUnique tweet ID
dateStringTweet timestamp (e.g., "Mon Apr 06 22:19:45 PDT 2009")
flagStringQuery flag (NO_QUERY for all)
userStringTwitter username
textStringTweet content (max 140 characters)
Sample Data
SentimentText
Negative "@user I hate when that happens. My whole day is ruined now"
Positive "Just had the best coffee ever! Starting my day right"
Negative "Stuck in traffic again. This commute is killing me"
Positive "Can't wait for the weekend! Beach trip with friends"
Note: For this project, use a subset of 100,000-200,000 samples for faster training. The full 1.6M dataset can be used for final model training.
04

Project Requirements

Your project must include all of the following components. This is a comprehensive NLP project covering both classical deep learning (LSTM) and modern transformer approaches (BERT).

1
Data Loading and Exploration

Load and understand the dataset:

  • Load Sentiment140 data (or chosen alternative)
  • Explore class distribution (positive vs negative)
  • Analyze text length distribution
  • Sample and display example texts from each class
  • Create train/validation/test splits (80/10/10)
Deliverable: EDA notebook section with visualizations of class balance, text length histogram, and word frequency analysis.
2
Text Preprocessing Pipeline

Build a comprehensive text cleaning pipeline:

  • Convert to lowercase
  • Remove URLs, mentions (@user), and hashtags (or convert hashtags to words)
  • Handle emojis (remove or convert to text using emoji library)
  • Remove special characters and numbers
  • Remove stopwords (optional - experiment with/without)
  • Apply lemmatization or stemming
  • Remove extra whitespace
Deliverable: Reusable preprocessing function with before/after examples demonstrating each cleaning step.
3
Tokenization and Embeddings

Convert text to numerical representations:

  • Use Keras Tokenizer to build vocabulary
  • Convert texts to sequences of token IDs
  • Pad sequences to uniform length (choose appropriate max length)
  • Experiment with pre-trained embeddings (GloVe or Word2Vec)
  • Compare trainable vs frozen embeddings
Deliverable: Tokenization code, vocabulary size stats, embedding matrix preparation if using pre-trained embeddings.
4
LSTM Model Development

Build and train an LSTM sentiment classifier:

  • Design architecture: Embedding > LSTM > Dense layers
  • Experiment with bidirectional LSTM
  • Add dropout for regularization
  • Compile with binary crossentropy and Adam optimizer
  • Use callbacks: EarlyStopping, ModelCheckpoint
  • Train and monitor validation metrics

Target Performance: Over 85% accuracy on test set

Deliverable: LSTM model code, training curves, hyperparameter experiments.
5
BERT Fine-tuning

Fine-tune a pre-trained BERT model:

  • Load pre-trained BERT (bert-base-uncased or DistilBERT)
  • Use Hugging Face Transformers and Datasets libraries
  • Tokenize text with BERT tokenizer
  • Add classification head on top of BERT
  • Fine-tune with appropriate learning rate (2e-5 to 5e-5)
  • Monitor training with Trainer or custom loop

Target Performance: Over 90% accuracy on test set

Deliverable: BERT fine-tuning code, training metrics, saved model.
6
Model Evaluation and Comparison

Comprehensive evaluation of both models:

  • Calculate accuracy, precision, recall, F1-score
  • Generate confusion matrices for both models
  • Create ROC curves and calculate AUC
  • Analyze misclassified examples
  • Compare inference speed (predictions per second)
  • Document trade-offs between LSTM and BERT
Deliverable: Comparison table, visualizations, error analysis, recommendations for production use.
05

Text Preprocessing Pipeline

Text preprocessing is crucial for NLP models. Social media text requires special handling for mentions, hashtags, URLs, and informal language.

Preprocessing Decisions
Recommended for Sentiment
  • Keep emojis: Convert to text ("happy_face")
  • Keep negations: "not", "no", "never" are crucial
  • Keep intensifiers: "very", "really", "extremely"
  • Convert hashtags: #happy to "happy"
  • Lemmatize: Reduces vocabulary size
Avoid for Sentiment
  • Remove all stopwords: Loses "not happy" context
  • Aggressive stemming: Can distort meaning
  • Remove all emojis: Valuable sentiment signals
  • Remove repeated chars: "sooo good" shows emphasis
  • Over-cleaning: Loses important signals
06

LSTM Model Architecture

Long Short-Term Memory (LSTM) networks are effective for sequence data like text. They can capture long-range dependencies that simple RNNs miss.

Recommended Architecture
LayerTypeParametersOutput Shape
InputInputLayermax_length=100(100,)
EmbeddingEmbeddingvocab_size, embed_dim=128(100, 128)
SpatialDropoutSpatialDropout1Drate=0.2(100, 128)
LSTM1Bidirectional LSTM64 units, return_sequences(100, 128)
LSTM2Bidirectional LSTM32 units(64,)
Dense1Dense64 units, ReLU(64,)
DropoutDropoutrate=0.5(64,)
OutputDense1 unit, Sigmoid(1,)
07

Fine-tuning BERT

BERT (Bidirectional Encoder Representations from Transformers) achieves state-of-the-art results on many NLP tasks. Fine-tuning involves adapting the pre-trained model to your specific task.

LSTM vs BERT Comparison
AspectLSTMBERT
Accuracy ~85-88% ~90-93%
Training Time Minutes (CPU/GPU) Hours (GPU required)
Inference Speed Fast (~1000 samples/sec) Slower (~100 samples/sec)
Model Size ~10-50 MB ~400-800 MB
Pre-training Required Optional (embeddings) Yes (uses pre-trained weights)
Best Use Case High-throughput, resource-limited High-accuracy, batch processing
08

Submission Requirements

Create a public GitHub repository with the exact name shown below:

Required Repository Name
sentiment-analysis-nlp
github.com/<your-username>/sentiment-analysis-nlp
Required Project Structure
sentiment-analysis-nlp/
├── notebooks/
│   ├── 01_data_exploration.ipynb     # EDA and preprocessing
│   ├── 02_lstm_model.ipynb           # LSTM development
│   └── 03_bert_finetuning.ipynb      # BERT fine-tuning
├── src/
│   ├── preprocessing.py              # Text cleaning functions
│   ├── lstm_model.py                 # LSTM architecture
│   └── bert_model.py                 # BERT fine-tuning code
├── models/
│   ├── lstm_sentiment.h5             # Saved LSTM model
│   └── bert_sentiment/               # Saved BERT model folder
├── reports/
│   ├── confusion_matrix_lstm.png     # LSTM confusion matrix
│   ├── confusion_matrix_bert.png     # BERT confusion matrix
│   ├── training_curves.png           # Training history plots
│   └── model_comparison.png          # LSTM vs BERT comparison
├── requirements.txt                  # Python dependencies
└── README.md                         # Project documentation
README.md Required Sections
1. Project Header
  • Project title and description
  • Your full name and submission date
  • Final accuracy for both models
2. Text Preprocessing
  • Cleaning steps implemented
  • Before/after examples
  • Vocabulary statistics
3. LSTM Model
  • Architecture diagram/summary
  • Training configuration
  • Final metrics achieved
4. BERT Fine-tuning
  • Model variant used
  • Training hyperparameters
  • Final metrics achieved
5. Model Comparison
  • Accuracy, F1, inference speed
  • Trade-offs discussion
  • Recommendation for production
6. How to Run
  • Installation instructions
  • How to train models
  • How to make predictions
Submit Your Project

Enter your GitHub username - we will verify your repository automatically

09

Grading Rubric

Your project will be graded on the following criteria. Total: 600 points.

Criteria Points Description
Data Exploration 50 EDA, class distribution, text analysis, proper splits
Text Preprocessing 75 Comprehensive cleaning, tokenization, embedding preparation
LSTM Model 125 Architecture design, training, achieving over 85% accuracy
BERT Fine-tuning 150 Proper fine-tuning, achieving over 90% accuracy
Model Comparison 100 Metrics comparison, trade-off analysis, recommendations
Documentation 75 README quality, code comments, notebook organization
Bonus: Error Analysis 25 Deep analysis of misclassified examples, edge cases
Total 600
Grading Levels
Excellent
540-600

BERT over 92%, exceptional analysis

Good
450-539

Both models meet targets, good docs

Satisfactory
360-449

Meets minimum requirements

Needs Work
< 360

Missing components or low accuracy

Ready to Submit?

Make sure both models are trained and your comparison analysis is complete.

Submit Your Project
10

Pre-Submission Checklist

Use this checklist to verify you have completed all requirements before submitting.

Preprocessing
LSTM Model
BERT Model
Evaluation
Final Check: Run all notebooks from scratch to ensure reproducibility. Verify all model files are saved and not too large for GitHub.