Project 3: Sentiment Analysis | AI Course

Project Overview

Sentiment analysis is one of the most impactful NLP applications in industry, powering customer feedback systems, social media monitoring, and brand reputation management. In this project, you will work with the Sentiment140 dataset containing 1.6 million tweets labeled as positive or negative. You will build two models: a traditional LSTM neural network and a state-of-the-art BERT transformer, comparing their performance on text classification. Target accuracy: over 85% with LSTM and over 90% with BERT.

Skills Applied: This project tests your proficiency in text preprocessing (cleaning, tokenization), word embeddings (Word2Vec, GloVe), LSTM architecture design, transformer models (BERT), and the Hugging Face ecosystem for NLP.

Preprocessing

Clean text, remove noise, handle emojis and special characters

Tokenization

Convert text to sequences, build vocabulary, pad sequences

LSTM Network

Build recurrent architecture with embeddings and dropout

BERT Fine-tuning

Adapt pre-trained transformers for classification

Learning Objectives

Technical Skills

Implement comprehensive text cleaning pipelines
Use Keras Tokenizer and TensorFlow text processing
Build LSTM models with embedding layers
Fine-tune BERT using Hugging Face Transformers
Evaluate NLP models with precision, recall, F1-score

NLP Concepts

Understand word embeddings and vector representations
Learn attention mechanisms in transformers
Compare RNN vs transformer architectures
Handle class imbalance in text classification
Interpret model predictions with attention weights

Ready to submit? Already completed the project? Submit your work now!

Submit Now

Business Scenario

SocialPulse Analytics

You have been hired as a Senior NLP Engineer at SocialPulse Analytics, a social media intelligence company that helps brands understand customer sentiment in real-time. The company processes millions of social media posts daily to provide insights about brand perception, product feedback, and emerging trends. Your task is to build the core sentiment classification engine.

"Our clients need real-time sentiment analysis that is both fast and accurate. We need two models: a lightweight LSTM for high-throughput processing and a BERT model for high-stakes analysis where accuracy is critical. Can you build both and help us understand the trade-offs?"

Dr. Emily Zhang, Chief Data Scientist

Technical Challenges to Solve

Text Preprocessing

How to handle Twitter-specific tokens (@mentions, #hashtags)?
Should emojis be removed or converted to text?
How to deal with slang, abbreviations, and misspellings?
What is the optimal text length for truncation?

Model Architecture

How many LSTM layers are optimal?
Bidirectional LSTM vs unidirectional?
Which BERT variant to use (base, distilled)?
How to prevent overfitting on noisy social data?

Performance Trade-offs

LSTM: faster inference vs lower accuracy
BERT: higher accuracy vs slower inference
Memory and compute requirements comparison
When to use which model in production?

Evaluation

How to handle sarcasm and irony?
Which metrics matter most (accuracy, F1, AUC)?
How to analyze model errors?
Confidence calibration for predictions

Pro Tip: Real-world sentiment analysis is challenging because of sarcasm, context-dependent meaning, and domain-specific language. Document edge cases where your models struggle and propose solutions.

The Dataset

You will work with the Sentiment140 dataset, one of the most popular datasets for sentiment analysis. Download from Kaggle or use alternative datasets based on your preference:

Primary Dataset: Sentiment140

The Sentiment140 dataset contains 1.6 million tweets automatically labeled using emoticons as noisy labels. Tweets with positive emoticons (like :)) are labeled positive (4), and tweets with negative emoticons (like :() are labeled negative (0).

Sentiment140 on Kaggle Alternative: IMDB Reviews

Dataset Info: 1.6M tweets | Binary classification (positive/negative) | 6 columns | ~240MB compressed | Collected April-June 2009

Alternative Datasets

You may also use these alternative datasets for different perspectives:

IMDB Reviews

50K movie reviews, binary sentiment

Kaggle

Amazon Reviews

Product reviews with star ratings

Kaggle

Yelp Reviews

Restaurant reviews, 5-star scale

Kaggle

Sentiment140 Schema

Column	Type	Description
`target`	Integer	Sentiment label: 0 = negative, 4 = positive
`ids`	Integer	Unique tweet ID
`date`	String	Tweet timestamp (e.g., "Mon Apr 06 22:19:45 PDT 2009")
`flag`	String	Query flag (NO_QUERY for all)
`user`	String	Twitter username
`text`	String	Tweet content (max 140 characters)

Sample Data

Sentiment	Text
Negative	"@user I hate when that happens. My whole day is ruined now"
Positive	"Just had the best coffee ever! Starting my day right"
Negative	"Stuck in traffic again. This commute is killing me"
Positive	"Can't wait for the weekend! Beach trip with friends"

Note: For this project, use a subset of 100,000-200,000 samples for faster training. The full 1.6M dataset can be used for final model training.

Project Requirements

Your project must include all of the following components. This is a comprehensive NLP project covering both classical deep learning (LSTM) and modern transformer approaches (BERT).

Data Loading and Exploration

Load and understand the dataset:

Load Sentiment140 data (or chosen alternative)
Explore class distribution (positive vs negative)
Analyze text length distribution
Sample and display example texts from each class
Create train/validation/test splits (80/10/10)

Deliverable: EDA notebook section with visualizations of class balance, text length histogram, and word frequency analysis.

Text Preprocessing Pipeline

Build a comprehensive text cleaning pipeline:

Convert to lowercase
Remove URLs, mentions (@user), and hashtags (or convert hashtags to words)
Handle emojis (remove or convert to text using emoji library)
Remove special characters and numbers
Remove stopwords (optional - experiment with/without)
Apply lemmatization or stemming
Remove extra whitespace

Deliverable: Reusable preprocessing function with before/after examples demonstrating each cleaning step.

Tokenization and Embeddings

Convert text to numerical representations:

Use Keras Tokenizer to build vocabulary
Convert texts to sequences of token IDs
Pad sequences to uniform length (choose appropriate max length)
Experiment with pre-trained embeddings (GloVe or Word2Vec)
Compare trainable vs frozen embeddings

Deliverable: Tokenization code, vocabulary size stats, embedding matrix preparation if using pre-trained embeddings.

LSTM Model Development

Build and train an LSTM sentiment classifier:

Design architecture: Embedding > LSTM > Dense layers
Experiment with bidirectional LSTM
Add dropout for regularization
Compile with binary crossentropy and Adam optimizer
Use callbacks: EarlyStopping, ModelCheckpoint
Train and monitor validation metrics

Target Performance: Over 85% accuracy on test set

Deliverable: LSTM model code, training curves, hyperparameter experiments.

BERT Fine-tuning

Fine-tune a pre-trained BERT model:

Load pre-trained BERT (bert-base-uncased or DistilBERT)
Use Hugging Face Transformers and Datasets libraries
Tokenize text with BERT tokenizer
Add classification head on top of BERT
Fine-tune with appropriate learning rate (2e-5 to 5e-5)
Monitor training with Trainer or custom loop

Target Performance: Over 90% accuracy on test set

Deliverable: BERT fine-tuning code, training metrics, saved model.

Model Evaluation and Comparison

Comprehensive evaluation of both models:

Calculate accuracy, precision, recall, F1-score
Generate confusion matrices for both models
Create ROC curves and calculate AUC
Analyze misclassified examples
Compare inference speed (predictions per second)
Document trade-offs between LSTM and BERT

Deliverable: Comparison table, visualizations, error analysis, recommendations for production use.

Text Preprocessing Pipeline

Text preprocessing is crucial for NLP models. Social media text requires special handling for mentions, hashtags, URLs, and informal language.

Preprocessing Decisions

Recommended for Sentiment

Keep emojis: Convert to text ("happy_face")
Keep negations: "not", "no", "never" are crucial
Keep intensifiers: "very", "really", "extremely"
Convert hashtags: #happy to "happy"
Lemmatize: Reduces vocabulary size

Avoid for Sentiment

Remove all stopwords: Loses "not happy" context
Aggressive stemming: Can distort meaning
Remove all emojis: Valuable sentiment signals
Remove repeated chars: "sooo good" shows emphasis
Over-cleaning: Loses important signals

LSTM Model Architecture

Long Short-Term Memory (LSTM) networks are effective for sequence data like text. They can capture long-range dependencies that simple RNNs miss.

Recommended Architecture

Layer	Type	Parameters	Output Shape
Input	InputLayer	max_length=100	(100,)
Embedding	Embedding	vocab_size, embed_dim=128	(100, 128)
SpatialDropout	SpatialDropout1D	rate=0.2	(100, 128)
LSTM1	Bidirectional LSTM	64 units, return_sequences	(100, 128)
LSTM2	Bidirectional LSTM	32 units	(64,)
Dense1	Dense	64 units, ReLU	(64,)
Dropout	Dropout	rate=0.5	(64,)
Output	Dense	1 unit, Sigmoid	(1,)

Fine-tuning BERT

BERT (Bidirectional Encoder Representations from Transformers) achieves state-of-the-art results on many NLP tasks. Fine-tuning involves adapting the pre-trained model to your specific task.

LSTM vs BERT Comparison

Aspect	LSTM	BERT
Accuracy	~85-88%	~90-93%
Training Time	Minutes (CPU/GPU)	Hours (GPU required)
Inference Speed	Fast (~1000 samples/sec)	Slower (~100 samples/sec)
Model Size	~10-50 MB	~400-800 MB
Pre-training Required	Optional (embeddings)	Yes (uses pre-trained weights)
Best Use Case	High-throughput, resource-limited	High-accuracy, batch processing

Submission Requirements

Create a public GitHub repository with the exact name shown below:

Required Repository Name

sentiment-analysis-nlp

github.com/<your-username>/sentiment-analysis-nlp

Required Project Structure

sentiment-analysis-nlp/
├── notebooks/
│   ├── 01_data_exploration.ipynb     # EDA and preprocessing
│   ├── 02_lstm_model.ipynb           # LSTM development
│   └── 03_bert_finetuning.ipynb      # BERT fine-tuning
├── src/
│   ├── preprocessing.py              # Text cleaning functions
│   ├── lstm_model.py                 # LSTM architecture
│   └── bert_model.py                 # BERT fine-tuning code
├── models/
│   ├── lstm_sentiment.h5             # Saved LSTM model
│   └── bert_sentiment/               # Saved BERT model folder
├── reports/
│   ├── confusion_matrix_lstm.png     # LSTM confusion matrix
│   ├── confusion_matrix_bert.png     # BERT confusion matrix
│   ├── training_curves.png           # Training history plots
│   └── model_comparison.png          # LSTM vs BERT comparison
├── requirements.txt                  # Python dependencies
└── README.md                         # Project documentation

README.md Required Sections

1. Project Header

Project title and description
Your full name and submission date
Final accuracy for both models

2. Text Preprocessing

Cleaning steps implemented
Before/after examples
Vocabulary statistics

3. LSTM Model

Architecture diagram/summary
Training configuration
Final metrics achieved

4. BERT Fine-tuning

Model variant used
Training hyperparameters
Final metrics achieved

5. Model Comparison

Accuracy, F1, inference speed
Trade-offs discussion
Recommendation for production

6. How to Run

Installation instructions
How to train models
How to make predictions

Submit Your Project

Enter your GitHub username - we will verify your repository automatically

Grading Rubric

Your project will be graded on the following criteria. Total: 600 points.

Criteria	Points	Description
Data Exploration	50	EDA, class distribution, text analysis, proper splits
Text Preprocessing	75	Comprehensive cleaning, tokenization, embedding preparation
LSTM Model	125	Architecture design, training, achieving over 85% accuracy
BERT Fine-tuning	150	Proper fine-tuning, achieving over 90% accuracy
Model Comparison	100	Metrics comparison, trade-off analysis, recommendations
Documentation	75	README quality, code comments, notebook organization
Bonus: Error Analysis	25	Deep analysis of misclassified examples, edge cases
Total	600

Grading Levels

Excellent

540-600

BERT over 92%, exceptional analysis

Good

450-539

Both models meet targets, good docs

Satisfactory

360-449

Meets minimum requirements

Needs Work

< 360

Missing components or low accuracy

Ready to Submit?

Make sure both models are trained and your comparison analysis is complete.

Submit Your Project

Sentiment Analysis

What You Will Build

Contents

Project Overview

Preprocessing

Tokenization

LSTM Network

BERT Fine-tuning

Learning Objectives

Technical Skills

NLP Concepts

Business Scenario

SocialPulse Analytics

Technical Challenges to Solve

The Dataset

Primary Dataset: Sentiment140

Alternative Datasets

IMDB Reviews

Amazon Reviews

Yelp Reviews

Sentiment140 Schema

Sample Data

Project Requirements

Data Loading and Exploration

Text Preprocessing Pipeline

Tokenization and Embeddings

LSTM Model Development

BERT Fine-tuning

Model Evaluation and Comparison

Text Preprocessing Pipeline

Preprocessing Decisions

LSTM Model Architecture

Recommended Architecture

Fine-tuning BERT

LSTM vs BERT Comparison

Submission Requirements

Required Repository Name

Required Project Structure

README.md Required Sections

1. Project Header

2. Text Preprocessing

3. LSTM Model

4. BERT Fine-tuning

5. Model Comparison

6. How to Run

Grading Rubric

Grading Levels

Excellent

Good

Satisfactory

Needs Work

Ready to Submit?

Pre-Submission Checklist

Preprocessing

LSTM Model

BERT Model

Evaluation