Introduction to NLP
Natural Language Processing (NLP) is a fascinating field at the intersection of artificial intelligence, computer science, and linguistics. It enables machines to read, understand, and derive meaning from human language, powering everything from virtual assistants to translation services.
Don't worry if you're new to programming or AI! This lesson is designed specifically for beginners. By the end, you'll understand how computers can "read" and "understand" human language - and you'll write your first NLP code!
Basic Python knowledge is helpful but not required
Every concept includes runnable code you can try
See how NLP powers apps you use daily
Test your understanding after each section
What is Natural Language Processing?
Imagine talking to your phone and it actually understands you. Or typing a question into Google and getting exactly the answer you need. Or having an app automatically translate a foreign menu into your language. That's NLP in action!
Every day, humans generate massive amounts of text data - emails, social media posts, reviews, articles, and conversations. NLP is the technology that helps computers make sense of this unstructured data. Unlike structured data in databases with clear rows and columns, text is messy, ambiguous, and filled with nuances that humans understand intuitively but machines struggle to grasp.
Natural Language Processing (NLP)
Natural Language Processing is a branch of artificial intelligence that helps computers understand, interpret, and manipulate human language. It combines computational linguistics (rule-based modeling of human language) with statistical, machine learning, and deep learning models.
Why it matters: NLP bridges the gap between human communication and computer understanding. Without NLP, computers would only understand structured commands, not natural human speech or writing.
Why Does NLP Matter?
Consider this: over 80% of business data is unstructured, primarily in the form of text. Emails, customer reviews, support tickets, social media posts, legal documents, and medical records all contain valuable insights locked in natural language. NLP is the key that unlocks this treasure trove of information.
Virtual Assistants
Siri, Alexa, and Google Assistant use NLP to understand your voice commands and respond naturally, making technology accessible to everyone.
Machine Translation
Google Translate and DeepL break language barriers, translating text between 100+ languages in real-time using advanced NLP models.
Search Engines
Google understands your queries, not just keywords. NLP helps interpret intent behind searches like "restaurants near me" or "how to fix a leaky faucet."
The NLP Pipeline
Processing natural language involves a series of steps, often called a pipeline. Each step transforms the raw text into something more useful for analysis. Understanding this pipeline is crucial because the quality of each step affects all downstream tasks. Think of it as an assembly line where each station adds value.
# A typical NLP pipeline in Python
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
# Sample text
text = "Natural Language Processing is amazing! It helps computers understand human language."
# Step 1: Lowercase
text_lower = text.lower()
print(f"Lowercased: {text_lower}")
# Step 2: Tokenization
tokens = word_tokenize(text_lower)
print(f"Tokens: {tokens}")
# Step 3: Remove punctuation and stop words
stop_words = set(stopwords.words('english'))
filtered_tokens = [t for t in tokens if t.isalnum() and t not in stop_words]
print(f"Filtered: {filtered_tokens}")
Output:
Lowercased: natural language processing is amazing! it helps computers understand human language.
Tokens: ['natural', 'language', 'processing', 'is', 'amazing', '!', 'it', 'helps', 'computers', 'understand', 'human', 'language', '.']
Filtered: ['natural', 'language', 'processing', 'amazing', 'helps', 'computers', 'understand', 'human', 'language']
Notice how the pipeline progressively cleans the text. We start with a natural sentence, convert it to lowercase for consistency, break it into individual tokens, and then remove noise like punctuation and common words that do not add meaning. The result is a clean list of meaningful words ready for analysis.
Installing NLP Libraries
Python offers several excellent libraries for NLP. NLTK (Natural Language Toolkit) is perfect for learning and experimentation, while SpaCy is optimized for production use. Let us set up your environment with both libraries so you can follow along with the examples in this lesson.
# Install NLTK (Natural Language Toolkit)
pip install nltk
# Install SpaCy (Industrial-strength NLP)
pip install spacy
# Download SpaCy's English model
python -m spacy download en_core_web_sm
# Download NLTK data (run in Python)
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
Challenges in NLP
Human language is incredibly complex. Words can have multiple meanings, sentences can be structured in countless ways, and context changes everything. These challenges make NLP one of the most difficult areas in AI, but also one of the most rewarding when solved.
| Challenge | Example | Why It Is Hard |
|---|---|---|
| Ambiguity | "I saw her duck" | Did she duck down, or did I see her pet duck? |
| Sarcasm | "Oh great, another meeting" | Words are positive but meaning is negative |
| Context | "It is cold" vs "The case went cold" | Same word, completely different meanings |
| Slang | "That movie was lit" | Informal language evolves constantly |
| Negation | "I do not dislike it" | Double negatives require logical reasoning |
Practice Questions: Introduction to NLP
Test your understanding with these coding exercises.
Task: Import NLTK and print its version to verify the installation is working correctly.
Show Solution
import nltk
print(f"NLTK Version: {nltk.__version__}")
# Output: NLTK Version: 3.8.1
Task: Load SpaCy's small English model and process the sentence "NLP is transforming how we interact with technology."
Show Solution
import spacy
# Load the small English model
nlp = spacy.load('en_core_web_sm')
# Process text
doc = nlp("NLP is transforming how we interact with technology.")
# Print each token
for token in doc:
print(f"{token.text:15} | POS: {token.pos_:8} | Lemma: {token.lemma_}")
# Output:
# NLP | POS: PROPN | Lemma: NLP
# is | POS: AUX | Lemma: be
# transforming | POS: VERB | Lemma: transform
# ...
Task: Given the text below, tokenize it and count the frequency of each word (ignore case and punctuation).
text = "NLP helps machines understand language. Language understanding is key to AI. AI and NLP work together."
Show Solution
from collections import Counter
from nltk.tokenize import word_tokenize
text = "NLP helps machines understand language. Language understanding is key to AI. AI and NLP work together."
# Tokenize and lowercase
tokens = word_tokenize(text.lower())
# Filter only alphabetic tokens
words = [t for t in tokens if t.isalpha()]
# Count frequencies
freq = Counter(words)
print("Word Frequencies:")
for word, count in freq.most_common():
print(f" {word}: {count}")
# Output:
# Word Frequencies:
# nlp: 2
# ai: 2
# language: 2
# understanding: 2
# ...
Text Preprocessing
Before feeding text data to any NLP model, we must clean and normalize it. Text preprocessing transforms raw, messy text into a clean, consistent format that algorithms can effectively process. This step is crucial because real-world text contains noise like HTML tags, special characters, and inconsistent formatting.
Why Is Preprocessing Important?
Raw text data is messy. A single word might appear as "Hello", "HELLO", "hello!", or "hello..." - and to a computer, these are all different strings. Preprocessing ensures consistency so that the model can focus on meaning rather than superficial differences. Think of it as preparing ingredients before cooking - you would not throw unwashed, unpeeled vegetables into a pot.
Text Preprocessing
Text preprocessing is the process of cleaning and transforming raw text into a format suitable for analysis. It typically includes lowercasing, removing punctuation, handling special characters, and normalizing whitespace.
Why it matters: Models trained on preprocessed text perform better because they learn patterns from clean, consistent data rather than getting confused by formatting variations.
Step 1: Lowercasing
The simplest but often most impactful preprocessing step is converting all text to lowercase. This ensures that "Python", "PYTHON", and "python" are treated as the same word. Without lowercasing, your vocabulary would be unnecessarily large, and the model might think these are different concepts.
# Lowercasing - the simplest preprocessing step
text = "Natural Language Processing is AMAZING!"
# Convert to lowercase
text_lower = text.lower()
print(f"Original: {text}")
print(f"Lowercased: {text_lower}")
# Why it matters - vocabulary comparison
words_original = set(text.split())
words_lower = set(text_lower.split())
print(f"\nOriginal vocabulary size: {len(words_original)}")
print(f"Lowercased vocabulary size: {len(words_lower)}")
Output:
Original: Natural Language Processing is AMAZING!
Lowercased: natural language processing is amazing!
Original vocabulary size: 5
Lowercased vocabulary size: 5
Step 2: Removing Punctuation
Punctuation marks like periods, commas, and exclamation points are important for human readability but often add noise for NLP models. Removing them simplifies the text and reduces vocabulary size. However, some tasks (like sentiment analysis) might benefit from keeping certain punctuation like exclamation marks.
import string
text = "Hello, World! How are you doing today? I'm great!!!"
# Method 1: Using string.punctuation
no_punct = text.translate(str.maketrans('', '', string.punctuation))
print(f"Original: {text}")
print(f"No punctuation: {no_punct}")
# Method 2: Using regex (more control)
import re
no_punct_regex = re.sub(r'[^\w\s]', '', text)
print(f"Using regex: {no_punct_regex}")
# See what punctuation we removed
print(f"\nPunctuation characters: {string.punctuation}")
Output:
Original: Hello, World! How are you doing today? I'm great!!!
No punctuation: Hello World How are you doing today Im great
Using regex: Hello World How are you doing today Im great
Punctuation characters: !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
"I'm" becomes "Im". This happens because the apostrophe is classified as punctuation.
For production systems, consider handling contractions first (expanding "I'm" to "I am")
before removing punctuation. The optimal approach depends on your specific NLP task requirements.
Step 3: Handling Special Characters and Numbers
Real-world text often contains special characters, HTML tags, URLs, email addresses, and numbers. Depending on your task, you may want to remove these entirely, replace them with placeholders, or handle them specially. Regular expressions are your best friend for this kind of pattern matching.
import re
text = """Check out https://example.com for more info!
Contact us at support@email.com or call 123-456-7890.
Price: $99.99 (50% off!)"""
# Remove URLs
no_urls = re.sub(r'https?://\S+|www\.\S+', '[URL]', text)
print("URLs replaced:")
print(no_urls)
# Remove emails
no_emails = re.sub(r'\S+@\S+', '[EMAIL]', no_urls)
print("\nEmails replaced:")
print(no_emails)
# Remove numbers (but keep words)
no_numbers = re.sub(r'\d+', '[NUM]', no_emails)
print("\nNumbers replaced:")
print(no_numbers)
Output:
URLs replaced:
Check out [URL] for more info!
Contact us at support@email.com or call 123-456-7890.
Price: $99.99 (50% off!)
Emails replaced:
Check out [URL] for more info!
Contact us at [EMAIL] or call 123-456-7890.
Price: $99.99 (50% off!)
Numbers replaced:
Check out [URL] for more info!
Contact us at [EMAIL] or call [NUM]-[NUM]-[NUM].
Price: $[NUM].[NUM] ([NUM]% off!)
Step 4: Removing Stop Words
Stop words are common words like "the", "is", "at", and "which" that appear frequently but carry little semantic meaning. Removing them reduces noise and helps models focus on the important content words. Most NLP libraries come with predefined stop word lists that you can customize.
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
# Get English stop words
stop_words = set(stopwords.words('english'))
print(f"Number of stop words: {len(stop_words)}")
print(f"Sample stop words: {list(stop_words)[:10]}")
# Remove stop words from text
text = "The quick brown fox jumps over the lazy dog"
tokens = word_tokenize(text.lower())
filtered = [word for word in tokens if word not in stop_words]
print(f"\nOriginal: {text}")
print(f"Tokens: {tokens}")
print(f"After removing stop words: {filtered}")
Output:
Number of stop words: 179
Sample stop words: ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]
Original: The quick brown fox jumps over the lazy dog
Tokens: ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
After removing stop words: ['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog']
Complete Preprocessing Pipeline
Let us combine all these steps into a reusable preprocessing function. This is a common pattern in NLP projects - you create a pipeline that can be applied consistently to all your text data.
import re
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
def preprocess_text(text):
"""Complete text preprocessing pipeline."""
# 1. Lowercase
text = text.lower()
# 2. Remove URLs
text = re.sub(r'https?://\S+|www\.\S+', '', text)
# 3. Remove HTML tags
text = re.sub(r'<.*?>', '', text)
# 4. Remove punctuation
text = text.translate(str.maketrans('', '', string.punctuation))
# 5. Remove extra whitespace
text = ' '.join(text.split())
# 6. Tokenize
tokens = word_tokenize(text)
# 7. Remove stop words
stop_words = set(stopwords.words('english'))
tokens = [t for t in tokens if t not in stop_words]
return tokens
# Test the pipeline
sample = "Check out https://nlp.com! NLP is AMAZING - it helps computers understand us."
result = preprocess_text(sample)
print(f"Original: {sample}")
print(f"Processed: {result}")
Output:
Original: Check out https://nlp.com! NLP is AMAZING - it helps computers understand us.
Processed: ['check', 'nlp', 'amazing', 'helps', 'computers', 'understand', 'us']
Practice Questions: Text Preprocessing
Practice your preprocessing skills.
Task: Write code to remove all digits from "The year 2024 has 365 days".
Show Solution
import re
text = "The year 2024 has 365 days"
# Method 1: Using regex
no_digits = re.sub(r'\d+', '', text)
print(no_digits) # The year has days
# Method 2: Using string methods
no_digits = ''.join(c for c in text if not c.isdigit())
print(no_digits) # The year has days
Task: Extract all email addresses from the following text.
text = "Contact john@example.com or support@company.org for help."
Show Solution
import re
text = "Contact john@example.com or support@company.org for help."
# Email regex pattern
emails = re.findall(r'\S+@\S+', text)
print(f"Found emails: {emails}")
# Output: Found emails: ['john@example.com', 'support@company.org']
Task: Clean this noisy review by removing HTML, converting to lowercase, removing punctuation, and removing stop words.
review = "<p>This product is AMAZING!!! I bought it for $29.99... Best purchase EVER!</p>"
Show Solution
import re
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
review = "This product is AMAZING!!! I bought it for $29.99... Best purchase EVER!
"
# Step 1: Remove HTML tags
clean = re.sub(r'<.*?>', '', review)
# Step 2: Lowercase
clean = clean.lower()
# Step 3: Remove punctuation
clean = clean.translate(str.maketrans('', '', string.punctuation))
# Step 4: Tokenize
tokens = word_tokenize(clean)
# Step 5: Remove stop words
stop_words = set(stopwords.words('english'))
final = [t for t in tokens if t not in stop_words and t.isalpha()]
print(f"Original: {review}")
print(f"Cleaned: {final}")
# Output: Cleaned: ['product', 'amazing', 'bought', 'best', 'purchase', 'ever']
Tokenization
Tokenization is the process of breaking text into smaller units called tokens. These tokens can be words, sentences, or even subwords. This fundamental step converts continuous text into discrete elements that machines can process, serving as the foundation for all downstream NLP tasks.
What is Tokenization?
Imagine reading a book without any spaces between words - it would be nearly impossible to understand. Tokenization is how we give computers that same ability to identify word boundaries. While it sounds simple, tokenization is surprisingly complex because languages have different rules and edge cases.
Tokenization
Tokenization is the process of splitting text into individual units called tokens. These tokens can be words, sentences, characters, or subword units depending on the tokenization strategy.
Why it matters: Tokenization is the first step in converting human-readable text into a format that machine learning models can process. The quality of tokenization directly affects model performance.
Word Tokenization
Word tokenization splits text into individual words. While you might think splitting on spaces is enough, real text has contractions ("don't" = "do" + "n't"), hyphenated words ("state-of-the-art"), and punctuation attached to words. Good tokenizers handle these edge cases intelligently.
from nltk.tokenize import word_tokenize, TreebankWordTokenizer
text = "I can't believe it's already 2024! State-of-the-art NLP is amazing."
# Simple split (naive approach)
simple_tokens = text.split()
print(f"Simple split: {simple_tokens}")
# NLTK word_tokenize (handles punctuation and contractions)
nltk_tokens = word_tokenize(text)
print(f"NLTK tokens: {nltk_tokens}")
# Notice how contractions are handled
print(f"\nContraction 'can't' becomes: {word_tokenize(\"can't\")}")
print(f"Contraction \"it's\" becomes: {word_tokenize(\"it's\")}")
Output:
Simple split: ["I", "can't", "believe", "it's", "already", "2024!", "State-of-the-art", "NLP", "is", "amazing."]
NLTK tokens: ['I', 'ca', "n't", 'believe', 'it', "'s", 'already', '2024', '!', 'State-of-the-art', 'NLP', 'is', 'amazing', '.']
Contraction 'can't' becomes: ['ca', "n't"]
Contraction "it's" becomes: ['it', "'s"]
Notice how NLTK intelligently separates punctuation and handles contractions. The word "can't" becomes ["ca", "n't"] because linguistically, "can't" is "can" + "not". This detailed tokenization helps models understand the underlying meaning better.
Sentence Tokenization
Sentence tokenization splits text into individual sentences. This is trickier than it sounds because periods appear in abbreviations (Dr., U.S.A.), decimal numbers (3.14), and URLs. Good sentence tokenizers use context to determine actual sentence boundaries.
from nltk.tokenize import sent_tokenize
text = """Dr. Smith earned $3.5 million in 2023. That's impressive!
The U.S.A. leads in AI research. Visit https://ai.stanford.edu for more info."""
# Sentence tokenization
sentences = sent_tokenize(text)
print("Sentences found:")
for i, sent in enumerate(sentences, 1):
print(f" {i}. {sent.strip()}")
print(f"\nTotal sentences: {len(sentences)}")
Output:
Sentences found:
1. Dr. Smith earned $3.5 million in 2023.
2. That's impressive!
3. The U.S.A. leads in AI research.
4. Visit https://ai.stanford.edu for more info.
Total sentences: 4
Tokenization with SpaCy
SpaCy provides industrial-strength tokenization with additional linguistic information. When you process text with SpaCy, each token comes with its part-of-speech tag, lemma (base form), and other annotations. This rich information is incredibly useful for downstream tasks.
import spacy
# Load English model
nlp = spacy.load('en_core_web_sm')
text = "Apple is looking at buying U.K. startup for $1 billion"
doc = nlp(text)
# Print tokens with details
print("Token Analysis:")
print("-" * 60)
for token in doc:
print(f"{token.text:12} | POS: {token.pos_:6} | Lemma: {token.lemma_:10} | Stop: {token.is_stop}")
# Get just the tokens as a list
tokens = [token.text for token in doc]
print(f"\nTokens: {tokens}")
Output:
Token Analysis:
------------------------------------------------------------
Apple | POS: PROPN | Lemma: Apple | Stop: False
is | POS: AUX | Lemma: be | Stop: True
looking | POS: VERB | Lemma: look | Stop: False
at | POS: ADP | Lemma: at | Stop: True
buying | POS: VERB | Lemma: buy | Stop: False
U.K. | POS: PROPN | Lemma: U.K. | Stop: False
startup | POS: NOUN | Lemma: startup | Stop: False
for | POS: ADP | Lemma: for | Stop: True
$ | POS: SYM | Lemma: $ | Stop: False
1 | POS: NUM | Lemma: 1 | Stop: False
billion | POS: NUM | Lemma: billion | Stop: False
Tokens: ['Apple', 'is', 'looking', 'at', 'buying', 'U.K.', 'startup', 'for', '$', '1', 'billion']
Part-of-Speech identifies word types: PROPN (proper noun), VERB, ADP (preposition)
Base forms: "looking" → "look", "is" → "be" for normalization
Flags common words (is, at, for) for easy filtering
Subword Tokenization
Modern NLP models like BERT and GPT use subword tokenization, which breaks words into smaller meaningful units. This handles unknown words elegantly - even if a word was not in the training data, its subwords probably were. For example, "unhappiness" might become ["un", "happiness"].
# Using Hugging Face tokenizers
from transformers import AutoTokenizer
# Load BERT tokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
text = "Tokenization is fundamental for NLP preprocessing."
# Tokenize
tokens = tokenizer.tokenize(text)
print(f"Text: {text}")
print(f"Tokens: {tokens}")
# Convert to IDs
token_ids = tokenizer.encode(text)
print(f"Token IDs: {token_ids}")
# Handling unknown words
rare_word = "supercalifragilistic"
rare_tokens = tokenizer.tokenize(rare_word)
print(f"\nRare word '{rare_word}' becomes: {rare_tokens}")
Output:
Text: Tokenization is fundamental for NLP preprocessing.
Tokens: ['token', '##ization', 'is', 'fundamental', 'for', 'nl', '##p', 'prep', '##ro', '##ces', '##sing', '.']
Token IDs: [101, 19204, 3989, 2003, 8050, 2005, 17953, 2361, 17531, 9541, 9623, 2075, 1012, 102]
Rare word 'supercalifragilistic' becomes: ['super', '##cal', '##if', '##rag', '##ili', '##stic']
The "##" prefix indicates that a token is a continuation of the previous token (not a new word). This subword approach allows models to handle any word, even ones never seen during training, by breaking them into familiar pieces.
Tokenization Methods Comparison
| Method | Best For | Pros | Cons |
|---|---|---|---|
| Word | Traditional NLP, BoW, TF-IDF | Simple, intuitive | Large vocabulary, OOV words |
| Sentence | Summarization, translation | Preserves structure | Ambiguous boundaries |
| Character | Spelling correction, some languages | Tiny vocabulary | Loses word meaning |
| Subword | Modern transformers (BERT, GPT) | Handles any word, compact | Requires pretrained tokenizer |
Practice Questions: Tokenization
Practice your tokenization skills.
Task: Tokenize and count the number of words in "Machine learning is a subset of artificial intelligence".
Show Solution
from nltk.tokenize import word_tokenize
text = "Machine learning is a subset of artificial intelligence"
tokens = word_tokenize(text)
print(f"Tokens: {tokens}")
print(f"Word count: {len(tokens)}")
# Output: Word count: 8
Task: Split the following paragraph into sentences and print each one with its word count.
paragraph = "NLP is fascinating. It powers virtual assistants. Machine translation is another application."
Show Solution
from nltk.tokenize import sent_tokenize, word_tokenize
paragraph = "NLP is fascinating. It powers virtual assistants. Machine translation is another application."
sentences = sent_tokenize(paragraph)
for i, sent in enumerate(sentences, 1):
word_count = len(word_tokenize(sent))
print(f"Sentence {i} ({word_count} words): {sent}")
# Output:
# Sentence 1 (4 words): NLP is fascinating.
# Sentence 2 (4 words): It powers virtual assistants.
# Sentence 3 (5 words): Machine translation is another application.
Task: Compare NLTK word tokenization with BERT subword tokenization for the sentence "Transformers revolutionized NLP".
Show Solution
from nltk.tokenize import word_tokenize
from transformers import AutoTokenizer
text = "Transformers revolutionized NLP"
# Word tokenization
word_tokens = word_tokenize(text)
print(f"Word tokens ({len(word_tokens)}): {word_tokens}")
# Subword tokenization (BERT)
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
subword_tokens = tokenizer.tokenize(text)
print(f"Subword tokens ({len(subword_tokens)}): {subword_tokens}")
# Output:
# Word tokens (3): ['Transformers', 'revolutionized', 'NLP']
# Subword tokens (5): ['transformers', 'revolution', '##ized', 'nl', '##p']
Text Representation
Machine learning models cannot directly process text - they need numbers. Text representation is the art of converting words and documents into numerical vectors that capture meaning. The quality of your text representation directly determines how well your model can understand and process language.
Why Do We Need Text Representation?
Computers understand numbers, not words. When you feed text to a machine learning algorithm, it must be converted into a numerical format. The challenge is doing this conversion in a way that preserves the meaning and relationships between words. Different representation methods capture different aspects of text.
Text Vectorization
Text vectorization is the process of converting text into numerical vectors. Each document becomes a point in high-dimensional space, where similar documents are closer together and dissimilar documents are farther apart.
The goal: Create numerical representations where semantically similar texts have similar vector representations, enabling mathematical operations on language.
Bag of Words (BoW)
The Bag of Words model is the simplest text representation. It counts how many times each word appears in a document, ignoring grammar and word order. Think of it as dumping all words into a "bag" and just counting them. Despite its simplicity, BoW works surprisingly well for many classification tasks.
from sklearn.feature_extraction.text import CountVectorizer
# Sample documents
documents = [
"Machine learning is amazing",
"Deep learning is a subset of machine learning",
"NLP uses machine learning techniques"
]
# Create CountVectorizer (Bag of Words)
vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(documents)
# Get feature names (vocabulary)
vocab = vectorizer.get_feature_names_out()
print("Vocabulary:", list(vocab))
# Convert to array and display
import pandas as pd
df = pd.DataFrame(bow_matrix.toarray(), columns=vocab)
print("\nBag of Words Matrix:")
print(df)
Output:
Vocabulary: ['amazing', 'deep', 'is', 'learning', 'machine', 'nlp', 'of', 'subset', 'techniques', 'uses']
Bag of Words Matrix:
amazing deep is learning machine nlp of subset techniques uses
0 1 0 1 1 1 0 0 0 0 0
1 0 1 1 2 1 0 1 1 0 0
2 0 0 0 1 1 1 0 0 1 1
Each row represents a document, and each column represents a unique word. The values show word counts. Notice how "learning" appears twice in Document 2 (index 1), so it has a value of 2 in that row.
TF-IDF: Term Frequency - Inverse Document Frequency
TF-IDF improves on BoW by weighing words based on their importance. Words that appear frequently in one document but rarely in others get higher weights. Common words like "the" and "is" get lower weights because they appear everywhere and carry little meaning.
Term Frequency (TF)
Measures how often a word appears in a document relative to the total number of words. Words that appear more frequently in a specific document are considered more important to that document's meaning. A word appearing 5 times in a 100-word document has TF = 0.05.
TF = count(word) / total_words
Inverse Document Frequency (IDF)
Measures how rare or unique a word is across all documents in the corpus. Rare words that appear in only a few documents get higher IDF scores, while common words like "the" or "is" that appear everywhere get lower scores, reducing their influence.
IDF = log(N / df)
from sklearn.feature_extraction.text import TfidfVectorizer
# Same documents as before
documents = [
"Machine learning is amazing",
"Deep learning is a subset of machine learning",
"NLP uses machine learning techniques"
]
# Create TF-IDF vectorizer
tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(documents)
# Get feature names
vocab = tfidf.get_feature_names_out()
# Display as DataFrame
import pandas as pd
df = pd.DataFrame(tfidf_matrix.toarray().round(3), columns=vocab)
print("TF-IDF Matrix:")
print(df)
Output:
TF-IDF Matrix:
amazing deep is learning machine nlp of subset techniques uses
0 0.631 0.0 0.480 0.373 0.373 0.0 0.000 0.000 0.000 0.0
1 0.000 0.4 0.305 0.474 0.237 0.0 0.400 0.400 0.000 0.0
2 0.000 0.0 0.000 0.330 0.330 0.5 0.000 0.000 0.500 0.5
Compare this to BoW: "amazing" has a high score (0.631) in Document 1 because it only appears there. Meanwhile, "machine" has lower scores across documents because it appears in all three - it is less distinctive. TF-IDF naturally identifies the most important words for each document.
Practical Example: Document Similarity
Once we have vectors, we can measure how similar documents are using cosine similarity. This is the foundation of search engines, recommendation systems, and document clustering.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
# Query and documents
query = "machine learning applications"
documents = [
"Machine learning is used in many applications",
"Deep learning is a type of machine learning",
"Cooking recipes for beginners",
"NLP is an application of machine learning"
]
# Vectorize query and documents together
tfidf = TfidfVectorizer()
all_texts = [query] + documents
tfidf_matrix = tfidf.fit_transform(all_texts)
# Calculate similarity between query and each document
query_vector = tfidf_matrix[0]
doc_vectors = tfidf_matrix[1:]
similarities = cosine_similarity(query_vector, doc_vectors)[0]
# Rank documents by similarity
print("Document Similarity to Query:")
print("-" * 50)
for i, (doc, score) in enumerate(zip(documents, similarities)):
print(f"Score: {score:.3f} | Doc {i+1}: {doc[:40]}...")
Output:
Document Similarity to Query:
--------------------------------------------------
Score: 0.638 | Doc 1: Machine learning is used in many applic...
Score: 0.256 | Doc 2: Deep learning is a type of machine lear...
Score: 0.000 | Doc 3: Cooking recipes for beginners...
Score: 0.391 | Doc 4: NLP is an application of machine learni...
Document 1 is most similar because it shares "machine", "learning", and "applications" with the query. Document 3 (cooking recipes) has zero similarity - it shares no vocabulary with our query. This is exactly how search engines find relevant results!
Text Representation Methods Comparison
| Method | How It Works | Pros | Cons |
|---|---|---|---|
| Bag of Words | Counts word occurrences | Simple, fast, interpretable | No semantics, sparse vectors |
| TF-IDF | Weighs by importance | Better than BoW for search | Still no semantic understanding |
| Word2Vec | Learns word embeddings | Captures semantic similarity | Requires large training data |
| BERT Embeddings | Contextual embeddings | State-of-the-art quality | Computationally expensive |
Practice Questions: Text Representation
Build your text vectorization skills.
Task: Use CountVectorizer to create BoW vectors for "I love NLP" and "NLP is fun".
Show Solution
from sklearn.feature_extraction.text import CountVectorizer
sentences = ["I love NLP", "NLP is fun"]
vectorizer = CountVectorizer()
bow = vectorizer.fit_transform(sentences)
print("Vocabulary:", vectorizer.get_feature_names_out())
print("BoW vectors:\n", bow.toarray())
# Output: [[0 1 1 0], [1 0 1 1]]
Task: Given three documents, find the most distinctive word in each using TF-IDF scores.
docs = ["Python is great for data science",
"Java is popular for enterprise",
"JavaScript runs in browsers"]
Show Solution
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
docs = ["Python is great for data science",
"Java is popular for enterprise",
"JavaScript runs in browsers"]
tfidf = TfidfVectorizer()
matrix = tfidf.fit_transform(docs)
vocab = tfidf.get_feature_names_out()
for i, doc in enumerate(docs):
scores = matrix[i].toarray()[0]
top_idx = np.argmax(scores)
print(f"Doc {i+1}: Most distinctive word = '{vocab[top_idx]}' (score: {scores[top_idx]:.3f})")
Task: Create a function that takes a query and returns the most similar document from a corpus.
Show Solution
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
def search_documents(query, corpus):
"""Find most similar document to query."""
tfidf = TfidfVectorizer()
# Fit on corpus, transform both corpus and query
corpus_vectors = tfidf.fit_transform(corpus)
query_vector = tfidf.transform([query])
# Calculate similarities
similarities = cosine_similarity(query_vector, corpus_vectors)[0]
# Find best match
best_idx = similarities.argmax()
return corpus[best_idx], similarities[best_idx]
# Test it
corpus = ["Learn Python programming",
"Data science with Python",
"Web development basics"]
query = "Python data analysis"
result, score = search_documents(query, corpus)
print(f"Best match: '{result}' (similarity: {score:.3f})")
Common NLP Tasks
NLP encompasses a wide range of tasks, from classifying sentiment in reviews to extracting named entities from documents. Understanding these tasks helps you identify which techniques to apply for your specific use case, whether it is building a chatbot, analyzing customer feedback, or automating document processing.
The NLP Task Landscape
NLP tasks can be grouped into categories based on what they accomplish. Some tasks analyze text to extract information (like sentiment or entities), while others generate new text (like translation or summarization). Let us explore the most common tasks you will encounter as a beginner.
Sentiment Analysis
Determines the emotional tone of text - positive, negative, or neutral. Widely used for analyzing product reviews, social media posts, and customer feedback to understand public opinion and brand perception.
Named Entity Recognition
Identifies and classifies named entities like people, organizations, locations, dates, and monetary values in text. Essential for extracting structured information from unstructured documents.
Text Classification
Categorizes documents into predefined classes based on their content. Powers spam detection, topic labeling, intent recognition in chatbots, and content moderation systems.
Sentiment Analysis
Sentiment analysis determines the emotional tone behind text. Is this review positive or negative? Is this tweet expressing happiness or frustration? Companies use sentiment analysis to monitor brand perception, analyze customer feedback, and track public opinion on social media.
from textblob import TextBlob
# Sample reviews to analyze
reviews = [
"This product is absolutely amazing! Best purchase ever.",
"Terrible quality. Complete waste of money.",
"It's okay, nothing special but it works.",
"I love this! Exceeded all my expectations!"
]
print("Sentiment Analysis Results:")
print("-" * 60)
for review in reviews:
blob = TextBlob(review)
polarity = blob.sentiment.polarity # -1 (negative) to 1 (positive)
if polarity > 0.1:
sentiment = "Positive"
elif polarity < -0.1:
sentiment = "Negative"
else:
sentiment = "Neutral"
print(f"{sentiment:8} (score: {polarity:+.2f}) | {review[:40]}...")
Output:
Sentiment Analysis Results:
------------------------------------------------------------
Positive (score: +0.62) | This product is absolutely amazing! Bes...
Negative (score: -0.65) | Terrible quality. Complete waste of mon...
Neutral (score: +0.00) | It's okay, nothing special but it works...
Positive (score: +0.50) | I love this! Exceeded all my expectatio...
Named Entity Recognition (NER)
Named Entity Recognition identifies and classifies named entities in text. It answers questions like: Who is mentioned? What companies? Which locations? This is crucial for information extraction, search engines, and building knowledge graphs.
import spacy
# Load English model
nlp = spacy.load('en_core_web_sm')
text = """Apple Inc. announced that Tim Cook will visit the new headquarters
in Cupertino, California next Monday. The company plans to invest $5 billion
in AI research by 2025."""
doc = nlp(text)
print("Named Entities Found:")
print("-" * 50)
for ent in doc.ents:
print(f"{ent.text:20} | Type: {ent.label_:12} | Description: {spacy.explain(ent.label_)}")
# Visualize entities (in Jupyter)
# from spacy import displacy
# displacy.render(doc, style='ent')
Output:
Named Entities Found:
--------------------------------------------------
Apple Inc. | Type: ORG | Description: Companies, agencies, institutions
Tim Cook | Type: PERSON | Description: People, including fictional
Cupertino | Type: GPE | Description: Countries, cities, states
California | Type: GPE | Description: Countries, cities, states
next Monday | Type: DATE | Description: Absolute or relative dates or periods
$5 billion | Type: MONEY | Description: Monetary values, including unit
2025 | Type: DATE | Description: Absolute or relative dates or periods
SpaCy's NER model automatically extracted 7 entities from our text:
Common Entity Types
| Entity Type | Description | Examples |
|---|---|---|
PERSON |
People, including fictional | Tim Cook, Albert Einstein, Harry Potter |
ORG |
Companies, agencies, institutions | Apple, NASA, United Nations |
GPE |
Countries, cities, states | India, New York, California |
DATE |
Dates and time periods | January 2024, next week, 1990s |
MONEY |
Monetary values | $100, 50 euros, 1 million dollars |
Text Classification
Text classification assigns predefined categories to documents. Think of your email inbox - spam detection is a classic text classification problem. Other applications include topic labeling, language detection, and intent classification for chatbots.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
# Training data: (text, category)
train_texts = [
"Get rich quick! Win $1000 now!",
"Meeting scheduled for tomorrow at 3pm",
"Claim your free prize today!!!",
"Please review the attached document",
"Congratulations! You've won a lottery",
"Project deadline extended to Friday"
]
train_labels = ["spam", "not_spam", "spam", "not_spam", "spam", "not_spam"]
# Create a simple classifier pipeline
classifier = Pipeline([
('tfidf', TfidfVectorizer()),
('clf', MultinomialNB())
])
# Train the model
classifier.fit(train_texts, train_labels)
# Test on new emails
test_emails = [
"Free money waiting for you!",
"Can we schedule a meeting?",
"You've been selected for a cash prize!"
]
print("Email Classification Results:")
print("-" * 50)
for email in test_emails:
prediction = classifier.predict([email])[0]
confidence = classifier.predict_proba([email]).max()
print(f"[{prediction.upper():8}] ({confidence:.0%}) {email}")
Output:
Email Classification Results:
--------------------------------------------------
[SPAM ] (89%) Free money waiting for you!
[NOT_SPAM] (76%) Can we schedule a meeting?
[SPAM ] (92%) You've been selected for a cash prize!
The Classification Pipeline
- Pattern Learning: Identifies spam keywords like
"free","win","prize","money" - TF-IDF Vectorization: Converts text to numerical feature vectors
- Naive Bayes Prediction: Calculates probability of each category
Result: 6 training examples → 89-92% confidence on new emails!
Other Important NLP Tasks
Machine Translation
Automatically translates text from one language to another while preserving meaning and context. Modern neural machine translation uses deep learning to produce human-quality translations, powering services like Google Translate and enabling global communication.
Text Summarization
Condenses long documents into shorter versions while retaining key information. Can be extractive (selecting important sentences) or abstractive (generating new sentences). Essential for processing news articles, research papers, and legal documents efficiently.
Question Answering
Extracts precise answers from text given a natural language question. Powers virtual assistants, FAQ systems, and search engines. Can work with structured knowledge bases or unstructured text documents using reading comprehension techniques.
Chatbots and Dialogue
Builds conversational agents that understand context, maintain dialogue history, and respond naturally to user queries. Combines multiple NLP tasks including intent recognition, entity extraction, and response generation to create intelligent assistants.
Practice Questions: NLP Tasks
Apply your NLP knowledge to real tasks.
Task: Use TextBlob to determine if "The battery life is disappointing but the camera is excellent" is positive or negative.
Show Solution
from textblob import TextBlob
review = "The battery life is disappointing but the camera is excellent"
blob = TextBlob(review)
print(f"Polarity: {blob.sentiment.polarity:.2f}")
print(f"Subjectivity: {blob.sentiment.subjectivity:.2f}")
# Output: Mixed sentiment - slightly positive (camera excellence outweighs battery disappointment)
Task: Extract all person names from: "Elon Musk met with Sundar Pichai to discuss AI safety. Bill Gates joined via video call."
Show Solution
import spacy
nlp = spacy.load('en_core_web_sm')
text = "Elon Musk met with Sundar Pichai to discuss AI safety. Bill Gates joined via video call."
doc = nlp(text)
people = [ent.text for ent in doc.ents if ent.label_ == 'PERSON']
print(f"People mentioned: {people}")
# Output: ['Elon Musk', 'Sundar Pichai', 'Bill Gates']
Task: Create a classifier that categorizes text into "sports", "technology", or "food" topics.
Show Solution
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
# Training data
texts = [
"The team scored a winning goal", "Basketball playoffs begin",
"New smartphone released", "AI breakthrough announced",
"Best pizza recipe", "Restaurant reviews"
]
labels = ["sports", "sports", "tech", "tech", "food", "food"]
# Train classifier
clf = Pipeline([('tfidf', TfidfVectorizer()), ('nb', MultinomialNB())])
clf.fit(texts, labels)
# Test
test = ["The football match was exciting", "New laptop features"]
predictions = clf.predict(test)
print(f"Predictions: {list(zip(test, predictions))}")
Key Takeaways
NLP Bridges Human-Machine Communication
Natural Language Processing enables computers to understand, interpret, and generate human language, powering applications from chatbots to translation services
Preprocessing is Essential
Text cleaning (lowercasing, removing punctuation, handling special characters) and normalization are crucial steps before any NLP analysis
Tokenization Breaks Text into Units
Word, sentence, and subword tokenization convert continuous text into discrete tokens that machines can process and analyze
Text Must Become Numbers
Bag of Words counts word occurrences, while TF-IDF weighs term importance. Both convert text to numerical vectors for machine learning
Stop Words and Stemming Reduce Noise
Removing common words (the, is, at) and reducing words to their root form helps focus on meaningful content and reduces vocabulary size
NLP Powers Many Applications
Sentiment analysis, named entity recognition, text classification, and machine translation are just a few of the many practical NLP applications
Knowledge Check
Test your understanding of Natural Language Processing fundamentals:
What is the primary goal of Natural Language Processing (NLP)?
Which preprocessing step converts "Running" and "RUNNING" to the same form?
What does tokenization do to the sentence "I love NLP"?
What is the main difference between Bag of Words and TF-IDF?
Which words are typically removed as "stop words" in NLP preprocessing?
What NLP task determines if a movie review is positive or negative?