Series vs DataFrame
Pandas has two core data structures: Series (1D) and DataFrame (2D). Think of a Series as a single column and a DataFrame as an entire spreadsheet.
Labeled Data Structures
Unlike NumPy arrays, Pandas structures have labels (index for rows, column names for columns). This makes data access intuitive and self-documenting.
Why it matters: Access data by meaningful names like df['price'] instead of cryptic indices like arr[:, 2].
Series (1D)
import pandas as pd
# Series = single column with index
prices = pd.Series([10.5, 20.3, 15.8],
index=['A', 'B', 'C'])
print(prices)
# A 10.5
# B 20.3
# C 15.8
print(prices['B']) # 20.3
DataFrame (2D)
import pandas as pd
# DataFrame = table with rows and columns
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Carol'],
'age': [25, 30, 35],
'city': ['NYC', 'LA', 'Chicago']
})
print(df)
# name age city
# 0 Alice 25 NYC
# 1 Bob 30 LA
# 2 Carol 35 Chicago
A DataFrame is essentially a collection of Series that share the same index. Each column can be accessed as a Series, and each row shares the same index across all columns. This structure enables powerful operations like filtering, grouping, and merging.
Think of a Series as a single labeled column and a DataFrame as multiple columns joined together. The index (row labels) aligns data when combining DataFrames or performing operations between columns.
Series Operations
Series support vectorized operations, making mathematical computations fast and readable. You can perform arithmetic directly on Series objects.
Arithmetic Operations
import pandas as pd
prices = pd.Series([100, 200, 300])
# Scalar operations
doubled = prices * 2
discounted = prices * 0.9
# Series operations
quantities = pd.Series([5, 10, 15])
totals = prices * quantities
print(totals)
# 0 500
# 1 2000
# 2 4500
Statistical Methods
import pandas as pd
data = pd.Series([10, 20, 30, 40, 50])
print(data.sum()) # 150
print(data.mean()) # 30.0
print(data.std()) # 15.81
print(data.min()) # 10
print(data.max()) # 50
print(data.median()) # 30.0
# Cumulative operations
print(data.cumsum())
# 0 10
# 1 30
# 2 60
# 3 100
# 4 150
Series operations are element-wise by default, meaning each operation applies independently to every element. When multiplying a Series by 2, each value doubles. When multiplying two Series together, corresponding elements multiply (position 0 with position 0, etc.).
Statistical methods like sum(), mean(), std() summarize the entire Series into a single value. cumsum() is different—it computes a running total, returning a new Series where each element is the sum of all previous elements.
Index Alignment
One of Pandas' most powerful features is automatic index alignment. When you perform operations between Series, Pandas aligns them by their index labels, not positions.
import pandas as pd
# Two series with different indices
s1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
s2 = pd.Series([10, 20, 30], index=['b', 'c', 'd'])
# Addition aligns by index
result = s1 + s2
print(result)
# a NaN # 'a' only in s1
# b 12.0 # 2 + 10
# c 23.0 # 3 + 20
# d NaN # 'd' only in s2
When you perform arithmetic operations between two Series with different indices, Pandas automatically aligns them by their index labels before computing. Here, s1 has indices 'a', 'b', 'c' and s2 has 'b', 'c', 'd'. Only 'b' and 'c' exist in both, so those get computed (2+10=12, 3+20=23). Index 'a' only exists in s1 and 'd' only in s2, resulting in NaN for those positions.
This behavior is incredibly powerful for real-world data where datasets often have missing values or different coverage. Unlike NumPy where position-based operations can silently produce wrong results, Pandas ensures data integrity by matching on labels.
# Fill missing with default value
result = s1.add(s2, fill_value=0)
print(result)
# a 1.0
# b 12.0
# c 23.0
# d 30.0
If you want to treat missing values as a specific number instead of NaN, use the fill_value parameter. Here, fill_value=0 means: if an index exists in one Series but not the other, treat the missing value as 0. So for index 'a', it computes 1+0=1; for index 'd', it computes 0+30=30.
This is useful when you're combining data and want a sensible default. For example, if you're adding sales from two regions and one region has no sales for a product, you'd want to treat it as 0 rather than making the total NaN.
String Operations
The str accessor provides string methods for Series containing text data. These operations are vectorized for efficiency.
import pandas as pd
names = pd.Series(['alice smith', 'bob jones', 'carol davis'])
# String methods via .str accessor
print(names.str.upper())
# 0 ALICE SMITH
# 1 BOB JONES
# 2 CAROL DAVIS
print(names.str.title())
# 0 Alice Smith
# 1 Bob Jones
# 2 Carol Davis
The .str accessor unlocks all Python string methods for entire Series at once. Instead of writing a loop to process each string, you call the method once and it applies to every element. str.upper() converts all characters to uppercase, while str.title() capitalizes the first letter of each word.
This vectorized approach is much faster than Python loops, especially for large datasets. Behind the scenes, Pandas optimizes these operations using C-level code rather than Python's interpreter.
print(names.str.split())
# 0 [alice, smith]
# 1 [bob, jones]
# 2 [carol, davis]
# Check if contains substring
print(names.str.contains('smith'))
# 0 True
# 1 False
# 2 False
str.split() splits each string by whitespace (or any delimiter you specify), returning a Series of lists. This is useful for extracting first/last names or parsing structured text. str.contains() returns a boolean Series indicating whether each element contains the specified substring.
The contains() method is particularly powerful for filtering data. You can use it with regular expressions by setting regex=True, enabling complex pattern matching across your entire dataset in a single operation.
# Get string length
print(names.str.len())
# 0 11
# 1 9
# 2 11
str.len() returns the length of each string. This is useful for data validation (checking if entries meet minimum/maximum length requirements) or for analysis (finding outliers with unusually long or short text).
Other commonly used string methods include str.replace() for substitutions, str.strip() for removing whitespace, str.startswith()/str.endswith() for prefix/suffix checks, and str.extract() for regex group extraction.
Pandas Data Types
Understanding data types is crucial for memory efficiency and proper operations. Pandas uses NumPy types plus its own specialized types.
| Dtype | Description | Example Values | Memory |
|---|---|---|---|
int64 |
64-bit integer | 1, -100, 999 | 8 bytes |
float64 |
64-bit floating point | 3.14, -2.5, NaN | 8 bytes |
object |
Python objects (usually strings) | 'hello', 'world' | Variable |
bool |
Boolean values | True, False | 1 byte |
datetime64 |
Date and time | 2024-01-15 10:30:00 | 8 bytes |
timedelta64 |
Time differences | 5 days, 3 hours | 8 bytes |
category |
Categorical data (limited values) | 'low', 'medium', 'high' | Memory efficient |
string |
Pandas StringDtype | 'text', 'data' | Variable |
import pandas as pd
# Check and convert types
df = pd.DataFrame({
'age': [25, 30, 35],
'name': ['Alice', 'Bob', 'Carol'],
'active': [True, False, True]
})
print(df.dtypes)
# age int64
# name object
# active bool
df.dtypes shows the data type of each column. Pandas infers types when creating a DataFrame: integers become int64, strings become 'object' (a general-purpose type), and True/False become bool. The 'object' dtype for strings is flexible but can be memory-inefficient.
Knowing your data types is essential for two reasons: some operations only work on certain types (you can't calculate mean of strings), and choosing efficient types can dramatically reduce memory usage for large datasets.
# Convert types
df['age'] = df['age'].astype('int32') # Save memory
df['name'] = df['name'].astype('string') # Use StringDtype
df['category'] = pd.Categorical(['A', 'B', 'A']) # Categorical
astype() converts a column to a different type. Converting int64 to int32 halves memory usage when you don't need the extra precision. The 'string' dtype is Pandas' dedicated string type with better handling of missing values than 'object'.
pd.Categorical() is powerful for columns with limited unique values (like status, gender, department). It stores each unique value once and uses integer codes internally, saving huge amounts of memory. A column with 1 million rows but only 5 unique categories uses a tiny fraction of the memory.
Practice: Series Basics
Task: Create a Series of temperatures for days Monday through Friday with values 72, 75, 68, 70, 73.
Show Solution
import pandas as pd
temps = pd.Series(
[72, 75, 68, 70, 73],
index=['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday']
)
print(temps)
print(f"Wednesday temp: {temps['Wednesday']}")
Task: Create a Series of 5 numbers and print sum, mean, and standard deviation.
Show Solution
import pandas as pd
numbers = pd.Series([10, 20, 30, 40, 50])
print(f"Sum: {numbers.sum()}")
print(f"Mean: {numbers.mean()}")
print(f"Std: {numbers.std():.2f}")
Task: Create a Series of email addresses and extract the domain (part after @).
Show Solution
import pandas as pd
emails = pd.Series([
'alice@gmail.com',
'bob@yahoo.com',
'carol@outlook.com'
])
domains = emails.str.split('@').str[1]
print(domains)
Task: Create two Series with overlapping indices and add them, using fill_value=0 for missing values.
Show Solution
import pandas as pd
s1 = pd.Series([10, 20, 30], index=['a', 'b', 'c'])
s2 = pd.Series([100, 200, 300], index=['b', 'c', 'd'])
result = s1.add(s2, fill_value=0)
print(result)
Task: Create a DataFrame and convert columns to more memory-efficient types (int32, category).
Show Solution
import pandas as pd
df = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'size': ['S', 'M', 'L', 'M', 'S'],
'price': [10, 20, 30, 20, 10]
})
# Check memory before
print(f"Before: {df.memory_usage(deep=True).sum()} bytes")
# Convert types
df['id'] = df['id'].astype('int32')
df['size'] = df['size'].astype('category')
df['price'] = df['price'].astype('int32')
print(f"After: {df.memory_usage(deep=True).sum()} bytes")
Creating DataFrames
Create DataFrames from dictionaries, lists of dictionaries, NumPy arrays, or by reading files like CSV and Excel. Understanding different creation methods helps you work with various data sources.
DataFrame Creation Methods
DataFrames can be created from many sources: dictionaries (columns as keys), lists of dictionaries (rows as dicts), NumPy arrays, CSV files, Excel files, SQL databases, and JSON.
Best Practice: Choose the creation method that matches your data source. Use read_csv() for files, dict for programmatic creation.
From Dictionary of Lists
import pandas as pd
# Dictionary of lists (most common)
data = {
'product': ['Laptop', 'Phone', 'Tablet'],
'price': [999, 699, 399],
'stock': [50, 150, 80]
}
df = pd.DataFrame(data)
print(df)
# product price stock
# 0 Laptop 999 50
# 1 Phone 699 150
# 2 Tablet 399 80
The most common way to create a DataFrame is from a dictionary where keys become column names and values (as lists) become the column data. This is intuitive because you're defining your data column by column, which is how we typically think about tabular data.
All lists must have the same length—if one column has 3 values and another has 4, Pandas will raise a ValueError. The automatic integer index (0, 1, 2) is created for you, representing row numbers.
From List of Dictionaries
# Each dict is a row
records = [
{'name': 'Alice', 'score': 85, 'grade': 'B'},
{'name': 'Bob', 'score': 92, 'grade': 'A'},
{'name': 'Carol', 'score': 78, 'grade': 'C'}
]
df = pd.DataFrame(records)
print(df)
# name score grade
# 0 Alice 85 B
# 1 Bob 92 A
# 2 Carol 78 C
An alternative approach treats each dictionary as a row. This row-oriented format is natural when you receive data record by record, such as from API responses, database queries, or log entries. Each dictionary represents one entity with its attributes.
Pandas automatically determines columns from all unique keys across all dictionaries. The column order may vary, so specify columns=['name', 'score', 'grade'] if order matters.
# Handles missing keys gracefully
records_incomplete = [
{'name': 'Alice', 'score': 85},
{'name': 'Bob', 'grade': 'A'}, # missing 'score'
]
df = pd.DataFrame(records_incomplete)
# Missing values become NaN
A key advantage of this format is handling incomplete records. If a dictionary is missing a key, Pandas fills that cell with NaN (Not a Number, representing missing data) rather than crashing. This is common when working with real-world data where some fields may be optional.
This flexibility is especially useful when loading JSON data from APIs or databases where not every record has every field. You can later use fillna() to handle these missing values.
From NumPy Arrays
import pandas as pd
import numpy as np
# From 2D NumPy array
arr = np.array([
[1, 2, 3],
[4, 5, 6],
[7, 8, 9]
])
df = pd.DataFrame(arr, columns=['A', 'B', 'C'])
print(df)
# A B C
# 0 1 2 3
# 1 4 5 6
# 2 7 8 9
When you already have numerical data in a NumPy array, converting to DataFrame adds labels and Pandas functionality. Rows of the array become DataFrame rows, columns become DataFrame columns. Since arrays don't have column names, you must provide them via the columns parameter.
This conversion is zero-copy when possible—Pandas uses the same underlying memory as the NumPy array, making it very efficient. The DataFrame inherits the array's dtype (int64, float64, etc.).
# With custom index
df = pd.DataFrame(
arr,
columns=['A', 'B', 'C'],
index=['row1', 'row2', 'row3']
)
You can specify custom row labels with the index parameter instead of using the default 0, 1, 2. This is useful when rows represent named entities like dates, product IDs, or locations that you want to access by name.
After creating, you can access data both ways: df.loc['row1'] gets by label, df.iloc[0] gets by position. This dual access pattern is a core strength of Pandas over raw NumPy arrays.
From CSV File
# Basic CSV read
df = pd.read_csv('data.csv')
The simplest way to load a CSV file is with just the filename. Pandas automatically detects the delimiter (comma by default), infers column types, and treats the first row as headers. This works for most well-formatted CSV files.
pd.read_csv() is one of the most-used functions in data science. It handles file reading, parsing, type conversion, and DataFrame construction all in one call, making it incredibly convenient for quick data exploration.
# Common parameters
df = pd.read_csv('data.csv',
index_col='id', # Use column as index
parse_dates=['date'], # Parse date columns
na_values=['N/A', '-'], # Treat as missing
dtype={'code': str}, # Force column type
usecols=['id', 'name'], # Read only these columns
nrows=1000, # Read first 1000 rows
skiprows=1, # Skip header rows
encoding='utf-8' # File encoding
)
# Write to CSV
df.to_csv('output.csv', index=False)
For real-world data, you'll often need to customize how Pandas reads the file. index_col sets a column as the DataFrame index; parse_dates converts string columns to datetime objects; na_values tells Pandas which strings represent missing values; dtype forces specific column types (crucial for codes like '001' that shouldn't become integers).
Performance options like usecols (read only needed columns), nrows (read first N rows), and skiprows (skip initial rows) are essential for large files. When saving, to_csv() with index=False prevents adding an extra index column to your output file.
From Excel Files
# Read Excel file
df = pd.read_excel('data.xlsx')
# Read specific sheet
df = pd.read_excel('data.xlsx', sheet_name='Sales')
# Read all sheets into dict of DataFrames
all_sheets = pd.read_excel('data.xlsx', sheet_name=None)
# all_sheets['Sheet1'], all_sheets['Sheet2'], etc.
read_excel() works similarly to read_csv() but handles Excel's .xlsx and .xls formats. By default, it reads the first sheet. Use sheet_name to specify a particular sheet by name or by index (0 for first sheet). Setting sheet_name=None reads ALL sheets and returns a dictionary where keys are sheet names.
Excel files often contain multiple related datasets on different sheets. The dictionary approach lets you load everything at once and then work with specific sheets as df = all_sheets['SheetName']. This requires the openpyxl package (pip install openpyxl).
# Write to Excel
df.to_excel('output.xlsx', index=False, sheet_name='Results')
# Write multiple sheets
with pd.ExcelWriter('multi_sheet.xlsx') as writer:
df1.to_excel(writer, sheet_name='Data')
df2.to_excel(writer, sheet_name='Summary')
to_excel() writes a DataFrame to an Excel file. Use index=False to avoid writing row numbers as the first column. The sheet_name parameter controls which sheet the data goes to.
For writing multiple DataFrames to different sheets in one file, use ExcelWriter as a context manager. This opens the file once, writes all sheets, and then properly closes and saves the file. Without ExcelWriter, each to_excel() call would overwrite the previous one.
From SQL Database
import pandas as pd
import sqlite3
# Connect to database
conn = sqlite3.connect('database.db')
# Read SQL query results
df = pd.read_sql('SELECT * FROM users', conn)
# Read entire table
df = pd.read_sql_table('users', conn)
Pandas can directly query databases and return results as DataFrames. First, establish a connection using the appropriate library (sqlite3 for SQLite, psycopg2 for PostgreSQL, mysql-connector for MySQL). Then use read_sql() with any SQL query—the results become a DataFrame.
read_sql() is flexible: you can run complex queries with JOINs, WHERE clauses, aggregations, etc. read_sql_table() is a shortcut for loading an entire table without writing SQL. Both methods handle type conversion automatically.
# Write DataFrame to SQL
df.to_sql('users_backup', conn, if_exists='replace', index=False)
conn.close()
to_sql() writes a DataFrame to a database table. The if_exists parameter controls behavior when the table already exists: 'replace' drops and recreates it, 'append' adds rows to existing table, 'fail' raises an error. Use index=False to avoid creating an extra index column.
Always close database connections when done. For production code, use SQLAlchemy's create_engine() which provides connection pooling and works with more database types. The with statement can auto-close connections.
From JSON
# Read JSON file
df = pd.read_json('data.json')
# From JSON string
json_str = '[{"name": "Alice", "age": 25}, {"name": "Bob", "age": 30}]'
df = pd.read_json(json_str)
read_json() parses JSON files or strings into DataFrames. It works with JSON arrays (each object becomes a row) or JSON objects (keys become rows or columns depending on structure). You can pass a filename or a JSON string directly.
JSON is the standard format for web APIs, so this function is essential when working with data from web services. The function automatically infers the structure and converts it to tabular format.
# Nested JSON (normalize)
nested_json = [
{'id': 1, 'info': {'name': 'Alice', 'city': 'NYC'}},
{'id': 2, 'info': {'name': 'Bob', 'city': 'LA'}}
]
df = pd.json_normalize(nested_json)
# Columns: id, info.name, info.city
# Write to JSON
df.to_json('output.json', orient='records')
Real-world JSON is often nested—objects containing other objects. json_normalize() flattens these structures into a flat table. Nested keys become column names with dots: the 'name' inside 'info' becomes the column 'info.name'. This handles arbitrarily deep nesting.
When writing DataFrames to JSON, the orient parameter controls the output format. 'records' produces a list of row objects [{row1}, {row2}], which is the most common format for APIs. Other options include 'columns' (dict of column arrays) and 'index' (dict keyed by index).
File Format Reference
| Format | Read Function | Write Function | Best For |
|---|---|---|---|
| CSV | pd.read_csv() |
df.to_csv() |
Universal, human-readable |
| Excel | pd.read_excel() |
df.to_excel() |
Business reports, multiple sheets |
| JSON | pd.read_json() |
df.to_json() |
Web APIs, nested data |
| SQL | pd.read_sql() |
df.to_sql() |
Database integration |
| Parquet | pd.read_parquet() |
df.to_parquet() |
Big data, columnar storage |
| Pickle | pd.read_pickle() |
df.to_pickle() |
Python-only, preserves types |
| HTML | pd.read_html() |
df.to_html() |
Web scraping tables |
| Clipboard | pd.read_clipboard() |
df.to_clipboard() |
Quick copy-paste |
DataFrame Anatomy
Every DataFrame has an index (row labels), columns (column names), and values (the data). Understanding the structure is key to effective manipulation.
Exploring DataFrames
import pandas as pd
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Carol'],
'age': [25, 30, 35],
'salary': [50000, 60000, 70000]
})
print(df.shape) # (3, 3)
print(df.columns) # Index(['name', 'age', 'salary'])
print(df.dtypes) # Data types of each column
When you first load a dataset, these properties give you an instant overview. shape returns a tuple (rows, columns)—here (3, 3) means 3 rows and 3 columns. columns returns an Index object listing all column names. dtypes shows the data type of each column.
These attributes are your first stop for understanding any DataFrame. shape tells you the size of your data, columns shows what fields exist, and dtypes helps identify potential issues like numeric data stored as strings.
print(df.head(2)) # First 2 rows
print(df.info()) # Summary of DataFrame
head(n) displays the first n rows (default 5)—perfect for previewing data without overwhelming your screen. There's also tail(n) for the last n rows. For large datasets, these methods let you quickly verify the data looks correct.
info() provides a comprehensive summary: number of entries, column names, non-null counts per column, dtypes, and memory usage. This is often the first method to call on a new dataset—it reveals missing values, data types, and memory footprint all at once.
Quick Statistics
# Numeric column statistics
print(df.describe())
# age salary
# count 3.000000 3.00000
# mean 30.000000 60000.00000
# std 5.000000 10000.00000
# min 25.000000 50000.00000
# max 35.000000 70000.00000
describe() generates summary statistics for all numeric columns in one call. You get count (non-null values), mean, standard deviation (std), minimum, maximum, and quartiles (25%, 50%, 75%). The 50% percentile is the median.
This is your go-to method for understanding the distribution of numeric data. Quickly spot outliers (huge max vs small mean), check for missing values (count less than expected), and understand the spread of your data (std relative to mean).
# Single column stats
print(df['age'].mean()) # 30.0
print(df['salary'].max()) # 70000
For individual statistics, call the method directly on a column. Common methods include mean(), median(), std(), min(), max(), sum(), and count(). These return single values rather than DataFrames.
You can also use multiple statistics: df['salary'].agg(['mean', 'median', 'std']) returns all three at once. For custom statistics, pass a function: df['age'].agg(lambda x: x.max() - x.min()) calculates the range.
Value Counts and Unique Values
import pandas as pd
df = pd.DataFrame({
'category': ['A', 'B', 'A', 'C', 'B', 'A'],
'status': ['active', 'inactive', 'active', 'active', 'active', 'inactive']
})
# Count occurrences of each value
print(df['category'].value_counts())
# A 3
# B 2
# C 1
value_counts() is the perfect tool for categorical data analysis. It counts how many times each unique value appears, sorted from most to least common by default. Here, 'A' appears 3 times, 'B' twice, and 'C' once.
Add normalize=True to get percentages instead of counts: df['category'].value_counts(normalize=True) shows A is 50%, B is 33%, C is 17%. This is essential for understanding the distribution of categories in your data.
# Get unique values
print(df['category'].unique()) # ['A' 'B' 'C']
print(df['category'].nunique()) # 3 (number of unique)
# Check for specific value
print('A' in df['category'].values) # True
unique() returns an array of all distinct values in the column—useful for seeing the vocabulary of a categorical column. nunique() returns just the count of unique values, which is helpful for understanding cardinality (how many distinct categories exist).
To check if a specific value exists, use 'value' in df['column'].values. Note the .values—this accesses the underlying NumPy array. Without it, you'd be checking if 'A' is in the column's index, which isn't what you want.
Memory Usage Analysis
import pandas as pd
df = pd.DataFrame({
'id': range(10000),
'name': ['user_' + str(i) for i in range(10000)],
'score': [i * 0.1 for i in range(10000)]
})
# Overall memory usage
print(df.info(memory_usage='deep'))
When working with large datasets, memory becomes critical. info(memory_usage='deep') provides accurate memory usage including the actual size of string objects. Without 'deep', it only shows the memory for pointers, not the actual string data.
This deep analysis takes longer but gives you the true picture. A DataFrame that looks small (few columns, reasonable row count) might use gigabytes if it contains long strings or complex objects.
# Per-column memory
print(df.memory_usage(deep=True))
# Index 80000
# id 80000
# name 670000
# score 80000
# Total bytes
total_bytes = df.memory_usage(deep=True).sum()
print(f"Total: {total_bytes / 1024:.1f} KB")
memory_usage(deep=True) breaks down memory by column. Here, 'name' uses far more memory (670KB) than numeric columns (80KB each) because strings require more space than fixed-size integers or floats. The Index also consumes memory.
Understanding per-column memory helps optimize your DataFrame. You might convert string columns to categorical (if few unique values), use smaller integer types (int32 instead of int64), or drop columns you don't need. These optimizations can reduce memory by 50-90%.
Practice: DataFrame Creation
Task: Create a DataFrame with columns 'fruit' and 'quantity' containing 3 fruits.
Show Solution
import pandas as pd
df = pd.DataFrame({
'fruit': ['Apple', 'Banana', 'Orange'],
'quantity': [10, 25, 15]
})
print(df)
Task: Create any DataFrame and print its shape and column names.
Show Solution
import pandas as pd
df = pd.DataFrame({
'a': [1, 2, 3],
'b': [4, 5, 6]
})
print(f"Shape: {df.shape}")
print(f"Columns: {list(df.columns)}")
Task: Create a DataFrame from a list of dictionaries representing students with name and grade.
Show Solution
import pandas as pd
students = [
{'name': 'Alice', 'grade': 'A'},
{'name': 'Bob', 'grade': 'B'},
{'name': 'Carol', 'grade': 'A'}
]
df = pd.DataFrame(students)
print(df)
Task: Create a DataFrame, save it to CSV, then read it back and verify the data.
Show Solution
import pandas as pd
# Create DataFrame
df = pd.DataFrame({
'id': [1, 2, 3],
'name': ['A', 'B', 'C'],
'value': [10, 20, 30]
})
# Save to CSV
df.to_csv('test_data.csv', index=False)
# Read back
df_loaded = pd.read_csv('test_data.csv')
print(df_loaded)
print(f"Match: {df.equals(df_loaded)}")
Task: Create a DataFrame from a NumPy array with custom row and column names.
Show Solution
import pandas as pd
import numpy as np
data = np.array([
[85, 90, 88],
[92, 88, 95],
[78, 82, 80]
])
df = pd.DataFrame(
data,
columns=['Math', 'Science', 'English'],
index=['Alice', 'Bob', 'Carol']
)
print(df)
print(f"\nAlice's Math score: {df.loc['Alice', 'Math']}")
Task: Create a DataFrame and print its shape, dtypes, memory usage, and summary statistics.
Show Solution
import pandas as pd
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Carol', 'David'],
'age': [25, 30, 35, 28],
'salary': [50000, 60000, 70000, 55000]
})
print(f"Shape: {df.shape}")
print(f"\nData types:\n{df.dtypes}")
print(f"\nMemory usage:\n{df.memory_usage(deep=True)}")
print(f"\nStatistics:\n{df.describe()}")
Selection and Filtering
Select columns, rows, or subsets of data using labels, positions, or boolean conditions. Mastering selection is fundamental to data analysis with Pandas.
Selection Methods
Pandas provides multiple ways to select data: bracket notation for columns, loc for label-based selection, iloc for position-based selection, and boolean indexing for conditional filtering.
Rule of Thumb: Use loc when you know the labels, iloc when you know positions, and boolean masks for conditional selection.
Selecting Columns
import pandas as pd
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Carol'],
'age': [25, 30, 35],
'city': ['NYC', 'LA', 'Chicago'],
'salary': [50000, 60000, 70000]
})
# Single column (returns Series)
names = df['name']
print(type(names)) # pandas.core.series.Series
# Multiple columns (returns DataFrame)
subset = df[['name', 'age']]
print(type(subset)) # pandas.core.frame.DataFrame
# Alternative: dot notation (only for simple column names)
ages = df.age # Same as df['age']
Selecting columns is the most common operation in Pandas. Using single brackets df['name'] returns a Series (a single column), while double brackets df[['name', 'age']] return a DataFrame (a subset table). This distinction matters because Series and DataFrames have different methods available.
Dot notation (df.age) is a convenient shortcut but has limitations: it doesn't work with column names containing spaces, special characters, or names that conflict with DataFrame methods. Always use bracket notation for reliability.
# Add new column
df['senior'] = df['age'] >= 30
# Computed column
df['annual_bonus'] = df['salary'] * 0.1
Creating new columns is as simple as assignment. df['senior'] = df['age'] >= 30 creates a boolean column based on a condition. df['annual_bonus'] = df['salary'] * 0.1 creates a computed column from existing data. The operation is vectorized—it applies to all rows automatically.
You can create columns from any expression: arithmetic operations, string manipulations, or the results of function calls. This is how you typically feature engineer in data science projects.
Selecting Rows with loc (Label-Based)
# loc: select by label
print(df.loc[0]) # Row with index label 0
print(df.loc[0:1]) # Rows 0 and 1 (INCLUSIVE!)
print(df.loc[0, 'name']) # Row 0, column 'name'
print(df.loc[0:2, 'name':'city']) # Slice rows and columns
# Multiple specific rows
print(df.loc[[0, 2]]) # Rows 0 and 2
# Multiple specific columns
print(df.loc[:, ['name', 'salary']]) # All rows, specific columns
loc is the primary way to select data by label. The syntax df.loc[row_selector, column_selector] lets you specify exactly which rows and columns you want. Use a colon (:) to mean "all" — df.loc[:, 'name'] means all rows, just the name column.
Critical difference from Python slicing: loc ranges are INCLUSIVE on both ends. df.loc[0:2] includes rows 0, 1, AND 2. This catches many beginners off guard since Python normally excludes the end of a slice.
# Custom index example
df_custom = df.set_index('name')
print(df_custom.loc['Alice']) # Row with index 'Alice'
print(df_custom.loc['Alice':'Carol']) # Range of labels
loc truly shines with custom indices. After setting 'name' as the index, you can select rows by name directly: df_custom.loc['Alice'] retrieves Alice's row. Slicing works alphabetically when the index is strings: 'Alice':'Carol' includes all names in that alphabetical range.
This label-based access is why Pandas is so powerful for data analysis—you can reference data by meaningful identifiers (like dates, names, or IDs) rather than remembering integer positions.
Selecting Rows with iloc (Position-Based)
# iloc: select by integer position
print(df.iloc[0]) # First row (position 0)
print(df.iloc[0:2]) # First 2 rows (EXCLUSIVE!)
print(df.iloc[0, 1]) # Row 0, column at position 1
print(df.iloc[0:2, 0:2]) # First 2 rows, first 2 columns
iloc selects data by integer position, exactly like Python list indexing. df.iloc[0] is the first row, df.iloc[0:2] is the first two rows (positions 0 and 1). Unlike loc, iloc slicing is EXCLUSIVE at the end—standard Python behavior.
Use iloc when you need positional access regardless of what the index labels are. It's especially useful for iteration, random sampling, or when working with data where you don't know or care about the index values.
# Negative indexing
print(df.iloc[-1]) # Last row
print(df.iloc[-3:]) # Last 3 rows
# Step slicing
print(df.iloc[::2]) # Every other row
# Specific positions
print(df.iloc[[0, 2], [1, 3]]) # Rows 0 and 2, columns 1 and 3
iloc supports all Python slicing features: negative indices count from the end (-1 is the last row), step values let you skip rows (::2 takes every other row), and you can select arbitrary combinations of row and column positions.
The syntax df.iloc[[0, 2], [1, 3]] selects a subset of non-contiguous rows and columns by their positions, returning the intersection—useful when you need specific cells scattered across the DataFrame.
loc vs iloc Comparison
| Feature | loc | iloc |
|---|---|---|
| Selection by | Labels (names) | Integer positions |
| Slice end | Inclusive | Exclusive |
| Accepts | Labels, boolean arrays | Integers, boolean arrays |
| Example | df.loc['row_a':'row_c'] |
df.iloc[0:3] |
| Negative index | Not supported | Supported |
| Best for | Named indices, conditions | Position-based iteration |
Filtering with Boolean Conditions
import pandas as pd
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Carol', 'David'],
'age': [25, 30, 35, 28],
'city': ['NYC', 'LA', 'NYC', 'Chicago'],
'salary': [50000, 60000, 70000, 55000]
})
# Single condition
adults = df[df['age'] >= 30]
print(adults)
Boolean filtering is the most powerful way to select data in Pandas. The expression df['age'] >= 30 creates a boolean Series (True/False for each row). When you put this inside brackets df[...], it returns only the rows where the condition is True.
This pattern is fundamental to data analysis: create a condition, use it to filter. The condition can be any expression that produces True/False for each row—comparisons, string matches, null checks, etc.
# Multiple conditions (use & for AND, | for OR)
# MUST wrap each condition in parentheses!
result = df[(df['age'] >= 25) & (df['city'] == 'NYC')]
print(result)
# OR condition
result = df[(df['city'] == 'NYC') | (df['city'] == 'LA')]
print(result)
# NOT condition
result = df[~(df['city'] == 'NYC')] # ~ is NOT
print(result)
Combining conditions requires special operators: & for AND, | for OR, and ~ for NOT. Critically, you MUST wrap each individual condition in parentheses. Without them, Python's operator precedence causes errors. (df['age'] >= 25) & (df['city'] == 'NYC') is correct; df['age'] >= 25 & df['city'] == 'NYC' will fail.
These operators work element-wise across the boolean Series. The result is a new boolean Series that you can use for filtering. This is different from Python's and/or keywords, which don't work with Pandas Series.
Query Method
# Using query method (cleaner syntax)
result = df.query('age >= 25 and city == "NYC"')
print(result)
# Query with variables (use @)
min_age = 25
target_city = 'NYC'
result = df.query('age >= @min_age and city == @target_city')
print(result)
The query() method provides a more readable alternative to boolean indexing. Instead of df[(df['age'] >= 25) & (df['city'] == 'NYC')], you write df.query('age >= 25 and city == "NYC"'). The string uses natural Python/SQL-like syntax with 'and' and 'or' keywords.
To use Python variables inside a query string, prefix them with @. This lets you build dynamic queries: @min_age refers to the variable min_age defined outside the string. This is cleaner than f-string interpolation and safer against injection.
# Complex queries
result = df.query('(age > 25 and salary >= 55000) or city == "LA"')
print(result)
# Query with column names containing spaces
df2 = df.rename(columns={'city': 'home city'})
result = df2.query('`home city` == "NYC"') # Use backticks
Complex queries are much easier to read with query(). You can use parentheses for grouping, and the precedence follows natural logic. For column names with spaces or special characters, wrap them in backticks within the query string.
query() is also slightly faster for large DataFrames because it can use numexpr under the hood. For simple conditions, both approaches perform similarly, but query() wins on readability for complex filtering logic.
isin() for Multiple Values
# Filter for multiple values
cities = ['NYC', 'LA']
result = df[df['city'].isin(cities)]
print(result)
# Exclude multiple values
result = df[~df['city'].isin(cities)]
print(result)
isin() checks whether each value is in a list of acceptable values. It's much cleaner than chaining multiple == comparisons with |. Here, df['city'].isin(['NYC', 'LA']) returns True for rows where city is either NYC or LA.
The ~ operator negates the boolean mask, giving you exclusion filtering. ~df['city'].isin(cities) returns rows where city is NOT in the list. This is how you filter out unwanted categories efficiently.
# Multiple conditions with isin
result = df[
df['city'].isin(['NYC', 'LA']) &
df['age'].isin([25, 30])
]
Combine isin() with other conditions using & (and) or | (or). This example filters for rows where city is NYC or LA AND age is 25 or 30. You can mix isin() with regular boolean conditions too.
isin() is optimized for membership testing and is much faster than equivalent OR chains, especially for long lists of values. Always use isin() when checking against more than 2-3 values.
String Filtering
# String contains
result = df[df['name'].str.contains('a', case=False)]
print(result)
# Starts with / Ends with
result = df[df['name'].str.startswith('A')]
result = df[df['name'].str.endswith('l')]
The str accessor unlocks string methods for filtering. str.contains() checks if a substring exists anywhere in the string. With case=False, it's case-insensitive—matching 'a', 'A', or any combination. By default, it's case-sensitive.
str.startswith() and str.endswith() check the beginning and end of strings respectively. These are faster than contains() when you know exactly where to look, and they're perfect for filtering by prefixes (like product codes) or suffixes (like file extensions).
# Regex patterns
result = df[df['name'].str.match(r'^[A-C]')] # Names starting with A, B, or C
# String length
result = df[df['name'].str.len() > 4]
str.match() applies regular expressions for powerful pattern matching. The pattern ^[A-C] matches strings starting with A, B, or C. You can use any regex pattern—this is incredibly flexible for complex text filtering.
str.len() returns the length of each string, which you can then use in comparisons. This filters for names longer than 4 characters. Combine these methods: df[df['name'].str.len().between(3, 10)] filters for names between 3 and 10 characters.
Sorting Data
# Sort by single column
df_sorted = df.sort_values('age')
print(df_sorted)
# Sort descending
df_sorted = df.sort_values('age', ascending=False)
# Sort by multiple columns
df_sorted = df.sort_values(['city', 'age'])
sort_values() orders rows based on column values. By default it sorts ascending (smallest to largest, A to Z). Set ascending=False for descending order. When sorting by multiple columns, it sorts by the first column first, then uses the second column to break ties.
Sorting is essential for data presentation and often required before certain operations like rolling calculations or when you need to see extremes (top earners, oldest items, etc.).
# Sort with mixed directions
df_sorted = df.sort_values(
['city', 'age'],
ascending=[True, False] # city asc, age desc
)
# Sort by index
df_sorted = df.sort_index()
# Ranking
df['salary_rank'] = df['salary'].rank(ascending=False)
print(df)
For multi-column sorts, pass a list to ascending to control each column's direction independently. This is useful when you want, say, alphabetical city names but highest salaries first within each city.
sort_index() sorts by the row index labels. rank() assigns rank positions rather than reordering—useful when you want to keep original order but add a ranking column. ascending=False gives rank 1 to the highest value.
Modifying Data
# Update values based on condition
df.loc[df['age'] > 30, 'status'] = 'senior'
df.loc[df['age'] <= 30, 'status'] = 'junior'
# Update multiple columns
df.loc[df['city'] == 'NYC', ['bonus', 'tax_rate']] = [5000, 0.25]
Use loc with a boolean condition to update specific rows. The syntax df.loc[condition, column] = value sets the value for rows where the condition is True. This is the safe, recommended way to modify DataFrame values conditionally.
You can update multiple columns at once by passing a list of column names and a list of values. This is useful when several columns need to change together based on the same condition.
# Using where (keeps original if condition is False)
df['adjusted_salary'] = df['salary'].where(df['age'] >= 30, df['salary'] * 0.9)
# Using mask (opposite of where)
df['adjusted_salary'] = df['salary'].mask(df['age'] < 30, df['salary'] * 0.9)
# Replace values
df['city'] = df['city'].replace({'NYC': 'New York', 'LA': 'Los Angeles'})
where() and mask() provide vectorized conditional logic. where(condition, other) keeps original values where condition is True, replaces with 'other' where False. mask() is the opposite—it replaces where condition is True. Both create new columns without modifying the original.
replace() substitutes values using a dictionary mapping old values to new values. It's perfect for standardizing data (e.g., converting abbreviations to full names) or fixing data entry inconsistencies.
Practice: Selection and Filtering
Task: From a DataFrame with columns 'product', 'price', 'qty', select just the 'price' column.
Show Solution
import pandas as pd
df = pd.DataFrame({
'product': ['A', 'B', 'C'],
'price': [10, 20, 30],
'qty': [100, 200, 150]
})
prices = df['price']
print(prices)
Task: Filter a DataFrame to get only rows where price > 15.
Show Solution
import pandas as pd
df = pd.DataFrame({
'product': ['A', 'B', 'C'],
'price': [10, 20, 30]
})
expensive = df[df['price'] > 15]
print(expensive)
Task: Filter to get products with price > 15 AND qty >= 150.
Show Solution
import pandas as pd
df = pd.DataFrame({
'product': ['A', 'B', 'C'],
'price': [10, 20, 30],
'qty': [100, 200, 150]
})
result = df[(df['price'] > 15) & (df['qty'] >= 150)]
print(result)
Task: Filter a DataFrame to include only rows where city is in ['NYC', 'LA', 'Chicago'].
Show Solution
import pandas as pd
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Carol', 'David', 'Eve'],
'city': ['NYC', 'Boston', 'LA', 'Chicago', 'Seattle']
})
target_cities = ['NYC', 'LA', 'Chicago']
result = df[df['city'].isin(target_cities)]
print(result)
Task: Sort a DataFrame by department ascending, then salary descending.
Show Solution
import pandas as pd
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Carol', 'David'],
'department': ['Sales', 'IT', 'Sales', 'IT'],
'salary': [60000, 70000, 55000, 65000]
})
result = df.sort_values(
['department', 'salary'],
ascending=[True, False]
)
print(result)
Task: Use the query() method to filter with a Python variable for the threshold.
Show Solution
import pandas as pd
df = pd.DataFrame({
'product': ['A', 'B', 'C', 'D'],
'price': [10, 25, 15, 30],
'category': ['Electronics', 'Clothing', 'Electronics', 'Clothing']
})
min_price = 15
target_category = 'Electronics'
result = df.query('price >= @min_price and category == @target_category')
print(result)
GroupBy Operations
GroupBy splits data into groups, applies a function to each group, and combines results. It is the foundation of aggregation in Pandas and essential for data summarization.
Split-Apply-Combine
GroupBy follows the split-apply-combine paradigm: split data into groups based on criteria, apply a function to each group independently, then combine results back into a DataFrame.
Common Use Cases: Calculate averages per category, sum sales by region, count items per group, find max/min within groups.
Basic GroupBy
import pandas as pd
df = pd.DataFrame({
'category': ['A', 'B', 'A', 'B', 'A'],
'value': [10, 20, 30, 40, 50]
})
# Group by category, calculate mean
grouped = df.groupby('category')['value'].mean()
print(grouped)
# category
# A 30.0
# B 30.0
The groupby() method splits your data into groups based on unique values in a column. Here, rows are grouped by 'category' (A or B), then mean() calculates the average 'value' for each group. Category A has values 10, 30, 50 (mean=30), and B has 20, 40 (mean=30).
The pattern is: df.groupby('column_to_group_by')['column_to_aggregate'].aggregation_function(). This is the most common groupby pattern and handles 90% of aggregation tasks you'll encounter.
# Group object is lazy - no computation until aggregation
groups = df.groupby('category')
print(type(groups)) # DataFrameGroupBy object
# Iterate over groups
for name, group in df.groupby('category'):
print(f"Group: {name}")
print(group)
print()
groupby() itself doesn't compute anything—it creates a lazy GroupBy object that waits for you to specify what to do with each group. This is efficient because you can chain multiple operations without recalculating groups.
You can iterate over groups using a for loop. Each iteration yields the group name (the value being grouped on) and a DataFrame containing just that group's rows. This is useful for custom processing or debugging.
Common Aggregation Functions
| Function | Description | Example |
|---|---|---|
sum() |
Sum of values | df.groupby('cat')['val'].sum() |
mean() |
Average of values | df.groupby('cat')['val'].mean() |
median() |
Median value | df.groupby('cat')['val'].median() |
count() |
Count of non-null values | df.groupby('cat')['val'].count() |
size() |
Count including nulls | df.groupby('cat').size() |
min() |
Minimum value | df.groupby('cat')['val'].min() |
max() |
Maximum value | df.groupby('cat')['val'].max() |
std() |
Standard deviation | df.groupby('cat')['val'].std() |
var() |
Variance | df.groupby('cat')['val'].var() |
first() |
First value in group | df.groupby('cat')['val'].first() |
last() |
Last value in group | df.groupby('cat')['val'].last() |
nunique() |
Count of unique values | df.groupby('cat')['val'].nunique() |
Multiple Aggregations with agg()
# Multiple functions at once
result = df.groupby('category')['value'].agg(['sum', 'mean', 'count'])
print(result)
# sum mean count
# category
# A 90 30.0 3
# B 60 30.0 2
The agg() method lets you apply multiple aggregation functions simultaneously. Pass a list of function names as strings, and you get a DataFrame with each function as a column. This is more efficient than calling each aggregation separately.
This is perfect for creating summary tables. In one operation, you can get the sum, mean, count, min, max—whatever statistics you need—for each group.
# Different functions per column
df = pd.DataFrame({
'category': ['A', 'B', 'A', 'B'],
'sales': [100, 200, 150, 250],
'quantity': [10, 20, 15, 25]
})
result = df.groupby('category').agg({
'sales': ['sum', 'mean'],
'quantity': ['sum', 'max']
})
print(result)
# Named aggregations (cleaner column names)
result = df.groupby('category').agg(
total_sales=('sales', 'sum'),
avg_sales=('sales', 'mean'),
total_qty=('quantity', 'sum'),
max_qty=('quantity', 'max')
)
print(result)
When you need different aggregations for different columns, pass a dictionary where keys are column names and values are lists of functions. This creates a multi-level column index which can be awkward to work with.
Named aggregations solve this elegantly: each argument specifies a result column name, and the value is a tuple of (column_to_aggregate, function). This produces clean, flat column names that are ready for reports or further analysis.
GroupBy with Multiple Columns
df = pd.DataFrame({
'region': ['East', 'West', 'East', 'West', 'East', 'West'],
'product': ['A', 'A', 'B', 'B', 'A', 'B'],
'sales': [100, 150, 200, 250, 120, 180]
})
# Group by multiple columns
result = df.groupby(['region', 'product'])['sales'].sum()
print(result)
# region product
# East A 220
# B 200
# West A 150
# B 430
Pass a list of column names to groupby() to group by multiple columns. This creates groups for every unique combination of values. Here we get four groups: East-A, East-B, West-A, and West-B, with sales summed for each combination.
The result has a hierarchical (multi-level) index. The first level is region, and nested within each region are the products. This structure represents the grouping hierarchy clearly.
# Reset index to get flat DataFrame
result = df.groupby(['region', 'product'])['sales'].sum().reset_index()
print(result)
# region product sales
# 0 East A 220
# 1 East B 200
# 2 West A 150
# 3 West B 430
reset_index() converts the hierarchical index back into regular columns, producing a flat DataFrame that's often easier to work with. Each row now explicitly shows the region, product, and aggregated sales.
This flat format is typically what you want for further analysis, visualization, or exporting to CSV/Excel. The hierarchical format is useful for display but can be cumbersome for programmatic access.
Transform and Apply
# Transform: return result with same shape as input
df = pd.DataFrame({
'category': ['A', 'A', 'B', 'B'],
'value': [10, 20, 100, 200]
})
# Add group mean as new column
df['group_mean'] = df.groupby('category')['value'].transform('mean')
print(df)
# category value group_mean
# 0 A 10 15.0
# 1 A 20 15.0
# 2 B 100 150.0
# 3 B 200 150.0
transform() applies an aggregation but broadcasts the result back to match the original DataFrame's shape. Here, the mean of category A (15) is placed in every row belonging to category A. Unlike regular aggregation which reduces data, transform keeps all original rows.
This is incredibly useful for adding group-level statistics as new columns. Common uses include calculating each row's deviation from group mean, percentile within group, or normalized values.
# Normalize within groups
df['normalized'] = df.groupby('category')['value'].transform(
lambda x: (x - x.mean()) / x.std()
)
# Apply: flexible function application
def top_n(group, n=2):
return group.nlargest(n, 'value')
result = df.groupby('category').apply(top_n, n=1)
print(result)
transform() accepts lambda functions for custom calculations. The normalization formula (x - mean) / std standardizes values within each group, centering them around 0 with standard deviation of 1. This is common preprocessing for machine learning.
apply() is the most flexible groupby method—it can return any shape. Your function receives each group as a DataFrame and can do anything: filter rows, add columns, reshape data. However, it's slower than built-in aggregations and should be used when other methods can't solve your problem.
Filtering Groups
# Filter: keep groups that meet condition
df = pd.DataFrame({
'category': ['A', 'A', 'B', 'B', 'C'],
'value': [10, 20, 100, 200, 5]
})
# Keep only groups with sum > 50
result = df.groupby('category').filter(lambda x: x['value'].sum() > 50)
print(result)
# category value
# 2 B 100
# 3 B 200
filter() keeps or discards entire groups based on a condition. The lambda function receives each group as a DataFrame and must return True (keep all rows in this group) or False (discard all rows). Here, only category B has sum > 50, so all of B's rows are kept while A and C are removed.
Unlike boolean filtering on individual rows, filter() makes group-level decisions. It's perfect for questions like "keep only customers with total purchases over $1000" or "show only regions with at least 10 orders."
# Keep groups with at least 2 members
result = df.groupby('category').filter(lambda x: len(x) >= 2)
print(result)
A common use case is filtering by group size. len(x) gives the number of rows in each group. Groups A and B have 2 rows each, while C has only 1. With len(x) >= 2, we keep A and B but drop C entirely.
This is useful for excluding rare categories from analysis, removing low-frequency items that might cause statistical issues, or focusing on groups with enough data to be meaningful.
Practice: GroupBy
Task: Group sales data by 'region' and calculate the total sales per region.
Show Solution
import pandas as pd
df = pd.DataFrame({
'region': ['North', 'South', 'North', 'South'],
'sales': [100, 200, 150, 250]
})
totals = df.groupby('region')['sales'].sum()
print(totals)
Task: Count how many products are in each category.
Show Solution
import pandas as pd
df = pd.DataFrame({
'product': ['A', 'B', 'C', 'D', 'E'],
'category': ['Electronics', 'Clothing', 'Electronics',
'Electronics', 'Clothing']
})
counts = df.groupby('category')['product'].count()
print(counts)
Task: Group by category and get sum, mean, and max of values.
Show Solution
import pandas as pd
df = pd.DataFrame({
'category': ['A', 'B', 'A', 'B', 'A'],
'value': [10, 20, 30, 40, 50]
})
result = df.groupby('category')['value'].agg(['sum', 'mean', 'max'])
print(result)
Task: Group sales data by both region and product_type and sum sales.
Show Solution
import pandas as pd
df = pd.DataFrame({
'region': ['North', 'North', 'South', 'South'],
'product_type': ['A', 'B', 'A', 'B'],
'sales': [100, 150, 200, 120]
})
result = df.groupby(['region', 'product_type'])['sales'].sum()
print(result)
# Reset index to get a regular DataFrame
result_df = result.reset_index()
print(result_df)
Task: Normalize scores within each class by subtracting the class mean.
Show Solution
import pandas as pd
df = pd.DataFrame({
'student': ['Alice', 'Bob', 'Carol', 'David'],
'class': ['A', 'A', 'B', 'B'],
'score': [85, 92, 78, 88]
})
df['normalized'] = df.groupby('class')['score'].transform(lambda x: x - x.mean())
print(df)
Task: Filter to keep only categories with more than 2 products.
Show Solution
import pandas as pd
df = pd.DataFrame({
'category': ['A', 'A', 'A', 'B', 'B', 'C'],
'product': ['P1', 'P2', 'P3', 'P4', 'P5', 'P6'],
'price': [10, 15, 20, 25, 30, 35]
})
# Use filter to keep groups with more than 2 products
result = df.groupby('category').filter(lambda x: len(x) > 2)
print(result)
Data Cleaning
Real-world data is messy. Data cleaning involves handling missing values, removing duplicates, fixing data types, and standardizing formats. These operations are essential before any analysis.
Missing Data in Pandas
Pandas uses NaN (Not a Number) and None to represent missing values. Operations on columns with missing data typically ignore NaN values, but you need to decide how to handle them for accurate analysis.
Strategy Options: Remove rows with missing data, fill with a default value, interpolate from neighbors, or forward/backward fill.
Detecting Missing Values
import pandas as pd
import numpy as np
df = pd.DataFrame({
'name': ['Alice', 'Bob', None, 'David'],
'age': [25, np.nan, 35, 28],
'city': ['NYC', 'LA', 'Chicago', np.nan]
})
We create a sample DataFrame with intentionally missing values to demonstrate detection techniques. None is Python's null value, while np.nan (Not a Number) is NumPy's representation of missing data. Pandas treats both as missing values internally.
In real datasets, missing values appear for many reasons: users skipping optional fields, data corruption during transfer, sensors failing to record readings, or records from different sources having different fields. Understanding your data's missingness is the first step in cleaning it.
# Check for missing values
print(df.isna()) # Boolean mask
print(df.isnull()) # Same as isna()
isna() returns a DataFrame of the same shape where each cell is True if the original value was missing, False otherwise. This boolean mask lets you see exactly where the gaps in your data are located. isnull() is simply an alias—they do exactly the same thing.
The output shows True at positions where data is missing: row 2's name, row 1's age, and row 3's city. This visual representation helps you spot patterns—are missing values concentrated in certain rows or columns?
# Count missing per column
print(df.isna().sum())
# name 1
# age 1
# city 1
Chaining .sum() after isna() counts the True values (missing) in each column. Since True equals 1 in Python, summing a boolean column gives the count of True values. Here, each column has exactly 1 missing value.
This is your first quantitative view of data quality. A column with 90% missing values might need to be dropped entirely, while a column with 1% missing can often be filled. This count guides your cleaning strategy.
# Total missing values
print(df.isna().sum().sum()) # 3
Double .sum() first sums each column (giving missing count per column), then sums those sums (giving total missing across the entire DataFrame). Here, 1+1+1 = 3 total missing values in the dataset.
This single number is useful for tracking data quality over time or comparing datasets. "Dataset A has 500 missing values, Dataset B has 50"—immediately tells you which needs more cleaning work.
# Percentage missing
print(df.isna().mean() * 100)
Using .mean() instead of .sum() gives the proportion of missing values (since True=1, False=0, mean gives the fraction that are True). Multiply by 100 to express as a percentage. This is often more meaningful than raw counts.
If a column has 25% missing data, that's concerning regardless of whether your dataset has 100 rows or 1 million rows. Percentages let you set universal thresholds like "drop columns with >50% missing" that work across different dataset sizes.
# Rows with any missing value
print(df[df.isna().any(axis=1)])
df.isna().any(axis=1) checks each row and returns True if ANY value in that row is missing. axis=1 means "check across columns for each row." This boolean Series can then filter the DataFrame to show only problematic rows.
This view helps you investigate missing data patterns. Are certain records systematically incomplete? Maybe data from a specific source or time period has quality issues. Seeing the actual rows with missing data often reveals the underlying cause.
Removing Missing Values
# Drop rows with ANY missing value
df_clean = df.dropna()
print(df_clean)
# Drop rows where ALL values are missing
df_clean = df.dropna(how='all')
# Drop rows with missing in specific columns
df_clean = df.dropna(subset=['name', 'age'])
dropna() removes rows containing missing values. By default, it drops rows with ANY missing value. With how='all', it only drops rows where every value is missing—useful for removing completely empty rows while preserving partial data.
The subset parameter lets you focus on critical columns. If 'name' and 'age' are essential but 'city' is optional, use subset=['name', 'age'] to keep rows where those two columns have data, regardless of other columns.
# Drop columns with missing values
df_clean = df.dropna(axis=1)
# Keep rows with at least 2 non-null values
df_clean = df.dropna(thresh=2)
axis=1 switches from dropping rows to dropping columns. This removes any column that has missing values—use carefully as it can eliminate useful columns with just one missing value.
thresh=n keeps rows with at least n non-null values. This is a more nuanced approach: "I can tolerate some missing data, but the row must have enough information to be useful." It's often better than the all-or-nothing approach of how='any'.
Filling Missing Values
# Fill with a constant
df_filled = df.fillna('Unknown')
# Fill specific columns differently
df['age'] = df['age'].fillna(df['age'].mean()) # Mean imputation
df['city'] = df['city'].fillna('Not Specified')
# Fill with different values per column
fill_values = {'name': 'Unknown', 'age': 0, 'city': 'N/A'}
df_filled = df.fillna(fill_values)
fillna() replaces missing values with a specified value. A single value fills all missing cells. For numeric columns, mean imputation (filling with the column's average) is a common statistical technique that preserves the overall distribution.
Different columns often need different fill strategies. Pass a dictionary to fillna() where keys are column names and values are the fill values. Text columns might get 'Unknown' while numeric columns get 0 or the mean.
# Forward fill (use previous value)
df_filled = df.fillna(method='ffill')
# Backward fill (use next value)
df_filled = df.fillna(method='bfill')
# Interpolate numeric values
df['age'] = df['age'].interpolate()
Forward fill (ffill) copies the last valid value forward into missing cells—useful for time series where the previous state continues until changed. Backward fill (bfill) does the opposite, copying the next valid value backward.
interpolate() estimates missing numeric values based on surrounding values using linear interpolation by default. For time series or ordered data, this often produces more realistic estimates than simple mean imputation.
Handling Duplicates
df = pd.DataFrame({
'id': [1, 2, 2, 3, 3, 3],
'name': ['Alice', 'Bob', 'Bob', 'Carol', 'Carol', 'Carol'],
'value': [10, 20, 20, 30, 31, 30]
})
# Check for duplicates
print(df.duplicated()) # Boolean mask
print(df.duplicated().sum()) # Count of duplicates
# Find duplicate rows
print(df[df.duplicated()])
# Check duplicates in specific columns
print(df.duplicated(subset=['id']))
duplicated() marks rows as True if they're duplicates of an earlier row. By default, it compares all columns. The first occurrence is marked False (not a duplicate), while subsequent identical rows are True. Summing gives the count of duplicate rows.
Use subset to check duplicates in specific columns only. df.duplicated(subset=['id']) finds rows where 'id' repeats, even if other columns differ. This is crucial when 'id' should be unique—any duplicates indicate data quality issues.
# Remove duplicates
df_unique = df.drop_duplicates()
# Keep first/last occurrence
df_unique = df.drop_duplicates(keep='first') # Default
df_unique = df.drop_duplicates(keep='last')
df_unique = df.drop_duplicates(keep=False) # Remove all duplicates
# Remove duplicates based on specific columns
df_unique = df.drop_duplicates(subset=['id', 'name'])
drop_duplicates() removes duplicate rows. keep='first' (default) retains the first occurrence, keep='last' retains the last. keep=False removes ALL copies of duplicated rows, leaving only rows that were never duplicated.
Use subset to de-duplicate based on certain columns. If you have multiple records per user but want one per user, use subset=['user_id'] to keep one row per unique user ID. Combine with keep to choose which record to retain.
Data Type Conversion
df = pd.DataFrame({
'id': ['1', '2', '3'], # String instead of int
'price': ['10.5', '20.3', '15.8'], # String instead of float
'date': ['2024-01-15', '2024-02-20', '2024-03-25']
})
# Convert types
df['id'] = df['id'].astype(int)
df['price'] = df['price'].astype(float)
Data imported from CSV files often comes in as strings even when it should be numeric. astype() converts a column to a different data type. Converting '1', '2', '3' from strings to integers enables mathematical operations and proper sorting.
Check your data types early with df.dtypes. Incorrect types cause subtle bugs: '10' + '5' = '105' (string concatenation) vs 10 + 5 = 15 (addition). Many Pandas functions behave differently based on type.
# Convert to datetime
df['date'] = pd.to_datetime(df['date'])
# Handle errors in conversion
df['id'] = pd.to_numeric(df['id'], errors='coerce') # Invalid -> NaN
df['date'] = pd.to_datetime(df['date'], errors='coerce')
# Convert to category for memory efficiency
df['status'] = df['status'].astype('category')
pd.to_datetime() parses date strings into datetime objects, enabling date arithmetic, filtering by date ranges, and extracting components like year and month. pd.to_numeric() handles numeric conversion with better error handling.
errors='coerce' converts invalid values to NaN instead of crashing. This is safer for messy real-world data. Category dtype saves memory when a column has few unique values repeated many times (like 'status' with only 'active', 'pending', 'closed').
String Cleaning
df = pd.DataFrame({
'name': [' Alice ', 'BOB', 'carol', 'DAVID '],
'email': ['alice@GMAIL.com', 'bob@yahoo.COM', 'CAROL@outlook.com', None]
})
# Strip whitespace
df['name'] = df['name'].str.strip()
# Standardize case
df['name'] = df['name'].str.title() # 'Alice', 'Bob', 'Carol'
df['email'] = df['email'].str.lower()
The .str accessor provides vectorized string methods that work on entire columns. str.strip() removes leading and trailing whitespace—essential because ' Alice ' and 'Alice' won't match in comparisons or grouping.
str.title() capitalizes the first letter of each word, str.lower() converts everything to lowercase, and str.upper() to uppercase. Standardizing case ensures 'BOB', 'Bob', and 'bob' are recognized as the same person.
# Replace patterns
df['name'] = df['name'].str.replace(r'\s+', ' ', regex=True) # Multiple spaces to one
# Extract parts
df['email_domain'] = df['email'].str.split('@').str[1]
# Remove unwanted characters
df['phone'] = df['phone'].str.replace(r'[^0-9]', '', regex=True)
str.replace() with regex=True enables pattern matching. The pattern \s+ matches one or more whitespace characters, collapsing "John Doe" to "John Doe". Regular expressions give you powerful text cleaning capabilities.
str.split('@').str[1] splits emails at @ and takes the second part (domain). The [^0-9] regex matches anything that's NOT a digit, so replacing with '' removes all non-numeric characters—perfect for standardizing phone numbers like '(555) 123-4567' to '5551234567'.
Renaming Columns
# Rename specific columns
df = df.rename(columns={'old_name': 'new_name', 'price': 'unit_price'})
# Rename all columns with function
df.columns = df.columns.str.lower()
df.columns = df.columns.str.replace(' ', '_')
rename() with a dictionary maps old column names to new ones. Only specified columns are renamed; others stay unchanged. This is the safest method because it's explicit and fails if the old name doesn't exist.
df.columns returns an Index of column names that supports string methods. str.lower() converts all column names to lowercase, str.replace(' ', '_') replaces spaces with underscores. These operations affect all columns at once.
# Combined: lowercase and replace spaces
df = df.rename(columns=lambda x: x.lower().replace(' ', '_'))
# Set column names directly
df.columns = ['col1', 'col2', 'col3']
Pass a function to rename() to transform all column names. The lambda receives each column name and returns the new name. This one-liner converts "User Name" to "user_name"—a common convention for Python-friendly column names.
You can also set df.columns directly to a list of new names. The list must have exactly as many elements as there are columns. This is useful when you need completely different names, but be careful—the order must match the existing column order.
Practice: Data Cleaning
Task: Create a DataFrame with some missing values and count how many missing values are in each column.
Show Solution
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A': [1, 2, np.nan, 4],
'B': [np.nan, 2, 3, np.nan],
'C': [1, 2, 3, 4]
})
missing_counts = df.isna().sum()
print(missing_counts)
Task: Create a DataFrame with duplicate rows and remove them.
Show Solution
import pandas as pd
df = pd.DataFrame({
'id': [1, 2, 2, 3],
'name': ['A', 'B', 'B', 'C']
})
df_clean = df.drop_duplicates()
print(f"Before: {len(df)} rows")
print(f"After: {len(df_clean)} rows")
Task: Fill missing values in a numeric column with the column mean.
Show Solution
import pandas as pd
import numpy as np
df = pd.DataFrame({
'score': [85, np.nan, 90, np.nan, 78]
})
df['score'] = df['score'].fillna(df['score'].mean())
print(df)
Task: Clean a column of names by stripping whitespace and converting to title case.
Show Solution
import pandas as pd
df = pd.DataFrame({
'name': [' ALICE ', 'bob', ' Carol ', 'DAVID']
})
df['name'] = df['name'].str.strip().str.title()
print(df)
Task: Clean a messy DataFrame: remove duplicates, fill missing values, and standardize column names.
Show Solution
import pandas as pd
import numpy as np
df = pd.DataFrame({
'User Name': ['Alice', 'Bob', 'Bob', np.nan],
'User Age': [25, np.nan, np.nan, 30],
'User City': ['NYC', 'LA', 'LA', 'Chicago']
})
# 1. Standardize column names
df.columns = df.columns.str.lower().str.replace(' ', '_')
# 2. Remove duplicates
df = df.drop_duplicates()
# 3. Fill missing values
df['user_name'] = df['user_name'].fillna('Unknown')
df['user_age'] = df['user_age'].fillna(df['user_age'].mean())
print(df)
Task: Identify outliers in a price column using IQR method and cap them.
Show Solution
import pandas as pd
import numpy as np
df = pd.DataFrame({
'product': ['A', 'B', 'C', 'D', 'E', 'F'],
'price': [10, 15, 12, 1000, 18, 11] # 1000 is an outlier
})
# Calculate IQR
Q1 = df['price'].quantile(0.25)
Q3 = df['price'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Cap outliers
df['price_capped'] = df['price'].clip(lower=lower_bound, upper=upper_bound)
print(f"Lower bound: {lower_bound:.2f}, Upper bound: {upper_bound:.2f}")
print(df)
Task: Parse messy date strings into proper datetime format and handle invalid dates.
Show Solution
import pandas as pd
df = pd.DataFrame({
'event': ['A', 'B', 'C', 'D', 'E'],
'date_str': ['2024-01-15', '01/20/2024', 'Jan 25, 2024', 'invalid', '2024-02-01']
})
# Parse dates with error handling
df['date'] = pd.to_datetime(df['date_str'], errors='coerce')
# Check for invalid dates
invalid_dates = df[df['date'].isna()]
print(f"Invalid dates found: {len(invalid_dates)}")
print(invalid_dates)
# Fill or remove invalid dates
df['date'] = df['date'].fillna(pd.Timestamp('2024-01-01'))
print(df)
Task: Clean inconsistent product names (varying cases, extra spaces).
Show Solution
import pandas as pd
df = pd.DataFrame({
'product': [' laptop ', 'LAPTOP', 'Laptop', ' phone', 'PHONE '],
'price': [999, 899, 799, 599, 649]
})
# Clean product names
df['product_clean'] = (df['product']
.str.strip() # Remove leading/trailing spaces
.str.lower() # Convert to lowercase
.str.replace(r'\s+', ' ', regex=True) # Remove multiple spaces
.str.title() # Title case
)
print(df)
# Group by cleaned names
grouped = df.groupby('product_clean')['price'].agg(['mean', 'count'])
print(grouped)
Merging and Joining
Combine DataFrames horizontally (merge/join) or vertically (concat/append). These operations are fundamental when working with relational data from multiple sources.
SQL-Style Joins in Pandas
Pandas merge() works like SQL JOIN operations. You can perform inner, outer, left, and right joins to combine DataFrames based on common columns or indices.
Join Types: Inner (only matching rows), outer (all rows), left (all left + matching right), right (all right + matching left).
Basic Merge (Inner Join)
import pandas as pd
# Two DataFrames with common column
df1 = pd.DataFrame({
'id': [1, 2, 3, 4],
'name': ['Alice', 'Bob', 'Carol', 'David']
})
df2 = pd.DataFrame({
'id': [2, 3, 4, 5],
'salary': [60000, 70000, 55000, 65000]
})
We start with two DataFrames that share a common column ('id'). df1 contains employee names with IDs 1-4, while df2 contains salaries for IDs 2-5. Some IDs exist in both (2, 3, 4), some only in df1 (1), and some only in df2 (5).
This scenario is extremely common: you have related data spread across multiple tables or files. Merging combines them into a single DataFrame where you can analyze the relationships between the data.
# Inner join (default) - only matching ids
result = pd.merge(df1, df2, on='id')
print(result)
# id name salary
# 0 2 Bob 60000
# 1 3 Carol 70000
# 2 4 David 55000
pd.merge() combines DataFrames based on a common column. The 'on' parameter specifies which column to match. By default, merge performs an inner join—only rows where the key (id) exists in BOTH DataFrames appear in the result.
Alice (id=1) isn't in the result because id=1 doesn't exist in df2. The mysterious id=5 salary isn't included either because id=5 doesn't exist in df1. Inner join is strict: it requires data from both sides.
Join Types
# Left join - all rows from left, matching from right
result = pd.merge(df1, df2, on='id', how='left')
print(result)
# id name salary
# 0 1 Alice NaN
# 1 2 Bob 60000.0
# 2 3 Carol 70000.0
# 3 4 David 55000.0
Left join (how='left') keeps ALL rows from the left DataFrame (df1) regardless of whether they have a match. Alice (id=1) now appears with NaN for salary because there's no matching record in df2. All employees are preserved.
This is the most common join type in practice. You typically have a "primary" table you want to keep complete, while adding data from a "secondary" table where available. Unmatched rows get NaN values that you can later fill or filter.
# Right join - all rows from right, matching from left
result = pd.merge(df1, df2, on='id', how='right')
print(result)
# Outer join - all rows from both
result = pd.merge(df1, df2, on='id', how='outer')
print(result)
# id name salary
# 0 1 Alice NaN
# 1 2 Bob 60000.0
# 2 3 Carol 70000.0
# 3 4 David 55000.0
# 4 5 NaN 65000.0
Right join is the mirror image of left join—it keeps all rows from the right DataFrame. Outer join (how='outer') keeps ALL rows from BOTH DataFrames, filling NaN where there's no match on either side.
Outer join is useful for finding discrepancies: which employees have no salary record? Which salary records have no employee? The NaN values reveal the mismatches between your data sources.
Join Types Reference
| Join Type | Description | Result | Use Case |
|---|---|---|---|
inner |
Only matching rows | Intersection of keys | When you need complete data only |
left |
All left + matching right | All left rows, NaN for unmatched | Keep all primary records |
right |
All right + matching left | All right rows, NaN for unmatched | Keep all secondary records |
outer |
All rows from both | Union of keys | Find all data, including mismatches |
cross |
Every combination | Cartesian product | Generate all possible pairs |
Merging on Different Column Names
df1 = pd.DataFrame({
'employee_id': [1, 2, 3],
'name': ['Alice', 'Bob', 'Carol']
})
df2 = pd.DataFrame({
'emp_id': [1, 2, 3],
'department': ['Sales', 'IT', 'HR']
})
Real-world data rarely has perfectly matching column names. Here, df1 uses 'employee_id' while df2 uses 'emp_id' for the same concept. This happens when data comes from different systems, databases, or when conventions vary across teams.
Pandas handles this gracefully—you don't need to rename columns before merging. Just tell merge which columns correspond to each other.
# Columns have different names
result = pd.merge(df1, df2, left_on='employee_id', right_on='emp_id')
print(result)
# Drop the redundant column
result = result.drop(columns=['emp_id'])
Use left_on and right_on to specify different column names: left_on is the column from the first (left) DataFrame, right_on from the second (right). Pandas matches rows where employee_id equals emp_id.
After merging, both columns appear in the result (they're technically different columns even though they contain the same values). Use drop() to remove the redundant one for cleaner output.
Merging on Multiple Columns
df1 = pd.DataFrame({
'year': [2023, 2023, 2024, 2024],
'quarter': ['Q1', 'Q2', 'Q1', 'Q2'],
'revenue': [100, 150, 120, 180]
})
df2 = pd.DataFrame({
'year': [2023, 2023, 2024, 2024],
'quarter': ['Q1', 'Q2', 'Q1', 'Q2'],
'expenses': [80, 90, 85, 95]
})
Sometimes a single column isn't enough to uniquely identify a row. Here, neither 'year' nor 'quarter' alone uniquely identifies a record—we need BOTH: 2023-Q1 is different from 2023-Q2 and from 2024-Q1. This is called a composite key.
Financial data, time series, and hierarchical data commonly require composite keys. Sales by region and month, scores by student and subject, inventory by warehouse and product—all need multiple columns to match correctly.
# Merge on multiple columns
result = pd.merge(df1, df2, on=['year', 'quarter'])
print(result)
# year quarter revenue expenses
# 0 2023 Q1 100 80
# 1 2023 Q2 150 90
# 2 2024 Q1 120 85
# 3 2024 Q2 180 95
Pass a list of column names to the 'on' parameter. Pandas will only match rows where ALL specified columns match. 2023-Q1 revenue joins with 2023-Q1 expenses—not with 2023-Q2 or 2024-Q1.
Now you can analyze revenue vs expenses for each period. The resulting DataFrame has one row per unique year-quarter combination with data from both sources side by side.
Concatenating DataFrames
# Vertical concatenation (stacking)
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
result = pd.concat([df1, df2])
print(result)
# A B
# 0 1 3
# 1 2 4
# 0 5 7
# 1 6 8
concat() stacks DataFrames vertically (by default)—it places one below the other. This is different from merge, which combines columns side-by-side based on matching keys. Use concat when you have data split across multiple files with identical structure.
Notice the index values (0, 1, 0, 1)—each DataFrame kept its original index. This can cause issues if you try to access by index, as there are now duplicate index values.
# Reset index after concat
result = pd.concat([df1, df2], ignore_index=True)
print(result)
# A B
# 0 1 3
# 1 2 4
# 2 5 7
# 3 6 8
# Horizontal concatenation
result = pd.concat([df1, df2], axis=1)
print(result)
# A B A B
# 0 1 3 5 7
# 1 2 4 6 8
ignore_index=True creates a fresh sequential index (0, 1, 2, 3)—usually what you want when combining data from the same source split across files (like monthly reports). The original indices are discarded.
axis=1 concatenates horizontally, placing DataFrames side by side. This is useful when you have different columns for the same rows. Note: this creates duplicate column names (A, B, A, B)—you'll likely want to rename them afterward.
Joining on Index
df1 = pd.DataFrame(
{'A': [1, 2, 3]},
index=['a', 'b', 'c']
)
df2 = pd.DataFrame(
{'B': [4, 5, 6]},
index=['a', 'b', 'd']
})
Sometimes your key isn't a column—it's the index itself. Here, df1 and df2 use string indices ('a', 'b', 'c' and 'a', 'b', 'd') rather than a column to identify rows. This is common with time series (date index) or when you've already set a meaningful index.
Both DataFrames share indices 'a' and 'b', while 'c' is only in df1 and 'd' is only in df2. The join operation will need to handle these mismatches.
# Join on index
result = df1.join(df2, how='outer')
print(result)
# A B
# a 1.0 4.0
# b 2.0 5.0
# c 3.0 NaN
# d NaN 6.0
# Merge on index
result = pd.merge(df1, df2, left_index=True, right_index=True, how='outer')
The join() method is designed specifically for index-based joins and offers cleaner syntax. By default it does a left join; here we use how='outer' to keep all indices from both DataFrames, with NaN where data is missing.
You can also use merge() with left_index=True and right_index=True to join on indices. This is more verbose but offers more flexibility, like combining index-based and column-based joins (e.g., left_index=True, right_on='column').
Practice: Merging and Joining
Task: Merge two DataFrames on a common 'id' column using inner join.
Show Solution
import pandas as pd
orders = pd.DataFrame({
'id': [1, 2, 3],
'product': ['A', 'B', 'C']
})
prices = pd.DataFrame({
'id': [1, 2, 4],
'price': [10, 20, 40]
})
result = pd.merge(orders, prices, on='id')
print(result)
Task: Stack two DataFrames with the same columns on top of each other.
Show Solution
import pandas as pd
df1 = pd.DataFrame({'name': ['Alice', 'Bob'], 'age': [25, 30]})
df2 = pd.DataFrame({'name': ['Carol', 'David'], 'age': [35, 28]})
result = pd.concat([df1, df2], ignore_index=True)
print(result)
Task: Perform a left join and identify rows that did not find a match.
Show Solution
import pandas as pd
employees = pd.DataFrame({
'id': [1, 2, 3, 4],
'name': ['Alice', 'Bob', 'Carol', 'David']
})
departments = pd.DataFrame({
'id': [1, 2],
'dept': ['Sales', 'IT']
})
result = pd.merge(employees, departments, on='id', how='left')
unmatched = result[result['dept'].isna()]
print("Unmatched employees:")
print(unmatched)
Task: Merge two DataFrames using composite keys (year and month).
Show Solution
import pandas as pd
sales = pd.DataFrame({
'year': [2023, 2023, 2024],
'month': [1, 2, 1],
'revenue': [1000, 1200, 1100]
})
targets = pd.DataFrame({
'year': [2023, 2023, 2024],
'month': [1, 2, 1],
'target': [900, 1100, 1000]
})
result = pd.merge(sales, targets, on=['year', 'month'])
result['achieved'] = result['revenue'] >= result['target']
print(result)
Task: Stack three monthly sales DataFrames vertically with a new index.
Show Solution
import pandas as pd
jan = pd.DataFrame({'product': ['A', 'B'], 'sales': [100, 150]})
feb = pd.DataFrame({'product': ['A', 'B'], 'sales': [120, 160]})
mar = pd.DataFrame({'product': ['A', 'B'], 'sales': [110, 140]})
# Concatenate with ignore_index to create fresh index
combined = pd.concat([jan, feb, mar], ignore_index=True)
print(combined)
# Or use keys to identify source month
combined_keyed = pd.concat([jan, feb, mar], keys=['Jan', 'Feb', 'Mar'])
print(combined_keyed)
Task: Find customers who haven't placed any orders (anti-join).
Show Solution
import pandas as pd
customers = pd.DataFrame({
'customer_id': [1, 2, 3, 4, 5],
'name': ['Alice', 'Bob', 'Carol', 'David', 'Eve']
})
orders = pd.DataFrame({
'order_id': [101, 102, 103],
'customer_id': [1, 3, 1],
'amount': [100, 200, 150]
})
# Method 1: Left merge with indicator
merged = customers.merge(orders, on='customer_id', how='left', indicator=True)
no_orders = merged[merged['_merge'] == 'left_only'][['customer_id', 'name']]
print("Customers with no orders:")
print(no_orders)
# Method 2: Using isin()
no_orders_v2 = customers[~customers['customer_id'].isin(orders['customer_id'])]
print(no_orders_v2)
Advanced Operations
Master advanced Pandas operations including pivot tables, melting, window functions, and apply/map operations. These techniques unlock sophisticated data transformations.
Pivot Tables
import pandas as pd
df = pd.DataFrame({
'date': ['2024-01', '2024-01', '2024-02', '2024-02'],
'region': ['East', 'West', 'East', 'West'],
'sales': [100, 150, 120, 180]
})
# Create pivot table
pivot = pd.pivot_table(
df,
values='sales',
index='date',
columns='region',
aggfunc='sum'
)
print(pivot)
# region East West
# date
# 2024-01 100 150
# 2024-02 120 180
Pivot tables transform "long" data into "wide" format while aggregating. Here, each row in the original data represents one region's sales for one date. The pivot table reorganizes this so dates become rows, regions become columns, and sales values fill the cells.
The key parameters: 'values' is what you're aggregating (sales), 'index' becomes row labels (date), 'columns' becomes column headers (region), and 'aggfunc' is the aggregation function (sum, mean, count, etc.).
# Multiple aggregations
pivot = pd.pivot_table(
df,
values='sales',
index='date',
columns='region',
aggfunc=['sum', 'mean'],
margins=True # Add row/column totals
)
Pass a list of functions to aggfunc to get multiple summary statistics in one pivot table. This creates hierarchical columns where the first level is the function and the second level is the region.
margins=True adds "All" row and column showing totals. This is like the Grand Total in Excel pivot tables—essential for reports where you need to see both individual values and overall summaries.
Melt (Unpivot)
# Wide format data
df_wide = pd.DataFrame({
'name': ['Alice', 'Bob'],
'math': [90, 85],
'science': [88, 92],
'english': [95, 78]
})
This "wide" format is human-readable—each student has one row with all their scores as separate columns. However, many visualization libraries and statistical functions expect "long" format where each observation (student-subject-score) is its own row.
Wide format is common in spreadsheets: columns represent different measurements of the same entity. But for analysis, long format is often more flexible—you can easily filter by subject, group by subject, or plot all subjects at once.
# Melt to long format
df_long = pd.melt(
df_wide,
id_vars=['name'],
value_vars=['math', 'science', 'english'],
var_name='subject',
value_name='score'
)
print(df_long)
# name subject score
# 0 Alice math 90
# 1 Bob math 85
# 2 Alice science 88
# 3 Bob science 92
# 4 Alice english 95
# 5 Bob english 78
melt() "unpivots" wide data into long format. id_vars specifies columns to keep as-is (name stays as a column). value_vars lists columns to melt (math, science, english become values in the new 'subject' column). Their values go into the 'score' column.
Now you have 6 rows instead of 2—one for each student-subject combination. This format lets you easily create box plots by subject, calculate average by subject, or apply the same analysis function to all scores regardless of subject.
Apply and Map
import pandas as pd
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Carol'],
'score': [85, 92, 78]
})
# Apply function to column
df['grade'] = df['score'].apply(lambda x: 'A' if x >= 90 else 'B' if x >= 80 else 'C')
# Apply function to entire row
def classify_row(row):
if row['score'] >= 90:
return f"{row['name']}: Excellent"
return f"{row['name']}: Good"
df['status'] = df.apply(classify_row, axis=1)
apply() runs a function on each element of a Series or each row/column of a DataFrame. The lambda function converts scores to letter grades using conditional logic. apply() is incredibly flexible—any Python function can be used.
With axis=1, apply() passes each row as a Series to your function. This lets you use multiple columns to compute the result. The function receives the entire row and can access any column by name (row['score'], row['name']).
# Map values using dictionary
grade_map = {85: 'B', 92: 'A', 78: 'C'}
df['mapped_grade'] = df['score'].map(grade_map)
# Applymap for element-wise operations (renamed to map in newer versions)
df_numeric = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df_squared = df_numeric.map(lambda x: x ** 2) # Or applymap() in older versions
map() on a Series replaces values according to a dictionary or function. It's perfect for categorical encoding or lookup tables—faster than apply() for simple substitutions. If a value isn't in the dictionary, it becomes NaN.
DataFrame.map() (formerly applymap()) applies a function to every element in a DataFrame. Here it squares all values. This is useful for element-wise transformations that should apply uniformly across all cells, like rounding or formatting.
Window Functions
import pandas as pd
df = pd.DataFrame({
'date': pd.date_range('2024-01-01', periods=10),
'value': [10, 12, 8, 15, 14, 18, 20, 16, 22, 25]
})
# Rolling window (moving average)
df['rolling_mean'] = df['value'].rolling(window=3).mean()
df['rolling_sum'] = df['value'].rolling(window=3).sum()
Rolling windows calculate statistics over a fixed-size sliding window. With window=3, each value is the mean (or sum) of that row and the two previous rows. The first two rows have NaN because there aren't enough prior values to fill the window.
Moving averages smooth out noise in time series data, revealing underlying trends. A 3-day rolling mean shows the average of the last 3 days, which is less volatile than daily values. Longer windows (7, 30 days) create smoother trends.
# Expanding window (cumulative)
df['cumulative_mean'] = df['value'].expanding().mean()
df['cumulative_max'] = df['value'].expanding().max()
# Exponential weighted
df['ewm_mean'] = df['value'].ewm(span=3).mean()
print(df[['date', 'value', 'rolling_mean', 'cumulative_mean']])
Expanding windows include ALL previous values, growing as you move down. Cumulative mean at row 5 is the mean of rows 0-5. Cumulative max shows the highest value seen so far—useful for tracking "best performance to date" or setting benchmarks.
Exponential weighted moving average (ewm) gives more weight to recent values, making it more responsive to recent changes than simple rolling mean. The 'span' parameter controls how quickly the influence of older values decays.
Shift and Diff
import pandas as pd
df = pd.DataFrame({
'date': pd.date_range('2024-01-01', periods=5),
'price': [100, 102, 101, 105, 108]
})
# Shift values
df['previous_price'] = df['price'].shift(1) # Previous row
df['next_price'] = df['price'].shift(-1) # Next row
shift() moves values up or down by a specified number of rows. shift(1) moves values down by one row, so 'previous_price' contains yesterday's price for each day. The first row becomes NaN because there's no previous value.
shift(-1) moves values up, giving you the "next" value. This is useful for calculating "look-ahead" metrics like "next day's price" or for creating lag features in time series machine learning models.
# Calculate difference
df['price_change'] = df['price'].diff() # Current - previous
# Percentage change
df['pct_change'] = df['price'].pct_change() * 100
print(df)
diff() calculates the difference between the current and previous value. It's shorthand for df['price'] - df['price'].shift(1). This shows absolute change—how much the price increased or decreased from yesterday.
pct_change() calculates the percentage change from the previous value: (current - previous) / previous. Multiply by 100 to express as a percentage. This is essential for financial analysis where you care about relative returns, not absolute price movements.
Cross Tabulation
import pandas as pd
df = pd.DataFrame({
'gender': ['M', 'F', 'M', 'F', 'M', 'F'],
'department': ['Sales', 'Sales', 'IT', 'IT', 'HR', 'HR'],
'salary': [50000, 55000, 60000, 62000, 45000, 48000]
})
# Cross tabulation (count)
cross_tab = pd.crosstab(df['gender'], df['department'])
print(cross_tab)
# department HR IT Sales
# gender
# F 1 1 1
# M 1 1 1
crosstab() creates a contingency table showing the frequency of combinations between two categorical variables. Here it counts how many employees of each gender work in each department. Each cell shows the count for that gender-department combination.
This is invaluable for exploring relationships between categorical variables. Are certain departments male-dominated? Do purchase patterns vary by region and product type? Cross tabulation answers these questions at a glance.
# With aggregation
cross_tab = pd.crosstab(
df['gender'],
df['department'],
values=df['salary'],
aggfunc='mean'
)
print(cross_tab)
Add 'values' and 'aggfunc' parameters to aggregate a numeric column instead of counting. This shows mean salary by gender and department—perfect for identifying pay disparities across demographic and organizational dimensions.
You can use any aggregation function: 'sum' for total sales by region and product, 'max' for best score by student and subject, or 'mean' for average ratings by movie genre and decade. It's a quick way to summarize data by two dimensions.
Working with Dates
import pandas as pd
df = pd.DataFrame({
'date': pd.to_datetime(['2024-01-15', '2024-03-20', '2024-06-10', '2024-12-25']),
'value': [100, 150, 200, 250]
})
# Extract date components
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['weekday'] = df['date'].dt.day_name()
df['quarter'] = df['date'].dt.quarter
df['week'] = df['date'].dt.isocalendar().week
The dt accessor provides access to datetime components, similar to how str works for strings. You can extract year, month, day, weekday name, quarter, week number, and more. These extracted features are invaluable for time-based analysis and grouping.
Each extraction returns a Series you can store as a new column. This is common preprocessing for time series: you might analyze sales patterns by day of week, or group transactions by month. The original datetime column stays intact while you gain analytical dimensions.
# Date arithmetic
df['days_from_today'] = (pd.Timestamp.today() - df['date']).dt.days
# Filter by date range
mask = (df['date'] >= '2024-01-01') & (df['date'] <= '2024-06-30')
first_half = df[mask]
# Resample time series
df_monthly = df.set_index('date').resample('M')['value'].sum()
Date arithmetic works naturally—subtracting dates gives a Timedelta, and .dt.days extracts the number of days. Filtering by date range uses standard boolean indexing; Pandas automatically parses date strings like '2024-01-01' for comparison.
resample() is the time series equivalent of groupby(). Set the date as index, then resample at a frequency: 'M' for monthly, 'W' for weekly, 'D' for daily, 'Q' for quarterly. This groups data into time buckets and lets you aggregate (sum, mean, count) within each period.
Practice: Advanced Operations
Task: Create a pivot table showing sales by region and product.
Show Solution
import pandas as pd
df = pd.DataFrame({
'region': ['East', 'West', 'East', 'West'],
'product': ['A', 'A', 'B', 'B'],
'sales': [100, 150, 200, 250]
})
pivot = pd.pivot_table(df, values='sales', index='region', columns='product')
print(pivot)
Task: Calculate a 3-day rolling average for stock prices.
Show Solution
import pandas as pd
df = pd.DataFrame({
'date': pd.date_range('2024-01-01', periods=7),
'price': [100, 102, 98, 105, 103, 108, 110]
})
df['rolling_avg'] = df['price'].rolling(window=3).mean()
print(df)
Task: Use apply() to create a 'grade' column based on score ranges.
Show Solution
import pandas as pd
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Carol'],
'score': [95, 82, 68]
})
def assign_grade(score):
if score >= 90: return 'A'
elif score >= 80: return 'B'
elif score >= 70: return 'C'
else: return 'D'
df['grade'] = df['score'].apply(assign_grade)
print(df)
Task: Transform wide data to long format and calculate average score per subject.
Show Solution
import pandas as pd
df_wide = pd.DataFrame({
'student': ['Alice', 'Bob', 'Carol'],
'math': [90, 85, 92],
'science': [88, 90, 85],
'english': [95, 78, 88]
})
df_long = pd.melt(
df_wide,
id_vars=['student'],
var_name='subject',
value_name='score'
)
avg_by_subject = df_long.groupby('subject')['score'].mean()
print(avg_by_subject)
Task: Calculate a 3-day rolling average for daily sales data.
Show Solution
import pandas as pd
df = pd.DataFrame({
'date': pd.date_range('2024-01-01', periods=10),
'sales': [100, 120, 90, 150, 130, 110, 140, 160, 125, 135]
})
# Calculate 3-day rolling average
df['rolling_avg_3d'] = df['sales'].rolling(window=3).mean()
# Calculate rolling sum
df['rolling_sum_3d'] = df['sales'].rolling(window=3).sum()
print(df)
Interactive Demo
Explore common Pandas operations with this interactive reference. Select an operation to see its syntax and example output.
Pandas Operation Explorer
# Select an operation to see the example
import pandas as pd
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Carol'],
'age': [25, 30, 35],
'city': ['NYC', 'LA', 'Chicago']
})
print(df)
Best Practices for Pandas
Do This
- Use vectorized operations over loops
- Use loc/iloc for explicit selection
- Chain operations when possible
- Use appropriate dtypes for memory
- Check for duplicates and nulls early
- Use query() for readable filters
Avoid This
- Iterating with for loops over rows
- Chained indexing (df['a']['b'])
- Modifying sliced DataFrames
- Using object dtype for numerics
- Ignoring SettingWithCopyWarning
- Using apply() when vectorized works
Common Mistakes and Solutions
| Mistake | Problem | Solution |
|---|---|---|
df['a']['b'] = value |
Chained assignment may not work | df.loc[:, 'a'] = value |
for i, row in df.iterrows() |
Very slow for large DataFrames | Use vectorized operations or apply() |
df[df['a'] > 5 and df['b'] < 10] |
Python and/or does not work | df[(df['a'] > 5) & (df['b'] < 10)] |
df.loc[0:5] expecting 5 rows |
loc is inclusive, gets 6 rows | Use df.iloc[0:5] for 5 rows |
pd.concat(df1, df2) |
concat expects a list | pd.concat([df1, df2]) |
| Modifying original after slice | SettingWithCopyWarning | Use .copy() when creating subset |
Quick Reference
A comprehensive reference of the most commonly used Pandas functions and methods. Keep this handy while coding.
DataFrame Creation
| Function | Description | Example |
|---|---|---|
pd.DataFrame(dict) |
Create from dictionary | pd.DataFrame({'a': [1,2], 'b': [3,4]}) |
pd.read_csv() |
Read CSV file | pd.read_csv('data.csv') |
pd.read_excel() |
Read Excel file | pd.read_excel('data.xlsx') |
pd.read_json() |
Read JSON file/string | pd.read_json('data.json') |
pd.read_sql() |
Read SQL query | pd.read_sql(query, conn) |
Selection and Indexing
| Operation | Syntax | Returns |
|---|---|---|
| Single column | df['col'] |
Series |
| Multiple columns | df[['col1', 'col2']] |
DataFrame |
| Rows by label | df.loc[label] |
Series/DataFrame |
| Rows by position | df.iloc[pos] |
Series/DataFrame |
| Specific cell | df.at[row, col] |
Scalar |
| Boolean filter | df[df['col'] > val] |
DataFrame |
| Query filter | df.query('col > val') |
DataFrame |
Data Manipulation
| Method | Description | Example |
|---|---|---|
sort_values() |
Sort by column(s) | df.sort_values('col', ascending=False) |
drop() |
Remove rows/columns | df.drop(columns=['col']) |
rename() |
Rename columns | df.rename(columns={'old': 'new'}) |
dropna() |
Remove missing values | df.dropna(subset=['col']) |
fillna() |
Fill missing values | df.fillna(0) |
drop_duplicates() |
Remove duplicate rows | df.drop_duplicates() |
astype() |
Convert data type | df['col'].astype(int) |
replace() |
Replace values | df['col'].replace({'a': 'b'}) |
Aggregation and Grouping
| Method | Description | Example |
|---|---|---|
groupby() |
Group by column(s) | df.groupby('cat')['val'].sum() |
agg() |
Multiple aggregations | df.groupby('cat').agg(['sum', 'mean']) |
transform() |
Group transform | df.groupby('cat')['val'].transform('mean') |
pivot_table() |
Create pivot table | pd.pivot_table(df, values='v', index='r', columns='c') |
crosstab() |
Cross tabulation | pd.crosstab(df['a'], df['b']) |
value_counts() |
Count unique values | df['col'].value_counts() |
Combining DataFrames
| Function | Description | Example |
|---|---|---|
pd.merge() |
SQL-style join | pd.merge(df1, df2, on='key', how='left') |
pd.concat() |
Stack DataFrames | pd.concat([df1, df2], ignore_index=True) |
df.join() |
Join on index | df1.join(df2, how='outer') |
Ready-to-Use Templates
import pandas as pd
# Load data
df = pd.read_csv('data.csv')
# Quick exploration
print(f"Shape: {df.shape}")
print(f"Columns: {list(df.columns)}")
print(f"\nData types:\n{df.dtypes}")
print(f"\nMissing values:\n{df.isna().sum()}")
print(f"\nBasic stats:\n{df.describe()}")
print(f"\nFirst 5 rows:\n{df.head()}")
import pandas as pd
def clean_dataframe(df):
# Make a copy
df = df.copy()
# Standardize column names
df.columns = df.columns.str.lower().str.replace(' ', '_')
# Remove duplicates
df = df.drop_duplicates()
# Handle missing values
for col in df.select_dtypes(include=['float64', 'int64']).columns:
df[col] = df[col].fillna(df[col].median())
for col in df.select_dtypes(include=['object']).columns:
df[col] = df[col].fillna('Unknown')
return df
df_clean = clean_dataframe(df)
import pandas as pd
# Comprehensive groupby analysis
summary = df.groupby('category').agg(
count=('value', 'count'),
total=('value', 'sum'),
average=('value', 'mean'),
median=('value', 'median'),
std_dev=('value', 'std'),
minimum=('value', 'min'),
maximum=('value', 'max')
).round(2)
# Sort by total descending
summary = summary.sort_values('total', ascending=False)
# Add percentage column
summary['pct_of_total'] = (summary['total'] / summary['total'].sum() * 100).round(1)
print(summary)
Real-World Examples
Let's apply everything we've learned to solve realistic data analysis problems that you'll encounter in actual projects.
Example 1: E-Commerce Sales Analysis
import pandas as pd
import numpy as np
# Create sample e-commerce data
np.random.seed(42)
dates = pd.date_range('2024-01-01', periods=365, freq='D')
products = ['Laptop', 'Phone', 'Tablet', 'Headphones', 'Keyboard']
regions = ['North', 'South', 'East', 'West']
sales_data = pd.DataFrame({
'date': np.random.choice(dates, 1000),
'product': np.random.choice(products, 1000),
'region': np.random.choice(regions, 1000),
'quantity': np.random.randint(1, 10, 1000),
'unit_price': np.random.choice([999, 599, 399, 99, 49], 1000),
'customer_id': np.random.randint(1000, 9999, 1000)
})
sales_data['revenue'] = sales_data['quantity'] * sales_data['unit_price']
# 1. Monthly Revenue Trend
sales_data['month'] = sales_data['date'].dt.to_period('M')
monthly_revenue = sales_data.groupby('month')['revenue'].sum().reset_index()
monthly_revenue['month'] = monthly_revenue['month'].astype(str)
print("Monthly Revenue Trend:")
print(monthly_revenue.head(6))
# 2. Top Products by Revenue
top_products = sales_data.groupby('product').agg({
'revenue': 'sum',
'quantity': 'sum',
'customer_id': 'nunique' # Unique customers
}).sort_values('revenue', ascending=False)
top_products.columns = ['Total Revenue', 'Units Sold', 'Unique Customers']
print("\nTop Products by Revenue:")
print(top_products)
# 3. Regional Performance
regional_stats = sales_data.groupby('region').agg({
'revenue': ['sum', 'mean'],
'quantity': 'sum',
'customer_id': 'nunique'
}).round(2)
regional_stats.columns = ['Total Revenue', 'Avg Order Value', 'Units Sold', 'Customers']
print("\nRegional Performance:")
print(regional_stats)
# 4. Best Selling Day of Week
sales_data['day_of_week'] = sales_data['date'].dt.day_name()
day_sales = sales_data.groupby('day_of_week')['revenue'].sum()
day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
day_sales = day_sales.reindex(day_order)
print("\nSales by Day of Week:")
print(day_sales)
# 5. Customer Segmentation by Purchase Value
customer_totals = sales_data.groupby('customer_id')['revenue'].sum().reset_index()
customer_totals['segment'] = pd.cut(
customer_totals['revenue'],
bins=[0, 500, 2000, np.inf],
labels=['Low', 'Medium', 'High']
)
segment_counts = customer_totals['segment'].value_counts()
print("\nCustomer Segments:")
print(segment_counts)
Example 2: Student Grades Analysis
import pandas as pd
import numpy as np
# Create sample student data
np.random.seed(42)
students = pd.DataFrame({
'student_id': range(1, 51),
'name': [f'Student_{i}' for i in range(1, 51)],
'class': np.random.choice(['10A', '10B', '10C'], 50),
'math': np.random.randint(40, 100, 50),
'science': np.random.randint(45, 100, 50),
'english': np.random.randint(50, 100, 50),
'history': np.random.randint(35, 100, 50),
'attendance_pct': np.random.uniform(70, 100, 50).round(1)
})
# 1. Calculate Total and Average Score
students['total'] = students[['math', 'science', 'english', 'history']].sum(axis=1)
students['average'] = students[['math', 'science', 'english', 'history']].mean(axis=1).round(2)
# 2. Assign Grades
def assign_grade(avg):
if avg >= 90: return 'A+'
elif avg >= 80: return 'A'
elif avg >= 70: return 'B'
elif avg >= 60: return 'C'
elif avg >= 50: return 'D'
else: return 'F'
students['grade'] = students['average'].apply(assign_grade)
# 3. Class Performance Summary
class_summary = students.groupby('class').agg({
'average': ['mean', 'std', 'min', 'max'],
'attendance_pct': 'mean',
'student_id': 'count'
}).round(2)
class_summary.columns = ['Avg Score', 'Std Dev', 'Min', 'Max', 'Avg Attendance', 'Students']
print("Class Performance Summary:")
print(class_summary)
# 4. Grade Distribution
grade_dist = students['grade'].value_counts().sort_index()
print("\nGrade Distribution:")
print(grade_dist)
# 5. Students Needing Attention (Low Attendance OR Failing)
at_risk = students[
(students['attendance_pct'] < 80) | (students['average'] < 60)
]
print(f"\nStudents At Risk: {len(at_risk)}")
print(at_risk[['name', 'class', 'average', 'grade', 'attendance_pct']])
# 6. Subject-wise Analysis
subject_stats = students[['math', 'science', 'english', 'history']].describe()
print("\nSubject Statistics:")
print(subject_stats.loc[['mean', 'std', 'min', 'max']].round(2))
# 7. Correlation Analysis
correlation = students[['math', 'science', 'english', 'history', 'attendance_pct']].corr()
print("\nCorrelation Matrix:")
print(correlation.round(2))
Example 3: Web Server Log Analysis
import pandas as pd
import numpy as np
# Create sample web server log data
np.random.seed(42)
n_requests = 5000
logs = pd.DataFrame({
'timestamp': pd.date_range('2024-01-01', periods=n_requests, freq='5T'),
'ip_address': [f'192.168.1.{np.random.randint(1, 255)}' for _ in range(n_requests)],
'method': np.random.choice(['GET', 'POST', 'PUT', 'DELETE'], n_requests, p=[0.7, 0.2, 0.05, 0.05]),
'endpoint': np.random.choice(['/api/users', '/api/products', '/api/orders', '/login', '/home', '/search'], n_requests),
'status_code': np.random.choice([200, 201, 400, 401, 403, 404, 500], n_requests, p=[0.6, 0.1, 0.1, 0.05, 0.05, 0.05, 0.05]),
'response_time_ms': np.random.exponential(100, n_requests).round(2)
})
# 1. Request Volume by Hour
logs['hour'] = logs['timestamp'].dt.hour
hourly_requests = logs.groupby('hour').size()
print("Requests by Hour:")
print(hourly_requests)
# 2. Status Code Distribution
status_dist = logs['status_code'].value_counts().sort_index()
print("\nStatus Code Distribution:")
print(status_dist)
# 3. Error Rate Analysis
logs['is_error'] = logs['status_code'] >= 400
error_rate = logs.groupby('endpoint')['is_error'].mean().sort_values(ascending=False)
print("\nError Rate by Endpoint:")
print((error_rate * 100).round(2).astype(str) + '%')
# 4. Response Time Analysis
response_stats = logs.groupby('endpoint')['response_time_ms'].agg(['mean', 'median', 'max', 'std']).round(2)
response_stats = response_stats.sort_values('mean', ascending=False)
print("\nResponse Time Statistics (ms):")
print(response_stats)
# 5. Identify Slow Requests (>500ms)
slow_requests = logs[logs['response_time_ms'] > 500]
print(f"\nSlow Requests (>500ms): {len(slow_requests)} ({len(slow_requests)/len(logs)*100:.2f}%)")
# 6. Unique Visitors
unique_ips = logs['ip_address'].nunique()
print(f"\nUnique Visitors: {unique_ips}")
# 7. Most Active IPs (potential bots/attackers)
active_ips = logs['ip_address'].value_counts().head(10)
print("\nMost Active IPs:")
print(active_ips)
# 8. Traffic Pattern by Day
logs['day'] = logs['timestamp'].dt.day_name()
day_traffic = logs.groupby('day').agg({
'ip_address': 'count',
'response_time_ms': 'mean'
}).round(2)
day_traffic.columns = ['Requests', 'Avg Response Time']
print("\nDaily Traffic Pattern:")
print(day_traffic)
Example 4: Stock Portfolio Analysis
import pandas as pd
import numpy as np
# Create sample stock data
np.random.seed(42)
dates = pd.date_range('2024-01-01', periods=252, freq='B') # Business days
stocks = ['AAPL', 'GOOGL', 'MSFT', 'AMZN', 'TSLA']
# Generate random price movements (simplified)
prices = pd.DataFrame(index=dates)
base_prices = {'AAPL': 180, 'GOOGL': 140, 'MSFT': 375, 'AMZN': 150, 'TSLA': 240}
for stock in stocks:
returns = np.random.normal(0.0005, 0.02, len(dates)) # Daily returns
prices[stock] = base_prices[stock] * (1 + returns).cumprod()
prices = prices.round(2)
# Portfolio: Number of shares owned
portfolio = pd.Series({'AAPL': 50, 'GOOGL': 30, 'MSFT': 20, 'AMZN': 40, 'TSLA': 25})
# 1. Calculate Daily Returns
returns = prices.pct_change().dropna()
print("Daily Returns (last 5 days):")
print((returns.tail() * 100).round(2).astype(str) + '%')
# 2. Summary Statistics
return_stats = returns.describe().T
return_stats['Annual Return'] = (returns.mean() * 252 * 100).round(2)
return_stats['Annual Volatility'] = (returns.std() * np.sqrt(252) * 100).round(2)
print("\nReturn Statistics:")
print(return_stats[['mean', 'std', 'min', 'max', 'Annual Return', 'Annual Volatility']])
# 3. Correlation Matrix
correlation = returns.corr().round(2)
print("\nStock Correlation:")
print(correlation)
# 4. Portfolio Value Over Time
portfolio_value = prices.mul(portfolio)
portfolio_value['Total'] = portfolio_value.sum(axis=1)
print("\nPortfolio Value (last 5 days):")
print(portfolio_value[['Total']].tail())
# 5. Portfolio Allocation
latest_prices = prices.iloc[-1]
current_holdings = latest_prices * portfolio
current_holdings['allocation_pct'] = (current_holdings / current_holdings.sum() * 100).round(2)
print("\nCurrent Portfolio Allocation:")
print(current_holdings)
# 6. Best and Worst Days
portfolio_returns = portfolio_value['Total'].pct_change().dropna()
best_day = portfolio_returns.idxmax()
worst_day = portfolio_returns.idxmin()
print(f"\nBest Day: {best_day.date()} ({portfolio_returns[best_day]*100:.2f}%)")
print(f"Worst Day: {worst_day.date()} ({portfolio_returns[worst_day]*100:.2f}%)")
# 7. Rolling Statistics
rolling_mean = portfolio_value['Total'].rolling(window=20).mean()
rolling_std = portfolio_value['Total'].rolling(window=20).std()
print("\nRolling 20-day Statistics (last 5 days):")
print(pd.DataFrame({'Mean': rolling_mean, 'Std': rolling_std}).tail())
Data Analysis Best Practices
DO
- Always explore data first with
head(),info(),describe() - Check for missing values before analysis
- Validate data types and convert if needed
- Use meaningful column names
- Document your analysis steps
- Handle outliers appropriately
- Create intermediate DataFrames for complex operations
- Use
.copy()when modifying data
DON'T
- Don't modify original data without a copy
- Don't ignore warnings about SettingWithCopyWarning
- Don't assume data is clean without checking
- Don't use loops when vectorized operations exist
- Don't load entire large files without chunking
- Don't forget to reset index after filtering
- Don't use inplace=True in chained operations
- Don't hardcode column indices
Key Takeaways
DataFrame = Table
DataFrames are 2D labeled data structures with rows and columns.
Series = Column
A Series is a single column. A DataFrame is a collection of Series.
loc vs iloc
loc uses labels, iloc uses positions. Know when to use each.
Boolean Filtering
Filter with df[df['col'] > value] or df.query() for cleaner syntax.
GroupBy is Powerful
Split-apply-combine pattern for aggregations and transformations.
Easy File I/O
read_csv() and to_csv() handle most data loading needs.
Knowledge Check
Quick Quiz
Test what you've learned about Pandas DataFrames and data manipulation