Module 11.2

Pandas

Pandas makes data analysis intuitive with DataFrames and Series. Load CSV files, filter rows, group data, and perform complex transformations with simple, readable code. It is the backbone of data science in Python.

55 min
Intermediate
Hands-on
What You'll Learn
  • Creating DataFrames
  • Data selection and indexing
  • Filtering and sorting
  • GroupBy operations
  • Reading and writing files
Contents
01

Series vs DataFrame

Pandas has two core data structures: Series (1D) and DataFrame (2D). Think of a Series as a single column and a DataFrame as an entire spreadsheet.

Key Concept

Labeled Data Structures

Unlike NumPy arrays, Pandas structures have labels (index for rows, column names for columns). This makes data access intuitive and self-documenting.

Why it matters: Access data by meaningful names like df['price'] instead of cryptic indices like arr[:, 2].

Series (1D)
import pandas as pd

# Series = single column with index
prices = pd.Series([10.5, 20.3, 15.8], 
                   index=['A', 'B', 'C'])
print(prices)
# A    10.5
# B    20.3
# C    15.8

print(prices['B'])  # 20.3
DataFrame (2D)
import pandas as pd

# DataFrame = table with rows and columns
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Carol'],
    'age': [25, 30, 35],
    'city': ['NYC', 'LA', 'Chicago']
})
print(df)
#     name  age     city
# 0  Alice   25      NYC
# 1    Bob   30       LA
# 2  Carol   35  Chicago

A DataFrame is essentially a collection of Series that share the same index. Each column can be accessed as a Series, and each row shares the same index across all columns. This structure enables powerful operations like filtering, grouping, and merging.

Think of a Series as a single labeled column and a DataFrame as multiple columns joined together. The index (row labels) aligns data when combining DataFrames or performing operations between columns.

Series Operations

Series support vectorized operations, making mathematical computations fast and readable. You can perform arithmetic directly on Series objects.

Arithmetic Operations
import pandas as pd

prices = pd.Series([100, 200, 300])

# Scalar operations
doubled = prices * 2
discounted = prices * 0.9

# Series operations
quantities = pd.Series([5, 10, 15])
totals = prices * quantities

print(totals)
# 0     500
# 1    2000
# 2    4500
Statistical Methods
import pandas as pd

data = pd.Series([10, 20, 30, 40, 50])

print(data.sum())    # 150
print(data.mean())   # 30.0
print(data.std())    # 15.81
print(data.min())    # 10
print(data.max())    # 50
print(data.median()) # 30.0

# Cumulative operations
print(data.cumsum())
# 0     10
# 1     30
# 2     60
# 3    100
# 4    150

Series operations are element-wise by default, meaning each operation applies independently to every element. When multiplying a Series by 2, each value doubles. When multiplying two Series together, corresponding elements multiply (position 0 with position 0, etc.).

Statistical methods like sum(), mean(), std() summarize the entire Series into a single value. cumsum() is different—it computes a running total, returning a new Series where each element is the sum of all previous elements.

Index Alignment

One of Pandas' most powerful features is automatic index alignment. When you perform operations between Series, Pandas aligns them by their index labels, not positions.

import pandas as pd

# Two series with different indices
s1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
s2 = pd.Series([10, 20, 30], index=['b', 'c', 'd'])

# Addition aligns by index
result = s1 + s2
print(result)
# a     NaN  # 'a' only in s1
# b    12.0  # 2 + 10
# c    23.0  # 3 + 20
# d     NaN  # 'd' only in s2

When you perform arithmetic operations between two Series with different indices, Pandas automatically aligns them by their index labels before computing. Here, s1 has indices 'a', 'b', 'c' and s2 has 'b', 'c', 'd'. Only 'b' and 'c' exist in both, so those get computed (2+10=12, 3+20=23). Index 'a' only exists in s1 and 'd' only in s2, resulting in NaN for those positions.

This behavior is incredibly powerful for real-world data where datasets often have missing values or different coverage. Unlike NumPy where position-based operations can silently produce wrong results, Pandas ensures data integrity by matching on labels.

# Fill missing with default value
result = s1.add(s2, fill_value=0)
print(result)
# a     1.0
# b    12.0
# c    23.0
# d    30.0

If you want to treat missing values as a specific number instead of NaN, use the fill_value parameter. Here, fill_value=0 means: if an index exists in one Series but not the other, treat the missing value as 0. So for index 'a', it computes 1+0=1; for index 'd', it computes 0+30=30.

This is useful when you're combining data and want a sensible default. For example, if you're adding sales from two regions and one region has no sales for a product, you'd want to treat it as 0 rather than making the total NaN.

String Operations

The str accessor provides string methods for Series containing text data. These operations are vectorized for efficiency.

import pandas as pd

names = pd.Series(['alice smith', 'bob jones', 'carol davis'])

# String methods via .str accessor
print(names.str.upper())
# 0    ALICE SMITH
# 1      BOB JONES
# 2    CAROL DAVIS

print(names.str.title())
# 0    Alice Smith
# 1      Bob Jones
# 2    Carol Davis

The .str accessor unlocks all Python string methods for entire Series at once. Instead of writing a loop to process each string, you call the method once and it applies to every element. str.upper() converts all characters to uppercase, while str.title() capitalizes the first letter of each word.

This vectorized approach is much faster than Python loops, especially for large datasets. Behind the scenes, Pandas optimizes these operations using C-level code rather than Python's interpreter.

print(names.str.split())
# 0    [alice, smith]
# 1      [bob, jones]
# 2    [carol, davis]

# Check if contains substring
print(names.str.contains('smith'))
# 0     True
# 1    False
# 2    False

str.split() splits each string by whitespace (or any delimiter you specify), returning a Series of lists. This is useful for extracting first/last names or parsing structured text. str.contains() returns a boolean Series indicating whether each element contains the specified substring.

The contains() method is particularly powerful for filtering data. You can use it with regular expressions by setting regex=True, enabling complex pattern matching across your entire dataset in a single operation.

# Get string length
print(names.str.len())
# 0    11
# 1     9
# 2    11

str.len() returns the length of each string. This is useful for data validation (checking if entries meet minimum/maximum length requirements) or for analysis (finding outliers with unusually long or short text).

Other commonly used string methods include str.replace() for substitutions, str.strip() for removing whitespace, str.startswith()/str.endswith() for prefix/suffix checks, and str.extract() for regex group extraction.

Pandas Data Types

Understanding data types is crucial for memory efficiency and proper operations. Pandas uses NumPy types plus its own specialized types.

Dtype Description Example Values Memory
int64 64-bit integer 1, -100, 999 8 bytes
float64 64-bit floating point 3.14, -2.5, NaN 8 bytes
object Python objects (usually strings) 'hello', 'world' Variable
bool Boolean values True, False 1 byte
datetime64 Date and time 2024-01-15 10:30:00 8 bytes
timedelta64 Time differences 5 days, 3 hours 8 bytes
category Categorical data (limited values) 'low', 'medium', 'high' Memory efficient
string Pandas StringDtype 'text', 'data' Variable
import pandas as pd

# Check and convert types
df = pd.DataFrame({
    'age': [25, 30, 35],
    'name': ['Alice', 'Bob', 'Carol'],
    'active': [True, False, True]
})

print(df.dtypes)
# age       int64
# name     object
# active     bool

df.dtypes shows the data type of each column. Pandas infers types when creating a DataFrame: integers become int64, strings become 'object' (a general-purpose type), and True/False become bool. The 'object' dtype for strings is flexible but can be memory-inefficient.

Knowing your data types is essential for two reasons: some operations only work on certain types (you can't calculate mean of strings), and choosing efficient types can dramatically reduce memory usage for large datasets.

# Convert types
df['age'] = df['age'].astype('int32')      # Save memory
df['name'] = df['name'].astype('string')   # Use StringDtype
df['category'] = pd.Categorical(['A', 'B', 'A'])  # Categorical

astype() converts a column to a different type. Converting int64 to int32 halves memory usage when you don't need the extra precision. The 'string' dtype is Pandas' dedicated string type with better handling of missing values than 'object'.

pd.Categorical() is powerful for columns with limited unique values (like status, gender, department). It stores each unique value once and uses integer codes internally, saving huge amounts of memory. A column with 1 million rows but only 5 unique categories uses a tiny fraction of the memory.

Practice: Series Basics

Task: Create a Series of temperatures for days Monday through Friday with values 72, 75, 68, 70, 73.

Show Solution
import pandas as pd

temps = pd.Series(
    [72, 75, 68, 70, 73],
    index=['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday']
)
print(temps)
print(f"Wednesday temp: {temps['Wednesday']}")

Task: Create a Series of 5 numbers and print sum, mean, and standard deviation.

Show Solution
import pandas as pd

numbers = pd.Series([10, 20, 30, 40, 50])
print(f"Sum: {numbers.sum()}")
print(f"Mean: {numbers.mean()}")
print(f"Std: {numbers.std():.2f}")

Task: Create a Series of email addresses and extract the domain (part after @).

Show Solution
import pandas as pd

emails = pd.Series([
    'alice@gmail.com',
    'bob@yahoo.com',
    'carol@outlook.com'
])
domains = emails.str.split('@').str[1]
print(domains)

Task: Create two Series with overlapping indices and add them, using fill_value=0 for missing values.

Show Solution
import pandas as pd

s1 = pd.Series([10, 20, 30], index=['a', 'b', 'c'])
s2 = pd.Series([100, 200, 300], index=['b', 'c', 'd'])
result = s1.add(s2, fill_value=0)
print(result)

Task: Create a DataFrame and convert columns to more memory-efficient types (int32, category).

Show Solution
import pandas as pd

df = pd.DataFrame({
    'id': [1, 2, 3, 4, 5],
    'size': ['S', 'M', 'L', 'M', 'S'],
    'price': [10, 20, 30, 20, 10]
})

# Check memory before
print(f"Before: {df.memory_usage(deep=True).sum()} bytes")

# Convert types
df['id'] = df['id'].astype('int32')
df['size'] = df['size'].astype('category')
df['price'] = df['price'].astype('int32')

print(f"After: {df.memory_usage(deep=True).sum()} bytes")
02

Creating DataFrames

Create DataFrames from dictionaries, lists of dictionaries, NumPy arrays, or by reading files like CSV and Excel. Understanding different creation methods helps you work with various data sources.

Key Concept

DataFrame Creation Methods

DataFrames can be created from many sources: dictionaries (columns as keys), lists of dictionaries (rows as dicts), NumPy arrays, CSV files, Excel files, SQL databases, and JSON.

Best Practice: Choose the creation method that matches your data source. Use read_csv() for files, dict for programmatic creation.

From Dictionary of Lists

import pandas as pd

# Dictionary of lists (most common)
data = {
    'product': ['Laptop', 'Phone', 'Tablet'],
    'price': [999, 699, 399],
    'stock': [50, 150, 80]
}
df = pd.DataFrame(data)
print(df)
#   product  price  stock
# 0  Laptop    999     50
# 1   Phone    699    150
# 2  Tablet    399     80

The most common way to create a DataFrame is from a dictionary where keys become column names and values (as lists) become the column data. This is intuitive because you're defining your data column by column, which is how we typically think about tabular data.

All lists must have the same length—if one column has 3 values and another has 4, Pandas will raise a ValueError. The automatic integer index (0, 1, 2) is created for you, representing row numbers.

From List of Dictionaries

# Each dict is a row
records = [
    {'name': 'Alice', 'score': 85, 'grade': 'B'},
    {'name': 'Bob', 'score': 92, 'grade': 'A'},
    {'name': 'Carol', 'score': 78, 'grade': 'C'}
]
df = pd.DataFrame(records)
print(df)
#     name  score grade
# 0  Alice     85     B
# 1    Bob     92     A
# 2  Carol     78     C

An alternative approach treats each dictionary as a row. This row-oriented format is natural when you receive data record by record, such as from API responses, database queries, or log entries. Each dictionary represents one entity with its attributes.

Pandas automatically determines columns from all unique keys across all dictionaries. The column order may vary, so specify columns=['name', 'score', 'grade'] if order matters.

# Handles missing keys gracefully
records_incomplete = [
    {'name': 'Alice', 'score': 85},
    {'name': 'Bob', 'grade': 'A'},  # missing 'score'
]
df = pd.DataFrame(records_incomplete)
# Missing values become NaN

A key advantage of this format is handling incomplete records. If a dictionary is missing a key, Pandas fills that cell with NaN (Not a Number, representing missing data) rather than crashing. This is common when working with real-world data where some fields may be optional.

This flexibility is especially useful when loading JSON data from APIs or databases where not every record has every field. You can later use fillna() to handle these missing values.

From NumPy Arrays

import pandas as pd
import numpy as np

# From 2D NumPy array
arr = np.array([
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9]
])
df = pd.DataFrame(arr, columns=['A', 'B', 'C'])
print(df)
#    A  B  C
# 0  1  2  3
# 1  4  5  6
# 2  7  8  9

When you already have numerical data in a NumPy array, converting to DataFrame adds labels and Pandas functionality. Rows of the array become DataFrame rows, columns become DataFrame columns. Since arrays don't have column names, you must provide them via the columns parameter.

This conversion is zero-copy when possible—Pandas uses the same underlying memory as the NumPy array, making it very efficient. The DataFrame inherits the array's dtype (int64, float64, etc.).

# With custom index
df = pd.DataFrame(
    arr,
    columns=['A', 'B', 'C'],
    index=['row1', 'row2', 'row3']
)

You can specify custom row labels with the index parameter instead of using the default 0, 1, 2. This is useful when rows represent named entities like dates, product IDs, or locations that you want to access by name.

After creating, you can access data both ways: df.loc['row1'] gets by label, df.iloc[0] gets by position. This dual access pattern is a core strength of Pandas over raw NumPy arrays.

From CSV File

# Basic CSV read
df = pd.read_csv('data.csv')

The simplest way to load a CSV file is with just the filename. Pandas automatically detects the delimiter (comma by default), infers column types, and treats the first row as headers. This works for most well-formatted CSV files.

pd.read_csv() is one of the most-used functions in data science. It handles file reading, parsing, type conversion, and DataFrame construction all in one call, making it incredibly convenient for quick data exploration.

# Common parameters
df = pd.read_csv('data.csv',
    index_col='id',           # Use column as index
    parse_dates=['date'],     # Parse date columns
    na_values=['N/A', '-'],   # Treat as missing
    dtype={'code': str},      # Force column type
    usecols=['id', 'name'],   # Read only these columns
    nrows=1000,               # Read first 1000 rows
    skiprows=1,               # Skip header rows
    encoding='utf-8'          # File encoding
)

# Write to CSV
df.to_csv('output.csv', index=False)

For real-world data, you'll often need to customize how Pandas reads the file. index_col sets a column as the DataFrame index; parse_dates converts string columns to datetime objects; na_values tells Pandas which strings represent missing values; dtype forces specific column types (crucial for codes like '001' that shouldn't become integers).

Performance options like usecols (read only needed columns), nrows (read first N rows), and skiprows (skip initial rows) are essential for large files. When saving, to_csv() with index=False prevents adding an extra index column to your output file.

From Excel Files

# Read Excel file
df = pd.read_excel('data.xlsx')

# Read specific sheet
df = pd.read_excel('data.xlsx', sheet_name='Sales')

# Read all sheets into dict of DataFrames
all_sheets = pd.read_excel('data.xlsx', sheet_name=None)
# all_sheets['Sheet1'], all_sheets['Sheet2'], etc.

read_excel() works similarly to read_csv() but handles Excel's .xlsx and .xls formats. By default, it reads the first sheet. Use sheet_name to specify a particular sheet by name or by index (0 for first sheet). Setting sheet_name=None reads ALL sheets and returns a dictionary where keys are sheet names.

Excel files often contain multiple related datasets on different sheets. The dictionary approach lets you load everything at once and then work with specific sheets as df = all_sheets['SheetName']. This requires the openpyxl package (pip install openpyxl).

# Write to Excel
df.to_excel('output.xlsx', index=False, sheet_name='Results')

# Write multiple sheets
with pd.ExcelWriter('multi_sheet.xlsx') as writer:
    df1.to_excel(writer, sheet_name='Data')
    df2.to_excel(writer, sheet_name='Summary')

to_excel() writes a DataFrame to an Excel file. Use index=False to avoid writing row numbers as the first column. The sheet_name parameter controls which sheet the data goes to.

For writing multiple DataFrames to different sheets in one file, use ExcelWriter as a context manager. This opens the file once, writes all sheets, and then properly closes and saves the file. Without ExcelWriter, each to_excel() call would overwrite the previous one.

From SQL Database

import pandas as pd
import sqlite3

# Connect to database
conn = sqlite3.connect('database.db')

# Read SQL query results
df = pd.read_sql('SELECT * FROM users', conn)

# Read entire table
df = pd.read_sql_table('users', conn)

Pandas can directly query databases and return results as DataFrames. First, establish a connection using the appropriate library (sqlite3 for SQLite, psycopg2 for PostgreSQL, mysql-connector for MySQL). Then use read_sql() with any SQL query—the results become a DataFrame.

read_sql() is flexible: you can run complex queries with JOINs, WHERE clauses, aggregations, etc. read_sql_table() is a shortcut for loading an entire table without writing SQL. Both methods handle type conversion automatically.

# Write DataFrame to SQL
df.to_sql('users_backup', conn, if_exists='replace', index=False)

conn.close()

to_sql() writes a DataFrame to a database table. The if_exists parameter controls behavior when the table already exists: 'replace' drops and recreates it, 'append' adds rows to existing table, 'fail' raises an error. Use index=False to avoid creating an extra index column.

Always close database connections when done. For production code, use SQLAlchemy's create_engine() which provides connection pooling and works with more database types. The with statement can auto-close connections.

From JSON

# Read JSON file
df = pd.read_json('data.json')

# From JSON string
json_str = '[{"name": "Alice", "age": 25}, {"name": "Bob", "age": 30}]'
df = pd.read_json(json_str)

read_json() parses JSON files or strings into DataFrames. It works with JSON arrays (each object becomes a row) or JSON objects (keys become rows or columns depending on structure). You can pass a filename or a JSON string directly.

JSON is the standard format for web APIs, so this function is essential when working with data from web services. The function automatically infers the structure and converts it to tabular format.

# Nested JSON (normalize)
nested_json = [
    {'id': 1, 'info': {'name': 'Alice', 'city': 'NYC'}},
    {'id': 2, 'info': {'name': 'Bob', 'city': 'LA'}}
]
df = pd.json_normalize(nested_json)
# Columns: id, info.name, info.city

# Write to JSON
df.to_json('output.json', orient='records')

Real-world JSON is often nested—objects containing other objects. json_normalize() flattens these structures into a flat table. Nested keys become column names with dots: the 'name' inside 'info' becomes the column 'info.name'. This handles arbitrarily deep nesting.

When writing DataFrames to JSON, the orient parameter controls the output format. 'records' produces a list of row objects [{row1}, {row2}], which is the most common format for APIs. Other options include 'columns' (dict of column arrays) and 'index' (dict keyed by index).

File Format Reference

Format Read Function Write Function Best For
CSV pd.read_csv() df.to_csv() Universal, human-readable
Excel pd.read_excel() df.to_excel() Business reports, multiple sheets
JSON pd.read_json() df.to_json() Web APIs, nested data
SQL pd.read_sql() df.to_sql() Database integration
Parquet pd.read_parquet() df.to_parquet() Big data, columnar storage
Pickle pd.read_pickle() df.to_pickle() Python-only, preserves types
HTML pd.read_html() df.to_html() Web scraping tables
Clipboard pd.read_clipboard() df.to_clipboard() Quick copy-paste
03

DataFrame Anatomy

Every DataFrame has an index (row labels), columns (column names), and values (the data). Understanding the structure is key to effective manipulation.

Visual: DataFrame Structure Index × Columns
2D Tabular Data: Rows × Columns
columns name age city index 0 1 2 Alice 25 NYC Bob 30 LA Carol 35 Chicago ← row 0 ← row 1 ← row 2 df.shape = (3, 3) df.columns = ['name', 'age', 'city'] df.index = [0, 1, 2]
DataFrame Components
  • Index labels rows
  • Columns label fields
  • Values are the data
Color Legend
Index Columns Values

Exploring DataFrames

import pandas as pd

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Carol'],
    'age': [25, 30, 35],
    'salary': [50000, 60000, 70000]
})

print(df.shape)      # (3, 3)
print(df.columns)    # Index(['name', 'age', 'salary'])
print(df.dtypes)     # Data types of each column

When you first load a dataset, these properties give you an instant overview. shape returns a tuple (rows, columns)—here (3, 3) means 3 rows and 3 columns. columns returns an Index object listing all column names. dtypes shows the data type of each column.

These attributes are your first stop for understanding any DataFrame. shape tells you the size of your data, columns shows what fields exist, and dtypes helps identify potential issues like numeric data stored as strings.

print(df.head(2))    # First 2 rows
print(df.info())     # Summary of DataFrame

head(n) displays the first n rows (default 5)—perfect for previewing data without overwhelming your screen. There's also tail(n) for the last n rows. For large datasets, these methods let you quickly verify the data looks correct.

info() provides a comprehensive summary: number of entries, column names, non-null counts per column, dtypes, and memory usage. This is often the first method to call on a new dataset—it reveals missing values, data types, and memory footprint all at once.

Quick Statistics

# Numeric column statistics
print(df.describe())
#              age       salary
# count   3.000000      3.00000
# mean   30.000000  60000.00000
# std     5.000000  10000.00000
# min    25.000000  50000.00000
# max    35.000000  70000.00000

describe() generates summary statistics for all numeric columns in one call. You get count (non-null values), mean, standard deviation (std), minimum, maximum, and quartiles (25%, 50%, 75%). The 50% percentile is the median.

This is your go-to method for understanding the distribution of numeric data. Quickly spot outliers (huge max vs small mean), check for missing values (count less than expected), and understand the spread of your data (std relative to mean).

# Single column stats
print(df['age'].mean())  # 30.0
print(df['salary'].max())  # 70000

For individual statistics, call the method directly on a column. Common methods include mean(), median(), std(), min(), max(), sum(), and count(). These return single values rather than DataFrames.

You can also use multiple statistics: df['salary'].agg(['mean', 'median', 'std']) returns all three at once. For custom statistics, pass a function: df['age'].agg(lambda x: x.max() - x.min()) calculates the range.

Value Counts and Unique Values

import pandas as pd

df = pd.DataFrame({
    'category': ['A', 'B', 'A', 'C', 'B', 'A'],
    'status': ['active', 'inactive', 'active', 'active', 'active', 'inactive']
})

# Count occurrences of each value
print(df['category'].value_counts())
# A    3
# B    2
# C    1

value_counts() is the perfect tool for categorical data analysis. It counts how many times each unique value appears, sorted from most to least common by default. Here, 'A' appears 3 times, 'B' twice, and 'C' once.

Add normalize=True to get percentages instead of counts: df['category'].value_counts(normalize=True) shows A is 50%, B is 33%, C is 17%. This is essential for understanding the distribution of categories in your data.

# Get unique values
print(df['category'].unique())    # ['A' 'B' 'C']
print(df['category'].nunique())   # 3 (number of unique)

# Check for specific value
print('A' in df['category'].values)  # True

unique() returns an array of all distinct values in the column—useful for seeing the vocabulary of a categorical column. nunique() returns just the count of unique values, which is helpful for understanding cardinality (how many distinct categories exist).

To check if a specific value exists, use 'value' in df['column'].values. Note the .values—this accesses the underlying NumPy array. Without it, you'd be checking if 'A' is in the column's index, which isn't what you want.

Memory Usage Analysis

import pandas as pd

df = pd.DataFrame({
    'id': range(10000),
    'name': ['user_' + str(i) for i in range(10000)],
    'score': [i * 0.1 for i in range(10000)]
})

# Overall memory usage
print(df.info(memory_usage='deep'))

When working with large datasets, memory becomes critical. info(memory_usage='deep') provides accurate memory usage including the actual size of string objects. Without 'deep', it only shows the memory for pointers, not the actual string data.

This deep analysis takes longer but gives you the true picture. A DataFrame that looks small (few columns, reasonable row count) might use gigabytes if it contains long strings or complex objects.

# Per-column memory
print(df.memory_usage(deep=True))
# Index      80000
# id         80000
# name      670000
# score      80000

# Total bytes
total_bytes = df.memory_usage(deep=True).sum()
print(f"Total: {total_bytes / 1024:.1f} KB")

memory_usage(deep=True) breaks down memory by column. Here, 'name' uses far more memory (670KB) than numeric columns (80KB each) because strings require more space than fixed-size integers or floats. The Index also consumes memory.

Understanding per-column memory helps optimize your DataFrame. You might convert string columns to categorical (if few unique values), use smaller integer types (int32 instead of int64), or drop columns you don't need. These optimizations can reduce memory by 50-90%.

Practice: DataFrame Creation

Task: Create a DataFrame with columns 'fruit' and 'quantity' containing 3 fruits.

Show Solution
import pandas as pd

df = pd.DataFrame({
    'fruit': ['Apple', 'Banana', 'Orange'],
    'quantity': [10, 25, 15]
})
print(df)

Task: Create any DataFrame and print its shape and column names.

Show Solution
import pandas as pd

df = pd.DataFrame({
    'a': [1, 2, 3],
    'b': [4, 5, 6]
})
print(f"Shape: {df.shape}")
print(f"Columns: {list(df.columns)}")

Task: Create a DataFrame from a list of dictionaries representing students with name and grade.

Show Solution
import pandas as pd

students = [
    {'name': 'Alice', 'grade': 'A'},
    {'name': 'Bob', 'grade': 'B'},
    {'name': 'Carol', 'grade': 'A'}
]
df = pd.DataFrame(students)
print(df)

Task: Create a DataFrame, save it to CSV, then read it back and verify the data.

Show Solution
import pandas as pd

# Create DataFrame
df = pd.DataFrame({
    'id': [1, 2, 3],
    'name': ['A', 'B', 'C'],
    'value': [10, 20, 30]
})

# Save to CSV
df.to_csv('test_data.csv', index=False)

# Read back
df_loaded = pd.read_csv('test_data.csv')
print(df_loaded)
print(f"Match: {df.equals(df_loaded)}")

Task: Create a DataFrame from a NumPy array with custom row and column names.

Show Solution
import pandas as pd
import numpy as np

data = np.array([
    [85, 90, 88],
    [92, 88, 95],
    [78, 82, 80]
])

df = pd.DataFrame(
    data,
    columns=['Math', 'Science', 'English'],
    index=['Alice', 'Bob', 'Carol']
)
print(df)
print(f"\nAlice's Math score: {df.loc['Alice', 'Math']}")

Task: Create a DataFrame and print its shape, dtypes, memory usage, and summary statistics.

Show Solution
import pandas as pd

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Carol', 'David'],
    'age': [25, 30, 35, 28],
    'salary': [50000, 60000, 70000, 55000]
})

print(f"Shape: {df.shape}")
print(f"\nData types:\n{df.dtypes}")
print(f"\nMemory usage:\n{df.memory_usage(deep=True)}")
print(f"\nStatistics:\n{df.describe()}")
04

Selection and Filtering

Select columns, rows, or subsets of data using labels, positions, or boolean conditions. Mastering selection is fundamental to data analysis with Pandas.

Key Concept

Selection Methods

Pandas provides multiple ways to select data: bracket notation for columns, loc for label-based selection, iloc for position-based selection, and boolean indexing for conditional filtering.

Rule of Thumb: Use loc when you know the labels, iloc when you know positions, and boolean masks for conditional selection.

Selecting Columns

import pandas as pd

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Carol'],
    'age': [25, 30, 35],
    'city': ['NYC', 'LA', 'Chicago'],
    'salary': [50000, 60000, 70000]
})

# Single column (returns Series)
names = df['name']
print(type(names))  # pandas.core.series.Series

# Multiple columns (returns DataFrame)
subset = df[['name', 'age']]
print(type(subset))  # pandas.core.frame.DataFrame

# Alternative: dot notation (only for simple column names)
ages = df.age  # Same as df['age']

Selecting columns is the most common operation in Pandas. Using single brackets df['name'] returns a Series (a single column), while double brackets df[['name', 'age']] return a DataFrame (a subset table). This distinction matters because Series and DataFrames have different methods available.

Dot notation (df.age) is a convenient shortcut but has limitations: it doesn't work with column names containing spaces, special characters, or names that conflict with DataFrame methods. Always use bracket notation for reliability.

# Add new column
df['senior'] = df['age'] >= 30

# Computed column
df['annual_bonus'] = df['salary'] * 0.1

Creating new columns is as simple as assignment. df['senior'] = df['age'] >= 30 creates a boolean column based on a condition. df['annual_bonus'] = df['salary'] * 0.1 creates a computed column from existing data. The operation is vectorized—it applies to all rows automatically.

You can create columns from any expression: arithmetic operations, string manipulations, or the results of function calls. This is how you typically feature engineer in data science projects.

Selecting Rows with loc (Label-Based)

# loc: select by label
print(df.loc[0])              # Row with index label 0
print(df.loc[0:1])            # Rows 0 and 1 (INCLUSIVE!)
print(df.loc[0, 'name'])      # Row 0, column 'name'
print(df.loc[0:2, 'name':'city'])  # Slice rows and columns

# Multiple specific rows
print(df.loc[[0, 2]])         # Rows 0 and 2

# Multiple specific columns
print(df.loc[:, ['name', 'salary']])  # All rows, specific columns

loc is the primary way to select data by label. The syntax df.loc[row_selector, column_selector] lets you specify exactly which rows and columns you want. Use a colon (:) to mean "all" — df.loc[:, 'name'] means all rows, just the name column.

Critical difference from Python slicing: loc ranges are INCLUSIVE on both ends. df.loc[0:2] includes rows 0, 1, AND 2. This catches many beginners off guard since Python normally excludes the end of a slice.

# Custom index example
df_custom = df.set_index('name')
print(df_custom.loc['Alice'])  # Row with index 'Alice'
print(df_custom.loc['Alice':'Carol'])  # Range of labels

loc truly shines with custom indices. After setting 'name' as the index, you can select rows by name directly: df_custom.loc['Alice'] retrieves Alice's row. Slicing works alphabetically when the index is strings: 'Alice':'Carol' includes all names in that alphabetical range.

This label-based access is why Pandas is so powerful for data analysis—you can reference data by meaningful identifiers (like dates, names, or IDs) rather than remembering integer positions.

Selecting Rows with iloc (Position-Based)

# iloc: select by integer position
print(df.iloc[0])             # First row (position 0)
print(df.iloc[0:2])           # First 2 rows (EXCLUSIVE!)
print(df.iloc[0, 1])          # Row 0, column at position 1
print(df.iloc[0:2, 0:2])      # First 2 rows, first 2 columns

iloc selects data by integer position, exactly like Python list indexing. df.iloc[0] is the first row, df.iloc[0:2] is the first two rows (positions 0 and 1). Unlike loc, iloc slicing is EXCLUSIVE at the end—standard Python behavior.

Use iloc when you need positional access regardless of what the index labels are. It's especially useful for iteration, random sampling, or when working with data where you don't know or care about the index values.

# Negative indexing
print(df.iloc[-1])            # Last row
print(df.iloc[-3:])           # Last 3 rows

# Step slicing
print(df.iloc[::2])           # Every other row

# Specific positions
print(df.iloc[[0, 2], [1, 3]])  # Rows 0 and 2, columns 1 and 3

iloc supports all Python slicing features: negative indices count from the end (-1 is the last row), step values let you skip rows (::2 takes every other row), and you can select arbitrary combinations of row and column positions.

The syntax df.iloc[[0, 2], [1, 3]] selects a subset of non-contiguous rows and columns by their positions, returning the intersection—useful when you need specific cells scattered across the DataFrame.

loc vs iloc Comparison

Feature loc iloc
Selection by Labels (names) Integer positions
Slice end Inclusive Exclusive
Accepts Labels, boolean arrays Integers, boolean arrays
Example df.loc['row_a':'row_c'] df.iloc[0:3]
Negative index Not supported Supported
Best for Named indices, conditions Position-based iteration

Filtering with Boolean Conditions

import pandas as pd

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Carol', 'David'],
    'age': [25, 30, 35, 28],
    'city': ['NYC', 'LA', 'NYC', 'Chicago'],
    'salary': [50000, 60000, 70000, 55000]
})

# Single condition
adults = df[df['age'] >= 30]
print(adults)

Boolean filtering is the most powerful way to select data in Pandas. The expression df['age'] >= 30 creates a boolean Series (True/False for each row). When you put this inside brackets df[...], it returns only the rows where the condition is True.

This pattern is fundamental to data analysis: create a condition, use it to filter. The condition can be any expression that produces True/False for each row—comparisons, string matches, null checks, etc.

# Multiple conditions (use & for AND, | for OR)
# MUST wrap each condition in parentheses!
result = df[(df['age'] >= 25) & (df['city'] == 'NYC')]
print(result)

# OR condition
result = df[(df['city'] == 'NYC') | (df['city'] == 'LA')]
print(result)

# NOT condition
result = df[~(df['city'] == 'NYC')]  # ~ is NOT
print(result)

Combining conditions requires special operators: & for AND, | for OR, and ~ for NOT. Critically, you MUST wrap each individual condition in parentheses. Without them, Python's operator precedence causes errors. (df['age'] >= 25) & (df['city'] == 'NYC') is correct; df['age'] >= 25 & df['city'] == 'NYC' will fail.

These operators work element-wise across the boolean Series. The result is a new boolean Series that you can use for filtering. This is different from Python's and/or keywords, which don't work with Pandas Series.

Query Method

# Using query method (cleaner syntax)
result = df.query('age >= 25 and city == "NYC"')
print(result)

# Query with variables (use @)
min_age = 25
target_city = 'NYC'
result = df.query('age >= @min_age and city == @target_city')
print(result)

The query() method provides a more readable alternative to boolean indexing. Instead of df[(df['age'] >= 25) & (df['city'] == 'NYC')], you write df.query('age >= 25 and city == "NYC"'). The string uses natural Python/SQL-like syntax with 'and' and 'or' keywords.

To use Python variables inside a query string, prefix them with @. This lets you build dynamic queries: @min_age refers to the variable min_age defined outside the string. This is cleaner than f-string interpolation and safer against injection.

# Complex queries
result = df.query('(age > 25 and salary >= 55000) or city == "LA"')
print(result)

# Query with column names containing spaces
df2 = df.rename(columns={'city': 'home city'})
result = df2.query('`home city` == "NYC"')  # Use backticks

Complex queries are much easier to read with query(). You can use parentheses for grouping, and the precedence follows natural logic. For column names with spaces or special characters, wrap them in backticks within the query string.

query() is also slightly faster for large DataFrames because it can use numexpr under the hood. For simple conditions, both approaches perform similarly, but query() wins on readability for complex filtering logic.

isin() for Multiple Values

# Filter for multiple values
cities = ['NYC', 'LA']
result = df[df['city'].isin(cities)]
print(result)

# Exclude multiple values
result = df[~df['city'].isin(cities)]
print(result)

isin() checks whether each value is in a list of acceptable values. It's much cleaner than chaining multiple == comparisons with |. Here, df['city'].isin(['NYC', 'LA']) returns True for rows where city is either NYC or LA.

The ~ operator negates the boolean mask, giving you exclusion filtering. ~df['city'].isin(cities) returns rows where city is NOT in the list. This is how you filter out unwanted categories efficiently.

# Multiple conditions with isin
result = df[
    df['city'].isin(['NYC', 'LA']) & 
    df['age'].isin([25, 30])
]

Combine isin() with other conditions using & (and) or | (or). This example filters for rows where city is NYC or LA AND age is 25 or 30. You can mix isin() with regular boolean conditions too.

isin() is optimized for membership testing and is much faster than equivalent OR chains, especially for long lists of values. Always use isin() when checking against more than 2-3 values.

String Filtering

# String contains
result = df[df['name'].str.contains('a', case=False)]
print(result)

# Starts with / Ends with
result = df[df['name'].str.startswith('A')]
result = df[df['name'].str.endswith('l')]

The str accessor unlocks string methods for filtering. str.contains() checks if a substring exists anywhere in the string. With case=False, it's case-insensitive—matching 'a', 'A', or any combination. By default, it's case-sensitive.

str.startswith() and str.endswith() check the beginning and end of strings respectively. These are faster than contains() when you know exactly where to look, and they're perfect for filtering by prefixes (like product codes) or suffixes (like file extensions).

# Regex patterns
result = df[df['name'].str.match(r'^[A-C]')]  # Names starting with A, B, or C

# String length
result = df[df['name'].str.len() > 4]

str.match() applies regular expressions for powerful pattern matching. The pattern ^[A-C] matches strings starting with A, B, or C. You can use any regex pattern—this is incredibly flexible for complex text filtering.

str.len() returns the length of each string, which you can then use in comparisons. This filters for names longer than 4 characters. Combine these methods: df[df['name'].str.len().between(3, 10)] filters for names between 3 and 10 characters.

Sorting Data

# Sort by single column
df_sorted = df.sort_values('age')
print(df_sorted)

# Sort descending
df_sorted = df.sort_values('age', ascending=False)

# Sort by multiple columns
df_sorted = df.sort_values(['city', 'age'])

sort_values() orders rows based on column values. By default it sorts ascending (smallest to largest, A to Z). Set ascending=False for descending order. When sorting by multiple columns, it sorts by the first column first, then uses the second column to break ties.

Sorting is essential for data presentation and often required before certain operations like rolling calculations or when you need to see extremes (top earners, oldest items, etc.).

# Sort with mixed directions
df_sorted = df.sort_values(
    ['city', 'age'], 
    ascending=[True, False]  # city asc, age desc
)

# Sort by index
df_sorted = df.sort_index()

# Ranking
df['salary_rank'] = df['salary'].rank(ascending=False)
print(df)

For multi-column sorts, pass a list to ascending to control each column's direction independently. This is useful when you want, say, alphabetical city names but highest salaries first within each city.

sort_index() sorts by the row index labels. rank() assigns rank positions rather than reordering—useful when you want to keep original order but add a ranking column. ascending=False gives rank 1 to the highest value.

Modifying Data

# Update values based on condition
df.loc[df['age'] > 30, 'status'] = 'senior'
df.loc[df['age'] <= 30, 'status'] = 'junior'

# Update multiple columns
df.loc[df['city'] == 'NYC', ['bonus', 'tax_rate']] = [5000, 0.25]

Use loc with a boolean condition to update specific rows. The syntax df.loc[condition, column] = value sets the value for rows where the condition is True. This is the safe, recommended way to modify DataFrame values conditionally.

You can update multiple columns at once by passing a list of column names and a list of values. This is useful when several columns need to change together based on the same condition.

# Using where (keeps original if condition is False)
df['adjusted_salary'] = df['salary'].where(df['age'] >= 30, df['salary'] * 0.9)

# Using mask (opposite of where)
df['adjusted_salary'] = df['salary'].mask(df['age'] < 30, df['salary'] * 0.9)

# Replace values
df['city'] = df['city'].replace({'NYC': 'New York', 'LA': 'Los Angeles'})

where() and mask() provide vectorized conditional logic. where(condition, other) keeps original values where condition is True, replaces with 'other' where False. mask() is the opposite—it replaces where condition is True. Both create new columns without modifying the original.

replace() substitutes values using a dictionary mapping old values to new values. It's perfect for standardizing data (e.g., converting abbreviations to full names) or fixing data entry inconsistencies.

Practice: Selection and Filtering

Task: From a DataFrame with columns 'product', 'price', 'qty', select just the 'price' column.

Show Solution
import pandas as pd

df = pd.DataFrame({
    'product': ['A', 'B', 'C'],
    'price': [10, 20, 30],
    'qty': [100, 200, 150]
})
prices = df['price']
print(prices)

Task: Filter a DataFrame to get only rows where price > 15.

Show Solution
import pandas as pd

df = pd.DataFrame({
    'product': ['A', 'B', 'C'],
    'price': [10, 20, 30]
})
expensive = df[df['price'] > 15]
print(expensive)

Task: Filter to get products with price > 15 AND qty >= 150.

Show Solution
import pandas as pd

df = pd.DataFrame({
    'product': ['A', 'B', 'C'],
    'price': [10, 20, 30],
    'qty': [100, 200, 150]
})
result = df[(df['price'] > 15) & (df['qty'] >= 150)]
print(result)

Task: Filter a DataFrame to include only rows where city is in ['NYC', 'LA', 'Chicago'].

Show Solution
import pandas as pd

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Carol', 'David', 'Eve'],
    'city': ['NYC', 'Boston', 'LA', 'Chicago', 'Seattle']
})
target_cities = ['NYC', 'LA', 'Chicago']
result = df[df['city'].isin(target_cities)]
print(result)

Task: Sort a DataFrame by department ascending, then salary descending.

Show Solution
import pandas as pd

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Carol', 'David'],
    'department': ['Sales', 'IT', 'Sales', 'IT'],
    'salary': [60000, 70000, 55000, 65000]
})
result = df.sort_values(
    ['department', 'salary'], 
    ascending=[True, False]
)
print(result)

Task: Use the query() method to filter with a Python variable for the threshold.

Show Solution
import pandas as pd

df = pd.DataFrame({
    'product': ['A', 'B', 'C', 'D'],
    'price': [10, 25, 15, 30],
    'category': ['Electronics', 'Clothing', 'Electronics', 'Clothing']
})

min_price = 15
target_category = 'Electronics'
result = df.query('price >= @min_price and category == @target_category')
print(result)
05

GroupBy Operations

GroupBy splits data into groups, applies a function to each group, and combines results. It is the foundation of aggregation in Pandas and essential for data summarization.

Key Concept

Split-Apply-Combine

GroupBy follows the split-apply-combine paradigm: split data into groups based on criteria, apply a function to each group independently, then combine results back into a DataFrame.

Common Use Cases: Calculate averages per category, sum sales by region, count items per group, find max/min within groups.

Basic GroupBy

import pandas as pd

df = pd.DataFrame({
    'category': ['A', 'B', 'A', 'B', 'A'],
    'value': [10, 20, 30, 40, 50]
})

# Group by category, calculate mean
grouped = df.groupby('category')['value'].mean()
print(grouped)
# category
# A    30.0
# B    30.0

The groupby() method splits your data into groups based on unique values in a column. Here, rows are grouped by 'category' (A or B), then mean() calculates the average 'value' for each group. Category A has values 10, 30, 50 (mean=30), and B has 20, 40 (mean=30).

The pattern is: df.groupby('column_to_group_by')['column_to_aggregate'].aggregation_function(). This is the most common groupby pattern and handles 90% of aggregation tasks you'll encounter.

# Group object is lazy - no computation until aggregation
groups = df.groupby('category')
print(type(groups))  # DataFrameGroupBy object

# Iterate over groups
for name, group in df.groupby('category'):
    print(f"Group: {name}")
    print(group)
    print()

groupby() itself doesn't compute anything—it creates a lazy GroupBy object that waits for you to specify what to do with each group. This is efficient because you can chain multiple operations without recalculating groups.

You can iterate over groups using a for loop. Each iteration yields the group name (the value being grouped on) and a DataFrame containing just that group's rows. This is useful for custom processing or debugging.

Common Aggregation Functions

Function Description Example
sum() Sum of values df.groupby('cat')['val'].sum()
mean() Average of values df.groupby('cat')['val'].mean()
median() Median value df.groupby('cat')['val'].median()
count() Count of non-null values df.groupby('cat')['val'].count()
size() Count including nulls df.groupby('cat').size()
min() Minimum value df.groupby('cat')['val'].min()
max() Maximum value df.groupby('cat')['val'].max()
std() Standard deviation df.groupby('cat')['val'].std()
var() Variance df.groupby('cat')['val'].var()
first() First value in group df.groupby('cat')['val'].first()
last() Last value in group df.groupby('cat')['val'].last()
nunique() Count of unique values df.groupby('cat')['val'].nunique()

Multiple Aggregations with agg()

# Multiple functions at once
result = df.groupby('category')['value'].agg(['sum', 'mean', 'count'])
print(result)
#           sum  mean  count
# category
# A          90  30.0      3
# B          60  30.0      2

The agg() method lets you apply multiple aggregation functions simultaneously. Pass a list of function names as strings, and you get a DataFrame with each function as a column. This is more efficient than calling each aggregation separately.

This is perfect for creating summary tables. In one operation, you can get the sum, mean, count, min, max—whatever statistics you need—for each group.

# Different functions per column
df = pd.DataFrame({
    'category': ['A', 'B', 'A', 'B'],
    'sales': [100, 200, 150, 250],
    'quantity': [10, 20, 15, 25]
})

result = df.groupby('category').agg({
    'sales': ['sum', 'mean'],
    'quantity': ['sum', 'max']
})
print(result)

# Named aggregations (cleaner column names)
result = df.groupby('category').agg(
    total_sales=('sales', 'sum'),
    avg_sales=('sales', 'mean'),
    total_qty=('quantity', 'sum'),
    max_qty=('quantity', 'max')
)
print(result)

When you need different aggregations for different columns, pass a dictionary where keys are column names and values are lists of functions. This creates a multi-level column index which can be awkward to work with.

Named aggregations solve this elegantly: each argument specifies a result column name, and the value is a tuple of (column_to_aggregate, function). This produces clean, flat column names that are ready for reports or further analysis.

GroupBy with Multiple Columns

df = pd.DataFrame({
    'region': ['East', 'West', 'East', 'West', 'East', 'West'],
    'product': ['A', 'A', 'B', 'B', 'A', 'B'],
    'sales': [100, 150, 200, 250, 120, 180]
})

# Group by multiple columns
result = df.groupby(['region', 'product'])['sales'].sum()
print(result)
# region  product
# East    A          220
#         B          200
# West    A          150
#         B          430

Pass a list of column names to groupby() to group by multiple columns. This creates groups for every unique combination of values. Here we get four groups: East-A, East-B, West-A, and West-B, with sales summed for each combination.

The result has a hierarchical (multi-level) index. The first level is region, and nested within each region are the products. This structure represents the grouping hierarchy clearly.

# Reset index to get flat DataFrame
result = df.groupby(['region', 'product'])['sales'].sum().reset_index()
print(result)
#   region product  sales
# 0   East       A    220
# 1   East       B    200
# 2   West       A    150
# 3   West       B    430

reset_index() converts the hierarchical index back into regular columns, producing a flat DataFrame that's often easier to work with. Each row now explicitly shows the region, product, and aggregated sales.

This flat format is typically what you want for further analysis, visualization, or exporting to CSV/Excel. The hierarchical format is useful for display but can be cumbersome for programmatic access.

Transform and Apply

# Transform: return result with same shape as input
df = pd.DataFrame({
    'category': ['A', 'A', 'B', 'B'],
    'value': [10, 20, 100, 200]
})

# Add group mean as new column
df['group_mean'] = df.groupby('category')['value'].transform('mean')
print(df)
#   category  value  group_mean
# 0        A     10        15.0
# 1        A     20        15.0
# 2        B    100       150.0
# 3        B    200       150.0

transform() applies an aggregation but broadcasts the result back to match the original DataFrame's shape. Here, the mean of category A (15) is placed in every row belonging to category A. Unlike regular aggregation which reduces data, transform keeps all original rows.

This is incredibly useful for adding group-level statistics as new columns. Common uses include calculating each row's deviation from group mean, percentile within group, or normalized values.

# Normalize within groups
df['normalized'] = df.groupby('category')['value'].transform(
    lambda x: (x - x.mean()) / x.std()
)

# Apply: flexible function application
def top_n(group, n=2):
    return group.nlargest(n, 'value')

result = df.groupby('category').apply(top_n, n=1)
print(result)

transform() accepts lambda functions for custom calculations. The normalization formula (x - mean) / std standardizes values within each group, centering them around 0 with standard deviation of 1. This is common preprocessing for machine learning.

apply() is the most flexible groupby method—it can return any shape. Your function receives each group as a DataFrame and can do anything: filter rows, add columns, reshape data. However, it's slower than built-in aggregations and should be used when other methods can't solve your problem.

Filtering Groups

# Filter: keep groups that meet condition
df = pd.DataFrame({
    'category': ['A', 'A', 'B', 'B', 'C'],
    'value': [10, 20, 100, 200, 5]
})

# Keep only groups with sum > 50
result = df.groupby('category').filter(lambda x: x['value'].sum() > 50)
print(result)
#   category  value
# 2        B    100
# 3        B    200

filter() keeps or discards entire groups based on a condition. The lambda function receives each group as a DataFrame and must return True (keep all rows in this group) or False (discard all rows). Here, only category B has sum > 50, so all of B's rows are kept while A and C are removed.

Unlike boolean filtering on individual rows, filter() makes group-level decisions. It's perfect for questions like "keep only customers with total purchases over $1000" or "show only regions with at least 10 orders."

# Keep groups with at least 2 members
result = df.groupby('category').filter(lambda x: len(x) >= 2)
print(result)

A common use case is filtering by group size. len(x) gives the number of rows in each group. Groups A and B have 2 rows each, while C has only 1. With len(x) >= 2, we keep A and B but drop C entirely.

This is useful for excluding rare categories from analysis, removing low-frequency items that might cause statistical issues, or focusing on groups with enough data to be meaningful.

Practice: GroupBy

Task: Group sales data by 'region' and calculate the total sales per region.

Show Solution
import pandas as pd

df = pd.DataFrame({
    'region': ['North', 'South', 'North', 'South'],
    'sales': [100, 200, 150, 250]
})
totals = df.groupby('region')['sales'].sum()
print(totals)

Task: Count how many products are in each category.

Show Solution
import pandas as pd

df = pd.DataFrame({
    'product': ['A', 'B', 'C', 'D', 'E'],
    'category': ['Electronics', 'Clothing', 'Electronics', 
                 'Electronics', 'Clothing']
})
counts = df.groupby('category')['product'].count()
print(counts)

Task: Group by category and get sum, mean, and max of values.

Show Solution
import pandas as pd

df = pd.DataFrame({
    'category': ['A', 'B', 'A', 'B', 'A'],
    'value': [10, 20, 30, 40, 50]
})
result = df.groupby('category')['value'].agg(['sum', 'mean', 'max'])
print(result)

Task: Group sales data by both region and product_type and sum sales.

Show Solution
import pandas as pd

df = pd.DataFrame({
    'region': ['North', 'North', 'South', 'South'],
    'product_type': ['A', 'B', 'A', 'B'],
    'sales': [100, 150, 200, 120]
})
result = df.groupby(['region', 'product_type'])['sales'].sum()
print(result)

# Reset index to get a regular DataFrame
result_df = result.reset_index()
print(result_df)

Task: Normalize scores within each class by subtracting the class mean.

Show Solution
import pandas as pd

df = pd.DataFrame({
    'student': ['Alice', 'Bob', 'Carol', 'David'],
    'class': ['A', 'A', 'B', 'B'],
    'score': [85, 92, 78, 88]
})
df['normalized'] = df.groupby('class')['score'].transform(lambda x: x - x.mean())
print(df)

Task: Filter to keep only categories with more than 2 products.

Show Solution
import pandas as pd

df = pd.DataFrame({
    'category': ['A', 'A', 'A', 'B', 'B', 'C'],
    'product': ['P1', 'P2', 'P3', 'P4', 'P5', 'P6'],
    'price': [10, 15, 20, 25, 30, 35]
})

# Use filter to keep groups with more than 2 products
result = df.groupby('category').filter(lambda x: len(x) > 2)
print(result)
06

Data Cleaning

Real-world data is messy. Data cleaning involves handling missing values, removing duplicates, fixing data types, and standardizing formats. These operations are essential before any analysis.

Key Concept

Missing Data in Pandas

Pandas uses NaN (Not a Number) and None to represent missing values. Operations on columns with missing data typically ignore NaN values, but you need to decide how to handle them for accurate analysis.

Strategy Options: Remove rows with missing data, fill with a default value, interpolate from neighbors, or forward/backward fill.

Detecting Missing Values

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'name': ['Alice', 'Bob', None, 'David'],
    'age': [25, np.nan, 35, 28],
    'city': ['NYC', 'LA', 'Chicago', np.nan]
})

We create a sample DataFrame with intentionally missing values to demonstrate detection techniques. None is Python's null value, while np.nan (Not a Number) is NumPy's representation of missing data. Pandas treats both as missing values internally.

In real datasets, missing values appear for many reasons: users skipping optional fields, data corruption during transfer, sensors failing to record readings, or records from different sources having different fields. Understanding your data's missingness is the first step in cleaning it.

# Check for missing values
print(df.isna())        # Boolean mask
print(df.isnull())      # Same as isna()

isna() returns a DataFrame of the same shape where each cell is True if the original value was missing, False otherwise. This boolean mask lets you see exactly where the gaps in your data are located. isnull() is simply an alias—they do exactly the same thing.

The output shows True at positions where data is missing: row 2's name, row 1's age, and row 3's city. This visual representation helps you spot patterns—are missing values concentrated in certain rows or columns?

# Count missing per column
print(df.isna().sum())
# name    1
# age     1
# city    1

Chaining .sum() after isna() counts the True values (missing) in each column. Since True equals 1 in Python, summing a boolean column gives the count of True values. Here, each column has exactly 1 missing value.

This is your first quantitative view of data quality. A column with 90% missing values might need to be dropped entirely, while a column with 1% missing can often be filled. This count guides your cleaning strategy.

# Total missing values
print(df.isna().sum().sum())  # 3

Double .sum() first sums each column (giving missing count per column), then sums those sums (giving total missing across the entire DataFrame). Here, 1+1+1 = 3 total missing values in the dataset.

This single number is useful for tracking data quality over time or comparing datasets. "Dataset A has 500 missing values, Dataset B has 50"—immediately tells you which needs more cleaning work.

# Percentage missing
print(df.isna().mean() * 100)

Using .mean() instead of .sum() gives the proportion of missing values (since True=1, False=0, mean gives the fraction that are True). Multiply by 100 to express as a percentage. This is often more meaningful than raw counts.

If a column has 25% missing data, that's concerning regardless of whether your dataset has 100 rows or 1 million rows. Percentages let you set universal thresholds like "drop columns with >50% missing" that work across different dataset sizes.

# Rows with any missing value
print(df[df.isna().any(axis=1)])

df.isna().any(axis=1) checks each row and returns True if ANY value in that row is missing. axis=1 means "check across columns for each row." This boolean Series can then filter the DataFrame to show only problematic rows.

This view helps you investigate missing data patterns. Are certain records systematically incomplete? Maybe data from a specific source or time period has quality issues. Seeing the actual rows with missing data often reveals the underlying cause.

Removing Missing Values

# Drop rows with ANY missing value
df_clean = df.dropna()
print(df_clean)

# Drop rows where ALL values are missing
df_clean = df.dropna(how='all')

# Drop rows with missing in specific columns
df_clean = df.dropna(subset=['name', 'age'])

dropna() removes rows containing missing values. By default, it drops rows with ANY missing value. With how='all', it only drops rows where every value is missing—useful for removing completely empty rows while preserving partial data.

The subset parameter lets you focus on critical columns. If 'name' and 'age' are essential but 'city' is optional, use subset=['name', 'age'] to keep rows where those two columns have data, regardless of other columns.

# Drop columns with missing values
df_clean = df.dropna(axis=1)

# Keep rows with at least 2 non-null values
df_clean = df.dropna(thresh=2)

axis=1 switches from dropping rows to dropping columns. This removes any column that has missing values—use carefully as it can eliminate useful columns with just one missing value.

thresh=n keeps rows with at least n non-null values. This is a more nuanced approach: "I can tolerate some missing data, but the row must have enough information to be useful." It's often better than the all-or-nothing approach of how='any'.

Filling Missing Values

# Fill with a constant
df_filled = df.fillna('Unknown')

# Fill specific columns differently
df['age'] = df['age'].fillna(df['age'].mean())  # Mean imputation
df['city'] = df['city'].fillna('Not Specified')

# Fill with different values per column
fill_values = {'name': 'Unknown', 'age': 0, 'city': 'N/A'}
df_filled = df.fillna(fill_values)

fillna() replaces missing values with a specified value. A single value fills all missing cells. For numeric columns, mean imputation (filling with the column's average) is a common statistical technique that preserves the overall distribution.

Different columns often need different fill strategies. Pass a dictionary to fillna() where keys are column names and values are the fill values. Text columns might get 'Unknown' while numeric columns get 0 or the mean.

# Forward fill (use previous value)
df_filled = df.fillna(method='ffill')

# Backward fill (use next value)
df_filled = df.fillna(method='bfill')

# Interpolate numeric values
df['age'] = df['age'].interpolate()

Forward fill (ffill) copies the last valid value forward into missing cells—useful for time series where the previous state continues until changed. Backward fill (bfill) does the opposite, copying the next valid value backward.

interpolate() estimates missing numeric values based on surrounding values using linear interpolation by default. For time series or ordered data, this often produces more realistic estimates than simple mean imputation.

Handling Duplicates

df = pd.DataFrame({
    'id': [1, 2, 2, 3, 3, 3],
    'name': ['Alice', 'Bob', 'Bob', 'Carol', 'Carol', 'Carol'],
    'value': [10, 20, 20, 30, 31, 30]
})

# Check for duplicates
print(df.duplicated())           # Boolean mask
print(df.duplicated().sum())     # Count of duplicates

# Find duplicate rows
print(df[df.duplicated()])

# Check duplicates in specific columns
print(df.duplicated(subset=['id']))

duplicated() marks rows as True if they're duplicates of an earlier row. By default, it compares all columns. The first occurrence is marked False (not a duplicate), while subsequent identical rows are True. Summing gives the count of duplicate rows.

Use subset to check duplicates in specific columns only. df.duplicated(subset=['id']) finds rows where 'id' repeats, even if other columns differ. This is crucial when 'id' should be unique—any duplicates indicate data quality issues.

# Remove duplicates
df_unique = df.drop_duplicates()

# Keep first/last occurrence
df_unique = df.drop_duplicates(keep='first')  # Default
df_unique = df.drop_duplicates(keep='last')
df_unique = df.drop_duplicates(keep=False)    # Remove all duplicates

# Remove duplicates based on specific columns
df_unique = df.drop_duplicates(subset=['id', 'name'])

drop_duplicates() removes duplicate rows. keep='first' (default) retains the first occurrence, keep='last' retains the last. keep=False removes ALL copies of duplicated rows, leaving only rows that were never duplicated.

Use subset to de-duplicate based on certain columns. If you have multiple records per user but want one per user, use subset=['user_id'] to keep one row per unique user ID. Combine with keep to choose which record to retain.

Data Type Conversion

df = pd.DataFrame({
    'id': ['1', '2', '3'],          # String instead of int
    'price': ['10.5', '20.3', '15.8'],  # String instead of float
    'date': ['2024-01-15', '2024-02-20', '2024-03-25']
})

# Convert types
df['id'] = df['id'].astype(int)
df['price'] = df['price'].astype(float)

Data imported from CSV files often comes in as strings even when it should be numeric. astype() converts a column to a different data type. Converting '1', '2', '3' from strings to integers enables mathematical operations and proper sorting.

Check your data types early with df.dtypes. Incorrect types cause subtle bugs: '10' + '5' = '105' (string concatenation) vs 10 + 5 = 15 (addition). Many Pandas functions behave differently based on type.

# Convert to datetime
df['date'] = pd.to_datetime(df['date'])

# Handle errors in conversion
df['id'] = pd.to_numeric(df['id'], errors='coerce')  # Invalid -> NaN
df['date'] = pd.to_datetime(df['date'], errors='coerce')

# Convert to category for memory efficiency
df['status'] = df['status'].astype('category')

pd.to_datetime() parses date strings into datetime objects, enabling date arithmetic, filtering by date ranges, and extracting components like year and month. pd.to_numeric() handles numeric conversion with better error handling.

errors='coerce' converts invalid values to NaN instead of crashing. This is safer for messy real-world data. Category dtype saves memory when a column has few unique values repeated many times (like 'status' with only 'active', 'pending', 'closed').

String Cleaning

df = pd.DataFrame({
    'name': ['  Alice  ', 'BOB', 'carol', 'DAVID  '],
    'email': ['alice@GMAIL.com', 'bob@yahoo.COM', 'CAROL@outlook.com', None]
})

# Strip whitespace
df['name'] = df['name'].str.strip()

# Standardize case
df['name'] = df['name'].str.title()  # 'Alice', 'Bob', 'Carol'
df['email'] = df['email'].str.lower()

The .str accessor provides vectorized string methods that work on entire columns. str.strip() removes leading and trailing whitespace—essential because ' Alice ' and 'Alice' won't match in comparisons or grouping.

str.title() capitalizes the first letter of each word, str.lower() converts everything to lowercase, and str.upper() to uppercase. Standardizing case ensures 'BOB', 'Bob', and 'bob' are recognized as the same person.

# Replace patterns
df['name'] = df['name'].str.replace(r'\s+', ' ', regex=True)  # Multiple spaces to one

# Extract parts
df['email_domain'] = df['email'].str.split('@').str[1]

# Remove unwanted characters
df['phone'] = df['phone'].str.replace(r'[^0-9]', '', regex=True)

str.replace() with regex=True enables pattern matching. The pattern \s+ matches one or more whitespace characters, collapsing "John Doe" to "John Doe". Regular expressions give you powerful text cleaning capabilities.

str.split('@').str[1] splits emails at @ and takes the second part (domain). The [^0-9] regex matches anything that's NOT a digit, so replacing with '' removes all non-numeric characters—perfect for standardizing phone numbers like '(555) 123-4567' to '5551234567'.

Renaming Columns

# Rename specific columns
df = df.rename(columns={'old_name': 'new_name', 'price': 'unit_price'})

# Rename all columns with function
df.columns = df.columns.str.lower()
df.columns = df.columns.str.replace(' ', '_')

rename() with a dictionary maps old column names to new ones. Only specified columns are renamed; others stay unchanged. This is the safest method because it's explicit and fails if the old name doesn't exist.

df.columns returns an Index of column names that supports string methods. str.lower() converts all column names to lowercase, str.replace(' ', '_') replaces spaces with underscores. These operations affect all columns at once.

# Combined: lowercase and replace spaces
df = df.rename(columns=lambda x: x.lower().replace(' ', '_'))

# Set column names directly
df.columns = ['col1', 'col2', 'col3']

Pass a function to rename() to transform all column names. The lambda receives each column name and returns the new name. This one-liner converts "User Name" to "user_name"—a common convention for Python-friendly column names.

You can also set df.columns directly to a list of new names. The list must have exactly as many elements as there are columns. This is useful when you need completely different names, but be careful—the order must match the existing column order.

Practice: Data Cleaning

Task: Create a DataFrame with some missing values and count how many missing values are in each column.

Show Solution
import pandas as pd
import numpy as np

df = pd.DataFrame({
    'A': [1, 2, np.nan, 4],
    'B': [np.nan, 2, 3, np.nan],
    'C': [1, 2, 3, 4]
})
missing_counts = df.isna().sum()
print(missing_counts)

Task: Create a DataFrame with duplicate rows and remove them.

Show Solution
import pandas as pd

df = pd.DataFrame({
    'id': [1, 2, 2, 3],
    'name': ['A', 'B', 'B', 'C']
})
df_clean = df.drop_duplicates()
print(f"Before: {len(df)} rows")
print(f"After: {len(df_clean)} rows")

Task: Fill missing values in a numeric column with the column mean.

Show Solution
import pandas as pd
import numpy as np

df = pd.DataFrame({
    'score': [85, np.nan, 90, np.nan, 78]
})
df['score'] = df['score'].fillna(df['score'].mean())
print(df)

Task: Clean a column of names by stripping whitespace and converting to title case.

Show Solution
import pandas as pd

df = pd.DataFrame({
    'name': ['  ALICE  ', 'bob', '  Carol  ', 'DAVID']
})
df['name'] = df['name'].str.strip().str.title()
print(df)

Task: Clean a messy DataFrame: remove duplicates, fill missing values, and standardize column names.

Show Solution
import pandas as pd
import numpy as np

df = pd.DataFrame({
    'User Name': ['Alice', 'Bob', 'Bob', np.nan],
    'User Age': [25, np.nan, np.nan, 30],
    'User City': ['NYC', 'LA', 'LA', 'Chicago']
})

# 1. Standardize column names
df.columns = df.columns.str.lower().str.replace(' ', '_')

# 2. Remove duplicates
df = df.drop_duplicates()

# 3. Fill missing values
df['user_name'] = df['user_name'].fillna('Unknown')
df['user_age'] = df['user_age'].fillna(df['user_age'].mean())

print(df)

Task: Identify outliers in a price column using IQR method and cap them.

Show Solution
import pandas as pd
import numpy as np

df = pd.DataFrame({
    'product': ['A', 'B', 'C', 'D', 'E', 'F'],
    'price': [10, 15, 12, 1000, 18, 11]  # 1000 is an outlier
})

# Calculate IQR
Q1 = df['price'].quantile(0.25)
Q3 = df['price'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Cap outliers
df['price_capped'] = df['price'].clip(lower=lower_bound, upper=upper_bound)
print(f"Lower bound: {lower_bound:.2f}, Upper bound: {upper_bound:.2f}")
print(df)

Task: Parse messy date strings into proper datetime format and handle invalid dates.

Show Solution
import pandas as pd

df = pd.DataFrame({
    'event': ['A', 'B', 'C', 'D', 'E'],
    'date_str': ['2024-01-15', '01/20/2024', 'Jan 25, 2024', 'invalid', '2024-02-01']
})

# Parse dates with error handling
df['date'] = pd.to_datetime(df['date_str'], errors='coerce')

# Check for invalid dates
invalid_dates = df[df['date'].isna()]
print(f"Invalid dates found: {len(invalid_dates)}")
print(invalid_dates)

# Fill or remove invalid dates
df['date'] = df['date'].fillna(pd.Timestamp('2024-01-01'))
print(df)

Task: Clean inconsistent product names (varying cases, extra spaces).

Show Solution
import pandas as pd

df = pd.DataFrame({
    'product': ['  laptop  ', 'LAPTOP', 'Laptop', '  phone', 'PHONE '],
    'price': [999, 899, 799, 599, 649]
})

# Clean product names
df['product_clean'] = (df['product']
    .str.strip()           # Remove leading/trailing spaces
    .str.lower()           # Convert to lowercase
    .str.replace(r'\s+', ' ', regex=True)  # Remove multiple spaces
    .str.title()           # Title case
)

print(df)

# Group by cleaned names
grouped = df.groupby('product_clean')['price'].agg(['mean', 'count'])
print(grouped)
07

Merging and Joining

Combine DataFrames horizontally (merge/join) or vertically (concat/append). These operations are fundamental when working with relational data from multiple sources.

Key Concept

SQL-Style Joins in Pandas

Pandas merge() works like SQL JOIN operations. You can perform inner, outer, left, and right joins to combine DataFrames based on common columns or indices.

Join Types: Inner (only matching rows), outer (all rows), left (all left + matching right), right (all right + matching left).

Basic Merge (Inner Join)

import pandas as pd

# Two DataFrames with common column
df1 = pd.DataFrame({
    'id': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Carol', 'David']
})

df2 = pd.DataFrame({
    'id': [2, 3, 4, 5],
    'salary': [60000, 70000, 55000, 65000]
})

We start with two DataFrames that share a common column ('id'). df1 contains employee names with IDs 1-4, while df2 contains salaries for IDs 2-5. Some IDs exist in both (2, 3, 4), some only in df1 (1), and some only in df2 (5).

This scenario is extremely common: you have related data spread across multiple tables or files. Merging combines them into a single DataFrame where you can analyze the relationships between the data.

# Inner join (default) - only matching ids
result = pd.merge(df1, df2, on='id')
print(result)
#    id   name  salary
# 0   2    Bob   60000
# 1   3  Carol   70000
# 2   4  David   55000

pd.merge() combines DataFrames based on a common column. The 'on' parameter specifies which column to match. By default, merge performs an inner join—only rows where the key (id) exists in BOTH DataFrames appear in the result.

Alice (id=1) isn't in the result because id=1 doesn't exist in df2. The mysterious id=5 salary isn't included either because id=5 doesn't exist in df1. Inner join is strict: it requires data from both sides.

Join Types

# Left join - all rows from left, matching from right
result = pd.merge(df1, df2, on='id', how='left')
print(result)
#    id   name   salary
# 0   1  Alice      NaN
# 1   2    Bob  60000.0
# 2   3  Carol  70000.0
# 3   4  David  55000.0

Left join (how='left') keeps ALL rows from the left DataFrame (df1) regardless of whether they have a match. Alice (id=1) now appears with NaN for salary because there's no matching record in df2. All employees are preserved.

This is the most common join type in practice. You typically have a "primary" table you want to keep complete, while adding data from a "secondary" table where available. Unmatched rows get NaN values that you can later fill or filter.

# Right join - all rows from right, matching from left
result = pd.merge(df1, df2, on='id', how='right')
print(result)

# Outer join - all rows from both
result = pd.merge(df1, df2, on='id', how='outer')
print(result)
#    id   name   salary
# 0   1  Alice      NaN
# 1   2    Bob  60000.0
# 2   3  Carol  70000.0
# 3   4  David  55000.0
# 4   5    NaN  65000.0

Right join is the mirror image of left join—it keeps all rows from the right DataFrame. Outer join (how='outer') keeps ALL rows from BOTH DataFrames, filling NaN where there's no match on either side.

Outer join is useful for finding discrepancies: which employees have no salary record? Which salary records have no employee? The NaN values reveal the mismatches between your data sources.

Join Types Reference

Join Type Description Result Use Case
inner Only matching rows Intersection of keys When you need complete data only
left All left + matching right All left rows, NaN for unmatched Keep all primary records
right All right + matching left All right rows, NaN for unmatched Keep all secondary records
outer All rows from both Union of keys Find all data, including mismatches
cross Every combination Cartesian product Generate all possible pairs

Merging on Different Column Names

df1 = pd.DataFrame({
    'employee_id': [1, 2, 3],
    'name': ['Alice', 'Bob', 'Carol']
})

df2 = pd.DataFrame({
    'emp_id': [1, 2, 3],
    'department': ['Sales', 'IT', 'HR']
})

Real-world data rarely has perfectly matching column names. Here, df1 uses 'employee_id' while df2 uses 'emp_id' for the same concept. This happens when data comes from different systems, databases, or when conventions vary across teams.

Pandas handles this gracefully—you don't need to rename columns before merging. Just tell merge which columns correspond to each other.

# Columns have different names
result = pd.merge(df1, df2, left_on='employee_id', right_on='emp_id')
print(result)

# Drop the redundant column
result = result.drop(columns=['emp_id'])

Use left_on and right_on to specify different column names: left_on is the column from the first (left) DataFrame, right_on from the second (right). Pandas matches rows where employee_id equals emp_id.

After merging, both columns appear in the result (they're technically different columns even though they contain the same values). Use drop() to remove the redundant one for cleaner output.

Merging on Multiple Columns

df1 = pd.DataFrame({
    'year': [2023, 2023, 2024, 2024],
    'quarter': ['Q1', 'Q2', 'Q1', 'Q2'],
    'revenue': [100, 150, 120, 180]
})

df2 = pd.DataFrame({
    'year': [2023, 2023, 2024, 2024],
    'quarter': ['Q1', 'Q2', 'Q1', 'Q2'],
    'expenses': [80, 90, 85, 95]
})

Sometimes a single column isn't enough to uniquely identify a row. Here, neither 'year' nor 'quarter' alone uniquely identifies a record—we need BOTH: 2023-Q1 is different from 2023-Q2 and from 2024-Q1. This is called a composite key.

Financial data, time series, and hierarchical data commonly require composite keys. Sales by region and month, scores by student and subject, inventory by warehouse and product—all need multiple columns to match correctly.

# Merge on multiple columns
result = pd.merge(df1, df2, on=['year', 'quarter'])
print(result)
#    year quarter  revenue  expenses
# 0  2023      Q1      100        80
# 1  2023      Q2      150        90
# 2  2024      Q1      120        85
# 3  2024      Q2      180        95

Pass a list of column names to the 'on' parameter. Pandas will only match rows where ALL specified columns match. 2023-Q1 revenue joins with 2023-Q1 expenses—not with 2023-Q2 or 2024-Q1.

Now you can analyze revenue vs expenses for each period. The resulting DataFrame has one row per unique year-quarter combination with data from both sources side by side.

Concatenating DataFrames

# Vertical concatenation (stacking)
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})

result = pd.concat([df1, df2])
print(result)
#    A  B
# 0  1  3
# 1  2  4
# 0  5  7
# 1  6  8

concat() stacks DataFrames vertically (by default)—it places one below the other. This is different from merge, which combines columns side-by-side based on matching keys. Use concat when you have data split across multiple files with identical structure.

Notice the index values (0, 1, 0, 1)—each DataFrame kept its original index. This can cause issues if you try to access by index, as there are now duplicate index values.

# Reset index after concat
result = pd.concat([df1, df2], ignore_index=True)
print(result)
#    A  B
# 0  1  3
# 1  2  4
# 2  5  7
# 3  6  8

# Horizontal concatenation
result = pd.concat([df1, df2], axis=1)
print(result)
#    A  B  A  B
# 0  1  3  5  7
# 1  2  4  6  8

ignore_index=True creates a fresh sequential index (0, 1, 2, 3)—usually what you want when combining data from the same source split across files (like monthly reports). The original indices are discarded.

axis=1 concatenates horizontally, placing DataFrames side by side. This is useful when you have different columns for the same rows. Note: this creates duplicate column names (A, B, A, B)—you'll likely want to rename them afterward.

Joining on Index

df1 = pd.DataFrame(
    {'A': [1, 2, 3]},
    index=['a', 'b', 'c']
)

df2 = pd.DataFrame(
    {'B': [4, 5, 6]},
    index=['a', 'b', 'd']
})

Sometimes your key isn't a column—it's the index itself. Here, df1 and df2 use string indices ('a', 'b', 'c' and 'a', 'b', 'd') rather than a column to identify rows. This is common with time series (date index) or when you've already set a meaningful index.

Both DataFrames share indices 'a' and 'b', while 'c' is only in df1 and 'd' is only in df2. The join operation will need to handle these mismatches.

# Join on index
result = df1.join(df2, how='outer')
print(result)
#      A    B
# a  1.0  4.0
# b  2.0  5.0
# c  3.0  NaN
# d  NaN  6.0

# Merge on index
result = pd.merge(df1, df2, left_index=True, right_index=True, how='outer')

The join() method is designed specifically for index-based joins and offers cleaner syntax. By default it does a left join; here we use how='outer' to keep all indices from both DataFrames, with NaN where data is missing.

You can also use merge() with left_index=True and right_index=True to join on indices. This is more verbose but offers more flexibility, like combining index-based and column-based joins (e.g., left_index=True, right_on='column').

Practice: Merging and Joining

Task: Merge two DataFrames on a common 'id' column using inner join.

Show Solution
import pandas as pd

orders = pd.DataFrame({
    'id': [1, 2, 3],
    'product': ['A', 'B', 'C']
})
prices = pd.DataFrame({
    'id': [1, 2, 4],
    'price': [10, 20, 40]
})
result = pd.merge(orders, prices, on='id')
print(result)

Task: Stack two DataFrames with the same columns on top of each other.

Show Solution
import pandas as pd

df1 = pd.DataFrame({'name': ['Alice', 'Bob'], 'age': [25, 30]})
df2 = pd.DataFrame({'name': ['Carol', 'David'], 'age': [35, 28]})
result = pd.concat([df1, df2], ignore_index=True)
print(result)

Task: Perform a left join and identify rows that did not find a match.

Show Solution
import pandas as pd

employees = pd.DataFrame({
    'id': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Carol', 'David']
})
departments = pd.DataFrame({
    'id': [1, 2],
    'dept': ['Sales', 'IT']
})
result = pd.merge(employees, departments, on='id', how='left')
unmatched = result[result['dept'].isna()]
print("Unmatched employees:")
print(unmatched)

Task: Merge two DataFrames using composite keys (year and month).

Show Solution
import pandas as pd

sales = pd.DataFrame({
    'year': [2023, 2023, 2024],
    'month': [1, 2, 1],
    'revenue': [1000, 1200, 1100]
})
targets = pd.DataFrame({
    'year': [2023, 2023, 2024],
    'month': [1, 2, 1],
    'target': [900, 1100, 1000]
})
result = pd.merge(sales, targets, on=['year', 'month'])
result['achieved'] = result['revenue'] >= result['target']
print(result)

Task: Stack three monthly sales DataFrames vertically with a new index.

Show Solution
import pandas as pd

jan = pd.DataFrame({'product': ['A', 'B'], 'sales': [100, 150]})
feb = pd.DataFrame({'product': ['A', 'B'], 'sales': [120, 160]})
mar = pd.DataFrame({'product': ['A', 'B'], 'sales': [110, 140]})

# Concatenate with ignore_index to create fresh index
combined = pd.concat([jan, feb, mar], ignore_index=True)
print(combined)

# Or use keys to identify source month
combined_keyed = pd.concat([jan, feb, mar], keys=['Jan', 'Feb', 'Mar'])
print(combined_keyed)

Task: Find customers who haven't placed any orders (anti-join).

Show Solution
import pandas as pd

customers = pd.DataFrame({
    'customer_id': [1, 2, 3, 4, 5],
    'name': ['Alice', 'Bob', 'Carol', 'David', 'Eve']
})
orders = pd.DataFrame({
    'order_id': [101, 102, 103],
    'customer_id': [1, 3, 1],
    'amount': [100, 200, 150]
})

# Method 1: Left merge with indicator
merged = customers.merge(orders, on='customer_id', how='left', indicator=True)
no_orders = merged[merged['_merge'] == 'left_only'][['customer_id', 'name']]
print("Customers with no orders:")
print(no_orders)

# Method 2: Using isin()
no_orders_v2 = customers[~customers['customer_id'].isin(orders['customer_id'])]
print(no_orders_v2)
08

Advanced Operations

Master advanced Pandas operations including pivot tables, melting, window functions, and apply/map operations. These techniques unlock sophisticated data transformations.

Pivot Tables

import pandas as pd

df = pd.DataFrame({
    'date': ['2024-01', '2024-01', '2024-02', '2024-02'],
    'region': ['East', 'West', 'East', 'West'],
    'sales': [100, 150, 120, 180]
})

# Create pivot table
pivot = pd.pivot_table(
    df,
    values='sales',
    index='date',
    columns='region',
    aggfunc='sum'
)
print(pivot)
# region     East  West
# date                  
# 2024-01     100   150
# 2024-02     120   180

Pivot tables transform "long" data into "wide" format while aggregating. Here, each row in the original data represents one region's sales for one date. The pivot table reorganizes this so dates become rows, regions become columns, and sales values fill the cells.

The key parameters: 'values' is what you're aggregating (sales), 'index' becomes row labels (date), 'columns' becomes column headers (region), and 'aggfunc' is the aggregation function (sum, mean, count, etc.).

# Multiple aggregations
pivot = pd.pivot_table(
    df,
    values='sales',
    index='date',
    columns='region',
    aggfunc=['sum', 'mean'],
    margins=True  # Add row/column totals
)

Pass a list of functions to aggfunc to get multiple summary statistics in one pivot table. This creates hierarchical columns where the first level is the function and the second level is the region.

margins=True adds "All" row and column showing totals. This is like the Grand Total in Excel pivot tables—essential for reports where you need to see both individual values and overall summaries.

Melt (Unpivot)

# Wide format data
df_wide = pd.DataFrame({
    'name': ['Alice', 'Bob'],
    'math': [90, 85],
    'science': [88, 92],
    'english': [95, 78]
})

This "wide" format is human-readable—each student has one row with all their scores as separate columns. However, many visualization libraries and statistical functions expect "long" format where each observation (student-subject-score) is its own row.

Wide format is common in spreadsheets: columns represent different measurements of the same entity. But for analysis, long format is often more flexible—you can easily filter by subject, group by subject, or plot all subjects at once.

# Melt to long format
df_long = pd.melt(
    df_wide,
    id_vars=['name'],
    value_vars=['math', 'science', 'english'],
    var_name='subject',
    value_name='score'
)
print(df_long)
#     name   subject  score
# 0  Alice      math     90
# 1    Bob      math     85
# 2  Alice   science     88
# 3    Bob   science     92
# 4  Alice   english     95
# 5    Bob   english     78

melt() "unpivots" wide data into long format. id_vars specifies columns to keep as-is (name stays as a column). value_vars lists columns to melt (math, science, english become values in the new 'subject' column). Their values go into the 'score' column.

Now you have 6 rows instead of 2—one for each student-subject combination. This format lets you easily create box plots by subject, calculate average by subject, or apply the same analysis function to all scores regardless of subject.

Apply and Map

import pandas as pd

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Carol'],
    'score': [85, 92, 78]
})

# Apply function to column
df['grade'] = df['score'].apply(lambda x: 'A' if x >= 90 else 'B' if x >= 80 else 'C')

# Apply function to entire row
def classify_row(row):
    if row['score'] >= 90:
        return f"{row['name']}: Excellent"
    return f"{row['name']}: Good"

df['status'] = df.apply(classify_row, axis=1)

apply() runs a function on each element of a Series or each row/column of a DataFrame. The lambda function converts scores to letter grades using conditional logic. apply() is incredibly flexible—any Python function can be used.

With axis=1, apply() passes each row as a Series to your function. This lets you use multiple columns to compute the result. The function receives the entire row and can access any column by name (row['score'], row['name']).

# Map values using dictionary
grade_map = {85: 'B', 92: 'A', 78: 'C'}
df['mapped_grade'] = df['score'].map(grade_map)

# Applymap for element-wise operations (renamed to map in newer versions)
df_numeric = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df_squared = df_numeric.map(lambda x: x ** 2)  # Or applymap() in older versions

map() on a Series replaces values according to a dictionary or function. It's perfect for categorical encoding or lookup tables—faster than apply() for simple substitutions. If a value isn't in the dictionary, it becomes NaN.

DataFrame.map() (formerly applymap()) applies a function to every element in a DataFrame. Here it squares all values. This is useful for element-wise transformations that should apply uniformly across all cells, like rounding or formatting.

Window Functions

import pandas as pd

df = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=10),
    'value': [10, 12, 8, 15, 14, 18, 20, 16, 22, 25]
})

# Rolling window (moving average)
df['rolling_mean'] = df['value'].rolling(window=3).mean()
df['rolling_sum'] = df['value'].rolling(window=3).sum()

Rolling windows calculate statistics over a fixed-size sliding window. With window=3, each value is the mean (or sum) of that row and the two previous rows. The first two rows have NaN because there aren't enough prior values to fill the window.

Moving averages smooth out noise in time series data, revealing underlying trends. A 3-day rolling mean shows the average of the last 3 days, which is less volatile than daily values. Longer windows (7, 30 days) create smoother trends.

# Expanding window (cumulative)
df['cumulative_mean'] = df['value'].expanding().mean()
df['cumulative_max'] = df['value'].expanding().max()

# Exponential weighted
df['ewm_mean'] = df['value'].ewm(span=3).mean()

print(df[['date', 'value', 'rolling_mean', 'cumulative_mean']])

Expanding windows include ALL previous values, growing as you move down. Cumulative mean at row 5 is the mean of rows 0-5. Cumulative max shows the highest value seen so far—useful for tracking "best performance to date" or setting benchmarks.

Exponential weighted moving average (ewm) gives more weight to recent values, making it more responsive to recent changes than simple rolling mean. The 'span' parameter controls how quickly the influence of older values decays.

Shift and Diff

import pandas as pd

df = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=5),
    'price': [100, 102, 101, 105, 108]
})

# Shift values
df['previous_price'] = df['price'].shift(1)   # Previous row
df['next_price'] = df['price'].shift(-1)      # Next row

shift() moves values up or down by a specified number of rows. shift(1) moves values down by one row, so 'previous_price' contains yesterday's price for each day. The first row becomes NaN because there's no previous value.

shift(-1) moves values up, giving you the "next" value. This is useful for calculating "look-ahead" metrics like "next day's price" or for creating lag features in time series machine learning models.

# Calculate difference
df['price_change'] = df['price'].diff()        # Current - previous

# Percentage change
df['pct_change'] = df['price'].pct_change() * 100

print(df)

diff() calculates the difference between the current and previous value. It's shorthand for df['price'] - df['price'].shift(1). This shows absolute change—how much the price increased or decreased from yesterday.

pct_change() calculates the percentage change from the previous value: (current - previous) / previous. Multiply by 100 to express as a percentage. This is essential for financial analysis where you care about relative returns, not absolute price movements.

Cross Tabulation

import pandas as pd

df = pd.DataFrame({
    'gender': ['M', 'F', 'M', 'F', 'M', 'F'],
    'department': ['Sales', 'Sales', 'IT', 'IT', 'HR', 'HR'],
    'salary': [50000, 55000, 60000, 62000, 45000, 48000]
})

# Cross tabulation (count)
cross_tab = pd.crosstab(df['gender'], df['department'])
print(cross_tab)
# department  HR  IT  Sales
# gender                    
# F            1   1      1
# M            1   1      1

crosstab() creates a contingency table showing the frequency of combinations between two categorical variables. Here it counts how many employees of each gender work in each department. Each cell shows the count for that gender-department combination.

This is invaluable for exploring relationships between categorical variables. Are certain departments male-dominated? Do purchase patterns vary by region and product type? Cross tabulation answers these questions at a glance.

# With aggregation
cross_tab = pd.crosstab(
    df['gender'], 
    df['department'],
    values=df['salary'],
    aggfunc='mean'
)
print(cross_tab)

Add 'values' and 'aggfunc' parameters to aggregate a numeric column instead of counting. This shows mean salary by gender and department—perfect for identifying pay disparities across demographic and organizational dimensions.

You can use any aggregation function: 'sum' for total sales by region and product, 'max' for best score by student and subject, or 'mean' for average ratings by movie genre and decade. It's a quick way to summarize data by two dimensions.

Working with Dates

import pandas as pd

df = pd.DataFrame({
    'date': pd.to_datetime(['2024-01-15', '2024-03-20', '2024-06-10', '2024-12-25']),
    'value': [100, 150, 200, 250]
})

# Extract date components
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['weekday'] = df['date'].dt.day_name()
df['quarter'] = df['date'].dt.quarter
df['week'] = df['date'].dt.isocalendar().week

The dt accessor provides access to datetime components, similar to how str works for strings. You can extract year, month, day, weekday name, quarter, week number, and more. These extracted features are invaluable for time-based analysis and grouping.

Each extraction returns a Series you can store as a new column. This is common preprocessing for time series: you might analyze sales patterns by day of week, or group transactions by month. The original datetime column stays intact while you gain analytical dimensions.

# Date arithmetic
df['days_from_today'] = (pd.Timestamp.today() - df['date']).dt.days

# Filter by date range
mask = (df['date'] >= '2024-01-01') & (df['date'] <= '2024-06-30')
first_half = df[mask]

# Resample time series
df_monthly = df.set_index('date').resample('M')['value'].sum()

Date arithmetic works naturally—subtracting dates gives a Timedelta, and .dt.days extracts the number of days. Filtering by date range uses standard boolean indexing; Pandas automatically parses date strings like '2024-01-01' for comparison.

resample() is the time series equivalent of groupby(). Set the date as index, then resample at a frequency: 'M' for monthly, 'W' for weekly, 'D' for daily, 'Q' for quarterly. This groups data into time buckets and lets you aggregate (sum, mean, count) within each period.

Practice: Advanced Operations

Task: Create a pivot table showing sales by region and product.

Show Solution
import pandas as pd

df = pd.DataFrame({
    'region': ['East', 'West', 'East', 'West'],
    'product': ['A', 'A', 'B', 'B'],
    'sales': [100, 150, 200, 250]
})
pivot = pd.pivot_table(df, values='sales', index='region', columns='product')
print(pivot)

Task: Calculate a 3-day rolling average for stock prices.

Show Solution
import pandas as pd

df = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=7),
    'price': [100, 102, 98, 105, 103, 108, 110]
})
df['rolling_avg'] = df['price'].rolling(window=3).mean()
print(df)

Task: Use apply() to create a 'grade' column based on score ranges.

Show Solution
import pandas as pd

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Carol'],
    'score': [95, 82, 68]
})

def assign_grade(score):
    if score >= 90: return 'A'
    elif score >= 80: return 'B'
    elif score >= 70: return 'C'
    else: return 'D'

df['grade'] = df['score'].apply(assign_grade)
print(df)

Task: Transform wide data to long format and calculate average score per subject.

Show Solution
import pandas as pd

df_wide = pd.DataFrame({
    'student': ['Alice', 'Bob', 'Carol'],
    'math': [90, 85, 92],
    'science': [88, 90, 85],
    'english': [95, 78, 88]
})

df_long = pd.melt(
    df_wide, 
    id_vars=['student'],
    var_name='subject',
    value_name='score'
)
avg_by_subject = df_long.groupby('subject')['score'].mean()
print(avg_by_subject)

Task: Calculate a 3-day rolling average for daily sales data.

Show Solution
import pandas as pd

df = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=10),
    'sales': [100, 120, 90, 150, 130, 110, 140, 160, 125, 135]
})

# Calculate 3-day rolling average
df['rolling_avg_3d'] = df['sales'].rolling(window=3).mean()

# Calculate rolling sum
df['rolling_sum_3d'] = df['sales'].rolling(window=3).sum()

print(df)
09

Interactive Demo

Explore common Pandas operations with this interactive reference. Select an operation to see its syntax and example output.

Pandas Operation Explorer
# Select an operation to see the example
import pandas as pd

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Carol'],
    'age': [25, 30, 35],
    'city': ['NYC', 'LA', 'Chicago']
})
print(df)

Best Practices for Pandas

Do This
  • Use vectorized operations over loops
  • Use loc/iloc for explicit selection
  • Chain operations when possible
  • Use appropriate dtypes for memory
  • Check for duplicates and nulls early
  • Use query() for readable filters
Avoid This
  • Iterating with for loops over rows
  • Chained indexing (df['a']['b'])
  • Modifying sliced DataFrames
  • Using object dtype for numerics
  • Ignoring SettingWithCopyWarning
  • Using apply() when vectorized works

Common Mistakes and Solutions

Mistake Problem Solution
df['a']['b'] = value Chained assignment may not work df.loc[:, 'a'] = value
for i, row in df.iterrows() Very slow for large DataFrames Use vectorized operations or apply()
df[df['a'] > 5 and df['b'] < 10] Python and/or does not work df[(df['a'] > 5) & (df['b'] < 10)]
df.loc[0:5] expecting 5 rows loc is inclusive, gets 6 rows Use df.iloc[0:5] for 5 rows
pd.concat(df1, df2) concat expects a list pd.concat([df1, df2])
Modifying original after slice SettingWithCopyWarning Use .copy() when creating subset
10

Quick Reference

A comprehensive reference of the most commonly used Pandas functions and methods. Keep this handy while coding.

DataFrame Creation

Function Description Example
pd.DataFrame(dict) Create from dictionary pd.DataFrame({'a': [1,2], 'b': [3,4]})
pd.read_csv() Read CSV file pd.read_csv('data.csv')
pd.read_excel() Read Excel file pd.read_excel('data.xlsx')
pd.read_json() Read JSON file/string pd.read_json('data.json')
pd.read_sql() Read SQL query pd.read_sql(query, conn)

Selection and Indexing

Operation Syntax Returns
Single column df['col'] Series
Multiple columns df[['col1', 'col2']] DataFrame
Rows by label df.loc[label] Series/DataFrame
Rows by position df.iloc[pos] Series/DataFrame
Specific cell df.at[row, col] Scalar
Boolean filter df[df['col'] > val] DataFrame
Query filter df.query('col > val') DataFrame

Data Manipulation

Method Description Example
sort_values() Sort by column(s) df.sort_values('col', ascending=False)
drop() Remove rows/columns df.drop(columns=['col'])
rename() Rename columns df.rename(columns={'old': 'new'})
dropna() Remove missing values df.dropna(subset=['col'])
fillna() Fill missing values df.fillna(0)
drop_duplicates() Remove duplicate rows df.drop_duplicates()
astype() Convert data type df['col'].astype(int)
replace() Replace values df['col'].replace({'a': 'b'})

Aggregation and Grouping

Method Description Example
groupby() Group by column(s) df.groupby('cat')['val'].sum()
agg() Multiple aggregations df.groupby('cat').agg(['sum', 'mean'])
transform() Group transform df.groupby('cat')['val'].transform('mean')
pivot_table() Create pivot table pd.pivot_table(df, values='v', index='r', columns='c')
crosstab() Cross tabulation pd.crosstab(df['a'], df['b'])
value_counts() Count unique values df['col'].value_counts()

Combining DataFrames

Function Description Example
pd.merge() SQL-style join pd.merge(df1, df2, on='key', how='left')
pd.concat() Stack DataFrames pd.concat([df1, df2], ignore_index=True)
df.join() Join on index df1.join(df2, how='outer')

Ready-to-Use Templates

import pandas as pd

# Load data
df = pd.read_csv('data.csv')

# Quick exploration
print(f"Shape: {df.shape}")
print(f"Columns: {list(df.columns)}")
print(f"\nData types:\n{df.dtypes}")
print(f"\nMissing values:\n{df.isna().sum()}")
print(f"\nBasic stats:\n{df.describe()}")
print(f"\nFirst 5 rows:\n{df.head()}")

import pandas as pd

def clean_dataframe(df):
    # Make a copy
    df = df.copy()
    
    # Standardize column names
    df.columns = df.columns.str.lower().str.replace(' ', '_')
    
    # Remove duplicates
    df = df.drop_duplicates()
    
    # Handle missing values
    for col in df.select_dtypes(include=['float64', 'int64']).columns:
        df[col] = df[col].fillna(df[col].median())
    
    for col in df.select_dtypes(include=['object']).columns:
        df[col] = df[col].fillna('Unknown')
    
    return df

df_clean = clean_dataframe(df)

import pandas as pd

# Comprehensive groupby analysis
summary = df.groupby('category').agg(
    count=('value', 'count'),
    total=('value', 'sum'),
    average=('value', 'mean'),
    median=('value', 'median'),
    std_dev=('value', 'std'),
    minimum=('value', 'min'),
    maximum=('value', 'max')
).round(2)

# Sort by total descending
summary = summary.sort_values('total', ascending=False)

# Add percentage column
summary['pct_of_total'] = (summary['total'] / summary['total'].sum() * 100).round(1)

print(summary)
11

Real-World Examples

Let's apply everything we've learned to solve realistic data analysis problems that you'll encounter in actual projects.

Example 1: E-Commerce Sales Analysis

sales_analysis.py
import pandas as pd
import numpy as np

# Create sample e-commerce data
np.random.seed(42)
dates = pd.date_range('2024-01-01', periods=365, freq='D')
products = ['Laptop', 'Phone', 'Tablet', 'Headphones', 'Keyboard']
regions = ['North', 'South', 'East', 'West']

sales_data = pd.DataFrame({
    'date': np.random.choice(dates, 1000),
    'product': np.random.choice(products, 1000),
    'region': np.random.choice(regions, 1000),
    'quantity': np.random.randint(1, 10, 1000),
    'unit_price': np.random.choice([999, 599, 399, 99, 49], 1000),
    'customer_id': np.random.randint(1000, 9999, 1000)
})
sales_data['revenue'] = sales_data['quantity'] * sales_data['unit_price']

# 1. Monthly Revenue Trend
sales_data['month'] = sales_data['date'].dt.to_period('M')
monthly_revenue = sales_data.groupby('month')['revenue'].sum().reset_index()
monthly_revenue['month'] = monthly_revenue['month'].astype(str)
print("Monthly Revenue Trend:")
print(monthly_revenue.head(6))

# 2. Top Products by Revenue
top_products = sales_data.groupby('product').agg({
    'revenue': 'sum',
    'quantity': 'sum',
    'customer_id': 'nunique'  # Unique customers
}).sort_values('revenue', ascending=False)
top_products.columns = ['Total Revenue', 'Units Sold', 'Unique Customers']
print("\nTop Products by Revenue:")
print(top_products)

# 3. Regional Performance
regional_stats = sales_data.groupby('region').agg({
    'revenue': ['sum', 'mean'],
    'quantity': 'sum',
    'customer_id': 'nunique'
}).round(2)
regional_stats.columns = ['Total Revenue', 'Avg Order Value', 'Units Sold', 'Customers']
print("\nRegional Performance:")
print(regional_stats)

# 4. Best Selling Day of Week
sales_data['day_of_week'] = sales_data['date'].dt.day_name()
day_sales = sales_data.groupby('day_of_week')['revenue'].sum()
day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
day_sales = day_sales.reindex(day_order)
print("\nSales by Day of Week:")
print(day_sales)

# 5. Customer Segmentation by Purchase Value
customer_totals = sales_data.groupby('customer_id')['revenue'].sum().reset_index()
customer_totals['segment'] = pd.cut(
    customer_totals['revenue'],
    bins=[0, 500, 2000, np.inf],
    labels=['Low', 'Medium', 'High']
)
segment_counts = customer_totals['segment'].value_counts()
print("\nCustomer Segments:")
print(segment_counts)

Example 2: Student Grades Analysis

student_grades.py
import pandas as pd
import numpy as np

# Create sample student data
np.random.seed(42)
students = pd.DataFrame({
    'student_id': range(1, 51),
    'name': [f'Student_{i}' for i in range(1, 51)],
    'class': np.random.choice(['10A', '10B', '10C'], 50),
    'math': np.random.randint(40, 100, 50),
    'science': np.random.randint(45, 100, 50),
    'english': np.random.randint(50, 100, 50),
    'history': np.random.randint(35, 100, 50),
    'attendance_pct': np.random.uniform(70, 100, 50).round(1)
})

# 1. Calculate Total and Average Score
students['total'] = students[['math', 'science', 'english', 'history']].sum(axis=1)
students['average'] = students[['math', 'science', 'english', 'history']].mean(axis=1).round(2)

# 2. Assign Grades
def assign_grade(avg):
    if avg >= 90: return 'A+'
    elif avg >= 80: return 'A'
    elif avg >= 70: return 'B'
    elif avg >= 60: return 'C'
    elif avg >= 50: return 'D'
    else: return 'F'

students['grade'] = students['average'].apply(assign_grade)

# 3. Class Performance Summary
class_summary = students.groupby('class').agg({
    'average': ['mean', 'std', 'min', 'max'],
    'attendance_pct': 'mean',
    'student_id': 'count'
}).round(2)
class_summary.columns = ['Avg Score', 'Std Dev', 'Min', 'Max', 'Avg Attendance', 'Students']
print("Class Performance Summary:")
print(class_summary)

# 4. Grade Distribution
grade_dist = students['grade'].value_counts().sort_index()
print("\nGrade Distribution:")
print(grade_dist)

# 5. Students Needing Attention (Low Attendance OR Failing)
at_risk = students[
    (students['attendance_pct'] < 80) | (students['average'] < 60)
]
print(f"\nStudents At Risk: {len(at_risk)}")
print(at_risk[['name', 'class', 'average', 'grade', 'attendance_pct']])

# 6. Subject-wise Analysis
subject_stats = students[['math', 'science', 'english', 'history']].describe()
print("\nSubject Statistics:")
print(subject_stats.loc[['mean', 'std', 'min', 'max']].round(2))

# 7. Correlation Analysis
correlation = students[['math', 'science', 'english', 'history', 'attendance_pct']].corr()
print("\nCorrelation Matrix:")
print(correlation.round(2))

Example 3: Web Server Log Analysis

log_analysis.py
import pandas as pd
import numpy as np

# Create sample web server log data
np.random.seed(42)
n_requests = 5000

logs = pd.DataFrame({
    'timestamp': pd.date_range('2024-01-01', periods=n_requests, freq='5T'),
    'ip_address': [f'192.168.1.{np.random.randint(1, 255)}' for _ in range(n_requests)],
    'method': np.random.choice(['GET', 'POST', 'PUT', 'DELETE'], n_requests, p=[0.7, 0.2, 0.05, 0.05]),
    'endpoint': np.random.choice(['/api/users', '/api/products', '/api/orders', '/login', '/home', '/search'], n_requests),
    'status_code': np.random.choice([200, 201, 400, 401, 403, 404, 500], n_requests, p=[0.6, 0.1, 0.1, 0.05, 0.05, 0.05, 0.05]),
    'response_time_ms': np.random.exponential(100, n_requests).round(2)
})

# 1. Request Volume by Hour
logs['hour'] = logs['timestamp'].dt.hour
hourly_requests = logs.groupby('hour').size()
print("Requests by Hour:")
print(hourly_requests)

# 2. Status Code Distribution
status_dist = logs['status_code'].value_counts().sort_index()
print("\nStatus Code Distribution:")
print(status_dist)

# 3. Error Rate Analysis
logs['is_error'] = logs['status_code'] >= 400
error_rate = logs.groupby('endpoint')['is_error'].mean().sort_values(ascending=False)
print("\nError Rate by Endpoint:")
print((error_rate * 100).round(2).astype(str) + '%')

# 4. Response Time Analysis
response_stats = logs.groupby('endpoint')['response_time_ms'].agg(['mean', 'median', 'max', 'std']).round(2)
response_stats = response_stats.sort_values('mean', ascending=False)
print("\nResponse Time Statistics (ms):")
print(response_stats)

# 5. Identify Slow Requests (>500ms)
slow_requests = logs[logs['response_time_ms'] > 500]
print(f"\nSlow Requests (>500ms): {len(slow_requests)} ({len(slow_requests)/len(logs)*100:.2f}%)")

# 6. Unique Visitors
unique_ips = logs['ip_address'].nunique()
print(f"\nUnique Visitors: {unique_ips}")

# 7. Most Active IPs (potential bots/attackers)
active_ips = logs['ip_address'].value_counts().head(10)
print("\nMost Active IPs:")
print(active_ips)

# 8. Traffic Pattern by Day
logs['day'] = logs['timestamp'].dt.day_name()
day_traffic = logs.groupby('day').agg({
    'ip_address': 'count',
    'response_time_ms': 'mean'
}).round(2)
day_traffic.columns = ['Requests', 'Avg Response Time']
print("\nDaily Traffic Pattern:")
print(day_traffic)

Example 4: Stock Portfolio Analysis

portfolio_analysis.py
import pandas as pd
import numpy as np

# Create sample stock data
np.random.seed(42)
dates = pd.date_range('2024-01-01', periods=252, freq='B')  # Business days
stocks = ['AAPL', 'GOOGL', 'MSFT', 'AMZN', 'TSLA']

# Generate random price movements (simplified)
prices = pd.DataFrame(index=dates)
base_prices = {'AAPL': 180, 'GOOGL': 140, 'MSFT': 375, 'AMZN': 150, 'TSLA': 240}

for stock in stocks:
    returns = np.random.normal(0.0005, 0.02, len(dates))  # Daily returns
    prices[stock] = base_prices[stock] * (1 + returns).cumprod()

prices = prices.round(2)

# Portfolio: Number of shares owned
portfolio = pd.Series({'AAPL': 50, 'GOOGL': 30, 'MSFT': 20, 'AMZN': 40, 'TSLA': 25})

# 1. Calculate Daily Returns
returns = prices.pct_change().dropna()
print("Daily Returns (last 5 days):")
print((returns.tail() * 100).round(2).astype(str) + '%')

# 2. Summary Statistics
return_stats = returns.describe().T
return_stats['Annual Return'] = (returns.mean() * 252 * 100).round(2)
return_stats['Annual Volatility'] = (returns.std() * np.sqrt(252) * 100).round(2)
print("\nReturn Statistics:")
print(return_stats[['mean', 'std', 'min', 'max', 'Annual Return', 'Annual Volatility']])

# 3. Correlation Matrix
correlation = returns.corr().round(2)
print("\nStock Correlation:")
print(correlation)

# 4. Portfolio Value Over Time
portfolio_value = prices.mul(portfolio)
portfolio_value['Total'] = portfolio_value.sum(axis=1)
print("\nPortfolio Value (last 5 days):")
print(portfolio_value[['Total']].tail())

# 5. Portfolio Allocation
latest_prices = prices.iloc[-1]
current_holdings = latest_prices * portfolio
current_holdings['allocation_pct'] = (current_holdings / current_holdings.sum() * 100).round(2)
print("\nCurrent Portfolio Allocation:")
print(current_holdings)

# 6. Best and Worst Days
portfolio_returns = portfolio_value['Total'].pct_change().dropna()
best_day = portfolio_returns.idxmax()
worst_day = portfolio_returns.idxmin()
print(f"\nBest Day: {best_day.date()} ({portfolio_returns[best_day]*100:.2f}%)")
print(f"Worst Day: {worst_day.date()} ({portfolio_returns[worst_day]*100:.2f}%)")

# 7. Rolling Statistics
rolling_mean = portfolio_value['Total'].rolling(window=20).mean()
rolling_std = portfolio_value['Total'].rolling(window=20).std()
print("\nRolling 20-day Statistics (last 5 days):")
print(pd.DataFrame({'Mean': rolling_mean, 'Std': rolling_std}).tail())

Data Analysis Best Practices

DO
  • Always explore data first with head(), info(), describe()
  • Check for missing values before analysis
  • Validate data types and convert if needed
  • Use meaningful column names
  • Document your analysis steps
  • Handle outliers appropriately
  • Create intermediate DataFrames for complex operations
  • Use .copy() when modifying data
DON'T
  • Don't modify original data without a copy
  • Don't ignore warnings about SettingWithCopyWarning
  • Don't assume data is clean without checking
  • Don't use loops when vectorized operations exist
  • Don't load entire large files without chunking
  • Don't forget to reset index after filtering
  • Don't use inplace=True in chained operations
  • Don't hardcode column indices

Key Takeaways

DataFrame = Table

DataFrames are 2D labeled data structures with rows and columns.

Series = Column

A Series is a single column. A DataFrame is a collection of Series.

loc vs iloc

loc uses labels, iloc uses positions. Know when to use each.

Boolean Filtering

Filter with df[df['col'] > value] or df.query() for cleaner syntax.

GroupBy is Powerful

Split-apply-combine pattern for aggregations and transformations.

Easy File I/O

read_csv() and to_csv() handle most data loading needs.

Knowledge Check

Quick Quiz

Test what you've learned about Pandas DataFrames and data manipulation

1 What is the difference between Series and DataFrame?
2 How do you select rows where age > 25?
3 What is the difference between loc and iloc?
4 What does df.groupby('category')['value'].sum() do?
5 How do you read a CSV file into a DataFrame?
6 What does df.describe() return?
Answer all questions to check your score