Introduction to Pandas
Pandas is Python's most powerful library for data manipulation and analysis. Built on top of NumPy, it provides two fundamental data structures-Series and DataFrame-that make working with structured data intuitive and efficient.
What is Pandas?
Pandas (Panel Data) was created in 2008 by Wes McKinney while working at AQR Capital Management. What started as a tool for financial data analysis has become the de facto standard for data manipulation in Python. Today, Pandas is used by data scientists, analysts, and engineers across industries-from finance and healthcare to e-commerce and social media-to clean, transform, and analyze datasets ranging from a few rows to millions of records.
Pandas Data Structures
Pandas provides two core data structures: Series (1-dimensional labeled array) and DataFrame (2-dimensional labeled table). Think of a Series as a single column in Excel, and a DataFrame as the entire spreadsheet. Both structures have labels (called an index) that allow you to access data by name rather than just position.
Why it matters: Labeled data structures make your code self-documenting and less error-prone.
Instead of remembering "column 3 is salary," you simply write df['Salary'].
Why Use Pandas?
Intuitive Syntax
Pandas code reads like plain English. Operations like "filter rows where salary > 50000"
become df[df['Salary'] > 50000]-simple and readable.
Flexible I/O
Read and write data from CSV, Excel, JSON, SQL databases, HTML tables, and more-all
with one-line commands like pd.read_csv().
Data Cleaning
Built-in methods for handling missing values, removing duplicates, converting data types, and more-making data cleaning less painful.
Installing Pandas
Before we dive into Pandas features, let's make sure you have it installed. Pandas is typically included in data science distributions like Anaconda, but if you need to install it separately, you can use pip or conda. The installation is straightforward and only takes a few seconds.
# Using pip (works with any Python installation)
pip install pandas
# Using conda (recommended for Anaconda users)
conda install pandas
# Install a specific version
pip install pandas==2.1.0
Output:
Successfully installed pandas-2.1.0 numpy-1.24.0 python-dateutil-2.8.2 pytz-2023.3
Notice that Pandas automatically installs NumPy (its foundation), python-dateutil (for date handling), and pytz (for timezone support). These dependencies are what make Pandas so powerful-it leverages battle-tested libraries under the hood.
Importing Pandas
The data science community has established a universal convention: always import pandas as pd.
This short alias keeps your code concise while remaining instantly recognizable to any data scientist reading it.
You'll see this pattern in documentation, tutorials, and production code around the world.
import pandas as pd
import numpy as np # Often used together
# Check your Pandas version
print(f"Pandas version: {pd.__version__}")
# Check available memory and system info
print(pd.show_versions())
Output:
Pandas version: 2.1.0
INSTALLED VERSIONS
------------------
commit : 2691c6c90b
python : 3.11.5
pandas : 2.1.0
numpy : 1.24.0
pytz : 2023.3
dateutil : 2.8.2
pd.show_versions() is incredibly useful when debugging
issues or asking for help online. It shows exactly which versions of Pandas and its dependencies you're
using, making it easier to reproduce and solve problems.
Pandas Series
A Series is Pandas' one-dimensional labeled array capable of holding any data type. Think of it as a supercharged Python list with an index that gives each element a label, making data access more intuitive and powerful.
Creating a Series from a List
The simplest way to create a Series is from a Python list. When you don't specify an index, Pandas automatically creates a numeric index starting from 0, just like a list. However, unlike lists, you get access to powerful vectorized operations and statistical methods right out of the box. Let's create a Series representing daily temperatures for a week.
import pandas as pd
import numpy as np
# Create a Series from a list
temperatures = pd.Series([72, 75, 68, 70, 73, 71, 69])
print(temperatures)
print(f"\nData type: {temperatures.dtype}")
print(f"Shape: {temperatures.shape}")
Output:
0 72
1 75
2 68
3 70
4 73
5 71
6 69
dtype: int64
Data type: int64
Shape: (7,)
Notice the output format: each line shows the index (0-6) on the left and the value on the right.
The dtype: int64 tells us all values are 64-bit integers. The shape (7,)
indicates this is a 1-dimensional structure with 7 elements. This automatic index is useful, but
Pandas really shines when we use custom labels.
Series with Custom Index
Custom indices transform your Series from a simple array into a powerful data structure where each value has meaningful context. Instead of referring to "the value at position 2," you can refer to "Monday's temperature." This makes your code self-documenting and dramatically reduces the chance of errors. Here's how to create a Series with day names as labels.
# Series with custom index (day names)
temperatures = pd.Series(
[72, 75, 68, 70, 73, 71, 69],
index=['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
)
print(temperatures)
# Access by label (much more readable!)
print(f"\nWednesday's temperature: {temperatures['Wed']}°F")
print(f"\nWeekend temperatures:\n{temperatures[['Sat', 'Sun']]}")
Output:
Mon 72
Tue 75
Wed 68
Thu 70
Fri 73
Sat 71
Sun 69
dtype: int64
Wednesday's temperature: 68°F
Weekend temperatures:
Sat 71
Sun 69
dtype: int64
See how much more readable that is? temperatures['Wed'] clearly refers to Wednesday's temperature,
compared to temperatures[2] which requires you to remember which day is at position 2. You can also
select multiple values by passing a list of labels, and Pandas returns a new Series with just those elements.
Creating Series from a Dictionary
Python dictionaries naturally map to Pandas Series: dictionary keys become the index, and values become the data. This is perfect for key-value data like city populations, product prices, or student grades. The dictionary's inherent structure-where each key uniquely identifies a value-aligns perfectly with Series' labeled data model. Let's create a Series of major city populations.
# Dictionary keys automatically become the index
city_populations = pd.Series({
'Tokyo': 37_400_000,
'Delhi': 31_900_000,
'Shanghai': 27_100_000,
'São Paulo': 22_200_000,
'Mumbai': 20_700_000
})
print(city_populations)
print(f"\nTokyo population: {city_populations['Tokyo']:,}")
# Get the 3 largest cities
print(f"\nTop 3 cities:\n{city_populations.nlargest(3)}")
Output:
Tokyo 37400000
Delhi 31900000
Shanghai 27100000
São Paulo 22200000
Mumbai 20700000
dtype: int64
Tokyo population: 37,400,000
Top 3 cities:
Tokyo 37400000
Delhi 31900000
Shanghai 27100000
dtype: int64
The nlargest() method is one of many convenient Series methods. It returns the n largest values
while maintaining the index labels-much cleaner than manually sorting and slicing. This is a common pattern
in Pandas: powerful operations become single method calls.
Series Index
The index is what makes a Series more than just a NumPy array. It's an immutable array that labels each element, enabling powerful alignment and selection operations. Indices can be integers, strings, dates, or any hashable type.
Why it matters: When you perform operations between Series, Pandas automatically aligns data by index labels, not position. This prevents common bugs when dealing with misaligned data.
Series Operations & Methods
Series support all NumPy's vectorized operations plus dozens of specialized methods for data analysis. You can perform arithmetic, statistical calculations, and comparisons across entire Series without writing loops. This vectorization is what makes Pandas fast-operations are implemented in optimized C code and can process millions of elements in milliseconds. Let's explore common operations with a retail dataset.
# Product prices
prices = pd.Series({
'Laptop': 1200,
'Mouse': 25,
'Keyboard': 75,
'Monitor': 350,
'Headset': 80
})
# Arithmetic operations (vectorized)
discounted_prices = prices * 0.85 # 15% discount
tax_included = prices * 1.10 # Add 10% tax
print("Original prices:")
print(prices)
print("\nAfter 15% discount:")
print(discounted_prices.round(2))
# Statistical methods
print(f"\nAverage price: ${prices.mean():.2f}")
print(f"Most expensive: ${prices.max()} ({prices.idxmax()})")
print(f"Cheapest: ${prices.min()} ({prices.idxmin()})")
print(f"Price range: ${prices.max() - prices.min()}")
Output:
Original prices:
Laptop 1200
Mouse 25
Keyboard 75
Monitor 350
Headset 80
dtype: int64
After 15% discount:
Laptop 1020.0
Mouse 21.25
Keyboard 63.75
Monitor 297.5
Headset 68.0
dtype: float64
Average price: $346.00
Most expensive: $1200 (Laptop)
Cheapest: $25 (Mouse)
Price range: $1175
Notice how prices * 0.85 applied the discount to every item at once-no loop needed. The
idxmax() and idxmin() methods don't just return the maximum and minimum values;
they return the index labels (product names) where those values occur. This is incredibly useful for
identifying which items meet certain criteria.
Boolean Indexing
One of Pandas' most powerful features is boolean indexing-using logical conditions to filter data. When you compare a Series to a value, you get back a Series of True/False values. You can then use this "boolean mask" to select only the elements where the condition is True. This pattern appears everywhere in data analysis: "show me all products over $100," "find customers who spent more than $500," and so on.
# Create boolean mask (condition)
expensive_mask = prices > 100
print("Which products cost more than $100?")
print(expensive_mask)
# Use mask to filter data
expensive_products = prices[expensive_mask]
print("\nExpensive products:")
print(expensive_products)
# You can also do this in one line
affordable = prices[prices <= 100]
print("\nAffordable products:")
print(affordable)
# Multiple conditions (use & for AND, | for OR)
mid_range = prices[(prices >= 50) & (prices <= 500)]
print("\nMid-range products ($50-$500):")
print(mid_range)
Output:
Which products cost more than $100?
Laptop True
Mouse False
Keyboard False
Monitor True
Headset False
dtype: bool
Expensive products:
Laptop 1200
Monitor 350
dtype: int64
Affordable products:
Mouse 25
Keyboard 75
Headset 80
dtype: int64
Mid-range products ($50-$500):
Keyboard 75
Monitor 350
Headset 80
dtype: int64
The boolean mask shows True for items that meet the condition and False for those that don't. When you use
this mask to index the Series, only the True elements are selected. For multiple conditions, use &
(and) or | (or), and always wrap each condition in parentheses-Python's operator precedence requires it.
& for AND and | for OR, NOT Python's and/or keywords.
Also, always use parentheses: (condition1) & (condition2) not condition1 & condition2.
Practice Questions: Pandas Series
Test your Series manipulation skills.
Task: Create a Series with values [85, 92, 78, 95] and index labels ['Math', 'Science', 'English', 'History'].
Expected output:
Math 85
Science 92
English 78
History 95
dtype: int64
Show Solution
import pandas as pd
grades = pd.Series([85, 92, 78, 95],
index=['Math', 'Science', 'English', 'History'])
print(grades)
Given:
temperatures = pd.Series([72, 68, 75, 82, 79, 65, 88])
Task: Select only temperatures greater than 75.
Expected output: 3 82\n4 79\n6 88\ndtype: int64
Show Solution
import pandas as pd
temperatures = pd.Series([72, 68, 75, 82, 79, 65, 88])
hot_days = temperatures[temperatures > 75]
print(hot_days)
Given:
stock_prices = pd.Series([150, 155, 148, 162, 159],
index=['Mon', 'Tue', 'Wed', 'Thu', 'Fri'])
Task: Find the day with the highest stock price using idxmax().
Expected output: 'Thu'
Show Solution
import pandas as pd
stock_prices = pd.Series([150, 155, 148, 162, 159],
index=['Mon', 'Tue', 'Wed', 'Thu', 'Fri'])
highest_day = stock_prices.idxmax()
print(highest_day) # Thu
Given:
scores = pd.Series([88, 92, 75, 95, 87, 90])
Task: Calculate and print the mean, median, and standard deviation of the scores.
Hint: Use .mean(), .median(), and .std() methods
Show Solution
import pandas as pd
scores = pd.Series([88, 92, 75, 95, 87, 90])
print(f"Mean: {scores.mean()}")
print(f"Median: {scores.median()}")
print(f"Std Dev: {scores.std():.2f}")
Creating DataFrames
A DataFrame is Pandas' two-dimensional data structure-think of it as a spreadsheet or SQL table in Python. Each column is a Series, and all columns share the same index. This is where Pandas truly shines for real-world data analysis, as most datasets naturally fit this tabular format.
Interactive: DataFrame Structure Explorer
Explore!Click on different parts of the DataFrame to understand its structure and components.
| Name | Age | Major | GPA | |
|---|---|---|---|---|
| 0 | Alice | 20 | CS | 3.8 |
| 1 | Bob | 21 | Math | 3.5 |
| 2 | Charlie | 19 | CS | 3.9 |
DataFrame from Dictionary
The most common way to create a DataFrame is from a dictionary where keys become column names and values become the column data. Each key should map to a list, array, or Series of equal length-this ensures all columns have the same number of rows. This pattern is perfect when you're building datasets programmatically or transforming data from JSON APIs. Let's create a student grades dataset.
import pandas as pd
# Create DataFrame from dictionary
students = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'Age': [20, 21, 19, 22, 20],
'Major': ['CS', 'Math', 'CS', 'Physics', 'Math'],
'GPA': [3.8, 3.5, 3.9, 3.7, 3.6]
})
print(students)
print(f"\nShape: {students.shape}")
print(f"Columns: {students.columns.tolist()}")
print(f"Data types:\n{students.dtypes}")
Output:
Name Age Major GPA
0 Alice 20 CS 3.8
1 Bob 21 Math 3.5
2 Charlie 19 CS 3.9
3 Diana 22 Physics 3.7
4 Eve 20 Math 3.6
Shape: (5, 4)
Columns: ['Name', 'Age', 'Major', 'GPA']
Data types:
Name object
Age int64
Major object
GPA float64
dtype: object
The DataFrame automatically assigns a numeric index (0-4) on the left. The shape shows we have
5 rows and 4 columns. Notice the data types: strings become 'object', integers are 'int64', and decimals are
'float64'. Pandas infers these types automatically, but you can override them if needed.
DataFrame from List of Dictionaries
Another intuitive format is a list of dictionaries, where each dictionary represents one row. This mirrors JSON data structures perfectly-if you're fetching data from a REST API that returns an array of objects, this is the natural conversion. Each dictionary's keys become columns, and if a key is missing from some dictionaries, Pandas fills those cells with NaN (Not a Number).
# Each dictionary is one row
products = pd.DataFrame([
{'Product': 'Laptop', 'Price': 1200, 'Stock': 15, 'Category': 'Electronics'},
{'Product': 'Mouse', 'Price': 25, 'Stock': 150, 'Category': 'Electronics'},
{'Product': 'Desk', 'Price': 350, 'Stock': 8, 'Category': 'Furniture'},
{'Product': 'Chair', 'Price': 200, 'Stock': 12, 'Category': 'Furniture'},
{'Product': 'Notebook', 'Price': 3, 'Stock': 500, 'Category': 'Stationery'}
])
print(products)
print(f"\nTotal inventory value: ${(products['Price'] * products['Stock']).sum():,}")
Output:
Product Price Stock Category
0 Laptop 1200 15 Electronics
1 Mouse 25 150 Electronics
2 Desk 350 8 Furniture
3 Chair 200 12 Furniture
4 Notebook 3 500 Stationery
Total inventory value: $25,540
Notice how we calculated total inventory value by multiplying two columns element-wise, then summing the result. This is vectorized arithmetic-the same multiplication happens to every row simultaneously. The ability to treat columns as unified data structures, not individual elements, is what makes Pandas so powerful for data analysis.
Dictionary of Lists
Best for column-oriented data. Each key is a column name, value is the column data. Example:
{'col1': [1,2,3], 'col2': [4,5,6]}
List of Dictionaries
Best for row-oriented data. Each dict is one row. Example:
[{'A': 1, 'B': 2}, {'A': 3, 'B': 4}]. Perfect for JSON data.
From Files
Most common in practice: pd.read_csv('file.csv'), pd.read_excel(),
pd.read_json(). Handles messy real-world data.
DataFrame from NumPy Array
If you're working with numerical computations and already have NumPy arrays, converting them to DataFrames adds labels and methods for data analysis. This is common in scientific computing and machine learning where data starts as matrices but needs to be analyzed as structured tables. You'll need to provide column names manually since arrays don't have that metadata.
import numpy as np
# 2D array: 4 rows, 3 columns (random data)
data = np.random.randint(50, 100, size=(4, 3))
# Convert to DataFrame with column names
grades = pd.DataFrame(
data,
columns=['Math', 'Science', 'English'],
index=['Alice', 'Bob', 'Charlie', 'Diana']
)
print(grades)
print(f"\nClass averages:\n{grades.mean()}")
print(f"\nStudent averages:\n{grades.mean(axis=1)}")
Output:
Math Science English
Alice 87 65 92
Bob 73 89 68
Charlie 91 76 85
Diana 69 94 71
Class averages:
Math 80.0
Science 81.0
English 79.0
dtype: float64
Student averages:
Alice 81.333333
Bob 76.666667
Charlie 84.000000
Diana 78.000000
dtype: float64
Notice we provided both column names and row labels (index). The mean() method calculates
averages by column (down) by default, which gives us the average for each subject across all students.
Using axis=1 calculates across columns (sideways), giving us each student's average across subjects.
Viewing DataFrame Information
Before analyzing any dataset, you need to understand its structure. Pandas provides several methods to
quickly inspect DataFrames: head() and tail() show the first or last rows,
info() displays structure and memory usage, and describe() generates statistical
summaries. These are your first steps with any new dataset-get the shape, check data types, look for missing
values, and understand the value ranges.
# Load a larger dataset example
import pandas as pd
# Create sample sales data
sales = pd.DataFrame({
'Date': pd.date_range('2024-01-01', periods=100),
'Product': np.random.choice(['Laptop', 'Mouse', 'Keyboard'], 100),
'Quantity': np.random.randint(1, 20, 100),
'Price': np.random.uniform(10, 1000, 100).round(2),
'Region': np.random.choice(['North', 'South', 'East', 'West'], 100)
})
# Quick inspection
print("First 5 rows:")
print(sales.head())
print("\nLast 3 rows:")
print(sales.tail(3))
print("\n" + "="*50)
print("DATASET INFO:")
print("="*50)
sales.info()
print("\n" + "="*50)
print("STATISTICAL SUMMARY:")
print("="*50)
print(sales.describe())
Output:
First 5 rows:
Date Product Quantity Price Region
0 2024-01-01 Laptop 15 876.32 North
1 2024-01-02 Mouse 7 234.56 East
2 2024-01-03 Keyboard 12 123.45 South
3 2024-01-04 Laptop 3 945.67 West
4 2024-01-05 Mouse 18 89.12 North
Last 3 rows:
Date Product Quantity Price Region
97 2024-04-07 Laptop 9 567.89 South
98 2024-04-08 Keyboard 14 234.56 East
99 2024-04-09 Mouse 5 876.54 West
==================================================
DATASET INFO:
==================================================
RangeIndex: 100 entries, 0 to 99
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 100 non-null datetime64[ns]
1 Product 100 non-null object
2 Quantity 100 non-null int64
3 Price 100 non-null float64
4 Region 100 non-null object
dtypes: datetime64[ns](1), float64(1), int64(1), object(2)
memory usage: 4.0+ KB
==================================================
STATISTICAL SUMMARY:
==================================================
Quantity Price
count 100.000000 100.000000
mean 10.450000 504.562100
std 5.468293 285.679123
min 1.000000 10.230000
25% 6.000000 267.897500
50% 10.500000 501.345000
75% 15.000000 741.227500
max 19.000000 999.870000
The info() output shows we have 100 rows with no missing values (all columns show "100 non-null").
We can see the data types: Date is datetime, Product and Region are objects (strings), Quantity is integer,
and Price is float. The describe() method provides statistics only for numeric columns-mean,
standard deviation, quartiles, etc. This is incredibly useful for spotting outliers or understanding data distributions.
df.info() and
df.head() when you first load a dataset. These two commands give you a complete overview:
structure, size, data types, missing values, and sample data-everything you need before starting analysis.
Practice Questions: Creating DataFrames
Practice creating and inspecting DataFrames.
Task: Create a DataFrame with columns 'Name': ['Alice', 'Bob', 'Charlie'] and 'Age': [25, 30, 35].
Expected output:
Name Age
0 Alice 25
1 Bob 30
2 Charlie 35
Show Solution
import pandas as pd
people = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35]
})
print(people)
Given:
df = pd.DataFrame({
'Product': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H'],
'Sales': [100, 200, 150, 300, 250, 180, 220, 190]
})
Task: Display the first 3 rows and the last 2 rows.
Show Solution
import pandas as pd
df = pd.DataFrame({
'Product': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H'],
'Sales': [100, 200, 150, 300, 250, 180, 220, 190]
})
print("First 3 rows:")
print(df.head(3))
print("\nLast 2 rows:")
print(df.tail(2))
Given:
df = pd.DataFrame({
'Employee': ['John', 'Sarah', 'Mike', 'Lisa'],
'Department': ['IT', 'HR', 'IT', 'Sales'],
'Salary': [75000, 65000, 72000, 68000],
'Years': [5, 3, 7, 4]
})
Task: Print the shape, column names, and statistical summary of numeric columns.
Show Solution
import pandas as pd
df = pd.DataFrame({
'Employee': ['John', 'Sarah', 'Mike', 'Lisa'],
'Department': ['IT', 'HR', 'IT', 'Sales'],
'Salary': [75000, 65000, 72000, 68000],
'Years': [5, 3, 7, 4]
})
print("Shape:", df.shape)
print("\nColumns:", df.columns.tolist())
print("\nStatistics:")
print(df.describe())
Task: Create a DataFrame from this list of dictionaries representing products:
[
{'Product': 'Laptop', 'Price': 999},
{'Product': 'Mouse', 'Price': 29},
{'Product': 'Monitor', 'Price': 349}
]
Show Solution
import pandas as pd
products = [
{'Product': 'Laptop', 'Price': 999},
{'Product': 'Mouse', 'Price': 29},
{'Product': 'Monitor', 'Price': 349}
]
df = pd.DataFrame(products)
print(df)
Given:
students = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Carol', 'Dave'],
'Math': [85, 92, 78, 88],
'Science': [90, 88, 95, 82],
'English': [78, 85, 88, 90],
'Age': [20, 21, 19, 22]
})
Task: Select only the 'Name', 'Math', and 'Science' columns.
Show Solution
import pandas as pd
students = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Carol', 'Dave'],
'Math': [85, 92, 78, 88],
'Science': [90, 88, 95, 82],
'English': [78, 85, 88, 90],
'Age': [20, 21, 19, 22]
})
selected = students[['Name', 'Math', 'Science']]
print(selected)
Selecting Data
Data selection is the foundation of data analysis. Pandas offers multiple ways to extract specific rows,
columns, or cells from DataFrames. Mastering loc[] (label-based) and iloc[]
(position-based) selection is essential for effective data manipulation.
Selecting Columns
The simplest selection is extracting columns by name. Use bracket notation with a string for a single column (returns a Series) or a list of strings for multiple columns (returns a DataFrame). This is your starting point for any analysis: isolate the variables you need before performing calculations or aggregations. Column selection is also the foundation for creating new columns derived from existing ones.
import pandas as pd
import numpy as np
# Create sample employee dataset
employees = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'Department': ['Engineering', 'Sales', 'Engineering', 'HR', 'Sales'],
'Salary': [95000, 65000, 87000, 72000, 68000],
'Years': [5, 3, 7, 4, 2],
'Bonus': [10000, 5000, 12000, 6000, 4000]
})
print("Full dataset:")
print(employees)
# Single column (returns Series)
print("\n--- Single Column (Series) ---")
names = employees['Name']
print(names)
print(f"Type: {type(names)}")
# Multiple columns (returns DataFrame)
print("\n--- Multiple Columns (DataFrame) ---")
compensation = employees[['Name', 'Salary', 'Bonus']]
print(compensation)
print(f"Type: {type(compensation)}")
Output:
Full dataset:
Name Department Salary Years Bonus
0 Alice Engineering 95000 5 10000
1 Bob Sales 65000 3 5000
2 Charlie Engineering 87000 7 12000
3 Diana HR 72000 4 6000
4 Eve Sales 68000 2 4000
--- Single Column (Series) ---
0 Alice
1 Bob
2 Charlie
3 Diana
4 Eve
Name: Name, dtype: object
Type:
--- Multiple Columns (DataFrame) ---
Name Salary Bonus
0 Alice 95000 10000
1 Bob 65000 5000
2 Charlie 87000 12000
3 Diana 72000 6000
4 Eve 68000 4000
Type:
Notice the critical difference: single bracket with a string returns a Series, but double bracket with a list returns a DataFrame. This distinction matters because many operations only work on DataFrames. When in doubt, use double brackets to keep your data as a DataFrame-you can always convert to a Series later if needed.
loc vs iloc
loc[] uses labels: df.loc['row_label', 'column_name']
iloc[] uses integer positions: df.iloc[0, 2] (row 0, column 2)
When to use each: Use loc when you know names (more readable, less brittle).
Use iloc when working with positions (loops, mathematical operations on structure).
loc is generally preferred because it's self-documenting.
Label-Based Selection: loc[]
The loc[] indexer selects data by labels-both row labels (index) and column names. The syntax
is df.loc[rows, columns] where both can be single labels, lists, or slices. This is the most
readable selection method because you're explicitly naming what you want. When you see
df.loc['Alice', 'Salary'], it's immediately clear you're getting Alice's salary.
# Set Name as index for clearer examples
emp = employees.set_index('Name')
print("Data with Name as index:")
print(emp)
# Single cell: specific row and column
print(f"\nAlice's salary: ${emp.loc['Alice', 'Salary']:,}")
# Single row: all columns
print("\nBob's full record:")
print(emp.loc['Bob'])
# Multiple rows, single column
print("\nSalaries for Alice and Charlie:")
print(emp.loc[['Alice', 'Charlie'], 'Salary'])
# Multiple rows and columns (returns DataFrame)
print("\nEngineering salaries and years:")
eng_data = emp.loc[['Alice', 'Charlie'], ['Salary', 'Years']]
print(eng_data)
# Slice notation (inclusive on both ends!)
print("\nRows from Bob to Diana:")
print(emp.loc['Bob':'Diana'])
Output:
Data with Name as index:
Department Salary Years Bonus
Name
Alice Engineering 95000 5 10000
Bob Sales 65000 3 5000
Charlie Engineering 87000 7 12000
Diana HR 72000 4 6000
Eve Sales 68000 2 4000
Alice's salary: $95,000
Bob's full record:
Department Sales
Salary 65000
Years 3
Bonus 5000
Name: Bob, dtype: object
Salaries for Alice and Charlie:
Name
Alice 95000
Charlie 87000
Name: Salary, dtype: int64
Engineering salaries and years:
Salary Years
Name
Alice 95000 5
Charlie 87000 7
Rows from Bob to Diana:
Department Salary Years Bonus
Name
Bob Sales 65000 3 5000
Charlie Engineering 87000 7 12000
Diana HR 72000 4 6000
A crucial detail: loc slicing is inclusive on both ends. When you write
loc['Bob':'Diana'], you get Bob, Charlie, AND Diana. This differs from Python's normal slicing
where the end is exclusive. This inclusive behavior makes sense for labels-if you ask for "Bob through Diana,"
you probably want Diana included.
Position-Based Selection: iloc[]
The iloc[] indexer selects by integer position, just like regular Python list indexing. Use this
when you need to select data by position rather than label-for example, "get the first 10 rows" or "extract
every other column." The syntax is df.iloc[row_positions, column_positions] where positions are
zero-based integers. Unlike loc, slicing with iloc is exclusive on the end.
# Reset to numeric index for clearer position-based examples
emp = employees.copy()
print("Original dataset:")
print(emp)
# Single cell by position
print(f"\nFirst row, third column: {emp.iloc[0, 2]}")
# First 3 rows
print("\nFirst 3 employees:")
print(emp.iloc[:3])
# Last 2 rows, last 2 columns
print("\nLast 2 rows, last 2 columns:")
print(emp.iloc[-2:, -2:])
# Specific rows and columns by position
print("\nRows 1 and 3, columns 0, 2, 4:")
print(emp.iloc[[1, 3], [0, 2, 4]])
# Every other row
print("\nEvery other employee:")
print(emp.iloc[::2])
Output:
Original dataset:
Name Department Salary Years Bonus
0 Alice Engineering 95000 5 10000
1 Bob Sales 65000 3 5000
2 Charlie Engineering 87000 7 12000
3 Diana HR 72000 4 6000
4 Eve Sales 68000 2 4000
First row, third column: 95000
First 3 employees:
Name Department Salary Years Bonus
0 Alice Engineering 95000 5 10000
1 Bob Sales 65000 3 5000
2 Charlie Engineering 87000 7 12000
Last 2 rows, last 2 columns:
Years Bonus
3 4 6000
4 2 4000
Rows 1 and 3, columns 0, 2, 4:
Name Salary Bonus
1 Bob 65000 5000
3 Diana 72000 6000
Every other employee:
Name Department Salary Years Bonus
0 Alice Engineering 95000 5 10000
2 Charlie Engineering 87000 7 12000
4 Eve Sales 68000 2 4000
The iloc[::2] syntax uses Python's step parameter-[start:end:step]. Here,
[::2] means "start at beginning, go to end, take every 2nd element." This is useful for sampling
data or extracting patterns. Remember: iloc is pure position indexing, so it works even if your
DataFrame has non-numeric or non-sequential indices.
Conditional Selection
Boolean indexing with DataFrames works just like with Series, but now you can filter rows based on column
conditions. This is incredibly powerful: "show me all employees in Engineering," "find products under $50,"
"get records from the last week." You create a boolean mask (True/False for each row), then use that mask
to filter the DataFrame. Multiple conditions can be combined with & (AND) and | (OR).
# Boolean mask based on column condition
high_earners = employees[employees['Salary'] > 75000]
print("Employees earning > $75,000:")
print(high_earners)
# Multiple conditions (AND)
senior_engineers = employees[
(employees['Department'] == 'Engineering') &
(employees['Years'] >= 5)
]
print("\nSenior engineers (5+ years):")
print(senior_engineers)
# Multiple conditions (OR)
sales_or_hr = employees[
(employees['Department'] == 'Sales') |
(employees['Department'] == 'HR')
]
print("\nSales or HR employees:")
print(sales_or_hr)
# Complex condition: total compensation
employees['Total_Comp'] = employees['Salary'] + employees['Bonus']
top_earners = employees[employees['Total_Comp'] > 90000]
print("\nTop earners (total comp > $90k):")
print(top_earners[['Name', 'Department', 'Total_Comp']])
Output:
Employees earning > $75,000:
Name Department Salary Years Bonus
0 Alice Engineering 95000 5 10000
2 Charlie Engineering 87000 7 12000
Senior engineers (5+ years):
Name Department Salary Years Bonus
0 Alice Engineering 95000 5 10000
2 Charlie Engineering 87000 7 12000
Sales or HR employees:
Name Department Salary Years Bonus
1 Bob Sales 65000 3 5000
3 Diana HR 72000 4 6000
4 Eve Sales 68000 2 4000
Top earners (total comp > $90k):
Name Department Total_Comp
0 Alice Engineering 105000
2 Charlie Engineering 99000
When combining conditions, always wrap each condition in parentheses: (condition1) & (condition2).
Python's operator precedence requires this. Also notice we created a new column Total_Comp by adding
two existing columns-this is a common pattern: create a derived column, then filter on it. This keeps your
conditions simple and readable.
and or
or for DataFrame conditions. Use & (bitwise AND) and | (bitwise OR)
instead. Also, always use parentheses around each condition due to operator precedence.
Practice Questions: Selecting Data
Master DataFrame selection techniques.
Given:
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['NYC', 'LA', 'Chicago']
})
Task: Select only the 'Name' column and print it.
Show Solution
import pandas as pd
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['NYC', 'LA', 'Chicago']
})
names = df['Name']
print(names)
Given:
employees = pd.DataFrame({
'Name': ['John', 'Sarah', 'Mike', 'Lisa', 'Tom'],
'Salary': [75000, 85000, 65000, 92000, 72000],
'Department': ['IT', 'Sales', 'IT', 'Sales', 'HR']
})
Task: Select all employees with Salary greater than 70000.
Show Solution
import pandas as pd
employees = pd.DataFrame({
'Name': ['John', 'Sarah', 'Mike', 'Lisa', 'Tom'],
'Salary': [75000, 85000, 65000, 92000, 72000],
'Department': ['IT', 'Sales', 'IT', 'Sales', 'HR']
})
high_earners = employees[employees['Salary'] > 70000]
print(high_earners)
Given:
sales = pd.DataFrame({
'Product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Webcam'],
'Price': [1200, 25, 75, 350, 80],
'Stock': [5, 150, 80, 20, 45],
'Category': ['Electronics', 'Electronics', 'Electronics', 'Electronics', 'Electronics']
})
Task: Use loc to select products where Price > 50 AND Stock < 100, showing only 'Product' and 'Price' columns.
Show Solution
import pandas as pd
sales = pd.DataFrame({
'Product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Webcam'],
'Price': [1200, 25, 75, 350, 80],
'Stock': [5, 150, 80, 20, 45],
'Category': ['Electronics', 'Electronics', 'Electronics', 'Electronics', 'Electronics']
})
result = sales.loc[(sales['Price'] > 50) & (sales['Stock'] < 100), ['Product', 'Price']]
print(result)
Given:
data = pd.DataFrame({
'A': [10, 20, 30, 40, 50],
'B': [15, 25, 35, 45, 55],
'C': [12, 22, 32, 42, 52]
})
Task: Select rows 1 through 3 (inclusive) and columns 0 and 2 using iloc.
Show Solution
import pandas as pd
data = pd.DataFrame({
'A': [10, 20, 30, 40, 50],
'B': [15, 25, 35, 45, 55],
'C': [12, 22, 32, 42, 52]
})
# iloc uses integer positions, 1:4 gets rows 1, 2, 3
result = data.iloc[1:4, [0, 2]]
print(result)
Given:
inventory = pd.DataFrame({
'Item': ['Widget', 'Gadget', 'Doohickey', 'Thingamajig', 'Whatsit'],
'Price': [19.99, 49.99, 9.99, 99.99, 29.99],
'Quantity': [100, 5, 200, 3, 50],
'Category': ['A', 'B', 'A', 'C', 'B']
})
Task: Select items where (Price < 30 OR Quantity < 10) AND Category is 'B'.
Show Solution
import pandas as pd
inventory = pd.DataFrame({
'Item': ['Widget', 'Gadget', 'Doohickey', 'Thingamajig', 'Whatsit'],
'Price': [19.99, 49.99, 9.99, 99.99, 29.99],
'Quantity': [100, 5, 200, 3, 50],
'Category': ['A', 'B', 'A', 'C', 'B']
})
# Use parentheses carefully for OR (|) and AND (&)
result = inventory[((inventory['Price'] < 30) | (inventory['Quantity'] < 10)) & (inventory['Category'] == 'B')]
print(result)
Key Takeaways
Series: Labeled 1D Arrays
Series are one-dimensional labeled arrays. Use custom indices for meaningful labels. Support vectorized operations and statistical methods. Create from lists, dicts, or NumPy arrays
DataFrames: 2D Tables
DataFrames are collections of Series sharing an index. Each column is a Series. Create from dicts, lists of dicts, NumPy arrays, or files (CSV, Excel, JSON)
loc vs iloc
loc[] uses labels (row/column names), iloc[] uses integer positions.
loc slicing is inclusive on both ends, iloc is exclusive on the end
Boolean Indexing
Filter data with conditions: df[df['col'] > value]. Combine conditions with
& (AND) or | (OR). Always use parentheses around each condition
Quick Inspection
Use head(), tail(), info(), and describe()
to quickly understand dataset structure, size, data types, and statistical summaries
Vectorized Operations
Apply operations to entire columns/Series at once without loops. Example: df['price'] * 1.1
adds 10% to all prices instantly. Fast and readable
Knowledge Check
Test your understanding of Pandas Series and DataFrames: