Series & DataFrames

Introduction to Pandas

Pandas is Python's most powerful library for data manipulation and analysis. Built on top of NumPy, it provides two fundamental data structures-Series and DataFrame-that make working with structured data intuitive and efficient.

What is Pandas?

Pandas (Panel Data) was created in 2008 by Wes McKinney while working at AQR Capital Management. What started as a tool for financial data analysis has become the de facto standard for data manipulation in Python. Today, Pandas is used by data scientists, analysts, and engineers across industries-from finance and healthcare to e-commerce and social media-to clean, transform, and analyze datasets ranging from a few rows to millions of records.

Key Concept

Pandas Data Structures

Pandas provides two core data structures: Series (1-dimensional labeled array) and DataFrame (2-dimensional labeled table). Think of a Series as a single column in Excel, and a DataFrame as the entire spreadsheet. Both structures have labels (called an index) that allow you to access data by name rather than just position.

Why it matters: Labeled data structures make your code self-documenting and less error-prone. Instead of remembering "column 3 is salary," you simply write df['Salary'].

Why Use Pandas?

Intuitive Syntax

Pandas code reads like plain English. Operations like "filter rows where salary > 50000" become df[df['Salary'] > 50000]-simple and readable.

Flexible I/O

Read and write data from CSV, Excel, JSON, SQL databases, HTML tables, and more-all with one-line commands like pd.read_csv().

Data Cleaning

Built-in methods for handling missing values, removing duplicates, converting data types, and more-making data cleaning less painful.

Installing Pandas

Before we dive into Pandas features, let's make sure you have it installed. Pandas is typically included in data science distributions like Anaconda, but if you need to install it separately, you can use pip or conda. The installation is straightforward and only takes a few seconds.

# Using pip (works with any Python installation)
pip install pandas

# Using conda (recommended for Anaconda users)
conda install pandas

# Install a specific version
pip install pandas==2.1.0

Output:

Successfully installed pandas-2.1.0 numpy-1.24.0 python-dateutil-2.8.2 pytz-2023.3

Notice that Pandas automatically installs NumPy (its foundation), python-dateutil (for date handling), and pytz (for timezone support). These dependencies are what make Pandas so powerful-it leverages battle-tested libraries under the hood.

Importing Pandas

The data science community has established a universal convention: always import pandas as pd. This short alias keeps your code concise while remaining instantly recognizable to any data scientist reading it. You'll see this pattern in documentation, tutorials, and production code around the world.

import pandas as pd
import numpy as np  # Often used together

# Check your Pandas version
print(f"Pandas version: {pd.__version__}")

# Check available memory and system info
print(pd.show_versions())

Output:

Pandas version: 2.1.0

INSTALLED VERSIONS
------------------
commit           : 2691c6c90b
python           : 3.11.5
pandas           : 2.1.0
numpy            : 1.24.0
pytz             : 2023.3
dateutil         : 2.8.2

Pro Tip: Running pd.show_versions() is incredibly useful when debugging issues or asking for help online. It shows exactly which versions of Pandas and its dependencies you're using, making it easier to reproduce and solve problems.

Pandas Series

A Series is Pandas' one-dimensional labeled array capable of holding any data type. Think of it as a supercharged Python list with an index that gives each element a label, making data access more intuitive and powerful.

Creating a Series from a List

The simplest way to create a Series is from a Python list. When you don't specify an index, Pandas automatically creates a numeric index starting from 0, just like a list. However, unlike lists, you get access to powerful vectorized operations and statistical methods right out of the box. Let's create a Series representing daily temperatures for a week.

import pandas as pd
import numpy as np

# Create a Series from a list
temperatures = pd.Series([72, 75, 68, 70, 73, 71, 69])
print(temperatures)
print(f"\nData type: {temperatures.dtype}")
print(f"Shape: {temperatures.shape}")

Output:

0    72
1    75
2    68
3    70
4    73
5    71
6    69
dtype: int64

Data type: int64
Shape: (7,)

Notice the output format: each line shows the index (0-6) on the left and the value on the right. The dtype: int64 tells us all values are 64-bit integers. The shape (7,) indicates this is a 1-dimensional structure with 7 elements. This automatic index is useful, but Pandas really shines when we use custom labels.

Series with Custom Index

Custom indices transform your Series from a simple array into a powerful data structure where each value has meaningful context. Instead of referring to "the value at position 2," you can refer to "Monday's temperature." This makes your code self-documenting and dramatically reduces the chance of errors. Here's how to create a Series with day names as labels.

# Series with custom index (day names)
temperatures = pd.Series(
    [72, 75, 68, 70, 73, 71, 69],
    index=['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
)
print(temperatures)

# Access by label (much more readable!)
print(f"\nWednesday's temperature: {temperatures['Wed']}°F")
print(f"\nWeekend temperatures:\n{temperatures[['Sat', 'Sun']]}")

Output:

Mon    72
Tue    75
Wed    68
Thu    70
Fri    73
Sat    71
Sun    69
dtype: int64

Wednesday's temperature: 68°F

Weekend temperatures:
Sat    71
Sun    69
dtype: int64

See how much more readable that is? temperatures['Wed'] clearly refers to Wednesday's temperature, compared to temperatures[2] which requires you to remember which day is at position 2. You can also select multiple values by passing a list of labels, and Pandas returns a new Series with just those elements.

Creating Series from a Dictionary

Python dictionaries naturally map to Pandas Series: dictionary keys become the index, and values become the data. This is perfect for key-value data like city populations, product prices, or student grades. The dictionary's inherent structure-where each key uniquely identifies a value-aligns perfectly with Series' labeled data model. Let's create a Series of major city populations.

# Dictionary keys automatically become the index
city_populations = pd.Series({
    'Tokyo': 37_400_000,
    'Delhi': 31_900_000,
    'Shanghai': 27_100_000,
    'São Paulo': 22_200_000,
    'Mumbai': 20_700_000
})

print(city_populations)
print(f"\nTokyo population: {city_populations['Tokyo']:,}")

# Get the 3 largest cities
print(f"\nTop 3 cities:\n{city_populations.nlargest(3)}")

Output:

Tokyo         37400000
Delhi         31900000
Shanghai      27100000
São Paulo     22200000
Mumbai        20700000
dtype: int64

Tokyo population: 37,400,000

Top 3 cities:
Tokyo        37400000
Delhi        31900000
Shanghai     27100000
dtype: int64

The nlargest() method is one of many convenient Series methods. It returns the n largest values while maintaining the index labels-much cleaner than manually sorting and slicing. This is a common pattern in Pandas: powerful operations become single method calls.

Key Concept

Series Index

The index is what makes a Series more than just a NumPy array. It's an immutable array that labels each element, enabling powerful alignment and selection operations. Indices can be integers, strings, dates, or any hashable type.

Why it matters: When you perform operations between Series, Pandas automatically aligns data by index labels, not position. This prevents common bugs when dealing with misaligned data.

Series Operations & Methods

Series support all NumPy's vectorized operations plus dozens of specialized methods for data analysis. You can perform arithmetic, statistical calculations, and comparisons across entire Series without writing loops. This vectorization is what makes Pandas fast-operations are implemented in optimized C code and can process millions of elements in milliseconds. Let's explore common operations with a retail dataset.

# Product prices
prices = pd.Series({
    'Laptop': 1200,
    'Mouse': 25,
    'Keyboard': 75,
    'Monitor': 350,
    'Headset': 80
})

# Arithmetic operations (vectorized)
discounted_prices = prices * 0.85  # 15% discount
tax_included = prices * 1.10  # Add 10% tax
print("Original prices:")
print(prices)
print("\nAfter 15% discount:")
print(discounted_prices.round(2))

# Statistical methods
print(f"\nAverage price: ${prices.mean():.2f}")
print(f"Most expensive: ${prices.max()} ({prices.idxmax()})")
print(f"Cheapest: ${prices.min()} ({prices.idxmin()})")
print(f"Price range: ${prices.max() - prices.min()}")

Output:

Original prices:
Laptop      1200
Mouse         25
Keyboard      75
Monitor      350
Headset       80
dtype: int64

After 15% discount:
Laptop      1020.0
Mouse         21.25
Keyboard      63.75
Monitor      297.5
Headset       68.0
dtype: float64

Average price: $346.00
Most expensive: $1200 (Laptop)
Cheapest: $25 (Mouse)
Price range: $1175

Notice how prices * 0.85 applied the discount to every item at once-no loop needed. The idxmax() and idxmin() methods don't just return the maximum and minimum values; they return the index labels (product names) where those values occur. This is incredibly useful for identifying which items meet certain criteria.

Boolean Indexing

One of Pandas' most powerful features is boolean indexing-using logical conditions to filter data. When you compare a Series to a value, you get back a Series of True/False values. You can then use this "boolean mask" to select only the elements where the condition is True. This pattern appears everywhere in data analysis: "show me all products over $100," "find customers who spent more than $500," and so on.

# Create boolean mask (condition)
expensive_mask = prices > 100
print("Which products cost more than $100?")
print(expensive_mask)

# Use mask to filter data
expensive_products = prices[expensive_mask]
print("\nExpensive products:")
print(expensive_products)

# You can also do this in one line
affordable = prices[prices <= 100]
print("\nAffordable products:")
print(affordable)

# Multiple conditions (use & for AND, | for OR)
mid_range = prices[(prices >= 50) & (prices <= 500)]
print("\nMid-range products ($50-$500):")
print(mid_range)

Output:

Which products cost more than $100?
Laptop       True
Mouse       False
Keyboard    False
Monitor      True
Headset     False
dtype: bool

Expensive products:
Laptop     1200
Monitor     350
dtype: int64

Affordable products:
Mouse       25
Keyboard    75
Headset     80
dtype: int64

Mid-range products ($50-$500):
Keyboard     75
Monitor     350
Headset      80
dtype: int64

The boolean mask shows True for items that meet the condition and False for those that don't. When you use this mask to index the Series, only the True elements are selected. For multiple conditions, use & (and) or | (or), and always wrap each condition in parentheses-Python's operator precedence requires it.

Important: When combining conditions, use & for AND and | for OR, NOT Python's and/or keywords. Also, always use parentheses: (condition1) & (condition2) not condition1 & condition2.

Practice Questions: Pandas Series

Test your Series manipulation skills.

Task: Create a Series with values [85, 92, 78, 95] and index labels ['Math', 'Science', 'English', 'History'].

Expected output:

Math        85
Science     92
English     78
History     95
dtype: int64

Show Solution

import pandas as pd
grades = pd.Series([85, 92, 78, 95], 
                   index=['Math', 'Science', 'English', 'History'])
print(grades)

Given:

temperatures = pd.Series([72, 68, 75, 82, 79, 65, 88])

Task: Select only temperatures greater than 75.

Expected output: 3 82\n4 79\n6 88\ndtype: int64

Show Solution

import pandas as pd
temperatures = pd.Series([72, 68, 75, 82, 79, 65, 88])
hot_days = temperatures[temperatures > 75]
print(hot_days)

Given:

stock_prices = pd.Series([150, 155, 148, 162, 159], 
                         index=['Mon', 'Tue', 'Wed', 'Thu', 'Fri'])

Task: Find the day with the highest stock price using idxmax().

Expected output: 'Thu'

Show Solution

import pandas as pd
stock_prices = pd.Series([150, 155, 148, 162, 159], 
                         index=['Mon', 'Tue', 'Wed', 'Thu', 'Fri'])
highest_day = stock_prices.idxmax()
print(highest_day)  # Thu

Given:

scores = pd.Series([88, 92, 75, 95, 87, 90])

Task: Calculate and print the mean, median, and standard deviation of the scores.

Hint: Use .mean(), .median(), and .std() methods

Show Solution

import pandas as pd
scores = pd.Series([88, 92, 75, 95, 87, 90])
print(f"Mean: {scores.mean()}")
print(f"Median: {scores.median()}")
print(f"Std Dev: {scores.std():.2f}")

Creating DataFrames

A DataFrame is Pandas' two-dimensional data structure-think of it as a spreadsheet or SQL table in Python. Each column is a Series, and all columns share the same index. This is where Pandas truly shines for real-world data analysis, as most datasets naturally fit this tabular format.

Interactive: DataFrame Structure Explorer

Explore!

Click on different parts of the DataFrame to understand its structure and components.

	Name	Age	Major	GPA
0	Alice	20	CS	3.8
1	Bob	21	Math	3.5
2	Charlie	19	CS	3.9

Component

Click a part

Access Code

-

Description

Click any part of the DataFrame to learn more

Columns (headers) Index (row labels) Data cells

DataFrame from Dictionary

The most common way to create a DataFrame is from a dictionary where keys become column names and values become the column data. Each key should map to a list, array, or Series of equal length-this ensures all columns have the same number of rows. This pattern is perfect when you're building datasets programmatically or transforming data from JSON APIs. Let's create a student grades dataset.

import pandas as pd

# Create DataFrame from dictionary
students = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'Age': [20, 21, 19, 22, 20],
    'Major': ['CS', 'Math', 'CS', 'Physics', 'Math'],
    'GPA': [3.8, 3.5, 3.9, 3.7, 3.6]
})

print(students)
print(f"\nShape: {students.shape}")
print(f"Columns: {students.columns.tolist()}")
print(f"Data types:\n{students.dtypes}")

Output:

      Name  Age   Major  GPA
0    Alice   20      CS  3.8
1      Bob   21    Math  3.5
2  Charlie   19      CS  3.9
3    Diana   22 Physics  3.7
4      Eve   20    Math  3.6

Shape: (5, 4)
Columns: ['Name', 'Age', 'Major', 'GPA']
Data types:
Name     object
Age       int64
Major    object
GPA     float64
dtype: object

The DataFrame automatically assigns a numeric index (0-4) on the left. The shape shows we have 5 rows and 4 columns. Notice the data types: strings become 'object', integers are 'int64', and decimals are 'float64'. Pandas infers these types automatically, but you can override them if needed.

DataFrame from List of Dictionaries

Another intuitive format is a list of dictionaries, where each dictionary represents one row. This mirrors JSON data structures perfectly-if you're fetching data from a REST API that returns an array of objects, this is the natural conversion. Each dictionary's keys become columns, and if a key is missing from some dictionaries, Pandas fills those cells with NaN (Not a Number).

# Each dictionary is one row
products = pd.DataFrame([
    {'Product': 'Laptop', 'Price': 1200, 'Stock': 15, 'Category': 'Electronics'},
    {'Product': 'Mouse', 'Price': 25, 'Stock': 150, 'Category': 'Electronics'},
    {'Product': 'Desk', 'Price': 350, 'Stock': 8, 'Category': 'Furniture'},
    {'Product': 'Chair', 'Price': 200, 'Stock': 12, 'Category': 'Furniture'},
    {'Product': 'Notebook', 'Price': 3, 'Stock': 500, 'Category': 'Stationery'}
])

print(products)
print(f"\nTotal inventory value: ${(products['Price'] * products['Stock']).sum():,}")

Output:

     Product  Price  Stock     Category
0     Laptop   1200     15  Electronics
1      Mouse     25    150  Electronics
2       Desk    350      8    Furniture
3      Chair    200     12    Furniture
4   Notebook      3    500   Stationery

Total inventory value: $25,540

Notice how we calculated total inventory value by multiplying two columns element-wise, then summing the result. This is vectorized arithmetic-the same multiplication happens to every row simultaneously. The ability to treat columns as unified data structures, not individual elements, is what makes Pandas so powerful for data analysis.

Dictionary of Lists

Best for column-oriented data. Each key is a column name, value is the column data. Example: {'col1': [1,2,3], 'col2': [4,5,6]}

List of Dictionaries

Best for row-oriented data. Each dict is one row. Example: [{'A': 1, 'B': 2}, {'A': 3, 'B': 4}]. Perfect for JSON data.

From Files

Most common in practice: pd.read_csv('file.csv'), pd.read_excel(), pd.read_json(). Handles messy real-world data.

DataFrame from NumPy Array

If you're working with numerical computations and already have NumPy arrays, converting them to DataFrames adds labels and methods for data analysis. This is common in scientific computing and machine learning where data starts as matrices but needs to be analyzed as structured tables. You'll need to provide column names manually since arrays don't have that metadata.

import numpy as np

# 2D array: 4 rows, 3 columns (random data)
data = np.random.randint(50, 100, size=(4, 3))

# Convert to DataFrame with column names
grades = pd.DataFrame(
    data,
    columns=['Math', 'Science', 'English'],
    index=['Alice', 'Bob', 'Charlie', 'Diana']
)

print(grades)
print(f"\nClass averages:\n{grades.mean()}")
print(f"\nStudent averages:\n{grades.mean(axis=1)}")

Output:

         Math  Science  English
Alice      87       65       92
Bob        73       89       68
Charlie    91       76       85
Diana      69       94       71

Class averages:
Math       80.0
Science    81.0
English    79.0
dtype: float64

Student averages:
Alice      81.333333
Bob        76.666667
Charlie    84.000000
Diana      78.000000
dtype: float64

Notice we provided both column names and row labels (index). The mean() method calculates averages by column (down) by default, which gives us the average for each subject across all students. Using axis=1 calculates across columns (sideways), giving us each student's average across subjects.

Viewing DataFrame Information

Before analyzing any dataset, you need to understand its structure. Pandas provides several methods to quickly inspect DataFrames: head() and tail() show the first or last rows, info() displays structure and memory usage, and describe() generates statistical summaries. These are your first steps with any new dataset-get the shape, check data types, look for missing values, and understand the value ranges.

# Load a larger dataset example
import pandas as pd

# Create sample sales data
sales = pd.DataFrame({
    'Date': pd.date_range('2024-01-01', periods=100),
    'Product': np.random.choice(['Laptop', 'Mouse', 'Keyboard'], 100),
    'Quantity': np.random.randint(1, 20, 100),
    'Price': np.random.uniform(10, 1000, 100).round(2),
    'Region': np.random.choice(['North', 'South', 'East', 'West'], 100)
})

# Quick inspection
print("First 5 rows:")
print(sales.head())
print("\nLast 3 rows:")
print(sales.tail(3))

print("\n" + "="*50)
print("DATASET INFO:")
print("="*50)
sales.info()

print("\n" + "="*50)
print("STATISTICAL SUMMARY:")
print("="*50)
print(sales.describe())

Output:

First 5 rows:
        Date   Product  Quantity   Price Region
0 2024-01-01    Laptop        15  876.32  North
1 2024-01-02     Mouse         7  234.56   East
2 2024-01-03  Keyboard        12  123.45  South
3 2024-01-04    Laptop         3  945.67   West
4 2024-01-05     Mouse        18   89.12  North

Last 3 rows:
          Date   Product  Quantity   Price Region
97  2024-04-07    Laptop         9  567.89  South
98  2024-04-08  Keyboard        14  234.56   East
99  2024-04-09     Mouse         5  876.54   West

==================================================
DATASET INFO:
==================================================

RangeIndex: 100 entries, 0 to 99
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   Date      100 non-null    datetime64[ns]
 1   Product   100 non-null    object        
 2   Quantity  100 non-null    int64         
 3   Price     100 non-null    float64       
 4   Region    100 non-null    object        
dtypes: datetime64[ns](1), float64(1), int64(1), object(2)
memory usage: 4.0+ KB

==================================================
STATISTICAL SUMMARY:
==================================================
         Quantity        Price
count  100.000000   100.000000
mean    10.450000   504.562100
std      5.468293   285.679123
min      1.000000    10.230000
25%      6.000000   267.897500
50%     10.500000   501.345000
75%     15.000000   741.227500
max     19.000000   999.870000

The info() output shows we have 100 rows with no missing values (all columns show "100 non-null"). We can see the data types: Date is datetime, Product and Region are objects (strings), Quantity is integer, and Price is float. The describe() method provides statistics only for numeric columns-mean, standard deviation, quartiles, etc. This is incredibly useful for spotting outliers or understanding data distributions.

Pro Tip: Always run df.info() and df.head() when you first load a dataset. These two commands give you a complete overview: structure, size, data types, missing values, and sample data-everything you need before starting analysis.

Practice Questions: Creating DataFrames

Practice creating and inspecting DataFrames.

Task: Create a DataFrame with columns 'Name': ['Alice', 'Bob', 'Charlie'] and 'Age': [25, 30, 35].

Expected output:

      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35

Show Solution

import pandas as pd
people = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35]
})
print(people)

Given:

df = pd.DataFrame({
    'Product': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H'],
    'Sales': [100, 200, 150, 300, 250, 180, 220, 190]
})

Task: Display the first 3 rows and the last 2 rows.

Show Solution

import pandas as pd
df = pd.DataFrame({
    'Product': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H'],
    'Sales': [100, 200, 150, 300, 250, 180, 220, 190]
})

print("First 3 rows:")
print(df.head(3))

print("\nLast 2 rows:")
print(df.tail(2))

Given:

df = pd.DataFrame({
    'Employee': ['John', 'Sarah', 'Mike', 'Lisa'],
    'Department': ['IT', 'HR', 'IT', 'Sales'],
    'Salary': [75000, 65000, 72000, 68000],
    'Years': [5, 3, 7, 4]
})

Task: Print the shape, column names, and statistical summary of numeric columns.

Show Solution

import pandas as pd
df = pd.DataFrame({
    'Employee': ['John', 'Sarah', 'Mike', 'Lisa'],
    'Department': ['IT', 'HR', 'IT', 'Sales'],
    'Salary': [75000, 65000, 72000, 68000],
    'Years': [5, 3, 7, 4]
})

print("Shape:", df.shape)
print("\nColumns:", df.columns.tolist())
print("\nStatistics:")
print(df.describe())

Task: Create a DataFrame from this list of dictionaries representing products:

[
    {'Product': 'Laptop', 'Price': 999},
    {'Product': 'Mouse', 'Price': 29},
    {'Product': 'Monitor', 'Price': 349}
]

Show Solution

import pandas as pd
products = [
    {'Product': 'Laptop', 'Price': 999},
    {'Product': 'Mouse', 'Price': 29},
    {'Product': 'Monitor', 'Price': 349}
]
df = pd.DataFrame(products)
print(df)

Given:

students = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Carol', 'Dave'],
    'Math': [85, 92, 78, 88],
    'Science': [90, 88, 95, 82],
    'English': [78, 85, 88, 90],
    'Age': [20, 21, 19, 22]
})

Task: Select only the 'Name', 'Math', and 'Science' columns.

Show Solution

import pandas as pd
students = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Carol', 'Dave'],
    'Math': [85, 92, 78, 88],
    'Science': [90, 88, 95, 82],
    'English': [78, 85, 88, 90],
    'Age': [20, 21, 19, 22]
})
selected = students[['Name', 'Math', 'Science']]
print(selected)

Selecting Data

Data selection is the foundation of data analysis. Pandas offers multiple ways to extract specific rows, columns, or cells from DataFrames. Mastering loc[] (label-based) and iloc[] (position-based) selection is essential for effective data manipulation.

Selecting Columns

The simplest selection is extracting columns by name. Use bracket notation with a string for a single column (returns a Series) or a list of strings for multiple columns (returns a DataFrame). This is your starting point for any analysis: isolate the variables you need before performing calculations or aggregations. Column selection is also the foundation for creating new columns derived from existing ones.

import pandas as pd
import numpy as np

# Create sample employee dataset
employees = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'Department': ['Engineering', 'Sales', 'Engineering', 'HR', 'Sales'],
    'Salary': [95000, 65000, 87000, 72000, 68000],
    'Years': [5, 3, 7, 4, 2],
    'Bonus': [10000, 5000, 12000, 6000, 4000]
})

print("Full dataset:")
print(employees)

# Single column (returns Series)
print("\n--- Single Column (Series) ---")
names = employees['Name']
print(names)
print(f"Type: {type(names)}")

# Multiple columns (returns DataFrame)
print("\n--- Multiple Columns (DataFrame) ---")
compensation = employees[['Name', 'Salary', 'Bonus']]
print(compensation)
print(f"Type: {type(compensation)}")

Output:

Full dataset:
      Name   Department  Salary  Years  Bonus
0    Alice  Engineering   95000      5  10000
1      Bob        Sales   65000      3   5000
2  Charlie  Engineering   87000      7  12000
3    Diana           HR   72000      4   6000
4      Eve        Sales   68000      2   4000

--- Single Column (Series) ---
0      Alice
1        Bob
2    Charlie
3      Diana
4        Eve
Name: Name, dtype: object
Type: 

--- Multiple Columns (DataFrame) ---
      Name  Salary  Bonus
0    Alice   95000  10000
1      Bob   65000   5000
2  Charlie   87000  12000
3    Diana   72000   6000
4      Eve   68000   4000
Type:

Notice the critical difference: single bracket with a string returns a Series, but double bracket with a list returns a DataFrame. This distinction matters because many operations only work on DataFrames. When in doubt, use double brackets to keep your data as a DataFrame-you can always convert to a Series later if needed.

Critical Distinction

loc vs iloc

loc[] uses labels: df.loc['row_label', 'column_name']
iloc[] uses integer positions: df.iloc[0, 2] (row 0, column 2)

When to use each: Use loc when you know names (more readable, less brittle). Use iloc when working with positions (loops, mathematical operations on structure). loc is generally preferred because it's self-documenting.

Label-Based Selection: loc[]

The loc[] indexer selects data by labels-both row labels (index) and column names. The syntax is df.loc[rows, columns] where both can be single labels, lists, or slices. This is the most readable selection method because you're explicitly naming what you want. When you see df.loc['Alice', 'Salary'], it's immediately clear you're getting Alice's salary.

# Set Name as index for clearer examples
emp = employees.set_index('Name')
print("Data with Name as index:")
print(emp)

# Single cell: specific row and column
print(f"\nAlice's salary: ${emp.loc['Alice', 'Salary']:,}")

# Single row: all columns
print("\nBob's full record:")
print(emp.loc['Bob'])

# Multiple rows, single column
print("\nSalaries for Alice and Charlie:")
print(emp.loc[['Alice', 'Charlie'], 'Salary'])

# Multiple rows and columns (returns DataFrame)
print("\nEngineering salaries and years:")
eng_data = emp.loc[['Alice', 'Charlie'], ['Salary', 'Years']]
print(eng_data)

# Slice notation (inclusive on both ends!)
print("\nRows from Bob to Diana:")
print(emp.loc['Bob':'Diana'])

Output:

Data with Name as index:
           Department  Salary  Years  Bonus
Name                                       
Alice     Engineering   95000      5  10000
Bob             Sales   65000      3   5000
Charlie   Engineering   87000      7  12000
Diana              HR   72000      4   6000
Eve             Sales   68000      2   4000

Alice's salary: $95,000

Bob's full record:
Department     Sales
Salary         65000
Years              3
Bonus           5000
Name: Bob, dtype: object

Salaries for Alice and Charlie:
Name
Alice      95000
Charlie    87000
Name: Salary, dtype: int64

Engineering salaries and years:
         Salary  Years
Name                  
Alice     95000      5
Charlie   87000      7

Rows from Bob to Diana:
          Department  Salary  Years  Bonus
Name                                      
Bob            Sales   65000      3   5000
Charlie  Engineering   87000      7  12000
Diana             HR   72000      4   6000

A crucial detail: loc slicing is inclusive on both ends. When you write loc['Bob':'Diana'], you get Bob, Charlie, AND Diana. This differs from Python's normal slicing where the end is exclusive. This inclusive behavior makes sense for labels-if you ask for "Bob through Diana," you probably want Diana included.

Position-Based Selection: iloc[]

The iloc[] indexer selects by integer position, just like regular Python list indexing. Use this when you need to select data by position rather than label-for example, "get the first 10 rows" or "extract every other column." The syntax is df.iloc[row_positions, column_positions] where positions are zero-based integers. Unlike loc, slicing with iloc is exclusive on the end.

# Reset to numeric index for clearer position-based examples
emp = employees.copy()
print("Original dataset:")
print(emp)

# Single cell by position
print(f"\nFirst row, third column: {emp.iloc[0, 2]}")

# First 3 rows
print("\nFirst 3 employees:")
print(emp.iloc[:3])

# Last 2 rows, last 2 columns
print("\nLast 2 rows, last 2 columns:")
print(emp.iloc[-2:, -2:])

# Specific rows and columns by position
print("\nRows 1 and 3, columns 0, 2, 4:")
print(emp.iloc[[1, 3], [0, 2, 4]])

# Every other row
print("\nEvery other employee:")
print(emp.iloc[::2])

Output:

Original dataset:
      Name   Department  Salary  Years  Bonus
0    Alice  Engineering   95000      5  10000
1      Bob        Sales   65000      3   5000
2  Charlie  Engineering   87000      7  12000
3    Diana           HR   72000      4   6000
4      Eve        Sales   68000      2   4000

First row, third column: 95000

First 3 employees:
      Name   Department  Salary  Years  Bonus
0    Alice  Engineering   95000      5  10000
1      Bob        Sales   65000      3   5000
2  Charlie  Engineering   87000      7  12000

Last 2 rows, last 2 columns:
   Years  Bonus
3      4   6000
4      2   4000

Rows 1 and 3, columns 0, 2, 4:
    Name  Salary  Bonus
1    Bob   65000   5000
3  Diana   72000   6000

Every other employee:
      Name   Department  Salary  Years  Bonus
0    Alice  Engineering   95000      5  10000
2  Charlie  Engineering   87000      7  12000
4      Eve        Sales   68000      2   4000

The iloc[::2] syntax uses Python's step parameter-[start:end:step]. Here, [::2] means "start at beginning, go to end, take every 2nd element." This is useful for sampling data or extracting patterns. Remember: iloc is pure position indexing, so it works even if your DataFrame has non-numeric or non-sequential indices.

Conditional Selection

Boolean indexing with DataFrames works just like with Series, but now you can filter rows based on column conditions. This is incredibly powerful: "show me all employees in Engineering," "find products under $50," "get records from the last week." You create a boolean mask (True/False for each row), then use that mask to filter the DataFrame. Multiple conditions can be combined with & (AND) and | (OR).

# Boolean mask based on column condition
high_earners = employees[employees['Salary'] > 75000]
print("Employees earning > $75,000:")
print(high_earners)

# Multiple conditions (AND)
senior_engineers = employees[
    (employees['Department'] == 'Engineering') & 
    (employees['Years'] >= 5)
]
print("\nSenior engineers (5+ years):")
print(senior_engineers)

# Multiple conditions (OR)
sales_or_hr = employees[
    (employees['Department'] == 'Sales') | 
    (employees['Department'] == 'HR')
]
print("\nSales or HR employees:")
print(sales_or_hr)

# Complex condition: total compensation
employees['Total_Comp'] = employees['Salary'] + employees['Bonus']
top_earners = employees[employees['Total_Comp'] > 90000]
print("\nTop earners (total comp > $90k):")
print(top_earners[['Name', 'Department', 'Total_Comp']])

Output:

Employees earning > $75,000:
      Name   Department  Salary  Years  Bonus
0    Alice  Engineering   95000      5  10000
2  Charlie  Engineering   87000      7  12000

Senior engineers (5+ years):
      Name   Department  Salary  Years  Bonus
0    Alice  Engineering   95000      5  10000
2  Charlie  Engineering   87000      7  12000

Sales or HR employees:
    Name Department  Salary  Years  Bonus
1    Bob      Sales   65000      3   5000
3  Diana         HR   72000      4   6000
4    Eve      Sales   68000      2   4000

Top earners (total comp > $90k):
      Name   Department  Total_Comp
0    Alice  Engineering      105000
2  Charlie  Engineering       99000

When combining conditions, always wrap each condition in parentheses: (condition1) & (condition2). Python's operator precedence requires this. Also notice we created a new column Total_Comp by adding two existing columns-this is a common pattern: create a derived column, then filter on it. This keeps your conditions simple and readable.

Common Mistake: Don't use and or or for DataFrame conditions. Use & (bitwise AND) and | (bitwise OR) instead. Also, always use parentheses around each condition due to operator precedence.

Practice Questions: Selecting Data

Master DataFrame selection techniques.

Given:

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['NYC', 'LA', 'Chicago']
})

Task: Select only the 'Name' column and print it.

Show Solution

import pandas as pd
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['NYC', 'LA', 'Chicago']
})

names = df['Name']
print(names)

Given:

employees = pd.DataFrame({
    'Name': ['John', 'Sarah', 'Mike', 'Lisa', 'Tom'],
    'Salary': [75000, 85000, 65000, 92000, 72000],
    'Department': ['IT', 'Sales', 'IT', 'Sales', 'HR']
})

Task: Select all employees with Salary greater than 70000.

Show Solution

import pandas as pd
employees = pd.DataFrame({
    'Name': ['John', 'Sarah', 'Mike', 'Lisa', 'Tom'],
    'Salary': [75000, 85000, 65000, 92000, 72000],
    'Department': ['IT', 'Sales', 'IT', 'Sales', 'HR']
})

high_earners = employees[employees['Salary'] > 70000]
print(high_earners)

Given:

sales = pd.DataFrame({
    'Product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Webcam'],
    'Price': [1200, 25, 75, 350, 80],
    'Stock': [5, 150, 80, 20, 45],
    'Category': ['Electronics', 'Electronics', 'Electronics', 'Electronics', 'Electronics']
})

Task: Use loc to select products where Price > 50 AND Stock < 100, showing only 'Product' and 'Price' columns.

Show Solution

import pandas as pd
sales = pd.DataFrame({
    'Product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Webcam'],
    'Price': [1200, 25, 75, 350, 80],
    'Stock': [5, 150, 80, 20, 45],
    'Category': ['Electronics', 'Electronics', 'Electronics', 'Electronics', 'Electronics']
})

result = sales.loc[(sales['Price'] > 50) & (sales['Stock'] < 100), ['Product', 'Price']]
print(result)

Given:

data = pd.DataFrame({
    'A': [10, 20, 30, 40, 50],
    'B': [15, 25, 35, 45, 55],
    'C': [12, 22, 32, 42, 52]
})

Task: Select rows 1 through 3 (inclusive) and columns 0 and 2 using iloc.

Show Solution

import pandas as pd
data = pd.DataFrame({
    'A': [10, 20, 30, 40, 50],
    'B': [15, 25, 35, 45, 55],
    'C': [12, 22, 32, 42, 52]
})
# iloc uses integer positions, 1:4 gets rows 1, 2, 3
result = data.iloc[1:4, [0, 2]]
print(result)

Given:

inventory = pd.DataFrame({
    'Item': ['Widget', 'Gadget', 'Doohickey', 'Thingamajig', 'Whatsit'],
    'Price': [19.99, 49.99, 9.99, 99.99, 29.99],
    'Quantity': [100, 5, 200, 3, 50],
    'Category': ['A', 'B', 'A', 'C', 'B']
})

Task: Select items where (Price < 30 OR Quantity < 10) AND Category is 'B'.

Show Solution

import pandas as pd
inventory = pd.DataFrame({
    'Item': ['Widget', 'Gadget', 'Doohickey', 'Thingamajig', 'Whatsit'],
    'Price': [19.99, 49.99, 9.99, 99.99, 29.99],
    'Quantity': [100, 5, 200, 3, 50],
    'Category': ['A', 'B', 'A', 'C', 'B']
})
# Use parentheses carefully for OR (|) and AND (&)
result = inventory[((inventory['Price'] < 30) | (inventory['Quantity'] < 10)) & (inventory['Category'] == 'B')]
print(result)

Key Takeaways

Series: Labeled 1D Arrays

Series are one-dimensional labeled arrays. Use custom indices for meaningful labels. Support vectorized operations and statistical methods. Create from lists, dicts, or NumPy arrays

DataFrames: 2D Tables

DataFrames are collections of Series sharing an index. Each column is a Series. Create from dicts, lists of dicts, NumPy arrays, or files (CSV, Excel, JSON)

loc vs iloc

loc[] uses labels (row/column names), iloc[] uses integer positions. loc slicing is inclusive on both ends, iloc is exclusive on the end

Boolean Indexing

Filter data with conditions: df[df['col'] > value]. Combine conditions with & (AND) or | (OR). Always use parentheses around each condition

Quick Inspection

Use head(), tail(), info(), and describe() to quickly understand dataset structure, size, data types, and statistical summaries

Vectorized Operations

Apply operations to entire columns/Series at once without loops. Example: df['price'] * 1.1 adds 10% to all prices instantly. Fast and readable

What You'll Learn

Contents

Introduction to Pandas

What is Pandas?

Pandas Data Structures

Why Use Pandas?

Intuitive Syntax

Flexible I/O

Data Cleaning

Installing Pandas

Importing Pandas

Pandas Series

Creating a Series from a List

Series with Custom Index

Creating Series from a Dictionary

Series Index

Series Operations & Methods

Boolean Indexing

Practice Questions: Pandas Series

Medium Build a student grades lookup table

Hard Find days exceeding temperature threshold

Easy Identify best trading day of the week

Medium Analyze exam performance metrics

Creating DataFrames

Interactive: DataFrame Structure Explorer

DataFrame from Dictionary

DataFrame from List of Dictionaries

Dictionary of Lists

List of Dictionaries

From Files

DataFrame from NumPy Array

Viewing DataFrame Information

Practice Questions: Creating DataFrames

Medium Build an employee directory table

Easy Preview sales report data boundaries

Hard Profile HR dataset structure and metrics

Hard Convert product inventory records to table

Easy Extract student academic performance columns

Selecting Data

Selecting Columns

loc vs iloc

Label-Based Selection: loc[]

Position-Based Selection: iloc[]

Conditional Selection

Practice Questions: Selecting Data

Hard Extract customer names from contact database

Easy Filter high-earning employees for bonus eligibility

Medium Find low-stock expensive items for reorder

Medium Slice metrics data for quarterly report

Hard Identify at-risk inventory by category

Key Takeaways

Series: Labeled 1D Arrays

DataFrames: 2D Tables

loc vs iloc

Boolean Indexing

Quick Inspection

Vectorized Operations

Knowledge Check

1 What does pd.Series([1, 2, 3]) create?

2 Which method shows the last 5 rows of a DataFrame?

3 What's the correct way to select rows where age > 30?

4 What does df.loc['Alice', 'Salary'] return?

5 How do you combine multiple conditions in Pandas?

6 What's the difference between df['col'] and df[['col']]?

1 What does `pd.Series([1, 2, 3])` create?

3 What's the correct way to select rows where `age > 30`?

4 What does `df.loc['Alice', 'Salary']` return?

6 What's the difference between `df['col']` and `df[['col']]`?