Module 7.1

Python Pandas

Learn the core Pandas APIs for data analytics: Series and DataFrame basics, IO, indexing, grouping, and common transformations to prepare data for visualization and modeling.

40 min read
Intermediate
Hands-on examples
What You'll Learn
  • Loading data with read_csv and read_excel
  • Series vs DataFrame and indexing
  • Aggregation with groupby and pivot tables
  • Data cleaning and transformation
  • Exporting data for visualization
Contents
01

Introduction to Pandas

Pandas is the most important Python library for data analysis. Whether you're cleaning messy datasets, calculating business metrics, or preparing data for visualization, Pandas is your go-to tool. In this lesson, you'll learn the fundamentals that data analysts use every single day.

Why Pandas Matters

Imagine you have a spreadsheet with 100,000 rows of sales data. You need to filter by region, calculate monthly totals, and find your top products. In Excel, this might crash your computer. In Pandas, it takes three lines of code and runs in seconds.

Pandas (Python Data Analysis Library) was created in 2008 by Wes McKinney while working at a hedge fund. Today, it's used by data analysts, scientists, and engineers at virtually every company that works with data—from startups to Fortune 500 companies.

Real-world example: A retail analyst needs to analyze 2 years of transaction data (millions of rows) to find seasonal trends. With Pandas, they can load the CSV, group by month, calculate statistics, and export results—all in under 30 seconds.

The Two Core Data Structures

Everything in Pandas revolves around two structures: Series (1D) and DataFrame (2D). Think of a Series as a single column in Excel, and a DataFrame as the entire spreadsheet.

Core Concept

Series

A Series is a one-dimensional labeled array. It's like a single column in a spreadsheet, but with a labeled index that lets you access values by name, not just position.

Why it matters: When you select one column from a DataFrame (like df['revenue']), you get a Series. Understanding Series operations is essential for data manipulation.

Core Concept

DataFrame

A DataFrame is a two-dimensional table with labeled rows and columns. Each column is a Series, and all columns share the same row index. This is where you'll spend 90% of your time in Pandas.

Think of it as: An Excel spreadsheet or a SQL table—but with Python's power for filtering, transforming, and analyzing data programmatically.

Installing and Importing Pandas

Before using Pandas, you need to install it (if not already installed) and import it into your Python script. The community convention is to import Pandas as pd—you'll see this in every tutorial and codebase.

# Install pandas (run once in terminal)
pip install pandas
# Import pandas with the standard alias
import pandas as pd

# Check your pandas version
print(pd.__version__)  # e.g., 2.1.0
Pro tip: Always use import pandas as pd. This is the universal convention, and using any other alias will confuse other developers reading your code.

Your First DataFrame

Let's create a simple DataFrame from scratch to understand how it works. The most common way is from a Python dictionary where keys become column names.

import pandas as pd

# Create a DataFrame from a dictionary
df = pd.DataFrame({
    'region': ['North', 'South', 'East', 'West'],
    'revenue': [12000, 9500, 15000, 8000],
    'customers': [150, 120, 180, 95]
})

print(df)

Output:

   region  revenue  customers
0   North    12000        150
1   South     9500        120
2    East    15000        180
3    West     8000         95

Notice the numbers 0, 1, 2, 3 on the left? That's the index—a label for each row. Pandas automatically creates a numeric index if you don't specify one.

Practice Questions: Introduction

Test your understanding with these hands-on exercises.

Task: Create a DataFrame with 3 products (Laptop, Mouse, Keyboard) and their prices (45000, 500, 1500). Print the DataFrame.

Show Solution
import pandas as pd

df = pd.DataFrame({
    'product': ['Laptop', 'Mouse', 'Keyboard'],
    'price': [45000, 500, 1500]
})
print(df)

Task: Load any CSV file and print its shape (rows, columns) and column names.

Show Solution
import pandas as pd

df = pd.read_csv('sales_data.csv')
print(f"Shape: {df.shape}")  # (rows, cols)
print(f"Columns: {df.columns.tolist()}")

Given:

df = pd.DataFrame({
    'product': ['Laptop', 'Mouse', 'Keyboard'],
    'price': [45000, 500, 1500],
    'quantity': [5, 50, 30]
})

Task: Add a new column total that equals price * quantity.

Show Solution
df['total'] = df['price'] * df['quantity']
print(df)
02

Working with Series

A Series is the building block of everything in Pandas. Before you can work with full datasets, you need to master Series operations. When you select a single column from a DataFrame, you get a Series—so understanding them is essential.

Creating a Series

There are several ways to create a Series. The most common is from a Python list or dictionary. When you use a dictionary, the keys become the index labels.

import pandas as pd

# From a list (auto-generates numeric index)
revenue = pd.Series([12000, 9500, 15000, 8000])
print(revenue)
# Output:
# 0    12000
# 1     9500
# 2    15000
# 3     8000
# dtype: int64
# From a dictionary (keys become index labels)
revenue = pd.Series({
    'North': 12000,
    'South': 9500,
    'East': 15000,
    'West': 8000
})
print(revenue)
# Output:
# North    12000
# South     9500
# East     15000
# West      8000
# dtype: int64
Why use labeled indexes? With a labeled index, you can access data by meaningful names (revenue['North']) instead of positions (revenue[0]). This makes your code more readable and less error-prone when rows get reordered.

Accessing Series Values

You can access Series values in multiple ways: by label, by position, or with boolean conditions.

Method Syntax Description Example
By label s['label'] Access using index label revenue['North'] → 12000
By position s.iloc[n] Access using numeric position revenue.iloc[0] → 12000
Multiple labels s[['a','b']] Select multiple values revenue[['North','East']]
Boolean mask s[condition] Filter by condition revenue[revenue > 10000]
# Access by label
print(revenue['North'])  # 12000

# Access by position
print(revenue.iloc[0])  # 12000

# Select multiple regions
print(revenue[['North', 'East']])
# North    12000
# East     15000

# Filter: only regions with revenue > 10000
high_revenue = revenue[revenue > 10000]
print(high_revenue)
# North    12000
# East     15000

Series Operations

One of Pandas' superpowers is vectorized operations—you can perform calculations on entire Series without writing loops. This is both faster and more readable.

# Mathematical operations (applied to all elements)
revenue_in_lakhs = revenue / 100000  # Convert to lakhs
print(revenue_in_lakhs)
# North    0.12
# South    0.095
# East     0.15
# West     0.08

# Add 10% bonus to all regions
bonus = revenue * 0.10
print(bonus)
# North    1200.0
# South     950.0
# East     1500.0
# West      800.0

Common Series Methods

These are the methods you'll use daily when analyzing data:

# Statistical summaries
print(revenue.sum())    # 44500 (total)
print(revenue.mean())   # 11125.0 (average)
print(revenue.max())    # 15000 (highest)
print(revenue.min())    # 8000 (lowest)
print(revenue.std())    # Standard deviation

# Useful information
print(revenue.count())  # 4 (number of values)
print(revenue.idxmax()) # 'East' (label of max value)
print(revenue.idxmin()) # 'West' (label of min value)

# Get all values as NumPy array
print(revenue.values)   # array([12000, 9500, 15000, 8000])
Beginner Tip

The map() Method

The map() method lets you transform values using a dictionary or function. This is incredibly useful for converting categories to numeric values for analysis.

Use case: Converting text ratings (Good, Average, Poor) to numeric scores (3, 2, 1) for calculating averages.

# Map categories to numbers
priority_scores = {'high': 3, 'medium': 2, 'low': 1}
priorities = pd.Series(['high', 'low', 'medium', 'high'])

numeric_scores = priorities.map(priority_scores)
print(numeric_scores)
# 0    3
# 1    1
# 2    2
# 3    3

# Now you can calculate the average priority!
print(numeric_scores.mean())  # 2.25

Practice Questions: Series

Apply what you learned about Series operations.

Task: Create a Series with months (Jan, Feb, Mar) as index and sales values (5000, 7500, 6200). Print the total sales.

Show Solution
sales = pd.Series({
    'Jan': 5000,
    'Feb': 7500,
    'Mar': 6200
})
print(f"Total sales: {sales.sum()}")  # 18700

Given:

scores = pd.Series({'Alice': 85, 'Bob': 92, 'Charlie': 78, 'Diana': 95})

Task: Find who has the highest score and what it is.

Show Solution
top_scorer = scores.idxmax()
top_score = scores.max()
print(f"{top_scorer} scored {top_score}")  # Diana scored 95

Given:

revenue = pd.Series({
    'North': 12000, 'South': 9500, 'East': 15000, 'West': 8000
})

Task: Find regions with revenue above average and calculate their percentage of total revenue.

Show Solution
avg = revenue.mean()
above_avg = revenue[revenue > avg]
pct = (above_avg.sum() / revenue.sum()) * 100
print(f"Above average regions: {list(above_avg.index)}")
print(f"Their share: {pct:.1f}%")  # 60.7%

Given:

ratings = pd.Series(['Excellent', 'Good', 'Poor', 'Good', 'Excellent'])

Task: Map these to scores (Excellent=5, Good=3, Poor=1) and find the average rating.

Show Solution
score_map = {'Excellent': 5, 'Good': 3, 'Poor': 1}
numeric = ratings.map(score_map)
print(f"Average rating: {numeric.mean()}")  # 3.4
03

DataFrame Operations

The DataFrame is where you'll spend 90% of your time in Pandas. It's a 2D table with rows and columns, similar to an Excel spreadsheet or SQL table. This section covers the essential operations every data analyst needs to master.

Exploring Your Data

When you first load a dataset, you need to understand what you're working with. These are the first commands every analyst runs:

import pandas as pd

# Load sample data
df = pd.read_csv('sales_data.csv')

# Quick look at the data
print(df.head())     # First 5 rows
print(df.tail(3))    # Last 3 rows
print(df.shape)      # (rows, columns)
print(df.columns)    # List of column names
print(df.dtypes)     # Data type of each column
Essential Method

df.info()

The info() method gives you a complete overview: column names, data types, non-null counts, and memory usage. It's the best single command to understand your dataset.

Pro tip: Run df.info() immediately after loading any dataset to spot missing values and incorrect data types.

# Complete dataset overview
df.info()
# Output:
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 1000 entries, 0 to 999
# Data columns (total 5 columns):
#  #   Column    Non-Null Count  Dtype  
# ---  ------    --------------  -----  
#  0   date      1000 non-null   object 
#  1   region    1000 non-null   object 
#  2   revenue   985 non-null    float64  # <-- 15 missing values!
#  3   units     1000 non-null   int64  
#  4   product   1000 non-null   object 
# memory usage: 39.1 KB

Selecting Columns

There are two ways to select columns: bracket notation (always works) and dot notation (convenient but limited).

# Bracket notation (recommended - always works)
revenue = df['revenue']            # Returns a Series
subset = df[['region', 'revenue']] # Returns a DataFrame with 2 columns

# Dot notation (convenient but limited)
revenue = df.revenue               # Works only if column name is valid Python identifier
# df.revenue target  # ERROR - space in name
# df['revenue target']  # This works!
Important: Always use bracket notation (df['column']) for column names with spaces, special characters, or names that match Python keywords like class or type.

Filtering Rows

Filtering is how you select specific rows based on conditions. This is one of the most common operations in data analysis.

# Single condition
high_revenue = df[df['revenue'] > 10000]

# Multiple conditions (use & for AND, | for OR)
# IMPORTANT: Wrap each condition in parentheses!
north_high = df[(df['region'] == 'North') & (df['revenue'] > 10000)]
low_or_south = df[(df['revenue'] < 5000) | (df['region'] == 'South')]

# Using isin() for multiple values
target_regions = df[df['region'].isin(['North', 'East', 'West'])]

# String contains (for text searching)
laptop_sales = df[df['product'].str.contains('Laptop', case=False)]
Common mistake: Forgetting parentheses around conditions. df[df['a'] > 5 & df['b'] < 10] causes an error. Always use df[(df['a'] > 5) & (df['b'] < 10)].

Creating New Columns

Adding calculated columns is a core skill. You can perform arithmetic, apply functions, or use conditional logic.

# Simple arithmetic
df['profit_margin'] = (df['revenue'] - df['cost']) / df['revenue'] * 100

# Using apply() for custom functions
def categorize_revenue(amount):
    if amount > 15000:
        return 'High'
    elif amount > 8000:
        return 'Medium'
    else:
        return 'Low'

df['revenue_category'] = df['revenue'].apply(categorize_revenue)

# Using np.where() for simple if-else (faster than apply)
import numpy as np
df['is_profitable'] = np.where(df['revenue'] > df['cost'], 'Yes', 'No')

loc and iloc: Precise Selection

For selecting specific rows and columns together, use loc (by label) or iloc (by position).

Method Selects By Example Description
loc Label df.loc[0:5, 'revenue'] Rows 0-5 (inclusive), 'revenue' column
iloc Position df.iloc[0:5, 2] First 5 rows, 3rd column
loc Label df.loc[df['region']=='North', ['revenue','units']] Filter + select columns
# Select specific rows and columns by label
df.loc[0:10, ['region', 'revenue', 'units']]

# Select by position (first 10 rows, columns 1-3)
df.iloc[0:10, 1:4]

# Update values using loc
df.loc[df['region'] == 'North', 'revenue'] = df.loc[df['region'] == 'North', 'revenue'] * 1.1

Handling Missing Data

Real-world data is messy. Missing values (shown as NaN) are common. Here's how to find and handle them:

# Check for missing values
print(df.isnull().sum())  # Count NaN per column

# Drop rows with any missing values
df_clean = df.dropna()

# Drop rows with missing values in specific columns
df_clean = df.dropna(subset=['revenue', 'region'])

# Fill missing values
df['revenue'] = df['revenue'].fillna(0)  # Fill with zero
df['revenue'] = df['revenue'].fillna(df['revenue'].mean())  # Fill with average
df['category'] = df['category'].fillna('Unknown')  # Fill with placeholder

Practice Questions: DataFrame

Practice essential DataFrame operations.

Task: Filter a sales DataFrame to show only rows where revenue is greater than 10000.

Show Solution
high_revenue = df[df['revenue'] > 10000]
print(high_revenue)

Task: Create a new DataFrame with only the 'product', 'quantity', and 'price' columns.

Show Solution
subset = df[['product', 'quantity', 'price']]
print(subset.head())

Given: A DataFrame with 'revenue' and 'cost' columns.

Task: Create a 'profit' column (revenue - cost) and a 'profit_margin' column (profit/revenue * 100).

Show Solution
df['profit'] = df['revenue'] - df['cost']
df['profit_margin'] = (df['profit'] / df['revenue']) * 100
print(df[['revenue', 'cost', 'profit', 'profit_margin']].head())

Task: Find all rows where region is 'North' AND revenue is above 10000.

Show Solution
result = df[(df['region'] == 'North') & (df['revenue'] > 10000)]
print(result)

Task: Fill missing revenue values with the median, then create a 'revenue_tier' column that categorizes as 'High' (>15000), 'Medium' (>8000), or 'Low'.

Show Solution
# Fill missing values
df['revenue'] = df['revenue'].fillna(df['revenue'].median())

# Create revenue tiers
def categorize(val):
    if val > 15000:
        return 'High'
    elif val > 8000:
        return 'Medium'
    else:
        return 'Low'

df['revenue_tier'] = df['revenue'].apply(categorize)
print(df['revenue_tier'].value_counts())
04

Reading & Writing Data

Real data doesn't live in Python code—it's in CSV files, Excel spreadsheets, databases, and APIs. Pandas makes it easy to load data from these sources and save your results back. This section covers the most common I/O operations you'll use daily.

Reading CSV Files

CSV (Comma-Separated Values) is the most common format for data exchange. The read_csv() function handles most scenarios with the right parameters.

import pandas as pd

# Basic CSV read
df = pd.read_csv('sales_data.csv')

# With common options
df = pd.read_csv('sales_data.csv',
    parse_dates=['date'],        # Convert date columns to datetime
    usecols=['date', 'revenue', 'region'],  # Load only specific columns
    nrows=1000,                  # Read only first 1000 rows (for testing)
    na_values=['N/A', 'missing'] # Treat these as NaN
)
Parameter Purpose Example
parse_dates Convert columns to datetime parse_dates=['date', 'created_at']
usecols Load only specific columns (faster) usecols=['id', 'name', 'revenue']
nrows Read only first N rows nrows=100
skiprows Skip rows at start of file skiprows=2 (skip header rows)
encoding Handle special characters encoding='utf-8' or 'latin-1'
Performance tip: For large files (millions of rows), use usecols to load only the columns you need. This can reduce memory usage by 50-80% and dramatically speed up loading.

Reading Excel Files

Pandas can read Excel files (.xlsx, .xls) directly. You'll need the openpyxl library installed.

# Install openpyxl first: pip install openpyxl

# Read Excel file
df = pd.read_excel('report.xlsx')

# Read specific sheet
df = pd.read_excel('report.xlsx', sheet_name='Q1 Sales')

# Read all sheets (returns a dictionary)
all_sheets = pd.read_excel('report.xlsx', sheet_name=None)
print(all_sheets.keys())  # ['Q1 Sales', 'Q2 Sales', 'Q3 Sales']

Writing Data

After cleaning and transforming your data, you'll want to save the results. Use to_csv() for CSV and to_excel() for Excel.

# Save to CSV (most common)
df.to_csv('cleaned_data.csv', index=False)  # index=False prevents extra column

# Save to Excel
df.to_excel('report.xlsx', sheet_name='Results', index=False)

# Save multiple sheets to Excel
with pd.ExcelWriter('report.xlsx') as writer:
    df_north.to_excel(writer, sheet_name='North', index=False)
    df_south.to_excel(writer, sheet_name='South', index=False)
    df_summary.to_excel(writer, sheet_name='Summary', index=False)
Pro tip: Always use index=False when saving to CSV unless you specifically need the index. Otherwise, you'll get an extra unnamed column when you reload the file.
CSV Previewer (Interactive)

Drop a small CSV file or choose one to preview the first 10 rows (client-side only).

Practice Questions: Reading & Writing Data

Practice loading and saving data efficiently.

Task: Load only the 'date', 'region', and 'revenue' columns from 'sales_data.csv' with dates parsed as datetime.

Show Solution
df = pd.read_csv('sales_data.csv',
    usecols=['date', 'region', 'revenue'],
    parse_dates=['date']
)
print(df.dtypes)

Task: Drop rows with missing revenue, then save to 'sales_cleaned.csv' without the index column.

Show Solution
df_clean = df.dropna(subset=['revenue'])
df_clean.to_csv('sales_cleaned.csv', index=False)
print(f"Saved {len(df_clean)} rows")

Task: Read the 'Q1_Data' sheet from 'quarterly_report.xlsx' and print the shape.

Show Solution
df = pd.read_excel('quarterly_report.xlsx', sheet_name='Q1_Data')
print(f"Shape: {df.shape}")
print(df.head())

Task: Split data by region into separate DataFrames and save each as a different sheet in 'regional_report.xlsx'.

Show Solution
with pd.ExcelWriter('regional_report.xlsx') as writer:
    for region in df['region'].unique():
        region_df = df[df['region'] == region]
        region_df.to_excel(writer, sheet_name=region, index=False)
print("Report saved with separate sheets per region")
05

Groupby & Aggregation

Groupby is where Pandas shines brightest. Need to calculate total sales by region? Average revenue by month? Top products by category? Groupby handles all of these. It's the Pandas equivalent of SQL's GROUP BY clause, but more powerful and flexible.

Understanding Groupby

The groupby operation follows a "split-apply-combine" pattern:

  1. Split: Divide data into groups based on one or more columns
  2. Apply: Calculate something for each group (sum, mean, count, etc.)
  3. Combine: Merge results back into a single DataFrame
Core Concept

Split-Apply-Combine

When you call df.groupby('region'), Pandas creates separate groups for each unique region. Then you apply an aggregation (like .sum()) to each group independently. Finally, results are combined into a new DataFrame.

Real example: "What's the total revenue for each region?" → df.groupby('region')['revenue'].sum()

Basic Groupby Operations

import pandas as pd

# Sample data
df = pd.DataFrame({
    'region': ['North', 'South', 'North', 'East', 'South', 'East'],
    'product': ['Laptop', 'Mouse', 'Mouse', 'Laptop', 'Keyboard', 'Mouse'],
    'revenue': [45000, 500, 800, 52000, 1500, 600]
})

# Basic groupby - sum revenue by region
by_region = df.groupby('region')['revenue'].sum()
print(by_region)
# region
# East     52600
# North    45800
# South     2000

# Multiple aggregations
summary = df.groupby('region')['revenue'].agg(['sum', 'mean', 'count'])
print(summary)
#        sum     mean  count
# East   52600  26300.0    2
# North  45800  22900.0    2
# South   2000   1000.0    2

Named Aggregations (Modern Syntax)

The modern agg() syntax lets you give custom names to your aggregated columns, making your output cleaner and more readable.

# Named aggregations (cleaner output)
result = df.groupby('region').agg(
    total_revenue=('revenue', 'sum'),
    avg_revenue=('revenue', 'mean'),
    num_transactions=('revenue', 'count'),
    max_sale=('revenue', 'max')
).reset_index()

print(result)
#   region  total_revenue  avg_revenue  num_transactions  max_sale
# 0   East          52600      26300.0                 2     52000
# 1  North          45800      22900.0                 2     45000
# 2  South           2000       1000.0                 2      1500
Why reset_index()? By default, groupby makes the grouped column(s) the index. Adding .reset_index() converts it back to a regular column, which is usually what you want for further analysis or export.

Grouping by Multiple Columns

You can group by multiple columns to get more granular breakdowns—like revenue by region AND product.

# Group by multiple columns
breakdown = df.groupby(['region', 'product']).agg(
    total_revenue=('revenue', 'sum'),
    count=('revenue', 'count')
).reset_index()

print(breakdown)
#   region   product  total_revenue  count
# 0   East    Laptop          52000      1
# 1   East     Mouse            600      1
# 2  North    Laptop          45000      1
# 3  North     Mouse            800      1
# 4  South  Keyboard           1500      1
# 5  South     Mouse            500      1

Common Aggregation Functions

Function Description Example
sum Total of all values ('revenue', 'sum')
mean Average value ('revenue', 'mean')
count Number of non-null values ('revenue', 'count')
min / max Minimum / Maximum value ('revenue', 'min')
std Standard deviation ('revenue', 'std')
first / last First / Last value in group ('date', 'first')
nunique Number of unique values ('product', 'nunique')

Transform: Keep Original Rows

Sometimes you want to add group-level calculations back to each row (like "this row's revenue vs. regional average"). The transform() method does exactly this.

# Add regional average to each row
df['regional_avg'] = df.groupby('region')['revenue'].transform('mean')
df['vs_avg'] = df['revenue'] - df['regional_avg']

print(df[['region', 'revenue', 'regional_avg', 'vs_avg']])
#   region  revenue  regional_avg    vs_avg
# 0  North    45000       22900.0   22100.0
# 1  South      500        1000.0    -500.0
# 2  North      800       22900.0  -22100.0
# 3   East    52000       26300.0   25700.0
# 4  South     1500        1000.0     500.0
# 5   East      600       26300.0  -25700.0
Top-N Regions Slider (Interactive)

Use the slider to display the top N regions by revenue from a small sample dataset.

Practice Questions: Groupby & Aggregation

Master the art of aggregating and summarizing data.

Task: Calculate total revenue for each product category.

Show Solution
by_product = df.groupby('product')['revenue'].sum()
print(by_product.sort_values(ascending=False))

Task: Count how many transactions occurred in each region.

Show Solution
transaction_count = df.groupby('region').size()
# OR
transaction_count = df['region'].value_counts()
print(transaction_count)

Task: For each region, calculate: total revenue, average revenue, max single sale, and number of unique products.

Show Solution
summary = df.groupby('region').agg(
    total_revenue=('revenue', 'sum'),
    avg_revenue=('revenue', 'mean'),
    max_sale=('revenue', 'max'),
    unique_products=('product', 'nunique')
).reset_index()
print(summary)

Task: Calculate each region's percentage share of total revenue.

Show Solution
region_totals = df.groupby('region')['revenue'].sum().reset_index()
region_totals['pct_share'] = (region_totals['revenue'] / region_totals['revenue'].sum()) * 100
print(region_totals.sort_values('pct_share', ascending=False))

Task: Add a column showing whether each transaction's revenue is above or below its regional average.

Show Solution
df['regional_avg'] = df.groupby('region')['revenue'].transform('mean')
df['performance'] = df.apply(
    lambda x: 'Above Average' if x['revenue'] > x['regional_avg'] else 'Below Average', 
    axis=1
)
print(df[['region', 'revenue', 'regional_avg', 'performance']])

Key Takeaways

Series & DataFrame

Pandas core data structures with labeled axes

Easy I/O

Read and write CSV or Excel with one function call

Groupby

Aggregate and summarize data efficiently

Data Cleaning

Fill missing values and transform columns for analysis

Transformations

Apply functions and compute new columns

Export

Save cleaned data for visualization and modeling

Knowledge Check

Test your understanding of Pandas fundamentals. Answer all questions to check your score.

1 Which function reads a CSV file into a DataFrame?
2 What does df.head() display by default?
3 Which method is used to group rows and compute aggregations?
4 How do you save a DataFrame to a CSV file?
5 What does df.dropna(subset=['revenue']) do?
6 What is the correct way to filter rows where revenue is greater than 10000?
7 Which parameter in read_csv() converts date columns to datetime?
8 What's the difference between loc and iloc?
Answer all questions to check your score