Data Cleaning

Why Data Cleaning Matters

Data cleaning is often the most time-consuming yet critical step in the data science process. Real-world data is messy, with missing values, duplicates, inconsistent formats, and errors that can severely impact your analysis if left unaddressed. Studies show that data scientists spend 60-80% of their time cleaning and preparing data before any actual analysis can begin. In this topic, you'll learn the essential techniques to transform raw, messy data into clean, analysis-ready datasets using Pandas' powerful data cleaning tools.

The Reality of Messy Data

When you work with real-world datasets, you'll rarely encounter perfectly clean data. Sensors fail and record null values, users skip form fields, data entry errors introduce typos, systems export data in inconsistent formats, and databases accumulate duplicate records over time. Each of these issues can lead to incorrect conclusions if not properly addressed. For example, calculating the average salary when the dataset contains missing values will give you a skewed result unless you decide how to handle those gaps. Understanding the nature and impact of data quality issues is the first step toward effective data cleaning.

Common Data Quality Issues

Let's explore the four main categories of data quality problems you'll encounter. Missing values occur when data is absent due to collection failures, user behavior, or system errors. Duplicates happen when the same record appears multiple times, often from merging datasets or data entry mistakes. Type inconsistencies arise when numbers are stored as strings, dates have wrong formats, or categories use mixed capitalization. String problems include extra whitespace, special characters, inconsistent formatting, and data that needs pattern extraction. Each of these issues requires different cleaning strategies, which we'll cover in detail throughout this topic.

Key Concept

Data cleaning is the process of detecting and correcting (or removing) corrupt, inaccurate, incomplete, or irrelevant data from a dataset. It involves identifying data quality issues, deciding on appropriate handling strategies, and applying transformations to ensure data integrity and consistency.

Why it matters: Clean data leads to accurate insights, reliable models, and trustworthy business decisions. The garbage-in-garbage-out principle means that no amount of sophisticated analysis can compensate for poor quality data

The Data Cleaning Workflow

Effective data cleaning follows a systematic workflow that helps you address quality issues methodically. First, you inspect the data to understand its structure, identify issues, and assess their severity. Next, you develop a cleaning strategy by deciding which issues to fix, which records to keep or remove, and how to handle edge cases. Then you apply transformations using Pandas methods to clean the data according to your strategy. Finally, you validate the results by verifying that cleaning worked as intended and that data quality has improved. This workflow ensures that your cleaning process is reproducible, well-documented, and produces reliable results.

Inspect

Examine data structure, identify missing values, duplicates, and inconsistencies using info(), describe(), and value_counts()

Strategize

Decide how to handle each issue based on domain knowledge, analysis goals, and data characteristics

Transform

Apply Pandas methods like fillna(), drop_duplicates(), astype(), and str operations to clean data

Validate

Verify cleaning results, check data quality metrics, and ensure no unintended changes were introduced

Tools for Data Cleaning

Pandas provides a comprehensive set of methods specifically designed for data cleaning tasks. For missing values, you'll use isnull(), notnull(), fillna(), and dropna() to detect and handle gaps in your data. For duplicates, duplicated() and drop_duplicates() help you find and remove redundant records. Type conversions are handled by astype(), to_numeric(), to_datetime(), and pd.Categorical. String operations use the .str accessor which provides dozens of methods for text processing, plus full regex support. Throughout this topic, we'll explore each of these tools in depth with practical examples that show you exactly when and how to use them.

Pro Tip: Always make a copy of your original data before cleaning (df_clean = df.copy()). This lets you compare before and after, undo mistakes, and document the exact transformations applied. It's also good practice to save cleaning steps in a reproducible script or notebook

Interactive: Data Cleaning Playground

Try It!

Click operations to see how they transform the data. Notice how different methods handle the issues differently.

Original "Dirty" Data

Name	Age	Salary
Alice	25	50,000
Bob	NaN	60,000
NaN	30	55,000
Alice	25	50,000
Charlie	35	NaN

NaN = Missing Duplicate Row

After Cleaning (select operation)

Name	Age	Salary
Select an operation below

Click an operation to see the result.

Handling Missing Values

Missing values are one of the most common data quality issues you'll encounter. They appear as NaN (Not a Number), None, or empty strings, and can occur for many reasons such as data not being collected, errors during data transfer, or users leaving form fields blank. How you handle missing values can significantly impact your analysis results, so it's crucial to understand detection methods, different strategies for handling them, and when to apply each approach. Pandas provides powerful tools to identify, analyze, and address missing data in various ways.

Detecting Missing Values

Before you can handle missing values, you need to know where they are and how prevalent they are in your dataset. Pandas provides several methods to detect missing data. The isnull() method returns a boolean DataFrame showing True where values are missing and False otherwise. Its inverse, notnull(), returns True for non-missing values. You can combine these with sum() to count missing values per column, or use info() to get a quick overview of non-null counts across all columns. Understanding the pattern and extent of missing data helps you choose the right handling strategy.

import pandas as pd
import numpy as np

# Create sample data with missing values
data = {
    'name': ['Alice', 'Bob', None, 'David', 'Eve'],
    'age': [25, 30, np.nan, 35, 28],
    'salary': [50000, np.nan, 60000, 70000, np.nan],
    'department': ['Sales', 'IT', 'Sales', None, 'IT']
}
df = pd.DataFrame(data)

print("DataFrame with missing values:")
print(df)
#       name   age   salary department
# 0    Alice  25.0  50000.0      Sales
# 1      Bob  30.0      NaN         IT
# 2     None   NaN  60000.0      Sales
# 3    David  35.0  70000.0       None
# 4      Eve  28.0      NaN         IT

# Check for missing values
print("\nMissing values (boolean):")
print(df.isnull())
#     name    age  salary  department
# 0  False  False   False       False
# 1  False  False    True       False
# 2   True   True   False       False
# 3  False  False   False        True
# 4  False  False    True       False

# Count missing values per column
print("\nMissing count per column:")
print(df.isnull().sum())
# name          1
# age           1
# salary        2
# department    1

# Percentage of missing values
print("\nMissing percentage:")
print((df.isnull().sum() / len(df) * 100).round(2))
# name          20.0
# age           20.0
# salary        40.0
# department    20.0

Removing Missing Values with dropna()

The simplest approach to handling missing values is to remove them entirely using dropna(). This method provides several options to control which rows or columns are dropped. By default, dropna() removes any row containing at least one missing value. You can use axis=1 to drop columns instead of rows, or use the thresh parameter to specify a minimum number of non-null values required to keep a row. The subset parameter lets you check for missing values only in specific columns. The inplace parameter determines whether to modify the original DataFrame or return a new one. Be cautious when dropping data, as you might lose valuable information or introduce bias into your analysis.

# Drop rows with any missing values
df_dropped_any = df.dropna()
print("Drop rows with ANY missing values:")
print(df_dropped_any)
#     name   age   salary department
# 0  Alice  25.0  50000.0      Sales

# Drop rows only if ALL values are missing
df_dropped_all = df.dropna(how='all')
print("\nDrop rows only if ALL missing:")
print(df_dropped_all)  # All rows kept (no row is entirely null)

# Drop rows with missing values in specific columns
df_dropped_subset = df.dropna(subset=['age', 'salary'])
print("\nDrop rows with missing age OR salary:")
print(df_dropped_subset)
#     name   age   salary department
# 0  Alice  25.0  50000.0      Sales
# 3  David  35.0  70000.0       None

# Keep rows with at least 3 non-null values
df_dropped_thresh = df.dropna(thresh=3)
print("\nKeep rows with at least 3 non-null values:")
print(df_dropped_thresh)
#     name   age   salary department
# 0  Alice  25.0  50000.0      Sales
# 1    Bob  30.0      NaN         IT
# 2   None   NaN  60000.0      Sales
# 3  David  35.0  70000.0       None
# 4    Eve  28.0      NaN         IT

# Drop columns with any missing values
df_dropped_cols = df.dropna(axis=1)
print("\nDrop columns with any missing:")
print(df_dropped_cols)
# Empty DataFrame (all columns have at least one null)

Filling Missing Values with fillna()

Instead of removing missing data, you can fill it with appropriate values using fillna(). This method is more flexible than dropping and helps preserve your data. You can fill with a scalar value (like 0 or 'Unknown'), use statistical measures (mean, median, mode), forward-fill (propagate the last valid value) using method='ffill', or backward-fill using method='bfill'. You can also fill different columns with different values by passing a dictionary. The choice of filling strategy depends on your data type and the nature of the missing values. For numerical data, mean or median are common choices. For categorical data, the mode or a placeholder like 'Unknown' works well.

# Fill with a scalar value
df_filled_zero = df.fillna(0)
print("Fill all missing with 0:")
print(df_filled_zero)
#     name   age   salary department
# 0  Alice  25.0  50000.0      Sales
# 1    Bob  30.0      0.0         IT
# 2      0   0.0  60000.0      Sales
# 3  David  35.0  70000.0          0
# 4    Eve  28.0      0.0         IT

# Fill different columns with different values
df_filled_dict = df.fillna({
    'name': 'Unknown',
    'age': df['age'].mean(),
    'salary': df['salary'].median(),
    'department': 'Unassigned'
})
print("\nFill with column-specific values:")
print(df_filled_dict)
#        name   age   salary  department
# 0     Alice  25.0  50000.0       Sales
# 1       Bob  30.0  60000.0          IT
# 2   Unknown  29.5  60000.0       Sales
# 3     David  35.0  70000.0  Unassigned
# 4       Eve  28.0  60000.0          IT

# Forward fill (propagate last valid value)
df_ffill = df.fillna(method='ffill')
print("\nForward fill:")
print(df_ffill)
#     name   age   salary department
# 0  Alice  25.0  50000.0      Sales
# 1    Bob  30.0  50000.0         IT
# 2    Bob  30.0  60000.0      Sales
# 3  David  35.0  70000.0      Sales
# 4    Eve  28.0  70000.0         IT

# Backward fill
df_bfill = df.fillna(method='bfill')
print("\nBackward fill:")
print(df_bfill)
#     name   age   salary department
# 0  Alice  25.0  50000.0      Sales
# 1    Bob  30.0  60000.0         IT
# 2  David  35.0  60000.0      Sales
# 3  David  35.0  70000.0         IT
# 4    Eve  28.0      NaN         IT

Advanced Missing Value Strategies

Beyond simple filling and dropping, there are more sophisticated strategies for handling missing data. You can use interpolate() to fill missing values using interpolation methods, which works well for time series or sequential data. For categorical data, you might fill with the mode (most frequent value) to maintain the distribution. You can also create indicator columns to flag which values were originally missing, allowing you to track imputation in your analysis. In some cases, missing values themselves carry information (for example, a missing survey response might indicate discomfort with the question), so you might want to treat NaN as its own category. The key is to choose a strategy that makes sense for your specific data and analysis goals.

# Interpolate missing values (useful for time series)
series = pd.Series([1, np.nan, np.nan, 4, 5])
interpolated = series.interpolate()
print("Original series:", series.values)  # [ 1. nan nan  4.  5.]
print("Interpolated:", interpolated.values)  # [1.  2.  3.  4.  5.]

# Fill with mode (most frequent value)
df_mode = df.copy()
df_mode['department'] = df_mode['department'].fillna(
    df_mode['department'].mode()[0]
)
print("\nFill department with mode:")
print(df_mode['department'])
# 0        Sales
# 1           IT
# 2        Sales
# 3           IT  (most frequent: IT or Sales)
# 4           IT

# Create indicator columns for missing values
df_with_indicators = df.copy()
df_with_indicators['age_missing'] = df['age'].isnull()
df_with_indicators['salary_missing'] = df['salary'].isnull()
print("\nDataFrame with missing indicators:")
print(df_with_indicators[['age', 'age_missing', 'salary', 'salary_missing']])
#     age  age_missing   salary  salary_missing
# 0  25.0        False  50000.0           False
# 1  30.0        False      NaN            True
# 2   NaN         True  60000.0           False
# 3  35.0        False  70000.0           False
# 4  28.0        False      NaN            True

Important: The choice between dropping and filling missing values depends on several factors including the percentage of missing data (generally drop if >50% missing), whether data is missing at random or systematically, the importance of the affected variables, and your analysis requirements. Document your decision and justify it based on the data context

Practice Questions

Task: Create a DataFrame with student data and find the total number of missing values in the entire DataFrame.

Given:

data = {
    'student': ['John', 'Mary', 'Peter', 'Jane'],
    'math': [85, np.nan, 92, 88],
    'english': [78, 85, np.nan, 90],
    'science': [np.nan, 88, 86, np.nan]
}
df = pd.DataFrame(data)

Expected output:

Show Solution

import pandas as pd
import numpy as np

data = {
    'student': ['John', 'Mary', 'Peter', 'Jane'],
    'math': [85, np.nan, 92, 88],
    'english': [78, 85, np.nan, 90],
    'science': [np.nan, 88, 86, np.nan]
}
df = pd.DataFrame(data)

# Count total missing values
total_missing = df.isnull().sum().sum()
print(total_missing)  # 4

Task: Fill missing math scores with the mean, missing english scores with the median, and missing science scores with 0.

Given:

data = {
    'student': ['John', 'Mary', 'Peter', 'Jane'],
    'math': [85, np.nan, 92, 88],
    'english': [78, 85, np.nan, 90],
    'science': [np.nan, 88, 86, np.nan]
}
df = pd.DataFrame(data)

Expected output:

  student   math  english  science
0    John   85.0     78.0      0.0
1    Mary   88.3     85.0     88.0
2   Peter   92.0     83.5     86.0
3    Jane   88.0     90.0      0.0

Show Solution

import pandas as pd
import numpy as np

data = {
    'student': ['John', 'Mary', 'Peter', 'Jane'],
    'math': [85, np.nan, 92, 88],
    'english': [78, 85, np.nan, 90],
    'science': [np.nan, 88, 86, np.nan]
}
df = pd.DataFrame(data)

# Fill with different strategies per column
df_filled = df.fillna({
    'math': df['math'].mean(),
    'english': df['english'].median(),
    'science': 0
})

print(df_filled)

Task: Remove only rows where BOTH math AND science scores are missing, but keep rows where only one subject is missing.

Given:

data = {
    'student': ['John', 'Mary', 'Peter', 'Jane', 'Tom'],
    'math': [85, np.nan, 92, np.nan, 78],
    'science': [90, 88, np.nan, np.nan, 85]
}
df = pd.DataFrame(data)

Expected output:

  student   math  science
0    John   85.0     90.0
1    Mary    NaN     88.0
2   Peter   92.0      NaN
4     Tom   78.0     85.0

Show Solution

import pandas as pd
import numpy as np

data = {
    'student': ['John', 'Mary', 'Peter', 'Jane', 'Tom'],
    'math': [85, np.nan, 92, np.nan, 78],
    'science': [90, 88, np.nan, np.nan, 85]
}
df = pd.DataFrame(data)

# Keep rows where at least one of math or science is not null
df_cleaned = df.dropna(subset=['math', 'science'], how='all')
print(df_cleaned)

Detecting & Removing Duplicates

Duplicate records can skew your analysis by inflating counts, distorting statistics, and creating misleading patterns in your data. Duplicates arise from various sources such as data entry errors, system glitches during data transfer, merging datasets without proper validation, or users submitting forms multiple times. Pandas provides powerful methods to identify and remove duplicate rows, giving you control over which duplicates to keep and which columns to consider when determining uniqueness. Understanding how to detect and handle duplicates is essential for maintaining data integrity and producing accurate analysis results.

Identifying Duplicates with duplicated()

The duplicated() method returns a boolean Series indicating which rows are duplicates of earlier rows. By default, it marks all duplicate occurrences except the first one as True, making it easy to see which rows would be removed. You can control this behavior with the keep parameter, which accepts 'first' (default), 'last', or False. Using keep='last' marks all duplicates except the last occurrence, while keep=False marks all duplicates including the first occurrence. You can also specify which columns to consider for duplication using the subset parameter, which is useful when you only care about uniqueness in specific fields rather than entire rows.

import pandas as pd

# Create sample data with duplicates
data = {
    'customer_id': [101, 102, 103, 101, 104, 102],
    'name': ['Alice', 'Bob', 'Charlie', 'Alice', 'David', 'Bob'],
    'email': ['alice@email.com', 'bob@email.com', 'charlie@email.com', 
              'alice@email.com', 'david@email.com', 'bob@email.com'],
    'purchase_amount': [150, 200, 175, 150, 225, 200]
}
df = pd.DataFrame(data)

print("DataFrame with duplicates:")
print(df)
#    customer_id     name            email  purchase_amount
# 0          101    Alice   alice@email.com              150
# 1          102      Bob     bob@email.com              200
# 2          103  Charlie charlie@email.com              175
# 3          101    Alice   alice@email.com              150
# 4          104    David   david@email.com              225
# 5          102      Bob     bob@email.com              200

# Check for duplicates (marks duplicates as True)
print("\nDuplicate rows (keep first):")
print(df.duplicated())
# 0    False
# 1    False
# 2    False
# 3     True  (duplicate of row 0)
# 4    False
# 5     True  (duplicate of row 1)

# Show which rows are duplicates
print("\nDuplicate rows:")
print(df[df.duplicated()])
#    customer_id   name            email  purchase_amount
# 3          101  Alice   alice@email.com              150
# 5          102    Bob     bob@email.com              200

# Mark all occurrences of duplicates (including first)
print("\nAll duplicate occurrences:")
print(df.duplicated(keep=False))
# 0     True  (row 0 and 3 are duplicates)
# 1     True  (row 1 and 5 are duplicates)
# 2    False
# 3     True
# 4    False
# 5     True

Removing Duplicates with drop_duplicates()

Once you've identified duplicates, you can remove them using drop_duplicates(). This method returns a new DataFrame with duplicate rows removed, or modifies the original DataFrame if you set inplace=True. Like duplicated(), it accepts keep='first', keep='last', or keep=False parameters to control which duplicates are retained. The most common use case is to keep the first occurrence and remove subsequent duplicates, but sometimes you might want to keep the most recent record (using keep='last') or remove all duplicate occurrences entirely. The method also supports the subset parameter to consider only specific columns when determining uniqueness, which is extremely useful when some fields may naturally repeat while others should be unique.

# Remove duplicate rows (keep first occurrence)
df_no_dupes = df.drop_duplicates()
print("After removing duplicates:")
print(df_no_dupes)
#    customer_id     name            email  purchase_amount
# 0          101    Alice   alice@email.com              150
# 1          102      Bob     bob@email.com              200
# 2          103  Charlie charlie@email.com              175
# 4          104    David   david@email.com              225

# Keep last occurrence instead
df_keep_last = df.drop_duplicates(keep='last')
print("\nKeep last occurrence:")
print(df_keep_last)
#    customer_id     name            email  purchase_amount
# 2          103  Charlie charlie@email.com              175
# 3          101    Alice   alice@email.com              150
# 4          104    David   david@email.com              225
# 5          102      Bob     bob@email.com              200

# Remove all occurrences of duplicates
df_remove_all = df.drop_duplicates(keep=False)
print("\nRemove all duplicate occurrences:")
print(df_remove_all)
#    customer_id     name            email  purchase_amount
# 2          103  Charlie charlie@email.com              175
# 4          104    David   david@email.com              225

# Count duplicates before removal
print(f"\nOriginal rows: {len(df)}")  # 6
print(f"After removing duplicates: {len(df_no_dupes)}")  # 4
print(f"Duplicates removed: {len(df) - len(df_no_dupes)}")  # 2

Checking Duplicates in Specific Columns

Often, you don't need entire rows to be identical to consider them duplicates. Instead, you might want to ensure uniqueness based on specific key columns like customer_id, email, or order_number. The subset parameter lets you specify which columns to consider when detecting duplicates. This is particularly useful when you have transaction data where some fields naturally repeat (like product names or categories) but you want to ensure uniqueness on identifying fields. You can pass a single column name as a string or multiple columns as a list. This approach is essential for identifying duplicate entries based on business logic rather than exact row matching.

# Create data where some fields repeat naturally
data = {
    'order_id': [1001, 1002, 1003, 1001, 1004],
    'customer': ['Alice', 'Bob', 'Alice', 'Alice', 'Charlie'],
    'product': ['Laptop', 'Mouse', 'Laptop', 'Laptop', 'Keyboard'],
    'quantity': [1, 2, 1, 1, 3]
}
df = pd.DataFrame(data)

print("Orders data:")
print(df)
#    order_id customer  product  quantity
# 0      1001    Alice   Laptop         1
# 1      1002      Bob    Mouse         2
# 2      1003    Alice   Laptop         1
# 3      1001    Alice   Laptop         1  (duplicate order_id)
# 4      1004  Charlie Keyboard         3

# Check duplicates based on order_id only
print("\nDuplicates based on order_id:")
print(df.duplicated(subset=['order_id']))
# 0    False
# 1    False
# 2    False
# 3     True  (order_id 1001 already exists)
# 4    False

# Remove duplicates based on order_id
df_unique_orders = df.drop_duplicates(subset=['order_id'])
print("\nUnique orders:")
print(df_unique_orders)
#    order_id customer  product  quantity
# 0      1001    Alice   Laptop         1
# 1      1002      Bob    Mouse         2
# 2      1003    Alice   Laptop         1
# 4      1004  Charlie Keyboard         3

# Check duplicates based on multiple columns
print("\nDuplicates based on customer AND product:")
print(df.duplicated(subset=['customer', 'product']))
# 0    False
# 1    False
# 2    False
# 3     True  (Alice + Laptop combination already exists)
# 4    False

Advanced Duplicate Handling Strategies

Sometimes simply removing duplicates isn't the best approach. You might want to aggregate duplicate records, keep the row with the maximum or minimum value in a certain column, or flag duplicates for manual review. For aggregation, you can group by key columns and use aggregation functions like sum(), mean(), or max() to combine duplicate records meaningfully. To keep the best record among duplicates, you can sort by a priority column before removing duplicates with keep='first'. For auditing purposes, you might want to count occurrences or add a flag column before removing duplicates. These advanced strategies help you handle duplicates in ways that preserve valuable information rather than simply discarding it.

# Create sales data with duplicate customers
data = {
    'customer': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob'],
    'date': ['2024-01-15', '2024-01-16', '2024-01-20', 
             '2024-01-17', '2024-01-18'],
    'amount': [100, 150, 200, 175, 125]
}
df = pd.DataFrame(data)

print("Sales data:")
print(df)
#   customer        date  amount
# 0    Alice  2024-01-15     100
# 1      Bob  2024-01-16     150
# 2    Alice  2024-01-20     200
# 3  Charlie  2024-01-17     175
# 4      Bob  2024-01-18     125

# Aggregate duplicates (sum amounts per customer)
df_aggregated = df.groupby('customer', as_index=False)['amount'].sum()
print("\nAggregated by customer:")
print(df_aggregated)
#   customer  amount
# 0    Alice     300  (100 + 200)
# 1      Bob     275  (150 + 125)
# 2  Charlie     175

# Keep row with highest amount for each customer
df_sorted = df.sort_values('amount', ascending=False)
df_max_amount = df_sorted.drop_duplicates(subset=['customer'], keep='first')
print("\nKeep highest purchase per customer:")
print(df_max_amount.sort_index())
#   customer        date  amount
# 2    Alice  2024-01-20     200
# 1      Bob  2024-01-16     150
# 3  Charlie  2024-01-17     175

# Count occurrences before removing duplicates
df_with_count = df.copy()
df_with_count['occurrence_count'] = df_with_count.groupby('customer')['customer'].transform('count')
print("\nWith occurrence counts:")
print(df_with_count)
#   customer        date  amount  occurrence_count
# 0    Alice  2024-01-15     100                 2
# 1      Bob  2024-01-16     150                 2
# 2    Alice  2024-01-20     200                 2
# 3  Charlie  2024-01-17     175                 1
# 4      Bob  2024-01-18     125                 2

Best Practice: Before removing duplicates, always investigate why they exist. Duplicates might indicate data quality issues upstream, system bugs, or legitimate repeated events. Use value_counts() and groupby() to understand patterns in your duplicates, and consider whether aggregation might be more appropriate than deletion

Practice Questions

Task: Create a DataFrame with product data and count how many duplicate rows exist.

Given:

data = {
    'product': ['Apple', 'Banana', 'Apple', 'Orange', 'Banana'],
    'price': [1.20, 0.50, 1.20, 0.80, 0.50]
}
df = pd.DataFrame(data)

Expected output:

Show Solution

import pandas as pd

data = {
    'product': ['Apple', 'Banana', 'Apple', 'Orange', 'Banana'],
    'price': [1.20, 0.50, 1.20, 0.80, 0.50]
}
df = pd.DataFrame(data)

# Count duplicates (excluding first occurrence)
duplicate_count = df.duplicated().sum()
print(duplicate_count)  # 2

Task: Remove duplicate users based on email address only, keeping the first occurrence.

Given:

data = {
    'user_id': [1, 2, 3, 4, 5],
    'name': ['Alice', 'Bob', 'Alice Smith', 'Charlie', 'Robert'],
    'email': ['alice@test.com', 'bob@test.com', 'alice@test.com', 
              'charlie@test.com', 'bob@test.com']
}
df = pd.DataFrame(data)

Expected output:

   user_id     name            email
0        1    Alice   alice@test.com
1        2      Bob     bob@test.com
3        4  Charlie charlie@test.com

Show Solution

import pandas as pd

data = {
    'user_id': [1, 2, 3, 4, 5],
    'name': ['Alice', 'Bob', 'Alice Smith', 'Charlie', 'Robert'],
    'email': ['alice@test.com', 'bob@test.com', 'alice@test.com', 
              'charlie@test.com', 'bob@test.com']
}
df = pd.DataFrame(data)

# Remove duplicates based on email
df_unique = df.drop_duplicates(subset=['email'], keep='first')
print(df_unique)

Task: For duplicate customer transactions, keep only the most recent transaction based on the date.

Given:

data = {
    'customer': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob', 'Alice'],
    'date': ['2024-01-10', '2024-01-15', '2024-01-20', 
             '2024-01-12', '2024-01-25', '2024-01-18'],
    'amount': [100, 150, 200, 175, 125, 180]
}
df = pd.DataFrame(data)

Expected output:

  customer        date  amount
2    Alice  2024-01-20     200
4      Bob  2024-01-25     125
3  Charlie  2024-01-12     175

Show Solution

import pandas as pd

data = {
    'customer': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob', 'Alice'],
    'date': ['2024-01-10', '2024-01-15', '2024-01-20', 
             '2024-01-12', '2024-01-25', '2024-01-18'],
    'amount': [100, 150, 200, 175, 125, 180]
}
df = pd.DataFrame(data)

# Convert date to datetime for proper sorting
df['date'] = pd.to_datetime(df['date'])

# Sort by date descending, then keep first (most recent) per customer
df_sorted = df.sort_values('date', ascending=False)
df_latest = df_sorted.drop_duplicates(subset=['customer'], keep='first')

# Sort by original index or customer name for display
df_latest = df_latest.sort_values('customer')
print(df_latest)

Data Type Conversions

Data often comes with incorrect or suboptimal data types that can prevent proper analysis or cause unexpected errors. Numbers might be stored as strings, dates as text, or categories as generic objects. Converting data to the correct types is essential for efficient computation, accurate calculations, and enabling type-specific operations. Pandas provides several methods for type conversion including astype() for general conversions, to_numeric() for handling numeric conversions safely, to_datetime() for parsing dates, and categorical types for memory-efficient storage of repeated values. Understanding when and how to use each conversion method helps you prepare data correctly for analysis.

Checking Current Data Types

Before converting types, you need to understand what types your data currently has. The dtypes attribute shows the data type of each column in your DataFrame. Common types include int64 for integers, float64 for decimal numbers, object for strings or mixed types, bool for boolean values, and datetime64 for dates and times. The info() method provides a comprehensive overview showing the data type, non-null count, and memory usage for each column. Understanding your current data types helps you identify which columns need conversion and what target types are appropriate. Remember that object dtype is Pandas' default for text and mixed-type data, but it's often worth converting to more specific types for better performance and functionality.

import pandas as pd
import numpy as np

# Create sample data with mixed types
data = {
    'customer_id': ['001', '002', '003', '004'],  # Numbers as strings
    'age': ['25', '30', 'unknown', '35'],  # Mixed: numbers and text
    'salary': ['50000', '60000', '55000', '70000'],  # Numbers as strings
    'join_date': ['2020-01-15', '2019-06-20', '2021-03-10', '2020-11-05'],
    'is_active': ['True', 'False', 'True', 'True'],  # Booleans as strings
    'department': ['Sales', 'IT', 'Sales', 'HR']
}
df = pd.DataFrame(data)

print("DataFrame with string types:")
print(df)
#   customer_id  age salary   join_date is_active department
# 0         001   25  50000  2020-01-15      True      Sales
# 1         002   30  60000  2019-06-20     False         IT
# 2         003  unknown  55000  2021-03-10      True      Sales
# 3         004   35  70000  2020-11-05      True         HR

# Check data types
print("\nCurrent data types:")
print(df.dtypes)
# customer_id    object
# age            object
# salary         object
# join_date      object
# is_active      object
# department     object

# Get detailed info
print("\nDataFrame info:")
df.info()
# 
# RangeIndex: 4 entries, 0 to 3
# Data columns (total 6 columns):
#  #   Column       Non-Null Count  Dtype 
# ---  ------       --------------  ----- 
#  0   customer_id  4 non-null      object
#  1   age          4 non-null      object
#  2   salary       4 non-null      object
#  3   join_date    4 non-null      object
#  4   is_active    4 non-null      object
#  5   department   4 non-null      object

Basic Type Conversion with astype()

The astype() method is the most straightforward way to convert column types. You specify the target type as a string (like 'int', 'float', 'str', 'bool') or a NumPy dtype. This method is fast and efficient but will raise an error if the conversion fails (for example, trying to convert 'abc' to an integer). You can convert a single column by calling astype() on that column, or convert multiple columns at once by passing a dictionary to the DataFrame's astype() method. Use astype() when you're confident that all values in the column can be converted to the target type. For data that might have invalid values, consider using the more forgiving to_numeric() or to_datetime() methods instead.

# Convert customer_id from string to integer
df_converted = df.copy()
df_converted['customer_id'] = df_converted['customer_id'].astype(int)
print("customer_id converted to int:")
print(df_converted['customer_id'])
# 0    1
# 1    2
# 2    3
# 3    4

# Convert salary from string to integer
df_converted['salary'] = df_converted['salary'].astype(int)
print("\nsalary converted to int:")
print(df_converted['salary'])
# 0    50000
# 1    60000
# 2    55000
# 3    70000

# Convert is_active from string to boolean
# Note: string 'False' converts to True! Need special handling
df_converted['is_active'] = df_converted['is_active'].map({'True': True, 'False': False})
print("\nis_active converted to bool:")
print(df_converted['is_active'])
# 0     True
# 1    False
# 2     True
# 3     True

# Try converting age (will fail due to 'unknown')
try:
    df['age'].astype(int)
except ValueError as e:
    print(f"\nError converting age: {e}")
    # Error: invalid literal for int() with base 10: 'unknown'

# Check new types
print("\nNew data types:")
print(df_converted.dtypes)
# customer_id     int64
# age            object  (not converted yet)
# salary          int64
# join_date      object  (not converted yet)
# is_active        bool
# department     object

Safe Numeric Conversion with to_numeric()

When converting strings to numbers, you often encounter invalid values that can't be converted. The to_numeric() function provides robust handling of such cases through its errors parameter. Setting errors='coerce' converts invalid values to NaN instead of raising an error, allowing the conversion to proceed. Using errors='ignore' leaves invalid values unchanged and returns the original data if conversion fails. You can also specify the downcast parameter to automatically convert to the smallest possible numeric type, saving memory. The to_numeric() function is ideal for cleaning messy data where you expect some values to be non-numeric, and you want to handle them gracefully rather than stopping the entire process.

# Safely convert age to numeric (coerce invalid to NaN)
df_converted['age'] = pd.to_numeric(df['age'], errors='coerce')
print("age converted with coercion:")
print(df_converted['age'])
# 0    25.0
# 1    30.0
# 2     NaN  (unknown became NaN)
# 3    35.0

# Example with downcast to save memory
large_numbers = pd.Series(['100', '200', '300', '400'])
converted_int64 = pd.to_numeric(large_numbers)
converted_int8 = pd.to_numeric(large_numbers, downcast='integer')

print("\nWithout downcast:", converted_int64.dtype)  # int64
print("With downcast:", converted_int8.dtype)  # int8 (smaller)

# Handle mixed data with different strategies
mixed_data = pd.Series(['100', '200', 'invalid', '300'])

# Strategy 1: Coerce to NaN
result_coerce = pd.to_numeric(mixed_data, errors='coerce')
print("\nCoerce strategy:")
print(result_coerce)
# 0    100.0
# 1    200.0
# 2      NaN
# 3    300.0

# Strategy 2: Ignore (keep as string if fails)
result_ignore = pd.to_numeric(mixed_data, errors='ignore')
print("\nIgnore strategy:")
print(result_ignore)
# 0        100
# 1        200
# 2    invalid  (kept as string)
# 3        300

Date and Time Conversion with to_datetime()

Converting strings to datetime objects is crucial for time-based analysis. The to_datetime() function is remarkably flexible, automatically recognizing many common date formats without explicit specification. You can provide a format string using strftime codes for faster parsing and to handle unusual formats. Like to_numeric(), it supports the errors parameter to handle invalid dates gracefully. The dayfirst parameter helps with ambiguous dates like '01-02-2020' (is it January 2nd or February 1st?). Once converted to datetime, you gain access to powerful time-based operations like date arithmetic, time-based indexing, and temporal aggregations. You can also extract components like year, month, day using the dt accessor.

# Convert join_date to datetime
df_converted['join_date'] = pd.to_datetime(df['join_date'])
print("join_date converted to datetime:")
print(df_converted['join_date'])
# 0   2020-01-15
# 1   2019-06-20
# 2   2021-03-10
# 3   2020-11-05
print("Type:", df_converted['join_date'].dtype)  # datetime64[ns]

# Extract date components
df_converted['year'] = df_converted['join_date'].dt.year
df_converted['month'] = df_converted['join_date'].dt.month
df_converted['day_of_week'] = df_converted['join_date'].dt.day_name()
print("\nExtracted date components:")
print(df_converted[['join_date', 'year', 'month', 'day_of_week']])
#    join_date  year  month day_of_week
# 0 2020-01-15  2020      1   Wednesday
# 1 2019-06-20  2019      6    Thursday
# 2 2021-03-10  2021      3   Wednesday
# 3 2020-11-05  2020     11    Thursday

# Handle different date formats
dates_various = pd.Series([
    '2020-01-15',
    '01/15/2020',
    '15-Jan-2020',
    'January 15, 2020'
])
converted_dates = pd.to_datetime(dates_various)
print("\nVarious formats converted:")
print(converted_dates)
# 0   2020-01-15
# 1   2020-01-15
# 2   2020-01-15
# 3   2020-01-15

# Specify format for faster parsing
dates_custom = pd.Series(['15/01/2020', '20/06/2019'])
parsed = pd.to_datetime(dates_custom, format='%d/%m/%Y', dayfirst=True)
print("\nCustom format parsed:")
print(parsed)
# 0   2020-01-15
# 1   2019-06-20

Categorical Data Type for Memory Efficiency

When a column has repeated values (like department names, product categories, or status codes), converting it to categorical type can dramatically reduce memory usage and improve performance. Categorical dtype stores the unique values once and uses integer codes internally to represent each occurrence. This is especially beneficial for large datasets with low-cardinality columns. You can convert to categorical using astype('category') or pd.Categorical(). Categorical columns also enable category-specific operations like reordering categories, adding or removing categories, and ensuring only valid values are allowed. The memory savings can be substantial, for example, a column with 1 million rows but only 5 unique values might use 90% less memory as categorical.

# Check memory usage before conversion
print("Memory usage before:")
print(df['department'].memory_usage(deep=True))  # ~280 bytes

# Convert to categorical
df_converted['department'] = df_converted['department'].astype('category')
print("\nMemory usage after categorical conversion:")
print(df_converted['department'].memory_usage(deep=True))  # ~190 bytes

# View categorical information
print("\nCategorical info:")
print(df_converted['department'].cat.categories)  # ['HR' 'IT' 'Sales']
print(df_converted['department'].cat.codes)
# 0    2  (Sales)
# 1    1  (IT)
# 2    2  (Sales)
# 3    0  (HR)

# Create ordered categorical (useful for rankings)
df_converted['education'] = pd.Categorical(
    ['Bachelor', 'Master', 'PhD', 'Bachelor'],
    categories=['High School', 'Bachelor', 'Master', 'PhD'],
    ordered=True
)
print("\nOrdered categorical:")
print(df_converted['education'])
# 0    Bachelor
# 1      Master
# 2         PhD
# 3    Bachelor

# Comparisons work with ordered categories
print("\nIs education >= Bachelor?")
print(df_converted['education'] >= 'Bachelor')
# 0     True
# 1     True
# 2     True
# 3     True

# Final data types overview
print("\nFinal data types:")
print(df_converted.dtypes)
# customer_id             int64
# age                   float64
# salary                  int64
# join_date      datetime64[ns]
# is_active                bool
# department           category
# year                    int32
# month                   int32
# day_of_week            object
# education            category

Memory Tip: For large datasets, check memory usage with df.memory_usage(deep=True) before and after type conversions. Converting object columns to appropriate types (numeric, datetime, or category) can reduce memory usage by 50-90%, enabling you to work with larger datasets and speeding up operations

Practice Questions

Task: Convert the price column from strings to floats.

Given:

data = {
    'product': ['Apple', 'Banana', 'Cherry'],
    'price': ['1.50', '0.75', '2.25']
}
df = pd.DataFrame(data)

Expected output:

  product  price
0   Apple   1.50
1  Banana   0.75
2  Cherry   2.25
Type: float64

Show Solution

import pandas as pd

data = {
    'product': ['Apple', 'Banana', 'Cherry'],
    'price': ['1.50', '0.75', '2.25']
}
df = pd.DataFrame(data)

# Convert to float
df['price'] = df['price'].astype(float)
print(df)
print("Type:", df['price'].dtype)

Task: Convert the score column to numeric, handling invalid values by converting them to NaN.

Given:

data = {
    'student': ['Alice', 'Bob', 'Charlie', 'David'],
    'score': ['85', '92', 'absent', '78']
}
df = pd.DataFrame(data)

Expected output:

  student  score
0   Alice   85.0
1     Bob   92.0
2 Charlie    NaN
3   David   78.0

Show Solution

import pandas as pd

data = {
    'student': ['Alice', 'Bob', 'Charlie', 'David'],
    'score': ['85', '92', 'absent', '78']
}
df = pd.DataFrame(data)

# Convert with coercion
df['score'] = pd.to_numeric(df['score'], errors='coerce')
print(df)

Task: Convert order_id to int, order_date to datetime, amount to float, and status to category.

Given:

data = {
    'order_id': ['1001', '1002', '1003'],
    'order_date': ['2024-01-15', '2024-01-16', '2024-01-17'],
    'amount': ['99.99', '149.50', '75.25'],
    'status': ['Pending', 'Shipped', 'Pending']
}
df = pd.DataFrame(data)

Expected output: All columns with appropriate types

Show Solution

import pandas as pd

data = {
    'order_id': ['1001', '1002', '1003'],
    'order_date': ['2024-01-15', '2024-01-16', '2024-01-17'],
    'amount': ['99.99', '149.50', '75.25'],
    'status': ['Pending', 'Shipped', 'Pending']
}
df = pd.DataFrame(data)

# Convert multiple columns
df['order_id'] = df['order_id'].astype(int)
df['order_date'] = pd.to_datetime(df['order_date'])
df['amount'] = df['amount'].astype(float)
df['status'] = df['status'].astype('category')

print(df)
print("\nData types:")
print(df.dtypes)

String Operations & Regex

Text data is notoriously messy and requires specialized cleaning techniques. String columns often contain extra whitespace, inconsistent capitalization, special characters, and embedded patterns that need extraction. Pandas provides the .str accessor which gives you access to dozens of string methods similar to Python's built-in string operations, but optimized for vectorized operations on entire columns. Additionally, regex (regular expressions) support enables powerful pattern matching and extraction for complex text processing tasks. Mastering string operations is essential for cleaning names, emails, phone numbers, addresses, and any other text data in your datasets.

Basic String Methods

The .str accessor provides methods for common string operations like changing case, removing whitespace, and checking string content. Use str.lower(), str.upper(), and str.title() to standardize capitalization across your data. The str.strip(), str.lstrip(), and str.rstrip() methods remove leading and trailing whitespace, which is crucial for cleaning user input data. You can check string properties with str.startswith(), str.endswith(), and str.contains(). The str.replace() method substitutes text patterns, while str.len() returns string lengths. These operations work element-wise on entire Series, making them much faster than looping through individual values. Remember that str methods return new Series and don't modify the original data unless you assign the result back.

import pandas as pd

# Create sample data with messy strings
data = {
    'name': ['  Alice Smith  ', 'BOB JONES', 'charlie brown', ' David Lee'],
    'email': ['ALICE@EMAIL.COM', 'bob@email.com  ', 'Charlie@Email.COM', 'david@email.com'],
    'phone': ['(555) 123-4567', '555-234-5678', '555.345.6789', '5554567890']
}
df = pd.DataFrame(data)

print("Original messy data:")
print(df)
#               name           email          phone
# 0   Alice Smith    ALICE@EMAIL.COM  (555) 123-4567
# 1        BOB JONES    bob@email.com    555-234-5678
# 2   charlie brown  Charlie@Email.COM   555.345.6789
# 3       David Lee    david@email.com     5554567890

# Clean whitespace
df['name'] = df['name'].str.strip()
df['email'] = df['email'].str.strip()
print("\nAfter removing whitespace:")
print(df['name'])
# 0      Alice Smith
# 1        BOB JONES
# 2    charlie brown
# 3        David Lee

# Standardize capitalization
df['name'] = df['name'].str.title()  # Title case
df['email'] = df['email'].str.lower()  # Lowercase
print("\nStandardized case:")
print(df[['name', 'email']])
#            name            email
# 0   Alice Smith  alice@email.com
# 1     Bob Jones    bob@email.com
# 2  Charlie Brown  charlie@email.com
# 3     David Lee    david@email.com

# Check string properties
print("\nNames starting with 'A':")
print(df['name'].str.startswith('A'))
# 0     True
# 1    False
# 2    False
# 3    False

print("\nEmails containing 'alice':")
print(df['email'].str.contains('alice'))
# 0     True
# 1    False
# 2    False
# 3    False

# Get string lengths
print("\nPhone number lengths:")
print(df['phone'].str.len())
# 0    14  (has parentheses and hyphens)
# 1    12  (has hyphens)
# 2    12  (has dots)
# 3    10  (digits only)

String Replacement and Cleaning

Real-world text data often contains unwanted characters, inconsistent formatting, or values that need standardization. The str.replace() method lets you substitute patterns with new values, supporting both literal strings and regex patterns. You can remove characters by replacing them with empty strings. For multiple replacements, chain multiple replace() calls or use a regex pattern with alternation. The str.removeprefix() and str.removesuffix() methods (Python 3.9+) provide clean ways to remove known prefixes or suffixes. When dealing with special characters or punctuation, you might need to escape them in regex patterns or use string methods strategically. These techniques are essential for cleaning phone numbers, product codes, currency values, and other formatted text.

# Clean phone numbers by removing non-digit characters
df['phone_clean'] = df['phone'].str.replace(r'[^0-9]', '', regex=True)
print("Cleaned phone numbers:")
print(df[['phone', 'phone_clean']])
#            phone phone_clean
# 0  (555) 123-4567  5551234567
# 1    555-234-5678  5552345678
# 2    555.345.6789  5553456789
# 3      5554567890  5554567890

# Replace multiple characters
text_data = pd.Series(['$100.50', '$2,450.75', '$999.99'])
cleaned = text_data.str.replace('$', '', regex=False)
cleaned = cleaned.str.replace(',', '', regex=False)
print("\nCleaned currency:")
print(cleaned.astype(float))
# 0     100.50
# 1    2450.75
# 2     999.99

# Remove prefixes/suffixes
products = pd.Series(['PRD-001', 'PRD-002', 'PRD-003'])
product_ids = products.str.replace('PRD-', '', regex=False)
print("\nProduct IDs without prefix:")
print(product_ids)
# 0    001
# 1    002
# 2    003

# Replace with mapping (useful for standardization)
status_data = pd.Series(['active', 'Active', 'ACTIVE', 'inactive', 'Inactive'])
status_mapping = {
    'active': 'Active',
    'Active': 'Active',
    'ACTIVE': 'Active',
    'inactive': 'Inactive',
    'Inactive': 'Inactive'
}
standardized = status_data.replace(status_mapping)
print("\nStandardized status:")
print(standardized)
# 0      Active
# 1      Active
# 2      Active
# 3    Inactive
# 4    Inactive

Splitting and Extracting Strings

Often you need to split text into multiple columns or extract specific parts of strings. The str.split() method divides strings based on a delimiter, returning a Series of lists by default. Use expand=True to create new columns from split results. The str.extract() method uses regex groups to extract specific patterns into new columns, which is perfect for parsing structured text like dates, IDs, or codes. You can use str.slice() or bracket notation to extract substrings by position. For more complex extractions, str.extractall() captures all matches of a pattern, not just the first. These techniques are invaluable for breaking apart full names into first and last names, extracting domain names from emails, or parsing product codes into components.

# Split full names into first and last names
names = pd.Series(['Alice Smith', 'Bob Jones', 'Charlie Brown'])
split_names = names.str.split(' ', expand=True)
split_names.columns = ['first_name', 'last_name']
print("Split names:")
print(split_names)
#   first_name last_name
# 0      Alice     Smith
# 1        Bob     Jones
# 2    Charlie     Brown

# Extract domain from email
emails = pd.Series(['alice@gmail.com', 'bob@company.co.uk', 'charlie@test.org'])
domains = emails.str.extract(r'@(.+)', expand=False)
print("\nExtracted domains:")
print(domains)
# 0        gmail.com
# 1    company.co.uk
# 2         test.org

# Extract multiple groups with regex
product_codes = pd.Series(['ABC-2023-001', 'XYZ-2024-042', 'ABC-2023-015'])
extracted = product_codes.str.extract(r'([A-Z]+)-(\d{4})-(\d+)')
extracted.columns = ['category', 'year', 'number']
print("\nExtracted product code parts:")
print(extracted)
#   category  year number
# 0      ABC  2023    001
# 1      XYZ  2024    042
# 2      ABC  2023    015

# Slice strings by position
ids = pd.Series(['USER001', 'USER002', 'USER003'])
numeric_part = ids.str[4:]  # From position 4 to end
print("\nNumeric part of IDs:")
print(numeric_part)
# 0    001
# 1    002
# 2    003

# Extract all matches (not just first)
text = pd.Series(['Prices: $10, $20, $30', 'Cost: $5'])
all_prices = text.str.extractall(r'\$(\d+)')
print("\nAll extracted prices:")
print(all_prices)
#        0
#   match  
# 0 0      10
#   1      20
#   2      30
# 1 0       5

Regular Expressions (Regex) for Pattern Matching

Regular expressions provide a powerful language for describing text patterns. Common regex patterns include \d for digits, \w for word characters, \s for whitespace, . for any character, and quantifiers like + (one or more), * (zero or more), and {n,m} (between n and m). Character sets like [0-9] or [A-Za-z] match specific ranges. The ^ and $ anchors match the start and end of strings. Parentheses create capture groups for extraction. You can use regex with str.contains() for filtering, str.replace() for substitution, and str.extract() for parsing. Regex is essential for validating email formats, extracting phone numbers, parsing dates, and finding patterns in unstructured text. While regex syntax can be complex, mastering even basic patterns dramatically expands your text processing capabilities.

# Validate email format with regex
emails = pd.Series([
    'alice@email.com',
    'invalid.email',
    'bob@test.co.uk',
    '@missing.com',
    'charlie@domain.org'
])
valid_emails = emails.str.match(r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$')
print("Valid email format:")
print(pd.DataFrame({'email': emails, 'valid': valid_emails}))
#              email  valid
# 0   alice@email.com   True
# 1    invalid.email  False
# 2    bob@test.co.uk   True
# 3     @missing.com  False
# 4  charlie@domain.org   True

# Extract dates in various formats
text_with_dates = pd.Series([
    'Meeting on 2024-01-15',
    'Due: 01/20/2024',
    'Event 2024.03.10',
    'Deadline 15-Jan-2024'
])
# Match YYYY-MM-DD or YYYY.MM.DD
dates = text_with_dates.str.extract(r'(\d{4}[-./]\d{2}[-./]\d{2})')
print("\nExtracted dates:")
print(dates)
#             0
# 0  2024-01-15
# 1         NaN
# 2  2024.03.10
# 3         NaN

# Find phone numbers in text
text_with_phones = pd.Series([
    'Call me at 555-123-4567',
    'Office: (555) 234-5678',
    'Mobile 555.345.6789'
])
phones = text_with_phones.str.extract(r'(\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4})')
print("\nExtracted phone numbers:")
print(phones)
#                 0
# 0    555-123-4567
# 1  (555) 234-5678
# 2    555.345.6789

# Replace using regex groups (format phone numbers)
messy_phones = pd.Series(['5551234567', '555-234-5678', '(555)3456789'])
formatted = messy_phones.str.replace(
    r'^\(?(\d{3})\)?[-.\s]?(\d{3})[-.\s]?(\d{4})$',
    r'(\1) \2-\3',
    regex=True
)
print("\nFormatted phone numbers:")
print(formatted)
# 0    (555) 123-4567
# 1    (555) 234-5678
# 2    (555) 345-6789

# Filter rows using regex patterns
products = pd.DataFrame({
    'product': ['Laptop-PRO', 'Mouse-Basic', 'Laptop-Standard', 'Keyboard-PRO']
})
pro_products = products[products['product'].str.contains(r'-PRO$', regex=True)]
print("\nPRO products:")
print(pro_products)
#        product
# 0   Laptop-PRO
# 3  Keyboard-PRO

Advanced String Techniques

Beyond basic operations, Pandas offers advanced string methods for complex scenarios. Use str.cat() to concatenate strings across Series or join elements within Series. The str.pad() and str.zfill() methods add padding to create fixed-width strings, useful for IDs or codes. You can normalize text with str.normalize() for Unicode handling. The str.translate() method provides efficient character-level replacements using translation tables. For performance with large datasets, consider pre-compiling regex patterns or using string methods instead of regex when possible. You can also combine string operations with other Pandas functionality like groupby to aggregate text, or use string methods in query conditions for powerful filtering. These advanced techniques help you handle edge cases and optimize text processing in production data pipelines.

# Concatenate strings from multiple columns
df_concat = pd.DataFrame({
    'first': ['Alice', 'Bob'],
    'last': ['Smith', 'Jones']
})
df_concat['full_name'] = df_concat['first'].str.cat(df_concat['last'], sep=' ')
print("Concatenated names:")
print(df_concat)
#   first   last  full_name
# 0 Alice  Smith  Alice Smith
# 1   Bob  Jones    Bob Jones

# Pad strings to fixed width
order_ids = pd.Series([1, 42, 123])
padded = order_ids.astype(str).str.zfill(5)  # Zero-fill to 5 characters
print("\nPadded order IDs:")
print(padded)
# 0    00001
# 1    00042
# 2    00123

# Clean and standardize text with multiple operations
messy_data = pd.Series(['  Product-A  ', ' PRODUCT-B', 'product-c '])
cleaned = (messy_data
    .str.strip()           # Remove whitespace
    .str.lower()           # Lowercase
    .str.replace('-', ' ', regex=False)  # Remove hyphens
    .str.title()           # Title case
)
print("\nCleaned product names:")
print(cleaned)
# 0    Product A
# 1    Product B
# 2    Product C

# Combine string operations with groupby
sales = pd.DataFrame({
    'product': ['Laptop', 'Mouse', 'Laptop', 'Keyboard'],
    'quantity': [1, 2, 1, 3]
})
product_summary = sales.groupby('product')['quantity'].sum()
print("\nProduct sales summary:")
print(product_summary)
# product
# Keyboard    3
# Laptop      2
# Mouse       2

# Use string methods in boolean indexing
data = pd.DataFrame({
    'code': ['A001', 'B002', 'A003', 'C004']
})
a_codes = data[data['code'].str.startswith('A')]
print("\nCodes starting with 'A':")
print(a_codes)
#    code
# 0  A001
# 2  A003

Key Concept

Regular Expressions (Regex)

Regular expressions are sequences of characters that define search patterns for text matching and manipulation. They provide a concise and flexible means for identifying strings of text, such as particular characters, words, or patterns. In Pandas, regex is integrated into string methods through the regex=True parameter.

Why it matters: Regex enables powerful pattern matching that would require hundreds of lines of code using basic string methods. It's essential for validating data formats (emails, phone numbers), extracting structured information from unstructured text, and performing complex text transformations

Regex Resources: Learning regex takes practice but pays huge dividends. Use regex101.com to test patterns interactively, refer to regex cheat sheets for common patterns, and start with simple patterns before building complex ones. Remember that str.contains() and str.match() differ, match() requires the pattern to match the entire string from the start while contains() finds the pattern anywhere

Practice Questions

Task: Remove leading/trailing whitespace and convert all names to title case.

Given:

names = pd.Series(['  alice smith  ', 'BOB JONES', '  charlie BROWN'])
df = pd.DataFrame({'name': names})

Expected output:

            name
0    Alice Smith
1      Bob Jones
2  Charlie Brown

Show Solution

import pandas as pd

names = pd.Series(['  alice smith  ', 'BOB JONES', '  charlie BROWN'])
df = pd.DataFrame({'name': names})

# Clean and standardize
df['name'] = df['name'].str.strip().str.title()
print(df)

Task: Extract the domain name (everything after @) from each email address.

Given:

emails = pd.Series(['alice@gmail.com', 'bob@company.co.uk', 'charlie@test.org'])
df = pd.DataFrame({'email': emails})

Expected output:

            email         domain
0   alice@gmail.com      gmail.com
1  bob@company.co.uk  company.co.uk
2  charlie@test.org       test.org

Show Solution

import pandas as pd

emails = pd.Series(['alice@gmail.com', 'bob@company.co.uk', 'charlie@test.org'])
df = pd.DataFrame({'email': emails})

# Extract domain using regex
df['domain'] = df['email'].str.extract(r'@(.+)', expand=False)
print(df)

Task: Take messy phone numbers and format them consistently as (XXX) XXX-XXXX.

Given:

phones = pd.Series(['5551234567', '555-234-5678', '(555) 345-6789', '555.456.7890'])
df = pd.DataFrame({'phone': phones})

Expected output:

            phone    formatted_phone
0      5551234567  (555) 123-4567
1    555-234-5678  (555) 234-5678
2  (555) 345-6789  (555) 345-6789
3    555.456.7890  (555) 456-7890

Show Solution

import pandas as pd

phones = pd.Series(['5551234567', '555-234-5678', '(555) 345-6789', '555.456.7890'])
df = pd.DataFrame({'phone': phones})

# Remove all non-digits first, then reformat
df['formatted_phone'] = df['phone'].str.replace(r'[^0-9]', '', regex=True)
df['formatted_phone'] = df['formatted_phone'].str.replace(
    r'^(\d{3})(\d{3})(\d{4})$',
    r'(\1) \2-\3',
    regex=True
)
print(df)

Key Takeaways

Data Cleaning is Critical

Real-world data is rarely clean. Handling missing values, duplicates, type inconsistencies, and text problems is essential before analysis. Clean data leads to accurate insights and reliable models

Missing Value Strategies

Use dropna() to remove incomplete records or fillna() to impute values. Choose strategies based on data context: mean/median for numeric, mode for categorical, or forward/backward fill for time series

Duplicate Detection

Duplicates skew analysis results. Use duplicated() to identify them and drop_duplicates() to remove them. The subset parameter lets you check uniqueness on specific key columns rather than entire rows

Type Conversions Matter

Correct data types enable proper operations and save memory. Use astype() for simple conversions, to_numeric() with errors='coerce' for messy numeric data, to_datetime() for dates, and categorical for repeated values

String Operations Power

The .str accessor provides vectorized string methods for efficient text processing. Use strip() to remove whitespace, lower()/upper() for case standardization, replace() for substitution, and split() for parsing

Regex for Pattern Matching

Regular expressions enable powerful pattern matching for validation, extraction, and transformation. Use str.contains() for filtering, str.extract() for parsing patterns, and str.replace() with regex=True for complex substitutions

Data Cleaning

What You'll Learn

Contents

Why Data Cleaning Matters

The Reality of Messy Data

Common Data Quality Issues

Data Cleaning

The Data Cleaning Workflow

Inspect

Strategize

Transform

Validate

Tools for Data Cleaning

Interactive: Data Cleaning Playground

Original "Dirty" Data

After Cleaning (select operation)

Handling Missing Values

Detecting Missing Values

Removing Missing Values with dropna()

Filling Missing Values with fillna()

Advanced Missing Value Strategies

Practice Questions

Medium Audit student records for data completeness

Hard Impute missing test scores using multiple strategies

Easy Filter incomplete student records for reporting

Detecting & Removing Duplicates

Identifying Duplicates with duplicated()

Removing Duplicates with drop_duplicates()

Checking Duplicates in Specific Columns

Advanced Duplicate Handling Strategies

Practice Questions

Hard Identify duplicate product entries in inventory system

Easy Deduplicate CRM contacts by email address

Medium Reconcile customer transactions keeping most recent

Data Type Conversions

Checking Current Data Types

Basic Type Conversion with astype()

Safe Numeric Conversion with to_numeric()

Date and Time Conversion with to_datetime()

Categorical Data Type for Memory Efficiency

Practice Questions

Medium Parse e-commerce price strings for analytics

Easy Convert exam scores handling absent students

Hard Standardize order data types for warehouse system

String Operations & Regex

Basic String Methods

String Replacement and Cleaning

Splitting and Extracting Strings

Regular Expressions (Regex) for Pattern Matching

Advanced String Techniques

Regular Expressions (Regex)

Practice Questions

Hard Standardize customer names for mailing list

Easy Segment subscribers by email domain for marketing

Medium Normalize phone numbers for SMS campaign system

Key Takeaways

Data Cleaning is Critical

Missing Value Strategies

Duplicate Detection

Type Conversions Matter

String Operations Power

Regex for Pattern Matching

Knowledge Check

1 What does fillna(method='ffill') do?

2 Which parameter in drop_duplicates() keeps the last occurrence of duplicates?

3 What is the advantage of converting a column to categorical dtype?

4 Which method safely converts strings to numbers, replacing invalid values with NaN?

5 What does the str.extract() method do?

6 In regex, what does the pattern \d{3} match?