Module 2.4

Excel Data Cleaning

Transform messy, inconsistent data into clean, analysis-ready datasets. Master the essential techniques for finding errors, removing duplicates, standardizing text, and ensuring data quality in Excel.

45 min read
Beginner to Intermediate
Hands-on Examples
What You'll Learn
  • Finding and fixing data errors with wildcards
  • Text cleaning functions (TRIM, CLEAN, PROPER)
  • Removing duplicate records efficiently
  • Handling missing values strategically
  • Text to Columns and Flash Fill techniques
  • Power Query for automated data cleaning
Contents
01

Introduction to Data Cleaning

Data cleaning is the process of detecting and correcting errors, inconsistencies, and inaccuracies in datasets. Studies show that analysts spend 60-80% of their time cleaning data before analysis. Mastering these techniques will dramatically improve your efficiency and ensure accurate insights.

Data Cleaning

The process of identifying and correcting (or removing) corrupt, inaccurate, incomplete, irrelevant, or improperly formatted data from a dataset to improve data quality and reliability for analysis.

Common Data Quality Issues

Real-world data rarely arrives in perfect condition. Understanding common issues helps you spot and fix them quickly:

Issue Type Example Impact
Missing Values Empty cells in customer phone numbers Incomplete analysis, cannot contact customers
Duplicate Records Same customer entered twice Inflated counts, skewed averages
Inconsistent Formatting "New York", "NY", "new york" Difficult to group and analyze
Extra Spaces " John Smith " (leading/trailing spaces) Matching and sorting problems
Wrong Data Type Numbers stored as text: "1000" Calculations fail, sorting incorrect
Spelling Errors "Californa" instead of "California" Categories split incorrectly
Industry Insight: According to IBM, poor data quality costs the US economy around $3.1 trillion annually. Clean data isn't just about accuracy - it directly impacts business decisions and revenue.

Data Cleaning Workflow

Follow this systematic approach to clean your data efficiently:

1
Inspect the Data

Review the dataset to identify obvious issues. Check for missing values, strange characters, inconsistent formats.

2
Remove Duplicates

Identify and eliminate duplicate records that could skew your analysis results.

3
Fix Structural Issues

Standardize formats, correct typos, remove extra spaces, and ensure consistent naming conventions.

4
Validate Data

Verify data types, check ranges, and ensure values make sense in their context.

Excel's Data Cleaning Toolkit

Excel provides a comprehensive set of tools for data cleaning. Here's your essential toolkit:

Find & Replace
Search and fix patterns across entire datasets
Text Functions
TRIM, PROPER, UPPER, LOWER, CLEAN
Remove Duplicates
Eliminate redundant records automatically
Filters
Isolate and examine specific data subsets
Text to Columns
Split data into separate fields
Data Validation
Prevent errors before they occur
Best Practice: Always create a backup copy of your original data before cleaning. Use "Save As" to create a working copy, or duplicate the worksheet within the same workbook.

Practice Questions: Introduction to Data Cleaning

Test your understanding with these hands-on exercises.

Scenario: You receive a customer dataset with the following records. Identify all data quality issues:

Name           Email                    Phone          City
John Smith     john@email.com           555-1234       New York
 Jane Doe      JANE@EMAIL.COM           5551234        new york
John Smith     john@email.com           555-1234       New York
Bob Johnson    bob@email              (555) 123-4567   NY
Sarah  Lee     sarah@email.com          555.1234       New york

Solution:

Issues identified:

  • Extra spaces: " Jane Doe" has leading space, "Sarah Lee" has double space
  • Inconsistent formatting: Emails in different cases (lowercase vs UPPERCASE)
  • Duplicate record: John Smith appears twice with identical data
  • Incomplete data: Bob's email is missing domain
  • Format variations: Phone numbers in different formats
  • Inconsistent city names: "New York", "new york", "NY" all represent same city

In the next sections, we'll learn how to fix each of these issues systematically using Excel's tools.

Task: You have a sales dataset with 10,000 rows containing: duplicate orders, missing customer names, inconsistent product codes, and extra spaces in all text fields. Rank these issues by cleaning priority and explain your reasoning.

Show Solution

Recommended priority order:

  1. Remove duplicate orders (HIGH): Duplicates directly affect revenue calculations and inventory counts. Fix first to avoid grossly incorrect totals.
  2. Standardize product codes (HIGH): Inconsistent codes prevent proper grouping and analysis. Essential for accurate reporting.
  3. Remove extra spaces (MEDIUM): Spaces cause matching issues but don't affect numerical calculations. Clean before doing text-based analysis.
  4. Fill missing customer names (LOW): Missing names are problematic but don't prevent order analysis. Can use customer ID for most analyses.

General rule: Prioritize issues that affect calculations and aggregations first, then handle formatting issues, finally address missing data that has workarounds.

Task: Create a data cleaning checklist for a new employee database that includes: names, email addresses, phone numbers, hire dates, and departments. List specific checks to perform for each field.

Show Solution
Field Cleaning Checks
Names ✓ Remove extra spaces (TRIM)
✓ Standardize capitalization (PROPER)
✓ Check for special characters
✓ Verify no numbers in names
Email ✓ Standardize to lowercase
✓ Verify @ symbol present
✓ Check for domain (.com, .org, etc.)
✓ Remove duplicates
Phone ✓ Standardize format (xxx-xxx-xxxx)
✓ Remove non-numeric characters
✓ Verify 10 digits
✓ Check for missing values
Hire Date ✓ Verify date format consistency
✓ Check dates not in future
✓ Ensure dates after company founding
✓ Convert text dates to date format
Department ✓ Standardize department names
✓ Fix spelling variations
✓ Create dropdown validation list
✓ Check against approved dept list
02

Finding and Fixing Errors

Excel provides powerful tools to identify and correct data quality issues. From Find & Replace to conditional formatting, you can quickly spot problems and fix them systematically across large datasets.

Find & Replace Basics

The Find & Replace tool (Ctrl+H) is your first line of defense against data inconsistencies. It can fix thousands of errors in seconds.

Keyboard Shortcuts
Ctrl + F Open Find dialog
Ctrl + H Open Find & Replace
Alt + A Replace All
Alt + R Replace (single)
Common Options
  • Match case: Distinguish between uppercase and lowercase
  • Match entire cell: Find exact matches only
  • Within: Search Sheet or Workbook
  • Search: By Rows or By Columns
  • Look in: Formulas, Values, or Comments

Using Wildcards

Wildcards enable pattern-based searching, making Find & Replace exponentially more powerful:

Wildcard Meaning Example Matches
* Any number of characters Find: data* "data", "database", "data science"
? Single character Find: sm?th "smith", "smyth"
~ Literal wildcard character Find: ~* Actual asterisk symbol "*"

Practical Wildcard Examples:

Example 1: Standardizing Phone Numbers
Problem: Phone numbers in mixed formats
(555) 123-4567
555-123-4567
555.123.4567

Find & Replace Steps:
1. Find: (*)*(*)*    Replace with: (leave empty)
   - Removes all parentheses, hyphens, periods
   
2. Result: 5551234567
   
3. Find: (???)???-????    (use formula after cleaning)
   - =TEXT(A2,"(000) 000-0000")
   - Creates: (555) 123-4567
Example 2: Fixing Inconsistent State Names
Problem: States in different formats
California, CA, Calif, Calif.

Solution - Replace each variation:
Find: California    Replace: CA
Find: Calif         Replace: CA
Find: Calif.        Replace: CA

Tip: Check "Match entire cell" to avoid 
     replacing partial matches
Example 3: Removing Extra Characters
Problem: Product codes with unwanted prefixes
OLD-12345, OLD-67890, OLD-11111

Solution:
Find: OLD-    Replace: (leave empty)

Result: 12345, 67890, 11111

Or use wildcard for all prefixes:
Find: *-      Replace: (leave empty)
Pro Tip: Before using Replace All, click "Find All" to preview all matches. This prevents accidental replacements and helps you refine your search pattern.

Conditional Formatting for Error Detection

Conditional formatting visually highlights potential problems, making them easy to spot in large datasets:

Example 4: Highlighting Duplicates
Steps:
1. Select your data range
2. Home tab → Conditional Formatting
3. Highlight Cells Rules → Duplicate Values
4. Choose formatting (e.g., Red Fill)

Result: All duplicate entries highlighted in red

Advanced: Highlight duplicates EXCEPT first occurrence
Formula: =COUNTIF($A$2:$A2,$A2)>1
Apply to: $A$2:$A$100
Example 5: Finding Data Entry Errors
Scenario: Age column should only contain values 18-65

Steps:
1. Select age column (B2:B100)
2. Conditional Formatting → New Rule
3. Use formula: =OR(B2<18, B2>65)
4. Set format: Red background

Result: Invalid ages highlighted automatically

Common validation formulas:
- Empty cells: =ISBLANK(A2)
- Non-numeric: =NOT(ISNUMBER(A2))
- Too long: =LEN(A2)>50
- Contains numbers in text: =SUMPRODUCT(--ISNUMBER(--MID(A2,ROW($1:$50),1)))>0
Warning: Replace All cannot be undone after saving. Always test on a small sample first, and keep a backup copy of your original data.

Practice Questions: Finding and Fixing Errors

Test your understanding with these hands-on exercises.

Dataset: You have email addresses with inconsistent domains that need standardization.

john@gmail.com
jane@GMAIL.COM
bob@gmail
sarah@gmail.co
mike@yahoo.com

Task: Standardize all Gmail addresses to lowercase "gmail.com" format.

Show Solution
Steps:
1. Find: GMAIL.COM    Replace: gmail.com
   (Fixes uppercase)

2. Find: @gmail       Replace: @gmail.com
   Options: Match entire cell = OFF
   (Completes incomplete domains)

3. Find: @gmail.co    Replace: @gmail.com
   (Fixes common typo)

Result: All Gmail addresses standardized to @gmail.com

Challenge: Use wildcards to clean these product SKUs. Remove all prefix variations but keep the numeric code.

PROD-12345
SKU-67890
ITEM-11111
CODE-99999

Desired result: Just the numbers (12345, 67890, 11111, 99999)
Show Solution
Method 1: Single Wildcard Replace
Find: *-
Replace: (leave empty)

This removes everything before and including the hyphen.

Method 2: Formula Approach (More Flexible)
=RIGHT(A2,5)
or
=VALUE(RIGHT(A2,5))  -- Converts to number

Method 3: Text to Columns
1. Select column
2. Data → Text to Columns
3. Choose Delimited → Next
4. Select Other, enter: -
5. Finish
6. Delete first column (prefixes)

Each method works - choose based on whether you need to keep original data or if the pattern changes.

Scenario: Create a conditional formatting rule that highlights email addresses that might be invalid. An email should have:

  • Exactly one @ symbol
  • At least one character before @
  • At least one dot after @
  • At least two characters after the last dot

Solution:

Complex Formula Approach:
=OR(
  LEN(A2)-LEN(SUBSTITUTE(A2,"@",""))<>1,
  FIND("@",A2)<=1,
  ISNUMBER(SEARCH("*@*.*",A2))=FALSE,
  LEN(A2)-FIND(".",A2,FIND("@",A2))<2
)

Breakdown:
1. LEN(A2)-LEN(SUBSTITUTE(A2,"@",""))<>1
   - Checks if there's exactly one @ symbol
   
2. FIND("@",A2)<=1
   - Checks @ is not at start
   
3. ISNUMBER(SEARCH("*@*.*",A2))=FALSE
   - Checks for dot after @ symbol
   
4. LEN(A2)-FIND(".",A2,FIND("@",A2))<2
   - Checks at least 2 chars after last dot

Apply to range: A2:A100
Format: Red fill with dark red text

Simpler Alternative (Less Precise):
=NOT(AND(
  LEN(A2)>5,
  ISNUMBER(SEARCH("*@*.*",A2))
))

Tests basic structure only but easier to understand.
03

Text Cleaning Functions

Excel offers specialized functions for cleaning text data. These functions remove unwanted spaces, fix capitalization, eliminate non-printable characters, and standardize text formats for consistent analysis.

Essential Text Functions

Master these five functions to handle 90% of text cleaning tasks:

Function Purpose Syntax Example Input → Output
TRIM Removes extra spaces =TRIM(text) " John Smith " → "John Smith"
CLEAN Removes non-printable characters =CLEAN(text) "Data\n\rScience" → "DataScience"
PROPER Capitalizes first letter of each word =PROPER(text) "john SMITH" → "John Smith"
UPPER Converts to uppercase =UPPER(text) "New York" → "NEW YORK"
LOWER Converts to lowercase =LOWER(text) "New York" → "new york"

TRIM Function Deep Dive

TRIM is your most-used text cleaning function. It removes leading spaces, trailing spaces, and reduces multiple spaces between words to single spaces.

Example 1: Basic TRIM Usage
Cell A2: "  John   Smith  "

Formula: =TRIM(A2)
Result: "John Smith"

Before TRIM:
LEN(A2) = 17 characters  -- Includes extra spaces

After TRIM:
LEN(TRIM(A2)) = 10 characters  -- Clean text

Use Case: Cleaning imported data with spacing issues
Example 2: TRIM with Other Functions
Problem: Match names for VLOOKUP, but some have spaces

Data in A2: " John Smith "
Lookup table has: "John Smith"

Solution:
=VLOOKUP(TRIM(A2), $D$2:$E$100, 2, FALSE)

Result: Successful match after trimming spaces

Tip: Always TRIM before comparisons and lookups

CLEAN Function

CLEAN removes the first 32 non-printable characters in the ASCII set. Essential when importing data from databases or websites.

Example 3: Removing Line Breaks and Special Characters
Problem: Data imported with line breaks (CHAR(10))

Cell A2 contains:
"Product: Laptop
Price: $999
Status: Available"

Formula: =CLEAN(A2)
Result: "Product: LaptopPrice: $999Status: Available"

Combined with TRIM:
=TRIM(CLEAN(A2))

Best Practice: Use both together for imported data:
=TRIM(CLEAN(A2))
Common Import Issues: Data from web scraping, CSV files, or database exports often contains CHAR(10) line feeds, CHAR(13) carriage returns, and CHAR(160) non-breaking spaces. CLEAN handles the first two; use SUBSTITUTE for CHAR(160).

Case Conversion Functions

Standardize text capitalization for consistency and accurate grouping.

Example 4: PROPER for Name Standardization
Mixed case names need standardization:

A2: "JOHN SMITH"       → =PROPER(A2) → "John Smith"
A3: "jane DOE"         → =PROPER(A3) → "Jane Doe"  
A4: "bob JOHNSON jr."  → =PROPER(A4) → "Bob Johnson Jr."

Common Use Cases:
- Customer names in CRM systems
- Product names from multiple sources
- Address standardization

Caution: PROPER capitalizes after apostrophes:
"o'brien" → "O'Brien" ✓ Correct
"mcdonald" → "Mcdonald" ✗ Should be "McDonald"
Example 5: UPPER and LOWER for Codes and Emails
Use UPPER for product codes (consistency):
A2: "sku-12345a"  → =UPPER(A2) → "SKU-12345A"

Use LOWER for email addresses (standard format):
A2: "John@EMAIL.com"  → =LOWER(A2) → "john@email.com"

Database matching example:
=VLOOKUP(UPPER(A2), ProductTable, 2, FALSE)
-- Ensures match regardless of input case

SUBSTITUTE Function

Replace specific text within a string. More precise than Find & Replace for formulas.

Example 6: Removing Specific Characters
Syntax: =SUBSTITUTE(text, old_text, new_text, [instance])

Remove all hyphens from phone numbers:
A2: "555-123-4567"
=SUBSTITUTE(A2,"-","")
Result: "5551234567"

Replace multiple characters (nest functions):
A2: "(555) 123-4567"
=SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(A2,"(",""),")",""),"-","")
Result: "555 1234567"

Remove specific instance (3rd hyphen only):
A2: "2024-01-15"
=SUBSTITUTE(A2,"-","",2)  -- Replaces 2nd hyphen
Result: "2024-0115"

Combining Functions for Power Cleaning

Stack multiple functions for comprehensive cleaning in a single formula:

Example 7: Ultimate Cleaning Formula
Clean imported text completely:

=TRIM(CLEAN(PROPER(A2)))

What it does:
1. PROPER(A2) - Standardizes capitalization
2. CLEAN(...) - Removes non-printable characters  
3. TRIM(...) - Removes extra spaces

Example:
A2: "  jOHN   SMITH\n  "
Result: "John Smith"

Real-world application:
Import 1000 customer names → Apply once → Copy values → Delete helper column
Example 8: Custom Email Cleaning Formula
Standardize email addresses:

=LOWER(TRIM(SUBSTITUTE(SUBSTITUTE(A2," ",""),"@","@")))

Steps:
1. Remove all spaces: SUBSTITUTE(A2," ","")
2. Trim (just in case): TRIM(...)
3. Convert to lowercase: LOWER(...)

Before: " John.Smith @EMAIL.COM "
After: "john.smith@email.com"

Add validation:
=IF(ISNUMBER(SEARCH("@",A2)),
   LOWER(TRIM(A2)),
   "Invalid Email"
)
Pro Workflow: Create a helper column with cleaning formulas → Apply to entire dataset → Copy cleaned values → Paste as Values over original → Delete helper column. This preserves your original data structure.

Practice Questions: Text Cleaning Functions

Task: Clean this customer name list to proper case with no extra spaces.

A2: "  JOHN SMITH  "
A3: "jane   DOE"
A4: "BOB johnson  "
A5: "  sarah   LEE"
Show Solution

Solution:

Formula in B2:
=TRIM(PROPER(A2))

Copy down to B5

Results:
B2: "John Smith"
B3: "Jane Doe"
B4: "Bob Johnson"
B5: "Sarah Lee"

Next steps:
1. Copy B2:B5
2. Paste Values over A2:A5
3. Delete column B

Challenge: Convert all phone numbers to format: (XXX) XXX-XXXX

A2: "555-123-4567"
A3: "5551234567"
A4: "(555) 123-4567"
A5: "555.123.4567"
Show Solution

Solution:

Step 1: Strip all formatting (Column B)
=SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(A2,
  "-",""),"(",""),")",""),".","")

Result: All become "5551234567"

Step 2: Reformat (Column C)
="("&LEFT(B2,3)&") "&MID(B2,4,3)&"-"&RIGHT(B2,4)

Or use TEXT function if numeric:
=TEXT(VALUE(B2),"(000) 000-0000")

Final Results (C2:C5):
"(555) 123-4567"
"(555) 123-4567"  
"(555) 123-4567"
"(555) 123-4567"

Alternative: Single mega-formula
=TEXT(VALUE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(
  SUBSTITUTE(A2,"-",""),"(",""),")",""),".","")),
  "(000) 000-0000")

Scenario: Data imported from a web form contains multiple issues. Create a single formula to clean it completely.

Issues in data:
- Leading/trailing spaces
- Non-printable characters (line breaks)
- Inconsistent capitalization  
- Multiple spaces between words
- Non-breaking spaces (CHAR(160))

Sample A2: "  JoHN   SMITH\n  "
            (with CHAR(160) instead of normal space)
Show Solution

Solution:

Comprehensive cleaning formula:

=TRIM(CLEAN(PROPER(SUBSTITUTE(A2,CHAR(160)," "))))

Breaking it down from inside out:

1. SUBSTITUTE(A2,CHAR(160)," ")
   - Replaces non-breaking spaces with normal spaces
   
2. PROPER(...)
   - Standardizes to Title Case
   
3. CLEAN(...)
   - Removes line breaks and non-printable chars
   
4. TRIM(...)
   - Removes extra spaces

Test cases:
Input: "  jOHN   SMITH\n  " 
Output: "John Smith"

Input: "SARAH\nLEE  "
Output: "Sarah Lee"

Input: "  bob   johnson  "
Output: "Bob Johnson"

Production tip: 
Create named range for formula:
Name: CleanText
Formula: =LAMBDA(text,TRIM(CLEAN(PROPER(
         SUBSTITUTE(text,CHAR(160)," ")))))
         
Usage: =CleanText(A2)
04

Removing Duplicate Records

Duplicate data can skew analysis results and waste storage space. Learn how to identify, highlight, and remove duplicate records while preserving data integrity and unique information.

Identifying Duplicates

Before removing duplicates, you need to understand what makes a record a duplicate in your context.

Duplicate Record

A row in a dataset where one or more columns contain the same values as another row. Can be exact duplicates (all columns match) or partial duplicates (specific columns match).

Duplicate Type Definition Example
Exact Duplicate All column values identical Two rows with same customer ID, name, email, phone
Partial Duplicate Key columns match, others differ Same email but different names (possible data entry error)
Case-Sensitive Duplicate Same text, different case "john@email.com" vs "John@email.com"

Highlighting Duplicates

Use conditional formatting to visualize duplicates before removal:

Method 1: Built-in Duplicate Highlighting
Steps:
1. Select your data range (e.g., A2:A100)
2. Home tab → Conditional Formatting
3. Highlight Cells Rules → Duplicate Values
4. Choose formatting (Red Fill recommended)
5. Click OK

Result: All duplicate values highlighted in red

Best for: Quick visual check of single column

Limitation: Highlights all occurrences including first
Method 2: Highlight Duplicates (Keep First)
Custom formula to highlight only repeat occurrences:

Select range: A2:A100
Conditional Formatting → New Rule → Use formula

Formula: =COUNTIF($A$2:$A2,$A2)>1

Format: Red background

How it works:
- For A2: Counts "A2" in range A2:A2 (count = 1) → Not highlighted
- For A3: If duplicate of A2, counts "A3" in A2:A3 (count = 2) → Highlighted
- For A4: Counts "A4" in A2:A4 → Highlights if duplicate

Result: First occurrence remains unmarked, duplicates highlighted
Pro Tip: Add a helper column with COUNTIF to see duplicate counts: =COUNTIF($A$2:$A$100,A2). This shows "2" for duplicates, "3" for triplicates, etc., helping you understand data quality issues.

Remove Duplicates Tool

Excel's Remove Duplicates feature is powerful but irreversible. Always backup first!

Example 1: Basic Duplicate Removal
Dataset: Customer list with duplicate entries

Name           Email               Phone
John Smith     john@email.com      555-1234
Jane Doe       jane@email.com      555-5678
John Smith     john@email.com      555-1234  ← Duplicate
Bob Johnson    bob@email.com       555-9012

Steps:
1. Select entire data range including headers
2. Data tab → Data Tools → Remove Duplicates
3. Check "My data has headers"
4. Select columns to check: All columns
5. Click OK

Result: 
- Keeps first John Smith row
- Deletes second John Smith row
- Message: "1 duplicate values found and removed; 3 unique values remain"

Warning: This action cannot be undone!
Example 2: Remove Duplicates by Specific Columns
Scenario: Remove duplicates based on email only

Customer_ID    Name           Email               Phone
1001          John Smith     john@email.com      555-1234
1002          J. Smith       john@email.com      555-5678  ← Same email
1003          Jane Doe       jane@email.com      555-9012

Problem: Same person entered twice with different names

Steps:
1. Select all data (A1:D3)
2. Data → Remove Duplicates
3. Uncheck all columns EXCEPT "Email"
4. Click OK

Result: Keeps row 1001, removes row 1002
Rationale: First occurrence preserved, duplicates removed

Use Case: Finding customers with multiple accounts
Critical Warning: Remove Duplicates permanently deletes rows. There is no undo after you save the file. Always work on a copy or create a backup worksheet first!

Advanced Duplicate Detection

Use formulas for more control over duplicate identification:

Example 3: COUNTIF for Duplicate Counting
Add helper column to count occurrences:

A2: john@email.com
B2: =COUNTIF($A$2:$A$100,$A2)

Results:
- B2 = 1: Unique value
- B2 = 2: Appears twice (1 duplicate)
- B2 = 3: Appears three times (2 duplicates)

Filter B column > 1 to see all duplicates

Advanced: Flag first occurrence differently
C2: =IF(COUNTIF($A$2:$A2,$A2)>1,"Duplicate","First/Unique")

Results:
Row 2: "First/Unique" (first john@email.com)
Row 5: "Duplicate" (second john@email.com)
Example 4: Multi-Column Duplicate Detection
Find duplicates based on multiple columns:

A2: Name      B2: Email            C2: Duplicate?
John Smith   john@email.com

Formula in C2:
=COUNTIFS($A$2:$A$100,$A2,$B$2:$B$100,$B2)>1

Returns TRUE if both name AND email match another row

Alternative: Create unique key
D2: =A2&"-"&B2
Result: "John Smith-john@email.com"

Then use COUNTIF on column D:
E2: =COUNTIF($D$2:$D$100,D2)>1

Duplicate Extraction Technique

Sometimes you want to keep duplicates for review rather than delete them:

Example 5: Extract Duplicates to Separate Sheet
Method: Use Advanced Filter

Step 1: Add helper column (D2)
=COUNTIF($A$2:$A$100,$A2)>1

Step 2: Apply filter
1. Select data range (A1:D100)
2. Data → Filter
3. Click filter arrow in column D
4. Check only TRUE values
5. Shows only duplicate rows

Step 3: Copy to new sheet
1. Select visible rows (Ctrl+Click to skip hidden)
2. Copy (Ctrl+C)
3. Create new sheet "Duplicates"
4. Paste (Ctrl+V)

Result: Original data intact, duplicates isolated for review
Data Validation Approach: Prevent duplicates from being entered in the first place using Data Validation with custom formula: =COUNTIF($A$2:$A$100,A2)<=1. This rejects duplicate entries during data entry.

Best Practices for Duplicate Management

Do This
  • Always create a backup before removing duplicates
  • Use conditional formatting to preview before deletion
  • Document your criteria for identifying duplicates
  • Review a sample of flagged duplicates manually
  • Keep a log of how many duplicates were removed
Avoid This
  • Removing duplicates without understanding the cause
  • Deleting duplicates based on wrong columns
  • Assuming all duplicates are errors (some may be valid)
  • Forgetting to check for case-sensitive duplicates
  • Removing duplicates from unsorted data without review

Practice Questions: Removing Duplicates

Dataset: Identify what type of duplicates exist in this customer list.

ID    Name           Email               City
101   John Smith     john@email.com      NYC
102   Jane Doe       jane@email.com      LA
103   John Smith     john@email.com      NYC      ← ?
104   john smith     john@email.com      NYC      ← ?
105   Jane Doe       jane@gmail.com      LA       ← ?
Show Solution

Solution:

Duplicate Analysis:

  • Row 103: Exact duplicate of row 101 (all columns match). Should be removed.
  • Row 104: Case-sensitive duplicate of row 101. Same person with lowercase name. Should standardize with PROPER() then remove.
  • Row 105: Partial duplicate of row 102. Same name, same city, but different email. Might be same person with two emails - requires investigation, not automatic removal.

Recommended Actions:

  1. Standardize names with PROPER() function
  2. Standardize emails with LOWER() function
  3. Remove exact duplicates (row 103)
  4. Flag row 105 for manual review

Task: Create a formula that identifies duplicate email addresses but ignores the first occurrence. Mark subsequent duplicates as "Duplicate", first occurrence as "Original", and unique values as "Unique".

Show Solution

Solution:

Formula in B2 (assuming emails in A2:A100):

=IF(COUNTIF($A$2:$A$100,$A2)=1,
    "Unique",
    IF(COUNTIF($A$2:$A2,$A2)=1,
       "Original",
       "Duplicate"
    )
)

Logic Breakdown:
1. COUNTIF($A$2:$A$100,$A2)=1
   - If email appears only once in entire range → "Unique"
   
2. COUNTIF($A$2:$A2,$A2)=1
   - If email appears once in range from start to current row
   - This is the first occurrence → "Original"
   
3. Otherwise → "Duplicate"

Example Results:
A2: john@email.com     B2: Original
A3: jane@email.com     B3: Unique
A4: john@email.com     B4: Duplicate
A5: bob@email.com      B5: Unique
A6: john@email.com     B6: Duplicate

Now filter column B for "Duplicate" to review/remove

Scenario: You have a sales dataset where duplicates should be identified based on BOTH Customer ID AND Order Date (same customer, same day = duplicate order). Create a system to flag and count these complex duplicates.

A: Order_ID   B: Customer_ID   C: Order_Date   D: Amount
1001          C123            2024-01-15      $500
1002          C456            2024-01-15      $300
1003          C123            2024-01-15      $500  ← Duplicate?
1004          C123            2024-01-16      $200  ← Not duplicate
Show Solution

Solution:

Method 1: Helper Column with Concatenation

Column E (Unique Key):
E2: =B2&"-"&TEXT(C2,"yyyy-mm-dd")
Result: "C123-2024-01-15"

Column F (Duplicate Flag):
F2: =COUNTIF($E$2:$E$100,E2)>1

Column G (Occurrence Number):
G2: =COUNTIF($E$2:$E2,E2)

Results:
Row 2: E2="C123-2024-01-15"  F2=TRUE  G2=1 (Original)
Row 3: E3="C456-2024-01-15"  F3=FALSE G3=1 (Unique)
Row 4: E4="C123-2024-01-15"  F4=TRUE  G4=2 (Duplicate!)
Row 5: E5="C123-2024-01-16"  F5=FALSE G5=1 (Unique)

Method 2: Single Formula (No Helper Column)

Column E (Duplicate Status):
=IF(COUNTIFS($B$2:$B$100,$B2,$C$2:$C$100,$C2)=1,
    "Unique",
    IF(COUNTIFS($B$2:$B2,$B2,$C$2:$C2,$C2)=1,
       "First",
       "Duplicate-"&COUNTIFS($B$2:$B2,$B2,$C$2:$C2,$C2)
    )
)

Results:
Row 2: "First"
Row 3: "Unique"
Row 4: "Duplicate-2"
Row 5: "Unique"

To Remove Duplicates:
1. Sort by Customer_ID, then Order_Date
2. Use Remove Duplicates tool
3. Select only Customer_ID and Order_Date columns
4. First occurrence kept, others deleted

Alternative: Keep highest amount when duplicates exist
1. Sort by Customer_ID, Order_Date, Amount (descending)
2. Remove duplicates based on Customer_ID + Order_Date
3. Highest amount entry preserved
05

Handling Missing Values

Missing data is one of the most common data quality issues. How you handle missing values can significantly impact your analysis. Learn to identify gaps, decide when to fill or remove them, and implement best practices for maintaining data integrity.

Identifying Missing Data

Excel represents missing data in different ways, each requiring different handling:

Type Appearance Excel Formula Common Cause
Empty Cells Blank cell (nothing visible) =ISBLANK(A2) Data entry skipped, import errors
Zero Values 0 =A2=0 Placeholder, actual zero, or missing?
Text Placeholders "N/A", "NULL", "Missing" =A2="N/A" Database exports, manual entry
Formula Errors #N/A, #DIV/0!, #VALUE! =ISERROR(A2) Lookup failures, calculation issues
Spaces Only Cell appears empty but has spaces =AND(A2<>"",TRIM(A2)="") Copy-paste errors, web scraping
Pro Tip: Use Go To Special (Ctrl+G → Special → Blanks) to quickly select all empty cells in a range for easy identification and handling.

Decision Framework: Fill or Remove?

Before taking action, understand what the missing data means:

When to Fill Missing Values
  • Missing value is predictable (carry forward last known value)
  • Can calculate from other columns (e.g., Total = Quantity × Price)
  • Domain knowledge provides default (e.g., missing country = "USA" for US company)
  • Small percentage missing (<5%) with safe assumptions
  • Critical field needed for all records (Customer ID, Date)
When to Remove Records
  • Missing critical information (e.g., missing sales amount in sales data)
  • Large percentage missing (>30%) makes filling unreliable
  • Random missing pattern (MCAR - Missing Completely at Random)
  • Cannot fill without making dangerous assumptions
  • Duplicate or test records with missing required fields

Techniques for Filling Missing Values

Method 1: Fill with Constant Value

Replace missing values with a specific value (0, "Unknown", average, etc.):

Manual Fill with Go To Special
Task: Fill blank phone numbers with "Not Provided"

Steps:
1. Select phone number column
2. Press F5 → Special → Blanks → OK
3. Type: Not Provided
4. Press Ctrl+Enter (fills all selected blanks at once)

Result: All blank cells in selection now contain "Not Provided"
Formula Approach
Formula in B2:
=IF(ISBLANK(A2),"Not Provided",A2)

Logic: If cell is blank, return "Not Provided", otherwise keep original value
Copy down for entire column

Method 2: Fill Down / Fill Up

Use surrounding values to fill gaps:

Quick Fill Down
Keyboard Shortcut: Ctrl+D

Steps:
1. Select cell with value + empty cells below
2. Press Ctrl+D

Example:
Before:        After Ctrl+D:
Sales          Sales
[blank]        Sales
[blank]        Sales
[blank]        Sales
Formula Approach - Carry Forward Last Known Value
Formula in A2:
=IF(ISBLANK(A2),B1,A2)

Then copy down

Use case: Filling hierarchical data where category repeats
Fill Series for Dates
Steps:
1. Select first date + blank cells
2. Home → Fill → Series
3. Choose: Date Unit: Day, Step value: 1
4. OK

Example:
Before:        After:
1/1/2024      1/1/2024
[blank]       1/2/2024
[blank]       1/3/2024
[blank]       1/4/2024

Method 3: Calculate from Other Columns

Derive missing values using relationships in your data:

Example 1: Calculate Missing Total Price
Scenario: Total column has blanks, but Quantity and Unit Price exist

Formula in D2:
=IF(ISBLANK(D2),B2*C2,D2)

Logic:
- If Total is blank → Calculate: Quantity × Unit Price
- If Total exists → Keep existing value

Example:
Quantity  Price  Total (Before)  Total (After)
10        5.00   50.00          50.00
5         10.00  [blank]        50.00 (calculated)
20        3.00   60.00          60.00
Example 2: Calculate Missing Age from Birth Date
Formula in C2:
=IF(ISBLANK(C2),YEAR(TODAY())-YEAR(B2),C2)

Before:
Birth Date    Age
1990-05-15    [blank]  → Calculates: 35 (as of 2026)
1985-03-20    40       → Keeps: 40
Example 3: Fill Missing Category Based on Product ID Pattern
Scenario: Product IDs have prefixes (EL=Electronics, CL=Clothing)

Formula in D2:
=IF(ISBLANK(D2),
   IF(LEFT(A2,2)="EL","Electronics",
   IF(LEFT(A2,2)="CL","Clothing","Other")),
   D2)

Example:
Product ID    Category (Before)   Category (After)
EL-001       [blank]             Electronics
CL-200       [blank]             Clothing
EL-055       Electronics         Electronics
FU-100       [blank]             Other

Method 4: Use VLOOKUP/XLOOKUP for Reference Data

Fill missing values by looking up information from reference tables:

Example 1: Fill Missing Customer Names
Scenario: Customer Name missing, but Customer ID exists

Formula in B2:
=IF(ISBLANK(B2),
   XLOOKUP(A2,CustomerTable[ID],CustomerTable[Name],"Not Found"),
   B2)

Reference Table (CustomerTable):
ID     Name
C001   John Smith
C002   Jane Doe
C003   Bob Johnson

Result:
ID     Name (Before)    Name (After)
C001   [blank]         John Smith
C002   Jane Doe        Jane Doe
C003   [blank]         Bob Johnson
C999   [blank]         Not Found
Example 2: Fill Missing Prices from Product Catalog
Formula in C2:
=IFERROR(
   VLOOKUP(A2,PriceTable,2,FALSE),
   "Price Not Found"
)

VLOOKUP Breakdown:
- A2: Product ID to look up
- PriceTable: Reference range with Product ID and Price
- 2: Return value from column 2 (Price column)
- FALSE: Exact match required

IFERROR wrapper: Returns "Price Not Found" if product doesn't exist
Warning: Always document your filling strategy. Create a data dictionary explaining which values were filled and why. Never silently fill missing data without tracking what was changed.

Removing Records with Missing Values

Method 1: Filter and Delete

Manual Filter and Delete Workflow
Steps:
1. Select your data range (Ctrl+A)
2. Data tab → Filter
3. Click filter dropdown on column with missing data
4. Uncheck (Blanks)
5. Select all visible rows (rows with data)
6. Right-click row numbers → Delete Rows
7. Remove filter to see remaining data

Result: All rows with blank values in that column are removed

Method 2: Formula Flagging

Create Helper Column to Flag Incomplete Records
Formula in E2 (helper column):
=IF(OR(ISBLANK(A2),ISBLANK(B2),ISBLANK(C2)),
   "INCOMPLETE",
   "COMPLETE")

Logic:
- Checks if ANY of columns A, B, or C is blank
- If blank found → "INCOMPLETE"
- If all filled → "COMPLETE"

Example:
A         B        C        E (Status)
John      Sales    50000    COMPLETE
Jane      [blank]  60000    INCOMPLETE
Bob       IT       [blank]  INCOMPLETE
Sarah     HR       55000    COMPLETE

Next Step: Filter column E for "INCOMPLETE" to review before deleting

Method 3: Remove if Any Critical Field is Missing

Flag Records with Any Blank in Critical Range
Formula in F2 (decision column):
=IF(COUNTBLANK(A2:E2)>0,"DELETE","KEEP")

COUNTBLANK Explanation:
- Counts number of blank cells in range A2:E2
- If count > 0 (any blanks exist) → "DELETE"
- If count = 0 (all filled) → "KEEP"

Critical fields example: Customer ID, Order Date, Product, Quantity, Price
If any are blank, entire record is flagged for deletion

Example:
ID    Date        Product   Qty   Price   Decision
101   2024-01-15  Widget    10    5.00    KEEP
102   2024-01-16  [blank]   5     10.00   DELETE
103   [blank]     Gadget    20    3.00    DELETE
104   2024-01-17  Tool      15    8.00    KEEP

Workflow:
1. Add formula in helper column
2. Sort by Decision column
3. Delete all "DELETE" rows at once
4. Remove helper column

Practice Questions: Handling Missing Values

Dataset: Employee records with various missing values.

ID    Name           Department    Salary    Email
101   John Smith     Sales         50000     john@company.com
102   Jane Doe                     60000     jane@company.com
103   Bob Johnson    Marketing               bob@company.com
104                  IT            55000     
105   Sarah Lee      Sales         52000     sarah@company.com

Task: For each missing value, determine whether to fill or remove. Explain your reasoning.

Show Solution

Analysis:

  • Row 102 - Missing Department: FILL - Can look up from HR database using Employee ID, or ask employee
  • Row 103 - Missing Salary: FILL with caution - Could use department average, but sensitive data requires verification
  • Row 104 - Missing Name: REMOVE or INVESTIGATE - Critical identifier missing, likely test record or data corruption
  • Row 104 - Missing Email: FILL - Can construct from naming convention (firstname@company.com) once name is found

Recommended Actions:

  1. Row 104: Investigate first (check ID 104 in source system). If truly missing name, delete record
  2. Row 102: Fill department from HR system or mark as "Unassigned" temporarily
  3. Row 103: Flag for manager review before filling salary
  4. Row 104: If name found, construct email using pattern

Scenario: Sales data where Total is missing for some records. Fill using Quantity × Price. If either Quantity or Price is missing, mark as "INCOMPLETE".

A: Order_ID   B: Quantity   C: Price   D: Total
1001         10           5.00       50.00
1002         5            10.00      
1003                      7.50       
1004         20           3.00       60.00
Show Solution
Formula in D2:
=IF(OR(ISBLANK(B2),ISBLANK(C2)),
   "INCOMPLETE",
   IF(ISBLANK(D2),B2*C2,D2)
)

Logic Breakdown:
1. First check if Quantity OR Price is blank
   → If either is missing, cannot calculate → "INCOMPLETE"
   
2. If both Quantity and Price exist:
   → Check if Total is blank
   → If blank, calculate: Quantity × Price
   → If not blank, keep existing Total

Results:
D2: 50.00 (already filled, keep it)
D3: 50.00 (calculated: 5 × 10.00)
D4: "INCOMPLETE" (missing Quantity)
D5: 60.00 (already filled, keep it)

Alternative - Strict version that only fills if missing:
=IF(ISBLANK(D2),
   IF(OR(ISBLANK(B2),ISBLANK(C2)),"INCOMPLETE",B2*C2),
   D2)

Challenge: You have 10,000 customer orders. 15% are missing State, 8% are missing Zip Code, 3% are missing both. Create a multi-step strategy to handle this.

Show Solution

Multi-Step Strategy:

Step 1: Assess Impact

// Count missing patterns
Missing State only: 12% (1,200 records)
Missing Zip only: 5% (500 records)
Missing both: 3% (300 records)
Complete: 80% (8,000 records)

Step 2: Fill Zip from State (if State exists)

// For records with State but no Zip, use most common Zip for that State
=IF(AND(NOT(ISBLANK(B2)),ISBLANK(C2)),
   XLOOKUP(B2,StateLookup[State],StateLookup[MostCommonZip],"00000"),
   C2)

This fills ~5% (500 records)

Step 3: Fill State from Zip (if Zip exists)

// For records with Zip but no State
=IF(AND(ISBLANK(B2),NOT(ISBLANK(C2))),
   XLOOKUP(C2,ZipLookup[Zip],ZipLookup[State],"Unknown"),
   B2)

This fills most of the 12% (State-only missing)

Step 4: Handle Both Missing

// Option A: Use shipping address from previous orders (if repeat customer)
=IF(AND(ISBLANK(B2),ISBLANK(C2)),
   XLOOKUP(A2,HistoricalOrders[CustomerID],HistoricalOrders[State],"CHECK"),
   B2)

// Option B: If no history and both missing, flag for manual review
=IF(AND(ISBLANK(B2),ISBLANK(C2)),"MANUAL REVIEW","OK")

Step 5: Document Changes

// Add audit column showing what was done
=IF(AND(OriginalState="",FinalState<>""),"State Filled from Zip",
  IF(AND(OriginalZip="",FinalZip<>""),"Zip Filled from State",
  IF(AND(OriginalState="",OriginalZip=""),"Manual Review Needed",
  "No Changes")))

Final Result:

  • ~93% records complete after automated filling
  • ~3-5% flagged for manual review (missing both)
  • ~2% may remain unresolvable (delete or mark as "Unknown")
  • Full audit trail of all changes

Key Principle: Never silently fill data. Always mark filled values for review and maintain original data in separate columns until validated.

06

Text to Columns and Flash Fill

Transform unstructured text into organized data with Excel's Text to Columns and Flash Fill features. Split full names into first and last names, separate addresses into components, or extract specific patterns from messy text data with ease.

Text to Columns

Text to Columns splits data from one column into multiple columns based on delimiters (commas, spaces, tabs) or fixed widths.

Method 1: Delimited Text

Use when data is separated by specific characters:

1
Select Data and Open Tool

Select the column with text to split → Data tab → Text to Columns

2
Choose Delimited

Select "Delimited" → Next

3
Select Delimiters

Check delimiter(s): Tab, Semicolon, Comma, Space, or Other

Preview shows how data will split

4
Format Columns and Finish

Set data format for each column (General, Text, Date) → Finish

Example: Split Full Names
Original Data (Column A):
John Smith
Jane Mary Doe
Bob Johnson

Steps:
1. Select A2:A4
2. Data → Text to Columns
3. Delimited → Space
4. Finish

Result:
Column A     Column B     Column C
John         Smith        
Jane         Mary         Doe
Bob          Johnson

Method 2: Fixed Width

Use when data has consistent character positions:

Example: Split Fixed-Width Data
Original Data (positions fixed):
ID0001JohnSmith    Marketing
ID0002JaneJohnson  Sales    
ID0003BobLee       IT       

Positions:
0-5: ID
6-10: First Name
11-20: Last Name
21-30: Department

Steps:
1. Data → Text to Columns
2. Fixed Width
3. Click to create break lines at positions 6, 11, 21
4. Finish

Result: 4 clean columns
Warning: Text to Columns overwrites adjacent columns. Ensure empty columns exist to the right, or insert blank columns first.

Flash Fill (Excel 2013+)

Flash Fill automatically detects patterns in your examples and fills remaining cells. It's like teaching Excel by showing examples rather than writing formulas.

How Flash Fill Works

1
Provide Examples

In an adjacent column, manually type 1-2 examples of the desired output

2
Trigger Flash Fill

Press Ctrl+E or Data → Flash Fill

Excel detects the pattern and fills remaining cells

3
Review and Accept

Check suggestions. If correct, accept. If not, add more examples and retry

Common Flash Fill Use Cases

Extract First Names
A: Full Name        B: First Name
John Smith         John (type this)
Jane Doe           [Ctrl+E]
Bob Johnson        [Auto-filled: Bob]
Create Emails
A: Name            B: Email
John Smith        john.smith@company.com
Jane Doe          [Ctrl+E]
Bob Johnson       [Auto: bob.johnson@company.com]
Format Dates
A: Date            B: Formatted
01/15/2024        Jan 15, 2024
02/20/2024        [Ctrl+E]
03/10/2024        [Auto: Mar 10, 2024]
Extract Area Codes
A: Phone             B: Area Code
(555) 123-4567      555
(212) 987-6543      [Ctrl+E]
(310) 555-1234      [Auto: 310]

Advanced Flash Fill Techniques

Multi-Column Patterns

Flash Fill can combine data from multiple columns:

A: First   B: Last     C: Dept        D: Badge ID
John      Smith      Sales          JS-SAL-001 (type example)
Jane      Doe        Marketing      [Ctrl+E detects: JD-MAR-002]
Bob       Lee        IT             [Auto: BL-IT-003]

Pattern: FirstInitial + LastInitial + "-" + DEPT + "-" + Number
Pro Tip: Flash Fill works best with 2-3 examples. If Excel doesn't detect the pattern, add one more example or ensure your examples are consistent.

Text to Columns vs Flash Fill

Feature Text to Columns Flash Fill
Best For Consistent delimiters or fixed widths Complex patterns, transformations
Speed Very fast, one-time setup Requires examples, but flexible
Flexibility Limited to splitting Can split, combine, format, extract
Consistency 100% reliable with clear delimiters 90% accurate, may need verification
Formula Required No No (but not dynamic)
Updates Automatically No (one-time conversion) No (one-time conversion)

Practice Questions: Text to Columns & Flash Fill

Task: Split this comma-separated data into separate columns.

A1: Name,Age,City,State
A2: John Smith,28,New York,NY
A3: Jane Doe,35,Los Angeles,CA
A4: Bob Johnson,42,Chicago,IL
Show Solution

Steps:

  1. Select A1:A4
  2. Data tab → Text to Columns
  3. Choose "Delimited" → Next
  4. Check "Comma" → Next
  5. Leave data format as "General" → Finish

Result:

A           B      C              D
Name        Age    City           State
John Smith  28     New York       NY
Jane Doe    35     Los Angeles    CA
Bob Johnson 42     Chicago        IL

If your data has headers in first row, consider skipping A1 and starting from A2, then manually type headers.

Challenge: Use Flash Fill to generate company email addresses in format: firstname.lastname@company.com (all lowercase).

A: Full Name         B: Email Address
John Smith          
Jane DOE            
Bob JOHNSON         
Sarah Lee           
Show Solution

Steps:

  1. In B2, manually type: john.smith@company.com
  2. In B3, manually type: jane.doe@company.com
  3. Select B4
  4. Press Ctrl+E (Flash Fill)
  5. Excel fills remaining cells with pattern

Result:

A: Full Name         B: Email Address
John Smith          john.smith@company.com
Jane DOE            jane.doe@company.com
Bob JOHNSON         bob.johnson@company.com
Sarah Lee           sarah.lee@company.com

Why It Works:

  • Flash Fill detects: split name at space
  • Convert to lowercase
  • Join with period
  • Add domain @company.com

If Flash Fill doesn't work, add a third example or use formula: =LOWER(SUBSTITUTE(A2," ","."))&"@company.com"

Scenario: Parse full addresses into Street, City, State, Zip components. Data has inconsistent formatting.

Full Address:
123 Main St, New York, NY 10001
456 Oak Avenue, Los Angeles, CA, 90001
789 Pine Road, Chicago IL 60601

Challenge: Handle comma variations and extract all 4 components reliably.

Show Solution

Multi-Step Approach:

Step 1: Standardize First (Add Missing Commas)

// Find state abbreviations without comma before
Find: ([A-Z]{2}) (\d{5})
Replace: , $1 $2
Use Find & Replace with wildcards

Result: All addresses have consistent comma placement

Step 2: Text to Columns (Comma Delimited)

1. Select address column
2. Text to Columns → Delimited → Comma
3. Click Next → Finish

Result:
A: 123 Main St       B: New York     C: NY 10001
A: 456 Oak Avenue    B: Los Angeles  C: CA          D: 90001

Step 3: Clean Up State and Zip

// Extract State (column D)
=LEFT(TRIM(C2),2)

// Extract Zip (column E)
=RIGHT(TRIM(C2),5)

Or use Flash Fill:
- Manually type "NY" in D2, "CA" in D3
- Ctrl+E to fill remaining states
- Repeat for Zip codes

Final Clean Result:

Street          City           State    Zip
123 Main St     New York       NY       10001
456 Oak Avenue  Los Angeles    CA       90001
789 Pine Road   Chicago        IL       60601

Alternative: Power Query (Best for Large Datasets)

1. Data → Get Data → From Table/Range
2. Split Column by Delimiter (comma)
3. Split last column by space to separate State and Zip
4. Trim all columns
5. Close & Load

Best Practice: For complex, repetitive parsing tasks, Power Query is more reliable than Text to Columns because it's repeatable and handles variations better.

07

Power Query Basics for Data Cleaning

Power Query is Excel's most powerful data transformation tool. Unlike formulas that need to be reapplied, Power Query creates repeatable, automated workflows that can refresh with new data. Perfect for recurring data cleaning tasks and complex transformations.

Why Use Power Query?

Repeatable

Create once, apply to any new data with a single refresh. Saves hours on recurring tasks.

Non-Destructive

Original data stays untouched. All transformations happen in Power Query Editor.

Auditable

Every step is recorded. See exactly what transformations were applied and when.

Getting Started with Power Query

1
Load Data into Power Query

Method A: From Excel Table

1. Select your data range
2. Insert → Table (or Ctrl+T)
3. Data tab → Get Data → From Table/Range

Method B: From File

Data → Get Data → From File → From Excel/CSV/Text
2
Power Query Editor Opens

You'll see three main areas:

  • Preview Area: Shows your data with transformations
  • Applied Steps: Right panel - records every transformation
  • Ribbon: Transformation tools organized by category
3
Apply Transformations

Each action adds a step to Applied Steps pane. You can:

  • Click any step to see data at that point
  • Delete steps (X icon)
  • Reorder steps (drag and drop)
  • Edit step settings (gear icon)
4
Load Back to Excel

Home tab → Close & Load → Choose destination:

  • Table: Loads as Excel table (most common)
  • PivotTable: Creates PivotTable from cleaned data
  • Connection Only: Saves query without loading (for reuse)

Essential Power Query Transformations

1. Remove Columns

// Right-click column header
→ Remove Columns
or
→ Remove Other Columns (keeps selected, removes rest)

Use when: Simplifying dataset, removing unnecessary fields

2. Remove Duplicates

// Select columns that define uniqueness
Home → Remove Rows → Remove Duplicates

Use when: Cleaning customer lists, order records
Advantage over Excel: Can specify which columns to check

3. Trim and Clean Text

// Right-click text column
→ Transform → Trim (removes leading/trailing spaces)
→ Transform → Clean (removes non-printable characters)
→ Format → UPPERCASE / lowercase / Capitalize Each Word

Chain these: Trim → Clean → Proper Case in sequence

4. Replace Values

// Right-click column → Replace Values
From: NY
To: New York

Better than Find & Replace: 
- Only affects selected column
- Case-sensitive option available
- Recorded as repeatable step

5. Split Columns

// Right-click column header → Split Column
Options:
- By Delimiter (comma, space, custom)
- By Number of Characters
- By Position (left, right, range)

Example: Split "LastName, FirstName"
Split by Delimiter → Comma → Each occurrence

6. Fill Down / Fill Up

// Right-click column → Fill → Down
Fills empty cells with value from above

Common use: Hierarchical data where category repeats
Category    Product
Electronics [blank]    → Electronics  Laptop
[blank]     [blank]    → Electronics  Phone
Furniture   [blank]    → Furniture    Desk

7. Change Data Type

// Click column header icon → Choose type
Text, Whole Number, Decimal Number, Date, Currency

Critical for: Ensuring calculations work, proper sorting
Automatic detection: Power Query guesses type (check Applied Steps)

8. Filter Rows

// Click filter dropdown in column header
Options:
- Text Filters: Contains, Begins With, Ends With
- Number Filters: Greater Than, Between, Top N
- Date Filters: Year, Month, Date Range

Removes unwanted records before loading to Excel

Power Query Cleaning Workflow Example

Real-World Scenario: Monthly Sales Report Automation

Problem: Every month, you receive a messy CSV file with 10,000+ sales records that needs the same cleaning.

Power Query Solution:

Step 1: Load CSV
Data → From Text/CSV → Select file

Step 2: Promote First Row to Headers
Home → Use First Row as Headers

Step 3: Remove Unnecessary Columns
Select ID, Date, Product, Quantity, Price columns
→ Remove Other Columns

Step 4: Clean Product Names
Transform → Trim
Transform → Clean  
Format → Capitalize Each Word

Step 5: Fix Date Format
Change Type → Date

Step 6: Remove Duplicates (based on ID)
Select ID column → Remove Duplicates

Step 7: Add Calculated Column (Total)
Add Column → Custom Column
Name: Total
Formula: [Quantity] * [Price]

Step 8: Filter Out Test Records
Filter Product column → Does Not Contain "TEST"

Step 9: Sort by Date
Home → Sort → Sort Descending

Step 10: Load to Excel
Close & Load → Table → New Worksheet

Next Month:

1. Save new CSV file with same name in same location
2. Right-click query in Queries & Connections pane
3. Click Refresh

All 10 steps automatically reapply to new data!
Time Savings: First time: 20 minutes to set up. Every subsequent month: 5 seconds to refresh. After 6 months, you've saved 2+ hours of repetitive work!

M Language (Behind the Scenes)

Power Query writes code called M Language for each step. You don't need to write it, but viewing helps understand transformations:

// View M Code: Right-click step → View Native Query
// Or: Home → Advanced Editor

Example M Code:
let
    Source = Excel.CurrentWorkbook(){[Name="SalesData"]}[Content],
    RemovedDuplicates = Table.Distinct(Source, {"ID"}),
    TrimmedText = Table.TransformColumns(RemovedDuplicates,{{"Product", Text.Trim}}),
    ChangedType = Table.TransformColumnTypes(TrimmedText,{{"Date", type date}})
in
    ChangedType
Advanced Tip: You can copy M code from one query to another, enabling you to replicate complex transformations across multiple datasets.

Practice Questions: Power Query Basics

Task: Create a Power Query to clean customer data: remove duplicates, trim spaces, capitalize names, remove test records.

Sample Data (CustomerList table):
ID    Name            Email               City
101   john smith      john@email.com     new york
102   JANE DOE        jane@email.com     LOS ANGELES
101   john smith      john@email.com     new york
999   TEST CUSTOMER   test@email.com     TEST CITY
Show Solution

Power Query Steps:

  1. Load Data: Data → From Table/Range → Select CustomerList
  2. Remove Duplicates: Home → Remove Rows → Remove Duplicates
  3. Trim All Text Columns:
    • Select Name, Email, City columns (hold Ctrl)
    • Transform → Format → Trim
  4. Capitalize Names:
    • Select Name column
    • Transform → Format → Capitalize Each Word
  5. Capitalize City:
    • Select City column
    • Transform → Format → Capitalize Each Word
  6. Remove Test Records:
    • Click filter on Name column
    • Text Filters → Does Not Contain → "TEST"
  7. Load: Home → Close & Load

Final Result:

ID    Name         Email              City
101   John Smith   john@email.com     New York
102   Jane Doe     jane@email.com     Los Angeles

Applied Steps (should show):

  • Source
  • Removed Duplicates
  • Trimmed Text
  • Capitalized Each Word (Name)
  • Capitalized Each Word (City)
  • Filtered Rows

Challenge: Clean product data by splitting SKU codes and extracting components.

Data:
SKU                  Quantity
ELEC-LAP-001        10
ELEC-PHONE-002      25
FURN-DESK-003       5
FURN-CHAIR-004      15

Task:
- Split SKU into: Category, Type, ID
- Capitalize Category and Type properly
- Add calculated column: Value = Quantity × 100
Show Solution

Power Query Steps:

  1. Split SKU by Delimiter:
    • Right-click SKU column
    • Split Column → By Delimiter → Hyphen (-)
    • Choose: At each occurrence of delimiter
  2. Rename Columns:
    • Double-click headers to rename:
    • SKU.1 → Category
    • SKU.2 → Type
    • SKU.3 → ID
  3. Capitalize Category and Type:
    • Select Category and Type columns (Ctrl+Click)
    • Transform → Format → Capitalize Each Word
  4. Add Calculated Column:
    • Add Column → Custom Column
    • Name: Value
    • Formula: [Quantity] * 100
  5. Reorder Columns (optional):
    • Drag columns to desired order
  6. Close & Load

Final Result:

Category    Type    ID    Quantity    Value
Elec        Lap     001   10          1000
Elec        Phone   002   25          2500
Furn        Desk    003   5           500
Furn        Chair   004   15          1500

Scenario: Build a Power Query that processes monthly sales files automatically. Requirements:

  • Import CSV from C:\Reports\MonthlySales.csv
  • Remove columns except: Date, Product, Quantity, UnitPrice
  • Fix date format issues
  • Calculate Total = Quantity × UnitPrice
  • Filter out records where Quantity = 0
  • Group by Product and show total sales
  • Sort by total sales descending
Show Solution

Complete Power Query Pipeline:

Step 1: Import CSV
Data → Get Data → From File → From Text/CSV
Navigate to C:\Reports\MonthlySales.csv
Load → Transform Data (opens Power Query Editor)

Step 2: Remove Unwanted Columns
Select: Date, Product, Quantity, UnitPrice
Right-click → Remove Other Columns

Step 3: Fix Date Column
Click Date column header icon → Choose "Date" type
If errors appear, check original format and use:
Transform → Date → Parse (choose format)

Step 4: Add Total Column
Add Column → Custom Column
Name: Total
Formula: [Quantity] * [UnitPrice]

Step 5: Filter Out Zero Quantity
Click filter on Quantity column
Number Filters → Greater Than → 0

Step 6: Group by Product
Home → Group By
Group by: Product
New column name: TotalSales
Operation: Sum
Column: Total

Step 7: Sort Descending
Click TotalSales column header dropdown
Sort Descending

Step 8: Rename Query (optional)
Right-click query name in Applied Steps
Rename: MonthlySalesReport

Step 9: Close & Load
Home → Close & Load → Table → New worksheet

Applied Steps Should Show:

  1. Source
  2. Promoted Headers
  3. Removed Other Columns
  4. Changed Type (Date)
  5. Added Custom (Total)
  6. Filtered Rows (Quantity > 0)
  7. Grouped Rows
  8. Sorted Rows

Next Month Refresh:

1. Replace MonthlySales.csv with new month's data (same filename, same location)
2. Right-click query in Queries & Connections pane
3. Click Refresh

All steps automatically reapply!

Advanced: Parameterize File Path

To make file path flexible:
1. Home → Manage Parameters → New Parameter
   Name: FilePath
   Type: Text
   Current Value: C:\Reports\MonthlySales.csv
   
2. Edit Source step in Advanced Editor:
   Replace hardcoded path with: FilePath parameter
   
Now you can change file location without editing query!

Result: A fully automated pipeline that transforms 30 minutes of manual work into 5 seconds of refresh time!

Key Takeaways

Clean Data Foundation

Data cleaning is essential before analysis - invest time upfront to ensure accurate insights

Find & Replace Mastery

Use wildcards and advanced options to find and fix patterns across entire datasets efficiently

Text Function Toolkit

TRIM, PROPER, UPPER, LOWER, and CLEAN functions standardize text and remove unwanted characters

Duplicate Detection

Remove Duplicates and conditional formatting help identify and eliminate redundant records

Data Validation

Set up validation rules to prevent errors from entering your dataset in the first place

Quality Assurance

Regular data quality checks ensure consistency and reliability throughout your analysis workflow

Knowledge Check

Quick Quiz

Test your understanding of Excel data cleaning techniques

1 What does the TRIM() function do in Excel?
2 In Find & Replace, what does the asterisk (*) wildcard represent?
3 What is the best practice BEFORE removing duplicate records from a dataset?
4 Which function would you use to remove non-printable characters from imported data?
5 When using Remove Duplicates, what happens to the duplicate rows?
6 What is the primary benefit of using Data Validation in Excel?
Answer all questions to check your score