Module 3.4

Advanced SQL Techniques

Elevate your SQL skills to expert level with subqueries, Common Table Expressions (CTEs), powerful window functions, and sophisticated data manipulation techniques. Learn the advanced patterns used by professional data analysts and engineers.

60 min read
Advanced
Complex Examples
What You'll Learn
  • Subqueries: scalar, row, and table subqueries
  • Common Table Expressions (CTEs) for readable queries
  • Window functions: ROW_NUMBER, RANK, LAG, LEAD
  • Date and string manipulation functions
  • Set operations: UNION and UNION ALL
Contents
1

Subqueries: Queries Within Queries

Subqueries are powerful SQL constructs that allow you to nest one query inside another, enabling complex filtering, calculations, and data comparisons. They're the foundation for sophisticated data analysis and dynamic reporting.

Understanding Subqueries

A subquery (also called an inner query or nested query) is a SELECT statement embedded within another SQL statement. Subqueries can appear in WHERE clauses, FROM clauses, SELECT lists, and even HAVING clauses. They execute first, and their results are used by the outer query.

Subqueries enable you to answer questions like: "Show customers who spent more than the average," "Find products that have never been ordered," or "List employees who earn more than anyone in the sales department."

Key Concept

Subquery Types

SQL supports three main types of subqueries: Scalar subqueries return a single value (one row, one column), Row subqueries return a single row with multiple columns, and Table subqueries return multiple rows and columns. Each type serves different purposes in query construction.

Why it matters: Choosing the right subquery type is crucial for query correctness. Using a multi-row subquery with = (expects single value) causes errors. Understanding these types helps you write efficient, error-free SQL for complex analytical tasks.

Scalar Subqueries (Single Value)

Scalar subqueries return exactly one value—a single row with a single column. They can be used anywhere a single value is expected: in WHERE clauses, SELECT lists, or with comparison operators.

-- Find products priced above average
SELECT 
  product_id,
  product_name,
  price,
  (SELECT AVG(price) FROM products) AS avg_price
FROM products
WHERE price > (SELECT AVG(price) FROM products)
ORDER BY price DESC;

-- Find customers who spent more than average
SELECT 
  customer_id,
  name,
  total_spent
FROM (
  SELECT 
    c.customer_id,
    c.name,
    SUM(o.order_total) AS total_spent
  FROM customers c
  JOIN orders o ON c.customer_id = o.customer_id
  GROUP BY c.customer_id, c.name
) AS customer_totals
WHERE total_spent > (
  SELECT AVG(customer_total)
  FROM (
    SELECT SUM(order_total) AS customer_total
    FROM orders
    GROUP BY customer_id
  ) AS averages
);

-- Compare employee salary to department average
SELECT 
  employee_id,
  name,
  department,
  salary,
  (SELECT AVG(salary) 
   FROM employees e2 
   WHERE e2.department = e1.department) AS dept_avg_salary,
  salary - (SELECT AVG(salary) 
            FROM employees e2 
            WHERE e2.department = e1.department) AS diff_from_avg
FROM employees e1
ORDER BY department, salary DESC;
Important: Scalar subqueries MUST return exactly one value. If they return multiple rows or no rows, you'll get an error. Always ensure your subquery logic guarantees a single result, or use appropriate aggregate functions (AVG, MAX, MIN, COUNT, SUM).

Table Subqueries (Multiple Rows)

Table subqueries return multiple rows and can be used with operators like IN, NOT IN, EXISTS, NOT EXISTS, ANY, or ALL. They're perfect for filtering based on membership in a set of values.

-- Find products that have never been ordered
SELECT product_id, product_name, price
FROM products
WHERE product_id NOT IN (
  SELECT DISTINCT product_id
  FROM order_items
);

-- Find customers who ordered in the last 30 days
SELECT customer_id, name, email
FROM customers
WHERE customer_id IN (
  SELECT DISTINCT customer_id
  FROM orders
  WHERE order_date >= DATE_SUB(CURDATE(), INTERVAL 30 DAY)
);

-- Find employees earning more than ANY sales rep
SELECT name, department, salary
FROM employees
WHERE salary > ANY (
  SELECT salary
  FROM employees
  WHERE department = 'Sales'
)
AND department != 'Sales';

-- Find products more expensive than ALL products in 'Books' category
SELECT product_name, category, price
FROM products
WHERE price > ALL (
  SELECT price
  FROM products
  WHERE category = 'Books'
);

Correlated Subqueries

Correlated subqueries reference columns from the outer query, executing once for each row in the outer query. They're powerful but can be performance-intensive on large datasets:

-- Find products with above-average price in their category
SELECT 
  p1.product_id,
  p1.product_name,
  p1.category,
  p1.price,
  (SELECT AVG(p2.price) 
   FROM products p2 
   WHERE p2.category = p1.category) AS category_avg
FROM products p1
WHERE p1.price > (
  SELECT AVG(p2.price)
  FROM products p2
  WHERE p2.category = p1.category
);

-- Find customers with orders above their own average
SELECT DISTINCT
  c.customer_id,
  c.name,
  o.order_id,
  o.order_total
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE o.order_total > (
  SELECT AVG(order_total)
  FROM orders o2
  WHERE o2.customer_id = c.customer_id
);

-- EXISTS: Check if customer has any orders
SELECT customer_id, name, email
FROM customers c
WHERE EXISTS (
  SELECT 1
  FROM orders o
  WHERE o.customer_id = c.customer_id
  AND o.order_date >= '2025-01-01'
);
Pro Tip: Use EXISTS instead of IN when checking for record existence—it's often faster because EXISTS stops searching as soon as it finds one match, while IN must retrieve all matching values. For non-existence checks, NOT EXISTS is similarly more efficient than NOT IN.

Subqueries in FROM Clause (Derived Tables)

Using subqueries in the FROM clause creates temporary result sets (derived tables) that you can query like regular tables. This pattern is useful for multi-step calculations:

-- Calculate percentage of total sales by category
SELECT 
  category,
  category_revenue,
  total_revenue,
  ROUND(category_revenue * 100.0 / total_revenue, 2) AS percent_of_total
FROM (
  SELECT 
    p.category,
    SUM(oi.price * oi.quantity) AS category_revenue
  FROM order_items oi
  JOIN products p ON oi.product_id = p.product_id
  GROUP BY p.category
) AS category_sales
CROSS JOIN (
  SELECT SUM(price * quantity) AS total_revenue
  FROM order_items
) AS total_sales
ORDER BY percent_of_total DESC;

-- Find top 5 customers per region
SELECT *
FROM (
  SELECT 
    c.region,
    c.customer_id,
    c.name,
    SUM(o.order_total) AS total_spent,
    ROW_NUMBER() OVER (PARTITION BY c.region ORDER BY SUM(o.order_total) DESC) AS rank_in_region
  FROM customers c
  JOIN orders o ON c.customer_id = o.customer_id
  GROUP BY c.region, c.customer_id, c.name
) AS ranked_customers
WHERE rank_in_region <= 5
ORDER BY region, rank_in_region;
Remember: Every derived table (subquery in FROM) must have an alias. This is required SQL syntax. Use meaningful aliases like "customer_totals" or "monthly_sales" rather than generic "t1", "t2" for better readability.

Practice Questions: Subqueries

Test your subquery skills with these challenges.

Task: Write a query to show products where price exceeds the average price in their category.

Show Solution
SELECT 
  p1.product_id,
  p1.product_name,
  p1.category,
  p1.price,
  (SELECT AVG(p2.price) 
   FROM products p2 
   WHERE p2.category = p1.category) AS category_avg
FROM products p1
WHERE p1.price > (
  SELECT AVG(p2.price)
  FROM products p2
  WHERE p2.category = p1.category
)
ORDER BY p1.category, p1.price DESC;

Task: List customers who have NOT placed any orders in the last 90 days. Use NOT EXISTS or NOT IN.

Show Solution
-- Solution 1: Using NOT EXISTS (more efficient)
SELECT customer_id, name, email
FROM customers c
WHERE NOT EXISTS (
  SELECT 1
  FROM orders o
  WHERE o.customer_id = c.customer_id
  AND o.order_date >= DATE_SUB(CURDATE(), INTERVAL 90 DAY)
);

-- Solution 2: Using NOT IN
SELECT customer_id, name, email
FROM customers
WHERE customer_id NOT IN (
  SELECT DISTINCT customer_id
  FROM orders
  WHERE order_date >= DATE_SUB(CURDATE(), INTERVAL 90 DAY)
);

Task: Show monthly revenue for 2025 with comparison to previous month. Calculate absolute and percentage growth.

Show Solution
SELECT 
  year,
  month,
  monthly_revenue,
  prev_month_revenue,
  monthly_revenue - prev_month_revenue AS revenue_growth,
  ROUND(
    (monthly_revenue - prev_month_revenue) * 100.0 / prev_month_revenue,
    2
  ) AS growth_percent
FROM (
  SELECT 
    YEAR(order_date) AS year,
    MONTH(order_date) AS month,
    SUM(order_total) AS monthly_revenue,
    (SELECT SUM(order_total)
     FROM orders o2
     WHERE YEAR(o2.order_date) = YEAR(o1.order_date)
     AND MONTH(o2.order_date) = MONTH(o1.order_date) - 1
    ) AS prev_month_revenue
  FROM orders o1
  WHERE YEAR(order_date) = 2025
  GROUP BY YEAR(order_date), MONTH(order_date)
) AS monthly_comparison
ORDER BY year, month;
2

Common Table Expressions (CTEs)

Common Table Expressions (CTEs) provide a cleaner, more readable alternative to subqueries. Using the WITH clause, CTEs let you define temporary named result sets that exist only for the duration of a query, making complex SQL much easier to write, understand, and maintain.

Understanding CTEs

A CTE is like a temporary view that exists only within the execution scope of a single query. Think of it as giving a name to a subquery result, which you can then reference multiple times in your main query. CTEs are especially valuable for breaking down complex queries into logical, readable steps.

-- Basic CTE syntax
WITH cte_name AS (
  SELECT column1, column2, ...
  FROM table
  WHERE conditions
)
SELECT *
FROM cte_name
WHERE additional_conditions;

-- Real example: Calculate customer lifetime value
WITH customer_totals AS (
  SELECT 
    customer_id,
    COUNT(*) AS order_count,
    SUM(order_total) AS lifetime_value,
    AVG(order_total) AS avg_order_value,
    MIN(order_date) AS first_order_date,
    MAX(order_date) AS last_order_date
  FROM orders
  GROUP BY customer_id
)
SELECT 
  c.customer_id,
  c.name,
  c.email,
  ct.order_count,
  ct.lifetime_value,
  ct.avg_order_value,
  DATEDIFF(ct.last_order_date, ct.first_order_date) AS customer_lifespan_days
FROM customers c
JOIN customer_totals ct ON c.customer_id = ct.customer_id
WHERE ct.order_count >= 5
ORDER BY ct.lifetime_value DESC;
Key Concept

CTE Benefits

CTEs make complex queries readable by breaking them into named, logical steps. Unlike subqueries that must be repeated for each use, a CTE is defined once and can be referenced multiple times. They support recursion for hierarchical data and improve query maintainability significantly.

Why it matters: Professional SQL code prioritizes readability and maintainability. CTEs transform tangled nested subqueries into clear, step-by-step logic that colleagues can understand and modify. This is crucial for collaborative data teams and long-term code maintenance.

Multiple CTEs in One Query

You can define multiple CTEs in a single query, each building on previous ones or working independently. This creates a pipeline of data transformations:

-- Multi-step customer segmentation analysis
WITH 
-- Step 1: Calculate customer metrics
customer_metrics AS (
  SELECT 
    customer_id,
    COUNT(*) AS order_count,
    SUM(order_total) AS total_spent,
    AVG(order_total) AS avg_order_value,
    MAX(order_date) AS last_order_date
  FROM orders
  GROUP BY customer_id
),
-- Step 2: Assign customer segments
customer_segments AS (
  SELECT 
    customer_id,
    order_count,
    total_spent,
    avg_order_value,
    last_order_date,
    CASE 
      WHEN total_spent >= 5000 THEN 'VIP'
      WHEN total_spent >= 1000 THEN 'Gold'
      WHEN total_spent >= 500 THEN 'Silver'
      ELSE 'Bronze'
    END AS segment,
    CASE
      WHEN last_order_date >= DATE_SUB(CURDATE(), INTERVAL 30 DAY) THEN 'Active'
      WHEN last_order_date >= DATE_SUB(CURDATE(), INTERVAL 90 DAY) THEN 'At Risk'
      ELSE 'Inactive'
    END AS activity_status
  FROM customer_metrics
),
-- Step 3: Calculate segment statistics
segment_stats AS (
  SELECT 
    segment,
    activity_status,
    COUNT(*) AS customer_count,
    AVG(total_spent) AS avg_lifetime_value,
    AVG(avg_order_value) AS avg_order_size
  FROM customer_segments
  GROUP BY segment, activity_status
)
-- Final: Show comprehensive segment analysis
SELECT 
  segment,
  activity_status,
  customer_count,
  ROUND(avg_lifetime_value, 2) AS avg_lifetime_value,
  ROUND(avg_order_size, 2) AS avg_order_size,
  ROUND(customer_count * 100.0 / SUM(customer_count) OVER (), 2) AS percent_of_total
FROM segment_stats
ORDER BY 
  FIELD(segment, 'VIP', 'Gold', 'Silver', 'Bronze'),
  FIELD(activity_status, 'Active', 'At Risk', 'Inactive');

CTEs vs Subqueries: When to Use Each

While CTEs and subqueries are functionally similar, each has optimal use cases:

Scenario Use CTEs When... Use Subqueries When...
Readability Query has multiple steps or complex logic that benefits from named intermediate results Query is simple with a single nested operation
Reusability Need to reference the same result set multiple times in one query Result is only needed once in one specific place
Recursion Working with hierarchical data (org charts, categories, etc.) Recursion is not needed—CTEs are the only SQL recursion option
Debugging Need to test/verify intermediate results step-by-step Logic is straightforward and doesn't need stepwise verification
Remember: CTEs are not stored views or temp tables. They exist only for the query duration and are not materialized unless the database optimizer decides to do so. For persistent reusable logic, create actual views instead.

Recursive CTEs for Hierarchical Data

Recursive CTEs can reference themselves, making them perfect for traversing hierarchical structures like organizational charts, product categories, or threaded comments:

-- Recursive CTE: Employee hierarchy (org chart)
WITH RECURSIVE employee_hierarchy AS (
  -- Base case: Top-level managers (no supervisor)
  SELECT 
    employee_id,
    name,
    manager_id,
    title,
    1 AS level,
    CAST(name AS CHAR(1000)) AS path
  FROM employees
  WHERE manager_id IS NULL
  
  UNION ALL
  
  -- Recursive case: Employees reporting to previous level
  SELECT 
    e.employee_id,
    e.name,
    e.manager_id,
    e.title,
    eh.level + 1,
    CONCAT(eh.path, ' > ', e.name)
  FROM employees e
  INNER JOIN employee_hierarchy eh ON e.manager_id = eh.employee_id
)
SELECT 
  employee_id,
  CONCAT(REPEAT('  ', level - 1), name) AS name_indented,
  title,
  level,
  path
FROM employee_hierarchy
ORDER BY path;

-- Recursive CTE: Category tree with parent-child relationships
WITH RECURSIVE category_tree AS (
  -- Base: Root categories
  SELECT 
    category_id,
    category_name,
    parent_category_id,
    1 AS depth,
    category_name AS full_path
  FROM categories
  WHERE parent_category_id IS NULL
  
  UNION ALL
  
  -- Recursive: Child categories
  SELECT 
    c.category_id,
    c.category_name,
    c.parent_category_id,
    ct.depth + 1,
    CONCAT(ct.full_path, ' > ', c.category_name)
  FROM categories c
  INNER JOIN category_tree ct ON c.parent_category_id = ct.category_id
)
SELECT * FROM category_tree
ORDER BY full_path;
Pro Tip: When using CTEs, name them descriptively: use "monthly_sales" instead of "cte1", "customer_segments" instead of "temp". Good names make your query self-documenting and save time for future readers (including yourself).

Practical Example: Sales Performance Dashboard

Let's build a comprehensive sales report using multiple CTEs to calculate various metrics:

WITH
-- CTE 1: Monthly sales by product
monthly_product_sales AS (
  SELECT 
    p.product_id,
    p.product_name,
    p.category,
    YEAR(o.order_date) AS year,
    MONTH(o.order_date) AS month,
    SUM(oi.quantity) AS units_sold,
    SUM(oi.price * oi.quantity) AS revenue
  FROM products p
  JOIN order_items oi ON p.product_id = oi.product_id
  JOIN orders o ON oi.order_id = o.order_id
  WHERE o.order_date >= '2025-01-01'
  GROUP BY p.product_id, p.product_name, p.category, YEAR(o.order_date), MONTH(o.order_date)
),
-- CTE 2: Calculate product rankings
product_rankings AS (
  SELECT 
    product_id,
    product_name,
    category,
    year,
    month,
    revenue,
    RANK() OVER (PARTITION BY category, year, month ORDER BY revenue DESC) AS rank_in_category
  FROM monthly_product_sales
),
-- CTE 3: Identify top performers
top_products AS (
  SELECT *
  FROM product_rankings
  WHERE rank_in_category <= 3
)
-- Final query: Show top 3 products per category per month
SELECT 
  year,
  month,
  category,
  rank_in_category,
  product_name,
  revenue,
  ROUND(revenue * 100.0 / SUM(revenue) OVER (PARTITION BY category, year, month), 2) AS percent_of_category
FROM top_products
ORDER BY year, month, category, rank_in_category;
Important: Recursive CTEs need a termination condition to avoid infinite loops. Always ensure your recursive case eventually reaches a state where no new rows are added (like reaching leaf nodes in a tree or hitting a maximum depth).

Practice Questions: CTEs

Master CTEs with these practical exercises.

Task: Rewrite this subquery-based query using a CTE:

SELECT * FROM customers WHERE customer_id IN (SELECT customer_id FROM orders WHERE order_date >= '2025-01-01');
Show Solution
WITH recent_customers AS (
  SELECT DISTINCT customer_id
  FROM orders
  WHERE order_date >= '2025-01-01'
)
SELECT c.*
FROM customers c
INNER JOIN recent_customers rc ON c.customer_id = rc.customer_id;

Task: Create a CTE-based query to calculate RFM (Recency, Frequency, Monetary) scores for customers.

Show Solution
WITH customer_rfm AS (
  SELECT 
    customer_id,
    DATEDIFF(CURDATE(), MAX(order_date)) AS recency_days,
    COUNT(*) AS frequency,
    SUM(order_total) AS monetary_value
  FROM orders
  GROUP BY customer_id
),
rfm_scores AS (
  SELECT 
    customer_id,
    recency_days,
    frequency,
    monetary_value,
    CASE 
      WHEN recency_days <= 30 THEN 5
      WHEN recency_days <= 60 THEN 4
      WHEN recency_days <= 90 THEN 3
      WHEN recency_days <= 180 THEN 2
      ELSE 1
    END AS recency_score,
    CASE 
      WHEN frequency >= 20 THEN 5
      WHEN frequency >= 10 THEN 4
      WHEN frequency >= 5 THEN 3
      WHEN frequency >= 2 THEN 2
      ELSE 1
    END AS frequency_score,
    CASE 
      WHEN monetary_value >= 5000 THEN 5
      WHEN monetary_value >= 2000 THEN 4
      WHEN monetary_value >= 1000 THEN 3
      WHEN monetary_value >= 500 THEN 2
      ELSE 1
    END AS monetary_score
  FROM customer_rfm
)
SELECT 
  c.customer_id,
  c.name,
  rs.recency_days,
  rs.frequency,
  rs.monetary_value,
  rs.recency_score,
  rs.frequency_score,
  rs.monetary_score,
  (rs.recency_score + rs.frequency_score + rs.monetary_score) AS total_rfm_score
FROM customers c
JOIN rfm_scores rs ON c.customer_id = rs.customer_id
ORDER BY total_rfm_score DESC;

Task: Write a recursive CTE to show complete category hierarchy with product counts at each level.

Show Solution
WITH RECURSIVE category_hierarchy AS (
  -- Base: Root categories
  SELECT 
    category_id,
    category_name,
    parent_category_id,
    1 AS level,
    CAST(category_name AS CHAR(500)) AS full_path
  FROM categories
  WHERE parent_category_id IS NULL
  
  UNION ALL
  
  -- Recursive: Child categories
  SELECT 
    c.category_id,
    c.category_name,
    c.parent_category_id,
    ch.level + 1,
    CONCAT(ch.full_path, ' > ', c.category_name)
  FROM categories c
  INNER JOIN category_hierarchy ch ON c.parent_category_id = ch.category_id
)
SELECT 
  ch.category_id,
  CONCAT(REPEAT('  ', ch.level - 1), ch.category_name) AS category_display,
  ch.level,
  ch.full_path,
  COUNT(p.product_id) AS product_count,
  ROUND(AVG(p.price), 2) AS avg_price
FROM category_hierarchy ch
LEFT JOIN products p ON ch.category_id = p.category_id
GROUP BY ch.category_id, ch.category_name, ch.level, ch.full_path
ORDER BY ch.full_path;
03

Window Functions

Window functions are game-changers for analytics. Unlike aggregate functions that collapse rows into groups, window functions perform calculations across sets of rows while preserving individual row identity. This enables rankings, running totals, moving averages, and comparisons between current and previous rows without complex self-joins.

Window Function Basics

A window function performs a calculation across a "window" of rows that are related to the current row. The syntax includes the function name followed by OVER(), which defines the window specification. You can partition data into groups (PARTITION BY) and order rows within partitions (ORDER BY).

Core Concept

Window Function Syntax

Window functions follow the pattern: function_name() OVER (PARTITION BY column1 ORDER BY column2). The PARTITION BY clause divides rows into groups (similar to GROUP BY), and ORDER BY determines the sequence within each partition. The key difference from GROUP BY is that window functions don't collapse rows.

Key Insight: Window functions let you compare each row to aggregate values without losing row-level detail. This is perfect for calculating percentages of totals, ranking within categories, or finding differences from group averages.

-- Basic window function structure
SELECT 
  column1,
  column2,
  aggregate_function() OVER (
    PARTITION BY partition_column
    ORDER BY order_column
    ROWS BETWEEN start AND end  -- Optional frame clause
  ) AS window_result
FROM table_name;

ROW_NUMBER() - Sequential Numbering

ROW_NUMBER() assigns a unique sequential integer to each row within a partition, starting from 1. It's the most straightforward ranking function and perfect when you need a simple ordinal position or want to select the top N rows per group.

-- Number orders for each customer chronologically
SELECT 
  customer_id,
  order_id,
  order_date,
  order_total,
  ROW_NUMBER() OVER (
    PARTITION BY customer_id 
    ORDER BY order_date
  ) AS order_number
FROM orders;
-- Result shows 1st, 2nd, 3rd order for each customer

-- Get first order for each customer
SELECT * FROM (
  SELECT 
    customer_id,
    order_id,
    order_date,
    order_total,
    ROW_NUMBER() OVER (
      PARTITION BY customer_id 
      ORDER BY order_date
    ) AS rn
  FROM orders
) AS ranked
WHERE rn = 1;

-- Top 3 products by sales in each category
SELECT * FROM (
  SELECT 
    category_id,
    product_id,
    product_name,
    total_sales,
    ROW_NUMBER() OVER (
      PARTITION BY category_id 
      ORDER BY total_sales DESC
    ) AS sales_rank
  FROM product_sales
) AS ranked
WHERE sales_rank <= 3;

RANK() and DENSE_RANK() - Handling Ties

RANK() and DENSE_RANK() both handle tied values, but differently. RANK() leaves gaps in numbering after ties, while DENSE_RANK() doesn't skip numbers. Choose RANK() when gaps matter (like sports rankings) and DENSE_RANK() when you want consecutive numbers despite ties.

-- Compare RANK vs DENSE_RANK vs ROW_NUMBER
SELECT 
  employee_id,
  name,
  department,
  salary,
  ROW_NUMBER() OVER (ORDER BY salary DESC) AS row_num,
  RANK() OVER (ORDER BY salary DESC) AS rank,
  DENSE_RANK() OVER (ORDER BY salary DESC) AS dense_rank
FROM employees;
-- If two people tie at #3, RANK shows 3,3,5 while DENSE_RANK shows 3,3,4

-- Rank products by profit margin within each category
SELECT 
  category_name,
  product_name,
  price,
  cost,
  (price - cost) / price * 100 AS profit_margin_pct,
  RANK() OVER (
    PARTITION BY category_name 
    ORDER BY (price - cost) / price DESC
  ) AS profitability_rank
FROM products p
JOIN categories c ON p.category_id = c.category_id
WHERE price > 0;

-- Find departments where employee ranks in top 3 by salary
SELECT 
  department,
  employee_id,
  name,
  salary,
  salary_rank
FROM (
  SELECT 
    department,
    employee_id,
    name,
    salary,
    DENSE_RANK() OVER (
      PARTITION BY department 
      ORDER BY salary DESC
    ) AS salary_rank
  FROM employees
) AS ranked
WHERE salary_rank <= 3
ORDER BY department, salary_rank;
Choosing the Right Ranking Function: Use ROW_NUMBER() when you need unique row numbers regardless of ties. Use RANK() when gaps after ties are meaningful (like "1st, 2nd, 2nd, 4th"). Use DENSE_RANK() when you want consecutive rankings without gaps (like "1st, 2nd, 2nd, 3rd").

LAG() and LEAD() - Accessing Adjacent Rows

LAG() accesses data from a previous row, while LEAD() accesses data from a following row within the same partition. These functions are invaluable for time-series analysis, calculating differences between consecutive records, and comparing current values to historical data.

-- Calculate month-over-month revenue growth
SELECT 
  EXTRACT(YEAR FROM order_date) AS year,
  EXTRACT(MONTH FROM order_date) AS month,
  SUM(order_total) AS monthly_revenue,
  LAG(SUM(order_total)) OVER (ORDER BY 
    EXTRACT(YEAR FROM order_date), 
    EXTRACT(MONTH FROM order_date)
  ) AS prev_month_revenue,
  SUM(order_total) - LAG(SUM(order_total)) OVER (ORDER BY 
    EXTRACT(YEAR FROM order_date), 
    EXTRACT(MONTH FROM order_date)
  ) AS revenue_change,
  ROUND(
    (SUM(order_total) - LAG(SUM(order_total)) OVER (ORDER BY 
      EXTRACT(YEAR FROM order_date), 
      EXTRACT(MONTH FROM order_date)
    )) / LAG(SUM(order_total)) OVER (ORDER BY 
      EXTRACT(YEAR FROM order_date), 
      EXTRACT(MONTH FROM order_date)
    ) * 100, 2
  ) AS growth_pct
FROM orders
GROUP BY EXTRACT(YEAR FROM order_date), EXTRACT(MONTH FROM order_date)
ORDER BY year, month;

-- Track customer purchase patterns
SELECT 
  customer_id,
  order_id,
  order_date,
  order_total,
  LAG(order_date) OVER (
    PARTITION BY customer_id 
    ORDER BY order_date
  ) AS previous_order_date,
  order_date - LAG(order_date) OVER (
    PARTITION BY customer_id 
    ORDER BY order_date
  ) AS days_since_last_order,
  LEAD(order_date) OVER (
    PARTITION BY customer_id 
    ORDER BY order_date
  ) AS next_order_date
FROM orders
ORDER BY customer_id, order_date;

-- Compare current price to previous and next price changes
SELECT 
  product_id,
  effective_date,
  price,
  LAG(price, 1, 0) OVER (
    PARTITION BY product_id 
    ORDER BY effective_date
  ) AS previous_price,
  LEAD(price, 1, price) OVER (
    PARTITION BY product_id 
    ORDER BY effective_date
  ) AS next_price,
  price - LAG(price, 1, 0) OVER (
    PARTITION BY product_id 
    ORDER BY effective_date
  ) AS price_increase
FROM price_history;
LAG/LEAD Parameters: Both functions accept three parameters: the column to access, the offset (number of rows back/forward, default 1), and a default value to return when the offset goes beyond the partition boundary.

Aggregate Window Functions - Running Totals & Moving Averages

Standard aggregate functions (SUM, AVG, COUNT, MIN, MAX) become window functions when paired with OVER(). The ROWS or RANGE clause defines the window frame - which rows to include in the calculation relative to the current row.

-- Running total of sales
SELECT 
  order_date,
  order_id,
  order_total,
  SUM(order_total) OVER (
    ORDER BY order_date
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
  ) AS running_total,
  AVG(order_total) OVER (
    ORDER BY order_date
    ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
  ) AS seven_day_avg
FROM orders
ORDER BY order_date;

-- Calculate percentage of total sales by product
SELECT 
  product_id,
  product_name,
  category,
  sales_amount,
  SUM(sales_amount) OVER () AS total_sales,
  ROUND(sales_amount * 100.0 / SUM(sales_amount) OVER (), 2) AS pct_of_total,
  SUM(sales_amount) OVER (PARTITION BY category) AS category_total,
  ROUND(sales_amount * 100.0 / SUM(sales_amount) OVER (PARTITION BY category), 2) AS pct_of_category
FROM product_sales
ORDER BY category, sales_amount DESC;

-- Moving average for trend analysis
SELECT 
  sale_date,
  daily_revenue,
  AVG(daily_revenue) OVER (
    ORDER BY sale_date
    ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
  ) AS seven_day_moving_avg,
  AVG(daily_revenue) OVER (
    ORDER BY sale_date
    ROWS BETWEEN 29 PRECEDING AND CURRENT ROW
  ) AS thirty_day_moving_avg
FROM daily_sales
ORDER BY sale_date;
Frame Clauses
Frame Type Meaning
ROWS BETWEEN Physical row positions
RANGE BETWEEN Logical value ranges
UNBOUNDED PRECEDING Start of partition
CURRENT ROW Current row position
UNBOUNDED FOLLOWING End of partition
N PRECEDING N rows before current
N FOLLOWING N rows after current
Common Window Functions
Function Purpose
ROW_NUMBER() Unique sequential numbers
RANK() Ranking with gaps
DENSE_RANK() Ranking without gaps
LAG() Access previous row
LEAD() Access next row
NTILE(n) Divide into n groups
FIRST_VALUE() First value in window
LAST_VALUE() Last value in window

Practice: Window Functions

Task: Write a query to rank employees by salary (highest to lowest) within each department using DENSE_RANK().

Show Solution
SELECT 
  department,
  employee_id,
  employee_name,
  salary,
  DENSE_RANK() OVER (
    PARTITION BY department 
    ORDER BY salary DESC
  ) AS salary_rank
FROM employees
ORDER BY department, salary_rank;

Task: For the daily_sales table, calculate the running total and 7-day moving average of revenue.

Show Solution
SELECT 
  sale_date,
  daily_revenue,
  SUM(daily_revenue) OVER (
    ORDER BY sale_date
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
  ) AS running_total,
  AVG(daily_revenue) OVER (
    ORDER BY sale_date
    ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
  ) AS seven_day_moving_avg,
  ROUND(
    daily_revenue * 100.0 / SUM(daily_revenue) OVER (), 
    2
  ) AS pct_of_total_sales
FROM daily_sales
ORDER BY sale_date;

Task: Calculate monthly revenue and year-over-year growth percentage using LAG().

Show Solution
WITH monthly_revenue AS (
  SELECT 
    EXTRACT(YEAR FROM order_date) AS year,
    EXTRACT(MONTH FROM order_date) AS month,
    SUM(order_total) AS revenue
  FROM orders
  GROUP BY EXTRACT(YEAR FROM order_date), EXTRACT(MONTH FROM order_date)
)
SELECT 
  year,
  month,
  revenue AS current_revenue,
  LAG(revenue, 12) OVER (
    PARTITION BY month 
    ORDER BY year
  ) AS prev_year_revenue,
  ROUND(
    (revenue - LAG(revenue, 12) OVER (
      PARTITION BY month 
      ORDER BY year
    )) * 100.0 / LAG(revenue, 12) OVER (
      PARTITION BY month 
      ORDER BY year
    ), 
    2
  ) AS yoy_growth_pct
FROM monthly_revenue
ORDER BY year, month;
04

Date and String Manipulation

Clean, transform, and analyze text and temporal data with SQL's powerful date and string functions. Master techniques to extract date components, perform date arithmetic, format timestamps, manipulate text, and clean messy string data for robust analytics.

Date Functions - Extracting Components

Working with dates requires extracting specific components like year, month, day, or quarter. The EXTRACT() function is SQL standard and works across most databases. You can pull out year, month, day, hour, minute, second, day of week, and more.

-- Extract date parts
SELECT 
  order_date,
  EXTRACT(YEAR FROM order_date) AS year,
  EXTRACT(MONTH FROM order_date) AS month,
  EXTRACT(DAY FROM order_date) AS day,
  EXTRACT(QUARTER FROM order_date) AS quarter,
  EXTRACT(DOW FROM order_date) AS day_of_week,  -- 0=Sunday
  EXTRACT(WEEK FROM order_date) AS week_number
FROM orders;

-- Aggregate sales by year and quarter
SELECT 
  EXTRACT(YEAR FROM order_date) AS year,
  EXTRACT(QUARTER FROM order_date) AS quarter,
  COUNT(*) AS order_count,
  SUM(order_total) AS total_revenue,
  AVG(order_total) AS avg_order_value
FROM orders
GROUP BY EXTRACT(YEAR FROM order_date), EXTRACT(QUARTER FROM order_date)
ORDER BY year, quarter;

-- Find orders placed on weekends
SELECT 
  order_id,
  order_date,
  CASE EXTRACT(DOW FROM order_date)
    WHEN 0 THEN 'Sunday'
    WHEN 6 THEN 'Saturday'
  END AS weekend_day,
  order_total
FROM orders
WHERE EXTRACT(DOW FROM order_date) IN (0, 6)
ORDER BY order_date;

Date Arithmetic and Intervals

SQL allows you to add or subtract time intervals from dates. This is essential for calculating deadlines, finding records within date ranges, or generating time-based reports. Use INTERVAL for standard date math operations.

-- Add/subtract time intervals
SELECT 
  order_date,
  order_date + INTERVAL '30 days' AS due_date,
  order_date - INTERVAL '7 days' AS week_ago,
  order_date + INTERVAL '3 months' AS three_months_later,
  CURRENT_DATE - order_date AS days_since_order
FROM orders;

-- Find customers who haven't ordered in 90 days
SELECT 
  c.customer_id,
  c.name,
  MAX(o.order_date) AS last_order_date,
  CURRENT_DATE - MAX(o.order_date) AS days_since_last_order
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.name
HAVING CURRENT_DATE - MAX(o.order_date) > 90
   OR MAX(o.order_date) IS NULL
ORDER BY days_since_last_order DESC;

-- Calculate age from birthdate
SELECT 
  employee_id,
  name,
  birth_date,
  EXTRACT(YEAR FROM AGE(CURRENT_DATE, birth_date)) AS age,
  AGE(CURRENT_DATE, birth_date) AS exact_age
FROM employees;
Date Functions

Common Date Operations

DATE_TRUNC truncates a timestamp to a specified precision (year, month, day, hour). AGE calculates the interval between two dates. NOW() returns the current timestamp, while CURRENT_DATE returns just the date without time.

Pro Tip: DATE_TRUNC is perfect for grouping by time periods: DATE_TRUNC('month', timestamp) groups all timestamps in the same month together, ideal for time-series analysis and reporting.

Date Formatting and Conversion

Format dates for display or convert strings to dates using TO_CHAR() and TO_DATE(). TO_CHAR() converts dates to formatted strings, while TO_DATE() parses string dates into proper date types using format patterns.

-- Format dates for display
SELECT 
  order_date,
  TO_CHAR(order_date, 'YYYY-MM-DD') AS iso_format,
  TO_CHAR(order_date, 'Month DD, YYYY') AS long_format,
  TO_CHAR(order_date, 'MM/DD/YY') AS short_format,
  TO_CHAR(order_date, 'Day, Mon DD YYYY') AS full_format,
  TO_CHAR(order_date, 'HH24:MI:SS') AS time_only
FROM orders;

-- Convert strings to dates
SELECT 
  TO_DATE('2026-01-15', 'YYYY-MM-DD') AS date1,
  TO_DATE('01/15/2026', 'MM/DD/YYYY') AS date2,
  TO_DATE('15-Jan-2026', 'DD-Mon-YYYY') AS date3;

-- Parse date strings in queries
SELECT *
FROM orders
WHERE order_date >= TO_DATE('2025-01-01', 'YYYY-MM-DD')
  AND order_date < TO_DATE('2026-01-01', 'YYYY-MM-DD');

String Functions - Concatenation and Transformation

String manipulation is crucial for data cleaning and formatting. CONCAT() joins strings, UPPER() and LOWER() change case, TRIM() removes whitespace, and SUBSTRING() extracts portions of text.

-- Concatenate strings
SELECT 
  first_name,
  last_name,
  CONCAT(first_name, ' ', last_name) AS full_name,
  CONCAT(last_name, ', ', first_name) AS formal_name,
  CONCAT(email, ' (', phone, ')') AS contact_info
FROM customers;

-- Change case
SELECT 
  product_name,
  UPPER(product_name) AS uppercase,
  LOWER(product_name) AS lowercase,
  INITCAP(product_name) AS title_case
FROM products;

-- Clean whitespace
SELECT 
  customer_name,
  TRIM(customer_name) AS trimmed,
  TRIM(LEADING ' ' FROM customer_name) AS left_trim,
  TRIM(TRAILING ' ' FROM customer_name) AS right_trim,
  LENGTH(customer_name) AS original_length,
  LENGTH(TRIM(customer_name)) AS trimmed_length
FROM customers;

-- Extract substrings
SELECT 
  email,
  SUBSTRING(email FROM 1 FOR POSITION('@' IN email) - 1) AS username,
  SUBSTRING(email FROM POSITION('@' IN email) + 1) AS domain,
  LEFT(phone, 3) AS area_code,
  RIGHT(phone, 4) AS last_four_digits
FROM customers;

String Searching and Replacement

Search within strings using POSITION() or LIKE patterns, and replace text with REPLACE(). These functions are essential for data cleaning, validation, and transformation tasks.

-- Search for substrings
SELECT 
  product_name,
  description
FROM products
WHERE POSITION('premium' IN LOWER(description)) > 0
   OR description LIKE '%premium%';

-- Replace text
SELECT 
  product_name,
  REPLACE(product_name, 'Jr.', 'Junior') AS formatted_name,
  REPLACE(phone, '-', '') AS phone_no_dashes,
  REPLACE(REPLACE(email, '.', ''), '@', ' at ') AS obfuscated_email
FROM products;

-- Clean and standardize data
SELECT 
  customer_id,
  UPPER(TRIM(REPLACE(name, '  ', ' '))) AS cleaned_name,
  REPLACE(REPLACE(REPLACE(phone, '-', ''), '(', ''), ')', '') AS cleaned_phone,
  LOWER(TRIM(email)) AS cleaned_email
FROM customers;

-- Pattern matching with SIMILAR TO (regex-like)
SELECT 
  email
FROM customers
WHERE email SIMILAR TO '%[a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,}%';
String Length and Character Functions: Use LENGTH() to count characters, CHAR_LENGTH() for character count (same as LENGTH in most databases), and OCTET_LENGTH() for byte count. For multi-byte character sets (like UTF-8), these can differ.

Advanced String Operations

Split strings, pad text, reverse strings, and perform other advanced manipulations. SPLIT_PART() divides strings by delimiters, LPAD/RPAD add padding, and string aggregation functions combine multiple rows into one string.

-- Split delimited strings
SELECT 
  full_address,
  SPLIT_PART(full_address, ',', 1) AS street,
  SPLIT_PART(full_address, ',', 2) AS city,
  SPLIT_PART(full_address, ',', 3) AS state
FROM addresses;

-- Pad strings to fixed width
SELECT 
  order_id,
  LPAD(order_id::TEXT, 10, '0') AS padded_order_id,
  RPAD(product_code, 15, '-') AS padded_code
FROM orders;

-- Aggregate strings from multiple rows
SELECT 
  category_id,
  category_name,
  STRING_AGG(product_name, ', ' ORDER BY product_name) AS products_list,
  COUNT(*) AS product_count
FROM products
GROUP BY category_id, category_name;

-- Reverse strings (useful for certain parsing tasks)
SELECT 
  email,
  REVERSE(email) AS reversed,
  SUBSTRING(REVERSE(email) FROM 1 FOR POSITION('.' IN REVERSE(email)) - 1) AS tld_reversed
FROM customers;
Key Date Functions
EXTRACT(part FROM date)Get date component
DATE_TRUNC('unit', date)Truncate to precision
AGE(date1, date2)Calculate interval
NOW(), CURRENT_DATECurrent timestamp/date
INTERVAL '1 day'Time duration literal
TO_CHAR(date, format)Format date as string
TO_DATE(string, format)Parse string to date
Essential String Functions
CONCAT(str1, str2, ...)Join strings
UPPER(), LOWER()Change case
TRIM(), LTRIM(), RTRIM()Remove whitespace
SUBSTRING(str FROM pos)Extract portion
REPLACE(str, from, to)Substitute text
LENGTH(), CHAR_LENGTH()String length
POSITION(sub IN str)Find substring

Practice: Date & String Manipulation

Task: Write a query to show order year, month name, and total revenue grouped by year-month.

Show Solution
SELECT 
  EXTRACT(YEAR FROM order_date) AS year,
  TO_CHAR(order_date, 'Month') AS month_name,
  EXTRACT(MONTH FROM order_date) AS month_num,
  COUNT(*) AS order_count,
  SUM(order_total) AS total_revenue,
  ROUND(AVG(order_total), 2) AS avg_order_value
FROM orders
GROUP BY 
  EXTRACT(YEAR FROM order_date), 
  TO_CHAR(order_date, 'Month'),
  EXTRACT(MONTH FROM order_date)
ORDER BY year, month_num;

Task: Clean customer data: trim whitespace, title case names, lowercase emails, extract domain from email.

Show Solution
SELECT 
  customer_id,
  INITCAP(TRIM(REGEXP_REPLACE(name, '\s+', ' '))) AS cleaned_name,
  LOWER(TRIM(email)) AS cleaned_email,
  SUBSTRING(
    LOWER(TRIM(email)) 
    FROM POSITION('@' IN LOWER(TRIM(email))) + 1
  ) AS email_domain,
  CASE 
    WHEN SUBSTRING(LOWER(TRIM(email)) FROM POSITION('@' IN LOWER(TRIM(email))) + 1) 
         LIKE '%.com' THEN 'Commercial'
    WHEN SUBSTRING(LOWER(TRIM(email)) FROM POSITION('@' IN LOWER(TRIM(email))) + 1) 
         LIKE '%.org' THEN 'Organization'
    WHEN SUBSTRING(LOWER(TRIM(email)) FROM POSITION('@' IN LOWER(TRIM(email))) + 1) 
         LIKE '%.edu' THEN 'Education'
    ELSE 'Other'
  END AS email_type
FROM customers;

Task: Calculate days as customer, cohort month, and revenue by cohort using date functions.

Show Solution
WITH customer_cohorts AS (
  SELECT 
    c.customer_id,
    c.name,
    MIN(o.order_date) AS first_order_date,
    MAX(o.order_date) AS last_order_date,
    DATE_TRUNC('month', MIN(o.order_date)) AS cohort_month,
    TO_CHAR(MIN(o.order_date), 'YYYY-MM') AS cohort_label,
    COUNT(DISTINCT o.order_id) AS total_orders,
    SUM(o.order_total) AS lifetime_value,
    CURRENT_DATE - MIN(o.order_date) AS days_as_customer,
    MAX(o.order_date) - MIN(o.order_date) AS days_active
  FROM customers c
  LEFT JOIN orders o ON c.customer_id = o.customer_id
  GROUP BY c.customer_id, c.name
)
SELECT 
  cohort_label,
  COUNT(*) AS customers_in_cohort,
  ROUND(AVG(lifetime_value), 2) AS avg_ltv,
  ROUND(AVG(days_as_customer), 0) AS avg_customer_age_days,
  ROUND(AVG(total_orders), 1) AS avg_orders_per_customer,
  SUM(lifetime_value) AS cohort_total_revenue
FROM customer_cohorts
WHERE first_order_date IS NOT NULL
GROUP BY cohort_label, cohort_month
ORDER BY cohort_month DESC;
05

UNION and UNION ALL

Set operations combine results from multiple SELECT statements into a single result set. UNION merges queries while removing duplicates, UNION ALL preserves all rows including duplicates. These operations are essential for consolidating data from multiple sources or tables with similar structures.

Understanding UNION

UNION combines the results of two or more SELECT statements and returns only distinct rows. Both queries must have the same number of columns with compatible data types in corresponding positions. The column names from the first SELECT are used in the result.

Key Difference

UNION vs UNION ALL

UNION performs a distinct operation, removing duplicate rows from the combined result set. UNION ALL includes all rows from all queries, even if they're duplicates. UNION ALL is faster because it skips the deduplication step.

Performance Tip: Use UNION ALL when you know there are no duplicates or when duplicates are acceptable. It's significantly faster than UNION, especially with large datasets, because it avoids the sorting and comparison required for deduplication.

-- Basic UNION syntax (removes duplicates)
SELECT customer_id, name, 'Active' AS status
FROM active_customers
UNION
SELECT customer_id, name, 'Inactive' AS status
FROM inactive_customers
ORDER BY name;

-- UNION ALL (keeps all rows including duplicates)
SELECT product_id, product_name, 'Online' AS channel
FROM online_sales
UNION ALL
SELECT product_id, product_name, 'Store' AS channel
FROM store_sales
ORDER BY product_name;

Practical UNION Examples

UNION is perfect for consolidating similar data from different tables, creating unified reports from segmented data, or combining historical and current records. The key requirement is that all SELECT statements must return the same number of columns with compatible types.

-- Combine data from different time periods
SELECT 
  order_id,
  customer_id,
  order_date,
  order_total,
  'Current' AS data_source
FROM orders_2026
UNION ALL
SELECT 
  order_id,
  customer_id,
  order_date,
  order_total,
  'Archive' AS data_source
FROM orders_archive
ORDER BY order_date DESC;

-- Create unified contact list from multiple tables
SELECT 
  employee_id AS contact_id,
  first_name,
  last_name,
  email,
  phone,
  'Employee' AS contact_type
FROM employees
UNION
SELECT 
  customer_id AS contact_id,
  first_name,
  last_name,
  email,
  phone,
  'Customer' AS contact_type
FROM customers
UNION
SELECT 
  supplier_id AS contact_id,
  contact_first_name,
  contact_last_name,
  email,
  phone,
  'Supplier' AS contact_type
FROM suppliers
ORDER BY last_name, first_name;

-- Combine different product categories
SELECT 
  product_id,
  product_name,
  price,
  'Electronics' AS category
FROM electronics
WHERE price > 100
UNION ALL
SELECT 
  product_id,
  product_name,
  price,
  'Appliances' AS category
FROM appliances
WHERE price > 100
ORDER BY price DESC;

UNION with Aggregations

You can use UNION to combine aggregated results from different tables or time periods, creating comprehensive summary reports. This is useful for comparing metrics across different segments or building dashboard queries.

-- Compare sales by channel
SELECT 
  'Online' AS channel,
  COUNT(*) AS order_count,
  SUM(order_total) AS total_revenue,
  AVG(order_total) AS avg_order_value
FROM online_orders
UNION ALL
SELECT 
  'Store' AS channel,
  COUNT(*) AS order_count,
  SUM(order_total) AS total_revenue,
  AVG(order_total) AS avg_order_value
FROM store_orders
UNION ALL
SELECT 
  'Phone' AS channel,
  COUNT(*) AS order_count,
  SUM(order_total) AS total_revenue,
  AVG(order_total) AS avg_order_value
FROM phone_orders;

-- Monthly summary with year-to-date totals
WITH monthly_sales AS (
  SELECT 
    EXTRACT(YEAR FROM order_date) AS year,
    EXTRACT(MONTH FROM order_date) AS month,
    'Monthly' AS period_type,
    SUM(order_total) AS revenue
  FROM orders
  GROUP BY EXTRACT(YEAR FROM order_date), EXTRACT(MONTH FROM order_date)
),
ytd_sales AS (
  SELECT 
    EXTRACT(YEAR FROM order_date) AS year,
    NULL AS month,
    'YTD Total' AS period_type,
    SUM(order_total) AS revenue
  FROM orders
  GROUP BY EXTRACT(YEAR FROM order_date)
)
SELECT * FROM monthly_sales
UNION ALL
SELECT * FROM ytd_sales
ORDER BY year, month NULLS LAST;

-- Product performance across regions
SELECT 
  'North' AS region,
  category,
  COUNT(*) AS units_sold,
  SUM(sale_amount) AS revenue
FROM north_sales
GROUP BY category
UNION ALL
SELECT 
  'South' AS region,
  category,
  COUNT(*) AS units_sold,
  SUM(sale_amount) AS revenue
FROM south_sales
GROUP BY category
UNION ALL
SELECT 
  'East' AS region,
  category,
  COUNT(*) AS units_sold,
  SUM(sale_amount) AS revenue
FROM east_sales
GROUP BY category
UNION ALL
SELECT 
  'West' AS region,
  category,
  COUNT(*) AS units_sold,
  SUM(sale_amount) AS revenue
FROM west_sales
GROUP BY category
ORDER BY region, revenue DESC;
UNION Requirements: All SELECT statements in a UNION must have (1) the same number of columns, (2) columns in the same order, and (3) compatible data types in corresponding positions. Column names come from the first SELECT.

UNION with CTEs for Complex Reports

Combining CTEs with UNION creates powerful reporting queries. Build separate CTEs for different data segments, then union them together for comprehensive analysis. This pattern is common in business intelligence and dashboard queries.

-- Comprehensive sales report with multiple dimensions
WITH product_sales AS (
  SELECT 
    'Product' AS dimension,
    product_name AS segment,
    SUM(quantity * price) AS revenue,
    COUNT(DISTINCT order_id) AS transaction_count
  FROM order_details od
  JOIN products p ON od.product_id = p.product_id
  GROUP BY product_name
),
category_sales AS (
  SELECT 
    'Category' AS dimension,
    c.category_name AS segment,
    SUM(od.quantity * od.price) AS revenue,
    COUNT(DISTINCT od.order_id) AS transaction_count
  FROM order_details od
  JOIN products p ON od.product_id = p.product_id
  JOIN categories c ON p.category_id = c.category_id
  GROUP BY c.category_name
),
region_sales AS (
  SELECT 
    'Region' AS dimension,
    cu.region AS segment,
    SUM(od.quantity * od.price) AS revenue,
    COUNT(DISTINCT o.order_id) AS transaction_count
  FROM orders o
  JOIN customers cu ON o.customer_id = cu.customer_id
  JOIN order_details od ON o.order_id = od.order_id
  GROUP BY cu.region
)
SELECT * FROM product_sales
UNION ALL
SELECT * FROM category_sales
UNION ALL
SELECT * FROM region_sales
ORDER BY dimension, revenue DESC;
When to Use UNION vs JOIN: Use UNION when combining rows (stacking tables vertically) with the same structure. Use JOIN when combining columns (adding columns horizontally) from related tables. UNION is for consolidation, JOIN is for enrichment.

Practice: UNION Operations

Task: Write a query to combine orders from current_orders and archived_orders tables, adding a source indicator.

Show Solution
SELECT 
  order_id,
  customer_id,
  order_date,
  order_total,
  'Current' AS source
FROM current_orders
UNION ALL
SELECT 
  order_id,
  customer_id,
  order_date,
  order_total,
  'Archived' AS source
FROM archived_orders
ORDER BY order_date DESC;

Task: Combine employees, customers, and vendors into a single contact list with type indicator.

Show Solution
SELECT 
  employee_id AS contact_id,
  CONCAT(first_name, ' ', last_name) AS full_name,
  email,
  phone,
  department AS affiliation,
  'Employee' AS contact_type
FROM employees
UNION
SELECT 
  customer_id AS contact_id,
  CONCAT(first_name, ' ', last_name) AS full_name,
  email,
  phone,
  company AS affiliation,
  'Customer' AS contact_type
FROM customers
UNION
SELECT 
  vendor_id AS contact_id,
  contact_name AS full_name,
  contact_email AS email,
  contact_phone AS phone,
  company_name AS affiliation,
  'Vendor' AS contact_type
FROM vendors
ORDER BY contact_type, full_name;

Task: Create a hierarchical sales report showing totals by product, category, and grand total using UNION.

Show Solution
WITH product_level AS (
  SELECT 
    1 AS level_order,
    'Product' AS level_type,
    c.category_name AS group_name,
    p.product_name AS detail,
    COUNT(*) AS order_count,
    SUM(od.quantity) AS units_sold,
    SUM(od.quantity * od.unit_price) AS revenue
  FROM order_details od
  JOIN products p ON od.product_id = p.product_id
  JOIN categories c ON p.category_id = c.category_id
  GROUP BY c.category_name, p.product_name
),
category_level AS (
  SELECT 
    2 AS level_order,
    'Category Total' AS level_type,
    c.category_name AS group_name,
    '' AS detail,
    COUNT(*) AS order_count,
    SUM(od.quantity) AS units_sold,
    SUM(od.quantity * od.unit_price) AS revenue
  FROM order_details od
  JOIN products p ON od.product_id = p.product_id
  JOIN categories c ON p.category_id = c.category_id
  GROUP BY c.category_name
),
grand_total AS (
  SELECT 
    3 AS level_order,
    'Grand Total' AS level_type,
    'ALL CATEGORIES' AS group_name,
    '' AS detail,
    COUNT(*) AS order_count,
    SUM(od.quantity) AS units_sold,
    SUM(od.quantity * od.unit_price) AS revenue
  FROM order_details od
)
SELECT 
  level_type,
  group_name,
  detail,
  order_count,
  units_sold,
  ROUND(revenue, 2) AS revenue,
  ROUND(revenue * 100.0 / (SELECT revenue FROM grand_total), 2) AS pct_of_total
FROM product_level
UNION ALL
SELECT 
  level_type,
  group_name,
  detail,
  order_count,
  units_sold,
  ROUND(revenue, 2) AS revenue,
  ROUND(revenue * 100.0 / (SELECT revenue FROM grand_total), 2) AS pct_of_total
FROM category_level
UNION ALL
SELECT 
  level_type,
  group_name,
  detail,
  order_count,
  units_sold,
  ROUND(revenue, 2) AS revenue,
  100.00 AS pct_of_total
FROM grand_total
ORDER BY level_order, group_name, revenue DESC;

Key Takeaways

Subqueries

Nested queries enable complex filtering, calculations, and data retrieval in SELECT, FROM, and WHERE clauses for dynamic analysis

CTEs

Common Table Expressions improve query readability and enable recursive queries for hierarchical data analysis

Window Functions

Perform calculations across row sets without collapsing results, perfect for rankings, running totals, and row comparisons

Date Manipulation

Extract, format, and calculate with dates using functions like EXTRACT, DATE_TRUNC, DATE_ADD, and AGE for time analysis

String Operations

Clean and transform text data with CONCAT, SUBSTRING, REPLACE, UPPER, LOWER, and TRIM for data quality

Set Operations

UNION combines distinct results from multiple queries, while UNION ALL preserves all rows including duplicates efficiently

Knowledge Check

Test your understanding of advanced SQL techniques:

Question 1 of 6

What is the main difference between a scalar subquery and a table subquery?

Question 2 of 6

What keyword do you use to define a Common Table Expression (CTE)?

Question 3 of 6

Which window function would you use to assign a sequential number to each row within a partition, resetting for each partition?

Question 4 of 6

What is the key difference between LAG() and LEAD() window functions?

Question 5 of 6

Which SQL function would you use to extract just the year from a date column?

Question 6 of 6

What happens when you use UNION instead of UNION ALL to combine two result sets?

Answer all questions to check your score