Module 4.1

Descriptive Statistics

Learn to summarize and describe datasets using measures of central tendency, variability, distribution shapes, and visual tools like box plots. Master the foundation of data analysis!

40 min read
Beginner
Hands-on Examples
What You'll Learn
  • Measures of central tendency (mean, median, mode)
  • Measures of spread and variability
  • Distribution shapes (normal vs. skewed)
  • Outlier detection methods
  • Five-number summary and box plots
Contents
01

Measures of Central Tendency

Measures of central tendency help us identify the "typical" or "center" value in a dataset. These statistics provide a single number that represents where most data points cluster. The three most common measures are mean, median, and mode, and each serves different purposes depending on your data's characteristics. Understanding when to use each measure is crucial for accurate data analysis and avoiding misleading conclusions.

Central Tendency

A single value that attempts to describe a dataset by identifying the central position within that set of data. It summarizes an entire distribution with one representative value.

The Mean (Average)

The mean is the arithmetic average, calculated by summing all values and dividing by the count. It's the most commonly used measure but is sensitive to extreme values (outliers).

How to Calculate the Mean

Imagine you're analyzing monthly sales for a small business. You have five months of data, and you want to know the typical monthly sales figure. The mean gives you this by treating all values equally - it adds up all the sales and divides by how many months you have.

In Excel, the =AVERAGE() function does this calculation automatically. You simply select the range of cells containing your data, and Excel computes the sum divided by the count. However, there's an important caveat: the mean is sensitive to outliers. If one value is much higher or lower than the others, it can pull the mean in that direction, giving a misleading picture of what's "typical."

Calculating Mean in Excel
// Sample data: Sales figures (in dollars)
// Let's say these are monthly sales for a small store
A1: 45000
A2: 52000
A3: 48000
A4: 51000
A5: 150000  // Outlier - maybe a special promotion month

// Calculate mean using AVERAGE function
=AVERAGE(A1:A5)  // Result: 69200
// This adds: 45000+52000+48000+51000+150000 = 346000
// Then divides by 5 months: 346000 ÷ 5 = 69200

// Calculate mean WITHOUT the outlier
=AVERAGE(A1:A4)  // Result: 49000
// This adds: 45000+52000+48000+51000 = 196000
// Then divides by 4 months: 196000 ÷ 4 = 49000

// Notice the difference!
// With outlier: $69,200 (inflated by the $150K month)
// Without outlier: $49,000 (more representative of typical sales)

Understanding the Results: In this example, the mean of $69,200 is misleading because it's pulled up by the outlier ($150,000). The typical monthly sales are actually around $45,000-$52,000. The mean without the outlier ($49,000) better represents normal business performance. This demonstrates why you should always examine your data for outliers before blindly trusting the mean.

Real-World Application: If you reported "average monthly sales are $69,200" to your manager, they might expect every month to be around that figure. But in reality, four out of five months were much lower. This is why understanding your data's distribution is crucial before choosing which average to report.

When to Use Mean: Use the mean when your data is symmetrically distributed without significant outliers. It's ideal for continuous numerical data like heights, temperatures, or test scores in a normal distribution.

The Median (Middle Value)

The median is the middle value when data is sorted in order. For even-numbered datasets, it's the average of the two middle values. The median is resistant to outliers, making it more reliable for skewed distributions.

How to Find the Median

Think of the median as the "50th percentile" - exactly half the values are below it, and half are above it. To find it manually, you would: (1) Sort all your values from smallest to largest, (2) Find the middle position. If you have an odd number of values (like 5), the median is the 3rd value. If you have an even number (like 6), the median is the average of the 3rd and 4th values.

The beauty of the median is its resistance to outliers. An extreme value at either end doesn't affect it much because the median only cares about the middle position, not the actual values at the extremes. This makes it perfect for data with outliers, like income (where billionaires skew the mean) or house prices (where mansions distort averages).

In Excel, the =MEDIAN() function handles all the sorting and calculation for you. It's particularly useful when reporting "typical" values for data that might contain a few extreme cases.

Finding Median in Excel
// Same sales data: 45000, 52000, 48000, 51000, 150000
// Excel automatically sorts it: 45000, 48000, 51000, 52000, 150000
//                      Position:   1st    2nd    3rd    4th     5th

// Calculate median using MEDIAN function
=MEDIAN(A1:A5)  // Result: 51000
// With 5 values (odd number), the median is the 3rd position
// That's 51000 - the middle value

// Let's understand WHY this is better:
// Mean was: $69,200 (pulled up by outlier)
// Median is: $51,000 (not affected by outlier)
// $51,000 is much closer to the typical sales ($45K-$52K)

// What if we had an EVEN number of values?
// Data: 45000, 48000, 51000, 52000
=MEDIAN(A1:A4)  // Result: 49500
// With 4 values (even number), median = average of 2 middle values
// That's (48000 + 51000) ÷ 2 = 49500

Interpreting the Results: The median of $51,000 tells us that half the months had sales below $51K and half had sales above $51K. This is a much more accurate representation of "typical" performance than the mean ($69,200), which was distorted by the one exceptional month. When you see the median, you know it represents the middle of your data distribution.

Key Insight: The big gap between mean ($69,200) and median ($51,000) is itself informative - it tells you that your data is right-skewed (has high outliers). When mean > median by a lot, you have some unusually high values pulling the mean up. When median > mean, you have some unusually low values pulling the mean down.

Scenario Mean Median Best Choice
Test Scores: 78, 82, 85, 88, 90 84.6 85 Mean (no outliers)
Home Prices: $180K, $195K, $210K, $1.2M $446K $202K Median (outlier present)
Employee Ages: 25, 28, 30, 32, 35 30 30 Either (symmetric)

The Mode (Most Frequent)

The mode is the value that appears most frequently in a dataset. A dataset can have no mode (all values unique), one mode (unimodal), or multiple modes (bimodal, multimodal). It's the only measure that can be used with categorical data.

Understanding the Mode

The mode answers the question: "What value shows up most often?" Unlike mean and median which work only with numbers, the mode works with any type of data - numbers, categories, or text. This makes it incredibly versatile. For example, you can find the most popular shirt size, the most common customer complaint, or the most frequent purchase amount.

In Excel, =MODE.SNGL() finds the single most frequent value in a numeric dataset. If multiple values tie for most frequent, it returns the first one it encounters. For categorical data (like shirt sizes or colors), you'll need to use other methods like pivot tables or COUNTIF functions to manually count frequencies.

When datasets have multiple modes (bimodal or multimodal), this often reveals interesting patterns. For instance, if customer ages have two modes at 25 and 55, you might have two distinct customer segments - young professionals and pre-retirees - each with different needs.

Finding Mode in Excel
// Customer purchase frequencies (number of items bought)
B1: 2, B2: 3, B3: 2, B4: 5, B5: 2, B6: 4, B7: 3

// Find the most common purchase quantity
=MODE.SNGL(B1:B7)  // Result: 2
// Value "2" appears 3 times (most frequent)
// Value "3" appears 2 times
// Values "4" and "5" each appear 1 time
// So the mode is 2 - most customers buy 2 items

// Example with categorical data (can't use MODE.SNGL)
// Shirt sizes ordered: S, M, M, L, M, XL, L, L
// Count manually: M appears 3 times, L appears 3 times
// This dataset is BIMODAL - it has TWO modes: M and L
// This tells us we need to stock both sizes heavily!

Interpreting Mode Results: When the mode is 2 items, it means more customers buy exactly 2 items than any other quantity. This is actionable business intelligence - you might create "2-item bundles" or "buy 2 get a discount" promotions to align with natural customer behavior.

Bimodal Example: The shirt size example showing both M and L as modes (bimodal distribution) suggests two primary customer body types. If you were managing inventory, you'd want to ensure you stock plenty of both M and L sizes, rather than assuming all sizes sell equally.

Categorical Data: Mode is the ONLY measure of central tendency that works with categorical data (like colors, brands, or survey responses). You can't calculate the mean of "red, blue, blue, green" but you can identify that "blue" is the mode.

Comparing the Three Measures

Mean
  • Uses all data points
  • Mathematical precision
  • Affected by outliers
  • Can be misleading with skew
Median
  • Resistant to outliers
  • Good for skewed data
  • Ignores actual values
  • Less mathematically useful
Mode
  • Works with categorical data
  • Easy to understand
  • Can have multiple modes
  • May not exist

Practice Exercises: Central Tendency

Task: You have daily sales figures for a week: $1200, $1500, $1200, $1800, $1200, $2200, $1600. Calculate the mean, median, and mode. Which measure best represents typical daily sales?

Steps:

  1. Sort the data
  2. Calculate mean (sum ÷ count)
  3. Find median (middle value)
  4. Identify mode (most frequent)
Show Solution
// Sorted data: 1200, 1200, 1200, 1500, 1600, 1800, 2200

// Mean calculation
Sum = 1200 + 1500 + 1200 + 1800 + 1200 + 2200 + 1600 = 10700
Mean = 10700 ÷ 7 = $1528.57

// Median (middle value, position 4)
Median = $1500

// Mode (appears 3 times)
Mode = $1200

// Best measure: Median ($1500) best represents typical sales
// because the mode is pulled down by the frequent low value,
// and the mean is slightly inflated by the $2200 outlier.

Task: A small company has 10 employees with salaries: $35K, $38K, $40K, $42K, $45K, $45K, $48K, $50K, $52K, $180K (CEO). Calculate all three measures and explain which is most appropriate for describing "typical" employee compensation.

Show Solution
// Excel formulas
Salaries in A1:A10

// Mean
=AVERAGE(A1:A10)  // Result: $57,500

// Median
=MEDIAN(A1:A10)  // Result: $45,000

// Mode
=MODE.SNGL(A1:A10)  // Result: $45,000

Analysis:
- Mean ($57,500): Inflated by CEO's $180K salary
- Median ($45,000): Represents the middle employee accurately
- Mode ($45,000): Shows the most common salary

Best choice: MEDIAN or MODE ($45,000)
The mean is misleading because the CEO's salary creates
a right-skewed distribution. Most employees earn around $45K.

Task: A store tracks customer visit frequency per month: 2, 2, 3, 3, 8, 8, 9, 9, 9, 10. The data appears bimodal (two peaks). Calculate all measures and explain what this tells you about customer behavior.

Show Solution
// Data: 2, 2, 3, 3, 8, 8, 9, 9, 9, 10

// Mean
=AVERAGE(A1:A10)  // Result: 6.3 visits

// Median
=MEDIAN(A1:A10)  // Result: 8 visits

// Mode (bimodal)
Modes: 2 (appears 2x), 3 (appears 2x), 8 (appears 2x), 9 (appears 3x)
Primary mode: 9 visits

Interpretation:
The bimodal pattern suggests TWO distinct customer groups:
1. Infrequent shoppers (2-3 visits/month)
2. Frequent/loyal shoppers (8-10 visits/month)

The mean (6.3) doesn't represent either group well!
This is a case where NO single measure of central tendency
adequately describes the data. You need to acknowledge
the two distinct segments in your analysis.

Business insight: Consider separate marketing strategies
for each customer segment.
02

Measures of Spread and Variability

While measures of central tendency tell us where data clusters, measures of spread (or dispersion) reveal how scattered or consistent the data points are. Two datasets can have the same mean but vastly different spreads, leading to different insights and decisions. Understanding variability is essential for assessing data quality, comparing groups, and making predictions. The main measures include range, interquartile range (IQR), variance, and standard deviation.

Variability (Dispersion)

The extent to which data points differ from each other and from the central value. High variability means data is spread out; low variability means data is clustered tightly together.

Range (Simplest Measure)

The range is the difference between the maximum and minimum values. While simple to calculate, it's heavily influenced by outliers and ignores the distribution of values in between.

Understanding Range

Think of range as the "span" of your data - how far apart are the most extreme values? If you're measuring daily temperatures and they range from 65°F to 85°F, your range is 20 degrees. This gives you a quick sense of total variation, but it has a major weakness: it completely ignores what happens in the middle.

The range is calculated simply as: Range = Maximum - Minimum. In Excel, you use the =MAX() function to find the highest value, =MIN() for the lowest, and subtract them. However, because it only looks at the two extreme values, a single outlier can make your range much larger than what's truly representative of your data's variability.

Calculating Range in Excel
// Product ratings (1-5 scale, plus one typo/outlier)
// Data: 3, 4, 4, 5, 5, 5, 4, 3, 5, 9

// Find the maximum value
Maximum = =MAX(A1:A10)  // Result: 9
// This finds 9 (the outlier - someone typed 9 instead of 5)

// Find the minimum value  
Minimum = =MIN(A1:A10)  // Result: 3
// This finds 3 (the lowest rating)

// Calculate range
Range = =MAX(A1:A10) - MIN(A1:A10)  // Result: 6
// 9 - 3 = 6

// THE PROBLEM:
// The range is 6, but look at the actual data!
// If we remove the outlier (9), our ratings are: 3, 3, 4, 4, 4, 5, 5, 5, 5
// Without outlier: Max=5, Min=3, Range=2
// 
// The range jumped from 2 to 6 because of ONE bad data point!
// Most ratings are actually clustered between 3-5 (range of only 2)

Why This Matters: Imagine you're a product manager looking at customer ratings. A range of 6 suggests wildly inconsistent opinions (some love it, some hate it). But actually, most customers gave 4-5 stars - they're quite satisfied! The range of 6 is misleading because of one data entry error or extreme outlier.

When to Use Range: Range is useful for a quick, rough sense of spread when you know your data is clean (no outliers). It's often used in quality control to show the full extent of variation. But for serious analysis, you'll want more robust measures like IQR or standard deviation.

Range Limitation: The range only uses two data points (min and max), making it extremely sensitive to outliers. A single extreme value can make a tightly clustered dataset appear highly variable.

Interquartile Range (IQR)

The IQR is the range of the middle 50% of the data, calculated as Q3 - Q1. It's resistant to outliers because it focuses on the central portion of the distribution, making it more reliable than the range for describing typical variability.

How IQR Works

Unlike the range which looks at extreme values, the IQR (Interquartile Range) focuses on where most of your data lives. It's calculated as: IQR = Q3 - Q1, where Q1 is the 25th percentile and Q3 is the 75th percentile. This means the IQR captures the spread of the middle 50% of your data, ignoring the top 25% and bottom 25%.

Why is this useful? Because outliers typically fall in those outer 25% regions. By focusing on the middle 50%, the IQR gives you a much more stable measure of variability. If you have a few extremely high or low values, they won't distort your IQR the way they distort the range.

In Excel, use =QUARTILE.INC() with 1 for Q1 and 3 for Q3. Then subtract Q1 from Q3 to get your IQR. A smaller IQR means your middle data is tightly packed; a larger IQR means more spread even in your central values.

Calculating IQR in Excel
// Test scores from a class of 10 students
// Data: 62, 68, 72, 75, 78, 82, 85, 88, 92, 95

// Step 1: Find the First Quartile (Q1) - 25th percentile
Q1 = =QUARTILE.INC(A1:A10, 1)  // Result: 71
// This means 25% of students scored below 71
// In other words, 71 is the cutoff for the bottom quarter

// Step 2: Find the Third Quartile (Q3) - 75th percentile
Q3 = =QUARTILE.INC(A1:A10, 3)  // Result: 88
// This means 75% of students scored below 88
// Or: 88 is the cutoff for the top quarter

// Step 3: Calculate the Interquartile Range
IQR = Q3 - Q1 = 88 - 71 = 17

// INTERPRETATION:
// The middle 50% of students scored within a 17-point range
// Scores from 71 to 88 contain half the class
// This is "moderate spread" - not too clustered, not too scattered
//
// Compare to full range: 95 - 62 = 33 points
// IQR (17) is much smaller than range (33)
// This tells us the extremes are pulling the range wider
// But the bulk of students are reasonably consistent

Breaking Down the Results: An IQR of 17 points tells us that the middle 50% of students performed within a 17-point band (71 to 88). This is useful because it ignores the struggling student at 62 and the top performer at 95. Those outliers don't affect the IQR, so we get a clearer picture of the typical performance range.

Practical Use: If you're a teacher, an IQR of 17 suggests moderate consistency in student understanding. If the IQR were 5, students are very similar in ability. If it were 30, you have a very diverse classroom requiring differentiated instruction. The IQR helps you understand the spread of your "typical" students, not just the extremes.

Quartile Symbol Percentile Meaning
First Quartile Q1 25th 25% of data falls below this value
Second Quartile Q2 (Median) 50th 50% of data falls below this value
Third Quartile Q3 75th 75% of data falls below this value

Variance and Standard Deviation

Variance measures the average squared deviation from the mean, while standard deviation (the square root of variance) provides a more interpretable measure in the original units. Standard deviation is the most widely used measure of spread in statistics.

Variance and Standard Deviation in Excel
// Weekly sales (in thousands): 45, 48, 52, 49, 51

// Population variance (all data)
=VAR.P(A1:A5)  // Result: 5.84

// Sample variance (from larger population)
=VAR.S(A1:A5)  // Result: 7.3

// Population standard deviation
=STDEV.P(A1:A5)  // Result: 2.42 thousand ($2,420)

// Sample standard deviation
=STDEV.S(A1:A5)  // Result: 2.70 thousand ($2,700)

// Interpretation: Sales typically vary by about ±$2,500 from mean
Population vs. Sample: Use VAR.P and STDEV.P when you have ALL data points (entire population). Use VAR.S and STDEV.S when you have a sample representing a larger population. Sample formulas divide by (n-1) instead of n, providing an unbiased estimate.

Coefficient of Variation (CV)

The coefficient of variation expresses standard deviation as a percentage of the mean, allowing you to compare variability across datasets with different units or scales.

Coefficient of Variation
// Dataset A - Heights (cm): Mean = 170, SD = 8
CV_A = (8 / 170) × 100 = 4.7%

// Dataset B - Weights (kg): Mean = 70, SD = 10
CV_B = (10 / 70) × 100 = 14.3%

// Interpretation: Weights show more relative variability
// than heights, even though absolute SD of weights is larger
Low Variability Indicates
  • Consistent, predictable data
  • Quality control is working
  • Homogeneous population
  • Reliable forecasting possible
High Variability Indicates
  • Inconsistent, unpredictable data
  • Process needs improvement
  • Diverse population
  • Uncertainty in predictions

Practice Exercises: Spread & Variability

Task: Customer ages at a store: 22, 25, 28, 30, 32, 35, 38, 42, 45, 65. Calculate the range and IQR. Why is the IQR more useful here?

Show Solution
// Ages: 22, 25, 28, 30, 32, 35, 38, 42, 45, 65

// Range
Maximum = 65
Minimum = 22
Range = 65 - 22 = 43 years

// IQR
=QUARTILE.INC(A1:A10, 1)  // Q1 = 28.5
=QUARTILE.INC(A1:A10, 3)  // Q3 = 42.75
IQR = 42.75 - 28.5 = 14.25 years

Why IQR is better:
- Range (43) is inflated by the 65-year-old outlier
- IQR (14.25) shows that most customers fall within
  a 14-year age span, giving a more accurate picture
  of the typical customer base
- The outlier (65) doesn't affect the IQR calculation

Task: Two products have daily sales: Product A: 100, 102, 98, 101, 99. Product B: 85, 110, 95, 105, 105. Both have mean = 100. Calculate standard deviation for each and interpret what this tells you about sales consistency.

Show Solution
// Product A: 100, 102, 98, 101, 99
=AVERAGE(A1:A5)  // Mean = 100
=STDEV.S(A1:A5)  // SD = 1.58

// Product B: 85, 110, 95, 105, 105
=AVERAGE(B1:B5)  // Mean = 100
=STDEV.S(B1:B5)  // SD = 9.35

Interpretation:
Product A: SD = 1.58 units
- Highly consistent sales
- Predictable demand
- Easy to manage inventory
- Sales rarely vary by more than ±2 units

Product B: SD = 9.35 units
- Highly variable sales (6x more variable than A)
- Unpredictable demand
- Requires buffer inventory
- Sales can swing ±10 units or more

Business Impact:
Product A is a stable, mature product
Product B needs demand forecasting improvements

Task: Team A response times: Mean = 5 minutes, SD = 1 minute. Team B response times: Mean = 45 minutes, SD = 8 minutes. Which team is more consistent relative to their average response time? Use coefficient of variation.

Show Solution
// Team A
Mean_A = 5 minutes
SD_A = 1 minute
CV_A = (1 / 5) × 100 = 20%

// Team B
Mean_B = 45 minutes
SD_B = 8 minutes
CV_B = (8 / 45) × 100 = 17.8%

Analysis:
Team B is MORE consistent (17.8% < 20%)

Why CV matters:
- Absolute SD: Team B varies more (8 min vs 1 min)
- Relative SD: Team A varies more (20% vs 17.8%)
- A 1-minute deviation from 5 minutes is proportionally
  larger than an 8-minute deviation from 45 minutes
  
Business Insight:
Team B's process is more standardized relative to their
average workload, even though absolute variation is higher.
Team A needs process improvements for short tasks.
03

Distribution Shapes

The shape of a distribution reveals patterns in how data values are arranged and where they cluster. Understanding distribution shapes helps you choose appropriate statistical methods, identify data quality issues, and interpret results correctly. The three main distribution types are normal (symmetric), left-skewed (negatively skewed), and right-skewed (positively skewed). Each shape tells a different story about your data and affects which descriptive statistics are most meaningful.

Distribution Shape

The visual pattern formed when data values are plotted, showing where values concentrate and how they spread. Shape indicates symmetry, skewness, and the presence of extreme values.

Normal Distribution (Symmetric)

A normal distribution (bell curve) is perfectly symmetric around the mean, with data evenly spread on both sides. Most values cluster near the center, with fewer values at the extremes. In a normal distribution, mean = median = mode, and about 68% of data falls within one standard deviation of the mean.

Identifying Normal Distribution
// Test scores: 65, 68, 70, 72, 75, 75, 78, 80, 82, 85

Mean = 75
Median = 75
Mode = 75

// Relationship: Mean ≈ Median ≈ Mode (symmetric)

// Standard deviation = 6.4
// 68% should fall within 75 ± 6.4 (68.6 to 81.4)
// Count values in this range: 7 out of 10 = 70% ✓

// Visual check: Create histogram
// Should show bell shape with peak at center
Characteristics of Normal Distribution
  • Bell-shaped, symmetric curve
  • Mean = Median = Mode
  • 68-95-99.7 rule applies
  • Tails extend infinitely
  • Predictable probabilities
  • Common in natural phenomena

Right-Skewed Distribution (Positive Skew)

A right-skewed distribution has a long tail extending to the right, with most values concentrated on the left. The mean is pulled toward the tail by high outliers, making it greater than the median. This pattern is common in income data, real estate prices, and product sales.

Right-Skewed Example: House Prices
// House prices ($1000s): 180, 195, 210, 225, 240, 255, 350, 480, 850

Mean = =AVERAGE(A1:A9)    // Result: $331,667
Median = =MEDIAN(A1:A9)   // Result: $240,000
Mode = [No mode]

// Relationship: Mean > Median (right skew indicator)
// Mean ($332K) pulled up by expensive homes ($850K)
// Median ($240K) better represents typical home price

// Skewness check
Mean - Median = 91.667 (positive = right skew)
Why Right-Skewed Matters: In right-skewed data, the mean is misleading because it's inflated by high outliers. Always report the median for right-skewed distributions to give an accurate picture of the typical value. Examples: salaries, website traffic, social media followers.

Left-Skewed Distribution (Negative Skew)

A left-skewed distribution has a long tail extending to the left, with most values concentrated on the right. The mean is pulled toward the tail by low outliers, making it less than the median. This pattern appears in test scores (with a ceiling effect), age at retirement, and product lifespan data.

Left-Skewed Example: Exam Scores
// Exam scores: 45, 62, 78, 82, 85, 88, 90, 92, 95, 98

Mean = =AVERAGE(A1:A10)   // Result: 81.5
Median = =MEDIAN(A1:A10)  // Result: 86.5
Mode = [No clear mode]

// Relationship: Mean < Median (left skew indicator)
// Mean (81.5) pulled down by low scores (45, 62)
// Median (86.5) better represents typical performance

// Most students scored 80+ (strong performance)
// but a few struggling students pulled the average down
Distribution Type Shape Mean vs. Median Best Measure Common Examples
Normal (Symmetric) Bell curve Mean = Median Mean Height, IQ, measurement errors
Right-Skewed Tail to right Mean > Median Median Income, home prices, wealth
Left-Skewed Tail to left Mean < Median Median Exam scores, age at death, lifespan
Quick Skewness Test: Calculate Mean - Median. If positive, distribution is right-skewed. If negative, left-skewed. If near zero, approximately symmetric. This simple rule helps you quickly assess distribution shape without creating histograms.

Measuring Skewness Statistically

While visual inspection and mean-median comparison are useful, you can calculate a precise skewness coefficient. Values near 0 indicate symmetry, positive values indicate right skew, and negative values indicate left skew.

Skewness in Excel
// Data in A1:A20

// Skewness coefficient
=SKEW(A1:A20)

// Interpretation:
// -1 to -0.5: Moderately left-skewed
// -0.5 to 0.5: Approximately symmetric
// 0.5 to 1: Moderately right-skewed
// >1 or <-1: Highly skewed

// Example results:
SKEW([2,4,5,6,8]) = 0.34 (slightly right-skewed)
SKEW([98,95,92,88,62,45]) = -0.82 (moderately left-skewed)

Practice Exercises: Distribution Shapes

Task: A dataset has Mean = $52,000 and Median = $48,000. What type of distribution is this, and what does it tell you about the data?

Show Solution
// Given:
Mean = $52,000
Median = $48,000

// Analysis:
Mean > Median → RIGHT-SKEWED distribution

// What this means:
1. Long tail extends to the right (high values)
2. Some very high values pull the mean up
3. Most data points are below the mean
4. Median ($48K) is more representative of "typical" value
5. There are likely high-earning outliers

// Real-world context:
If this is salary data, most employees earn around $48K,
but some executives or specialists earn much more,
creating the right skew.

// Recommendation:
Report median ($48K) as the typical salary,
not mean ($52K) which gives inflated impression

Task: Store A customer ages: Mean=35, Median=35, SD=8. Store B customer ages: Mean=42, Median=35, SD=12. Analyze the distribution shape of each store and recommend which statistics to report.

Show Solution
// Store A Analysis
Mean = 35, Median = 35
Mean ≈ Median → SYMMETRIC/NORMAL distribution
SD = 8 (moderate variability)

Store A Characteristics:
- Balanced age distribution
- No skewness
- Typical customer age is clearly 35
- Ages fairly consistent (SD=8)

// Store B Analysis
Mean = 42, Median = 35
Mean > Median → RIGHT-SKEWED distribution
SD = 12 (high variability)

Store B Characteristics:
- Most customers are younger (around 35)
- Some older customers pull mean up to 42
- Higher variability in ages
- Long tail of older customers

// Recommendations

Store A Report:
- Use MEAN (35 years) - accurate representation
- Include SD (±8 years)
- "Average customer age is 35, with most between 27-43"

Store B Report:
- Use MEDIAN (35 years) - better representation
- Note the skewness
- "Typical customer age is 35, though we also serve
  an older demographic that raises the average to 42"

Business Insight:
Store B has more diverse age appeal, especially
attracting older customers that Store A doesn't reach

Task: Product reviews (1-5 stars): Mean=4.2, Median=5, Mode=5, Skewness=-1.3. Analyze this distribution and explain what it reveals about customer satisfaction. Should the company highlight the mean or median in marketing?

Show Solution
// Distribution Analysis
Mean = 4.2
Median = 5
Mode = 5
Skewness = -1.3 (highly left-skewed)

// Interpretation:

1. Shape: Highly left-skewed
   - Mean < Median confirms left skew
   - Skewness of -1.3 is strong negative skew

2. What's happening:
   - MOST customers give 5 stars (mode=5)
   - Median is 5 (at least 50% give 5 stars)
   - Some negative reviews (1-2 stars) pull mean down to 4.2
   
3. Customer Sentiment:
   - Majority are highly satisfied (5 stars)
   - Small but vocal unhappy minority
   - Polarized opinions (love it or hate it)

// Marketing Recommendation:

Highlight MEDIAN (5 stars):
✓ "Median rating: 5 stars - Most customers LOVE it!"
✓ More accurate representation
✓ Shows typical customer experience

Avoid emphasizing mean (4.2):
✗ Underrepresents satisfaction
✗ Gives impression of "good but not great"
✗ Doesn't show that majority give top rating

// Business Actions:

1. Investigate the negative reviews:
   - Are they about specific issues?
   - Can these be addressed?
   - Are expectations misaligned?

2. Marketing message:
   "Over 50% of customers rate us 5 stars!"
   
3. Product page display:
   Show distribution graph showing the
   concentration at 5 stars, not just the 4.2 average

// Statistical Insight:
This is a classic case where mean is misleading.
The 5-star consensus is the real story,
not the 4.2 average dragged down by outliers.
04

Outlier Detection Methods

Outliers are data points that differ significantly from other observations. They can represent errors, rare events, or important anomalies that need investigation. Detecting outliers is crucial because they can distort statistical analyses, affect model performance, and hide important patterns. However, not all outliers should be removed - some represent valuable insights. The two most common detection methods are the IQR method and Z-score method.

Outlier

A data point that lies an abnormal distance from other values in a dataset. Outliers can be caused by measurement errors, data entry mistakes, or genuine extreme variations in the measured phenomenon.

IQR Method (Tukey's Fences)

The IQR method defines outliers based on quartiles. Any value below Q1 - 1.5×IQR or above Q3 + 1.5×IQR is considered an outlier. This method is robust because it doesn't assume a normal distribution and isn't influenced by the outliers it's trying to detect.

IQR Method in Excel
// Response times (minutes): 2, 3, 3, 4, 4, 5, 5, 6, 6, 15

// Step 1: Calculate quartiles
Q1 = =QUARTILE.INC(A1:A10, 1)  // Result: 3
Q3 = =QUARTILE.INC(A1:A10, 3)  // Result: 6

// Step 2: Calculate IQR
IQR = Q3 - Q1 = 6 - 3 = 3

// Step 3: Calculate fences
Lower_Fence = Q1 - 1.5 × IQR = 3 - 4.5 = -1.5
Upper_Fence = Q3 + 1.5 × IQR = 6 + 4.5 = 10.5

// Step 4: Identify outliers
Any value < -1.5 or > 10.5 is an outlier
Outlier detected: 15 minutes (exceeds upper fence)
Why 1.5×IQR? The multiplier 1.5 is a convention established by statistician John Tukey. It balances sensitivity (catching true outliers) with specificity (avoiding false alarms). For more conservative detection, use 3×IQR for "extreme outliers".

Z-Score Method (Standard Score)

The Z-score method measures how many standard deviations a data point is from the mean. A Z-score beyond ±3 (or sometimes ±2) is typically considered an outlier. This method assumes a normal distribution and works well when data is approximately bell-shaped.

Z-Score Method in Excel
// Sales data in column A (A1:A10)
Sales: 48, 52, 51, 50, 49, 53, 48, 51, 120, 50

// Step 1: Calculate mean and standard deviation
Mean = =AVERAGE(A1:A10)     // Result: 57.2
SD = =STDEV.S(A1:A10)       // Result: 22.15

// Step 2: Calculate Z-scores (in column B)
B1: =(A1-$B$11)/$B$12       // (48-57.2)/22.15 = -0.42
B2: =(A2-$B$11)/$B$12       // (52-57.2)/22.15 = -0.23
...
B9: =(A9-$B$11)/$B$12       // (120-57.2)/22.15 = 2.84

// Step 3: Identify outliers (|Z| > 3)
Value 120 has Z = 2.84
Borderline outlier (would be definite outlier if using |Z| > 2)
Z-Score Range Interpretation Percentage in Normal Distribution Action
-1 to +1 Typical value 68% Keep
-2 to +2 Somewhat unusual 95% Keep
-3 to +3 Rare but possible 99.7% Investigate
Beyond ±3 Extremely rare 0.3% Likely outlier

Comparing IQR vs. Z-Score Methods

IQR Method

Best for:

  • Skewed distributions
  • Unknown distribution shape
  • When outliers affect mean/SD
  • Small datasets

Advantages:

  • • Robust to extreme values
  • • No distribution assumptions
  • • Easy to calculate and explain
Z-Score Method

Best for:

  • Normal distributions
  • Large datasets (n > 30)
  • Comparing across datasets
  • Symmetric data

Advantages:

  • • Provides degree of "outlierness"
  • • Statistical foundation
  • • Standardizes across scales
Don't Auto-Delete Outliers! Always investigate outliers before removing them. They might indicate: (1) Data entry errors that should be corrected, (2) Measurement problems requiring re-measurement, (3) Legitimate extreme events with valuable insights, or (4) Important subgroups that deserve separate analysis.

Practice Exercises: Outlier Detection

Task: Product prices: $12, $15, $18, $20, $22, $25, $28, $95. Use the IQR method to identify if $95 is an outlier.

Show Solution
// Data: 12, 15, 18, 20, 22, 25, 28, 95

// Step 1: Calculate Q1 and Q3
Q1 = =QUARTILE.INC(A1:A8,1)  // Result: 16.5
Q3 = =QUARTILE.INC(A1:A8,3)  // Result: 26.5

// Step 2: Calculate IQR
IQR = Q3 - Q1 = 26.5 - 16.5 = 10

// Step 3: Calculate fences
Lower = Q1 - 1.5 × IQR = 16.5 - 15 = 1.5
Upper = Q3 + 1.5 × IQR = 26.5 + 15 = 41.5

// Step 4: Check each value
Values below 1.5 or above 41.5 are outliers
$95 > 41.5 → YES, it's an outlier!

Interpretation:
$95 is far above the upper fence (41.5)
This could be:
- A premium/luxury version of the product
- A pricing error
- A bundle/package deal
Requires investigation before analysis

Task: Customer purchase amounts: $45, $52, $48, $51, $50, $49, $150, $47, $53, $46. Calculate Z-scores and identify outliers using the threshold |Z| > 2.

Show Solution
// Data in A1:A10

// Step 1: Calculate mean and SD
Mean = =AVERAGE(A1:A10)  // Result: $59.10
SD = =STDEV.S(A1:A10)    // Result: $31.85

// Step 2: Calculate Z-scores
For $45:  Z = (45 - 59.10) / 31.85 = -0.44
For $52:  Z = (52 - 59.10) / 31.85 = -0.22
For $48:  Z = (48 - 59.10) / 31.85 = -0.35
For $51:  Z = (51 - 59.10) / 31.85 = -0.25
For $50:  Z = (50 - 59.10) / 31.85 = -0.29
For $49:  Z = (49 - 59.10) / 31.85 = -0.32
For $150: Z = (150 - 59.10) / 31.85 = 2.85 ← Outlier!
For $47:  Z = (47 - 59.10) / 31.85 = -0.38
For $53:  Z = (53 - 59.10) / 31.85 = -0.19
For $46:  Z = (46 - 59.10) / 31.85 = -0.41

// Step 3: Identify outliers (|Z| > 2)
$150 has Z = 2.85 (> 2) → Outlier detected!

Analysis:
- Most purchases cluster around $45-$53
- $150 purchase is 2.85 standard deviations above mean
- This represents unusual buying behavior
- Could be: bulk order, gift purchase, or special occasion

Action:
Investigate this customer segment for upselling opportunities

Task: Website load times (seconds): 0.8, 1.2, 1.1, 1.3, 0.9, 1.0, 1.2, 8.5, 1.1, 1.0. Apply both IQR and Z-score methods to identify outliers. Which method is more appropriate here and why?

Show Solution
// === IQR METHOD ===
Data: 0.8, 1.2, 1.1, 1.3, 0.9, 1.0, 1.2, 8.5, 1.1, 1.0
Sorted: 0.8, 0.9, 1.0, 1.0, 1.1, 1.1, 1.2, 1.2, 1.3, 8.5

Q1 = 1.0
Q3 = 1.2
IQR = 1.2 - 1.0 = 0.2

Lower = 1.0 - 1.5(0.2) = 0.7
Upper = 1.2 + 1.5(0.2) = 1.5

Outliers: 8.5 > 1.5 → YES, outlier detected

// === Z-SCORE METHOD ===
Mean = (0.8+1.2+1.1+1.3+0.9+1.0+1.2+8.5+1.1+1.0)/10 = 1.81
SD = 2.31

Z-score for 8.5:
Z = (8.5 - 1.81) / 2.31 = 2.90

Using |Z| > 2: 2.90 > 2 → YES, outlier detected

// === COMPARISON ===

Both methods identify 8.5 as an outlier ✓

However, NOTICE THE PROBLEM with Z-score:
- The outlier (8.5) inflates the mean (1.81)
- The outlier inflates the SD (2.31)
- The Z-score (2.90) might understate how extreme it is
- Without outlier: Mean ≈ 1.08, SD ≈ 0.15
- True Z-score would be: (8.5-1.08)/0.15 = 49.5!

// === RECOMMENDATION ===

Use IQR METHOD for this data because:
1. Distribution is right-skewed (mean > median)
2. Outlier affects mean and SD calculations
3. IQR is not influenced by the 8.5 value
4. IQR gives clearer boundary (1.5 seconds)

Real-world interpretation:
- Normal load times: 0.8-1.3 seconds (excellent)
- 8.5 seconds indicates a problem:
  * Server issue?
  * Network timeout?
  * Heavy traffic spike?
  
Action: Monitor for page performance issues
DO NOT remove from analysis - this is actionable data!
05

Five-Number Summary and Box Plots

The five-number summary is a concise way to describe a dataset's distribution using five key values: minimum, Q1, median, Q3, and maximum. When visualized as a box plot (box-and-whisker plot), these numbers reveal the shape, spread, center, and outliers at a glance. Box plots are particularly powerful for comparing multiple distributions side-by-side and quickly spotting anomalies in data.

Five-Number Summary

A set of descriptive statistics consisting of the minimum value, first quartile (Q1), median (Q2), third quartile (Q3), and maximum value. Together, these five numbers summarize the distribution's location, spread, and shape.

The Five Numbers Explained

Statistic Symbol Position Meaning
Minimum Min 0th percentile Smallest value in dataset (excluding outliers in some box plots)
First Quartile Q1 25th percentile 25% of data falls below this point; lower edge of box
Median Q2 / Med 50th percentile Middle value; line inside the box
Third Quartile Q3 75th percentile 75% of data falls below this point; upper edge of box
Maximum Max 100th percentile Largest value in dataset (excluding outliers in some box plots)
Calculating Five-Number Summary in Excel
// Employee commute times (minutes): 15, 18, 20, 22, 25, 28, 30, 32, 35, 60

// Five-number summary
Minimum = =MIN(A1:A10)              // Result: 15
Q1 = =QUARTILE.INC(A1:A10, 1)       // Result: 19
Median = =MEDIAN(A1:A10)            // Result: 26.5
Q3 = =QUARTILE.INC(A1:A10, 3)       // Result: 31
Maximum = =MAX(A1:A10)              // Result: 60

// Summary: [15, 19, 26.5, 31, 60]

// Additional useful stats
IQR = Q3 - Q1 = 31 - 19 = 12
Range = Max - Min = 60 - 15 = 45

Understanding Box Plots

A box plot (or box-and-whisker plot) is a visual representation of the five-number summary. The "box" shows the middle 50% of data (IQR), the line inside shows the median, and the "whiskers" extend to the minimum and maximum (or to 1.5×IQR, with outliers shown as separate points).

Box Plot Components:
Box: Represents IQR (Q1 to Q3), containing middle 50% of data
Line in box: Shows median (Q2) position
Whiskers: Extend to min/max or to 1.5×IQR from box edges
Dots beyond whiskers: Individual outliers
Box position: Shows where data is centered
Box size: Shows data spread (larger box = more variability)

Reading Box Plots

Symmetric Distribution
  • Median line near center of box
  • Whiskers roughly equal length
  • Q1 to median ≈ median to Q3
  • Example: heights, test scores
Right-Skewed Distribution
  • Median line closer to Q1 (bottom)
  • Upper whisker longer than lower
  • Possible outliers above box
  • Example: income, house prices
Left-Skewed Distribution
  • Median line closer to Q3 (top)
  • Lower whisker longer than upper
  • Possible outliers below box
  • Example: age at death, exam scores
Detecting Outliers
  • Points beyond whiskers are outliers
  • Whiskers typically extend to 1.5×IQR
  • Multiple outliers suggest problems
  • Investigate causes before removing

Comparing Multiple Groups with Box Plots

Box plots truly shine when comparing distributions across groups. Side-by-side box plots instantly reveal differences in central tendency, spread, and outliers between categories.

Creating Box Plot in Excel
// 1. Prepare data in columns (one per group)
//    Column A: Store A sales
//    Column B: Store B sales
//    Column C: Store C sales

// 2. Select all data including headers

// 3. Insert > Charts > Box and Whisker
//    (Excel 2016+) or Insert > Stat Chart > Box Plot

// 4. Interpret the chart:
//    - Compare median positions (central tendency)
//    - Compare box heights (variability)
//    - Look for outliers (dots beyond whiskers)
//    - Assess symmetry (whisker lengths)

// Excel will automatically:
//    - Calculate five-number summary
//    - Draw boxes and whiskers
//    - Mark outliers as individual points
//    - Use IQR method (1.5×IQR fences)

Practice Exercises: Box Plots & Five-Number Summary

Task: Calculate the five-number summary for this dataset: 12, 15, 18, 21, 24, 27, 30, 33, 36, 45. Describe the distribution based on these values.

Show Solution
// Data: 12, 15, 18, 21, 24, 27, 30, 33, 36, 45

Minimum = 12
Q1 = =QUARTILE.INC(A1:A10,1) = 18
Median = =MEDIAN(A1:A10) = 25.5
Q3 = =QUARTILE.INC(A1:A10,3) = 33
Maximum = 45

Five-Number Summary: [12, 18, 25.5, 33, 45]

// Additional calculations
IQR = Q3 - Q1 = 33 - 18 = 15
Range = Max - Min = 45 - 12 = 33

// Distribution analysis:
1. Center: Median = 25.5
2. Spread: IQR = 15 (moderate variability)
3. Shape: Slightly right-skewed
   - Median (25.5) is closer to Q1 (18) than Q3 (33)
   - Distance Q1 to Median: 25.5 - 18 = 7.5
   - Distance Median to Q3: 33 - 25.5 = 7.5
   - Actually symmetric in middle 50%!
   - But Max (45) creates slight right tail

4. Outliers check:
   Lower fence: 18 - 1.5(15) = -4.5
   Upper fence: 33 + 1.5(15) = 55.5
   No outliers (all values between -4.5 and 55.5)

Conclusion: Fairly symmetric distribution with moderate
spread, no outliers, typical value around 25-26

Task: A box plot shows: Min=20, Q1=45, Median=50, Q3=75, Max=200, with one outlier at 200. Describe what this tells you about the distribution and explain why the median is better than the mean for this data.

Show Solution
// Given: [20, 45, 50, 75, 200]

// Distribution Analysis:

1. Shape: RIGHT-SKEWED
   Evidence:
   - Median (50) closer to Q1 (45) than Q3 (75)
   - Q1 to Median: 50-45 = 5
   - Median to Q3: 75-50 = 25 (5x larger!)
   - Long upper whisker to outlier (200)

2. Center:
   - Median = 50 (typical value)
   - Mean would be inflated by 200 outlier

3. Spread:
   - IQR = 75 - 45 = 30 (middle 50%)
   - Range = 200 - 20 = 180 (misleading due to outlier)

4. Outlier:
   - Upper fence = 75 + 1.5(30) = 120
   - 200 > 120 → Confirmed outlier
   - Value is 150% above the fence!

// Why Median > Mean:

Median = 50 ✓
Mean calculation:
If we estimate from box plot, the outlier (200)
will pull mean significantly higher than median.

Example: If this represented 9 values:
20, 40, 45, 48, 50, 60, 70, 75, 200
Mean = 608/9 = 67.6
Median = 50

The mean (67.6) is 35% higher than median!

Recommendation:
Report: "Median value is 50, with most data
between 45-75. One extreme outlier at 200
requires investigation."

DO NOT report: "Average is 67.6"
This misrepresents typical experience

Task: Three sales teams have these five-number summaries:
Team A: [40, 55, 60, 65, 80]
Team B: [30, 45, 70, 95, 150]
Team C: [50, 58, 62, 66, 75]
Analyze and compare their performance, variability, and consistency.

Show Solution
// === TEAM A: [40, 55, 60, 65, 80] ===
Median = 60
IQR = 65 - 55 = 10
Range = 80 - 40 = 40
Shape: Symmetric (median centered in box)

// === TEAM B: [30, 45, 70, 95, 150] ===
Median = 70
IQR = 95 - 45 = 50
Range = 150 - 30 = 120
Shape: Right-skewed (median closer to Q1)
Check outlier: 95 + 1.5(50) = 170 (150 < 170, not outlier)

// === TEAM C: [50, 58, 62, 66, 75] ===
Median = 62
IQR = 66 - 58 = 8
Range = 75 - 50 = 25
Shape: Symmetric

// === COMPARATIVE ANALYSIS ===

1. Central Tendency (Median):
   Team B: 70 (highest) ✓
   Team C: 62
   Team A: 60

2. Consistency (IQR - lower is better):
   Team C: 8 (most consistent) ✓✓✓
   Team A: 10
   Team B: 50 (highly inconsistent)

3. Overall Spread (Range):
   Team C: 25 (narrow)
   Team A: 40
   Team B: 120 (very wide)

4. Symmetry:
   Team A: Symmetric ✓
   Team C: Symmetric ✓
   Team B: Right-skewed (unpredictable)

// === BUSINESS INTERPRETATION ===

Team C - BEST OVERALL:
+ Consistent performance (IQR=8)
+ Predictable results (symmetric)
+ Good median (62)
+ Reliable for forecasting
→ Well-trained, standardized process

Team B - HIGHEST PERFORMERS:
+ Highest median (70)
+ But extremely variable (IQR=50)
+ Unpredictable performance
+ Some top performers, some strugglers
→ Needs training standardization
→ May have star performers distorting results

Team A - MIDDLE GROUND:
+ Balanced performance
+ Moderate consistency
+ No major concerns
→ Solid, reliable team

// === RECOMMENDATIONS ===

1. Study Team C's practices and replicate
2. Investigate Team B's variability:
   - Are some reps undertrained?
   - Do star performers use different methods?
   - Split analysis by rep to find outliers
3. Consider Team A as baseline standard
4. Set performance targets:
   Minimum: 55 (Team A's Q1)
   Target: 62 (Team C's median)
   Stretch: 70 (Team B's median)

Key Takeaways

Central Tendency

Mean, median, and mode represent the center of your data distribution in different ways

Spread Measures

Range, variance, and standard deviation quantify how dispersed your data points are

Distribution Shapes

Normal, left-skewed, and right-skewed distributions tell different stories about your data

Outlier Detection

IQR method and Z-scores help identify unusual values that might need investigation

Box Plots

Five-number summary (min, Q1, median, Q3, max) visualizes distribution and outliers effectively

Choose Wisely

Select appropriate measures based on data type, distribution shape, and presence of outliers

Knowledge Check

Test your understanding of descriptive statistics:

Question 1 of 6

Which measure of central tendency is most affected by extreme outliers?

Question 2 of 6

A dataset has the following values: 5, 7, 7, 9, 12. What is the standard deviation approximately?

Question 3 of 6

In a right-skewed distribution, which relationship is typically true?

Question 4 of 6

Using the IQR method, a data point is considered an outlier if it is:

Question 5 of 6

The five-number summary consists of which five values?

Question 6 of 6

If a dataset has high variance, what does this indicate?

Answer all 6 questions to check your score