When analyzing data, knowing only the mean is often not enough. Two datasets can have the same average but be very different in how their values are distributed. This is why measures of spread are so important—they help us understand how spread out or concentrated the data is.
In this article, we’ll look at three key measures of spread:
Quantiles
Standard Deviation
Variance
We’ll focus on understanding what they represent, without going too deep into mathematical details.
Quantiles
Quantiles divide an ordered dataset into equal parts. They are useful for understanding how values are distributed across the range of the data.
The most commonly used quantiles are:
25th percentile (Q1)
50th percentile (Q2), also known as the median
75th percentile (Q3)
Example
Dataset:
[1, 2, 3, 4, 5]
25% quantile = 2
50% quantile (median) = 3
75% quantile = 4
This means:
25% of the values are less than or equal to 2
50% of the values are less than or equal to 3
75% of the values are less than or equal to 4
In Pandas, quantiles are calculated using the .quantile() method.
Standard Deviation
Standard deviation measures how much the individual values in a dataset differ from the mean. It gives a sense of how spread out the data is.
One of its key advantages is that it is expressed in the same units as the original data, which makes it easier to interpret.
Low standard deviation → values are close to the mean
High standard deviation → values are more spread out
In Pandas, standard deviation is calculated using .std().
Step-by-Step Example
Dataset:
[1, 2, 3, 4, 5]
Mean = 3
Subtract the mean from each value:
[-2, -1, 0, 1, 2]Square each result:
[4, 1, 0, 1, 4]Sum = 10
Divide by n − 1 (sample standard deviation):
10 / 4 = 2.5Take the square root:
√2.5 ≈ 1.581
By default, Pandas calculates the sample standard deviation, since in real-world scenarios we usually work with samples rather than entire populations.
Variance
Variance also measures how far values are from the mean, but it does so by squaring the standard deviation.
Formula:
Variance = (Standard Deviation)²
In Pandas, variance is calculated using .var().
Example
Standard deviation = 1.581
Variance:
1.581² ≈ 2.5
Because variance is expressed in squared units, it is less intuitive than standard deviation, but it is fundamental in statistics and data analysis.
Population vs Sample Standard Deviation
The two formulas look very similar, but they are used in different situations.
Population Standard Deviation (σ)
This formula is used when you have data for an entire population.
X = each value in the dataset
μ (mu) = population mean
N = total number of values in the population
Σ = sum of all values
Because you are measuring the whole population, you divide by N.
Sample Standard Deviation (s)
This formula is used when your data is only a sample of a larger population.
xᵢ = each individual observation
x̄ (x-bar) = sample mean
n = number of observations in the sample
The key difference is dividing by n − 1 instead of n.
This adjustment (called Bessel’s correction) helps produce a more accurate estimate of the population’s variability.
Why This Matters
In real-world data analysis, we almost always work with samples, not entire populations. For this reason:
Pandas uses sample standard deviation by default
.std()divides by n − 1.var()also follows the sample formula
