" MicromOne: Measures of Spread: Quantiles, Standard Deviation, and Variance

Pagine

Measures of Spread: Quantiles, Standard Deviation, and Variance

 

When analyzing data, knowing only the mean is often not enough. Two datasets can have the same average but be very different in how their values are distributed. This is why measures of spread are so important—they help us understand how spread out or concentrated the data is.

In this article, we’ll look at three key measures of spread:

  • Quantiles

  • Standard Deviation

  • Variance

We’ll focus on understanding what they represent, without going too deep into mathematical details.

Quantiles

Quantiles divide an ordered dataset into equal parts. They are useful for understanding how values are distributed across the range of the data.

The most commonly used quantiles are:

  • 25th percentile (Q1)

  • 50th percentile (Q2), also known as the median

  • 75th percentile (Q3)

Example

Dataset:

[1, 2, 3, 4, 5]
  • 25% quantile = 2

  • 50% quantile (median) = 3

  • 75% quantile = 4

This means:

  • 25% of the values are less than or equal to 2

  • 50% of the values are less than or equal to 3

  • 75% of the values are less than or equal to 4

In Pandas, quantiles are calculated using the .quantile() method.

Standard Deviation

Standard deviation measures how much the individual values in a dataset differ from the mean. It gives a sense of how spread out the data is.

One of its key advantages is that it is expressed in the same units as the original data, which makes it easier to interpret.

  • Low standard deviation → values are close to the mean

  • High standard deviation → values are more spread out

In Pandas, standard deviation is calculated using .std().

Step-by-Step Example

Dataset:

[1, 2, 3, 4, 5]
  1. Mean = 3

  2. Subtract the mean from each value:

    [-2, -1, 0, 1, 2]
    
  3. Square each result:

    [4, 1, 0, 1, 4]
    
  4. Sum = 10

  5. Divide by n − 1 (sample standard deviation):

    10 / 4 = 2.5
    
  6. Take the square root:

    √2.5 ≈ 1.581
    

By default, Pandas calculates the sample standard deviation, since in real-world scenarios we usually work with samples rather than entire populations.

Variance

Variance also measures how far values are from the mean, but it does so by squaring the standard deviation.

Formula:

Variance = (Standard Deviation)²

In Pandas, variance is calculated using .var().

Example

  • Standard deviation = 1.581

  • Variance:

    1.581² ≈ 2.5
    

Because variance is expressed in squared units, it is less intuitive than standard deviation, but it is fundamental in statistics and data analysis.









Population vs Sample Standard Deviation

The two formulas look very similar, but they are used in different situations.

Population Standard Deviation (σ)

This formula is used when you have data for an entire population.

  • X = each value in the dataset

  • μ (mu) = population mean

  • N = total number of values in the population

  • Σ = sum of all values

Because you are measuring the whole population, you divide by N.

Sample Standard Deviation (s)

This formula is used when your data is only a sample of a larger population.

  • xᵢ = each individual observation

  • x̄ (x-bar) = sample mean

  • n = number of observations in the sample

The key difference is dividing by n − 1 instead of n.
This adjustment (called Bessel’s correction) helps produce a more accurate estimate of the population’s variability.

Why This Matters

In real-world data analysis, we almost always work with samples, not entire populations. For this reason:

  • Pandas uses sample standard deviation by default

  • .std() divides by n − 1

  • .var() also follows the sample formula