Measures of Dispersion: Range and Mean Deviation
Describing the Dispersion (Variability of Data)
Beyond Central Tendency
Measures of central tendency (mean, median, mode) provide a single value that represents the "center" or typical value of a dataset. They tell us where the data is located. However, they do not give any information about how spread out or scattered the data values are around this center. Two datasets can have the exact same mean, median, and mode but exhibit vastly different levels of variability.
Consider the following two small datasets representing, for instance, the heights (in cm) of students in two different groups:
- Set A: {49, 50, 51}
- Set B: {10, 50, 90}
For Set A:
- Mean = $\frac{49+50+51}{3} = \frac{150}{3} = 50$
- Median = 50 (middle value after ordering)
- Mode = 50 (most frequent value)
For Set B:
- Mean = $\frac{10+50+90}{3} = \frac{150}{3} = 50$
- Median = 50 (middle value after ordering: 10, 50, 90)
- Mode = 50 (most frequent value)
Both datasets have the same mean (50) and the same median (50). However, visually inspecting the numbers, it is clear that the values in Set A are very close to the mean, while the values in Set B are much more spread out. A single measure of central tendency alone is insufficient to fully describe these datasets.
Therefore, describing the dispersion or variability of the data is just as crucial as describing its center for a complete understanding of the dataset's characteristics.
What is Dispersion?
Dispersion, also referred to as variability, scatter, or spread, is a statistical concept that quantifies the extent to which the data values in a distribution are spread out or clustered together. It measures how much the individual observations deviate or vary from the central value (an average like the mean or median).
- A dataset with low dispersion (or high homogeneity) means the data points are clustered closely around the central value. The values are relatively similar to each other.
- A dataset with high dispersion (or high heterogeneity) means the data points are spread out over a wider range of values. The values are quite different from each other and from the center.
Measures of dispersion complement measures of central tendency by providing context about the variability within the dataset. Averages alone can be misleading if the dispersion is not also considered.
Measures of Dispersion: Definition and Purpose
Definition
Measures of Dispersion are statistical indicators that quantify the amount of variation or spread within a set of data. They provide a numerical summary of how much the individual data points tend to differ from the average or from each other. A measure of dispersion summarizes the scatter of observations into a single value.
In simple terms, a measure of dispersion tells us how representative the average is or how homogeneous (similar) the data points are. A large value of dispersion indicates that the data is widely spread, while a small value indicates that the data is clustered closely around the average.
Purpose and Importance
Calculating measures of dispersion is essential for a comprehensive understanding of data and serves several important purposes:
- Judging the Reliability of Averages: Measures of dispersion help in assessing how well an average (like the mean) represents the entire dataset. If the dispersion is small, it means most data points are close to the average, and the average is a reliable representative value. If the dispersion is large, the average is less reliable as a single representative because the data is widely scattered, and many values are far from the average.
- Comparing Variability: They allow for a direct comparison of the spread or consistency of two or more different datasets, even if they have different averages or units (when using relative measures). For example, comparing the dispersion of scores in two different exams can tell us which exam had more consistent performance among students. Comparing the variability in returns of two investments helps assess their relative riskiness.
- Basis for Quality Control: In fields like manufacturing and quality control, minimizing variability is often a key objective. Measures of dispersion are used to monitor process consistency and identify when variability exceeds acceptable limits.
- Foundation for Further Statistical Analysis: Measures of dispersion, particularly variance and standard deviation, are fundamental components of many more advanced statistical techniques, including hypothesis testing, confidence interval estimation, correlation, regression, and analysis of variance (ANOVA).
- Understanding Distribution Shape: Along with measures of central tendency, measures of dispersion help in describing the shape of a distribution. For instance, a distribution with high dispersion will have a flatter shape than one with low dispersion, assuming they have similar central tendencies.
Common Measures of Dispersion
Various measures have been developed to quantify dispersion, each with its own strengths and applications. They can be broadly categorized into absolute and relative measures:
-
Absolute Measures of Dispersion:
These measures express the amount of variability in the same units as the original data. They indicate the actual spread of the data. They are suitable for comparing dispersion only when the datasets have the same units and similar average values.
Common absolute measures include:
Range:
The simplest measure, calculated as the difference between the maximum and minimum values in the dataset.Interquartile Range (IQR):
The range of the middle 50% of the data. It is calculated as the difference between the third quartile ($Q_3$) and the first quartile ($Q_1$).Quartile Deviation (Semi-Interquartile Range):
Half of the Interquartile Range, calculated as $\frac{Q_3 - Q_1}{2}$.Mean Deviation (or Mean Absolute Deviation):
The average of the absolute differences between each observation and a central value (usually the mean or median). It measures the average distance of data points from the center.Variance ($\sigma^2$ or $s^2$):
The average of the squared deviations of each observation from the mean. It is a widely used measure due to its mathematical properties.Standard Deviation ($\sigma$ or $s$):
The positive square root of the variance. It is the most common and important measure of dispersion. It represents a typical or standard distance of data points from the mean, expressed in the original units of the data.
-
Relative Measures of Dispersion:
These measures are calculated as a ratio or percentage, making them unitless. They are useful for comparing the degree of variability between datasets that have different units of measurement or significantly different average values.
Common relative measures include:
Coefficient of Range:
Calculated as $\frac{\text{Maximum} - \text{Minimum}}{\text{Maximum} + \text{Minimum}}$.Coefficient of Quartile Deviation:
Calculated as $\frac{Q_3 - Q_1}{Q_3 + Q_1}$.Coefficient of Mean Deviation:
Calculated as $\frac{\text{Mean Deviation}}{\text{Average used (Mean or Median)}}$.Coefficient of Variation (CV):
The most widely used relative measure. It is calculated as $\left(\frac{\text{Standard Deviation}}{\text{Mean}}\right) \times 100\%$. It expresses the standard deviation as a percentage of the mean.
In the following sections, we will discuss the calculation and interpretation of some of these measures in more detail, starting with the Range and Mean Deviation.
Range: Definition and Calculation
Definition
The Range is the simplest and most easily understood measure of dispersion. It quantifies the total spread of the data by calculating the difference between the highest and lowest values in the dataset.
In simple terms, it tells us the extent of variation from the minimum to the maximum value.
Calculation
Let $L$ denote the minimum (smallest) value and $H$ denote the maximum (largest) value in a dataset.
Formula for Ungrouped Data:
Range $= H - L$
... (1)
For Grouped Data:
For data presented in a grouped frequency distribution, the exact minimum and maximum values are usually unknown. The range is estimated as the difference between the upper boundary of the highest class interval and the lower boundary of the lowest class interval.
Range (Grouped Data) $\approx$ Upper Boundary of Highest Class - Lower Boundary of Lowest Class
... (2)
Ensure you use class boundaries, especially if dealing with inclusive intervals. For example, if the lowest class is $10-19$ and the highest is $90-99$ (with class boundaries $9.5-19.5$ and $89.5-99.5$), the estimated range would be $99.5 - 9.5 = 90$.
Advantages of Range
- Simplicity: It is extremely easy to understand and calculate. Requires minimal computational effort.
- Quick Overview: Provides a rapid indication of the total spread or extent of variation in the dataset.
Disadvantages of Range
Despite its simplicity, the range has significant limitations as a measure of dispersion:
- Sensitivity to Outliers: It is based on only the two most extreme values in the dataset. A single unusually large or small value (an outlier) can dramatically increase the range, giving a misleading impression of the overall variability.
- Ignores Intermediate Values: It does not consider the distribution of the data between the minimum and maximum values. Datasets with very different distributions can have the same range.
- Doesn't Reflect Variation Around the Center: It doesn't tell us how much the data points deviate from the mean, median, or any other central value.
- Limited Usefulness for Grouped Data: For grouped data, it's only an estimate, as the actual minimum and maximum values are not known.
- Poor for Further Analysis: It is generally not suitable for more advanced mathematical or statistical calculations.
Due to these limitations, the range is often used only for preliminary data analysis or for comparing the spread of very small datasets where simplicity is prioritized over precision.
Example
Example 1. Find the range of the following set of scores: 15, 20, 25, 18, 22, 95.
Answer:
Given: Dataset: 15, 20, 25, 18, 22, 95.
To Find: The range.
Solution:
We need to identify the maximum (largest) and minimum (smallest) values in the dataset.
- Maximum value ($H$) = 95.
- Minimum value ($L$) = 15.
Using the formula for range:
Range $= H - L$
... (i)
Substitute the maximum and minimum values:
Range $= 95 - 15$
Range $= 80$
... (ii)
The range of the scores is 80.
Note on Outliers:
Consider the dataset without the value 95: {15, 20, 25, 18, 22}. The maximum value is 25 and the minimum is 15. The range is $25 - 15 = 10$. The inclusion of a single outlier (95) drastically increased the range from 10 to 80, illustrating the range's high sensitivity to extreme values.
Mean Deviation: Definition and Calculation (from Mean, Median)
Definition and Concept
The Mean Deviation (often abbreviated as MD, and also known as the Mean Absolute Deviation, MAD) is a measure of dispersion that quantifies the average amount by which individual observations in a dataset differ from a measure of central tendency. It is calculated as the arithmetic mean of the absolute values of the deviations of the observations from the chosen central value (typically the mean or the median).
The use of absolute values, denoted by $|...|$, is essential. If we simply summed the deviations $(x_i - \bar{x})$, the sum would always be zero for the mean, as positive and negative deviations cancel out. Taking absolute values ensures that the measure reflects the total distance of data points from the center, regardless of direction.
The Mean Deviation provides a direct measure of the average distance of each data point from the center.
Calculation for Ungrouped Data
For a set of $n$ individual observations $x_1, x_2, \dots, x_n$:
1. Mean Deviation about the Mean ($\text{MD}_{\bar{x}}$):
This measures the average deviation of observations from the arithmetic mean.
- Calculate the arithmetic mean, $\bar{x} = \frac{\sum x_i}{n}$.
- For each observation $x_i$, calculate its deviation from the mean: $x_i - \bar{x}$.
- Take the absolute value of each deviation: $|x_i - \bar{x}|$.
- Sum all these absolute deviations: $\sum_{i=1}^{n} |x_i - \bar{x}|$.
- Divide the sum by the total number of observations, $n$.
Formula:
$\text{MD}_{\bar{x}} = \frac{\sum\limits_{i=1}^{n} |x_i - \bar{x}|}{n}$
... (1)
2. Mean Deviation about the Median ($\text{MD}_{\text{M}}$):
This measures the average deviation of observations from the median.
- Arrange the data in ascending or descending order and find the median, $M$.
- For each observation $x_i$, calculate its deviation from the median: $x_i - M$.
- Take the absolute value of each deviation: $|x_i - M|$.
- Sum these absolute deviations: $\sum_{i=1}^{n} |x_i - M|$.
- Divide the sum by the total number of observations, $n$.
Formula:
$\text{MD}_{\text{M}} = \frac{\sum\limits_{i=1}^{n} |x_i - M|}{n}$
... (2)
A significant property is that the sum of absolute deviations, $\sum |x_i - c|$, is minimized when $c$ is the median. Consequently, the Mean Deviation about the Median is always less than or equal to the Mean Deviation about the Mean ($\text{MD}_{\text{M}} \le \text{MD}_{\bar{x}}$).
Calculation for Frequency Distributions (Ungrouped or Grouped)
When data is presented in a frequency distribution (either ungrouped with distinct values or grouped with class intervals), where $x_1, x_2, \dots, x_k$ are the distinct values or class marks and $f_1, f_2, \dots, f_k$ are their corresponding frequencies, and $N = \sum\limits_{i=1}^{k} f_i$ is the total frequency:
The calculation involves weighting the absolute deviations by their frequencies.
1. Mean Deviation about the Mean ($\text{MD}_{\bar{x}}$):
- Calculate the mean, $\bar{x} = \frac{\sum f_i x_i}{N}$ (using the appropriate formula for ungrouped or grouped data).
- For each distinct value or class mark $x_i$, calculate its absolute deviation from the mean: $|x_i - \bar{x}|$.
- Multiply each absolute deviation by its corresponding frequency $f_i$: $f_i |x_i - \bar{x}|$.
- Sum these products: $\sum_{i=1}^{k} f_i |x_i - \bar{x}|$.
- Divide the sum by the total frequency $N$.
Formula:
$\text{MD}_{\bar{x}} = \frac{\sum\limits_{i=1}^{k} f_i |x_i - \bar{x}|}{N}$
... (3)
2. Mean Deviation about the Median ($\text{MD}_{\text{M}}$):
- Calculate the median, $M$ (using the appropriate method for ungrouped or grouped frequency data).
- For each distinct value or class mark $x_i$, calculate its absolute deviation from the median: $|x_i - M|$.
- Multiply each absolute deviation by its corresponding frequency $f_i$: $f_i |x_i - M|$.
- Sum these products: $\sum_{i=1}^{k} f_i |x_i - M|$.
- Divide by the total frequency $N$.
Formula:
$\text{MD}_{\text{M}} = \frac{\sum\limits_{i=1}^{k} f_i |x_i - M|}{N}$
... (4)
Example
Example 1. Find the mean deviation about the mean for the data: 6, 7, 10, 12, 13, 4, 8, 12.
Answer:
Given: Dataset: 6, 7, 10, 12, 13, 4, 8, 12.
To Find: Mean deviation about the mean.
Solution:
This is ungrouped data (individual observations). The number of observations is $n=8$.
Step 1: Calculate the Mean ($\bar{x}$).
Sum of observations = $6+7+10+12+13+4+8+12 = 72$.
$\bar{x} = \frac{\sum x_i}{n} = \frac{72}{8} = 9$
... (i)
The mean is 9.
Step 2: Calculate Absolute Deviations $|x_i - \bar{x}| = |x_i - 9|$.
We calculate the absolute difference between each observation and the mean (9).
$x_i$ | $x_i - \bar{x}$ ($x_i - 9$) |
Absolute Deviation $|x_i - \bar{x}|$ |
---|---|---|
6 | $6 - 9 = -3$ | $|-3| = 3$ |
7 | $7 - 9 = -2$ | $|-2| = 2$ |
10 | $10 - 9 = 1$ | $|1| = 1$ |
12 | $12 - 9 = 3$ | $|3| = 3$ |
13 | $13 - 9 = 4$ | $|4| = 4$ |
4 | $4 - 9 = -5$ | $|-5| = 5$ |
8 | $8 - 9 = -1$ | $|-1| = 1$ |
12 | $12 - 9 = 3$ | $|3| = 3$ |
Total | $\sum |x_i - \bar{x}| = 3+2+1+3+4+5+1+3 = 22$ |
Step 3: Calculate Mean Deviation about Mean.
Using the formula $\text{MD}_{\bar{x}} = \frac{\sum |x_i - \bar{x}|}{n}$:
$\text{MD}_{\bar{x}} = \frac{22}{8}$
... (ii)
$\text{MD}_{\bar{x}} = 2.75$
... (iii)
The mean deviation about the mean is 2.75. This means, on average, each score deviates from the mean of 9 by 2.75 units.
Advantages and Disadvantages of Mean Deviation
Mean deviation offers some advantages and disadvantages compared to other measures of dispersion:
-
Advantages:
- Uses all observations: Unlike the range, mean deviation considers every data point in its calculation, giving a more complete picture of dispersion.
- Easy to understand: The concept of average deviation from the center is intuitive.
- Less affected by extreme values (than standard deviation from mean): While still influenced by every value, using absolute deviations (especially when calculated about the median) makes it less sensitive to extreme outliers than squaring deviations as in variance and standard deviation.
-
Disadvantages:
- Ignoring Signs: The use of absolute values makes it mathematically less elegant and difficult to use in further algebraic manipulations and complex statistical theories compared to measures based on squared deviations (variance and standard deviation).
- Calculation Complexity: Calculating the median for grouped data can be more complex than calculating the mean, which might make $\text{MD}_{\text{M}}$ calculation slightly more involved than $\text{MD}_{\bar{x}}$ (though this is minor).
Due to the mathematical difficulties associated with absolute values, the Mean Deviation is not as widely used in inferential statistics as the Variance and Standard Deviation, which are based on squared deviations.