Measures of Central Tendency: Mode and Relationship
Mode: Definition and Calculation for Ungrouped Data
Definition and Nature
The Mode is defined as the value or category that occurs most frequently in a given dataset. It represents the observation with the highest frequency. Unlike the mean and median, which are typically used for numerical data, the mode can be determined for both numerical (quantitative) and non-numerical (categorical or qualitative) data.
The mode indicates the most popular or common outcome in a dataset. For example, if we survey people about their favourite colour, the colour chosen by the largest number of people is the mode.
Key aspects of the mode:
- It is the value with the highest frequency.
- It can be easily identified by inspection in simple datasets or frequency tables.
- It is not affected by extreme values (outliers), making it a suitable measure of central tendency for distributions with outliers or when the data is skewed.
- It may not exist, or there may be more than one mode.
Calculation for Ungrouped Data
Calculating the mode for ungrouped data (either a simple list of observations or an ungrouped frequency distribution) is straightforward:
-
List the Observations (if raw data):
If the data is just a list of values, examine each distinct value and count how many times it appears in the list.
-
Identify Frequencies:
For raw data, tally the frequency of each distinct value. If the data is already in an ungrouped frequency distribution table, the frequencies are given directly.
-
Find the Highest Frequency:
Identify the maximum frequency among all the distinct values or categories.
-
Determine the Mode(s):
The value(s) or category(ies) corresponding to the highest frequency are the mode(s).
Possible outcomes when finding the mode:
- Unimodal: If only one value or category has the highest frequency, the dataset has a single mode.
- Bimodal: If two values or categories share the same highest frequency, the dataset has two modes.
- Multimodal: If more than two values or categories share the same highest frequency, the dataset is said to be multimodal.
- No Mode: If all distinct values or categories in the dataset occur with the exact same frequency (i.e., no single value is more frequent than any other), the dataset has no mode.
Example
Example 1. Find the mode of the following datasets:
(a) 2, 5, 3, 5, 1, 5, 3, 4, 5, 2
(b) Favourite colours of students: Red, Blue, Green, Blue, Red, Red, Yellow
(c) 7, 8, 10, 10, 12, 13, 12, 15
(d) 1, 2, 3, 4, 5, 6, 7
Answer:
Solution:
(a) Dataset: 2, 5, 3, 5, 1, 5, 3, 4, 5, 2
Let's count the frequency of each distinct value:
- Value 1: occurs 1 time
- Value 2: occurs 2 times
- Value 3: occurs 2 times
- Value 4: occurs 1 time
- Value 5: occurs 4 times
The highest frequency is 4, which corresponds to the value 5.
The mode is 5. This is a unimodal distribution.
(b) Dataset: Red, Blue, Green, Blue, Red, Red, Yellow
Let's count the frequency of each colour:
- Colour Red: occurs 3 times
- Colour Blue: occurs 2 times
- Colour Green: occurs 1 time
- Colour Yellow: occurs 1 time
The highest frequency is 3, which corresponds to the colour Red.
The mode is Red. This is a unimodal distribution.
(c) Dataset: 7, 8, 10, 10, 12, 13, 12, 15
Let's count the frequency of each distinct value:
- Value 7: occurs 1 time
- Value 8: occurs 1 time
- Value 10: occurs 2 times
- Value 12: occurs 2 times
- Value 13: occurs 1 time
- Value 15: occurs 1 time
The highest frequency is 2, and it occurs for two values: 10 and 12.
The modes are 10 and 12. This is a bimodal distribution.
(d) Dataset: 1, 2, 3, 4, 5, 6, 7
Let's count the frequency of each distinct value:
- Value 1: occurs 1 time
- Value 2: occurs 1 time
- Value 3: occurs 1 time
- Value 4: occurs 1 time
- Value 5: occurs 1 time
- Value 6: occurs 1 time
- Value 7: occurs 1 time
All values occur with the same frequency (1).
There is no mode for this dataset.
Mode of Grouped Data (Formula and Calculation)
Introduction and Modal Class
For data that is grouped into class intervals, we cannot determine the exact value of the mode because the individual observations are not known. Instead, we can identify the class interval that contains the mode. This class is called the modal class and is defined as the class interval with the highest frequency.
Once the modal class is identified, we estimate the value of the mode within that class using a formula. This formula is based on the assumption that the mode is located within the modal class and its exact position is influenced by the frequencies of the classes immediately preceding and succeeding the modal class.
Steps to Calculate the Mode of Grouped Data
To estimate the mode from a grouped frequency distribution table:
-
Identify the Modal Class:
Examine the frequency column of the grouped frequency distribution table. The class interval corresponding to the maximum frequency is the modal class.
-
Determine Values for the Formula:
From the identified modal class and the frequency table, extract the following values needed for the mode formula:
- $l$: The lower class boundary of the modal class. If the class intervals are exclusive (like $10-20, 20-30$), the lower limit is the boundary. If they are inclusive (like $10-19, 20-29$), convert the lower limit to a boundary by subtracting the adjustment factor (half the gap between classes).
- $f_m$: The frequency of the modal class.
- $f_1$: The frequency of the class immediately preceding the modal class.
- $f_2$: The frequency of the class immediately succeeding the modal class.
- $h$: The class width (size) of the modal class. (Assumed to be constant). $h = \text{Upper Boundary} - \text{Lower Boundary}$.
-
Apply the Mode Formula:
The estimated mode of grouped data is calculated using the formula:
Mode $= l + \left( \frac{f_m - f_1}{2f_m - f_1 - f_2} \right) \times h$
... (1)
Alternatively, the denominator can be written as $f_m + (f_m - f_1) + (f_m - f_2)$, highlighting the differences in frequency between the modal class and its neighbours:
Mode $= l + \left( \frac{f_m - f_1}{(f_m - f_1) + (f_m - f_2)} \right) \times h$
... (2)
Important Notes:
- The formula assumes the class intervals have equal widths. If widths are unequal, adjustments are needed (often by finding the modal class based on frequency density or by inspecting a histogram with adjusted heights).
- Ensure you use the correct class boundaries, especially for the lower limit ($l$) of the modal class.
- If the modal class is the very first class interval, then $f_1$ is taken as 0.
- If the modal class is the very last class interval, then $f_2$ is taken as 0.
- This formula provides an estimate of the mode. It might not be meaningful if the dataset is truly bimodal or multimodal, or if the maximum frequency occurs in a class at either extreme end of the distribution.
Visual Representation:
The mode formula can be intuitively understood from a histogram. If lines are drawn from the top-right corner of the modal class bar to the top-right corner of the preceding bar, and from the top-left corner of the modal class bar to the top-left corner of the succeeding bar, the intersection of these lines projected onto the x-axis gives an estimate of the mode.
Example
Example 1. Find the mode for the following weight distribution data:
Weight (kg) | Frequency (f) |
---|---|
40 - 45 | 2 |
45 - 50 | 5 |
50 - 55 | 5 |
55 - 60 | 7 |
60 - 65 | 5 |
65 - 70 | 4 |
70 - 75 | 2 |
Total | 30 |
Answer:
Given: Grouped frequency distribution of student weights.
To Find: The mode weight.
Solution:
Step 1: Identify Modal Class.
We look for the class interval with the maximum frequency in the table.
Weight (kg) | Frequency (f) |
---|---|
40 - 45 | 2 |
45 - 50 | 5 |
50 - 55 | 5 |
55 - 60 | 7 (Maximum Frequency) |
60 - 65 | 5 |
65 - 70 | 4 |
70 - 75 | 2 |
Total | 30 |
The maximum frequency is 7, which occurs in the class interval 55 - 60. This is the modal class.
Step 2: Determine Values for Formula.
From the modal class (55 - 60) and the table, we extract the necessary values:
- $l$: Lower class boundary of the modal class. The class intervals are exclusive, so the lower boundary is the lower limit. $l = 55$.
- $f_m$: Frequency of the modal class. $f_m = 7$.
- $f_1$: Frequency of the class preceding the modal class (50 - 55). $f_1 = 5$.
- $f_2$: Frequency of the class succeeding the modal class (60 - 65). $f_2 = 5$.
- $h$: Class width of the modal class. $h = 60 - 55 = 5$.
Step 3: Apply the Mode Formula.
Mode $= l + \left( \frac{f_m - f_1}{2f_m - f_1 - f_2} \right) \times h$
... (i)
Substitute the values into the formula:
Mode $= 55 + \left( \frac{7 - 5}{2(7) - 5 - 5} \right) \times 5$
Mode $= 55 + \left( \frac{2}{14 - 10} \right) \times 5$
Mode $= 55 + \left( \frac{2}{4} \right) \times 5$
Mode $= 55 + \left( \frac{\cancel{2}^{1}}{\cancel{4}_{2}} \right) \times 5$
(Cancelling the fraction)
Mode $= 55 + \left(\frac{1}{2}\right) \times 5$
Mode $= 55 + 2.5$
Mode $= 57.5$
... (ii)
The estimated mode weight is 57.5 kg. This value lies within the modal class 55-60 kg, as expected.
Relationship Between Mean, Median and Mode (Empirical Formula)
Empirical Relationship
While the mean, median, and mode are distinct measures of central tendency, for distributions that are **unimodal** (having a single peak) and **moderately skewed** (not extremely asymmetrical), there exists an approximate empirical relationship between them. This relationship is based on observations from many real-world datasets that exhibit this type of distribution.
The approximate empirical relationship is given by:
Mean - Mode $\approx$ 3 $\times$ (Mean - Median)
... (1)
This formula suggests that the difference between the mean and the mode is roughly three times the difference between the mean and the median.
This relationship can be rearranged algebraically to express one measure in terms of the other two:
- Subtracting 3(Mean - Median) from Mean:
Mode $\approx$ Mean - 3(Mean - Median)
... (2)
- Expanding the equation $Mean - Mode \approx 3 Mean - 3 Median$:
$-Mode \approx 3 Mean - Mean - 3 Median$
$-Mode \approx 2 Mean - 3 Median$
Multiplying by -1:
Mode $\approx$ 3 Median - 2 Mean
... (3)
Formula (3) is the most commonly cited form of the empirical relationship.
Conditions for Applicability and Interpretation
It is crucial to understand when and how this empirical relationship applies:
-
Applicability:
This formula is an approximation and works best for distributions that are:
- Unimodal (having only one peak).
- Moderately skewed (the asymmetry is not extreme).
It is generally not accurate for distributions that are highly skewed, symmetric, multimodal (having two or more peaks), or U-shaped.
-
Symmetric Distribution:
For a perfectly symmetric unimodal distribution (like the Normal Distribution), the mean, median, and mode coincide. In this case, Mean = Median = Mode. The empirical formula holds true: $Mean - Mean = 3 \times (Mean - Mean)$, which simplifies to $0 = 3 \times 0$, or $0=0$.
-
Skewed Distributions and the Order of Measures:
The empirical relationship reflects the typical relative positions of the mean, median, and mode in moderately skewed distributions:
- Positively Skewed (Skewed to the Right): The longer tail extends towards higher values on the right. In such distributions, the mean is typically greater than the median, which is greater than the mode (Mean > Median > Mode). The presence of higher values in the tail pulls the mean towards the right more than it affects the median (which is only influenced by position) or the mode (which is just the peak frequency).
- Negatively Skewed (Skewed to the Left): The longer tail extends towards lower values on the left. In such distributions, the mode is typically greater than the median, which is greater than the mean (Mode > Median > Mean). The presence of lower values in the tail pulls the mean towards the left.
The empirical formula captures this ordering for moderate skewness (e.g., if Mean > Median, then Mean - Median is positive, and Mode $\approx$ Mean - 3(positive value), so Mode < Mean). The median is always located between the mean and the mode in moderately skewed unimodal distributions.
-
Usefulness:
If a distribution is known to be moderately skewed and unimodal, and the values of any two of the measures (mean, median, mode) are known, this empirical formula can be used to obtain a reasonable estimate of the third measure without performing the full calculation.
Example
Example 1. In a moderately skewed distribution, the mean is 35.4 and the median is 34.3. Estimate the mode.
Answer:
Given: Mean ($\bar{x}$) = 35.4, Median (M) = 34.3.
The distribution is moderately skewed.
To Estimate: The mode (Z).
Solution:
We can use the empirical formula relating Mean, Median, and Mode for moderately skewed distributions. Let's use the form Mode $\approx$ 3 Median - 2 Mean.
Mode $\approx$ 3 $\times$ Median - 2 $\times$ Mean
... (i)
Substitute the given values into the formula:
Mode $\approx$ 3 $\times$ (34.3) - 2 $\times$ (35.4)
Perform the multiplications:
3 $\times$ 34.3 = 102.9
... (ii)
2 $\times$ 35.4 = 70.8
... (iii)
Substitute these products back into the formula (i):
Mode $\approx$ 102.9 - 70.8
Perform the subtraction:
Mode $\approx$ 32.1
... (iv)
The estimated mode is 32.1.
Interpretation:
Given that Mean (35.4) > Median (34.3), this is consistent with a positively skewed distribution. In a positively skewed distribution, the expected order of the measures is Mean > Median > Mode. Our estimated mode (32.1) follows this pattern (35.4 > 34.3 > 32.1), which supports the appropriateness of using the empirical formula in this case.
Comparing Mean, Median, and Mode
Comparison of the Three Measures
The mean, median, and mode are the three most common measures of central tendency, but they represent the "center" of the data in different ways. Each measure has its strengths and weaknesses, and the most appropriate measure to use depends on the nature of the data, the shape of its distribution, and the objective of the analysis.
Here is a comparison of their key features:
Feature | Mean ($\bar{x}$) | Median | Mode |
---|---|---|---|
Definition | The sum of all observations divided by the number of observations. It's the arithmetical average. | The middle value when the data is arranged in ascending or descending order. It's the positional average. | The value that occurs most frequently in the dataset. It's the value with the highest frequency. |
Calculation Basis | Takes into account the magnitude of every observation. | Based on the positional rank of the middle value(s) in the ordered data. Does not directly use the magnitude of all values. | Based on the frequency of occurrence of each value or category. |
Type of Data Applicable | Suitable only for **numerical** data (interval or ratio scales). Requires mathematical operations (addition, division). | Suitable for **numerical** (interval or ratio scales) and **ordinal** data (data that can be ranked). Requires ordering. | Suitable for **numerical** (interval or ratio scales), **ordinal**, and **categorical** (nominal) data. Does not require ordering or mathematical operations. |
Effect of Extreme Values (Outliers) | **Highly affected**. Extreme values pull the mean towards them. This can distort the representation of the typical value in skewed distributions. | **Not affected** (or minimally affected). Its value is only dependent on the position of the middle observation(s), making it robust to outliers. | **Not affected**. Outliers (rare extreme values) by definition do not occur with the highest frequency. |
Existence and Uniqueness | Always exists and is always unique for any numerical dataset. | Always exists and is always unique for any dataset that can be ordered. | May not exist (if all values have the same frequency), or may not be unique (bimodal or multimodal distributions). |
Mathematical Properties | Possesses desirable mathematical properties. Amenable to further algebraic treatment and is the foundation for many other statistical techniques (e.g., variance, standard deviation, correlation). | Less amenable to complex algebraic manipulation compared to the mean. | Least amenable to further algebraic treatment. Primarily a descriptive statistic. |
When to Prefer Use | Preferred for symmetrical distributions without significant outliers, when every observation's value should contribute to the central measure, or when further parametric statistical analysis is intended. | Preferred for skewed distributions or distributions with significant outliers, when the focus is on the typical value based on rank, or when dealing with open-ended classes in grouped data. It represents the true middle point of the ordered data. | Preferred for categorical data, identifying the most frequent observation or category, or when describing the peak(s) in a frequency distribution. Useful when a quick estimate of the center is needed. |
Stability | Generally stable (less fluctuation) across different samples from the same population, but this stability is compromised by outliers. | More stable than the mean in the presence of outliers or skewness. | Can be unstable, especially in small datasets, where adding or removing a few values can drastically change the mode. It can also be highly sensitive to the grouping of data in frequency distributions. |
Choosing the Right Measure
Selecting the most appropriate measure of central tendency is a crucial step in data analysis. Consider the following guidelines:
- For **categorical data**, the **mode** is the only measure of central tendency that can be used.
- For **numerical data** that is **symmetrically distributed** and does not contain significant outliers, the **mean** is generally the preferred measure. It utilizes all the information in the dataset and has good mathematical properties for further analysis. The median and mode will usually be very close to the mean in such cases.
- For **numerical data** that is **skewed** or contains significant **outliers**, the **median** is typically the most appropriate measure. It provides a better representation of the "typical" value because it is not distorted by the extreme values. For example, when discussing income distribution (which is often positively skewed), the median income is usually more informative than the mean income.
- The **mode** is useful even for numerical data when you want to identify the most frequent value or category. It can also help describe the shape of the distribution, indicating peaks (modalities).
- For grouped data with **open-ended classes** (e.g., "Above 70" or "Below 10"), the mean cannot be calculated using standard methods because the class mark of the open-ended class is undefined. In such cases, the median and mode can still be calculated.
In practice, it is often beneficial to calculate and report more than one measure of central tendency to provide a more complete description of the data's center and shape. Examining all three measures together can reveal important characteristics of the distribution.