Suppose you scored 82 points on the first assignment and 75 points on the second assignment of the same course. If both scores were out of 100, you'd probably say that you did better on the first assignment. But suppose the average score on the first assignment was 85, and the average score on the second assignment was 67. Would this change how you felt? You might also change your mind if you heard that half the students scored above 80 on the first assignment, while only one student (you!) scored above 73 on the second assignment.
In this topic, we'll discuss measures of center and spread which we can use to describe data sets, as well as to compare different data sets.
We often use a single number, as in the example above, to describe a data set. The three commonly used measures of center are mean, median, and mode.
The mean is the arithmetic average, commonly just referred
to as the "average," and is calculated by summing all the data values,
and then dividing this sum by the number of data values.
For example, if we have n data values, x1 ,
x2 , x3 ,..., xn , then
their mean is: ( x1 + x2 + x3
+...+ xn ) / n .
The median is the middle observation in an ordered
list of data. To find the median:
(1) Sort the data values in order (the data does NOT need to be numerical)
(2) If there are an odd number of data values, ie. n is odd,
then the median value is in position ( n + 1 ) / 2 .
If there are an even number of data values, ie. n is even,
then the median value is the mean of the two middle value. That is, when n
is even, the median is the arithmetic average of the values in positions n/2
and n/2 + 1.
The mode is the most frequently occurring data value, ie. the data value with the highest frequency. If all data values have the same frequency, the data set does not have a mode. If there are two data values which occur more frequently than the other data values, the distribution is said to be bi-modal.
Example:
Find the mean, median, and mode of the number of hazardous waste sites
in the following table:
State |
# of Hazardous Waste Sites |
| Colorado | 15 |
| Idaho | 6 |
| Illinois | 39 |
| Indiana | 28 |
| Iowa | 13 |
| Kansas | 10 |
| Minnesota | 24 |
| Missouri | 23 |
| Nebraska | 10 |
| North Dakota | 0 |
| Ohio | 29 |
| Oklahoma | 10 |
| South Dakota | 2 |
| Utah | 16 |
| Wisconsin | 39 |
source: Environmental Protection Agency, Superfund Sites
answer:
The mean is (15+6+39+28+13+10+24+23+10+0+29+10+2+16+39)
/ 15 = 17.6
For the median, we first need to sort the data:
State |
# of Hazardous Waste Sites |
| North Dakota | 0 |
| South Dakota | 2 |
| Idaho | 6 |
| Kansas | 10 |
| Nebraska | 10 |
| Oklahoma | 10 |
| Iowa | 13 |
| Colorado | 15 |
| Utah | 16 |
| Missouri | 23 |
| Minnesota | 24 |
| Indiana | 28 |
| Ohio | 29 |
| Illinois | 39 |
| Wisconsin | 39 |
Since there are 15 data values, n = 15, the median will be in the middle, or the (15+1)/2 = 8th position. In this case the median number of hazardous waste sites is 15.
For the mode, we look for the most frequently occurring value, which in this data set is 10, which occurs 3 times.
Now, consider if the largest data value above was not 39, but 239. This data value would be called an outlier, since it is outside the general pattern of the data. In this example, the outlier would not affect the median or the mode, but the mean would be dramatically different, becoming (15+6+39+28+13+10+24+23+10+0+29+10+2+16+239) / 15 = 30.9 .
If a data set is symmetric, a histogram or stemplot representing the data looks the same to the left and right of the "center." In data that is symmetric (or close to being symmetric), the mean and median are close together. If the data is skewed, then the histogram or stemplot representing the data will have a long "tail." If the outlying values are larger than the median, we say the data is positively skewed, and the mean will be larger than the median. If the outlying data is less than the median, we say the data is negatively skewed, and the mean will be smaller than the median.
Consider the histogram of Govenor's salaries:

We can estimate the median and the mode by looking at the graph.
The median is the value (on the horizontal axis) that splits the area of the
graph: half the total area lies to its left and half of the total area lies
to its right. In this case, the median salary looks to be around $108,000.
The mean can visualized by thinking about a "see-saw" from a children's
playground. The mean would be where the center of the seesaw would be perfectly
balanced.
Since the data above is positively skewed, the mean is larger than the median.
In this case, the mean salary would be about $115,000.
The center of a data set might not provide enough information, so we could also consider the spread of the data. The three most common numerical measures of the spread of data are: the range, the interquartile range, and the standard deviation.
The range of a data set is simply the difference between the maximum (largest) data value and minimum (smallest) data value. In the original example above, the range is 39 - 0 = 39. The range would be susceptible to skewed data.
The interquartile range is the length of the middle half of the data, and it is somewhat less susceptible to skewed data. To calculate it, we order the data values in increasing order, find the median, and use the median to divide the data into smaller and larger halves. The first quartile, Q1 is the median of the smaller half of the data values, and the third quartile, Q3 is the median of the larger half of the data. The IQR is then simply Q3 - Q1.
In the example above, Q1 = 10 and Q3 = 28, so the IQR is 28 - 10 = 18.
If we list the first and third quartiles, as well as the median, the minimum, and the maximum values for a data set, we have what is referred to as the five-number summary. For our example above it would be 0, 10, 15, 28, 39.
While useful, range is susceptible to outliers, since it only uses the maximum and minimum values of the dataset. The interquartile range uses more information about the dataset, since it depends on the middle 50% of the data, and it is resistant to outliers. But the variability of a data set depends on all the data. So next, we discuss a measure of spread that does depend on all the data, standard deviation.
Here are the steps to calculate the standard deviation:
1. Compute the mean.
2. For each data value, calculate its deviation from the mean. The deviation
from the mean is determined by subtracting the data value from the mean. Note
that if we added all these deviations from the mean for one dataset, the sum
would be 0 (or close, depending on round-off error).
3. Square each deviation from the mean.
4. Sum the squares of the deviations.
5. Divide the sum in #4 by n - 1. Note that the text says,"
there are important statistical reasons we divide by one less than the number
of data values."
6. Take the square root of the value in #5, which will give the standard
deviation.
Now, let's look at an example where standard deviation helps explain the data.
Consider the following three datasets:
(1) 5, 25, 25, 25, 25, 25, 45
(2) 5, 15, 20, 25, 30, 35, 45
(3) 5, 5, 5, 25, 45, 45, 45
The mean, median, and range are all the same for these datasets, but the variability of each dataset is quite different.
Let's calculate the standard deviation for each dataset:
i |
xi
|
xi-mean
|
(xi-mean)2
|
i |
xi
|
xi-mean
|
(xi-mean)2
|
i |
xi
|
xi-mean
|
(xi-mean)2
|
||
| 1 | 5 | -20 | 400 | 1 | 5 | -20 | 400 | 1 | 5 | -20 | 400 | ||
| 2 | 25 | 0 | 0 | 2 | 15 | -10 | 100 | 2 | 5 | -20 | 400 | ||
| 3 | 25 | 0 | 0 | 3 | 20 | -5 | 25 | 3 | 5 | -20 | 400 | ||
| 4 | 25 | 0 | 0 | 4 | 25 | 0 | 0 | 4 | 25 | 0 | 0 | ||
| 5 | 25 | 0 | 0 | 5 | 30 | 5 | 25 | 5 | 45 | 20 | 400 | ||
| 6 | 25 | 0 | 0 | 6 | 35 | 10 | 100 | 6 | 45 | 20 | 400 | ||
| 7 | 45 | 20 | 400 | 6 | 45 | 20 | 400 | 6 | 45 | 20 | 400 | ||
| sum = | 800 | sum = | 1050 | sum = | 2400 | ||||||||
| sum/(n-1) = | 133.33 | sum/(n-1) = | 175 | sum/(n-1) = | 400 | ||||||||
| std. dev. = | 11.55 | std. dev. = | 13.23 | std. dev. = | 20 |
The standard deviations for the datasets are 11.55, 13.23, and 20. The larger standard deviations indicate greater variability in the data, and in general we can say that smaller standard deviations indicate less variability in the data.