Suppose you scored 82 points on the first assignment and 75 points on the second assignment of the same course. If both scores were out of 100, you'd probably say that you did better on the first assignment. But suppose the average score on the first assignment was 85, and the average score on the second assignment was 67. Would this change how you felt? You might also change your mind if you heard that half the students scored above 80 on the first assignment, while only one student (you!) scored above 73 on the second assignment.
In this topic, we'll discuss measures of center and spread which we can use to describe data sets, as well as to compare different data sets.
We often use a single number, as in the example above, to describe a data set. The three commonly used measures of center are mean, median, and mode.
The mean is the arithmetic average, commonly just referred
to as the "average," and is calculated by summing all the data values,
and then dividing this sum by the number of data values.
For example, if we have n data values, x1 ,
x2 , x3 ,..., xn , then
their mean is: ( x1 + x2 + x3
+...+ xn ) / n .
The median is the middle observation in an ordered
list of data. To find the median:
(1) Sort the data values in order
(2) If there are an odd number of data values, ie. n is odd,
then the median value is in position ( n + 1 ) / 2 .
If there are an even number of data values, ie. n is even,
then the median value is the mean of the two middle value. That is, when n
is even, the median is the arithmetic average of the values in positions n/2
and n/2 + 1.
The mode is the most frequently occurring data value, ie. the data value with the highest frequency. If all data values have the same frequency, the data set does not have a mode. If there are two data values which occur more frequently than the other data values, the distribution is said to be bi-modal.
Example:
Find the mean, median, and mode of the number of hazardous waste sites
in the following table:
State |
# of Hazardous Waste Sites |
| Colorado | 15 |
| Idaho | 6 |
| Illinois | 39 |
| Indiana | 28 |
| Iowa | 13 |
| Kansas | 10 |
| Minnesota | 24 |
| Missouri | 23 |
| Nebraska | 10 |
| North Dakota | 0 |
| Ohio | 29 |
| Oklahoma | 10 |
| South Dakota | 2 |
| Utah | 16 |
| Wisconsin | 39 |
source: Environmental Protection Agency, Superfund Sites
answer:
The mean is (15+6+39+28+13+10+24+23+10+0+29+10+2+16+39)
/ 15 = 17.6
For the median, we first need to sort the data:
State |
# of Hazardous Waste Sites |
| North Dakota | 0 |
| South Dakota | 2 |
| Idaho | 6 |
| Kansas | 10 |
| Nebraska | 10 |
| Oklahoma | 10 |
| Iowa | 13 |
| Colorado | 15 |
| Utah | 16 |
| Missouri | 23 |
| Minnesota | 24 |
| Indiana | 28 |
| Ohio | 29 |
| Illinois | 39 |
| Wisconsin | 39 |
Since there are 15 data values, n = 15, the median will be in the middle, or the (15+1)/2 = 8th position. In this case the median number of hazardous waste sites is 15.
For the mode, we look for the most frequently occurring value, which in this data set is 10, which occurs 3 times.
Now, consider if the largest data value above was not 39, but 239. This data value would be called an outlier, since it is outside the general pattern of the data. In this example, the outlier would not affect the median or the mode, but the mean would be dramatically different, becoming (15+6+39+28+13+10+24+23+10+0+29+10+2+16+239) / 15 = 30.9 .
If a data set is symmetric, a histogram or stemplot representing the data looks the same to the left and right of the "center." In data that is symmetric (or close to being symmetric), the mean and median are close together. If the data is skewed, then the histogram or stemplot representing the data will have a long "tail." If the outlying values are larger than the median, we say the data is positively skewed, and the mean will be larger than the median. If the outlying data is less than the median, we say the data is negatively skewed, and the mean will be smaller than the median.
Consider the histogram of Govenor's salaries:

We can estimate the median and the mode by looking at the graph.
The median is the value (on the horizontal axis) that splits the area of the
graph: half the total area lies to its left and half of the total area lies
to its right. In this case, the median salary looks to be around $108,000.
The mean can visualized by thinking about a "see-saw" from a children's
playground. The mean would be where the center of the seesaw would be perfectly
balanced.
Since the data above is positively skewed, the mean is larger than the median.
In this case, the mean salary would be about $115,000.
The center of a data set might not provide enough information, so we could also consider the spread of the data. The three most common numerical measures of the spread of data are: the range, the interquartile range, and the standard deviation. We'll consider the first two here, and then standard deviation in Chapter 17.
The range of a data set is simply the difference between the maximum (largest) data value and minimum (smallest) data value. In the original example above, the range is 39 - 0 = 39. The range would be susceptible to skewed data.
The interquartile range is the length of the middle half of the data, and it is somewhat less susceptible to skewed data. To calculate it, we order the data values in increasing order, find the median, and use the median to divide the data into smaller and larger halves. The first quartile, Q1 is the median of the smaller half of the data values, and the third quartile, Q3 is the median of the larger half of the data. The IQR is then simply Q3 - Q1.
In the example above, Q1 = 10 and Q3 = 28, so the IQR is 28 - 10 = 18.
If we list the first and third quartiles, as well as the median, the minimum, and the maximum values for a data set, we have what is referred to as the five-number summary. For our example above it would be 0, 10, 15, 28, 39.