In this topic, we will explore issues related to surveys and samples.
In an experiment, the investigator imposes some treatment or condition on the subjects and records their responses. In an observational study, the investigator observes the subjects without attempting to manipulate or impose any treatment, and then records their responses.
In both experiements and observational studies, researchers frequently search for a relationship between two or more variables. For example, Harvard researchers conducted a study and found that women getting fewer than 5 hours of sleep per night are more likely to develop diabetes. In this case the amount of sleep was the explanatory variable and the incidence of diabetes was the response variable.
For any study, the entire group of individuals about which we want information is called the population. The portion of the population that we actually use to collect information is a sample, which is a subset of the population.
We must also take care when designing an observational study or experiment to control for confounding variables, which are variables whose influence on the response variable cannot be separated from that of the explanatory variable. One way to control for confounding variables is by randomizing. There are many ways to randomize: picking out of a "hat," Excel's random number generator, or using a random digit table.
We will demonstrate the use of the random number table in the next example.
Example 20.2
The following is a chronological list, across rows, of the U. S. Presidents and their ages at inauguration.
| Washington, 57 | J. Adams, 61 | Jefferson, 57 | Madison, 57 |
| Monroe, 58 | J. Q. Adams, 57 | Jackson, 61 | Van Buren, 54 |
| W. H. Harrison, 68 | Tyler, 51 | Polk, 49 | Taylor, 64 |
| Filmore, 50 | Pierce, 48 | Buchanon, 65 | Lincoln, 52 |
| A. Johnson, 56 | Grant, 46 | Hayes, 54 | Garfield, 49 |
| Arthur, 50 | Cleveland, 47 | B. Harrison, 55 | Cleveland, 55 |
| McKinley, 54 | T. Roosevelt, 42 | Taft, 51 | Wilson, 56 |
| Harding, 55 | Coolidge, 51 | Hoover, 54 | F. D. Roosevelt, 51 |
| Truman, 60 | Eisenhower, 62 | Kennedy, 43 | L. B. Johnson, 55 |
| Nixon, 56 | Ford, 61 | Carter, 52 | Reagan, 69 |
| G. H. Bush, 64 | Clinton, 46 | G. W. Bush, 54 |
We want to choose a sample of 5 U. S. Presidents.
a. Number the Presidents, using two digits, from 01 to 43, going left-right across each row.
b. Using the random digit table below, group the digits in "two's." Ignore any pairs of numbers outside the range 01 - 43. The numbers are spaced in groups of 5, but the grouping and spacing isn't important
Random Number Table
12975 13258 13048 45144 72321 81940 00360 02428 96767 35964 23822
96012 94591 65194 50842 53372 72829 50232 97892 63408 77919 44575
24870 04178 88565 42628 17797 49376 61762 16953 88604 12724 62964
99612 93465 64658 27402 56319 81103 46759 14520 19807 46845 30862
Answer:
We generate a list: 12, 97, 51, 32, 58, 13, 04, 84, 51, 44, 72, 32, 18, 19,
40, ...
and choosing five numbers between 01 and 43, we get: 12, 32, 13, 04, 18. We
ignore the second 32.
c. Write the name and age of the 5 Presidents chosen
Answer:
12 - Taylor, 64
32 - F. D. Roosevelt, 51
13 - Filmore, 50
04 - Madison, 57
18 - Grant, 46
How does the mean of your sample compare to the mean of the entire population,
which is approx. 54.8?
Answer: The mean of the randomly selected Presidents is (64+51+50+57+46)/5
= 53.6, which is a bit less than the mean for the population.
d. T. Roosevelt was the first President inaugurated after
1900. How many Presidents elected after 1900 are in your sample?
Answer: one, F. D. Roosevelt.
e. Repeat a. - c., starting
with the third row of the random digit table.
Answer:
random numbers - 24, 87, 00, 41, 78, 88, 56, 54, 26, 28, 17, 79, 74, 93, 76,
61, 76, 21, ...
5 numbers in range - 24, 41, 26, 28, 17
Presidents - Cleveland, 55, G. H. Bush, 64, T. Roosevelt, 42, Wilson, 56,
A. Johnson, 56
mean age - (55+64+42+56+56)/5 = 54.6, which is very close to the population
mean
f. How would this process change if there was a population
of 500 and we needed to pick a random sample of 5?
Answer: Number the individuals with three digits, 001 thru 500, and then
group the Random Digits in threes.
In the previous example, 18 of 43 U. S. Presidents were inaugurated after 1900, which is over 40%. We could choose a sample with different groups, or strata, where the proportion for each strata is approximately the same for both the sample and the population. This is called a stratified random sample. For example, we could use "inaugurated before 1900" and "inaugurated 1900 or later" as the strata. Since 25 of 43, approx. 60% were inaugurated before 1900, and 18 of 43, approx. 40%, were inaugurated 1900 or later, a representative sample would have strata of 60% and 40%, respectively.
If a sampling method produces results that are systematically different from the true results about the population, then the method is biased.
Selection bias occurs when sampling excludes entirely or includes
disproportionately, some segment of the population.
Response bias occurs when questions on a survey, or behavior
of the interviewer, or situation involving the interview influence the response.
Non-response bias occurs when some subjects selected for the
sample choose not to participate, and these non-responders are different from
responders.
Survey respondents might be hesitant to respond to sensitive questions, or even answer them truthfully. Warner's randomized response model is an interesting way to elicit candid responses.
Example 20.5
We will ask each student in the class to toss a penny, and toss a nickel, and
to keep track of the result of each toss. Then we will survey the class with
two questions.
Q1 is, "Have you, or anyone you know, ever phoned in a vote on American
Idol?"
Q2 is, "Was the outcome on the nickel a head?"
If the result of the penny toss was heads, then answer Q1. Otherwise, if the penny toss was a tail, answer Q2. Since no one knows which question the respondent is answering, there is no stigma associated with answering Yes.
Here is a blank table which can be used for general examples:
| Answered Yes | Answered No | Total | |
| Answered Q1 | |||
| Answered Q2 | |||
| Total |
If we want to consider some data to fill the table, we can assume there are 100 respondents, and further we can assume 50 would answer Q1 and 50 would answer Q2. Of the 50 answering Q2, we can assume that 25 answered Yes, and 25 would answer No.
| Answered Yes | Answered No | Total | |
| Answered Q1 | 50 | ||
| Answered Q2 | 25 | 25 | 50 |
| Total | 100 |
Then, if we tallied the results, and there were 44 Yes responses, we can fill
the rest of the table.
| Answered Yes | Answered No | Total | |
| Answered Q1 | 19 | 31 | 50 |
| Answered Q2 | 25 | 25 | 50 |
| Total | 44 | 56 | 100 |
So, the probability that the respondent answered yes given that they answered
Q1, is:
Pr( Answered Yes | Answered Q1 ) = 19/50 = 0.38. This gives an estimate that
38% of these college students (or someone they know) have voted on American
Idol.