KEY CONCEPTS
|
PRESENTING PROBLEM
Presenting Problem 1
Barbara Dennison and her colleagues (1997) asked an intriguing question relating to nutrition in young children: How does fruit juice consumption affect growth parameters during early childhood? The American Academy of Pediatrics has warned that excessive use of fruit juice may cause gastrointestinal symptoms, including diarrhea, abdominal pain, and bloating caused by fructose malabsorption and the presence of the nonabsorbable sugar alcohol, sorbitol. Excessive fruit juice consumption has been reported as a contributing factor in failure to thrive.
These investigators designed a cross-sectional study including 116 two-year-old children and 107 five-year-old children selected from a primary care, pediatric practice. The children's parents completed a 7-day dietary record that included the child's daily consumption of beverage—milk, fruit juice, soda pop, and other drinks. Height was measured to the nearest 0.l cm and weight to the nearest 0.25 lb. Excess fruit juice consumption was defined as ≥ 12 fl oz/day. Both the body mass index (BMI) and the ponderal index were used as measures of obesity.
They found that the dietary energy intake of the children in their study population, 1245 kcal for the 2-year-olds and 1549 kcal for the 5-year-olds, was remarkably similar to that reported in the National Health and Nutrition Examination Survey (NHANES) taken from a nationally representative sample of white children. The prevalence of short stature and obesity was higher among children consuming excess fruit juice. Forty-two percent of children drinking ≥ 12 fl oz/day of fruit juice were short compared with 14% of children drinking < 12 fl oz/day. For obesity the percentages were 53% and 32%, respectively.
We use the observations on the group of 2-year-old children (see section titled, “Introduction to Questions About Means”), and find that the t distribution and t test are appropriate statistical approaches. The entire data set, including information on 5-year-olds as well, is available in the folder entitled “Dennison” on the CD-ROM.
Presenting Problem 2
Concerns related to the use of smallpox virus as a potential biological warfare agent have led to intense interest in evaluating the availability and safety of smallpox vaccine. Currently, the supply of smallpox vaccine is insufficient to vaccinate all United States residents.
The National Institute of Allergy and Infectious Diseases conducted a randomized, single-blind trial to determine the rate of success of inoculation with different dilutions of smallpox vaccine (Frey et al, 2002). A total of 680 healthy adult volunteers 18–32 years of age were randomly assigned to receive undiluted vaccine, a 1:5 dilution of vaccine, or a 1:10 dilution of vaccine. A primary end point of the study was the rate of success of vaccination defined by the presence of a primary vesicle at the inoculation site 7–9 days after inoculation. If no vesicle formed, revaccination with the same dilution of vaccine was administered. The investigators also wished to determine the range and frequency of adverse reactions to the vaccine. We use data from this study to illustrate statistical methods for a proportion.
Presenting Problem 3
Following cholecystectomy, symptoms of abdominal pain, flatulence, or dyspepsia occur frequently and are part of the “postcholecystectomy syndrome.” Postcholecystectomy diarrhea (PCD) is a well-known complication of the surgery, although the frequency of its occurrence varies considerably in clinical reports. Sauter and colleagues (2002) prospectively evaluated the frequency of PCD and changes in bowel habits in patients undergoing cholecystectomy. They also evaluated the role of bile acid malabsorption in PCD.
Fifty-one patients undergoing cholecystectomy were evaluated before, 1 month after, and 3 months after cholecystectomy. Patients were interviewed about the quality and frequency of their stools. In addition, to evaluate the role of bile acid malabsorption, serum concentrations of 7α-hydroxy-4-cholesten-3-one (7α-HCO) were measured before and after surgery.
After cholecystectomy, there was an increase in the number of patients reporting more than one bowel movement per day: 22% before surgery, 51% at 1 month, and 45% at 3 months. Those reporting loose stools also increased.
The section titled, “Confidence Intervals for the Mean Difference in Paired Designs” gives 7α-HCO levels at baseline, 1 month after surgery, and 3 months after surgery; and the data sets are in a folder on the CD-ROM called “Sauter.” We use the data from this study to illustrate before and after study designs with both binary and numerical variables.
Presenting Problem 4
Large-vessel atherothromboembolism is a major cause of ischemic stroke. Histologic studies of atherosclerotic plaques suggest that the lesions containing a large lipid-rich necrotic core or intraplaque hemorrhage place patients at greater risk of ischemic stroke. Yuan and colleagues (2001) used high-resolution magnetic resonance imaging (MRI) to study characteristics of diseased carotid arteries to determine which plaque features might pose higher risk for future ischemic complications.
They evaluated 18 consecutive patients scheduled for carotid endarterectomy with a carotid artery MRI examination and correlated these findings with histopathologic characteristics of the surgical carotid artery specimens. The histology slides were evaluated by a pathologist who was blinded to the imaging results. It is important to establish the level of agreement between the MRI findings and histology, and we will use the observations to illustrate a measure of agreement called Cohen's kappa κ. See the data in the section titled, “Measuring Agreement Between Two People or Methods” and the file entitled “Yuan” on the CD-ROM.
PURPOSE OF THE CHAPTER
The methods in Chapter 3 are often called descriptive statistics because they help investigators describe and summarize data. Chapter 4provided the basic probability concepts needed to evaluate data using statistical methods. Without probability theory, we could not make statements about populations without studying everyone in the population—clearly an undesirable and often impossible task. In this chapter we begin the study of inferential statistics; these are the statistical methods used to draw conclusions from a sample and make inferences to the entire population. In all the presenting problems in this and future chapters dealing with inferential methods, we assume the investigators selected a random sample of individuals to study from a larger population to which they wanted to generalize.
In this chapter, we focus specifically on research questions that involve one group of subjects who are measured on one or two occasions. The best statistical approach may depend on the way we pose the research question and the assumptions we are willing to make.
We spend a lot of time on confidence intervals and hypothesis testing in this chapter in order to introduce the logic behind these two approaches. We also discuss some of the traditional topics associated with hypothesis testing, such as the errors that can be made, and we explain what P values mean. In subsequent chapters we streamline the presentation of the procedures, but we believe it is worthwhile to emphasize the details in this chapter to help reinforce the concepts.
Surveys of statistical methods used in journals indicate that the t test is one of the most commonly used statistical methods. The percentages of articles that use the t test range from 10% to more than 60%. Williams and colleagues (1997) noted a number of problems in using the t test, including a lack of discussion of assumptions in more than 85% of the articles, and Welch and Gabbe (1996) found a number of errors in using the t test when a nonparametric procedure is called for. Thus, being able to evaluate the use of tests comparing means—whether they are used properly and how to interpret the results—is an important skill for medical practitioners.
We depart from some of the traditional texts and present formulas in terms of sample statistics rather than population parameters. We also use the formulas that best reflect the concepts rather than the ones that are easiest to calculate, for the very reason that calculations are not the important issue.
MEAN IN ONE GROUP WHEN THE OBSERVATIONS ARE NORMALLY DISTRIBUTED
Introduction to Questions About Means
Dennison and colleagues (1997) wanted to estimate the average consumption of various beverages in 2- and 5-year-old children and to determine whether nutritional intake in the children in their study differed from that reported in a national study of nutrition (NHANES III). Some of their findings are given in Table 5-1. Focusing specifically on the 2-year-olds, their research questions were: (1) How confident can we be that the observed mean fruit juice consumption is 5.97 oz/day? and, (2) Is the mean energy intake (1242 kcal) in their study of 2-year-olds significantly different from 1286 kcal, the value reported in NHANES III? Stated differently, do the measurements of energy intake in their study of 2-year-old children come from the same population as the measurements in NHANES III? We will use the t distribution to form confidence limits and perform statistical tests to answer these kinds of research questions.
Before discussing research questions involving means, let's think about what it takes to convince us that a mean in a study is significantly different from a norm or population mean. If we want to know whether the mean energy intake in 2-year-old children in our practice is different from the mean in a national nutrition study, what evidence is needed to conclude that energy intake is really different in our group and not just a random occurrence? If the mean energy intake is much larger or smaller than the mean in the national nutrition study, such as the situation in Figure 5-1A, we will probably conclude that the difference is real. What if the difference is relatively moderate, as is the situation in Figure 5-1B?
What other factors can help us? Figure 5-1B gives a clue: The sample values vary substantially, compared with Figure 5-1A, in which there is less variation. A smaller standard deviation may lead to a real difference, even though the difference is relatively small. For the variability to be small, subjects must be relatively similar (homogeneous) and the method of measurement must be relatively precise. In contrast, if the characteristic measured varies widely from one person to another or if the measuring device is relatively crude, the standard deviations will be greater, and we will need to observe a greater difference to be convinced that the difference is real and not just a random occurrence.
Another factor is the number of patients included in the sample. Most of us have greater intuitive confidence in findings that are based on a larger rather than a smaller sample, and we will demonstrate the sound statistical reasons for this confidence.
To summarize, three factors play a role in deciding whether an observed mean differs from a norm: (1) the difference between the observed mean and the norm, (2) the amount of variability among subjects, and (3) the number of subjects in the study. We will see later in this chapter that the first two factors are important when we want to estimate the needed sample size before beginning a study.
Introduction to the t Distribution
The t test is used a great deal in all areas of science. The t distribution is similar in shape to the z distribution introduced in the previous chapter, and one of its major uses is to answer research questions about means. Because we use the t distribution and the t test in several chapters, we need a basic understanding of t.
The t test is sometimes called “Student's t test” after the person who first studied the distribution of means from small samples in 1890. Student was really a mathematician named William Gosset who worked for the Guiness Brewery; he was forced to use the pseudonym Student because of company policy prohibiting employees from publishing their work. Gosset discovered that when observations come from a normal distribution, the means are normally distributed only if the true standard deviation in the population is known. When the true standard deviation is not known and researchers use the sample standard deviation in its place, the means are no longer normally distributed. Gosset named the distribution of means when the sample standard deviation is used the t distribution.
If you think about it, you will recognize that we almost always use samples instead of populations in medical research. As a result, we seldom know the true standard deviation and almost always use the sample standard deviation. Our conclusions are therefore more likely to be accurate if we use the t distribution rather than the normal distribution, although the difference between t and z becomes very small when n is greater than 30.
The formula (or critical ratio) for the t test has the observed mean (X̅) minus the hypothesized value of the population mean (ľ) in the numerator, and the standard error of the mean in the denominator. The symbol ľ stands for the true mean in the population; it is the Greek letter mu, pronounced “mew.” The formula for the t test is
Table 5-1. Data on average consumption in 2-year-old children. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
We know the standard normal, or z, distribution is symmetric with a mean of 0 and a standard deviation of 1. The t distribution is also symmetric and has a mean of 0, but its standard deviation is larger than 1. The precise size of the standard deviation depends on a complex concept related to the sample size, called degrees of freedom (df), which is related to the number of times sample information is used. Because sample information is used once to estimate the standard deviation, the t distribution for one group, it has n – 1 df.
|
Figure 5-1. Comparison of distributions. |
Because the t distribution has a larger standard deviation, it is wider and its tails are higher than those for the z distribution. As the sample size increases to 30 or more, the df also increase, and the t distribution becomes almost the same as the standard normal distribution, and either t or z can be used. Generally the t distribution is used in medicine, even when the sample size is 30 or greater, and we will follow that practice in this book. Computer programs, such as Visual Statistics (module on continuous distributions) or ConStats that allow for the plotting of different distributions, can be used to generate t distributions for different sample sizes in order to compare them, as we did inFigure 5-2.
When using the t distribution to answer research questions, we need to find the area under the curve, just as with the z distribution. The area can be found by using calculus to integrate a mathematical function, but fortunately we do not need to do so. Formerly, statisticians used tables (as we do when illustrating some points in this book), but today most of us use computer programs. Table A–3 in Appendix Agives the critical values for the t distribution corresponding to areas in the tail of the distribution equal to 0.10, 0.05, 0.02, 0.01, and 0.001 for two-tailed, or two-sided, tests (half that size for one-tailed tests or one-sided tests).
We assume that the observations are normally distributed in order to use the t distribution. When the observations are not normally distributed, a nonparametric statistical test, called the sign test, is used instead; see the section titled, “What to Do When Observations Are Not Normally Distributed.”
|
Figure 5-2. t Distribution with 1, 5, and 25 df and standard normal (z) distribution. |
The t Distribution and Confidence Intervals About the Mean in One Group
Confidence intervals are used increasingly for research involving means, proportions, and other statistics in medicine, and we will encounter them in subsequent chapters. Thus, it is important to understand the basics. The general format for confidence intervals for one mean is
The confidence coefficient is a number related to the level of confidence we want; typical values are 90%, 95%, and 99%, with 95% being the most common. Refer to Table A–3 to find the confidence coefficients. For 95% confidence, we want the value that separates the central 95% of the distribution from the 5% in the two tails; with 10 df this value is 2.228. As the sample size becomes very large, the confidence coefficient for a 95% confidence interval is the same as the z distribution, 1.96, as shown in the bottom line of Table A–3.
Recall from Chapter 4 that the standard error of the mean (SE) is the standard deviation divided by the square root of the sample size and is used to estimate how much the mean can be expected to vary from one sample to another. Using X̅ as the observed (sample) mean, the formula for a 95% confidence interval for the true mean is
where t stands for the confidence coefficient (critical value from the t distribution), which, as we saw earlier, depends on the df (which in turn depend on the sample size).
Using the data from Dennison and coworkers (1997) in Table 5-1, we discover that the mean is 5.97 oz/day and the standard deviation is 4.77. The df for the mean in a single group is n – 1, or 94 – 1 = 93 in our example. In Table A–3, the value corresponding to 95% confidence limits is about halfway between 2.00 for 60 df and 1.98 for 120 df, so we use 1.99. Using these numbers in the preceding formula, we get
or approximately 4.99 to 6.95 oz/day. We interpret this confidence interval as follows: in other samples of 2-year-old children, Dennison and coworkers (or other researchers) would almost always observe mean juice consumption different from the one in this study. They would not know the true mean, of course. If they calculated a 95% confidence interval for each mean, however, 95% of these confidence intervals would contain the true mean. They can therefore have 95% confidence that the interval from 4.99 to 6.95 oz/day contains the actual mean juice consumption in 2-year-old children. Using 4.99 to 6.95 oz/day to express the confidence interval is better than 4.99–6.95 oz/day, which can become confusing if the interval has negative signs.
Medical researchers often use error graphs to illustrate means and confidence intervals. Box 5-1 shows an error graph of the mean fruit juice consumption among 2-year-old children, along with the 95% confidence limits. You can replicate this analysis using the “Dennison” file and the SPSS Explore procedure.
There is nothing sacred about 95% confidence intervals; they simply are the ones most often reported in the medical literature. If researchers want to be more confident that the interval contains the true mean, they can use a 99% confidence interval. Will this interval be wider or narrower than the interval corresponding to 95% confidence?
The t Distribution and Testing Hypotheses About the Mean in One Group
Some investigators test hypotheses instead of finding and reporting confidence intervals. The conclusions are the same, regardless of which method is used. More and more, statisticians recommend confidence intervals because they actually provide more information than hypothesis tests. Some researchers still prefer hypothesis tests, possibly because tests have been used traditionally. We will return to this point after we illustrate the procedure for testing a hypothesis concerning the mean in a single sample.
As with confidence limits, the purpose of a hypothesis test is to permit generalizations from a sample to the population from which the sample came. Both statistical hypothesis testing and estimation make certain assumptions about the population and then use probabilities to estimate the likelihood of the results obtained in the sample, given these assumptions.
To illustrate hypothesis testing, we use the energy intake data from Dennison and coworkers (1997) in Table 5-1. We use these observations to test whether the mean energy intake in 2-year-olds in this study is different from the mean energy intake in the NHANES III data shown in Table 5-2, to be the norm. Another way to state the research question is: On average, do 2-year-old children in the sample studied by Dennison and coworkers have the different levels of energy intake as 2-year-olds in the NHANES III study?
Statistical hypothesis testing seems to be the reverse of our nonstatistical thinking. We first assume that the mean energy intake is the same as in NHANES III (1286 kcal), and then we find the probability of observing mean energy intake equal to 1242 kcal in a sample of 94 children, given this assumption. If the probability is large, we conclude that the assumption is justified and the mean energy intake in the study is not statistically different from that reported by NHANES III. If the probability is small, however—such as 1 out of 20 (0.05) or 1 out of 100 (0.01)—we conclude that the assumption is not justified and that there really is a difference; that is, 2-year-old children in the Dennison and coworkers study have a mean energy intake different from those in NHANES III. Following a brief discussion of the assumptions we make when using the t distribution, we will use the Dennison and coworkers study to illustrate the steps in hypothesis testing.
Assumptions in Using the t Distribution
For the t distribution or the t test to be used, observations should be normally distributed. Many computer programs, such as NCSS and SPSS, overlay a plot of the normal distribution on a histogram of the data. Often it is possible to look at a histogram or a box-and-whisker plot and make a judgment call. Sometimes we know the distribution of the data from past research, and we can decide whether the assumption of normality is reasonable. This assumption can be tested empirically by plotting the observations on a normal probability graph, called a Lilliefors graph (Conover, 1999, or using several statistical tests of normality. The NCSS computer program produces a normal probability plot as part of the Descriptive Statistics Report, which we illustrate in the section titled, “Mean Difference When Observations Are Not Normally Distributed” (see Box 5-2), and reports the results of several statistical tests. SPSS has a routine to test normality that is part of the Explore Plots option. It is always a good idea to plot data before beginning the analysis in case some strange values are present that need to be investigated.
Box 5-1. NINETY-FIVE PERCENT CONFIDENCE INTERVAL AND ERROR GRAPH FOR THE MEAN FRUIT JUICE CONSUMPTION IN 2-YEAR-OLD CHILDREN.
|
||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
Figure. No Caption available. Source: Data, used with permission, from Dennison BA, Rockwell HL, Baker SL: Excess fruit juice consumption by preschool-aged children is associated with short stature and obesity. Pediatrics 1997;99:15–22. Table produced with SPSS; used with permission. |
You may wonder why normality matters. What happens if the t distribution is used for observations that are not normally distributed? With 30 or more observations, the central limit theorem (Chapter 4) tells us that means are normally distributed, regardless of the distribution of the original observations. So, for research questions concerning the mean, the central limit theorem basically says that we do not need to worry about the underlying distribution with reasonable sample sizes. However, using the t distribution with observations that are not normally distributed and when the sample size is fewer than 30 can lead to confidence intervals that are too narrow. In this situation, we erroneously conclude that the true mean falls in a narrower range than is really the case. If the observations deviate from the normal distribution in only minor ways, the t distribution can be used anyway, because it is robust for nonnormal data. (Robustness means we can draw the proper conclusion even when all our assumptions are not met.)
Table 5-2. Children's energy and macronutrient intake. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
HYPOTHESIS TESTING
We now illustrate the steps in testing a hypothesis and discuss some related concepts using data from the study by Dennison and coworkers.
Steps in Hypothesis Testing
A statistical hypothesis is a statement of belief about population parameters. Like the term “probability,” the term “hypothesis” has a more precise meaning in statistics than in everyday use.
Step 1: State the research question in terms of statistical hypotheses. The null hypothesis, symbolized by H0, is a statement claiming that there is no difference between the assumed or hypothesized value and the population mean; null means “no difference.” Thealternative hypothesis, which we symbolize by H1 (some textbooks use HA) is a statement that disagrees with the null hypothesis.
If the null hypothesis is rejected as a result of sample evidence, then the alternative hypothesis is concluded. If the evidence is insufficient to reject the null hypothesis, it is retained but not accepted per se. Scientists distinguish between not rejecting and accepting the null hypothesis; they argue that a better study may be designed in which the null hypothesis will be rejected. Traditionally, we therefore do not accept the null hypothesis from current evidence; we merely state that it cannot be rejected.
For the Dennison and coworkers study, the null and alternative hypotheses are as follows:
H0: The mean energy intake in 2-year-old children in the study, ľ1, is not different from the norm (mean in NHANES III), ľ0, written ľ1 = ľ0.
H1: The mean energy intake in 2-year-old children in the Dennison and coworkers study, ľ1, is different from the norm (mean in NHANES III), ľ0, written ľ1 ≠ ľ0.
(Recall that ľ stands for the true mean in the population.)
These hypotheses are for a two-tailed (or nondirectional) test: The null hypothesis will be rejected if mean energy intake is sufficiently greater than 1286 kcal or if it is sufficiently less than 1286 kcal. A two-tailed test is appropriate when investigators do not have an a priori expectation for the value in the sample; they want to know if the sample mean differs from the population mean in either direction.
A one-tailed (or directional) test can be used when investigators have an expectation about the sample value, and they want to test only whether it is larger or smaller than the mean in the population. Examples of an alternative hypothesis are H1: The mean energy intake in 2-year-old children in the Dennison and coworkers study, ľ1, is larger than the norm (mean in NHANES III), ľ0, sometimes written ľ1 > ľ0 or H1: The mean energy intake in 2-year-old children in the Dennison and coworkers study, ľ1, is not larger than the norm (mean in NHANES III), ľ0, sometimes written as ľ1 ≤ ľ0.
Box 5-2. SIGN TEST OF CHANGE IN 7α-HYDROXY-4-CHOLESTEN-3-ONE (7α-HCO) BEFORE AND 1 MONTH AFTER CHOLECYSTECTOMY.
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Figure. No Caption available. Source: Data, used with permission, from Sauter GH, Moussavian AC, Meyer G, Steitz HO, Parhofer KG, Jungst D: Bowel habits and bile acid malabsorptin in the months after cholecystectomy. Am J Gastroenterol 2002;97(2):1732–35. Table produced with NCSS; used with permission. |
A one-tailed test has the advantage over a two-tailed test of obtaining statistical significance with a smaller departure from the hypothesized value, because there is interest in only one direction. Whenever a one-tailed test is used, it should therefore make sense that the investigators really were interested in a departure in only one direction before the data were examined. The disadvantage of a one-tailed test is that once investigators commit themselves to this approach, they are obligated to test only in the hypothesized direction. If, for some unexpected reason, the sample mean departs from the population mean in the opposite direction, the investigators cannot rightly claim the departure as significant. Medical researchers often need to be able to test for possible unexpected adverse effects as well as the anticipated positive effects, so they most frequently choose a two-tailed hypothesis even though they have an expectation about the direction of the departure. A graphic representation of a one-tailed and a two-tailed test is given in Figure 5-3.
|
Figure 5-3. Defining areas of acceptance and rejection in hypothesis testing using α = 0.05. A: Two-tailed or nondirectional. B: One-tailed or directional lower tail. C: One-tailed or directional upper tail. (Data, used with permission, from Dennison BA, Rockwell HL, Baker SL: Excess fruit juice consumption by preschool-aged children is associated with short stature and obesity. Pediatrics1997;99:15–22. Graphs produced using the Visualizing Continuous Distributions module in Visual Statistics, a program published by McGraw-Hill Companies; used with permission.) |
Step 2: Decide on the appropriate test statistic. Some texts use the term “critical ratio” to refer to test statistics. Choosing the right test statistic is a major topic in statistics, and subsequent chapters focus on which test statistics are appropriate for answering specific kinds of research questions.
We decide on the appropriate statistic as follows. Each test statistic has a probability distribution. In this example, the appropriate test statistic is based on the t distribution because we want to make inferences about a mean and do not know the population standard deviation. The t test is the test statistic for testing one mean; it is the difference between the sample mean and the hypothesized mean divided by the standard error.
Step 3: Select the level of significance for the statistical test. The level of significance, when chosen before the statistical test is performed, is called the alpha value, denoted by α (Greek letter alpha); it gives the probability of incorrectly rejecting the null hypothesis when it is actually true (and concluding there is a difference when there is not). This probability should be small, because we do not want to reject the null hypothesis when it is true. Traditional values used for α are 0.05, 0.01, and 0.001. We will use α = 0.05.
Step 4: Determine the value the test statistic must attain to be declared significant. This significant value is also called the critical value of the test statistic. Determining the critical value is simple (we already found it when we calculated a 95% confidence interval), but detailing the reasoning behind the process is instructive. Each test statistic has a distribution; the distribution of the test statistic is divided into an area of (hypothesis) acceptance and an area of (hypothesis) rejection. The critical value is the dividing line between the areas.
An illustration should help clarify the idea. The test statistic in our example follows the t distribution; α is 0.05; and a two-tailed test was specified. Thus, the area of acceptance is the central 95% of the t distribution, and the areas of rejection are the 2.5% areas in each tail (see Figure 5-3). From Table A–3, the value of t (with n – 1 or 94 – 1 = 93 df) that defines the central 95% area is between -1.99 and 1.99, as we found for the 95% confidence interval. Thus, the portion of the curve below -1.99 contains the lower 2.5% of the area of the tdistribution with 93 df, and the portion above +1.99 contains the upper 2.5% of the area. The null hypothesis (that the mean energy intake of the group studied by Dennison and coworkers is equal to 1286 kcal as reported in the NHANES III study) will therefore be rejected if the critical value of the test statistic is less than -1.99 or if it is greater than +1.99.
In practice, however, almost everyone uses computers to do their statistical analyses. As a result, researchers do not usually look up the critical value before doing a statistical test. Although researchers need to decide beforehand the alpha level they will use to conclude significance, in practice they wait and see the more exact P value calculated by the computer program. We discuss the P value in the following sections.
Step 5: Perform the calculation. To summarize, the mean energy intake among the 94 two-year-old children studied by Dennison and coworkers was 1242 kcal with standard deviation 256 and standard error 26.4.a We compare this value with the assumed population value of 1286 kcal. Substituting these values in the test statistic yields
Step 6: Draw and state the conclusion. Stating the conclusion in words is important because, in our experience, people learning statistics sometimes focus on the mechanics of hypothesis testing but have difficulty applying the concepts. In our example, the observed value for tis -1.67. (Typically, the value of test statistics is reported to two decimal places.) Referring to Figure 5-3, we can see that -1.67 falls within the acceptance area of the distribution. The decision is therefore not to reject the null hypothesis that the mean energy intake in the 2-year-old children study by Dennison differs from that reported in the NHANES III study. Another way to state the conclusion is that we do not reject the hypothesis that the sample of energy intake values could come from a population with mean energy intake of 1286 kcal. This means that, on average, the energy intake values observed in 2-year-olds by Dennison are not statistically significantly different from those in the NHANES III. The probability of observing a mean energy intake of 1242 kcal in a random sample of 94 two-year-olds, if the true mean is actually 1286 kcal, is greater than 0.05, the alpha value chosen for the test.
Use the CD-ROM to confirm our calculations. Then use the t test with the data on 5-year-old children, and compare the mean to 1573 kcal in the NHANES III study.
Equivalence of Confidence Intervals and Hypothesis Tests
Now, let us examine the correspondence between hypothesis tests and confidence intervals. The results from the hypothesis test lead us to conclude that the mean energy intake in the 2-year-old children studied by Dennison and coworkers is not different from the mean in the NHANES III study (1286 kcal), using an α value of 0.05. Although we did not illustrate the calculations, the 95% confidence interval for mean energy intake is 1189–1295 kcal, meaning that we are 95% confident that this interval contains the true mean energy intake among 2-year-old children. Note that 1286, the value from the NHANES III study, is contained within the interval; therefore, we can conclude that the mean in the Dennison study could well be 1286 kcal, even though the observed mean intake was 1242 kcal. When we compare the two approaches, we see that the df, 94 – 1 = 93, are the same, the critical value of t is the same, ą1.99, and the conclusions are the same. The only difference is that the confidence interval gives us an idea of the range within which the mean could occur by chance. In other words, confidence intervals give more information than hypothesis tests yet are no more costly in terms of work; thus we get more for our money, so to speak.
There is every indication that more and more results will be presented using confidence intervals. For example, the British Medical Journalhas established the policy of having its authors use confidence intervals instead of hypothesis tests if confidence intervals are appropriate to their study (Gardner and Altman, 1986; 1989). To provide practice with both approaches to statistical inference, we will use both hypothesis tests and confidence intervals throughout the remaining chapters.
|
Table 5-3. Correct decisions and errors in hypothesis testing. |
Replicate our findings using the CD-ROM. Also find the 99% confidence interval. Is it narrower or wider? Why?
Errors in Hypothesis Tests
Two errors can be made in testing a hypothesis. In step 3, we tacitly referred to one of these errors—rejecting the null hypothesis when it is true—as a consideration when selecting the significance level α for the test. This error results in our concluding a difference when none exists. Another error is also possible: not rejecting the null hypothesis when it is actually false, or not accepting the alternative hypothesis when it is true. This error results in our concluding that no difference exists when one really does. Table 5-3summarizes these errors. The situation marked by I, called a type I error (see the upper right box), is rejecting the null hypothesis when it is really true; α is the probability of making a type I error. In the study of children's mean energy intake, a type I error would be concluding that the mean energy intake in the sample studied by Dennison and coworkers is different from the mean in NHANES III (rejecting the null hypothesis) when, in fact, it is not.
A type II error occurs in the situation marked by II (see lower left box in Table 5-3); this error is failing to reject the null hypothesis when it is false (or not concluding a difference exists when it does). The probability of a type II error is denoted by β (Greek letter beta). In the energy intake example, a type II error would be concluding that the mean energy intake in the Dennison study is not different from that in NHANES III (not rejecting the null hypothesis) when the mean level of energy intake was, in fact, actually different from NHANES III.
The situations marked by the asterisk (*) are correct decisions. The upper left box in Table 5-3 correctly rejects the null hypothesis when a difference exists; this situation is also called the power of the test, a concept we will discuss in the next section. Finally, the lower right box is the situation in which we correctly retain the null hypothesis when there is no difference.
Power
Power is important in hypothesis testing. Power is the probability of rejecting the null hypothesis when it is indeed false or, equivalently, concluding that the alternative hypothesis is true when it really is true. Power is the ability of a study to detect a true difference. Obviously, high power is a valuable attribute for a study, because all investigators want to detect a significant result if it is present. Power is calculated as (1 – β) or (1 – a type II error) and is intimately related to the sample size used in the study. The importance of addressing the issue of the power of a study cannot be overemphasized—it is essential in designing a valid study. We discuss power in more detail in the following sections where we illustrate programs for estimating sample sizes and in subsequent chapters.
P Values
Another vital concept related to significance and to the α level is the P value, commonly reported in medical journals. The P value is related to a hypothesis test (although sometimes P values are stated along with confidence intervals); it is the probability of obtaining a result as extreme as (or more extreme than) the one observed, if the null hypothesis is true. Some people like to think of the P value as the probability that the observed result is due to chance alone. The P value is calculated after the statistical test has been performed; if the P value is less than α, the null hypothesis is rejected.
Referring to the test using the Dennison data, the P value cannot be precisely obtained from Table A–3 because the degrees of freedom are 93. We need to extrapolate in these situations. For 3 df α = 0.10, the critical value is approximately 1.665. For a one-tailed test, the critical value is -1.665, slightly less than the value of -1.67 we found. We could report the P value as P > 0.10 or P > 0.05 to indicate that the significance level is more than 0.10. It is easier and more precise to use a computer program to do this calculation, however. Using the program for the one-group t test, we found the reported two-tailed significance to be 0.101, consistent with our conclusion that P > 0.10.
Some authors report that the P value is less than some traditional value such as 0.05 or 0.01; however, more authors now report the precise P value produced by computer programs. The practice of reporting values less than some traditional value was established prior to the availability of computers, when statistical tables such as those in Appendix A were the only source of probabilities. Reporting the actualP value communicates the significance of the findings more precisely. We prefer this practice; using the arbitrary traditional values may lead an investigator (or reader of a journal article) to conclude that a result is significant when P = 0.05 but is not significant when P = 0.06, a dubious distinction.
Analogies to Hypothesis Testing
Analogies often help us better understand new or complex topics. Certain features of diagnostic testing, such as sensitivity and specificity, provide a straightforward analogy to hypothesis testing. A type I error, incorrectly concluding significance when the result is not significant, is similar to a false-positive test that incorrectly indicates the presence of a disease when it is absent. Similarly, a type II error, incorrectly concluding no significance when the result is significant, is analogous to a false-negative test that incorrectly indicates the absence of disease when it is present. The power of a statistical test, the ability to detect significance when a result is significant, corresponds to the sensitivity of a diagnostic test: the test's ability to detect a disease that is present. We may say we want the statistical test to be sensitive to detecting significance when it should be detected. We illustrate diagnostic testing concepts in detail in Chapter 12.
Another analogy is to the U.S. legal system. Assuming that the null hypothesis is true until proven false is like assuming that a person is innocent until proven guilty. Just as it is the responsibility of the prosecution to present evidence that the accused person is guilty, the investigator must provide evidence that the null hypothesis is false. In the legal system, in order to avoid a type I error of convicting an innocent person, the prosecution must provide evidence to convince jurors “beyond a reasonable doubt” that the accused is guilty before the null hypothesis of innocence can be rejected. In research, the evidence for a false null hypothesis must be so strong that the probability of incorrectly rejecting the null hypothesis is very small, typically, but not always, less than 0.05.
The U.S. legal system opts to err in the direction of setting a guilty person free rather than unfairly convicting an innocent person. In scientific inquiry, the tradition is to prefer the error of missing a significant difference (arguing, perhaps, that others will come along and design a better study) to the error of incorrectly concluding significance when a result is not significant. These two errors are, of course, related to each other. If a society decides to reduce the number of guilty people that go free, it must increase the chances that innocent people will be convicted. Similarly, an investigator who wishes to decrease the probability of missing a significant difference by decreasing β necessarily increases the probability α of falsely concluding a difference. The way the legal system can simultaneously reduce both types of errors is by requiring more evidence for a decision. Likewise, the way simultaneously to reduce both type I and type II errors in scientific research is by increasing the sample size n. When that is not possible—because the study is exploratory, the problem studied is rare, or the costs are too high—the investigator must carefully evaluate the values for α and β and make a judicious decision.
RESEARCH QUESTIONS ABOUT A PROPORTION IN ONE GROUP
When a study uses nominal or binary (yes/no) data, the results are generally reported as proportions or percentages (see Chapter 3). In medicine we sometimes observe a single group and want to compare the proportion of subjects having a certain characteristic with some well-accepted standard or norm. For example, Frey and colleagues (2002) in Presenting Problem 2 wanted to examine the efficacy of different dilutions of smallpox vaccine. Findings from this study are given in Table 5-4.
Table 5-4. Rate of success of initial and repeated vaccination with vaccinia virus.a |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The binomial distribution introduced in Chapter 4 can be used to determine confidence limits or to test hypotheses about the observed proportion. Recall that the binomial distribution is appropriate when a specific number of independent trials is conducted (symbolized byn), each with the same probability of success (symbolized by the Greek letter ρ), and this probability can be interpreted as the proportion of people with or without the characteristic we are interested in. Applied to the data in this study, each vaccination is considered a trial, and the probability of having a successful vaccination in the group that received the 1:10 dilution was 330/340 = 0.971.
The binomial distribution has some interesting features, and we can take advantage of these. Figure 5-4 shows the binomial distribution when the population proportion ρ is 0.2 and 0.4 for sample sizes of 5, 10, and 25. We see that the distribution becomes more bell-shaped as the sample size increases and as the proportion approaches 0.5. This result should not be surprising because a proportion is actually a special case of a mean in which successes equal 1 and failures equal 0, and the central limit theorem states that the sampling distribution of means for large samples resembles the normal distribution. These observations lead naturally to the idea of using the standard normal, or z, distribution as an approximation to the binomial distribution.
|
Figure 5-4. Probability distributions for the binomial when π = 0.2 and 0.4. |
Confidence Intervals for a Proportion of Subjects
In the study by Frey and colleagues (2002), the proportion of people receiving the 1:10 dilution with a successful vaccination was 0.971. Of course, 0.971 is only an estimate of the unknown true proportion in the entire population of who could be given this vaccine dilution. How much do you think the proportion of patients with a successful outcome would vary from one sample to another? We can use the sampling distribution for proportions from large samples to help answer this question. Recall from Chapter 4 that, in order to use a sampling distribution, we need to know the mean and standard error. For a proportion, the mean is simply the proportion itself (symbolized as ρ in the population and lower case p in the sample), and the standard error is the square root of ρ (1 – ρ) divided by n in the population or p (1 – p) divided by n in the sample; that is, the standard error is
Then the 95% confidence limits for the true population proportion π are given by
Where did 1.96 come from? From Appendix A-2, we find that 1.96 is the value that separates the central 95% of the area under the standard normal, or z, distribution from the 2.5% in each tail. The only requirement is that the product of the proportion and the sample size (pn) be greater than 5 [and that (1 – p)n be greater than 5 as well].
Using the preceding formula, the 95% confidence interval for the true proportion of patients with an initial successful vaccination using 1:10 dilution is
or 0.953–0.988. The investigators may therefore be 95% confident that the interval 0.953–0.988 (or 95.3–98.8%) contains the true proportion of subjects having a successful vaccination with the 1:10 vaccine dilution. The results of the test for one proportion in NCSS is given inTable 5-5. Note that our results correspond with the “Approximation (uncorrected)” calculation method.
Compare the confidence interval in Table 5-4 to our results and the NCSS output in Table 5-5. Frey and colleagues used the exact binomial method, and there is every reason to use it because it is available in some statistical programs, such as NCSS.
Just for practice, find the values of the z distribution used for the 90% and 99% confidence intervals and calculate the resulting confidence intervals themselves. What happens to the width of the confidence interval as the confidence level changes decreases? increases?
The z Distribution to Test a Hypothesis About a Proportion
Recall that we can draw conclusions from samples using two methods: finding confidence intervals or testing hypotheses. We have already stated our preference for confidence intervals and Frey used this approach, but we also illustrate a statistical test. We assume the investigators want to know if the 97.1% success rate with a 1:10 dilution was greater than 95%. We use the six-step procedure to test the hypotheses that this dilution exceeds 95%.
Step 1: State the research question in terms of statistical hypotheses. We assume the investigators wanted to know whether the observed proportion of 0.971 was significantly greater than 0.95. The z distribution can be used to test the hypothesis for this research question. The Greek letter π stands for the hypothesized population proportion because the null hypothesis refers to the population:
H0: The proportion of subjects with a successful vaccination is 0.95 or less, or π ≤ 0.95.
H1: The proportion of subjects with a successful vaccination is more than 0.95, or π > 0.95.
In this example, we are interested in concluding that the diluted vaccine success rate is greater than 95%; therefore, a one-tailed test to detect only a positive difference is appropriate. A two-tailed test would be appropriate to test whether the success rate is either > or < 95%.
Step 2: Decide on the appropriate test statistic. The sample size (340) times the proportion (0.971) is 330, and the sample size times 1 minus the proportion (0.029) is 10. Because both are greater than 5, we can use the z test. The z test, just like the t test, takes the common form of the observed value of the statistic minus its hypothesized value divided by its standard error.
Step 3: Select the level of significance for the statistical test. For this one-tailed test, we use α = 0.05.
Step 4: Determine the value the test statistic must attain to be declared significant. A one-tailed test with the alternative hypothesis in the positive direction places the entire rejection area in the upper part of the probability distribution. The value of z that divides the normal distribution into the lower 95% and upper 5% is 1.645 (Table A–2). The null hypothesis that the true population proportion is less than or equal to 0.95 will be rejected if the observed value of z is greater than 1.645 (see Figure 5-5).
Table 5-5. Initial success of initial vaccination using a 1:10 dilution, 107.0 pfu/mL. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Step 5: Perform the calculations. The null hypothesis says the proportion is 0.95 or less, so this is the value we use in the standard error, not the observed proportion of 0.971. Substituting these values in the z test gives
Step 6: Draw and state the conclusion. Because the observed value of z (1.75) is greater than the critical value, 1.645, the decision is to reject the null hypothesis and conclude that the alternative hypothesis, that the proportion of patients with a positive vaccination is > 0.95, is correct. The conclusion is that a dilution of 1:10 is greater than 0.95 with P < 0.05. Table 5-5 also contains the results of the test of hypothesis; note that NCSS provides the decision for a two-tailed test and one-tailed tests in both directions.
|
Figure 5-5. Defining areas of acceptance and rejection in the standard normal distribution (z) using α = 0.05. (Graph produced using the Visualizing Continuous Distributions module in Visual Statistics, a program pu1blished by McGraw-Hill Companies; used with permission.) |
Continuity Correction
The z distribution is continuous and the binomial distribution is discrete, and many statisticians recommend making a small correction to the test statistic to create a more accurate approximation. One continuity correction involves subtracting z/(2n) from the absolute value of the numerator of the z statistic. (Recall that the absolute value of a number is positive, regardless of whether the number is positive or negative.) The test statistic is
Newcombe (1998) compared seven confidence interval procedures and found that the Wilson score was more accurate than all others for confidence intervals. This statistic, given by NCSS, is
The continuity correction has minimal effect with large sample sizes. Should a continuity correction be used? To be honest, even statisticians do not agree on the answer to this. In the past, we suggested not using the continuity correction, but studies such as the one by Newcombe have convinced us that it is appropriate.
MEANS WHEN THE SAME GROUP IS MEASURED TWICE
Earlier in this chapter, we found confidence intervals for means and proportions. We also illustrated research questions that investigators ask when they want to compare one group of subjects to a known or assumed population mean or proportion. In actuality, these latter situations do not occur very often in medicine (or, indeed, in most other fields). We discussed the methods because they are relatively simple statistics, and, more importantly, minor modifications of these tests can be used for research questions that do occur with great frequency.
In the next two sections, we concentrate on studies in which the same group of subjects is observed twice using paired designs orrepeated-measures designs. Typically in these studies, subjects are measured to establish a baseline (sometimes called the beforemeasurement); then, after some intervention or at a later time, the same subjects are measured again (called the after measurement). The research question asks whether the intervention makes a difference—whether there is a change. In this design, each subject serves as his or her own control. The observations in these studies are called paired observations because the before-and-after measurements made on the same people (or on matched pairs) are paired in the analysis. We sometimes call these dependent observations as well, because, if we know the first measurement, we have some idea of what the second measurement will be (as we will see later in this chapter).
Sometimes the second measurement is made shortly after the first. In other studies a considerable time passes before the second measurement. In the study by Sauter and colleagues (2002), 4 weeks elapsed between the baseline at the time of surgery and first follow-up measurements to give sufficient time for bowel habits to begin to change. A second follow-up was done 12 weeks after surgery.
Why Researchers Use Repeated-Measures Studies
Suppose a researcher wants to evaluate the effect of a new diet on weight loss. Furthermore, suppose the population consists of only six people who have used the diet for 2 months; their weights before and after the diet are given in Table 5-6. To estimate the amount of weight loss, the researcher selects a random sample of three patients (patients 2, 3, and 6) to determine their mean weight before the diet and finds a mean weight of (89 + 83 + 95)/3 = 89 kg. Two months later the researcher selects an independent random sample of three patients (patients 1, 4, and 5) to determine their mean weight after the diet and finds a mean weight of (95 + 93 + 103)/3 = 97 kg. (Of course, we contrived the makeup of these samples for a reason, but they really could occur by chance.) The researcher would conclude that the patients gained an average of 8 kg on the diet. What is the problem here?
The means for the two independent samples indicate that the patients gained weight, while, in fact, they each lost 5 kg on the diet. We know the conclusion based on these samples is incorrect because we can examine the entire population and determine the actual differences; however, in real life we can rarely observe the population. The problem is that the characteristic being studied (weight) is quite variablefrom one patient to another; in this small population of six patients, weight varied from 83 to 108 kg before the diet program began. Furthermore, the amount of change, 5 kg, is relatively small compared with the variability among patients and is obscured by this variability. The researcher needs a way to control for variability among patients.
Table 5-6. Illustration of observations in a paired design (before and after measurements). |
|||||||||||||||||||||
|
The solution, as you may have guessed, is to select a single random sample of patients and measure their weights both before and after the diet. Because the measurements are taken on the same patients, a better estimate of the true change is more likely. The goal of paired designs is to control for extraneous factors that might influence the result; then, any differences caused by the intervention will not be masked by the differences among the subjects themselves.
The paired design allows researchers to detect change more easily by controlling for extraneous variation among the observations. Many biologic measurements exhibit wide variation among individuals, and the use of the paired design is thus especially appropriate in the health field.
The statistical test that researchers use when the same subjects are measured on a numerical (interval) variable before and after an intervention is called the paired t test, because the observations on the same subject are paired with one another to find the difference. This test is also called the matched groups t test and the dependent groups t test.
The good news is that paired, or before-and-after, designs are easy to analyze. Instead of having to find the mean and standard deviation of both the before and the after measurements, we need find only the mean and standard deviation of the differences between the before-and-after measurements. Then, the t distribution we used for a single mean (described in the sections titled, “What to Do When Observations Are Not Normally Distributed” and “Hypothesis Testing”) can be used to analyze the differences themselves.
To illustrate, examine the mean weights of the six subjects in the weight-loss example. Before the diet, the mean weight was 95.5 kg; after the diet, the mean was 90.5 kg. The difference between the means, 95.5 – 90.5 = 5 kg, is exactly the same as the mean weight loss, 5 kg, for each subject. The standard deviation of the differences, however, is not equal to the difference between the standard deviations in the before-and-after measurements. The differences between each before-and-after measurement must be analyzed to obtain the standard deviation of the differences. Actually, the standard deviation of the differences is frequently smaller than the standard deviation in the before measurements and in the after measurements. This is because the two sets of measurements are generally correlated, meaning that the lower values in the before measurements are associated with the lower values in the after measurements and similarly for the higher values. In this illustration, the standard deviations of the weights both before and after the diet program are 8.74, whereas the standard deviation of the differences is 0. Why is this the case? Because we made the before-and-after measurements perfectly correlated. Of course, as with the t distribution used with one mean, we must assume that the differences are normally distributed.
Confidence Intervals for the Mean Difference in Paired Designs
One way to evaluate the effect of an intervention in a before-and-after study is to form a confidence interval (CI) for the mean difference. To illustrate, we use data from the Sauter and colleagues (2002) study investigating the frequency in postcholecystectomy diarrhea (PCD) and changes in bowel habits following cholecystectomy. Some descriptive information on the patients and serum values are given in Table 5-7.
The value of 7α-HCO is a numerical variable, so means and standard deviations are appropriate. We see that the mean 7α-HCO in the 51 patients was 25.33 ng/mL at baseline (SD 13.51). After 1 month, the mean increased to 46.55 (SD 29.58). We want to know whether this increase could happen by chance. To examine the mean difference in a paired study, we need the raw data so we can find the mean and standard deviation of the differences between the before-and-after scores. The before, after, and difference scores are given in Table 5-8.
For patients in this study, the mean of the 51 differences is 21.22 (indicating that, on average, 7α-HCO increased 1 month after cholecystectomy), and the standard deviation of the differences is 26.68. The calculations for the mean and the standard deviation of the differences use the same formulas as in Chapter 3, except that we replace the symbol X (used for an observation on a single subject) with the symbol d to stand for the difference in the measurements for a single subject. Then, the mean difference is the sum of the differences divided by the number of subjects, or [d with bar above] = Σd/n. Using the differences d's instead of X's and the mean difference [d with bar above] instead of X̅, the standard deviation is:
We suggest you confirm these calculations using the Sauter.xls data set on the CD-ROM. You can compute the differences, the mean, and the standard deviation of the differences.
Table 5-7. Descriptive information on patients and serum concentrations before cholecystectomy and at 1 month and 3 months after cholecystectomy. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Table 5-8. Difference between baseline and 1 month measures of 7α-HCO. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Just as when we have one group of observations and find the standard error of the mean by dividing the standard deviation SD by the square root of the sample size, we find the standard error of the mean differences by dividing the standard deviation of the differences SDdby the square root of n,
Finding a 95% confidence interval for the mean difference is just like finding a 95% confidence interval for the mean of one group, except again we use the mean difference and the standard error of the mean differences instead of the mean and standard error of the mean. We also use the t distribution with n – 1 degrees of freedom to evaluate research hypotheses about mean differences, just as we did about the mean in one group. To illustrate, if we want to calculate a 95% confidence interval for the mean difference in number of supportive staff, we use the value of t for n – 1 = 51 – 1 = 50 df, which is 2.01 from Appendix A-3. Using these values in the formula for a 95% confidence interval for the true population difference gives
This confidence interval can be interpreted as follows. We are 95% sure that the true mean difference in 7α-HCO at 1 month versus baseline is between 13.71 and 28.73. Logically, because the entire interval is greater than zero, we can be 95% sure that the mean difference is greater than zero. In plain words, it appears that the 7α-HCO increased in the month following cholecystectomy.
Can you tell from the calculations for the 95% CI in this example how small a difference could be observed and still be statistically significant? Because we subtracted 7.51 from the observed mean difference to calculate the lower limit of the confidence interval, if the mean difference is less than 7.51, the 95% confidence interval will include zero (7.50 – 7.51 = – 0.01). In this situation we would conclude no change in 7α-HCO.
The Paired t Test for the Mean Difference
We can use the t distribution for both confidence intervals and hypothesis tests about mean differences. Again, we use data from Presenting Problem 3, in which researchers examined changes in serum lipoproteins and bowel habits after cholecystectomy (Sauter et al, 2002).
Step 1: State the research question in terms of statistical hypotheses. The statistical hypothesis for a paired design is usually stated as follows, where the Greek letter delta (δ) stands for the difference in the population:
H0: The true difference 7α-HCO is zero, or, in symbols, δ = 0.
H1: The true difference in the 7α-HCO is not zero, or, in symbols, δ ≠ 0.
We are interested in rejecting the null hypothesis of no difference in two situations: when 7α-HCO significantly increases, and when it significantly decreases; it is a two-sided test.
Step 2: Decide on the appropriate test statistic. When the purpose is to see if a difference exists between before and after measurements in a paired design, and the observations are measured on a numerical (either interval or ratio) scale, the test statistic is thet statistic, assuming the differences are normally distributed. We almost always want to know if a change occurs, or, in other words, if the difference is zero. If we wanted to test the hypothesis that the mean difference is equal to some value other than zero, we would need to subtract that value (instead of zero) from the mean difference in the numerator of the following formula:
with n – 1 df, where [d with bar above] stands for the mean difference and
for the standard error of the mean differences as explained earlier.
Step 3: Select the level of significance for the statistical test. Let us use α = 0.01.
Step 4: Determine the value the test statistic must attain to be declared significant. The value of t that divides the distribution into the central 99% is, by interpolation, 2.682, with 0.5% of the area in each tail with n – 1 = 50 df. We therefore reject the null hypothesis that the program does not make a difference if the value of the t statistic is less than -2.682 or greater than +2.682.
Step 5: Perform the calculations. Substituting our numbers (mean difference of 21.22, hypothesized difference of 0, standard deviation of 26.68, and a sample size of 51), the observed value of the t statistic is
Step 6: Draw and state the conclusion. Because the observed value of the t statistic is 5.68, larger than the critical value 2.682, we reject the null hypothesis that means 7α-HCO is the same before cholecystectomy and 1 month later (P < 0.01).
PROPORTIONS WHEN THE SAME GROUP IS MEASURED TWICE
Researchers might want to ask two types of questions when a measurement has been repeated on the same group of subjects. Sometimes they are interested in knowing how much the first and second measurements agree with each other; other times they want to know only whether a change has occurred following an intervention or the passage of time. We discuss the first situation in detail in this section and then cover the second situation briefly.
Measuring Agreement Between Two People or Methods
Frequently in the health field, a practitioner must interpret a procedure as indicating the presence or the absence of a disease or abnormality; that is, the observation is a yes-or-no outcome, a nominal measure. A common strategy to show that measurements are reliable is to repeat the measurements and see how much they agree with each other. When one person observes the same subject or specimen twice and the observations are compared, the degree of agreement is called intrarater reliability (intra- meaning within). When two or more people observe the same subject or specimen, their agreement is called interrater reliability (inter- meaning between). A common way to measure interrater reliability when the measurements are nominal is to use the kappa κ statistic. If the measurements are on a numerical scale, the correlation between the measurements is found. We discussed the correlation coefficient inChapter 3 and will return to it in Chapter 8.
In other similar situations, two different procedures are used to measure the same characteristic. If one can be assumed to be the “gold standard,” then sensitivity and specificity, discussed in Chapter 12, are appropriate. When neither is the gold standard, the kappa statistic is used to measure the agreement between the two procedures. In Presenting Problem 4, Yuan and colleagues (2001) interpreted MRIs of 90 carotid artery locations and compared their findings with histopathologic examination. The kappa statistic can be used to estimate the level of agreement between the MRI findings and histology. Information for Table 2 in Yuan has been reproduced in Table 5-9 and rearranged inTable 5-10 to make the analysis simpler. The MRIs and results from the histologic examination agreed that 56 of 90 were positive and 22 were negative.
We can describe the degree of agreement between the two procedures as follows. The total observed agreement [(56 + 22)/90 = 87%] is an overestimate, because it ignores the fact that, with only two categories (positive and negative), they would agree by chance part of the time. We need to adjust the observed agreement and see how much they agree beyond the level of chance.
To find the percentage of specimens on which they would agree by chance we use a straightforward application of one of the probability rules from Chapter 4. Because the pathologist reading the histology slides were evaluated by a pathologist who was blinded to the MRI findings, the two measurements are independent, and we can use the multiplication rule for two independent events to see how likely it is that they agree merely by chance.
The statistic most often used to measure agreement between two observers on a binary variable is kappa (κ), defined as the agreement beyond chance divided by the amount of possible agreement beyond chance. Reviewing the data in Table 5-10, MRI indicated that 58, or 64%, were positive, and histology classification indicated that 66, or 73%, were positive. Using the multiplication rule, they would agree that by chance 64% × 73%, or 47%, of the specimens were positive. By chance alone the procedures would agree that 36% × 27%, or another 10%, were negative. The two procedures would therefore agree by chance on 47% + 10%, or 57%, of the images. In actuality, the procedures agreed on (58 + 22)/90, or 87%, of the 90 specimens, so the level of agreement beyond chance was 0.87 – 0.57, or 0.30, the numerator of κ.
The potential agreement beyond chance is 100% minus the chance agreement of 57%, or, using proportions, 1 – 0.57 = 0.43. Kappa in this example is therefore 0.30/0.43 = 0.70. The formula for kappa and our calculations are as follows:
Sackett and associates (1991) point out that the level of agreement varies considerably depending on the clinical task, ranging from 57% agreement with a κ of 0.30 for two cardiologists examining the same electrocardiograms from different patients, to 97% agreement with a κ of 0.67 for two radiologists examining the same set of mammograms. Byrt (1996) proposed the following guidelines for interpreting κ:
Table 5-9. Test performance of multispectral MRI for identifying regions of lipid-rich necrotic core and acute intraplaque hemorrhage. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Based on these guidelines, the agreement been MRI and histology was quite respectable. When κ is zero, agreement is only at the level expected by chance. When κ is negative, the observed level of agreement is less than we would expect by chance alone.
The 2 × 2 table in Table 5-10 is reproduced using SPSS in Table 5-11, along with the calculation of κ. Most of the time, we are interested in the kappa statistic as a descriptive measure and not whether it is statistically significant. The NCSS statistical program reports the value oft for the kappa statistic; we can use Table A–3 if we want to know the probability. Alternatively, we can use the probability calculator in NCSS (in Other in the Analysis pull-down window).
Proportions in Studies with Repeated Measurements and the McNemar Test
In studies in which the outcome is a binary (yes/no) variable, researchers may want to know whether the proportion of subjects with (or without) the characteristic of interest changes after an intervention or the passage of time. In these types of studies, we need a statistical test that is similar to the paired t test and appropriate with nominal data. The McNemar test can be used for comparing paired proportions.
Table 5-10. Observed agreement between MRI and histology findings. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Table 5-11. Comparing results of MRI and histology findings. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The researchers in Presenting Problem 3 (Sauter et al, 2002) wanted to know whether changes occurred in the bowel function of patients following cholecystectomy. They collected information on the number of patients who had one or fewer versus more than one stool per day. The results are displayed in a 2 × 2 table in Table 5-12. Before surgery, 11, or 21.6%, of the patients had more than one stool per day, but 1 month following surgery, the number increased to 26, or 51.0%.
The null hypothesis is that the proportions of patients with more than one stool per day are the same at the two different time periods. The alternative hypothesis is that the paired proportions are not equal. The McNemar test for paired proportions is very easy to calculate; it uses only the numbers in the cells where before-and-after scores change; that is, the upper right and lower left cells. For the numerator we find the absolute value of the difference between the top right and the bottom left cells in the 2 × 2 table and square the number. In our example this is the absolute value of |15 – 0| = 152 = 225. For the denominator we take the sum of the top right and bottom left cells: 15 + 0 = 15. Dividing gives 15; in symbols the equation is
If we want to use α = 0.05, we compare the value of the McNemar test to the critical value of 3.84 to decide if we can reject the null hypothesis that the paired proportions are equal. (We explain more about how we determined this value when we discuss chi-square in the next chapter.) Because 15 is larger than 3.84, we can reject the null hypothesis and conclude that there is a difference, increase in this situation, in the proportion of patients having more than one stool per day before and after cholecystectomy.
Table 5-12. Analysis using the McNemar statistic for the number of patients having more than one stool per day before and 1 month after cholecystectomy. |
||||||||||||||||||||||||||||||||||||||||||||
|
Results from the NCSS program for the McNemar test are also given in Table 5-12. As with the z statistic, it is possible to use a continuity correction with the McNemar test. The correction involves subtracting 1 from the absolute value in the numerator before squaring it.
WHAT TO DO WHEN OBSERVATIONS ARE NOT NORMALLY DISTRIBUTED
If observations are quite skewed, the t distribution should not be used, because the values that are supposed to separate the upper and lower 2.5% of the area of the central 95% of the distribution do not really do so. In this situation, we can transform or rescale the observations or we can use nonparametric methods.
Transforming or Rescaling Observations
Transforming observations expresses their values on another scale. To take a simple example, if weight is measured in pounds, we can multiply by 2.2 to get weight in kilograms. The main reason for knowing about transformations is that they sometimes make it possible to use statistical tests that otherwise would be inappropriate. You already know about several transformations. For example, the standard normal, or z, distribution in introduced in Chapter 4, is obtained by subtracting the mean from each observation and then dividing by the standard deviation. The z transformation is a linear transformation; it rescales a distribution with a given mean and standard deviation to a distribution in which the mean is 0 and the standard deviation is 1. The basic bell shape of the distribution itself is not changed by this transformation.
Nonlinear transformations change the shape of the distribution. We also talked about rank ordering observations when we discussed ordinal scales in Chapter 3. This transformation ranks observations from lowest to highest (or vice versa). The rank transformation can be very useful in analyzing observations that are skewed, and many of the nonparametric methods we discuss in this book use ranks as their basis.
Other nonlinear transformations can be used to straighten out the relationship between two variables by changing the shape of the skewed distribution to one that more closely resembles the normal distribution. Consider the survival time of patients who are diagnosed with cancer of the prostate. A graph of possible values of survival time (in years) for a group of patients with prostate cancer metastatic to the bone is given in Figure 5-6A. The distribution has a substantial positive skew, so methods that assume a normal distribution would not be appropriate. Figure 5-6B illustrates the distribution if the logarithmb of survival time is used instead, that is, Y = log(X), where Y is the transformed value (or exponent) related to a given value of X. This is the log to base 10.
Another log transformation uses the transcendental number e as the base and is called the natural log, abbreviated ln. Log transformations are frequently used with laboratory values that have a skewed distribution. Crook and colleagues (1997), a Presenting Problem in Chapter 9, studied prostate-specific antigen (PSA) and found it to have a skewed distribution. They used a log transformation to make PSA more normally distributed. Another transformation is the square root transformation,
Although this transformation is not used as frequently in medicine as the log transformation, it can be very useful when a log transformation overcorrects. In a study of women who were administered a paracervical block to diminish pain and cramping with cryosurgery (Harper, 1997, Presenting Problem in Chapter 6), one of the variables used to measure pain was very skewed. The authors used a square root transformation and improved results. We calculated the natural log and the square root of the pain score. A histogram of each is given inFigure 5-7. You can see that neither transformation is very close to a normal distribution. In this situation, the investigators might well choose a nonparametric procedure that does not make any assumptions about the shape of the distribution.
|
Figure 5-6. Example of logarithm transformation for survival of patients with cancer of the prostate metastatic to bone. |
The Sign Test for Hypotheses About the Median in One Group
An alternative to transforming data is to use statistical procedures called nonparametric, or distribution-free, methods. Nonparametric methods are based on weaker assumptions than the z and t tests, and they do not require the observations to follow any particular distribution.
Figure 5-8 is a histogram of energy consumption in 2-year-old children (Dennison et al, 1997). Can we assume they are normally distributed? How would you describe the distribution? It is somewhat positively skewed, or skewed to the right? Should we have used the t test to compare the mean energy intake with that reported in the NHANES III study? Let us see the conclusion if we use a method that does not require assuming a normal distribution.
The sign test is a nonparametric test that can be used with a single group using the median rather than the mean. For example, we can ask: Did children in the study by Dennison and colleagues have the same median level of energy intake as the 1286 kcal reported in the NHANES III study? (Because we do not know the median in the NHANES data, we assume for this illustration that the mean and median values are the same.)
The logic behind the sign test is as follows: If the median energy intake in the population of 2-year-old children is 1286, the probability is 0.50 that any observation is less than 1286. (The probability is also 0.50 that any observation is greater than 1286.) We count the number of observations less than 1286 and can use the binomial distribution (Chapter 4) with π = 0.50. Table 5-13 contains the data on the energy level in 2-year-olds ranked from lowest to highest. Fifty-seven 2-year-olds have energy levels lower than 1286 and 37 have higher energy levels. The probability of observing X = 57 out of n = 94 values less than 1286 using the binomial distribution is
Rather than trying to calculate this probability, we use this example as an opportunity to use the z approximation to the binomial distribution to illustrate the sign test. We use the same level of α and use a two-tailed test so we can directly compare the results to the t test in the section titled, “Steps in Hypothesis Testing.”
Step 1: The null and alternative hypotheses are H0: The population median energy intake level in 2-year-old children is 1286 kcal, or MD = 1286.
|
Figure 5-7. Original observations and two transformations of the pain score. (Data, used with permission, from Harper D: Paracervical block diminishes cramping associated with cryosurgery. J Fam Pract 1997;44:75–79. Analysis produced with SPSS; used with permission.) |
H1: The population median energy intake level in 2-year-old children is not 1286 kcal, or MD ≠ 1286.
Step 2: Assuming energy intake is not normally distributed, the appropriate test is the sign test; and because the sample size is large, we can use the z distribution. In the sign test we deal with frequencies instead of proportions, so the z test is rewritten in terms of frequencies.
where X is the number of children with energy levels less than 1286 (57 in our example), or we could use the number with energy levels greater than 1286; it does not matter. The total number of children n is 94, and the probability π is 0.5, to reflect the 50% chance that any observation is less than (or greater than) the median.
Note that ˝ is subtracted from the absolute value in the numerator; this is the continuity correction for frequencies.
|
Figure 5-8. A histogram with a normal curve of energy consumption among 2-year-old children. (Data, used with permission, from Dennison BA, Rockwell HL, Baker SL: Excess fruit juice consumption by preschool-aged children is associated with short stature and obesity. Pediatrics 1997;99:15–22. Analysis produced with SPSS; used with permission.) |
Step 3: We use α = 0.05 so we can compare the results with those found with the t test.
Step 4: The critical value of the z distribution for α = 0.05 is ą 1.96. So, if the z test statistic is less than -1.96 or greater than +1.96, we will reject the null hypothesis of no difference in median levels of energy intake.
Step 5: The calculations are
Step 6: The value of the sign test is 1.96 and is right on the line with +1.96. It is traditional that we do not reject the null hypothesis unless the value of the test statistic exceeds the critical value. Two points are interesting. First, the value of the test statistic using the t test was -1.37. Looking at the formula tells us that the critical value of the sign test is positive because we use the absolute value in the numerator. Second, we drew the same conclusion when we used the t test. If we had not used the continuity correction, however, the value of z would be 10/4.85, or 2.06, and we would have rejected the null hypothesis. Here the continuity correction makes a difference in the conclusion.
Use the CD-ROM to rank the energy intake of the 5-year-old children and then compare the median with 1573 in the NHANES III study. Are the results the same as you obtained with the t test in the section titled, “Steps in Hypothesis Testing”? A useful discussion of nonparametric methods is given in the comprehensive text by Hollander and Wolfe (1998).
MEAN DIFFERENCES WHEN OBSERVATIONS ARE NOT NORMALLY DISTRIBUTED
Using the t test requires that we assume the differences are normally distributed, and this is especially important with small sample sizes, (n < 30). If the distribution of the differences is skewed, several other methods are more appropriate. First, one of the transformations we discussed earlier can be used. More often, however, researchers in the health sciences use a nonparametric statistical test that does not require the normal distribution. For paired designs, we can use the sign test that we used with a single group, applying it to the differences. Alternatively, we can use a nonparametric procedure called the Wilcoxon signed rank test (also called the Mann–Whitney U test). In fact, there is absolutely no disadvantage in using the Wilcoxon signed rank test in any situation, even when observations are normally distributed. The Wilcoxon test is almost as powerful (correctly rejecting the null hypothesis when it is false) as the t test. For paired comparisons, we recommend the Wilcoxon test over the sign test because it is more powerful. In the past, the Wilcoxon signed rank test required either exhaustive calculations or employing extensive statistical tables. Today, however, nonparametric tests are easy to do with computerized routines.
Using the study by Sauter and colleagues (2002), we can compare the conclusion using the Wilcoxon signed rank test with that found with the paired t test. Figure 5-9 is a box plot of the change in 7α-HCO before surgery and at one month; there is some evidence of positive skewness, so using a nonparametric procedure is justified. The results of the Wilcoxon test using the SPSS nonparametric procedure is shown in Table 5-14. SPSS gives the number of patients with an increase in 7α-HCO after 1 month, the number with a decrease, and the number with no change (0 in this example). Note that the significance level is given as 0.000.
Does this mean there is zero probability that a chance occurred? Not really; it is simply that computer programs report P values to only a small number of decimal place. It is customary to report P < 0.001 and not 0.000. Use the CD-ROM to calculate the Wilcoxon test for the difference in 7α-HCO from baseline to 3 months.
Table 5-13. Rank ordering of 2-year-old children according to energy consumption. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
FINDING THE APPROPRIATE SAMPLE SIZE FOR RESEARCH
Researchers must learn how large a sample is needed before beginning their research because they may not otherwise be able to determine significance when it occurs. Earlier in this chapter, we talked about type I (α) errors and type II (β) errors, and we defined power (1 – β) as the probability of finding significance when there really is a difference. Low power can occur because the difference is small or because the sample size is too small.
Readers of research reports also need to know what sample size was needed in a study. This is especially true when the results of a study are not statistically significant (a negative study), because the results would possibly have been significant if the sample size had been larger. Increasingly, we see examples of sample size information in the method section of published articles. Institutional review boards (IRB) examine proposals before giving approval for research involving human and animal subjects and require sample size estimates before approving a study. Granting agencies require this information as well.
|
Figure 5-9. Box-and-whisker (boxplot) of change in 7α-HCO after cholecystectomy. (Data, used with permission, from Sauter GH, Moussavian AC, Meyer G, Steitz HO, Parhofer KG, Jungst D: Bowel habits and bile acid malabsorption in the months after cholecystectomy. Am J Gastroenterol 2002;97(2):1732–1735. Figure produced with SPSS; used with permission.) |
A variety of formulas can determine what size sample is needed, and several computer programs can estimate sample sizes for a wide range of study designs and statistical methods. A somewhat advanced discussion of the logic of sample size estimation in clinical research was reported by Lerman (1996).
Many people prefer to use a computer program to calculate sample sizes. The manuals that come with these programs are very helpful. We present typical computer output from some of these programs in this section and in following chapters.
We also give formulas that protect against both type I and type II errors for two common situations: a study that involves one mean or one proportion, and a study that measures a group twice and compares the difference before and after an intervention.
Finding the Sample Size for Studies with One Mean
To estimate sample size for a research study involving a single mean, we must answer the following four questions:
Table 5-14. The Wilcoxon signed-ranks test on 7α-HCO. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
1. What level of significance (α level or P value) related to the null hypothesis is wanted?
2. What is the desired level of power (equal to 1 – β)?
3. How large should the difference be between the mean and the standard value or norm (ľ1 – ľ0) in order to be clinical importance?
4. What is an estimate of the standard deviation σ?
Specifications of α for a null hypothesis and β for an alternative hypothesis permit us to solve for the sample size. These specifications lead to the following two critical ratios, where zα is the two-tailed value (for a two-sided test) of z related to α, generally 0.05, and zβ is the lower one-tailed value of z related to β, generally 0.20. We use the lower one-sided value for β because we want power to be equal to (1 – β) or more.
Suppose that prior to beginning their study Dennison and colleagues (1997) wanted to know whether mean juice consumption in 2-year-olds is different from 5 oz/day—either more or less. Researchers generally choose a type I error of 0.05, and power of 0.80. We assume the standard deviation is about 3 oz. From the given information, what is the sample size needed to detect a difference of 1 or more ounces?
The two-tailed z value for α of 0.05 is ą1.96 (from Table A–2); refer to Table 5-5. The lower one-tailed z value related to β is approximately -0.84 (the critical value that separates the lower 20% of the z distribution from the upper 80%). With a standard deviation of 3 and the 1-oz difference the investigators want to be able to detect (consumption of ≤ 4 oz or ≥ 6 oz), the sample size is
To conclude that mean juice consumption of ≤ 4 oz/ day or ≥ 6 oz/day is a significant departure from an assumed 5 oz/day (with standard deviation of 3), investigators need a sample of 71. (The sample size is 71 instead of 70 because we always round up to the next whole number.) Because the investigators were interested in detecting a rather small difference of 1 oz, they need a moderately large sample. InExercise 3 you are asked to calculate how large a sample would be needed if they wanted to detect a difference of 2 or more oz.
The Sample Size for Studieswith One Proportion
Just as in estimating the sample size for a mean, the researcher must answer the same four questions to estimate the sample size needed for a single proportion.
1. What is the desired level of significance (the α level) related to the null hypothesis, π0?
2. What level of power (1 – β) is desired associated with the alternative hypothesis, π1?
3. How large should the difference between the proportions (π1 – π0) be for it to be clinically significant?
4. What is a good estimate of the standard deviation in the population? For a proportion, it is easy: the proportion itself, π, determines the estimated standard deviation. It is π(1 – π).
The formula to determine the sample size is
where, using the same logic as with the sample size for a mean, zα is the two-tailed z value related to the null hypothesis and zβ is the lowerone-tailed z value related to the alternative hypothesis.
To illustrate, we consider the study by Frey and colleagues (2002) of the success of smallpox vaccinations when used in a weaker than normal dilution. In the methods section of their article, the investigators provide an appropriate statement about power based on comparing two dilutions with undiluted vaccine. However, theirs was a complicated design and we use a simpler illustration and assume they expect the 1:10 dilution to be 95% effective and want to be sure it is significantly greater than 90. As with finding sample sizes or means, the two-tailed z value related to α = 0.05 is ą1.96, and the lower one-tailed z value related to β is approximately -0.84. Then, the estimated sample size is
from which, by squaring and rounding up, we have 238. Exercise 4 calculates the sample size needed if we use 97% instead of 95% as the immunization rate we want to detect.
Sample Sizes for Before-and-After Studies
When studies involve the mean in a group measured twice, the research question focuses on whether there has been a change, or stated another way, whether the mean difference varies from zero. As we saw in the section titled, “Means when the Same Group Is Measured Twice,” we can use the same approach to finding confidence intervals and doing hypothesis tests for determining a change in the mean in one group measured twice as for a mean in one group. We can also use the same formulas to find the desired sample size. The only difference from the previous illustration is that we test the change (or difference between the means) against a population value of zero. If you have access to one of the power computer programs, you can compare the result from the procedure to calculate the sample size for one mean with the result from the procedure to calculate the sample size for a mean difference, generally referred to as the paired t test. If you assume the standard deviations are the same in both situations, you will get the same number. Unfortunately, the situation is not as simple when the focus is on proportions. In this situation, we need to use different formulas to determine sample sizes for before-and-after studies. Because paired studies involving proportions occur less often than paired studies involving means, we refer you to the power programs for calculating sample sizes for paired proportions.
|
Figure 5-10. Computer output from the SamplePower program estimating a sample size for the mean juice consumption in 2-year-old children. (Data, used with permission, from Dennison BA, Rockwell HL, Baker SL: Excess fruit juice consumption by preschool-aged children is associated with short stature and obesity. Pediatrics 1997;99:15–22. Table produced with SamplePower 1.00, a registered trademark of SPSS, Inc.; used with permission.) |
Computer Programs for Finding Sample Sizes
Using data from Dennison and coworkers (1997), we use the SamplePower program to calculate the sample size for a study involving one mean. Output from the program is given in Figure 5-10. (If you use this program, you can automatically get the sample size for 80% power by clicking on the binoculars icon in the tool bar.) SamplePower indicates we need n of 73, close to the value we calculated of 71. This program also generates a verbal statement (by pressing the icon that has lines on it and indicates it produces a report). Part of a power statement is also reproduced in Figure 5-10.
|
Figure 5-11. Computer output from the nQuery program estimating the sample size for a proportion. (Observations based on Frey SE, Couch RB, Tacket CO, Treanor JJ, Wolff M, Newman FK, et al: Clinical responses to undiluted and diluted smallpox vaccine. N Engl J Med2002;346:1265–1274. Analysis produced with nQuery; used with permission.) |
BOX 5-3. COMPUTER OUTPUT FROM THE PASS PROGRAM ESTIMATING A SAMPLE SIZE FOR THE NUMBER OF PATIENTS NEEDED IN THE STUDY OF CHOLECYSTECTOMY.
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Figure. Chart Section. Source: Data, used with permission, from Sauter GH, Moussavian AC, Meyer G, Steitz HO, Parhofer KG, Jungst D: Bowel habits and bile acid malabsorptin in the months after cholecystectomy. Am J Gastroenterol 2002;97(2):1732–35. Analyzed with PASS; used with permission. |
To find the sample size for a proportion, we use the nQuery program with data from Frey and coworkers (2002). Output from this procedure is given in Figure 5-11 and states that 250 patients will provide 81% power. nQuery also generates a statement, included in Figure 5-10, as well as a graph.
Finally, we illustrate the output from the PASS program for finding the sample size for a mean. The program for one mean can be used for a paired design, and we show this with data from Sauter and colleagues (2002). Output from this procedure is given in Box 5-3 and indicates that a sample size of 20 is needed to conclude that the observed difference in 7α-HCO is significant at P < 0.05. PASS also provides a graph of the relationship between power and sample size and generates a statement.
SUMMARY
This chapter illustrated several methods for estimating and testing hypotheses about means and proportions. We also discussed methods to use in paired or before-and-after designs in which the same subjects are measured twice. These studies are typically called repeated-measures designs.
We used observations on children whose juice consumption and overall energy intake was studied by Dennison and coworkers (1997). We formed a 95% confidence interval for the mean fruit juice consumed by 2-year-old children and found it to be 4.99–6.95 oz/day. We illustrated hypothesis testing for the mean in one group by asking whether the mean energy intake in 2-year-olds was different from the norm found in a national study and showed the equivalence of conclusions when using confidence intervals and hypothesis tests.
In the study published by Frey and colleagues (2002), the investigators found that initial vaccination was successful in 665 of 680 subjects (97.8%); in the group receiving the 1:10 dilution the proportion was 330 of 340 (97.1%). We used data from this study to illustrate statistical methods for a proportion. The authors concluded that vaccinia virus can be diluted to a titer as low as 1:10 and induce local viral replication and vesicle formation in more than 97% of persons; this suggests that the current stocks of smallpox vaccine in the United States could potentially protect nearly 10 times as many people as undiluted vaccine.
To illustrate the usefulness of paired or before-and-after studies, we used data from the study by Sauter and colleagues (2002) in which bile acid absorption and bowel habits were examined before and after cholecystectomy. We analyzed change in 7α-HCO at baseline and after 1 month and used the t statistic to form a 95% confidence interval for the change. Second, we performed a paired t test for the change in 7α-HCO and found that the difference was statistically significant. The investigators reported that after cholecystectomy there was an increase in patients reporting more than one bowel movement per day and those reporting loose stools. Despite significant increases in serum levels of 7α-HCO at 1 and 3 months after surgery, there was no relationship between changes in these levels and changes in bowel habits or occurrence of diarrhea. These results indicate that changes in bowel habits frequently occur after cholecystectomy but that bile acid malabsorption does not appear to be the predominant pathogenic factor in PCD.
Yuan and colleagues (2001) showed that MRI can identify lipid-rich necrotic cores and intraplaque hemorrhage in atherosclerotic plaques with high sensitivity and specificity. We used the data to illustrate agreement between two procedures with the κ statistic and found a good level of agreement. The investigators hope that this noninvasive technique will be a useful tool in lipid-lowering clinical trials and in determining prognosis in patients with carotid artery disease.
On occasion, investigators want to know whether the proportion of subjects changes after an intervention. In this situation, the McNemar test is used, as with changes in stool frequency status in the study of cholecystectomy by Sauter and colleagues (2002).
We explained alternatives methods to use when observations are not normally distributed. Among these are several kinds of transformations, with the log (logarithmic transformation) being fairly common, and nonparametric tests. These tests make no assumptions about the distribution of the data. We illustrated the sign test for testing hypotheses about the median in one group and the Wilcoxon signed rank test for paired observations, which has power almost as great as that of the t test.
We concluded the chapter with a discussion of the important concept of power. We outlined the procedures for estimating the sample size for research questions involving one group and illustrated the use of three statistical programs that make the process much easier.
In the next chapter, we move on to research questions that involve two independent groups. The methods you learned in this chapter are not only important for their use with one group of subjects, but they also serve as the basis for the methods in the next chapter.
A summary of the statistical methods discussed in this chapter is given in Appendix C. These flowcharts can help both readers and researchers determine which statistical procedure is appropriate for comparing means.
EXERCISES
1. Using the study by Dennison and coworkers (1997), find the 99% confidence interval for the mean fruit juice consumption among 2-year-olds and compare the result with the 95% confidence interval we found (4.99–6.95).
a. Is it wider or narrower than the confidence interval corresponding to 95% confidence?
b. How could Dennison and coworkers increase the precision with which the mean level of juice consumption is measured?
c. Recalculate the 99% confidence interval assuming the number of children is 200. Is it wider or narrower than the confidence interval corresponding to 95% confidence?
2. Using the study by Dennison and coworkers, test whether the mean consumption of soda in 2-year-olds differs from zero. What is the P value? Find the 95% confidence interval for the mean and compare the results to the hypothesis test.
3. Using the Dennison and coworkers study, determine the sample size needed if the researchers wanted 80% power to detect a difference of ≥ 2 oz in fruit juice consumption among 2-year-olds (assuming the standard deviation is 3 oz). Compare the results with the sample size needed for a difference of 1 oz.
4. What sample size is needed if Frey and coworkers (2002) wanted to know if an observed 97% of patients with an initial success to vaccination is different from an assumed norm of 90%? How does this number compare with the number we found assuming a rate of 95%?
5. Our calculations indicated that a sample size of 71 is needed to detect a difference of ≥1 oz from an assumed mean of 5 oz in the Dennison and coworkers study, assuming a standard deviation of 3 oz. Dennison and coworkers had 94 children in their study and found a mean juice consumption of 5.97 oz. Because 94 is larger than 71, we expect that a 95% CI for the mean would not contain 5. The CI we found was 4.99–6.95, however, and because this CI contains 5, we cannot reject a null hypothesis that the true mean is 5. What is the most likely explanation for this seeming contradiction?
6. Using the data from the Sauter and colleagues study (2002), how large would the mean difference need to be to be considered significant at the 0.05 level if only ten patients were in the study? Hint: Use the formula for one mean and solve for the difference, using 30 as an estimate of the standard deviation.
7. Two physicians evaluated a sample of 50 mammograms and classified them as negative (needing no follow-up) versus positive (needing follow-up). Physician 1 determined that 30 mammograms were negative and 20 were positive, and physician 2 found 35 negative and 15 positive. They agreed that 25 were negative. What is the agreement beyond chance?
8. Use the data from the Sauter and colleagues study to determine if a change occurs in HDL after 3 months (HDL3DIFF).
a. First, examine the distribution of the changes in HDL. Is the distribution normal so we can use the paired t test, or is the Wilcoxon test more appropriate?
b. Second, use the paired t test to compare the before-and-after measures of HDL; then, use the t test for one sample to compare the difference to zero. Compare the answers from the two procedures.
9. Using the Canberra Interview for the Elderly (CIE), Henderson and colleagues (1997) collected data on depressive symptoms and cognitive performance for 545 people. The interview was given at baseline and again 3–4 years later. The CIE reports the depression measure on a scale from 1 to 17.
a. Use the data set in the folder entitled “Henderson” on the CD-ROM to examine the distribution of the depression scores at baseline and later. What statistical method is preferred for determining if a change occurs in depression scores?
b. We recoded the depression score as depressed versus not depressed. Use the McNemar statistic to see if the proportion of depressed people is different at the end of the study.
c. Do the conclusions agree? Discuss why or why not.
10. Dennison and coworkers also studied 5-year-old children. Use the data set in the CD-ROM folder marked “Dennison” to evaluate fruit juice consumption in 5-year-olds.
a. Are the observations normally distributed?
b. Perform the t test and sign test for one group. Do these two tests lead to the same conclusion? If not, which is the more appropriate?
c. Produce a box plot for 2-year-olds and for 5-year-olds and compare them visually. What do you think we will learn when we compare these two groups in the next chapter?
11. If you have access to the statistical program Visual Statistics, use the Discrete Distributions module to see how the distribution changes as the proportion and the sample size change. What happens as the proportion gets closer to 0? to 0.5? to 1? And what happens as the sample size increases? Decreases? Try some situations in which the proportion times the sample size is quite small (eg, 0.2 × 10). What happens to the shape of the distribution then?
12. Group Exercise. Congenital or developmental dysplasia of the hip (DDH) is a common pediatric affliction that may precede arthritic deformities of the hip in adult patients. Among patients undergoing total hip arthroplasty, the prevalence of DDH is 3–10%. Ömeroğlu and colleagues (2002) analyzed a previously devised radiographic classification system for the shape of the acetabular roof. The study design required that four orthopedic surgeons independently evaluate the radiographs of 33 patients who had previously been operated on to treat unilateral or bilateral DDH. They recorded their measurements independently on two separate occasions during a period of 1 month. You may find it helpful to obtain a copy of the article to help answer the following questions.
a. What was the study design? Was it appropriate for the research question?
b. How did the investigators analyze the agreement among the orthopedic surgeons? What is this type of agreement called?
c. How did the investigators analyze the agreement between the measurements made on two separate occasions by a given orthopedic surgeon?
d. Based on the guidelines presented in this chapter, how valuable is the classification system?
13. Following is a report that appeared in the April– June 1999 Chance News from the Chance Web site at
http://www.dartmouth.edu/~chance/chance_news/recent_news/chance_news_8.05.html#polls
Read the information and answer the discussion questions. “Election Had Too Many Polls and Not Enough Context.”
We do not often see a newspaper article criticizing the way it reports the news but this is such an article. Schachter writes about the way newspapers confuse the public with their tracking of the polls. He starts by commenting that the polls are “crude instruments which are only modestly accurate.” The truth is in the margin of error, which is “ritualistically repeated in the boilerplate paragraph that newspapers plunk about midway through poll stories (and the electronic media often ignore).”
He remarks that when the weather forecaster reports a 60% chance of rain tomorrow, few people believe the probability of rain is exactly 60%. But when a pollster says that 45% of the voters will vote for Joe Smith, people believe this and feel that the poll failed if Joe got only 42%. They also feel that something is wrong when the polls do not agree.
Schacter reviews how the polls did in the recent Ottawa election and finds that they did quite a good job taking the margin of error into account—“much better than the people reporting, actually.”
In a more detailed analysis of the polls in this election, Schachter gives examples to show that, when newspapers try to explain each chance fluctuation in the polls, they often miss the real reason voters change their minds.
Schachter concludes by saying:
It's amusing to consider what might happen if during an election one media outlet reported all the poll results as a range. Instead of showing the Progressive Conservatives at 46%, for example, the result would be shown as 44–50%. That imprecision would silence many of the pollsters who like to pretend they understand public opinion down to a decimal point. And after the initial confusion, it might help the public to see polls for what they are: useful, but crude, bits of information.
--(Harvey Schachter, The Ottawa Citizen, 5 June, 1999)
DISCUSSION QUESTIONS
1. What do you think about the idea of giving polls as intervals rather than as specific percentages? Would this help also in weather predictions?
2. Do you agree that weather predictions of the temperature are understood better than poll estimates? For example, what confidence interval would you put on a weather predictor's 60% chance for rain?
Footnotes
aWhere does the value of 26.4 come from? Recall from Chapter 4 that the standard error of the mean, SE, is the standard deviation of the mean, not the standard deviation of the original observations. We calculate the standard error of the mean by dividing the standard deviation by the square root of the sample size:
bRemember from your high school or college math courses that the log of a number is the power of 10 that gives the number. For example, the log of 100 is 2 because 10 raised to the second power (102) is 100, the log of 1000 is 3 because 10 is raised to the third power (103) to obtain 1000, and so on. We can also think about logs as being exponents of 10 that must be used to get the number.