Principles of Ambulatory Medicine, 7th Edition

Chapter 2

Practicing Evidence-Based Medicine

Darius A. Rastegar

Scott M. Wright contributed to an earlier version of this chapter.

General Approach

Clinicians are increasingly (and appropriately) asked to provide both scientifically sound and cost-effective medical care. These expectations have given rise to an emphasis on evidence-based medicine (EBM), which is defined as the conscientious, explicit, and judicious use of current best evidence in making decisions about the care of individual patients (1). EBM focuses on issues integral to day-to-day patient care: assessment of risks, prevention, screening, diagnosis, prognosis, treatment, and management of the increasing amount of medical information that confronts health care practitioners.

Evidence-based decision making is especially important in ambulatory practice because this is the setting where patients are most likely to present with undifferentiated problems. It is also the setting where most clinical decisions are made.

The following steps are considered to be indispensable to practicing EBM:

  • Step 1: Formulate specific questions that are relevant to a patient's care and identify the type of information that is needed (e.g., efficacy or harm of a treatment, accuracy of a diagnostic test).
  • Step 2: Identify and retrieve the relevant data.
  • Step 3: Critically appraise the relevant information.
  • Step 4: Apply the valid information to the patient whose presentation initiated the inquiry, taking into account the patient's values and wishes.

The importance of step 1, formulation of specific questions, can be understood by considering two similar questions that might be generated when a practitioner sees a patient with hepatitis C who asks whether antiviral therapy, which the patient has read about in the newspaper, should be initiated:

  • Question A: How effective is interferon and ribavirin for the treatment of hepatitis C?
  • Question B: For Mr. B, the 48-year-old man with type 2 hepatitis C whose transaminases and liver function tests have been normal during the last 9 months, what is the evidence regarding the efficacy and safety of interferon and ribavirin in preventing cirrhosis and other complications?

The second question is more specific and will better help to tailor the search effort (step 2) to the clinical outcomes that are most relevant to the practitioner and the patient.

The successful completion of step 2 requires efficient and effective searching skills. Most medical libraries offer brief hands-on tutorials to teach clinicians how to search databases such as MEDLINE and the Internet to find the current best evidence. The National Library of Medicine (see http://www.hopkinsbayview.org/PAMreferences) provides access to PubMed (MEDLINE) and multiple health and science databases. It also offers full-text versions of many articles, eliminating the need for additional steps to retrieve the desired manuscripts.

P.14

Step 3, critical appraisal, is likely to be most difficult and time-consuming for clinicians. The two components of this step are (a) deciding whether the results are valid and (b) deciding whether the results are relevant to the specific question being asked.

It is more efficient to resolve the second component first, which can usually be done fairly quickly. If the results are neither relevant nor clinically important, then one can avoid the time and effort spent judging the validity and quality of the information. There are numerous books and articles published in the medical literature (e.g., the Users’ Guides to the Medical Literature series published by the Journal of the American Medical Association) that aim to teach clinicians the core skills of critical appraisal. Having confidence in one's ability to critically appraise manuscripts on a wide variety of topics (e.g., diagnosis, treatment, cost-effectiveness), and which use a myriad of study designs, may take time, practice, and even additional training. Such training can be found in workshops at regional and national meetings or through medical libraries.

Step 4 involves integrating the important and valid newly found information into the care of one's patient. This step can be the most satisfying component of practicing EBM. Educating patients that a particular diagnostic approach or treatment is supported by current medical research may instill a sense of confidence about the practitioner's knowledge and expertise in finding new data. However, even after completing all these steps, choosing the best course of action is not always straightforward, and the patient's values and wishes should determine the ultimate course of action. For example, a patient may wish to forgo a treatment that may prolong their life but will likely worsen their quality of life (e.g., chemotherapy for metastatic cancer) or a patient may be unwilling to trade a short-term risk for the possibility of a long-term benefit (e.g., carotid endarterectomy for asymptomatic carotid stenosis).

It would be impractical to assume or recommend that primary care practitioners embark on these fundamental steps of EBM every time a clinical question comes up. However, when critical queries arise that are likely to recur or are particularly important to an individual patient, this version of “self-directed continuing medical education” is likely to be helpful to both practitioners and patients. Some barriers to practicing EBM include skepticism by practitioners, information overload and feeling overwhelmed by the growth of medical knowledge, lack of time, and lack of appropriate resources, skills, or motivation to implement EBM (2). Furthermore, for some clinical questions, high-quality data is lacking.

All dedicated and committed clinicians, however, practice EBM to some degree. To counterbalance the barriers to practicing EBM, the following facilitating behaviors have been proposed: (a) reading and keeping up-to-date with the medical literature (see Keeping Up); (b) refining one's EBM skills (practice makes perfect); (c) collaborating with colleagues so that valuable clinical evidence is shared among practitioners; (d) writing down specific clinical questions (step 1) when they come up so that the process can continue when time permits; (e) setting up one's computer (e.g., bookmarking relevant websites) and one's office (e.g., acquiring access to high-quality information) so as to find information efficiently; and (f) making friends with the librarian at the nearest medical library.

The remainder of this chapter discusses the core principles of EBM that apply to issues most relevant to primary care practice: diagnosis, prognosis, treatment, risk or potential harm, and cost-effectiveness. Strategies for keeping up are also discussed. Chapter 14 discusses principles that apply to prevention and screening.

Diagnosis

How Clinicians Formulate a Diagnosis

Diagnostic assessment begins the moment one meets a patient. Behavioral scientists have described at least four ways in which clinicians formulate diagnoses: pattern recognition, algorithm, exhaustion, and hypothesis-deduction.

Pattern Recognition

Many diagnoses are made instantly because clinicians have learned to recognize patterns specific to certain diseases, such as the face of a patient with Down syndrome or the elbows of a patient with psoriasis. The certainty of these types of diagnoses is so great that further testing often is unnecessary.

Algorithm

Algorithms are growing more common as a result of the growth of clinical practice guidelines, which, when grounded scientifically, can be extremely helpful. The drawbacks of algorithms are that they must be constructed before the patient is seen, and they must account for every possibility in a workup.

Exhaustion

As Sackett pointed out (see Sackett et al., Clinical Epidemiology, at http://www.hopkinsbayview.org/PAMreferences), medical students should be taught how to do a complete history and physical examination, and then be taught never to do one again. On occasion, however, clinicians do resort to comprehensive histories and examinations, as much to buy time to think as to uncover hidden disease.

P.15

Hypothesis–Deduction

Clinicians usually diagnose by forming hypotheses and testing them, as is done in scientific experimentation. On hearing that a patient has chest pain, the practitioner builds a short list of hypotheses, invites further description, and then asks focused questions that help confirm or rule out the hypotheses. The questions in the interview and each maneuver in the examination are as much diagnostic tests as the electrocardiogram or the chest radiograph. Studies of clinicians’ behavior reveal that the short list of hypotheses usually does not exceed three or four diagnoses. Typically, new hypotheses are added as others are discarded, but the eventual goal is to narrow the list and reduce the uncertainty about which diagnosis is most likely. Studies of clinicians in ambulatory practice showed that hypotheses were generated, on average, 28 seconds into the interview, and that correct diagnoses of standard problems were made 6 minutes into 30-minute workups; the correct diagnoses were made in 75% of the encounters (3).

The hypothesis–deduction model reveals a truth common to all methods of diagnosis: Rarely can a clinician be absolutely certain of any diagnosis. Clinicians live with uncertainty, and the role of all diagnostic tests—the interview, the physical examination, the laboratory evaluation, trials of empiric treatments, allowing time to pass (expectant observation)—is to narrow the uncertainty enough to place a diagnostic label on a patient's problem. How narrow the uncertainty must be depends on the practitioner's and the patient's tolerance of uncertainty, the severity of the suspected disease, the “treatability” of the suspected disease, and the benefits and risks of possible treatments.

Steps in the Hypothesis–Deduction Process

Evidence shows that clinicians implicitly use common sense and their medical knowledge to reach a diagnosis with adequate certainty. Explicitly, the diagnostic process follows certain steps.

Step 1: Form a Hypothesis and Estimate Its Likelihood

The estimate of likelihood is called the pretest probability (or prior probability); it simply represents the estimate of prevalence of the disease in a group of people similar to the patient at hand. Each hypothesized diagnosis and the estimate of its likelihood comes initially from evidence collected during the interview and physical examination and from the practitioner's fund of knowledge from sources such as other patients, colleagues, textbooks, and journals. More recently, computer programs have been developed to aid clinicians in making this estimate; these programs have the potential to become a powerful tool in clinical decision making.

Step 2: Decide How Certain the Diagnosis Must Be

If the hypothesized disease is easily and safely treated, one might have to be less certain than if the disease has an ominous prognosis or demands complex, risk-laden treatment. For example, a 75% certainty that a patient has streptococcal pharyngitis might be sufficient to prescribe an antibiotic, whereas a much higher level of certainty is needed before diagnosing and treating a patient with suspected leukemia. If the pretest probability is above the threshold for a hypothesized disease (e.g., greater than 75% for streptococcal pharyngitis), further tests are unnecessary and treatment is prescribed. Conversely, if one is adequately certain that the patient does not have the hypothesized disease (e.g., 90% probability that the patient does not have streptococcal pharyngitis), no further tests are required and the patient can be reassured and educated. However, if the level of uncertainty remains between these two extremes, further testing (e.g., a throat culture) can help move the case toward one extreme or the other. Diagnostic testing usually is most helpful between the two extremes of certainty, whereas further testing generally has little impact on the posttest probability if the pretest probability is very high or very low.

Step 3: Choose a Diagnostic Test

Which test to choose depends on many factors, including its safety, its accuracy (e.g., how closely an observation or a test result reflects the true clinical state of a patient), how easily it can be done, its cost, and, not least, the patient's preferences and values regarding tests, especially those that carry risks. Accuracy includes both reliability and validity. Reliability of a test, also called reproducibility or precision, is the extent to which repeated measurements of a stable phenomenon give results close to one another. Validity is the degree to which a test measures what it is supposed to measure. A test can be reliable but not valid (i.e., it reliably measures the wrong phenomenon), or it can be valid but not reliable (i.e., it measures the phenomenon of interest, but with wide scatter).

When considering a test, one needs to reflect on each of these factors. Table 2.1 summarizes practical guidelines to assess and critically appraise reported studies of diagnostic tests. When selecting a test for a patient, the crucial questions to ask are, “Will the results of the test change my plan?” and “Will my patient be better off from having had the test?” (the utility of the test). If the answer to these questions is “No,” the test should not be performed.

TABLE 2.1 Guidelines for Assessing a Study of a Diagnostic Test

Was there an independent blind comparison with a gold standard?
Was the test evaluated in a sample of patients that included an
appropriate spectrum of disease (mild to severe, treated and
untreated) plus patients with commonly confused disorders?
Was the setting for the evaluation adequately described?
Were the reproducibility of the test result (precision) and its
interpretation (observer variation) determined?
Was the term normal defined sensibly?
Were the methodologies for conducting the test described well
enough for their exact replication?
Was the utility of the test determined (i.e., were the patients better
off for having had the test)?

Adapted from Sackett DL, Haynes RB, Guyatt GH, et al. Clinical epidemiology: a basic science for clinical medicine. 2nd ed. Boston: Little, Brown, 1991.

P.16

Step 4: Be Aware of the Test's Performance Characteristics

Every diagnostic test has a sensitivity and specificity for each disease it tests for. Sensitivity and specificity have become common terms in medical discussion, but they are often misunderstood. The sensitivity of a test (the true positive rate) is equal to the number of study subjects with a given disease who have a positive test divided by all study subjects with the disease. The specificity of a test (the true negative rate) is the number of study subjects without the disease who have a negative test divided by all those without the disease. The 2 × 2 table in Fig. 2.1 reveals much about these and related terms.

FIGURE 2.1. Test performance determined by research. Researcher identifies diseased and nondiseased patients using a gold standard and then determines the performance characteristics (sensitivity and specificity) of another test. Example: Iron deficiency anemia determination using bone marrow aspirate/biopsy as the gold standard and serum ferritin measurement as the screening test. (Data from

Guyatt GH, Patterson C, Ali M, et al. Diagnosis of iron deficiency in the elderly. Am J Med 1990;88:205

.)

Tests with high sensitivity have a low false-negative rate and are useful for “ruling out” a diagnosis (when they are negative). Conversely, tests with high specificity have a low false-positive rate and are useful for “ruling in” a diagnosis (when they are positive). One way of remembering this is with the mnemonics SnNOut (high sensitivity, negative result rules out) and SpPIn (high specificity, positive result rules in). However, it should be pointed out that these rules of thumb do not always hold up in actual practice; the ability of a sensitive test to rule out a diagnosis is reduced when the specificity is low (4).

“Diseased” and “not diseased” are labels that reflect a best test or a definition of a certain disease: the so-called gold standard. For pulmonary embolus, for example, the gold standard is the pulmonary angiogram. For angina, there is no sure test, so a case definition becomes the gold standard. Skepticism must be used in evaluating gold

P.17


standards, for they often have their own limitations. For example, when gallbladder ultrasonography was tested for use in the diagnosis of cholelithiasis, it initially seemed to be a poor test in comparison with the gold standard (oral cholecystogram), not because of problems with the new test, but because, as was later shown, the gold standard was itself a poor test (5). Studies of diagnostic testing may have other problems, including verification bias (when those with a positive test result are more likely to have further evaluation), spectrum bias (when the population tested does not reflect those in whom the test will be used), and incorporation bias (when the results of the test under study are included among criteria to establish the reference standard).

TABLE 2.2 Tradeoff between Sensitivity and Specificity when Diagnosing Iron-Deficiency Anemia: Likelihood Ratios

Serum Ferritin Level to be Used as the Cutoff Value (µg/L)

Sensitivity (%)

False-Negative Rate (1- sensitivity, %)

Specificity (%)

False-Positive Rate (1- specificity, %)

Positive Likelihood Ratio (sensitivity/false-positive rate)

Negative Likelihood Ratio (false-negative rate/specificity)

<15

58.6

41.4

98.9

1.1

53.3

0.42

<35

80.2

19.8

94.4

5.6

14.3

0.21

<65

90.4

9.6

84.7

15.3

5.9

0.11

<95

94.1

5.9

75.3

24.7

3.8

0.08

Adapted from Guyatt GH, Patterson C, Ali M, et al. Diagnosis of iron deficiency in the elderly. Am J Med 1990;88:205, and from Sackett DL, Haynes RB, Guyatt GH, et al. Clinical epidemiology: a basic science for clinical medicine. 2nd ed. Boston: Little, Brown, 1991.

Sensitivity and specificity are not static properties of a test. As the cutoff value for an abnormal result is made more extreme, the test's sensitivity decreases and its specificity increases. Table 2.2, where progressively lower ferritin levels are used to characterize elderly patients as having iron-deficiency anemia (IDA), illustrates this principle (6). This illustration matches the common-sense conclusion that as a patient's test result becomes more abnormal, one can be more certain that the patient has disease—although never fully certain. If one selects a very low ferritin level for the cutoff between normal and abnormal (Table 2.2), many iron-deficient people will remain undiagnosed (i.e., the sensitivity will be low), but almost all of those diagnosed will be truly iron deficient (i.e., the specificity will be high). Conversely, if one decides to label patients as having IDA based on a ferritin level well within the normal range (e.g., 75 µg/L), one will not miss much disease (higher sensitivity), but will falsely label numerous anemic patients as being iron deficient who are not (lower specificity). When interpreting the results of a test, a clinician must consider the severity of disease, the potential risks and benefits of treatment, and changing information about the risks and benefits of treatment.

Another way of showing the relationship (and trade-off) between sensitivity and specificity is to plot a receiver operating curve (ROC); the true-positive rate (sensitivity) is plotted on the vertical axis and the false-positive rate (1 - specificity) on the horizontal axis. Figure 2.2shows a plot of the values provided in Table 2.2. Receiver operating curves can be a useful tool to compare different diagnostic tests; in general, the closer the curve is to the left-upper-hand corner (100% sensitivity and specificity), the better the test performs.

Step 5: Determine a Posttest Probability of Disease

The perfect test (100% sensitivity and specificity) would yield a “yes” or “no” answer to the question “Does my patient have disease or not?” However, because no test is perfect, the more appropriate question is: “Given the result of this test, what is the posttest probability that my patient has (or does not have) disease?” Posttest probability takes into account both the performance characteristics (sensitivity and specificity) of the test and the pretest (prior) probability of disease in a group of patients similar to the patient in question.

FIGURE 2.2. Receiver operating curve (ROC) of serum ferritin for iron-deficiency anemia.

P.18

One method for determining posttest probability is through the use of predictive values. Predictive values can be calculated from the known sensitivity and specificity of a test and the estimated pretest probability of disease. Sensitivity and specificity are generally transferable from study to practice settings, provided the diseased and nondiseased populations in the study and in the practice settings are similar. Sensitivity and specificity usually are not influenced by the prevalence, or pretest probability, of disease. However, predictive values must be recalculated for each patient or population from the estimated pretest probability or prevalence of disease in that particular group.Positive predictive value is the probability of disease in a patient who has an abnormal test result. Negative predictive value is the probability of no disease in a patient for whom a test result is normal. Figure 2.1 illustrates the calculation of posttest probability, based on pretest probability, sensitivity, and specificity.

The lower the pretest probability of disease, the lower the positive predictive value of a test, the lower the posttest probability of disease, and the more likely it is that a positive test result is falsely positive. This influence of pretest probability on posttest probability makes intuitive sense. For example, when a seasoned clinician encounters an unexpected positive test result in a patient with a very low likelihood of disease, the clinician is suspicious of the finding and either repeats the test, suspecting laboratory error, or orders another, more specific test to confirm or refute the finding.

Published information is available that can be helpful in estimating pretest probability, and therefore the predictive value of test results, in patients with selected characteristics. Examples of how such information can be used to interpret test results and determine diagnostic strategies are illustrated elsewhere in this book for deep vein thrombosis (see Chapter 57) and renovascular hypertension (see Chapter 67).

Another method of calculating posttest probability of disease is through the use of a likelihood ratio (LR). This number combines the relationships of sensitivity and specificity into a single number. The positive LR (+LR) is the true-positive rate (sensitivity) divided by thefalse-positive rate (1 -specificity), and the negative LR (–LR) is the false-negative rate (1 - sensitivity) divided by the true-negative rate(specificity). LR ranges from 0 to infinity; when the positive LR is between 0 and 1, a positive test result decreases the posttest probability; when it is >1, it increases the posttest probability; a LR of 1 does not change the posttest probability (i.e., the test is not useful).

There are a few ways of using the LR to calculate posttest probabilities. The standard method is to convert pretest probability into an odds ratio, multiply this by the likelihood ratio to determine posttest odds ratio, and then convert the posttest odds ratio to a probability:

FIGURE 2.3. Nomogram for interpreting test results using likelihood ratios. Example from text: An elderly male patient with anemia has a pretest probability of having IDA equal to 33%. His serum ferritin level is 33 µg/L, which is associated with a positive LR of 14.3. Extending a straight line through the pretest probability of 33% and the LR of 14.3 results in a posttest probability of 88%. (Adapted from

Fagan TJ. Nomogram for Bayes’ theorem. N Engl J Med 1975;293:257

.)

  1. Pretest Probability ÷ (1 - Pretest Probability) = Pretest Odds
  2. Pretest Odds × Likelihood Ratio = Posttest Odds
  3. Posttest Odds ÷ (1 + Posttest Odds) = Posttest Probability

Another method is to use a nomogram (Fig. 2.3) that allows conversion of pretest to posttest probabilities, given a known LR, without having to convert back and forth between probabilities and odds. This alternative is quick, is easy to use, and decreases the chances of calculation error.

However, converting probabilities to odds ratios and back can be cumbersome, and most of us do not carry nomograms in our pockets. For this reason, it may be simpler to use a method of estimating posttest probabilities (7). This method is fairly accurate when the pretest probability is between 10% and 90% (i.e., neither very high nor very low). Table 2.3 summarizes the approximate change

P.19


in probability associated with a range of LRs. One can simply remember that positive LRs of 2, 5, 10 are associated with approximate posttest probability increases of 15%, 30% and 45% respectively. Conversely, LRs of 1/2 (0.5), 1/5 (0.2), and 1/10 (0.1) decrease the posttest probability by 15%, 30% and 45% respectively.

TABLE 2.3 Simplified Posttest Probability Estimates Based on Likelihood Ratio*

Likelihood Ratio

Approximate Change in Probability

1/10 (0.1)

-45%

1/5 (0.2)

-30%

1/2 (0.5)

-15%

1

0%

2

+15%

5

+30%

10

+45%

*These estimates are only applicable if the pretest probability is between 10% and 90%.
From McGee S. Simplifying likelihood ratios. J Gen Intern Med 2002;17:646.

For example, suppose the clinician is faced with a 67-year-old male patient who has increasing fatigue and is found to be anemic. Knowing that the baseline prevalence (pretest probability) of IDA among anemic elderly patients is 31% (6), one might consider this man's pretest probability of IDA to be approximately 33%, for an odds of 1:2, or 0.5. If the serum ferritin is 33 µg/L, we can see from Table 2.2 that when a cutoff of <35 µg/L is used, the positive LR is 14.3:

So the odds of the patient having IDA based on this test result are 7:1. Converting back to probability, the patient has a posttest probability of IDA of about 7 ÷ (1 + 7) = 7 ÷ 8 = 88%. Given this posttest probability, further diagnostic workup (e.g., colonoscopy) to identify the cause of the IDA is appropriate.

Using the nomogram and a straightedge, the posttest probability is approximately 85%. Finally, if we use the simplified estimation method outlined earlier, we know that the likelihood ratio is >10; consequently, we should add at least 45% to the pretest probability of 31%, yielding a posttest probability of >76%, which is probably close enough to the actual value to help us in our decision making.

Prognosis

Often, the information that is most important to a patient who has a new diagnosis is the prognosis (“What is going to happen to me?”). In choosing therapy, one decides what one can do for the patient's disease. Yet, predicting what will happen to a particular patient usually is not possible, and clinicians must rely on probabilities. Sometimes, specific characteristics (“prognostic factors”) such as demographic factors, disease-specific factors, and comorbidities can help further delineate a patient's prognosis. Clinical prediction rules that take these factors into account can help practitioners arrive at more accurate estimates of prognosis.

Prognosis can be addressed in two ways: the natural history of a disease and the clinical course of a disease. Because few diseases today progress without medical intervention, less is being learned about natural history and more is being learned about clinical course. For example, the natural history of diabetes in the late 20th century is unknown because virtually no diagnosed patients go without some type of therapy, yet through many studies, more is known about the course of treated diabetes.

Most information about prognosis comes from prospective cohort studies in which patients with a disease are monitored over time. Cohort studies may include only untreated subjects (natural history of a disease), only treated subjects, or a combination of both treated and untreated subjects (clinical course of a disease). Cohort studies are simple in design, yet they are often costly in time and money. They are susceptible to biases, such as sampling bias, in which the group of patients being monitored is not representative of all patients with that condition. Table 2.4 summarizes suggested guidelines for assessing studies of prognosis.

Treatment

Once a diagnosis is made, treatment becomes the focus of care. Before embarking on a treatment plan, one must decide on the goals of treatment (to cure, delay complications, relieve acute distress, reassure, or comfort). Clearly, more than one goal may be chosen. For example, when

P.20


diagnosing and treating type 2 diabetes, one may seek to cure (counsel weight loss and exercise), to delay or prevent complications (seek tight glucose control), and to relieve distress, reassure, and comfort (listen to the patient's fears, reassure the patient that diabetes is a treatable disease and that he or she will not be abandoned).

TABLE 2.4 Guidelines for Assessing a Study of Prognosis

Was a representative and well-defined sample of patients (at a similar point in disease course) assembled?
Are these patients similar to my own?
Was followup sufficiently long and complete?
Were objective and relevant outcome criteria developed and used?
Was the outcome assessment “blind”?
Was adjustment for important prognostic factors carried out?

Adapted from Laupacis A, Wells G, Richardson WS, et al. User's guides to the medical literature. V: how to use an article about prognosis. JAMA 1994:272:234.

Once the goals have been set, treatments are chosen. Unfortunately, many treatments have never been tested scientifically in ways that answer the questions that are of interest to clinicians and their patients (e.g., probability of benefit, size of benefit, onset time and duration of response, frequency of complications of treatment), and many aspects of treatment are difficult to measure through scientific experiments. Fortunately, drugs and procedures are increasingly being subjected to clinical trials, and measures of quality of life are being included in the evaluation of therapies.

The clinical trial is the current standard for assessment of drugs and therapeutic procedures. The strongest clinical trials are randomized, double-blinded controlled trials. The strength of a randomized controlled trial (RCT) is that the study groups are likely to be similar with respect to known determinants of outcome, as well as those determinants that are unknown. However, randomization is often difficult to accomplish in the real world, where patients are free to join or refuse to join a clinical trial and where money to support research is limited. Theoretically, in a trial that is double blinded (meaning that neither the patient nor the researcher knows who is receiving the experimental treatment), the researchers’ and patients’ assessment of outcome is not biased by prior knowledge of their assignment (e.g., to placebo or to active treatment). However, studies may not be truly blinded; for example, in a trial of β-blockers against placebo, patients and clinicians can measure pulse rates. Nonetheless, the clinical trial is the least-biased method currently available for researchers to test how well drugs and other interventions work in ideal situations (efficacy) and in the real world (effectiveness). Table 2.5 lists guidelines that clinicians can use when assessing the results of a clinical trial. As illustrated in the table, there are important questions to ask of a clinical trial that reports benefits to treated subjects. Were clinically relevant outcomes, such as measures of patient health (e.g., morbid events, functional status) reported, and not just surrogate end points (e.g., reduction of blood pressure)? Was all-cause mortality, not just mortality caused by the disease in question (e.g., colon cancer), reported? In addition to reporting the statistical significance of findings (the probability that the findings are true), did the study discuss or clarify the clinical significance of the findings (whether the benefits were clinically meaningful)? As the size of a study increases, there is an increased likelihood that clinically small or nonmeaningful benefits, which are nonetheless statistically significant, will be demonstrated.

TABLE 2.5 Guidelines for Assessing a Study of Treatment (Clinical Trials)

Was the assignment of patients to treatments really randomized?
Were all clinically relevant outcomes reported?
Were the study patients recognizably similar to my own?
Were both statistical and clinical significance considered?
Is the treatment feasible for patients in my practice?
Was the analysis performed on an intention-to-treat basis?
Were all patients who entered the study accounted for at its conclusion?

Adapted from Sackett DL, Haynes RB, Guyatt GH, et al. Clinical epidemiology: a basic science for clinical medicine. 2nd ed. Boston: Little, Brown, 1991.

Moreover, one must pay close attention to the followup of the subjects enrolled in trials; intention-to-treat analysis is a strategy for analyzing data in which all study participants are analyzed in the group to which they were assigned, regardless of whether they dropped out, were noncompliant, or crossed over to another treatment or nontreatment group. Such an analysis may weaken the ability of a study to demonstrate the effect of a treatment, but it prevents selection biases caused by differences in participants who drop out from a treatment compared with those who remain.

Researchers often report treatment outcomes in terms of the relative risk reduction (RRR), which is the difference in the event rate between control and experimental groups of patients expressed as a proportion of the event rate in the control group: RRR = (control event rate - experimental event rate) ÷ control event rate. The difference between the control and experimental event rates is the absolute risk reduction (ARR): ARR = control event rate - experimental event rate. RRR can alternatively be expressed as the ARR divided by the control event rate: RRR = ARR ÷ control event rate. RRR is only meaningful in the context of absolute risk and can be misleading when applied to individual patients. If someone is at very low risk for an adverse outcome, a treatment with even a high RRR will have negligible effect on their absolute risk. On the other hand, for someone who is at high risk for an adverse event, even a small RRR can have a significant impact on their absolute risk. One method of incorporating absolute risk into an assessment of an intervention's impact, besides stating ARR, is to calculate the number needed to treat (NNT). This refers to the number of persons who need to be treated for one person to benefit and is a more useful measure for a clinician than the RRR. The calculation for NNT is simply 100%/ARR, with ARR expressed as a percentage, or 1/ARR, with ARR expressed as a fraction (e.g., 0.10 for an ARR of 10%).

These concepts can be illustrated using the results of two trials of beta-hydroxy-beta-methylglutaryl-coenzyme A (HMG-CoA) reductase inhibitors (“statins”) for the

P.21


prevention of myocardial infarction. The Scandinavian Simvastatin Survival Study (4S) included subjects with high cholesterol levels and a history of coronary heart disease (8). In contrast, the Air Force/Texas Coronary Atherosclerosis Prevention Study (AFCAPS/TexCAPS) trial included a lower-risk group of individuals with average cholesterol levels and no known heart disease (9). Table 2.6 provides the rates of myocardial infarction (fatal and nonfatal) in each trial and shows how to calculate the RRR, ARR, and NNT. Although treatment with a statin in both trials yielded similar relative risk reductions (≈40%), the absolute risk reductions and numbers needed to treat are quite different. This illustrates the importance of understanding an individual's risk when trying to gauge the impact of a therapeutic intervention; a practitioner (on average) would need to treat 83 patients with average cholesterol levels and no history of heart disease with a statin for 5 years to prevent a myocardial infarction, while only 12 patients with high cholesterol levels and heart disease would need to be treated to prevent one event.

TABLE 2.6 Use of Data to Estimate Clinical Consequences of Treatment: Comparison of Two Trials

Trial

Fatal or Non-fatal MI after 5 years

RRR

ARR

NNT to Benefit One Patient

Control (placebo)

Treatment (statin)

(Control - Treatment) ÷ Control

Control - Treatment

100% ÷ ARR

4S (1)

22%

14%

(22% - 14%) ÷ 22% = 36%

22% - 14% = 8%

100% ÷ 8% = 12 patients

CAPS (2)

2.9%

1.7%

(2.9% - 1.7%) ÷ 2.9% = 41%

2.9% - 1.7% = 1.2%

100% ÷ 1.2% = 83 patients

ARR, Absolute risk reduction; NNT, number needed to treat; RRR, relative risk reduction.
Data from (1) Scandinavian Simvastatin Survival Study. Randomised trial of cholesterol lowering in 4444 patients with coronary heart disease. Lancet 1994;344:1383, and (2) Downs JR, Clearfield M, Weis S, et al. Primary prevention of acute coronary events with lovastatin in men and women with average cholesterol levels: results of AFCAPS/TexCAPS. Air Force/Texas Coronary Atherosclerosis Prevention Study. JAMA 1998;279:1615.

There are a few caveats about clinical trials. Although the RCT is the best study design for assessing the value of a treatment, one should be cautious about relying on the results of any single study, even one that was done well. Systematic reviews and meta-analyses, which combine the results a number of studies, are discussed later in this chapter (see Keeping Up). Sometimes clinical trials have not been performed. In this situation, the clinician may need to rely on cohort, case-control, or cross-sectional studies. These types of studies are more commonly used to assess risk or harm and are discussed in the next section.

Risk or Potential Harm

Practitioners are frequently called on to make assessments and judgments regarding risk or potential harm resulting from either medical interventions or environmental exposures. Table 2.7 summarizes some of the guidelines for assessing evidence of harm. Ideally, these questions would be answered in a RCT; however, for obvious ethical reasons, RCTs are not undertaken with the intent of studying a harmful exposure. Sometimes, a potentially beneficial intervention is unexpectedly found to be harmful in a clinical trial, or there may be both benefits and harms associated with the intervention.

More commonly, harm is addressed through observational studies. One kind of observational study is a cohort study, in which exposed and unexposed patients are identified and monitored for a period of time, and outcomes in the two groups are compared. For example, a cohort of cigarette smokers and nonsmokers could be monitored and the incidence of lung cancer in both groups measured. In these studies, the two groups may be different with respect to important determinants of outcome other than the exposure being studied (confounding variables). Researchers often can statistically adjust for these factors,

P.22


but there may be other contributing factors of which they are unaware.

TABLE 2.7 Guidelines for Assessing a Study of Harm

What type of study was reported: a prospective cohort study (with or without comparison group); a retrospective case-control study; a cross-sectional study; a case series; or a case report?
Were comparison groups clearly identified and similar with respect to potential determinants of outcome, other than the one of concern? If not, were differences in potential determinants controlled for in the analysis of data?
Were outcomes measured the same way in the groups compared (and was the assessment objective and blinded)?
Was followup sufficiently long and complete?
Was there a temporal relationship between exposure and harm?
Was there a dose–response gradient?
What was the magnitude of the risk, and how precise is this estimate?

Adapted from Levine M, Walter S, Lee H, et al. User's guides to the medical literature. IV: how to use an article about harm. JAMA 1994;271:1615.

Another method of assessing harm is through case-control studies. In these studies, patients with an outcome of interest (cases) are identified and compared with others who are similar in respects other than the outcome (controls). Exposure rates in the case and control groups are then compared to look at the association between the exposure and the outcome. For example, the smoking rate in a group of patients with lung cancer may be compared with a group of patients without lung cancer who are otherwise similar. These studies are subject to recall bias: patients with an illness may be more likely to recall or report an unusual exposure than those who are not ill. In addition, like cohort studies, they are limited by the possibility of differences in unidentified risk factors between the groups.

Cohort and case-control studies can also be used to assess potentially beneficial associations, as was done in studies that suggested a cardiovascular benefit of hormone replacement therapy. However, this benefit was not demonstrated when studied in an RCT (10), calling the purported benefit into question and highlighting the limitations of observational data.

Weaker designs for identifying risk or harm include cross-sectional studies, case series, and case reports. Cross-sectional studies can establish associations but not causal links. They are strengthened by statistical methods that control for confounding variables (potential determinants of harm other than the one of concern). Temporal relationships, however, are usually not established. In case reports or case series, adverse outcomes associated with a particular exposure are reported in a single patient or group of patients. These reports are useful for identifying potentially harmful exposures to be studied further, but they are weak evidence for a causal relationship by themselves. However, if the outcome is very harmful and otherwise rare, this kind of evidence may be sufficient to take action. This might occur, for example, when severe adverse reactions associated with a particular medication are reported, especially if safer alternatives exist. A recent example is troglitazone, which was taken off the market after case reports of severe hepatotoxicity associated with its use.

Cost-Effectiveness

In ambulatory practice, cost considerations arise frequently. Cost-effectiveness analyses evaluate health care outcomes in relation to cost. The primary goals are to determine the most efficient use of resources and to minimize the costs associated with the achievement of health goals and objectives. A common strategy for cost-effectiveness studies is to compare a novel approach or therapy with the current practice or standard of care. The time frame of the study should be long enough to allow for costs and long-term benefits to be realized. The perspective of the analysis takes into account who benefits from the intervention as well as who pays for it (society, the payer, or the patient). Cost-effectiveness analyses often rely on a number of assumptions, and small variations in one or more of these parameters can have a significant effect on the conclusions; a sensitivity analysis can help determine how sensitive the outcomes are to changes in the parameters.

TABLE 2.8 Guidelines for Assessing a Study with an Economic Analysis of Clinical Practice

Did the analysis provide a full economic comparison of health care strategies?
Were the costs and outcomes appropriately measured and valued?
Were the estimates of costs and outcomes related to the baseline risk in the treatment population?
Was a sensitivity analysis performed that included a range of estimates for important assumptions? Are the findings consistent across reasonable ranges of assumptions, or do they change as the assumptions vary within reasonable ranges?
What were the incremental costs and outcomes of each strategy?
Are treatment benefits worth the harms and costs?

Adapted from Drummond MF, Richardson WS, O'Brien BJ, et al. User's guides to the medical literature. XIII: how to use an article on economic analysis of clinical practice. JAMA 1997;277:1552.

Whether decisions are being made for a population (e.g., frequency of screening colonoscopy, drugs to be added to a formulary) or for a particular patient (e.g., choice of antihypertensive medicine), the potential benefits should be weighed against the resources used and money spent. Table 2.8 summarizes some guidelines for assessing evidence in studies performing economic analyses.

In cost-effectiveness analyses, costs usually are measured in monetary units (e.g., dollars) and a single clinical outcome is considered (e.g., mortality). In cost-utility analyses, multiple clinical outcomes, including quality of life, are represented and result in the calculation of “quality-adjusted life years (QALY).” In both types of analyses, alternative diagnostic or therapeutic approaches are studied with a primary emphasis placed on economic considerations.

Keeping Up

One of the major challenges to clinicians is keeping one's personal fund of medical knowledge current. Studies suggest that older practitioners are often “out of date” and tend to provide lower-quality care (11). For primary care practitioners who are expected to know about a wide array of clinical topics, keeping up-to-date can be particularly difficult. It has been suggested that each practitioner should develop a personal mission as to the extent of

P.23


“up-to-datedness” he or she hopes to achieve and maintain. Two questions that may help to better define this territory are (a) “What information do I need to have in my head to be satisfied with my knowledge base for the performance of my job?” and (b) “What information would I be embarrassed not to know?” (12).

One author estimated that if clinicians tried to keep up with the medical literature by reading one article each day, they would be 55 centuries behind in their reading after 1 year (see Sackett et al., Clinical Epidemiology, at http://www.hopkinsbayview.org/PAMreferences). In a seminal study, experienced clinicians in ambulatory practice said they had about two clinical questions per week that went unanswered; however, when shadowed in day-to-day practice, they were found to actually have about two unanswered questions for every three patients seen (13). Moreover, although these clinicians said that their main sources of information were textbooks and journals, their behavior showed that they got most of their clinical information from colleagues and drug retailers. Fortunately, in ambulatory medicine, some high-quality secondary or abstracting publications exist that produce abstracts and often provide expert commentary on clinical articles believed to be of particular importance (approximately 2% to 3% of articles screened from hundreds of journals) (14). Examples are the ACP Journal Club and Evidence-Based Medicine.

Scheduling of time to obtain and find relevant reading material is a critical step in keeping up-to-date. The actual reading of the pulled material can occur either in the scheduled time or when a lull presents itself (e.g., a patient no-show). Proactive scanning or browsing through a small number of peer-reviewed journals that regularly yield articles relevant to one's clinical practice is an integral part of keeping up. Reactive learning (also called problem-focused learning) is stimulated by clinical encounters or questions from patients or medical learners and requires searching to find the appropriate materials (steps 1 and 2 of the core EBM skills described at the beginning of this chapter). Sackett described the “educational prescription” as a means of phrasing and keeping track of questions as they arise with the goal and intent of searching for the best available evidence to answer these queries at some time later. A combination of proactive and reactive approaches is thought to represent the ideal balance for dealing with the evolution of medical knowledge. Several additional ideas have been suggested by authors who have pondered the challenge of keeping clinically up-to-date (Table 2.9) (15).

Although original research articles continue to be an excellent source for new information, other types of publications can also be helpful in the quest to stay current. One common source of medical information is the overview. The chapters of this book (and other textbooks) are one example of an overview; review articles in medical journals are another. These types of overviews are easy to access (especially if the textbook is at hand) and easy to use; they require little work or effort to obtain needed information. However, they are limited by the biases and limitations of the authors and typically do not explain how the information was gathered or how conclusions were reached.

TABLE 2.9 Elements of an Information Plan

Browse at least one general journal regularly.
Maintain surveillance on new information.
Establish reliable ways of looking up common facts.
Identify a set of ways to look up obscure facts.
Develop critical appraisal skills.
Set aside high-quality time regularly to deal with information needs.
Invest time to discover new sources of useful information.

From Fletcher RH, Fletcher SW. Keeping clinically up-to-date. J Gen Intern Med 1997;12:S5.

Systematic reviews and meta-analyses published in peer-reviewed journals with detailed methods describing specifically the literature search and the inclusion/exclusion criteria of the original articles can be invaluable. Critical appraisal methods for these two article types have been developed and can be applied to evaluate the quality of the work (16); Table 2.10 summarizes these methods. Some of the limitations that need to be considered

P.24


include the heterogeneity of studies (with regard to populations studied and outcomes assessed) and the fact that small studies with negative results are less likely to be published than those with positive results (publication bias). Authors often try to correct for these limitations, but meta-analyses have sometimes yielded results and conclusions that were discordant with subsequent large RCTs (17). Nevertheless, meta-analysis can be a powerful tool to synthesize the available evidence in an unbiased fashion. In addition to those published in medical journals, the Cochrane Collaboration (and the Cochrane Library— see http://www.hopkinsbayview.org/PAMreferences) represents an international endeavor to develop, maintain, and disseminate systematic reviews on clinical and health-related topics.

TABLE 2.10 Guidelines for Assessing a Review Article

Are the results of the study valid?
Did the review address an explicitly described, focused clinical question?
Were appropriate criteria (inclusion and exclusion) used for selecting studies for review?
Were search strategies explicitly described, thorough, and appropriate? Is it unlikely that important, relevant studies were missed?
Was the validity of the included studies appraised and accounted for?
Were assessments of the studies reproducible?
Were results similar from study to study?
If data from different studies were combined quantitatively, were the methods explicit and reasonable?
What are the results?
What are the overall results of the review?
How precise are the results?
Are the results presented in a manner that permits comparison and synthesis of the key features and findings of the studies reviewed?
Will the results help me in caring for my patients?
Can the results be applied to my patient care?
Were all clinically important outcomes (benefits and harms) considered?
Are the benefits worth the harms and costs?

Adapted from Oxman AD, Cook DJ, Guyatt GH. Users' guides to the medical literature. VI: how to use an overview. Evidence-Based Medicine Working Group. JAMA 1994;272:1367.

Guidelines are systematically developed statements that offer recommendations to assist with decision making in specific situations. It has been found that clinicians often do not employ effective interventions (e.g., prescribing beta-blockers to patients after a myocardial infarction). Guidelines serve the dual purpose of offering easily accessible recommendations for practitioners and publicizing these recommendations to practitioners and the general public. Guidelines typically are developed by expert panels. They are best when they employ explicit criteria for gathering the evidence and making recommendations and acknowledge the level of evidence for each recommendation. Guidelines may be biased by the composition of the expert panel, and sometimes conflicting guidelines are disseminated by different organizations. For example, the American Urological Association recommends offering prostate-specific antigen (PSA) determinations to screen for prostate cancer, whereas the United States Preventive Services Task Force does not. Table 2.11 lists some suggestions for evaluating practice guidelines.

Each information source has strengths and weaknesses. Colleagues may be misinformed. Drug retailers have a product to sell, making them biased. Textbooks are often out of date by the time they are printed. Traditional continuing medical education courses provide variable degrees of evidence-based education and have been shown to have little effect on practice.

TABLE 2.11 Guidelines for Assessing a Practice Guideline

Was a recent, reproducible, and comprehensive review of the literature carried out?
Were the methods of the review explicit and strong? Specifically, were inclusion and exclusion criteria explicit and reasonable? Were methods of synthesizing the data explicit and reasonable? Was each recommendation assigned a level of evidence supporting it and a strength based on an explicit synthesis of all considerations?
How does this guideline compare to other guidelines? Does the group issuing this guideline have biases or conflicts of interest?
Have important studies been conducted subsequent to the guideline that would alter the recommendations?
Does the burden of the problem addressed warrant implementation of the guideline?
Would implementation of the guideline be cost-effective and feasible?

Adapted from Sackett DL, Straus SE, Richardson WS, et al. Evidence-based medicine: how to practice and teach EBM. 2nd ed. Edinburgh: Churchill Livingstone, 2000.

Because “keeping up” with the medical literature represents a colossal challenge, some authors have provided some direction for how to optimize the chance that one's time investment will result in a reasonable return (18,19). They suggest that the usefulness of medical information for a given provider is proportional to its relevance, validity, and accessibility. Relevance relates to the frequency with which the provider encounters the topic. Validity refers to the quality of the information and the likelihood that the information is true. Accessibilityconnotes the ease with which the information source can be retrieved. These authors recommend that practitioners seek out information sources that are relevant, valid, and easily accessible.

Finally, medical librarians can be extraordinary helpful in keeping clinicians in touch with changes in the medical literature, and they most are happy to meet with clinicians to make them aware of new resources. Befriending one's medical librarian is a critical component of a “keeping up” strategy and can pay huge dividends in the pursuit of evidence-based medical practice.

Specific References*

For annotated General References and resources related to this chapter, visit http://www.hopkinsbayview.org/PAMreferences.

  1. Sackett DL, Rosenberg WM, Gray JAM, et al. Evidence-based medicine: what it is and what it isn't. BMJ 1996;312:71.
  2. Wilkinson EK, Bosanquet A, Salisbury C, et al. Barriers and facilitators to the implementation of evidence-based medicine in general practice: a qualitative study. Eur J Gen Pract 1999;5:66.
  3. Barrows HS, Norman GR, Neufeld VR, et al. The clinical reasoning of randomly selected physicians in general medical practice. Clin Invest Med 1982;5:49.
  4. Pewsner D, Battaglia M, Minder C, et al. Ruling a diagnosis in or out with “SpPIn” and “SnNOut”: a note of caution. BMJ 2004;329:209.
  5. Shea JA, Berlin JA, Escarce JJ, et al. Revised estimates of diagnostic test sensitivity and specificity in suspected biliary tract disease. Arch Intern Med 1994;154:2573.
  6. Guyatt GH, Patterson C, Ali M, et al. Diagnosis of iron deficiency in the elderly. Am J Med 1990;88:205.
  7. McGee S. Simplifying likelihood ratios. J Gen Intern Med 2002;17:646.
  8. Scandinavian Simvastatin Survival Study Group. Randomised trial of cholesterol lowering in 4444 patients with coronary heart disease. Lancet 1994;344:1383.
  9. Downs JR, Clearfield M, Weis S, et al. Primary prevention of acute coronary events with lovastatin in men and women with average cholesterol levels. JAMA 1998;279:1615.
  10. Hulley S, Grady D, Bush T, et al. Randomized trial of estrogen plus progestin for secondary prevention of coronary heart disease in postmenopausal women. JAMA 1998;280:605.

P.25

  1. Choudhry NK, Fletcher RH, Soumerai SB. Systematic review: the relationship between clinical experience and quality of health care. Ann Intern Med 2005;142:260.
  2. Laine C. How can physicians keep up to date? Annu Rev Med 1999;50:99.
  3. Covell DG, Uman CG, Manning PR. Information needs in office practice: are they being met? Ann Intern Med 1985;103:596.
  4. Wyatt JC. Reading journals and monitoring the published work. J R Soc Med 2000;93:423.
  5. Fletcher RH, Fletcher SW. Evidence-based approach to the medical literature. J Gen Intern Med 1997;12:S5.
  6. Oxman AD, Cook DJ, Guyatt GH. Users’ guides to the medical literature: VI. How to use an overview. Evidence-Based Medicine Working Group. JAMA 1994;272:1367.
  7. Borzak S, Ridker PM. Discordance between meta-analyses and large-scale randomized controlled trials: examples from the management of acute myocardial infarction. Ann Intern Med 1995;123:873.
  8. Smith R. What clinical information do clinicians need? BMJ 1996;313:1062.
  9. Shaughnessy AF, Slawson DC, Bennett JH. Becoming an information master: a guidebook to the medical information jungle. J Fam Pract 1994;39:489.


If you find an error or have any questions, please email us at admin@doctorlib.org. Thank you!