Biostatistics as Applied to Oncology

The Washington Manual of Oncology, 3 Ed.

Kathryn M. Trinkaus • Feng Gao • J.Philip Miller

I. INTRODUCTION. Statistics is the mathematical science of estimation in the presence of uncertainty. Its strengths are identifying patterns and teasing out relations in complex data, comparing information from multiple sources, quantifying similarities or differences, and estimating the degree of uncertainty or level of confidence with which to regard the results. Statistics includes an extensive toolbox of solutions for practical problems, with a well-established mathematical foundation. Statistics is more than just tricks, tests, and theorems, however. In a broader sense, it is an efficient, systematic, and reproducible means of investigating patterns and relations in complex data. It is a framework for organized thinking about the issues that generate these data.

A few words about data. The hypotheses of a study state the scientific ideas being tested, the objectives state the tasks required to test those hypotheses, and the end points are the quantities that will be measured while carrying out the tests. A good end point is clearly related to the biological or behavioral process that it measures, ascertainable with minimal error, and readily reproducible.

If an end point cannot be directly observed, a related quantity may be substituted as a surrogate. Surrogates are ethically preferable if the true end point requires invasive procedures or otherwise puts the patient at additional risk. They may be more efficient if the true end point takes a long time to observe or is costly to obtain. To be valid, a surrogate must provide the same conclusion as a test of the true end point, so it must respond to disease and treatment in the same manner as the true end point. Association alone is not sufficient, nor is availability of a more precise measurement. A surrogate end point will produce useless or misleading results if it is precise at the expense of accurately capturing the quantity of interest.

A useful end point is consistently observable and easy to record accurately as missing data put a study at risk of failure. Primary end points are used to achieve the primary objectives, so every missing value of a primary end point is the loss of a participant. Systematic data loss is the absence of most end points from individuals or from most individuals for a single end point. Dropping variables or individuals with missing data may substantially influence, or bias, results by narrowing the scope of the study or reducing the power of the study to identify patterns and differences accurately. Numerous methods of substituting values for those that are missing are available. The most effective is multiple imputation, which uses a probability model to impute values based on known characteristics of the subject. Values are imputed several times and the results are combined to provide estimates of the missing value and of the precision of the imputation. Missing data may be important indicators that a study is encountering logistic, administrative, or procedural difficulties. The best solution is to monitor data loss and address the underlying problems as promptly as possible. A study is only as successful as its data are accurate, precise, and consistently recorded. Good experimental design and data analysis strategies can help with the complex realities of biomedical and clinical research, but they are no substitute for good data.

A SHORT INTRODUCTION TO PROBABILITY. Probabilities are used to describe discrete events, such as “response to therapy,” as well as the likelihood that a continuous measurement, such as serum creatinine or blood pressure, will take on a specific value. For brevity, both are referred to as events. The individuals to whom results of a clinical study will be generalized make up the target population. The probability that an event will occur can be defined as the frequency with which the event or value occurs in the target population; this definition is referred to as “frequentist.” A clinical study usually estimates frequencies in a sample from the target population. The sample size should be large enough to include all relevant features of the target population. The selection process is designed so that all members of the target population have the same (or a predefined) probability of being chosen for the sample; that is, the sample is randomly chosen. Randomness helps ensure that no characteristic of the target population is over- or underrepresented, so the selection process does not bias the conclusions of the study.

The frequency of all possible states of an event (e.g., all possible levels of response) in the target population is the event’s probability distribution. Hypotheses are tested, inferences drawn, and conclusions reached by comparing frequencies observed in the sample with those expected from the probability distribution in the target population. Clearly defining the target population can be difficult but is necessary for sound frequentist statistical analysis.

An alternative is to use existing knowledge, beliefs, or assumptions to define a prior (probability) distribution and the likelihood of each possible outcome. The prior distribution, likelihood function, and observed data are combined to generate a revised, posterior probability distribution for the measure of interest. This reasoning is based on a theorem about conditional probabilities first stated by Thomas Bayes; hence the term “Bayesian” statistics. Bayesian approaches are well-suited to predictive modeling and iterative decision making, as in dose-finding studies or sequential toxicity monitoring, because the posterior distribution provides a new prior for the next stage of data collection. Even so-called “vague” or “non-informative” priors can have a substantial effect on conclusions and must be chosen with care. Defining a likelihood function also can be challenging. Frequentist and Bayesian approaches have a common mathematical foundation, and most standard analyses can be carried out using either way.

Generally speaking, two events are independent if the occurrence of one provides no information about the probability of occurrence of the second. In most cases, observations taken on separate and unrelated biological organisms are considered independent, whereas repeated observations taken from the same biological organism are dependent. Common sources of dependence are association in space (e.g., expression levels of two proteins from a single individual), time (e.g., measures at time of treatment and subsequent weekly intervals), function (e.g., blood pressure and heart rate of the same individual), or inheritance (e.g., genetic studies of family members). Dependence is a matter of degree and can be modeled.

Repeated observations may be incorporated into experimental design and methods of analysis. Replication of a measure within an individual helps to better estimate within-subject differences, whereas taking measurements on additional individuals helps to better estimate differences between subjects. Repeating an experiment with the same samples already analyzed (technical replication) is primarily useful for quality control and adds little to any conclusion drawn about the study sample or target population.

In general, the effective sample size is the number of independent observations, not the number of events or measurements. The more complex or variable a quantity is, the larger the number of independent measurements needed to adequately describe it.

III. MEASUREMENT WITH ERROR. Random error occurs by chance alone and is a part of most measurements in a clinical study. Over a large series of measurements, random error has an average value of zero, so it can be reduced by replication. Systematic error is a more serious problem as it is due to some aspect of the biological phenomenon of interest, the sample being studied, or the measurement process. Systematic error has an average value greater or less than zero in the long run, so it shifts (biases) estimates away from the true value of the quantities being observed. It may be amplified by replication rather than reduced. A study can be designed to minimize identified sources of error to improve its estimates of the quantities of interest. Biological variability also contributes to the overall variability of observations. Variability in quantities of interest to the study are considered “signal”; those not of interest to the study contribute “noise.” A good experimental or clinical study design maximizes the capture of signal and minimizes the capture of noise.

Most clinical measurements are random variables (RV), measurements that take on a different value for each experimental subject with each value having a specific probability distribution. Discrete RVs fall into unordered (nominal) or ordered (ordinal) categories. The probability distributions of discrete RVs are often known, such as the binomial or multinomial probabilities of falling into two or more categories, respectively. Counts, especially of events per unit time or space, may approximate a Poisson distribution. Continuous RVs are measured on a real or complex number scale with or without upper or lower boundaries. The values that completely describe a continuous distribution are its parameters. The parameters of a distribution are usually related to its mean (location of the center) or variance (the spread). Common parametric distributions are generalizations from observation of natural processes, not merely mathematical abstractions, and many are related to one another. For a large number of observations of a relatively rare event, the binomial is a good approximation of the Poisson. For a large number of observations, both the binomial and the Poisson approach the normal/Gaussian distribution.

If the parameters of a distribution can be estimated, so can the probability that the random variable will take on any given value. If it is very improbable that two sets of observations could have been drawn from a distribution with a single set of parameters, then there is evidence that the groups differ in that respect. This reasoning is the basis of most frequentist hypothesis tests. It also explains the large role of estimating means and variances in statistical analysis.

If the observations do not seem to fit any known distribution, some eccentricities of shape can be adjusted by analyzing the data on an alternative scale, transforming the data to a shape with known properties. Transformationalters the intervals between observations, not their order, so it does not alter conclusions drawn from hypothesis tests. A log transform, for example, makes a multiplicative relationship additive, a useful feature if additive models such as regression are to be used. Taking ratios is a way of adjusting each measurement with respect to a baseline or denominator. Mild eccentricity, such as skewness without a large number of duplicate measurements, can also be analyzed with robust, nonparametric, or semiparametric methods. These methods are less strongly influenced by a few unusual values (outliers) and have fewer, weaker assumptions about the distribution from which observed values are drawn.

Observations with multiple peaks, abrupt descents, and reascents (“singularities”), large numbers of single values (e.g., a “floor” at zero or a “ceiling” at the detection limit of the measuring instrument), or a combination of discrete and continuous elements are not adequately characterized by a few parameters. With some loss of information, the values can be categorized and discrete methods used. Another alternative is resampling, a form of simulation using probability (“Monte Carlo”) methods to draw repeated samples from observed data. The repeated samples define an empirical probability distribution for a quantity of interest, such as a mean, and can be used to find confidence intervals or estimate bias, figuratively using the data to pull itself up by its bootstraps. The same strategy can be used to test hypotheses by randomly permuting sample subgroup labels, testing the hypothesis of interest in the many such subgroups with permuted labels. The permutation test p-value is the proportion of permuted sample p-values that are smaller than the p-value from the same hypothesis test in the correctly labeled subgroups.

If the data are too complex to be dealt with as a whole, piecewise methods, such as locally weighted regression and splines, including multivariable adaptive regression splines (MARS), may be preferable. These methods analyze the data in segments, estimating the regression curve over a small region rather than trying to find the curve appropriate for the observations as a whole.

IV. VIEWING THE DATA. Given the complexity of biological phenomena, a preliminary overview is essential before diving into analysis. Plots, charts, lists, and frequency tables provide a comprehensive, visual representation of pattern, distribution, and difference. They highlight unusual data points, help with error checking, summarize the shape of individual variables, and illustrate relations between sets of variables. Visual summaries are so important for understanding data that it is essential to have a software program with good graphics capability.

For continuous, interval, or ordinal variables, dot plots and stem-and-leaf plots contain a symbol for each data point, stacking the occurrences of each value. The location of most common values, symmetry or skewness, and presence of unusual values (outliers) are readily seen. Histograms summarize counts or proportions in a solid column rather than representing each data value separately. Relative numbers or proportions in nominal or ordinal categories are easily identified. Histograms are most informative when the height of the column represents the amount being displayed. The magnitude of a single value, such as a mean, may not be well-represented by its height above zero, and as such is better illustrated by a point with error bars. Bivariate scatter plots are useful for examining relations between two or more continuous variables, as well as finding the center(s) of the distribution and the location and distance to outlying values. Box plots represent the distribution of a continuous variable in each of one or more categories. The “box” represents the middle of the distribution, usually the 25th to 75th percentiles. Lines extending outward from the ends of the box and plot symbols represent the spread of the data. A lattice plot is a matrix of scatter plots, one for each pair of a set of variables. Lattice plots make it easy to assess a number of variables at a glance; for example, for a first look at variables to be used in a multivariable analysis. Robust smoothing methods draw a curve through the bulk of data points, giving more weight to nearby observations than to distant ones.

A useful plot highlights patterns by suppressing detail, so it is often more useful to compare several kinds of plots than to add information to a single plot. The human eye is extraordinarily good at finding patterns in random scatters, especially when few data points are available. Plots are a starting point, but they do not replace a more rigorous statistical examination.

MAKING INFERENCES ABOUT DATA. The goal of most clinical studies is to improve clinical decision making and patient outcomes, so the data that are gathered will be used for making inferences and testing hypotheses. Hypothesis testing, whether frequentist or Bayesian, is a well-defined, repeatable methodology for answering questions about observed data using probability models. Frequentist hypothesis tests require a null hypothesis, which describes the background against which research results will be interpreted, and an alternative hypothesis, stating the expected result or difference. The probabilities of all possible results are calculated assuming the null hypothesis is true. Results compatible with the null hypothesis, the critical region, are identified, as are those too extreme to be probable when the null hypothesis is true.

Samples are drawn from a well-defined target population, with randomization to reduce selection bias, and the measure of interest is observed. If the measure falls within the critical region, there is no evidence that the null hypothesis is false. If the measure falls outside the critical area, the result is not compatible with the null hypothesis, and the alternative is chosen instead. There are two correct decisions: to accept the alternative when the null hypothesis is false and to fail to reject a true null hypothesis. The corresponding errors are to reject the null hypothesis when it is true (a false-positive or type I error) or to fail to reject the null hypothesis when it is false (a false-negative or type II error).

In practice, a p-value is usually calculated, expressing the probability of results that are as extreme as or more extreme than the observed results, assuming that the observations are drawn from the probability distribution specified by the null hypothesis. “More extreme than” refers to values far from the center of the distribution, or in the “tails” of the distribution. If the alternative hypothesis is concerned with any difference from the null hypothesis, then the p-value measures the probability of falling into either tail of the distribution, a two-tailed test. If the alternative is concerned only with values greater than, or only values less than, the null value, then the p-value measures the probability of falling into a single tail, a one-tailed test. Two-tailed tests are more demanding as the area in each tail is smaller, and are generally preferred unless there is a strong reason for a one-tailed test.

Studies are designed to minimize the probability of a false-positive (the significance level, or α), while maximizing the probability of a correct positive (the study power). Conventional significance levels are 0.01 and 0.05, whereas power is usually no less than 0.8. Calculating study power is a routine part of designing any clinical trial using inferential procedures. Calculating power requires information about the expected values of end points under standard conditions (the null hypothesis) and the study treatment (the alternative hypothesis) as well as the expected variability and precision of measurement. If some of this information is not available, the maximum detectable difference can be calculated for a given null hypothesis and a specified study power. In the absence of any preliminary evidence, no inferential procedures can be planned, and only an observational trial is possible. If the preliminary information is weak or of uncertain relevance, study power can be reviewed at one or more interim analyses to ensure that the entire study does not rest on a shaky foundation. There is no useful information to be had from a post hoc power calculation carried out after the data have been gathered.

Estimates of study power generally refer to single tests. If many tests are to be carried out, then some positive results may be observed purely by chance. A significance level of 0.05 implies that any result that is expected to occur less frequently than 5 in 100 times, or 1 in 20 times, is considered “unlikely.” If a large number of tests are carried out, the probability of at least one false-positive result can be large, making it necessary to adjust for the effect of multiple testing. Most such adjustments were developed for moderate numbers of tests in a single analysis or model. Their usefulness for combining results from several types of analyses on related data or from multiple studies (meta-analyses) is unclear. Genomic and proteomic studies, which may involve tens of thousands of tests, are also not well-served by traditional multiple testing correction methods. Multiple testing corrections are more useful in confirmatory studies, where the goal is to avoid a false-positive result, than in discovery studies, where overcorrection may prevent recognition of interesting results. Results from discovery studies usually require validation in fully independent studies.

If the conditions of a hypothesis test are too difficult to meet or if a more information-rich result is needed, then a Bayesian procedure may be preferable. These make more efficient and comprehensive use of prior information. The caveat is that prior data must be substantially accurate if a reliable prior distribution and likelihood function are to be found.

VI. MODELING RELATIONS. Modeling provides a richer, more nuanced approach to data analysis than simple hypothesis testing. Traditional single or multiple linear models describe the relation between independent variables and Gaussian-distributed dependent variables. The effect of each independent variable is adjusted for the effect of the others, so the model describes the joint effect of several covariates on the outcome. Models may be stratified, allowing different curves to be fit to subsets of the data and their distinctness tested. Generalized linear, nonlinear, and time-to-event models allow the same strategies to be applied to non-Gaussian or nonlinear dependent variables, as well as to events for which some patients’ times are unknown (censored). Mixed models have extended the linear framework to random effect, independent variables that represent a sample of the values to which conclusions will be generalized. Hierarchical models allow inclusion of multiple levels of dependence, such as multiple observations taken from each individual and at several points in time or space. Robust methods can accommodate some forms of eccentrically distributed data by modeling a curve or surface in segments, estimating the value of the dependent variable based on nearby observations. Splines take a similar piecewise approach in a more formal way, fitting a model to each segment.

The complexity of relations being modeled is both a strength and a weakness of the modeling process. Confounding occurs when two or more independent variables are related to one another as well as to the outcome. Confounding can exaggerate or mask the effect of one or more covariates on the outcome, and it is best dealt with in the study design. A related problem is collinearity, in which two or more covariates provide redundant information about the outcome. One or more will appear less strongly related to the outcome than is actually the case. If it is important to estimate the effect of each covariate, several models can be created.

The joint effect of several covariates may be quite different than their individual effects, or main effects. When the effect of a covariate differs depending on the presence or level of another covariate, an interaction is present. In this case, there is no way to interpret the effect of a single covariate; only their joint effects can be interpreted.

If there are few observations to work with, then only large effects are likely to be identified even if the sample is a good representation of the target population. Some real effects may not be measurable with a given sample, and a difficult choice of covariates is usually necessary. The p-values indicate whether a covariate contributes to a model, but they are not sensitive measures of how much information is provided. Measures of information or tests, such as likelihood ratio tests, are better indicators of how much information is gained or lost with each independent variable. If a model is based on most of the variation in the input data, with few anomalies and little ignored information, then it is said to fit well. Any change in the model can alter its fit, so the fit must be reexamined after each change. Diagnostic tools, such as residuals, measure unexplained variability, while tests for goodness-of-fit estimate how closely the model fits the input data. Outliers occur where the model’s estimation of the outcome value differs from the observed value for a specific set of covariates. Influential points are closely approximated by the model, although at the expense of a substantial number of the remaining observations. Any of these may distort the model and render its conclusions inaccurate or misleading. To fit a sound model and interpret it correctly, the analyst must know the input data well, understand the model being used, test the fit with care, and examine the output results in detail.

VII. PREDICTION. Well-fitting descriptive models often fail to produce accurate predictions when given new data. A model may be overfitted to a specific data set, so it does not accommodate systematic differences between the data on which it was built and new data about which predictions are made. Predictive modeling can be improved by splitting the data into parts, a training set on which the model is built and an independent validation seton which its performance is assessed. K-fold cross-validation divides the data into K equal-sized parts and creates K-1 models, each leaving out one part to be used as a validation set. Training sets also may be randomly sampled, leaving the remainder of observations as a validation set and repeating the process many times. Model performance is summarized over the results from each validation set. Performance criteria include overall deviation from predicted values, such as the root mean square error (RMSE), and data visualization such as residual plots, which show where deviations from prediction are largest or most common. At each training and test step, it is essential that data used to assess performance come only from the validation set and are not used in training the model.

VIII. STATISTICAL LEARNING. Statistical learning is a rapidly expanding set of methods for analyzing and interpreting complex or very large data sets. Classification methods are used to predict a qualitative outcome or response, regression methods to predict a quantitative outcome or response. If the outcome or response of interest is known at the outset, the analysis is a supervised learning process. In some cases, the goal is to illuminate structure in the data, treating all input alike, a process referred to as unsupervised learning. Bootstrap and cross-validation are used to estimate precision and predictive accuracy, select the best among alternative methods, and avoid overfitting a specific model or set of classes. Classification analyses split the observations into subsets, clusters, or nodes, aiming to maximize homogeneity within and heterogeneity between nodes. Some common classification procedures include K-nearest neighbor analysis, logistic regression, discriminant analysis, and many forms of cluster analysis. More complex underlying structure can be identified using latent class analysis. Tree-based classification methods repeatedly split a data set, as in recursive partitioning, or create a forest of many bootstrap-based trees, as in random forest. Classification accuracy may be increased by bootstrap aggregating (“bagging”) the results of repeated classifications. In a supervised analysis, classification accuracy can be assessed by developing classifiers with a proportion of observations and assessing their accuracy using the remaining, out-of-bag observations. The Gini index is a measure of homogeneity within and heterogeneity between classes. Tree-based methods, such as random forest, also find distances (“proximities”) between observations that can be used to visualize their structure using multidimensional scaling. Regression analyses attempt to predict the value of a continuous quantity by such methods as linear regression, partial least squares, principal components, and factor analysis. More complex, partly prespecified structures can be represented with structural equation modeling and latent profile analysis. Random forest also can be used to build regression trees.

IX. CONCLUSION. A basic knowledge of statistics allows one to investigate data at first hand in a systematic and organized manner as well as to collaborate more effectively with a statistician when more extensive analyses are needed.

How to learn more. JMP (www.jmp.com), SPSS (www.spss.com), and Stata (www.stata.com) provide immediate access with good graphics and user-friendly, menu-driven interfaces. JMP includes extensive genomic data analysis methods, and all four have power and sample size calculators. For more complex or customized analyses, SAS (www.sas.com), R (www.r-project.org), MATLAB (www.mathworks.com), and Python (www.python.org/psf, www.learnpython.org) are powerful programming and scripting languages that require some study to be used effectively. R and Python are open source and can be used without charge. SAS and MATLAB require a license.

SUGGESTED READINGS

General Interest

Hacking I. The Taming of Chance. Cambridge, England: Cambridge University Press, 1990.

Huff D, Geis I. How to Lie with Statistics. New York, NY: W.W. Norton & Co, 1993.

McGrayne SB. The Theory That Would Not Die: How Bayes’ Rule Cracked the Enigma Code, Hunted Down Russian Submarines, and Emerged Triumphant from Two Centuries of Controversy. Yale University Press, 2012.

Salsburg D. The Lady Tasting Tea: How Statistics Revolutionized Science in the Twentieth Century. New York, NY: Henry Holt, 2002.

Silver N. The Signal and the Noise: Why So Many Predictions Fail – But Some Don’t. Penguin Press HC, 2012.

Stigler S. Statistics on the Table: A History of Statistical Concepts and Methods. Cambridge, England: Harvard University Press, 2002.

Basics

James G, Witten D, Hastie T, et al. An Introduction to Statistical Learning. Springer, 2013.

Klein G, Dabney A. The Cartoon Introduction to Statistics. New York, NY: Hill and Wang, 2013.

Kuhn M. Applied Predictive Modeling. New York, NY: Springer, 2013.

Motulsky H. Intuitive Biostatistics: A Nonmathematical Guide to Statistical Thinking, 3rd ed. London, England: Oxford University Press, 2013.

Rosner B. Fundamentals of Biostatistics, 7th ed. Boston, USA: Cengage Learning, 2010.

Salkind N. Statistics for People Who (Think They) Hate Statistics, 5th ed. Thousand Oaks, CA: Sage Publications, 2013.

Software-Guided Learning

Cody R. SAS Statistics by Example. Cary, NC: SAS Institute, 2011.

Delwiche L, Slaughter S. The Little SAS Book: A Primer, 5th ed. Cary, NC: SAS Institute, 2012.

Field A. Discovering Statistics Using IBM SPSS Statistics. Thousand Oaks, CA: Sage Publications, 2013.

Hahn B, Valentine D. Essential MATLAB for Engineers and Scientists, 5th ed. New York, NY: Academic Press, 2013.

Kohler U, Kreuter F. Data Analysis Using Stata, 3rd ed. College Station: Stata Press, 2012.

Kruschke JK. Doing Bayesian Data Analysis: A Tutorial with R and BUGS, 1st ed. New York, NY: Academic Press, 2010.

Lutz M. Learning Python, 5th ed. O’Reilly Media, 2013.

Maindonald J, Braun J. Data Analysis and Graphics Using R: An Example-Based Approach, 3rd ed. Cambridge, England: Cambridge University Press, 2010.

Sall J, Lehman A, Stephens M, et al. JMP Start Statistics: A Guide to Statistics and Data Analysis Using JMP, 5th ed. Cary, NC: SAS Institute, 2012.

Page

Contents

If you find an error or have any questions, please email us at admin@doctorlib.org. Thank you!