Glossary of Statistical Terms - Study Design and Statistics

AMA Manual of Style - Stacy L. Christiansen, Cheryl Iverson 2020

Glossary of Statistical Terms
Study Design and Statistics

In the glossary that follows, terms defined elsewhere in the glossary are displayed in boldface and italic font. An arrow () indicates points to consider in addition to the definition. For detailed discussion of these terms, the referenced texts and the additional reading at the end of the chapter are useful sources.

Eponymous names for statistical procedures often differ from one text to another (eg, the Newman-Keuls and Student-Newman-Keuls test). The names provided in this glossary follow the Dictionary of Statistical Terms,93 published for the International Statistical Institute. Although some statistical texts use the possessive form for most eponyms, the possessive form for eponyms is not recommended in AMA style (see 15.0, Eponyms).

Most statistical tests are applicable only under specific circumstances, which are generally dictated by the scale properties of both the independent variable and dependent variable. Table 19.5-1 presents a guide to selection of commonly used statistical techniques. This table is not meant to be exhaustive but rather to indicate the appropriate applications of commonly used statistical techniques.

Table 19.5-1. Selection of Commonly Used Statistical Techniquesa

Scale of measurement




2 Treatment groups

Unpaired t test

Mann-Whitney rank sum test

χ2 Analysis-of-contingency table; Fisher exact test if <5 in any cell

≥3 Treatment groups

Analysis of variance

Kruskal-Wallis statistic

χ2 Analysis-of-contingency table; Fisher exact test if <5 in any cell

Before and after treatment in same individual

Paired t test

Wilcoxon signed rank test

McNemar test

Multiple treatments in same individual

Repeated-measures analysis of variance

Friedman statistic

Cochran Q

Association between 2 variables

Linear regression and Pearson product moment correlation

Spearman rank correlation

Contingency coefficients

a Adapted with permission from Primer of Biostatistics.94 ©McGraw-Hill.

b Assumes normally distributed data. If data are not normally distributed, then rank the observations and use the methods for data measured on an ordinal scale.

c For a nominal dependent variable that is time dependent (such as mortality over time), use life-table analysis for nominal independent variables and Cox proportional hazards regression for continuous and/or nominal independent variables.

abscissa: horizontal or x-axis of a graph. The y-axis is referred to as the ordinate.

absolute difference: the difference between an event rate before an intervention and the event rate after the intervention. For example, if the influenza rate in a population is 20% before a vaccination program is implemented and it decreases to 15% after the program is initiated, the absolute difference is 5%. Absolute differences are subtracted numbers, whereas relative differences are fractions. Absolute differences are useful because they provide the absolute magnitude of an effect, whereas small absolute differences can appear to be very large when expressed as relative differences. The inverse of the absolute difference is the number needed to treat.

absolute risk: probability of an event occurring during a specified period. The absolute risk equals the relative risk times the average probability of the event during the same time, if the risk factor is absent.95 See absolute risk reduction.

absolute risk reduction (ARR): proportion in the control group experiencing an event minus the proportion in the intervention group experiencing an event. The inverse of the absolute risk reduction is the number needed to treat. See absolute risk.

accuracy: ability of a test to produce results that are close to the true measure of the phenomenon.95 Generally, assessing accuracy of a test requires that there be a criterion standard (or reference standard) with which to compare the test results. Accuracy encompasses a number of measures, including reliability, validity, and lack of bias. Accuracy should be distinguished from precision, which refers to the variability or how close together successive measurements of the same phenomenon are.96

actuarial life-table method: see life table, Cutler-Ederer method.

adjustment: statistical techniques used after the collection of data to adjust for the effect of known or potential confounding variables.95 A typical example is adjusting a result for the independent effect of age of the participants (age is an independent variable). In this case, adjustment has the effect of holding age constant. Some outcomes may be age dependent, but statistical adjustment yields a result that would occur if the age effect were not present. Typically, this is performed by including the variable to be adjusted as an independent variable in a regression procedure.

aggregate data: data accumulated from disparate sources.

agreement: statistical test performed to determine the equivalence of the results obtained by 2 tests when one test is compared with another (one of which is usually but not always a criterion standard [or reference standard]). Agreement also occurs in studies where evaluators provide judgments of, say, medical images (radiographs, slides). A measure of agreement, such as the κ statistic or concordance, should also be provided in such situations.

→ Agreement should not be confused with correlation. Correlation is used to test the degree to which changes in a variable are related to changes in another variable, whereas agreement tests whether 2 variables are equivalent. For example, an investigator compares results obtained by 2 methods of measuring hematocrit. Method A gives a result that is always exactly twice that of method B. The correlation between A and B is perfect because A is always twice B, but the agreement is poor; method A is not equivalent to method B. One appropriate way to assess agreement has been described by Bland and Altman97 (see Bland-Altman plot).

Akaike information criterion (AIC): Method for determining which of various models best fits an observed data set. Used for models fit by maximum likelihood. The likelihood function is penalized by the number of parameters fit to account for the fact that any model will fit better if it has more parameters.

L is the maximum value of the model’s likelihood function and k is the number of parameters estimated. When fitting several models, the model that fits best has the lowest AIC. Used in conjunction with the Bayes information criteria (BIC). These criteria are among the preferred ways to establish which model fits the data best when fit by maximum likelihood methods.

algorithm: systematic process performed in an ordered, typically branching sequence of steps; each step depends on the outcome of the previous step.98 An algorithm may be used clinically to guide treatment decisions for an individual patient on the basis of the patient’s clinical outcome or result.

α (alpha), α level: the threshold for statistical significance. It is the size of the likelihood acceptable to the investigators that an association observed between 2 variables is due to chance (the probability of a type I error); usually α = .05. If α = .05, P < .05 will be considered significant. When presenting α levels, do not include zero before the decimal point (but see Cronbach α).

analysis of covariance (ANCOVA): statistical test used to examine data that include both continuous and nominal independent variables and a continuous dependent variable. It is basically a hybrid of multiple regression (used for continuous independent variables) and analysis of variance (used for nominal independent variables).95 In general, when analyzing a change of some factor when compared with a baseline value, it is better to use ANCOVA than report the outcome as the subtracted value of the outcome-baseline value because regression to the mean occurs for baseline values that are too high or low. ANCOVA provides a better mechanism when baseline differences exist in the experimental subjects.99

analysis of variance (ANOVA): statistical method used to compare a continuous dependent variable and more than 1 nominal independent variable. When multiple comparisons between groups is necessary, ANOVA avoids problems inherent in using t tests for multiple comparisons of independent observations. The null hypothesis for ANOVA is that there are no differences in the means between any groups within a sample that is studied. This hypothesis is tested by calculating the variance between the means of the groups and dividing it by the variance of the total population. Because only 1 hypothesis is tested (ie, that there are no differences between any of the group’s mean values), the problem of repeated testing is avoided. The greater the differences between the means of the groups, the larger the variance will be between the groups, resulting in a larger ratio. The null hypothesis in ANOVA is tested by means of the F test. In ANOVA, the null hypothesis only tests that there are no differences between any mean values of the groups tested. If a significant difference is found by F testing, contrasts or multiple comparison testing procedures are required to know which of the groups differ.

In 1-way ANOVA, there is a single nominal independent variable with 2 or more levels (eg, age categorized into strata of 20-39 years, 40-59 years, and ≥60 years). When there are only 2 mutually exclusive categories for the nominal independent variable (eg, male or female), the 1-way ANOVA is equivalent to the t test.

A 2-way ANOVA is used if there are 2 independent variables (eg, age strata and sex), a 3-way ANOVA if there are 3 independent variables, and so on. If more than 1 independent variable is analyzed, the process is called factorial ANOVA, which assesses the main effects of the independent variables as well as their interactions. For example, if the effect of age category and sex on systolic blood pressure (SBP) is tested, the independent effect of age categories on SBP and those for sex can be tested as main effects. An interaction exists when the effect of combined groups is larger than the sum of the effect of both groups. In a factorial 3-way ANOVA with independent variables A, B, and C, there is one 3-way interaction term (A × B × C), 3 different 2-way interaction terms (A × B, A × C, and B × C), and 3 main effect terms (A, B, and C). A separate F test must be computed for each different main effect and interaction term.

If repeated measures are made on an individual (such as measuring blood pressure over time), a statistical method should be used that does not depend on complete independence of samples (because serial measurement from the same individual are not statistically independent from one another) and capitalizes on the fact that the measures are from the same individual. This is accomplished by use of repeated-measures ANOVA, which has the added benefit of facilitating control for confounding factors (such as age or socioeconomic status). Randomized-block ANOVA is used if treatments are assigned by means of block randomization.95,100

→ An ANOVA can establish only whether a significant difference exists among groups, not which groups are significantly different from each other. If hypotheses are generated before the experiment is performed about which groups might differ, ANOVA can be performed together with contrasts to establish if significant differences exist between individual groups. If hypotheses were not established before the experiment, the statistical significance of group differences can be determined by a pairwise analysis of variables by the Newman-Keuls test or Tukey test or one of several other tests. These multiple comparison procedures avoid the potential of a type I error that might occur if the t test were applied at this stage.

Multiple comparison tests explore what groups are significantly different from the others after an ANOVA has found a significant F test result. They either test for differences between all possible pairs of subgroups (t, Duncan, Student-Newman-Keuls, Tukey-Kramer, Bonferroni, etc) or between a control and all the remaining groups (t, Dunnett, Bonferroni, etc).

→ The F ratio of ANOVA is the ratio of the variance observed between the group means and the variance of the entire population and is a number between 0 and infinity. The F ratio is compared with tables of the F distribution, taking into account the α level and degrees of freedom ( df) for the numerator and denominator, to determine the P value.

Example: The difference was found to be significant by 1-way ANOVA (F2,63 = 61.07; P < .001).100

The dfs are provided along with the F statistic. The first subscript (2) is the df for the numerator; the second subscript (63) is the df for the denominator. The P value can be obtained from an F statistic table that provides the P value that corresponds to a given F and df. In practice, however, the P value is generally calculated by a computerized algorithm. A statistically significant ANOVA test means only that there is a difference between some of the groups included in the analysis. It does not determine which groups are significantly different from each other; to find out, a multiple comparisons procedure must be performed.100 Other models, such as Latin square, may also be used.

ANCOVA: see analysis of covariance.

ANOVA: see analysis of variance.

Ansari-Bradley dispersion test: rank test to determine whether 2 distributions known to be of identical shape (but not necessarily of normal distribution) have equal parameters of scale.101

area under the curve (AUC): technique used to measure the performance of a test plotted on a receiver operating characteristic (ROC) curve to assess the integrated response of a hormone’s release, such as insulin during a glucose tolerance test, or to measure drug clearance in pharmacokinetic studies.98 See also receiver operating characteristic (ROC) curve.

When measuring hormone release or drug clearance, the AUC assesses the total release of a hormone or exposure of an individual to the drug as measured by levels in blood or urine over time. The AUC is also used to calculate the drug’s half-life.

→ The method used to determine the AUC should be specified (eg, the trapezoidal rule).

artifact: Something measured in a study that is not really present but results from the way the study was conducted.95

Example: The hemoglobin level of 4.0 g/d was incorrect because the blood was drawn from just above the patient’s intravenous catheter.

assessment: in the statistical sense, evaluating the outcome(s) of the study and control groups.

assignment: process of distributing individuals to study and control groups. See also randomization.

association: statistically significant relationship between 2 variables in which one does not necessarily cause the other. When 2 variables are measured simultaneously, association rather than causation generally is all that can be determined.

Example: After confounding factors were controlled for by means of multivariate regression, a significant association remained between age and disease prevalence. However, these methods do not allow one to conclude that age causes differences in disease prevalence.

as-treated analysis: see per-protocol analysis

attributable risk: the difference in the rate of disease when a risk factor is present from what the disease rate is when the risk factor is absent. It is the amount of disease attributable to the risk factor.95 Attributable risk assumes a causal relationship (ie, the factor to be eliminated is a cause of the disease and not merely associated with the disease). See attributable risk percentage and attributable risk reduction.

attributable risk percentage: the percentage of risk associated with a given risk factor.95 For example, risk of stroke in an older person who smokes and has hypertension and no other risk factors can be divided among the risks attributable to smoking, hypertension, and age. Attributable risk percentage is often determined for a population and is the percentage of the disease related to the risk factor. See population attributable risk percentage.

attributable risk reduction: the number of events that can be prevented by eliminating a particular risk factor from the population. Attributable risk reduction is a function of 2 factors: the strength of the association between the risk factor and the disease (ie, how often the risk factor causes the disease) and the frequency of the risk factor in the population (ie, a common risk factor may have a lower attributable risk in an individual than a less common risk factor but could have a higher attributable risk reduction because of the risk factor’s high prevalence in the population). Attributable risk reduction is a useful concept for public health decisions. See also attributable risk.

average: the arithmetic mean; one measure of the central tendency of a collection of measurements. The word average is often used synonymously with mean but can also imply the median, mode, or some other statistic. Thus, the word average should generally be avoided in favor of a more precise term.

Bayes information criteria (BIC): see Akaike information criteria (AIC). Like the AIC, the BIC is a means for determining which of several models best fit data using maximum likelihood analysis. It penalizes the result more than the AIC does to lessen the influence of overfitting.

BIC = ln(n) × k — 2 ln(L)

where n indicates the number of data points; k, the number of parameters; and L, the maximum likelihood function for the model. The model with the lowest AIC or BIC should be selected as the best-fitting model.

bayesian analysis: a statistical method that uses prior knowledge combined with new data to obtain a new probability. The general approach assumes that something is known about what is being assessed before any tests are performed. For example, the prevalence of HIV in intravenous drug abusers enrolled in treatment programs is 8%.102 When testing for HIV in such a patient, the prior probability of having a positive HIV test result is 8% because it is known that 8% of that population of patients has the disease. A positive or negative HIV test result will either increase or decrease the 8% likelihood of having HIV, resulting in a new probability called the posterior probability.

β (beta), β error or type II error: probability of showing no significant difference when a true difference exists; a false acceptance of the null hypothesis.98 One minus β is the statistical power of a test to detect a true difference between groups; a value of .20 for β is equal to .80 or 80% power. The test has an 80% probability of detecting a difference between groups if one exists. A β of .10 or .20 is most frequently used in power calculations. Power is important when no difference is found between groups after an intervention. If the power is low, then there is the possibility that a true difference exists but that the tests performed could not detect the difference. When the power is high (usually resulting from a large sample size) and no difference between groups is found, there is greater confidence in concluding that no difference actually exists between the groups. The β error is synonymous with type II error.100

bias (or systematic error): a systematic situation or condition that causes a result to depart from the true value in a consistent direction. Bias refers to defects in study design (often selection bias) or measurement.95 One method to reduce measurement bias is to ensure that the investigator measuring outcomes for a participant is unaware of the group to which the participant belongs (ie, blinded assessment). The following list summarizes the types of bias that exist.7

Types of Biasa

Berkson’s bias

A type of selection bias where both cases and controls are derived from a subpopulation, resulting in both groups being different than the main population. For example, if both the cases and controls come from a group of hospitalized patients, they may not represent the population at large because of some factor that resulted in the members of the groups being hospitalized. See Cutter et al103 for examples.

Channeling effect or channeling bias

The tendency of clinicians to prescribe treatment according to a patient’s prognosis. As a result of this behavior in observational studies, treated patients are more or less likely to be high-risk patients than untreated patients, leading to a biased estimate of treatment effect.

Data completeness bias

Using a computer decision support system (CDSS) to log episodes in the intervention group and using a manual system in the non-CDSS control group can create variation in the completeness of data.

Detection bias (or surveillance bias)

The tendency to look more carefully for an outcome in one of the comparison groups.

Differential verification bias (or verification bias or workup bias)

When test results influence the choice of the reference standard (eg, test-positive patients undergo an invasive test to establish the diagnosis, whereas test-negative patients undergo long-term follow-up without application of the invasive test), the assessment of test properties may be biased.

Expectation bias

In data collection, an interviewer has information that influences his or her expectation of finding the exposure or outcome. In clinical practice, a clinician’s assessment may be influenced by previous knowledge of the presence or absence of a disorder.

Incorporation bias

Occurs when investigators use a reference standard that incorporates a diagnostic test that is the subject of investigation. The result is a bias toward making the test appear more powerful in differentiating target-positive from target-negative patients than it actually is.

Interviewer bias

Greater probing by an interviewer of some participants than others, contingent on particular features of the participants.

Lead-time bias

Occurs when outcomes such as survival, as measured from the time of diagnosis, may be increased not because patients live longer but because screening lengthens the time that they know they have disease.

Length-time bias

Occurs when patients whose disease is discovered by screening may appear to do better or live longer than people whose disease presents clinically with symptoms because screening tends to detect disease that is destined to progress slowly and that therefore has a good prognosis.

Observer bias

Occurs when an observer’s observations differ systematically according to participant characteristics (eg, making systematically different observations in treatment and control groups).

Partial verification bias

Occurs when only a selected sample of patients who underwent the index test is verified by the reference standard, and that sample is dependent on the results of the test. For example, patients with suspected coronary artery disease whose exercise test results are abnormal may be more likely to undergo coronary angiography (the reference standard) than those whose exercise test results are normal.

Publication bias

Occurs when the publication of research depends on the direction of the study results and whether they are statistically significant.

Recall bias

Occurs when patients who experience an adverse outcome have a different likelihood of recalling an exposure than patients who do not experience the adverse outcome, independent of the true extent of exposure.

Referral bias

Occurs when characteristics of patients differ between one setting (such as primary care) and another setting that includes only referred patients (such as secondary or tertiary care).

Reporting bias (or selective outcome reporting bias)

The inclination of authors to differentially report research results according to the magnitude, direction, or statistical significance of the results.

Social desirability bias

Occurs when participants answer according to social norms or socially desirable behavior rather than what is actually the case (eg, underreporting alcohol consumption).

Spectrum bias

Ideally, diagnostic test properties are assessed in a population in which the spectrum of disease in the target-positive patients includes all those in whom clinicians might be uncertain about the diagnosis, and the target-negative patients include all those with conditions easily confused with the target condition. Spectrum bias may occur when the accuracy of a diagnostic test is assessed in a population that differs from this ideal. Examples of spectrum bias would include a situation in which a substantial proportion of the target-positive population has advanced disease and target-negative participants are healthy or asymptomatic. Such situations typically occur in diagnostic case-control studies (for instance, comparing those with advanced disease vs healthy individuals). Such studies are liable to yield an overly sanguine estimate of the usefulness of the test.

Surveillance bias

see detection bias

Verification bias

see differential verification bias

Workup bias

see differential verification bias

a From Users’ Guides to the Medical Literature.7

biased estimator: an estimator is a function that results in a value that represents some property of a distribution. For example, an estimator for the mean value is some function that takes individual numbers, sums them, and divides by how many numbers there were, resulting in the mean value. A biased estimator is one that systematically results in an estimate for some value that is incorrect.

bimodal distribution: the normal distribution is characterized by the bell-shaped curve. A bimodal distribution has 2 peaks with a valley between them. There are 2 modal values (the most common number in a distribution of numbers). The mean (the number representing the central tendency of a group of numbers) and median (the number that has half the numbers larger than it and half smaller than it) may be equivalent, but neither is a good representation of the data. A population composed entirely of schoolchildren and their grandparents might have a mean age of 35 years, although there would be 1 bell-shaped curve representing the ages of the grandchildren and another representing the grandparents’ ages.

binary variable: variable that has 2 mutually exclusive subgroups, such as male/female or pregnant/not pregnant; synonymous with dichotomous variable.104

binomial distribution: probability distribution that characterizes binomial data; used for modeling cumulative incidence and prevalence rates98 (eg, the probability of a person having a stroke in a given population during a given period; the outcome must be stroke or no stroke). In a binomial sample with a probability (p) of the event and number (n) of participants, the predicted mean is p × n and the predicted variance is p(p − 1).

biological plausibility: assumption that a causal relationship may be present because a known biological phenomenon exists that can explain the relationship. Statistically, this can be expressed as evidence that an independent variable may exert a biological effect on a dependent variable with which it is associated. For example, studies in animals were used to establish the biological plausibility of adverse effects of passive smoking.

bivariable analysis: used when the effect of a single independent variable on a single dependent variable is assessed.95 Common examples include the t test for 1 continuous variable and 1 binary variable and the χ 2test for 2 binary variables. Bivariate analyses can be used for hypothesis testing in which only 1 independent variable is taken into account, to compare baseline characteristics of 2 groups or to develop a model for multivariable regression. See also univariable and multivariate analysis.

bivariate analysis: bivariate analysis implies that there is more than 1 dependent variable, a situation that may arise when ANOVA is performed. The suffix —ate in bivariate implies that something acts on the variables, which implies that bivariate refers to the dependent and not independent variables. See also bivariable analysis, multivariate analysis.

Bland-Altman plot (also known as the Tukey mean difference plot): a method to assess agreement (eg, between 2 tests) developed by Bland and Altman.97 The difference between measures for the same subject is plotted on the y-axis against the mean of the 2 observations on the x-axis. This yields a scatterplot that graphically shows if one of the tests is unreliable compared with the other (example from Hoffmann et al105).

→ Sample wording: Methods: A Bland-Altman assessment for agreement was used to compare computed tomography measurement of coronary calcium with angiography. A range of agreement was defined as a mean bias of ± 2 SDs. Results: The Bland-Altman analysis indicates that the 95% limits of agreement between computed tomography coronary calcium measurement and angiography ranged from −19% to +43%. Note: When the Bland-Altman plot shows that the variables are NOT similar, one could say, “The 2 methods do not consistently provide similar measures because there is a level of disagreement that includes clinically important discrepancies of up to XX” [fill in the threshold value for the measure that exceeds what is considered clinically important].

blinded (masked) assessment: evaluation or categorization of an outcome in which the person assessing the outcome is unaware of the treatment assignment.

blinded (masked) assignment: assignment of individuals participating in a prospective study (usually random) to a study group and a control group without the investigator or the participants being aware of the group to which they are assigned. Studies may be single-blind, in which either the participant or the person administering the intervention does not know the treatment assignment, or double-blind, in which neither knows the treatment assignment. The term triple-blind is sometimes used to indicate that the persons who analyze or interpret the data are similarly unaware of treatment assignment. Authors should indicate who exactly was blinded. The term masked assignment is preferred by some investigators and journals, particularly those in ophthalmology. Assessors may be blinded to things other than the treatment group, such as the purpose of the study or patient characteristics.

→ Blinded assessment is important to prevent bias on the part of the investigator performing the assessment, who may be influenced by the study question and consciously or unconsciously expect a certain test result.

block randomization: type of randomization in which the unit of randomization is not the individual but a larger group, sometimes stratified on particular variables, such as age or severity of illness, to ensure even distribution of the variable between randomized groups.

Bonferroni adjustment: one of several statistical adjustments to the P value that may be applied when multiple comparisons are made. The α level (usually .05) is divided by the number of comparisons to determine the α level that will be considered statistically significant. Thus, if 10 comparisons are made, an α of .05 would become α = .005 for the study. Alternatively, the P value may be multiplied by the number of comparisons, while retaining the α of .05.104 For example, a P value of .02 obtained for 1 of 10 comparisons would be multiplied by 10 to get the final result of P = .20, a nonsignificant result.

→ The Bonferroni test is a conservative adjustment for large numbers of comparisons (ie, less likely than other methods to give a significant result) but is simple and used frequently.

bootstrap method (resampling method): statistical method for validating a new parameter in the same group from which the parameter was derived. This method is typically used to estimate CIs. Values from a data set are randomly sampled, creating a new data set that is the same size as the original one. The mean of the new sample is calculated. This is repeated many times, and the 95% CIs of the new collection of data sets are calculated. These CIs are then used as the CIs of the original data set.98

→ Sometimes the bootstrap approach is used to validate a new statistical model. After a model is built, it is repeated many times using randomly sampled data from the original data from which the model was derived. The model may be considered reliable if it provides results with each bootstrap run similar to those obtained from the original model. In general, this is not an ideal approach for model validation because the data used to create the model may differ somewhat from data obtained from other cohorts. It is best to validate models on data other than those from which the model was constructed.

Brown-Mood procedure: test used with a regression model that does not assume a normal distribution or common variance of the errors.93 It is an extension of the median test.

C statistic: (also referred to as the C index or concordance statistic): a measure of the area under a receiver operating characteristic curve.106,107 The C statistic is commonly used in logistic regression procedures. The C statistic is the probability that, given 2 individuals (one who experiences the outcome of interest and the other who does not or who experiences it later), the model will yield a higher risk for the first patient than for the second. It is a measure of concordance (hence, the name C statistic) between model-based risk estimates and observed events. C statistics measure the ability of a model to rank patients from high to low risk but do not assess the ability of a model to assign accurate probabilities of an event occurring (that is measured by the model’s calibration). C statistics generally range from 0.5 (random concordance) to 1 (perfect concordance).

The C statistic is calculated by evaluating each pair of data points against one another and determining whether the model’s predicted probability is higher for an individual who had an event relative to one who did not experience the event. If the model always results in a higher risk of an event for individuals who experience the event, the C statistic will equal 1.0. If the model completely fails, it will equal 0.5 because there is a 50-50 chance that the model will predict a higher risk for individuals with event by chance alone.

Note: Do not hyphenate C statistic.

case: in a study, an individual with the outcome or disease of interest.

case-control study: retrospective study in which individuals with the disease (cases) are compared with those who do not have the disease (controls). Cases and controls are identified without knowledge of exposure to the risk factors under study. Cases and controls are matched on certain important variables, such as age, sex, and year in which the individual was treated or identified. A case-control study on individuals already enrolled in a cohort study is referred to as a nested case-control study.98 This type of case-control study may be an especially strong study design if characteristics of the cohort have been carefully ascertained (see 19.3.2, Case-Control Studies).

→ Cases and controls should be selected from the same population to minimize confounding by factors other than those under study. Matching cases and controls on too many characteristics may obscure the association of interest, because if cases and controls are too similar, their exposures may be too similar to detect a difference (see overmatching). For example, if only males are studied, the effect of sex on an intervention cannot be determined. Similarly, if the participants are matched on age, the effect of age on outcome cannot be determined. Case-control studies also have risk for recall bias, where a patient’s memory of events before the disease may differ from the memory of a control patient without the disease who may have paid less attention to factors that contribute to the disease being evaluated. Patients with a disease may have had more intense data collection than control patients. Patients and controls may have been selected from special populations (eg, people who happen to be admitted to certain hospitals) that may not represent all the relevant populations at risk for a disease.

case-fatality rate: probability of death among people who have a disease. The rate is calculated as the number of deaths during a specific period divided by the number of persons with the disease at the beginning of that period.104

case series: retrospective descriptive study in which clinical experience with a number of patients is described (see 19.3.4, Case Series).

categorical data: categorical data have nonnumerical values (they are named or nominal). For example, sex and race are categorical data. They are often represented by numbers when used for statistical analysis. For example, sex is male or female and may be represented in a database as 0 or 1. Categorical data that only have 2 categories are known as dichotomous93 (eg, sex or race/ethnicity). The categories have no numerical importance. Categorical data are summarized by proportions, percentages, fractions, or simple counts. Categorical data are synonymous with nominal data. Analyzing categorical data is less optimal than leaving the data continuous because the high and low categories might be influenced by outliers. There are difficulties in interpreting the importance of results for values that are adjacent to the cutoff points that create the categories. For example, if a cutoff is created for cardiovascular risk categories at 5-year intervals, the resultant risk score may assign a very different risk for a 49-year-old patient than one who is 51 years old, yet their risk is probably very similar.

cause, causation: something that brings about an effect or result; to be distinguished from association, especially in observational studies. To establish something as a cause it must be known to precede the effect. The concept of causation includes the contributory cause, the direct cause, and the indirect cause. In general, causality should be assumed only from randomized clinical trials. Observational studies demonstrate associations and usually not causation. Wording in manuscripts should be consistent with these distinctions.

censored data: data in which the true value is replaced by some other value. Censoring has 2 different statistical connotations: (1) data in which extreme values are reassigned to some predefined, more moderate value and (2) data in which values have been assigned to individuals for whom the actual value is not known, such as in survival analyses for individuals who have not experienced the outcome (usually death) at the time the data collection was terminated.

The term left-censored data means that data were censored from the low end or left of the distribution. Typically, it means that the start time for some event is not known. For example, if studying the age at which women get breast cancer, a patient is enrolled who had breast cancer already and the age at onset is unknown, that observation would be left-censored. This also means that all that is known about the age at onset of the breast cancer is that it was less than the age of the woman when she was enrolled.

Right-censored data come from the high end or right of the distribution98 (eg, in survival analyses). For right-censored data, the end time is not known, and all that is known is that it is greater than some value. If studying the age at death from breast cancer in a woman who was still alive at the end of the study, that observation would be right-censored because all that is known about how old she will be when she dies is that it will be older than she was at the end of the study.

central limit theorem: the means of random samples from any distribution will themselves have a normal distribution. Increasing the number of values increases the probability that the calculated mean value is the same as the real one for the population. Stated another way: the probability that any calculated mean value is the same as the actual total population mean value decreases as the sample size becomes smaller.108 This is the basis for the importance of the normal distribution in statistical testing.93

central tendency: measures such as the mean, mode, and median that provide a single numerical value attempting to characterize a data set.108 The mean characterizes a gaussian distribution of numbers as the average value. When data are not distributed in a gaussian distribution, the mean should not be used. In that case, it is best to use the median value, which is the value with half the numbers being greater than it and half being smaller than it. The mode is the most common number in a distribution of numbers.98

χ2test (chi-square test): a test of significance of categorical data based on the χ2 statistic. The square of the expected values subtracted from the observed values is divided by the expected values. The sum of all these is called the χ2 statistic. The χ2 test assumes there is no association between the observed and expected numbers. When they are close to one another, the χ2 value is small, and when they are far apart, it is large.

→ The χ2 test can be used to assess the goodness of fit for a model by comparing the values of some parameter estimated by a model with those actually observed in the original data. The χ2 test can also compare an observed variance with hypothetical variance in normally distributed samples.91 In the case of a continuous independent variable and a nominal dependent variable, the χ2 test for trend can be used to determine whether a linear relationship exists (eg, the relationship between systolic blood pressure and stroke).95

→ If there are multiple categories of data and there is a question of there being a trend in the data such that the categories increase or decrease in a consistent way, the χ2 test is modified to test for the presence of the trend. This is known as the Cochran-Armitage test for trend.

→ The P value for the χ2 test is determined from χ2 tables with the use of the specified α level and the df calculated from the number of cells in the χ2 table. The χ2 statistic should be reported to no more than 1 decimal place; if the Yates correction was used, that should be specified. See also contingency table.

Example: The exercise intervention group was least likely to have experienced a fall in the previous month (Image = 17.7, P = .02). [See 21.9.4 Editing, Proofreading, Tagging, and Display, Specific Uses of Fonts and Styles, Italics.]

Example: We compared the responses of attending physicians and advanced practice nurses with the use of the χ2 test of significance at the level of P < .05.

Note that the df for Image is specified using a subscript 3; it is derived from the number of cells in the χ2 table (for this example, 4 cells in a 2 × 2 table). The value 17.7 is the χ2 value. The P value is determined from the χ2 value and df.

Results of the χ2 test may be biased if there are too few observations (generally <5) per cell. In this case, the Fisher exact test is preferred.

choropleth map: map of a region or country that uses shading to display quantitative data98 (see 4.2.3, Maps).

chunk sample: subset of a population selected for convenience without regard to whether the sample is random or representative of the population.93 Synonymous with convenience sample, which is the more commonly used term.

cluster randomization: the assignment of groups (eg, schools, clinics) rather than individuals to intervention and control groups. This approach is often used when assignment by individuals is likely to result in contamination (eg, if adolescents within a school are assigned to receive or not receive a new sex education program, it is likely that they will share the information they learn with one another; instead, if the unit of assignment is schools, entire schools are assigned to receive or not receive the new sex education program). Cluster assignment is typically randomized, but it is possible (although not advisable) to assign clusters to the treatment or control by other methods.

Cochran-Armitage test for trend: when there are multiple categories of data and there is a question of the existence of a trend in the data such that the categories increase or decrease in a consistent way, the χ2 test is modified to test for the presence of the trend.

Cochran Q test: used when there is binary (yes/no) data and the question is if the percentage of successes (yes answers or 1s) is not different between 2 or more groups. The groups must be matched (ie, have the same number of participants). The analysis results in a Q statistic, which, with the df, determines the P value; if significant, then the hypothesis that the proportion of successes between groups is the same is disproven, and one concludes that the variation between the 2 or more groups cannot be explained by chance alone.93 Often used to determine whether different observers of the same phenomenon yield consistent observations. See also interobserver bias and I2 statistic.

coefficient of determination: square of the correlation coefficient, used in linear or multiple regression analysis. This statistic indicates the proportion of the variation of the dependent variable that can be predicted from the independent variable.95 If the analysis is bivariate, the correlation coefficient is indicated as r and the coefficient of determination is r2. If the correlation coefficient is derived from multivariate analysis, the correlation coefficient is indicated as R and the coefficient of determination is R2. See also correlation coefficient.

Example: The sum of the R2 values for age and body mass index was 0.23. Twenty-three percent of the variance could be explained by those 2 variables. The correlation coefficient between body mass index and systolic blood pressure was 0.8. There is a 64% chance that systolic blood pressure can be predicted from a body mass index value.

coefficient of variation: ratio of the standard deviation (SD) to the mean. The coefficient of variation is expressed as a percentage and is used to compare dispersions of different samples. It is typically used to determine how consistent a measure is when repeated many times. The smaller the coefficient of variation, the greater the precision of the measurement.100 The coefficient of variation is also used when the SD is dependent on the mean (eg, the increase in height with age is accompanied by an increasing SD of height in the population). The coefficient of variation has no units of measure.

cohort: a group of individuals who share a common exposure, experience, or characteristic, or a group of individuals followed up or traced over time in a cohort study.93

cohort effect: change in rates that can be explained by the common experience or characteristic of a group or cohort of individuals. A cohort effect implies that a current pattern of variables may not be generalizable to a different cohort.93

Example: The decline in socioeconomic status with age was a cohort effect explained by fewer years of education among the older individuals.

cohort study: study of a group of individuals, some of whom are exposed to a variable of interest (eg, a drug treatment or environmental exposure), in which participants are followed up over time to determine who develops the outcome of interest and whether the outcome is associated with the exposure. Cohort studies may be concurrent (prospective) or nonconcurrent (retrospective)95 (see 19.3.1, Cohort Studies).

→ Whenever possible, a participant’s outcome should be assessed by individuals who do not know whether the participant was exposed. See also blinded assessment.

concordant pair: pair in which both individuals have the same trait or outcome (as opposed to discordant pair). Used frequently in twin studies.98

conditional probability: probability that an event (E) will occur given the occurrence of F, called the conditional probability of E given F. The reciprocal is not necessarily true: the probability of E given F may not be equal to the probability of F given E.104

confidence interval (CI): range within which one can be confident (usually 95% confident, to correspond to an α level of .05) that the population value the study is intended to estimate lies.95 The CI is an indication of the precision of an estimated population value.

→ Confidence intervals used to estimate a population value usually are symmetric or nearly symmetric around a value, but CIs used for relative risks and odds ratios may not be. Confidence intervals are preferable to P values because they convey information about precision as well as statistical significance of point estimates.

→ Confidence intervals are expressed with a hyphen separating the 2 values. To avoid confusion, the word to replaces hyphens if one of the values is a negative number. Units that are closed up with the numeral are repeated for each CI; those not closed up are repeated only with the last numeral (see 19.4, Significant Digits and Rounding Numbers, and 18.4, Use of Digit Spans and Hyphens).

Example: The odds ratio was 3.1 (95% CI, 2.2-4.8). The prevalence of disease in the population was 1.2% (95% CI, 0.8%-1.6%).

confidence limits (CLs): upper and lower boundaries of the confidence interval, expressed with a comma separating the 2 values.98

Example: The mean (95% confidence limits) was 30% (28%, 32%).

confounding: (1) a situation in which the apparent effect of an exposure on risk is caused by an association with factors other than those that are being studied and can influence the outcome; (2) a situation in which the effects of 2 or more causal factors as shown by a set of data cannot be separated to identify the unique effects of any of them; (3) a situation in which the measure of the effect of an exposure on risk is distorted because of the association of exposure with another factor(s) that influences the outcome under study98 See also confounding variable.

confounding variable: variable that can cause or prevent the outcome of interest, is not an intermediate variable, and is associated with the factor under investigation. Unless it is possible to adjust for confounding variables, their effects cannot be distinguished from those of the factors being studied. Bias can occur when adjustment is made for any factor that is caused in part by the exposure and also is correlated with the outcome.98 Multivariate analysis is used to control the effects of confounding variables that have been measured but cannot account for unmeasured confounding variables.

contingency coefficient: the coefficient C (not to be confused with the C statistic), used to measure the strength of association between 2 characteristics in a contingency table.104

contingency table: table created when categorical variables are used to calculate expected frequencies in an analysis and to present data, especially for a χ 2test (2-dimensional data) or log-linear models (data with at least 3 dimensions). A typical 2 × 2 table might represent patients having a disease in the columns and those with a positive test result in the rows. Each of the 4 cells would have the number of patients who do or do not have the disease and do or do not have a positive test results. A 2  ×  3 contingency table has 2 rows and 3 columns. The df are calculated as (number of rows − 1)(number of columns − 1). Thus, a 2 × 3 contingency table has 6 cells and 2 df.

continuous data: data that contain all possible real number values that may exist between 2 boundaries. Examples of measures using continuous data include blood pressure, height, and weight. This differs from categorical data, which have nonnumerical values (they are named or nominal). Sex and race are examples of categorical data.95 There are 2 kinds of continuous data: ratio data and interval data. Ratio-level data have a true zero, and thus numbers can meaningfully be divided by one another (eg, weight, systolic blood pressure, cholesterol level). For instance, 75 kg is half as heavy as 150 kg. Interval data may be measured with a similar precision but lack a true zero point. Thus, 32 oC is not half as warm as 64 oC, although temperature may be measured on a precise continuous scale. Continuous data include more information than categorical, nominal, or dichotomous data. Use of parametric statistics requires that continuous data have a normal distribution or that the data can be transformed to a normal distribution (eg, by computing logarithms of the data).

contrast: procedure used in ANOVA to determine if statistically significant differences exist between groups. Weights are assigned to each group that add up to zero. If there are 3 groups, the one to determine whether it is different than the others might be assigned a value of 1.0 and each of the 2 remaining groups assigned values of 0.5.

contributory cause: independent variable (cause) that is considered to contribute to the occurrence of the dependent variable (effect), typically in a randomized clinical trial. That a cause is contributory should not be assumed unless all of the following have been established: (1) a relationship exists between the putative cause and effect, (2) the cause precedes the effect in time, and (3) altering the cause alters the probability of occurrence of the effect.95 Other factors that may contribute to establishing a contributory cause include the concept of biological plausibility, the existence of a dose-response relationship, and consistency of the relationship when evaluated in different settings.

control: in a case-control study, the designation for an individual without the disease or outcome of interest; in a cohort study, the individuals not exposed to the independent variable of interest; and in a randomized clinical trial, the group receiving a placebo, sham treatment, or standard treatment rather than the intervention under study.

control group: a group that does not receive the experimental intervention. In many studies, the control group receives either usual care or a placebo.

controlled clinical trial: study in which a group receiving an experimental treatment is compared with a control group receiving a placebo or an active treatment other than the experimental one (see 19.2.1, Parallel-Design Double-blind Trials).

convenience sample: sample of participants selected because they were available for the researchers to study, not because they are necessarily representative of a particular population.

→ Use of a convenience sample limits generalizability and can confound the analysis, depending on the source of the sample. For instance, if a sample of patients is selected from a group of patients who had undergone cardiac catheterization or echocardiography to compare cardiac auscultation, echocardiography, and cardiac catheterization, the patients are unlikely to resemble the population at large. This is because patients undergo the tests that served as the basis for which they were selected because they have already been found to have cardiac abnormalities. Consequently, the spectrum of cardiac auscultatory findings in the convenience sample will differ from that of the general population and, most likely, will show many more abnormalities than found in unselected patients.

correlation: a general term meaning that there is an association between 2 variables.95 The strength of the association is described by the correlation coefficient. There are many reasons why 2 variables may be correlated, but it is important to remember that correlation alone does not prove causation. See also agreement.

→ The Kendall τ rank correlation test is used when testing 2 ordinal variables, the Pearson product moment correlation is used when testing 2 normally distributed continuous variables, and the Spearman rank correlation is used when testing 2 nonnormally distributed continuous variables.100

→ Correlation is often depicted graphically by means of a scatterplot of the data (see Figure 4.2-4 in, Statistical Graphs). The more circular a scatterplot, the smaller the correlation; the more linear a scatterplot, the greater the correlation.

correlation coefficient: measure of the association between 2 variables. The coefficient falls between −1 and 1; the sign indicates the direction of the relationship and the number indicates the magnitude of the relationship. A plus sign indicates that the 2 variables increase or decrease together; a minus sign indicates that increases in one are associated with decreases in the other. A value of −1 or 1 indicates that the sample values fall in a straight line, whereas a value of 0 indicates no relationship. The correlation coefficient should be followed by a measure of the significance of the correlation, and the statistical test used to measure correlation should be specified.

Example: Body mass index increased with age (Pearson r = 0.61; P < .001); years of education decreased with age (Pearson r = −0.48; P = .01).

→ When 2 variables are compared, the correlation coefficient is expressed by r; when more than 2 variables are compared by multivariate analysis, the correlation coefficient is expressed by R. The symbol r2 or R2 is termed the coefficient of determination and indicates the amount of variation in the dependent variable that can be explained by knowledge of the independent variable.

For example, if the correlation coefficient is 0.5, then the coefficient of determination is 0.25. This means that 25% of the variance of one variable can be explained by the other. Another way to express the relationship is that if one variable is known, there is a 25% chance of accurately predicting the value of the other variable.

cost-benefit analysis: economic analysis that compares the costs accruing to an individual for some treatment, process, or procedure and the ensuing medical consequences, with the benefits of reduced loss of earnings resulting from prevention of death or premature disability. The cost-benefit ratio is the ratio of marginal benefit (financial benefit of preventing 1 case) to marginal cost (cost of preventing 1 case)98 (see 19.3.7, Study Design and Statistics, Observational Studies, Economic Analyses).

cost-effectiveness analysis: comparison of strategies to determine which provides the most clinical value for the cost.100 A preferred intervention is the one that will cost the least for a given result or be the most effective for a given cost.79 Outcomes are expressed by the cost-effectiveness ratio, such as cost per year of life saved (see 19.3.7, Economic Analyses).

cost-utility analysis: form of economic evaluation in which the outcomes of alternative procedures are expressed in terms of a single utility-based measurement, most often the quality-adjusted life-year (QALY).98

covariates: a variable that may influence an outcome of interest but is not the main variable being studied. Controlling for a covariate may improve the understanding between the variable of interest and the outcome being studied. For example, if the relationship between the degree of hypertension and stroke occurrence is examined, other factors, such as age or cholesterol level, will influence stroke outcomes and are considered covariates. How hypertension influences stroke is better understood when age and cholesterol levels are controlled for by including them in statistical analyses.

Cox-Mantel test: a nonparametric method for comparing 2 survival curves. This method does not assume any one particular distribution of data,104 similar to the log-rank test.95

Cox proportional hazards regression model (Cox proportional hazards model): model used to assess rate data (number of items per unit time) as opposed to proportions, which are analyzed by logistic regression. In addition to an outcome such as alive or dead, the time it takes to experience that outcome (time to event) is incorporated in Cox proportional hazards regression, adding power to the analysis above that available from logistic regression. The hazard is the probability that if a person survives to a certain time, that person will not survive to make it into the next observed interval.

Cox proportional hazards regression assesses the influence some exposure has on the time it takes for an outcome to occur. Individuals with and without the exposure are compared with the assumption that the exposure’s influence will be proportional in both groups (ie, that changes beyond the baseline hazard will be multiples of one another). When the hazards are proportional, the groups with and without the exposure will be represented as parallel lines when the outcome is graphed as a function of time. An indicator that the proportional hazards assumption is not met is when Kaplan-Meier curves of the exposed and unexposed groups are not parallel and cross one another. One of the several procedures that test the proportionality of the data should be used when Cox proportional hazards regression is performed. Although Cox proportional hazards regression assumes that the hazards are proportional, corrections can be made to account for nonproportionality.

Accounting for proportionality is important because the mathematics for the analysis depend on a ratio that erases the baseline risk from the equation. As a consequence, results from Cox proportional hazards regression can only provide relative risks and not absolute risks because the absolute risk of an event relies on knowing the event’s baseline risk. Because it is a regression procedure, covariates can be added (as they are in logistic regression) to assess the influence of the exposure variable while assuming that the added covariates are held constant and do not influence the outcome.92,104

criterion standard (also known as reference standard): test considered to be the diagnostic standard for a particular disease or condition, used as a basis of comparison for other (usually noninvasive) tests. An outdated term is gold standard, which refers to a time when the monetary system was based on the value of gold. This is no longer the case and the term is obsolete.98 Ideally, the sensitivity and specificity of the criterion standard for the disease should be 100%. See also diagnostic discrimination.

Cronbach α: used to determine the consistency between scores when they are used together to generate an aggregate score. For example, if there is a 10-question quality-of-life instrument with each of the 10 questions having answers ranging from 1 to 5 and the overall score representing the quality of life is the sum of all 10 scores, then the 10 questions should be consistent with one another. Statistically, each of them should be highly correlated with one another.98 The Cronbach α ranges from 0 (little internal consistency between the questions) to 1 (the questions are highly consistent with each other).104

cross-design synthesis: method for evaluating outcomes of medical interventions, developed by the US General Accounting Office, which pools results from databases of randomized clinical trials and other study designs. It is a form of meta-analysis (see 19.3.6, Meta-analyses).98

cross-level bias: a mechanism underlying the ecologic fallacy. Ecologic fallacy occurs when individual effects that exist are not found when data are examined at a group level. One cause for this is cross-level bias. Examples include confounders being distributed differentially between groups or the baseline prevalence of the disorder differs between groups and the risk for disease depends on the group prevalence. Cross-level bias occurs when an intervention that is known to be effective on an individual level is administered at different rates in different facilities to patients with the disease the intervention is intended to treat. This occurs when clinicians apply differing criteria for administering a treatment based on their own practice patterns.

crossover design: method of comparing 2 or more treatments or interventions. Individuals initially are randomized to one treatment or the other; after completing the first treatment, they are crossed over to 1 or more other randomization groups and undergo other courses of treatment being tested in the experiment. Advantages are that a smaller sample size is needed to detect a difference between treatments because a paired analysis is used to compare the treatments in each individual, but the disadvantage is that an adequate washout period is needed after the initial course of treatment to avoid carryover effect from the first to the second treatment. Order of treatments should be randomized to avoid potential bias104 (see 19.2.2, Crossover Trials).

cross-sectional study: study that identifies participants with and without the condition or disease under study and the characteristic or exposure of interest at the same point in time.95

→ Causality is difficult to establish in a cross-sectional study because the outcome of interest and associated factors are simultaneously assessed. Demonstrating causality requires an experimental and a control group, an intervention, and then an observation period that follows delivery of the intervention.

crude death rate: total deaths during a year divided by the midyear population. Deaths are usually expressed per 100  000 persons.104

cumulative incidence: number of people who experience onset of a disease or outcome of interest during a specified period; may also be expressed as a rate or ratio.98

Cutler-Ederer method (also known as theactuarial life-table method): form of life-table analysis that uses actuarial techniques. The method assumes that the times at which follow-up ended (because of death or the outcome of interest) are uniformly distributed during the period, as opposed to the Kaplan-Meier method, which assumes that termination of follow-up occurs at the end of the time block. Therefore, Cutler-Ederer estimates of risk tend to be slightly higher than Kaplan-Meier estimates.109 Often an intervention and control group are depicted on 1 graph, and the curves are compared by means of a log-rank test.

cut point: in testing, the arbitrary level at which “normal” values are separated from “abnormal” values, often selected at the point 2 SDs from the mean. See also receiver operating characteristic curve.98

DALY: see disability-adjusted life-years.

data: collection of items of information.98 (Datum, the singular form of this word, is rarely used.) See 7.5.2, False Singulars.

data dredging (also known as a “fishing expedition”): jargon meaning post hoc analysis, with no a priori hypothesis, of several variables collected in a study to identify variables that have a statistically significant association for purposes of publication. One form of this is called HARKing (hypothesizing after the results are known).

→ Although post hoc analyses occasionally can be useful to generate hypotheses, data dredging increases the likelihood of a type I error and should be avoided. If post hoc analyses are performed, they should be declared as such and the number of post hoc comparisons performed specified. They should always be considered exploratory analyses.

decision analysis: process of identifying all possible choices and outcomes for a particular set of decisions to be made regarding patient care. Decision analysis generally uses preexisting data to estimate the likelihood of occurrence of each outcome. The process is displayed as a decision tree, with each node depicting a branch point that represents a decision in treatment or intervention to be made (usually represented by a square at the branch point) or possible outcomes or chance events (usually represented by a circle at the branch point). The relative worth of each outcome may be expressed as a utility, such as the quality-adjusted life-year.98

degrees of freedom (df): see df.

dependent variable: outcome variable of interest in any study; the outcome that one intends to explain or estimate95 (eg, death, myocardial infarction, or reduction in blood pressure). Multivariable analysis controls for independent variables or covariates that might modify the occurrence of a single dependent variable (eg, age, sex, and other medical diseases or risk factors). When there is more than 1 dependent variable, regressions should be referred to as multivariate.

descriptive statistics: method used to summarize or describe data with the use of the mean, median, SD, SE, or range or to convey in graphic form (eg, by using a histogram, shown in Figure 4.2-5 in, Statistical Graphs) for purposes of data presentation and analysis.104

df (degrees of freedom) (df is not expanded at first mention): the number of arithmetically independent comparisons that can be made among members of a sample. In a contingency table, df is calculated as (number of rows − 1)(number of columns − 1).

Another definition for df was put forth by Sir Ronald Fisher. It is the number of independent values needed to determine a system. For example, for the 4-number system defining the average, (1 + 2 + 3 + 4)/4 = 2.5, the df is the number of observations (4) minus the parameter that is the mean value (there is 1 mean) = 3 df. The reason being that the system can be recreated from 3 values. If 3 values are selected, any combination of 1, 2, 3, or 4, the fourth has to be the remaining number if the mean is to be 2.5.

For a t test or regression analysis, it is the number of observations minus the number of unknown parameters to be estimated. When regression results are reported as “adjusted,” it means that the sample size used in calculations is the number of observations (n) minus the number of parameters estimated. These adjusted values are the ones that should be reported as the regression result.

→ The df should be reported as a subscript after the related statistic, such as the t test, analysis of variance, and χ 2 test (eg, Image = 17.7, P = .02; in this example, the subscript 3 is the df).

diagnostic discrimination: statistical assessment of how the performance of a clinical diagnostic test compares with the criterion standard (or reference standard). To assess a test’s ability to distinguish an individual with a particular condition from one without the condition, the researcher must (1) determine the variability of the test, (2) define a population free of the disease or condition and determine the normal range of values for that population for the test (usually the central 95% of values, but in tests that are quantitative rather than qualitative, a receiver operating characteristic curve may be created to determine the optimal cut point for defining normal and abnormal), and (3) determine the criterion standard for a disease (by definition, the criterion standard should have 100% sensitivity and specificity for the disease) with which to compare the test. Diagnostic discrimination is reported with the performance measures sensitivity, specificity, positive predictive value, and negative predictive value; false-positive rate; and the likelihood ratio.95

→ Because the values used to report diagnostic discrimination are ratios, they can be expressed either as the ratio, using the decimal form, or as the percentage, by multiplying the ratio by 100.

Example: The test had a sensitivity of 0.80 and a specificity of 0.95; the false-positive rate was 0.05.

Or: The test had a sensitivity of 80% and a specificity of 95%; the false-positive rate was 5%.

→ When the diagnostic discrimination of a test is defined, the individuals tested should represent the full spectrum of the disease and reflect the population on which the test will be used. For example, if a test is proposed as a screening tool, it should be assessed in the general population.

dichotomous variable: a variable with only 2 possible categories (eg, male/female, alive/dead); synonymous with binary variable.104

→ A variable may have a continuous distribution during data collection but is made dichotomous for purposes of analysis (eg, one group being younger than 65 years and the other being 65 years or older). This is done most often for nonnormally distributed data. Note that the use of a cut point generally converts a continuous variable to a dichotomous one (eg, normal vs abnormal). Data may be categorized by establishing several cut points and separating the data into quantiles. If there is only 1 cut point, the resultant data are dichotomous. Analyses of continuous data are much more powerful than categorized data and are always preferred (see categorical variables).

difference in differences: determines whether there is a statistically significant effect of some change that was made at a discrete time on some outcome. For example, to determine whether the implementation of a policy requiring Center of Excellence status for Medicare coverage of bariatric surgery was effective, the progressively decreasing rate of complications was analyzed in facilities before and after the policy was implemented. The control group included facilities that were not Centers of Excellence, and the experimental group included facilities that became Centers of Excellence after the policy was implemented. If the slope of the outcomes from the 2 groups deviates after the policy is implemented, then that policy is considered to be associated with the outcomes. The change in slope is determined by the statistical significance of an interaction term in a regression equation.110,111

direct cause: contributory cause that is considered to be the most immediate cause of a disease. The direct cause is dependent on the current state of knowledge and may change as more immediate mechanisms are discovered.95

Example: Although several other causes were suggested when the disease was first described, HIV is the direct cause of AIDS.

disability-adjusted life-years (DALY): quantitative indicator of burden of disease that reflects the years lost because of premature mortality and years lived with disability, adjusted for severity.112

discordant pair: pair in which the individuals have different outcomes. In twin studies, only the discordant pairs are informative about the association between exposure and disease.98 The antonym is concordant pair.

discrete variable: variable that is counted as an integer; no fractions are possible.104 Examples are counts of pregnancies or surgical procedures, or responses to a Likert scale.

discriminant analysis: analytic technique used to classify participants according to their characteristics (eg, the independent variables, signs, symptoms, and diagnostic test results) to the appropriate outcome or dependent variable,104 also referred to as discriminatory analysis.92 This analysis tests the ability of the independent variable model to correctly classify an individual in terms of outcome. For example, if studying 3 possible outcomes after discharge for congestive heart failure (no readmission, readmission within 30 days, death within 30 days) and information is available about the patients (eg, age, sex, socioeconomic status), discriminant analysis would be used to determine which variables can best predict which patients have 1 of the 3 outcomes. In this example, age may prove to be the best predictor for death and socioeconomic status the best predictor for readmission.

dispersion: degree of scatter shown by observations. For continuous data, dispersion is best represented by the SD. For nonnormally distributed data or categorical data, it is best shown by interquartile ranges (or some other quantile, such as terciles, or quintiles) or by some other form of percentiles.93

distribution: the frequency and pattern of all possible values for some variable.95 Distributions may have a normal distribution (bell-shaped curve) or a nonnormal distribution (eg, binomial or Poisson distribution).

dose-response relationship: relationship in which changes in levels of exposure are associated with changes in the frequency of an outcome in a consistent direction. This supports the idea that the agent of exposure (most often a drug) is responsible for the effect seen.93 May be tested statistically by using a χ2 test for trend.

Duncan multiple range test: modified form of the Newman-Keuls test for multiple comparisons. It is a test for determining which pairs of groups have statistically significant different mean values when there are more than 2 means to be compared. This test is not dependent on first having found that any difference exists between any means values by the F test of ANOVA.104

Dunn test: multiple comparisons procedure based on the Bonferroni adjustment.104

Dunnett test: multiple comparisons procedure intended for comparing each of a number of treatments with a single control.104

Durbin-Watson test: test to determine whether the residuals from linear regression or multiple regression are independent or, alternatively, are serially (auto) correlated.104 If the Durbin-Watson value is less than 2, then the data are positively autocorrelated; greater than 2, the data are negatively autocorrelated. When the Durbin-Watson statistic equals 2, there is no autocorrelation. Because conventional regression procedures assume data independence, they should not be used in autocorrelated data. When autocorrelation is present, time series statistics (eg, autoregressive integrated moving average [ARIMA] models) should be used for the analysis. Data that are commonly autocorrelated include incidence rates for disease measured in successive periods.

ecologic fallacy: error that occurs when the existence of a group association is used to imply, incorrectly, the existence of a relationship at the individual level.95 Alternatively, there may be a strong individual effect of an exposure that is not observed when assessed at a group level. For example, when 1930 US Census data were used to assess the relationship between being foreign born and literacy, higher rates within states of being foreign born were unexpectedly associated with higher literacy rates. When the analysis was performed on an individual and not a state level, the expected relationship between being foreign born and low literacy rates was found. This apparent paradox, or ecologic fallacy, occurred because low-literacy immigrants tended to immigrate to regions with high literacy rates.113

ecologic study: examination of groups or populations of patients but not the individuals themselves. Ecologic studies are useful for understanding disease prevalence or incidence and facilitate analysis of large numbers of people. These studies may uncover relationships between exposure factors and diseases. These studies are limited by the ecologic fallacy. Group-level exposure and the outcome may not relate well to individuals within that group. These studies are highly susceptible to bias.

effectiveness: extent to which an intervention is beneficial when implemented under the usual conditions of clinical care for a group of patients,95 as distinguished from efficacy (the degree of beneficial effect seen in a clinical trial) and efficiency (the intervention effect achieved relative to the effort expended in time, money, and resources).

effect of observation: bias that results when the process of observation alters the outcome of the study.95 See also Hawthorne effect.

effect size: observed or expected change in outcome as a result of an intervention. Whereas statistical significance provides only an indication that a difference between groups exists, it does not provide an indication of how important that effect is. Effect size provides a measure of the magnitude of the differences between groups and should always be considered in addition to the statistical significance. Effect size is calculated from the absolute difference between groups divided by a measure of the variation in the data, such as the SD. Effect size can assist in understanding the clinical significance of research findings.

→ Expected effect size is used during the process of estimating the sample size necessary to achieve a given power. Given a similar amount of variability among individuals, a large effect size will require a smaller sample size to detect a difference than will a smaller effect size.114

efficacy: degree to which an intervention produces a beneficial result under the ideal conditions of an investigation,95 usually in a randomized clinical trial; it is usually greater than the intervention’s effectiveness.

efficiency: effects achieved in relation to the effort expended in money, time, and resources. Statistically, the precision with which a study design will estimate a parameter of interest.98

effort-to-yield measures: amount of resources needed to produce a unit change in outcome, such as number needed to treat100; used in cost-effectiveness and cost-benefit analyses (see 19.3.7, Economic Analyses).

equivalence trial: trials that estimate treatment effects that exclude any patient-important superiority of interventions under evaluation. Equivalence trials require a priori definition of the smallest difference in outcomes between these interventions that patients would consider large enough to justify a preference for the superior intervention (given the intervention’s harms and burdens). The CI for the estimated treatment effect at the end of the trial should exclude that difference for the authors to claim equivalence (ie, the confidence limits should be closer to zero than the minimal patient-important difference). This level of precision often requires investigators to enroll large numbers of patients with large numbers of events. Equivalence trials are helpful when investigators want to see whether a cheaper, safer, simpler (or increasingly often, better method to generate income for the sponsor) intervention is neither better nor worse (in terms of efficacy) than a current intervention. Claims of equivalence are frequent when results are not significant, but one must be alert to whether the CIs exclude differences between the interventions that are as large as or larger than those patients would consider important. If they do not, the trial is indeterminate rather than yielding equivalence.11,22

error: difference between a measured or estimated value and the true value. Three types are seen in scientific research: a false or mistaken result obtained in a study; measurement error, a random form of error; and systematic error that skews results in a particular direction.98

estimate: value or values calculated from sample observations that are used to approximate the corresponding value for the population.95

event: end point or outcome of a study; usually the dependent variable. The event should be defined before the study is conducted and assessed by an individual masked to the intervention or exposure category of the study participant.

exclusion criteria: characteristics of potential study participants or other data that will exclude them from the study sample (such as being younger than 65 years, history of cardiovascular disease, expected to move within 6 months of the beginning of the study). Like inclusion criteria, exclusion criteria should be defined before any individuals are enrolled.

expectation: the summation or integration of all possible values of a random variable; synonymous with the mean value.

explanatory variable: variable that helps explain the phenomenon represented by the dependent variable. It is synonymous with independent variable but preferred by some because “independent” in this context does not refer to statistical independence.93

extrapolation: conclusions drawn about the meaning of a study for a target population that includes types of individuals or data not represented in the study sample.95 The value of a missing or unknown variable may be derived from other known variables by extrapolation. This is often performed by fitting a regression equation.

face validity: the extent to which a measurement instrument appears to measure what it is intended to measure.

factor analysis: procedure used to group related variables to reduce the number of variables needed to represent the data. This analysis reduces complex correlations between a large number of variables to a smaller number of independent theoretical factors. The researcher must then interpret the factors by looking at the pattern of “loadings” of the various variables on each factor.100 Loadings describe the relationship between the variables and the factors. Weak factors have little effect on the variable and have small numerical values for the loading values. In theory, there can be as many factors as there are variables, and thus the authors should explain how they decided on the number of factors in their solution. The decision about the number of factors is a compromise between the need to simplify the data and the need to explain as much of the variability as possible. There is no single criterion on which to make this decision, and thus authors may consider a number of indexes of goodness of fit. There are a number of algorithms for rotation of the factors, which may make them more straightforward to interpret. Factor analysis is commonly used for developing scoring systems for rating scales and questionnaires.

false negative: negative test result in an individual who has the disease or condition as determined by the criterion standard (or reference standard).95 See also diagnostic discrimination.

false-negative rate: proportion of test results found or expected to yield a false-negative result; equal to 1 minus sensitivity.95 See also diagnostic discrimination.

false positive: positive test result in an individual who does not have the disease or condition as determined by the criterion standard (or reference standard).95 See also diagnostic discrimination.

false-positive rate: proportion of tests found to or expected to yield a false-positive result; equal to 1 minus specificity.95 See also diagnostic discrimination.

F distribution: ratio of the distribution of 2 normally distributed independent variables; synonymous with variance ratio distribution. It is used for ANOVA.98

Fisher exact test (also known as the Fisher-Yates test and the Fisher-Irwin test93): assesses the independence of 2 variables by means of a 2 × 2 contingency table; used when the frequency in at least 1 cell is small104 (<5).

fixed-effect model: when used in the context of multilevel (eg, hierarchical) analysis, refers to a standard regression equation that characterizes one level of a system. For example, if modeling the effect of teaching on student performance, if a model is constructed only at the student level, it would be a fixed-effect model. If the model is constructed to include the student and school levels, then it would be a random-effects model. Random-effects models represent the synthesis of 2 or more regression equations into a single regression model and allow the slope or intercept terms to vary.

→ Fixed-effect models are also used to generate a summary estimate of the magnitude of effect in a meta-analysis that restricts inferences to the set of studies included in the meta-analysis and assumes that a single true value underlies all the primary study results. The assumption is that if all studies were infinitely large, they would yield identical estimates of effect; thus, observed estimates of effect differ from one another only because of random error. This model takes only within-study variation into account and not between-study variation.115 The antonym is random-effects model.

Because the model involves a single level, it should be referred to as the singular fixed-effect. Multilevel models should be referred to by the plural, random effects.

forest plot: graphical representation of the results of a series of studies when their results are summarized into a single graph; usually used for showing the results of meta-analyses. Each horizontal line represents the point estimate and 95% CIs for a single study. The size of the symbol that represents the study results’ point estimate is proportional to the weight that study was given in the meta-analysis. These weights are usually the inverse of the study variance (see Figure 4.2-16 in, Statistical Graphs). A vertical line is placed where the intervention being assessed has no effect. For hazard ratios, risk ratios, and odds ratios, this is a dotted line at 1.0. For standardized mean differences, this point would be a solid line at zero. A vertical dashed line may be placed at the point estimate for the overall summary finding from a meta-analysis.

Friedman test: a nonparametric test for a design with 2 factors that uses the ranks rather than the values of the observations.93 Nonparametric analog to analysis of variance.

F test (score): alternative name for the variance ratio test (or F ratio),98 which results in the F score. The F test is used to test the significance of analysis of variance.104 When ANOVA is performed, if the F test is statistically significant, there is a difference in the variance between 2 or more groups, but the test does not specify which groups are different. To determine which groups are different, a multiple comparison test, such as the Dennett, Tukey, or other tests, is performed.

Example: There were differences by academic status in perceptions of the quality of both primary care training (F1,682 = 6.71; P = .01) and specialty training (F1,682 = 6.71; P = .01). (The numbers set as subscripts for the F test are the df for the numerator and denominator, respectively.) In the medical literature, the subscripted values are often not included when reporting the results of F tests.

funnel plot: a graphic technique for assessing the possibility of publication bias in a systematic review. The effect measure is typically plotted on the horizontal axis and a measure of the random error associated with each study on the vertical axis. In the absence of publication bias, because of sampling variability, the graph should have the shape of a funnel. If there is bias against the publication of null results or results showing an adverse effect of the intervention, one quadrant of the funnel plot will be partially or completely missing (see Figure 4.2-17 in, Funnel Plot).

gaussian distribution: see normal distribution.

genome-wide association study: a study that evaluates the association of genetic variation with outcomes or traits of interest by using 100 000 to 1 000 000 or more markers across the genome.

gold standard: See criterion standard.

goodness of fit: agreement between an observed set of values and a second set derived from a statistical model.93 It is considered good statistical practice to test the goodness of fit of any statistical model to assess how well it models the phenomena it is purported to represent. For regressions solved by a least-squares approach (used when the dependent variable is a continuous value, such as cost of care), the goodness of fit is represented by the coefficient of determination (R2). R2 is calculated by subtracting the residual sum of the squares (RSS) by the total sum of the squares (TSS) and dividing this quantity by the total sum of the squares. The closer this value is to 1, the better the model fit. The coefficient of determination should not be confused with the correlation coefficient (r), which measures the association between 2 variables. See correlation coefficient.

Models fit with maximum likelihood (ML) are more difficult to assess. A pseudo-R2 can be calculated, but it is not equivalent to the R2 calculated for ordinary linear regression models. How the addition of variables to the model improves its fit can be assessed by likelihood ratios (LRs), where the likelihood of the model with the added variable is divided by the likelihood for the model without the variable. This ratio has a χ2 distribution, and statistical significance of the improved model can be statistically tested by χ2 with 2 df. Because addition of variables to models generally improves the fit, when model fits are compared, they should be penalized by the number of variables they contain. This is done by the Akaike information criteria (AIC) or Bayes information criteria (BIC). When models with different numbers of variables are compared, the model with the lowest AIC or BIC is generally considered to have the optimal fit.

The Hosmer-Lemeshow test is often used to assess the maximum likelihood ratio model fit for frequency data. This test divides the observed and expected data into deciles. A χ2 test is performed with the null hypothesis that no difference exists between any of the 10 deciles. If the Hosmer-Lemeshow test has P > .05, then no difference between observed and expected data is concluded, suggesting a good model fit. The Hosmer-Lemeshow test is not optimal because there is a substantial loss of analytic power when the data are aggregated into deciles; the Hosmer-Lemeshow test may not accurately reflect the goodness of fit for models with large amounts of data.

group association: situation in which a characteristic and a disease both occur more frequently in one group of individuals than another. The association does not mean that all individuals with the characteristic necessarily have the disease.95

group matching (also known as frequency matching95): process of matching during assignment in a study to ensure that the groups have a nearly equal distribution of particular variables. The closeness of the match should be shown by providing the standardized difference (differences divided by pooled SDs).116

Hartley test: test for the equality of variances of a number of populations that are normally distributed based on the ratio between the largest and smallest sample variations.93

Hawthorne effect: effect produced in a study because of the participants’ awareness that they are participating in a study. The term usually refers to an effect on the control group that changes the group in the direction of the outcome, resulting in a smaller effect size.104 A related concept is effect of observation. The Hawthorne effect is different from the placebo effect, which relates to participants’ expectations that an intervention will have specific effects. The Hawthorne effect is commonly seen in quality assurance investigations where measures of certain outcomes are collected and data collection processes are changed because those collecting the data are aware that these measures are now being evaluated.

hazard rate, hazard function: theoretical measure of the likelihood that if individuals enter a certain period they will experience an event before the end of the designated period.98 A number of hazard rates for specific intervals can be combined to create a hazard function.

hazard ratio: the ratio of the hazard rate in one group to the hazard rate in another. It is calculated from the Cox proportional hazards model. The interpretation of the hazard ratio is similar to that of the relative risk. In Cox proportional hazards regression, the ratio is calculated to avoid needing to know the baseline hazard to model how various factors affect the hazard of an event. Because the calculation depends on the ratio, the hazards should change in time at the same rate (have the same slope). This is known as the proportional hazards assumption. When a Cox proportional hazards regression is performed, a test should be performed to ensure that the proportional hazard assumption is met. If the curves cross, the hazards are not proportional. If they are not proportional, the model must be adjusted to account for the lack of proportionality.117,118

hierarchical model: See mixed-model analysis.

heterogeneity: qualities of groups that are aggregated are not similar. The term is commonly used in meta-analysis to denote that the studies aggregated in a single analysis are not similar to one another. The degree of heterogeneity is measured with the Q test or the I2 test. The antonym is homogeneity.

heteroscedasticity: systematic deviation of data rather than the data being randomly distributed. See homoscedasticity. Heteroscedasticity in data violates the assumptions for linear regression analysis.

histogram: graphical representation of data in which the frequency (quantity) within each class or category is represented by the area of a rectangle centered on the class interval. The heights of the rectangles are proportional to the observed frequencies (see Figure 4.2-5 in 4.2.1, Statistical Graphs).

Hoeffding independence test: bivariable test of nonnormally distributed continuous data to determine whether the elements of the 2 groups are independent of each other.98

Hollander parallelism test: determines whether 2 regression lines for 2 independent variables plotted against a dependent variable are parallel. The test does not require a normal distribution, but there must be an equal and even number of observations corresponding to each line. If the lines are parallel, then both independent variables predict the dependent variable equally well. The Hollander parallelism test is a special case of the signed rank test.93

homogeneity: the qualities of individuals aggregated into groups are similar to one another. In meta-analyses, homogeneity means that the individual studies that are combined in a single analysis are similar to one another. Homogeneity also refers to the equality of a quantity of interest (such as variance), specifically in a number of groups or populations.93 See antonym heterogeneity for a discussion of measures of homogeneity in meta-analysis.

homoscedasticity: statistical determination that the variance of the different variables under study is equal.98 Homoscedasticity can be assessed by plotting the error (difference between observed data and the regression line) against the independent variables and looking for systematic deviations. If the errors are randomly distributed, then the data are homoscedastic; if not, they are heteroscedastic, and the regression results are invalid. See also heterogeneity.

Hosmer-Lemeshow goodness-of-fit test: a series of statistical steps used to assess goodness of fit for logistic regression analyses. The observed and modeled observations are aggregated into equal-sized groups, and usually 10 groups are used. A χ2 test is performed, and if there is a significant difference found between the observed and modeled groups of data, the model is considered to not fit well. Thus, if P < .05, the model is considered to not fit well, and P values above this level suggest a good fit to the model. The Hosmer-Lemeshow test has limited power to detect differences between observed and fitted data because only 10 groups are compared rather than the much larger number of values involved in the regression itself. The Hosmer-Lemeshow test does not perform well as a goodness-of-fit test for large data sets.119

Hotelling T statistic: generalization of the t test for use with multivariate data; results in a T statistic. Significance can be tested with the variance ratio distribution.93

hypothesis: an educated guess about some phenomenon; a supposition that leads to a prediction that can be tested and found to be supported or refuted.98,120 The null hypothesis in statistical analysis states that no difference exists between groups. Statistically significant differences between groups are assessed by comparing them and finding if the chance that any observed differences between groups has a probability of being observed that is less than 5% (P < .05). Hypothesis testing includes (1) generating the study hypothesis and defining the null hypothesis, (2) determining the level below which results are considered statistically significant, or α level (usually α = .05), and (3) identifying and applying the appropriate statistical test to accept or reject the null hypothesis.

I2 statistic: the I2 statistic is a test of heterogeneity. I2 can be calculated from Cochran Q according to the formula I2 = 100% × (Cochran Qdf). Any negative values of I2 are considered equal to 0, so that the range of I2 values is 0% to 100%, indicating no heterogeneity to high heterogeneity.

imputation: a group of techniques for replacing missing data with values that are likely to have been observed if they were present. Among the simplest methods of imputation is last observation carried forward, in which missing values are replaced by the last observed value. This provides a conservative estimate in cases in which the condition is expected to improve on its own but may be overly optimistic in conditions that are known to worsen over time. In general, last observation carried forward should not be performed. Missing values may also be imputed based on the patterns of other variables. In multiple imputation, missing data are modeled for a data set based on other data to provide the best guess for what the values for the missing data should have been. The process of guessing what the missing data should be is repeated many times, resulting in a large number of data sets. From the multiple data sets, parameters are estimated, as are the CIs for those parameters. Multiple imputation is the preferred method for modeling missing data. Complete case analysis is performed by excluding cases with missing data and only analyzing cases that have all the data of interest. This can result in biased results and/or can be associated with a substantial loss in analytic power and should not be done.121,122

incidence: number of new cases of disease among persons at risk that occur over time,98 as contrasted with prevalence, which is the total number of persons with the disease at any given time. Incidence is usually expressed as a percentage of individuals affected during an interval (eg, year) or as a rate calculated as the number of individuals who develop the disease during a period divided by the number of person-years at risk.

Example: The incidence rate for the disease was 1.2 cases per 100  000 per year.

inclusion criteria: characteristics a study participant must possess to be included in the study population (such as age of 65 years or older at the time of study enrollment and willing and able to provide informed consent). Like exclusion criteria, inclusion criteria should be defined before any participants are enrolled in a study, and such criteria should be mentioned in the Methods section of research articles.

independence, assumption of: assumption that the occurrence of one event is in no way linked to another event. Many statistical tests depend on the assumption that each outcome is independent.98 This may not be a valid assumption if repeated tests are performed on the same individuals (eg, blood pressure is measured sequentially over time), if more than 1 outcome is measured for a given individual (eg, myocardial infarction and death or all hospital admissions), or if more than 1 intervention is made on the same individual (eg, blood pressure is measured during 3 different drug treatments). Tests for repeated measures may be used in those circumstances. If a Durbin-Watson test suggests that autocorrelation is present, then the data should be analyzed using autoregression techniques.

independent variable: variable postulated to influence the dependent variable within the defined area of relationships under study.98 The term does not refer to statistical independence, so some use the term explanatory variable instead.93

Example: Age, sex, systolic blood pressure, and cholesterol level were the independent variables entered into the multiple logistic regression that assessed how these characteristics influenced the dependent variable, mortality.

indirect cause: a factor X can cause an effect on Y directly, but if X acts on some other factor Z that in turn affects Y, X acts on Y as an indirect cause.95

Example: Overcrowding in the cities facilitated transmission of the tubercle bacillus and precipitated the tuberculosis epidemic. (Overcrowding is an indirect cause; the tubercle bacillus is the direct cause.)

inference: conclusions arrived at on the basis of evidence and reasoning.98

Example: Intake of a high-fat diet was significantly associated with cardiovascular mortality; therefore, we infer that eating a high-fat diet increases the risk of cardiovascular death.

instrument error: error introduced in a study when the testing instrument is not appropriate for the conditions of the study or is not accurate enough to measure the study outcome95 (may be attributable to deficiencies in such factors as calibration, accuracy, and precision).

instrumental variable analysis: a method used in observational studies that minimizes the risk of confounding. An instrument is some factor that is correlated with the type of treatment received but not with the outcome. For example, a patient may live near a hospital that only performs one type of intervention but not another. If the patient only goes to the nearest hospital and receives one or the other treatment, then the distance the person lives from the facility or their zip/postal code might be used as an instrumental variable. In this instance, an analysis of outcomes stratified by zip/postal code will yield results that mimic those for randomized trials because patients will only receive one or another intervention based on where they live and not on their clinical condition.

intention-to-treat (ITT) analysis, intent-to-treat (ITT) analysis: analysis of outcomes for individuals based on the treatment group to which they were randomized rather than on which treatment they actually received and whether they completed the study.13 When groups are assessed by the actual treatment received rather than the one they were intended to get, the analysis is referred to as a per-protocol analysis. The ITT analysis maintains an equal distribution of the study group’s baseline characteristics and generally avoids bias associated with changes in treatment plans that occur after the study was initiated. ITT is the preferred main analysis approach for randomized trials. ITT can increase the risk for bias in noninferiority trials, and although ITT should be the primary analytic approach for noninferior trials, a per-protocol analysis should also be presented123 (see 19.2, Randomized Clinical Trials).

→ Although other analyses, such as evaluable patient analysis or per-protocol analyses, are often performed to evaluate outcomes based on treatment actually received, the ITT analysis should be presented regardless of other analyses because the intervention itself may influence whether treatment was changed and whether participants dropped out. ITT may bias the results of equivalence and noninferiority trials; for those trials, additional analyses should be presented (see 19.2.3, Equivalence Trials and Noninferiority Trials).

interaction: when 2 variables affect one another in a nonlinear way. It is common for 2 variables to have an additive effect. When the 2 variables are working in concert with one another and have an effect that is greater than what each one adds to an effect by itself, an interaction is present. An example might be that as a person ages he or she becomes more susceptible to hypoglycemia and that when insulin is given hypoglycemia may occur. The effect of insulin on hypoglycemia in older people may be greater than in younger people and greater than expected from studying the hypoglycemic effects of insulin or age alone. See interactive effect, interaction term.

interaction term: variable used in analysis of variance, analysis of covariance, and regression analysis in which 2 independent variables interact with each other (eg, when assessing the effect of energy expenditure on cardiac output, the increase in cardiac output per unit increase in energy expenditure might differ between men and women; the interaction term would enable the analysis to take this difference into account). The term is created by multiplying the 2 interacting variables into a single variable. For example, if age is one variable and insulin dose is another, the interacting term will be age by insulin dose (the age variable multiplied by the insulin dose variable). If the interaction variable is statistically significant in a regression analysis, an interaction between variables is assumed to be present.95

interactive effect: effect of 2 or more independent variables on a dependent variable in which the effect of one independent variable is influenced by the presence of another.93 The interactive effect may be additive (ie, equal to the sum of the 2 effects present separately), synergistic (ie, the 2 effects together have a greater effect than the sum of the effects present separately), or antagonistic (ie, the 2 effects together have a smaller effect than the sum of the effects present separately).

interim analysis: data analysis performed during a clinical trial to monitor treatment effects. Interim analysis should be prespecified in the study protocol before patient enrollment and the stopping rules if a particular treatment effect is reached. Each time an interim analysis is performed, the α level set for establishing statistical significance of the outcome should be adjusted just as one would do for a repeated-measures analysis. This is because each time the data are assessed and a statistical significance test is performed, each successive look at the data increases the likelihood of falsely concluding that a statistically significant result is present when it really is not (type I error). The process for adjusting the α level is called α expenditure.15

interobserver bias: likelihood that one observer is more likely to give a particular response than another observer because of factors unique to the observer or instrument. For example, one physician may be more likely than another to identify a particular set of signs and symptoms as indicative of religious preoccupation on the basis of his or her beliefs, or a physician may be less likely than another physician to diagnose alcoholism in a patient because of the physician’s expectations.104 The Cochran Q test is used to assess interobserver bias.104

interobserver reliability: test used to measure agreement among observers about a particular measure or outcome.

→ Although the proportion of times that 2 observers agree can be reported, this does not take into account the number of times they would have agreed by chance alone. For example, if 2 observers must decide whether a factor is present or absent, they should agree 50% of the time according to chance. The κ statistic assesses agreement while taking chance into account and is described by the equation [(observed agreement) − (agreement expected by chance)]/(1 − agreement expected by chance). The κ value may range from 0 (poor agreement) to 1 (perfect agreement) and may be classified by various descriptive terms, such as slight (0-0.20), fair (0.21-0.40), moderate (0.41-0.60), substantial (0.61-0.80), or near perfect (0.81-0.99).123

→ In cases in which disagreement may have especially grave consequences, such as one pathologist rating a slide “without disease” and another rating a slide “invasive carcinoma,” a weighted κ may be used to grade disagreement according to the severity of the consequences.123 See also Pearson product moment correlation.

interobserver variation: see interobserver reliability.

interquartile range (IQR): the distance between the 25th and 75th percentiles, which is used to describe the dispersion of values for categorical data. Like other quantiles (eg, tertiles, quintiles), expressing the range of data is preferred when the data are nonnormally distributed. Data that are not normally distributed should not have their distribution expressed as the SD. The interquartile range describes the inner 50% of values; the interquintile range (20th to 80th percentile) describes the inner 60% of values; the interdecile range (10th to 90th percentile) describes the inner 80% of values.93

interrater reliability: reproducibility among raters or observers; synonymous with interobserver reliability.

interval estimate: see confidence interval.95

intraobserver reliability (or variation): reliability (or, conversely, variation) in measurements by the same person at different times.95 Similar to interobserver reliability, intraobserver reliability is the agreement between measurements by one individual beyond that expected by chance and can be measured by means of the κ statistic or the Pearson product moment correlation.

intrarater reliability: synonymous with intraobserver reliability.

jackknife test: technique for estimating the variance and bias of an estimator. An estimator is a process that results in a numerical value that characterizes a distribution. For example, there are processes that calculate the mean and variance of a sample, and the processes that do the calculation are called estimators. If the estimators are biased, they may result in incorrect estimates of the mean and variance. The jackknife method detects bias in estimators. The jackknife technique calculates a mean by calculating the mean of a sample leaving 1 value out, repeating this n − 1 times and calculating the mean of those estimates. If the estimate obtained by the jackknife method differs from the mean calculated by some other method, there is bias in the method used to calculate the mean. A similar process can be used to detect bias in variance estimates. Bootstrapping performs a similar function, but instead of leaving one value out of a subset of the sample, sampling of subsets is repeated many times, and a new mean or variance is calculated.123

Kaplan-Meier method (also known as the product-limit method): nonparametric method of estimating survival functions and compiling life tables.118 Kaplan-Meier survival curves (see Figure 4.2-3 in, Survival Plots) show the fraction of patients surviving or experiencing an event in any given interval after some sort of intervention. Unlike the Cutler-Ederer method, the Kaplan-Meier method assumes that termination of follow-up occurs at the end of the time block. Therefore, Kaplan-Meier estimates of risk tend to be slightly lower than Cutler-Ederer estimates.95 The horizontal lines on the plots represent the known survival time for the patients being studied. Patients who are censored (ie, those who do not experience the event of interest) are represented as tick marks on a Kaplan-Meier plot. Each vertical line represents an individual patient experiencing an event. Often an intervention and control group are depicted on one graph, and the groups are compared by a log-rank test. Because the method is nonparametric, there is no attempt to fit the data to a theoretical curve. Thus, Kaplan-Meier plots have a jagged appearance, with discrete drops at the end of each interval in which an event occurs.

κ (kappa) statistic: statistic used to measure nonrandom agreement between observers or measurements. It is calculated by subtracting the probability of 2 observers randomly agreeing with one another from the observed frequency that they agree with each other and dividing this quantity by (1 minus the probability that they agree with one another randomly). When κ = 1, there is perfect agreement; if κ = 0, there is no agreement.98 See interobserver and intraobserver reliability.

Kendall τ (tau) rank correlation: method used to determine if an association exists between 2 ordinal (ie, categorical data that can be ordered from high to low) or ranked variables. Pairs of data between the 2 variables are compared, and if the values of one variable are consistently higher or lower than the other, a correlation between the 2 variables is assumed to exist. When τ = 1, the raters always agreed; if τ = −1, they never agreed; and if τ = 0, the different rankings between raters was completely random.123,124

Kolmogorov-Smirnov test: test used to determine if the distribution of a data set matches a particular probability distribution. This test is performed by subtracting the observed continuous probability distribution from a theoretical one. Typically, this test is used to determine if sample data fit the normal probability distribution. If P ≤ .05, then the observed distribution is different than the theoretical distribution it is thought to conform with.104 The Kolmogorov-Smirnov test may be used to assess goodness of fit.98

Kruskal-Wallis test: is the nonparametric equivalent of the 1-way ANOVA test and does not require normally distributed data assumptions of equal variance. Used to compare 3 or more groups, this test tests the null hypothesis that there are no differences in the distributions of any group. Like ANOVA, the Kruskal-Wallis test determines only that there are no differences between any group. If differences are found, then a multiple comparison test, such as a Tukey or multiple Mann-Whitney test using a Bonferroni correction, can be used to determine if differences exist between individual groups.104 The Kruskal-Wallis test is a nonparametric analog of ANOVA and generalizes the 2-sample Wilcoxon rank sum test to the multiple-sample case.93

The Kruskal-Wallis test is performed by ranking all the data from highest to lowest and then separating into the various prespecified groups. The mean rank for each group is compared by χ2 analysis using n − 1 groups as the number of degrees of freedom.

→ Sample wording: Methods: A Kruskal-Wallis test was used to test for differences among the 7 different drug doses for their effect on antibody response because normality was questionable and sample sizes within each group were small. Results: The Kruskal-Wallis test for comparison of different drug doses indicates that there was a statistically significant difference in the distribution of antibody responses between the groups (Image = 15.3, P = .02).124

kurtosis: the way in which a unimodal curve deviates from a normal distribution; may be more peaked (leptokurtic) or more flat (platykurtic) than a normal distribution.104 The measure of the normal distribution’s kurtosis is 3. The other assessment of the shape of a distribution is its skewness (the deviation from being symmetric or how much of the distribution exists in its tails).

Latin square: form of complete treatment crossover design used for crossover drug trials that eliminates the effect of treatment order. Each patient receives each drug, but each drug is followed by another drug only once in the array. For example, in the following 4 × 4 array (Table 19.5-2), letters A through D correspond to each of 4 drugs, each row corresponds to a patient, and each column corresponds to the order in which the drugs are given.16

Table 19.5-2. Latin Square

First drug

Second drug

Third drug

Fourth drug

Patient 1





Patient 2





Patient 3





Patient 4





See 19.2.2, Crossover Trials.

lead-time bias: artifactual increase in survival time that results from earlier detection of a disease, usually cancer, during a time when the disease is asymptomatic. Lead-time bias produces longer survival from that of diagnosis but not longer survival from the time of onset of the disease.95 Imagine that 2 patients develop small tumors at the same time, of which they will die irrespective of any treatment in 10 years. Then imagine that this particular tumor would become obvious at 7 years. One patient gets screened at 5 years, the tumor is detected, and treatment administered. This patient dies 5 years after screening. The other patient receives treatment at 7 years when the tumor is obvious and only lives for 3 years after treatment. It appears that screening resulted in 2 additional years of life, but this is an artifact because there was a 2-year lead in making a diagnosis. Making a diagnosis did not result in a longer life, but it made cancer screening appear effective. See also length-time bias.

→ Lead-time bias may give the appearance of a survival benefit from screening when in fact the increased survival is only artifactual. Lead-time bias is used more generally to indicate a systematic error that arises when follow-up of groups does not begin at comparable stages in the natural course of the condition.

least significant difference test: one of the multiple comparison tests assessing differences in individual groups when the null hypothesis for an analysis of variance is proven false, meaning that the group variances are not all equal. An extension of the t test.95

least-squares method: method of estimation, particularly in regression analysis, that minimizes the sum of the differences between the observed responses and the values predicted by a model.104 The regression line is created so that the sum of the squares of the residuals is as small as possible.

left-censored data: see censored data.

length-time bias: bias that arises when a sampling scheme is based on patient visits because patients with more frequent clinic visits are more likely to be selected than those with less frequent visits. In a screening study of cancer, for example, screening patients with frequent visits is more likely to detect slow-growing tumors than would sampling patients who visit a physician only when symptoms arise.104 Another scenario is if one patient has a rapidly growing form of a tumor, the time between when the cancer becomes symptomatic and the patient dies is short. The rapid course may make it less likely that the patient gets screened. Another patient might have a slower-growing form of the same tumor, get screened, and live a long time until he or she dies. Screening appears effective because the patient who had a tumor with a more benign course lived longer than the one who had a rapidly growing tumor and who was less likely to undergo screening because of the rapidity of the tumor’s growth. See also lead-time bias.

life table: method of organizing data that allows examination of the experience of 1 or more groups of individuals over time with varying periods of follow-up. For each increment of the follow-up period, the number entering, the number leaving, and the number dying of disease or developing disease can be calculated. In contrast to Kaplan-Meier analysis, where changes in survival probability are calculated for each patient having events, in life-table analysis arbitrary intervals are selected and the survival probability for all the patients having events within that interval is calculated. An assumption of the life-table method is that a censored individual (eg, not completing follow-up, dying during the interval from a cause other than the event) is exposed for half the incremental follow-up period.104 Thus, the number of patients at risk for the event at the beginning of the interval is adjusted by half the number censored during that interval. (The Kaplan-Meier method and the Cutler-Ederer method are also forms of life-table analysis but make different assumptions about the length of exposure.)

→ The clinical life table describes the outcomes of a cohort of individuals classified according to their exposure or treatment history. The cohort life table is used for a cohort of individuals born at approximately the same time and followed up until death. The current life table is a summary of mortality of the population during a brief (1- to 3-year) period, classified by age, often used to estimate life expectancy for the population at a given age.98

likelihood ratio: probability of getting a certain test result if the patient has the condition relative to the probability of getting the result if the patient does not have the condition. The greater the likelihood ratio, the more likely that a positive test result will occur in a patient who has the disease. A ratio of 2 means a person with the disease is twice as likely to have a positive test result as a person without the disease.100 The likelihood ratio test is based on the ratio of 2 likelihood functions and is used to assess the model fit for regression models fit by maximum likelihood modeling, as is done in logistic regression.93 For dichotomous variables, this is calculated as sensitivity/(1 − specificity). See also diagnostic discrimination.

Likert scale: scale often used to assess opinion or attitude, ranked by attaching a number to each response, such as 1, strongly agree; 2, agree; 3, undecided or neutral; 4, disagree; and 5, strongly disagree. The score is a sum of the numerical responses to each question.93

Lilliefors test: test of normality (using the Kolmogorov-Smirnov test statistic) in which mean and variance are estimated from the data.93

linear regression: statistical method used to compare continuous dependent and independent variables. When the data are depicted on a graph as a regression line, the independent variable is plotted on the x-axis and the dependent variable on the y-axis. The residual is the vertical distance from the data point to the regression line100; analysis of residuals is a commonly used procedure for linear regression (see Figure 4.2-4 in, Scatterplots). This method is frequently performed using least-squares regression.92

→ The description of a linear regression model should include the equation of the fitted line with the estimated value for the slope and its 95% CI if possible, the r2, the fraction of variation in y explained by each of the x variables (correlation and partial correlation), and the variances of the fitted coefficients a and b (and their SDs).92,125

Example: The regression model identified a significant positive relationship between the dependent variable weight and height (slope = 0.25; 95% CI, 0.19-0.31; y = 12.6 + 0.25x; t451 = 8.3; P < .001; r2 = 0.67).100

(In this example, the slope is positive, indicating that as one variable increases the other increases; the t test with 451 df is significant; the regression line is described by the equation and includes the slope 0.25 and the constant 12.6. The coefficient of determination r2 demonstrates that 67% of the variance in weight is explained by height.)100

→ Four important assumptions are made when linear regression is conducted: the dependent variable is sampled randomly from the population; the spread or dispersion of the dependent variable is the same regardless of the value of the independent variable (this equality is referred to as homogeneity of variances or homoscedasticity); the relationship between the 2 variables is linear; and the independent variable is measured with complete precision.95

LOESS (locally estimated scatterplot smoothing) or LOWESS (locally weighted scatterplot smoothing): method for fitting a curve to data that does not depend on knowing what mathematical function (if any) describes the data (such as the data following a linear or exponential relationship).126 Each data point is fitted by a regression equation finding the best estimate for the mean and variance for that point based on fitting a curve to the point and some window around the data. Larger windows result in smaller variances for these estimates but result in greater data smoothing. In contrast, smaller windows yield fitted data points that more closely follow the trends within the data but have large variances associated with the point estimates. LOESS is used to find patterns and trends within data.126

Example: Mortality rates for Medicare beneficiaries were measured between 1999 to 2013 (Figure 19.5-1). The symbols represent the observed mortality rates for each year. The solid lines represent estimates for these rates derived from LOESS method and the shaded areas show the 95% CI for those estimates as determined by LOESS.127

Figure 19.5-1. LOESS Method to Fit a Curve to the Data


logistic regression: type of regression model used to analyze the relationship between a binary dependent variable (eg, alive or dead, complication or no complication) and 1 or more independent variables. Often used to determine the independent effect on the dependent variable of one of the explanatory variables while simultaneously controlling for several other factors that are included as independent variables in the regression equation. Results are usually expressed by odds ratios and 95% CIs.95 (The multiple logistic regression equation may also be provided, but because these involve exponents they are substantially more complicated than linear regression equations. Therefore, in journals, the equation is generally not published but can be made available on request from authors. Alternatively, it may be placed in supplementary tables.)

→ To be valid, a multiple regression model must have an adequate sample size for the number of variables examined. A rough rule of thumb is to have 10 to 20 events (eg, deaths, complications) for each explanatory variable examined.128

→ When using logistic regression, the Methods section of a manuscript should include wording such as the following: “To examine the effect of the type of analgesia used and chest tube size on complications of pleurodesis, multiple logistic regression was used.” The results should be stated as follows: “The odds ratio for using a 12F chest tube and having complications of pleurodesis was 1.9 (95% CI, 0.7-5.1) relative to placing a 24F chest tube.” It is always good practice to state what the absolute risks are: “In 55 patients with 12F chest tubes, there were 13 complications (24%), and in the 56 patients with 24F chest tubes, there were 8 complications (12%).”

log-linear model: models where the logarithm of the dependent variable is a linear combination of the independent variables. In general, the dependent variable is shown in terms of being equal to the exponent of the right side of the regression equation. These may be used for the analysis of categorical data.93

→ ln (Y) = α + β1X1+ β2X2 + β3X3. . . + ε

 Y = exp(α + β1X1+ β2X2 + β3X3. . . + ε) are examples of log-linear models.

log-rank test: method to compare differences between survival curves for different treatments; same as the Mantel-Haenszel test.93 If P < .05, then it is concluded that there is a less than 5% chance that any observed discrepancies between 2 survival curves are due to chance alone.

main effect: estimate of the independent effect of an explanatory (independent) variable on a dependent variable in analysis of variance or analysis of covariance. In a factorial-designed experiment, several interventions can be tested simultaneously. For each intervention (eg, a drug dose), there can be a single level of the dose (comparing drug vs no drug) or several levels of the drug (different doses). With this design, several drugs can be tested in one large experiment to determine their influence on some outcome. The main effect is the effect of one of the factors (drugs) on the outcome inclusive of all the various doses and excluding the effect of the other factors (drugs). Interactions can be tested between factors (drugs) in the same experiment. The interaction is tested for by creation of a variable that multiplies the factors being compared. When an interaction exists (eg, the interaction variable is statistically significant), the factors have an effect that is greater than what would be expected by the addition of each when they act alone.104

Mann-Whitney test: nonparametric equivalent of the t test, used to compare ordinal dependent variables with nominal independent variables or continuous independent variables converted to an ordinal scale.98 Similar to the Wilcoxon rank sum test.

MANOVA: multivariate analysis of variance. This involves examining the overall significance of all dependent variables considered simultaneously and thus has less risk of type I error than would a series of univariable analysis of variance procedures on several dependent variables. See also ANOVA.

Mantel-Haenszel test: another name for the log-rank test.

Markov process: process of modeling possible events or conditions over time that assumes that the probability that a given state or condition will be present depends only on the state or condition immediately preceding it and that no additional information about previous states or conditions would create a more accurate estimate.104 These models account for patients moving from one state or condition to another. For example, if there is a 2% stroke rate per year in a population, Markov models facilitate modeling how many patients with stroke will be present after a number of years. They do this by assuming that starting with 100 patients there will be 2 with a stroke after the first year and 98 left who are at risk for stroke. After the second year there will be 4 patients with stroke in the population and 96 eligible to have a stroke in the next year. Markov models estimate how many patients will be in each state (stroke or no stroke) for any given number of years (or cycles) that the investigator wants to model.

masked assessment: synonymous with blinded assessment.

masked assignment: synonymous with blinded assignment.

matching: process of making study and control groups comparable with respect to factors other than the factors under study, generally as part of a case-control study. Matching can be done in several ways, including frequency matching (matching on frequency distributions of the matched variable[s]), category (matching in broad groups, such as young and old), individual (matching on individual rather than group characteristics), and pair matching (matching each study individual with a control individual).98 Attempts to approximate matching in observational studies are performed by propensity methods, where the propensity to be in one group or another is calculated (usually by multivariable regression) and used to match patients into one group or another. After matching, the baseline characteristics of the 2 groups before and after the match should be shown.129

McNemar test: form of the χ 2test for binary responses in comparisons of matched pairs.98 The ratio of discordant to concordant pairs is determined; the greater the number of discordant pairs with the better outcome being associated with the treatment intervention, the greater the effect of the intervention.104

mean: sum of values measured for a given variable divided by the number of values; a measure of central tendency appropriate for normally distributed data.130 The SD should always be displayed along with mean values (m). The SD shows how much dispersion there is around the estimate of the data’s central tendency. In general, if the SD is larger than the mean value, the data should be assumed to be not normally distributed. Means should not be used to represent the central tendency of data that are not normally distributed because the mean itself assumes that data are normally distributed. Kurtosis and skewness are measures of how much a data distribution deviates from a normal distribution. See also average.

→ If the data are not normally distributed, the median should be used as a measure of central tendency and should be displayed along with the 25% and 75% interquartile range to provide a sense for how much scatter there is in the data.

measurement error: estimate of the variability of a measurement. Variability of a given parameter (eg, weight) is the sum of the true variability of what is measured (eg, day-to-day weight fluctuations) plus the variability of the instrument or observer measurement or variability caused by measurement error (error variability, eg, the scale used for weighing).

median: midpoint of a distribution chosen so that half the values for a given variable appear above and half occur below.95 For data that do not have a normal distribution, the median provides a better measure of central tendency than does the mean because it is less influenced by outliers.119 Medians should be displayed along with measures of uncertainty, such as 25% and 75% interquartile ranges.

median test: nonparametric rank-order test for 2 groups.93

mediation analysis: a method used to assess a pathway in which 1 variable associated with a second variable in turn is associated with a third variable, in which the second variable mediates the association between the first 2 variables (Figure 19.3-2).

mendelian randomization: a means for mimicking the results of randomized trials by using genetic variations that exist between individuals that influence health outcomes that are not subject to the confounding or reverse-causation bias that can distort observational findings. An example might be genetic variation that results in high or low high-density lipoprotein cholesterol (HDL-C) levels and assessing the effect of that genetic variation on cardiovascular outcomes to test the potential effect of HDL-C on those outcomes.

meta-analysis: a method of aggregating statistical results and deriving a single estimate for an effect based on a number of similar studies. To perform a meta-analysis, the studies should be similar, with little heterogeneity. When there is a great deal of heterogeneity, data aggregation as a meta-analysis should not be performed, and the various studies should be assessed as a systematic review. Meta-analyses should be viewed like any research study and have a detailed Methods section. Heterogeneity among studies should be reported as I2 and the point estimate for comparing the studies reported along with 95% CIs. Showing the effects of individual studies is facilitated by including a forest plot (see Figure 4.2-16 in, Forest Plots, and 19.3.6, Meta-analyses).

missing data: incomplete information on individuals resulting from any of a number of causes, including loss to follow-up, refusal to participate, and inability to complete the study. Although the simplest approach would be to remove such participants from the analysis, this introduces bias in the analysis and should not be done. Furthermore, certain health conditions may be systematically associated with the risk of having missing data, and thus removal of these individuals could bias the analysis. It is generally better to model the missing data with multiple imputation, which is then included in the analysis.121,122

mixed-methods analysis (also known as multimethod research): study design using a variety of methods, both qualitative and quantitative, to answer a research question.

mixed-model analysis (also known as hierarchical analysis): statistical model having both fixed and random effects. Regression analyses assume that individual data are independent of one another. When the data are not independent (ie, correlated), mixed models may be used. This occurs when data are clustered (eg, patients in one hospital may have effects related to the hospital that differ from those of patients treated in another hospital) or repeated measures are performed (one measure taken after another from the same person will be correlated). Fixed and random effects are characteristics of data. Data are random if they can be considered to be drawn randomly from some distribution and are expected to vary from one subject to another. For example, if the effect of an intervention on patients who are in various hospitals is being studied, the variable representing hospitals may be modeled as a random effect because the hospitals were randomly selected from the universe of all hospitals. However, the intervention is the same for all patients and all hospitals, so it is considered a fixed effect. Variables that do not change between individuals are considered fixed effects.131

mode: in a series of values of a given variable, the number that occurs most frequently; used most often when a distribution has 2 peaks (bimodal distribution).130 This is also appropriate as a measure of central tendency for categorical data.

Monte Carlo simulation: a family of techniques for modeling complex systems for which it would otherwise be difficult to obtain sufficient data. In general, Monte Carlo simulations will randomly resample data from a larger data set to mimic the random selection that occurs in experiments. Instead of doing actual experiments, Monte Carlo simulations use a computer algorithm to generate a large number of observations that are randomly selected. The patterns of these numbers are then used to assess probabilities of events occurring and for any irregularities that might arise.

mortality rate: death rate described by the following equation: [(number of deaths during period) × (period of observation)]/(number of individuals observed). For values such as the crude mortality rate, the denominator is the number of individuals observed at the midpoint of observation. See also crude death rate.104

→ Mortality rate is often expressed in terms of a standard ratio, such as deaths per 100 000 persons per year.

Moses rank-like dispersion test: rank test of the equality of scale of 2 identically shaped populations, applicable when the population medians are not known.93

multilevel model: models in which the data are aggregated into groups. Students may be clustered into a classroom, which is clustered into a school, which is clustered into a school district. Because data organized like this will not meet the assumption of independence, regression analyses examining characteristics of these groups should be modeled using mixed-models analysis.

multiple analyses problem: problem that occurs when several statistical tests are performed on one group of data because of the potential to introduce a type I error. Multiple analyses are problematic when the analyses were not specified as primary outcome measures. Multiple analyses can be appropriately adjusted for by means of a Bonferroni adjustment or any of several multiple comparison procedures.

multiple comparison procedures: If many tests for statistical significance are performed on a data set, there is a risk of falsely concluding that a difference exists when it does not. For example, when the first test is performed and the null hypothesis is rejected (concluding that the groups are different) at the P < .05 level, there is a 5% chance that the observed difference between the groups was attributable to chance alone. If another test is performed on the same data using the same assumptions, the chance that the null hypothesis is rejected by chance alone this second time is not 5% but 5% + 5% = 10%, meaning that when 2 tests for significance are performed, the chance of rejecting the null hypothesis (ie, concluding falsely that a difference exists when it does not) is 10%. One way to correct for this problem is to divide the P value used for significance by the number of statistical tests performed. This is called the Bonferroni correction and, for our example of 2 tests, 0.05/2 would yield an α value of .025 for a significance threshold.132,133

The same problem exists when performing statistical significance tests for multiple groups as is done for ANOVA. The first test establishes only that significant differences exist between groups but does not specify which groups. Finding out which groups are different from one another post hoc becomes a multiple comparison problem.

→ Some tests for multiple comparisons result in more conservative estimates (less likely to be significant) than others. More conservative tests include the Tukey test and the Bonferroni adjustment; the Duncan multiple range test is less conservative. Other tests include the Scheffé test, the Newman-Keuls test, and the Gabriel test,93 as well as many others. There is ongoing debate among statisticians about when it is appropriate to use these tests.

multiple regression: general term for analysis procedures used to estimate values of the dependent variable for all measured independent variables that are found to be associated. The procedure used depends on whether the variables are continuous or nominal. When all variables are continuous variables, multiple linear regression is used and the mean of the dependent variable is expressed using the equation Y = α + β1χ1 + β2χ2 + … + βkχk, where Y is the dependent variable and k is the total number of independent variables. When independent variables may be either nominal or continuous and the dependent variable is continuous, analysis of covariance is used. (Analysis of covariance often requires an interaction term to account for differences in the relationship between the independent and dependent variables.) When all variables are nominal and the dependent variable is time dependent, life-table methods are used. When the independent variables are either continuous or nominal and the dependent variable is nominal and time dependent (such as incidence of death), the Cox proportional hazards model may be used. Nominal dependent variables that are not time dependent are analyzed by means of logistic regression or discriminant analysis.92

multivariable analysis: the name means many variables. Any statistical test that deals with 1 dependent variable and at least 2 independent variables. It may include nominal or continuous variables, but ordinal data must be converted to a nominal scale for analysis. The multivariate approach has 3 advantages over bivariate analysis: (1) it allows for investigation of the relationship between the dependent and independent variables while controlling for the effects of other independent variables; (2) it allows several comparisons to be made statistically without increasing the likelihood of a type I error; and (3) it can be used to compare how well several independent variables individually can estimate values of the dependent variable.95 Examples include analysis of variance, multiple (logistic or linear) regression, analysis of covariance, Kruskal-Wallis test, Friedman test, life table, and Cox proportional hazards model.

multivariate analysis: similar to multivariable analysis except that there is more than 1 dependent variable. The term multivariate is frequently incorrectly used in the scientific literature when multivariable analysis is meant. Multivariate analysis is seen with repeated-measures experiments when an outcome variable is repeatedly measured in different periods. It is also seen in hierarchical and cluster statistical models when there are many individuals in a single cluster. The suffix ate added to “variable” indicates that something is done to the variables, implying that the variables in question are on the left side of the equation.

N: total number of units (eg, patients, households) in the sample under study.

Example: We assessed the admission diagnoses of all patients admitted from the emergency department during a 1-month period (N = 127).

n: number of units in a subgroup of the sample under study.

Example: Of the patients admitted from the emergency department (N = 127), the most frequent admission diagnosis was unstable angina (n = 38).

natural experiment (also known as “found” experiment): investigation in which a change in a risk factor or exposure occurs in one group of individuals but not in another. The distribution of individuals into a particular group is nonrandom, and as opposed to controlled clinical trials, the change is not brought about by the investigator.95 The natural experiment is often used to study effects that cannot be studied in a controlled trial, such as the incidence of medical illness immediately after an earthquake.

naturalistic sample: set of observations obtained from a sample of the population in such a way that the distribution of independent variables in the sample is representative of the distribution in the population.95

necessary cause: characteristic whose presence is required to bring about or cause the disease or outcome under study.134 A necessary cause may not be a sufficient cause.

negative predictive value: the probability that an individual does not have the disease (as determined by the criterion standard) if a test result is negative.95 This measure takes into account the prevalence of the condition or the disease. A more general term is posttest probability. See positive predictive value, diagnostic discrimination.

nested case-control study: case-control study in which cases and controls are drawn from a cohort study. The advantages of a nested case-control study over a case-control study are that the controls are selected from participants at risk at the time of occurrence of each case that arises in a cohort, thus avoiding the confounding effect of time in the analysis, and that cases and controls are by definition drawn from the same population95 (see 19.3.1, Cohort Studies, and 19.3.2, Case-Control Studies).

Newman-Keuls test: a type of multiple comparisons procedure, used to compare more than 2 groups. It first compares the 2 groups that have the highest and lowest means, then sequentially compares the next most extreme groups, and stops when a comparison is not significant.94

n-of-1 trial: randomized trial that uses a single patient and an outcome measure agreed on by the patient and physician. The n-of-1 trial may be used by clinicians to assess which of 2 or more possible treatment options is better for the individual patient.134

nocebo: negative effects on treatment efficacy and tolerability induced or driven by psychological factors. For example, in a study of migraine treatment when patients were told that a study drug was a placebo, the patients perceived that it had less effect than in patients who believed the study drug was pharmacologically active.135

nominal variable (also calledcategorical variable): There is no arithmetic relationship among the categories, and thus there is no intrinsic ranking or order between them, variables that can be named (eg, sex, gene alleles, race, eye color). The nominal or discrete variable usually is assessed to determine its frequency within a population.95 The variable can have either a binomial or Poisson distribution.

nomogram: a visual means of representing a mathematical equation. For example, in the nomogram in Figure 19.5-2, hospital readmission within 30 days after discharge is predicted. Points are assigned for each variable by drawing a line upward from the corresponding variable to the points line. The sum of the points plotted on the total points line corresponds with the prediction of 30-day readmission.

Figure 19.5-2. Nomogram Predicting Postsurgery Survival


nonconcurrent cohort study: cohort study in which an individual’s group assignment is determined by information that exists at the time a study begins. The extreme of a nonconcurrent cohort study is one in which the outcome is determined retrospectively from existing records.95

noninferiority trial: A study that examines the effect of a treatment believed to reduce adverse effects, toxic effects, or burdens of treatment relative to an existing standard treatment. The issue of such a treatment is the extent to which it maintains the primary benefits of the existing standard treatment. Unlike equivalence trials, which aim to establish that a novel treatment is neither better nor worse than standard treatment beyond a specified margin, a noninferiority trial endeavors to show that the novel treatment is “not much worse” than standard treatment. Noninferiority trials test whether the new treatment is not as good as the standard treatment by some noninferiority margin. The noninferiority margin is an estimate of how much worse the outcomes can be yet remain acceptable because they are offset by the benefits of the new treatment (eg, less cost, fewer adverse effects, simplified dosing regimens).20,21,22 See also equivalence trials.

nonnormal distribution: data that do not have a normal (bell-shaped curve) distribution; examples include the binomial, Poisson, and exponential distributions.

→ Nonnormally distributed continuous data must be either transformed to a normal distribution to use parametric methods or, more commonly, analyzed by nonparametric methods.

nonparametric statistics: statistical procedures that do not assume that the data conform to any theoretical distribution. Nonparametric tests are most often used for ordinal or nominal data or for nonnormally distributed continuous data converted to an ordinal scale95 (eg, weight classified by tertile). Although these tests are useful for data that are not normally distributed, they are less powerful than parametric statistics for determining whether significant differences exist between groups.

nonrandomized trial: prospectively assigns groups or populations to study the efficacy or effectiveness of an intervention but in which the assignment to the intervention occurs through self-selection or administrator selection rather than through randomization. Control groups can be historical, concurrent, or both. This design is sometimes called a quasi-experimental design. Reports of these trials should follow the Transparent Reporting of Evaluations with Nonrandomized Designs (TREND) reporting guideline (

normal distribution (also known asgaussian distribution): continuous data distributed in a symmetrical, bell-shaped curve with the mean value corresponding to the highest point of the curve. This distribution of data is assumed in many statistical procedures.95,108

→ Descriptive statistics, such as mean and SD, can be used to accurately describe data only if the values are normally distributed or can be transformed into a normal distribution. When not normally distributed, the central tendency for data should be displayed as the median value and the distribution by the interquartile range.

normal range: range of values for a diagnostic test found among patients without a disease. Cut points for abnormal test results are arbitrary and are often defined as the central 95% of values, or the mean of values ±2 SDs.

null hypothesis: statement used in statistics asserting that no true difference exists between comparison groups.95,120 In general, statistical tests do not prove that the null hypothesis is true; rather, the results of statistical testing can reject the null hypothesis at the stated α likelihood of a type I error.

→ In the Methods section, it is usually stated that the null hypothesis was rejected if α < .05. Under these circumstances, the null hypothesis stating that no difference exists between groups is rejected if there is a less than 5% chance that any such observed difference was found because of chance alone.

→ There is an adage in statistics that the absence of evidence does not mean the evidence of absence. Not showing that the null hypothesis of there being no difference between groups is true is not the same as proving that there is no difference between groups, only that the study was unable to demonstrate that a difference exists.

number needed to harm (NNH): computed similarly to number needed to treat, but number of patients who, after being treated for a specific period, would be expected to experience 1 bad outcome. The NNH is calculated by inverting the absolute risk of harm. The absolute risk of harm is the proportion of patients harmed by an intervention subtracted from the proportion of patients experiencing harm who did not receive the intervention.

→ The NNH should be reported along with CIs and the absolute risk information used in its calculation. It should be expressed in a way that the reader can easily interpret the NNH concept; for example, “During follow-up, there were 1746 acute myocardial infarctions (21.7% fatal), 1052 strokes (7.3% fatal), 3307 hospitalizations for heart failure (2.6% fatal), and 2562 deaths from all causes among cohort members. For the composite of acute myocardial infarction, stroke, heart failure, or death, the attributable risk was 1.68 (95% CI, 1.27-2.08) excess events per 100 person-years of rosiglitazone compared with pioglitazone treatment. The corresponding number needed to harm for this composite end point was 60 (95% CI, 48-79) persons treated for 1 year to generate 1 excess event.” The NNT of 60 was calculated by taking the inverse of the attributable (ie, absolute) risk of 0.0168.

number needed to treat (NNT): number of patients who must receive an intervention for a specific period for 1 patient to benefit from the intervention.95 The NNT is the reciprocal of the absolute risk reduction, the difference between event rates in the intervention and placebo groups in a clinical trial. See also number needed to harm.

→ The NNT is preferably reported along with its CI and absolute risks and expressed in a way that is intuitively obvious to the reader; for example, “Meta-analysis using a random-effects model showed that corticosteroids alone were associated with a reduced risk of unsatisfactory recovery (relative risk [RR], 0.69 [95% CI, 0.55-0.87]; P = .001) (number needed to treat to benefit 1 person, 11 [95% CI, 8-25]).”

→ The study patients from whom the NNT is calculated should be representative of the population to whom the numbers will be applied. The NNT does not take into account adverse effects of the intervention.

observational study: An observational study can be used to describe many designs that are not randomized trials (eg, cohort studies or case-control studies that have a goal of establishing causation, studies of prognosis, studies of diagnostic tests, and qualitative studies) (see 19.3, Observational Studies). For example, the term is commonly used in the context of cohort studies and case-control studies in which patient or caregiver preference, or happenstance, determines whether a person is exposed to an intervention or putative harmful agent or behavior (in contrast to the exposure’s being under the control of the investigator, as in a randomized trial). Observational studies are also commonly performed using large administrative or clinical databases. Because the data are not usually collected for the purpose of the study and treatments are given based on a patient’s clinical condition and not randomly allocated, observational studies are limited in what they can conclude because of selection bias or unmeasured confounding.116,136 Causation cannot be concluded from observational studies, and it is best to refer to the relationships between risk factors and outcomes as associations and not use causative language.

odds ratio (OR): ratio of 2 odds. Odds ratio may have different definitions, depending on the study, and therefore should be defined. For example, it may be the odds of having the disease if a particular risk factor is present to the odds of not having the disease if the risk factor is not present, or the odds of having a risk factor present if the person has the disease to the odds of the risk factor being absent if the person does not have the disease.

→ The OR typically is used for a case-control or cohort study. For a study of incident cases with an infrequent disease (eg, <2% incidence), the OR approximates the relative risk.96 When the incidence is relatively frequent, the OR may be arithmetically corrected to better approximate the relative risk.137

→ The OR is usually expressed by a point estimate and expressed with a measure of uncertainty, such as the 95% confidence interval. An OR for which the CI includes 1 indicates no statistically significant effect on risk; if the point estimate and CI are both less than 1, there is a statistically significant reduction in risk (eg, 0.75; 95% CI, 0.60-0.87); if the point estimate and CI are both greater than 1, there is a statistically significant increase in risk (eg, 1.25; 95% CI, 1.10-1.40).

1-tailed test: test of statistical significance in which deviations from the null hypothesis in only 1 direction are considered.95,120 Most commonly used for the t test.

→ One-tailed tests are more likely to produce a statistically significant result than are 2-tailed tests. The use of a 1-tailed test implies that the intervention can move only in 1 direction (ie, beneficial or harmful). Thus, the use of a 1-tailed test must be justified.

ordinal data: type of data with a limited number of categories with an inherent ordering of the category from lowest to highest but without fixed or equal spacing between increments.95 Examples are Apgar scores, heart murmur rating, and cancer stage and grade. Ordinal data can be summarized by means of the median and quantiles or range.

→ Because increments between the numbers for ordinal data generally are not fixed (eg, the difference between a grade 1 and a grade 2 heart murmur is not quantitatively the same as the difference between a grade 3 and a grade 4 heart murmur), ordinal data should be analyzed by nonparametric statistics.

ordinate: vertical or y-axis of a graph. The x-axis is referred to as the abscissa.

outcome: dependent variable or end point of an investigation. In retrospective studies, such as case-control studies, the outcomes have already occurred before the study is begun; in prospective studies, such as cohort studies and controlled trials, the outcomes occur during the time of the study.95 Primary outcomes are the main object of a study and are usually identified by the study hypothesis. Ideally, there should only be a single primary outcome. Study power is calculated based on the primary outcome. Secondary outcomes are outcomes other than the primary outcome that investigators wish to observe in the results of studies. Secondary outcomes are specified for analysis before the study is performed. Exploratory outcomes are those that an investigator elects to analyze after an experiment is performed but had not prespecified in the analytic plan.

outliers (outlying values): values at the extremes of a distribution. Because the median is far less sensitive to outliers than is the mean, it is preferable to use the median to describe the central tendency of data that have extreme outliers. Outliers can have a large influence on analytic techniques, such as regression analysis. They can also substantially influence study results that break down data into groups (quantiles) and compare high vs low quantiles. Whenever these analyses are performed, investigators should carefully assess the potential influence of outliers.

→ If outliers are excluded from an analysis, the rationale for their exclusion should be explained in the text. A number of tests are available to determine whether an outlier is so extreme that it should be excluded from the analysis.

overfit: a perfect fit between the data and outcome variables will occur if the number of variables in a regression equals the number of data points. In other words, the more variables present in a regression equation, the better the fit will be. Inclusion of an excessive number of variables will result in fits for statistical models that are good but misleading. In general, it is best to include only the fewest number of variables necessary to develop a statistical model. When comparing the results of regression analyses using different numbers of variables, this phenomenon should be corrected for by penalizing the comparator statistic by the number of variables present in the regression equation. This is done in the Akaike and Bayes information criteria processes.

overmatching: the phenomenon of obscuring by the matching process of a case-control study a true causal relationship between the independent and dependent variables because the variable used for matching is strongly related to the mechanism by which the independent variable exerts its effect.95 For example, matching cases and controls for residence within a certain area could obscure an environmental cause of a disease. Overmatching may also be used to refer to matching on variables that have no effect on the dependent variable, and therefore are unnecessary, or the use of so many variables for matching that no suitable controls can be found.98

oversampling: in survey research, a technique that selectively increases the likelihood of including certain groups or units that would otherwise produce too few responses to provide reliable estimates. For example, Hispanic individuals comprise approximately 17% of the US population. When trying to understand something about the Hispanic population, if one created a statistical sample from the overall US population, the likelihood of finding significant factors relating to Hispanic individuals would be low because there are relatively few Hispanic individuals. In oversampling, a larger number of Hispanic individuals than white individuals would be sampled to increase the likelihood that factors important to whatever is being studied as they relate to Hispanic individuals can be identified.

paired samples: form of matching that can include self-pairing, where each participant serves as his or her own control, or artificial pairing, where 2 participants are matched on prognostic variables.98 Twins may be studied as pairs to attempt to separate the effects of environment and genetics. Paired analyses provide greater power to detect a difference for a given sample size than do nonpaired analyses because interindividual differences are minimized or eliminated. Pairing may also be used to match participants in case-control or cohort studies. See Table 19.3-1.

paired t test: t test for paired data.

parameter: measurable characteristic of a population. One purpose of statistical analysis is to estimate population parameters from sample observations.95 The statistic is the numerical characteristic of the sample; the parameter is the numerical characteristic of the population. Parameter is also used to refer to aspects of a model (eg, a regression model).

parametric statistics: tests used for continuous data that require the assumption that the data being tested are normally distributed, either as collected initially or after transformation to the ln or log of the value or other mathematical conversion.95 The t test is a parametric statistic. See Table 19.5-1.

Pearson product-moment correlation: test of correlation between 2 groups of normally distributed data. See diagnostic discrimination.

The Methods section should specify that this test was performed. For example, “A Pearson product-moment correlation coefficient was calculated between last survey year and change in prevalence of BMI lower than 16 to detect whether there was an association between these variables.” The Results section should show the correlation value; for example, “For 13 countries where DHS program data also contained men, the Pearson product-moment correlation coefficient between rates of BMI lower than 16 among men and women was r  =  0.88.”

→ The square of the Pearson correlation coefficient gives the probability of observing one value given another. For the example above, where r = 0.88, given a value for the last survey year there is a 0.77% (calculated by squaring 0.88) probability of obtaining the corresponding value of the change in prevalence of BMI lower than 16.

percentile: see quantile.

per-protocol analysis: analysis of patients in a clinical trial analyzed by the treatment they received rather than which group they were initially randomized into (which would be intention to treat). This approach compromises the prognostic balance that randomization achieves and is therefore likely to provide a biased estimate of treatment effect. See also intention to treat.

placebo: a biologically inactive substance administered to some participants in a clinical trial. A placebo should ideally appear similar in every other way to the experimental treatment under investigation. Assignment, allocation, and assessment should be blinded. See also nocebo.

placebo effect: refers to specific expectations that participants may have of the intervention. These expectations can make the intervention appear more effective than it actually is. Comparison of a group receiving placebo vs those receiving the active intervention allows researchers to identify effects of the intervention itself because the placebo effect should affect both groups equally.

point estimate: single value calculated from sample observations that is used as the estimate of the population value, or parameter,95 and should be accompanied by an interval estimate (eg, 95% confidence interval).

Poisson distribution: distribution that occurs when a nominal event (often disease or death) occurs rarely.98 This distribution is used when there are count data, such as the number of events per unit measure. For example, when assessing the procedure volume—outcome relationship, the number of procedures per hospital studied would follow a Poisson distribution. The Poisson distribution is present when the mean equals the variance of a distribution. The Poisson distribution is used instead of a binomial distribution when sample size is calculated for a study of events that occur rarely.

population: any finite or infinite collection of individuals from which a sample is drawn for a study to obtain estimates to approximate the values that would be obtained if the entire population were sampled.104 A population may be defined narrowly (eg, all individuals exposed to a specific traumatic event) or widely (eg, all individuals at risk for coronary artery disease).

population attributable risk percentage: percentage of risk within a population that is associated with exposure to the risk factor. Population attributable risk takes into account the frequency with which a particular event occurs and the frequency with which a given risk factor occurs in the population. Population attributable risk does not necessarily imply a cause-and-effect relationship. It is also called attributable fraction, attributable proportion, and etiologic fraction.95 The population attributable risk percentage is the percent of the incidence of a disease in the population that would be eliminated if exposure were eliminated. The population attributable risk is calculated by subtracting the incidence of a disease in a population of patients not exposed to some risk factor (Iu) from the incidence of disease in the total population (both exposed and unexposed) (Ip):


When this is divided by Ip and multiplied by 100, the result is the population attributable risk percentage.

positive predictive value (PPV): proportion of those participants or individuals with a positive test result who have the condition or disease as measured by the criterion standard. This measure takes into account the prevalence of the condition or the disease. Clinically, it is the probability that an individual has the disease if the test result is positive. Although sensitivity is often used to assess the efficacy of a diagnostic test in establishing a diagnosis, it is the proportion of test results that will be positive when someone has a disease. When patients are first seen, it is not known if they have the disease. In this context, the PPV provides a better assessment of how a test will perform for establishing a diagnosis. The disease prevalence influences the PPV. Rare diseases have a low PPV and common ones have higher PPVs. The preferred measure for a test’s ability to establish a diagnosis is its likelihood ratio ( LR).95 See Table 19.3-2 and diagnostic discrimination.

posterior probability: in bayesian analysis, the probability obtained after the prior probability is combined with the probability from the study of interest.98 If one assumes a uniform prior (no useful information for estimating probability exists before the study), the posterior probability is the same as the probability from the study of interest alone.

→ The prior probability is the probability of some events occurring before new information is obtained. For example, the probability of being bitten by a mosquito in the Amazon might be 50%. However, if you learn that this region was sprayed with an insecticide and the mosquito population reduced, application of a mathematical operation reflecting the probability of getting bitten as a function of the mosquito population will change the probability to some other value. If the new value is 15% after spraying, this new value is called the posterior probability.

post hoc analysis (also known as ad hoc analysis): performed after completion of a study and not based on a hypothesis considered before the study. Such analyses should be performed without prior knowledge of the relationship between the dependent and independent variables. A potential hazard of post hoc analysis is the type I error. In general, post hoc analyses are not as definitive as prespecified analyses and should be considered as exploratory. It is important not to use causal language when describing the results of post hoc analyses and to refer to relationships between variables and outcomes as associations.

→ While post hoc analyses may be used to explore intriguing results and generate new hypotheses for future testing, they should not be used to test hypotheses because the comparison is not hypothesis driven. See also data dredging.

posttest probability: the probability that an individual has the disease if the test result is positive (positive predictive value) or that the individual does not have the disease if the test result is negative (negative predictive value).95

power: ability to detect a statistically significant difference with the use of a given sample size and variance; determined by frequency of the condition under study, magnitude of the effect, study design, and sample size.95 Power should be calculated before a study is begun. If the sample is too small to have a reasonable chance (usually 80% or 90%) of rejecting the null hypothesis if a true difference exists, then a negative result may indicate a type II error rather than a true failure to reject the null hypothesis.138

→ Power calculations should be performed as part of the study design. A statement providing the power of the study should be included in the Methods section of all randomized clinical trials (Table 19.2-1) and is appropriate for many other types of studies. A power statement is especially important if the study results are negative, to demonstrate that a type II error was unlikely to have been the reason for the negative result. Performing a post hoc power analysis is controversial, especially if it is based on the study results. Nonetheless, if such calculations were performed, they should be described in the Discussion section and their post hoc nature clearly stated.

→ The study power determines the sample size. The more powerful a study, the more confidence one has that a difference between groups does not truly exist when that is the study result. In general, larger sample sizes are required when the expected differences between groups are small or there is a great deal of variation in the data. When reporting the results of a power analysis, it is important to show the rationale (including citations to published reports) for the anticipated differences between the groups and variation in the data. This is to avoid the appearance of predicting an artificially small variance or anticipating an unreasonably large difference between the groups to minimize the necessary sample size for a study. The observed difference between groups and group variances should be roughly the same as was predicted when the power calculation was made, before patients were enrolled in the study. If not, the sample size calculation might have been in error.

Example: The statistical power to demonstrate a superior success rate (1-sided hypothesis test) in the primary end point for the coil group vs the usual care group was 90% with a significance of α = .05 and a total sample size of 100 patients.139 This was based on achieving the end point of 37% of patients having improved function in the coil group and 5% in the usual care group, with 30% of patients unable to perform the 6-minute walk test or lost to follow-up at 6 months. The hypothesis of a 37% primary end-point achievement in the coil group was based on data provided by PneumRx in 2012. One-sided statistical tests were considered appropriate in view of the favorable results of previous smaller studies and confirmed by a recent meta-analysis. The sample size was calculated using Nquery software, version 7.0 (Statistical Solutions Ltd).

Example: A sample size of 136 participants was planned to have 90% power to detect a difference in change in hemoglobin A1c between treatment groups, assuming a population difference of 0.5%, SD of 26-week values of 1.0%, correlation between baseline and 26-week values of 0.56, type I error rate of 5% (2-sided), and no more than a 15% loss to follow-up.

precision: inverse of the variance in measurement (see measurement error)98; precision refers to the variability or how close together successive measurements of the same phenomenon are. Note that precision and accuracy are independent concepts; if a blood pressure cuff is poorly calibrated against a standard, it may produce measurements that are precise (successive measurements are very close to one another) but inaccurate (the measured blood pressure deviates from the true blood pressure).

pretest probability: same as prevalence.

prevalence (also known aspretest probability): proportion of persons with a particular disease at a given point in time. Prevalence can also be interpreted to mean the likelihood that a person selected at random from the population will have the disease.95 See also incidence.

principal components analysis (PCA): procedure used to group related variables to help describe data. The variables are grouped so that the original set of correlated variables is transformed into a smaller set of uncorrelated variables called the principal components.98 Variables are not grouped according to dependent and independent variables, unlike many forms of statistical analysis. Principal components analysis is similar to factor analysis and is used to detect patterns in a collection of data.

prior probability: in bayesian analysis, the probability of an event based on previous information before the study of interest is considered. The prior probability may be informative, based on previous studies or clinical information, or not, in which case the analysis uses a uniform prior (no information is known before the study of interest). A reference prior is one with minimal information, a clinical prior is based on expert opinion, and a skeptical prior is used when large treatment differences are not expected.104 When bayesian analysis is used to determine the posterior probability of a disease after a patient has undergone a diagnostic test, the prior probability may be estimated as the prevalence of the disease in the population from which the patient is drawn (usually the clinic or hospital population).

probability: in clinical studies, the number of times an event occurs in a study group divided by the number of individuals being studied.95

product-limit method: see Kaplan-Meier method.

propensity analysis: in observational studies, a way of minimizing bias. The propensity for an individual to be in one group or another is calculated from variables available to the investigator by regression analysis.116 Individuals are then matched based on the propensity score. To determine how closely the groups are matched, it is best to display their baseline characteristics as both unmatched and matched.129 The closeness of the match is also assessed by showing the standardized difference between the means or proportions of the various groups. Rather than match, balance can be achieved by calculating the propensity score and then using it as an independent variable in a regression analysis to adjust for the differences between groups.

→ Although propensity analysis helps balance groups in observational studies, it is limited by the variables available to describe the phenomena being studied. Unmeasured confounding can never be overcome, so even propensity-matched groups will not be as well balanced as those from randomized trials. Consequently, propensity-matched groups can only show associations and not causality. Some investigators have criticized propensity methods as not being inherently better than multivariable methods to achieve statistical adjustment.

proportionate mortality ratio: number of individuals who die of a particular disease during a span of time, divided by the number of individuals who die of all diseases during the same period.95 This ratio may also be expressed as a rate, that is, a ratio per unit of time (eg, cardiovascular deaths per total deaths per year).

prospective study: study in which participants with and without an exposure are identified and then followed up over time; the outcomes of interest have not occurred at the time the study commences.104 Antonym is retrospective study.

pseudorandomization: assigning of individuals to groups in a nonrandom manner, for example, selecting every other individual for an intervention or assigning participants by a government identification number or birth date.

publication bias: tendency of articles reporting positive and/or “new” results to be submitted and published and studies with negative or confirmatory results not to be submitted or published; especially important in meta-analysis but also in other systematic reviews. Substantial publication bias has been demonstrated from the “file-drawer” problem.140 See also funnel plot.

purposive sample: set of observations obtained from a population in such a way that the sample distribution of independent variable values is determined by the researcher and is not necessarily representative of distribution of the values in the population.95

P value: probability of obtaining the observed data (or data that are more extreme) if the null hypothesis were exactly true.104 Also expressed as the probability that the observed result was obtained by chance alone. For example, if it is said that the difference between groups was statistically significant with P = .046 then the probability that the difference observed was from chance alone is 4.6%.

→ Although hypothesis testing often results in the P value, P values themselves can only provide information about whether the null hypothesis is rejected. Confidence intervals are much more informative because they provide a plausible range of values for an unknown parameter, as well as some indication of the power of the study as indicated by the width of the CI.92 (For example, an odds ratio of 0.5 with a 95% CI of 0.05 to 4.5 indicates to the reader the [im]precision of the estimate, whereas P = .63 does not provide such information.) CIs are preferred whenever possible. Including both the CI and the P value provides more information than either alone.92 This is especially true if the CI is used to provide an interval estimate and the P value to provide the results of hypothesis testing.

→ When any P value is expressed, it should be clear to the reader what parameters and groups were compared, what statistical test was performed, and the degrees of freedom ( df) (when appropriate) and whether the test was 1-tailed or 2-tailed (if these distinctions are relevant for the statistical test).

→ For expressing P values in manuscripts and articles, display P as a capital, italicized letter. The actual value for P should be expressed to 2 digits for P =.01, whether or not P is significant. (When rounding a P value would make the P value nonsignificant, such as P = .049 rounded to .05, the P value can be left as 3 digits.) If P < .01, it should be expressed to 3 digits. The actual P value should be expressed (P = .04), rather than expressing a statement of inequality (P < .05), unless P < .001. In general, expressing P to more than 3 significant digits does not add useful information to P < .001 because precise P values with extreme results are sensitive to biases or departures from the statistical model.92

Exceptions to the 2-digit rule exist in several situations. In genetic studies (particularly genome-wide association studies [GWASs] and in studies in which there are adjustments for multiple comparisons, such as Bonferroni adjustment, and the definition of level of significance is substantially less than P < .05), it may be important to express P values to more significant digits. For example, if the threshold of significance is P < .0004, then by definition the P value must be expressed to at least 4 digits to indicate whether a result is statistically significant. GWASs express P values to very small numbers, using scientific notation. If a manuscript you are editing defines statistical significance as a P value substantially less than .05, possibly even using scientific notation to express P values to very small numbers, it is best to retain the values as the author presents them.

Example: A single-nucleotide variant on chromosome 19 was significantly associated with posttraumatic stress disorder in the European American samples of the New Soldier Study (genetic sequence: rs11085374; odds ratio [OR], 0.77; 95% CI, 0.70-0.85; P = 4.59 × 10−8).141

Example: A Bonferroni adjustment was made to account for having 6 outcomes such that the P value for statistical significance is .0083 (.05/6) and 99.17% CIs are presented around the effect estimates to reflect the adjusted α level (ie, 1 − .0083 =  .9917).142

P values should not be listed simply as “not significant” or “NS” because for meta-analysis the actual values are important and not providing exact P values is a form of incomplete reporting.92 Because the P value represents the result of a statistical test and not the strength of the association or the clinical importance of the result, P values should be referred to simply as statistically significant or not significant; phrases such as “highly significant” and “very highly significant” should be avoided.

P values should never be reported alone. Because the P value refers only to the probability that something is statistically significant or not, it is necessary to assess the P value in the context of other statistical information, such as absolute or relative risks and CIs. Findings that are statistically significant can be clinically unimportant; however, this cannot be determined from P values alone.

Best practice is to not use a zero to the left of the decimal point because statistically it is not possible to prove or disprove the null hypothesis completely when only a sample of the population is tested (P cannot equal 0 or 1, except by rounding). If P < .00001, P should be expressed as P < .001 as discussed. If P > .999, P should be expressed as P > .99.

Q statistic: See Cochran Q test.

qualitative data: data that fit into discrete categories according to their attributes, such as nominal or ordinal data, as opposed to quantitative data.98

qualitative study: form of study based on observation and interview with individuals that uses inductive reasoning and a theoretical sampling model, with emphasis on validity rather than reliability of results. Qualitative research is used traditionally in sociology, psychology, and group theory but also occasionally in clinical medicine to explore beliefs and motivations of patients and physicians.143

quality-adjusted life-year (QALY): a method used to adjust the survival someone might experience by the quality of life he/she has during that survival period. The number of years survived is multiplied by a utility, which is some measure of the quality of life that ranges from 0 to 1.0. For example, if someone is expected to live 5 years but has a disability that is expected to reduce his/her quality of life by 0.5, he/she would have 2.5 quality-adjusted life-years (QALYs).98

quantile: method used for grouping and describing dispersion of data. Commonly used quantiles are the tertile (3 equal divisions of data into lower, middle, and upper ranges), quartile (4 equal divisions of data), quintile (5 divisions), and decile (10 divisions). Quantiles are also referred to as percentiles.93 In general, analyses should use continuous data when they are available and not with the data grouped into quantiles. Analyzing quantile data reduces statistical power, and outliers tend to heavily influence the uppermost and lowest quantiles.

→ Data may be expressed as median (quantile range); for example, length of stay was 7.5 days (interquartile range, 4.3-9.7 days). See also interquartile range.

quantitative data: data in numerical quantities, such as continuous data or counts98 (as opposed to qualitative data). Nominal and ordinal data may be treated either qualitatively or quantitatively.

quasi-experiment: experimental design in which variables are specified and participants assigned to groups, but interventions cannot be controlled by the experimenter. One type of quasi-experiment is the natural experiment.98

r: correlation coefficient for bivariable analysis.

R: correlation coefficient for multivariable analysis.

r2: coefficient of determination for bivariable analysis. See also correlation coefficient.

R2: coefficient of determination for multivariable analysis. See also correlation coefficient.

random-effects model: used to allow the slope or intercept to vary in a multilevel (hierarchical) model. The model refers to more than 1 layer of analysis. An example might be examining effects on students grouped into schools modeled with a series of regression equations.

Random-effects models can also provide a summary estimate of the magnitude of effect in a meta-analysis. These models assume that the included studies are a random sample of a population of studies addressing the question posed in the meta-analysis. Each study estimates a different underlying true effect, and the distribution of these effects is assumed to be normal around a mean value. Because a random-effects model takes into account both within-study and between-study variability, the CI around the point estimate is, when there is appreciable variability in results across studies, wider than it could be if a fixed-effects model were used.51,144

When there is heterogeneity in a meta-analysis, random-effects models can be used to help determine the sources of heterogeneity.

Because random effects refer to many layers of a model, the term should be used in the plural. Conversely, because fixed-effect models refer to a single layer of a model, the term is singular. Antonym to random-effects model is fixed-effects model. See fixed-effect model for more detail. See also meta-analysis.

randomization: method of assignment in which all individuals have the same chances of being assigned to the conditions in a study. Individuals may be randomly assigned at a 2:1 or 3:1 frequency, in addition to the usual 1:1 frequency. Participants may or may not be representative of a larger population.92 Simple methods of randomization include coin flip or use of a random numbers table. See also block randomization.

randomized clinical trial: A trial in which the participants are randomly assigned to one group or another before the intervention is given, in contrast to an observational trial in which participants are separated into groups based on what intervention they may or may not have gotten after they received treatment. Also distinguished from a case-control trial in which groups are assigned based on whether a patient had some outcome (see 19.2.1, Parallel-Design, Double-blind Trials).

random sample: method of obtaining a sample that ensures that every individual in the population has a known (but not necessarily equal, for example, in weighted sampling techniques) chance of being selected for the sample.95

range: the highest and lowest values of a variable measured in a sample.

Example: The mean age of the participants was 45.6 years (range, 20-64 years).

rank sum test: see Mann-Whitney test or Wilcoxon rank sum test.

rate: measure of the occurrence of a disease or outcome per unit of time, usually expressed as a decimal if the denominator is 100 (eg, the surgical mortality rate was 0.02) (see 18.7.3, Reporting Proportions and Percentages).

ratio: fraction in which the numerator is not necessarily a subset of the denominator, unlike a proportion95 (eg, the assignment ratio was 1:2:1 for each drug dose [twice as many individuals were assigned to the second group as to the first and third groups]).

realization: in statistics, a realization is the actual observed value of a random variable. By convention, the random variable is designated by capital letters and the realization by lowercase letters. A random variable, X, can have many values, but after some process (draw a number at random) it takes on a definite value, x, its realization.

recall bias: systematic error resulting from individuals in one group being more likely than individuals in the other group to remember past events.98

→ Recall bias is especially common in case-control studies that assess risk factors for serious illness in which individuals are asked about past exposures or behaviors, such as environmental exposure in an individual who has cancer.95

receiver operating characteristic curve (ROC curve): graphic means of assessing the extent to which a test can be used to discriminate between persons with and without disease98 and to select an appropriate cut point for defining normal vs abnormal results. The ROC curve is created by plotting sensitivity vs (1 − specificity). The area under the curve provides some measure of how well the test performs; the larger the area, the better the test. The C statistic is a measure of the area under the ROC curve.

Diagnostic tests can be assessed by a graph of the ROC (Figure 19.3-1). The value for what is called a positive test result can be varied to optimize the tradeoff between false-positive and false-negative results. The closer the curve comes to the upper-left corner of the graph, the better the overall test performance.86

→ The appropriate cut point is a function of the test. A screening test would require high sensitivity, whereas a diagnostic or confirmatory test would require high specificity. See Table 19.5-3 and diagnostic discrimination.

Table 19.5-3. Diagnostic Discrimination

Test result

Disease by criterion standard

Disease-free by criterion standard


a (true positives)

b (false positives)


c (false negatives)

d (true negatives)

a + c = total number of persons with disease

b + d = total number of persons without disease

Sensitivity = a/(a+c)

Specificity = d(b+d)

Positive predictive value = a(a+b)

Negative predictive value = d(c+d)

reference group: group of presumably disease-free individuals from which a sample of individuals is drawn and tested to establish a range of normal values for a test.95

regression analysis: statistical techniques used to describe a dependent variable as a function of 1 or more independent variables; often used to control for confounding variables.95 See also linear regression and logistic regression.

regression line: diagrammatic presentation of a linear regression equation, with the independent variable plotted on the x-axis and the dependent variable plotted on the y-axis. As many as 3 variables may be depicted on the same graph.98

regression to the mean: the principle that extreme values are unlikely to recur. If a test that produced an extreme value is repeated, it is likely that the second result will be closer to the mean. Thus, after repeated observations results tend to “regress to the mean.” A common example is blood pressure measurement; on repeated measurements, individuals who are initially hypertensive often will have a blood pressure reading closer to the population mean than the initial measurement was.95

relative risk (RR): probability of developing an outcome within a specified period if a risk factor is present divided by the probability of developing the outcome in that same period if the risk factor is absent. The relative risk is applicable to randomized clinical trials and cohort studies95; for case-control studies, the odds ratio can be used to approximate the relative risk if the outcome is infrequent.

→ The relative risk should be accompanied by confidence intervals.

Example: The individuals with untreated mild hypertension had a relative risk of 2.4 (95% CI, 1.9-3.0) for stroke or transient ischemic attack. [In this example, individuals with untreated mild hypertension were 2.4 times as likely as were individuals in the comparison group to have a stroke or transient ischemic attack.]

relative risk reduction (RRR): proportion of the control group experiencing a given outcome minus the proportion of the treatment group experiencing the outcome divided by the proportion of the control group experiencing the outcome.

reliability: ability of a test to replicate a result given the same measurement conditions, as distinguished from validity, which is the ability of a test to measure what it is intended to measure.98

repeated measures: analysis designed to take into account the lack of independence of events when measures are repeated in each participant over time (eg, blood pressure, weight, or test scores). This type of analysis emphasizes the change measured for a participant over time rather than the differences between participants over time. A traditional analytic technique rarely in use now for repeated measures is ANOVA with repeated measures. Repeated measures are correlated and therefore violate the regression analysis requirement that data be independent. This correlation can be accounted for by using multilevel models when performing regression. The repeated measures are considered clustered within the individual from whom the measure was obtained. With multilevel modeling, the first level is the repeated measure and the second level is the individual who had repeated measures.

repeated-measures ANOVA: see analysis of variance (ANOVA).

reporting bias: a bias in assessment that can occur when individuals in one group are more likely than individuals in another group to report past events. Reporting bias is especially likely to occur when different groups have different reasons to report or not report information.95 For example, when examining behaviors, adolescent girls may be less likely than adolescent boys to report being sexually active. See also recall bias.

reproducibility: ability of a test to produce consistent results when repeated under the same conditions and interpreted without knowledge of the prior results obtained with the same test95; synonymous with reliability.

residual: measure of the discrepancy between observed and predicted values. The residual SD is a measure of the goodness of fit of the regression line to the data and gives the uncertainty of estimating a point y from a point x.93

residual confounding: in observational studies, the possibility that differences in outcome may be caused by unmeasured or unmeasurable factors.136

response rate: number of complete interviews with reporting units divided by the number of eligible units in the sample88 (see 19.3.9, Survey Studies). See also participation rate.

retrospective study: study performed after the outcomes of interest have already occurred98; most commonly a case-control study but also may be a retrospective cohort study or case series or observational study. Antonym is prospective study.

right-censored data: see censored data.

risk: probability that an event will occur during a specified period. Risk is equal to the number of individuals who develop the disease during the period divided by the number of disease-free persons at the beginning of the period.95

risk factor: characteristic or factor that is associated with an increased probability of developing a condition or disease. Also called a risk marker, a risk factor does not necessarily imply a causal relationship. A modifiable risk factor is one that can be modified through an intervention98 (eg, stopping smoking or treating an elevated cholesterol level, as opposed to a genetically linked characteristic for which there is no effective treatment).

risk ratio: the ratio of 2 risks. See also relative risk.

robustness: term used to indicate that a statistical procedure’s assumptions (most commonly, normal distribution of data) can be violated without a substantial effect on its conclusions.98

root-mean-square: see standard deviation.

rule of 3: method used to estimate the number of observations required to have a 95% chance of observing at least 1 episode of a serious adverse effect. For example, to observe at least 1 case of penicillin anaphylaxis that occurs in approximately 1 in 10 000 cases treated, 30 000 treated cases must be observed. If an adverse event occurs 1 in 15 000 times, 45 000 cases need to be treated and observed.95

run-in period: a period at the start of a trial when no treatment is administered (although a placebo may be administered). This can help to ensure that patients are stable and will adhere to treatment. This period may also be used to allow patients to discontinue any previous treatments and so is sometimes also called a washout period.

sample: subset of a larger population, selected for investigation to draw conclusions or make estimates about the larger population.140

sampling error: error introduced by chance differences between the estimate obtained from the sample and the true value in the population from which the sample was drawn. Sampling error is inherent in the use of sampling methods and is measured by the standard error.95

Scheffé test: see multiple comparisons procedures.

SD: see standard deviation.

SE: see standard error.

SEE: see standard error of the estimate.

selection bias: bias in assignment that occurs when the way the study and control groups are chosen causes them to differ from each other by at least 1 factor that affects the outcome of the study.95

→ A common type of selection bias occurs when individuals from the study group are drawn from one population (eg, patients seen in an emergency department or admitted to a hospital) and the control participants are drawn from another (eg, clinic patients). Regardless of the disease under study, the clinic patients will be healthier overall than the patients seen in the emergency department or hospital and will not be comparable controls. A similar example is the “healthy worker effect”: people who hold jobs are likely to have fewer health problems than those who do not, and thus comparisons between these groups may be biased.

SEM: see standard error of the mean.

sensitivity: proportion of individuals with the disease or condition as measured by the criterion standard (or reference standard) who have a positive test result (individuals with true-positive results divided by all those with the disease).95 See Table 19.5-3 and diagnostic discrimination.

→ Although sensitivity is the most commonly cited measure of a test’s efficacy in establishing a diagnosis, it is not the best to use to determine the clinical efficacy of a test. This is because it measures how often a test result is positive when a patient has a disease. When patients are seen in clinic, it is not known what they have, and tests are obtained to make a diagnosis. What must be known is how often a patient for whom the diagnosis is not known has a positive result and has the disease in question (ie, the positive predictive value of the test). Better yet, use likelihood ratios to determine the efficacy of a test in establishing diagnoses.

→ One way to remember how to best use test sensitivity and specificity is SNOUT/SPIN: SNOUT means when the result of a highly Sensitive test is Negative the disease is essentially ruled OUT (because highly sensitive tests have few false-negative results). SPIN means that the result of a highly Specific test when Positive rules a disease IN (because highly specific tests have few false-positive results).

sensitivity analysis: method to determine the robustness of an assessment by examining the extent to which results are changed by differences in methods, values of variables, or assumptions95; applied in decision analysis to test the robustness of the conclusion to changes in the assumptions. It is performed by inserting a range of values into the variables of a statistical model and assessing how they influence the model’s output. Because regression models are only valid with the range of data used to build them, the data inserted for a sensitivity analysis should only be within the same ranges.

signed rank test: See Wilcoxon signed rank test.

significance: statistically, the testing of the null hypothesis of no difference between groups. A significant result rejects the null hypothesis. Statistical significance is highly dependent on sample size and provides no information about the clinical significance of the result. Large samples will almost always have statistically significant differences. Consequently, it is better to assess the effect size of the difference rather than its statistical significance when determining clinical significance.114,138 Clinical significance also involves a judgment as to whether the risk factor or intervention studied would affect a patient’s outcome enough to make a difference for the patient. The level of clinical significance considered important is sometimes defined prospectively (often by consensus of a group of physicians) as the minimal clinically important difference, but the cutoff is arbitrary. Avoid the use of phrases such as “marginal significance” or “trend toward significance” or speculative words such as “highly significant.”

sign test: a nonparametric test of significance that depends on the signs (positive or negative) of variables and not on their magnitude; used when combining the results of several studies, as in meta-analysis.98 See also Cox-Stuart trend test.

skewness: the degree to which the data are asymmetric on either side of the central tendency. Data for a variable with a longer tail on the right of the distribution curve are referred to as positively skewed; data with a longer left tail are negatively skewed.104

snowball sampling: a sampling method used in qualitative research in which survey respondents are asked to recommend other respondents who might be eligible to participate in the survey. This may be used when the researcher is not entirely familiar with demographic or cultural patterns in the population under investigation.

Spearman rank correlation (ρ): statistical test used to determine the covariance between 2 nominal or ordinal variables.104 The nonparametric equivalent to the Pearson product moment correlation, it can also be used to calculate the coefficient of determination.

specificity: proportion of those without the disease or condition as measured by the criterion standard who have negative results by the test being studied95 (individuals with true-negative results divided by all those without the disease). See Table 19.5-3 and diagnostic discrimination.

standard deviation (SD): commonly used descriptive measure of the spread or dispersion of data. It is calculated by obtaining the positive square root of the variance.95 One SD incorporates 68% of the data, and 2 SDs represents the middle 95% of values obtained. It may be represented by the Greek letter sigma (σ) when referring to a population or s when referring to a sample. In contrast to the variance, the units of the SD are the same as are those of the original data, easing the interpretation of the SD as it relates to the original data.108

→ Describing data by means of SD implies that the data are normally distributed; if they are not, then the median value and interquartile range or a similar measure involving quantiles is more appropriate to describe the data. One indication that data are not normally distributed is when the SD is larger than the mean (eg, mean [SD] length of stay = 9 [15] days or mean [SD] age at evaluation = 4 [5.3] days). When this occurs, the mean and SD should not be used to represent the central tendency of the data; quantiles should be used. Note that the format mean (SD) should be used.

standard error (SE): positive square root of the variance of the sampling distribution of the statistic.93 The SE provides an estimate of the precision with which a parameter (such as the mean value) can be estimated. There are several types of SE; the type intended should be clear. See also standard error of the mean (SEM).108,120

In text and tables that provide descriptive statistics, SD rather than SE is usually appropriate; by contrast, parameter estimates (eg, regression coefficients) should be accompanied by SEs. In figures where error bars are used, the 95% confidence interval is preferred (see Figure 4.2-8 in, Bar Graphs).

standard error of the difference: measure of the dispersion of the differences between samples of 2 populations, usually the differences between the means of 2 samples; used in the t test.

standard error of the estimate (SEE): SD of the observed values about the regression line.93

standard error of the mean (SEM): an inferential statistic, which describes the certainty with which the mean computed from a random sample estimates the true mean of the population from which the sample was drawn.94 If multiple samples of a population were taken, then 95% of the samples would have means that fall within 2 SEMs of the mean of all the sample means. Larger sample sizes will be accompanied by smaller SEMs because larger samples provide a more precise estimate of the population mean than do smaller samples.108,120

→ The SEM is not interchangeable with SD. The SD generally describes the observed dispersion of data around the mean of a sample. By contrast, the SEM provides an estimate of the precision with which the true population mean can be inferred from the sample mean. The mean itself can thus be understood as either a descriptive or an inferential statistic; it is this intended interpretation that governs whether it should be accompanied by the SD or SEM. In the former case the mean simply describes the average value in the sample and should be accompanied by the SD, whereas in the latter it provides an estimate of the population mean and should be accompanied by the SEM. The interpretation of the mean is often clear from the text, but authors may need to be queried to discern their intent in presenting this statistic.

→ When many samples are obtained from a population, the mean of each sample will differ somewhat from the true population mean. The spread (or SD) of this set of mean values is characterized by the SE. Larger samples sizes will result in a smaller SE. The SEM is calculated by dividing the sample SD by the square root of the number of data points in the sample. The units of the SE are the same as those for the original data.

standard error of the proportion: SD of the population of all possible values of the proportion computed from samples of a given size.94

standardization (of a rate): adjustment of a rate to account for factors such as age or sex.95

standardized mortality ratio: ratio in which the numerator contains the observed number of deaths and the denominator contains the number of deaths that would be expected in a comparison population. This ratio implies that confounding factors have been controlled for by means of indirect standardization. It is distinguished from proportionate mortality ratio, which is the mortality rate for a specific disease.95

standard normal distribution: a normal distribution in which the raw scores have been recomputed to have a mean of 0 and an SD of 1.104 Such recomputed values are referred to as z scores or standard scores. The mean, median, and mode are all equal to zero. The variance equals 1 (eg, unit variance, Image).

standard score: see z score.93

statistic: value calculated from sample data that is used to estimate a value or parameter in the larger population from which the sample was obtained,95 as distinguished from data, which refers to the actual values obtained via direct observation (eg, measurement, medical record review, patient interview).

stepped-wedge design: The sequential rollout of a quality improvement (QI) intervention to study units (clinician, organizations) during a number of periods so that by the end of the study all participants have received the intervention, usually in a cluster randomized trial. The order in which participants receive the intervention may be randomized (similar rigor to cluster randomized designs). Data are collected and outcomes measured at each point at which a new group of participants (“step”) receives the QI intervention. Observed differences in outcomes between the control section of the wedge with those in the intervention section are attributed to the intervention.

stochastic: type of measure that implies the presence of a random variable.93

stopping rule: rule, based on a test statistic or other function, specified as part of the design of the trial and established before patient enrollment, that specifies a limit for the observed treatment difference for the primary outcome measure, which, if exceeded, will lead to the termination of the trial or one of the study groups.145,146,147 The stopping rules are designed to ensure that a study does not continue to enroll patients after a significant treatment difference has been demonstrated that would still exist regardless of the treatment results of subsequently enrolled patients.

stratification: division into groups. Stratification may be used to compare groups separated according to similar confounding characteristics. Stratified sampling may be used to increase the number of individuals sampled in rare categories of independent variables or to obtain an adequate sample size to examine differences among individuals with certain characteristics of interest.46

Student-Newman-Keuls test: see Newman-Keuls test.

[Student] t test: see t test. W. S. Gossett, who originated the test, wrote under the name Student because his employment as a statistician with the Guinness brewing company precluded individual publication.98 Simply using the term t test is preferred.120

study group: in a controlled clinical trial, the group of individuals who undergo an intervention; in a cohort study, the group of individuals with the exposure or characteristic of interest; and in a case-control study, the group of cases.95

sufficient cause: characteristic that will bring about or cause the disease.95

superiority trial: trial designed to show that a treatment is better than another treatment or placebo. Superiority trials can be thought of as nonequivalence trials. Superiority trials investigate whether a new therapy is better or worse than an established therapy or placebo. Because the new therapy may be better or worse than what it is being compared with, statistical tests should be 2-tailed.These trials tend to require large numbers of patients to have adequate power to be assured that no difference exists when the statistical test shows no significant difference between groups.20,21,22,148 See also equivalence trial.

supportive criteria: substantiation of the existence of a contributory cause. Potential supportive criteria include the strength and consistency of the relationship, the presence of a dose-response relationship, and biological plausibility.95

surrogate end points: in a clinical trial, outcomes that are not of direct clinical importance but that are believed to be related to those that are. Such variables are often physiologic measurements (eg, blood pressure) or biochemical (eg, cholesterol level). Such end points can usually be collected more quickly and economically than clinical end points, such as myocardial infarction or death, but their clinical relevance may be less certain.

survival analysis: statistical procedures for estimating the survival function and for making inferences about how it is affected by treatment and prognostic factors.98 See also life table.

target population: group of individuals to whom one wishes to apply or extrapolate the results of an investigation, not necessarily the population studied.95 If the target population is different from the population studied, whether the study results can be extrapolated to the target population should be discussed.

τ (tau): see Kendall τ rank correlation.

Total sum of the square (TSS): sum of the squared difference between an observed value and the overall mean value for a data set. It is used in regression analysis to calculate how well the regression line fits the data and in ANOVA for calculating the variance (which is the TSS divided by the degrees of freedom, which, in turn, is the number of data points minus 1).

trend, test for: see χ2 test.

trial: controlled experiment with an uncertain outcome93; used most commonly to refer to a randomized study.

triangulation: in qualitative research, the simultaneous use of several different techniques to study the same phenomenon, thus revealing and avoiding biases that may occur if only a single method were used.

true negative: negative test result in an individual who does not have the disease or condition as determined by the criterion standard.95 See also Table 19.5-3.

true-negative rate: number of individuals who have a negative test result and do not have the disease by the criterion standard divided by the total number of individuals who do not have the disease as determined by the criterion standard; usually expressed as a decimal (eg, the true-negative rate was 0.85). Synonymous with a test’s specificity. See also Table 19.5-3.

true positive: positive test result in an individual who has the disease or condition as determined by the criterion standard.95 See also Table 19.5-3.

true-positive rate: number of individuals who have a positive test result and have the disease as determined by the criterion standard divided by the total number of individuals who have the disease as measured by the criterion standard; usually expressed as a decimal (eg, the true-positive rate was 0.92). Synonymous with a test’s sensitivity. See also Table 19.5-3.

t test: statistical test used to determine whether the means of 2 groups that have continuous data with equal variances are statistically different from one another. Use of the t test assumes that the variables have a normal distribution; if not, nonparametric statistics must be used.95,120

→ Usually the t test is unpaired, unless the data have been measured in the same individual over time. A paired t test is appropriate to assess the change of the parameter in the individual from baseline to final measurement; in this case, the dependent variable is the change from one measurement to the next. These changes are usually compared against 0, on the null hypothesis that there is no change from time 1 to time 2.

→ Presentation of the t statistic should include the degrees of freedom ( df), whether the t test was paired or unpaired, and whether a 1-tailed or 2-tailed test was used. Because a 1-tailed test assumes that the study effect can have only 1 possible direction (ie, only beneficial or only harmful), justification for use of the 1-tailed test must be provided. (The 1-tailed test at α = .05 is similar to testing at α = .10 for a 2-tailed test and therefore is more likely to give a significant result.)

Example: The difference was significant by a 2-tailed test for paired samples (t15 = 2.78, P = .05).

→ The t test can also be used to compare different coefficients of variation.

Tukey test: a type of multiple comparisons procedure used to determine which groups are different from one another in an ANOVA procedure.

2-tailed test: test of statistical significance in which deviations from the null hypothesis in either direction are considered.95 For most outcomes, the 2-tailed test is appropriate unless there is a plausible reason why only 1 direction of effect is considered and a 1-tailed test is appropriate. Commonly used for the t test but can also be used in other statistical tests. Synonymous with 2-sided test.

type I error: a result in which the sample data lead to a rejection of the null hypothesis despite the fact that the null hypothesis is actually true in the population. The α level is the size of a type I error that will be permitted, usually .05. Essentially the same as saying that a difference between groups is found when it does not really exist.

→ A frequent cause of a type I error is performing multiple comparisons, which increases the likelihood that a significant result will be found by chance. To avoid a type I error, one of several multiple comparisons procedures can be used.

type II error: the situation where the sample data lead to a failure to reject the null hypothesis despite the fact that the null hypothesis is actually false in the population. Essentially the same as concluding that there is no difference between groups when one really exists.

→ A frequent cause of a type II error is insufficient sample size. Therefore, a power calculation should be performed when a study is planned to determine the sample size needed to avoid a type II error.

uncensored data: continuous data reported as collected, without adjustment, as opposed to censored data.

uniform prior: assumption that no useful information regarding the outcome of interest is available before the study and thus that all individuals have an equal prior probability of the outcome. See also bayesian analysis.

unity: a relative risk of 1 is a relative risk of unity, and a regression line with a slope of 1 is said to have a slope of unity. Synonymous with the number 1.

univariable analysis: statistical tests that involve only 1 dependent variable and no independent variables. Uses measures of central tendency (mean or median) and location or dispersion. The term may also apply to an analysis in which there are no independent variables. In this case, the purposes of the analysis are to describe the sample, determine how the sample compares with the population, and determine whether chance has resulted in a skewed distribution of 1 or more of the variables in the study. If the characteristics of the sample do not reflect those of the population from which the sample was drawn, the results may not be generalizable to that population.95 It is common to use the incorrect term univariate. The suffix -ate means to act on. Because no variable is acted on in a univariable analysis, univariable is a more appropriate term than univariate when there is only a single variable involved. See also multivariable and multivariate analysis.

unpaired analysis: method that compares 2 treatment groups when the 2 treatments are not given to the same individual. Most case-control studies also use unpaired analysis.

unpaired t test: see t test.


test: see Wilcoxon rank sum test.

utility: in decision theory and clinical decision analysis, a scale used to judge the preference of achieving a particular outcome (used in studies to quantify the value of an outcome vs the discomfort of the intervention to a patient) or the discomfort experienced by the patient with a disease.98 Commonly used methods are the time trade-off and the standard gamble. The result is expressed as a single number along a continuum from death ( 0) to full health or absence of disease (1.0). This quality number can then be multiplied by the number of years a patient is in the health state produced by a particular treatment to obtain the quality-adjusted life-year (see 19.3.7, Economic Analyses).

validity (of a measurement): degree to which a measurement is appropriate for the question being addressed or measures what it is intended to measure. For example, a test may be highly consistent and reproducible over time, but unless it is compared with a criterion standard or other validation method, the test cannot be considered valid. Construct validity refers to the extent to which the measurement corresponds to theoretical concepts. Because there are no criterion standards for constructs, construct validity is generally established by comparing the results of one method of measurement with those of other methods. Content validity is the extent to which the measurement samples the entire domain under study (eg, a measurement to assess delirium must evaluate cognition). Criterion validity is the extent to which the measurement is correlated with some quantifiable external criterion (eg, a test that predicts reaction time). Validity can be concurrent (assessed simultaneously) or predictive (eg, ability of a standardized test to predict school performance).98 See also diagnostic discrimination.

→ Validity of a test is sometimes mistakenly used as a synonym of reliability; the two are distinct statistical concepts and should not be used interchangeably. Validity is related to the idea of accuracy or how close the measured value is to the real value, whereas reliability is related to the idea of precision or how close to one another successive measures are.

validity (of a study): internal validity means that the observed differences between the control and comparison groups may, apart from sampling error, be attributed to the effect under study; external validity or generalizability means that a study can produce unbiased inferences regarding the target population, beyond the participants in the study.98

Van der Waerden test: nonparametric test that is sensitive to differences in location for 2 samples from otherwise identical populations.93

variable: characteristic measured as part of a study. Variables may be dependent (usually the outcome of interest) or independent (characteristics of individuals that may affect the dependent variable).

variance: variation measured in a set of data for one variable, defined as the sum of the squared deviations (the difference between each data point and the mean of the variable) divided by the degrees of freedom (number of observations in the sample minus 1).104 The SD is the square root of the variance. The units for the variance are the square of the units of the data from which the variance was calculated.

variance components analysis: process of isolating the sources of variability in the outcome variable for the purpose of analysis.

variance ratio distribution: synonymous with F distribution and is used for ANOVA.98

visual analog scale: scale used to quantify subjective factors, such as pain, satisfaction, or values that individuals attach to possible outcomes. Participants are asked to indicate where their current feelings fall by marking a straight line with 1 extreme, such as “worst pain ever experienced,” at one end of the scale and the other extreme, such as “no pain,” at the other end. The feeling (eg, degree of pain) is quantified by measuring the distance from the mark on the scale to the end of the scale.98

washout period: see 19.2.2, Crossover Trials.

Wilcoxon rank sum test: a nonparametric test that ranks and sums observations from combined samples and compares the result with the sum of ranks from 1 sample.93 U is the statistic that results from the test. Synonymous with Mann-Whitney test.

Wilcoxon signed rank test: nonparametric test in which 2 treatments that have been evaluated by means of matched samples are compared. Each observation is ranked according to size and given the sign of the treatment difference (ie, positive if the treatment effect was positive and vice versa) and the ranks are summed.93

Wilks Λ (lambda): a test used in multivariate analysis of variance (MANOVA) that tests the effect size for all the dependent variables considered simultaneously. It thus adjusts significance levels for multiple comparisons.

x-axis: horizontal axis of a graph. By convention, the independent variable is plotted on the x-axis. Synonymous with abscissa.

Yates correction: continuity correction used to bring a distribution based on discontinuous frequencies closer to the continuous χ2 distribution from which χ2 tables are derived.98

y-axis: vertical axis of a graph. By convention, the dependent variable is plotted on the y-axis. Synonymous with ordinate.

z score: score used to analyze continuous variables that represents the deviation of a value from the mean value, expressed as the number of SDs from the mean. The z score is frequently used to compare children’s height and weight measurements, as well as behavioral scores.98 Synonymous with standard score.