

Open Science
Introduction
Methods
Results
Discussion
Statistical methods
Item 21c: How missing data were handled in the analysis
Examples
“Regarding the multiple imputation procedure, briefly, for each outcome, the analysis model used was a linear regression with treatment arm, baseline outcome, and ethnicity (randomization stratifier) as explanatory variables. The imputation models contained all the variables of the analysis model(s) as well as factors associated with missingness: age (identified empirically to predict missingness, P = .03) and adherence (number of doses taken of either vitamin D or placebo, P < .001) [386]."
“To consider the potential impact of missing data on trial conclusions, we used multiple imputation (data missing at random) and sensitivity analysis (data not missing at random). Multiple imputation by chained equations was performed using the “mi impute chained” command in Stata. We used a linear regression model to impute missing outcomes for the HOS ADL [activities of daily living subscale of the hip outcome score] at eight months post-randomisation. Variables in the imputation model included all covariates in the analysis model (baseline HOS ADL (continuous), age (continuous), and sex). In addition, we included other variables that were thought to be predictive of the outcome (lateral centre-edge angle, maximum α angle, Kellgren-Lawrence grade, and baseline HADS score). Imputations were run separately by treatment arm and based on a predictive mean matching approach, choosing at random one of the five HOS ADL values with the closest predicted scores. Missing data in the covariates that were included in the multiple imputation model were imputed simultaneously (multiple imputation by chained equation approach). Sensitivity analysis was performed using the “rctmiss” command in Stata, and we considered scenarios where participants with missing data in each arm were assumed to have outcomes that were up to 9 points worse than when data were missing at random [242]."
“Analyses for the 2 primary outcomes compared each treatment with usual care using multiple imputation to handle missing data and a Bonferroni-corrected 2-tailed type I error of .025. We performed 20 imputations with a fully conditional specification using Proc MI in SAS. Imputation was performed with the following prespecified variables: age, study group, study site, clinic, sex, race and ethnicity, body mass index, exercise frequency at baseline, education, employment status, smoking status, other medical conditions at baseline, number of medications used for spine pain at baseline, duration of pain at baseline, number of previous pain episodes, STarT Back score, baseline ODI, baseline self-efficacy, baseline EQ-5D-5L, and scores for patient-reported outcomes at every follow-up point (ODI [Oswestry Disability Index], cost, Lorig et al self-efficacy scale, and EQ-5D-5L [EuroQol 5-dimensional 5-level questionnaire]). Each imputed data set was analyzed separately using Proc GENMOD in SAS (with an identity link and normally distributed errors for ODI and a log link and Poisson-distributed errors for spine-related spending) [387]."
“Missing peak V̇o2 [oxygen consumption] data at week 20, regardless of the type of intercurrent event, was imputed using multiple imputation methodology under the missing at random assumption for the primary analysis. Sensitivity analyses were performed by exploring a missing not at random assumption in the imputation of peak V̇o2. The imputation model used a regression multiple imputation, which includes treatment group, baseline respiratory exchange ratio, persistent atrial fibrillation (yes or no), age, sex, baseline peak V̇o2, baseline hemoglobin level, baseline estimated glomerular filtration rate, baseline body weight, baseline KCCQ [Kansas City cardiomyopathy questionnaire] total symptom score, baseline NYHA [New York Heart Association] class, and baseline average daily activity units (refers to 10 hours of wearing during the awake time for ≥7 days unless otherwise specified). Treatment group, persistent atrial fibrillation (yes or no), baseline NYHA class, and sex were treated as categorical variables. Fifty imputed data sets were generated. Each of the imputed data sets was analyzed using the analysis of covariance model of the primary analysis. Least square mean (LSM) treatment difference and the standard error were combined using Rubin’s rules to produce an LSM estimate of the treatment difference, its 95% CI [confidence interval], and P value for the test of null hypothesis of no treatment effect [388]."
“Multiple imputation was preplanned for the primary outcome measure in the case of missing data; however, because there were no missing data relating to ventilator-free days, imputation was not required [389]."
Explanation
Missing data are common when conducting medical research. Collecting data on all study participants can be challenging even in a trial that has mechanisms to maximise data capture. Missing values can occur in either the outcome or in one or more covariates, or usually both. There are many reasons why missing values occur in the outcome. Patients may stop participating in the trial, withdraw consent for further data collection, or fail to attend follow-up visits; all of which could be related to the treatment allocation, specific (prognostic) factors, or experiencing a specific health outcome [390]. Missing values could also occur in baseline variables, such that all the necessary data needed to conduct the trial have been only partially recorded. Despite the ubiquity of missing data in medical research, the reporting of missing data and how they are handled in the analyses is poor [391-399].
Many trialists exclude patients without an observed outcome. Once any randomised participants are excluded, the analysis is not strictly an intention-to-treat analysis. Most randomised trials have some missing observations. Trialists effectively must choose between omitting the participants without final outcome data, imputing their missing outcome data, or using model based approaches such as fitting a linear mixed model to repeated measures data [371]. A complete case (or available case) analysis includes only those participants whose outcome is known. While a few missing outcomes will not cause a problem, many trials have more than 10% of randomised patients with missing outcomes [391-398]. This common situation will result in loss of power by reducing the sample size, and bias may well be introduced if being lost to follow-up is related to a participant’s response to treatment. There should be concern when the frequency or the causes of dropping out differ between the intervention groups.
Participants with missing outcomes can be included in the analysis if their outcomes are imputed (ie, their outcomes are estimated from other information that was collected) or if using a model based approach. Imputing the values of missing data allows the analysis to potentially conform to intention-to-treat analysis but requires strong assumptions, which may be hard to justify. Simple imputation methods are appealing, but their use may be inadvisable as they fail to account for uncertainty introduced by missing data and may lead to invalid inferences (eg, estimated standard errors for the treatment effect will be too small) [400]. For randomised trials with missing data within repeated measures data, model based approaches such as fitting a linear mixed model can be used to estimate the treatment effect at the final time point which is valid under a missing-at-random assumption. A model is fit at a (limited) number of time points following randomisation, by including fixed effects for time and randomised group and their interaction [371].
Another approach that is sometimes used is known as “last observation carried forward,” in which missing final values of the outcome variable are replaced by the last known value before the participant was lost to follow-up. Although this method might appear appealing through its simplicity, the underlying assumption will rarely be valid, so the method may introduce bias, and makes no allowance for the uncertainty of imputation. The approach of last observation carried forward has been severely criticised [383, 384, 401]. Sensitivity analyses should be reported to understand the extent to which the results of the trial depend on the missing data assumptions and subsequent analysis (item 21d) [402]. When the findings from the sensitivity analyses are consistent with the results from the primary analysis (eg, complete case for the primary analysis and multiple imputation for a sensitivity analysis), trialists can be reassured that the missing data assumptions and associated methods had little impact on the trial results [403].
Regardless of what data are missing, how such data are to be analysed and reported needs to be carefully planned. Authors should provide a description on how missing data were handled in sufficient detail to allow for the analysis to be reproduced (in principle; box 8).
Box start
Box 8: Guidance for reporting analytical approaches to handle missing data (adapted from Hussain et al404)
Methods
1) Report any strategies used to reduce missing data throughout the trial process.
2) Report if and/or how the original sample size calculation accounted for missing data (item 16a) and the justification for these decisions. Report if and/or how the sample size was reassessed during the course of the trial (item 16b).
3) Report the assumption about the missing data mechanism for the primary analysis and the justification for this choice, for all outcomes. For multiple imputation methods, report405:
• What variables were included in the imputation procedure?
• How were non-normally distributed and binary/categorical variables dealt with?
• If statistical interactions were included in the final analyses (item 21a), were they also included in imputation models?
• Was imputation done separately or by randomised group?
• How many imputed datasets were created?
• How were results from different imputed datasets combined?
4) Report the method used to handle missing data for the primary analysis (eg, complete case, multiple imputation) and the justification for the methods chosen, for all outcomes. Include whether or which auxiliary variables were collected and used.
5) Report the assumptions about the missing data mechanism (eg, missing at random) and methods used to conduct the missing data sensitivity analyses for all outcomes, and the justification for the assumptions and methods chosen.
6) Report how data that were truncated due to death or other causes were handled with a justification for the method(s) (if relevant).
Results
1) Report the numbers and proportions of missing data in each trial arm.
2) Report the reasons for missing data in each trial arm.
3) Report a comparison of the characteristics of those with observed and missing data.
4) Report the primary analysis based on the primary assumption about the missing data mechanism, for all outcomes.
5) Report results of the missing data sensitivity analyses for all outcomes. As a minimum, a summary of the missing data sensitivity analyses should be reported in the main paper with the full results in the supplementary material.
Discussion
1) Discuss the impact of missing data on the interpretation of findings, considering both internal and external validity. For multiple imputation, include whether the variables included in the imputation model make the missing-at-random assumption plausible.
Box end