R: Imputation of Repeated Measures

rm.impute {Design}

R Documentation

Imputation of Repeated Measures

Description

NOTE: This function is under development and is not correct at present. Uses the method of Lavori, Dawson, and Shera (1995) to analyze uniformly (over subjects) collected repeated measurement data subject to non-random dropout. Separately for each imputation and for each time period, a binary logistic model is developed (using the Design lrm function) to predict the probability that each subject remains in the study at that time period. Predictors for the first time period are those listed in the pformula formula. These are assumed to be baseline variables that are never missing. For later time periods, predictors include the baseline predictors plus the matrix of response (y) values for all earlier periods. These "previous responses" will have missing values imputed from the earlier steps.

Missing responses for time period i are imputed, for one of the n.impute multiple imputations, as follows. The period i fitted propensity model described above is evaluated to obtain the predicted probability that each subject remained in the study until at least period i. The estimated propensity is divided into g quantile groups. If for period i within a propensity quantile group there are a subjects still in the study and b subjects who have dropped out, Rubin's approximate Bayesian bootstrap is used to estimate the predictive distribution of the response values for the b dropouts, given that the propensity for remaining in the study is approximately constant for all subjects (dropouts and non-dropouts) in the group. A sample of size a is selected with replacement from the a subjects still in the study from the propensity group. Then a sample of size b with replacement is selected from this sample of size a. These b responses are used to fill-in the responses for the b dropouts in the quantile group for the current imputation and current time period.

If the right-hand-side of a formula is specified for a univariate response summary (which may be the last response, mean, or area under the time-response curve), rm.impute goes on to fit rformula to this response summary for each of the multiple imputations using a fitting function fitter. After all n.impute imputations have been done, the average "apparent" covariance matrix and the between-imputation covariance matrix are computed to derive Rubin's multiple-imputation-corrected covariance matrix for the average of n.impute sets of regression coefficients. See fit.mult.impute for more details.

The response variable y may be an array to handle multiple responses at each time period. This array has number of rows equal to the number of subjects, number of columns equal to the number of periods, and number of "pages" equal to the number of different response measurements. A utility function pbind is supplied for creating such arrays from a series of matrices. When multiple responses are present, all responses are used in the current propensity model, and the which, nk, rinteraction, and rint.with arguments will apply equally to all responses.

Usage

rm.impute(pformula, y, last, 
          rformula, fitter=ols, which=c("last", "mean", "auc"), 
          data=sys.parent(1), n.impute=10, g=5, 
          nk=0, rinteraction, rint.with=c('all','recent'),
          pr=FALSE, pra=FALSE, npr, 
          keep.prop=FALSE, keep.pfits=FALSE)

pbind(...)

Arguments

`pformula`	right-hand-side propensity formula, e.g., ~ treatment + x1 + x2 + x3*x4. This formula (as well as `rformula` if `fitter` is one of the Design library fitting functions) can contain any of the Design library's transformation functions such as `rcs`, `pol`, etc.
`y`	matrix of responses over time periods. To use `which="auc"`, column names of `y` must contain numeric measurement times.
`last`	an integer vector whose value for the jth subject is the last period before the subject dropped out. A subject who never had a follow-up response measured will have `last=0`.
`...`	a series of matrices to "page bind" into an array. New names may be supplied using the notation `pbind(newname1=y1,newname2=y2)`. The `dimnames` of the first argument (which will be converted to a matrix if it is a vector, for the unusual one-period case) will be used as the first two `dimnames` of the resulting array, and the names of the matrices will form the third vector of `dimnames`.
`rformula`	right-hand-side response formula, e.g., ~ x1 + pol(x2) + treatment. If omitted, `rm.impute` will return only the multiple response imputations.
`fitter`	any S-Plus or Design library fitting function for a univariate response summary. The default is `ols`. If there are multiple response variables at each time period and you want to use a different fitter for different response variables, specify a list of `nr` fitting functions as this argument, where `nr` is the number of response variables.
`which`	which response summary is used if `rformula` is given. The default is the last column of the response matrix.
`data`	usually a data frame, if the variables in `pformula` and `rformula` are not already available via `attach()`
`n.impute`	number of imputations. The more missing data, the higher `n.impute` should be.
`g`	number of propensity quantile groups
`nk`	number of knots to use in expanding each previous response into a restricted cubic spline in the propensity model. Default is zero (assume linearity).
`rinteraction`	a character vector specifying the names of baseline variables that should be interacted with each response in the propensity model. Default is no such interactions.
`rint.with`	set to `"recent"` to allow the variables in `rinteraction` to only interact with the response for the most recent time period, and not with the most recent and all previous responses (the default)
`pr`	set to `TRUE` to print each logistic propensity model fit.
`pra`	if `pr=TRUE`, you can also set `pra=TRUE` to print the Design `anova()` results for each propensity model fit.
`npr`	if `pr=TRUE`, printing will be done for the first `npr` imputations
`keep.prop`	set to `TRUE` to store the array `propensity` in the returned list. The dimensions for `propensity` are the same as `Y`.
`keep.pfits`	set to `TRUE` to store all propensity model fits from `lrm` in the result returned by `rm.impute`

Details

The algorithm used here will not correct for non-random dropout due to variables that are not included in the propensity model. A worst-case would be having dropouts at period i due to unmeasured responses at period i.

Ironically, there must be a sufficient number of dropouts for the propensity score method to work, as the propensity models must have adequate numbers of dropouts and non-dropouts at each time period.

Value

a list with elements Y and optionally fit (if rformula is given) and propensity (if keep.prop=TRUE). Y and propensity are arrays whose last dimension corresponds to the multiple imputations and whose first two dimensions correspond to y. Y is the multiply-imputed response array and fit is the imputation-corrected fit object. Note: Aside from the regression coefficient vector and covariance matrix, this fit object will have parameters from the fit of the response summary for the last imputation. If keep.pfits=TRUE, the returned list will also have an array of propensity fit objects (lrm objects) for all response periods and imputations. If there is more than one response variable at each time period, fit will be a list of nr fit objects for nr response variables.

Side Effects

prints, and creates variables such as y.1, y.2, ... and in.period.i in the session database (frame 0)

Author(s)

Frank Harrell
Department of Biostatistics
Vanderbilt University School of Medicine
f.harrell@vanderbilt.edu
Much valuable input was received from Chris Barker (Roche Pharmaceuticals) and Phil Lavori (Stanford University).

References

Lavori PW, Dawson R, Shera, D: A multiple imputation strategy for clinical trials with truncation of patient data. Stat in Med 14:1913–1925, 1995.

Rubin D, Shenker N: Multiple imputation in health-care data bases: An overview and some applications. Stat in Med 10:585–598, 1991.

Engels JM, Diehr P: Imputation of missing longitudinal data: a comparison of methods. J Clin Epi 56:968–976, 2003.

Examples

## Not run: 
# Generate multiple imputes of the response matrix for later use
Y <- rm.impute(~treatment + pol(age,2)*sex, responses, 
               last=lastvisit, data=mydata)$Y
# Do some analysis for each imputation
fits <- vector('list',10)
for(i in 1:10) {
  y <- Y[,,i]
  fits[[i]] <- my.analysis(X,y)
}

# Function to generate a 4-variate equal correlation pattern response
# with missing-at-random responses; missingness is a function of x and
# previous responses.
#
# pna is a function that computes the probability that a subject
# drops out at the current visit.  For visit 1 pna is a function
# of treatment and baseline covariable x.  For visits > 1 pna is
# a function of the matrix of responses for all previous visits.
#
# If second=TRUE we also generate a second response variable having
# NAs in the same positions as this first one.  y2 is generated
# so that its NAs are completely unrelated to any y2 values if
# y2B.effect=0, as the pna function is only given the first 
# response variable.
# y2 is N(0,1) for treat='A' and N(y2.treat.effect,1) for treat='B'.

testdf <- function(n=1500, seed=7, pna, second=FALSE, y2.treat.effect=0) {

set.seed(seed)
treat <- sample(c('A','B'),n,TRUE)
x <- runif(n)
nt <- 4

mvrnorm <- function(n, p = 1, u = rep(0, p), S = diag(p)) {
  Z <- matrix(rnorm(n * p), p, n)
  t(u + t(chol(S)) %*% Z)
}

# Generate multivariate normal errors for n subjects at nt times
# Assume equal correlations of rho=.5, independent subjects

rho   <- .5
y <- mvrnorm(n, p=nt, S=diag(rep(1-rho,nt))+rho)

y[treat=='B',] <- y[treat=='B',] + 1

cat('\n\nTreatment-specific means for last period in response variable 1 before generating NAs:\n')
print(tapply(y[,4], treat, mean, na.rm=TRUE))

y[runif(n) < pna(treat, x), 1] <- NA
y[is.na(y[,1]) | runif(n) < pna(treat, x, y[,1]),   2] <- NA
y[is.na(y[,2]) | runif(n) < pna(treat, x, y[,1:2]), 3] <- NA
y[is.na(y[,3]) | runif(n) < pna(treat, x, y[,1:3]), 4] <- NA

last <- rep(4, n)
last[is.na(y[,4])] <- 3
last[is.na(y[,3])] <- 2
last[is.na(y[,2])] <- 1
last[is.na(y[,1])] <- 0

cat('\nNumber of NAs for each time period:\n')
print(apply(y, 2, function(x)sum(is.na(x))))
cat('\n\nTreatment-specific means for last period in response variable 1 after excluding NAs:\n')
print(tapply(y[,4], treat, mean, na.rm=TRUE))
cat('\n\nNaive complete-case analysis:\n\n')
prn(ols(y[,4] ~ pol(x,2) + treat))

if(second) {
  y2 <- matrix(rnorm(n*4),ncol=4)
  y2[treat=='B',] <- y2[treat=='B',] + y2.treat.effect
  cat('\n\nTreatment-specific means for last period in response variable 2 before generating NAs:\n')
  print(tapply(y2[,4], treat, mean, na.rm=TRUE))

  y2[is.na(y[,1]),1] <- NA
  y2[is.na(y[,2]),2] <- NA
  y2[is.na(y[,3]),3] <- NA
  y2[is.na(y[,4]),4] <- NA
  cat('\n\nTreatment-specific means for last period in response variable 2 after excluding NAs:\n')
  print(tapply(y2[,4], treat, mean, na.rm=TRUE))

  y <- pbind(y1=y, y2=y2)
}

list(x=x, treat=treat, y=y, last=last)
}

pna <- function(treat, x, yprev) {
# In this model for the probability of dropout just before the
# current visit, the probability does not depend on the baseline
# covariable x.  For treat='B' the probability of dropout is a
# constant 0.1.  For treat='A' it is a curtailed quadratic
# function of the previous visit's response.
#
# If no previous responses available, we are at first follow-up visit

if(missing(yprev)) 0 else {
  if(is.matrix(yprev)) yprev <- yprev[,ncol(yprev)]
  ifelse(treat=='B', .1,
         pmax(0, pmin(1, .124 +.0835*yprev + .020868*yprev^2)))
  }
}

df <- testdf(pna = pna, second=TRUE)

g <- rm.impute(~ pol(x,2) + treat, df$y, last=df$last, 
               rformula=~ pol(x,2) + treat,
               n.impute=10, g=4, nk=3, 
               rinteraction='treat', rint.with='all',
               pr=TRUE, pra=TRUE, data=df, keep.prop=TRUE, keep.pfits=TRUE)
# Base propensity model is in.study ~ pol(x,2) + treat
# for visits 2,3,4, filled-in y's from previous visits will also be
# used as predictors, and these interact with treat.  
# Restricted cubic spline with 3 knots is assumed for the propensity models
# To fit the multiply-imputed last (4th) response an additive model
# in quadratic x and treat is used

g$fit[[1]]       # shows response fit for first response variable
                 # (y1), with variances adj. for imputation
page(g$Y)        # show all 10 imputations for both responses x 4 periods

# Check for the first imputation how well propensity matching achieved 
# balance in baseline and period 3 filled-in responses for
# dropouts and non-dropouts.  For continuous variables show ECDFs
# using the Hmisc ecdf function, for first 4 imputations.  Do this
# with and without stratifying on quintiles of propensity, and also
# show the estimated 3rd period response  vs. propensity stratified 
# by dropout status.  Use only first response (y1) for all of this.

for(imp in 1:4) {
  y3     <- g$Y[,3,1,imp]
  prop3  <- g$propensity[,3,imp]
  prop3g <- cut2(prop3,g=5)
  ti <- paste('Imputation',imp)
  print(ecdf(~ y3, groups=df$last >= 3, subset=unclass(prop3g)<5))
  title(ti)
  print(ecdf(~ y3 | prop3g, groups=df$last >= 3, 
             subset=unclass(prop3g)<5))
  # Not enough dropouts in highest quintile of propensity completing
  # visit 3
  title(ti)
  plsmo(prop3, y3, group=df$last >= 3, datadensity=TRUE, col=1:2)
  title(ti)
}

# Examine propensity fit for sixth imputation, 4th response
f <- g$pfits[4,6][[1]]
dfr <- as.data.frame(df)
# Edit names of dfr so that responses called y.1, y.2, etc.
# For this example, these are already OK
dd <- datadist(dfr)
options(datadist='dd')   
# datadist makes plot below work without specifying variable settings
plot(f, y.3=NA, treat=NA, conf.int=FALSE)

# Analyze multiple response variables.  Both systolic.bp and
# diastolic.bp are matrices (columns = time periods)

f <- rm.impute(~treatment + pol(age,2)*sex,
               pbind(systolic.bp, diastolic.bp),
               last=lastvisit, data=mydata)

# To deal with a continuous and a binary endpoint you can specify
# pbind(sysbolic.bp, stroke), fitter=list(ols, lrm)
## End(Not run)

[Package Design version 2.0-12 Index]