R: Overview of Design Library

Overview {Design}

R Documentation

Overview of Design Library

Description

Design does regression modeling, testing, estimation, validation, graphics, prediction, and typesetting by storing enhanced model design attributes in the fit.

Design is a collection of about 180 functions that assist and streamline modeling, especially for biostatistical and epidemiologic applications. It also contains new functions for binary and ordinal logistic regression models and the Buckley-James multiple regression model for right-censored responses, and implements penalized maximum likelihood estimation for logistic and ordinary linear models. Design works with almost any regression model, but it was especially written to work with logistic regression, Cox regression, accelerated failure time models, ordinary linear models, and the Buckley-James model. You should install the Hmisc library before using Design, as a few of Design's options use Hmisc functions, and Hmisc has several functions useful for data analysis (especially data reduction and imputation).

Details

To make use of automatic typesetting features you must have LaTeX or one of its variants installed.

Some aspects of Design (e.g., latex) will not work correctly if options(contrasts=) other than c("contr.treatment", "contr.poly") are used.

Design relies on a wealth of survival analysis functions written by Terry Therneau of Mayo Clinic. Front-ends have been written for several of Therneau's functions, and other functions have been slightly modified.

Statistical Methods Implemented

Ordinary linear regression models
Binary and ordinal logistic models (proportional odds and continuation ratio models)
Cox model
Parametric survival models in the accelerated failure time class
Buckley-James least-squares linear regression model with possibly right-censored responses
Bootstrap model validation to obtain unbiased estimates of model performance without requiring a separate validation sample
Automatic Wald tests of all effects in the model that are not parameterization-dependent (e.g., tests of nonlinearity of main effects when the variable does not interact with other variables, tests of nonlinearity of interaction effects, tests for whether a predictor is important, either as a main effect or as an effect modifier)
Graphical depictions of model estimates (effect plots, odds/hazard ratio plots, nomograms that allow model predictions to be obtained manually even when there are nonlinear effects and interactions in the model)
Various smoothed residual plots, including some new residual plots for verifying ordinal logistic model assumptions
Composing S functions to evaluate the linear predictor (X*beta hat), hazard function, survival function, quantile functions analytically from the fitted model
Typesetting of fitted model using LaTeX
Robust covariance matrix estimation (Huber or bootstrap)
Cubic regression splines with linear tail restrictions (natural splines)
Tensor splines
Interactions restricted to not be doubly nonlinear
Penalized maximum likelihood estimation for ordinary linear regression and logistic regression models. Different parts of the model may be penalized by different amounts, e.g., you may want to penalize interaction or nonlinear effects more than main effects or linear effects
Estimation of hazard or odds ratios in presence of nonlinearity and interaction
Sensitivity analysis for an unmeasured binary confounder in a binary logistic model
Multiple imputation of repeated measures data with non- random dropout using propensity score matching (experimental, not yet functional)

Motivation

Design was motivated by the following needs:

need to automatically print interesting Wald tests that can be constructed from the design
- tests of linearity with respect to each predictor
- tests of linearity of interactions
- pooled interaction tests (e.g., all interactions involving race)
- pooled tests of effects with higher order effects
  - test of main effect not meaningful when effect in interaction
  - pooled test of main effect + interaction effect is meaningful
  - test of 2nd-order interaction + any 3rd-order interaction containing those factors is meaningful
need to store transformation parameters with the fit
- example: knot locations for spline functions
- these are "remembered" when getting predictions, unlike standard S or R
- for categorical predictors, save levels so that same dummy variables will be generated for predictions; check that all levels in out-of-data predictions were present when model was fitted
need for uniform re-insertion of observations deleted because of NAs when using predict without newdata or when using resid
need to easily plot the regression effect of any predictor
- example: age is represented by a linear spline with knots at 40 and 60y plot effect of age on log odds of disease, adjusting interacting factors to easily specified constants
- vary 2 predictors: plot x1 on x-axis, separate curves for discrete x2 or 3d perspective plot for continuous x2
- if predictor is represented as a function in the model, plots should be with respect to the original variable:
  f <- lrm(y ~ log(cholesterol)+age)
  plot(f, cholesterol=NA) # cholesterol on x-axis, default range
need to store summary of distribution of predictors with the fit
- plotting limits (default: 10th smallest, 10th largest values or %-tiles)
- effect limits (default: .25 and .75 quantiles for continuous vars.)
- adjustment values for other predictors (default: median for continuous predictors, most frequent level for categorical ones)
- discrete numeric predictors: list of possible values example: x=0,1,2,3,5 -> by default don't plot prediction at x=4
- values are on the inner-most variable, e.g. cholesterol, not log(chol.)
- allows estimation/plotting long after original dataset has been deleted
- for Cox models, underlying survival also stored with fit, so original data not needed to obtain predicted survival curves
need to automatically print estimates of effects in presence of non- linearity and interaction
- example: age is quadratic, interacting with sex default effect is inter-quartile-range hazard ratio (for Cox model), for sex=reference level
- user-controlled effects: summary(fit, age=c(30,50), sex="female") -> odds ratios for logistic model, relative survival time for accelerated failure time survival models
- effects for all variables (e.g. odds ratios) may be plotted with multiple-confidence-level bars
need for prettier and more concise effect names in printouts, especially for expanded nonlinear terms and interaction terms
- use inner-most variable name to identify predictors
- e.g. for pmin(x^2-3,10) refer to factor with legal S-name x
need to recognize that an intercept is not always a simple concept
- some models (e.g., Cox) have no intercept
- some models (e.g., ordinal logistic) have multiple intercepts
need for automatic high-quality printing of fitted mathematical model (with dummy variables defined, regression spline terms simplified, interactions "factored"). Focus is on regression splines instead of nonparametric smoothers or smoothing splines, so that explicit formulas for fit may be obtained for use outside S. Design can also compose S functions to evaluate X*Beta from the fitted model analytically, as well as compose SAS code to do this.
need for automatic drawing of nomogram to represent the fitted model
need for automatic bootstrap validation of a fitted model, with only one S command (with respect to calibration and discrimination)
need for robust (Huber sandwich) estimator of covariance matrix, and be able to do all other analysis (e.g., plots, C.L.) using the adjusted covariances
need for robust (bootstrap) estimator of covariance matrix, easily used in other analyses without change
need for Huber sandwich and bootstrap covariance matrices adjusted for cluster sampling
need for routine reporting of how many observations were deleted by missing values on each predictor (see na.delete in Hmisc)
need for optional reporting of descriptive statistics for Y stratified by missing status of each X (see na.detail.response)
need for pretty, annotated survival curves, using the same commands for parametric and Cox models
need for ordinal logistic model (proportional odds model, continuation ratio model)

Fitting Functions Compatible with Design

Design will work with a wide variety of fitting functions, but it is meant especially for the following:

Function Purpose Related S

Functions

ols Ordinary least squares linear model lm

lrm Binary and ordinal logistic regression glm

model cr.setup

psm Accelerated failure time parametric survreg

survival model

cph Cox proportional hazards regression coxph

bj Buckley-James censored least squares survreg

linear model

glmD Version of glm for use with Design

glsD Version of gls for use with Design

Methods in Design

The following generic functions work with fits with Design in effect:

Function Purpose Related

Functions

print Print parameters and statistics of fit

coef Fitted regression coefficients

formula Formula used in the fit

specs Detailed specifications of fit

robcov Robust covariance matrix estimates

bootcov Bootstrap covariance matrix estimates

summary Summary of effects of predictors

plot.summary Plot continuously shaded confidence

bars for results of summary

anova Wald tests of most meaningful hypotheses

contrast General contrasts, C.L., tests

plot.anova Depict results of anova graphically dotchart

plot Plot effects of predictors

gendata Generate data frame with predictor expand.grid

combinations (optionally interactively)

predict Obtain predicted values or design matrix

fastbw Fast backward step-down variable step

selection

residuals Residuals, influence statistics from fit

(or resid)

which.influence Which observations are overly residuals

influential

sensuc Sensitivity of one binary predictor in

lrm and cph models to an unmeasured

binary confounder

latex LaTeX representation of fitted

model or anova or summary table

Function S function analytic representation Function.transcan

of a fitted regression model (X*Beta)

hazard S function analytic representation rcspline.restate

of a fitted hazard function (for psm)

Survival S function analytic representation of

fitted survival function (for psm,cph)

Quantile S function analytic representation of

fitted function for quantiles of

survival time (for psm, cph)

nomogram Draws a nomogram for the fitted model latex, plot

survest Estimate survival probabilities survfit

(for psm, cph)

survplot Plot survival curves (psm, cph) plot.survfit

validate Validate indexes of model fit using val.prob

resampling

calibrate Estimate calibration curve for model

using resampling

vif Variance inflation factors for a fit

naresid Bring elements corresponding to missing

data back into predictions and residuals

naprint Print summary of missing values

pentrace Find optimum penality for penalized MLE

effective.df Print effective d.f. for each type of

variable in model, for penalized fit or

pentrace result

rm.impute Impute repeated measures data with transcan,

non-random dropout fit.mult.impute

experimental, non-functional

Background for Examples

The following programs demonstrate how the pieces of the Design package work together. A (usually) one-time call to the function datadist requires a pass at the entire data frame to store distribution summaries for potential predictor variables. These summaries contain (by default) the .25 and .75 quantiles of continuous variables (for estimating effects such as odds ratios), the 10th smallest and 10th largest values (or .1 and .9 quantiles for small n) for plotting ranges for estimated curves, and the total range. For discrete numeric variables (those having <=10 unique values), the list of unique values is also stored. Such summaries are used by the summary.Design, plot.Design, and nomogram.Design functions. You may save time and defer running datadist. In that case, the distribution summary is not stored with the fit object, but it can be gathered before running summary or plot.

d <- datadist(my.data.frame) # or datadist(x1,x2)
options(datadist="d") # omit this or use options(datadist=NULL)
# if not run datadist yet
cf <- ols(y ~ x1 * x2)
anova(f)
fastbw(f)
predict(f, newdata)

In the Examples section there are three detailed examples using a fitting function designed to be used with Design, lrm (logistic regression model). In Detailed Example 1 we create 3 predictor variables and a two binary response on 500 subjects. For the first binary response, dz, the true model involves only sex and age, and there is a nonlinear interaction between the two because the log odds is a truncated linear relationship in age for females and a quadratic function for males. For the second binary outcome, dz.bp, the true population model also involves systolic blood pressure (sys.bp) through a truncated linear relationship. First, nonparametric estimation of relationships is done using the Hmisc library's plsmo function which uses lowess with outlier detection turned off for binary responses. Then parametric modeling is done using restricted cubic splines. This modeling does not assume that we know the true transformations for age or sys.bp but that these transformations are smooth (which is not actually the case in the population).

For Detailed Example 2, suppose that a categorical variable treat has values "a", "b", and "c", an ordinal variable num.diseases has values 0,1,2,3,4, and that there are two continuous variables, age and cholesterol. age is fitted with a restricted cubic spline, while cholesterol is transformed using the transformation log(cholesterol - 10). Cholesterol is missing on three subjects, and we impute these using the overall median cholesterol. We wish to allow for interaction between treat and cholesterol. The following S program will fit a logistic model, test all effects in the design, estimate effects, and plot estimated transformations. The fit for num.diseases really considers the variable to be a 5-level categorical variable. The only difference is that a 3 d.f. test of linearity is done to assess whether the variable can be re-modeled "asis". Here we also show statements to attach the Design library and store predictor characteristics from datadist.

Detailed Example 3 shows some of the survival analysis capabilities of Design related to the Cox proportional hazards model. We simulate data for 2000 subjects with 2 predictors, age and sex. In the true population model, the log hazard function is linear in age and there is no age x sex interaction. In the analysis below we do not make use of the linearity in age. Design makes use of many of Terry Therneau's survival functions that are builtin to S.

The following is a typical sequence of steps that would be used with Design in conjunction with the Hmisc transcan function to do single imputation of all NAs in the predictors (multiple imputation would be better but would be harder to do in the context of bootstrap model validation), fit a model, do backward stepdown to reduce the number of predictors in the model (with all the severe problems this can entail), and use the bootstrap to validate this stepwise model, repeating the variable selection for each re-sample. Here we take a short cut as the imputation is not repeated within the bootstrap.

In what follows we (atypically) have only 3 candidate predictors. In practice be sure to have the validate and calibrate functions operate on a model fit that contains all predictors that were involved in previous analyses that used the response variable. Here the imputation is necessary because backward stepdown would otherwise delete observations missing on any candidate variable.

Note that you would have to define x1, x2, x3, y to run the following code.

xt <- transcan(~ x1 + x2 + x3, imputed=TRUE)
impute(xt) # imputes any NAs in x1, x2, x3
# Now fit original full model on filled-in data
f <- lrm(y ~ x1 + rcs(x2,4) + x3, x=TRUE, y=TRUE) #x,y allow boot.
fastbw(f)
# derives stepdown model (using default stopping rule)
validate(f, B=100, bw=TRUE) # repeats fastbw 100 times
cal <- calibrate(f, B=100, bw=TRUE) # also repeats fastbw
plot(cal)

Common Problems to Avoid

Don't have a formula like y ~ age + age^2. In S you need to connect related variables using a function which produces a matrix, such as pol or rcs. This allows effect estimates (e.g., hazard ratios) to be computed as well as multiple d.f. tests of association.
Don't use poly or strata inside formulas used in Design. Use pol and strat instead.
Almost never code your own dummy variables or interaction variables in S. Let S do this automatically. Otherwise, anova can't do its job.
Almost never transform predictors outside of the model formula, as then plots of predicted values vs. predictor values, and other displays, would not be made on the original scale. Use instead something like y ~ log(cell.count+1), which will allow cell.count to appear on x-axes. You can get fancier, e.g., y ~ rcs(log(cell.count+1),4) to fit a restricted cubic spline with 4 knots in log(cell.count+1). For more complex transformations do something like
f <- function(x) {
... various 'if' statements, etc.
log(pmin(x,50000)+1)
}
fit1 <- lrm(death ~ f(cell.count))
fit2 <- lrm(death ~ rcs(f(cell.count),4))
}
Don't put $ inside variable names used in formulas. Either attach data frames or use data=.
Don't forget to use datadist. Try to use it at the top of your program so that all model fits can automatically take advantage if its distributional summaries for the predictors.
Don't validate or calibrate models which were reduced by dropping "insignificant" predictors. Proper bootstrap or cross-validation must repeat any variable selection steps for each re-sample. Therefore, validate or calibrate models which contain all candidate predictors, and if you must reduce models, specify the option bw=TRUE to validate or calibrate.
Dropping of "insignificant" predictors ruins much of the usual statistical inference for regression models (confidence limits, standard errors, P-values, chi-squares, ordinary indexes of model performance) and it also results in models which will have worse predictive discrimination.

Accessing the Library

If you are using any of Design's survival analysis functions, create a file called .Rprofile in your working directory that contains the line library(survival). That way, survival will move down the search list as Hmisc and Design are attached during your session. This will allow Hmisc and Design to override some of the survival function such as survfit.

Since the Design library has a .First.lib function, that function will be executed by the library command, to dynamically load the .o or .obj files. You may want to create a .First function such as

.First <- {
options(na.action = "na.delete")
# gives more info than na.omit
library(Hmisc)
library(Design)
invisible()
}

Published Applications of Design and Regression Splines

Spline fits
1. Spanos A, Harrell FE, Durack DT (1989): Differential diagnosis of acute meningitis: An analysis of the predictive value of initial observations. JAMA 2700-2707.
2. Ohman EM, Armstrong PW, Christenson RH, et al. (1996): Cardiac troponin T levels for risk stratification in acute myocardial ischemia. New Eng J Med 335:1333-1341.
Bootstrap calibration curve for a parametric survival model:
1. Knaus WA, Harrell FE, Fisher CJ, Wagner DP, et al. (1993): The clinical evaluation of new drugs for sepsis: A prospective study design based on survival analysis. JAMA 270:1233-1241.
Splines, interactions with splines, algebraic form of fitted model from latex.Design
1. Knaus WA, Harrell FE, Lynn J, et al. (1995): The SUPPORT prognostic model: Objective estimates of survival for seriously ill hospitalized adults. Annals of Internal Medicine 122:191-203.
Splines, odds ratio chart from fitted model with nonlinear and interaction terms, use of transcan for imputation
1. Lee KL, Woodlief LH, Topol EJ, Weaver WD, Betriu A. Col J, Simoons M, Aylward P, Van de Werf F, Califf RM. Predictors of 30-day mortality in the era of reperfusion for acute myocardial infarction: results from an international trial of 41,021 patients. Circulation 1995;91:1659-1668.
Splines, external validation of logistic models, prediction rules using point tables
1. Steyerberg EW, Hargrove YV, et al (2001): Residual mass histology in testicular cancer: development and validation of a clinical prediction rule. Stat in Med 2001;20:3847-3859.
2. van Gorp MJ, Steyerberg EW, et al (2003): Clinical prediction rule for 30-day mortality in Bjork-Shiley convexo-concave valve replacement. J Clinical Epidemiology 2003;56:1006-1012.
Model fitting, bootstrap validation, missing value imputation
1. Krijnen P, van Jaarsveld BC, Steyerberg EW, Man in 't Veld AJ, Schalekamp, MADH, Habbema JDF (1998): A clinical prediction rule for renal artery stenosis. Annals of Internal Medicine 129:705-711.
Model fitting, splines, bootstrap validation, nomograms
1. Kattan MW, Eastham JA, Stapleton AMF, Wheeler TM, Scardino PT. A preoperative nomogram for disease recurrence following radical prostatectomy for prostate cancer. J Natl Ca Inst 1998; 90(10):766-771.
2. Kattan, MW, Wheeler TM, Scardino PT. A postoperative nomogram for disease recurrence following radical prostatectomy for prostate cancer. J Clin Oncol 1999; 17(5):1499-1507
3. Kattan MW, Zelefsky MJ, Kupelian PA, Scardino PT, Fuks Z, Leibel SA. A pretreatment nomogram for predicting the outcome of three-dimensional conformal radiotherapy in prostate cancer. J Clin Oncol 2000; 18(19):3252-3259.
4. Eastham JA, May R, Robertson JL, Sartor O, Kattan MW. Development of a nomogram which predicts the probability of a positive prostate biopsy in men with an abnormal digital rectal examination and a prostate specific antigen between 0 and 4 ng/ml. Urology. (In press).
5. Kattan MW, Heller G, Brennan MF. A competing-risk nomogram fir sarcoma-specific death following local recurrence. Stat in Med 2003; 22; 3515-3525.
Nomogram with 2- and 5-year survival probability and median survival time (but watch out for the use of univariable screening)
1. Clark TG, Stewart ME, Altman DG, Smyth JF. A prognostic model for ovarian cancer. Br J Cancer 2001; 85:944-52.
Comprehensive example of parametric survival modeling with an extensive nomogram, time ratio chart, anova chart, survival curves generated using survplot, bootstrap calibration curve
1. Teno JM, Harrell FE, Knaus WA, et al. Prediction of survival for older hospitalized patients: The HELP survival model. J Am Geriatrics Soc 2000; 48: S16-S24.
Model fitting, imputation, and several nomograms expressed in tabular form
1. Hasdai D, Holmes DR, et al. Cardiogenic shock complicating acute myocardial infarction: Predictors of death. Am Heart J 1999; 138:21-31.
Ordinal logistic model with bootstrap calibration plot
1. Wu AW, Yasui U, Alzola CF et al. Predicting functional status outcomes in hospitalized patients aged 80 years and older. J Am Geriatric Society 2000; 48:S6-S15.
Propensity modeling in evaluating medical diagnosis, anova dot chart
1. Weiss JP, Gruver C, et al. Ordering an echocardiogram for evaluation of left ventricular function: Level of expertise necessary for efficient use. J Am Soc Echocardiography 2000; 13:124-130.
Simulations using Design to study the properties of various modeling strategies
1. Steyerberg EW, Eijkemans MJC, Habbema JDF. Stepwise selection in small data sets: A simulation study of bias in logistic regression analysis. J Clin Epi 1999; 52:935-942.
2. Steyerberg WE, Eijekans MJC, Harrell FE, Habbema JDF. Prognostic modeling with logistic regression analysis: In search of a sensible strategy in small data sets. Med Decision Making 2001; 21:45-56.
Statistical methods and references related to Design, along with case studies which includes the Design code which produced the analyses
1. Harrell FE, Lee KL, Mark DB (1996): Multivariable prognostic models: Issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat in Med 15:361-387.
2. Harrell FE, Margolis PA, Gove S, Mason KE, Mulholland EK et al. (1998): Development of a clinical prediction model for an ordinal outcome: The World Health Organization ARI Multicentre Study of clinical signs and etiologic agents of pneumonia, sepsis, and meningitis in young infants. Stat in Med 17:909-944.
3. Bender R, Benner, A (2000): Calculating ordinal regression models in SAS and S-Plus. Biometrical J 42:677-699.

Bug Reports

The author is willing to help with problems. Send E-mail to f.harrell@vanderbilt.edu. To report bugs, please do the following:

If the bug occurs when running a function on a fit object (e.g., anova), attach a dump'd text version of the fit object to your note. If you used datadist but not until after the fit was created, also send the object created by datadist. Example: dump("myfit","/tmp/dumpdata") will create a text file called "dumpdata" that can be attached to the E-mail.
If the bug occurs during a model fit (e.g., with lrm, ols, psm, cph), send the statement causing the error with a dump'd version of the data frame used in the fit. If this data frame is very large, reduce it to a small subset which still causes the error.

Copyright Notice

GENERAL DISCLAIMER This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. In short: you may use this code any way you like, as long as you don't charge money for it, remove this notice, or hold anyone liable for its results. Also, please acknowledge the source and communicate changes to the author.

If this software is used is work presented for publication, kindly reference it using for example: Harrell FE (2003): Design: S functions for biostatistical/epidemiologic modeling, testing, estimation, validation, graphics, and prediction. Programs available from biostat.mc.vanderbilt.edu/s/Design.html. Be sure to reference other libraries used as well as S-Plus or R itself.

Acknowledgements

This work was supported by grants from the Agency for Health Care Policy and Research (US Public Health Service) and the Robert Wood Johnson Foundation.

Author(s)

Frank E Harrell Jr
Professor of Biostatistics
Chair, Department of Biostatistics
Vanderbilt University School of Medicine
Nashville, Tennessee
f.harrell@vanderbilt.edu

References

The primary resource for the Design library is Regression Modeling Strategies by FE Harrell (Springer-Verlag, 2001) and the web pages http://biostat.mc.vanderbilt.edu/rms and http://biostat.mc.vanderbilt.edu/s/Design.html. See also the Statistics in Medicine articles by Harrell et al listed below for case studies of modeling and model validation using Design. Also see the free book by Alzola and Harrell at http://biostat.mc.vanderbilt.edu.

Several datasets useful for multivariable modeling with Design are found at http://biostat.mc.vanderbilt.edu/s/data.

Examples

######################
# Detailed Example 1 #
######################
# May want to first invoke the Hmisc store function
# so that new variables will go into a temporary directory
set.seed(17)  # So can repeat random number sequence
n <- 500

sex    <- factor(sample(c('female','male'), n, rep=TRUE))
age    <- rnorm(n, 50, 10)
sys.bp <- rnorm(n, 120, 7)

# Use two population models, one with a systolic
# blood pressure effect and one without

L    <- ifelse(sex=='female', .1*(pmin(age,50)-50), .005*(age-50)^2)
L.bp <- L + .4*(pmax(sys.bp,120)-120)

dz    <- ifelse(runif(n) <= plogis(L),    1, 0)
dz.bp <- ifelse(runif(n) <= plogis(L.bp), 1, 0)

# Use summary.formula in the Hmisc library to summarize the
# data one predictor at a time

s <- summary(dz.bp ~ age + sex + sys.bp) 
options(digits=3)
print(s)
plot(s)

plsmo(age, dz, group=sex, fun=qlogis, ylim=c(-3,3))
plsmo(age, L,  group=sex, method='raw', add=TRUE, prefix='True', trim=0)
title('Lowess-smoothed Estimates with True Regression Functions')

dd <- datadist(age, sex, sys.bp)
options(datadist='dd')
# can also do: dd <- datadist(dd, newvar)

f <- lrm(dz ~ rcs(age,5)*sex, x=TRUE, y=TRUE)
f
# x=TRUE, y=TRUE for pentrace

fpred <- Function(f)
fpred
fpred(age=30, sex=levels(sex))

anova(f)

p <- plot(f, age=NA, sex=NA, conf.int=FALSE, ylim=c(-3,3))
datadensity(p, age, sex)
scat1d(age)

plsmo(age, L, group=sex, method='raw', add=TRUE, prefix='True', trim=0)
title('Spline Fits with True Regression Functions')

f.bp <- lrm(dz.bp ~ rcs(age,5)*sex + rcs(sys.bp,5))

for(method in c('persp','image')) 
  p <- plot(f.bp, age=NA, sys.bp=NA, method=method)
# Legend(p)   # NOTE: Needs subplot - not in R

cat('Doing 25 bootstrap repetitions to validate model\n')
validate(f, B=25)   # in practice try to use 150

cat('Doing 25 bootstrap reps to check model calibration\n')
cal <- calibrate(f, B=25)   # use 150 in practice
plot(cal)
title('Calibration of Unpenalized Model')

p <- if(.R.) pentrace(f, penalty=c(.009,.009903,.02,.2,.5,1)) else
             pentrace(f, penalty=1, method='optimize')

f <- update(f, penalty=p$penalty)
f
specs(f,long=TRUE)
edf <- effective.df(f)

p <- plot(f, age=NA, sex=NA, conf.int=FALSE, ylim=c(-3,3))
datadensity(p, age, sex)
scat1d(age)

plsmo(age, L, group=sex, method='raw', add=TRUE, prefix='True', trim=0)
title('Penalized Spline Fits with True Regression Functions')

options(digits=3)
s <- summary(f)
s
plot(s)

s <- summary(f, sex='male')
plot(s)

fpred <- Function(f)
fpred
fpred(age=30, sex=levels(sex))
sascode(fpred)

cat('Doing 40 bootstrap reps to validate penalized model\n')
validate(f, B=40)

cat('Doing 40 bootstrap reps to check penalized model calibration\n')
cal <- calibrate(f, B=40)
plot(cal)
title('Calibration of Penalized Model')

nomogram(f.bp, fun=plogis,
         funlabel='Prob(dz)',
         fun.at=c(.15,.2,.3,.4,.5,.6,.7,.8,.9,.95,.975),
         fun.side=c(1,3,1,3,1,3,1,3,1,3,1))
options(datadist=NULL)

#####################
#Detailed Example 2 #
#####################
# Simulate the data.  
n <- 1000    # define sample size
set.seed(17) # so can reproduce the results
treat <- factor(sample(c('a','b','c'), n, TRUE))
num.diseases <- sample(0:4, n, TRUE)
age <- rnorm(n, 50, 10)
cholesterol <- rnorm(n, 200, 25)
weight <- rnorm(n, 150, 20)
sex <- factor(sample(c('female','male'), n, TRUE))
label(age) <- 'Age'      # label is in Hmisc
label(num.diseases) <- 'Number of Comorbid Diseases'
label(cholesterol) <- 'Total Cholesterol'
label(weight) <- 'Weight, lbs.'
label(sex) <- 'Sex'
units(cholesterol) <- 'mg/dl'   # uses units.default in Hmisc

# Specify population model for log odds that Y=1
L <- .1*(num.diseases-2) + .045*(age-50) +
  (log(cholesterol - 10)-5.2)*(-2*(treat=='a') +
  3.5*(treat=='b')+2*(treat=='c'))
# Simulate binary y to have Prob(y=1) = 1/[1+exp(-L)]
y <- ifelse(runif(n) < plogis(L), 1, 0)
cholesterol[1:3] <- NA   # 3 missings, at random

ddist <- datadist(cholesterol, treat, num.diseases,
                  age, weight, sex)
# Could have used ddist <- datadist(data.frame.name)
options(datadist="ddist") # defines data dist. to Design
cholesterol <- impute(cholesterol) # see impute in Hmisc library
# impute, describe, and several other basic functions are
# distributed as part of the Hmisc library

fit <- lrm(y ~ treat*log(cholesterol - 10) +
           scored(num.diseases) +  rcs(age))

describe(y ~ treat + scored(num.diseases) + rcs(age))
# or use describe(formula(fit)) for all variables used in fit
# describe function (in Hmisc) gets simple statistics on variables
#fit <- robcov(fit) # Would make all statistics which follow
                    # use a robust covariance matrix
                    # would need x=TRUE, y=TRUE in lrm
specs(fit) # Describe the design characteristics
a <- anova(fit)
print(a, which='subscripts')          # print which parameters being tested
plot(anova(fit)) # Depict Wald statistics graphically
anova(fit, treat, cholesterol) # Test these 2 by themselves
summary(fit) # Estimate effects using default ranges
plot(summary(fit)) # Graphical display of effects with C.L.
summary(fit, treat="b", age=60) 
# Specify reference cell and adjustment val

summary(fit, age=c(50,70)) # Estimate effect of increasing age from
                           # 50 to 70
summary(fit, age=c(50,60,70)) # Increase age from 50 to 70, 
                              # adjust to 60 when estimating 
                              # effects of other factors
# If had not defined datadist, would have to define
# ranges for all var.

# Estimate and test treatment (b-a) effect averaged
# over 3 cholesterols
contrast(fit, list(treat='b',cholesterol=c(150,200,250)),
              list(treat='a',cholesterol=c(150,200,250)),
         type='average')
# Remove type='average' to get 3 separate contrasts for b-a

# Plot effects.  plot(fit) plots effects of all predictors,
# showing values used for interacting factors as subtitles
# The ref.zero parameter is helpful for showing effects of
# predictors on a common scale for comparison of strength
plot(fit, ref.zero=TRUE, ylim=c(-2,2))

plot(fit, age=seq(20,80,length=100), treat=NA, conf.int=FALSE)
# Plots relationship between age and log
# odds, separate curve for each treat, no C.I.
plot(fit, age=NA, cholesterol=NA)
# 3-dimensional perspective plot for age, cholesterol, and
# log odds using default ranges for both variables
plot(fit, num.diseases=NA, fun=function(x) 1/(1+exp(-x)),  #or fun=plogis
     ylab="Prob", conf.int=.9)   
# Plot estimated probabilities instead of log odds
# Again, if no datadist were defined, would have to
# tell plot all limits
logit <- predict(fit, expand.grid(treat="b",num.diseases=1:3,
                 age=c(20,40,60),
                 cholesterol=seq(100,300,length=10)))
#logit <- predict(fit, gendata(fit, nobs=12))
# Interactively specify 12 predictor combinations using UNIX
# For UNIX or Windows, generate 9 combinations with other variables
# set to defaults, get predicted values
logit <- predict(fit, gendata(fit, age=c(20,40,60),
                 treat=c('a','b','c')))

# Since age doesn't interact with anything, we can quickly and
# interactively try various transformations of age,
# taking the spline function of age as the gold standard. We are
# seeking a linearizing transformation.  Here age is linear in the
# population so this is not very productive.  Also, if we simplify the
# model the total degrees of freedom will be too small and
# confidence limits too narrow

ag <- 10:80
logit <- predict(fit, expand.grid(treat="a",
                 num.diseases=0, age=ag,
                 cholesterol=median(cholesterol)),
                 type="terms")[,"age"]
# Note: if age interacted with anything, this would be the age
#               "main effect" ignoring interaction terms
# Could also use
#   logit <- plot(f, age=ag, ...)$x.xbeta[,2]
# which allows evaluation of the shape for any level
# of interacting factors.  When age does not interact with
# anything, the result from
# predict(f, ..., type="terms") would equal the result from
# plot if all other terms were ignored
# Could also use
#   logit <- predict(fit, gendata(fit, age=ag, cholesterol=median...))

plot(ag^.5, logit)  # try square root vs. spline transform.
plot(ag^1.5, logit) # try 1.5 power

# w <- latex(fit)  # invokes latex.lrm, creates fit.tex
# print(w)         # display or print model on screen

# Draw a nomogram for the model fit
nomogram(fit, fun=plogis, funlabel="Prob[Y=1]")

# Compose S function to evaluate linear predictors from fit
g <- Function(fit)
g(treat='b', cholesterol=260, age=50)
# Leave num.diseases at reference value

# Use the Hmisc dataRep function to summarize sample
# sizes for subjects as cross-classified on 2 key
# predictors
drep <- dataRep(~ roundN(age,10) + num.diseases)
print(drep, long=TRUE)

# Some approaches to making a plot showing how
# predicted values vary with a continuous predictor
# on the x-axis, with two other predictors varying

fit <- lrm(y ~ log(cholesterol - 10) + 
           num.diseases + rcs(age) + rcs(weight) + sex)

combos <- gendata(fit, age=10:100,
                  cholesterol=c(170,200,230),
                  weight=c(150,200,250))
# num.diseases, sex not specified -> set to mode
# can also used expand.grid

combos$pred <- predict(fit, combos)
library(lattice)
xyplot(pred ~ age | cholesterol*weight, data=combos)
xYplot(pred ~ age | cholesterol, groups=weight,
       data=combos, type='l') # in Hmisc
xYplot(pred ~ age, groups=interaction(cholesterol,weight),
       data=combos, type='l')

# Can also do this with plot.Design but a single
# plot may be busy:
ch <- c(170, 200, 230)
plot(fit, age=NA, cholesterol=ch, weight=150,
     conf.int=FALSE)
plot(fit, age=NA, cholesterol=ch, weight=200,
     conf.int=FALSE, add=TRUE)
plot(fit, age=NA, cholesterol=ch, weight=250,
     conf.int=FALSE, add=TRUE)

#Here we use plot.Design to make 9 separate plots, with CLs
d <- expand.grid(cholesterol=c(170,200,230),
                 weight=c(150,200,250))
for(i in 1:nrow(d)) {
  plot(fit, age=NA, cholesterol=d$cholesterol[i],
       weight=d$weight[i])
  title(paste('Chol=',format(d$cholesterol[i]),' ',
              'Wt=',format(d$weight[i]),sep=''))
}
options(datadist=NULL)

######################
# Detailed Example 3 #
######################
n <- 2000
set.seed(731)
age <- 50 + 12*rnorm(n)
label(age) <- "Age"
sex <- factor(sample(c('Male','Female'), n, 
              rep=TRUE, prob=c(.6, .4)))
cens <- 15*runif(n)
h <- .02*exp(.04*(age-50)+.8*(sex=='Female'))
t <- -log(runif(n))/h
label(t) <- 'Follow-up Time'
e <- ifelse(t<=cens,1,0)
t <- pmin(t, cens)
units(t) <- "Year"
age.dec <- cut2(age, g=10, levels.mean=TRUE)
dd <- datadist(age, sex, age.dec)
options(datadist='dd')
Srv <- Surv(t,e)

# Fit a model that doesn't assume anything except
# that deciles are adequate representations of age
f <- cph(Srv ~ strat(age.dec)+strat(sex), surv=TRUE)
# surv=TRUE speeds up computations, and confidence limits when
# there are no covariables are still accurate.

# Plot log(-log 3-year survival probability) vs. mean age
# within age deciles and vs. sex
plot(f, age.dec=NA, sex=NA, time=3, 
     loglog=TRUE, val.lev=TRUE, ylim=c(-5,-1))

# Fit a model assuming proportional hazards for age and
# absence of age x sex interaction
f <- cph(Srv ~ rcs(age,4)+strat(sex), surv=TRUE)
survplot(f, sex=NA, n.risk=TRUE)
# Add ,age=60 after sex=NA to tell survplot use age=60
# Validate measures of model performance using the bootstrap
# First must add data (design matrix and Srv) to fit object
f <- update(f, x=TRUE, y=TRUE)
validate(f, B=10, dxy=TRUE, u=5)  # use t=5 for Dxy (only)
# Use B=150 in practice
# Validate model for accuracy of predicting survival at t=1
# Get Kaplan-Meier estimates by divided subjects into groups
# of size 200 (for other values of u must put time.inc=u in
# call to cph)
cal <- calibrate(f, B=10, u=1, m=200)  # B=150 in practice
plot(cal)
# Check proportional hazards assumption for age terms
z <- cox.zph(f, 'identity')
print(z); plot(z)

# Re-fit this model without storing underlying survival
# curves for reference groups, but storing raw data with
# the fit (could also use f <- update(f, surv=FALSE, x=TRUE, y=TRUE))
f <- cph(Srv ~ rcs(age,4)+strat(sex), x=TRUE, y=TRUE) 
# Get accurate C.L. for any age
# Note: for evaluating shape of regression, we would not ordinarily
# bother to get 3-year survival probabilities - would just use X * beta
# We do so here to use same scale as nonparametric estimates
f
anova(f)
ages <- seq(20, 80, by=4)   # Evaluate at fewer points. Default is 100
                            # For exact C.L. formula n=100 -> much memory
plot(f, age=ages, sex=NA, time=3, loglog=TRUE, ylim=c(-5,-1))

# Fit a model assuming proportional hazards for age but
# allowing for general interaction between age and sex
f <- cph(Srv ~ rcs(age,4)*strat(sex), x=TRUE, y=TRUE)
anova(f)
ages <- seq(20, 80, by=6)   
# Still fewer points - more parameters in model

# Plot 3-year survival probability (log-log and untransformed)
# vs. age and sex, obtaining accurate confidence limits
plot(f, age=ages, sex=NA, time=3, loglog=TRUE, ylim=c(-5,-1))
plot(f, age=ages, sex=NA, time=3)
# Having x=TRUE, y=TRUE in fit also allows computation of influence stats
r <- resid(f, "dfbetas")
which.influence(f)
# Use survest to estimate 3-year survival probability and
# confidence limits for selected subjects
survest(f, expand.grid(age=c(20,40,60), sex=c('Female','Male')),
        times=c(2,4,6), conf.int=.95)

# Create an S function srv that computes fitted
# survival probabilities on demand, for non-interaction model
f <- cph(Srv ~ rcs(age,4)+strat(sex), surv=TRUE)
srv <- Survival(f)
# Define functions to compute 3-year estimates as a function of
# the linear predictors (X*Beta)
surv.f <- function(lp) srv(3, lp, stratum="sex=Female")
surv.m <- function(lp) srv(3, lp, stratum="sex=Male")
# Create a function that computes quantiles of survival time
# on demand
quant <- Quantile(f)
# Define functions to compute median survival time
med.f <- function(lp) quant(.5, lp, stratum="sex=Female")
med.m <- function(lp) quant(.5, lp, stratum="sex=Male")
# Draw a nomogram to compute several types of predicted values
nomogram(f, fun=list(surv.m, surv.f, med.m, med.f),
         funlabel=c("S(3 | Male)","S(3 | Female)",
                    "Median (Male)","Median (Female)"),
         fun.at=list(c(.8,.9,.95,.98,.99),c(.1,.3,.5,.7,.8,.9,.95,.98),
                   c(8,12),c(1,2,4,8,12)))
options(datadist=NULL)

########################################################
# Simple examples using small datasets for checking    #
# calculations across different systems in which random#
# number generators cannot be synchronized.            #
########################################################

x1 <- 1:20
x2 <- abs(x1-10)
x3 <- factor(rep(0:2,length.out=20))
y  <- c(rep(0:1,8),1,1,1,1)
dd <- datadist(x1,x2,x3)
options(datadist='dd')
f  <- lrm(y ~ rcs(x1,3) + x2 + x3)
f
specs(f, TRUE)
anova(f)
anova(f, x1, x2)
plot(anova(f))
s <- summary(f)
s
plot(s, log=TRUE)
par(mfrow=c(2,2))
plot(f)
par(mfrow=c(1,1))
nomogram(f)
g <- Function(f)
g(11,7,'1')
contrast(f, list(x1=11,x2=7,x3='1'), list(x1=10,x2=6,x3='2'))
fastbw(f)
gendata(f, x1=1:5)
# w <- latex(f)

f <- update(f, x=TRUE,y=TRUE)
which.influence(f)
residuals(f,'gof')
robcov(f)$var
validate(f, B=10)
cal <- calibrate(f, B=10)
plot(cal)

f <- ols(y ~ rcs(x1,3) + x2 + x3, x=TRUE, y=TRUE)
anova(f)
anova(f, x1, x2)
plot(anova(f))
s <- summary(f)
s
plot(s, log=TRUE)
par(mfrow=c(2,2))
plot(f)
par(mfrow=c(1,1))
nomogram(f)
g <- Function(f)
g(11,7,'1')
contrast(f, list(x1=11,x2=7,x3='1'), list(x1=10,x2=6,x3='2'))
fastbw(f)
gendata(f, x1=1:5)
# w <- latex(f)

f <- update(f, x=TRUE,y=TRUE)
which.influence(f)
residuals(f,'dfbetas')
robcov(f)$var
validate(f, B=10)
cal <- calibrate(f, B=10)
plot(cal)

S <- Surv(c(1,4,2,3,5,8,6,7,20,18,19,9,12,10,11,13,16,14,15,17))
survplot(survfit(S ~ x3))
f <- psm(S ~ rcs(x1,3)+x2+x3, x=TRUE,y=TRUE)
f
# NOTE: LR chi-sq of 39.67 disagrees with that from old survreg
# and old psm (77.65); suspect were also testing sigma=1

for(w in c('survival','hazard'))
 print(survest(f, data.frame(x1=7,x2=3,x3='1'), 
       times=c(5,7), conf.int=.95, what=w))
# S-Plus 2000 using old survival library:
#  S(t):.925 .684 SE:0.729 0.556 Hazard:0.0734 0.255

plot(f, x1=NA, time=5)
f$var
set.seed(3)
# robcov(f)$var when score residuals implemented
bootcov(f, B=30)$var
validate(f, B=10)
cal <- calibrate(f, u=5, B=10, m=10)
plot(cal)
r <- resid(f)
survplot(r)

f <- cph(S ~ rcs(x1,3)+x2+x3, x=TRUE,y=TRUE,surv=TRUE,time.inc=5)
f
plot(f, x1=NA, time=5)
robcov(f)$var
bootcov(f, B=10)
validate(f, B=10)
cal <- calibrate(f, u=5, B=10, m=10)
survplot(f, x1=c(2,19))
options(datadist=NULL)

[Package Design version 2.0-12 Index]

Function	Purpose	Related S
		Functions
`ols`	Ordinary least squares linear model	`lm`
`lrm`	Binary and ordinal logistic regression	`glm`
	model	`cr.setup`
`psm`	Accelerated failure time parametric	`survreg`
	survival model
`cph`	Cox proportional hazards regression	`coxph`
`bj`	Buckley-James censored least squares	`survreg`
	linear model
`glmD`	Version of glm for use with Design
`glsD`	Version of gls for use with Design

Function	Purpose	Related
		Functions
`print`	Print parameters and statistics of fit
`coef`	Fitted regression coefficients
`formula`	Formula used in the fit
`specs`	Detailed specifications of fit
`robcov`	Robust covariance matrix estimates
`bootcov`	Bootstrap covariance matrix estimates
`summary`	Summary of effects of predictors
`plot.summary`	Plot continuously shaded confidence
	bars for results of summary
`anova`	Wald tests of most meaningful hypotheses
`contrast`	General contrasts, C.L., tests
`plot.anova`	Depict results of anova graphically	`dotchart`
`plot`	Plot effects of predictors
`gendata`	Generate data frame with predictor	`expand.grid`
	combinations (optionally interactively)
`predict`	Obtain predicted values or design matrix
`fastbw`	Fast backward step-down variable	`step`
	selection
`residuals`	Residuals, influence statistics from fit
(or `resid`)
`which.influence`	Which observations are overly	`residuals`
	influential
`sensuc`	Sensitivity of one binary predictor in
	lrm and cph models to an unmeasured
	binary confounder
`latex`	LaTeX representation of fitted
	model or `anova` or `summary` table
`Function`	S function analytic representation	`Function.transcan`
	of a fitted regression model (XBeta*)
`hazard`	S function analytic representation	`rcspline.restate`
	of a fitted hazard function (for `psm`)
`Survival`	S function analytic representation of
	fitted survival function (for `psm,cph`)
`Quantile`	S function analytic representation of
	fitted function for quantiles of
	survival time (for `psm, cph`)
`nomogram`	Draws a nomogram for the fitted model	`latex, plot`
`survest`	Estimate survival probabilities	`survfit`
	(for `psm, cph`)
`survplot`	Plot survival curves (psm, cph)	plot.survfit
`validate`	Validate indexes of model fit using	val.prob
	resampling
`calibrate`	Estimate calibration curve for model
	using resampling
`vif`	Variance inflation factors for a fit
`naresid`	Bring elements corresponding to missing
	data back into predictions and residuals
`naprint`	Print summary of missing values
`pentrace`	Find optimum penality for penalized MLE
`effective.df`	Print effective d.f. for each type of
	variable in model, for penalized fit or
	pentrace result
`rm.impute`	Impute repeated measures data with	`transcan`,
	non-random dropout	`fit.mult.impute`
	experimental, non-functional