R: Validate Predicted Probabilities

val.prob {Design}

R Documentation

Validate Predicted Probabilities

Description

The val.prob and val.surv functions are useful for validating predicted probabilities against binary events and predicted survival probabilities against right-censored failure times, respectively. First val.prob is described.

Given a set of predicted probabilities p or predicted log odds logit, and a vector of binary outcomes y that were not used in developing the predictions p or logit, val.prob computes the following indexes and statistics: Somers' D_{xy} rank correlation between p and y [2(C-.5), C=ROC area], Nagelkerke-Cox-Snell-Maddala-Magee R-squared index, Discrimination index D [ (Logistic model L.R. chi-square - 1)/n], L.R. chi-square, its P-value, Unreliability index U, chi-square with 2 d.f. for testing unreliability (H0: intercept=0, slope=1), its P-value, the quality index Q, Brier score (average squared difference in p and y), Intercept, and Slope, and E_{max}=maximum absolute difference in predicted and calibrated probabilities. If pl=TRUE, plots fitted logistic calibration curve and optionally a smooth nonparametric fit using lowess(p,y,iter=0) and grouped proportions vs. mean predicted probability in group. If the predicted probabilities or logits are constant, the statistics are returned and no plot is made.

When group is present, different statistics are computed, different graphs are made, and the object returned by val.prob is different. group specifies a stratification variable. Validations are done separately by levels of group and overall. A print method prints summary statistics and several quantiles of predicted probabilities, and a plot method plots calibration curves with summary statistics superimposed, along with selected quantiles of the predicted probabilities (shown as tick marks on calibration curves). Only the lowess calibration curve is estimated. The statistics computed are the average predicted probability, the observed proportion of events, a 1 d.f. chi-square statistic for testing for overall mis-calibration (i.e., a test of the observed vs. the overall average predicted probability of the event) (ChiSq), and a 2 d.f. chi-square statistic for testing simultaneously that the intercept of a linear logistic calibration curve is zero and the slope is one (ChiSq2), average absolute calibration error (average absolute difference between the lowess-estimated calibration curve and the line of identity, labeled Eavg), Eavg divided by the difference between the 0.95 and 0.05 quantiles of predictive probabilities (Eavg/P90), a "median odds ratio", i.e., the anti-log of the median absolute difference between predicted and calibrated predicted log odds of the event (Med OR), the C-index (ROC area), the Brier quadratic error score (B), a chi-square test of goodness of fit based on the Brier score (B ChiSq), and the Brier score computed on calibrated rather than raw predicted probabilities (B cal). The first chi-square test is a test of overall calibration accuracy ("calibration in the large"), and the second will also detect errors such as slope shrinkage caused by overfitting or regression to the mean. See Cox (1970) for both of these score tests. The goodness of fit test based on the (uncalibrated) Brier score is due to Hilden, Habbema, and Bjerregaard (1978) and is discussed in Spiegelhalter (1986). When group is present you can also specify sampling weights (usually frequencies), to obtained weighted calibration curves.

To get the behavior that results from a grouping variable being present without having a grouping variable, use group=TRUE. In the plot method, calibration curves are drawn and labeled by default where they are maximally separated using the labcurve function. The following parameters do not apply when group is present: pl, smooth, logistic.cal, m, g, cuts, emax.lim, legendloc, riskdist, mkh, connect.group, connect.smooth. The following parameters apply to the plot method but not to val.prob: xlab, ylab, lim, statloc, cex.

val.surv uses Cox-Snell (1968) residuals on the cumulative probability scale to check on the calibration of a survival model against right-censored failure time data. If the predicted survival probability at time t for a subject having predictors X is S(t|X), this method is based on the fact that the predicted probability of failure before time t, 1 - S(t|X), when evaluated at the subject's actual survival time T, has a uniform (0,1) distribution. The quantity 1 - S(T|X) is right-censored when T is. By getting one minus the Kaplan-Meier estimate of the distribution of 1 - S(T|X) and plotting against the 45 degree line we can check for calibration accuracy. A more stringent assessment can be obtained by stratifying this analysis by an important predictor variable. The theoretical uniform distribution is only an approximation when the survival probabilities are estimates and not population values.

When censor is specified to val.surv, a different validation is done that is more stringent but that only uses the uncensored failure times. This method is used for type I censoring when the theoretical censoring times are known for subjects having uncensored failure times. Let T, C, and F denote respectively the failure time, censoring time, and cumulative failure time distribution (1 - S). The expected value of F(T | X) is 0.5 when T represents the subject's actual failure time. The expected value for an uncensored time is the expected value of F(T | T <=q C, X) = 0.5 F(C | X). A smooth plot of F(T|X) - 0.5 F(C|X) for uncensored T should be a flat line through y=0 if the model is well calibrated. A smooth plot of 2F(T|X)/F(C|X) for uncensored T should be a flat line through y=1.0. The smooth plot is obtained by smoothing the (linear predictor, difference or ratio) pairs.

Usage

val.prob(p, y, logit, group, weights=rep(1,length(y)), normwt=FALSE, 
         pl=TRUE, smooth=TRUE, logistic.cal=TRUE,
         xlab="Predicted Probability", ylab="Actual Probability",
         lim=c(0, 1), m, g, cuts, emax.lim=c(0,1),
         legendloc=lim[1] + c(0.55 * diff(lim), 0.27 * diff(lim)), 
         statloc=c(0,0.9), riskdist="calibrated", cex=.75, mkh=.02,
         connect.group=FALSE, connect.smooth=TRUE, g.group=4, 
         evaluate=100, nmin=0)

## S3 method for class 'val.prob':
print(x, ...)

## S3 method for class 'val.prob':
plot(x, xlab="Predicted Probability", 
     ylab="Actual Probability",
     lim=c(0,1), statloc=lim, stats=1:12, cex=.5, 
     lwd.overall=4, quantiles=c(.05,.95), flag, ...)

val.surv(fit, newdata, S, est.surv, censor)

## S3 method for class 'val.surv':
plot(x, group, g.group=4, what=c('difference','ratio'),
     type=c('l','b','p'),
     xlab, ylab, xlim, ylim, datadensity=TRUE, ...)

Arguments

`p`	predicted probability
`y`	vector of binary outcomes
`logit`	predicted log odds of outcome. Specify either `p` or `logit`.
`fit`	a fit object created by `cph` or `psm`
`group`	a grouping variable. If numeric this variable is grouped into `g.group` quantile groups (default is quartiles). Set `group=TRUE` to use the `group` algorithm but with a single stratum for `val.prob`.
`weights`	an optional numeric vector of per-observation weights (usually frequencies), used only if `group` is given.
`normwt`	set to `TRUE` to make `weights` sum to the number of non-missing observations.
`pl`	TRUE to plot calibration curves and optionally statistics
`smooth`	plot smooth fit to `(p,y)` using `lowess(p,y,iter=0)`
`logistic.cal`	plot linear logistic calibration fit to `(p,y)`
`xlab`	x-axis label, default is `"Predicted Probability"` for `val.prob`. Other defaults are used by `val.surv`.
`ylab`	y-axis label, default is `"Actual Probability"` for `val.prob`. Other defaults are used by `val.surv`.
`lim`	limits for both x and y axes
`m`	If grouped proportions are desired, average no. observations per group
`g`	If grouped proportions are desired, number of quantile groups
`cuts`	If grouped proportions are desired, actual cut points for constructing intervals, e.g. `c(0,.1,.8,.9,1)` or `seq(0,1,by=.2)`
`emax.lim`	Vector containing lowest and highest predicted probability over which to compute `Emax`.
`legendloc`	If `pl=TRUE`, list with components `x,y` or vector `c(x,y)` for upper left corner of legend for curves and points. Default is `c(.55, .27)` scaled to `lim`. Use `locator(1)` to use the mouse, `FALSE` to suppress legend.
`statloc`	D_{xy}, C, R^2, D, U, Q, `Brier` score, `Intercept`, `Slope`, and E_{max} will be added to plot, using `statloc` as the upper left corner of a box (default is `c(0,.9)`). You can specify a list or a vector. Use `locator(1)` for the mouse, `FALSE` to suppress statistics. This is plotted after the curve legends.
`riskdist`	Defaults to `"calibrated"` to plot the relative frequency distribution of calibrated robabilities after dividing into 101 bins from `lim[1]` to `lim[2]`. Set to `"predicted"` to use raw assigned risk, `FALSE` to omit risk distribution. Values are scaled so that highest bar is `0.15*(lim[2]-lim[1])`.
`cex`	Character size for legend or for table of statistics when `group` is given
`mkh`	Size of symbols for legend. Default is 0.02 (see `par()`).
`connect.group`	Defaults to `FALSE` to only represent group fractions as triangles. Set to `TRUE` to also connect with a solid line.
`connect.smooth`	Defaults to `TRUE` to draw smoothed estimates using a dashed line. Set to `FALSE` to instead use dots at individual estimates.
`g.group`	number of quantile groups to use when `group` is given and variable is numeric.
`evaluate`	number of points at which to store the `lowess`-calibration curve. Default is 100. If there are more than `evaluate` unique predicted probabilities, `evaluate` equally-spaced quantiles of the unique predicted probabilities, with linearly interpolated calibrated values, are retained for plotting (and stored in the object returned by `val.prob`.
`nmin`	applies when `group` is given. When `nmin` > 0, `val.prob` will not store coordinates of smoothed calibration curves in the outer tails, where there are fewer than `nmin` raw observations represented in those tails. If for example `nmin`=50, the `plot` function will only plot the estimated calibration curve from a to b, where there are 50 subjects with predicted probabilities < a and > b. `nmin` is ignored when computing accuracy statistics.
`x`	result of `val.prob` (with `group` in effect) or `val.surv`
`...`	optional arguments for `labcurve` (through `plot`). Commonly used options are `col` (vector of colors for the strata plus overall) and `lty`. Ignored for `print`.
`stats`	vector of column numbers of statistical indexes to write on plot
`lwd.overall`	line width for plotting the overall calibration curve
`quantiles`	a vector listing which quantiles should be indicated on each calibration curve using tick marks. The values in `quantiles` can be any number of values from the following: .01, .025, .05, .1, .25, .5, .75, .9, .95, .975, .99. By default the 0.05 and 0.95 quantiles are indicated.
`flag`	a function of the matrix of statistics (rows representing groups) returning a vector of character strings (one value for each group, including "Overall"). `plot.val.prob` will print this vector of character values to the left of the statistics. The `flag` function can refer to columns of the matrix used as input to the function by their names given in the description above. The default function returns `"*"` if either `ChiSq2` or `B ChiSq` is significant at the 0.01 level and `" "` otherwise.
`newdata`	a data frame for which `val.surv` should obtain predicted survival probabilities. If omitted, survival estimates are made for all of the subjects used in `fit`.
`S`	an `Surv` object
`est.surv`	a vector of estimated survival probabilities corresponding to times in the first column of `S`.
`censor`	a vector of censoring times. Only the censoring times for uncensored observations are used.
`what`	the quantity to plot when `censor` was in effect. The default is to show the difference between cumulative probabilities and their expectation given the censoring time. Set `what="ratio"` to show the ratio instead.
`type`	Set to the default (`"l"`) to plot the trend line only, `"b"` to plot both individual subjects ratios and trend lines, or `"p"` to plot only points.
`xlim`
`ylim`	axis limits for `plot.val.surv` when the `censor` variable was used.
`datadensity`	By default, `plot.val.surv` will show the data density on each curve that is created as a result of `censor` being present. Set `datadensity=FALSE` to suppress these tick marks drawn by `scat1d`.

Details

The 2 d.f. chi-square test and Med OR exclude predicted or calibrated predicted probabilities <=q 0 to zero or >=q 1, adjusting the sample size as needed.

Value

val.prob without group returns a vector with the following named elements: Dxy, R2, D, D:Chi-sq, D:p, U, U:Chi-sq, U:p, Q, Brier, Intercept, Slope, Emax. When group is present val.prob returns an object of class val.prob containing a list with summary statistics and calibration curves for all the strata plus "Overall".

Side Effects

plots calibration curve

Author(s)

Frank Harrell
Department of Biostatistics, Vanderbilt University
f.harrell@vanderbilt.edu

References

Harrell FE, Lee KL, Mark DB (1996): Multivariable prognostic models: Issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat in Med 15:361–387.

Harrell FE, Lee KL (1987): Using logistic calibration to assess the accuracy of probability predictions (Technical Report).

Miller ME, Hui SL, Tierney WM (1991): Validation techniques for logistic regression models. Stat in Med 10:1213–1226.

Harrell FE, Lee KL (1985): A comparison of the discrimination of discriminant analysis and logistic regression under multivariate normality. In Biostatistics: Statistics in Biomedical, Public Health, and Environmental Sciences. The Bernard G. Greenberg Volume, ed. PK Sen. New York: North-Holland, p. 333–343.

Cox DR (1970): The Analysis of Binary Data, 1st edition, section 4.4. London: Methuen.

Spiegelhalter DJ (1986):Probabilistic prediction in patient management. Stat in Med 5:421–433.

Cox DR, Snell EJ (1968):A general definition of residuals (with discussion). JRSSB 30:248–275.

May M, Royston P, Egger M, Justice AC, Sterne JAC (2004):Development and validation of a prognostic model for survival time data: application to prognosis of HIV positive patients treated with antiretroviral therapy. Stat in Med 23:2375–2398.

Examples

# Fit logistic model on 100 observations simulated from the actual 
# model given by Prob(Y=1 given X1, X2, X3) = 1/(1+exp[-(-1 + 2X1)]),
# where X1 is a random uniform [0,1] variable.  Hence X2 and X3 are 
# irrelevant.  After fitting a linear additive model in X1, X2,
# and X3, the coefficients are used to predict Prob(Y=1) on a
# separate sample of 100 observations.

set.seed(1)
n <- 200
x1 <- runif(n)
x2 <- runif(n)
x3 <- runif(n)
logit <- 2*(x1-.5)
P <- 1/(1+exp(-logit))
y <- ifelse(runif(n)<=P, 1, 0)
d <- data.frame(x1,x2,x3,y)
f <- lrm(y ~ x1 + x2 + x3, subset=1:100)
pred.logit <- predict(f, d[101:200,])
phat <- 1/(1+exp(-pred.logit))
val.prob(phat, y[101:200], m=20, cex=.5)  # subgroups of 20 obs.

# Validate predictions more stringently by stratifying on whether
# x1 is above or below the median

v <- val.prob(phat, y[101:200], group=x1[101:200], g.group=2)
v
plot(v)
plot(v, flag=function(stats) ifelse(
  stats[,'ChiSq2'] > qchisq(.95,2) |
  stats[,'B ChiSq'] > qchisq(.95,1), '*', ' ') )
# Stars rows of statistics in plot corresponding to significant
# mis-calibration at the 0.05 level instead of the default, 0.01

plot(val.prob(phat, y[101:200], group=x1[101:200], g.group=2), 
              col=1:3) # 3 colors (1 for overall)

# Weighted calibration curves
# plot(val.prob(pred, y, group=age, weights=freqs))

# Survival analysis examples
# Generate failure times from an exponential distribution
set.seed(123)              # so can reproduce results
n <- 2000
age <- 50 + 12*rnorm(n)
sex <- factor(sample(c('Male','Female'), n, rep=TRUE, prob=c(.6, .4)))
cens <- 15*runif(n)
h <- .02*exp(.04*(age-50)+.8*(sex=='Female'))
t <- -log(runif(n))/h
label(t) <- 'Time to Event'
ev <- ifelse(t <= cens, 1, 0)
t <- pmin(t, cens)
S <- Surv(t, ev)

# First validate true model used to generate data
w <- val.surv(est.surv=exp(-h*t), S=S)
plot(w)
plot(w, group=sex)  # stratify by sex

# Now fit an exponential model and validate
# Note this is not really a validation as we're using the
# training data here
f <- psm(S ~ age + sex, dist='exponential', y=TRUE)
w <- val.surv(f)
plot(w, group=sex)

# We know the censoring time on every subject, so we can
# compare the predicted Pr[T <= observed T | T>c, X] to
# its expectation 0.5 Pr[T <= C | X] where C = censoring time
# We plot a ratio that should equal one
w <- val.surv(f, censor=cens)
plot(w)

plot(w, group=age, g=3)   # stratify by tertile of age

[Package Design version 2.0-12 Index]