R: Linear Model Estimation Using Ordinary Least Squares

ols {Design}

R Documentation

Linear Model Estimation Using Ordinary Least Squares

Description

Fits the usual weighted or unweighted linear regression model using the same fitting routines used by lm, but also storing the variance-covariance matrix var and using traditional dummy-variable coding for categorical factors. Also fits unweighted models using penalized least squares, with the same penalization options as in the lrm function. For penalized estimation, there is a fitter function call lm.pfit.

Usage

ols(formula, data, weights, subset, na.action=na.delete, 
    method="qr", model=FALSE,
    x=FALSE, y=FALSE, se.fit=FALSE, linear.predictors=TRUE,
    penalty=0, penalty.matrix, tol=1e-7, sigma,
    var.penalty=c('simple','sandwich'), ...)

Arguments

`formula`	an S formula object, e.g. Y ~ rcs(x1,5)*lsp(x2,c(10,20))
`data`	name of an S data frame containing all needed variables. Omit this to use a data frame already in the S ``search list''.
`weights`	an optional vector of weights to be used in the fitting process. If specified, weighted least squares is used with weights `weights` (that is, minimizing sum(we^2)*); otherwise ordinary least squares is used.
`subset`	an expression defining a subset of the observations to use in the fit. The default is to use all observations. Specify for example `age>50 & sex="male"` or `c(1:100,200:300)` respectively to use the observations satisfying a logical expression or those having row numbers in the given vector.
`na.action`	specifies an S function to handle missing data. The default is the function `na.delete`, which causes observations with any variable missing to be deleted. The main difference between `na.delete` and the S-supplied function `na.omit` is that `na.delete` makes a list of the number of observations that are missing on each variable in the model. The `na.action` is usally specified by e.g. `options(na.action="na.delete")`.
`method`	specifies a particular fitting method, or `"model.frame"` instead to return the model frame of the predictor and response variables satisfying any subset or missing value checks.
`model`	default is `FALSE`. Set to `TRUE` to return the model frame as element `model` of the fit object.
`x`	default is `FALSE`. Set to `TRUE` to return the expanded design matrix as element `x` (without intercept indicators) of the returned fit object. Set both `x=TRUE` if you are going to use the `residuals` function later to return anything other than ordinary residuals.
`y`	default is `FALSE`. Set to `TRUE` to return the vector of response values as element `y` of the fit.
`se.fit`	default is `FALSE`. Set to `TRUE` to compute the estimated standard errors of the estimate of X beta and store them in element `se.fit` of the fit.
`linear.predictors`	set to `FALSE` to cause predicted values not to be stored
`penalty`
`penalty.matrix`	see `lrm`
`tol`	tolerance for information matrix singularity
`sigma`	If `sigma` is given, it is taken as the actual root mean squared error parameter for the model. Otherwise `sigma` is estimated from the data using the usual formulas (except for penalized models). It is often convenient to specify `sigma=1` for models with no error, when using `fastbw` to find an approximate model that predicts predicted values from the full model with a given accuracy.
`var.penalty`	the type of variance-covariance matrix to be stored in the `var` component of the fit when penalization is used. The default is the inverse of the penalized information matrix. Specify `var.penalty="sandwich"` to use the sandwich estimator (see below under `var`), which limited simulation studies have shown yields variances estimates that are too low.
`...`	arguments to pass to `lm.wfit` or `lm.fit`

Details

For penalized estimation, the penalty factor on the log likelihood is -0.5 β' P β / σ^2, where P is defined above. The penalized maximum likelihood estimate (penalized least squares or ridge estimate) of beta is (X'X + P)^{-1} X'Y. The maximum likelihood estimate of σ^2 is (sse + β' P β) / n, where sse is the sum of squared errors (residuals). The effective.df.diagonal vector is the diagonal of the matrix X'X/(sse/n) σ^{2} (X'X + P)^{-1}.

Value

the same objects returned from lm (unless penalty or penalty.matrix are given - then an abbreviated list is returned since lm.pfit is used as a fitter) plus the design attributes (see Design). Predicted values are always returned, in the element linear.predictors. The vectors or matrix stored if y=TRUE or x=TRUE have rows deleted according to subset and to missing data, and have names or row names that come from the data frame used as input data. If penalty or penalty.matrix is given, the var matrix returned is an improved variance-covariance matrix for the penalized regression coefficient estimates. If var.penalty="sandwich" (not the default, as limited simulation studies have found it provides variance estimates that are too low) it is defined as σ^{2} (X'X + P)^{-1} X'X (X'X + P)^{-1}, where P is penalty factors * penalty.matrix, with a column and row of zeros added for the intercept. When var.penalty="simple" (the default), var is σ^{2} (X'X + P)^{-1}. The returned list has a vector stats with named elements n, Model L.R., d.f., R2, Sigma. Model L.R. is the model likelihood ratio chi-square statistic, and R2 is R^2. For penalized estimation, d.f. is the effective degrees of freedom, which is the sum of the elements of another vector returned, effective.df.diagonal, minus one for the intercept. Sigma is the penalized maximum likelihood estimate (see below).

Author(s)

Frank Harrell
Department of Biostatistics, Vanderbilt University
f.harrell@vanderbilt.edu

Examples

set.seed(1)
x1 <- runif(200)
x2 <- sample(0:3, 200, TRUE)
distance <- (x1 + x2/3 + rnorm(200))^2
d <- datadist(x1,x2)
options(datadist="d")   # No d -> no summary, plot without giving all details

f <- ols(sqrt(distance) ~ rcs(x1,4) + scored(x2), x=TRUE)
# could use d <- datadist(f); options(datadist="d") at this point,
# but predictor summaries would not be stored in the fit object for
# use with plot.Design, summary.Design.  In that case, the original
# dataset or d would need to be accessed later, or all variable values
# would have to be specified to summary, plot
anova(f)
which.influence(f)
summary(f)
summary.lm(f)    # will only work if penalty and penalty.matrix not used

# Fit a complex model and approximate it with a simple one
x1 <- runif(200)
x2 <- runif(200)
x3 <- runif(200)
x4 <- runif(200)
y <- x1 + x2 + rnorm(200)
f    <- ols(y ~ rcs(x1,4) + x2 + x3 + x4)
pred <- fitted(f)   # or predict(f) or f$linear.predictors
f2   <- ols(pred ~ rcs(x1,4) + x2 + x3 + x4, sigma=1)
# sigma=1 prevents numerical problems resulting from R2=1
fastbw(f2, aics=100000)
# This will find the best 1-variable model, best 2-variable model, etc.
# in predicting the predicted values from the original model
options(datadist=NULL)

[Package Design version 2.0-12 Index]