sas.get {Hmisc} | R Documentation |
Converts a SAS dataset into an S data frame.
You may choose to extract only a subset of variables
or a subset of observations in the SAS dataset.
The function will automatically convert PROC FORMAT-coded
variables to factor objects. The original SAS codes are stored in an
attribute called sas.codes
and these may be added back to the
levels
of a factor
variable using the code.levels
function.
Information about special missing values may be captured in an attribute
of each variable having special missing values. This attribute is
called special.miss
, and such variables are given class special.miss
.
There are print
, []
, format
, and is.special.miss
methods for such variables.
date, time, and date-time variables use respectively
Dates
, DateTimeClasses
, and
chron
variables.
If using S-Plus 5 or 6 or later, the timeDate
function is used instead.
If a date variable represents a partial date (.5 added if
month missing, .25 added if day missing, .75 if both), an attribute
partial.date
is added to the variable, and the variable also becomes
a class imputed
variable.
The describe
function uses information about partial dates and
special missing values.
There is an option to automatically PKUNZIP
compressed
SAS datasets.
sas.get
works by composing and running a SAS job that
creates various ASCII files that are read and analyzed
by sas.get
. You can also run the SAS sas_get
macro,
which writes the ASCII files for downloading, in a separate
step or on another computer, and then tell sas.get
(through the
sasout
argument) to access these files instead of running SAS.
sas.get(library, member, variables=character(0), ifs=character(0), format.library=library, id, dates.=c("sas","yymmdd","yearfrac","yearfrac2"), keep.log=TRUE, log.file="_temp_.log", macro=sas.get.macro, data.frame.out=existsFunction("data.frame"), clean.up=!.R., quiet=FALSE, temp=tempfile("SaS"), formats=TRUE, recode=formats, special.miss=FALSE, sasprog="sas", as.is=.5, check.unique.id=TRUE, force.single=FALSE, where, uncompress=FALSE) is.special.miss(x, code) x[...] ## S3 method for class 'special.miss': print(x, ...) ## S3 method for class 'special.miss': format(x, ...) sas.codes(object) code.levels(object)
library |
character string naming the directory in which the dataset is kept.
The default is library="." , indicating that the current
directory is to be used.
|
member |
character string giving the second part of the two part SAS dataset name. (The first part is irrelevant here - it is mapped to the directory name.) |
x |
a variable that may have been created by sas.get with special.miss=T
or with recode in effect.
|
variables |
vector of character strings naming the variables in the SAS dataset.
The resulting data frame will contain only those variables from the
SAS dataset.
To get all of the variables (the default), an empty string may be given.
It is a fatal error if any one of the variables is not
in the SAS dataset. If you have retrieved a subset of the variables
in the SAS dataset and which to retrieve the same list of variables
from another dataset, you can program the value of variables - see
one of the last examples.
|
ifs |
a vector of character strings, each containing one SAS "subsetting if" statement. These will be used to extract a subset of the observations in the SAS dataset. |
format.library |
The directory containing the file formats.sc2, which contains the definitions of the user defined formats used in this dataset. By default, we look for the formats in the same directory as the data. The user defined formats must be available (so SAS can read the data). |
formats |
Set formats to FALSE to keep sas.get from telling the SAS macro to
retrieve value label formats from format.library . When you do not
specify formats or recode , sas.get will set format to T if a
SAS format catalog (.sct or .sc2 ) file exists in format.library .
sas.get stores SAS PROC FORMAT VALUE definitions
as the formats attribute of the returned
object (see below). A format is used if it is referred to by one or more
variables
in the dataset, if it contains no ranges of values (i.e., it identifies
value labels for single values), and if it is a character format
or a numeric format that is not used just to label missing values.
To fetch the values and labels for variable x in the dataset d you
could type:
f <- attr(d$x, "format") formats <- attr(d, "formats") formats$f$values; formats$f$labels |
recode |
This parameter defaults to T if formats is T . If it is
T , variables that have an appropriate format (see above) are
recoded as factor objects, which map the values
to the value labels for the format. Alternatively, set recode to
1 to use labels of the form value:label, e.g. 1:good 2:better 3:best.
Set recode to 2 to use labels such as good(1) better(2) best(3).
Since sas.codes and code.levels add flexibility, the usual choice
for recode is T .
|
special.miss |
For numeric variables, any missing values are stored as NA in S.
You can recover special missing values by setting special.miss to
T . This will cause the special.miss attribute and the
special.miss class to be added
to each variable that has at least one special missing value.
Suppose that variable y was .E in observation 3 and .G
in observation 544. The special.miss attribute for y then has the
value
list(codes=c("E","G"),obs=c(3,544)) To fetch this information for variable y you would say for example
s <- attr(y, "special.miss") s$codes; s$obs or use is.special.miss(x) or the print.special.miss method, which
will replace NA values for the variable with E or G if they
correspond to special missing values.
The describe
function uses this information in printing a data summary.
|
id |
The name of the variable to be used as the row names of the S dataset.
The id variable becomes the row.names attribute of a data frame, but
the id variable is still retained as a variable in the data frame.
You can also specify a vector of variable names as the id
parameter. After fetching the data from SAS, all these variables will be
converted to character format and concatenated (with a space as a separator)
to form a (hopefully) unique ID variable.
|
dates. |
specifies the format for storing SAS dates in the resulting data frame |
as.is |
SAS character variables are converted to S factor
objects if as.is=FALSE or if as.is is a number between 0 and 1 inclusive and
the number of unique values of the variable is less than
the number of observations (n ) times as.is . The default if as.is is .5,
so character variables are converted to factors only if they have fewer
than n/2 unique values. The primary purpose of this is to keep unique
identification variables as character values in the data frame instead
of using more space to store both the integer factor codes and the
factor labels.
|
check.unique.id |
If id is specified, the row names are checked for
uniqueness if check.unique.id=T . If any are duplicated, a warning
is printed. Note that if a data frame is being created with duplicate
row names, statements such as my.data.frame["B23",] will retrieve
only the first row with a row name of "B23" .
|
force.single |
By default, SAS numeric variables having LENGTH s > 4 are stored as
S double precision numerics, which allow for the same precision as
a SAS LENGTH 8 variable. Set force.single=T to store every
numeric variable in single precision (7 digits of precision).
This option is useful when the creator of the SAS dataset has
failed to use a LENGTH statement.
R does not have single precision,
so no attempt is made to convert to single if running R.
|
keep.log |
logical flag: if FALSE , delete the SAS log file upon completion.
|
log.file |
the name of the SAS log file. |
macro |
the name of an S object in the current search path that contains the text of the SAS macro called by S. The S object is a character vector that can be edited using, for example, sas.get.macro <- editor(sas.get.macro). |
data.frame.out |
set to FALSE to make the result a list instead of a data frame |
clean.up |
logical flag: if TRUE , remove all temporary files when finished. You
may want to keep these while debugging the SAS macro. Not needed for R.
|
quiet |
logical flag: if FALSE , print the contents of the
SAS log file if there has been an error.
|
temp |
the prefix to use for the temporary files. Two characters will be added to this, the resulting name must fit on your file system. |
sasprog |
the name of the system command to invoke SAS |
uncompress |
set to FALSE by default. Set it
to T to automatically invoke the DOS PKUNZIP command
if member.zip exists,
to uncompress the SAS dataset before
proceeding. This assumes you have the file permissions to allow
uncompressing in place. If the file is already uncompressed, this
option is ignored.
|
where |
by default, a list or data frame which contains all the variables
is returned. If you specify where , each individual variable
is placed into a separate object (whose name is the name
of the variable) using the assign function with the
where argument. For example, you can put each variable
in its own file in a directory, which in some cases may
save memory over attaching a data frame.
|
code |
a special missing value code (A through Z or underscore) to check against.
If code is omitted, is.special.miss will return a T for each
observation that has any special missing value.
|
object |
a variable in a data frame created by sas.get |
... |
ignored |
If you specify special.miss=T
and there are no special missing
values in the data SAS dataset, the SAS step will bomb.
For variables having a PROC FORMAT VALUE
format with some of the levels undefined, sas.get
will interpret those
values as NA
if you are using recode
.
If you leave the sasprog
argument at its default value of
"sas"
, be sure that the SAS executable is in the PATH
specified in your autoexec.bat
file. Also make sure that
you invoke S so that your current project directory is known
to be the current working directory. This is best done by creating
a shortcut in Windows95, for which the command to execute will be
something like drive:\spluswin\cmd\splus.exe HOME=.
and the
program is flagged to start in drive:\myproject
for example.
In this way, you will be able to examine the SAS log file easily
since it will be placed in drive:\myproject
by default.
SAS will create SASWORK
and SASUSER
directories in what it thinks
are the current working directories. To specify where SAS should
put these instead, edit the config.sas
file or specify a
sasprog
argument of the following form:
sasprog="\sas\sas.exe -saswork c:\saswork -sasuser c:\sasuser"
.
When sas.get
needs to run SAS it is run in iconized form.
The SAS macro sas_get
uses record lengths of up to 4096 in two
places. If you are exporting records that are very long (because of
a large number of variables and/or long character variables), you
may want to edit these LRECL
s to quadruple them, for example.
A data frame resembling the SAS dataset. If id
was specified, that column of the data frame will be used
as the row names of the data frame. Each variable in the data frame
or vector in the list will have the attributes label
and format
containing SAS labels and formats. Underscores in formats are
converted to periods. Formats for character variables have \$
placed
in front of their names.
If formats
is T
and there are any
appropriate format definitions in format.library
, the returned
object will have attribute formats
containing lists named the
same as the format names (with periods substituted for underscores and
character formats prefixed by $).
Each of these lists has a vector called values
and one called
labels
with the PROC FORMAT; VALUE ... definitions.
if a SAS error occurs the SAS log file will be
printed under the control of the pager
function.
The references cited below explain the structure of SAS datasets and how they are stored. See SAS Language for a discussion of the "subsetting if" statement.
If sasout
is not given, you
must be able to run SAS on your system.
If you are reading time or
date-time variables, you will need to execute the command library(chron)
to print those variables or the data frame if the timeDate
function
is not available.
Terry Therneau, Mayo Clinic
Frank Harrell, Vanderbilt University
Bill Dunlap, University of Washington and Insightful Corp.
Michael W. Kattan, Cleveland Clinic Foundation
SAS Institute Inc. (1990). SAS Language: Reference, Version 6. First Edition. SAS Institute Inc., Cary, North Carolina.
SAS Institute Inc. (1988). SAS Technical Report P-176, Using the SAS System, Release 6.03, under UNIX Operating Systems and Derivatives. SAS Institute Inc., Cary, North Carolina.
SAS Institute Inc. (1985). SAS Introductory Guide. Third Edition. SAS Institute Inc., Cary, North Carolina.
data.frame
, describe
,
label
, upData
## Not run: mice <- sas.get("saslib", mem="mice", var=c("dose", "strain", "ld50")) plot(mice$dose, mice$ld50) nude.mice <- sas.get(lib=unix("echo $HOME/saslib"), mem="mice", ifs="if strain='nude'") nude.mice.dl <- sas.get(lib=unix("echo $HOME/saslib"), mem="mice", var=c("dose", "ld50"), ifs="if strain='nude'") # Get a dataset from current directory, recode PROC FORMAT; VALUE ... # variables into factors with labels of the form "good(1)" "better(2)", # get special missing values, recode missing codes .D and .R into new # factor levels "Don't know" and "Refused to answer" for variable q1 d <- sas.get(mem="mydata", recode=2, special.miss=TRUE) attach(d) nl <- length(levels(q1)) lev <- c(levels(q1), "Don't know", "Refused") q1.new <- as.integer(q1) q1.new[is.special.miss(q1,"D")] <- nl+1 q1.new[is.special.miss(q1,"R")] <- nl+2 q1.new <- factor(q1.new, 1:(nl+2), lev) # Note: would like to use factor() in place of as.integer ... but # factor in this case adds "NA" as a category level d <- sas.get(mem="mydata") sas.codes(d$x) # for PROC FORMATted variables returns original data codes d$x <- code.levels(d$x) # or attach(d); x <- code.levels(x) # This makes levels such as "good" "better" "best" into e.g. # "1:good" "2:better" "3:best", if the original SAS values were 1,2,3 # For the following example, suppose that SAS is run on a # different machine from the one on which S is run. # The sas_get macro is used to create files needed by # sas.get. To make a text file containing the sas_get macro # run the following S command, for example: # cat(sas.get.macro, file='/sasmacro/sas_get.sas', sep='\n') # Here is the SAS job. This job assumes that you put # sas_get.sas in an autocall macro library. # libname db '/my/sasdata/area'; # %sas_get(db.mydata, dict, data, formats, specmiss, # formats=1, specmiss=1) # Substitute whatever file names you may want. # Next the 4 files are moved to the S machine (using # ASCII file transfer mode) and the following S # program is run: mydata <- sas.get(sasout=c('dict','data','formats','specmiss'), id='idvar') # If PKZIP is run after %sas_get, e.g. "PKZIP port dict data formats" # (assuming that specmiss was not used here), use mydata <- sas.get(sasout='a:port', id='idvar') # which will run PKUNZIP port to unzip a:port.zip, creating the # dict, data, and formats files which are generated (and later # deleted) by sas.get # Retrieve the same variables from another dataset (or an update of # the original dataset) mydata2 <- sas.get('mydata2', var=names(mydata)) # This only works if none of the original SAS variable names contained _ # Code from Don MacQueen to generate SAS dataset to test import of # date, time, date-time variables # data ssd.test; # d1='3mar2002'd ; # dt1='3mar2002 9:31:02'dt; # t1='11:13:45't; # output; # # d1='3jun2002'd ; # dt1='3jun2002 9:42:07'dt; # t1='11:14:13't; # output; # format d1 mmddyy10. dt1 datetime. t1 time.; # run; ## End(Not run)