Reference Manual for COVIMP.EXE (Version 1.4) (with ADDRES11.EXE and IMPCALC.EXE utilities) Authors: John W. Graham & Scott M. Hofer Manual Revision Date: June 11, 1996 Program revision date: June 11, 1996 COVIMP14: A program (with utilities) for doing multiple imputation BACKGROUND These programs (COVIMP14.EXE, ADDRES11.EXE, and IMPCALC.EXE) are meant to be used in conjunction with EMCOV (EMCOV23.EXE) (Graham & Hofer, 1993), an implementation of the EM algorithm for continuous data, as described by Little & Rubin (1987). One use of EMCOV is to analyze the dataset containing missing values, and produce a covariance matrix (TEMP.COV) that takes all the missingness into account, and is useable in any program that can take a covariance matrix as input. A second possible use of EMCOV is to request additional output. The additional output is in the form of a raw dataset in which the missing values have been replaced by imputed values (TEMP.DAT). Also output with this option is a dataset (TEMP.RES) containing a vector of residuals for each variable in the dataset. If the initial value was nonmissing, the residual is the difference between the actual value and the value predicted by the regression equation (all other variables as predictors). If the initial value was missing the residual is zero. Note of Caution: We recommend that you use EMCOV23 only for generating the covariance matrix (TEMP.COV), and that you always select the default option for additional output (i.e., you should NOT select additional output). We recommend that all imputing be done with the new imputation utility: COVIMP14.EXE. Note of Caution: One should NOT make use of the file TEMP.DAT by itself. Because the missing values have been imputed without error (i.e., the imputed value lies exactly on the regression line), variables with imputed values have too little (error) variance. Adding a randomly selected (non-zero) residual term from TEMP.RES restores this variability. Please see discussion of ADDRES and COVIMP below. ******************************** ADDRES (version 1.1) ******************************** This version of ADDRES is a compiled BASIC program that randomly samples the residuals from TEMP.RES, and adds one residual to each imputed value in TEMP.DAT (the zero residuals corresponding to previously missing values are NOT included in this sampling). This version of ADDRES reads in the datasets TEMP.DAT and TEMP.RES, which are output from EMCOV, and writes out the new dataset TEMPX.DAT, which contains an imputed data matrix. The variables in the TEMPX.DAT dataset should have variances and covariances that are asymptotically unbiased, and reasonably efficient (e.g., see Graham, Hofer, & Piccinin, 1994; Graham, Hofer, & MacKinnon, in press). Note of Caution: ADDRES version 1.1 uses the clock timer as the starting seed for random selection. Minor problems in sampling may thus occur because of this. Note of Caution: If a particular case has more than one missing value, ADDRES version 1.1 samples a random residual independently for each missing value for that case. An alternative for future versions of ADDRES will be to sample residuals from the same residual case for every imputed value for a given case (Schafer, personal communication, 1995). For example, if a case was initially missing variables 3 and 6, the residuals added to the imputed values for variables 3 and 6 would be taken from a single case for which both variables were nonmissing. This technique for restoring error variability to the imputed values would better retain the structure of error covariance in the data. The current version of ADDRES may thus slightly underestimate error covariance in the model. ******************************* Multiple Imputation ******************************* Unfortunately, having a single imputed dataset is not particularly helpful. Although one is certainly able to analyze the dataset and obtain reasonable parameter estimates, one has no way of obtaining standard errors for these estimates. A good option is to generate multiple imputed datasets (Rubin, 1987; Schafer, in press). It is important in performing multiple imputation that all sources of variability be retained. By using the utility program ADDRES (or a similar program), one can restore the error variability to the imputed data. However, a second source of variability is due to the fact that the covariance matrix used to impute the data is itself only an estimate of the true covariance matrix. It is, in fact, only one of several plausible covariance matrices. Thus, an important second part of the multiple imputation process is to obtain multiple plausible estimates of the covariance matrix used to generate the imputed values. One way to do this is by bootstrapping the original dataset. Let's suppose we would like to have five multiply imputed datasets (Schafer, in press, suggests five to 10 datasets are usually sufficient). The first thing we would need is five plausible covariance matrices. The first of these is obtained by using EMCOV on the original dataset. The remaining four covariance matrices are obtained by obtaining four bootstrapped samples of the original data (using the BOOTSAMP utility), and using EMCOV on each of these. The next step involves generating the imputed datasets. For this version of the program, this is a two-step process. To generate each imputed dataset, first use the COVIMP14.EXE utility with the original dataset and one of the five, plausible covariance matrices as input. The datasets TEMP.DAT and TEMP.RES are output each time you run COVIMP14.EXE. In the second step, the ADDRES11.EXE utility is used to add random residuals from TEMP.RES to each imputed value in TEMP.DAT, and the dataset TEMPX.DAT is created. Note of Caution. You must remember to rename files (TEMP.COV, TEMP.DAT, TEMP.RES, TEMPX.DAT) if you wish to keep them from being overwritten. ********************* COVIMP ********************* The COVIMP program, version 1.4, is a compiled BASIC program. The program uses the same sweep operator as used in EMCOV to obtain regression weights based on a specific covariance matrix. These regression weights allow us to predict each variable in the dataset using all others as predictors. If each case in the dataset has no more than a single missing value, then this estimation is easy. For each missing value, we use the predicted value from the regression equation for that variable. However, what happens if some cases have more than one missing value? In this case, one or more of the predictor variables will also be missing. What values should be used for these predictor variables? COVIMP is an iterative program. It uses the same input covariance matrix to generate the same regression weights at each iteration. Suppose a case has two missing values. The value from the previous iteration is used for variable 1 in predicting variable 2 (and vice versa). This program iterates until the largest change in imputed values throughout the dataset is trivial (.0000001 is currently used as the default criterion). Note of Caution: It is possible that COVIMP will fail to converge in some cases. *************************************************** Summary of Multiple Imputation Procedure *************************************************** Step 1: Analyze the original dataset (e.g., ORIG.DAT) with EMCOV23.EXE. Output result is the covariance matrix (with vector of means), TEMP.COV Step 2: Run COVIMP14.EXE. The program assumes TEMP.COV is the input covariance matrix. The program will prompt you for the name of the original dataset (e.g., ORIG.DAT), N and K. The program outputs TEMP.DAT and TEMP.RES. Step 3: Analyze TEMP.DAT and TEMP.RES with ADDRES11.EXE. The program assumes TEMP.DAT and TEMP.RES as input. The program prompts you for N (the number of subjects) and K (number of variables). The program outputs an imputed dataset called TEMPX.DAT, which you might rename to something like IMPUTE1.DAT. Step 4: Bootstrap the original dataset with BOOTSAMP.EXE, yielding a bootstrapped dataset (e.g., BOOT1.DAT). The program prompts you for input dataset (e.g., ORIG.DAT), output dataset (e.g., BOOT1.DAT), N and K. Step 5: Analyze the output from Step 4 (e.g., BOOT1.DAT) with EMCOV. Do NOT request additional output. The output dataset is TEMP.COV. Step 6: Run COVIMP11.EXE. The program assumes TEMP.COV is the input covariance matrix. The program will prompt you for the name of the original dataset (e.g., ORIG.DAT), N and K. The program outputs TEMP.DAT and TEMP.RES. Step 7: Analyze TEMP.DAT and TEMP.RES with ADDRES. The program assumes TEMP.DAT and TEMP.RES as input. The program prompts you for N and K. The program outputs an imputed dataset called TEMPX.DAT, which you might rename to something like IMPUTE2.DAT. Step 8: Repeat Steps 4 - 7 as many times as you like. It is recommended that you obtain 5 or 10 imputed datasets in total (e.g., IMPUTE1.DAT -- IMPUTE10.DAT). Step 9: Analyze all (e.g., 5) imputed datasets with your analysis of choice (e.g., LISREL, or SAS PROC REG). Set the sample size to the maximum sample size in the original dataset (e.g., ORIG.DAT), pretending for the moment that there are no missing data. Save the parameter estimates of interest, and the corresponding standard errors from the each analysis. Step 10: When all analyses are complete, organize your results in an ASCII file (e.g., RESULTS.DAT) as follows: a. Arrange the parameter estimates of interest from the first analysis in the file such that the elements are separated by at least one space. b. Then arrange the corresponding standard error estimates from that analysis (separating elements with spaces), and add these to RESULTS.DAT following the parameter estimates. c. Repeat (a) and (b) for each dataset analyzed. That is, in the ASCII dataset, you should have estimates from analysis of IMPUTE1.DAT, standard errors from IMPUTE1.DAT, estimates from IMPUTE2.DAT, standard errors from IMPUTE2.DAT, and so forth. As an example, the file (RESULTS.DAT) might look like the following (assume five parameter estimates of interest, and five imputed datasets): .421 .316 .322 -.205 .045 .060 .055 .071 .082 .062 .330 .342 .353 -.176 .129 .061 .054 .070 .085 .062 .391 .324 .326 -.221 .058 .062 .053 .069 .083 .063 .413 .291 .277 -.210 .098 .060 .058 .066 .089 .063 .471 .309 .298 -.169 .011 .060 .054 .071 .079 .056 Step 11: Run the utility program IMPCALC.EXE. The program will prompt you for input dataset (e.g., RESULTS.DAT) and output dataset (e.g., RESULTS.TOT). It will prompt you for the number of imputations, and the number of parameter estimates. The final dataset (e.g., RESULTS.TOT) contains * the point estimate of each parameter estimate (the simple average of the estimates obtained from the 5 or 10 imputed datasets) * an estimate of the standard error for each parameter estimate. Schafer (in press) gives the following formula. The standard error is the square root of T, where T = U(bar) + [ ( 1 + 1/m ) * B ] and U(bar) is the average of the squared standard errors for one estimate over the 5-10 datasets, m is the number of imputations (e.g., 5 or 10), and B is the sample variance of the parameter estimate over the 5-10 datasets. Conceptually, the estimate of the standard error is a combination of the within-imputation variability, U(bar), and the between-imputation variability B. * the critical ratio (point estimate divided by the estimated standard error). Schafer suggests that this could be treated as a z-value, but might more appropriately be treated as a t-value. * the estimated degrees of freedom for the t-value using the formula suggested by Schafer (in press) The output from IMPCALC.EXE for the example data shown above is: HOW MANY IMPUTATIONS? 5 HOW MANY PARAMETER ESTIMATES? 5 INPUT WHAT DATASET? results.dat OUTPUT WHAT DATASET? results.tot AVERAGE PARAMETER ESTIMATE = 0.40520 STANDARD ERROR = 0.08260 DF (V) = 18.77293 t(z) = 4.906 AVERAGE PARAMETER ESTIMATE = 0.31640 STANDARD ERROR = 0.05857 DF (V) = 261.81127 t(z) = 5.402 AVERAGE PARAMETER ESTIMATE = 0.31520 STANDARD ERROR = 0.07631 DF (V) = 134.57886 t(z) = 4.130 AVERAGE PARAMETER ESTIMATE = -.19620 STANDARD ERROR = 0.08723 DF (V) = 623.97291 t(z) = -2.249 AVERAGE PARAMETER ESTIMATE = 0.06820 STANDARD ERROR = 0.07939 DF (V) = 24.43794 t(z) = 0.859 *************************************************** Summary of Differences Between Multiple Imputation as Implemented in EMCOV/COVIMP and Multiple Imputation as Implemented by Schafer (in press) **************************************************** * Basic Imputation EMCOV uses an EM algorithm for continuous data as the basic imputation procedure. Schafer's procedure uses an EM algorithm for continuous, categorical, or mixed (continuous and categorical) data (whichever is most appropriate for the dataset). * Restoration of Error Variability EMCOV/ADDRES restores variability by adding to each imputed value a randomly selected residual from a case that is nonmissing for that variable. In this version of EMCOV/ADDRES, residuals are selected independently for all imputed values, even if the same case has two or more missing values. When a case has missing values for two or more variables, Schafer suggests adding all residuals for that case from a randomly sampled case that has nonmissing values for the two or more variables in question. * Restoring variability due to estimation of the covariance matrix (sampling from the distribution of plausible covariance matrices) EMCOV/Bootsamp samples from the distribution of plausible covariance matrices by bootstrapping the original dataset, and analyzing each bootstrapped dataset with EMCOV. Schafer's procedures samples from the distribution of plausible covariance matrices using a data augmentation procedure (Tanner & Wong, 1987; Schafer, in press). Note. Schafer's multiple imputation routines are currently written in FORTRAN for Sun workstations. As currently implemented, the routines are called from the S-plus statistical package. The routines, in their current form, are available in the public domain, and may be available in the near future as a stand-alone package in the PC DOS or Windows environment. **************************************************** Distribution: Please feel free to distribute these EMCOV programs to anyone you like. Any questions or comments should be directed to: **************************************************** John W. Graham Department of Biobehavioral Health East-210 Health & Human Development Bldg. Penn State University University Park, PA 16802-6508 email: jwg4@psuvm.psu.edu tel: (814) 863-0200 *********************** References: *********************** 1. Graham, J. W., & Donaldson, S. I. (1993). Evaluating interventions with differential attrition: The importance of nonresponse mechanisms and use of followup data. Journal of Applied Psychology, 78, 119-128. 2. Graham, J. W., Hofer, S. M., & Piccinin, A. M. (1994). Analysis with missing data in drug prevention research. In L. M. Collins and L. Seitz (eds.), Advances in data analysis for prevention intervention research. NIDA Research Monograph Series (#142), Washington DC: National Institute on Drug Abuse. 3. Graham, J. W., Hofer, S. M., & MacKinnon, D. P. (in press). Maximizing the usefulness of data obtained with planned missing value patterns: An application of maximum likelihood procedures. Multivariate Behavioral Research. 4. Schafer, J.L. (in press). Analysis of Incomplete Multivariate Data. New York: Chapman and Hall. 5. Tanner, M.A., & Wong, W.H. (1987). The calculation of posterior distributions by data augmentation (with discussion). Journal of the American Statistical Association, 82, 528-550.