- When data are missing, the appropriate missing data analysis procedures do not generate something out of nothing but do make the most out of the data available
One of the most common forms of analysis with missing data involves simply substituting the mean for the variable whenever a value is missing. Unfortunatley, mean substitution can produce very wrong estimates of variances and covariances. In general, substituting the mean for the missing value has the effect of underestimating the magnitude of both variances and covariances
- “1. Whenever possible, use the EM algorithm (or other maximum likelihood procedure, including the multiple-group structural equation-modeling procedure or, where appropriate, multiple imputation) for analyses involving missing data.2. If other analyses must be used, keep in mind that they produce biased results and should not be relied upon for final analyses. Recommmending:
a. Never use mean substitution, even for preliminary analyses.
b. With minimal missing data, analysis of complete cases may be a reasonable solution.
c. If data are missing completely at random, pairwise deletion or complete cases analysis may be a reasonable solution.
d. If data are not missing completely at random and the cause of missingness has been measured, complete cases may produce unbiased e2stimates, although it is a generally less powerful approach than the EM algorithm or multiple-group procedure.”
Accordingly, mean substitution:
1. artificially decreases the variation of scores, in turn, this decrease in individual variation for each of the variables is proportional to the number of missing data – in turn, the more missing data, the more “perfectly average scores” will be artificially added to the data set; and
2. substitutes missing data with artificially created “average” data points – this can result in considerably changing the values of correlations.
We have tried to minimize this issue with calculating the mean value for those variables with the same Google Page Rank only (e.g., take all means from variable x for those blogs with Google PageRank 4 only – calculate the average score – use it). In turn, this reduces the impact outliers – high and low – may have on our results. For more see also:
what we do with missing values
We are currently in the process of finding ways for doing multiple – imputations. Want to support our efforts, get in touch and/or leave a comment below.We are wondering how we can integrate this program:
Schafer, J. Software for Multiple Imputation
with our work. Suggestions are welcome.
We need to find a way to run it without using any of the statistical packages to do the job for us – any advice – please leave a comment we need your expertise and appreciate any help we can get.
other resources







{ 3 comments }
Thanks for the kind words. Refreshing-wow!
Here’s my advice:
Get on the Impute mailing list: http://lists.utsouthwestern.edu/mailman/listinfo/impute. All the theoretical statisticians who work with multiple imputation (the guys who derive the equations) seem to be there.
A couple good books on missing data you might want to start with are Allison and Little & Rubin. Allison’s is not equation heavy, but I do know there is an equation in there about how to combine the standard errors for multiple imputation. Little and Rubin is very equation heavy, so probably has much of what you need. Full citations are at my site under Resources: Books.
Dear Karen
The nice words I put down about your blog are well deserved, I meant and still mean it
Thanks for the input, I am getting on the list as per your suggestion…
Equations is fine for me… but I am trying to find a way to get a program that can help us deal with this and do it right.
Maybe the mailing list will give me the info, or else looking at these books? If you know of a program that we can use (we program in php), let me know please, I really would love to know.
Thanks Urs
Hi Urs–I am totally not a programmer, so can’t help you there. The books will help you with the equations to program, but I suspect someone on that Impute list will know about programming it.
You might want to look into R as well. It’s free statistical software, and I believe open source. I’m pretty sure it has multiple imputation.
Good luck–Karen
Comments on this entry are closed.