missing values – your advice is needed

by Urs E. Gattiker on 2008/11/24 1,145 views

    When data are missing, the appropriate missing data analysis procedures do not generate something out of nothing but do make the most out of the data available

One of the most common forms of analysis with missing data involves simply substituting the mean for the variable whenever a value is missing. Unfortunatley, mean substitution can produce very wrong estimates of variances and covariances. In general, substituting the mean for the missing value has the effect of underestimating the magnitude of both variances and covariances

    “1. Whenever possible, use the EM algorithm (or other maximum likelihood procedure, including the multiple-group structural equation-modeling procedure or, where appropriate, multiple imputation) for analyses involving missing data.2. If other analyses must be used, keep in mind that they produce biased results and should not be relied upon for final analyses. Recommmending:

    a. Never use mean substitution, even for preliminary analyses.

    b. With minimal missing data, analysis of complete cases may be a reasonable solution.

    c. If data are missing completely at random, pairwise deletion or complete cases analysis may be a reasonable solution.

    d. If data are not missing completely at random and the cause of missingness has been measured, complete cases may produce unbiased e2stimates, although it is a generally less powerful approach than the EM algorithm or multiple-group procedure.”

John W. Graham, Scott M. Hofer, and Andrea M. Piccinin (1994). Analysis With Missing Data in Drug Prevention Research L. M. Collins & L. A. Seitz (eds.), Advances in Data Analysis for Prevention Intervention Research (13-63). NIDA Research Monograph 142 Bethesda, MA: U.S. Department of Health and Human Service

Accordingly, mean substitution:

1. artificially decreases the variation of scores, in turn, this decrease in individual variation for each of the variables is proportional to the number of missing data – in turn, the more missing data, the more “perfectly average scores” will be artificially added to the data set; and

2. substitutes missing data with artificially created “average” data points – this can result in considerably changing the values of correlations.

We have tried to minimize this issue with calculating the mean value for those variables with the same Google Page Rank only (e.g., take all means from variable x for those blogs with Google PageRank 4 only – calculate the average score – use it). In turn, this reduces the impact outliers – high and low – may have on our results. For more see also:

what we do with missing values

We are currently in the process of finding ways for doing multiple – imputations. Want to support our efforts, get in touch and/or leave a comment below.We are wondering how we can integrate this program:

Schafer, J. Software for Multiple Imputation

with our work. Suggestions are welcome.

We need to find a way to run it without using any of the statistical packages to do the job for us – any advice – please leave a comment we need your expertise and appreciate any help we can get.

other resources

Allen, E. I., & Sharpe, N. R. (2005) Demonstration of Ranking Issues for Students: A Case Study. Journal of Statistics Education Volume 13, Number 3

Karen Grace-Martin – writes The Analsis Factor blog – you should subscribe – it is refreshing and very helpful indeed


blog comments powered by Disqus

Previous post: Invitation to benchmark your blog

Next post: corporate microblogging on Twitter