2008

Data Source

Canadian Community Health Survey (CCHS)

Organizer

David Haziza Département de mathématiques et de statistique Université de Montréal and Gordon Kuromi Statistics Canada

NOTE 1: For interested students, we can send, by email, an electronic version of most of the references cited below. Please, write us an email to obtain a copy of the papers.

NOTE 2: The Imputation Bulletin is a newsletter produced by Statistics Canada twice a year.

Objective

Despite the best efforts made by survey staff to maximize response, it is almost certain that some degree of nonresponse will occur in large scale surveys. Essentially, survey statisticians distinguish between two types of nonresponse, total or unit nonresponse (when no information is collected on a sampled unit) and partial or item nonresponse (when the absence of information is limited to some variables only). Unit nonresponse occurs, for example, when the sampled unit is not-at-home or refuses to participate in the survey. Item nonresponse may occur if the sampled unit refuses to respond to sensitive items, does not know the answer to some items, or because of edit failures. Generally, weighting adjustment methods are used to compensate for unit nonresponse whereas imputation is used to compensate for item nonresponse. The main idea behind a weighting adjustment is to increase the sampling weights of the respondents in order to compensate for the nonrespondents, while imputation is a process where one or more ”plausible values” are produced to replace a missing value. It is customary in both weighting and imputation to first classify respondents and nonrespondents into classes, formed on the basis of information recorded for all units in the sample. The main effects of (unit or item) nonresponse include: (i) bias of point estimators; (ii) increase of the variance of point estimators (since the observed sample size is smaller than the sample size initially planned) and (iii) bias of the complete-data variance estimators. The main objective when treating (unit or item) nonresponse is the reduction of the nonresponse bias, which occurs if respondents and nonrespondents are different with respect to the survey variables.

The student assignment is to find imputation strategies that can reduce the nonresponse bias as much as possible. Also, the student will consider the problem of variance estimation in the presence of imputed data and compare the results with those obtained when the imputed values are treated as if they were observed. A list of questions is given on page 10.

Data:

The data set used for the case study is a subset of a sample collected between January 2005 and December 2005 for the Canadian Community Health Survey (CCHS), which is a cross-sectional survey that collects information related to health status, health care utilization and health determinants for the Canadian population. The original data file was a Public Use Microdata File (PUMF) obtained for the CCHS Cycle 3.1 (2005) consisting of 132,221 records containing 1284 variables. We selected a relatively small subset of variables. Records with missing or incomplete responses were dropped, which reduced the number of records. This led to the creation of an artificial population file (consisting of complete records) containing 97,035 records. This population was then stratified by province / territory, and a stratified simple random sample without replacement of 20,000 records was taken for the case study sample.

In the data file, the variable samplingweight denotes the sampling weight of an individual, which is defined as the inverse of its inclusion probability in the sample. Let whi be the sampling weight of an individual i in stratum h. We have whi = Nh nh , where Nh is the number of individuals in stratum h and nh is the number of individuals sampled in stratum h.

Finally, missing values to the variable BMI (body mass index) were generated according to a specified nonresponse mechanism (that is only known to the organizing team). Note that, in practice, it is highly unusual to have nonresponse to just a single variable. For simplicity, we chose to create missing values to one variable only. Also, the nonresponse mechanism used to create missing values is artificial in the sense that it is likely different from the true unknown nonresponse mechanism observed in practice.

For more details on stratified sampling, see Lohr (1999), Särndal, Swensson and Wretman (1992) and Haziza (2007a). Also, you can obtain information about the CCHS on the Statistics Canada website.

Background

Single Imputation

We distinguish between single and multiple imputation. Single imputation consists of creating a single imputed value to replace a missing value which leads to the creation of a single complete data file. Multiple imputation, proposed by Rubin (1978, 1987), consists of creating M ≥ 2 imputed values to fill in a missing value which leads to the creation of M complete data files. Multiple imputation is discussed in section 14.

Single imputation is widely used in surveys for treating item nonresponse because it presents the following advantages: (1) It leads to the creation of a complete data file; (2) unlike weighting adjustment for each item, imputation allows for the use of a single sampling weight for all items and (3) the results of different analyses are consistent with each other.

It is nonetheless important to note that imputation presents certain risks. The most significant risks are as follows: (1) Even though imputation leads to the creation of a complete data file, inferences are valid only if the underlying assumptions about the response mechanism and/or the imputation model are satisfied. (2) Some imputation methods tend to distort the distribution of the variables of interest (i.e., the variables being imputed). (3) Treating the imputed values as if they were observed may lead to a substantial underestimation of the variance of the estimator, especially if the item nonresponse rate is appreciable. (4) Marginal imputation for each item separately has the effect of distorting relationships between variables.
Where there is no nonresponse…

In the absence of nonresponse, survey samplers usually try to avoid using estimation procedures whose validity depends on that of a given model. To avoid assumptions on the distribution of the data, the properties of estimators are generally based on the sampling design used to select the sample rather than on a particular model. This approach is the so-called design-based approach or randomization approach to inference. This does not mean that models are useless under the design-based approach. In fact, they play an important role in the determination of efficient sampling and estimation procedures. For details on point and variance estimation in the case of complete data, see Lohr (1999), Särndal, Swensson and Wretman (1992) and Haziza (2007a).

Unlike in the case of complete response, the use of models is unavoidable in the presence of nonresponse, and the properties of (point and variance) estimators (e.g., bias and variance) will depend on the validity of the assumed models. Consequently, imputation is essentially a modeling exercise. The quality of the estimates will thus depend on the availability (at the imputation stage) of good auxiliary information and its judicious use in the construction of imputed values and/or imputation classes.
Auxiliary information

Auxiliary information is defined as a set of variables, which is available (at least) for all the sampled units (i.e., variables for which we have complete response). Good auxiliary information refers to a set of variables which is related to the variable being imputed and/or related to the probability of response to the variable being imputed.
Nonresponse mechanism

The sample is selected according to a known random mechanism called the sampling mechanism. In the case of complete response, survey statisticians use the knowledge of specific aspects of the sampling mechanism (i.e., first-order and second order inclusion probabilities) to construct point and variance estimators that are unbiased under repeated sampling (or design-unbiased).

In the context of nonresponse, there is another random mechanism that, given the selected sample, divides the sample into a random set of respondents and a random set of nonrespondents. This random mechanism is called thenonresponse mechanism. Unlike the sampling mechanism, the nonresponse mechanism is not known to the survey statistician. Thus, it is customary to make assumptions about the nonresponse mechanism. The reader is referred to Beaumont (2002) for a discussion of the nonresponse mechanism and its impact on the bias of point estimators.
Imputation model The imputation model is set of assumptions with respect to the distribution of the variable being imputed. It links the variable being imputed to a set of auxiliary variables. For example, if the variable being imputed is continuous, the imputation model could be a multiple linear regression model.
Nonresponse model The nonresponse model is a set of assumptions with respect to the unknown nonresponse mechanism. Let ri = 1 if unit i responded to a given item of interest and ri = 0, otherwise. Since the response status of a unit is a binary variable, one possible nonresponse model is the usual logistic model linking P(ri= 1) to a set of auxiliary variables.
Some population parameters of interest and their estimators In practice, important parameters include the population total and/or the population mean of a particular variable. Consider a finite population of N individuals. The population total of a variable of interest y is given by . The population mean is defined as: Suppose we select a random sample, s, of size n, according to a given sampling design p(s) . If we had complete response to the variable y, we could, for example, use the Horvitz-Thompson estimator of Y given by where wi = 1πidenotes the sampling weight attached to unit i and πi denotes its inclusion probability in the sample. The estimator is unbiased for Y with respect to the sampling design and we write , where Ep(.) denotes the expectation with respect to the sampling design. An estimator of the population mean, , is obtained by dividing by N. In the presence of nonresponse to item y, it is not possible to compute the estimator since some y-values are missing. We define an imputed estimator of Y given by , where denotes the imputed value used to replace the missing value yi. In practice, estimates for various domains (subpopulations) are needed. For example, in the context of CCHS, estimates of the average BMI may be required by age-sex group or by province. Let be a domain of interest of size Nd. The domain mean, , can be expressed as , where di is a domain indicator such that di = 1 if unit ibelongs to Ud and di = 0, otherwise. In the absence of nonresponse, an asymptotically unbiased estimator of is given by . That is, . In the presence of nonresponse to item y, we define an imputed estimator by .

A third parameter of interest is the proportion of individuals in the population who have a particular characteristic, for example, the proportion of the population that is overweight or obese. Note that, according to World Health Organization and Health Canada guidelines, a person is classified as overweight if his BMI is between 25.0 and 29.9 while a person with a BMI of 30.0 or more is classified as obese; see Shields and Tjepkema (2006), Tjepkema (2006) and Le Petit and Berthelot (2006).

In the absence of nonresponse, an estimate of the proportion of the population with a particular characteristic is given by where Ci = 1 if uniti has the characteristic and Ci = 0 if unit i does not have the characteristic. In the presence of nonresponse, the imputed estimator is defined as above, except that we are trying to impute a binary variable instead of a continuous variable (e.g., BMI).
Imputation methods In practice, many different imputation methods are used to fill in missing values. Descriptions of some of these methods are given in Kovar and Whitridge (1995), Beaumont (2001), Kalton (2003), Beaumont and Bocci (2005), and Haziza (2005).
Imputation classes In practice, imputation is rarely performed at the overall sample level. Instead, imputation classes are formed and imputation is then performed independently within each class. In practice, many methods are used to form imputation classes. The reader is referred to Little (1986), Eltinge and Yansaneh (1997), Haziza (2002), Haziza and Beaumont (2007).
Nonresponse bias Nonresponse bias occurs when respondents and nonrespondents have different characteristics with respect to the variables measured in the survey. It is defined as the average difference between the imputed estimator and the estimator we would have obtained had complete response been observed. For a discussion of the nonresponse bias, see, for example, Haziza et Beaumont (2007), Haziza and Kuromi (2007) and Haziza (2005). The first objective of imputation is to reduce the nonresponse bias as much as possible. A secondary objective is also to control the nonresponse variance (see section 11) as much as possible. In order to do so, we need good auxiliary information to construct imputed values and/or imputation classes.

To reduce the nonresponse bias, it is important to identify a set of auxiliary variables that explain the variable being imputed as well as a set of auxiliary variables that explain the response probability to the variable being imputed; see, for example, Haziza and Rao (2006).

In fact, one can eliminate the nonresponse bias if: (i) the imputation model and/or the nonresponse model contain all the appropriate auxiliary variables (i.e., the models are correctly specified) and (ii) the nonresponse mechanism is ignorable. This is discussed in Beaumont (2002).
Nonresponse and imputation variance In the presence of nonresponse, the observed sample size is smaller than the sample size initially planned so nonresponse usually has the effect of leading to estimators with larger variance than the variance of estimators that would be attained if complete response was possible. This increase in variance is called the nonresponse variance. When a random imputation is used (e.g., random hot deck imputation within classes), a third random mechanism is applied to randomly select the residuals. Thus, random imputation methods suffer from an additional component of variance (called the imputation variance) due to the use of a random imputation mechanism. For a discussion on the nonresponse and imputation variance, see, for example, Haziza et Beaumont (2007), Haziza and Kuromi (2007).
Variance estimation In recent years, variance estimation in the presence of imputed data has been widely treated in the literature. Before the 1990’s, it was customary to treat the imputed values as if they were observed values. As result, the published variance estimates were too low because they failed to account for the nonresponse variance, and imputation variance in the case of random imputation methods. Nowadays, many methods have been developed/adapted to take the nonresponse and the imputation variance into account. Resampling methods such as the jackknife (Rao and Shao, 1992) and the bootstrap (Shao and Sitter, 1996) have been considered in the context of imputation. The reader is also referred to Särndal (1992), Rao (1996, 2003), Shao and Steel (1999), Haziza and Rancourt (2004), Mathews (2004), Haziza (2005) and Haziza (2007b).
Some software For carrying out single imputation and point estimation a person can use the modeling procedures of any standard software. They include SAS, WESVAR,SUDAAN, SPLUS and R . However, these software do not perform correct variance estimation under single imputation. In this case, the students will need to write their own code for variance estimation.
Multiple imputation

Each missing value is replaced by M ≥ 2 imputed values, which leads to the creation of M completed data files.
The M completed data files are then analyzed using standard SAS procedures.

These results are then combined for inference.

Multiple imputation, introduced by Rubin (1978, 1987) involves three distinct steps:

For a good overview of multiple imputation, see, for example, Rubin (1996) and Little and Rubin (2002). In the context of survey sampling, it may be important to take the design features into account to define an appropriate imputation strategy. This is discussed in Reiter, Raghunathan, and Kinney (2006) and Little and Ragunathan (2007).

In SAS (version 9), two procedures are available to the user: PROC MI and PROC MIANALYZE (for a description of the procedures, see, for example, Haziza (2003)). In WESVAR , variance estimation for multiple imputation is also available. SOLAS is another software for multiple imputation.

Research Question

Dans ce paragraphe, nous proposons quelques pistes de travail pour les étudiants:

Trouvez un ensemble de variables auxiliaires liées à la variable BMI. Validez votre modèle (c.-à-d., exécutez quelques diagnostics pour vérifier que votre modèle est raisonnable).
Trouvez un ensemble de variables auxiliaires liées à la probabilité de réponse à la variable BMI. Validez votre modèle (c.-à-d., exécutez quelques diagnostics pour vérifier que votre modèle est raisonnable).
Si vous vouliez estimer la moyenne de population pour la variable BMI, quelle méthode d’imputation utiliseriez-vous? Utiliseriez-vous une méthode d’imputation déterministe ou aléatoire? Une méthode d’imputation pondérée ou non pondérée? Discutez votre choix. Comment construiriez-vous les classes d’imputation?
Supposons que vous vouliez estimer la proportion d’individus dans la population qui est obèse (c.-à-d., dont la valeur de la variable BMI est supérieure ou égale à 30,0). Répondez aux mêmes questions que pour (c).
Mêmes questions que pour (c), mais il s’agit d’estimer l’IMC moyen par groupe d’âge-sexe. Prêtez attention aux domaines pour lesquels le comportement est différent de celui de la population globale.
Estimez les variances des estimées pour (c), (d) et (e) en traitant les valeurs imputées comme si elles avaient été observées. Puis, estimez les variances en employant une méthode d’estimation de la variance qui utilise la variance de non-réponse et d’imputation (dans le cas de l’imputation aléatoire). Comparez et discutez les résultats. Vous pouvez également étudier l’effet de l’emploi d’une méthode d’imputation aléatoire ou déterministe sur l’estimation de la variance.
Les méthodes d’imputation déterministes (à l’exception de l’imputation par le plus proche voisin) ont tendance à déformer la distribution de la variable à imputer, tandis que les méthodes d’imputation aléatoire ont tendance à la conserver. Étudiez cet aspect.

Variables

Variable name	Variable Label (meaning)	Type of variable	Number of values
GEOEGPRV	Province of residence of respondent-(G)	nominal categorical	11
DHHEGAGE	Age - (G)	ordinal categorical	16
DHHE_SEX	Sex	nominal categorical	2
DHHEGMS	Marital status - (G)	nominal categorical	4
CCCE_011	Has food allergies	nominal categorical	2
CCCE_031	Has asthma	nominal categorical	2
CCCE_071	Has high blood pressure	nominal categorical	2
PACEDEE	Daily energy expenditure - (D)	continuous	N/A
PACEDPAI	Physical activity index - (D)	ordinal categorical	3
SMKE_202	Type of smoker	ordinal categorical	3
ETSE_10	Someone smokes inside home	nominal categorical	2
ALCEDTYP	Type of drinker - (D)	nominal categorical	4
ALCEDDLY	Average daily alcohol consumption - (D)	discrete continuous	N/A
INCEGHH	Total hhld inc. from all sources - (D,G)	ordinal categorical	5
HWTEGBMI	BMI / self-report - (D,G)	derived continuous	N/A
SelectionProb	Probability of Selection	continuous	N/A
SamplingWeight	Sampling Weight	derived continuous	N/A
AGE_GROUP		ordinal categorical	7

References

Beaumont J.-F. (2001). The connection between models and commonly used imputation methods, The Imputation Bulletin, vol 1, no 2.
Beaumont J.-F. (2002). When are we in the presence of nonignorable nonresponse?, The Imputation Bulletin, vol 2, no 1.
Beaumont J.-F. and Bocci (2005). Some Thoughts on Nearest-Neighbour Imputation, The Imputation Bulletin, vol 5, no 2.
Eltinge, J. L., and Yansaneh, I. S. (1997), “Diagnostics for formation of Nonresponse Adjustment Cells, With an Application to Income Nonresponse in the U.S. Consumer Expenditure Survey”, Survey Methodology, 23, pp. 33-40.
Haziza, D. (2002). Imputation classes, The Imputation Bulletin, vol 2, no 1.
Haziza, D. (2003). Proc MI and Proc MIANALYZE in SAS, The Imputation Bulletin, vol 3, no 2.
Haziza, D. (2005). Inférence en présence d’imputation simple dans les enquêtes: un survol, Journal de la Société Française de Statistique, 146, 69-118.
Haziza, D. (2007a). Frameworks for variance estimation in the presence of imputed data, The Imputation Bulletin, vol 7, no 1.
Haziza, D. (2007b). Échantillonnage. Notes cours. Disponible àhttp://www.davidhaziza.com/index_fichiers
Haziza, D. and Beaumont, J.-F. (2007). On the construction of imputation classes in surveys. International Statistical Review, 75, 1, 25-43.
Haziza, D. and Kuromi, G. (2007), Handling item nonresponse in surveys.Journal of Case Studies in Business, Industry and Government statistics, 1, 102-118.
Haziza, D. and Rancourt, E. (2004), Variance estimation under the two-phase imputation model approach, The Imputation Bulletin, vol 4, no 1.
Haziza, D. and Rao, J. N. K. (2006), A nonresponse model approach to inference under imputation for missing survey data, Survey Methodology, 32, 53-64.
Kalton, G. (2003). Imputation methods, The Imputation Bulletin, vol 3, no 1.
Kovar, J. G. and P. Whitridge (1995), “Imputation of Business Survey Data”, in B. Cox, D. Binder, A. Christianson, M. Colledge, and P. Kott (eds), Business Survey Methods, New Work: Wiley, pp. 403-420.
Le Petit, C. and J-M Berthelot (2006). Obesity – a growing issue. Health Reports (Statistics Canada Catalogue 82-003) 17(3) 43-52http://www.statcan.ca/cgi-bin/downpub/listpub.cgi?catno=82-003-XIE2005003
Little, R. J. A. (1986), “Survey Nonresponse Adjustments for Estimates of Means”, International Statistical Review, 54, pp. 139-157.
Little, R.J.A., and Rubin, D.B. (2002). Statistical Analysis with Missing Data, 2nd Edition. New York: John Wiley & Sons, Inc.
Little, R.J.A., and Raghunathan, T.E. (2007). Multiple imputation for missing data in surveys. The imputation Bulletin, Vol. ?, no. ??, 12-12
Lohr, S.L. (1999). Sampling: Design and Analysis. Duxbury Press.
Matthews, S. (2004). The reverse approach to variance estimation from survey data with imputed values, The Imputation Bulletin, vol 4, no 1.
Rao, J.N.K. (2003). Variance estimation in the presence of imputation for item nonresponse, The Imputation Bulletin, vol 3, no 2.
Rao, J.N.K. (1996). On variance estimation with imputed survey data. Journal of American Statistical Association, 91, 499-506.
Rao, J.N.K., and Shao, J. (1992). Jackknife variance estimation with survey data under hotdeck imputation. Biometrika, 79, 811-822.
Reiter, J.S., Raghunathan, T.E. and Kinney, S.K. (2006). The Importance of Modeling the Sampling Design in Multiple Imputation for Missing Data. Survey Methodology, 32, 143–149.
Rubin, D.B. (1978). Multiple imputations in sample surveys. Proceedings of the Survey Research Methods Section, American Statistical Association, 1978, 20-34.
Rubin, D.B. (1987). Multiple Imputation for Nonresponse in Surveys. New York: John Wiley & Sons, Inc.
Rubin, D.B. (1996). Multiple imputation after 18+ years. Journal of the American Statistical Association, 91, 473-489.
Särndal, C.E. (1992). Methods for estimating the precision of survey estimates when imputation has been used. Survey Methodology, 18, 241-252.
Särndal, C.E., Swensson, B. and Wretman, J. (1992). Model Assisted Survey Sampling. New York: Springer-Verlag.
Shao, J., and Steel, P. (1999). Variance estimation for survey data with composite imputation and nonnegligible sampling fractions. Journal of the American Statistical Association, 94, 254-265.
Shields, M. and M. Tjepkema (2006). Trends in adult obesity. Health Reports(Statistics Canada Catalogue 82-003) 17(3) 53-59.http://www.statcan.ca/cgi-bin/downpub/listpub.cgi?catno=82-003-XIE2005003
Tjepkema, M. (2006). Adult obesity. Health Reports (Statistics Canada Catalogue 82-003) 17(3) 9-25. http://www.statcan.ca/cgi-bin/downpub/listpub.cgi?catno=82-003-XIE2005003.