Case Study 1 : Does survey design information matter?

2018

Date Source: 

Canadian Health Measures Survey

Organizer: 

Tracey Bushnik

 

Background:

 

The Canadian Health Measures Survey (CHMS) is an ongoing national health survey that involves: 1) a household interview that gathers general demographic and socio-economic data and detailed health, nutrition and lifestyle information, and 2) an interview at a mobile examination clinic (MEC) where direct physical measurements are taken, including collection of blood and urine samples.  The target population are those aged 3 to 79 living in the ten provinces.  The CHMS is designed to produce reliable estimates at the national level for the age group 3-5 males and females combined, and by sex for the age groups 6-11, 12-19, 20-39, 40-59 and 60-79. 
 
The CHMS requires the MEC to travel around the country to collect the direct measures, resulting in a unique sample design in which survey respondents are selected from dwellings within collection sites within regions.  The dataset provided for this case study reflects data collected from individuals aged 20 to 79, from 16 collection sites within 5 regional strata: 2 sites from the Atlantic region, 4 sites from the Quebec region, 6 sites from the Ontario region, 2 sites from the Prairies region, and 2 sites from the British Columbia region.
 
While the small number of sampled collection sites can produce national baseline prevalence estimates, it has the drawback of leaving at most 11 “degrees of freedom”  for variance estimation. Limited degrees of freedom have several consequences for analysis and inference , in particular:
  • The variability of variance estimates of estimated quantities needs to be taken into account when doing analyses,
  • Estimated covariance matrices of vectors of estimates (such as the vector of estimated coefficients of a model) could be singular or close to singular, thus possibly not invertible 
  • It may not be possible to calculate some test statistics,
  • The usual asymptotic distributions of many test statistics may not hold when there are only a small number of primary sampling units (PSUs), in this case the collection sites in the sample, even when the total sample size is large
  • The number of parameters in a regression model are limited to 10, as one degree of freedom is used to estimate the intercept
  • Analytical methods that are less impacted by the limited degrees of freedom, or are conservative, should be considered such as: Satterthwaite-adjusted statistics or Bonferroni tests.
The CHMS produces a survey weight and 500 bootstrap weights, the former to produce estimates that are representative of the Canadian population, and the latter for appropriate variance estimation given the CHMS’ complex survey design.
 

Research Question: 

The aim of this case study is to assess the impact of using and not using survey design information when producing estimates for the Canadian population from the CHMS.  To accomplish this, participants are asked to use the synthetic CHMS data to estimate the prevalence of and selected risk factors associated with hypertension in Canada.
 
Questions to consider:
  • Which risk factors are associated with hypertension? Do these associations hold with and without the survey design information (survey weight, bootstrap weights, specifying the 11 degrees of freedom)?
  • Does the prevalence of hypertension and selected risk factors vary between men and women?  Across age groups?  How does your interpretation of these results change when the analysis is run with and without the survey design information?
  • How would you summarize the impact of including and not including the survey design information in your analysis?  Did you see a greater impact for certain estimates and not others? 

Variables: 

 

Data source:
Provided for this case study is a synthetic data file that denotes Cycle 3 of the CHMS.  It includes 3,060 records for individuals aged 20 to 79.  There are missing values for some variables for some respondents. Although the overall distribution of values for each variable resembles that of the actual CHMS data, please note that this synthetic file produces synthetic results.

Synthetic file

Number of records:
3,060

Number of variables:  509

 

Variable name

Description

CLINICID

Unique record identifier.

SMK_12

Current smoking status: 1 daily; 2 occasional; 3 non-smoker.

CLC_SEX

Sex at clinic visit: 1 male, 2 female.

CLC_AGE

Age in years at clinic visit: 20 to 79.

HWMDBMI

Body mass index in kg/m2. Based on measured height and weight. Range of values: 11.56 to 49.35.

HIGHBP

Categorized hypertensive: 1 yes, 2 no.  A respondent is categorized as hypertensive if he/she has SPB >= 140 mmHg or DBP >=90 mmHg or is treated for hypertension (taking medication and/or been diagnosed as hypertensive by a medical professional in the past 6 months).

LAB_BCD

Blood cadmium in nmol/L.  Range of valid values .71 to 47.  The value 999.5 indicates the value was below the limit of detection (LOD) for that respondent.

LAB_BHG

Blood mercury in nmol/L. Range of valid values 2.1 to 100.  The value 999.5 indicates the value was below the LOD for that respondent.

WGT_FULL

Survey weight

BSW1---BSW500

Bootstrap weights

 

Data Access: 

The dataset has been provided in a .csv file. Please email lisa.lix@umanitoba.ca if you would like the data as a .zip file. 
 

Organizer: 

Tracey Bushnik
Senior Research Analyst
Health Analysis Division
Statistics Canada
 

Email: tracey.bushnik@canada.ca
Phone: 613 854-7906


  1. “Degrees of freedom” is being used as a generic term to reflect the amount of information used to estimate variances and covariances. An approximation often used for the value of “degrees of freedom” is # of PSUs - # of strata. For cycle 3 of the CHMS, the 16 collection sites are the PSUs and there are 5 regions (strata) resulting in 11 degrees of freedom (16-5). This is an approximate estimate of the degrees of freedom and provides only the maximum value.
  2. In particular, confidence intervals and tests of hypotheses.
  3. Invertible covariance matrices are needed to perform Wald tests for tests on vectors of parameters.

Data Files: