Case Study 1 : Does survey design information matter?
Data Source:
Organizer:
- The variability of variance estimates of estimated quantities needs to be taken into account when doing analyses,
- Estimated covariance matrices of vectors of estimates (such as the vector of estimated coefficients of a model) could be singular or close to singular, thus possibly not invertible
- It may not be possible to calculate some test statistics,
- The usual asymptotic distributions of many test statistics may not hold when there are only a small number of primary sampling units (PSUs), in this case the collection sites in the sample, even when the total sample size is large
- The number of parameters in a regression model are limited to 10, as one degree of freedom is used to estimate the intercept
- Analytical methods that are less impacted by the limited degrees of freedom, or are conservative, should be considered such as: Satterthwaite-adjusted statistics or Bonferroni tests.
Research Question:
- Which risk factors are associated with hypertension? Do these associations hold with and without the survey design information (survey weight, bootstrap weights, specifying the 11 degrees of freedom)?
- Does the prevalence of hypertension and selected risk factors vary between men and women? Across age groups? How does your interpretation of these results change when the analysis is run with and without the survey design information?
- How would you summarize the impact of including and not including the survey design information in your analysis? Did you see a greater impact for certain estimates and not others?
Variables:
Data source:
Provided for this case study is a synthetic data file that denotes Cycle 3 of the CHMS. It includes 3,060 records for individuals aged 20 to 79. There are missing values for some variables for some respondents. Although the overall distribution of values for each variable resembles that of the actual CHMS data, please note that this synthetic file produces synthetic results.
Synthetic file
Number of records: 3,060
Number of variables: 509
Variable name |
Description |
CLINICID |
Unique record identifier. |
SMK_12 |
Current smoking status: 1 daily; 2 occasional; 3 non-smoker. |
CLC_SEX |
Sex at clinic visit: 1 male, 2 female. |
CLC_AGE |
Age in years at clinic visit: 20 to 79. |
HWMDBMI |
Body mass index in kg/m^{2}. Based on measured height and weight. Range of values: 11.56 to 49.35. |
HIGHBP |
Categorized hypertensive: 1 yes, 2 no. A respondent is categorized as hypertensive if he/she has SPB >= 140 mmHg or DBP >=90 mmHg or is treated for hypertension (taking medication and/or been diagnosed as hypertensive by a medical professional in the past 6 months). |
LAB_BCD |
Blood cadmium in nmol/L. Range of valid values .71 to 47. The value 999.5 indicates the value was below the limit of detection (LOD) for that respondent. |
LAB_BHG |
Blood mercury in nmol/L. Range of valid values 2.1 to 100. The value 999.5 indicates the value was below the LOD for that respondent. |
WGT_FULL |
Survey weight |
BSW1---BSW500 |
Bootstrap weights |
Data Access:
The dataset has been provided in a .csv file. Please email lisa.lix@umanitoba.ca if you would like the data as a .zip file.
Organizer:
Tracey Bushnik
Senior Research Analyst
Health Analysis Division
Statistics Canada
Email: tracey.bushnik@canada.ca
Phone: 613 854-7906
- “Degrees of freedom” is being used as a generic term to reflect the amount of information used to estimate variances and covariances. An approximation often used for the value of “degrees of freedom” is # of PSUs - # of strata. For cycle 3 of the CHMS, the 16 collection sites are the PSUs and there are 5 regions (strata) resulting in 11 degrees of freedom (16-5). This is an approximate estimate of the degrees of freedom and provides only the maximum value.
- In particular, confidence intervals and tests of hypotheses.
- Invertible covariance matrices are needed to perform Wald tests for tests on vectors of parameters.