Acknowledgment: We acknowledge Mr. François Brisebois (Methodology Branch, Statistics Canada) for his support in arranging open access data for all participants. We also thank Dr. Pingzhao Hu (Department of Biochemistry and Medical Genetics, University of Manitoba) and Mr. Brisebois for their feedback in preparing this case study.
Background: The Canadian Community Health Survey (CCHS) is a nationwide cross-sectional survey. This survey gathers health-related data for the Canadian population 12 years of age and over living in the 10 provinces and 3 territories, covering about 97% of the target population. In this case study, we will use Public Use Microdata Files (PUMF) from cycles 1.1, 2.1 and 3.1 that contain data collected in years 2000-2001, 2003 and 2005, respectively. Various measures were taken to protect the confidentiality of the participants of the survey.
The survey sampling weight provided corresponds to the number of individuals represented by the respondent for the target population. Incorporation of these weights will ensure an appropriate representation of the covered population, and hence these need to be considered to produce meaningful statistical estimates. Due to confidentiality concerns, only survey weights are made available on PUMF, but neither design information nor bootstrap weights for estimating variances are provided. Using the survey weight will provide correct point estimates, but in the absence of bootstrap weights and necessary design information, estimated variability measures calculated assuming simple random sampling will not be accurate and often be under estimated.
This case study aims to familiarize participants with analyzing the PUMF version of the CCHS dataset (combined data from cycles 1.1, 2.1 and 3.1). To that end, participants are asked to use this PUMF data to first create an ‘analytic dataset’ (with only the appropriate variables and records useful for analyses from all cycles), and then use that dataset to estimate crude and adjusted measures of association between osteoarthritis and self-reported heart diseases.
Questions to consider:
- Within Canadian adults (20-64 years of age), is having osteoarthritis associated with the developing heart disease? For the purpose of this case study, assume that, from the literature, we know that the following variables are risk factors for the outcome and confounders in the above relationship: age, sex, ethnicity, education, household income, body mass index (BMI), access to a regular medical doctor, smoking habit, alcohol drinking habit, high-blood pressure, and diabetes. Also, assume that physical activity is suspected to be an intermediate factor between osteoarthritis and heart disease.
- Does the relationship between osteoarthritis and heart disease vary (a) between participants living in the northern parts of Canada versus those living in the southern parts, (b) between men and women, (c) by marital status, or (d) by recency of immigration?
- Do the results change when missing values (i.e., invalid responses) for the ‘household income’ are imputed? Which assumptions do you have to make to perform such an analysis?
- With the information provided in the PUMF, what would be your interpretation of the analysis results? What are the limitations of this study? What additional information would be helpful in reaching a more meaningful conclusion?
In order to create an ‘analytic dataset,’ reviewing the corresponding data documentation (e.g., data dictionary, topical index and user guide associated with the data) for further details of the following variables is strongly recommended (e.g., check ‘Universe’). It is often a good idea to cross-tabulate variables with the ‘Age’ variable (from the same cycle) to double check if the question was restricted to particular age groups. Similarly, cross-tabulating with the ‘Province’ variable often helps identify variables that were created from an ‘optional CCHS component.’ Note that, unless stated otherwise in the research question, some of the following variables may not be relevant for the relationship of interest. Also, there are no identifying information in cycle 1.1, 2.1 and 3.1 in this public-use data that will enable us to identify whether the same person was surveyed in multiple cycles. Therefore, for the purposes of this study (for simplicity), we will assume that the lists of subjects surveyed in different cycles were different.
|
Variable names in 3 cycles |
|
||
Variable Concept |
CCHS 1.1 |
CCHS 2.1 |
CCHS 3.1 |
Comments (see notes below) |
Has heart disease |
CCCA_121 |
CCCC_121 |
CCCE_121 |
Outcome. Only “YES” and “NO” are considered valid responses. (1) |
Has arthritis or rheumatism |
CCCA_051 |
CCCC_051 |
CCCE_051 |
Those who answered ‘NO’ are considered as ‘NOT APPLICABLE’ in the next variable ‘kind of arthritis’. |
Kind of arthritis |
CCCA_05A |
CCCC_05A |
CCCE_05A |
Useful for creating the exposure variable. Response “OSTEOARTHRITIS” will create the exposed group, and “NOT APPLICABLE” will create the unexposed group. (2) |
Age |
DHHAGAGE |
DHHCGAGE |
DHHEGAGE |
Recode into categories that make sense and apply to all 3 cycles. (3) |
Sex |
DHHA_SEX |
DHHC_SEX |
DHHE_SEX |
|
Marital Status |
DHHAGMS |
DHHCGMS |
DHHEGMS |
Recode into categories that make sense and apply to all 3 cycles. (1) |
Cultural / racial origin |
SDCAGRAC |
SDCCGRAC |
SDCEGCGT |
(1) |
Immigrant status |
SDCAFIMM |
SDCCFIMM |
SDCEFIMM |
Those who answered ‘NO’ is considered as ‘NOT APPLICABLE’ in the next variable ‘Length of time in Canada since immigration’. (4) |
Length of time in Canada since immigration |
SDCAGRES |
SDCCGRES |
SDCEGRES |
(4) |
Highest level of education - respondent |
EDUADR04 |
EDUCDR04 |
EDUEDR04 |
Recode into categories that make sense and apply to all 3 cycles. (1) |
Total household income from all sources |
INCAGHH |
INCCGHH |
INCEGHH |
Recode into categories that make sense and apply to all 3 cycles. (1) |
Body mass index |
HWTAGBMI |
HWTCGBMI |
HWTEGBMI |
Recode into categories 3 categories: underweight (<18.5), healthy weight (between 18.5 and 25), overweight (>25). (1) |
Physical activity index |
PACADPAI |
PACCDPAI |
PACEDPAI |
(1) |
Has a regular medical doctor |
TWDA_5 |
HCUC_1AA |
HCUE_1AA |
(1) |
Type of smoker |
SMKADSTY |
SMKCDSTY |
SMKEDSTY |
Recode into categories that make sense and apply to all 3 cycles. (1) |
Type of drinker |
ALCADTYP |
ALCCDTYP |
ALCEDTYP |
Recode into categories that make sense and apply to all 3 cycles. (1) |
Has high blood pressure |
CCCA_071 |
CCCC_071 |
CCCE_071 |
(1) |
Has diabetes |
CCCA_101 |
CCCC_101 |
CCCE_101 |
(1) |
Has emphysema or chronic obstructive pulmonary disease (COPD) |
CCCA_91B |
CCCC_91B |
CCCE_91F |
(1) |
Daily consumption - total fruits and vegetables |
FVCADTOT |
FVCCDTOT |
FVCEDTOT |
Recode into categories 3 categories, 0-3, 4-6 and 6+ daily serving. (1) |
Self-perceived stress |
GENA_07 |
GENC_07 |
GENE_07 |
Recode into categories that make sense and apply to all 3 cycles. (1) |
Province |
GEOAGPRV |
GEOCGPRV |
GEOEGPRV |
Recode Northwest Territories, Nunavut, Yukon as ‘north’ and the rest of the provinces/territories as ‘south’. (1) |
Sampling weight - master weight |
WTSAM |
WTSC_M |
WTSE_M |
Divide them by 3 to get a nationally representative sample (on average). |
Note:
- The following are considered invalid responses, and hence may be considered as missing values: “NOT APPLICABLE”, “DON'T KNOW”, “REFUSAL”, “NOT STATED” unless otherwise stated. For a complete case analysis, all of these records could be excluded from the study.
- Responses “RHEUMATOID ARTHRITIS” and “OTHER” will be excluded from the study.
- According to study eligibility criteria, the study will be restricted to participants 20-64 years of age.
- Useful for creating immigration status (potential categories: “not immigrant,” “recent immigrant,” “immigrated more than 10 years ago”)
Data and Documentation Files: https://www.dropbox.com/sh/dntqkl6wv54ypop/AACPOf6pnGh4sgithHJRQyYYa?dl=1.
The zip files (inside the download) contain the unedited original Statistics Canada public-use datafiles. The data files are also provided in RData formats (converted from the original datasets), suitable for opening in R. Note that, other than the format of the data, the content and the corresponding documentations are unedited. These data files should be the same as the data that can be accessed through the Data Liberation Initiative (DLI) member universities. Associated documentations (e.g., data dictionary, topical index and user guide associated with the data) and licence agreements are provided in the respective 'documentation' folders within the zip files. Finally, note that the use of these files must be done with respect to the terms and conditions of the Statistics Canada Open Licence (link: https://www.statcan.gc.ca/eng/reference/licence). Please consult the licence agreements provided here before downloading.
Number of records and variables:
|
CCHS 1.1 |
CCHS 2.1 |
CCHS 3.1 |
Number of records |
130,880 |
134,072 |
132,221 |
Number of variables |
614 |
1,068 |
1,284 |
Data Access Issues: Please email ehsan.karim@ubc.ca (with subject line “SSC2019 Case study”) if you are having trouble accessing the data, documentation or licence agreements.