Gender Gap in Earnings among Young People


Data Source: 

the Data Liberation Initiative (DLI)


Lenka Mach or Karla Fox of Statistics Canada



The data is provided via the Data Liberation Initiative (DLI) that allows the faculty and students of the DLI participating post secondary institutions to access numerous Statistics Canada public use microdata files. 

For more information about DLI, please go to:

Please address queries to Lenka Mach or Karla Fox of Statistics Canada.


The differences in employment and earnings between men and women have been studied by many Canadian labour economists, e.g., Christofides and Swidinsky (1994), Baker et al. (1995), Drolet (2001). Frenette and Coulombe (2007) examined the gender gap in labour market outcomes for young people aged 25 to 29 using the Canadian Census of Population long-form data for the years 1981, 1991 and 2001. The main objective of their study is to assess whether the rising educational attainment among young women is having any impact on the gender gap in full-time employment and earnings. Drolet (2011) re-examines the differences in pay between men and women by comparing the hourly wages for full-time workers from 1988 to 2008.

The analysis of the gender gap typically starts with producing estimates of descriptive statistics for specific domains followed by testing hypothesis about these statistics. For example, Frenette and Coulombe (2007) estimate employment rates using a sample of 25 to 29 year old labour force participants for both males and females and then examine the difference between them. Drolet (2001, Table 2) estimates annual earnings for full-year, full-time workers for different domains defined by age, education, etc., and examines the female-male earnings ratios. As this comparison does not ensure that equal quantities of work are being compared, Drolet (2011) estimates hourly wages for full-time men and women and analyses the female-to-male hourly wage ratio and gap. While these authors heuristically compare point estimates, a statistician would also be interested in testing hypotheses to establish whether the population parameters are different between the two genders. Luong (2010) uses a sample of 20 to 45 years old individuals who are not attending school from the 2007 cross-sectional SLID data and tests the hypothesis whether the employment status for a group of interest is different from the employment status of the reference group.

If data on individual factors influencing the labour market outcomes are available, then fitting regression models can be very useful when analysing the gender gap in employment and earnings. These factors can help explain the differences we see in the outcomes. For example, Frenette and Coulombe (2007) use logit and probit models to predict the probability of full-time employment and linear regression to model the log earnings. In these models, they use explanatory variables like education, marital status, place of residence (Frenette and Coulombe, 2007, Table 1). Drolet (2001, Appendix 1) also uses regression to model the log hourly wage using the 1997 SLID data that includes a measure of labour market experience, the full-year full-time equivalent (FYFTE) and other important wage-determining characteristics.

When the goal is to identify underlying causes of differences in gender differences in the labour market, labour economists often use decomposition procedures. These methods, attributed to Blinder (1973) and Oaxaca (1973) partition the gender gap into two components, one which is explained by the differences between the female and male wage-determining characteristics (explained component), and a second part due to different effects of these characteristics (unexplained component). This unexplained component is often used as a measure of discrimination but it should be noted that it includes the effects of group differences in unobserved predictors. In the simplest case, the decomposition is of this form:


The Blinder-Oaxaca procedure and other decomposition methods are discussed in Frenette and Coulombe (2007) and Drolet (2011). Additionally, there are SAS and Stata procedures available at

Many of the above discussed studies use data for multiple reference periods and examine the trend in the gender gap over time. Other studies examine the gender gap for one reference period; for example Christofides and Swidinsky (1994) use the 1989 Labour Market Activity Survey and Drolet (2001) uses the 1997 SLID data.

When researchers are using data collected by a complex survey, they must determine whether to use a design-based or model-based approach for their analysis. The choice depends on the objective of their analysis and factors such as the informativeness and the ignorability of the design.

A description of the design-based and model-based approaches is provided here.


The cross-sectional Public-Use Microdata File (PUMF) for the Survey of Labour and Income Dynamics (SLID) will be used for this case study. The cross-sectional PUMF files for SLID are available for the reference years 1996-2008 through the Data Liberation Initiative (DLI). SLID collects data for families as well as for individuals and hence four different SLID PUMF files are created for every reference year. The person file will be used for this study.

The metadata, including the User’s Guide, data dictionary and univariate distributions for collected variables, for all SLID files in the DLI collection are available here:

Sample design of SLID

The samples for SLID are selected from the monthly Labour Force Survey (LFS) and thus inherit the LFS design. The LFS sample is based on a stratified, two-stage design that uses probability sampling. Each province is divided into large geographic strata (economic regions) for which reliable estimates are required. Each stratum consists of smaller geographic areas, called clusters, or primary sampling units (PSUs). In the first stage of sampling, a sample of these PSUs is selected from within each stratum. In each selected PSU, all dwellings are first listed. Then, in the second stage of sampling, a sample of dwellings is selected in each selected PSU. The residents in the selected dwellings form the LFS sample. The cross-sectional SLID sample consists of roughly 34,000 households which rotated out of the LFS sample a few years before. 

Using Public-Use Microdata File for design-based analysis

The survey weight is provided on the public-use microdata file (PUMF) to be used for the production of the design-consistent point estimators. See Sections 4 and 5 of the User’s Guide (User’s Guide for Cross-Sectional Public-Use Microdata File: Survey of Labour and Income Dynamics) for guidelines to producing reliable point estimates.

To estimate , additional design information (e.g. stratum and cluster identifiers) is required but is not currently available on the SLID PUMF file due to confidentiality considerations. In this situation, the analysts can use design effects, if available, to approximate  by  .


Research Question: 

The students can consider the following questions:

1.    Is there a difference in the employment or earnings between young men and young women (aged 25 to 34)?

a.    Does the conclusion differ under design-based or a model-based approach? 

b.    If you assume different design effects, for example 2 or 20, do your results change?

c.     Is the design ignorable?

2.    If there are any differences, can they be attributed to explanatory variables?

a.    What factors are the most important?

b.    Does weighting change the inference?

3.    Can we decompose the gap?

a.    What are the explained and unexplained portions?




Using the cross-sectional Public-Use Microdata File (PUMF) for the Survey of Labour and Income Dynamics (SLID), analyse the gender gap in employment and earnings for young people (aged 25 to 34) living in the ten Canadian provinces. Learn about the design-based analysis and examine the differences between the design-based and model-based approach.


Data Access: 

Students can get the access to the PUMF SLID files from the DLI contact person at their university.

To identify the DLI contact person in your university, visit

If your university is not participating in DLI, please contact Lenka Mach for arranging the access to the data. 

Data Files: 



Baker, M., Benjamin, D., Desaulniers, A. and Grant, M. (1995). The distribution of the male/female earnings differential, 1970-1990. Canadian Journal of Economics. 28, 3: 479–501.

Barnard, G. A. (1971). Discussion of the paper by Professor Godambe and Dr. Thompson. (Bayes, Fiducial and Frequency Aspects of Regression Analysis in Survey-sampling.) Journal of the Royal Statistical Society, Series B, 33, 3, 361-390.

Binder, D.A. and Roberts, G.R. (2001). Can Informative Designs be Ignorable? Survey Research Methods Section Newsletter, Issue 12. American Statistical Association.

Binder, D.A. and Roberts, G.R. (2003), Design-based and Model-based Methods for Estimating Model Parameters. In Analysis of Survey Data (R.L. Chambers and C.J. Skinner, eds.), pp. 29-48, Chichester: Wiley.

Binder, D.A., Kovacevic, M.S. and Roberts, G.R. (2005). How important is the informativeness of the sample design? Proceedings of the Survey Methods Section, Annual Meeting of the Statistical Society of Canada in Saskatoon. (accessed December 22, 2010).

Blinder, A.S. (1973). Wage discrimination: Reduced form and structural estimates. Journal of Human Resources. 8, 4: 436–455.

Christofides, L.N. and Swidinsky. R. (1994). Wage determination by gender and visible minority status: Evidence from the 1989 LMAS. Canadian Public Policy. 21, 1: 34–51.

Drolet, M. (2001). The Persistent Gap: New Evidence on the Canadian Gender Wage Gap. Statistics Canada Catalogue no. 11F0019MPE No. 157. Ottawa, Ontario. Analytical Studies Branch Research Paper Series. (accessed December 30, 2010).

Drolet, M. (2011). Why has the gender wage gap narrowed? Perspectives on Labour and Income. Statistics Canada Catalogue no. 75-001-X. (accessed December 30, 2010).

Frenette, M. and Coulombe, S. (2007). Has Higher Education among Young Women Substantially Reduced the Gender Gap in Employment and Earnings? Statistics Canada Catalogue no. 11F0019MIE - No.301. Ottawa, Ontario. Analytical Studies Branch Research Paper Series. (accessed December 10, 2010).

Heeringa, S.G., West, B.T. and Berglund, P.A. (2010). Applied Survey Data Analysis. Chapman & Hall/CRC.

Kish, L. (1965). Survey Sampling. New York: John Wiley.

Kish, L. (1995). Methods for Design Effects. Journal of Official Statistics, 11, 55-77.

Korn, E.L., and Graubard, B.I. (1990). Simultaneous testing of regression coefficients with complex survey data: Use of Bonferroni t-statistics. American Statistician, 44, 270-276.

Korn, E.L., and Graubard, B.I. (1998) Confidence intervals for proportions with small expected number of positive counts estimated from survey data. Survey Methodology, 24, 193-201.

Korn, E.L., and Graubard, B.I. (1999), Analysis of Health Surveys, New York: Wiley.

Lohr, S. (1999). Sampling: Design and Analysis. Duxbury Press.

Luong, M. (2010). The financial impact of student loans. Perspectives on Labour and Income. Statistics Canada Catalogue no. 75-001-X. (accessed December 9, 2010).

Oaxaca, R.L. (1973). “Male-female wage differentials in urban labor markets.” International Economic Review. 14, 3: 693–709.

Park, I. and Lee, H. (2004). Design Effects for the Weighted Mean and Total Estimators Under Complex Survey Sampling. Survey Methodology, 30, 183-193.

Pfeffermann, D. (1993). The Role of Sampling Weights when Modeling Survey Data. International Statistical Review, vol. 61, no. 2, pp. 317-337.

Pfeffermann, D. (1996). The Use of Sampling Weights for Survey Data Analysis. Statistical Methods in Medical Research, 5, 239-261.

Pfeffermann, D. and Sverchkov, M. (2003). Fitting Generalized Linear Models Under Informative Sampling. In Analysis of Survey Data (R.L. Chambers and C.J. Skinner, eds.), 175-195, Chichester: Wiley.

Rao, J.N.K. and Scott, A.J. (1981). The analysis of categorical data from complex sample surveys: Chi-squared tests for goodness of fit and independence in two-way tables. Journal of the American Statistical Association, 76, 221-230.

Rao, J.N.K. and Scott, A.J. (1984). On chi-squared tests for multi-way tables with cell proportions estimated with survey data. Annals of Statistics, 12, 46-60.