Case Studies in Data Analysis Poster Competition 2015

There will be a poster competition for Case Studies in Data Analysis during the Annual Meeting in Halifax, June 14-17, 2015. One award will be presented for the best poster in each of the two case studies. The value of the award for each case study in the 2015 competition is $500 with the expectation that this award is shared equally among the members of each winning team. The Committee of the Award for Case Studies in Data Analysis will consider the quality of both the analysis of the data and the presentation of the results in reaching its decision. The Committee of the Award for Case Studies in Data Analysis reserves the right to decline to make an award for a given study if numbers of entries are insufficient.

The case studies are intended for teams of graduate and/or senior undergraduate students working either with or without faculty mentors. Each participating team will choose to analyse one of the two data sets described below and the teams will present summaries of the methods they used and the results of their analyses in a juried poster presentation session at the Annual Meeting. The date and time of the judging will be communicated to all teams in advance of the Annual Meeting.

Registration for the Case Studies Competition

Interested teams should send an email to Georges Monette indicating the names and email addresses of all team members, and the selected case study. The deadline for registration for the Case Studies Competition is April 30, 2015.

Note that at least one member of each team needs to register for and attend the SSC Annual Meeting. Also note that the deadline for the "early-bird" discount on registration fees is April 15, 2015.


Case Study 1: Youth Employment and the 2008 Crisis

Data provider: Statistics Canada. Labour Force Survey through the Data Liberation Initiative: LFS (2002-2014)

Organizer: Heather Krause, Datassist


The financial crisis of 2008 has been said (Tancer, 2012) to have caused widespread youth unemployment and may have had a long term effect on relative employment demand in different industries.

Research Questions

Question 1: Based on your analysis, is this true across the board in Canada? How has the impact differed for young men and young women across diverse industries? To what extent has youth employment recovered within various industries?

Question 2: You have a daughter in high school. Based on the trends revealed in your analysis, what career advice do you give her? Is this advice the same or different for her twin brother?

Data access

The Statistics Canada Labour Force Survey (2015) compiles data on variables that allow an analysis of the effects of the crisis. Almost all Canadian universities participate in Statistics Canada’s Data Liberation Initiative (DLI) which gives students access to longitudinal microdata from the Labour Force Survey. A data set comprising a selection of variables from 2002 to 2014 has been compiled and is available to students enrolled at universities participating in the DLI by contacting Georges Monette. You may also obtain additional variables, if you wish, through the DLI representative at your university. See the list of participating institutions and their representatives.




Case Study 2: Baseball Strategies

Provenance of Data: Two R packages: PitchRx, Lehman and Retrosheet (

Organizer: Dave Campbell (


Of all sports, baseball has probably generated the most extensive and complex statistical analyses. Albert (2010) proposes a number of questions to explore with data obtained by combining three data sets that are readily available. Two data sets are available through packages in R (R Core Team, 2014): the Lehman package (Friendly, 2014) and the pitchRx package (Sievert, 2014). A third data set with play-by-play data, Retrosheet (Pankin, 2015), can be downloaded for use in R following instructions provided by Albert (2014).

The goal of this case study is to use the data in these datasets to explore one of two questions proposed in a personal communication by Green (2015):


Each team chooses which of the two questions it wishes to explore and should present results on only one question.

Question 1: What is the relative value to a team of a (non-pitcher) player's hitting and defence? How much would a batter's offence have to improve to compensate (in terms of runs scored/allowed or wins/losses) for weaker defence in the field? This will change depending on how critical the position is. (Green, 2015)

Question 2:What is the impact (on runs scored or wins) of the order of the batting lineup? Traditionally the fastest player (with high batting average) hits first. A strong contact hitter (low strikeouts) bats second. Third and fourth are "power" hitters who hit lots of homeruns. Some have argued against this tradition and claim that you should simply put your "best" overall hitter first, your "second best" hitter second, and so forth, on the justification that your best hitters will come to bat more often, on average, over the course of a season. How to measure "best", "second best" overall hitter engenders some controversy as well. (Green, 2015).


[1] Jim Albert (2010). Baseball data at Season, Play-by-Play, and Pitch-by-Pitch Levels. Journal of Statistics Education, 18, Retrieved from

[2] Jim Albert (2014). Exploring Baseball Data with R, Retrieved from

[3] Christopher Green (2015). Personal communication.

[4] Michael Friendly (2014). Lahman: Sean Lahman's Baseball Database. R package version 3.0-1.

[5] Mark Pankin (2015). Retrosheet. Retrieved from

[6] R Core Team (2014). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL

[7] Carson Sievert (2014). Taming PITCHf/x Data with {pitchRx} and {XML2R} The R Journal, 6(1). Retrieved from

[8] Carson Sievert (2014). pitchRx: Tools for Harnessing MLBAM Gameday data and Visualizing PITCHf/x. R package version 1.6.