Case Studies in Data Analysis Poster Competition 2014

There will be a poster competition for Case Studies in Data Analysis during the Annual Meeting in Toronto, May 25-28, 2014. One award will be presented for the best poster in each of the two case studies. The value of the award for each case study in the 2014 competition is $500 with the expectation that this award is shared equally among the members of each winning team. The Committee of the Award for Case Studies in Data Analysis will consider the quality of both the analysis of the data and the presentation of the results in reaching its decision. The Committee of the Award for Case Studies in Data Analysis reserves the right to decline to make an award for a given study if numbers of entries are insufficient.

The case studies are intended for teams of graduate and/or senior undergraduate students working either with or without faculty mentors. Each participating team will choose to analyse one of the two data sets described below and the teams will present summaries of the methods they used and the results of their analyses in a juried poster presentation session at the Annual Meeting. The date and time of the judging will be communicated to all teams in advance of the Annual Meeting.

Interested teams should send an email to Georges Monette (georges+ssc@yorku.ca) indicating the names of all team members and the selected case study. The deadline for registration is March 31, 2014.

Case Study 1: The American Time Use Survey: How have economic and socio-demographic factors affected television viewing in the last decade?

Case Study 2: Player interaction data from a mobile social game: Exploring interactions between game factors and player factors that predict engagement

Case Study 1: The American Time Use Survey: How have economic and socio-demographic factors affected television viewing in the last decade?

Data provider: U.S. Bureau of Labor Statistics, American Time Use Survey

Organizer: Heather Krause, Datassist, Toronto

Description of the ATS:

The American Time Use Survey is the culmination of a design and development effort that lasted nearly ten years, including a pilot study in 1997 and full-scale field testing in 2002 (Horrigan and Herz, 2005). The ATUS uses a random sample drawn from households that have recently completed their participation in the Current Population Survey (CPS). Thus, for example, a household that had been included in the CPS in January through April 2002 (Month-in-Sample 1–4) and January through April 2003 (Month-in-Sample 5–8) was eligible for inclusion in the June, July or August 2003 ATUS.

Sample households are selected based on the characteristics of the CPS reference person, and the respondent is then randomly selected from the list of adult (age 15 or older) household members. All adults within a household have the same probability of being selected. During 2003, the ATUS collected over 1,700 diaries per month. Beginning in January 2004, the sample size was reduced to approximately 1,100 per month, a rate that is expected to continue indefinitely.

The American Time Use Survey is administered using computer-assisted telephone interviewing, rather than paper diaries as in many other countries. All ATUS respondents are assigned an initial diary day and are called on the following day. If the respondent is unavailable on that day, subsequent contact attempts are made on the same day of subsequent weeks. This procedure maintains the proportional assignment of respondents to days of the week.

The core time diary of the ATUS is very similar to other time-budget surveys. The respondent is asked to take the interviewer through his or her day from 4 AM through 4 AM of the following day (the interview day). The respondent describes each activity, which the interviewer either records verbatim or, for a limited set of commonly performed activities (such as sleeping or watching television), hits a precode button. The verbatim responses are coded to a three-tier scheme, going from top-level category of activity, to sub-categories, to descriptions of very specific actions that together are considered to comprise a single third-tier activity.

Only the respondent’s primary activity is recorded and coded; if the respondent mentions secondary activities performed simultaneously, these are recorded but are not included in the total time inputs and are not classified using the three-tier scheme. For each episode, the ATUS collects either the ending time or the duration of the activity. In addition, for each activity the survey asks where the respondent was and with whom, unless the activity is sleeping or grooming (neither location nor with whom is asked) or working at a job (only location is asked). The “who” codes for household members refer to specific individuals.

Person-level Variables Included in the ATS:

Labour Force Status
Income
Earnings
Gender
Race
Marital Status
Age
Region
State
Household Demographics
Height
Weight
Body Mass Index
Education
And, of course – lots of time use variables!

Research Questions

What effect does the economy have on the amount of time spent watching TV and playing video games? Does this vary by gender? Does this vary according to your labour force participation? Does this vary across income? What are the strongest sociodemographic predictors of time spent watching TV?

Exploratory question: What activities have been replaced by increased time spent on TV and on video games?

Data access

ATS microdata from 2003 to 2012 can be downloaded from the U.S. Bureau of Labor Statistics at http://www.bls.gov/tus/.

Case Study 2: Analysis of player interaction data from a mobile social game: finding interactions between game factors and player factors that predict engagement?

Data provider: Uken Games

Organizer: Alex Yakubovich, Uken Games, Toronto

Background

The data are from a social mobile game. Once a user downloads the game from the app store, they go through a tutorial, and then progress through a number of stages. While the game is free, users have the option of making in-app purchases. A user has some chance of winning each round, but even if they don't win they can accrue some in-game currency. Once users acquire enough currency, they can proceed to the next stage. If users connect their game account to Facebook, they can also send/receive gifts amongst their friends. Measurements

We measure three target variables for each user: revenue, engagement (minutes played), and retention (does the player come back after a certain number of days). We also record when different events occur within the game – for example, when the user makes different in-app purchases, when they send or receive gifts or unlock different achievements with the game (i.e., reach a level or win a prize). Finally, for some users, we have demographic data, such as gender, country, and whether they connect their game account to their Facebook account.

Analysis

There are some interesting aspects to this data. First, you will notice that the distributions of revenue and engagement are very heavy tailed, so it doesn't make sense to talk about an 'average user'. Most users don't become spending players, and most spending players don't make big purchases. However, the outliers make up a large part of the revenue, and in a sense subsidize the game for everyone else.

Second, the game economy is closed - we control how much real currency maps to in-game currency, as well as the number and type of available purchases.

Below are some questions to guide the analysis:

Can you identify clusters of users? What are users like in each cluster? How does the cluster assignment change over time (for example, when does a user start to make lots of purchases or become increasingly engaged with the game?)
How do the user demographics and user actions affect the response variables (engagement, revenue, retention)? Which are the strongest predictors? What interactions are present?
Can you come up with a good way to visualize this data, ideally in an interactive way?
What other insights can you provide?

We would be interested in seeing the code behind your solutions, so reproducible analyses are highly encouraged.

Dataset Characteristics

For each user, measurements are taken between the time they install the app and until a certain number of days has passed. For privacy reasons, we cannot reveal the exact observation period, but note that the length of the observation period is the same for every user.

The dataset consists of a single table, user_stats.csv, with one record for each user. There are 300,000 rows and the following columns:

install_date – in the format of year, month, date
user_id - integer uniquely identifying each user
country - string specifying the country the user is from (NA is unknown)
gender - (male, female, NA). Gender is known if and only if the user connects to Facebook during the observation period.
platform - (ipad, iphone)
num_platforms - (1,2) number of platforms on which a user installs the game on (a user can install on both iphone and ipad)
num_sessions - number of times a user opened the app during the observation period
games_played - number of games played during the observation period (a session can have multiple games)
fb_connect - date when user connects their game account to Facebook (NA if don’t do so during the observation period)

retention - does the user return to the game at the end of the observation period?
engagement - number of minutes the game was played during the observation period
revenue - amount of money the user spent during the observation period

tutorial_completed - date when user completes tutorial. NA if they don’t start it during the observation period.
stage1 - date when the user first plays stage 1. NA if they don’t start it during the observation period
stage2 - date when the user first plays stage 2. NA if they don’t start it during the observation period
stage2 - etc.
stage3
stage4
stage5
stage6
first_win
first_bonus
first_special_purchase
first_purchase_A
first_purchase_B
first_purchase_C
first_purchase_D
first_purchase_E
first_purchase_F
first_purchase_G
first_purchase_H
first_gift_sent
first_gift_received
first_gift2_received
first_gift_accepted
first_collection
first_prize_A
first_prize_B
first_prize_C

Remarks:

Note that the revenue and engagement numbers have been rescaled.
Stage 1 becomes available as soon as the user completes the tutorial. However, for all subsequent stages, the user doesn't have to complete the previous stage -- they become available as soon as enough in-game currency has been accrued. For example, a user can start playing stage 4 without having started stage 3.

Data access

In order to access the dataset, please complete the confidentiality document (http://s.uken.com/sscdata) and send it to Alex Yakubovich (alex.yakubovich@uken.com).

References

[1] http://www.wired.com/gamelife/2011/06/free-to-play/

[2] http://mobiledevmemo.com/the-average-user-doesnt-exist-in-freemium-gamin/

[3] http://www.youtube.com/watch?v=nnwPn8Ou6Wo&list=PL63BVidWw3h81zRkapJm2DC__tLIOGznm&index=43