Analysis of player interaction data from a mobile social game: finding interactions between game factors and player factors that predict engagement?


Data Source: 

Uken Games


Alex Yakubovich, Uken Games, Toronto


The data are from a social mobile game. Once a user downloads the game from the app store, they go through a tutorial, and then progress through a number of stages. While the game is free, users have the option of making in-app purchases. A user has some chance of winning each round, but even if they don't win they can accrue some in-game currency. Once users acquire enough currency, they can proceed to the next stage. If users connect their game account to Facebook, they can also send/receive gifts amongst their friends.


We measure three target variables for each user: revenue, engagement (minutes played), and retention (does the player come back after a certain number of days). We also record when different events occur within the game – for example, when the user makes different in-app purchases, when they send or receive gifts or unlock different achievements with the game (i.e., reach a level or win a prize). Finally, for some users, we have demographic data, such as gender, country, and whether they connect their game account to their Facebook account.


There are some interesting aspects to this data. First, you will notice that the distributions of revenue and engagement are very heavy tailed, so it doesn't make sense to talk about an 'average user'. Most users don't become spending players, and most spending players don't make big purchases. However, the outliers make up a large part of the revenue, and in a sense subsidize the game for everyone else.

Second, the game economy is closed - we control how much real currency maps to in-game currency, as well as the number and type of available purchases.

Research Question: 


Below are some questions to guide the analysis:

  1. Can you identify clusters of users? What are users like in each cluster? How does the cluster assignment change over time (for example, when does a user start to make lots of purchases or become increasingly engaged with the game?)
  2. How do the user demographics and user actions affect the response variables (engagement, revenue, retention)? Which are the strongest predictors? What interactions are present?
  3. Can you come up with a good way to visualize this data, ideally in an interactive way?
  4. What other insights can you provide?

We would be interested in seeing the code behind your solutions, so reproducible analyses are highly encouraged.



For each user, measurements are taken between the time they install the app and until a certain number of days has passed. For privacy reasons, we cannot reveal the exact observation period, but note that the length of the observation period is the same for every user.

The dataset consists of a single table, user_stats.csv, with one record for each user. There are 300,000 rows and the following columns:

install_date – in the format of year, month, date
user_id - integer uniquely identifying each user
country - string specifying the country the user is from (NA is unknown)
gender - (male, female, NA). Gender is known if and only if the user connects to Facebook during the observation period.
platform - (ipad, iphone)
num_platforms - (1,2) number of platforms on which a user installs the game on (a user can install on both iphone and ipad)
num_sessions - number of times a user opened the app during the observation period
games_played - number of games played during the observation period (a session can have multiple games)
fb_connect - date when user connects their game account to Facebook (NA if don’t do so during the observation period)

retention - does the user return to the game at the end of the observation period?
engagement - number of minutes the game was played during the observation period
revenue - amount of money the user spent during the observation period

tutorial_completed - date when user completes tutorial. NA if they don’t start it during the observation period.
stage1 - date when the user first plays stage 1. NA if they don’t start it during the observation period
stage2 - date when the user first plays stage 2. NA if they don’t start it during the observation period
stage2 - etc.


  • Note that the revenue and engagement numbers have been rescaled.
  • Stage 1 becomes available as soon as the user completes the tutorial. However, for all subsequent stages, the user doesn't have to complete the previous stage -- they become available as soon as enough in-game currency has been accrued. For example, a user can start playing stage 4 without having started stage 3.

Data Access: 

In order to access the dataset, please complete the confidentiality document ( and send it to Alex Yakubovich (