Background
Podcasts are an increasingly popular form of media consumption, and represent a rich potential source of data on consumer patterns and trends. An estimated 51% of the US population has listened to a podcast; there are over 750,000 shows totalling over 30 million episodes. Research on podcasts has generally focused on predicting popular content and making use of the show and episode descriptions for text analysis and natural language processing (Tsagkias et al. 2010). Two primary metrics are used to assess popularity in podcasts: the average rating (usually based on listener ratings), and the number of reviews; the latter is often considered more important, given that it correlates with audience size.
Case study objectives
The goal for this case study project is to explore what predicts podcast popularity, given a recent dataset on podcast ratings and some information about each show (e.g., category, host, topic summary). Each team can also seek out additional covariates or transform the existing ones, such as through text analysis methods.
Teams will be given hourly podcast popularity data for approximately 50 days preceding the competition commencement. An additional unlabelled dataset will be provided, and once your models are built and validated, each team will also submit predicted number of reviews for the unlabelled dataset (note that the podcasts in this unlabelled dataset may or may not have appeared previously in your training data, so any custom predictors you use to build your model must be reproducible given a new set of the included covariates in order to generate predictions for this dataset).
Data description
The podcast ratings on iTunes are the most frequently used and cited source for assessing the popularity of podcasts, as it is one of the primary sources for podast media. iTunes makes available an API for current real-time ratings of the podcasts available on iTunes, and these data are used for this case study compeition. The data were collected by automated scraping techniques designed to get a large sample of most available English-language shows that are available on iTunes. The API allows for automated scraping of up to 200 shows per main category (many of which have fewer than 200 in total). Each hourly dataset is approximately 2000 rows. We only included podcasts in English and excluded video content. For a sample view of the data, see APPENDIX 1.
The case study data sets include a training set with 1,408,901 rows and ten columns, and a prediction (unlabeled) set with approximately 10 days of data.
1. What characteristics of podcasts predict their popularity, both in terms of rating value and number of reviews? Which measure of popularity do you find more meaningful?
2. Do the trends in popularity change over the timespan of the dataset (approximately 50 days)? Are there any daily patterns in the data? Any weekly patterns?
3. What are the predicted number of reviews for each podcast in the new unlabelled dataset?
Accuracy of the predicted number of reviews will be calculated using mean absolute error (MAE):
MAE is a suitable metric for continuous data when larger differences do not need to be penalized in a non-linear way (as compared to, for example, the root mean squared error) (Hyndman and Koehler 2006). Positive and negative errors are penalized equally in the MAE.
Title |
The name of the podcast (as it appears in iTunes) |
Summary |
A text blurb describing the content and topics of the show and the host. |
Sub-category |
The detailed sub-category of the show (70 in total) |
Artist |
The company, organization, or individual producing the show. |
Date |
A timestamp with the date of the data scrape. |
Hour |
The hour from 00 (midnight EST) to 23 (11PM EST) |
Release |
The date of the most recent episode release. |
Rating value |
A proprietary metric calculated by Apple that averages listener reviews with other information; minimum 1.0, maximum 5.0 |
Number of reviews |
The number of iTunes reviews, usually rounded to two significant figures. |
URL |
The website of the podcast show. |
The training data set can be downloaded from here:
Any questions or concerns can be directed to: kathryn@precision-analytics.ca
The unlabeled data set can be downloaded from here.
Organizer contact information
This case study was prepared by Dr Kathryn Morrison with help and guidance from the other members of the case study committee of the Statistical Society of Canada (Dr. Ehsan Karim, Dr. Pingzhao Hu, and Dr. Chel Hee Lee). The staff at Precision Analytics assisted with the data scraping and preparations.
Hyndman RJ, Koehler AB. Another look at measures of forecast accuracy. International Journal of Forecasting. 2006 Oct 1;22(4):679-88.
Tsagkias, M., Larson, M. and De Rijke, M., 2010. Predicting podcast preference: An analysis framework and its application. Journal of the American Society for information Science and Technology, 61(2), pp.374-391.