Case Study 2: What predicts the popularity of TED Talks?


Data Source: 



Lisa Lix



This dataset was developed using web scraping techniques, which extract data from websites. Web scraping is largely an automated solution; it is also an area of research that is rapidly growing. Data from web scraping is typically analyzed using text processing and artificial intelligence tools. 

The data are from TED, a nonpartisan and nonprofit organization. TED spreads ideas, primarily via short talks that can be accessed on the internet. As noted on its website, TED was initiated in 1984 as a conference where technology, entertainment, and design ideas were shared. At present, TED Talks cover topics ranging from science to business to global issues. More information about TED can be found at the following website: Learning about the organization and its talks may be useful to develop your data analytic strategy. 

This case study is currently a data competition on Kaggle ( You may wish to check out what others have done with these data, although the analyses to date have been primarily descriptive in nature. 

Your analysis in this case study will focus on the use of inferential techniques to analyze the data. As well, you should consider innovative approaches to measure popularity of the talks, beyond the conventional measure of the number of views of a talk.

Research Question: 

The questions to consider when analyzing these data are:
  • What characteristics of TED Talks predict their popularity? 
  • What different ways could you measure the popularity of TED Talks? For example, could you consider the development of a composite measure(s)? Do the characteristics that predict popularity depend on the way that you measure this construct?
  • Do the characteristics that predict popularity change over time? 
  • Do the characteristics that predict popularity differ based on the theme of the TED Talks? 


Description of the Dataset:
This dataset contains information about audio-video recordings of TED Talks uploaded to the official TED website. The data cover the period from 2006 to September 21st, 2017. 
Number of Records: 2550
Number of Columns: 17
Column Name Description
Comments The number of first level comments made on the talk
A description of what the talk is about
Duration The duration of the talk in seconds
Event The TED event where the talk took place
Film_date The Unix timestamp of the filming
Languages The number of languages in which the talk is available
Main_speaker The first named speaker of the talk
Name The official name of the TED Talk. Includes both the title and the speaker
Num_speaker The number of speakers in the talk


The Unix timestamp for the publication of the talk on

A string dictionary of the ratings given to the talk (e.g., inspiring, fascinating, jaw dropping, etc.) and their frequency
A list of dictionaries of recommended talks to watch next
The occupation of the main speaker
The themes associated with the talk
The title of the talk
The URL of the talk
The number of views on the talk.

Data Access: 

The dataset has been provided in a CSV file. Please email if you would like the data as a .zip file. 


Lisa Lix
University of Manitoba


Data Files: