Skip to main content

Title: Analysis of Unstructured Text Data.
Facilitators: Dave Campbell (Carleton University) and Nathan Taback (University of Toronto).
Date: Thursday June 4th, 2020

Part 1: 1:00 PM to 2:30 PM Eastern
Part 2: 2:30 PM to 4:00 PM Eastern

 

Part 1 will cover data acquisition, cleaning, tokenization, and exploration.
Part 2 will cover topic models for clustering documents and converting words into low dimensional numeric vectors.

 

To register, click here.

 

Workshop Description:

Dave and Nathan will be introducing participants to tools for collecting, managing, processing, and analyzing vast unstructured
text data from a variety of sources.


In many fields, unstructured text is a natural data source. Accident reports, medical charts, news articles, and product
descriptions all contain large amounts of text. With text data preparation, cleaning, and analysis are not separable from the
domain of the analytic questions. Beyond just text encoding presence or absence of features, text allows the modelling of
context and sentiment requiring special techniques. In this hands-on workshop we introduce participants to tidyverse tools for
manipulating, managing, cleaning, and visualizing text data. Participants will be introduced to tools for tracking the sentiment
across a document, clustering text documents through modelling their topics, and embedding spaces for text.


Nathan Taback is an Associate Professor, Teaching Stream in the Departments of Statistical Sciences and Computer Science
and Director of Data Science Programs in Statistical Sciences at University of Toronto.


Dave Campbell is a professor in the School of Mathematics and Statistics at Carleton University and adjunct member in
the Department of Statistics and Actuarial Science at Simon Fraser University.