Using Machine Learning Algorithms for Finding the Topics of COVID-19 Open Research Dataset Automatically

The COVID-19 Open Research Dataset (CORD-19) is a collection of over 400,000 of scholarly papers (as of January 11th 2021) about COVID-19, SARS-CoV-2, and related coronaviruses curated by the Allen Institute for AI. Carrying out an exploratory literature review of these papers has become a time-sensitive and exhausting challenge during the pandemic. The topic modelling pipeline presented in this thesis helps researchers gain an overview of the topics addressed in the papers. The preprocessing framework identifies Unified Medical Language System (UMLS) entities by using MedLinker, which handles Word Sense Disambiguation (WSD) through a pre-trained Bidirectional Encoder Representations from Transformers (BERT) model. The topic model used in this research is a Variational Autoencoder implementing ProdLDA which is an extension to the Latent Dirichlet Allocation (LDA) model. Applying the pipeline to the CORD-19 dataset achieved a highly diverse topics with coherence value of 0.7.

Session

Big Data

Date and Time

Wed, 06/09/2021 - 14:15 - Wed, 06/09/2021 - 14:30

Language of Oral Presentation

English

Language of Visual Aids

English

Speaker