Boosting Methods for Imbalanced Data Classification

Data imbalance is an important consideration when working with real world data. Over/undersampling approaches allow us to gather more insight from the limited data we have on the minority class; however, there are many proposed methods. The goal of our study is to identify the optimal approach for over/undersampling to use with Adaptive Boosting (AdaBoost). Based on a simulation study, we’ve found that combining AdaBoost with various sampling techniques provides an increased weighted accuracy across classes for progressively larger data imbalances. The three Synthetic Minority Oversampling Technique’s (SMOTE) performed the best, with the SMOTE – Edited Nearest Neighbours (SMOTE-ENN) approach being the most accurate for all levels of data imbalance. We then applied the most effective over/undersampling methods to predict upsets (games where the lower seeded team wins) in the March Madness College Basketball Tournament.

Session

Poster Presentations

Date and Time