Skip to main content
Imbalanced binary classification problems arise in many fields of study, such as wildfire, health care, and finance. When using random forests to learn from imbalanced data, it is common to subsample the majority class (i.e., undersampling) to create a (more) balanced dataset for the random forest to learn from. This skews the random forest’s predictions, so those wanting meaningful probability estimates try to calibrate them. One way of doing this is to map the original predictions to new values based on the sampling rates for the majority and minority classes, which were used to create the training dataset. However, calibrating a random forest this way has surprising consequences. The result is a prevalence estimate that depends on both i) the sampling rates used; and ii) the number of predictors considered at each split in the random forest. We explain why these have an impact and show the potential changes in prevalence estimates based on different choices of these hyperparameters.
Date and Time
-
Language of Oral Presentation
English
Language of Visual Aids
English

Speaker

Edit Name Primary Affiliation
Nathan Phelps University of Western Ontario