Learning from Imbalanced Data
Source
Table of Contents
Consider binary classification
Often the classes are highly imbalanced
Debatable ...
Scenario 1: 100,000 negative and 1,000 positive examples
Scenario 2: 10,000 negative and 10 positive examples
Scenario 3: 1,000 negative and 1 positive examples
Usually, imbalance is characterized by absolute rather than relative rarity
$\quad \;$ ... will usually get a high accuracy
Create a new training data set by:
Throws away a lot of data/information, but efficient to train
Create a new training data set by:
From the loss function's perspective, the repeated examples simply constribute multiple times to the loss function
Oversampling ususally tends to outperform undersampling because we are using more data to train the model
Some oversampling methods (SMOTE) are based on creating synthetic examples from the minority class
Add costs/weights to the training set
"negative" examples get weight 1
"positive" examples get a much larger weight
Change learning algorithm to optimize weighted training error
Similar effect as oversampling but is more efficient (because there is no multiplicity of examples)
Also requires a classfier that can learn with weighted examples
Traditional loss functions have the form $\sum_{n=1}^{N} \ell \left( y_n, f(x_n)\right)$
Such loss functions look at positive and negative examples individually, so the majority class tends to overwhelm the minority class
Reweighting the loss function differently for different classes can be one way to handle class imbalance, e.g., $\sum_{n=1}^{N} C_{y_n} \ell \left( y_n, f(x_n)\right)$
Alternatively, we can loss functions that look at pairs of examples (a positive example $x_n^+$ and a negative example $x_m^-$). For example:
These are called "pairwise" loss functions
Why is it a good loss function for imbalanced data?
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')