Active Learning


By Prof. Seungchul Lee
http://iailab.kaist.ac.kr/
Industrial AI Lab at KAIST

Table of Contents



1. Active Learning: Two Purposes

Similar to traditional active learning in the sense that it involves sampling from unlabeled dataset, but…

  • Purpose of sampling is different

  • Objective of "Active Learning with Bayesian Optimization": Sample unlabeled data that is expected to have highest label values

  • (Example) Sampling unlabeled process parameters for maximizing productivity



2. Active Learning with Bayesian Optimization

  • Two important components
    • Surrogate model
      • Return predicted labels and uncertainties of such predictions
    • Utility function
      • Use the results of the surrogate model to measure which unlabeled data is more likely to have higher label values than currently known label values


3. Lab


Load Python Packages

In [ ]:
import numpy as np
import pandas as pd
from scipy.stats import norm
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import ConstantKernel as C, WhiteKernel as Wht, Matern as matk

from warnings import filterwarnings
filterwarnings('ignore')

Load Data into Pandas Dataframe

Download

In [ ]:
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
In [ ]:
train_data = pd.read_csv('/content/drive/MyDrive/ML/ML_data/AL_train.csv')
test_data = pd.read_csv('/content/drive/MyDrive/ML/ML_data/AL_test.csv')

print('Shape of training dataset =', train_data.shape)
print('Shape of test dataset =', test_data.shape)
Shape of training dataset = (500, 9)
Shape of test dataset = (530, 9)
In [ ]:
train_data
Out[ ]:
Cement Blast Furnace Slag Fly Ash Water Superplasticizer Coarse Aggregate Fine Aggregate Age Strength
0 0.342009 0.000000 0.499250 0.194089 0.385093 0.595930 0.767185 0.035714 33.36
1 0.085845 0.582638 0.000000 0.560703 0.000000 0.715116 0.534119 0.016484 14.59
2 0.305936 0.000000 0.000000 0.576677 0.000000 0.485465 0.730055 0.244505 21.95
3 0.541553 0.000000 0.000000 0.510383 0.000000 0.779651 0.402158 0.016484 21.18
4 0.399543 0.000000 0.000000 0.552716 0.000000 0.485465 0.657301 0.035714 21.26
... ... ... ... ... ... ... ... ... ...
495 0.538584 0.525876 0.000000 0.424121 0.295031 0.417733 0.405921 0.247253 67.80
496 0.623288 0.260991 0.000000 0.038339 0.726708 0.148547 1.000000 0.016484 45.70
497 0.664384 0.000000 0.000000 0.560703 0.000000 0.404070 0.411440 0.244505 48.79
498 0.594977 0.525876 0.000000 0.344249 0.360248 0.417733 0.405921 0.005495 35.30
499 0.335845 0.000000 0.493753 0.289936 0.397516 0.543023 0.740090 0.074176 30.85

500 rows × 9 columns

In [ ]:
train_X = train_data.iloc[:, :-1].to_numpy()
test_X = test_data.iloc[:, :-1].to_numpy()
train_Y = train_data.iloc[:, -1].to_numpy()
test_Y = test_data.iloc[:, -1].to_numpy()
In [ ]:
print(train_X.shape)
print(train_Y.shape)
(500, 8)
(500,)
In [ ]:
print(np.max(train_Y))
68.5

Define Surrogate Model

  • Gaussian Process Regression (GPR)
    • GPR derives a distribution of functions that can map input to label values based on training data
    • Herein, the forms of these functions are defined by a kernel function
      • We will use a kernel based on Matern kernel

Define Utility Function

  • Expected Improvement
    • Fuse exploration strategy into probability of improvement (PI)
    • Weighting PI value by the difference between the current max value and the mean value
    • Probability of obtaining data with larger label value than the existing points is important, but it is also very important how large a value is obtained


In [ ]:
def upperConfidenceBound(xdata, gpr, epsilon):
    yu_pred, usigma = gpr.predict(xdata, return_std = True)
    ucb = np.empty(yu_pred.size, dtype = float)
    for ii in range(0, yu_pred.size):
        if usigma[ii] > 0:
            ucb[ii] = (yu_pred[ii] + epsilon * usigma[ii])
        else:
            ucb[ii] = 0.0
    return ucb

def probabilityOfImprovement(xdata, gpr, ybest, epsilon):
    yp_pred, psigma = gpr.predict(xdata, return_std = True)
    poI = np.empty(yp_pred.size, dtype = float)
    for ii in range(0, yp_pred.size):
        if psigma[ii] > 0:
            zzval = (yp_pred[ii] - ybest - epsilon) / float(psigma[ii])
            poI[ii] = norm.cdf(zzval)
        else:
            poI[ii] = 0.0
    return poI

def expectedImprovement(xdata, gpr, ybest, epsilon):
    ye_pred, esigma = gpr.predict(xdata, return_std = True)
    expI = np.empty(ye_pred.size, dtype = float)
    for ii in range(0, ye_pred.size):
        if esigma[ii] > 0:
            zzval = (ye_pred[ii] - ybest) / float(esigma[ii])
            expI[ii] = (ye_pred[ii] - ybest - epsilon) * norm.cdf(zzval) + esigma[ii] * norm.pdf(zzval)
        else:
            expI[ii] = 0.0
    return expI

3.1. Build The Active Learning Framework


Step 1: Train GPR Model and Make Predictions

In [ ]:
cmean = [1.0] * 8
cbound = [[1e-3, 1e3]] * 8
kernel = C(1.0, (1e-3, 1e3)) * matk(cmean, cbound, 1.5) + Wht(1.0, (1e-3, 1e3))

gp = GaussianProcessRegressor(kernel = kernel, n_restarts_optimizer = 40, normalize_y = False, random_state = 10)
gp.fit(train_X, train_Y)

yt_pred, tsigma = gp.predict(train_X, return_std = True)

Step 2: Evaluate The Utility Function and Suggest Next Data Point for Evaluation

In [ ]:
y_bestloc = np.argmax(yt_pred)
y_best_pred = yt_pred[y_bestloc]

uf_values = expectedImprovement(test_X, gp, y_best_pred, epsilon = 0.01)
#uf_values = probabilityOfImprovement(test_X, gp, y_best_pred, epsilon = 0.01)
#uf_values = upperConfidenceBound(test_X, gp, epsilon = 0.01)

uf_maxloc = np.argmax(uf_values)

print('Index of unlabeled data recommended to be sampled:', uf_maxloc)
print('\nRecommended experiment parameter:', test_X[uf_maxloc])
Index of unlabeled data recommended to be sampled: 513

Recommended experiment parameter: [1.         0.         0.         0.40894569 0.         0.94186047
 0.04766683 0.73901099]

Step 3: Add Real Label Values for The Sampled Unlabeled Data to The Training Data

  • Label values are obtained with experiments
In [ ]:
exp_parameter = test_X[uf_maxloc].reshape(1, -1)
exp_result = test_Y[uf_maxloc].reshape(1, -1)

print(exp_result)
[[74.17]]
In [ ]:
train_X = np.vstack([train_X, exp_parameter])
train_Y = np.append(train_Y, exp_result)
test_X = np.delete(test_X, uf_maxloc, axis = 0)
Ytest = np.delete(test_Y, uf_maxloc, axis = 0)

print('Number of training dataset =', train_X.shape[0])
print('Number of test dataset =', test_X.shape[0])
Number of training dataset = 501
Number of test dataset = 529
In [1]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')