Active Learning

By Prof. Seungchul Lee
Industrial AI Lab at KAIST

Table of Contents

1. Active Learning: Two Purposes

Similar to traditional active learning in the sense that it involves sampling from unlabeled dataset, but…

  • Purpose of sampling is different

  • Objective of "Active Learning with Bayesian Optimization": Sample unlabeled data that is expected to have highest label values

  • (Example) Sampling unlabeled process parameters for maximizing productivity

2. Active Learning with Bayesian Optimization

  • Two important components
    • Surrogate model
      • Return predicted labels and uncertainties of such predictions
    • Utility function
      • Use the results of the surrogate model to measure which unlabeled data is more likely to have higher label values than currently known label values

3. Lab

Load Python Packages

In [ ]:
import numpy as np
import pandas as pd
from scipy.stats import norm
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import ConstantKernel as C, WhiteKernel as Wht, Matern as matk

from warnings import filterwarnings

Load Data into Pandas Dataframe


In [ ]:
from google.colab import drive
Mounted at /content/drive
In [ ]:
train_data = pd.read_csv('/content/drive/MyDrive/ML/ML_data/AL_train.csv')
test_data = pd.read_csv('/content/drive/MyDrive/ML/ML_data/AL_test.csv')

print('Shape of training dataset =', train_data.shape)
print('Shape of test dataset =', test_data.shape)
Shape of training dataset = (500, 9)
Shape of test dataset = (530, 9)
In [ ]:
Out[ ]:
Cement Blast Furnace Slag Fly Ash Water Superplasticizer Coarse Aggregate Fine Aggregate Age Strength
0 0.342009 0.000000 0.499250 0.194089 0.385093 0.595930 0.767185 0.035714 33.36
1 0.085845 0.582638 0.000000 0.560703 0.000000 0.715116 0.534119 0.016484 14.59
2 0.305936 0.000000 0.000000 0.576677 0.000000 0.485465 0.730055 0.244505 21.95
3 0.541553 0.000000 0.000000 0.510383 0.000000 0.779651 0.402158 0.016484 21.18
4 0.399543 0.000000 0.000000 0.552716 0.000000 0.485465 0.657301 0.035714 21.26
... ... ... ... ... ... ... ... ... ...
495 0.538584 0.525876 0.000000 0.424121 0.295031 0.417733 0.405921 0.247253 67.80
496 0.623288 0.260991 0.000000 0.038339 0.726708 0.148547 1.000000 0.016484 45.70
497 0.664384 0.000000 0.000000 0.560703 0.000000 0.404070 0.411440 0.244505 48.79
498 0.594977 0.525876 0.000000 0.344249 0.360248 0.417733 0.405921 0.005495 35.30
499 0.335845 0.000000 0.493753 0.289936 0.397516 0.543023 0.740090 0.074176 30.85

500 rows × 9 columns

In [ ]:
train_X = train_data.iloc[:, :-1].to_numpy()
test_X = test_data.iloc[:, :-1].to_numpy()
train_Y = train_data.iloc[:, -1].to_numpy()
test_Y = test_data.iloc[:, -1].to_numpy()
In [ ]:
(500, 8)
In [ ]:

Define Surrogate Model

  • Gaussian Process Regression (GPR)
    • GPR derives a distribution of functions that can map input to label values based on training data
    • Herein, the forms of these functions are defined by a kernel function
      • We will use a kernel based on Matern kernel

Define Utility Function

  • Expected Improvement
    • Fuse exploration strategy into probability of improvement (PI)
    • Weighting PI value by the difference between the current max value and the mean value
    • Probability of obtaining data with larger label value than the existing points is important, but it is also very important how large a value is obtained

In [ ]:
def upperConfidenceBound(xdata, gpr, epsilon):
    yu_pred, usigma = gpr.predict(xdata, return_std = True)
    ucb = np.empty(yu_pred.size, dtype = float)
    for ii in range(0, yu_pred.size):
        if usigma[ii] > 0:
            ucb[ii] = (yu_pred[ii] + epsilon * usigma[ii])
            ucb[ii] = 0.0
    return ucb

def probabilityOfImprovement(xdata, gpr, ybest, epsilon):
    yp_pred, psigma = gpr.predict(xdata, return_std = True)
    poI = np.empty(yp_pred.size, dtype = float)
    for ii in range(0, yp_pred.size):
        if psigma[ii] > 0:
            zzval = (yp_pred[ii] - ybest - epsilon) / float(psigma[ii])
            poI[ii] = norm.cdf(zzval)
            poI[ii] = 0.0
    return poI

def expectedImprovement(xdata, gpr, ybest, epsilon):
    ye_pred, esigma = gpr.predict(xdata, return_std = True)
    expI = np.empty(ye_pred.size, dtype = float)
    for ii in range(0, ye_pred.size):
        if esigma[ii] > 0:
            zzval = (ye_pred[ii] - ybest) / float(esigma[ii])
            expI[ii] = (ye_pred[ii] - ybest - epsilon) * norm.cdf(zzval) + esigma[ii] * norm.pdf(zzval)
            expI[ii] = 0.0
    return expI

3.1. Build The Active Learning Framework

Step 1: Train GPR Model and Make Predictions

In [ ]:
cmean = [1.0] * 8
cbound = [[1e-3, 1e3]] * 8
kernel = C(1.0, (1e-3, 1e3)) * matk(cmean, cbound, 1.5) + Wht(1.0, (1e-3, 1e3))

gp = GaussianProcessRegressor(kernel = kernel, n_restarts_optimizer = 40, normalize_y = False, random_state = 10), train_Y)

yt_pred, tsigma = gp.predict(train_X, return_std = True)

Step 2: Evaluate The Utility Function and Suggest Next Data Point for Evaluation

In [ ]:
y_bestloc = np.argmax(yt_pred)
y_best_pred = yt_pred[y_bestloc]

uf_values = expectedImprovement(test_X, gp, y_best_pred, epsilon = 0.01)
#uf_values = probabilityOfImprovement(test_X, gp, y_best_pred, epsilon = 0.01)
#uf_values = upperConfidenceBound(test_X, gp, epsilon = 0.01)

uf_maxloc = np.argmax(uf_values)

print('Index of unlabeled data recommended to be sampled:', uf_maxloc)
print('\nRecommended experiment parameter:', test_X[uf_maxloc])
Index of unlabeled data recommended to be sampled: 513

Recommended experiment parameter: [1.         0.         0.         0.40894569 0.         0.94186047
 0.04766683 0.73901099]

Step 3: Add Real Label Values for The Sampled Unlabeled Data to The Training Data

  • Label values are obtained with experiments
In [ ]:
exp_parameter = test_X[uf_maxloc].reshape(1, -1)
exp_result = test_Y[uf_maxloc].reshape(1, -1)

In [ ]:
train_X = np.vstack([train_X, exp_parameter])
train_Y = np.append(train_Y, exp_result)
test_X = np.delete(test_X, uf_maxloc, axis = 0)
Ytest = np.delete(test_Y, uf_maxloc, axis = 0)

print('Number of training dataset =', train_X.shape[0])
print('Number of test dataset =', test_X.shape[0])
Number of training dataset = 501
Number of test dataset = 529
In [1]: