1. Active Learning: Two Purposes¶

Similar to traditional active learning in the sense that it involves sampling from unlabeled dataset, but…

Purpose of sampling is different
Objective of "Active Learning with Bayesian Optimization": Sample unlabeled data that is expected to have highest label values
(Example) Sampling unlabeled process parameters for maximizing productivity

2. Active Learning with Bayesian Optimization¶

Two important components
- Surrogate model
  - Return predicted labels and uncertainties of such predictions
- Utility function
  - Use the results of the surrogate model to measure which unlabeled data is more likely to have higher label values than currently known label values

3. Lab¶

Load Python Packages

import numpy as np
import pandas as pd
from scipy.stats import norm
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import ConstantKernel as C, WhiteKernel as Wht, Matern as matk

from warnings import filterwarnings
filterwarnings('ignore')

Load Data into Pandas Dataframe

Download

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive

train_data = pd.read_csv('/content/drive/MyDrive/ML/ML_data/AL_train.csv')
test_data = pd.read_csv('/content/drive/MyDrive/ML/ML_data/AL_test.csv')

print('Shape of training dataset =', train_data.shape)
print('Shape of test dataset =', test_data.shape)

Shape of training dataset = (500, 9)
Shape of test dataset = (530, 9)

train_data

train_X = train_data.iloc[:, :-1].to_numpy()
test_X = test_data.iloc[:, :-1].to_numpy()
train_Y = train_data.iloc[:, -1].to_numpy()
test_Y = test_data.iloc[:, -1].to_numpy()

print(train_X.shape)
print(train_Y.shape)

(500, 8)
(500,)

print(np.max(train_Y))

68.5

Define Surrogate Model

Gaussian Process Regression (GPR)
- GPR derives a distribution of functions that can map input to label values based on training data
- Herein, the forms of these functions are defined by a kernel function
  - We will use a kernel based on Matern kernel

Define Utility Function

Expected Improvement
- Fuse exploration strategy into probability of improvement (PI)
- Weighting PI value by the difference between the current max value and the mean value
- Probability of obtaining data with larger label value than the existing points is important, but it is also very important how large a value is obtained

def upperConfidenceBound(xdata, gpr, epsilon):
    yu_pred, usigma = gpr.predict(xdata, return_std = True)
    ucb = np.empty(yu_pred.size, dtype = float)
    for ii in range(0, yu_pred.size):
        if usigma[ii] > 0:
            ucb[ii] = (yu_pred[ii] + epsilon * usigma[ii])
        else:
            ucb[ii] = 0.0
    return ucb

def probabilityOfImprovement(xdata, gpr, ybest, epsilon):
    yp_pred, psigma = gpr.predict(xdata, return_std = True)
    poI = np.empty(yp_pred.size, dtype = float)
    for ii in range(0, yp_pred.size):
        if psigma[ii] > 0:
            zzval = (yp_pred[ii] - ybest - epsilon) / float(psigma[ii])
            poI[ii] = norm.cdf(zzval)
        else:
            poI[ii] = 0.0
    return poI

def expectedImprovement(xdata, gpr, ybest, epsilon):
    ye_pred, esigma = gpr.predict(xdata, return_std = True)
    expI = np.empty(ye_pred.size, dtype = float)
    for ii in range(0, ye_pred.size):
        if esigma[ii] > 0:
            zzval = (ye_pred[ii] - ybest) / float(esigma[ii])
            expI[ii] = (ye_pred[ii] - ybest - epsilon) * norm.cdf(zzval) + esigma[ii] * norm.pdf(zzval)
        else:
            expI[ii] = 0.0
    return expI

3.1. Build The Active Learning Framework¶

Step 1: Train GPR Model and Make Predictions

cmean = [1.0] * 8
cbound = [[1e-3, 1e3]] * 8
kernel = C(1.0, (1e-3, 1e3)) * matk(cmean, cbound, 1.5) + Wht(1.0, (1e-3, 1e3))

gp = GaussianProcessRegressor(kernel = kernel, n_restarts_optimizer = 40, normalize_y = False, random_state = 10)
gp.fit(train_X, train_Y)

yt_pred, tsigma = gp.predict(train_X, return_std = True)

Step 2: Evaluate The Utility Function and Suggest Next Data Point for Evaluation

y_bestloc = np.argmax(yt_pred)
y_best_pred = yt_pred[y_bestloc]

uf_values = expectedImprovement(test_X, gp, y_best_pred, epsilon = 0.01)
#uf_values = probabilityOfImprovement(test_X, gp, y_best_pred, epsilon = 0.01)
#uf_values = upperConfidenceBound(test_X, gp, epsilon = 0.01)

uf_maxloc = np.argmax(uf_values)

print('Index of unlabeled data recommended to be sampled:', uf_maxloc)
print('\nRecommended experiment parameter:', test_X[uf_maxloc])

Index of unlabeled data recommended to be sampled: 513

Recommended experiment parameter: [1.         0.         0.         0.40894569 0.         0.94186047
 0.04766683 0.73901099]

Step 3: Add Real Label Values for The Sampled Unlabeled Data to The Training Data

Label values are obtained with experiments

exp_parameter = test_X[uf_maxloc].reshape(1, -1)
exp_result = test_Y[uf_maxloc].reshape(1, -1)

print(exp_result)

[[74.17]]

train_X = np.vstack([train_X, exp_parameter])
train_Y = np.append(train_Y, exp_result)
test_X = np.delete(test_X, uf_maxloc, axis = 0)
Ytest = np.delete(test_Y, uf_maxloc, axis = 0)

print('Number of training dataset =', train_X.shape[0])
print('Number of test dataset =', test_X.shape[0])

Number of training dataset = 501
Number of test dataset = 529

%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')

	Cement	Blast Furnace Slag	Fly Ash	Water	Superplasticizer	Coarse Aggregate	Fine Aggregate	Age	Strength
0	0.342009	0.000000	0.499250	0.194089	0.385093	0.595930	0.767185	0.035714	33.36
1	0.085845	0.582638	0.000000	0.560703	0.000000	0.715116	0.534119	0.016484	14.59
2	0.305936	0.000000	0.000000	0.576677	0.000000	0.485465	0.730055	0.244505	21.95
3	0.541553	0.000000	0.000000	0.510383	0.000000	0.779651	0.402158	0.016484	21.18
4	0.399543	0.000000	0.000000	0.552716	0.000000	0.485465	0.657301	0.035714	21.26
...	...	...	...	...	...	...	...	...	...
495	0.538584	0.525876	0.000000	0.424121	0.295031	0.417733	0.405921	0.247253	67.80
496	0.623288	0.260991	0.000000	0.038339	0.726708	0.148547	1.000000	0.016484	45.70
497	0.664384	0.000000	0.000000	0.560703	0.000000	0.404070	0.411440	0.244505	48.79
498	0.594977	0.525876	0.000000	0.344249	0.360248	0.417733	0.405921	0.005495	35.30
499	0.335845	0.000000	0.493753	0.289936	0.397516	0.543023	0.740090	0.074176	30.85