Active Learning
By Prof. Seungchul Lee
http://iailab.kaist.ac.kr/
Industrial AI Lab at KAIST
http://iailab.kaist.ac.kr/
Industrial AI Lab at KAIST
Table of Contents
1. Active Learning: Two Purposes¶
Similar to traditional active learning in the sense that it involves sampling from unlabeled dataset, but…
Purpose of sampling is different
Objective of "Active Learning with Bayesian Optimization": Sample unlabeled data that is expected to have highest label values
(Example) Sampling unlabeled process parameters for maximizing productivity
2. Active Learning with Bayesian Optimization¶
- Two important components
- Surrogate model
- Return predicted labels and uncertainties of such predictions
- Utility function
- Use the results of the surrogate model to measure which unlabeled data is more likely to have higher label values than currently known label values
- Surrogate model
3. Lab¶
Load Python Packages
In [ ]:
import numpy as np
import pandas as pd
from scipy.stats import norm
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import ConstantKernel as C, WhiteKernel as Wht, Matern as matk
from warnings import filterwarnings
filterwarnings('ignore')
In [ ]:
from google.colab import drive
drive.mount('/content/drive')
In [ ]:
train_data = pd.read_csv('/content/drive/MyDrive/ML/ML_data/AL_train.csv')
test_data = pd.read_csv('/content/drive/MyDrive/ML/ML_data/AL_test.csv')
print('Shape of training dataset =', train_data.shape)
print('Shape of test dataset =', test_data.shape)
In [ ]:
train_data
Out[ ]:
In [ ]:
train_X = train_data.iloc[:, :-1].to_numpy()
test_X = test_data.iloc[:, :-1].to_numpy()
train_Y = train_data.iloc[:, -1].to_numpy()
test_Y = test_data.iloc[:, -1].to_numpy()
In [ ]:
print(train_X.shape)
print(train_Y.shape)
In [ ]:
print(np.max(train_Y))
Define Surrogate Model
- Gaussian Process Regression (GPR)
- GPR derives a distribution of functions that can map input to label values based on training data
- Herein, the forms of these functions are defined by a kernel function
- We will use a kernel based on Matern kernel
Define Utility Function
- Expected Improvement
- Fuse exploration strategy into probability of improvement (PI)
- Weighting PI value by the difference between the current max value and the mean value
- Probability of obtaining data with larger label value than the existing points is important, but it is also very important how large a value is obtained
In [ ]:
def upperConfidenceBound(xdata, gpr, epsilon):
yu_pred, usigma = gpr.predict(xdata, return_std = True)
ucb = np.empty(yu_pred.size, dtype = float)
for ii in range(0, yu_pred.size):
if usigma[ii] > 0:
ucb[ii] = (yu_pred[ii] + epsilon * usigma[ii])
else:
ucb[ii] = 0.0
return ucb
def probabilityOfImprovement(xdata, gpr, ybest, epsilon):
yp_pred, psigma = gpr.predict(xdata, return_std = True)
poI = np.empty(yp_pred.size, dtype = float)
for ii in range(0, yp_pred.size):
if psigma[ii] > 0:
zzval = (yp_pred[ii] - ybest - epsilon) / float(psigma[ii])
poI[ii] = norm.cdf(zzval)
else:
poI[ii] = 0.0
return poI
def expectedImprovement(xdata, gpr, ybest, epsilon):
ye_pred, esigma = gpr.predict(xdata, return_std = True)
expI = np.empty(ye_pred.size, dtype = float)
for ii in range(0, ye_pred.size):
if esigma[ii] > 0:
zzval = (ye_pred[ii] - ybest) / float(esigma[ii])
expI[ii] = (ye_pred[ii] - ybest - epsilon) * norm.cdf(zzval) + esigma[ii] * norm.pdf(zzval)
else:
expI[ii] = 0.0
return expI
3.1. Build The Active Learning Framework¶
Step 1: Train GPR Model and Make Predictions
In [ ]:
cmean = [1.0] * 8
cbound = [[1e-3, 1e3]] * 8
kernel = C(1.0, (1e-3, 1e3)) * matk(cmean, cbound, 1.5) + Wht(1.0, (1e-3, 1e3))
gp = GaussianProcessRegressor(kernel = kernel, n_restarts_optimizer = 40, normalize_y = False, random_state = 10)
gp.fit(train_X, train_Y)
yt_pred, tsigma = gp.predict(train_X, return_std = True)
Step 2: Evaluate The Utility Function and Suggest Next Data Point for Evaluation
In [ ]:
y_bestloc = np.argmax(yt_pred)
y_best_pred = yt_pred[y_bestloc]
uf_values = expectedImprovement(test_X, gp, y_best_pred, epsilon = 0.01)
#uf_values = probabilityOfImprovement(test_X, gp, y_best_pred, epsilon = 0.01)
#uf_values = upperConfidenceBound(test_X, gp, epsilon = 0.01)
uf_maxloc = np.argmax(uf_values)
print('Index of unlabeled data recommended to be sampled:', uf_maxloc)
print('\nRecommended experiment parameter:', test_X[uf_maxloc])
Step 3: Add Real Label Values for The Sampled Unlabeled Data to The Training Data
- Label values are obtained with experiments
In [ ]:
exp_parameter = test_X[uf_maxloc].reshape(1, -1)
exp_result = test_Y[uf_maxloc].reshape(1, -1)
print(exp_result)
In [ ]:
train_X = np.vstack([train_X, exp_parameter])
train_Y = np.append(train_Y, exp_result)
test_X = np.delete(test_X, uf_maxloc, axis = 0)
Ytest = np.delete(test_Y, uf_maxloc, axis = 0)
print('Number of training dataset =', train_X.shape[0])
print('Number of test dataset =', test_X.shape[0])
In [1]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')