Overfitting
http://iailab.kaist.ac.kr/
Industrial AI Lab at KAIST
Table of Contents
0. Lecture Video¶
from IPython.display import YouTubeVideo
YouTubeVideo('xhjVhMHEGgY', width = "560", height = "315")
1. Regression with Polynomial Functions¶
Nonlinear regression
(= linear regression for non-linearly distributed data)
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# 10 data points
n = 10
x = np.linspace(-4.5, 4.5, 10)
y = np.array([0.9819, 0.7973, 1.9737, 0.1838, 1.3180, -0.8361, -0.6591, -2.4701, -2.8122, -6.2512])
plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', label = 'Data')
plt.xlabel('X', fontsize = 15)
plt.ylabel('Y', fontsize = 15)
plt.grid(alpha = 0.3)
plt.show()
p = np.polyfit(x, y, deg = 1)
xp = np.arange(-4.5, 4.5, 0.01)
yp = np.polyval(p, xp)
plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', label = 'Data')
plt.plot(xp, yp, linewidth = 2, label = 'Polinomial')
plt.xlabel('X', fontsize = 15)
plt.ylabel('Y', fontsize = 15)
plt.legend(fontsize = 15)
plt.grid(alpha = 0.3)
plt.show()
p = np.polyfit(x, y, deg = 9)
xp = np.arange(-4.5, 4.5, 0.01)
yp = np.polyval(p, xp)
plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', label = 'Data')
plt.plot(xp, yp, linewidth = 2, label = 'Polinomial')
plt.xlabel('X', fontsize = 15)
plt.ylabel('Y', fontsize = 15)
plt.legend(fontsize = 15)
plt.grid(alpha = 0.3)
plt.show()
p = np.polyfit(x, y, deg = 3)
xp = np.arange(-4.5, 4.5, 0.01)
plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', label = 'Data')
plt.plot(xp, np.polyval(p, xp), linewidth = 2, label = 'Polynomial')
plt.xlabel('X', fontsize = 15)
plt.ylabel('Y', fontsize = 15)
plt.legend(fontsize = 15)
plt.grid(alpha = 0.3)
plt.show()
2. Polynomial Regression in TensorFlow¶
Construct explicit feature vectors
Consider linear combinations of fixed nonlinear functions of the input variables, of the form
$$
\begin{bmatrix}
1 & x_{1} & x_1^2\\1 & x_{2} & x_2^2\\\vdots & \vdots\\1 & x_{m} & x_m^2
\end{bmatrix}
\begin{bmatrix}\theta_0\\\theta_1 \\ \theta_2 \end{bmatrix} \quad \Rightarrow \quad
\begin{bmatrix}
\mid & \mid & \mid \\
b_0(x) & b_1(x) & b_2(x)\\
\mid & \mid & \mid
\end{bmatrix}
\begin{bmatrix}\theta_0\\\theta_1 \\ \theta_2 \end{bmatrix}
$$
$$ \hat{y}=\sum_{i=0}^d{\theta_i b_i(x)} = \Phi \theta$$
- Polynomial functions
$$b_i(x) = x^i, \quad i = 0,\cdots,d$$
from sklearn.preprocessing import MaxAbsScaler
# 10 data points
m = 10
train_x = np.linspace(-4.5, 4.5, 10).reshape(-1,1)
train_y = np.array([0.9819, 0.7973, 1.9737, 0.1838, 1.3180, -0.8361, -0.6591, -2.4701, -2.8122, -6.2512]).reshape(-1,1)
d = 9
train_X = np.hstack([train_x**(i+1) for i in range(d)])
train_X = MaxAbsScaler().fit_transform(train_X)
train_X = np.asmatrix(train_X)
plt.figure(figsize = (6, 4))
for i in range(d):
plt.plot(train_X[:,i], label = '$x^{}$'.format(i+1))
plt.title('Polynomial Basis', fontsize = 15)
plt.xlabel('X', fontsize = 15)
plt.ylabel('Y', fontsize = 15)
plt.grid(alpha = 0.3)
plt.legend(fontsize = 12)
plt.show()
import tensorflow as tf
LR = 0.1
n_iter = 10000
w = tf.Variable(tf.random.normal([d, 1]))
b = tf.Variable(tf.random.normal([1, 1]))
train_input = tf.constant(train_X, dtype = tf.float32)
train_output = tf.constant(train_y, dtype = tf.float32)
loss_record = []
for i in range(n_iter):
with tf.GradientTape() as tape:
y_pred = tf.add(tf.matmul(train_input,w), b)
loss = tf.square(y_pred - train_output)
loss = tf.reduce_mean(loss)
w_grad, b_grad = tape.gradient(loss, [w, b])
loss_record.append(loss)
w.assign_sub(LR * w_grad)
b.assign_sub(LR * b_grad)
w_val = w.numpy()
b_val = b.numpy()
xp = np.linspace(-4.5, 4.5, 100).reshape(-1,1)
Xp = np.hstack([xp**(i+1) for i in range(d)])
Xp = MaxAbsScaler().fit_transform(Xp)
Xp = np.asmatrix(Xp)
yp = Xp*w_val + b_val
plt.figure(figsize = (6, 4))
plt.plot(train_x, train_y,'ko')
plt.plot(xp, yp, 'r')
plt.title('Data', fontsize = 15)
plt.xlabel('X', fontsize = 15)
plt.ylabel('Y', fontsize = 15)
#plt.axis('equal')
plt.grid(alpha = 0.3)
#plt.xlim([0, 5])
plt.show()
Overfitting problem
Have you come across a situation where your model performed exceptionally well on train data, but was not able to predict test data ?
One of the most common problem data science professionals face is to avoid overfitting.
Issue with rich representation
- Low error on input data points, but high error nearby
- Low error on training data, but high error on testing data
Generalization Error
- But what we really care about is loss of prediction on new data (𝑥,𝑦)
- also called generalization error
3. Regularization to Reduce Overfitting¶
3.1. Regularization¶
With many features, prediction function becomes very expressive (model complexity)
- Choose less expressive function (e.g., lower degree polynomial, fewer RBF centers, larger RBF bandwidth)
- Keep the magnitude of the parameter small
- Regularization: penalize large parameters $\theta$
$$\min\; \lVert \Phi \theta - y \rVert_2^2 + \lambda \lVert \theta \rVert_2^2$$ - $\lambda$: regularization parameter, trades off between low loss and small values of $\theta$
Often, overfitting associated with very large estimated parameters $\theta$
We want to balance
how well function fits data
magnitude of coefficients
$$ \begin{align*} \text{Total cost } = \;&\underbrace{\text{measure of fit}}_{RSS(\theta)} + \;\lambda \cdot \underbrace{\text{measure of magnitude of coefficients}}_{\lVert \theta \rVert_2^2} \\ \\ \implies &\min\; \lVert \Phi \theta - y \rVert_2^2 + \lambda \lVert \theta \rVert_2^2 \end{align*} $$
where $ RSS(\theta) = \lVert \Phi\theta - y \rVert^2_2 $, ( = Rresidual Sum of Squares) and $\lambda$ is a tuning parameter to be determined separately
the second term, $\lambda \cdot \lVert \theta \rVert_2^2$, called a shrinkage penalty, is small when $\theta_1, \cdots,\theta_d$ are close to zeros, and so it has the effect of shrinking the estimates of $\theta_j$ towards zero
The tuning parameter $\lambda$ serves to control the relative impact of these two terms on the regression coefficient estimates
import tensorflow as tf
LR = 0.1
lamb = 0.1
n_iter = 10000
w = tf.Variable(tf.random.normal([d, 1]))
b = tf.Variable(tf.random.normal([1, 1]))
train_input = tf.constant(train_X, dtype = tf.float32)
train_output = tf.constant(train_y, dtype = tf.float32)
loss_record = []
for i in range(n_iter):
with tf.GradientTape() as tape:
y_pred = tf.add(tf.matmul(train_input,w), b)
reg = tf.reduce_mean(tf.square(w))
loss = tf.square(y_pred - train_output)
loss = tf.reduce_mean(loss)
w_grad, b_grad = tape.gradient(loss + reg, [w, b])
loss_record.append(loss)
w.assign_sub(LR * w_grad)
b.assign_sub(LR * b_grad)
w_val = w.numpy()
b_val = b.numpy()
xp = np.linspace(-4.5, 4.5, 100).reshape(-1,1)
Xp = np.hstack([xp**(i+1) for i in range(d)])
Xp = MaxAbsScaler().fit_transform(Xp)
Xp = np.asmatrix(Xp)
yp = Xp*w_val + b_val
plt.figure(figsize = (6, 4))
plt.plot(train_x, train_y,'ko')
plt.plot(xp, yp, 'r')
plt.title('Data', fontsize = 15)
plt.xlabel('X', fontsize = 15)
plt.ylabel('Y', fontsize = 15)
#plt.axis('equal')
plt.grid(alpha = 0.3)
#plt.xlim([0, 5])
plt.show()
3.2. Different Regularization Techniques¶
$L_2$ and $L_1$ regularizations
Big Data
Data augmentation
- The simplest way to reduce overfitting is to increase the size of the training data
- Early stopping
- When we see that the performance on the validation set is getting worse, we immediately stop the training on the model
- Dropout (later)
- This is the one of the most interesting types of regularization techniques.
- It also produces very good results and is consequently the most frequently used regularization technique in the field of deep learning.
- At every iteration, it randomly selects some nodes and removes them.
- It can also be thought of as an ensemble technique in machine learning.
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')