Overfitting

By Prof. Seungchul Lee
http://iailab.kaist.ac.kr/
Industrial AI Lab at KAIST

1. Regression with Polynomial FunctionsĀ¶

• Nonlinear regression

• (= linear regression for non-linearly distributed data)

InĀ [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# 10 data points
n = 10
x = np.linspace(-4.5, 4.5, 10)
y = np.array([0.9819, 0.7973, 1.9737, 0.1838, 1.3180, -0.8361, -0.6591, -2.4701, -2.8122, -6.2512])

plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', label = 'Data')
plt.xlabel('X', fontsize = 15)
plt.ylabel('Y', fontsize = 15)
plt.grid(alpha = 0.3)
plt.show()
InĀ [2]:
p = np.polyfit(x, y, deg = 1)
InĀ [3]:
xp = np.arange(-4.5, 4.5, 0.01)
yp = np.polyval(p, xp)

plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', label = 'Data')
plt.plot(xp, yp, linewidth = 2, label = 'Polinomial')
plt.xlabel('X', fontsize = 15)
plt.ylabel('Y', fontsize = 15)
plt.legend(fontsize = 15)
plt.grid(alpha = 0.3)
plt.show()
InĀ [4]:
p = np.polyfit(x, y, deg = 9)

xp = np.arange(-4.5, 4.5, 0.01)
yp = np.polyval(p, xp)

plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', label = 'Data')
plt.plot(xp, yp, linewidth = 2, label = 'Polinomial')
plt.xlabel('X', fontsize = 15)
plt.ylabel('Y', fontsize = 15)
plt.legend(fontsize = 15)
plt.grid(alpha = 0.3)
plt.show()
InĀ [5]:
p = np.polyfit(x, y, deg = 3)

xp = np.arange(-4.5, 4.5, 0.01)

plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', label = 'Data')
plt.plot(xp, np.polyval(p, xp), linewidth = 2, label = 'Polynomial')
plt.xlabel('X', fontsize = 15)
plt.ylabel('Y', fontsize = 15)
plt.legend(fontsize = 15)
plt.grid(alpha = 0.3)
plt.show()

2. Polynomial Regression in TensorFlowĀ¶

• Construct explicit feature vectors

• Consider linear combinations of fixed nonlinear functions of the input variables, of the form

$$\begin{bmatrix} 1 & x_{1} & x_1^2\\1 & x_{2} & x_2^2\\\vdots & \vdots\\1 & x_{m} & x_m^2 \end{bmatrix} \begin{bmatrix}\theta_0\\\theta_1 \\ \theta_2 \end{bmatrix} \quad \Rightarrow \quad \begin{bmatrix} \mid & \mid & \mid \\ b_0(x) & b_1(x) & b_2(x)\\ \mid & \mid & \mid \end{bmatrix} \begin{bmatrix}\theta_0\\\theta_1 \\ \theta_2 \end{bmatrix}$$

$$\hat{y}=\sum_{i=0}^d{\theta_i b_i(x)} = \Phi \theta$$

• Polynomial functions

$$b_i(x) = x^i, \quad i = 0,\cdots,d$$

InĀ [6]:
from sklearn.preprocessing import MaxAbsScaler

# 10 data points
m = 10
train_x = np.linspace(-4.5, 4.5, 10).reshape(-1,1)
train_y = np.array([0.9819, 0.7973, 1.9737, 0.1838, 1.3180, -0.8361, -0.6591, -2.4701, -2.8122, -6.2512]).reshape(-1,1)

d = 9
train_X = np.hstack([train_x**(i+1) for i in range(d)])
train_X = MaxAbsScaler().fit_transform(train_X)
train_X = np.asmatrix(train_X)
InĀ [7]:
plt.figure(figsize = (6, 4))

for i in range(d):
plt.plot(train_X[:,i], label = '$x^{}$'.format(i+1))

plt.title('Polynomial Basis', fontsize = 15)
plt.xlabel('X', fontsize = 15)
plt.ylabel('Y', fontsize = 15)
plt.grid(alpha = 0.3)
plt.legend(fontsize = 12)
plt.show()
InĀ [8]:
import tensorflow as tf

LR = 0.1
n_iter = 10000

w = tf.Variable(tf.random.normal([d, 1]))
b = tf.Variable(tf.random.normal([1, 1]))
train_input = tf.constant(train_X, dtype = tf.float32)
train_output = tf.constant(train_y, dtype = tf.float32)

loss_record = []
for i in range(n_iter):
loss = tf.square(y_pred - train_output)
loss = tf.reduce_mean(loss)

loss_record.append(loss)
InĀ [9]:
w_val = w.numpy()
b_val = b.numpy()

xp = np.linspace(-4.5, 4.5, 100).reshape(-1,1)
Xp = np.hstack([xp**(i+1) for i in range(d)])
Xp = MaxAbsScaler().fit_transform(Xp)
Xp = np.asmatrix(Xp)

yp = Xp*w_val + b_val

plt.figure(figsize = (6, 4))
plt.plot(train_x, train_y,'ko')
plt.plot(xp, yp, 'r')
plt.title('Data', fontsize = 15)
plt.xlabel('X', fontsize = 15)
plt.ylabel('Y', fontsize = 15)
#plt.axis('equal')
plt.grid(alpha = 0.3)
#plt.xlim([0, 5])
plt.show()
• Overfitting problem

• Have you come across a situation where your model performed exceptionally well on train data, but was not able to predict test data ?

• One of the most common problem data science professionals face is to avoid overfitting.

• Issue with rich representation

• Low error on input data points, but high error nearby
• Low error on training data, but high error on testing data
• Generalization Error

• But what we really care about is loss of prediction on new data (š„,š¦)
• also called generalization error

3. Regularization to Reduce OverfittingĀ¶

3.1. RegularizationĀ¶

With many features, prediction function becomes very expressive (model complexity)

• Choose less expressive function (e.g., lower degree polynomial, fewer RBF centers, larger RBF bandwidth)
• Keep the magnitude of the parameter small
• Regularization: penalize large parameters $\theta$

$$\min\; \lVert \Phi \theta - y \rVert_2^2 + \lambda \lVert \theta \rVert_2^2$$
• $\lambda$: regularization parameter, trades off between low loss and small values of $\theta$

Often, overfitting associated with very large estimated parameters $\theta$

We want to balance

• how well function fits data

• magnitude of coefficients

\begin{align*} \text{Total cost } = \;&\underbrace{\text{measure of fit}}_{RSS(\theta)} + \;\lambda \cdot \underbrace{\text{measure of magnitude of coefficients}}_{\lambda \cdot \lVert \theta \rVert_2^2} \\ \\ \implies &\min\; \lVert \Phi \theta - y \rVert_2^2 + \lambda \lVert \theta \rVert_2^2 \end{align*}
where $RSS(\theta) = \lVert \Phi\theta - y \rVert^2_2$, ( = Rresidual Sum of Squares) and $\lambda$ is a tuning parameter to be determined separately

• the second term, $\lambda \cdot \lVert \theta \rVert_2^2$, called a shrinkage penalty, is small when $\theta_1, \cdots,\theta_d$ are close to zeros, and so it has the effect of shrinking the estimates of $\theta_j$ towards zero
• The tuning parameter $\lambda$ serves to control the relative impact of these two terms on the regression coefficient estimates
InĀ [10]:
import tensorflow as tf

LR = 0.1
lamb = 0.1
n_iter = 10000

w = tf.Variable(tf.random.normal([d, 1]))
b = tf.Variable(tf.random.normal([1, 1]))
train_input = tf.constant(train_X, dtype = tf.float32)
train_output = tf.constant(train_y, dtype = tf.float32)

loss_record = []
for i in range(n_iter):
reg = tf.reduce_mean(tf.square(w))
loss = tf.square(y_pred - train_output)
loss = tf.reduce_mean(loss)

loss_record.append(loss)

w_val = w.numpy()
b_val = b.numpy()

xp = np.linspace(-4.5, 4.5, 100).reshape(-1,1)
Xp = np.hstack([xp**(i+1) for i in range(d)])
Xp = MaxAbsScaler().fit_transform(Xp)
Xp = np.asmatrix(Xp)

yp = Xp*w_val + b_val

plt.figure(figsize = (6, 4))
plt.plot(train_x, train_y,'ko')
plt.plot(xp, yp, 'r')
plt.title('Data', fontsize = 15)
plt.xlabel('X', fontsize = 15)
plt.ylabel('Y', fontsize = 15)
#plt.axis('equal')
plt.grid(alpha = 0.3)
#plt.xlim([0, 5])
plt.show()

3.2. Different Regularization TechniquesĀ¶

• $L_2$ and $L_1$ regularizations
• Big Data
• Data augmentation
• The simplest way to reduce overfitting is to increase the size of the training data

• Early stopping
• When we see that the performance on the validation set is getting worse, we immediately stop the training on the model

• Dropout (later)
• This is the one of the most interesting types of regularization techniques.
• It also produces very good results and is consequently the most frequently used regularization technique in the field of deep learning.
• At every iteration, it randomly selects some nodes and removes them.
• It can also be thought of as an ensemble technique in machine learning.

InĀ [11]:
%%javascript
\$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')