Overfitting


By Prof. Seungchul Lee
http://iailab.kaist.ac.kr/
Industrial AI Lab at KAIST

Table of Contents

1. Regression with Polynomial Functions

  • Nonlinear regression

  • (= linear regression for non-linearly distributed data)

In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# 10 data points
n = 10
x = np.linspace(-4.5, 4.5, 10)
y = np.array([0.9819, 0.7973, 1.9737, 0.1838, 1.3180, -0.8361, -0.6591, -2.4701, -2.8122, -6.2512])

plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', label = 'Data')
plt.xlabel('X', fontsize = 15)
plt.ylabel('Y', fontsize = 15)
plt.grid(alpha = 0.3)
plt.show()
In [2]:
p = np.polyfit(x, y, deg = 1)
In [3]:
xp = np.arange(-4.5, 4.5, 0.01)
yp = np.polyval(p, xp)

plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', label = 'Data')
plt.plot(xp, yp, linewidth = 2, label = 'Polinomial')
plt.xlabel('X', fontsize = 15)
plt.ylabel('Y', fontsize = 15)
plt.legend(fontsize = 15)
plt.grid(alpha = 0.3)
plt.show()
In [4]:
p = np.polyfit(x, y, deg = 9)

xp = np.arange(-4.5, 4.5, 0.01)
yp = np.polyval(p, xp)

plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', label = 'Data')
plt.plot(xp, yp, linewidth = 2, label = 'Polinomial')
plt.xlabel('X', fontsize = 15)
plt.ylabel('Y', fontsize = 15)
plt.legend(fontsize = 15)
plt.grid(alpha = 0.3)
plt.show()
In [5]:
p = np.polyfit(x, y, deg = 3)

xp = np.arange(-4.5, 4.5, 0.01)

plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', label = 'Data')
plt.plot(xp, np.polyval(p, xp), linewidth = 2, label = 'Polynomial')
plt.xlabel('X', fontsize = 15)
plt.ylabel('Y', fontsize = 15)
plt.legend(fontsize = 15)
plt.grid(alpha = 0.3)
plt.show()

2. Polynomial Regression in TensorFlow

  • Construct explicit feature vectors

  • Consider linear combinations of fixed nonlinear functions of the input variables, of the form



$$ \begin{bmatrix} 1 & x_{1} & x_1^2\\1 & x_{2} & x_2^2\\\vdots & \vdots\\1 & x_{m} & x_m^2 \end{bmatrix} \begin{bmatrix}\theta_0\\\theta_1 \\ \theta_2 \end{bmatrix} \quad \Rightarrow \quad \begin{bmatrix} \mid & \mid & \mid \\ b_0(x) & b_1(x) & b_2(x)\\ \mid & \mid & \mid \end{bmatrix} \begin{bmatrix}\theta_0\\\theta_1 \\ \theta_2 \end{bmatrix} $$



$$ \hat{y}=\sum_{i=0}^d{\theta_i b_i(x)} = \Phi \theta$$

  • Polynomial functions


$$b_i(x) = x^i, \quad i = 0,\cdots,d$$

In [6]:
from sklearn.preprocessing import MaxAbsScaler

# 10 data points
m = 10
train_x = np.linspace(-4.5, 4.5, 10).reshape(-1,1)
train_y = np.array([0.9819, 0.7973, 1.9737, 0.1838, 1.3180, -0.8361, -0.6591, -2.4701, -2.8122, -6.2512]).reshape(-1,1)

d = 9
train_X = np.hstack([train_x**(i+1) for i in range(d)])
train_X = MaxAbsScaler().fit_transform(train_X)
train_X = np.asmatrix(train_X)
In [7]:
plt.figure(figsize = (6, 4))

for i in range(d):
    plt.plot(train_X[:,i], label = '$x^{}$'.format(i+1))

plt.title('Polynomial Basis', fontsize = 15)
plt.xlabel('X', fontsize = 15)
plt.ylabel('Y', fontsize = 15)
plt.grid(alpha = 0.3)
plt.legend(fontsize = 12)
plt.show()
In [8]:
import tensorflow as tf

LR = 0.1
n_iter = 10000

w = tf.Variable(tf.random.normal([d, 1]))
b = tf.Variable(tf.random.normal([1, 1]))
train_input = tf.constant(train_X, dtype = tf.float32)
train_output = tf.constant(train_y, dtype = tf.float32)

loss_record = []
for i in range(n_iter):
    with tf.GradientTape() as tape:
        y_pred = tf.add(tf.matmul(train_input,w), b)
        loss = tf.square(y_pred - train_output)
        loss = tf.reduce_mean(loss)
        w_grad, b_grad = tape.gradient(loss, [w, b])

    loss_record.append(loss)
    w.assign_sub(LR * w_grad)
    b.assign_sub(LR * b_grad)
In [9]:
w_val = w.numpy()
b_val = b.numpy()

xp = np.linspace(-4.5, 4.5, 100).reshape(-1,1)
Xp = np.hstack([xp**(i+1) for i in range(d)])
Xp = MaxAbsScaler().fit_transform(Xp)
Xp = np.asmatrix(Xp)

yp = Xp*w_val + b_val

plt.figure(figsize = (6, 4))
plt.plot(train_x, train_y,'ko')
plt.plot(xp, yp, 'r')
plt.title('Data', fontsize = 15)
plt.xlabel('X', fontsize = 15)
plt.ylabel('Y', fontsize = 15)
#plt.axis('equal')
plt.grid(alpha = 0.3)
#plt.xlim([0, 5])
plt.show()
  • Overfitting problem

    • Have you come across a situation where your model performed exceptionally well on train data, but was not able to predict test data ?

    • One of the most common problem data science professionals face is to avoid overfitting.

  • Issue with rich representation

    • Low error on input data points, but high error nearby
    • Low error on training data, but high error on testing data
  • Generalization Error

    • But what we really care about is loss of prediction on new data (𝑥,𝑦)
    • also called generalization error

3. Regularization to Reduce Overfitting

3.1. Regularization

With many features, prediction function becomes very expressive (model complexity)

  • Choose less expressive function (e.g., lower degree polynomial, fewer RBF centers, larger RBF bandwidth)
  • Keep the magnitude of the parameter small
  • Regularization: penalize large parameters $\theta$

    $$\min\; \lVert \Phi \theta - y \rVert_2^2 + \lambda \lVert \theta \rVert_2^2$$
  • $\lambda$: regularization parameter, trades off between low loss and small values of $\theta$


Often, overfitting associated with very large estimated parameters $\theta$

We want to balance

  • how well function fits data

  • magnitude of coefficients

    $$ \begin{align*} \text{Total cost } = \;&\underbrace{\text{measure of fit}}_{RSS(\theta)} + \;\lambda \cdot \underbrace{\text{measure of magnitude of coefficients}}_{\lambda \cdot \lVert \theta \rVert_2^2} \\ \\ \implies &\min\; \lVert \Phi \theta - y \rVert_2^2 + \lambda \lVert \theta \rVert_2^2 \end{align*} $$
    where $ RSS(\theta) = \lVert \Phi\theta - y \rVert^2_2 $, ( = Rresidual Sum of Squares) and $\lambda$ is a tuning parameter to be determined separately


  • the second term, $\lambda \cdot \lVert \theta \rVert_2^2$, called a shrinkage penalty, is small when $\theta_1, \cdots,\theta_d$ are close to zeros, and so it has the effect of shrinking the estimates of $\theta_j$ towards zero
  • The tuning parameter $\lambda$ serves to control the relative impact of these two terms on the regression coefficient estimates
In [10]:
import tensorflow as tf

LR = 0.1
lamb = 0.1
n_iter = 10000

w = tf.Variable(tf.random.normal([d, 1]))
b = tf.Variable(tf.random.normal([1, 1]))
train_input = tf.constant(train_X, dtype = tf.float32)
train_output = tf.constant(train_y, dtype = tf.float32)

loss_record = []
for i in range(n_iter):
    with tf.GradientTape() as tape:
        y_pred = tf.add(tf.matmul(train_input,w), b)
        reg = tf.reduce_mean(tf.square(w))
        loss = tf.square(y_pred - train_output)
        loss = tf.reduce_mean(loss)
        w_grad, b_grad = tape.gradient(loss + reg, [w, b])

    loss_record.append(loss)
    w.assign_sub(LR * w_grad)
    b.assign_sub(LR * b_grad)

w_val = w.numpy()
b_val = b.numpy()

xp = np.linspace(-4.5, 4.5, 100).reshape(-1,1)
Xp = np.hstack([xp**(i+1) for i in range(d)])
Xp = MaxAbsScaler().fit_transform(Xp)
Xp = np.asmatrix(Xp)

yp = Xp*w_val + b_val

plt.figure(figsize = (6, 4))
plt.plot(train_x, train_y,'ko')
plt.plot(xp, yp, 'r')
plt.title('Data', fontsize = 15)
plt.xlabel('X', fontsize = 15)
plt.ylabel('Y', fontsize = 15)
#plt.axis('equal')
plt.grid(alpha = 0.3)
#plt.xlim([0, 5])
plt.show()

3.2. Different Regularization Techniques

  • $L_2$ and $L_1$ regularizations
  • Big Data
  • Data augmentation
    • The simplest way to reduce overfitting is to increase the size of the training data




  • Early stopping
    • When we see that the performance on the validation set is getting worse, we immediately stop the training on the model




  • Dropout (later)
    • This is the one of the most interesting types of regularization techniques.
    • It also produces very good results and is consequently the most frequently used regularization technique in the field of deep learning.
    • At every iteration, it randomly selects some nodes and removes them.
    • It can also be thought of as an ensemble technique in machine learning.



In [11]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')