Overfitting


By Prof. Seungchul Lee
http://iailab.kaist.ac.kr/
Industrial AI Lab at KAIST

Table of Contents


0. Lecture Video

In [ ]:
from IPython.display import YouTubeVideo
YouTubeVideo('xhjVhMHEGgY', width = "560", height = "315")
Out[ ]:

1. Regression with Polynomial Functions

In machine learning, one of the most common challenges is overfitting. Overfitting occurs when a model learns not only the important patterns in the training data but also the noise and random fluctuations. Instead of performing well on new, unseen data, the model becomes highly tuned to the training data, resulting in poor generalization.

This is similar to a student who memorizes answers to specific questions rather than understanding the concepts, leading to poor performance when asked different questions. In the same way, an overfitted model may perform very well on the training set but fail to make accurate predictions on test data.

Understanding and addressing overfitting is essential for building models that generalize effectively. We introduce the concept of overfitting, its causes, and its impact on model performance. We will also cover practical methods to prevent overfitting, such as regularization, dropout, early stopping, and data augmentation.

With clear examples and explanations, this section aims to provide an understanding of overfitting and how to handle it. By the end of this section, you will be equipped with the knowledge to create models that strike the right balance between fitting the data and making reliable predictions on new data.


In this tutorial, we will consider a simple case where we have ten data points and build a non-linear regression model using polynomial functions of varying degrees. By visualizing the fitted models, we will observe how the degree of the polynomial affects the fit and identify cases of underfitting, appropriate fitting, and overfitting.


In [ ]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# 10 data points
n = 10
x = np.linspace(-4.5, 4.5, 10)
y = np.array([0.9819, 0.7973, 1.9737, 0.1838, 1.3180, -0.8361, -0.6591, -2.4701, -2.8122, -6.2512])

plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', label = 'Data')
plt.xlabel('X')
plt.ylabel('Y')
plt.grid(alpha = 0.3)
plt.show()

Degree 1 (Linear Regression):

  • The model underfits the data
  • It fails to capture the non-linear pattern and results in a straight-line fit.

In [ ]:
p = np.polyfit(x, y, deg = 1)
In [ ]:
xp = np.arange(-4.5, 4.5, 0.01)
yp = np.polyval(p, xp)

plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', label = 'Data')
plt.plot(xp, yp, linewidth = 2, label = 'Polinomial')
plt.xlabel('X')
plt.ylabel('Y')
plt.legend()
plt.grid(alpha = 0.3)
plt.show()

Degree 9 (High-Degree Polynomial):

  • The model overfits the data
  • It closely follows every data point, including noise, resulting in a wavy curve that fails to generalize well to unseen data.

In [ ]:
p = np.polyfit(x, y, deg = 9)

xp = np.arange(-4.5, 4.5, 0.01)
yp = np.polyval(p, xp)

plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', label = 'Data')
plt.plot(xp, yp, linewidth = 2, label = 'Polinomial')
plt.xlabel('X')
plt.ylabel('Y')
plt.legend()
plt.grid(alpha = 0.3)
plt.show()

Degree 3 (Cubic Polynomial):

  • The model fits the data well
  • It captures the underlying trend without being too flexible or too rigid.

In [ ]:
p = np.polyfit(x, y, deg = 3)

xp = np.arange(-4.5, 4.5, 0.01)

plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', label = 'Data')
plt.plot(xp, np.polyval(p, xp), linewidth = 2, label = 'Polynomial')
plt.xlabel('X')
plt.ylabel('Y')
plt.legend()
plt.grid(alpha = 0.3)
plt.show()

Key Takeaways

  • Underfitting: Happens when the model is too simple (e.g., a low-degree polynomial) and fails to capture the pattern.

  • Good Fit: Achieved when the model is complex enough to capture the trend but not so flexible that it fits noise.

  • Overfitting: Happens when the model is too complex (e.g., a high-degree polynomial) and memorizes noise, reducing generalization.


By testing different degrees of polynomial functions, this example demonstrates the importance of finding the right balance between model complexity and generalization to avoid underfitting and overfitting.


2. Overfitting

Overfitting occurs when a machine learning model performs well on the training data but poorly on unseen or test data. The model essentially "memorizes" the noise and details in the training set rather than learning the general patterns that can be applied to new data.


Overfitting Problem

  • Have you come across a situation where your model performed exceptionally well on train data, but was not able to predict test data?
  • One of the most common problem data science professionals face is to avoid overfitting.

Causes of Overfitting

  • Model Complexity: A model with too many parameters (e.g., deep neural networks) can overfit the training data.
  • Insufficient Training Data: If the training dataset is small, the model may fit every point precisely, capturing noise rather than general patterns.
  • Noisy Data: If the data has a lot of noise or irrelevant features, the model may try to learn this noise.
  • Too Many Training Iterations: Training for too long can lead to overfitting, where the model starts fitting noise instead of actual patterns.

Signs of Overfitting

  • High Training Accuracy, Low Test Accuracy: The model performs well on the training set but poorly on the validation/test set.
  • Large Gap Between Training and Validation Loss: Training loss continues to decrease, but validation loss increases.

Generalization Error

Generalization Error refers to the difference between a machine learning model's performance on the training data and its performance on unseen data (test or validation data). In other words, the generalization error measures the difference between the expected loss on unseen test data and the loss on the training data. It measures how well the model can generalize to new, unseen examples. A model with a low generalization error performs well not only on the training set but also on new data.

Mathematically, the generalization error can be defined as:


$$ \text{Generalization Error} = \mathbb{E}_{D_{\text{test}}}[\mathcal{L}(f(x), y)] - \mathbb{E}_{D_{\text{train}}}[\mathcal{L}(f(x), y)], $$

where:

  • $D_{\text{test}}$ and $D_{\text{train}}$ are the test and training data distributions.
  • $f(x)$ is the model's prediction function.
  • $\mathcal{L}(f(x), y)$ is the loss function measuring the difference between the predicted and true values.

Techniques to Prevent Overfitting


Regularization

  • Prevent overfitting by adding a penalty to the loss function, encouraging simpler models with smaller weights.
  • This penalty discourages the model from fitting noise in the training data by constraining the values of the model's parameters.
  • Regularization helps the model generalize better to unseen data by making it simpler and less sensitive to outliers.

Dropout (later)

  • Randomly drops a percentage of neurons during training, preventing the network from relying on specific neurons and forcing it to learn more robust features.
    • This is the one of the most interesting types of regularization techniques.
    • It also produces very good results and is consequently the most frequently used regularization technique in the field of deep learning.
    • At every iteration, it randomly selects some nodes and removes them.
    • It can also be thought of as an ensemble technique in machine learning.



Early Stopping

  • Monitors the validation loss during training and stops when the loss starts increasing to prevent overfitting.




Data Augmentation

  • Increases the training set size by applying transformations (e.g., rotations, flips, cropping) to existing data, making the model more generalizable.





Cross-Validation

  • Splits the data into multiple folds to train and validate the model on different subsets, ensuring that the model is not overfitting specific subsets of data.

Reduce Model Complexity

  • Use simpler models with fewer parameters to avoid memorizing the training data.

Among the various techniques to prevent overfitting, the regularization method will be the primary focus of the following session.



3. Regularization to Reduce Overfitting

In machine learning, achieving a balance between fitting the training data well and generalizing to unseen data is crucial. A common problem that arises when a model fits the training data too closely, including noise and outliers, is called overfitting. To address this issue, regularization methods are used to introduce constraints that simplify the model and prevent it from becoming overly complex.

Regularization works by adding a penalty term to the loss function, discouraging large or unnecessary model weights. This forces the model to prioritize learning essential patterns rather than memorizing specific data points. By doing so, regularization improves the model's ability to generalize to new data, reducing the generalization error.

There are different types of regularization, such as L1 regularization (Lasso), L2 regularization (Ridge), and Elastic Net, but in this session, we will focus solely on L2 regularization. L2 regularization, also known as Ridge regularization, adds a penalty proportional to the square of the weights. This penalty discourages the model from assigning disproportionately large values to certain weights, which helps reduce model complexity. As a result, L2 regularization tends to produce smoother models that generalize well to unseen data.

In this session, you will learn how L2 regularization affects the model's loss function, how to adjust the regularization strength, and how it helps prevent overfitting. By the end, you will have a clear understanding of when and how to apply L2 regularization to improve model performance.


3.1. L2 Regularization

L2 regularization, also known as Ridge Regularization or weight decay, involves adding a penalty equal to the sum of the squared weights to the loss function. It discourages large weights by penalizing their size, helping the model learn simpler and more robust patterns.


The regularized loss function is:


$$ J_{L2}(\omega) = J(\omega) + \lambda \sum_{i=1}^{n} \omega_i^2 $$

$\quad$where:

  • $J(\omega)$ is the original loss function (e.g., Mean Squared Error, Cross-Entropy).
  • $\omega = [\omega_1, \omega_2, \cdots, \omega_n]^T$ are the weights (parameters).
  • $\lambda$ is the regularization parameter that controls the strength of the regularization.

Gradient of L2 Regularization

The gradient of the regularized loss function with respect to each weight $\omega_i$ is:


$$ \frac{\partial J_{L2}(\omega)}{\partial \omega_i} = \frac{\partial J(\omega)}{\partial \omega_i} + 2 \lambda \omega_i $$

Weight Update Rule for Gradient Descent

The weight update rule in gradient descent becomes:


$$ \begin{align*} \omega_i &\leftarrow \omega_i - \alpha \frac{\partial J_{L2}(\omega)}{\partial \omega_i}\\\\ \omega_i &\leftarrow \omega_i - \alpha \left( \frac{\partial J(\omega)}{\partial w_i} + 2 \lambda \omega_i \right) \end{align*} $$
  • Here, $\alpha$ is the learning rate.
  • The update equation includes an additional term $-2\alpha \lambda \omega_i$, which shrinks the weight during each iteration.

How L2 Regularization Works

  • Effect on Weights:

    • L2 regularization forces the model to keep weights small.
  • Interpretation:

    • L2 regularization distributes the penalty uniformly across all weights, shrinking their magnitude.
  • Overfitting Reduction:

    • By limiting the model's ability to learn large weights, the model becomes less likely to overfit the noise in the training data.

3.2. Example: L2 Regularization with Non-linear Polynomial Regression

We will illustrate L2 regularization using a non-linear polynomial regression model.

Polynomial regression extends linear regression by incorporating polynomial terms of the input features to fit non-linear curves. However, as the polynomial degree increases, the model becomes prone to overfitting the training data. L2 regularization mitigates this overfitting by penalizing large weight values.


In [ ]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# 10 data points
m = 10
x = np.linspace(-1, 1, 10).reshape(-1,1)
y = np.array([0.9819, 0.7973, 1.9737, 0.1838, 1.3180, -0.8361, -0.6591, -2.4701, -2.8122, -6.2512]).reshape(-1,1)

plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', label = 'Data')
plt.xlabel('X')
plt.ylabel('Y')
plt.grid(alpha = 0.3)
plt.show()



$$ J_{L2}(\omega) = J(\omega) + \lambda \sum_{i=1}^{n} \omega_i^2 $$
$$\omega_i \leftarrow \omega_i - \alpha \left( \frac{\partial J(\omega)}{\partial w_i} + 2 \lambda \omega_i \right)$$
In [ ]:
import tensorflow as tf

LR = 0.1
lamb = 0.1
n_iter = 10000

d = 9
X = np.hstack([x**(i+1) for i in range(d)])

w = tf.Variable(tf.random.normal([d, 1]))
b = tf.Variable(tf.random.normal([1, 1]))

X = tf.constant(X, dtype = tf.float32)
y = tf.constant(y, dtype = tf.float32)

loss_record = []

for i in range(n_iter):
    with tf.GradientTape() as tape:
        y_pred = tf.add(tf.matmul(X, w), b)

        loss = tf.square(y_pred - y)
        loss = tf.reduce_mean(loss)

        reg = tf.reduce_mean(tf.square(w))

        w_grad, b_grad = tape.gradient(loss + reg, [w, b])

    w.assign_sub(LR * w_grad)
    b.assign_sub(LR * b_grad)
In [ ]:
# to plot
w_val = w.numpy()
b_val = b.numpy()

xp = np.linspace(-1, 1, 100).reshape(-1,1)
Xp = np.hstack([xp**(i+1) for i in range(d)])
Xp = np.asmatrix(Xp)

yp = Xp*w_val + b_val

plt.figure(figsize = (6, 4))
plt.plot(x, y,'o')
plt.plot(xp, yp)
plt.title('Data')
plt.xlabel('X')
plt.ylabel('Y')
plt.grid(alpha = 0.3)
plt.show()

L2 regularization slows the growth of the weight, making it less likely to become too large and overfit the training data.

In [ ]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')