Overfitting
Table of Contents
from IPython.display import YouTubeVideo
YouTubeVideo('xhjVhMHEGgY', width = "560", height = "315")
In machine learning, one of the most common challenges is overfitting. Overfitting occurs when a model learns not only the important patterns in the training data but also the noise and random fluctuations. Instead of performing well on new, unseen data, the model becomes highly tuned to the training data, resulting in poor generalization.
This is similar to a student who memorizes answers to specific questions rather than understanding the concepts, leading to poor performance when asked different questions. In the same way, an overfitted model may perform very well on the training set but fail to make accurate predictions on test data.
Understanding and addressing overfitting is essential for building models that generalize effectively. We introduce the concept of overfitting, its causes, and its impact on model performance. We will also cover practical methods to prevent overfitting, such as regularization, dropout, early stopping, and data augmentation.
With clear examples and explanations, this section aims to provide an understanding of overfitting and how to handle it. By the end of this section, you will be equipped with the knowledge to create models that strike the right balance between fitting the data and making reliable predictions on new data.
In this tutorial, we will consider a simple case where we have ten data points and build a non-linear regression model using polynomial functions of varying degrees. By visualizing the fitted models, we will observe how the degree of the polynomial affects the fit and identify cases of underfitting, appropriate fitting, and overfitting.
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# 10 data points
n = 10
x = np.linspace(-4.5, 4.5, 10)
y = np.array([0.9819, 0.7973, 1.9737, 0.1838, 1.3180, -0.8361, -0.6591, -2.4701, -2.8122, -6.2512])
plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', label = 'Data')
plt.xlabel('X')
plt.ylabel('Y')
plt.grid(alpha = 0.3)
plt.show()
Degree 1 (Linear Regression):
p = np.polyfit(x, y, deg = 1)
xp = np.arange(-4.5, 4.5, 0.01)
yp = np.polyval(p, xp)
plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', label = 'Data')
plt.plot(xp, yp, linewidth = 2, label = 'Polinomial')
plt.xlabel('X')
plt.ylabel('Y')
plt.legend()
plt.grid(alpha = 0.3)
plt.show()
Degree 9 (High-Degree Polynomial):
p = np.polyfit(x, y, deg = 9)
xp = np.arange(-4.5, 4.5, 0.01)
yp = np.polyval(p, xp)
plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', label = 'Data')
plt.plot(xp, yp, linewidth = 2, label = 'Polinomial')
plt.xlabel('X')
plt.ylabel('Y')
plt.legend()
plt.grid(alpha = 0.3)
plt.show()
Degree 3 (Cubic Polynomial):
p = np.polyfit(x, y, deg = 3)
xp = np.arange(-4.5, 4.5, 0.01)
plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', label = 'Data')
plt.plot(xp, np.polyval(p, xp), linewidth = 2, label = 'Polynomial')
plt.xlabel('X')
plt.ylabel('Y')
plt.legend()
plt.grid(alpha = 0.3)
plt.show()
Key Takeaways
Underfitting: Happens when the model is too simple (e.g., a low-degree polynomial) and fails to capture the pattern.
Good Fit: Achieved when the model is complex enough to capture the trend but not so flexible that it fits noise.
Overfitting: Happens when the model is too complex (e.g., a high-degree polynomial) and memorizes noise, reducing generalization.
By testing different degrees of polynomial functions, this example demonstrates the importance of finding the right balance between model complexity and generalization to avoid underfitting and overfitting.
Overfitting occurs when a machine learning model performs well on the training data but poorly on unseen or test data. The model essentially "memorizes" the noise and details in the training set rather than learning the general patterns that can be applied to new data.
Overfitting Problem
Causes of Overfitting
Signs of Overfitting
Generalization Error
Generalization Error refers to the difference between a machine learning model's performance on the training data and its performance on unseen data (test or validation data). In other words, the generalization error measures the difference between the expected loss on unseen test data and the loss on the training data. It measures how well the model can generalize to new, unseen examples. A model with a low generalization error performs well not only on the training set but also on new data.
Mathematically, the generalization error can be defined as:
where:
Techniques to Prevent Overfitting
Regularization
Dropout (later)
Early Stopping
Data Augmentation
Cross-Validation
Reduce Model Complexity
Among the various techniques to prevent overfitting, the regularization method will be the primary focus of the following session.
In machine learning, achieving a balance between fitting the training data well and generalizing to unseen data is crucial. A common problem that arises when a model fits the training data too closely, including noise and outliers, is called overfitting. To address this issue, regularization methods are used to introduce constraints that simplify the model and prevent it from becoming overly complex.
Regularization works by adding a penalty term to the loss function, discouraging large or unnecessary model weights. This forces the model to prioritize learning essential patterns rather than memorizing specific data points. By doing so, regularization improves the model's ability to generalize to new data, reducing the generalization error.
There are different types of regularization, such as L1 regularization (Lasso), L2 regularization (Ridge), and Elastic Net, but in this session, we will focus solely on L2 regularization. L2 regularization, also known as Ridge regularization, adds a penalty proportional to the square of the weights. This penalty discourages the model from assigning disproportionately large values to certain weights, which helps reduce model complexity. As a result, L2 regularization tends to produce smoother models that generalize well to unseen data.
In this session, you will learn how L2 regularization affects the model's loss function, how to adjust the regularization strength, and how it helps prevent overfitting. By the end, you will have a clear understanding of when and how to apply L2 regularization to improve model performance.
L2 regularization, also known as Ridge Regularization or weight decay, involves adding a penalty equal to the sum of the squared weights to the loss function. It discourages large weights by penalizing their size, helping the model learn simpler and more robust patterns.
The regularized loss function is:
$\quad$where:
Gradient of L2 Regularization
The gradient of the regularized loss function with respect to each weight $\omega_i$ is:
Weight Update Rule for Gradient Descent
The weight update rule in gradient descent becomes:
How L2 Regularization Works
Effect on Weights:
Interpretation:
Overfitting Reduction:
We will illustrate L2 regularization using a non-linear polynomial regression model.
Polynomial regression extends linear regression by incorporating polynomial terms of the input features to fit non-linear curves. However, as the polynomial degree increases, the model becomes prone to overfitting the training data. L2 regularization mitigates this overfitting by penalizing large weight values.
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# 10 data points
m = 10
x = np.linspace(-1, 1, 10).reshape(-1,1)
y = np.array([0.9819, 0.7973, 1.9737, 0.1838, 1.3180, -0.8361, -0.6591, -2.4701, -2.8122, -6.2512]).reshape(-1,1)
plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', label = 'Data')
plt.xlabel('X')
plt.ylabel('Y')
plt.grid(alpha = 0.3)
plt.show()
import tensorflow as tf
LR = 0.1
lamb = 0.1
n_iter = 10000
d = 9
X = np.hstack([x**(i+1) for i in range(d)])
w = tf.Variable(tf.random.normal([d, 1]))
b = tf.Variable(tf.random.normal([1, 1]))
X = tf.constant(X, dtype = tf.float32)
y = tf.constant(y, dtype = tf.float32)
loss_record = []
for i in range(n_iter):
with tf.GradientTape() as tape:
y_pred = tf.add(tf.matmul(X, w), b)
loss = tf.square(y_pred - y)
loss = tf.reduce_mean(loss)
reg = tf.reduce_mean(tf.square(w))
w_grad, b_grad = tape.gradient(loss + reg, [w, b])
w.assign_sub(LR * w_grad)
b.assign_sub(LR * b_grad)
# to plot
w_val = w.numpy()
b_val = b.numpy()
xp = np.linspace(-1, 1, 100).reshape(-1,1)
Xp = np.hstack([xp**(i+1) for i in range(d)])
Xp = np.asmatrix(Xp)
yp = Xp*w_val + b_val
plt.figure(figsize = (6, 4))
plt.plot(x, y,'o')
plt.plot(xp, yp)
plt.title('Data')
plt.xlabel('X')
plt.ylabel('Y')
plt.grid(alpha = 0.3)
plt.show()
L2 regularization slows the growth of the weight, making it less likely to become too large and overfit the training data.
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')