Stochastic and Mini-batch Gradient Descent
Table of Contents
from IPython.display import YouTubeVideo
YouTubeVideo('ALDd7PTU-JY', width = "560", height = "315")
We learned Gradient Descent as an optimization technique.
Gradient Descent is an iterative optimization algorithm used to minimize a loss function by updating the model's parameters in the direction of the negative gradient. It is widely used in machine learning and deep learning to train models.
In this lecture, we will cover gradient descent algorithm and its variants:
We will explore the concept of these three gradient descent algorithms with a logistic regression model.
(= Gradient Descent)
In Batch Gradient Descent methods, we use all of the training data set for each iteration. So, for each update, we have to sum over all examples.
By linearity,
Advantages:
Stable Convergence: Utilizing the entire dataset leads to a more stable and predictable convergence path toward the minimum of the loss function.
Accurate Gradient Estimation: Since all data points are considered, the gradient computation is precise, which can result in more accurate parameter updates.
Disadvantages:
Computationally Intensive: Processing the entire dataset in each iteration can be time-consuming and resource-intensive, especially with large datasets.
Memory Requirements: Storing and processing large datasets simultaneously may require substantial memory, which can be a limitation for very large datasets.
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import time
%matplotlib inline
# data generation
m = 10000
true_w = np.array([[-6], [2], [1]])
train_X = np.hstack([np.ones([m,1]), 5*np.random.rand(m,1), 4*np.random.rand(m,1)])
true_w = np.asmatrix(true_w)
train_X = np.asmatrix(train_X)
train_Y = 1/(1 + np.exp(-train_X*true_w)) > 0.5
C1 = np.where(train_Y == True)[0]
C0 = np.where(train_Y == False)[0]
train_Y = np.empty([m,1])
train_Y[C1] = 1
train_Y[C0] = 0
plt.figure(figsize = (6, 4))
plt.plot(train_X[C1,1], train_X[C1,2], 'r.', alpha = 0.1, label = 'C1')
plt.plot(train_X[C0,1], train_X[C0,2], 'b.', alpha = 0.1, label = 'C0')
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.axis('equal')
# plt.xlim([0,4])
plt.ylim([0,4])
plt.show()
LR = 0.05
n_iter = 10000
w = tf.Variable([[0],[0],[0]], dtype = tf.float32)
train_x = tf.constant(train_X, dtype = tf.float32)
train_y = tf.constant(train_Y, dtype = tf.float32)
start_time = time.time()
loss_record = []
for i in range(n_iter):
with tf.GradientTape() as tape:
y_pred = tf.sigmoid(tf.matmul(train_x, w))
loss = - train_y*tf.math.log(y_pred) - (1-train_y)*tf.math.log(1-y_pred)
loss = tf.reduce_mean(loss)
w_grad = tape.gradient(loss, w)
loss_record.append(loss)
w.assign_sub(LR * w_grad)
training_time = time.time() - start_time
w_hat = w.numpy()
print(w_hat)
xp = np.arange(0, 4, 0.01).reshape(-1, 1)
yp = - w_hat[1,0]/w_hat[2,0]*xp - w_hat[0,0]/w_hat[2,0]
plt.figure(figsize = (6, 4))
plt.plot(train_X[C1,1], train_X[C1,2], 'r.', alpha = 0.1, label = 'C1')
plt.plot(train_X[C0,1], train_X[C0,2], 'b.', alpha = 0.1, label = 'C0')
plt.plot(xp, yp, 'g', linewidth = 3, label = 'Logistic Regression')
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.axis('equal')
plt.ylim([0,4])
plt.show()
We will examine two aspects:
print(training_time)
print("\n")
plt.figure(figsize = (6, 4))
plt.plot(loss_record)
plt.xlabel('iteration', fontsize = 15)
plt.ylabel('loss', fontsize = 15)
plt.show()
Stochastic Gradient Descent is an extreme case of Batch Gradient Descent. In this case, learning happens on every example.
Update the parameters based on the gradient for a single randomly selected training data point:
The advantages:
It adds even more noise to the learning process than mini-batch that helps improving generalization error.
This approach allows for more frequent updates and can lead to faster convergence, especially with large datasets.
SGD can handle large datasets more efficiently than methods requiring full dataset computations
The disadvantages:
It use only one example for each update. But one example hardly can represent whole data set. In other words, the variance becomes large since we only use one example for each learning step.
Since each update is based on a single data point, the optimization path can be noisy, potentially leading to fluctuations in the loss function.
Due to its stochastic nature, SGD might miss the optimal solution if not properly managed.
Mathematical justification: if you sample a training example at random, the stochastic gradient is an unbiased estimate of the batch gradient:
Below is a graph that shows the gradient descent's variants and their direction towards the minimum:
As we can see in figure, SGD direction is very noisy compared to others.
LR = 0.05
n_iter = 10000
w = tf.Variable([[0],[0],[0]], dtype = tf.float32)
train_x = tf.constant(train_X, dtype = tf.float32)
train_y = tf.constant(train_Y, dtype = tf.float32)
loss_record = []
for i in range(n_iter):
with tf.GradientTape() as tape:
idx = np.random.choice(m)
batch_x = tf.expand_dims(train_x[idx,:], axis = 0)
batch_y = tf.expand_dims(train_y[idx], axis = 0)
y_pred = tf.sigmoid(tf.matmul(batch_x, w))
loss = - batch_y*tf.math.log(y_pred) - (1-batch_y)*tf.math.log(1-y_pred)
loss = tf.reduce_mean(loss)
w_grad = tape.gradient(loss, w)
loss_record.append(loss)
w.assign_sub(LR * w_grad)
w_hat = w.numpy()
print(w_hat)
xp = np.arange(0, 4, 0.01).reshape(-1, 1)
yp = - w_hat[1,0]/w_hat[2,0]*xp - w_hat[0,0]/w_hat[2,0]
plt.figure(figsize = (6, 4))
plt.plot(train_X[C1,1], train_X[C1,2], 'r.', alpha = 0.1, label = 'C1')
plt.plot(train_X[C0,1], train_X[C0,2], 'b.', alpha = 0.1, label = 'C0')
plt.plot(xp, yp, 'g', linewidth = 3, label = 'Logistic Regression')
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.axis('equal')
plt.ylim([0,4])
plt.show()
print(training_time)
print("\n")
plt.figure(figsize = (6, 4))
plt.plot(loss_record)
plt.xlabel('iteration', fontsize = 15)
plt.ylabel('loss', fontsize = 15)
plt.show()
Mini-batch Gradient Descent method is often used in machine learning and deep learning training. The main idea is similar with batch gradient descent. However, unlike batch gradient descent, in this method, we can customize the batch size $s$. So, parameters are updated based on mini-batch for each iteration. Since we assume the examples in the data set has positive correlation, this method can be used. In other words, in large data set, there are lots of similar examples.
Gradients computed on larger mini-batches have smaller variance:
The mini-batch size $s$ is a hyper-parameter that needs to be set.
This approach balances the efficiency and stability of Batch Gradient Descent with the faster convergence of Stochastic Gradient Descent.
The main advantages of MGD:
Choosing the Mini-Batch Size:
LR = 0.05
n_iter = 10000
n_batch = 50
w = tf.Variable([[0],[0],[0]], dtype = tf.float32)
loss_record = []
for i in range(n_iter):
with tf.GradientTape() as tape:
idx = np.random.choice(m, size = n_batch)
batch_x = tf.constant(train_X[idx,:], dtype = tf.float32)
batch_y = tf.constant(train_Y[idx], dtype = tf.float32)
y_pred = tf.sigmoid(tf.matmul(batch_x, w))
loss = - batch_y*tf.math.log(y_pred) - (1-batch_y)*tf.math.log(1-y_pred)
loss = tf.reduce_mean(loss)
w_grad = tape.gradient(loss, w)
loss_record.append(loss)
w.assign_sub(LR * w_grad)
w_hat = w.numpy()
print(w_hat)
xp = np.arange(0, 4, 0.01).reshape(-1, 1)
yp = - w_hat[1,0]/w_hat[2,0]*xp - w_hat[0,0]/w_hat[2,0]
plt.figure(figsize = (6, 4))
plt.plot(train_X[C1,1], train_X[C1,2], 'r.', alpha = 0.1, label = 'C1')
plt.plot(train_X[C0,1], train_X[C0,2], 'b.', alpha = 0.1, label = 'C0')
plt.plot(xp, yp, 'g', linewidth = 3, label = 'Logistic Regression')
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.axis('equal')
plt.ylim([0,4])
plt.show()
print(training_time)
plt.figure(figsize = (6, 4))
plt.plot(loss_record)
plt.xlabel('iteration', fontsize = 15)
plt.ylabel('loss', fontsize = 15)
plt.show()
Batch Gradient Descent:
Stochastic Gradient Descent (SGD):
Mini-Batch Gradient Descent:
from IPython.display import YouTubeVideo
YouTubeVideo('ALDd7PTU-JY?si=GD-i2BJah0GH1y7F&start=1768', width = "560", height = "315")
Small learning rate converges slowly and gets stuck in false local minima
Large learning rates overshoot, become unstable and diverge
Idea 1
Idea 2
GD Learning Rate: Spatial
The gradient descent method makes a strong assumption about the magnitude of the “local curvature” to fix the step size, and about its isotropy so that the same step size makes sense in all directions.
The loss landscape in high-dimensional space may be steeper in some directions and flatter in others.
A spatially adaptive learning rate assigns larger steps in flat directions and smaller steps in steep directions to ensure faster yet stable convergence.
GD Learning Rate: Temporal
The temporal learning rate in Gradient Descent (GD) refers to the idea of adjusting the learning rate over time during the optimization process. Instead of using a constant learning rate throughout training, a time-dependent schedule is used to either decrease or adaptively adjust the learning rate based on the number of iterations or training progress.
Time-Decay Learning Rate:
The learning rate decreases over time as the model approaches the optimal solution.
Early in training, a higher learning rate is used to make faster progress.
Later in training, the learning rate is reduced to fine-tune the solution and prevent overshooting.
Temporal Learning Rate Schedule:
The learning rate is updated based on the epoch or iteration.
This prevents the model from taking large steps near the optimum, where small adjustments are needed.
Adaptive Learning Rate Methods improve the standard gradient descent by adjusting the learning rate dynamically for each parameter during training. These methods aim to speed up convergence and improve stability by using information from past gradients.
Why Adaptive Learning Rates?
SDG
Adagrad (Adaptive Gradient Algorithm)
Deep-learning generally relies on a smarter use of the gradient, using statistics over its past values to make a “smarter step” with the current one.
Scales the learning rate inversely proportional to the sum of past squared gradients.
$G_{t,ii}$ sum of the squares of the gradients
Perform smaller updates (i.e., low learning rates) for parameters associated with frequently occurring features, and
Perform larger updates (i.e., high learning rates) for parameters associated with infrequent features
Larger steps are taken in less frequently updated directions.
RMSProp (Root Mean Square Propagation)
RMSProp is an adaptive learning rate optimization algorithm designed to improve the convergence of Stochastic Gradient Descent (SGD) by adjusting the learning rate based on the average magnitude of recent gradients. It helps to stabilize learning by preventing excessively large or small parameter updates, particularly in cases where the loss surface is non-stationary.
Instead of using a constant learning rate, RMSProp adapts the learning rate for each parameter by normalizing the gradient by the square root of the moving average of its squared gradients.
This prevents large parameter updates when the gradient is large and allows faster updates in flat regions where the gradient is small.
The RMSProp algorithm maintains a moving average of the squared gradients:
(1) Squared Gradient Moving Average:
$\quad$where:
(2) Parameter Update:
$\quad$where:
GD with Momentum
Momentum is an optimization technique used to speed up convergence and stabilize gradient descent by smoothing out updates. Instead of updating the parameters directly based on the gradient, momentum incorporates a fraction of the previous update to accelerate the learning process and prevent oscillations.
Why?
Momentum Update Rule
(1) Velocity Update:
$\quad$where:
(2) Parameter Update:
$\quad$where:
The velocity $v_t$ accumulates past gradients, leading to smoother and faster convergence, particularly in regions with high curvature or flat surfaces.
How Momentum Works
Adam (Adaptive Moment Estimation)
The Adam Optimizer is an advanced optimization algorithm that combines the benefits of both Momentum and RMSProp:
Adam maintains an adaptive learning rate for each parameter by using first-order (mean) and second-order (variance) moments of the gradients.
Adam Update Rule
(1) First Moment (Mean of Gradients):
$\quad$where:
(2) Second Moment (Variance of Gradients):
$\quad$where:
(3) Bias-Correction:
(4) Parameter Update:
$\quad$where:
The Adam optimizer is one of the most widely used optimization algorithms in deep learning due to its ability to combine the strengths of momentum and adaptive learning rates. It adjusts both the direction and magnitude of updates dynamically, making it efficient for complex models with large datasets. However, careful tuning of hyperparameters may still be required to achieve the best performance.
Most algorithms related to gradient descent are available in TensorFlow.
By understanding Gradient Descent, we have learned the foundational algorithm used in most machine learning optimization problems. It is the core of many more advanced optimization techniques used in neural networks and deep learning frameworks.
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')