Stochastic and Mini-batch Gradient Descent


By Prof. Seungchul Lee
http://iailab.kaist.ac.kr/
Industrial AI Lab at KAIST

Table of Contents


0. Lecture Video

In [ ]:
from IPython.display import YouTubeVideo
YouTubeVideo('ALDd7PTU-JY', width = "560", height = "315")
Out[ ]:

1. Gradient Descent Algorithm

We learned Gradient Descent as an optimization technique.


$$\text{Repeat : } x \leftarrow x - \alpha \nabla _x f(x) \quad \quad \text{for some step size } \alpha > 0$$




Gradient Descent is an iterative optimization algorithm used to minimize a loss function by updating the model's parameters in the direction of the negative gradient. It is widely used in machine learning and deep learning to train models.

In this lecture, we will cover gradient descent algorithm and its variants:

  • Batch Gradient Descent
  • Stochastic Gradient Descent
  • Mini-batch Gradient Descent

We will explore the concept of these three gradient descent algorithms with a logistic regression model.


2. Batch Gradient Descent

(= Gradient Descent)

In Batch Gradient Descent methods, we use all of the training data set for each iteration. So, for each update, we have to sum over all examples.


$$\mathcal{E} (\omega) = \frac{1}{m} \sum_{i=1}^{m} \ell (\hat y_i, y_i) = \frac{1}{m} \sum_{i=1}^{m} \ell (h_{\omega}(x_i), y_i)$$

By linearity,


$$\nabla_{\omega} \mathcal{E} = \nabla_{\omega} \frac{1}{m} \sum_{i=1}^{m} \ell (h_{\omega}(x_i), y_i)= \frac{1}{m} \sum_{i=1}^{m} \frac{\partial }{\partial \omega}\ell (h_{\omega}(x_i), y_i)$$

$$\omega \leftarrow \omega - \alpha \, \nabla_{\omega} \mathcal{E} $$

Advantages:

  • Stable Convergence: Utilizing the entire dataset leads to a more stable and predictable convergence path toward the minimum of the loss function.

  • Accurate Gradient Estimation: Since all data points are considered, the gradient computation is precise, which can result in more accurate parameter updates.


Disadvantages:

  • Computationally Intensive: Processing the entire dataset in each iteration can be time-consuming and resource-intensive, especially with large datasets.

  • Memory Requirements: Storing and processing large datasets simultaneously may require substantial memory, which can be a limitation for very large datasets.


In [ ]:
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import time
%matplotlib inline
In [ ]:
# data generation

m = 10000
true_w = np.array([[-6], [2], [1]])
train_X = np.hstack([np.ones([m,1]), 5*np.random.rand(m,1), 4*np.random.rand(m,1)])

true_w = np.asmatrix(true_w)
train_X = np.asmatrix(train_X)

train_Y = 1/(1 + np.exp(-train_X*true_w)) > 0.5

C1 = np.where(train_Y == True)[0]
C0 = np.where(train_Y == False)[0]

train_Y = np.empty([m,1])
train_Y[C1] = 1
train_Y[C0] = 0

plt.figure(figsize = (6, 4))
plt.plot(train_X[C1,1], train_X[C1,2], 'r.', alpha = 0.1, label = 'C1')
plt.plot(train_X[C0,1], train_X[C0,2], 'b.', alpha = 0.1, label = 'C0')
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.axis('equal')
# plt.xlim([0,4])
plt.ylim([0,4])
plt.show()
In [ ]:
LR = 0.05
n_iter = 10000

w = tf.Variable([[0],[0],[0]], dtype = tf.float32)
train_x = tf.constant(train_X, dtype = tf.float32)
train_y = tf.constant(train_Y, dtype = tf.float32)

start_time = time.time()

loss_record = []

for i in range(n_iter):
    with tf.GradientTape() as tape:
        y_pred = tf.sigmoid(tf.matmul(train_x, w))
        loss = - train_y*tf.math.log(y_pred) - (1-train_y)*tf.math.log(1-y_pred)
        loss = tf.reduce_mean(loss)
        w_grad = tape.gradient(loss, w)

    loss_record.append(loss)
    w.assign_sub(LR * w_grad)

training_time = time.time() - start_time

w_hat = w.numpy()
print(w_hat)
[[-9.898782 ]
 [ 3.5175416]
 [ 1.5210328]]
In [ ]:
xp = np.arange(0, 4, 0.01).reshape(-1, 1)
yp = - w_hat[1,0]/w_hat[2,0]*xp - w_hat[0,0]/w_hat[2,0]

plt.figure(figsize = (6, 4))
plt.plot(train_X[C1,1], train_X[C1,2], 'r.', alpha = 0.1, label = 'C1')
plt.plot(train_X[C0,1], train_X[C0,2], 'b.', alpha = 0.1, label = 'C0')
plt.plot(xp, yp, 'g', linewidth = 3, label = 'Logistic Regression')
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.axis('equal')
plt.ylim([0,4])
plt.show()

We will examine two aspects:

  • Duration of training session, and
  • Loss function noise.
In [ ]:
print(training_time)
print("\n")

plt.figure(figsize = (6, 4))
plt.plot(loss_record)
plt.xlabel('iteration', fontsize = 15)
plt.ylabel('loss', fontsize = 15)
plt.show()
43.00573205947876

3. Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent is an extreme case of Batch Gradient Descent. In this case, learning happens on every example.

Update the parameters based on the gradient for a single randomly selected training data point:


$$f(\omega) = \ell (\hat y_i, y_i) = \ell (h_{\omega}(x_i), y_i) = \ell^{(i)}$$

$$\omega \leftarrow \omega - \alpha \, \frac{\partial \ell^{(i)}}{\partial \omega}$$

The advantages:

  • It adds even more noise to the learning process than mini-batch that helps improving generalization error.

  • This approach allows for more frequent updates and can lead to faster convergence, especially with large datasets.

  • SGD can handle large datasets more efficiently than methods requiring full dataset computations


The disadvantages:

  • It use only one example for each update. But one example hardly can represent whole data set. In other words, the variance becomes large since we only use one example for each learning step.

  • Since each update is based on a single data point, the optimization path can be noisy, potentially leading to fluctuations in the loss function.

  • Due to its stochastic nature, SGD might miss the optimal solution if not properly managed.


Mathematical justification: if you sample a training example at random, the stochastic gradient is an unbiased estimate of the batch gradient:


$$\mathbb{E} \left[\frac{\partial \ell^{(i)}}{\partial \omega} \right] = \frac{1}{m} \sum_{i=1}^{m} \frac{\partial \ell^{(i)}}{\partial \omega} = \frac{\partial }{\partial \omega} \left[ \frac{1}{m} \sum_{i=1}^{m} \ell^{(i)} \right] = \frac{\partial \mathcal{E}}{\partial \omega}$$

Below is a graph that shows the gradient descent's variants and their direction towards the minimum:



As we can see in figure, SGD direction is very noisy compared to others.


In [ ]:
LR = 0.05
n_iter = 10000

w = tf.Variable([[0],[0],[0]], dtype = tf.float32)
train_x = tf.constant(train_X, dtype = tf.float32)
train_y = tf.constant(train_Y, dtype = tf.float32)

loss_record = []

for i in range(n_iter):
    with tf.GradientTape() as tape:
        idx = np.random.choice(m)
        batch_x = tf.expand_dims(train_x[idx,:], axis = 0)
        batch_y = tf.expand_dims(train_y[idx], axis = 0)

        y_pred = tf.sigmoid(tf.matmul(batch_x, w))
        loss = - batch_y*tf.math.log(y_pred) - (1-batch_y)*tf.math.log(1-y_pred)
        loss = tf.reduce_mean(loss)
        w_grad = tape.gradient(loss, w)

    loss_record.append(loss)
    w.assign_sub(LR * w_grad)

w_hat = w.numpy()
print(w_hat)
[[-9.925287 ]
 [ 3.5804236]
 [ 1.5582784]]
In [ ]:
xp = np.arange(0, 4, 0.01).reshape(-1, 1)
yp = - w_hat[1,0]/w_hat[2,0]*xp - w_hat[0,0]/w_hat[2,0]

plt.figure(figsize = (6, 4))
plt.plot(train_X[C1,1], train_X[C1,2], 'r.', alpha = 0.1, label = 'C1')
plt.plot(train_X[C0,1], train_X[C0,2], 'b.', alpha = 0.1, label = 'C0')
plt.plot(xp, yp, 'g', linewidth = 3, label = 'Logistic Regression')
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.axis('equal')
plt.ylim([0,4])
plt.show()
In [ ]:
print(training_time)
print("\n")

plt.figure(figsize = (6, 4))
plt.plot(loss_record)
plt.xlabel('iteration', fontsize = 15)
plt.ylabel('loss', fontsize = 15)
plt.show()
43.00573205947876

4. Mini-batch Gradient Descent

Mini-batch Gradient Descent method is often used in machine learning and deep learning training. The main idea is similar with batch gradient descent. However, unlike batch gradient descent, in this method, we can customize the batch size $s$. So, parameters are updated based on mini-batch for each iteration. Since we assume the examples in the data set has positive correlation, this method can be used. In other words, in large data set, there are lots of similar examples.


$$\mathcal{E} (\omega) = \frac{1}{s} \sum_{i=1}^{s} \ell (\hat y_i, y_i) = \frac{1}{s} \sum_{i=1}^{s} \ell (h_{\omega}(x_i), y_i) = \frac{1}{s} \sum_{i=1}^{s} \ell^{(i)}$$

$$\omega \leftarrow \omega - \alpha \, \nabla_{\omega} \mathcal{E} $$

Gradients computed on larger mini-batches have smaller variance:


$$\text{var} \left[ \frac{1}{s} \sum_{i=1}^{s} \frac{\partial \ell^{(i)}}{\partial \omega} \right] = \frac{1}{s^2} \text{var} \left[ \sum_{i=1}^{s} \frac{\partial \ell^{(i)}}{\partial \omega} \right] = \frac{1}{s} \text{var} \left[ \frac{\partial \ell^{(i)}}{\partial \omega} \right]$$

The mini-batch size $s$ is a hyper-parameter that needs to be set.

This approach balances the efficiency and stability of Batch Gradient Descent with the faster convergence of Stochastic Gradient Descent.


The main advantages of MGD:

  • Faster than batch gradient descent. Since it goas through a lot less examples than batch gradient descent (all examples).
  • Since we randomly choose the mini-batch examples, we can avoid redundant examples and examples that are very similar that don't contribute much to the learning.
  • With batch size $<$ size of training set, it can adds noise to the learning process that helps improving generaization error.




Choosing the Mini-Batch Size:

  • Selecting an appropriate mini-batch size is crucial. Smaller batch sizes can introduce noise, potentially aiding in escaping local minima, while larger batch sizes provide more accurate gradient estimates but may require more computational resources.

In [ ]:
LR = 0.05
n_iter = 10000
n_batch = 50

w = tf.Variable([[0],[0],[0]], dtype = tf.float32)

loss_record = []

for i in range(n_iter):
    with tf.GradientTape() as tape:
        idx = np.random.choice(m, size = n_batch)
        batch_x = tf.constant(train_X[idx,:], dtype = tf.float32)
        batch_y = tf.constant(train_Y[idx], dtype = tf.float32)

        y_pred = tf.sigmoid(tf.matmul(batch_x, w))
        loss = - batch_y*tf.math.log(y_pred) - (1-batch_y)*tf.math.log(1-y_pred)
        loss = tf.reduce_mean(loss)
        w_grad = tape.gradient(loss, w)

    loss_record.append(loss)
    w.assign_sub(LR * w_grad)

w_hat = w.numpy()
print(w_hat)
[[-9.897612 ]
 [ 3.5361586]
 [ 1.5278765]]
In [ ]:
xp = np.arange(0, 4, 0.01).reshape(-1, 1)
yp = - w_hat[1,0]/w_hat[2,0]*xp - w_hat[0,0]/w_hat[2,0]

plt.figure(figsize = (6, 4))
plt.plot(train_X[C1,1], train_X[C1,2], 'r.', alpha = 0.1, label = 'C1')
plt.plot(train_X[C0,1], train_X[C0,2], 'b.', alpha = 0.1, label = 'C0')
plt.plot(xp, yp, 'g', linewidth = 3, label = 'Logistic Regression')
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.axis('equal')
plt.ylim([0,4])
plt.show()
In [ ]:
print(training_time)

plt.figure(figsize = (6, 4))
plt.plot(loss_record)
plt.xlabel('iteration', fontsize = 15)
plt.ylabel('loss', fontsize = 15)
plt.show()
43.00573205947876

5. Summary

  • Batch Gradient Descent:

    • Updates parameters using the entire dataset.
    • Stable but slow for large datasets.
  • Stochastic Gradient Descent (SGD):

    • Updates parameters using a single data point at a time.
    • Faster but introduces noise in updates.
  • Mini-Batch Gradient Descent:

    • Updates parameters using small random subsets of the dataset.
    • Balances speed and stability.


  • No guarantee that the below is what is going to always happen. However the noisy SGC gradients can sometimes help in escaping local optima.




6. Challenges of Gradient Descent

In [ ]:
from IPython.display import YouTubeVideo
YouTubeVideo('ALDd7PTU-JY?si=GD-i2BJah0GH1y7F&amp;start=1768', width = "560", height = "315")
Out[ ]:

6.1. Setting the Learning Rate


$$\omega_{k+1} = \omega_k - \alpha \, \nabla f(\omega_k)$$
  • Small learning rate converges slowly and gets stuck in false local minima

  • Large learning rates overshoot, become unstable and diverge


  • Idea 1

    • Try lots of different learning rates and see what works “just right”
  • Idea 2

    • Do something smarter! Design an adaptive learning rate that “adapts” to the landscape
    • Temporal and spatial

GD Learning Rate: Spatial

The gradient descent method makes a strong assumption about the magnitude of the “local curvature” to fix the step size, and about its isotropy so that the same step size makes sense in all directions.

  • We typically assign the same learning rate to all features


  • The loss landscape in high-dimensional space may be steeper in some directions and flatter in others.

  • A spatially adaptive learning rate assigns larger steps in flat directions and smaller steps in steep directions to ensure faster yet stable convergence.


GD Learning Rate: Temporal

The temporal learning rate in Gradient Descent (GD) refers to the idea of adjusting the learning rate over time during the optimization process. Instead of using a constant learning rate throughout training, a time-dependent schedule is used to either decrease or adaptively adjust the learning rate based on the number of iterations or training progress.


Time-Decay Learning Rate:

  • The learning rate decreases over time as the model approaches the optimal solution.

  • Early in training, a higher learning rate is used to make faster progress.

  • Later in training, the learning rate is reduced to fine-tune the solution and prevent overshooting.


Temporal Learning Rate Schedule:

  • The learning rate is updated based on the epoch or iteration.

  • This prevents the model from taking large steps near the optimum, where small adjustments are needed.


6.2. Adaptive Learning Rate Methods

Adaptive Learning Rate Methods improve the standard gradient descent by adjusting the learning rate dynamically for each parameter during training. These methods aim to speed up convergence and improve stability by using information from past gradients.

Why Adaptive Learning Rates?

  • In standard gradient descent, the learning rate $\alpha$ is fixed.
  • A small learning rate leads to slow convergence, while a large learning rate can cause divergence.
  • Adaptive methods adjust the learning rate for each parameter based on the magnitude of past gradients, allowing for more efficient learning.

SDG


$$\omega_{t+1,i} = \omega_{t,i} - \alpha \cdot g_{t,i} $$

Adagrad (Adaptive Gradient Algorithm)

  • Deep-learning generally relies on a smarter use of the gradient, using statistics over its past values to make a “smarter step” with the current one.

  • Scales the learning rate inversely proportional to the sum of past squared gradients.


$$\omega_{t+1,i} = \omega_{t,i} - \frac{\alpha}{\sqrt{G_{t,ii} + \epsilon}} \cdot g_{t,i} $$
  • $G_{t,ii}$ sum of the squares of the gradients

  • Perform smaller updates (i.e., low learning rates) for parameters associated with frequently occurring features, and

  • Perform larger updates (i.e., high learning rates) for parameters associated with infrequent features

  • Larger steps are taken in less frequently updated directions.

RMSProp (Root Mean Square Propagation)

RMSProp is an adaptive learning rate optimization algorithm designed to improve the convergence of Stochastic Gradient Descent (SGD) by adjusting the learning rate based on the average magnitude of recent gradients. It helps to stabilize learning by preventing excessively large or small parameter updates, particularly in cases where the loss surface is non-stationary.

  • Instead of using a constant learning rate, RMSProp adapts the learning rate for each parameter by normalizing the gradient by the square root of the moving average of its squared gradients.

  • This prevents large parameter updates when the gradient is large and allows faster updates in flat regions where the gradient is small.


The RMSProp algorithm maintains a moving average of the squared gradients:

(1) Squared Gradient Moving Average:


$$v_t = \beta v_{t-1} + (1 - \beta) g_t^2$$

$\quad$where:

  • $v_t$ is the moving average of squared gradients at time $t$.
  • $\beta$ is the decay rate (typically set to 0.9).
  • $g_t$ is the gradient of the loss function at time $t$.

(2) Parameter Update:


$$\omega_t \leftarrow \omega_{t-1} - \frac{\alpha}{\sqrt{v_t + \epsilon}} g_t$$

$\quad$where:

  • $\alpha$ is the learning rate.
  • $\epsilon$ (a small value like $10^{-8}$) is added to avoid division by zero.

GD with Momentum

Momentum is an optimization technique used to speed up convergence and stabilize gradient descent by smoothing out updates. Instead of updating the parameters directly based on the gradient, momentum incorporates a fraction of the previous update to accelerate the learning process and prevent oscillations.


Why?

  • In steep regions (e.g., along the vertical direction), gradient descent may oscillate and take longer to converge.
  • In flat regions, the update steps can be slow.
  • Momentum helps by "building up velocity" in the right direction, dampening oscillations, and increasing speed in flatter areas.

Momentum Update Rule

(1) Velocity Update:


$$v_t = \beta v_{t-1} + (1-\beta) g_t$$

$\quad$where:

  • $v_t$ is the velocity at step $t$.
  • $\beta$ is the momentum coefficient (typically 0.9).
  • $g_t$ is the gradient of the loss function at step $t$.

(2) Parameter Update:


$$w_t \leftarrow w_{t-1} - \alpha v_t$$

$\quad$where:

  • $\alpha$ is the learning rate.
  • $w_t$ is the updated parameter at step $t$.

The velocity $v_t$ accumulates past gradients, leading to smoother and faster convergence, particularly in regions with high curvature or flat surfaces.


How Momentum Works

  • The velocity term $v_t$ accumulates gradients over time, effectively averaging the direction of the updates.
  • This "memory" of past gradients helps to:
    • Accelerate learning in consistent directions.
    • Reduce oscillations in noisy or steep regions.

Adam (Adaptive Moment Estimation)

The Adam Optimizer is an advanced optimization algorithm that combines the benefits of both Momentum and RMSProp:

  • Momentum helps smooth the updates by incorporating the past gradients.
  • RMSProp adapts the learning rate based on the magnitude of the gradients.

Adam maintains an adaptive learning rate for each parameter by using first-order (mean) and second-order (variance) moments of the gradients.


Adam Update Rule

(1) First Moment (Mean of Gradients):


$$m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t$$

$\quad$where:

  • $m_t$ is the moving average of gradients at step $t$.
  • $\beta_1$ is the decay rate for the first moment (typically 0.9).
  • $g_t$ is the gradient of the loss function at step $t$.

(2) Second Moment (Variance of Gradients):


$$v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2$$

$\quad$where:

  • $v_t$ is the moving average of squared gradients at step $t$.
  • $\beta_2$ is the decay rate for the second moment (typically 0.999).

(3) Bias-Correction:


$$\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}$$

(4) Parameter Update:


$$w_t \leftarrow w_{t-1} - \frac{\alpha}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t $$

$\quad$where:

  • $\alpha$ is the learning rate.
  • $\epsilon$ is a small constant (e.g., $10^{-8}$) to prevent division by zero.

The Adam optimizer is one of the most widely used optimization algorithms in deep learning due to its ability to combine the strengths of momentum and adaptive learning rates. It adjusts both the direction and magnitude of updates dynamically, making it efficient for complex models with large datasets. However, careful tuning of hyperparameters may still be required to achieve the best performance.

Most algorithms related to gradient descent are available in TensorFlow.




By understanding Gradient Descent, we have learned the foundational algorithm used in most machine learning optimization problems. It is the core of many more advanced optimization techniques used in neural networks and deep learning frameworks.

In [ ]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')