Stochastic and Mini-batch Gradient Descent

By Prof. Seungchul Lee
http://iailab.kaist.ac.kr/
Industrial AI Lab at KAIST

Table of Contents

0. Lecture Video¶

from IPython.display import YouTubeVideo
YouTubeVideo('ALDd7PTU-JY', width = "560", height = "315")

1. Gradient Descent Algorithm¶

Gradient Descent

$$\text{Repeat : } x \leftarrow x - \alpha \nabla _x f(x) \quad \quad \text{for some step size } \alpha > 0$$

In this lecture, we will cover gradient descent algorithm and its variants:

Batch Gradient Descent
Stochastic Gradient Descent
Mini-batch Gradient Descent

We will explore the concept of these three gradient descent algorithms with a logistic regression model.

2. Batch Gradient Descent¶

(= Gradient Descent)

In Batch Gradient Descent methods, we use all of the training data set for each iteration. So, for each update, we have to sum over all examples.

$$\mathcal{E} (\omega) = \frac{1}{m} \sum_{i=1}^{m} \ell (\hat y_i, y_i) = \frac{1}{m} \sum_{i=1}^{m} \ell (h_{\omega}(x_i), y_i)$$

By linearity,

$$\nabla_{\omega} \mathcal{E} = \nabla_{\omega} \frac{1}{m} \sum_{i=1}^{m} \ell (h_{\omega}(x_i), y_i)= \frac{1}{m} \sum_{i=1}^{m} \frac{\partial }{\partial \omega}\ell (h_{\omega}(x_i), y_i)$$

$$\omega \leftarrow \omega - \alpha \, \nabla_{\omega} \mathcal{E} $$

The main advantages:

We can use fixed learning rate during training without worring about learining rate decay.
It has straight trajectory towards the minimum
It guaranteed to converge in theory to the global minimum if the loss function is convex and to a local minimum if the loss function is not convex.
It has unbiased estimate of gradients. As increasing the number of examples, standard error will decreasing.

Even if it is safe and accurate method, it is very inefficient in terms of computation.

The main disadvantages:

When we have large data set, this method may slow to converge.
Each step of learning happens after going over all examples where some examples may be redundant and don’t contribute much to the update.

import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import time
%matplotlib inline

# data generation

m = 10000
true_w = np.array([[-6], [2], [1]])
train_X = np.hstack([np.ones([m,1]), 5*np.random.rand(m,1), 4*np.random.rand(m,1)])

true_w = np.asmatrix(true_w)
train_X = np.asmatrix(train_X)

train_Y = 1/(1 + np.exp(-train_X*true_w)) > 0.5

C1 = np.where(train_Y == True)[0]
C0 = np.where(train_Y == False)[0]

train_Y = np.empty([m,1])
train_Y[C1] = 1
train_Y[C0] = 0

plt.figure(figsize = (6, 4))
plt.plot(train_X[C1,1], train_X[C1,2], 'r.', alpha = 0.1, label = 'C1')
plt.plot(train_X[C0,1], train_X[C0,2], 'b.', alpha = 0.1, label = 'C0')
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.axis('equal')
# plt.xlim([0,4])
plt.ylim([0,4])
plt.show()

LR = 0.05
n_iter = 10000

w = tf.Variable([[0],[0],[0]], dtype = tf.float32)
train_x = tf.constant(train_X, dtype = tf.float32)
train_y = tf.constant(train_Y, dtype = tf.float32)

start_time = time.time()

loss_record = []
for i in range(n_iter):
    with tf.GradientTape() as tape:
        y_pred = tf.sigmoid(tf.matmul(train_x, w))
        loss = - train_y*tf.math.log(y_pred) - (1-train_y)*tf.math.log(1-y_pred)
        loss = tf.reduce_mean(loss)
        w_grad = tape.gradient(loss, w)

    loss_record.append(loss)
    w.assign_sub(LR * w_grad)

training_time = time.time() - start_time

w_hat = w.numpy()
print(w_hat)

[[-9.898782 ]
 [ 3.5175416]
 [ 1.5210328]]

xp = np.arange(0, 4, 0.01).reshape(-1, 1)
yp = - w_hat[1,0]/w_hat[2,0]*xp - w_hat[0,0]/w_hat[2,0]

plt.figure(figsize = (6, 4))
plt.plot(train_X[C1,1], train_X[C1,2], 'r.', alpha = 0.1, label = 'C1')
plt.plot(train_X[C0,1], train_X[C0,2], 'b.', alpha = 0.1, label = 'C0')
plt.plot(xp, yp, 'g', linewidth = 3, label = 'Logistic Regression')
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.axis('equal')
plt.ylim([0,4])
plt.show()

We will check two pieces of information

time spend for training session, and
loss function noise.

print(training_time)

plt.figure(figsize = (6, 4))
plt.plot(loss_record)
plt.xlabel('iteration', fontsize = 15)
plt.ylabel('loss', fontsize = 15)
plt.show()

43.00573205947876

3. Stochastic Gradient Descent (SGD)¶

Stochastic Gradient Descent is an extreme case of Mini-batch Gradient Descent. In this case, learning happens on every example. This is less common than the mini-batch gradient descent method.

Update the parameters based on the gradient for a single training example:

$$f(\omega) = \ell (\hat y_i, y_i) = \ell (h_{\omega}(x_i), y_i) = \ell^{(i)}$$

$$\omega \leftarrow \omega - \alpha \, \frac{\partial \ell^{(i)}}{\partial \omega}$$

The advantages:

It adds even more noise to the learning process than mini-batch that helps improving generalization error.

The disadvantages:

It use only one example for each update. But one example hardly can represent whole data set. In other words, the variance becomes large since we only use one example for each learning step.
Due to the noise, the learning steps have more oscillations.
It become very slow since we can't utilize vectorization over only one example.

Mathematical justification: if you sample a training example at random, the stochastic gradient is an unbiased estimate of the batch gradient:

$$\mathbb{E} \left[\frac{\partial \ell^{(i)}}{\partial \omega} \right] = \frac{1}{m} \sum_{i=1}^{m} \frac{\partial \ell^{(i)}}{\partial \omega} = \frac{\partial }{\partial \omega} \left[ \frac{1}{m} \sum_{i=1}^{m} \ell^{(i)} \right] = \frac{\partial \mathcal{E}}{\partial \omega}$$

Below is a graph that shows the gradient descent's variants and their direction towards the minimum:

As we can see in figure, SGD direction is very noisy compared to others.

LR = 0.05
n_iter = 10000

w = tf.Variable([[0],[0],[0]], dtype = tf.float32)
train_x = tf.constant(train_X, dtype = tf.float32)
train_y = tf.constant(train_Y, dtype = tf.float32)

loss_record = []
for i in range(n_iter):
    with tf.GradientTape() as tape:
        idx = np.random.choice(m)
        batch_x = tf.expand_dims(train_x[idx,:], axis = 0)
        batch_y = tf.expand_dims(train_y[idx], axis = 0)

        y_pred = tf.sigmoid(tf.matmul(batch_x, w))
        loss = - batch_y*tf.math.log(y_pred) - (1-batch_y)*tf.math.log(1-y_pred)
        loss = tf.reduce_mean(loss)
        w_grad = tape.gradient(loss, w)

    loss_record.append(loss)
    w.assign_sub(LR * w_grad)

w_hat = w.numpy()
print(w_hat)

[[-9.925287 ]
 [ 3.5804236]
 [ 1.5582784]]

xp = np.arange(0, 4, 0.01).reshape(-1, 1)
yp = - w_hat[1,0]/w_hat[2,0]*xp - w_hat[0,0]/w_hat[2,0]

plt.figure(figsize = (6, 4))
plt.plot(train_X[C1,1], train_X[C1,2], 'r.', alpha = 0.1, label = 'C1')
plt.plot(train_X[C0,1], train_X[C0,2], 'b.', alpha = 0.1, label = 'C0')
plt.plot(xp, yp, 'g', linewidth = 3, label = 'Logistic Regression')
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.axis('equal')
plt.ylim([0,4])
plt.show()

print(training_time)

plt.figure(figsize = (6, 4))
plt.plot(loss_record)
plt.xlabel('iteration', fontsize = 15)
plt.ylabel('loss', fontsize = 15)
plt.show()

43.00573205947876

4. Mini-batch Gradient Descent¶

Mini-batch Gradient Descent method is often used in machine learning and deep learning training. The main idea is similar with batch gradient descent. However, unlike batch gradient descent, in this method, we can customize the batch size $s$. Instead of going over all examples $m$, mini-batch gradient descent sums up over lower number of examples based on the batch size $s\;(< m)$. So, parameters are updated based on mini-batch for each iteration. Since we assume the examples in the data set has positive correlation, this method can be used. In other words, in large data set, there are lots of similar examples.

$$\mathcal{E} (\omega) = \frac{1}{s} \sum_{i=1}^{s} \ell (\hat y_i, y_i) = \frac{1}{s} \sum_{i=1}^{s} \ell (h_{\omega}(x_i), y_i) = \frac{1}{s} \sum_{i=1}^{s} \ell^{(i)}$$

$$\omega \leftarrow \omega - \alpha \, \nabla_{\omega} \mathcal{E} $$

Stochastic gradients computed on larger mini-batches have smaller variance:

$$\text{var} \left[ \frac{1}{s} \sum_{i=1}^{s} \frac{\partial \ell^{(i)}}{\partial \omega} \right] = \frac{1}{s^2} \text{var} \left[ \sum_{i=1}^{s} \frac{\partial \ell^{(i)}}{\partial \omega} \right] = \frac{1}{s} \text{var} \left[ \frac{\partial \ell^{(i)}}{\partial \omega} \right]$$

The mini-batch size $s$ is a hyper-parameter that needs to be set.

The main advantages of SGD:

Faster than batch gradient descent. Since it goas through a lot less examples than batch gradient descent (all examples).
Since we randomly choose the mini-batch examples, we can avoid redundant examples and examples that are very similar that don't contribute much to the learning.
With batch size $<$ size of training set, it can adds noise to the learning process that helps improving generaization error.

The main disadvantages of SGD:

It won't converge. On each iteration, the learning step mat go back and forth due to the noise. So, it wanders around the minimum region but never converges.
Due to the noise, the learning steps have more oscillations.

Note that if batch size is equal to number of training examples, mini-batch gradient descent method is same with batch gradient descent.

LR = 0.05
n_iter = 10000
n_batch = 50

w = tf.Variable([[0],[0],[0]], dtype = tf.float32)

loss_record = []
for i in range(n_iter):
    with tf.GradientTape() as tape:
        idx = np.random.choice(m, size = n_batch)
        batch_x = tf.constant(train_X[idx,:], dtype = tf.float32)
        batch_y = tf.constant(train_Y[idx], dtype = tf.float32)

        y_pred = tf.sigmoid(tf.matmul(batch_x, w))
        loss = - batch_y*tf.math.log(y_pred) - (1-batch_y)*tf.math.log(1-y_pred)
        loss = tf.reduce_mean(loss)
        w_grad = tape.gradient(loss, w)

    loss_record.append(loss)
    w.assign_sub(LR * w_grad)

w_hat = w.numpy()
print(w_hat)

[[-9.897612 ]
 [ 3.5361586]
 [ 1.5278765]]

xp = np.arange(0, 4, 0.01).reshape(-1, 1)
yp = - w_hat[1,0]/w_hat[2,0]*xp - w_hat[0,0]/w_hat[2,0]

plt.figure(figsize = (6, 4))
plt.plot(train_X[C1,1], train_X[C1,2], 'r.', alpha = 0.1, label = 'C1')
plt.plot(train_X[C0,1], train_X[C0,2], 'b.', alpha = 0.1, label = 'C0')
plt.plot(xp, yp, 'g', linewidth = 3, label = 'Logistic Regression')
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.axis('equal')
plt.ylim([0,4])
plt.show()

print(training_time)

plt.figure(figsize = (6, 4))
plt.plot(loss_record)
plt.xlabel('iteration', fontsize = 15)
plt.ylabel('loss', fontsize = 15)
plt.show()

43.00573205947876

5. Summary¶

No guarantee that the below is what is going to always happen. But the noisy SGC gradients can help some times escaping local optima.

6. Limitation of the Gradient Descent¶

from IPython.display import YouTubeVideo
YouTubeVideo('ALDd7PTU-JY?si=GD-i2BJah0GH1y7F&amp;start=1768', width = "560", height = "315")

6.1. Setting the Learning Rate¶

$$\omega_{k+1} = \omega_k - \alpha \, \nabla f(\omega_k)$$

Small learning rate converges slowly and gets stuck in false local minima
Large learning rates overshoot, become unstable and diverge
Idea 1
- Try lots of different learning rates and see what works “just right”
Idea 2
- Do something smarter! Design an adaptive learning rate that “adapts” to the landscape
- Temporal and spatial

SGD Learning Rate: Spatial

The gradient descent method makes a strong assumption about the magnitude of the “local curvature” to ﬁx the step size, and about its isotropy so that the same step size makes sense in all directions.

We assign the same learning rate to all features

SGD Learning Rate: Temporal

Typical strategy:
- Use a large learning rate early in training so you can get close to the optimum
- Gradually decay the learning rate to reduce the ﬂuctuations

6.2. Adaptive Learning Rate Methods¶

SDG

$$\omega_{t+1,i} = \omega_{t,i} - \alpha \cdot g_{t,i} $$

Adagrad
- Deep-learning generally relies on a smarter use of the gradient, using statistics over its past values to make a “smarter step” with the current one.
  
  $$\omega_{t+1,i} = \omega_{t,i} - \frac{\alpha}{\sqrt{G_{t,ii} + \epsilon}} \cdot g_{t,i} $$
- $G_{t,ii}$ sum of the squares of the gradients
- Perform smaller updates (i.e., low learning rates) for parameters associated with frequently occurring features, and
- Perform larger updates (i.e., high learning rates) for parameters associated with infrequent features

%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')