Stochastic and Mini-batch Gradient Descent


By Prof. Seungchul Lee
http://iailab.kaist.ac.kr/
Industrial AI Lab at KAIST

Table of Contents

1. Gradient Descent AlgorithmĀ¶

  • Gradient Descent


$$\text{Repeat : } x \leftarrow x - \alpha \nabla _x f(x) \quad \quad \text{for some step size } \alpha > 0$$




In this lecture, we will cover gradient descent algorithm and its variants:

  • Batch Gradient Descent
  • Stochastic Gradient Descent
  • Mini-batch Gradient Descent

We will explore the concept of these three gradient descent algorithms with a logistic regression model.

2. Batch Gradient DescentĀ¶

(= Gradient Descent)

In Batch Gradient Descent methods, we use all of the training data set for each iteration. So, for each update, we have to sum over all examples.


$$\mathcal{E} (\omega) = \frac{1}{m} \sum_{i=1}^{m} \ell (\hat y_i, y_i) = \frac{1}{m} \sum_{i=1}^{m} \ell (h_{\omega}(x_i), y_i)$$


By linearity,


$$\nabla_{\omega} \mathcal{E} = \nabla_{\omega} \frac{1}{m} \sum_{i=1}^{m} \ell (h_{\omega}(x_i), y_i)= \frac{1}{m} \sum_{i=1}^{m} \frac{\partial }{\partial \omega}\ell (h_{\omega}(x_i), y_i)$$



$$\omega \leftarrow \omega - \alpha \, \nabla_{\omega} \mathcal{E} $$


The main advantages:

  • We can use fixed learning rate during training without worring about learining rate decay.
  • It has straight trajectory towards the minimum
  • It guaranteed to converge in theory to the global minimum if the loss function is convex and to a local minimum if the loss function is not convex.
  • It has unbiased estimate of gradients. As increasing the number of examples, standard error will decreasing.

Even if it is safe and accurate method, it is very inefficient in terms of computation.

The main disadvantages:

  • When we have large data set, this method may slow to converge.
  • Each step of learning happens after going over all examples where some examples may be redundant and donā€™t contribute much to the update.
InĀ [1]:
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import time
%matplotlib inline
InĀ [2]:
# data generation

m = 10000
true_w = np.array([[-6], [2], [1]])
train_X = np.hstack([np.ones([m,1]), 5*np.random.rand(m,1), 4*np.random.rand(m,1)])

true_w = np.asmatrix(true_w)
train_X = np.asmatrix(train_X)

train_Y = 1/(1 + np.exp(-train_X*true_w)) > 0.5

C1 = np.where(train_Y == True)[0]
C0 = np.where(train_Y == False)[0]

train_Y = np.empty([m,1])
train_Y[C1] = 1
train_Y[C0] = 0

plt.figure(figsize = (6, 4))
plt.plot(train_X[C1,1], train_X[C1,2], 'r.', alpha = 0.1, label = 'C1')
plt.plot(train_X[C0,1], train_X[C0,2], 'b.', alpha = 0.1, label = 'C0')
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.axis('equal')
# plt.xlim([0,4])
plt.ylim([0,4])
plt.show()
InĀ [3]:
LR = 0.05
n_iter = 10000

w = tf.Variable([[0],[0],[0]], dtype = tf.float32)
train_x = tf.constant(train_X, dtype = tf.float32)
train_y = tf.constant(train_Y, dtype = tf.float32)

start_time = time.time()

loss_record = []
for i in range(n_iter):
    with tf.GradientTape() as tape:
        y_pred = tf.sigmoid(tf.matmul(train_x, w))
        loss = - train_y*tf.math.log(y_pred) - (1-train_y)*tf.math.log(1-y_pred)
        loss = tf.reduce_mean(loss)
        w_grad = tape.gradient(loss, w)

    loss_record.append(loss)
    w.assign_sub(LR * w_grad)

training_time = time.time() - start_time

w_hat = w.numpy()
print(w_hat)
[[-9.898782 ]
 [ 3.5175416]
 [ 1.5210328]]
InĀ [4]:
xp = np.arange(0, 4, 0.01).reshape(-1, 1)
yp = - w_hat[1,0]/w_hat[2,0]*xp - w_hat[0,0]/w_hat[2,0]

plt.figure(figsize = (6, 4))
plt.plot(train_X[C1,1], train_X[C1,2], 'r.', alpha = 0.1, label = 'C1')
plt.plot(train_X[C0,1], train_X[C0,2], 'b.', alpha = 0.1, label = 'C0')
plt.plot(xp, yp, 'g', linewidth = 3, label = 'Logistic Regression')
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.axis('equal')
plt.ylim([0,4])
plt.show()

We will check two pieces of information

  • time spend for training session, and
  • loss function noise.
InĀ [5]:
print(training_time)

plt.figure(figsize = (6, 4))
plt.plot(loss_record)
plt.xlabel('iteration', fontsize = 15)
plt.ylabel('loss', fontsize = 15)
plt.show()
43.00573205947876

3. Stochastic Gradient Descent (SGD)Ā¶

Stochastic Gradient Descent is an extreme case of Mini-batch Gradient Descent. In this case, learning happens on every example. This is less common than the mini-batch gradient descent method.

Update the parameters based on the gradient for a single training example:


$$f(\omega) = \ell (\hat y_i, y_i) = \ell (h_{\omega}(x_i), y_i) = \ell^{(i)}$$



$$\omega \leftarrow \omega - \alpha \, \frac{\partial \ell^{(i)}}{\partial \omega}$$

The advantages:

  • It adds even more noise to the learning process than mini-batch that helps improving generalization error.

The disadvantages:

  • It use only one example for each update. But one example hardly can represent whole data set. In other words, the variance becomes large since we only use one example for each learning step.
  • Due to the noise, the learning steps have more oscillations.
  • It become very slow since we can't utilize vectorization over only one example.

Mathematical justification: if you sample a training example at random, the stochastic gradient is an unbiased estimate of the batch gradient:


$$\mathbb{E} \left[\frac{\partial \ell^{(i)}}{\partial \omega} \right] = \frac{1}{m} \sum_{i=1}^{m} \frac{\partial \ell^{(i)}}{\partial \omega} = \frac{\partial }{\partial \omega} \left[ \frac{1}{m} \sum_{i=1}^{m} \ell^{(i)} \right] = \frac{\partial \mathcal{E}}{\partial \omega}$$


Below is a graph that shows the gradient descent's variants and their direction towards the minimum:




As we can see in figure, SGD direction is very noisy compared to others.

InĀ [6]:
LR = 0.05
n_iter = 10000

w = tf.Variable([[0],[0],[0]], dtype = tf.float32)
train_x = tf.constant(train_X, dtype = tf.float32)
train_y = tf.constant(train_Y, dtype = tf.float32)

loss_record = []
for i in range(n_iter):
    with tf.GradientTape() as tape:
        idx = np.random.choice(m)
        batch_x = tf.expand_dims(train_x[idx,:], axis = 0)
        batch_y = tf.expand_dims(train_y[idx], axis = 0)

        y_pred = tf.sigmoid(tf.matmul(batch_x, w))
        loss = - batch_y*tf.math.log(y_pred) - (1-batch_y)*tf.math.log(1-y_pred)
        loss = tf.reduce_mean(loss)
        w_grad = tape.gradient(loss, w)

    loss_record.append(loss)
    w.assign_sub(LR * w_grad)

w_hat = w.numpy()
print(w_hat)
[[-9.925287 ]
 [ 3.5804236]
 [ 1.5582784]]
InĀ [7]:
xp = np.arange(0, 4, 0.01).reshape(-1, 1)
yp = - w_hat[1,0]/w_hat[2,0]*xp - w_hat[0,0]/w_hat[2,0]

plt.figure(figsize = (6, 4))
plt.plot(train_X[C1,1], train_X[C1,2], 'r.', alpha = 0.1, label = 'C1')
plt.plot(train_X[C0,1], train_X[C0,2], 'b.', alpha = 0.1, label = 'C0')
plt.plot(xp, yp, 'g', linewidth = 3, label = 'Logistic Regression')
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.axis('equal')
plt.ylim([0,4])
plt.show()
InĀ [8]:
print(training_time)

plt.figure(figsize = (6, 4))
plt.plot(loss_record)
plt.xlabel('iteration', fontsize = 15)
plt.ylabel('loss', fontsize = 15)
plt.show()
43.00573205947876

4. Mini-batch Gradient DescentĀ¶

Mini-batch Gradient Descent method is often used in machine learning and deep learning training. The main idea is similar with batch gradient descent. However, unlike batch gradient descent, in this method, we can customize the batch size $s$. Instead of going over all examples $m$, mini-batch gradient descent sums up over lower number of examples based on the batch size $s \;(< m)$. So, parameters are updated based on mini-batch for each iteration. Since we assume the examples in the data set has positive correlation, this method can be used. In other words, in large data set, there are lots of similar examples.


$$\mathcal{E} (\omega) = \frac{1}{s} \sum_{i=1}^{s} \ell (\hat y_i, y_i) = \frac{1}{s} \sum_{i=1}^{s} \ell (h_{\omega}(x_i), y_i) = \frac{1}{s} \sum_{i=1}^{s} \ell^{(i)}$$



$$\omega \leftarrow \omega - \alpha \, \nabla_{\omega} \mathcal{E} $$


Stochastic gradients computed on larger mini-batches have smaller variance:


$$\text{var} \left[ \frac{1}{s} \sum_{i=1}^{s} \frac{\partial \ell^{(i)}}{\partial \omega} \right] = \frac{1}{s^2} \text{var} \left[ \sum_{i=1}^{s} \frac{\partial \ell^{(i)}}{\partial \omega} \right] = \frac{1}{s} \text{var} \left[ \frac{\partial \ell^{(i)}}{\partial \omega} \right]$$


The mini-batch size š‘  is a hyper-parameter that needs to be set.

The main advantages of SGD:

  • Faster than batch gradient descent. Since it goas through a lot less examples than batch gradient descent (all examples).
  • Since we randomly choose the mini-batch examples, we can avoid redundant examples and examples that are very similar that don't contribute much to the learning.
  • With batch size $<$ size of training set, it can adds noise to the learning process that helps improving generaization error.

The main disadvantages of SGD:

  • It won't converge. On each iteration, the learning step mat go back and forth due to the noise. So, it wanders around the minimum region but never converges.
  • Due to the noise, the learning steps have more oscillations.





Note that if batch size is equal to number of training examples, mini-batch gradient descent method is same with batch gradient descent.

InĀ [9]:
LR = 0.05
n_iter = 10000
n_batch = 50

w = tf.Variable([[0],[0],[0]], dtype = tf.float32)

loss_record = []
for i in range(n_iter):
    with tf.GradientTape() as tape:
        idx = np.random.choice(m, size = n_batch)
        batch_x = tf.constant(train_X[idx,:], dtype = tf.float32)
        batch_y = tf.constant(train_Y[idx], dtype = tf.float32)

        y_pred = tf.sigmoid(tf.matmul(batch_x, w))
        loss = - batch_y*tf.math.log(y_pred) - (1-batch_y)*tf.math.log(1-y_pred)
        loss = tf.reduce_mean(loss)
        w_grad = tape.gradient(loss, w)

    loss_record.append(loss)
    w.assign_sub(LR * w_grad)

w_hat = w.numpy()
print(w_hat)
[[-9.897612 ]
 [ 3.5361586]
 [ 1.5278765]]
InĀ [10]:
xp = np.arange(0, 4, 0.01).reshape(-1, 1)
yp = - w_hat[1,0]/w_hat[2,0]*xp - w_hat[0,0]/w_hat[2,0]

plt.figure(figsize = (6, 4))
plt.plot(train_X[C1,1], train_X[C1,2], 'r.', alpha = 0.1, label = 'C1')
plt.plot(train_X[C0,1], train_X[C0,2], 'b.', alpha = 0.1, label = 'C0')
plt.plot(xp, yp, 'g', linewidth = 3, label = 'Logistic Regression')
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.axis('equal')
plt.ylim([0,4])
plt.show()
InĀ [11]:
print(training_time)

plt.figure(figsize = (6, 4))
plt.plot(loss_record)
plt.xlabel('iteration', fontsize = 15)
plt.ylabel('loss', fontsize = 15)
plt.show()
43.00573205947876

5. SummaryĀ¶





  • No guarantee that the below is what is going to always happen. But the noisy SGC gradients can help some times escaping local optima.



6. Limitation of the Gradient DescentĀ¶

6.1. Setting the Learning RateĀ¶


$$\omega_{k+1} = \omega_k - \alpha \, \nabla f(\omega_k)$$


  • Small learning rate converges slowly and gets stuck in false local minima

  • Large learning rates overshoot, become unstable and diverge

  • Idea 1

    • Try lots of different learning rates and see what works ā€œjust rightā€
  • Idea 2

    • Do something smarter! Design an adaptive learning rate that ā€œadaptsā€ to the landscape
    • Temporal and spatial

SGD Learning Rate: Spatial

The gradient descent method makes a strong assumption about the magnitude of the ā€œlocal curvatureā€ to ļ¬x the step size, and about its isotropy so that the same step size makes sense in all directions.

  • We assign the same learning rate to all features





SGD Learning Rate: Temporal

  • Typical strategy:
    • Use a large learning rate early in training so you can get close to the optimum
    • Gradually decay the learning rate to reduce the ļ¬‚uctuations

6.2. Adaptive Learning Rate MethodsĀ¶

  • SDG


$$\omega_{t+1,i} = \omega_{t,i} - \alpha \cdot g_{t,i} $$


  • Adagrad

    • Deep-learning generally relies on a smarter use of the gradient, using statistics over its past values to make a ā€œsmarter stepā€ with the current one.


      $$\omega_{t+1,i} = \omega_{t,i} - \frac{\alpha}{\sqrt{G_{t,ii} + \epsilon}} \cdot g_{t,i} $$

    • $G_{t,ii}$ sum of the squares of the gradients

    • Perform smaller updates (i.e., low learning rates) for parameters associated with frequently occurring features, and
    • Perform larger updates (i.e., high learning rates) for parameters associated with infrequent features




InĀ [12]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')