Stochastic and Mini-batch Gradient Descent


By Prof. Seungchul Lee
http://iai.postech.ac.kr/
Industrial AI Lab at POSTECH

Table of Contents

1. Gradient Descent Algorithm

  • Gradient Descent
$$\text{Repeat : } x \leftarrow x - \alpha \nabla _x f(x) \quad \quad \text{for some step size } \alpha > 0$$



In this lecture, we will cover gradient descent algorithm and its variants:

  • Batch Gradient Descent
  • Stochastic Gradient Descent
  • Mini-batch Gradient Descent

We will explore the concept of these three gradient descent algorithms with a logistic regression model.

2. Batch Gradient Descent

(= Gradient Descent)

In Batch Gradient Descent methods, we use all of the training data set for each iteration. So, for each update, we have to sum over all examples.


$$\mathcal{E} (\omega) = \frac{1}{m} \sum_{i=1}^{m} \ell (\hat y_i, y_i) = \frac{1}{m} \sum_{i=1}^{m} \ell (h_{\omega}(x_i), y_i)$$

By linearity,


$$\nabla_{\omega} \mathcal{E} = \nabla_{\omega} \frac{1}{m} \sum_{i=1}^{m} \ell (h_{\omega}(x_i), y_i)= \frac{1}{m} \sum_{i=1}^{m} \frac{\partial }{\partial \omega}\ell (h_{\omega}(x_i), y_i)$$


$$\omega \leftarrow \omega - \alpha \, \nabla_{\omega} \mathcal{E} $$

The main advantages:

  • We can use fixed learning rate during training without worring about learining rate decay.
  • It has straight trajectory towards the minimum
  • It guaranteed to converge in theory to the global minimum if the loss function is convex and to a local minimum if the loss function is not convex.
  • It has unbiased estimate of gradients. As increasing the number of examples, standard error will decreasing.

Even if it is safe and accurate method, it is very inefficient in terms of computation.

The main disadvantages:

  • When we have large data set, this method may slow to converge.
  • Each step of learning happens after going over all examples where some examples may be redundant and don’t contribute much to the update.
In [1]:
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import time
%matplotlib inline
In [2]:
# datat generation

m = 10000
true_w = np.array([[-6], [2], [1]])
train_X = np.hstack([np.ones([m,1]), 5*np.random.rand(m,1), 4*np.random.rand(m,1)])

true_w = np.asmatrix(true_w)
train_X = np.asmatrix(train_X)

train_y = 1/(1 + np.exp(-train_X*true_w)) > 0.5 

C1 = np.where(train_y == True)[0]
C0 = np.where(train_y == False)[0]

train_y = np.empty([m,1])
train_y[C1] = 1
train_y[C0] = 0

plt.figure(figsize = (10,8))
plt.plot(train_X[C1,1], train_X[C1,2], 'r.', alpha = 0.3, label = 'C1')
plt.plot(train_X[C0,1], train_X[C0,2], 'b.', alpha = 0.3, label = 'C0')
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.axis('equal')
# plt.xlim([0,4])
plt.ylim([0,4])
plt.show()
In [3]:
LR = 0.04
n_iter = 60000
n_prt = 250

x = tf.placeholder(tf.float32, [m, 3])
y = tf.placeholder(tf.float32, [m, 1])

w = tf.Variable([[0],[0],[0]], dtype = tf.float32)

y_pred = tf.matmul(x,w)
loss = tf.nn.sigmoid_cross_entropy_with_logits(logits=y_pred, labels=y)
loss = tf.reduce_mean(loss)

optm = tf.train.GradientDescentOptimizer(LR).minimize(loss)
init = tf.global_variables_initializer()

start_time = time.time()

loss_record = []
with tf.Session() as sess:
    sess.run(init)
    for epoch in range(n_iter):                                                                         
        sess.run(optm, feed_dict = {x: train_X, y: train_y})
        
        if (epoch + 1) % n_prt == 0:
            loss_record.append(sess.run(loss, feed_dict = {x: train_X, y: train_y}))
    
    w_hat = sess.run(w)

training_time = time.time() - start_time
    
xp = np.arange(0, 4, 0.01).reshape(-1, 1)
yp = - w_hat[1,0]/w_hat[2,0]*xp - w_hat[0,0]/w_hat[2,0]

plt.figure(figsize=(10,8))
plt.plot(train_X[C1,1], train_X[C1,2], 'r.', alpha = 0.3, label='C1')
plt.plot(train_X[C0,1], train_X[C0,2], 'b.', alpha = 0.3, label='C0')
plt.plot(xp, yp, 'g', linewidth = 4)
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.axis('equal')
# plt.xlim([0,4])
plt.ylim([0,4])
plt.show()
WARNING: Logging before flag parsing goes to stderr.
W0816 22:36:34.576486  6040 deprecation.py:323] From c:\users\seungchul\appdata\local\programs\python\python35\lib\site-packages\tensorflow\python\ops\nn_impl.py:180: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where

We will check two pieces of information

  • time spend for training session, and
  • loss function noise.
In [4]:
print(training_time)

plt.figure(figsize=(10,8))
plt.plot(range(1, n_iter+1, n_prt), loss_record)
plt.xlabel('Iteration', fontsize = 15)
plt.ylabel('Loss', fontsize = 15)
plt.show()
26.01719880104065

3. Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent is an extreme case of Mini-batch Gradient Descent. In this case, learning happens on every example. This is less common than the mini-batch gradient descent method.

Update the parameters based on the gradient for a single training example:


$$f(\omega) = \ell (\hat y_i, y_i) = \ell (h_{\omega}(x_i), y_i) = \ell^{(i)}$$


$$\omega \leftarrow \omega - \alpha \, \frac{\partial \ell^{(i)}}{\partial \omega}$$

The advantages:

  • It adds even more noise to the learning process than mini-batch that helps improving generalization error.

The disadvantages:

  • It use only one example for each update. But one example hardly can represent whole data set. In other words, the variance becomes large since we only use one example for each learning step.
  • Due to the noise, the learning steps have more oscillations.
  • It become very slow since we can't utilize vectorization over only one example.

Mathematical justification: if you sample a training example at random, the stochastic gradient is an unbiased estimate of the batch gradient:


$$\mathbb{E} \left[\frac{\partial \ell^{(i)}}{\partial \omega} \right] = \frac{1}{m} \sum_{i=1}^{m} \frac{\partial \ell^{(i)}}{\partial \omega} = \frac{\partial }{\partial \omega} \left[ \frac{1}{m} \sum_{i=1}^{m} \ell^{(i)} \right] = \frac{\partial \mathcal{E}}{\partial \omega}$$

Below is a graph that shows the gradient descent's variants and their direction towards the minimum:



As we can see in figure, SGD direction is very noisy compared to others.

In [5]:
LR = 0.04
n_iter = 60000
n_prt = 250

x = tf.placeholder(tf.float32, [1, 3])
y = tf.placeholder(tf.float32, [1, 1])

w = tf.Variable(tf.random_normal([3,1]), dtype = tf.float32)

y_pred = tf.matmul(x,w)
loss = tf.nn.sigmoid_cross_entropy_with_logits(logits=y_pred, labels=y)
loss = tf.reduce_mean(loss)

optm = tf.train.GradientDescentOptimizer(LR).minimize(loss)
init = tf.global_variables_initializer()

start_time = time.time()

loss_record = []
with tf.Session() as sess:
    sess.run(init)
    for epoch in range(n_iter):      
        idx = np.random.choice(m, 1)
        batch_X = train_X[idx,:]
        batch_y = train_y[idx]
        sess.run(optm, feed_dict = {x: batch_X, y: batch_y})   
        
        if (epoch + 1) % n_prt == 0:
            loss_record.append(sess.run(loss, feed_dict = {x: batch_X, y: batch_y}))
    
    w_hat = sess.run(w)

training_time = time.time() - start_time
    
xp = np.arange(0, 4, 0.01).reshape(-1, 1)
yp = - w_hat[1,0]/w_hat[2,0]*xp - w_hat[0,0]/w_hat[2,0]

plt.figure(figsize=(10,8))
plt.plot(train_X[C1,1], train_X[C1,2], 'r.', alpha = 0.3, label='C1')
plt.plot(train_X[C0,1], train_X[C0,2], 'b.', alpha = 0.3, label='C0')
plt.plot(xp, yp, 'g', linewidth = 4)
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.axis('equal')
# plt.xlim([0,4])
plt.ylim([0,4])
plt.show()