Stochastic and Mini-batch Gradient Descent
Table of Contents
In this lecture, we will cover gradient descent algorithm and its variants:
We will explore the concept of these three gradient descent algorithms with a logistic regression model.
(= Gradient Descent)
In Batch Gradient Descent methods, we use all of the training data set for each iteration. So, for each update, we have to sum over all examples.
By linearity,
The main advantages:
Even if it is safe and accurate method, it is very inefficient in terms of computation.
The main disadvantages:
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import time
%matplotlib inline
# data generation
m = 10000
true_w = np.array([[-6], [2], [1]])
train_X = np.hstack([np.ones([m,1]), 5*np.random.rand(m,1), 4*np.random.rand(m,1)])
true_w = np.asmatrix(true_w)
train_X = np.asmatrix(train_X)
train_Y = 1/(1 + np.exp(-train_X*true_w)) > 0.5
C1 = np.where(train_Y == True)[0]
C0 = np.where(train_Y == False)[0]
train_Y = np.empty([m,1])
train_Y[C1] = 1
train_Y[C0] = 0
plt.figure(figsize = (6, 4))
plt.plot(train_X[C1,1], train_X[C1,2], 'r.', alpha = 0.1, label = 'C1')
plt.plot(train_X[C0,1], train_X[C0,2], 'b.', alpha = 0.1, label = 'C0')
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.axis('equal')
# plt.xlim([0,4])
plt.ylim([0,4])
plt.show()
LR = 0.05
n_iter = 10000
w = tf.Variable([[0],[0],[0]], dtype = tf.float32)
train_x = tf.constant(train_X, dtype = tf.float32)
train_y = tf.constant(train_Y, dtype = tf.float32)
start_time = time.time()
loss_record = []
for i in range(n_iter):
with tf.GradientTape() as tape:
y_pred = tf.sigmoid(tf.matmul(train_x, w))
loss = - train_y*tf.math.log(y_pred) - (1-train_y)*tf.math.log(1-y_pred)
loss = tf.reduce_mean(loss)
w_grad = tape.gradient(loss, w)
loss_record.append(loss)
w.assign_sub(LR * w_grad)
training_time = time.time() - start_time
w_hat = w.numpy()
print(w_hat)
xp = np.arange(0, 4, 0.01).reshape(-1, 1)
yp = - w_hat[1,0]/w_hat[2,0]*xp - w_hat[0,0]/w_hat[2,0]
plt.figure(figsize = (6, 4))
plt.plot(train_X[C1,1], train_X[C1,2], 'r.', alpha = 0.1, label = 'C1')
plt.plot(train_X[C0,1], train_X[C0,2], 'b.', alpha = 0.1, label = 'C0')
plt.plot(xp, yp, 'g', linewidth = 3, label = 'Logistic Regression')
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.axis('equal')
plt.ylim([0,4])
plt.show()
We will check two pieces of information
print(training_time)
plt.figure(figsize = (6, 4))
plt.plot(loss_record)
plt.xlabel('iteration', fontsize = 15)
plt.ylabel('loss', fontsize = 15)
plt.show()
Stochastic Gradient Descent is an extreme case of Mini-batch Gradient Descent. In this case, learning happens on every example. This is less common than the mini-batch gradient descent method.
Update the parameters based on the gradient for a single training example:
$$\omega \leftarrow \omega - \alpha \, \frac{\partial \ell^{(i)}}{\partial \omega}$$
The advantages:
The disadvantages:
Mathematical justification: if you sample a training example at random, the stochastic gradient is an unbiased estimate of the batch gradient:
Below is a graph that shows the gradient descent's variants and their direction towards the minimum:
As we can see in figure, SGD direction is very noisy compared to others.
LR = 0.05
n_iter = 10000
w = tf.Variable([[0],[0],[0]], dtype = tf.float32)
train_x = tf.constant(train_X, dtype = tf.float32)
train_y = tf.constant(train_Y, dtype = tf.float32)
loss_record = []
for i in range(n_iter):
with tf.GradientTape() as tape:
idx = np.random.choice(m)
batch_x = tf.expand_dims(train_x[idx,:], axis = 0)
batch_y = tf.expand_dims(train_y[idx], axis = 0)
y_pred = tf.sigmoid(tf.matmul(batch_x, w))
loss = - batch_y*tf.math.log(y_pred) - (1-batch_y)*tf.math.log(1-y_pred)
loss = tf.reduce_mean(loss)
w_grad = tape.gradient(loss, w)
loss_record.append(loss)
w.assign_sub(LR * w_grad)
w_hat = w.numpy()
print(w_hat)
xp = np.arange(0, 4, 0.01).reshape(-1, 1)
yp = - w_hat[1,0]/w_hat[2,0]*xp - w_hat[0,0]/w_hat[2,0]
plt.figure(figsize = (6, 4))
plt.plot(train_X[C1,1], train_X[C1,2], 'r.', alpha = 0.1, label = 'C1')
plt.plot(train_X[C0,1], train_X[C0,2], 'b.', alpha = 0.1, label = 'C0')
plt.plot(xp, yp, 'g', linewidth = 3, label = 'Logistic Regression')
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.axis('equal')
plt.ylim([0,4])
plt.show()
print(training_time)
plt.figure(figsize = (6, 4))
plt.plot(loss_record)
plt.xlabel('iteration', fontsize = 15)
plt.ylabel('loss', fontsize = 15)
plt.show()
Mini-batch Gradient Descent method is often used in machine learning and deep learning training. The main idea is similar with batch gradient descent. However, unlike batch gradient descent, in this method, we can customize the batch size $s$. Instead of going over all examples $m$, mini-batch gradient descent sums up over lower number of examples based on the batch size $s \;(< m)$. So, parameters are updated based on mini-batch for each iteration. Since we assume the examples in the data set has positive correlation, this method can be used. In other words, in large data set, there are lots of similar examples.
Stochastic gradients computed on larger mini-batches have smaller variance:
The mini-batch size š is a hyper-parameter that needs to be set.
The main advantages of SGD:
The main disadvantages of SGD:
Note that if batch size is equal to number of training examples, mini-batch gradient descent method is same with batch gradient descent.
LR = 0.05
n_iter = 10000
n_batch = 50
w = tf.Variable([[0],[0],[0]], dtype = tf.float32)
loss_record = []
for i in range(n_iter):
with tf.GradientTape() as tape:
idx = np.random.choice(m, size = n_batch)
batch_x = tf.constant(train_X[idx,:], dtype = tf.float32)
batch_y = tf.constant(train_Y[idx], dtype = tf.float32)
y_pred = tf.sigmoid(tf.matmul(batch_x, w))
loss = - batch_y*tf.math.log(y_pred) - (1-batch_y)*tf.math.log(1-y_pred)
loss = tf.reduce_mean(loss)
w_grad = tape.gradient(loss, w)
loss_record.append(loss)
w.assign_sub(LR * w_grad)
w_hat = w.numpy()
print(w_hat)
xp = np.arange(0, 4, 0.01).reshape(-1, 1)
yp = - w_hat[1,0]/w_hat[2,0]*xp - w_hat[0,0]/w_hat[2,0]
plt.figure(figsize = (6, 4))
plt.plot(train_X[C1,1], train_X[C1,2], 'r.', alpha = 0.1, label = 'C1')
plt.plot(train_X[C0,1], train_X[C0,2], 'b.', alpha = 0.1, label = 'C0')
plt.plot(xp, yp, 'g', linewidth = 3, label = 'Logistic Regression')
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.axis('equal')
plt.ylim([0,4])
plt.show()
print(training_time)
plt.figure(figsize = (6, 4))
plt.plot(loss_record)
plt.xlabel('iteration', fontsize = 15)
plt.ylabel('loss', fontsize = 15)
plt.show()
Small learning rate converges slowly and gets stuck in false local minima
Large learning rates overshoot, become unstable and diverge
Idea 1
Idea 2
SGD Learning Rate: Spatial
The gradient descent method makes a strong assumption about the magnitude of the ālocal curvatureā to ļ¬x the step size, and about its isotropy so that the same step size makes sense in all directions.
SGD Learning Rate: Temporal
Adagrad
Deep-learning generally relies on a smarter use of the gradient, using statistics over its past values to make a āsmarter stepā with the current one.
$$\omega_{t+1,i} = \omega_{t,i} - \frac{\alpha}{\sqrt{G_{t,ii} + \epsilon}} \cdot g_{t,i} $$
$G_{t,ii}$ sum of the squares of the gradients
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')