1. Regression with Polynomial Functions¶

Nonlinear regression
(= linear regression for non-linearly distributed data)

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# 10 data points
n = 10
x = np.linspace(-4.5, 4.5, 10)
y = np.array([0.9819, 0.7973, 1.9737, 0.1838, 1.3180, -0.8361, -0.6591, -2.4701, -2.8122, -6.2512])

plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', label = 'Data')
plt.xlabel('X', fontsize = 15)
plt.ylabel('Y', fontsize = 15)
plt.grid(alpha = 0.3)
plt.show()

p = np.polyfit(x, y, deg = 1)

xp = np.arange(-4.5, 4.5, 0.01)
yp = np.polyval(p, xp)

plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', label = 'Data')
plt.plot(xp, yp, linewidth = 2, label = 'Polinomial')
plt.xlabel('X', fontsize = 15)
plt.ylabel('Y', fontsize = 15)
plt.legend(fontsize = 15)
plt.grid(alpha = 0.3)
plt.show()

p = np.polyfit(x, y, deg = 9)

xp = np.arange(-4.5, 4.5, 0.01)
yp = np.polyval(p, xp)

plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', label = 'Data')
plt.plot(xp, yp, linewidth = 2, label = 'Polinomial')
plt.xlabel('X', fontsize = 15)
plt.ylabel('Y', fontsize = 15)
plt.legend(fontsize = 15)
plt.grid(alpha = 0.3)
plt.show()

p = np.polyfit(x, y, deg = 3)

xp = np.arange(-4.5, 4.5, 0.01)

plt.figure(figsize = (6, 4))
plt.plot(x, y, 'o', label = 'Data')
plt.plot(xp, np.polyval(p, xp), linewidth = 2, label = 'Polynomial')
plt.xlabel('X', fontsize = 15)
plt.ylabel('Y', fontsize = 15)
plt.legend(fontsize = 15)
plt.grid(alpha = 0.3)
plt.show()

2. Polynomial Regression in TensorFlow¶

Construct explicit feature vectors
Consider linear combinations of fixed nonlinear functions of the input variables, of the form

$$ \begin{bmatrix} 1 & x_{1} & x_1^2\\1 & x_{2} & x_2^2\\\vdots & \vdots\\1 & x_{m} & x_m^2 \end{bmatrix} \begin{bmatrix}\theta_0\\\theta_1 \\ \theta_2 \end{bmatrix} \quad \Rightarrow \quad \begin{bmatrix} \mid & \mid & \mid \\ b_0(x) & b_1(x) & b_2(x)\\ \mid & \mid & \mid \end{bmatrix} \begin{bmatrix}\theta_0\\\theta_1 \\ \theta_2 \end{bmatrix} $$

$$ \hat{y}=\sum_{i=0}^d{\theta_i b_i(x)} = \Phi \theta$$

Polynomial functions

$$b_i(x) = x^i, \quad i = 0,\cdots,d$$

from sklearn.preprocessing import MaxAbsScaler

# 10 data points
m = 10
train_x = np.linspace(-4.5, 4.5, 10).reshape(-1,1)
train_y = np.array([0.9819, 0.7973, 1.9737, 0.1838, 1.3180, -0.8361, -0.6591, -2.4701, -2.8122, -6.2512]).reshape(-1,1)

d = 9
train_X = np.hstack([train_x**(i+1) for i in range(d)])
train_X = MaxAbsScaler().fit_transform(train_X)
train_X = np.asmatrix(train_X)

plt.figure(figsize = (6, 4))

for i in range(d):
    plt.plot(train_X[:,i], label = '$x^{}$'.format(i+1))

plt.title('Polynomial Basis', fontsize = 15)
plt.xlabel('X', fontsize = 15)
plt.ylabel('Y', fontsize = 15)
plt.grid(alpha = 0.3)
plt.legend(fontsize = 12)
plt.show()

import tensorflow as tf

LR = 0.1
n_iter = 10000

w = tf.Variable(tf.random.normal([d, 1]))
b = tf.Variable(tf.random.normal([1, 1]))
train_input = tf.constant(train_X, dtype = tf.float32)
train_output = tf.constant(train_y, dtype = tf.float32)

loss_record = []
for i in range(n_iter):
    with tf.GradientTape() as tape:
        y_pred = tf.add(tf.matmul(train_input,w), b)
        loss = tf.square(y_pred - train_output)
        loss = tf.reduce_mean(loss)
        w_grad, b_grad = tape.gradient(loss, [w, b])

    loss_record.append(loss)
    w.assign_sub(LR * w_grad)
    b.assign_sub(LR * b_grad)

w_val = w.numpy()
b_val = b.numpy()

xp = np.linspace(-4.5, 4.5, 100).reshape(-1,1)
Xp = np.hstack([xp**(i+1) for i in range(d)])
Xp = MaxAbsScaler().fit_transform(Xp)
Xp = np.asmatrix(Xp)

yp = Xp*w_val + b_val

plt.figure(figsize = (6, 4))
plt.plot(train_x, train_y,'ko')
plt.plot(xp, yp, 'r')
plt.title('Data', fontsize = 15)
plt.xlabel('X', fontsize = 15)
plt.ylabel('Y', fontsize = 15)
#plt.axis('equal')
plt.grid(alpha = 0.3)
#plt.xlim([0, 5])
plt.show()

Overfitting problem
- Have you come across a situation where your model performed exceptionally well on train data, but was not able to predict test data ?
- One of the most common problem data science professionals face is to avoid overfitting.

Issue with rich representation
- Low error on input data points, but high error nearby
- Low error on training data, but high error on testing data

Generalization Error
- But what we really care about is loss of prediction on new data (𝑥,𝑦)
- also called generalization error

3. Regularization to Reduce Overfitting¶

3.1. Regularization¶

With many features, prediction function becomes very expressive (model complexity)

Choose less expressive function (e.g., lower degree polynomial, fewer RBF centers, larger RBF bandwidth)
Keep the magnitude of the parameter small
Regularization: penalize large parameters $\theta$

$$\min\; \lVert \Phi \theta - y \rVert_2^2 + \lambda \lVert \theta \rVert_2^2$$
$\lambda$: regularization parameter, trades off between low loss and small values of $\theta$

Often, overfitting associated with very large estimated parameters $\theta$

We want to balance

how well function fits data
magnitude of coefficients

$$ \begin{align*} \text{Total cost } = \;&\underbrace{\text{measure of fit}}_{RSS(\theta)} + \;\lambda \cdot \underbrace{\text{measure of magnitude of coefficients}}_{\lambda \cdot \lVert \theta \rVert_2^2} \\ \\ \implies &\min\; \lVert \Phi \theta - y \rVert_2^2 + \lambda \lVert \theta \rVert_2^2 \end{align*} $$
where $ RSS(\theta) = \lVert \Phi\theta - y \rVert^2_2 $, ( = Rresidual Sum of Squares) and $\lambda$ is a tuning parameter to be determined separately

the second term, $\lambda \cdot \lVert \theta \rVert_2^2$, called a shrinkage penalty, is small when $\theta_1, \cdots,\theta_d$ are close to zeros, and so it has the effect of shrinking the estimates of $\theta_j$ towards zero

The tuning parameter $\lambda$ serves to control the relative impact of these two terms on the regression coefficient estimates

import tensorflow as tf

LR = 0.1
lamb = 0.1
n_iter = 10000

w = tf.Variable(tf.random.normal([d, 1]))
b = tf.Variable(tf.random.normal([1, 1]))
train_input = tf.constant(train_X, dtype = tf.float32)
train_output = tf.constant(train_y, dtype = tf.float32)

loss_record = []
for i in range(n_iter):
    with tf.GradientTape() as tape:
        y_pred = tf.add(tf.matmul(train_input,w), b)
        reg = tf.reduce_mean(tf.square(w))
        loss = tf.square(y_pred - train_output)
        loss = tf.reduce_mean(loss)
        w_grad, b_grad = tape.gradient(loss + reg, [w, b])

    loss_record.append(loss)
    w.assign_sub(LR * w_grad)
    b.assign_sub(LR * b_grad)

w_val = w.numpy()
b_val = b.numpy()

xp = np.linspace(-4.5, 4.5, 100).reshape(-1,1)
Xp = np.hstack([xp**(i+1) for i in range(d)])
Xp = MaxAbsScaler().fit_transform(Xp)
Xp = np.asmatrix(Xp)

yp = Xp*w_val + b_val

plt.figure(figsize = (6, 4))
plt.plot(train_x, train_y,'ko')
plt.plot(xp, yp, 'r')
plt.title('Data', fontsize = 15)
plt.xlabel('X', fontsize = 15)
plt.ylabel('Y', fontsize = 15)
#plt.axis('equal')
plt.grid(alpha = 0.3)
#plt.xlim([0, 5])
plt.show()

3.2. Different Regularization Techniques¶

$L_2$ and $L_1$ regularizations

Big Data

Data augmentation
- The simplest way to reduce overfitting is to increase the size of the training data

Early stopping
- When we see that the performance on the validation set is getting worse, we immediately stop the training on the model

Dropout (later)
- This is the one of the most interesting types of regularization techniques.
- It also produces very good results and is consequently the most frequently used regularization technique in the field of deep learning.
- At every iteration, it randomly selects some nodes and removes them.
- It can also be thought of as an ensemble technique in machine learning.

%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')