Machine Learning

By Prof. Seungchul Lee
http://iailab.kaist.ac.kr/
Industrial AI Lab at KASIT

Table of Contents

1. Linear Regression¶

Regression is a fundamental concept in machine learning, playing a crucial role in understanding and predicting continuous outcomes based on input features. Unlike classification, which assigns data points to discrete categories, regression models aim to establish a relationship between independent variables (features) and a continuous dependent variable (target).

At its core, regression seeks to capture patterns in data and predict numerical values by minimizing the error between the predicted values and the actual values. This makes regression indispensable for a wide range of applications.

from IPython.display import YouTubeVideo
YouTubeVideo('IRrGVQV8vZ8', width = "560", height = "315")

Consider a linear regression

Given $\; \begin{cases}x_{i} \; \text{: inputs} \\y_{i} \; \text{: outputs}\end{cases}$

$$x= \begin{bmatrix} x_{1} \\ x_{2} \\ \vdots \\ x_{m} \end{bmatrix}, \qquad y= \begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{m} \end{bmatrix}$$

$ \hat{y}_{i} $: predicted output

$$ \hat{y}_{i} = f(x_{i}\,; \theta) \; \text{ in general}$$

In many cases, a linear model is used to predict $y_{i}$

Unknown model parameters $ \theta = \begin{bmatrix}\theta_{0} \\\theta_{1} \\\end{bmatrix}$

$$ y_i \approx \hat{y}_{i} = \theta_{0} + \theta_{1}x_{i} \; \quad \text{ such that }\quad \min\limits_{\theta_{0}, \theta_{1}}\sum\limits_{i = 1}^{m} \left(\hat{y}_{i} - y_{i} \right)^2$$

Need to find $\theta_{0}$ and $\theta_{1}$

$$ y= \begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{m} \end{bmatrix} \approx \begin{bmatrix} \hat y_{1} \\ \hat y_{2} \\ \vdots \\ \hat y_{m} \end{bmatrix} = \theta_{0} \begin{bmatrix} 1 \\ 1 \\ \vdots \\ 1 \end{bmatrix} + \theta_{1} \begin{bmatrix} x_{1} \\ x_{2} \\ \vdots \\ x_{m} \end{bmatrix}$$

1.1. Re-cast Problem as a Least Squares¶

For convenience, we define a function that maps inputs to feature vectors, $\phi$

$$\begin{array}{Icr}\begin{align*} \hat{y}_{i} & = \theta_0 + x_i \theta_1 = 1 \cdot \theta_0 + x_i \theta_1 \\ \\ & = \begin{bmatrix}1 & x_{i}\end{bmatrix}\begin{bmatrix}\theta_{0} \\ \theta_{1}\end{bmatrix} \\\\ & =\begin{bmatrix}1 \\ x_{i} \end{bmatrix}^{T}\begin{bmatrix}\theta_{0} \\ \theta_{1}\end{bmatrix} \\\\ & =\phi^{T}(x_{i})\theta \end{align*}\end{array} \begin{array}{Icr} \quad \quad \text{feature vector} \; \phi(x_{i}) = \begin{bmatrix}1 \\ x_{i}\end{bmatrix} \end{array}$$

$$\Phi = \begin{bmatrix}1 & x_{1} \\ 1 & x_{2} \\ \vdots \\1 & x_{m} \end{bmatrix}=\begin{bmatrix}\phi^T(x_{1}) \\\phi^T(x_{2}) \\\vdots \\\phi^T(x_{m}) \end{bmatrix} \quad \implies \quad \hat{y} = \begin{bmatrix}\hat{y}_{1} \\\hat{y}_{2} \\\vdots \\\hat{y}_{m}\end{bmatrix}=\Phi\theta$$

Optimization problem (or least squares)

$$\min\limits_{\theta_{0}, \theta_{1}}\sum\limits_{i = 1}^{m} (\hat{y}_{i} - y_{i})^2 =\color{red}{\min\limits_{\theta}\lVert\Phi\theta-y\rVert^2_2} \qquad \qquad \left(\text{same as} \; \min_{x} \lVert Ax-b \rVert_2^2 \right)$$

$$ \text{solution} \; \theta^* = (\Phi^{T}\Phi)^{-1}\Phi^{T} y $$

1.2. Solve using Linear Algebra¶

known as least square

$$ \theta = (A^TA)^{-1}A^T y $$

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# data points in column vector [input, output]
x = np.array([0.1, 0.4, 0.7, 1.2, 1.3, 1.7, 2.2, 2.8, 3.0, 4.0, 4.3, 4.4, 4.9]).reshape(-1, 1)
y = np.array([0.5, 0.9, 1.1, 1.5, 1.5, 2.0, 2.2, 2.8, 2.7, 3.0, 3.5, 3.7, 3.9]).reshape(-1, 1)

plt.figure(figsize = (6, 4))
plt.plot(x, y, 'ko', alpha = 0.3)
plt.xlabel('X')
plt.ylabel('Y')
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.xlim([0, 5])
plt.show()

m = y.shape[0]
# A = np.hstack([np.ones([m, 1]), x])
A = np.hstack([x**0, x])
A = np.asmatrix(A)

theta = (A.T*A).I*A.T*y

print('theta:\n', theta)

theta:
 [[0.65306531]
 [0.67129519]]

# to plot
plt.figure(figsize = (6, 4))
plt.xlabel('X')
plt.ylabel('Y')
plt.plot(x, y, 'ko', label = "data", alpha = 0.3)

# to plot a straight line (fitted line)
xp = np.arange(0, 5, 0.01).reshape(-1, 1)
yp = theta[0,0] + theta[1,0]*xp

plt.plot(xp, yp, 'r', linewidth = 2, label = "regression")
plt.legend()
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.xlim([0, 5])
plt.show()

1.3. Scikit-Learn¶

Machine Learning in Python
Simple and efficient tools for data mining and data analysis
Accessible to everybody, and reusable in various contexts
Built on NumPy, SciPy, and matplotlib
Open source, commercially usable - BSD license
https://scikit-learn.org/stable/index.html#

from sklearn import linear_model

reg = linear_model.LinearRegression()
reg.fit(x, y)

LinearRegression()

LinearRegression()

reg.coef_

array([[0.67129519]])

reg.intercept_

array([0.65306531])

# to plot
plt.figure(figsize = (6, 4))
plt.xlabel('X')
plt.ylabel('Y')
plt.plot(x, y, 'ko', label = "data", alpha = 0.3)

# to plot a straight line (fitted line)
plt.plot(xp, reg.predict(xp), 'r', linewidth = 2, label = "regression")
plt.legend()
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.xlim([0, 5])
plt.show()

1.4. More Topics in Regression¶

The following topics should be covered, but we will leave them for self-study.

Feature Engineering
- The performance of regression models depends heavily on the quality of input features. Preprocessing techniques such as scaling, encoding, and interaction terms can improve model performance.
Multivarate regression
- An extension of linear regression where multiple output variables (dependent variables) are predicted simultaneously based on one or more input variables.
Nonlinear regression
- Models the relationship between independent variables and dependent variables as a nonlinear function.
- Unlike linear regression, it can capture complex, non-linear patterns in the data.
Overfitting
- Overfitting occurs when a regression model learns the noise and fluctuations in the training data rather than the true underlying pattern.
- As a result, the model performs well on the training data but poorly on unseen data (test data).
Regularization
- Adds a penalty term to the loss function to prevent the model from fitting the noise in the training data.
- It discourages the model from using overly large coefficients, thus simplifying the model.

2. Classification: Perceptron¶

from IPython.display import YouTubeVideo
YouTubeVideo('IRrGVQV8vZ8?si=_hkRnh2jEhJVDXsY&amp;start=931', width = "560", height = "315")

2.1. Classification¶

Classification is an another core task in machine learning, aimed at categorizing data into predefined classes or categories based on input features. Unlike regression, where the target variable is continuous, classification deals with discrete outputs. It plays a crucial role in a variety of real-world applications, such as spam detection, image recognition, and medical diagnosis.

Where $y$ is a discrete value
- Develop the classification algorithm to determine which class a new input should fall into
Start with binary class problems
- Later look at multiclass classification problem, although this is just an extension of binary classification
We could use linear regression
- Then, threshold the classifier output (i.e., anything over some value is yes, else no)
- Linear regression with thresholding seems to work
- However, relying on linear regression with thresholding may lead to suboptimal results and fail to capture more complex decision boundaries, making it prone to misclassification in certain cases.
We will learn
- We will skip Support Vector Machines (SVM) and focus instead on
- Perceptron
- Logistic Regression

2.2. Perceptron¶

The Perceptron is one of the simplest types of artificial neural networks and is used for binary classification. It was invented by Frank Rosenblatt in 1958 and serves as the foundation for more advanced neural networks.

To understand how the perceptron works, let's consider a bank loan approval scenario where the bank decides whether to approve or reject a loan application based on specific criteria.

For input $x = \begin{bmatrix}x_1\\ \vdots\\ x_d \end{bmatrix}\;$ 'attributes of a customer'

weights $\omega = \begin{bmatrix}\omega_1\\ \vdots\\ \omega_d \end{bmatrix}$

$$\begin{align*} \text{Approve credit if} \; & \sum\limits_{i=1}^{d}\omega_ix_i > \text{threshold}, \\ \text{Deny credit if} \; & \sum\limits_{i=1}^{d}\omega_ix_i < \text{threshold}. \end{align*}$$
$$h(x) = \text{sign} \left(\left( \sum\limits_{i=1}^{d}\omega_ix_i \right)- \text{threshold} \right) = \text{sign}\left(\left( \sum\limits_{i=1}^{d}\omega_ix_i \right)+ \omega_0\right)$$

sign function

$$ \text{sign}(x) = \begin{cases} 1, &\text{if }\; x > 0\\ 0, &\text{if }\; x = 0\\ -1, &\text{if }\; x < 0 \end{cases} $$

Introduce an artificial coordinate $x_0 = 1$:
- To simplify the perceptron formulation, we can include the bias term $\omega_0$ into the weight vector by adding an artificial coordinate $x_0 = 1$ to the input vector $x$. This eliminates the explicit bias term and simplifies the computation.

$$h(x) = \text{sign}\left( \sum\limits_{\color{red}{i=0}}^{d}\omega_ix_i \right)$$

In a vector form, the perceptron implements

$$h(x) = \text{sign}\left( \omega^T x \right)$$

Hyperplane
- Separates a D-dimensional space into two half-spaces
- Defined by an outward pointing normal vector $\omega$
- $\omega$ is orthogonal to any vector lying on the hyperplane
- Assume the hyperplane passes through origin, $\omega^T x = 0$ with $x_0 = 1$

Sign with respect to a line

$$ \begin{align*} \omega = \begin{bmatrix}\omega_1 \\ \omega_2 \end{bmatrix}, \quad x = \begin{bmatrix} x_1 \\ x_2\end{bmatrix} &\implies g(x) = \omega_0 + \omega_1 x_1 + \omega_2 x_2 = \omega_0 + \omega^T x \end{align*} $$

$\qquad$or

$$ \begin{align*} \omega = \begin{bmatrix}\omega_0 \\ \omega_1 \\ \omega_2 \end{bmatrix}, \quad x = \begin{bmatrix} 1 \\ x_1 \\ x_2\end{bmatrix} &\implies g(x) = \omega_0 + \omega_1 x_1 + \omega_2 x_2 = \omega^T x \end{align*} $$

Goal: to learn the hyperplane $g_{\omega}(x)=0$ using the training data

How to find $\omega$
- All data in class 1
  $$g(x) > 0$$
- All data in class 0 $$g(x) < 0$$

Perceptron Algorithm

We will first walk through the perceptron algorithm and then explore the underlying principles that explain how and why it works.

The perceptron implements

$$h(x) = \text{sign}\left( \omega^Tx \right)$$

Given the training set

$$(x_1, y_1), (x_2, y_2), \cdots, (x_N, y_N) \quad \text{where } y_i \in \{-1,1\}$$

(1) pick a misclassified point

$$ \text{sign}\left(\omega^Tx_n \right) \neq y_n$$

(2) and update the weight vector

$$\omega \leftarrow \omega + y_nx_n$$

Why Perceptron Updates Work ?

Let's look at a misclassified positive example ($y_n = +1$)
- perceptron (wrongly) thinks $\omega_{old}^T x_n < 0$

Updates would be

$$ \begin{align*}\omega_{new} &= \omega_{old} + y_n x_n = \omega_{old} + x_n \\ \\ \omega_{new}^T x_n &= (\omega_{old} + x_n)^T x_n = \omega_{old}^T x_n + x_n^T x_n \geq \omega_{old}^T x_n \end{align*}$$

Thus $\omega_{new}^T x_n$ is less negative than $\omega_{old}^T x_n$

Iterations of Perceptron

Randomly assign $\omega$
One iteration of the PLA (perceptron learning algorithm)

$$\omega \leftarrow \omega + yx$$
where $(x, y)$ is a misclassified training point
At iteration $t = 1, 2, 3, \cdots,$ pick a misclassified point from

$$(x_1,y_1),(x_2,y_2),\cdots,(x_N, y_N)$$
and run a PLA iteration on it
That's it!

Summary

The perceptron is a simple yet powerful model for binary classification tasks like bank loan approval. It classifies applicants based on features like credit score, income, and debt. While it works well for linearly separable data, it has limitations for more complex datasets. Understanding how the perceptron learns and updates its weights is fundamental to understanding modern neural networks.

Perceptron Loss Function

If you do not want to explicitly check whether each point is misclassified or not, you can write the perceptron loss function in a more compact expression that automatically accumulates contributions only from misclassified points.

The loss for an individual sample is defined as:

$$ \mathscr{L}(\omega) = \max \left\{ 0, -y_n \cdot \left(\omega^T x_n \right)\right\} $$

Loss $\mathscr{L}(\omega) = 0$ on examples where perceptron is correct, i.e., $y_n \cdot \left(\omega^T x_n \right) > 0$

Loss $\mathscr{L}(\omega) > 0$ on examples where perceptron misclassified, i.e., $y_n \cdot \left(\omega^T x_n \right) < 0$

Note:

$\text{sign}\left(\omega^T x_n \right) \neq y_n$ is equivalent to $ y_n \cdot \left(\omega^T x_n \right) < 0$
$\text{ReLU}(z) = \max(0, z)$: the Rectified Linear Unit (ReLU) will be revisited later in the discussion, as it plays a crucial role not only in perceptron loss formulation but also in modern deep learning architectures where it is widely used as an activation function.

This function returns zero when the point is correctly classified and returns a positive value proportional to the margin violation when misclassified.

Compact Expression for Total Perceptron Loss

The total loss across all samples in the dataset $D$ is given by:

$$ \mathscr{L}(\omega) = \sum_{n =1}^{m} \max \left\{ 0, -y_n \cdot \left(\omega^T x_n \right)\right\} $$

This summation aggregates the loss over all training samples and only penalizes incorrectly classified points.

2.3. Perceptron in Python¶

By adding an artificial coordinate $x_0 = 1$, we simplify the perceptron implementation and avoid the need to handle the bias separately. This technique is commonly used in machine learning to streamline calculations and make the formulation more uniform.

$$g(x) = \omega_0 + \omega^Tx = \omega_0 + \omega_1x_1 + \omega_2x_2 = 0$$

$$ \begin{align*} \omega &= \begin{bmatrix} \omega_0 \\ \omega_1 \\ \omega_2\end{bmatrix}\\ \\ x &= \begin{bmatrix} \left(x^{(1)}\right)^T \\ \left(x^{(2)}\right)^T \\ \left(x^{(3)}\right)^T\\ \vdots \\ \left(x^{(m)}\right)^T \end{bmatrix} = \begin{bmatrix} 1 & x_1^{(1)} & x_2^{(1)} \\ 1 & x_1^{(2)} & x_2^{(2)} \\ 1 & x_1^{(3)} & x_2^{(3)}\\\vdots & \vdots & \vdots \\ 1 & x_1^{(m)} & x_2^{(m)}\end{bmatrix} \\ \\ y &= \begin{bmatrix}y^{(1)} \\ y^{(2)} \\ y^{(3)}\\ \vdots \\ y^{(m)} \end{bmatrix} \end{align*}$$

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

#training data gerneration
m = 100
x1 = 8*np.random.rand(m, 1)
x2 = 7*np.random.rand(m, 1) - 4

g = 0.8*x1 + x2 - 3

C1 = np.where(g >= 1)
C0 = np.where(g < -1)
print(C1)

(array([ 3,  4, 10, 11, 13, 16, 18, 21, 23, 27, 28, 33, 34, 37, 39, 40, 45,
       61, 64, 67, 69, 74, 77, 82, 83, 85, 87, 90, 96]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0]))

C1 = np.where(g >= 1)[0]
C0 = np.where(g < -1)[0]
print(C1.shape)
print(C0.shape)

(29,)
(40,)

plt.figure(figsize = (6, 4))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.title('Linearly Separable Classes', fontsize = 15)
plt.legend(loc = 1, fontsize = 15)
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.show()

$$ \begin{align*} x &= \begin{bmatrix} \left(x^{(1)}\right)^T \\ \left(x^{(2)}\right)^T \\ \left(x^{(3)}\right)^T\\ \vdots \\ \left(x^{(m)}\right)^T \end{bmatrix} = \begin{bmatrix} 1 & x_1^{(1)} & x_2^{(1)} \\ 1 & x_1^{(2)} & x_2^{(2)} \\ 1 & x_1^{(3)} & x_2^{(3)}\\\vdots & \vdots & \vdots \\ 1 & x_1^{(m)} & x_2^{(m)}\end{bmatrix} \\ y &= \begin{bmatrix}y^{(1)} \\ y^{(2)} \\ y^{(3)}\\ \vdots \\ y^{(m)} \end{bmatrix} \end{align*}$$

X1 = np.hstack([np.ones([C1.shape[0],1]), x1[C1], x2[C1]])
X0 = np.hstack([np.ones([C0.shape[0],1]), x1[C0], x2[C0]])
X = np.vstack([X1, X0])

y = np.vstack([np.ones([C1.shape[0],1]), -np.ones([C0.shape[0],1])])

X = np.asmatrix(X)
y = np.asmatrix(y)

$$\omega = \begin{bmatrix} \omega_0 \\ \omega_1 \\ \omega_2\end{bmatrix}$$
$$\omega \leftarrow \omega + yx$$

where $(x, y)$ is a misclassified training point

w = np.ones([3,1])
w = np.asmatrix(w)

n_iter = y.shape[0]
flag = 0

while flag == 0:
    flag = 1
    for i in range(n_iter):
        if y[i,0] != np.sign(X[i,:]*w)[0,0]:
            w += y[i,0]*X[i,:].T
            flag = 0

print(w)

[[-14.        ]
 [  4.77848286]
 [  8.61057437]]

$$ \begin{align*} g(x) &= \omega_0 + \omega^Tx = \omega_0 + \omega_1x_1 + \omega_2x_2 = 0 \\\\ \implies x_2 &= -\frac{\omega_1}{\omega_2} x_1 - \frac{\omega_0}{\omega_2} \end{align*} $$

x1p = np.linspace(0,8,100).reshape(-1,1)
x2p = - w[1,0]/w[2,0]*x1p - w[0,0]/w[2,0]

plt.figure(figsize = (6, 4))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.plot(x1p, x2p, c = 'k', linewidth = 3, label = 'perceptron')
plt.xlim([0, 8])
plt.xlabel('$x_1$', fontsize = 15)
plt.ylabel('$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 15)
plt.show()

2.4. Perceptron using Scikit-Learn¶

In scikit-learn, the Perceptron includes the bias term by default, so you don't need to add an artificial coordinate manually. This makes it more convenient for implementing and training perceptron models on datasets.

$$ \begin{align*} x &= \begin{bmatrix} \left(x^{(1)}\right)^T \\ \left(x^{(2)}\right)^T \\ \left(x^{(3)}\right)^T\\ \vdots \\ \left(x^{(m)}\right)^T \end{bmatrix} = \begin{bmatrix} x_1^{(1)} & x_2^{(1)} \\ x_1^{(2)} & x_2^{(2)} \\ x_1^{(3)} & x_2^{(3)}\\ \vdots & \vdots \\ x_1^{(m)} & x_2^{(m)}\end{bmatrix} \\\\ y &= \begin{bmatrix}y^{(1)} \\ y^{(2)} \\ y^{(3)}\\ \vdots \\ y^{(m)} \end{bmatrix} \end{align*}$$

X1 = np.hstack([x1[C1], x2[C1]])
X0 = np.hstack([x1[C0], x2[C0]])
X = np.vstack([X1, X0])

y = np.vstack([np.ones([C1.shape[0],1]), -np.ones([C0.shape[0],1])])

from sklearn import linear_model

clf = linear_model.Perceptron(tol = 1e-3)
clf.fit(X, np.ravel(y))

Perceptron()

Perceptron()

clf.predict([[3, -2]])

array([-1.])

clf.predict([[6, 2]])

array([1.])

clf.coef_

array([[4.30276835, 7.18194009]])

clf.intercept_

array([-12.])

$$ \begin{align*} g(x) &= \omega_0 + \omega^Tx = \omega_0 + \omega_1x_1 + \omega_2x_2 = 0 \\\\ \implies x_2 &= -\frac{\omega_1}{\omega_2} x_1 - \frac{\omega_0}{\omega_2} \end{align*} $$

w0 = clf.intercept_[0]
w1 = clf.coef_[0,0]
w2 = clf.coef_[0,1]

x1p = np.linspace(0,8,100).reshape(-1,1)
x2p = - w1/w2*x1p - w0/w2

plt.figure(figsize = (6, 4))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.plot(x1p, x2p, c = 'k', linewidth = 4, label = 'perceptron')
plt.xlim([0, 8])
plt.xlabel('$x_1$', fontsize = 15)
plt.ylabel('$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 15)
plt.show()

2.5. The best hyperplane separator?¶

Perceptron finds one of the many possible hyperplanes separating the data if one exists
Of the many possible choices, which one is the best? $\rightarrow$ lead to optimization
Utilize distance information from all data samples
- We will see this formally when we discuss the logistic regression

2.6. Limitations and Improvements of Perceptron¶

Limitations

Linear Separability:
- The perceptron only works if the "approved" and "rejected" loans can be separated by a straight line.
- For complex data patterns (e.g., XOR), it fails to classify correctly.
No Probability Outputs:
- The perceptron outputs a hard "yes" or "no" without providing a probability estimate (unlike logistic regression).

Improvements

Use Logistic Regression if probability estimates are needed.
Use a Multi-Layer Perceptron (MLP) if non-linear relationships exist in the data.

2.7. Historical Notes¶

In the late 1960s, Marvin Minsky and Seymour Papert published the influential book "Perceptrons" (1969), where they mathematically demonstrated the limitations of the perceptron. One key result from their work was that single-layer perceptrons cannot solve the XOR problem (exclusive OR), highlighting a significant limitation of early neural networks.

Stagnation in Neural Network Research:

The book led to a significant decline in interest and funding for neural networks, as it was seen as proof that neural networks had severe limitations.
This period is often referred to as the "AI Winter".

Rediscovery with Multi-Layer Networks:

In the 1980s, researchers like Geoffrey Hinton showed that multi-layer perceptrons (MLPs) with backpropagation could learn non-linear functions like XOR.
Adding hidden layers and non-linear activation functions allows neural networks to learn complex, non-linear patterns.

3. Classification: Logistic Regression¶

Logistic regression is a classification algorithm - don't be misled by its name.

from IPython.display import YouTubeVideo
YouTubeVideo('IRrGVQV8vZ8?si=8H3cmkAUyNIu2NDb&amp;start=2750', width = "560", height = "315")

3.1. Using All Distances¶

Perceptron: make use of sign of data
We want to use distance information of all data points $\rightarrow$ logistic regression
For logistic regression, $y_i \in \{0,1\}$

Let's start with the case of two data points. Which linear classification boundary would be considered better?

On the left: The classification boundary is positioned near the center of the data points.
On the right: The classification boundary is biased toward one of the data points.

Basic idea: to find the decision boundary (hyperplane) of $g(x)=\omega^T x =0$ such that maximizes $\prod_i \lvert h_i \rvert$
Why?
- Inequality of arithmetic and geometric means

$$ \frac{\lvert h_1 \rvert + \lvert h_2 \rvert}{2} \geq \sqrt{\lvert h_1 \rvert \lvert h_2 \rvert} $$

$\qquad \qquad$and that equality holds if and only if $\lvert h_1 \rvert = \lvert h_2 \rvert$

Roughly speaking, this optimization of $\max \prod_i \lvert h_i \rvert$ tends to position a hyperplane in the middle of two classes

$$h = \frac{g(x)}{\lVert \omega \rVert} = \frac{\omega^T x}{\lVert \omega \rVert} \sim \omega^T x$$

We link or squeeze $(-\infty, +\infty)$ to $(0,1)$ for several reasons:

If $\sigma(z)$ is the sigmoid function, or the logistic function

$$ \sigma(z) = \frac{1}{1+e^{-z}} \implies \sigma \left(\omega^T x \right) = \frac{1}{1+e^{-\omega^T x}}$$

Logistic function always generates a value between 0 and 1
Crosses 0.5 at the origin, then flattens out
The derivative of the sigmoid function satisfies

$$\sigma'(z) = \sigma(z)\left( 1 - \sigma(z)\right)$$

# plot a sigmoid function

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

z = np.linspace(-4, 4, 100)
s = 1/(1 + np.exp(-z))

plt.figure(figsize = (6, 3))
plt.plot(z, s, linewidth = 3)
plt.xlim([-4, 4])
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.show()

The output of the logistic function is bounded between 0 and 1, making it suitable for binary classification tasks.
The logistic function compresses large positive and negative values:
- Large positive values are mapped close to 1.
- Large negative values are mapped close to 0.

Benefit of mapping via the logistic function
- monotonic: same or similar optimziation solution
- continuous and differentiable: good for gradient descent optimization
- reduces the impact of incorrect predictions with overly confident large distances.
- probability or confidence: can be considered as probability

$$P\left(y = +1 \mid x\,;\omega\right) = \frac{1}{1+e^{-\omega^T x}} \;\; \in \; [0,1]$$

Often we do note care about predicting the label $y$
Rather, we want to predict the label probabilities $P\left(y \mid x\,;\omega\right)$
- the probability that the label is $+1$
$$P\left(y = +1 \mid x ;\omega\right)$$
- the probability that the label is $0$
$$P\left(y = 0 \mid x ;\omega\right) = 1 - P\left(y = +1 \mid x;\omega\right)$$

Goal: we need to fit $\omega$ to our data

For a single data point $(x,y)$ with parameters $\omega$

$$ \begin{align*} P\left(y = +1 \mid x\,;\omega\right) &= h_{\omega}(x) = \sigma \left(\omega^T x \right)\\ P\left(y = 0 \mid x\,;\omega\right) &= 1 - h_{\omega}(x) = 1- \sigma \left(\omega^T x \right) \end{align*} $$

It can be compactly written as (since $y$ is either $0$ or $1$)

$$P\left(y \mid x\,;\omega\right) = \left(h_{\omega}(x) \right)^y \left(1 - h_{\omega}(x)\right)^{1-y}$$

For $m$ training data points, the likelihood function of the parameters:

$$ \begin{align*} \mathscr{L}(\omega) &= P\left(y^{(1)}, \cdots, y^{(m)} \mid x^{(1)}, \cdots, x^{(m)}\,;\omega\right)\\ &= \prod\limits_{i=1}^{m}P\left(y^{(i)} \mid x^{(i)}\,;\omega\right)\\ &= \prod\limits_{i=1}^{m}\left(h_{\omega}\left(x^{(i)}\right) \right)^{y^{(i)}} \left(1 - h_{\omega}\left(x^{(i)}\right)\right)^{1-y^{(i)}} \qquad \left(\sim \prod_i \lvert h_i \rvert \right) \end{align*} $$

It would be easier to work on the log likelihood.

Taking the logarithm converts the product into a sum
Probabilities close to 0 can cause numerical instability when directly multiplied.
The log transformation helps keep values in a manageable range.

$$\ell(\omega) = \log \mathscr{L}(\omega) = \sum_{i=1}^{m} y^{(i)} \log h_{\omega} \left(x^{(i)} \right) + \left(1-y^{(i)} \right) \log \left(1-h_{\omega} \left(x^{(i)} \right) \right)$$

Then, the logistic regression problem can be solved as a (convex) optimization problem as

$$\hat{\omega} = \arg\max_{\omega} \ell(\omega)$$

Note:

Although the term "regression" appears in logistic regression, it is, in fact, a classification algorithm. This terminology may seem counterintuitive at first. However, upon closer examination, logistic regression can be understood as a regression-based approach in which the distances between data points and the linear decision boundary are transformed via a logistic (sigmoid) function. This transformation maps the outputs to a probability distribution within the range [0, 1], thereby enabling binary classification.

3.2. Logistic Regression using Scikit-Learn¶

We can implement logistic regression from scratch in Python, but instead, we will use scikit-learn for convenience and efficiency.

$$ \begin{align*} \omega &= \begin{bmatrix} \omega_1 \\ \omega_2\end{bmatrix}, \qquad \omega_0, \qquad x = \begin{bmatrix} x_1 \\ x_2\end{bmatrix}\\ \\ X &= \begin{bmatrix} \left(x^{(1)}\right)^T \\ \left(x^{(2)}\right)^T \\ \left(x^{(3)}\right)^T \\ \vdots\end{bmatrix} = \begin{bmatrix} x_1^{(1)} & x_2^{(1)} \\ x_1^{(2)} & x_2^{(2)} \\ x_1^{(3)} & x_2^{(3)} \\ \vdots & \vdots \\\end{bmatrix}\\ \\ y & = \begin{bmatrix} y^{(1)}\\ y^{(2)} \\y^{(3)} \\ \vdots \end{bmatrix} \end{align*} $$

# datat generation

m = 100
w0 = -6
w = np.array([[2], [1]])
X = np.hstack([4*np.random.rand(m,1), 4*np.random.rand(m,1)])

w = np.asmatrix(w)
X = np.asmatrix(X)

y = 1/(1 + np.exp(-w0-X*w)) > 0.5

C1 = np.where(y == True)[0]
C0 = np.where(y == False)[0]

y = np.empty([m,1])
y[C1] = 1
y[C0] = 0

plt.figure(figsize = (6, 4))
plt.plot(X[C1,0], X[C1,1], 'ro', alpha = 0.3, label = 'C1')
plt.plot(X[C0,0], X[C0,1], 'bo', alpha = 0.3, label = 'C0')
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.axis('equal')
plt.xlim([0,4])
plt.ylim([0,4])
plt.show()

from sklearn import linear_model

clf = linear_model.LogisticRegression(solver = 'lbfgs')
clf.fit(np.asarray(X), np.ravel(y))

LogisticRegression()

LogisticRegression()

clf.coef_

array([[3.21453711, 1.29943009]])

clf.intercept_

array([-9.18960726])

w0 = clf.intercept_[0]
w1 = clf.coef_[0,0]
w2 = clf.coef_[0,1]

xp = np.linspace(0,4,100).reshape(-1,1)
yp = - w1/w2*xp - w0/w2

plt.figure(figsize = (6, 4))
plt.plot(X[C1,0], X[C1,1], 'ro', alpha = 0.3, label = 'C1')
plt.plot(X[C0,0], X[C0,1], 'bo', alpha = 0.3, label = 'C0')
plt.plot(xp, yp, 'g', linewidth = 4, label = 'Logistic Regression')
plt.title('Logistic Regression')
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.axis('equal')
plt.xlim([0,4])
plt.ylim([0,4])
plt.show()

3.3. Cross-Entropy¶

You might have seen the concept of entropy in physics. In fact, the concept of entropy is closely related to the cross-entropy that we just encountered in logistic regression.

Entropy in Physics

In statistical physics, entropy measures the uncertainty or disorder of a system:

$$ S = -k_B \sum_{i} p_i \log p_i, $$

where:

$ S $ is the Boltzmann entropy.
$ k_B $ is Boltzmann's constant.
$ p_i $ is the probability of the system being in state $ i $.

Entropy quantifies how much "randomness" or "information" is needed to describe the system's configuration.

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

p = np.linspace(0.01, 0.99, 100)
S = -p*np.log(p) - (1-p)*np.log(1-p)

plt.figure(figsize = (6, 4))
plt.plot(p, S, linewidth = 3)
plt.xlabel('p')
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.show()

Cross-Entropy in Logistic Regression

In logistic regression, cross-entropy measures how well the predicted probability distribution matches the true distribution:

$$ \mathcal{L} = - \sum_{i} y_i \log \hat{y}_i, $$

where:

$ y_i $ is the true label (1 or 0).
$ \hat{y}_i $ is the predicted probability for class 1.
The loss is high when the model predicts probabilities far from the true class.

Connecting the Concepts

In physics, Boltzmann entropy measures the disorder of a system based on probabilities of different states.
In machine learning, cross-entropy loss measures the "disorder" or "uncertainty" when the predicted probabilities do not align with the true labels.
Enhancing classification performance is positively correlated with reducing data disorder or uncertainty.

4. Deep Learning Libraries¶

Tensorflow

Platform: Linux, Mac OS, Windows
Written in: C++, Python
Interface: Python, C/C++, Java, Go, R
https://www.tensorflow.org/

Keras

https://keras.io/

PyTorch

https://pytorch.org/

5. TensorFlow¶

TensorFlow is an open-source software library for deep learning.

It's a framework to perform computation very efficiently, and it can tap into the GPU (Graphics Processor Unit) in order to speed it up even further. TensorFlow can be controlled by a simple Python API.
Tensorflow is one of the widely used libraries for implementing machine learning and other algorithms involving large number of mathematical operations. Tensorflow was developed by Google and it's one of the most popular Machine Learning libraries on GitHub.

TensorFlow gets its name from "tensors", which are arrays of arbitrary dimensionality.
The "flow" part of the name refers to computation flowing through a graph. Training and inference in a neural network, for example, involves the propagation of matrix computations through many nodes in a computational graph.

Side note:

In solid mechanics, a tensor is a multi-dimensional array of numbers that describes the physical state of a material. Tensors are used to describe physical quantities like stress and strain, which have magnitude and two or more directions.
While the specific applications of tensors in TensorFlow and solid mechanics differ, the underlying concept - a mathematical object capable of representing multi-dimensional relationships - remains consistent, highlighting their conceptual similarity.

5.1. TensorFlow as Optimization Solver¶

Here we will demonstrate how to implement gradient descent using TensorFlow just to get familiar with it.

By using TensorFlow for this example, you can familiarize yourself with how TensorFlow handles automatic differentiation and gradient descent updates. This can be helpful when building deep learning models and optimizing complex loss functions.

$$\min_{\omega}\;\;(\omega - 4)^2$$

import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
%matplotlib inline

w = tf.Variable(0, dtype = tf.float32)

LR = 0.05

# Training
cost_record = []

for i in range(50):
    with tf.GradientTape() as tape:
        cost = w*w - 8*w + 16
        w_grad = tape.gradient(cost, w)

    cost_record.append(cost)
    w.assign_sub(LR * w_grad)

print("\n optimal w =", w.numpy())
print("\n")

plt.figure(figsize = (6, 4))
plt.plot(cost_record)
plt.xlabel('iteration')
plt.ylabel('cost')
plt.show()

 optimal w = 3.979385

5.2. Linear Regression¶

$$\hat{y} = \omega x + b$$

Given $x$ and $y$
Want to estimate $\omega$ and $b$

# data generation
# data points in column vector [input, output]

train_x = np.array([0.1, 0.4, 0.7, 1.2, 1.3, 1.7, 2.2, 2.8, 3.0, 4.0, 4.3, 4.4, 4.9]).reshape(-1, 1)
train_y = np.array([0.5, 0.9, 1.1, 1.5, 1.5, 2.0, 2.2, 2.8, 2.7, 3.0, 3.5, 3.7, 3.9]).reshape(-1, 1)

m = train_x.shape[0]

plt.figure(figsize = (6, 4))
plt.plot(train_x, train_y, 'ko', alpha = 0.3)
plt.title('Data')
plt.xlabel('X')
plt.ylabel('Y')
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.xlim([0, 5])
plt.show()

Given $(x_i, y_i)$ for $i=1,\cdots, m$

$$ \hat{y}_{i} = \omega x_{i} + b \; \quad \text{ such that }\quad \min\limits_{\omega, b}\sum\limits_{i = 1}^{m} (\hat{y}_{i} - y_{i})^2$$

LR = 0.001
n_iter = 1000

w = tf.Variable([[0]], dtype = tf.float32)
b = tf.Variable([[0]], dtype = tf.float32)

loss_record = []

for i in range(n_iter):
    with tf.GradientTape() as tape:
        cost = tf.reduce_mean(tf.square(w*train_x + b - train_y))
        w_grad, b_grad = tape.gradient(cost, [w,b])

    loss_record.append(cost)
    w.assign_sub(LR * w_grad)
    b.assign_sub(LR * b_grad)

w_val = w.numpy()
b_val = b.numpy()
print("\n optimal w =", w_val)
print("\n optimal b =", b_val)
print("\n")

plt.figure(figsize = (6, 4))
plt.plot(loss_record)
plt.xlabel('iteration')
plt.ylabel('loss')
plt.show()

 optimal w = [[0.74257565]]

 optimal b = [[0.41717836]]

xp = np.arange(0, 5, 0.01).reshape(-1, 1)
yp = w_val*xp + b_val

plt.figure(figsize = (6, 4))
plt.plot(train_x, train_y, 'ko', alpha = 0.3)
plt.plot(xp, yp, 'r')
plt.title('Data')
plt.xlabel('X')
plt.ylabel('Y')
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.xlim([0, 5])
plt.show()

5.3. Logistic Regression¶

$$ \begin{align*} \omega &= \begin{bmatrix} \omega_0 \\ \omega_1 \\ \omega_2\end{bmatrix}, \qquad x = \begin{bmatrix} 1 \\ x_1 \\ x_2\end{bmatrix}\\ \\ X &= \begin{bmatrix} \left(x^{(1)}\right)^T \\ \left(x^{(2)}\right)^T \\ \left(x^{(3)}\right)^T \\ \vdots\end{bmatrix} = \begin{bmatrix} 1 & x_1^{(1)} & x_2^{(1)} \\ 1 & x_1^{(2)} & x_2^{(2)} \\ 1 & x_1^{(3)} & x_2^{(3)} \\ \vdots & \vdots & \vdots \\\end{bmatrix}, \quad y = \begin{bmatrix} y^{(1)}\\ y^{(2)} \\y^{(3)} \\ \vdots \end{bmatrix} \end{align*} $$

# datat generation

m = 1000
true_w = np.array([[-6], [2], [1]])
train_X = np.hstack([np.ones([m,1]), 5*np.random.rand(m,1), 4*np.random.rand(m,1)])

true_w = np.asmatrix(true_w)
train_X = np.asmatrix(train_X)

train_y = 1/(1 + np.exp(-train_X*true_w)) > 0.5

C1 = np.where(train_y == True)[0]
C0 = np.where(train_y == False)[0]

train_y = np.empty([m,1])
train_y[C1] = 1
train_y[C0] = 0

plt.figure(figsize = (6, 4))
plt.plot(train_X[C1,1], train_X[C1,2], 'ro', alpha = 0.3, label = 'C1')
plt.plot(train_X[C0,1], train_X[C0,2], 'bo', alpha = 0.3, label = 'C0')
plt.xlabel(r'$x_1$')
plt.ylabel(r'$x_2$')
plt.legend(loc = 1)
plt.axis('equal')
plt.ylim([0,4])
plt.show()

$$ \begin{align*} \ell(\omega) = \log \mathscr{L}(\omega) &= \sum_{i=1}^{m} y^{(i)} \log h_{\omega} \left(x^{(i)} \right) + \left(1-y^{(i)} \right) \log \left(1-h_{\omega} \left(x^{(i)} \right) \right)\\\\ &\Rightarrow \frac{1}{m} \sum_{i=1}^{m} y^{(i)} \log h_{\omega} \left(x^{(i)} \right) + \left(1-y^{(i)} \right) \log \left(1-h_{\omega} \left(x^{(i)} \right) \right) \end{align*} $$

LR = 0.1
n_iter = 3000

w = tf.Variable([[0],[0],[0]], dtype = tf.float32)
train_x = tf.constant(train_X, dtype = tf.float32)
train_y = tf.constant(train_y, dtype = tf.float32)

loss_record = []

for i in range(n_iter):
    with tf.GradientTape() as tape:
        y_pred = tf.sigmoid(tf.matmul(train_x, w))
        loss = - train_y*tf.math.log(y_pred) - (1-train_y)*tf.math.log(1-y_pred)
        loss = tf.reduce_mean(loss)
        w_grad = tape.gradient(loss, w)

    loss_record.append(loss)
    w.assign_sub(LR * w_grad)

w_hat = w.numpy()
print(w_hat)
print("\n")

plt.figure(figsize = (6, 4))
plt.plot(loss_record)
plt.xlabel('iteration')
plt.ylabel('loss')
plt.show()

[[-8.113688 ]
 [ 2.872544 ]
 [ 1.2784871]]

xp = np.arange(0, 4, 0.01).reshape(-1, 1)
yp = - w_hat[1,0]/w_hat[2,0]*xp - w_hat[0,0]/w_hat[2,0]

plt.figure(figsize = (6, 4))
plt.plot(train_X[C1,1], train_X[C1,2], 'ro', alpha = 0.3, label = 'C1')
plt.plot(train_X[C0,1], train_X[C0,2], 'bo', alpha = 0.3, label = 'C0')
plt.plot(xp, yp, 'g', linewidth = 3, label = 'Logistic Regression')
plt.xlabel(r'$x_1$')
plt.ylabel(r'$x_2$')
plt.legend(loc = 1)
plt.axis('equal')
plt.ylim([0,4])
plt.show()

Instead of manually defining the cross-entropy loss function, TensorFlow's built-in functions can be utilized for greater convenience and efficiency.

TensorFlow embedded functions

tf.nn.sigmoid_cross_entropy_with_logits for binary classification
tf.nn.softmax_cross_entropy_with_logits for multiclass classification

LR = 0.05
n_iter = 1500

w = tf.Variable([[0],[0],[0]], dtype = tf.float32)
train_x = tf.constant(train_X, dtype = tf.float32)
train_y = tf.constant(train_y, dtype = tf.float32)

loss_record = []

for i in range(n_iter):
    with tf.GradientTape() as tape:
        y_pred = tf.matmul(train_x,w)
        loss = tf.nn.sigmoid_cross_entropy_with_logits(labels = train_y, logits = y_pred)
        w_grad = tape.gradient(loss, w)

    loss_record.append(loss)
    w.assign_sub(LR * w_grad)

w_hat = w.numpy()
print(w_hat)
print("\n")

xp = np.arange(0, 4, 0.01).reshape(-1, 1)
yp = - w_hat[1,0]/w_hat[2,0]*xp - w_hat[0,0]/w_hat[2,0]

plt.figure(figsize = (6, 4))
plt.plot(train_X[C1,1], train_X[C1,2], 'ro', alpha = 0.3, label = 'C1')
plt.plot(train_X[C0,1], train_X[C0,2], 'bo', alpha = 0.3, label = 'C0')
plt.plot(xp, yp, 'g', linewidth = 3, label = 'Logistic Regression')
plt.xlabel(r'$x_1$')
plt.ylabel(r'$x_2$')
plt.legend(loc = 1)
plt.axis('equal')
plt.ylim([0,4])
plt.show()

[[-154.34242 ]
 [  51.494408]
 [  25.676441]]

%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')