Machine Learning
Table of Contents
Regression is a fundamental concept in machine learning, playing a crucial role in understanding and predicting continuous outcomes based on input features. Unlike classification, which assigns data points to discrete categories, regression models aim to establish a relationship between independent variables (features) and a continuous dependent variable (target).
At its core, regression seeks to capture patterns in data and predict numerical values by minimizing the error between the predicted values and the actual values. This makes regression indispensable for a wide range of applications.
from IPython.display import YouTubeVideo
YouTubeVideo('IRrGVQV8vZ8', width = "560", height = "315")
Consider a linear regression
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# data points in column vector [input, output]
x = np.array([0.1, 0.4, 0.7, 1.2, 1.3, 1.7, 2.2, 2.8, 3.0, 4.0, 4.3, 4.4, 4.9]).reshape(-1, 1)
y = np.array([0.5, 0.9, 1.1, 1.5, 1.5, 2.0, 2.2, 2.8, 2.7, 3.0, 3.5, 3.7, 3.9]).reshape(-1, 1)
plt.figure(figsize = (6, 4))
plt.plot(x, y, 'ko', alpha = 0.3)
plt.xlabel('X')
plt.ylabel('Y')
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.xlim([0, 5])
plt.show()
m = y.shape[0]
# A = np.hstack([np.ones([m, 1]), x])
A = np.hstack([x**0, x])
A = np.asmatrix(A)
theta = (A.T*A).I*A.T*y
print('theta:\n', theta)
# to plot
plt.figure(figsize = (6, 4))
plt.xlabel('X')
plt.ylabel('Y')
plt.plot(x, y, 'ko', label = "data", alpha = 0.3)
# to plot a straight line (fitted line)
xp = np.arange(0, 5, 0.01).reshape(-1, 1)
yp = theta[0,0] + theta[1,0]*xp
plt.plot(xp, yp, 'r', linewidth = 2, label = "regression")
plt.legend()
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.xlim([0, 5])
plt.show()
Machine Learning in Python
Simple and efficient tools for data mining and data analysis
Accessible to everybody, and reusable in various contexts
Built on NumPy, SciPy, and matplotlib
Open source, commercially usable - BSD license
from sklearn import linear_model
reg = linear_model.LinearRegression()
reg.fit(x, y)
reg.coef_
reg.intercept_
# to plot
plt.figure(figsize = (6, 4))
plt.xlabel('X')
plt.ylabel('Y')
plt.plot(x, y, 'ko', label = "data", alpha = 0.3)
# to plot a straight line (fitted line)
plt.plot(xp, reg.predict(xp), 'r', linewidth = 2, label = "regression")
plt.legend()
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.xlim([0, 5])
plt.show()
The following topics should be covered, but we will leave them for self-study.
Feature Engineering
Multivarate regression
Nonlinear regression
Overfitting
Regularization
from IPython.display import YouTubeVideo
YouTubeVideo('IRrGVQV8vZ8?si=_hkRnh2jEhJVDXsY&start=931', width = "560", height = "315")
Classification is an another core task in machine learning, aimed at categorizing data into predefined classes or categories based on input features. Unlike regression, where the target variable is continuous, classification deals with discrete outputs. It plays a crucial role in a variety of real-world applications, such as spam detection, image recognition, and medical diagnosis.
Where $y$ is a discrete value
Start with binary class problems
We could use linear regression
We will learn
The Perceptron is one of the simplest types of artificial neural networks and is used for binary classification. It was invented by Frank Rosenblatt in 1958 and serves as the foundation for more advanced neural networks.
To understand how the perceptron works, let's consider a bank loan approval scenario where the bank decides whether to approve or reject a loan application based on specific criteria.
$\qquad$or
How to find $\omega$
All data in class 1
$$g(x) > 0$$All data in class 0 $$g(x) < 0$$
Perceptron Algorithm
We will first walk through the perceptron algorithm and then explore the underlying principles that explain how and why it works.
The perceptron implements
Given the training set
(1) pick a misclassified point
(2) and update the weight vector
Why Perceptron Updates Work ?
Iterations of Perceptron
Randomly assign $\omega$
One iteration of the PLA (perceptron learning algorithm)
$$\omega \leftarrow \omega + yx$$
where $(x, y)$ is a misclassified training point
At iteration $t = 1, 2, 3, \cdots,$ pick a misclassified point from
$$(x_1,y_1),(x_2,y_2),\cdots,(x_N, y_N)$$
and run a PLA iteration on it
That's it!
Summary
The perceptron is a simple yet powerful model for binary classification tasks like bank loan approval. It classifies applicants based on features like credit score, income, and debt. While it works well for linearly separable data, it has limitations for more complex datasets. Understanding how the perceptron learns and updates its weights is fundamental to understanding modern neural networks.
Perceptron Loss Function
If you do not want to explicitly check whether each point is misclassified or not, you can write the perceptron loss function in a more compact expression that automatically accumulates contributions only from misclassified points.
The loss for an individual sample is defined as:
Note:
$\text{sign}\left(\omega^T x_n \right) \neq y_n$ is equivalent to $ y_n \cdot \left(\omega^T x_n \right) < 0$
$\text{ReLU}(z) = \max(0, z)$: the Rectified Linear Unit (ReLU) will be revisited later in the discussion, as it plays a crucial role not only in perceptron loss formulation but also in modern deep learning architectures where it is widely used as an activation function.
This function returns zero when the point is correctly classified and returns a positive value proportional to the margin violation when misclassified.
Compact Expression for Total Perceptron Loss
The total loss across all samples in the dataset $D$ is given by:
This summation aggregates the loss over all training samples and only penalizes incorrectly classified points.
By adding an artificial coordinate $x_0 = 1$, we simplify the perceptron implementation and avoid the need to handle the bias separately. This technique is commonly used in machine learning to streamline calculations and make the formulation more uniform.
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
#training data gerneration
m = 100
x1 = 8*np.random.rand(m, 1)
x2 = 7*np.random.rand(m, 1) - 4
g = 0.8*x1 + x2 - 3
C1 = np.where(g >= 1)
C0 = np.where(g < -1)
print(C1)
C1 = np.where(g >= 1)[0]
C0 = np.where(g < -1)[0]
print(C1.shape)
print(C0.shape)
plt.figure(figsize = (6, 4))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.title('Linearly Separable Classes', fontsize = 15)
plt.legend(loc = 1, fontsize = 15)
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.show()
X1 = np.hstack([np.ones([C1.shape[0],1]), x1[C1], x2[C1]])
X0 = np.hstack([np.ones([C0.shape[0],1]), x1[C0], x2[C0]])
X = np.vstack([X1, X0])
y = np.vstack([np.ones([C1.shape[0],1]), -np.ones([C0.shape[0],1])])
X = np.asmatrix(X)
y = np.asmatrix(y)
where $(x, y)$ is a misclassified training point
w = np.ones([3,1])
w = np.asmatrix(w)
n_iter = y.shape[0]
flag = 0
while flag == 0:
flag = 1
for i in range(n_iter):
if y[i,0] != np.sign(X[i,:]*w)[0,0]:
w += y[i,0]*X[i,:].T
flag = 0
print(w)
x1p = np.linspace(0,8,100).reshape(-1,1)
x2p = - w[1,0]/w[2,0]*x1p - w[0,0]/w[2,0]
plt.figure(figsize = (6, 4))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.plot(x1p, x2p, c = 'k', linewidth = 3, label = 'perceptron')
plt.xlim([0, 8])
plt.xlabel('$x_1$', fontsize = 15)
plt.ylabel('$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 15)
plt.show()
In scikit-learn, the Perceptron includes the bias term by default, so you don't need to add an artificial coordinate manually. This makes it more convenient for implementing and training perceptron models on datasets.
X1 = np.hstack([x1[C1], x2[C1]])
X0 = np.hstack([x1[C0], x2[C0]])
X = np.vstack([X1, X0])
y = np.vstack([np.ones([C1.shape[0],1]), -np.ones([C0.shape[0],1])])
from sklearn import linear_model
clf = linear_model.Perceptron(tol = 1e-3)
clf.fit(X, np.ravel(y))
clf.predict([[3, -2]])
clf.predict([[6, 2]])
clf.coef_
clf.intercept_
w0 = clf.intercept_[0]
w1 = clf.coef_[0,0]
w2 = clf.coef_[0,1]
x1p = np.linspace(0,8,100).reshape(-1,1)
x2p = - w1/w2*x1p - w0/w2
plt.figure(figsize = (6, 4))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.plot(x1p, x2p, c = 'k', linewidth = 4, label = 'perceptron')
plt.xlim([0, 8])
plt.xlabel('$x_1$', fontsize = 15)
plt.ylabel('$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 15)
plt.show()
Perceptron finds one of the many possible hyperplanes separating the data if one exists
Of the many possible choices, which one is the best? $\rightarrow$ lead to optimization
Utilize distance information from all data samples
Limitations
Linear Separability:
No Probability Outputs:
Improvements
Use Logistic Regression if probability estimates are needed.
Use a Multi-Layer Perceptron (MLP) if non-linear relationships exist in the data.
In the late 1960s, Marvin Minsky and Seymour Papert published the influential book "Perceptrons" (1969), where they mathematically demonstrated the limitations of the perceptron. One key result from their work was that single-layer perceptrons cannot solve the XOR problem (exclusive OR), highlighting a significant limitation of early neural networks.
Stagnation in Neural Network Research:
Rediscovery with Multi-Layer Networks:
from IPython.display import YouTubeVideo
YouTubeVideo('IRrGVQV8vZ8?si=8H3cmkAUyNIu2NDb&start=2750', width = "560", height = "315")
Perceptron: make use of sign of data
We want to use distance information of all data points $\rightarrow$ logistic regression
For logistic regression, $y_i \in \{0,1\}$
Let's start with the case of two data points. Which linear classification boundary would be considered better?
On the left: The classification boundary is positioned near the center of the data points.
On the right: The classification boundary is biased toward one of the data points.
Basic idea: to find the decision boundary (hyperplane) of $g(x)=\omega^T x =0$ such that maximizes $\prod_i \lvert h_i \rvert$
Why?
$\qquad \qquad$and that equality holds if and only if $\lvert h_1 \rvert = \lvert h_2 \rvert$
$$ \sigma(z) = \frac{1}{1+e^{-z}} \implies \sigma \left(\omega^T x \right) = \frac{1}{1+e^{-\omega^T x}}$$
# plot a sigmoid function
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
z = np.linspace(-4, 4, 100)
s = 1/(1 + np.exp(-z))
plt.figure(figsize = (6, 3))
plt.plot(z, s, linewidth = 3)
plt.xlim([-4, 4])
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.show()
The output of the logistic function is bounded between 0 and 1, making it suitable for binary classification tasks.
The logistic function compresses large positive and negative values:
$$P\left(y = +1 \mid x\,;\omega\right) = \frac{1}{1+e^{-\omega^T x}} \;\; \in \; [0,1]$$
Often we do note care about predicting the label $y$
Rather, we want to predict the label probabilities $P\left(y \mid x\,;\omega\right)$
$$P\left(y = +1 \mid x ;\omega\right)$$
$$P\left(y = 0 \mid x ;\omega\right) = 1 - P\left(y = +1 \mid x;\omega\right)$$
Goal: we need to fit $\omega$ to our data
For a single data point $(x,y)$ with parameters $\omega$
It can be compactly written as (since $y$ is either $0$ or $1$)
For $m$ training data points, the likelihood function of the parameters:
It would be easier to work on the log likelihood.
Then, the logistic regression problem can be solved as a (convex) optimization problem as
Note:
We can implement logistic regression from scratch in Python, but instead, we will use scikit-learn for convenience and efficiency.
# datat generation
m = 100
w0 = -6
w = np.array([[2], [1]])
X = np.hstack([4*np.random.rand(m,1), 4*np.random.rand(m,1)])
w = np.asmatrix(w)
X = np.asmatrix(X)
y = 1/(1 + np.exp(-w0-X*w)) > 0.5
C1 = np.where(y == True)[0]
C0 = np.where(y == False)[0]
y = np.empty([m,1])
y[C1] = 1
y[C0] = 0
plt.figure(figsize = (6, 4))
plt.plot(X[C1,0], X[C1,1], 'ro', alpha = 0.3, label = 'C1')
plt.plot(X[C0,0], X[C0,1], 'bo', alpha = 0.3, label = 'C0')
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.axis('equal')
plt.xlim([0,4])
plt.ylim([0,4])
plt.show()
from sklearn import linear_model
clf = linear_model.LogisticRegression(solver = 'lbfgs')
clf.fit(np.asarray(X), np.ravel(y))
clf.coef_
clf.intercept_
w0 = clf.intercept_[0]
w1 = clf.coef_[0,0]
w2 = clf.coef_[0,1]
xp = np.linspace(0,4,100).reshape(-1,1)
yp = - w1/w2*xp - w0/w2
plt.figure(figsize = (6, 4))
plt.plot(X[C1,0], X[C1,1], 'ro', alpha = 0.3, label = 'C1')
plt.plot(X[C0,0], X[C0,1], 'bo', alpha = 0.3, label = 'C0')
plt.plot(xp, yp, 'g', linewidth = 4, label = 'Logistic Regression')
plt.title('Logistic Regression')
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.axis('equal')
plt.xlim([0,4])
plt.ylim([0,4])
plt.show()
You might have seen the concept of entropy in physics. In fact, the concept of entropy is closely related to the cross-entropy that we just encountered in logistic regression.
Entropy in Physics
In statistical physics, entropy measures the uncertainty or disorder of a system:
where:
Entropy quantifies how much "randomness" or "information" is needed to describe the system's configuration.
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
p = np.linspace(0.01, 0.99, 100)
S = -p*np.log(p) - (1-p)*np.log(1-p)
plt.figure(figsize = (6, 4))
plt.plot(p, S, linewidth = 3)
plt.xlabel('p')
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.show()
Cross-Entropy in Logistic Regression
In logistic regression, cross-entropy measures how well the predicted probability distribution matches the true distribution:
where:
Connecting the Concepts
Keras
TensorFlow is an open-source software library for deep learning.
It's a framework to perform computation very efficiently, and it can tap into the GPU (Graphics Processor Unit) in order to speed it up even further. TensorFlow can be controlled by a simple Python API.
Tensorflow is one of the widely used libraries for implementing machine learning and other algorithms involving large number of mathematical operations. Tensorflow was developed by Google and it's one of the most popular Machine Learning libraries on GitHub.
TensorFlow gets its name from "tensors", which are arrays of arbitrary dimensionality.
The "flow" part of the name refers to computation flowing through a graph. Training and inference in a neural network, for example, involves the propagation of matrix computations through many nodes in a computational graph.
Side note:
In solid mechanics, a tensor is a multi-dimensional array of numbers that describes the physical state of a material. Tensors are used to describe physical quantities like stress and strain, which have magnitude and two or more directions.
While the specific applications of tensors in TensorFlow and solid mechanics differ, the underlying concept - a mathematical object capable of representing multi-dimensional relationships - remains consistent, highlighting their conceptual similarity.
Here we will demonstrate how to implement gradient descent using TensorFlow just to get familiar with it.
By using TensorFlow for this example, you can familiarize yourself with how TensorFlow handles automatic differentiation and gradient descent updates. This can be helpful when building deep learning models and optimizing complex loss functions.
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
%matplotlib inline
w = tf.Variable(0, dtype = tf.float32)
LR = 0.05
# Training
cost_record = []
for i in range(50):
with tf.GradientTape() as tape:
cost = w*w - 8*w + 16
w_grad = tape.gradient(cost, w)
cost_record.append(cost)
w.assign_sub(LR * w_grad)
print("\n optimal w =", w.numpy())
print("\n")
plt.figure(figsize = (6, 4))
plt.plot(cost_record)
plt.xlabel('iteration')
plt.ylabel('cost')
plt.show()
# data generation
# data points in column vector [input, output]
train_x = np.array([0.1, 0.4, 0.7, 1.2, 1.3, 1.7, 2.2, 2.8, 3.0, 4.0, 4.3, 4.4, 4.9]).reshape(-1, 1)
train_y = np.array([0.5, 0.9, 1.1, 1.5, 1.5, 2.0, 2.2, 2.8, 2.7, 3.0, 3.5, 3.7, 3.9]).reshape(-1, 1)
m = train_x.shape[0]
plt.figure(figsize = (6, 4))
plt.plot(train_x, train_y, 'ko', alpha = 0.3)
plt.title('Data')
plt.xlabel('X')
plt.ylabel('Y')
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.xlim([0, 5])
plt.show()
LR = 0.001
n_iter = 1000
w = tf.Variable([[0]], dtype = tf.float32)
b = tf.Variable([[0]], dtype = tf.float32)
loss_record = []
for i in range(n_iter):
with tf.GradientTape() as tape:
cost = tf.reduce_mean(tf.square(w*train_x + b - train_y))
w_grad, b_grad = tape.gradient(cost, [w,b])
loss_record.append(cost)
w.assign_sub(LR * w_grad)
b.assign_sub(LR * b_grad)
w_val = w.numpy()
b_val = b.numpy()
print("\n optimal w =", w_val)
print("\n optimal b =", b_val)
print("\n")
plt.figure(figsize = (6, 4))
plt.plot(loss_record)
plt.xlabel('iteration')
plt.ylabel('loss')
plt.show()
xp = np.arange(0, 5, 0.01).reshape(-1, 1)
yp = w_val*xp + b_val
plt.figure(figsize = (6, 4))
plt.plot(train_x, train_y, 'ko', alpha = 0.3)
plt.plot(xp, yp, 'r')
plt.title('Data')
plt.xlabel('X')
plt.ylabel('Y')
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.xlim([0, 5])
plt.show()
# datat generation
m = 1000
true_w = np.array([[-6], [2], [1]])
train_X = np.hstack([np.ones([m,1]), 5*np.random.rand(m,1), 4*np.random.rand(m,1)])
true_w = np.asmatrix(true_w)
train_X = np.asmatrix(train_X)
train_y = 1/(1 + np.exp(-train_X*true_w)) > 0.5
C1 = np.where(train_y == True)[0]
C0 = np.where(train_y == False)[0]
train_y = np.empty([m,1])
train_y[C1] = 1
train_y[C0] = 0
plt.figure(figsize = (6, 4))
plt.plot(train_X[C1,1], train_X[C1,2], 'ro', alpha = 0.3, label = 'C1')
plt.plot(train_X[C0,1], train_X[C0,2], 'bo', alpha = 0.3, label = 'C0')
plt.xlabel(r'$x_1$')
plt.ylabel(r'$x_2$')
plt.legend(loc = 1)
plt.axis('equal')
plt.ylim([0,4])
plt.show()
LR = 0.1
n_iter = 3000
w = tf.Variable([[0],[0],[0]], dtype = tf.float32)
train_x = tf.constant(train_X, dtype = tf.float32)
train_y = tf.constant(train_y, dtype = tf.float32)
loss_record = []
for i in range(n_iter):
with tf.GradientTape() as tape:
y_pred = tf.sigmoid(tf.matmul(train_x, w))
loss = - train_y*tf.math.log(y_pred) - (1-train_y)*tf.math.log(1-y_pred)
loss = tf.reduce_mean(loss)
w_grad = tape.gradient(loss, w)
loss_record.append(loss)
w.assign_sub(LR * w_grad)
w_hat = w.numpy()
print(w_hat)
print("\n")
plt.figure(figsize = (6, 4))
plt.plot(loss_record)
plt.xlabel('iteration')
plt.ylabel('loss')
plt.show()
xp = np.arange(0, 4, 0.01).reshape(-1, 1)
yp = - w_hat[1,0]/w_hat[2,0]*xp - w_hat[0,0]/w_hat[2,0]
plt.figure(figsize = (6, 4))
plt.plot(train_X[C1,1], train_X[C1,2], 'ro', alpha = 0.3, label = 'C1')
plt.plot(train_X[C0,1], train_X[C0,2], 'bo', alpha = 0.3, label = 'C0')
plt.plot(xp, yp, 'g', linewidth = 3, label = 'Logistic Regression')
plt.xlabel(r'$x_1$')
plt.ylabel(r'$x_2$')
plt.legend(loc = 1)
plt.axis('equal')
plt.ylim([0,4])
plt.show()
Instead of manually defining the cross-entropy loss function, TensorFlow's built-in functions can be utilized for greater convenience and efficiency.
TensorFlow embedded functions
LR = 0.05
n_iter = 1500
w = tf.Variable([[0],[0],[0]], dtype = tf.float32)
train_x = tf.constant(train_X, dtype = tf.float32)
train_y = tf.constant(train_y, dtype = tf.float32)
loss_record = []
for i in range(n_iter):
with tf.GradientTape() as tape:
y_pred = tf.matmul(train_x,w)
loss = tf.nn.sigmoid_cross_entropy_with_logits(labels = train_y, logits = y_pred)
w_grad = tape.gradient(loss, w)
loss_record.append(loss)
w.assign_sub(LR * w_grad)
w_hat = w.numpy()
print(w_hat)
print("\n")
xp = np.arange(0, 4, 0.01).reshape(-1, 1)
yp = - w_hat[1,0]/w_hat[2,0]*xp - w_hat[0,0]/w_hat[2,0]
plt.figure(figsize = (6, 4))
plt.plot(train_X[C1,1], train_X[C1,2], 'ro', alpha = 0.3, label = 'C1')
plt.plot(train_X[C0,1], train_X[C0,2], 'bo', alpha = 0.3, label = 'C0')
plt.plot(xp, yp, 'g', linewidth = 3, label = 'Logistic Regression')
plt.xlabel(r'$x_1$')
plt.ylabel(r'$x_2$')
plt.legend(loc = 1)
plt.axis('equal')
plt.ylim([0,4])
plt.show()
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')