Machine Learning for Mechanical Engineering

Classification

Prof. Seungchul Lee
http://iailab.kaist.ac.kr/
Industrial AI Lab at KAIST

For your handwritten solutions, please scan or take a picture of them. Alternatively, you can write them in markdown if you prefer.
Only .ipynb files will be graded for your code.
- Ensure that your NAME and student ID are included in your .ipynb files. ex) SeungchulLee_20241234_HW01.ipynb
Compress all the files into a single .zip file.
- In the .zip file's name, include your NAME and student ID.
ex) SeungchulLee_20241234_HW01.zip
- Submit this .zip file on KLMS
Do not submit a printed version of your code, as it will not be graded.

Problem 1¶

(a) Let $f(x_1,x_2) = 1 + 7x_1 + 4x_2 = 0$ be the hyperplane (or decision surface in classification problems). Compute the shortest distances from $x_a = \begin{bmatrix} 2 \\1\end{bmatrix}$ and $x_b = \begin{bmatrix} 1 \\ -8\end{bmatrix}$ to the above hyperplane, repectively.

(b) Let $f(x_1,x_2) = 2x_1 + 3x_2 + 1 = 0$ be the hyperplane (or decision surface in classification problems). Compute the shortest distances from $x_a = \begin{bmatrix} 2 \\1\end{bmatrix}$ and $x_b = \begin{bmatrix} 1 \\ -1\end{bmatrix}$ to the above hyperplane, respectively.

Problem 2¶

Find the solution $r$ of the below optimization problem with the following objective function and constraint. ($x,\omega, \omega_0 \in \mathbb{R}^n$)

$$ \begin{align*} r = \;&\min_x \;\rVert x \lVert_2\\ \text{s.t.}&\; \omega^T x + \omega_0 = 0 \end{align*}$$

Problem 3¶

Suppose we have the following training examples in a 2-dimensional input space, and no bias term. Following our discussion of the perceptron algorithm, we use $t = −1$ rather than $t = 0$ to denote the negative class.

Recall the perceptron algorithm from lecture. It repeatedly applies the following procedure, stopping when the weights are strictly within the feasible region (i.e., not on the boundary):

Here we work through the algorithm by hand.

(a) Sketch out the problem in weight space. In particular: draw and label the axes, draw the half-space corresponding to each of the two training examples, and shade the feasible region. (Remember that there is no bias term.)

(b) Simulate the perceptron algorithm by hand, starting from the initial weights $\omega_1 = 0, \omega_2 = −2$. On the plot from part (a), mark an $X$ on every point in weight space visited by the algorithm until it stops. You do not need to provide anything for this part other than the $X$’s on the plot.

Hint: the weights are updated 12 times. It will be tedious if you compute all the updates using arithmetic; instead, plot the ﬁrst few updates as you go along, and you should start to notice a visual pattern.

Problem 4-1: Feature Extracting from Handwritten Digits¶

In this problem, we will try to classify handwritten digits. For the sake of simplicity, we simplify this digit classification problem into a binary classification between digit 0 and digit 1.

Step 1. Load the data¶

Data	Data dexcription
0	1000 images (28×28 pixels) of handwritten digit 0
1	1000 images (28×28 pixels) of handwritten digit 1

Download data

To read the files in Python, use the following code:

## you do not need to change anyting

import numpy as np
import matplotlib.pyplot as plt
import cvxpy as cvx
%matplotlib inline

from six.moves import cPickle

data = cPickle.load(open('./data_files/data_zero_one.pkl', 'rb'))
data0 = data['0']
data1 = data['1']

Here, we will load 1000 of data for ‘0’ and ‘1’, respectively. To display the image use plt.imshow(img).

Note: make sure you are reading the files correctly. Check by displaying the first few and the last few images in each class.
Note: calculations are more manageable if you go though and convert each of the pixels in $28\times28$ matrix to a binary value first.

(a) Display the few random images in each class.

## you do not need to change anyting

plt.figure(figsize = (12, 8))
plt.subplot(2,4,1), plt.imshow(data0[np.random.randint(1000)], 'gray'), plt.axis('off')
plt.subplot(2,4,2), plt.imshow(data0[np.random.randint(1000)], 'gray'), plt.axis('off')
plt.subplot(2,4,3), plt.imshow(data0[np.random.randint(1000)], 'gray'), plt.axis('off')
plt.subplot(2,4,4), plt.imshow(data0[np.random.randint(1000)], 'gray'), plt.axis('off')
plt.subplot(2,4,5), plt.imshow(data1[np.random.randint(1000)], 'gray'), plt.axis('off')
plt.subplot(2,4,6), plt.imshow(data1[np.random.randint(1000)], 'gray'), plt.axis('off')
plt.subplot(2,4,7), plt.imshow(data1[np.random.randint(1000)], 'gray'), plt.axis('off')
plt.subplot(2,4,8), plt.imshow(data1[np.random.randint(1000)], 'gray'), plt.axis('off')
plt.show()

(b) Convert each of the pixels in $28×28$ matrix to a binary value.

## you do not need to change anyting

data0 = data0 > 125
data1 = data1 > 125

Step 2. Extract features¶

In machine learning and pattern recognition, a feature is an individual measurable property or characteristic of a phenomenon being observed. Choosing informative, discriminating and independent features from raw data (or input) is a crucial step for effective algorithms in pattern recognition, classification and regression.

Now we must select the own ‘features’ from image data to detect digit 0 and digit 1. Two features are recommended as follows:

(a) Feature 1: the total average pixels located at the center of the image (img[10:20,10:20]).

## your code here
#

(b) Feature 2: the total average pixels over the entire image.

## your code here
#

(c) Include the ones as the bias term.

$$\phi(x) = \begin{bmatrix} 1 \\ \text{feature1}\\ \text{feature2} \end{bmatrix} \quad \implies \quad X \; (\text{or } \Phi) = \begin{bmatrix} \phi_1^T\\ \vdots \\ \phi_{1000}^T \\ \phi_{1001}^T \\ \vdots \\ \phi_{2000}^T \end{bmatrix}$$

You should end up with a $2000\times3$ input matrix with the first $1000$ rows correspond to all of the ‘data0’ and the second 1000 rows correspond to all of the given ‘data1’. This matrix is matrix $X$ (or $\Phi$) which we learned in a class.

## your code here
#

(2000, 3)

Step 3. Plot the features of data¶

Plot the features to see if classes are separable. The expected plot is the following:

## your code here
#

Problem 4-2: Digit classification with perceptron¶

We would like to use the perceptron algorithm to classify digit 0 and digit 1 in the training set.

Step 1. Initialization¶

You have to initialize $\omega$. We usually set $\omega$ to a zero vector as an initial guess.

## your code here
#

Step 2. Update $\omega$¶

We update $\omega$ when the prediction is wrong. The update rule is the following:

$$ \omega \leftarrow \omega + y \cdot x$$

We will repeat the same iteration for the number of data points. Here’s a pseudo code:

For k = 1:100 {
      For j = 1:100 {
            i = random integer selection between 1 and 2000
            compute yhat(i)
            if yhat(i) is wrong {
                w = w + y(i)x(i)
            }
    }
}

For every $k$'s iteration, count and store how many predictions are wrong to see if your classifier converges to somewhere or not.

## your code here
#

Step 3. Plot the result¶

You are asked to plot two graphs. First, plot the number of wrong predictions with respect to every $k$'s iteration. The graph is expected as follows:

Second, plot the classifier (decision boundary). Note that the decision boundary is given:

$$\omega_1 x_1 + \omega_2x_2 + \omega_0 = 0$$

## your code here
#

Problem 4-3: Digit classification with SVM¶

We would like to use the SVM algorithm to classify digit 0 and digit 1 in the training set.

Step 1. Define variables¶

Find a classifier using the support vector machine which is equivalent to solving the following optimization problem:

$$\begin{align*} \text{minimize } & \lVert \omega \rVert_2 + \gamma(1^Tu + 1^Tv)\\ \text{subject to }& \\ & X_0^T \omega \geq 1 - u\\ & X_1^T \omega \leq - (1-v)\\ & u \geq 0\\ & v \geq 0 \end{align*}$$

Here, $X_0$ and $X_1$ are $1000\times3$ matrix for each class. You can divide the above matrix $X$ into two parts to reuse it.

## your code here
#

Step 2. CVXPY¶

Run CVXPY to find $\omega$.

## your code here
#

Step 3. Plot the result¶

Now you have an optimal $\omega$ . Plot a decision boundary. Note that our decision boundary is defined as follows:

$$ \omega_1 x_1 + \omega_2 x_2 + \omega_0 = 0$$

## your code here
#

Problem 5¶

Consider the binary classification task that consists of the following points:

$$ \begin{align*} \mathcal{C}_1 &= \left\{ \begin{bmatrix} 1 \\1\end{bmatrix}, \begin{bmatrix} 1 \\-1\end{bmatrix} \right\}\\ \mathcal{C}_0 &= \left\{ \begin{bmatrix} -1 \\1\end{bmatrix}, \begin{bmatrix} -1 \\-1\end{bmatrix} \right\} \end{align*} $$

$g(x) = \omega_0 + \omega_1 x_1 + \omega_2 x_2 = 0$ is a generic expression of a linear classification boundary. Show that the linear classification boundary by the SVM approach is $x_1 = 0$

Problem 6¶

(a) Derive the support vector machine algorithm in a form of the optimization.

(b) Derive the gradient descent algorithm for the logistic regression.

Problem 7¶

(a) Let $f(x) = \omega^Tx$ be a linear boundary of a binary classifier and $y \in \{-1,1\}$. Explain $y \cdot f(x) \leq 0$ when the classifier is wrong and $y \cdot f(x) > 0$ when the classifier is correct.

(b) Let $\ell \left(f(x), y \right)$ be a loss function. Ideally, the loss function for the classification problem is $u \left(-y \cdot f(x) \right)$ where $u(\cdot)$ is a step function. Explain the meaning of $J = \sum_{i=1}^{N} u \left(-y_i \cdot f(x_i) \right)$.

(c) Explain why the ideal loss function cannot be implemented when we want to use a gradient descent algorithm?

(d) In the logistic regression, we define

$$\sigma(z) = \frac{1}{1 + e^{-z}}$$

Then, show

$$ \frac{d\sigma(z)}{dz} = \sigma(z)(1-\sigma(z))$$

Note: this property is useful in a gradient descent for a parameter learning.

(e) Plot the ideal loss function and the loss function of the logistic regression. Compare the shape of both loss functions and explain why the logistic regression can approximately solve the classification problem.

Problem 8¶

In this problem we will refer to the binary classification task depicted in Figure 1, which we attempt to solve with the simple linear logistic regression model

$$ P(y = +1 \mid x,\,\omega_1,\, \omega_2) = \sigma(\omega_1 x_1 + \omega_2 x_2) = \frac{1}{1+\exp(-\omega_1 x_1 - \omega_2 x_2)}$$

(For simplicity we do not use the bias parameter $\omega_0$ or think as $\omega_0 = 0$). The training data can be separated with zero training error. See line $L_1$ in Figure 2 for instance.

Figure 1 $\qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad$ Figure 2

(a) Consider a regularization approach where we want to minimize

$$ \sum_{i=1}^{m}\log P\left( y_i \mid x_i, \, \omega_1, \, \omega_2\right) + \lambda \,\omega_2^2 $$

Note that only $\omega_2$ is penalized. We want to know which of the four lines in Figure 2 could arise as a result of such regularization. For each potential line $L_2$, $L_3$, or $L_4$, determine whether it can result from regularizing $\omega_2$. Then explain why.

(b) If we change the form of regularization to L1 norm

$$ \sum_{i=1}^{m}\log P\left( y_i \mid x_i,\omega_1, \omega_2\right) + \lambda \,\left( \lvert \omega_1 \rvert + \lvert \omega_2 \rvert \right)$$

Consider again the problem in Figure 2 and the same linear logistic regression model. As we increase the regularization parameter $\lambda$, which of the following scenarios do you expect to observe (choose only one), and explain why?

$\;$ 1) First $\omega_1$ will become 0, then $\omega_2$

$\;$ 2) $\omega_1$ and $\omega_2$ will become zero simultaneously

$\;$ 3) First $\omega_2$ will become 0, then $\omega_1$

$\;$ 4) None of the weights will become exactly zero, only smaller as $\lambda$ increases.

Problem 9-1: Multiclass class classification with SVM¶

[Classification with more than two classes]

We learned about a binary classification in class (SVM). However, we have to deal with multi-class classifications for some cases. I provide three groups of data in the following python code. Create your own algorithms in python to find linear classification surfaces and classes that each data belongs to.

(hint: use binary classification algorithms and extend it to multi-class classification problems)

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

## generate three simulated clusters

mu1 = np.array([1, 7])
SIGMA1 = 0.8*np.array([[1, 1.5],
                       [1.5, 3]])
X1 = np.random.multivariate_normal(mu1, SIGMA1, 100)

mu2 = np.array([3, 4])
SIGMA2 = 0.3*np.array([[1, 0],
                       [0, 1]])
X2 = np.random.multivariate_normal(mu2, SIGMA2, 100)

mu3 = np.array([7, 5])
SIGMA3 = 0.3*np.array([[1, -1],
                       [-1, 2]])
X3 = np.random.multivariate_normal(mu3, SIGMA3, 50)

plt.figure(figsize = (10, 8))
plt.title('Generated Data', fontsize=15)
plt.plot(X1[:,0], X1[:,1], '.')
plt.plot(X2[:,0], X2[:,1], '.')
plt.plot(X3[:,0], X3[:,1], '.')
plt.xlabel('$X_1$', fontsize = 15)
plt.ylabel('$X_2$', fontsize = 15)
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.axis([-2, 10, 1, 12])
plt.show()

# your code here
#

Problem 9-2: Multiclass classification with logistic regression¶

Solve the above problem using the logistic regression classifier.

# your code here
#

Problem 10: Nonlinear Classification¶

Suppose we have the following 1-D dataset for a binary and linear classiﬁcation:

$x = -1, t = 1$
$x = 1, t = 0$
$x = 3, t = 1$

(a) Argue brieﬂy (at most a few sentences) that this dataset is not linearly separable. Your answer should mention convexity.

Now suppose we apply the feature map and weights

$$\phi(x) = \begin{bmatrix} \phi_1(x) \\ \phi_2(x) \end{bmatrix} = \begin{bmatrix} x \\ x^2 \end{bmatrix} \qquad \omega = \begin{bmatrix} \omega_1 \\ \omega_2 \end{bmatrix} $$

(b) Assume we have no bias term, so that the parameters are $\omega_1$ and $\omega_2$. Write down the constraints on $\omega_1$ and $\omega_2$ corresponding to each training example, and then ﬁnd a pair of values $(r_1, r_2)$ that correctly classiﬁes all the examples. Remember that there is no bias term.

Problem 11¶

For the XOR problem, write your own answer to the below questions. The truth table of $x_1$ XOR $x_2$ shows that it outputs 1 whenever the inputs differ.

$x_1$	$x_2$	output
-1	-1	-1
1	-1	1
-1	1	1
1	1	-1

(a) Plot the two classes data and show that the data is not linearly separable.

# your code here
#

(b) Discuss whether mapping $\phi(x) = \begin{bmatrix} x_1 - x_2 \\ x_1x_2 \\ x_1+x_2 \end{bmatrix}$ can solve the XOR problem or not. If it can, plot the data on higher dimensional space and find hyperplane. (Use the SVM classifier, and you do not need to visualize the classification hyperplane)

# your code here
#

Problem 12¶

In this problem, we are going to classify EMNIST alphabet images. Images are 28 $\times$ 28 size and labels are 0 to 25. Let's load the dataset.

Download data files
- Alphabet Handwritten_images
- Alphabet Handwritten_labels

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier

img_file = './data_files/Alphabet_Handwritten_1000_images.npy'
lbl_file = './data_files/Alphabet_Handwritten_1000_labels.npy'

x_total = np.load(img_file) / 255
y_total = np.load(lbl_file)

x_train, _, y_train, _ = train_test_split(x_total, y_total, test_size = 0.3, random_state = 77)

print('Trainset shape:', x_train.shape, y_train.shape)

del x_total, y_total

Plot randomly selected alphabet images for each character.

# fill out the blank

plt.figure(figsize = (18, 6))
for i in range(26):
    plt.subplot(3, 9, i+1)

    idx =

    plt.imshow(x_train[idx].reshape(28, 28), 'gray_r')
    plt.title('Class: {}'.format(i))
    plt.xticks([]); plt.yticks([])
plt.show()

Now, we are going to classify these alphabet images into two classes, consonant and vowel (A, E, I, O, U). Do re-labelling to the train labels.

# fill out the blank

train_label = []

for i in y_train:
    if ''' write your code here''':
        train_label.append([1])
    else:
        train_label.append([0])

train_label = np.asarray(train_label).reshape(-1,)

Train a MLP model that classifies 26 alphabets into 2 classes: consonant, and vowel.

Note that model paramters are

solver = 'sgd'
hidden_layer_sizes = (300, 100,)
activation = 'logistic'
max_iter = 100

Train your MLP model with randomly selected 6000 data from x_train.

Check your MLP model with given test data

## Write your code here
#

testx, testy = np.load('./data_files/test_x.npy'), np.load('./data_files/test_y.npy')
print('Accuracy : ', sum(clf.predict(testx) == testy) / len(testy) * 100)

As the dataset is biased to cosonant, the model might be trained to represent biased results. Re-train your MLP model.

Note that model paramters are

solver = 'sgd'
hidden_layer_sizes = (300, 100,)
activation = 'logistic'
max_iter = 100

Train your MLP model with equally selected 6000 data from each class .

Train data should include 3000 consonants and 3000 vowels.

Check your MLP model with given test data

## Write your code here
#

testx, testy = np.load('./data_files/test_x.npy'), np.load('./data_files/test_y.npy')
print('Accuracy : ', sum(clf.predict(testx) == testy) / len(testy) * 100)