Prof. Seungchul Lee

http://iailab.kaist.ac.kr/

Industrial AI Lab at KAIST

http://iailab.kaist.ac.kr/

Industrial AI Lab at KAIST

For your handwritten solutions, please scan or take a picture of them. Alternatively, you can write them in markdown if you prefer.

Only .ipynb files will be graded for your code.

- Ensure that your NAME and student ID are included in your .ipynb files. ex) SeungchulLee_20241234_HW01.ipynb

Compress all the files into a single .zip file.

- In the .zip file's name, include your NAME and student ID.

ex) SeungchulLee_20241234_HW01.zip

- Submit this .zip file on KLMS

Do not submit a printed version of your code, as it will not be graded.

(a) Let $f(x_1,x_2) = 1 + 7x_1 + 4x_2 = 0$ be the hyperplane (or decision surface in classification problems). Compute the shortest distances from $x_a = \begin{bmatrix} 2 \\1\end{bmatrix}$ and $x_b = \begin{bmatrix} 1 \\ -8\end{bmatrix}$ to the above hyperplane, repectively.

(b) Let $f(x_1,x_2) = 2x_1 + 3x_2 + 1 = 0$ be the hyperplane (or decision surface in classification problems). Compute the shortest distances from $x_a = \begin{bmatrix} 2 \\1\end{bmatrix}$ and $x_b = \begin{bmatrix} 1 \\ -1\end{bmatrix}$ to the above hyperplane, respectively.

Find the solution $r$ of the below optimization problem with the following objective function and constraint. ($x,\omega, \omega_0 \in \mathbb{R}^n$)

$$ \begin{align*} r = \;&\min_x \;\rVert x \lVert_2\\ \text{s.t.}&\; \omega^T x + \omega_0 = 0 \end{align*}$$

Suppose we have the following training examples in a 2-dimensional input space, and no bias term. Following our discussion of the perceptron algorithm, we use $t = −1$ rather than $t = 0$ to denote the negative class.

Recall the perceptron algorithm from lecture. It repeatedly applies the following procedure, stopping when the weights are strictly within the feasible region (i.e., not on the boundary):

Here we work through the algorithm by hand.

(a) Sketch out the problem in weight space. In particular: draw and label the axes, draw the half-space corresponding to each of the two training examples, and shade the feasible region. (Remember that there is no bias term.)

(b) Simulate the perceptron algorithm by hand, starting from the initial weights $\omega_1 = 0, \omega_2 = −2$. On the plot from part (a), mark an $X$ on every point in weight space visited by the algorithm until it stops. You do not need to provide anything for this part other than the $X$’s on the plot.

Hint: the weights are updated 12 times. It will be tedious if you compute all the updates using arithmetic; instead, plot the ﬁrst few updates as you go along, and you should start to notice a visual pattern.

In this problem, we will try to classify handwritten digits. For the sake of simplicity, we simplify this digit classification problem into a binary classification between digit 0 and digit 1.

Data | Data dexcription |
---|---|

0 | 1000 images (28×28 pixels) of handwritten digit 0 |

1 | 1000 images (28×28 pixels) of handwritten digit 1 |

To read the files in Python, use the following code:

In [ ]:

```
## you do not need to change anyting
import numpy as np
import matplotlib.pyplot as plt
import cvxpy as cvx
%matplotlib inline
from six.moves import cPickle
data = cPickle.load(open('./data_files/data_zero_one.pkl', 'rb'))
data0 = data['0']
data1 = data['1']
```

Here, we will load 1000 of data for ‘0’ and ‘1’, respectively. To display the image use `plt.imshow(img)`

.

Note: make sure you are reading the files correctly. Check by displaying the first few and the last few images in each class.

Note: calculations are more manageable if you go though and convert each of the pixels in $28\times28$ matrix to a binary value first.

(a) Display the few random images in each class.

In [ ]:

```
## you do not need to change anyting
plt.figure(figsize = (12, 8))
plt.subplot(2,4,1), plt.imshow(data0[np.random.randint(1000)], 'gray'), plt.axis('off')
plt.subplot(2,4,2), plt.imshow(data0[np.random.randint(1000)], 'gray'), plt.axis('off')
plt.subplot(2,4,3), plt.imshow(data0[np.random.randint(1000)], 'gray'), plt.axis('off')
plt.subplot(2,4,4), plt.imshow(data0[np.random.randint(1000)], 'gray'), plt.axis('off')
plt.subplot(2,4,5), plt.imshow(data1[np.random.randint(1000)], 'gray'), plt.axis('off')
plt.subplot(2,4,6), plt.imshow(data1[np.random.randint(1000)], 'gray'), plt.axis('off')
plt.subplot(2,4,7), plt.imshow(data1[np.random.randint(1000)], 'gray'), plt.axis('off')
plt.subplot(2,4,8), plt.imshow(data1[np.random.randint(1000)], 'gray'), plt.axis('off')
plt.show()
```

(b) Convert each of the pixels in $28×28$ matrix to a binary value.

In [ ]:

```
## you do not need to change anyting
data0 = data0 > 125
data1 = data1 > 125
```

In machine learning and pattern recognition, a feature is an individual measurable property or characteristic of a phenomenon being observed. Choosing informative, discriminating and independent features from raw data (or input) is a crucial step for effective algorithms in pattern recognition, classification and regression.

Now we must select the own ‘features’ from image data to detect digit 0 and digit 1. Two features are recommended as follows:

(a) Feature 1: the total average pixels located at the center of the image (`img[10:20,10:20]`

).

In [ ]:

```
## your code here
#
```

(b) Feature 2: the total average pixels over the entire image.

In [ ]:

```
## your code here
#
```

(c) Include the ones as the bias term.

$$\phi(x) = \begin{bmatrix} 1 \\ \text{feature1}\\ \text{feature2} \end{bmatrix} \quad \implies \quad X \; (\text{or } \Phi) = \begin{bmatrix} \phi_1^T\\ \vdots \\ \phi_{1000}^T \\ \phi_{1001}^T \\ \vdots \\ \phi_{2000}^T \end{bmatrix}$$

You should end up with a $2000\times3$ input matrix with the first $1000$ rows correspond to all of the ‘`data0`

’ and the second 1000 rows correspond to all of the given ‘`data1`

’. This matrix is matrix $X$ (or $\Phi$) which we learned in a class.

In [ ]:

```
## your code here
#
```

Plot the features to see if classes are separable. The expected plot is the following:

In [ ]:

```
## your code here
#
```

We would like to use the perceptron algorithm to classify digit 0 and digit 1 in the training set.

You have to initialize $\omega$. We usually set $\omega$ to a zero vector as an initial guess.

In [ ]:

```
## your code here
#
```

We update $\omega$ when the prediction is wrong. The update rule is the following:

$$ \omega \leftarrow \omega + y \cdot x$$

We will repeat the same iteration for the number of data points. Here’s a pseudo code:

```
For k = 1:100 {
For j = 1:100 {
i = random integer selection between 1 and 2000
compute yhat(i)
if yhat(i) is wrong {
w = w + y(i)x(i)
}
}
}
```

For every $k$'s iteration, count and store how many predictions are wrong to see if your classifier converges to somewhere or not.

In [ ]:

```
## your code here
#
```

You are asked to plot two graphs. First, plot the number of wrong predictions with respect to every $k$'s iteration. The graph is expected as follows:

Second, plot the classifier (decision boundary). Note that the decision boundary is given:

$$\omega_1 x_1 + \omega_2x_2 + \omega_0 = 0$$

In [ ]:

```
## your code here
#
```

We would like to use the SVM algorithm to classify digit 0 and digit 1 in the training set.

Find a classifier using the support vector machine which is equivalent to solving the following optimization problem:

$$\begin{align*}
\text{minimize } & \lVert \omega \rVert_2 + \gamma(1^Tu + 1^Tv)\\
\text{subject to }& \\
& X_0^T \omega \geq 1 - u\\
& X_1^T \omega \leq - (1-v)\\
& u \geq 0\\
& v \geq 0
\end{align*}$$

Here, $X_0$ and $X_1$ are $1000\times3$ matrix for each class. You can divide the above matrix $X$ into two parts to reuse it.

In [ ]:

```
## your code here
#
```

Run CVXPY to find $\omega$.

In [ ]:

```
## your code here
#
```

Now you have an optimal $\omega$ . Plot a decision boundary. Note that our decision boundary is defined as follows:

$$ \omega_1 x_1 + \omega_2 x_2 + \omega_0 = 0$$

In [ ]:

```
## your code here
#
```

Consider the binary classification task that consists of the following points:

$$ \begin{align*} \mathcal{C}_1 &= \left\{ \begin{bmatrix} 1 \\1\end{bmatrix}, \begin{bmatrix} 1 \\-1\end{bmatrix} \right\}\\ \mathcal{C}_0 &= \left\{ \begin{bmatrix} -1 \\1\end{bmatrix}, \begin{bmatrix} -1 \\-1\end{bmatrix} \right\} \end{align*} $$

$g(x) = \omega_0 + \omega_1 x_1 + \omega_2 x_2 = 0$ is a generic expression of a linear classification boundary. Show that the linear classification boundary by the SVM approach is $x_1 = 0$

(a) Derive the support vector machine algorithm in a form of the optimization.

(b) Derive the gradient descent algorithm for the logistic regression.

(a) Let $f(x) = \omega^Tx$ be a linear boundary of a binary classifier and $y \in \{-1,1\}$. Explain $y \cdot f(x) \leq 0$ when the classifier is wrong and $y \cdot f(x) > 0$ when the classifier is correct.

(b) Let $\ell \left(f(x), y \right)$ be a loss function. Ideally, the loss function for the classification problem is $u \left(-y \cdot f(x) \right)$ where $u(\cdot)$ is a step function. Explain the meaning of $J = \sum_{i=1}^{N} u \left(-y_i \cdot f(x_i) \right)$.

(c) Explain why the ideal loss function cannot be implemented when we want to use a gradient descent algorithm?

(d) In the logistic regression, we define

$$\sigma(z) = \frac{1}{1 + e^{-z}}$$

Then, show

$$ \frac{d\sigma(z)}{dz} = \sigma(z)(1-\sigma(z))$$

Note: this property is useful in a gradient descent for a parameter learning.

(e) Plot the ideal loss function and the loss function of the logistic regression. Compare the shape of both loss functions and explain why the logistic regression can approximately solve the classification problem.

In this problem we will refer to the binary classification task depicted in Figure 1, which we attempt to solve with the simple linear logistic regression model

$$ P(y = +1 \mid x,\,\omega_1,\, \omega_2) = \sigma(\omega_1 x_1 + \omega_2 x_2) = \frac{1}{1+\exp(-\omega_1 x_1 - \omega_2 x_2)}$$

(For simplicity we do not use the bias parameter $\omega_0$ or think as $\omega_0 = 0$). The training data can be separated with zero training error. See line $L_1$ in Figure 2 for instance.

(a) Consider a regularization approach where we want to minimize

$$ \sum_{i=1}^{m}\log P\left( y_i \mid x_i, \, \omega_1, \, \omega_2\right) + \lambda \,\omega_2^2 $$

Note that only $\omega_2$ is penalized. We want to know which of the four lines in Figure 2 could arise as a result of such regularization. For each potential line $L_2$, $L_3$, or $L_4$, determine whether it can result from regularizing $\omega_2$. Then explain why.

(b) If we change the form of regularization to L1 norm

$$ \sum_{i=1}^{m}\log P\left( y_i \mid x_i,\omega_1, \omega_2\right) + \lambda \,\left( \lvert \omega_1 \rvert + \lvert \omega_2 \rvert \right)$$

Consider again the problem in Figure 2 and the same linear logistic regression model. As we increase the regularization parameter $\lambda$, which of the following scenarios do you expect to observe (choose only one), and explain why?

$\;$ 1) First $\omega_1$ will become 0, then $\omega_2$

$\;$ 2) $\omega_1$ and $\omega_2$ will become zero simultaneously

$\;$ 3) First $\omega_2$ will become 0, then $\omega_1$

$\;$ 4) None of the weights will become exactly zero, only smaller as $\lambda$ increases.

[Classification with more than two classes]

We learned about a binary classification in class (SVM). However, we have to deal with multi-class classifications for some cases. I provide three groups of data in the following python code. Create your own algorithms in python to find linear classification surfaces and classes that each data belongs to.

(hint: use binary classification algorithms and extend it to multi-class classification problems)

In [ ]:

```
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
## generate three simulated clusters
mu1 = np.array([1, 7])
SIGMA1 = 0.8*np.array([[1, 1.5],
[1.5, 3]])
X1 = np.random.multivariate_normal(mu1, SIGMA1, 100)
mu2 = np.array([3, 4])
SIGMA2 = 0.3*np.array([[1, 0],
[0, 1]])
X2 = np.random.multivariate_normal(mu2, SIGMA2, 100)
mu3 = np.array([7, 5])
SIGMA3 = 0.3*np.array([[1, -1],
[-1, 2]])
X3 = np.random.multivariate_normal(mu3, SIGMA3, 50)
plt.figure(figsize = (10, 8))
plt.title('Generated Data', fontsize=15)
plt.plot(X1[:,0], X1[:,1], '.')
plt.plot(X2[:,0], X2[:,1], '.')
plt.plot(X3[:,0], X3[:,1], '.')
plt.xlabel('$X_1$', fontsize = 15)
plt.ylabel('$X_2$', fontsize = 15)
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.axis([-2, 10, 1, 12])
plt.show()
```

In [ ]:

```
# your code here
#
```

Solve the above problem using the logistic regression classifier.

In [ ]:

```
# your code here
#
```

Suppose we have the following 1-D dataset for a binary and linear classiﬁcation:

- $x = -1, t = 1$
- $x = 1, t = 0$
- $x = 3, t = 1$

(a) Argue brieﬂy (at most a few sentences) that this dataset is not linearly separable. Your answer should mention convexity.

Now suppose we apply the feature map and weights

$$\phi(x) = \begin{bmatrix} \phi_1(x) \\ \phi_2(x) \end{bmatrix} = \begin{bmatrix} x \\ x^2 \end{bmatrix} \qquad \omega = \begin{bmatrix} \omega_1 \\ \omega_2 \end{bmatrix} $$

(b) Assume we have no bias term, so that the parameters are $\omega_1$ and $\omega_2$. Write down the constraints on $\omega_1$ and $\omega_2$ corresponding to each training example, and then ﬁnd a pair of values $(r_1, r_2)$ that correctly classiﬁes all the examples. Remember that there is no bias term.

For the XOR problem, write your own answer to the below questions. The truth table of $x_1$ XOR $x_2$ shows that it outputs 1 whenever the inputs differ.

$x_1$ | $x_2$ | output |
---|---|---|

-1 | -1 | -1 |

1 | -1 | 1 |

-1 | 1 | 1 |

1 | 1 | -1 |

(a) Plot the two classes data and show that the data is not linearly separable.

In [ ]:

```
# your code here
#
```

(b) Discuss whether mapping $\phi(x) = \begin{bmatrix} x_1 - x_2 \\ x_1x_2 \\ x_1+x_2 \end{bmatrix}$ can solve the XOR problem or not. If it can, plot the data on higher dimensional space and find hyperplane. (Use the SVM classifier, and you do not need to visualize the classification hyperplane)

In [ ]:

```
# your code here
#
```

In this problem, we are going to classify EMNIST alphabet images. Images are 28 $\times$ 28 size and labels are 0 to 25. Let's load the dataset.

- Download data files

In [ ]:

```
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
img_file = './data_files/Alphabet_Handwritten_1000_images.npy'
lbl_file = './data_files/Alphabet_Handwritten_1000_labels.npy'
x_total = np.load(img_file) / 255
y_total = np.load(lbl_file)
x_train, _, y_train, _ = train_test_split(x_total, y_total, test_size = 0.3, random_state = 77)
print('Trainset shape:', x_train.shape, y_train.shape)
del x_total, y_total
```

- Plot randomly selected alphabet images for each character.

In [ ]:

```
# fill out the blank
plt.figure(figsize = (18, 6))
for i in range(26):
plt.subplot(3, 9, i+1)
idx =
plt.imshow(x_train[idx].reshape(28, 28), 'gray_r')
plt.title('Class: {}'.format(i))
plt.xticks([]); plt.yticks([])
plt.show()
```

- Now, we are going to classify these alphabet images into two classes, consonant and vowel (A, E, I, O, U). Do re-labelling to the train labels.

In [ ]:

```
# fill out the blank
train_label = []
for i in y_train:
if ''' write your code here''':
train_label.append([1])
else:
train_label.append([0])
train_label = np.asarray(train_label).reshape(-1,)
```

- Train a MLP model that classifies 26 alphabets into 2 classes: consonant, and vowel.

Note that model paramters are

`solver = 'sgd'`

`hidden_layer_sizes = (300, 100,)`

`activation = 'logistic'`

`max_iter = 100`

Train your MLP model with randomly selected 6000 data from `x_train`

.

Check your MLP model with given test data

In [ ]:

```
## Write your code here
#
```

In [ ]:

```
testx, testy = np.load('./data_files/test_x.npy'), np.load('./data_files/test_y.npy')
print('Accuracy : ', sum(clf.predict(testx) == testy) / len(testy) * 100)
```

- As the dataset is biased to cosonant, the model might be trained to represent biased results. Re-train your MLP model.

Note that model paramters are

`solver = 'sgd'`

`hidden_layer_sizes = (300, 100,)`

`activation = 'logistic'`

`max_iter = 100`

Train your MLP model with equally selected 6000 data from each class .

- Train data should include 3000 consonants and 3000 vowels.

Check your MLP model with given test data

In [ ]:

```
## Write your code here
#
```

In [ ]:

```
testx, testy = np.load('./data_files/test_x.npy'), np.load('./data_files/test_y.npy')
print('Accuracy : ', sum(clf.predict(testx) == testy) / len(testy) * 100)
```