Machine Learning for Mechanical Engineering

Midterm Exam: Part I

04/17/2024, 6:00 PM to 7:50 PM (110 minutes)

Prof. Seungchul Lee
http://iailab.kaist.ac.kr/
Industrial AI Lab at KAIST

Problem 01¶

Find $J$ of the below optimization problem with the following objective function and constraint. ($x,\omega\in \mathbb{R}^n, \omega_0\in \mathbb{R}$)

$$ \begin{align*} J = \;\min_x \quad & \rVert x \lVert_2\\ \text{subject to}\quad & \omega^T x + \omega_0 = 0 \end{align*}$$

Solution 01¶

If $x$ is on the line of $\omega_0 + \omega^Tx = 0$, the objective function can be interpreted as the shortest distance $J^{*}$ from origin to the line. We geometrically (or intuitively) know that distance $J$ will be minimal when vector $x$ is orthogonal to the line.

The optimal $x^*$ that minimizes $\|x\|_2$:

$$ \omega^T x^* + \omega_0 = 0 \\ x^* = -\omega_0 \frac{\omega}{\|\omega\|^2} $$

Calculation of $J^*$: $$ J^* = \|x^*\|_2 = \left\|-\omega_0 \frac{\omega}{\|\omega\|^2}\right\|_2 \\ J^* = \left| \omega_0 \right| \left\|\frac{\omega}{\|\omega\|^2}\right\|_2 \\ J^* = \frac{|\omega_0|}{\|\omega\|} $$

Problem 02¶

Solve the following optimization problem

$$\begin{align*} \text{minimize} \quad & \frac{1}{2}x^TPx+q^Tx+r \\ \text{subject to} \quad & -1 \leq x_i \leq 1, \quad i = 1,2,3,\\ \end{align*} $$

where

$$ P = \begin{bmatrix} 13 & 12 & -2\\ 12 & 17 & 6\\ -2 & 6 & 12\end{bmatrix} , \quad\quad q = \begin{bmatrix} -22.0\\ -14.5\\ 13.0\end{bmatrix} , \quad\quad r = 1. $$

Prove that $x^* = \left(1,\frac{1}{2},-1 \right)$ is optimal for the optimization problem

(Hint: the direction of the steepest ascent in the value of a function $f(x)$ at a given point $x^*$ is $\nabla f(x^*)^T(x - x^*)$ where $x$ is an arbitrary point)

Solution 02¶

Let $x = \begin{bmatrix} x_1\\ x_2\\ x_3\end{bmatrix}$, then

$$ \begin{align*} f &= \frac{1}{2} \begin{bmatrix} x_1 & x_2 & x_3 \end{bmatrix} \begin{bmatrix} 13 & 12 & -2\\ 12 & 17 & 6\\ -2 & 6 & 12 \end{bmatrix} \begin{bmatrix} x_1\\ x_2\\ x_3 \end{bmatrix} + \begin{bmatrix} -22.0 & -14.5 & 13.0 \end{bmatrix} \begin{bmatrix} x_1\\ x_2\\ x_3 \end{bmatrix} + 1\\\\ & = 6.5x_1^2 + 12x_1x_2 - 2x_1x_3 + 8.5x_2^2 + 6x_2x_3 + 6x_3^2 - 22x_1 - 14.5x_2 + 13x_3 + 1\\ \end{align*} $$

To obtain an optimistic solution of $x$, the partial derivative of $f$ for each element of $x$ is as follows.

$$ \begin{align*} \frac{\partial f}{\partial x_1} = 13x_1 + 12x_2 - 2x_3 - 22.0 = 0\\ \frac{\partial f}{\partial x_2} = 12x_1 + 17x_2 + 6x_3 - 14.5 = 0\\ \frac{\partial f}{\partial x_3} = -2x_1 + 6x_2 - 13x_3 + 13.0 = 0\\ \end{align*} $$

This is expressed in a matrix form as follows.

$$ \begin{align*} \begin{bmatrix} 13 & 12 & -2\\ 12 & 17 & 6\\ -2 & 6 & 12\\ \end{bmatrix} \begin{bmatrix} x_1\\ x_2\\ x_3\\ \end{bmatrix} = \begin{bmatrix} 22\\ 14.5\\ -13\\ \end{bmatrix} \end{align*} $$

The gradient at given point $x^* = \left(1,\frac{1}{2},-1 \right)$ is calculated as follows.

$$ \begin{align*} \nabla f(x^*) = \begin{bmatrix} 13 & 12 & -2\\ 12 & 17 & 6\\ -2 & 6 & 12\\ \end{bmatrix} \begin{bmatrix} 1\\ 0.5\\ -1\\ \end{bmatrix} - \begin{bmatrix} 22\\ 14.5\\ -13\\ \end{bmatrix} = \begin{bmatrix} -1\\ 0\\ 2\\ \end{bmatrix} \end{align*} $$

From the given point $x^*$, calculating variance regard to arbitrary point $x$, which is restricted to $-1 \leq x_i \leq 1, \quad i = 1,2,3$, is as follows.

$$ \begin{align*} \nabla f(x^*)^T(x - x^*) = -1(x_1 - 1) + 2(x_3 + 1) \geq 0 \quad\quad (\because x_1 - 1 \leq 0,\quad x_3 + 1 \geq 0) \end{align*} $$

As a result, $x^*$ becomes an optimal point because the value increases at all points belonging to the constraints.

Problem 03¶

Fit a straight line $y = \theta_1 x + \theta_0$ to the points $(1,2), (2,3), (0,0), (-3,-5)$. You must solve by recasting this problem as least squares.

Solution 03¶

$$\begin{bmatrix}1&1\\2&1\\0&1\\-3&1\end{bmatrix}\begin{bmatrix}a\\b\end{bmatrix}=\begin{bmatrix}2\\3\\0\\-5\end{bmatrix}$$
$$A^TA=\begin{bmatrix}14&0\\0&4\end{bmatrix}, \qquad A^Tb=\begin{bmatrix}23\\0\end{bmatrix}$$
$$\begin{bmatrix}14&0\\0&4\end{bmatrix}x = \begin{bmatrix}23\\0\end{bmatrix}$$
$$\begin{bmatrix}a\\b\end{bmatrix}=\begin{bmatrix}\frac{23}{14}\\0\end{bmatrix}$$
$$\therefore y = \frac{23}{14}x$$

Problem 04¶

What are the advantages (or characteristics) of the logistic regression compared to the perceptron?
What is the advantage of having big data?
Why do we divide data into training set, and validation (testing) set?
Suggest possible approaches to avoid overfitting.

Solution 04¶

LR can find a hyperplane even when data is not fully separated. LR algorithm makes use of information from all data.
prevent overfitting
to detect overfitting
early stopping, regularization, data augmentation, dropout, ...

Problem 05¶

Suppose we have a linear regression model with two weights and no bias term:

$$y = \omega_1 x_1 + \omega_2 x_2$$

and the usual loss function $\ell(y,\hat y) = \frac{1}{2}(y− \hat y)^2$ and cost $\mathcal{C}(\omega_1, \omega_2) = \frac{1}{m} \sum_{i}\ell(y^{(i)},\hat y ^{(i)})$. Suppose we have a training set consisting of $m = 3$ examples:

$x^{(1)} = [1,0]^T,\; y^{(1)} = 2$
$x^{(2)} = [0,1]^T,\; y^{(2)} = 2$
$x^{(3)} = [\sqrt{3},0]^T,\; y^{(3)} = 0$

Let's sketch one of the contours in weight space.

Write the cost in the form below using $\ell(y,\hat y) = \frac{1}{2}(y− \hat y)^2$ , $\mathcal{C}(\omega_1, \omega_2) = \frac{1}{m} \sum_{i}\ell(y^{(i)},\hat y ^{(i)})$ and find $c_1,d_1,c_2,d_2,c_0$

$$\mathcal{C} = c_1 (\omega_1 - d_1)^2 + c_2 (\omega_2 - d_2)^2 + c_0$$

Since $c_1, c_2 > 0$, this corresponds to an axis-aligned ellipse of $\omega_1$ and $\omega_2$. Sketch the ellipse by hands for $\mathcal{C} = 4$. Label the center and radii of the ellipse.

Solution 05¶

$$\begin{align*} \mathcal{C}(\omega_1, \omega_2) & = \frac{1}{m} \sum_{i}\ell(y^{(i)},t^{(i)}) \\ & = \frac{1}{3}\left[\ell(y^{(1)},t^{(1)})+\ell(y^{(2)},t^{(2)}) +\ell(y^{(3)},t^{(3)}) \right] \\ & = \frac{1}{3}\left[\frac{1}{2}\left(\omega_1-2\right)^2+\frac{1}{2}\left(\omega_2-2\right)^2+\frac{1}{2}\left(\sqrt{3}\omega_1\right)^2\right]\\ & = \frac{1}{6}\left[\omega_1^2-4\omega_1+4+3\omega_1^2+\left(\omega_2-2\right)^2\right]\\ & = \frac{1}{6}\left[\left(2\omega_1-1\right)^2+\left(\omega_2-2\right)^2 + 3 \right] \\ & = \frac{1}{6}4\left(\omega_1-\frac{1}{2}\right)^2+\frac{1}{6}\left(\omega_2-2\right)^2 + \frac{3}{6} \\ & = \frac{2}{3}\left(\omega_1-\frac{1}{2}\right)^2+\frac{1}{6}\left(\omega_2-2\right)^2 + \frac{1}{2} \end{align*}$$
$$ \therefore c_1 = \frac{2}{3},\; c_2 = \frac{1}{6},\; c_0=\frac{1}{2},\; d_1=\frac{1}{2},\; d_2=2 $$

$$ \begin{align*} \mathcal{C} = \frac{2}{3}&\left(\omega_1-\frac{1}{2}\right)^2+\frac{1}{6}\left(\omega_2-2\right)^2 + \frac{1}{2} = 4 \\ \implies & \frac{\left(\omega_1-\frac{1}{2}\right)^2}{3/2}+\frac{\left(\omega_2-2\right)^2}{6} = \frac{7}{2} \\ & \frac{\left(\omega_1-\frac{1}{2}\right)^2}{21/4}+\frac{\left(\omega_2-2\right)^2}{21} = 1 \end{align*}$$
$$ \text{Center }= \left(\frac{1}{2},\; 2\right) \\ \text{long radius }= \sqrt{21},\quad \text{short radius }= \frac{\sqrt{21}}{2} $$

Problem 06¶

Consider the binary classification task that consists of the following points:

$$ \begin{align*} \mathcal{C}_1 &= \left\{ \begin{bmatrix} 1 \\1\end{bmatrix}, \begin{bmatrix} 1 \\-1\end{bmatrix} \right\}\\ \mathcal{C}_0 &= \left\{ \begin{bmatrix} -1 \\1\end{bmatrix}, \begin{bmatrix} -1 \\-1\end{bmatrix} \right\} \end{align*} $$

$g(x) = \omega_0 + \omega_1 x_1 + \omega_2 x_2 = 0$ is a generic expression of a linear classification boundary. Show that the linear classification boundary by the SVM approach is $x_1 = 0$

Solution 06¶

SVM makes margin(minimum distance from boundary) maximize. First, assumpt that $g(x)$ is a linear line with positive slope. Then,

$$\begin{align*} g \left(x \right) = \omega_0 + \omega_1 x_1 + \omega_2 x_2 = 0 \quad \Longrightarrow \quad x_2 = - \frac{\omega_1}{\omega_2} - \frac{\omega_0}{\omega_2} \end{align*}$$

To make this line as positive slope, $- \frac{\omega_1}{\omega_2} > 1, \quad \therefore \omega_1 < -\omega_2 $

In this assumption, $(1,1)$ and $(-1,-1)$ have to become support vectors. So each distance from boundary line have to become same.

$$\begin{align*} d &= \frac{\left\vert \omega_0 + \omega_1 + \omega_2 \right\vert}{\sqrt{\omega_1^2 + \omega_2^2}} = \frac{\left\vert \omega_0 - \omega_1 - \omega_2 \right\vert}{\sqrt{\omega_1^2 + \omega_2^2}} \\ & \approx \left\vert \omega_0 + \omega_1 + \omega_2 \right\vert = \left\vert \omega_0 - \omega_1 - \omega_2 \right\vert \\ & \therefore \omega_1 = \omega_2 \quad or \quad \omega_0 = 0 \end{align*}$$

But $\omega_1 = \omega_2$ case violate $\omega_1 < -\omega_2$ condition, so $\omega_0 = 0$

Now problem becomes

$$\begin{align*} \text{maximize} \quad \frac{\left\vert \omega_1 + \omega_2 \right\vert}{\sqrt{\omega_1^2 + \omega_2^2}} \end{align*}$$
$$\begin{align*} \left\vert \omega_1 + \omega_2 \right\vert = \sqrt{\left(\omega_1 + \omega_2 \right)^2} = \sqrt{\omega_1^2 + 2\omega_1\omega_2 + \omega_2^2} < \sqrt{\omega_1^2 + \omega_2^2} \quad \left(\because 2\omega_1\omega_2 < 0 \right) \end{align*}$$

The maximum distance is 1, when $\omega_1 = 0$ or $\omega_2 = 0$.

If $\omega_1 = 0$, $g(x): x_2 = 0$ but this cannot classify those points. So $\omega_2 = 0$ and $g(x): x_1 = 0$

Problem 07¶

Assume that we trained a logistic regression model and our class probabilities can be found by

$$ z(x) = \sigma (w^Tx+b)$$

and we classify using the rule

$$ y(x) = \mathbb{1}[z(x) > 0.5]$$

Show that this corresponds to a linear decision boundary in the input space where $\sigma(x) = \cfrac{1}{1+e^{-x}}$

(Hint: $ y(x) = \mathbb{1}[z(x) > 0.5]$ means $y = 1$ if $z(x) > 0.5$)

Solution 07¶

What decision boundary looks like:

predict $y = 1$ if $z(x) > 0.5$ $\iff w_k^T x** + w_{k,0}x_0 > 0$
predict $y = 0$ if $z(x) \leq 0.5 \iff w_k^T x + w_{k,0}x_0 \leq 0$
decision boundary: $w_k^T x + w_{k,0}x_0 = 0$

The derivation to the decision boundary is:

$$ \begin{align*} \frac{1}{1 + e^{-(w_k^T x+w_{k,0}x_0)}} &= 0.5 \\ 1 &= 0.5(1 + e^{-(w_k^T x+w_{k,0}x_0)}) \\ 1 - 0.5 &= 0.5(e^{-(w_k^T x+w_{k,0}x_0)}) \\ 1 &= e^{-(w_k^T x+w_{k,0}x_0)} \\ \ln(1) &= \ln(e^{-(w_k^T x+w_{k,0}x_0)}) \\ 0 &= w_k^T x + w_{k,0}x_0 \end{align*} $$

Problem 08¶

Let's do principal components analysis (PCA). Consider this sample of six points $X_i \in \mathbb{R}^2$

$$ \left\{ \begin{bmatrix} 0 \\ 0 \end{bmatrix}, \begin{bmatrix} 0 \\ 1 \end{bmatrix}, \begin{bmatrix} 1 \\ 0 \end{bmatrix}, \begin{bmatrix} 1 \\ 2 \end{bmatrix}, \begin{bmatrix} 2 \\ 1 \end{bmatrix}, \begin{bmatrix} 2 \\ 2 \end{bmatrix} \right\} $$

Compute the mean of the sample points and write the centered matrix $\hat{X}$

(Hint: You can form the centered matrix by subtracting the mean from each sample.)

Find all the principal components of this sample. Write them as unit vectors.

a. Which of those two principal components would be preferred if you use only one?
b. What information does the PCA algorithm use to decide that one principal components is better than another?
c. From an optimization point of view, why do we prefer that one?

Compute the vector projection of each of the original sample points (not the centered sample points) onto your preferred principal component. By "vector projection" we mean that the projected points are still in $\mathbb{R}^2$. (Don't just give us the principal coordinate; give us the projected point.)

(Hint: Vector projection of $X_i$ onto a direction $w$ is $\bar{X}_i = (X_i^T w) w$ where $w$ is a unit vector)

Solution 08¶

The sample mean is

$$ \mu = \frac{1}{6} \sum_{i=1}^{6} X_i \begin{bmatrix} 1 \\ \end{bmatrix}. $$

By subtracting the mean from each sample, we form the centered design matrix
$$ X = \begin{bmatrix} -1 & -1 \\ -1 & 0 \\ 0 & -1 \\ 0 & 1 \\ 1 & 0 \\ 1 & 1 \\ \end{bmatrix}. $$

The principal components of our dataset are the eigenvectors of the matrix

$$ X^TX = \begin{bmatrix} 4 & 2 \\ 2 & 4 \\ \end{bmatrix}. $$

The characteristic polynomial of this symmetric matrix is

$$ \det(sI - X^TX) = \det \begin{bmatrix} s - 4 & -2 \\ -2 & s - 4 \\ \end{bmatrix} = (s - 4)(s - 4) - (-2)(-2) = s^2 - 8s + 12 = (s - 6)(s - 2). $$
Hence the eigenvalues of $X^TX$ are $\lambda_1 = 2$ and $\lambda_2 = 6$. With these eigenvalues, we can compute the eigenvectors of this matrix as follows. (Or you could just guess them and verify them.)
$$ \begin{bmatrix} \lambda_1 - 4 & -2 \\ -2 & \lambda_1 - 4 \\ \end{bmatrix} \begin{bmatrix} v_{11} \\ v_{21} \\ \end{bmatrix} = \begin{bmatrix} 0 \\ 0 \\ \end{bmatrix} \implies v_1 = \begin{bmatrix} 1/\sqrt{2} \\ -1/\sqrt{2} \\ \end{bmatrix} $$
$$ \begin{bmatrix} \lambda_2 - 4 & -2 \\ -2 & \lambda_2 - 4 \\ \end{bmatrix} \begin{bmatrix} v_{12} \\ v_{22} \\ \end{bmatrix} = \begin{bmatrix} 0 \\ 0 \\ \end{bmatrix} \implies v_2 = \begin{bmatrix} 1/\sqrt{2} \\ 1/\sqrt{2} \\ \end{bmatrix} $$

a. We choose $v_{2} = \begin{bmatrix}1/\sqrt2 \\1/\sqrt2\end{bmatrix}$ first.

b. PCA picks the principal component with the largest eigenvalue.

c. We prefer it because it maximizes the variance of the sample points after they are projected onto a line parallel to $v_{2}$. (Alternatively, because it minimizes the sum of squares of distances from eqach sample point to its projection on that line.
The vector projection of $X_{i}$ onto a direction $w$ is

$$ \hat{X_i} = \frac{X_i^Tw}{||w||^2} w. $$

If $w$ is a unit vector, we can just write $\hat{X_i} = (X_i^Tw)w.$

For our sample and our chosen principal component $w = v_2$, the projections are

$$ \hat{X_1} = \begin{bmatrix} 0 \\ 0 \\ \end{bmatrix}, \hat{X_2} = \begin{bmatrix} 1/2 \\ 1/2 \\ \end{bmatrix}, \hat{X_3} = \begin{bmatrix} 1/2 \\ 1/2 \\ \end{bmatrix}, \hat{X_4} = \begin{bmatrix} 3/2 \\ 3/2 \\ \end{bmatrix}, \hat{X_5} = \begin{bmatrix} 3/2 \\ 3/2 \\ \end{bmatrix}, \hat{X_6} = \begin{bmatrix} 2 \\ 2 \\ \end{bmatrix}. $$

Problem 09¶

As most pattern recognition methods, tree-based methods work best if proper features are selected. Discuss why preprocessing by PCA can be effective to find “important” axes with smaller maximum depth of the tree.

You have 400 photographs. Each photo is 64 $\times$ 64 pixels in size. However, the amount of space left on your computer's hard disk is less than the size of the image dataset. Assume you won't need any high-resolution photographs. Using the knowledge you've gained from the course, describe any approach for reducing the storage of an image dataset.

Solution 9¶

The top figure shows that depending on the given features, a complicated decision tree may result while it can be better split along another direction. However if PCA is used to find the right direction for data splitting, the decision tree may be simpler as shown below.

First, make all image into one 2 dimensional array (64$\times$64)$\times$400
Second, substract average of the array into axis = 0
Third, make zero centered image, decompose it using SVD (Truncation = 100)
Finally,the storage is reduced from ((400$\times$4096)$\times$2)-((400$\times$100)+(4096$\times$100))$\times$2

Problem 10¶

We want you to compute (by hand) part of the singular value decomposition (SVD) $X = U \Sigma V^T$ of the matrix

$$X = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ -1 & -1 \\ \end{bmatrix}$$

One way to compute the SVD by hand is to exploit the relationship between the singular values of $X$ and the eigenvalues of a related symmetric matrix. What symmetric matrix has eigenvalues related to the singular values of $X$? Compute that matrix, write down its characteristic equation and simplify it, and determine its eigenvalues. Show your work.

What is the relationship between the singular values of $X$ and the eigenvalues of your symmetric matrix? Use that relationship to write down the two specific singular values of $X$

Generalize your answer above. For any matrix $Z$ (not only $X$), what symmetric matrix has eigenvalues related to the singular values of $Z$? Prove that every singular value of $Z$ can be identified by looking at the eigenvalues of the matrix your method constructs.

Explicitly write down enough eigenvectors of your symmetric matrix to form a complete basis for the space. (Hint: the symmetry of your matrix should make it easy to guess eigenvectors; then you can confirm that they are eigenvectors.) Based on these, write down the value of $U$ or $V$ (whichever is appropriate)

Solution 10¶

Consider the symmetric matrix

$$ Y = X^TX = \begin{bmatrix} 2 & 1 \\ 1 & 2 \\ \end{bmatrix} $$

The characteristic polynomial of $X$ is $(2 - \lambda)^{2} - 1^{2} = \lambda^ {2} - 4\lambda + 3 .$ We can factor it into $(\lambda - 3)(\lambda - 1)$ or simply use the quadratic formula to see that its solutions are $\lambda = 3$ and $\lambda = 1$, the eigenvalues of $X$.

Note: if you like doing things the hard way, it is also correct to use the matrix

$$ XX^T = \begin{bmatrix} 1 & 0 & -1 \\ 0 & 1 & -1 \\ -1 & -1 & 2 \\ \end{bmatrix} $$

The singular values of $X$ are square roots of eigenvalues of $X^TX$ (or $XX^T$). Therefore, they are $\sqrt{3}$ and $1$.

The singular values of $Z$ are square roots of eigenvalues of $Z^TZ$ (or $ZZ^T$).

Proof: Let $Z = UDV^T$ be the SVD of $Z$; then $Z^TZ = VDU^TUDV^T = VD^2V^T$, which is clearly an eigendecomposition (because $V$ is orthonormal and $D^2$ is diagonal). If $Z$ has a singular vector $v$ with right singular vector $u$, then $v$ is an eigenvalue of $Z^TZ$ with eigenvalue $\sigma^2$.

The simplest basis of eigenvectors for $X^TX$ is $[1,1]^T$ and $[1,-1]^T$. From this we can choose

$$ V = \begin{bmatrix} 1/\sqrt{2} & -1/\sqrt{2} \\ 1/\sqrt{2} & 1/\sqrt{2} \\ \end{bmatrix} $$
but note that the $-1/\sqrt{2}$ could appear in any of the four slots. Moreover, we could have three negative entries instead of one.

Postscript: By the way, one correct SVD is

$$ X = \begin{bmatrix} 1/\sqrt{6} & -1/\sqrt{2} \\ 1/\sqrt{6} & 1/\sqrt{2} \\ 2/\sqrt{6} & 0 \\ \end{bmatrix} \begin{bmatrix} \sqrt{3} & 0 \\ 0 & 1 \\ \end{bmatrix} \begin{bmatrix} 1/\sqrt{2} & -1/\sqrt{2} \\ 1/\sqrt{2} & 1/\sqrt{2} \\ \end{bmatrix} $$