Machine Learning for Mechanical Engineering

Dimension Reduction

Prof. Seungchul Lee
http://iailab.kaist.ac.kr/
Industrial AI Lab at KAIST
• For your handwritten solutions, please scan or take a picture of them. Alternatively, you can write them in markdown if you prefer.

• Ensure that your NAME and student ID are included in your .ipynb files. ex) SeungchulLee_20241234_HW01.ipynb
• Compress all the files into a single .zip file.

• In the .zip file's name, include your NAME and student ID.

ex) SeungchulLee_20241234_HW01.zip

• Submit this .zip file on KLMS
• Do not submit a printed version of your code, as it will not be graded.

# Problem 01¶

In statistics, it is important to understand the characteristics of population, sample and distributions.

1. Plot a histogram of 100,000 data points that are randomly generated from the uniform distribution.
In [ ]:
## uniform distribution
#

1. Plot a histogram of 100,000 data points that are randomly generated from the normal distribution.
In [ ]:
## normal distribution
#

1. When we select 100 data points for 1000 times from each distribution, show that the means of sampled data points approximately go to the normal distribution.
In [ ]:
## uniform distribution

Sample_N = 1000

In [ ]:
## normal distribution

Sample_N = 1000

1. Calculate covariance, covariance matrix and correlation coefficient with the below given data.
In [ ]:
x = np.random.normal(0, 0.3, 100)
data = np.array((x, 2 * x + np.random.normal(0, 0.5, 100)))

plt.figure(figsize = (8, 6))
plt.scatter(data[0], data[1])
plt.axis('equal')
plt.show()

In [ ]:
## write your code here
#


# Problem 02¶

Describe the difference between the process of finding the linear regression line and the PCA line.

# Problem 03¶

In [ ]:
from six.moves import cPickle


1. Write a python code to conduct a linear regression (least square).
In [ ]:
## your code here
#

1. Write a python code to perform PCA. Since mean(x) = mean(y) = 0, the normalization step (mean subtraction and rescaling) can be skipped.
In [ ]:
## your code here
#

1. Plot both the regression line and the PCA line (i.e., the first principal component) at the same time, and comment on why the results are different.
In [ ]:
## your code here
#


# Problem 04¶

Show that minimizing the sum of squared errors maximizes the variance in the PCA.

# Problem 05¶

1. What is the relationship between PCA and SVD?

2. Explain how SVD can do low rank approximation.

3. Derive Fisher discriminant analysis from scratch.

# Problem 06¶

In [ ]:
import numpy as np
import random
import matplotlib.pyplot as plt

%matplotlib inline

In [ ]:
n0 = 200
n1 = 200

sigma = [[19, -4],
[-4, 1]]

x0 = np.random.multivariate_normal([0.7,0.7], sigma, n0)        # data in class 0
x1 = np.random.multivariate_normal([-0.5,-0.5], sigma, n1)      # data in class 0

x0 = np.asmatrix(x0)
x1 = np.asmatrix(x1)

plt.figure(figsize = (10, 6))
plt.plot(x0[:,0],x0[:,1],'r.')
plt.plot(x1[:,0],x1[:,1],'b.')
plt.ylim([-8, 8])
plt.xlim([-14, 14])
plt.axis('equal')
plt.show()

1. Plot the projection line from Fisher Discriminant Analysis.
In [ ]:
## your code here
#

1. Plot the projection line from PCA (i.e., the first principal component). In the case of PCA, you do not need to consider labels.(i.e., data class)
In [ ]:
## your code here
#


# Problem 07¶

Given 100 pictures of human faces with size of $(50, 40)$, we will apply PCA to this dataset.

(1) Plot the image of a single human face among 100 pictures. (You can radomly select one of the face images)

In [ ]:
## Write your code here
#


(2) Before applying PCA, we need to reshape the face images into vectorized form. Reshape the given dataset into a matrix of size $(100,50 \times 40)$.

(i.e., $(100,50,40)$ $\rightarrow$ $(100,50\times 40)$)

In [ ]:
## Write your code here
#


(3) Apply PCA to the reshaped dataset. First, you need to compute a covariance matrix. Second, you need to compute eigen vectors of its covariance matrix.

In [ ]:
## Write your code here
#


(4) Show the first eigen vector. First, you need to convert eigen vector to a real valued vector using np.real(). Then you need to reshape the eigen vector into image size $(50,40)$.

In [ ]:
## Write your code here
#


# Problem 08¶

Recall that PCA transforms (zero-mean) data into low-dimensional reconstructions that lie in the span of the top $k$ eigenvectors of the sample covariance matrix. Let $U_k$ denote the $d\times k$ matrix of the top $k$ eigenvectors of the covariance matrix ($U_k$ is a truncated version of $U$, which is the matrix of eigenvectors of the covariance matrix).

There are two approaches to computing the low-dimensional reconstruction $\omega \in \mathbb{R}^k$ of a data point $x \in \mathbb{R}^d$:

1. Solve a least squares problem to minimize the reconstruction error

2. Project $x$ onto the span of the columns of $U_k$

In this problem, you will show that these approaches are equivalent.

a) Formulate the least squares problem in terms of $U_k$, $x$, and the variable $\omega$. (Hint : This optimization problem should resemble linear regression)

b) Show that the solution of the least squares problem is equal to $U_k^Tx$, which is the projection of $x$ onto the span of the columns of $U_k$.

# Problem 09¶

We would like to use the FDA (LDA) to classify digit 0 and digit 1

Data Data dexcription
0 1000 images (28×28 pixels) of handwritten digit 0
1 1000 images (28×28 pixels) of handwritten digit 1

To read the files in Python, use the following code:

In [ ]:
import numpy as np
import matplotlib.pyplot as plt
from six.moves import cPickle

data0 = data['0']
data1 = data['1']

1. Convert each of the pixels in 28 $\times$ 28 matrix to a binary value.
In [ ]:
## your code here
#

1. Extract features and plot the feature space. Now we must select the own ‘features’ from image data to detect digit 0 and digit 1. Two features are recommended
• (feature 1) The total average pixels located at the center of the image (img[10:20,10:20]).

• (feature 2) The total average pixels over the entire location.

$$\Phi(x) = \begin{bmatrix}\ \text{feature1}\\ \text{feature2} \end{bmatrix}$$

You should end up with a $2000\times2$ input matrix with the first $1000$ rows correspond to all of the ‘data0’ and the second 1000 rows correspond to the two features for all of the given ‘data1’.

In [ ]:
## your code here
#

In [ ]:
## your code here
#

1. Solve the problem with the FDA (LDA) and visualize the histogram of data on the projection line by class.
In [ ]:
## your code here
#

1. plot projection line and classification line in feature space.
In [ ]:
## your code here
#


# Problem 10¶

We are going to classify some fashion items by FDA (LDA). Let's load the dataset.

In [ ]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline


In [ ]:
target_dict = {
0: 'T-shirt/top',
1: 'Trouser',
2: 'Pullover',
3: 'Dress',
4: 'Coat',
5: 'Sandal',
6: 'Shirt',
7: 'Sneaker',
8: 'Bag',
9: 'Ankle boot',
}

In [ ]:
plt.figure(figsize = (10,10))
for i in range(0,20):
plt.subplot(5,5, i+1)
plt.imshow(data[i], 'gray')
plt.title(target_dict[(data_labels[i])])
plt.xticks([])
plt.yticks([])

1. Choose T-shirt/top class and trouser class from 10 classes
In [ ]:
# fill out the blank

T_shirt_top_data =
Trouser_data =

1. Select 1000 data from two selected classes, respectively.
In [ ]:
## Write your code here
#

In [ ]:
print(Trouser_data.shape)

1. Plot random images for two selected classes.
In [ ]:
## Write your code here
#

1. Now we must select the own ‘features’ from image data to classify T_shirt_top and Trouser. The following two features are recommended
• (feature 1) The total variance pixels over the entire location.

• (feature 2) The average pixels located at the center of the image (img[15:25,13:15]).

In [ ]:
# fill out the blank

feature_T_shirt_top_var = np.var( , axis = (1, 2))
feature_Trouser_var = np.var( , axis = (1, 2))
feature_T_shirt_top_mean = np.mean( , axis = (1, 2))
feature_Trouser_mean = np.mean( , axis = (1, 2))

1. The shape of each feature should be changed to (1000,1).
In [ ]:
## Write your code here
#

1. Plot all the data in feature space
In [ ]:
## Write your code here
#

1. Solve the problem with FDA (LDA) and visualize the histogram of data on the projected line by class.
In [ ]:
from sklearn import discriminant_analysis

#

1. Plot projection line and classification line in feature space.
In [ ]:
## Write your code here
#

In [ ]:
plt.figure(figsize = (10, 10))
plt.plot(Theta0[:,0], Theta0[:,1], '.', color = 'b', label = 'T_shirt_top')
plt.plot(Theta1[:,0], Theta1[:,1], '.', color = 'hotpink', label ='Trouser')
plt.plot(xp, prj_yp, 'k', label = 'FDA(LDA) projection line', linewidth = 2.5)
plt.plot(xp, clf_yp, 'g', label = 'FDA(LDA) classification boundary', linewidth = 2.5)
plt.legend(fontsize = 10)
plt.grid(alpha = 0.3)
plt.axis('equal')
plt.ylim([-0.2, 1])
plt.xlim([-0.05, 0.24])
plt.xlabel('Feature 1', fontsize = 15)
plt.ylabel('Feature 2', fontsize = 15)
plt.show()


# Problem 11¶

Consider the following matrix:

\begin{align*} A = \begin{bmatrix} -1 & 1 & 0 \\ 0 & -1 & 1 \\ \end{bmatrix} \end{align*}

(a) Compute the Singular Value Decomposition (SVD) of matrix $A$ using python.

In [ ]:
## your code here
#


(b) In this time, solve it with your own hand (i.e, Compute SVD by hands).

(c) Compute the rank 1 approximation of matrix $A$ by hands.

# Problem 12¶

The data provided in this problem is a set of sequential pictures (or frames) taken from a video of people walking. The total number of pictures is 20. In this problem, you are asked to get rid of people in the pictures and only extract background from pictures.

(a) Plot 'capimage_0.jpg' image in grayscale.

In [ ]:
import numpy as np
import cv2
import matplotlib.pyplot as plt
%matplotlib inline

#


(b) Load the image into grayscale and reshape each picture as (row $\times$ col, 1), then horizontally stack them to form matrix $A$ with a shape of (row $\times$ col, 20).

In [ ]:
## your code here
#


(c) Compute SVD of matrix $A$ and plot the sigular values.

In [ ]:
## your code here
#


(d) Apply low rank approximation to matrix $A$ with $k=1$, and then plot $\hat A[:,0]$ with a reshape of (row, col).

$$\hat A = U_k \Sigma_k V_k^T = \sum_{i=1}^{k}\sigma_i u_i v_i^T \qquad (k = 1)$$
In [ ]:
## your code here
#


(e) Explain why we have the above result.

# Problem 13¶

I have demonstrated the PCA algorithm with a spring and mass system in class.

We want to make the same results using SVD at this time. Use the same data set used in class.

In [ ]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

X = data_sp.T
X = np.asmatrix(X)

print(X.shape)

In [ ]:
plt.figure(figsize = (12, 6))

plt.subplot(1,3,1)
plt.plot(X[:, 0], -X[:, 1], 'r')
plt.axis('equal')
plt.title('Camera 1')

plt.subplot(1,3,2)
plt.plot(X[:, 2], -X[:, 3], 'b')
plt.axis('equal')
plt.title('Camera 2')

plt.subplot(1,3,3)
plt.plot(X[:, 4], -X[:, 5], 'k')
plt.axis('equal')
plt.title('Camera 3')

plt.show()

In [ ]:
X = X - np.mean(X, axis = 0)


## write your code here

## write your code here