AI for Mechanical Engineering

Artificial Neural Networks (ANN)

Problem 01¶

1. Explain why the perceptron cannot solve the XOR problem.

2. ANN (Artificial Neural Networks) is also called as MLP (Multilayer Perceptron). Explain why the MLP is able to solve the XOR problem.

Problem 02¶

1. (Choose correct answers) Jonathan has now switched to multilayer neural networks and notices that the training error is going down and converges to a local minimum. Then when he tests on the new data, the test error is abnormally high. What is probably going wrong and what do you recommend him to do? (3 choices)

a) The training data size is not large enough. Collect a larger training data and retrain it.

b) Play with learning rate and add regularization term to the objective function.

c) Use a different initialization and train the network several times. Use the average of predictions from all nets to predict test data.

d) Use the same training data but add two more hidden layers.

1. True or false for the following questions. (Correct +1, Wrong -1)

a) Single perceptron can solve a lineary inseperable problem with a kernel function.

b) Gradient descent trains neural networks to the global optimum.

1. (Choose all the correct answers) Jonathan is tyring to solve the XOR problem using a multilayer perceptron (MLP) with ReLU activation function. However, as he trains the MLP model, the results vary at every iteration. The results are correct in some iterations, and the results are wrong at the other iterations. What is probably going wrong and what do you recommend him to do?

a) The training data points are not large enough. Collect a larger training data points and re-train it.

b) The number of perceptron layers is too large. Remove the perceptron layers and re-train it.

c) The number of perceptron layers is too small. Add more the perceptron layers and re-train it.

d) Learning rate is too high. Reduce learning rate and re-train it.

1. Explain the difference between sigmoid (or hyperbolic tangent), and rectified linear unit (ReLU) activation functions in gradient backpropagation.

Problem 03¶

To deal with nonlinearly distributed data set, we need to design an appropriate kernel to make (or map) the original data to be linearly seperable. However, in artificial neural networks, this step is not necessary. Discuss why. (Hint: see the following figure.)

Problem 04¶

Suppose a multi-layer perceptron which has an input layer with 10 neurons, a hidden layer with 50 neurons, and an output layer with 3 neurons. Non-linear activation function for every neurons are ReLU. Write your answer to the following questions.

1. Size of input $X$?

2. Size of weights and biases ($W_h, b_h$) for the hidden layer?

3. Size of weights and biases ($W_o, b_o$) for the output layer?

4. Size of output $Y$?

Solution

1. Size of $X$ is $m \times 10$ where $m$ is batch size.

2. $$W_h\space :\space10\times50 \\b_h\space:\space50$$
3. $$W_o\space :\space50\times3 \\b_o\space:\space3$$
4. Size of $Y$ is $m\times 3$ where $m$ is batch size.

Problem 05¶

To train neural networks, backpropagation is used. Briefly explain what the backpropagation is. When you discuss it, use the keywords such as recursive, memorized, dynamic programming, chain rule, etc.

Problem 06¶

Build the ANN model which receives three binary-valued (i.e., $0$ or $1$) inputs $x_1,x_2,x_3$, and outputs $1$ if exactly two of the inputs are $1$, and outputs $0$ otherwise. All of the units use a hard threshold activation function:

$$z = \begin{cases} 1 \quad \text{if } z \geq 0\\ 0 \quad \text{if } z < 0 \end{cases}$$

Suggest one of possible weights and biases which correctly implement this function.

Denote by

• $\mathbf{W}_{2 \times 3}$ and $\mathbf{V}_{1 \times 2}$ weight matrices connecting input and hidden layer, and hidden layer and output respectively.

• $\mathbf{b}^{(1)}_{2 \times 1}$ and $\mathbf{b}^{(2)}_{1 \times 1}$ biases matrices at hidden layer and output, respectively.

• $x_{3 \times 1}$ and $h_{2 \times 1}$ node values at input and hidden layer, repectively.

Problem 07¶

In this problem, we are going to compute the gradient using the chain rule and dynamic programming, and update the weights $\omega \rightarrow \omega^+$. After that, the weights are updated through 1 back-propagation and compared with the error before the update.

Neural Network Model

• The artificial neural network structure: an input layer, a hidden layern, and an output layer.
• All neurons ($h_1$, $h_2$, $\sigma_1$ and $\sigma_2$) in the hidden and output layers use the sigmoid function as activation function.
• The red number means the initial weight values, the blue number means the input values, and the ground truth means the actual values.
• The loss function is the mean square error (MSE). Use 1/2 MSE for calculation convenience. For example, $E = \frac{1}{2}\sum(\text{target} - \text{output})^2$
• Learning rate is set to 0.9.

Step 1: Forward Propagation¶

1. [hand written] Write and calculate $z_1$, $z_2$, $h_1$, $h_2$, $z_3$, $z_4$, $\sigma_1$, $\sigma_2$, and $E_{\text{total}}$ of forward propagation.

Solution

\begin{align*} z_1 &= \omega_1 x_1 + \omega_2 x_2 = 0.3 \space \times \space 0.2 \space + \space 0.25 \space \times \space 0.3 \space = \space 0.135 \\ z_2 &= \omega_3 x_1 + \omega_4 x_2 = 0.4 \space \times \space 0.2 \space + \space 0.3 \space \times \space 0.3 \space = \space 0.17 \\\\ \end{align*}\begin{align*} h_1 &= \text{sigmoid}(z_1) = 0.5337 \\ h_2 &= \text{sigmoid}(z_2) = 0.5424 \\\\ \end{align*}\begin{align*} z_3 &= \omega_5 h_1 + \omega_6 h_2 = 0.5 \space \times \space h_1 \space + \space 0.4 \space \times \space h_2 \space = \space 0.4838 \\ z_4 &= \omega_7 h_1 + \omega_8 h_2 = 0.7 \space \times \space h_1 \space + \space 0.8 \space \times \space h_2 \space = \space 0.8075 \\\\ \end{align*}\begin{align*} \sigma_1 &= \text{output}_{\sigma_1} = \text{sigmoid}(z_3) = 0.6186 \\ \sigma_2 &= \text{output}_{\sigma_2} = \text{sigmoid}(z_4) = 0.6916 \\\\ \end{align*}\begin{align*} E_{\sigma_1} = \frac{1}{2}(\text{target}_{\sigma_1} - \text{output}_{\sigma_1})^2 = 0.0239 \\ E_{\sigma_2} = \frac{1}{2}(\text{target}_{\sigma_2} - \text{output}_{\sigma_2})^2 = 0.0042 \\\\ \end{align*}

Thus,

\begin{align*} E_{\text{total}} = E_{\sigma_1} + E_{\sigma_2} = 0.0281 \end{align*}

Step 2: BackPropagation 1¶

1. [hand written] update $\omega_5$, $\omega_6$, $\omega_7$, $\omega_8$ $\rightarrow$ $\omega_5^+$, $\omega_6^+$, $\omega_7^+$, $\omega_8^+$ of back-propagation.

Solution

\begin{align*} \frac{\partial E_{\text{total}}}{\partial \omega_5} = \frac{\partial E_{\text{total}}}{\partial \sigma_1} \times \frac{\partial \sigma_1}{\partial z_3} \times \frac{\partial z_3}{\partial \omega_5} \\\\ \end{align*}\begin{align*} E_{\text{total}} &= \frac{1}{2}(\text{target}_{\sigma_1} - \text{output}_{\sigma_1})^{2} + \frac{1}{2}(\text{target}_{\sigma_2} - \text{output}_{\sigma_2})^2 \\\\ \end{align*}

First term,

\begin{align*}\\ \frac{\partial E_{\text{total}}}{\partial \sigma_1} &= 2 \space \times \space \frac{1}{2}(\text{target}_{\sigma_1} - \text{output}_{\sigma_1})^{2-1}\times(-1) \\\\ &= -(\text{target}_{\sigma_1} - \text{output}_{\sigma_1}) = -(0.4 - 0.6186) = 0.2186 \\\\ \end{align*}

Second term,

\begin{align*}\\ \frac{\partial \sigma_1}{\partial z_3} = \sigma_1 \times (1 \space - \space \sigma_1) = 0.6186(1 \space - \space 0.6186) = 0.2359 \\\\ \end{align*}

Third term,

\begin{align*}\\ \frac{\partial z_3}{\partial \omega_5} = h_1 = 0.5336 \\\\ \end{align*}

Thus,

\begin{align*} \\ \frac{\partial E_{\text{total}}}{\partial \omega_5} = 0.2186 \times 0.2359 \times 0.5337 = 0.0275 \\\\ \end{align*}

Update the weights through gradient descent optimization,

\begin{align*} \\ \omega_5^+ = \omega_5 - \alpha \frac{\partial E_{total}}{\partial \omega_5} = 0.5 - 0.9 \times 0.0275 = 0.4752 \\\\ \end{align*}

Proceed in the same way for other weights,

\begin{align*} \\ \frac{\partial E_{total}}{\partial \omega_6} = \frac{\partial E_{total}}{\partial o_1} \times \frac{\partial o_1}{\partial z_3} \times \frac{\partial z_3}{\partial \omega_6} \rightarrow \omega_6^+ = 0.3748 \\\\ \frac{\partial E_{total}}{\partial \omega_7} = \frac{\partial E_{total}}{\partial o_2} \times \frac{\partial o_2}{\partial z_4} \times \frac{\partial z_4}{\partial \omega_7} \rightarrow \omega_7^+= 0.6895 \\\\ \frac{\partial E_{total}}{\partial \omega_8} = \frac{\partial E_{total}}{\partial o_2} \times \frac{\partial o_2}{\partial z_4} \times \frac{\partial z_4}{\partial \omega_8} \rightarrow \omega_8^+= 0.7894 \\\\ \end{align*}

Step 3: BackPropagation 2¶

1. [hand written] update $\omega_1$, $\omega_2$, $\omega_3$, $\omega_4$ $\rightarrow$ $\omega_1^+$, $\omega_2^+$, $\omega_3^+$, $\omega_4^+$ of back-propagation.

Solution

\begin{align*} \frac{\partial E_{\text{total}}}{\partial \omega_1} = \frac{E_{\text{total}}}{\partial h_1} \times \frac{\partial h_1}{\partial z_1} \times \frac{\partial z_1}{\partial \omega_1} \\\\ \end{align*}

First term,

\begin{align*} \\ \frac{\partial E_{\text{total}}}{\partial h_1} = \frac{\partial E_{\sigma_1}}{\partial h_1} + \frac{\partial E_{\sigma_2}}{\partial h_1} \\\\\\ \end{align*}\begin{align*} \frac{\partial E_{\sigma_1}}{\partial h_1} &= \frac{\partial E_{\sigma_1}}{\partial z_3} \times \frac{\partial z_3}{\partial h_1} = \frac{\partial E_{\sigma_1}}{\partial \sigma_1} \times \frac{\partial \sigma_1}{\partial z_3} \times \frac{\partial z_3}{\partial h_1}\\\\ &= -(\text{target}_{\sigma_1} - \text{output}_{\sigma_1}) \times \sigma_1 \times (1 - \sigma_1) \times \omega_5 \\\\ &= 0.2186 \space \times \space 0.2359 \space \times \space 0.5 \space = \space 0.0258 \\\\ \end{align*}

In the same way,

\begin{align*} \\ \frac{\partial E_{\sigma_2}}{\partial h_1} = \frac{\partial E_{\sigma_2}}{\partial z_4} \times \frac{\partial z_4}{\partial h_1} = \frac{\partial E_{\sigma_2}}{\partial \sigma_2} \times \frac{\partial \sigma_2}{\partial z_4} \times \frac{\partial z_4}{\partial h_1} = 0.0078 \\\\\\ \end{align*}\begin{align*} \frac{\partial E_{\text{total}}}{\partial h_1} = 0.0258 + 0.0078 = 0.0336 \\\\ \end{align*}

Second term,

$$\\ \frac{\partial h_1}{\partial z_1} = h_1 \times (1 - h_1) = 0.5337(1 - 0.5337) = 0.2488 \\\\ \frac{\partial z_1}{\partial \omega_1} = x_1 = 0.2 \\\\$$

Thus,

$$\frac{\partial E_{\text{total}}}{\partial \omega_1} = 0.0336 \times 0.2488 \times 0.2 = 0.0017 \\\\$$

Update the weights through gradient descent optimization,

$$\\ \omega_1^+ = \omega_1 - \alpha \frac{\partial E_{\text{total}}}{\partial \omega_1} = 0.2 - 0.9 \times 0.0017 = 0.2985 \\\\$$

Proceed in the same way for other weights,

\begin{align*} \\ \frac{\partial E_{\text{total}}}{\partial \omega_2} &= \frac{\partial E_{\text{total}}}{\partial h_1} \times \frac{\partial h_1}{\partial z_1} \times \frac{\partial z_1}{\partial \omega_2} \rightarrow \omega_2^+ = 0.2477 \\\\ \frac{\partial E_{\text{total}}}{\partial \omega_3} &= \frac{\partial E_{\text{total}}}{\partial h_2} \times \frac{\partial h_2}{\partial z_2} \times \frac{\partial z_2}{\partial \omega_3} \rightarrow \omega_3^+ = 0.3965 \\\\ \frac{\partial E_{\text{total}}}{\partial \omega_4} &= \frac{\partial E_{\text{total}}}{\partial h_2} \times \frac{\partial h_2}{\partial z_2} \times \frac{\partial z_2}{\partial \omega_4} \rightarrow \omega_4^+ = 0.2965 \\\\ \end{align*}

Step 4: Check the Result for Weight Update¶

1. [hand written] Write and calculate $E_{\text{total}}$ with the updated weights, and compare it to the previous error.

Solution

\begin{align*} z_1 &= \omega_1 x_1 + \omega_2 x_2 = 0.2985 \space \times \space 0.2 \space + \space 0.2477 \space \times \space 0.3 \space = \space 0.1340 \\ z_2 &= \omega_3 x_1 + \omega_4 x_2 = 0.3965 \space \times \space 0.2 \space + \space 0.2965 \space \times \space 0.3 \space = \space 0.1683\\\\ \end{align*}\begin{align*} h_1 &= \text{sigmoid}(z_1) = 0.5334 \\ h_2 &= \text{sigmoid}(z_2) = 0.5420 \\\\ \end{align*}\begin{align*} z_3 &= \omega_5 h_1 + \omega_6 h_2 = 0.4752 \space \times \space h_1 \space + \space 0.3748 \space \times \space h_2 \space = \space 0.4566 \\ z_4 &= \omega_7 h_1 + \omega_8 h_2 = 0.6895 \space \times \space h_1 \space + \space 0.7894 \space \times \space h_2 \space = \space 0.7956 \\\\ \end{align*}\begin{align*} \sigma_1 &= \text{output}_{o1} = \text{sigmoid}(z_3) = 0.6122 \\ \sigma_2 &= \text{output}_{o2} = \text{sigmoid}(z_4) = 0.6890 \\\\ \end{align*}\begin{align*} E_{\sigma_1} = \frac{1}{2}(\text{target}_{\sigma_1} - \text{output}_{\sigma_1})^2 = 0.0225 \\ E_{\sigma_2} = \frac{1}{2}(\text{target}_{\sigma_2} - \text{output}_{\sigma_2})^2 = 0.0040 \\\\ \end{align*}

Thus,

\begin{align*} E_{\text{total}} = E_{\sigma_1} + E_{\sigma_2} = 0.0265 \\\\ \end{align*}

Since the previous error was 0.0281, it can be seen that the error decreased with the back-propagation.

Learning of artificial neural networks means that forward propagation and backpropagation are repeated for the purpose of finding weights that minimize errors.

Problem 08¶

1. Classify the given four points into two classes in 2D plane using a single layer structure as shown below. Plot a linear boundary even if it fails to classify them.

Note that bias units are not indicated here.

In [ ]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

x_data = np.array([[0, 0], [1, 1], [0, 1], [1, 0]], dtype=np.float32)
y_data = np.array([[0], [0], [1], [1]], dtype=np.float32)

plt.figure(figsize = (6, 4))
plt.scatter(x_data[:2,0], x_data[:2,1], marker='+', s=100, label='A')
plt.scatter(x_data[2:,0], x_data[2:,1], marker='x', s=100, label='B')
plt.axis('equal')
plt.ylim([-0.5, 1.5]);
plt.grid(alpha=0.15);
plt.legend();
plt.show()

In [ ]:
## write your code here
#

Model: "sequential"
_________________________________________________________________
Layer (type)                Output Shape              Param #
=================================================================
dense (Dense)               (None, 1)                 3

=================================================================
Total params: 3 (12.00 Byte)
Trainable params: 3 (12.00 Byte)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________

1. Classify the given four points in 2D plane using two layers as shown below (the number of neurons in the out layer can be changed to one).

Note that bias units are not indicated here and you can use either one-hot-encoding or sparse_categorical_crossentropy.

In [ ]:
## write your code here
#

# Initializer is optional

Model: "sequential_1"
_________________________________________________________________
Layer (type)                Output Shape              Param #
=================================================================
dense_1 (Dense)             (None, 2)                 6

dense_2 (Dense)             (None, 1)                 3

=================================================================
Total params: 9 (36.00 Byte)
Trainable params: 9 (36.00 Byte)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________

1. The first layer can be seen as kernel function $\phi$. Show the location of four points on 2D plane after the first layer.
In [ ]:
## write your code here
#

1/1 [==============================] - 0s 126ms/step

1. Visualize the kernel space onto 2D plane.

Hint: Make 2d grid points and apply the kernel.

In [ ]:
## write your code here
#

313/313 [==============================] - 1s 4ms/step

1. Plot the decision boundary on kernel space.
In [ ]:
## write your code here
#


Problem 09¶

You will do binary classification for nonlinearly seperable data using MLP. Plot the given data first.

In [ ]:
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
%matplotlib inline

N = 200
M = 2*N
gamma = 0.01

G0 = np.random.multivariate_normal([0, 0], gamma*np.eye(2), N)
G1 = np.random.multivariate_normal([1, 1], gamma*np.eye(2), N)
G2 = np.random.multivariate_normal([0, 1], gamma*np.eye(2), N)
G3 = np.random.multivariate_normal([1, 0], gamma*np.eye(2), N)

train_X = np.vstack([G0, G1, G2, G3])
train_y = np.vstack([np.ones([M,1]), np.zeros([M,1])])

train_X = np.asmatrix(train_X)
train_y = np.asmatrix(train_y)

print(train_X.shape)
print(train_y.shape)

plt.figure(figsize = (6, 4))
plt.plot(train_X[:M,0], train_X[:M,1], 'b.', alpha = 0.4, label = 'A')
plt.plot(train_X[M:,0], train_X[M:,1], 'r.', alpha = 0.4, label = 'B')
plt.axis('equal')
plt.xlim([-1, 2]); plt.ylim([-1, 2]);
plt.grid(alpha = 0.15)
plt.legend(fontsize = 12)
plt.show()

(800, 2)
(800, 1)