Machine Learning for Mechanical Engineering

kNN and Decision Tree

Prof. Seungchul Lee
http://iailab.kaist.ac.kr/
Industrial AI Lab at KAIST

For your handwritten solutions, please scan or take a picture of them. Alternatively, you can write them in markdown if you prefer.
Only .ipynb files will be graded for your code.
- Ensure that your NAME and student ID are included in your .ipynb files. ex) SeungchulLee_20241234_HW01.ipynb
Compress all the files into a single .zip file.
- In the .zip file's name, include your NAME and student ID.
ex) SeungchulLee_20241234_HW01.zip
- Submit this .zip file on KLMS
Do not submit a printed version of your code, as it will not be graded.

Problem 1¶

In our class, we used the sklearn-embedded functions for kNN. Here, you are asked to write your own function for the kNN regressor.

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

N = 100
w1 = 0.5
w0 = 2
x = np.random.normal(0, 15, N).reshape(-1,1)
y = w1*x + w0 + 5*np.random.normal(0, 1, N).reshape(-1,1)

plt.figure(figsize = (10, 8))
plt.title('Data Set', fontsize = 15)
plt.plot(x, y, '.', label = 'Data')
plt.xlabel('X', fontsize = 15)
plt.ylabel('Y', fontsize = 15)
plt.legend(fontsize = 15)
plt.axis('equal')
plt.axis([-40, 40, -30, 30])
plt.grid(alpha = 0.3)
plt.show()

# here is your function for KNNReg
#

# here is your code for plotting
#

Problem 2¶

In our class, we used the sklearn-embedded functions for kNN. Here, you are asked to write your own function for the kNN classification.

m = 1000
X = -1.5 + 3*np.random.uniform(size = (m,2))

y = np.zeros([m,1])
for i in range(m):
    if np.linalg.norm(X[i,:], 2) <= 1:
        if np.random.uniform() < 0.05:
            y[i] = 0
        else:
            y[i] = 1
    else:
        if np.random.uniform() < 0.05:
            y[i] = 1
        else:
            y[i] = 0

C1 = np.where(y == 1)[0]
C0 = np.where(y == 0)[0]

theta = np.linspace(0, 2*np.pi, 100)

plt.figure(figsize = (8,8))
plt.plot(X[C1,0], X[C1,1], 'o', label = 'C1', markerfacecolor = "k", markeredgecolor = 'k', markersize = 4)
plt.plot(X[C0,0], X[C0,1], 'o', label = 'C0', markerfacecolor = "None", alpha = 0.3, markeredgecolor = 'k', markersize = 4)
plt.plot(np.cos(theta), np.sin(theta), '--', color = 'orange')
plt.axis([-1.5, 1.5, -1.5, 1.5])
plt.axis('equal')
plt.axis('off')
plt.show()

# here is your function for KNNclf
#

# here is your code for plotting
#

Problem 3¶

Look at the following training set for a decision tree that classifies names as either location or person names:

Calculate the disorder of attribute 1.
Calculate the disorder of attribute 2.
Which attribute is more useful, attribute 1 or attribute 2 ? and Why?

Problem 4¶

(Choose correct answers) Jonathan just trained a decision tree for a digit recognition. He notices an extremely low training error, but an abnormally large test error. He also notices that an SVM with a linear kernel performs much better than his tree. Whar could be the cause of his problem? (2 choices)

$\quad$ a. Decision tree is too deep

$\quad$ b. Decision tree is overfitting

$\quad$ c. Learning rate too high

$\quad$ d. There is too much training data

(True or False) Random forests are better than decision trees in terms of avoiding overfitting the training set. Why or why not?
We have seen that averaging the outputs from multiple models typically gives better results than using just one model. Let's say that we are going to average the outputs from 10 models. Of course, we want 10 good models, i.e., models that also perform well individually. Do you think it is better if you instead average the weights and biases of 10 networks? Why or why not?

Problem 5¶

As most pattern recognition methods, tree-based methods work best if proper features are selected. Discuss why preprocessing by PCA can be effective to find “important” axes with smaller maximum depth of the tree.