KAIST 산학협동 공개강좌

인공지능과 설계: 해석 예측에서 설계 최적화까지

Prof. Seungchul Lee
http://iailab.kaist.ac.kr/
Industrial AI Lab at KAIST

Practice Aims & Objectives¶

Implement machine learning and deep learning models for regression tasks
Perform feature selection with correlation analysis
Compare model performance before and after feature selection

from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
import seaborn as sns
import random

1. Understand Machine Learning and Deep Learning Models¶

1.1 Linear Regression¶

Regression analysis is the process of finding the function $f(x)$ that outputs the most similar value $\hat{y}$ to the dependent variable $y$ corresponding to the independent variable $x$.

$$ \hat{y} = f(x) ≈ y$$

If $f(x)$ is a linear function, then this function is called a linear regression model.

$$ \hat{y} = \omega_0 + \omega_1x_1 + \omega_2x_2 + ... + \omega_Dx_D = \omega_0 + \omega^Tx$$

In the above equation, the independent variable $x = (x_1, x_2, ... , x_D)$ is a $D$-dimension vector. The weight vector $\omega = (\omega_0, ... , \omega_D)$ is the coefficient of the function $f(x)$ and the parameter is this linear regression model.

1.2 Random Forest¶

Ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees

1.3 Deep Neural Network¶

Complex/Nonlinear universal function approximator
- Linearly connected networks
- Simple nonlinear neurons
Multi-neurons

Differentiable activation function

In a compact representation

Multi-layer perceptron

Hidden layers
- Autonomous feature learning

2. Build and Predict Machine Learning and Deep Learning Models¶

2.1 Data Description¶

Injection molding dataset consisting of 36 mold shapes
This dataset consists of 5 process features (control parameters), 32 mold shape features, and the weight of the molded part (ground truth).

train_dataset = pd.read_excel('/content/drive/MyDrive/tutorials/산학협동강좌/data/train_dataset.xlsx',
                              index_col = 0,
                              engine = 'openpyxl')

# USB로 파일을 다운 받으신 경우
# train_dataset = pd.read_excel('train_dataset.xlsx',
#                               index_col = 0,
#                               engine = 'openpyxl')


train_dataset

train_dataset.describe()

2.2 Construct Train and Test Dataset¶

1 ~ 35 geometry mold: train dataset
Test mold: test dataset

train_x = train_dataset.drop('Weight (g)', axis = 1)
train_y = train_dataset.loc[:, 'Weight (g)']

test_dataset = pd.read_excel('/content/drive/MyDrive/tutorials/산학협동강좌/data/test_dataset.xlsx',
                             index_col = 0,
                             engine = 'openpyxl')

# USB로 파일을 다운 받으신 경우
# test_dataset = pd.read_excel('test_dataset.xlsx',
#                              index_col = 0,
#                              engine = 'openpyxl')

test_x = test_dataset.drop('Weight (g)', axis = 1)
test_y = test_dataset.loc[:, 'Weight (g)']

print('train_x: {}, train_y: {}'.format(train_x.shape, train_y.shape))
print('test_x: {}, test_y: {}'.format(test_x.shape, test_y.shape))

train_x: (3500, 37), train_y: (3500,)
test_x: (100, 37), test_y: (100,)

2.3 Linear Regression Modeling¶

from sklearn.linear_model import LinearRegression

reg = LinearRegression().fit(train_x, train_y)

print('Regression Coefficient: \n{}\n'.format(reg.coef_))
print('Regression Bias: \n{}'.format(reg.intercept_))

Regression Coefficient: 
[ 7.71710835e-01 -3.19812614e-02 -2.49243899e-02  3.31152425e-03
  7.44199836e-01 -1.18719384e-01 -1.39731152e+02 -3.50739358e+03
  1.58253392e+01 -2.19842350e+03 -2.19842406e+03 -2.97184154e+02
  9.53226836e+00  4.44332598e+01 -3.96366900e-01 -2.32062972e+00
 -3.26819880e+01 -8.69045736e-01 -3.74451683e-02 -7.21563953e+00
  2.19844363e+03  3.50826044e+03 -3.50712687e+03 -5.89890297e+01
  1.91988137e+02 -1.16484245e+01 -2.10800016e-01  2.17656934e+00
  2.81864581e+00 -4.65071586e+00 -4.45489515e+01 -3.95507162e+01
  4.18233457e+01  7.86503291e+00  1.97037478e+00  3.47124091e-01
 -2.35971505e+00]

Regression Bias: 
-23.152042554052834

pred_lr = reg.predict(test_x)

print(pred_lr)

[28.73689182 27.45943159 27.85232047 28.97871945 29.27536825 29.82853293
 29.32334135 29.7689617  29.11690179 28.85082292 28.10863057 29.22590309
 29.06674493 28.31248811 29.87379821 28.09077154 29.37008561 28.55386492
 28.92239827 28.73599575 27.9104892  28.06897856 28.59764758 28.32907063
 28.5403697  28.57227216 28.34471971 27.99995204 29.28922192 28.56464774
 28.88345475 28.46864172 28.72691885 28.72231828 28.90968127 28.68449853
 29.57019488 28.54721663 28.77939432 30.24654283 29.32049995 29.46878455
 29.32209367 27.73418205 28.99499946 29.29748179 29.05209479 28.51578413
 29.87678766 28.80247587 27.54602986 28.44057129 28.72179073 28.30980943
 28.18470521 29.75556879 27.73003314 30.07298571 28.29758293 28.66809105
 28.97864665 28.29041079 28.36312169 29.33915134 29.96363876 28.57069275
 28.00265501 28.88283972 28.46110322 28.29828081 29.08656903 29.23661205
 28.8327277  28.45849813 29.20269797 29.9468978  27.67114385 28.41534368
 29.15954352 26.88378956 27.6279894  28.37218923 28.60294867 29.34714851
 30.09134834 27.61690293 28.36110277 29.1053026  27.95114619 28.69534603
 29.43954586 28.54870775 29.29290759 30.03710743 28.88295101 29.62715085
 30.37135069 27.89690527 28.64110511 29.38530495]

plt.figure(figsize = (8, 6))

plt.plot(pred_lr, 'ro--', label = 'Prediction')
plt.plot(np.array(test_y), 'bo--', label = 'Ground Truth')

plt.legend(fontsize = 13)
plt.ylabel('Weight (g)', fontsize = 13)
plt.title('Linear Regression', fontsize = 13)
plt.ylim([21, 31])
plt.xticks(fontsize = 12)
plt.yticks(fontsize = 12)

plt.show()

2.4 Random Forest¶

n_estimators: How many trees to ensemble ?

from sklearn.ensemble import RandomForestRegressor

reg = RandomForestRegressor(n_estimators = 100,
                            max_depth = 10000,
                            random_state = 42)
reg.fit(train_x, train_y)
pred_rf = reg.predict(test_x)

plt.figure(figsize = (8, 6))
plt.plot(pred_rf, 'ro--', label = 'Prediction')
plt.plot(np.array(test_y), 'bo--', label = 'Ground Truth')
plt.legend(fontsize = 13)
plt.ylabel('Weight (g)', fontsize = 13)
plt.title('Random Forest', fontsize = 13)
plt.ylim([21, 31])
plt.xticks(fontsize = 12)
plt.yticks(fontsize = 12)
plt.show()

2.5 Deep Neural Networks (Multi-Layer Perceptron)¶

Raw dataset
Feature selected dataset

tf.random.set_seed(42)

model = tf.keras.models.Sequential([
    tf.keras.layers.Input(shape = (train_x.shape[1],)),
    tf.keras.layers.Dense(units = 10, activation = 'relu'),
    tf.keras.layers.Dense(units = 10, activation = 'relu'),
    tf.keras.layers.Dense(units = 10, activation = 'relu'),
    tf.keras.layers.Dense(units = 10, activation = 'relu'),
    tf.keras.layers.Dense(units = 10, activation = 'relu'),
    tf.keras.layers.Dense(units = 1, activation = None)
])

model.summary()

Model: "sequential_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense_24 (Dense)            (None, 10)                380       
                                                                 
 dense_25 (Dense)            (None, 10)                110       
                                                                 
 dense_26 (Dense)            (None, 10)                110       
                                                                 
 dense_27 (Dense)            (None, 10)                110       
                                                                 
 dense_28 (Dense)            (None, 10)                110       
                                                                 
 dense_29 (Dense)            (None, 1)                 11        
                                                                 
=================================================================
Total params: 831 (3.25 KB)
Trainable params: 831 (3.25 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________

model.compile(optimizer = tf.keras.optimizers.Adam(learning_rate = 0.001),
              loss = 'mse')

loss = model.fit(train_x, train_y, epochs = 50, verbose = 0)

pred_dnn = model.predict(test_x)

plt.figure(figsize = (8, 6))
plt.plot(pred_dnn, 'ro--', label = 'Prediction')
plt.plot(np.array(test_y), 'bo--', label = 'Ground Truth')
plt.legend(fontsize = 13, loc = 1)
plt.ylabel('Weight (g)', fontsize = 13)
plt.title('Deep Neural Network', fontsize = 13)
plt.ylim([21, 31])
plt.xticks(fontsize = 12)
plt.yticks(fontsize = 12)
plt.show()

4/4 [==============================] - 0s 3ms/step

3. Understand Potential Problems in Industrial Data¶

3.1 Curse of Dimensionality¶

Unlike the toy example (e.g., MNIST dataset), when applying artificial intelligence to industrial applications, a variety of issues may arise. The curse of dimensionality is an example of one of them. Because several physical phenomena overlap in the case of industrial data, it is difficult to know important features for predicted values from the standpoint of domain knowledge.

To ensure statistical stability, the number of features should be chosen to correspond to the number of data points. Therefore, we try correlation analysis to select important features in data-based prediction.

3.2 Correlation Analysis¶

Which features are statistically significant?
- Correlation analysis: a statistical analysis method that examines the relationship (closeness) of variables.
Analyze features that are highly related to "the weight of the injected product".

# 삼각형 마스크를 만든다(위 쪽 삼각형에 True, 아래 삼각형에 False)

mask = np.zeros_like(train_dataset.corr(), dtype = bool)
mask[np.triu_indices_from(mask)] = True

plt.figure(figsize = (16, 16))
sns.heatmap(train_dataset.corr(),
            cmap = 'RdYlBu_r',
            annot = True,               # 실제 값을 표시한다
            mask = mask,                # 표시하지 않을 마스크 부분을 지정한다
            linewidths = 1,             # 경계면 실선으로 구분하기
            cbar_kws = {"shrink": .5},  # 컬러바 크기 절반으로 줄이기
            vmin = -1, vmax = 1         # 컬러바 범위 -1 ~ 1
           )
plt.show()

weight_corr = train_dataset.corr()['Weight (g)']
weight_corr.sort_values(ascending = False)

Weight (g)                  1.000000
OverallVolume               0.998736
CavityVolume                0.985415
CavitySurfaceArea           0.858996
OverallSurfaceArea          0.856717
OverallProjectionAreaXY     0.729818
OverallProjectionAreaYZ     0.556189
MaxPartThickness            0.514668
OverallProjectionAreaZX     0.485988
Cavity_Compactness          0.484543
Overall_Compactness         0.479901
AvgFlowLength               0.434925
StdPartThickness            0.422464
MaxFlowLength               0.420291
MinFlowLength               0.416272
AvgPartThickness            0.404082
RunnerVolume                0.355200
RunnerSurfaceArea           0.270440
MaxFlowLengthToThickness    0.216423
AvgFlowLengthToThickness    0.207084
MinFlowLengthToThickness    0.199110
MinGateHydraulicDiameter    0.132351
AvgGateHydraulicDiameter    0.094403
StdFlowLength               0.081371
Runner_Compactness          0.069361
MaxGateHydraulicDiameter    0.061674
StdFlowLengthToThickness    0.030716
RunnerSurfaceToVolume       0.021096
Packing pressure (MPa)      0.009901
Packing time (sec)         -0.004789
Fill time (sec)            -0.011950
Mold temperature (℃)       -0.016511
Melt temperature (℃)       -0.019672
StdGateHydraulicDiameter   -0.078172
NumberOfGates              -0.094894
CavitySurfaceToVolume      -0.435327
OverallSurfaceToVolume     -0.443351
NumberOfCavities           -0.457310
Name: Weight (g), dtype: float64

del_features = weight_corr[weight_corr.abs() < 0.8]
del_features.keys()

Index(['Fill time (sec)', 'Melt temperature (℃)', 'Mold temperature (℃)',
       'Packing pressure (MPa)', 'Packing time (sec)',
       'OverallProjectionAreaZX', 'MinGateHydraulicDiameter', 'StdFlowLength',
       'RunnerSurfaceArea', 'StdGateHydraulicDiameter', 'MinFlowLength',
       'MaxFlowLengthToThickness', 'MaxPartThickness', 'AvgFlowLength',
       'AvgFlowLengthToThickness', 'NumberOfGates', 'OverallProjectionAreaXY',
       'MaxFlowLength', 'RunnerVolume', 'StdFlowLengthToThickness',
       'MaxGateHydraulicDiameter', 'MinFlowLengthToThickness',
       'OverallProjectionAreaYZ', 'StdPartThickness', 'NumberOfCavities',
       'AvgPartThickness', 'AvgGateHydraulicDiameter', 'Overall_Compactness',
       'Cavity_Compactness', 'Runner_Compactness', 'CavitySurfaceToVolume',
       'RunnerSurfaceToVolume', 'OverallSurfaceToVolume'],
      dtype='object')

train_x_fs = train_x.drop(list(del_features.keys()[5:]), axis = 1)
test_x_fs = test_x.drop(list(del_features.keys()[5:]), axis = 1)

print('train_x_fs: {}, train_y: {}'.format(train_x_fs.shape, train_y.shape))
print('test_x_fs: {}, test_y: {}'.format(test_x_fs.shape, test_y.shape))

train_x_fs: (3500, 9), train_y: (3500,)
test_x_fs: (100, 9), test_y: (100,)

3.3 Linear Regression Modeling¶

from sklearn.linear_model import LinearRegression

reg = LinearRegression().fit(train_x_fs, train_y)

print('Regression Coefficient: \n{}\n'.format(reg.coef_))
print('Regression Bias: \n{}'.format(reg.intercept_))

pred_s_lr = reg.predict(test_x_fs)

Regression Coefficient: 
[ 0.76727    -0.03221696 -0.0250693   0.00341566  0.72655702 -0.04489469
  0.00220443  0.00469149  0.77575597]

Regression Bias: 
5.309289663305158

plt.figure(figsize = (8, 6))
plt.plot(pred_s_lr, 'ro--', label = 'Prediction')
plt.plot(np.array(test_y), 'bo--', label = 'Ground Truth')
plt.legend(fontsize = 13)
plt.ylabel('Weight (g)', fontsize = 13)
plt.title('Linear Regression with FS', fontsize = 13)
plt.ylim([21, 31])
plt.xticks(fontsize = 12)
plt.yticks(fontsize = 12)

plt.show()

3.4 Random Forest¶

from sklearn.ensemble import RandomForestRegressor

reg = RandomForestRegressor(n_estimators = 100,
                            max_depth = 10000,
                            random_state = 42).fit(train_x_fs, train_y)

pred_s_rf = reg.predict(test_x_fs)

plt.figure(figsize = (8, 6))
plt.plot(pred_s_rf, 'ro--', label = 'Prediction')
plt.plot(np.array(test_y), 'bo--', label = 'Ground Truth')
plt.legend(fontsize = 13)
plt.ylabel('Weight (g)', fontsize = 13)
plt.title('Random Forest with FS', fontsize = 13)
plt.ylim([21, 31])
plt.xticks(fontsize = 12)
plt.yticks(fontsize = 12)
plt.show()

3.5 Deep Neural Networks¶

tf.random.set_seed(42)

model = tf.keras.models.Sequential([
    tf.keras.layers.Input(shape = (train_x_fs.shape[1],)),
    tf.keras.layers.Dense(units = 10, activation = 'relu'),
    tf.keras.layers.Dense(units = 10, activation = 'relu'),
    tf.keras.layers.Dense(units = 10, activation = 'relu'),
    tf.keras.layers.Dense(units = 10, activation = 'relu'),
    tf.keras.layers.Dense(units = 10, activation = 'relu'),
    tf.keras.layers.Dense(units = 1, activation = None)
])

model.compile(optimizer = tf.keras.optimizers.Adam(learning_rate = 0.001),
              loss = 'mse')

loss = model.fit(train_x_fs, train_y, epochs = 50, verbose = 0)

pred_s_dnn = model.predict(test_x_fs)

4/4 [==============================] - 0s 3ms/step

plt.figure(figsize = (8, 6))
plt.plot(pred_s_dnn, 'ro--', label = 'Prediction')
plt.plot(np.array(test_y), 'bo--', label = 'Ground Truth')
plt.legend(fontsize = 13)
plt.ylabel('Weight (g)', fontsize = 13)
plt.title('Deep Neural Network with FS', fontsize = 13)
plt.ylim([21, 31])
plt.xticks(fontsize = 12)
plt.yticks(fontsize = 12)

plt.show()

	Fill time (sec)	Melt temperature (℃)	Mold temperature (℃)	Packing pressure (MPa)	Packing time (sec)	OverallProjectionAreaZX	MinGateHydraulicDiameter	CavityVolume	StdFlowLength	RunnerSurfaceArea	...	NumberOfCavities	AvgPartThickness	AvgGateHydraulicDiameter	Overall_Compactness	Cavity_Compactness	Runner_Compactness	CavitySurfaceToVolume	RunnerSurfaceToVolume	OverallSurfaceToVolume	Weight (g)
MoldNumber
1	0.7	232	31	43	2.7	19.686	0.55	14.804	0.601	16.797	...	2	1.253	0.813	1.398630	1.323422	2.400429	15.112335	8.331845	14.299703	13.7819
1	0.5	220	40	51	1.3	19.686	0.55	14.804	0.601	16.797	...	2	1.253	0.813	1.398630	1.323422	2.400429	15.112335	8.331845	14.299703	13.1317
1	1.4	221	51	46	1.3	19.686	0.55	14.804	0.601	16.797	...	2	1.253	0.813	1.398630	1.323422	2.400429	15.112335	8.331845	14.299703	13.3700
1	0.8	234	34	61	2.0	19.686	0.55	14.804	0.601	16.797	...	2	1.253	0.813	1.398630	1.323422	2.400429	15.112335	8.331845	14.299703	13.5909
1	1.1	239	44	37	2.6	19.686	0.55	14.804	0.601	16.797	...	2	1.253	0.813	1.398630	1.323422	2.400429	15.112335	8.331845	14.299703	13.7390
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
35	1.5	225	30	70	2.0	77.579	0.98	14.441	0.259	39.768	...	2	1.154	1.011	1.352616	1.188687	2.354154	16.825289	8.495621	14.786163	16.1268
35	1.5	225	30	70	3.0	77.579	0.98	14.441	0.259	39.768	...	2	1.154	1.011	1.352616	1.188687	2.354154	16.825289	8.495621	14.786163	16.4245
35	1.5	240	45	30	1.0	77.579	0.98	14.441	0.259	39.768	...	2	1.154	1.011	1.352616	1.188687	2.354154	16.825289	8.495621	14.786163	15.4136
35	1.5	240	45	30	2.0	77.579	0.98	14.441	0.259	39.768	...	2	1.154	1.011	1.352616	1.188687	2.354154	16.825289	8.495621	14.786163	15.8083
35	1.5	240	45	30	3.0	77.579	0.98	14.441	0.259	39.768	...	2	1.154	1.011	1.352616	1.188687	2.354154	16.825289	8.495621	14.786163	16.1344

	Fill time (sec)	Melt temperature (℃)	Mold temperature (℃)	Packing pressure (MPa)	Packing time (sec)	OverallProjectionAreaZX	MinGateHydraulicDiameter	CavityVolume	StdFlowLength	RunnerSurfaceArea	...	NumberOfCavities	AvgPartThickness	AvgGateHydraulicDiameter	Overall_Compactness	Cavity_Compactness	Runner_Compactness	CavitySurfaceToVolume	RunnerSurfaceToVolume	OverallSurfaceToVolume	Weight (g)
count	3500.000000	3500.000000	3500.000000	3500.000000	3500.000000	3500.000000	3500.000000	3500.000000	3500.000000	3500.000000	...	3500.000000	3500.000000	3500.00000	3500.000000	3500.000000	3500.000000	3500.000000	3500.000000	3500.000000	3500.000000
mean	1.002686	224.865714	45.266000	49.784000	2.018086	31.048743	1.134171	60.416314	2.088086	48.864543	...	2.085714	2.159286	1.19860	2.254255	2.197544	1.593867	10.974949	22.288319	10.008039	51.881052
std	0.342842	10.043360	9.865042	13.314482	0.673256	27.991635	0.529301	46.290953	6.463764	57.341880	...	1.360266	0.888018	0.66395	0.801692	0.892460	0.944723	5.255846	16.967633	3.424420	37.638686
min	0.500000	210.000000	30.000000	30.000000	0.500000	3.265000	0.319000	5.932000	0.000000	0.129000	...	1.000000	0.725000	0.57600	1.086974	0.758980	0.447284	4.462667	6.053965	4.559499	9.887500
25%	0.700000	216.000000	37.000000	38.000000	1.400000	9.128000	0.924000	28.069000	0.000000	0.626000	...	1.000000	1.418000	0.92400	1.556347	1.464589	0.465116	6.766880	8.495621	6.847776	22.670925
50%	1.000000	225.000000	45.000000	50.000000	2.000000	20.782000	0.980000	49.139000	0.153000	22.034000	...	1.000000	1.956000	1.00200	1.960123	1.960405	2.026478	10.201973	9.869339	10.203440	45.398450
75%	1.300000	234.000000	54.000000	61.000000	2.600000	45.083000	1.482000	78.400000	0.601000	79.879000	...	4.000000	2.899000	1.48200	2.920656	2.955572	2.354154	13.655710	43.000000	12.850604	59.060725
max	1.500000	240.000000	60.000000	70.000000	3.000000	125.475000	3.578000	207.283000	37.500000	191.761000	...	5.000000	3.960000	4.69800	4.386447	4.481625	3.303620	26.351158	44.714286	18.399706	182.257600