KAIST 산학협동 공개강좌

인공지능과 설계: 해석 예측에서 설계 최적화까지


Prof. Seungchul Lee
http://iailab.kaist.ac.kr/
Industrial AI Lab at KAIST

Practice Aims & Objectives

  1. Implement machine learning and deep learning models for regression tasks
  2. Perform feature selection with correlation analysis
  3. Compare model performance before and after feature selection
In [ ]:
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
In [ ]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
import seaborn as sns
import random

1. Understand Machine Learning and Deep Learning Models

1.1 Linear Regression

Regression analysis is the process of finding the function $f(x)$ that outputs the most similar value $\hat{y}$ to the dependent variable $y$ corresponding to the independent variable $x$.

$$ \hat{y} = f(x) ≈ y$$

If $f(x)$ is a linear function, then this function is called a linear regression model.

$$ \hat{y} = \omega_0 + \omega_1x_1 + \omega_2x_2 + ... + \omega_Dx_D = \omega_0 + \omega^Tx$$

In the above equation, the independent variable $x = (x_1, x_2, ... , x_D)$ is a $D$-dimension vector. The weight vector $\omega = (\omega_0, ... , \omega_D)$ is the coefficient of the function $f(x)$ and the parameter is this linear regression model.



1.2 Random Forest

  • Ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees



1.3 Deep Neural Network

  • Complex/Nonlinear universal function approximator

    • Linearly connected networks
    • Simple nonlinear neurons
  • Multi-neurons





  • Differentiable activation function





  • In a compact representation






  • Multi-layer perceptron



  • Hidden layers
    • Autonomous feature learning




2. Build and Predict Machine Learning and Deep Learning Models

2.1 Data Description

  • Injection molding dataset consisting of 36 mold shapes
  • This dataset consists of 5 process features (control parameters), 32 mold shape features, and the weight of the molded part (ground truth).


In [ ]:
train_dataset = pd.read_excel('/content/drive/MyDrive/tutorials/산학협동강좌/data/train_dataset.xlsx',
                              index_col = 0,
                              engine = 'openpyxl')

# USB로 파일을 다운 받으신 경우
# train_dataset = pd.read_excel('train_dataset.xlsx',
#                               index_col = 0,
#                               engine = 'openpyxl')


train_dataset
Out[ ]:
Fill time (sec) Melt temperature (℃) Mold temperature (℃) Packing pressure (MPa) Packing time (sec) OverallProjectionAreaZX MinGateHydraulicDiameter CavityVolume StdFlowLength RunnerSurfaceArea ... NumberOfCavities AvgPartThickness AvgGateHydraulicDiameter Overall_Compactness Cavity_Compactness Runner_Compactness CavitySurfaceToVolume RunnerSurfaceToVolume OverallSurfaceToVolume Weight (g)
MoldNumber
1 0.7 232 31 43 2.7 19.686 0.55 14.804 0.601 16.797 ... 2 1.253 0.813 1.398630 1.323422 2.400429 15.112335 8.331845 14.299703 13.7819
1 0.5 220 40 51 1.3 19.686 0.55 14.804 0.601 16.797 ... 2 1.253 0.813 1.398630 1.323422 2.400429 15.112335 8.331845 14.299703 13.1317
1 1.4 221 51 46 1.3 19.686 0.55 14.804 0.601 16.797 ... 2 1.253 0.813 1.398630 1.323422 2.400429 15.112335 8.331845 14.299703 13.3700
1 0.8 234 34 61 2.0 19.686 0.55 14.804 0.601 16.797 ... 2 1.253 0.813 1.398630 1.323422 2.400429 15.112335 8.331845 14.299703 13.5909
1 1.1 239 44 37 2.6 19.686 0.55 14.804 0.601 16.797 ... 2 1.253 0.813 1.398630 1.323422 2.400429 15.112335 8.331845 14.299703 13.7390
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
35 1.5 225 30 70 2.0 77.579 0.98 14.441 0.259 39.768 ... 2 1.154 1.011 1.352616 1.188687 2.354154 16.825289 8.495621 14.786163 16.1268
35 1.5 225 30 70 3.0 77.579 0.98 14.441 0.259 39.768 ... 2 1.154 1.011 1.352616 1.188687 2.354154 16.825289 8.495621 14.786163 16.4245
35 1.5 240 45 30 1.0 77.579 0.98 14.441 0.259 39.768 ... 2 1.154 1.011 1.352616 1.188687 2.354154 16.825289 8.495621 14.786163 15.4136
35 1.5 240 45 30 2.0 77.579 0.98 14.441 0.259 39.768 ... 2 1.154 1.011 1.352616 1.188687 2.354154 16.825289 8.495621 14.786163 15.8083
35 1.5 240 45 30 3.0 77.579 0.98 14.441 0.259 39.768 ... 2 1.154 1.011 1.352616 1.188687 2.354154 16.825289 8.495621 14.786163 16.1344

3500 rows × 38 columns

In [ ]:
train_dataset.describe()
Out[ ]:
Fill time (sec) Melt temperature (℃) Mold temperature (℃) Packing pressure (MPa) Packing time (sec) OverallProjectionAreaZX MinGateHydraulicDiameter CavityVolume StdFlowLength RunnerSurfaceArea ... NumberOfCavities AvgPartThickness AvgGateHydraulicDiameter Overall_Compactness Cavity_Compactness Runner_Compactness CavitySurfaceToVolume RunnerSurfaceToVolume OverallSurfaceToVolume Weight (g)
count 3500.000000 3500.000000 3500.000000 3500.000000 3500.000000 3500.000000 3500.000000 3500.000000 3500.000000 3500.000000 ... 3500.000000 3500.000000 3500.00000 3500.000000 3500.000000 3500.000000 3500.000000 3500.000000 3500.000000 3500.000000
mean 1.002686 224.865714 45.266000 49.784000 2.018086 31.048743 1.134171 60.416314 2.088086 48.864543 ... 2.085714 2.159286 1.19860 2.254255 2.197544 1.593867 10.974949 22.288319 10.008039 51.881052
std 0.342842 10.043360 9.865042 13.314482 0.673256 27.991635 0.529301 46.290953 6.463764 57.341880 ... 1.360266 0.888018 0.66395 0.801692 0.892460 0.944723 5.255846 16.967633 3.424420 37.638686
min 0.500000 210.000000 30.000000 30.000000 0.500000 3.265000 0.319000 5.932000 0.000000 0.129000 ... 1.000000 0.725000 0.57600 1.086974 0.758980 0.447284 4.462667 6.053965 4.559499 9.887500
25% 0.700000 216.000000 37.000000 38.000000 1.400000 9.128000 0.924000 28.069000 0.000000 0.626000 ... 1.000000 1.418000 0.92400 1.556347 1.464589 0.465116 6.766880 8.495621 6.847776 22.670925
50% 1.000000 225.000000 45.000000 50.000000 2.000000 20.782000 0.980000 49.139000 0.153000 22.034000 ... 1.000000 1.956000 1.00200 1.960123 1.960405 2.026478 10.201973 9.869339 10.203440 45.398450
75% 1.300000 234.000000 54.000000 61.000000 2.600000 45.083000 1.482000 78.400000 0.601000 79.879000 ... 4.000000 2.899000 1.48200 2.920656 2.955572 2.354154 13.655710 43.000000 12.850604 59.060725
max 1.500000 240.000000 60.000000 70.000000 3.000000 125.475000 3.578000 207.283000 37.500000 191.761000 ... 5.000000 3.960000 4.69800 4.386447 4.481625 3.303620 26.351158 44.714286 18.399706 182.257600

8 rows × 38 columns

2.2 Construct Train and Test Dataset

  • 1 ~ 35 geometry mold: train dataset
  • Test mold: test dataset
In [ ]:
train_x = train_dataset.drop('Weight (g)', axis = 1)
train_y = train_dataset.loc[:, 'Weight (g)']

test_dataset = pd.read_excel('/content/drive/MyDrive/tutorials/산학협동강좌/data/test_dataset.xlsx',
                             index_col = 0,
                             engine = 'openpyxl')

# USB로 파일을 다운 받으신 경우
# test_dataset = pd.read_excel('test_dataset.xlsx',
#                              index_col = 0,
#                              engine = 'openpyxl')

test_x = test_dataset.drop('Weight (g)', axis = 1)
test_y = test_dataset.loc[:, 'Weight (g)']

print('train_x: {}, train_y: {}'.format(train_x.shape, train_y.shape))
print('test_x: {}, test_y: {}'.format(test_x.shape, test_y.shape))
train_x: (3500, 37), train_y: (3500,)
test_x: (100, 37), test_y: (100,)

2.3 Linear Regression Modeling

In [ ]:
from sklearn.linear_model import LinearRegression

reg = LinearRegression().fit(train_x, train_y)
In [ ]:
print('Regression Coefficient: \n{}\n'.format(reg.coef_))
print('Regression Bias: \n{}'.format(reg.intercept_))
Regression Coefficient: 
[ 7.71710835e-01 -3.19812614e-02 -2.49243899e-02  3.31152425e-03
  7.44199836e-01 -1.18719384e-01 -1.39731152e+02 -3.50739358e+03
  1.58253392e+01 -2.19842350e+03 -2.19842406e+03 -2.97184154e+02
  9.53226836e+00  4.44332598e+01 -3.96366900e-01 -2.32062972e+00
 -3.26819880e+01 -8.69045736e-01 -3.74451683e-02 -7.21563953e+00
  2.19844363e+03  3.50826044e+03 -3.50712687e+03 -5.89890297e+01
  1.91988137e+02 -1.16484245e+01 -2.10800016e-01  2.17656934e+00
  2.81864581e+00 -4.65071586e+00 -4.45489515e+01 -3.95507162e+01
  4.18233457e+01  7.86503291e+00  1.97037478e+00  3.47124091e-01
 -2.35971505e+00]

Regression Bias: 
-23.152042554052834
In [ ]:
pred_lr = reg.predict(test_x)

print(pred_lr)
[28.73689182 27.45943159 27.85232047 28.97871945 29.27536825 29.82853293
 29.32334135 29.7689617  29.11690179 28.85082292 28.10863057 29.22590309
 29.06674493 28.31248811 29.87379821 28.09077154 29.37008561 28.55386492
 28.92239827 28.73599575 27.9104892  28.06897856 28.59764758 28.32907063
 28.5403697  28.57227216 28.34471971 27.99995204 29.28922192 28.56464774
 28.88345475 28.46864172 28.72691885 28.72231828 28.90968127 28.68449853
 29.57019488 28.54721663 28.77939432 30.24654283 29.32049995 29.46878455
 29.32209367 27.73418205 28.99499946 29.29748179 29.05209479 28.51578413
 29.87678766 28.80247587 27.54602986 28.44057129 28.72179073 28.30980943
 28.18470521 29.75556879 27.73003314 30.07298571 28.29758293 28.66809105
 28.97864665 28.29041079 28.36312169 29.33915134 29.96363876 28.57069275
 28.00265501 28.88283972 28.46110322 28.29828081 29.08656903 29.23661205
 28.8327277  28.45849813 29.20269797 29.9468978  27.67114385 28.41534368
 29.15954352 26.88378956 27.6279894  28.37218923 28.60294867 29.34714851
 30.09134834 27.61690293 28.36110277 29.1053026  27.95114619 28.69534603
 29.43954586 28.54870775 29.29290759 30.03710743 28.88295101 29.62715085
 30.37135069 27.89690527 28.64110511 29.38530495]
In [ ]:
plt.figure(figsize = (8, 6))

plt.plot(pred_lr, 'ro--', label = 'Prediction')
plt.plot(np.array(test_y), 'bo--', label = 'Ground Truth')

plt.legend(fontsize = 13)
plt.ylabel('Weight (g)', fontsize = 13)
plt.title('Linear Regression', fontsize = 13)
plt.ylim([21, 31])
plt.xticks(fontsize = 12)
plt.yticks(fontsize = 12)

plt.show()

2.4 Random Forest

  • n_estimators: How many trees to ensemble ?
In [ ]:
from sklearn.ensemble import RandomForestRegressor

reg = RandomForestRegressor(n_estimators = 100,
                            max_depth = 10000,
                            random_state = 42)
reg.fit(train_x, train_y)
pred_rf = reg.predict(test_x)
In [ ]:
plt.figure(figsize = (8, 6))
plt.plot(pred_rf, 'ro--', label = 'Prediction')
plt.plot(np.array(test_y), 'bo--', label = 'Ground Truth')
plt.legend(fontsize = 13)
plt.ylabel('Weight (g)', fontsize = 13)
plt.title('Random Forest', fontsize = 13)
plt.ylim([21, 31])
plt.xticks(fontsize = 12)
plt.yticks(fontsize = 12)
plt.show()

2.5 Deep Neural Networks (Multi-Layer Perceptron)

  • Raw dataset
  • Feature selected dataset



In [ ]:
tf.random.set_seed(42)

model = tf.keras.models.Sequential([
    tf.keras.layers.Input(shape = (train_x.shape[1],)),
    tf.keras.layers.Dense(units = 10, activation = 'relu'),
    tf.keras.layers.Dense(units = 10, activation = 'relu'),
    tf.keras.layers.Dense(units = 10, activation = 'relu'),
    tf.keras.layers.Dense(units = 10, activation = 'relu'),
    tf.keras.layers.Dense(units = 10, activation = 'relu'),
    tf.keras.layers.Dense(units = 1, activation = None)
])
In [ ]:
model.summary()
Model: "sequential_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense_24 (Dense)            (None, 10)                380       
                                                                 
 dense_25 (Dense)            (None, 10)                110       
                                                                 
 dense_26 (Dense)            (None, 10)                110       
                                                                 
 dense_27 (Dense)            (None, 10)                110       
                                                                 
 dense_28 (Dense)            (None, 10)                110       
                                                                 
 dense_29 (Dense)            (None, 1)                 11        
                                                                 
=================================================================
Total params: 831 (3.25 KB)
Trainable params: 831 (3.25 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
In [ ]:
model.compile(optimizer = tf.keras.optimizers.Adam(learning_rate = 0.001),
              loss = 'mse')
In [ ]:
loss = model.fit(train_x, train_y, epochs = 50, verbose = 0)
In [ ]:
pred_dnn = model.predict(test_x)

plt.figure(figsize = (8, 6))
plt.plot(pred_dnn, 'ro--', label = 'Prediction')
plt.plot(np.array(test_y), 'bo--', label = 'Ground Truth')
plt.legend(fontsize = 13, loc = 1)
plt.ylabel('Weight (g)', fontsize = 13)
plt.title('Deep Neural Network', fontsize = 13)
plt.ylim([21, 31])
plt.xticks(fontsize = 12)
plt.yticks(fontsize = 12)
plt.show()
4/4 [==============================] - 0s 3ms/step

3. Understand Potential Problems in Industrial Data

3.1 Curse of Dimensionality

Unlike the toy example (e.g., MNIST dataset), when applying artificial intelligence to industrial applications, a variety of issues may arise. The curse of dimensionality is an example of one of them. Because several physical phenomena overlap in the case of industrial data, it is difficult to know important features for predicted values from the standpoint of domain knowledge.



To ensure statistical stability, the number of features should be chosen to correspond to the number of data points. Therefore, we try correlation analysis to select important features in data-based prediction.

3.2 Correlation Analysis

  • Which features are statistically significant?
    • Correlation analysis: a statistical analysis method that examines the relationship (closeness) of variables.
  • Analyze features that are highly related to "the weight of the injected product".


In [ ]:
# 삼각형 마스크를 만든다(위 쪽 삼각형에 True, 아래 삼각형에 False)

mask = np.zeros_like(train_dataset.corr(), dtype = bool)
mask[np.triu_indices_from(mask)] = True

plt.figure(figsize = (16, 16))
sns.heatmap(train_dataset.corr(),
            cmap = 'RdYlBu_r',
            annot = True,               # 실제 값을 표시한다
            mask = mask,                # 표시하지 않을 마스크 부분을 지정한다
            linewidths = 1,             # 경계면 실선으로 구분하기
            cbar_kws = {"shrink": .5},  # 컬러바 크기 절반으로 줄이기
            vmin = -1, vmax = 1         # 컬러바 범위 -1 ~ 1
           )
plt.show()
In [ ]:
weight_corr = train_dataset.corr()['Weight (g)']
weight_corr.sort_values(ascending = False)
Out[ ]:
Weight (g)                  1.000000
OverallVolume               0.998736
CavityVolume                0.985415
CavitySurfaceArea           0.858996
OverallSurfaceArea          0.856717
OverallProjectionAreaXY     0.729818
OverallProjectionAreaYZ     0.556189
MaxPartThickness            0.514668
OverallProjectionAreaZX     0.485988
Cavity_Compactness          0.484543
Overall_Compactness         0.479901
AvgFlowLength               0.434925
StdPartThickness            0.422464
MaxFlowLength               0.420291
MinFlowLength               0.416272
AvgPartThickness            0.404082
RunnerVolume                0.355200
RunnerSurfaceArea           0.270440
MaxFlowLengthToThickness    0.216423
AvgFlowLengthToThickness    0.207084
MinFlowLengthToThickness    0.199110
MinGateHydraulicDiameter    0.132351
AvgGateHydraulicDiameter    0.094403
StdFlowLength               0.081371
Runner_Compactness          0.069361
MaxGateHydraulicDiameter    0.061674
StdFlowLengthToThickness    0.030716
RunnerSurfaceToVolume       0.021096
Packing pressure (MPa)      0.009901
Packing time (sec)         -0.004789
Fill time (sec)            -0.011950
Mold temperature (℃)       -0.016511
Melt temperature (℃)       -0.019672
StdGateHydraulicDiameter   -0.078172
NumberOfGates              -0.094894
CavitySurfaceToVolume      -0.435327
OverallSurfaceToVolume     -0.443351
NumberOfCavities           -0.457310
Name: Weight (g), dtype: float64
In [ ]:
del_features = weight_corr[weight_corr.abs() < 0.8]
del_features.keys()
Out[ ]:
Index(['Fill time (sec)', 'Melt temperature (℃)', 'Mold temperature (℃)',
       'Packing pressure (MPa)', 'Packing time (sec)',
       'OverallProjectionAreaZX', 'MinGateHydraulicDiameter', 'StdFlowLength',
       'RunnerSurfaceArea', 'StdGateHydraulicDiameter', 'MinFlowLength',
       'MaxFlowLengthToThickness', 'MaxPartThickness', 'AvgFlowLength',
       'AvgFlowLengthToThickness', 'NumberOfGates', 'OverallProjectionAreaXY',
       'MaxFlowLength', 'RunnerVolume', 'StdFlowLengthToThickness',
       'MaxGateHydraulicDiameter', 'MinFlowLengthToThickness',
       'OverallProjectionAreaYZ', 'StdPartThickness', 'NumberOfCavities',
       'AvgPartThickness', 'AvgGateHydraulicDiameter', 'Overall_Compactness',
       'Cavity_Compactness', 'Runner_Compactness', 'CavitySurfaceToVolume',
       'RunnerSurfaceToVolume', 'OverallSurfaceToVolume'],
      dtype='object')
In [ ]:
train_x_fs = train_x.drop(list(del_features.keys()[5:]), axis = 1)
test_x_fs = test_x.drop(list(del_features.keys()[5:]), axis = 1)

print('train_x_fs: {}, train_y: {}'.format(train_x_fs.shape, train_y.shape))
print('test_x_fs: {}, test_y: {}'.format(test_x_fs.shape, test_y.shape))
train_x_fs: (3500, 9), train_y: (3500,)
test_x_fs: (100, 9), test_y: (100,)

3.3 Linear Regression Modeling

In [ ]:
from sklearn.linear_model import LinearRegression

reg = LinearRegression().fit(train_x_fs, train_y)

print('Regression Coefficient: \n{}\n'.format(reg.coef_))
print('Regression Bias: \n{}'.format(reg.intercept_))

pred_s_lr = reg.predict(test_x_fs)
Regression Coefficient: 
[ 0.76727    -0.03221696 -0.0250693   0.00341566  0.72655702 -0.04489469
  0.00220443  0.00469149  0.77575597]

Regression Bias: 
5.309289663305158
In [ ]:
plt.figure(figsize = (8, 6))
plt.plot(pred_s_lr, 'ro--', label = 'Prediction')
plt.plot(np.array(test_y), 'bo--', label = 'Ground Truth')
plt.legend(fontsize = 13)
plt.ylabel('Weight (g)', fontsize = 13)
plt.title('Linear Regression with FS', fontsize = 13)
plt.ylim([21, 31])
plt.xticks(fontsize = 12)
plt.yticks(fontsize = 12)

plt.show()

3.4 Random Forest

In [ ]:
from sklearn.ensemble import RandomForestRegressor

reg = RandomForestRegressor(n_estimators = 100,
                            max_depth = 10000,
                            random_state = 42).fit(train_x_fs, train_y)

pred_s_rf = reg.predict(test_x_fs)
In [ ]:
plt.figure(figsize = (8, 6))
plt.plot(pred_s_rf, 'ro--', label = 'Prediction')
plt.plot(np.array(test_y), 'bo--', label = 'Ground Truth')
plt.legend(fontsize = 13)
plt.ylabel('Weight (g)', fontsize = 13)
plt.title('Random Forest with FS', fontsize = 13)
plt.ylim([21, 31])
plt.xticks(fontsize = 12)
plt.yticks(fontsize = 12)
plt.show()

3.5 Deep Neural Networks

In [ ]:
tf.random.set_seed(42)

model = tf.keras.models.Sequential([
    tf.keras.layers.Input(shape = (train_x_fs.shape[1],)),
    tf.keras.layers.Dense(units = 10, activation = 'relu'),
    tf.keras.layers.Dense(units = 10, activation = 'relu'),
    tf.keras.layers.Dense(units = 10, activation = 'relu'),
    tf.keras.layers.Dense(units = 10, activation = 'relu'),
    tf.keras.layers.Dense(units = 10, activation = 'relu'),
    tf.keras.layers.Dense(units = 1, activation = None)
])

model.compile(optimizer = tf.keras.optimizers.Adam(learning_rate = 0.001),
              loss = 'mse')

loss = model.fit(train_x_fs, train_y, epochs = 50, verbose = 0)
In [ ]:
pred_s_dnn = model.predict(test_x_fs)
4/4 [==============================] - 0s 3ms/step
In [ ]:
plt.figure(figsize = (8, 6))
plt.plot(pred_s_dnn, 'ro--', label = 'Prediction')
plt.plot(np.array(test_y), 'bo--', label = 'Ground Truth')
plt.legend(fontsize = 13)
plt.ylabel('Weight (g)', fontsize = 13)
plt.title('Deep Neural Network with FS', fontsize = 13)
plt.ylim([21, 31])
plt.xticks(fontsize = 12)
plt.yticks(fontsize = 12)

plt.show()