ANN Training

By Prof. Seungchul Lee
http://iai.postech.ac.kr/
Industrial AI Lab at POSTECH

1. Recursive Algorithm¶

• One of the central ideas of computer science

• Depends on solutions to smaller instances of the same problem ( = subproblem)

• Function to call itself (it is impossible in the real world)

In [1]:
%%html
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>

• Factorial example

$$n ! = n \cdot (n-1) \cdots 2 \cdot 1$$

In [2]:
n = 5

m = 1
for i in range(n):
m = m*(i+1)

print(m)

120

In [3]:
def fac(n):
if n == 1:
return 1
else:
return n*fac(n-1)

In [4]:
# recursive

fac(5)

Out[4]:
120

2. Dynamic Programming¶

• Dynamic Programming: general, powerful algorithm design technique

• Fibonacci numbers:

In [5]:
# naive Fibonacci

def fib(n):
if n <= 2:
return 1
else:
return fib(n-1) + fib(n-2)

In [6]:
fib(10)

Out[6]:
55
In [7]:
# Memorized DP Fibonacci

def mfib(n):
global memo

if memo[n-1] != 0:
return memo[n-1]
elif n <= 2:
return 1
else:
memo[n-1] = mfib(n-1) + mfib(n-2)
return memo[n-1]

In [8]:
import numpy as np

n = 10
memo = np.zeros(n)
mfib(n)

Out[8]:
55.0
In [9]:
n = 30
%timeit fib(30)

172 ms ± 830 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [10]:
memo = np.zeros(n)
%timeit mfib(30)

402 ns ± 3.18 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


3. Training Neural Networks¶

$=$ Learning or estimating weights and biases of multi-layer perceptron from training data

3.1. Optimization¶

3 key components

1. objective function $f(\cdot)$
2. decision variable or unknown $\omega$
3. constraints $g(\cdot)$

In mathematical expression

\begin{align*} \min_{\omega} \quad &f(\omega) \end{align*}

3.2. Loss Function¶

• Measures error between target values and predictions

$$\min_{\omega} \sum_{i=1}^{m}\ell\left( h_{\omega}\left(x^{(i)}\right),y^{(i)}\right)$$

• Example
• Squared loss (for regression): $$\frac{1}{m} \sum_{i=1}^{m} \left(h_{\omega}\left(x^{(i)}\right) - y^{(i)}\right)^2$$
• Cross entropy (for classification): $$-\frac{1}{m}\sum_{i=1}^{m}y^{(i)}\log\left(h_{\omega}\left(x^{(i)}\right)\right) + \left(1-y^{(i)}\right)\log\left(1-h_{\omega}\left(x^{(i)}\right)\right)$$

3.3. Learning¶

Learning weights and biases from data using gradient descent

$$\omega \Leftarrow \omega - \alpha \nabla_{\omega} \ell \left( h_{\omega}\left(x^{(i)}\right), y^{(i)} \right)$$
• $\frac{\partial \ell}{\partial \omega}$: too many computations are required for all $\omega$
• Structural constraints of NN:
• Composition of functions
• Chain rule
• Dynamic programming

Backpropagation

• Forward propagation
• the initial information propagates up to the hidden units at each layer and finally produces output
• Backpropagation
• allows the information from the cost to flow backwards through the network in order to compute the gradients
• Chain Rule

• Computing the derivative of the composition of functions

• $\space f(g(x))' = f'(g(x))g'(x)$

• $\space {dz \over dx} = {dz \over dy} \bullet {dy \over dx}$

• $\space {dz \over dw} = ({dz \over dy} \bullet {dy \over dx}) \bullet {dx \over dw}$

• $\space {dz \over du} = ({dz \over dy} \bullet {dy \over dx} \bullet {dx \over dw}) \bullet {dw \over du}$

• Backpropagation

• Update weights recursively with memory

Optimization procedure

• It is not easy to numerically compute gradients in network in general.
• The good news: people have already done all the "hardwork" of developing numerical solvers (or libraries)
• There are a wide range of tools: $\color{Red}{\text{TensorFlow}}$

Summary

• Learning weights and biases from data using gradient descent

4. Other Tutorials¶

In [11]:
%%html
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>

In [12]:
%%html
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>

In [13]:
%%html

%%html

%%javascript