Optimization


By Prof. Seungchul Lee
http://iailab.kaist.ac.kr/
Industrial AI Lab at KAIST

Table of Contents



1. Optimization

In [ ]:
from IPython.display import YouTubeVideo
YouTubeVideo('iiKKPAF1inU', width="560", height="315", frameborder="0")
Out[ ]:

Optimization is a mathematical discipline that focuses on finding the best solution to a problem within a defined set of constraints. It involves maximizing or minimizing an objective function, which represents the goal of the optimization process, such as minimizing costs, maximizing efficiency, or achieving the best performance in a system.



(source)

3 key components

  1. objective function
  2. decision variable or unknown
  3. constraints

Procedures

  1. The process of identifying objective function, variables, and constraints for a given problem is known as "modeling"
  2. Once the model has been formulated, optimization algorithm can be used to find its solutions.

In mathematical expression


$$\begin{align*} \min_{x} \quad &f(x) \\ \text{subject to} \quad &g_i(x) \leq 0, \qquad i=1,\cdots,m \end{align*} $$

$\quad$where

  • $x=\begin{bmatrix}x_1 \\ \vdots \\ x_n\end{bmatrix} \in \mathbb{R}^n$ is the decision variable

  • $f: \mathbb{R}^n \rightarrow \mathbb{R}$ is an objective function

  • Feasible region: $\mathcal{C} = \{x: g_i(x) \leq 0, \quad i=1, \cdots,m\}$



Remarks) equivalent


$$\begin{align*} \min_{x} f(x) \quad&\longleftrightarrow \quad \max_{x} -f(x)\\ \quad g_i(x) \leq 0\quad&\longleftrightarrow \quad -g_i(x) \geq 0\\ h(x) = 0 \quad&\longleftrightarrow \quad \begin{cases} h(x) \leq 0 \quad \text{and} \\ h(x) \geq 0 \end{cases} \end{align*} $$

  • The good news:
    • For many classes of optimization problems, people have already done all the "hardwork" of developing numerical algorithms
    • A wide range of tools that can take optimization problems in "natural" forms and compute a solution

2. Solving Optimization Problems

In [ ]:
from IPython.display import YouTubeVideo
YouTubeVideo('CqYJhPOFPGk', width="560", height="315", frameborder="0")
Out[ ]:

  • Starting with the unconstrained, one dimensional case


  • To find minimum point $x^*$, we can look at the derivave of the function, $f'(x)$
    • Any location where $f'(x)$ = 0 will be a "flat" point in the function
    • For convex problems, this is guaranteed to be a minimum

  • Generalization for multivariate function $f:\mathbb{R}^n \rightarrow \ \mathbb{R}$

    • The gradient of $f$ must be zero

$$ \nabla _x f(x) = 0$$
  • Gradient is a n-dimensional vector containing partial derivatives with respect to each dimension

$$ x = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix} \quad \quad \quad \quad \nabla _x f(x) = \begin{bmatrix} \partial f(x) \over \partial x_1 \\ \vdots\\ \partial f(x) \over \partial x_n \end{bmatrix} $$
  • For continuously differentiable $f$ and unconstrained optimization, optimal point must have $\nabla _x f(x^*)=0$

2.1. How to Find $\nabla _x f(x) = 0$: Analytic Approach


  • Direct solution
    • In some cases, it is possible to analytically compute $x^*$ such that $ \nabla _x f(x^*)=0$

  • For example,

$$ \begin{align*} f(x) &= 2x_1^2+ x_2^2 + x_1 x_2 -6 x_1 -5 x_2\\\\ \Longrightarrow \nabla _x f(x) &= \begin{bmatrix} 4x_1+x_2-6\\ 2x_2 + x_1 -5 \end{bmatrix} = \begin{bmatrix}0\\0 \end{bmatrix}\\\\ \therefore x^* &= \begin{bmatrix} 4 & 1\\ 1 & 2 \end{bmatrix} ^{-1} \begin{bmatrix} 6 \\ 5\\ \end{bmatrix} = \begin{bmatrix} 1 \\ 2\\ \end{bmatrix} \end{align*} $$
  • Note: Matrix derivatives

2.2. How to Find $\nabla _x f(x) = 0$: Iterative Approach


  • Iterative methods
    • More commonly, the condition where the gradient equal zero does not have an analytical solution, require iterative methods

  • The gradient points in the direction of "steepest ascent" for function $f$






3. Gradient Descent


  • It motivates the gradient descent algorithm, which repeatedly takes steps in the direction of the negative gradient

$$ x \leftarrow x - \alpha \nabla _x f(x) \quad \quad \text{for some step size } \alpha > 0$$




  • Gradient Descent

$$\color{red}{\text{Repeat : }} x \leftarrow x - \alpha \nabla _x f(x) \quad \quad \text{for some step size } \alpha > 0$$




Gradient descent is an optimization algorithm widely used in machine learning and deep learning to minimize a loss function by iteratively updating model parameters. The core idea is to adjust the parameters in the direction of the negative gradient of the loss function with respect to the parameters, aiming to find the point where the loss function reaches its minimum value.


  • Gradient Descent in Higher Dimension
    • the following visual illustration in 2D can provide an intuitive understanding of the gradient descent algorithm






Example


$$ \begin{align*} \min& \quad (x_1-3)^{2} + (x_2-3)^{2}\\\\ =\min& \quad \frac{1}{2} \begin{bmatrix} x_1 & x_2 \end{bmatrix} \begin{bmatrix} 2 & 0 \\ 0 & 2 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} - \begin{bmatrix} 6 & 6 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} + 18 \end{align*} $$



\begin{align*} f &= \frac{1}{2}X^THX + g^TX \\ \nabla f &= HX+g \end{align*}
  • update rule

$$ \begin{align*} X_{i+1} &= X_{i} - \alpha \, \nabla f(X_i)\\ &= X_{i} - \alpha \, (H X_i + g) \end{align*} $$
In [ ]:
import numpy as np

H = np.matrix([[2, 0],[0, 2]])
g = -np.matrix([[6],[6]])

x =  np.zeros((2,1))
alpha = 0.2

for i in range(25):
    df = H*x + g
    x = x - alpha*df

print(x)
[[2.99999147]
 [2.99999147]]

3.1. Choosing Step Size $\alpha$


Learning rate $\alpha$

  • A hyperparameter that controls the step size during parameter updates.

  • A learning rate that is too large may cause the algorithm to diverge, while a learning rate that is too small may slow down convergence.


<

3.2. Where will We Converge?



  • In machine learning and deep learning, many loss functions are non-convex, meaning they have multiple local minima, saddle points, and complex surfaces. Applying gradient descent to non-convex optimization presents unique challenges but remains effective when guided by careful implementation and enhancements.

  • For non-convex optimization problems, conducting multiple trials with random initializations is recommended.


4. Practically Solving Optimization Problems


  • The good news:
    • For many classes of optimization problems, people have already done all the “hard work” of developing numerical algorithms
    • A wide range of tools that can take optimization problems in “natural” forms and compute a solution

  • Gradient descent
    • Easy to implement
    • Very general, can be applied to any differentiable loss functions
    • Requires less memory and computations (for stochastic methods)
    • Neural networks/deep learning
    • TensorFlow
In [ ]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')