Optimization for Machine Learning



By Seungchul Lee
http://iai.postech.ac.kr/
Industrial AI Lab at POSTECH

Table of Contents

1. Optimization

  • an important tool in 1) engineering problem solving and 2) decision science
  • optimize



3 key components

  1. objective
  2. decision variable or unknown
  3. constraints

Procedures

  1. The process of identifying objective, variables, and constraints for a given problem is known as "modeling"
  2. Once the model has been formulated, optimization algorithm can be used to find its solutions.

In mathematical expression


$$\begin{align*} \min_{x} \quad &f(x) \\ \text{subject to} \quad &g_i(x) \leq 0, \qquad i=1,\cdots,m \end{align*} $$

$\;\;\; $where

  • $x=\begin{bmatrix}x_1 \\ \vdots \\ x_n\end{bmatrix} \in \mathbb{R}^n$ is the decision variable
  • $f: \mathbb{R}^n \rightarrow \mathbb{R}$ is an objective function
  • Feasible region: $\mathcal{C} = \{x: g_i(x) \leq 0. \quad i=1, \cdots,m\}$



Remarks) equivalent


$$\begin{align*} \min_{x} f(x) \quad&\leftrightarrow \quad \max_{x} -f(x)\\ \quad g_i(x) \leq 0\quad&\leftrightarrow \quad -g_i(x) \geq 0\\ h(x) = 0 \quad&\leftrightarrow \quad \begin{cases} h(x) \leq 0 \quad \text{and} \\ h(x) \geq 0 \end{cases} \end{align*} $$


2. Convex Optimization

  • Convex optimizaiton is an extremely powerful subset of all optimization problems


$$\begin{align*} \min_{x} \quad &f(x) \\ \text{subject to} \quad & x \in \mathcal{C} \end{align*} $$

$\;\;\; $where

  • $f: \mathbb{R}^n \rightarrow \mathbb{R}$ is a convex function and

  • Feasible region $\mathcal{C}$ is a convex set


Remarks)

  • Key property of convex optimization:

    • All local solutions are global solutions

Linear Interpolation between Two Points





$$ \begin{align*} \vec{z} &= \vec{y} + \theta (\vec{x} - \vec{y}) = \theta \vec{x} + (1-\theta) \vec{y},\qquad 0 \leq \theta \leq 1\\ \\ \text{or }\;\; \vec{z} &= \alpha \vec{x} + \beta \vec{y}, \qquad \alpha + \beta = 1 \;\;\text{and}\; 0 \leq \alpha, \beta \end{align*} $$

Convex Function



  • For any $x,y \in \mathbb{R}^n$ and $\theta \in [0,1]$


$$f(\theta x + (1-\theta)y) \leq \theta f(x) + (1-\theta)f(y)$$

Convex Set



  • For a $x,y \in \mathcal{C}$ and $\theta \in [0,1]$


$$\theta x + (1-\theta)y \in \mathcal{C}$$

3. Solving Optimization Problems

  • Starting with the unconstrained, one dimensional case



  • To find minimum point $x^*$, we can look at the derivave of the function $f'(x)$:
    • Any location where $f'(x)$ = 0 will be a "flat" point in the function
  • For convex problems, this is guaranteed to be a minimum
  • Generalization for multivariate function $f:\mathbb{R}^n \rightarrow \ \mathbb{R}$

    • The gradient of $f$ must be zero
$$ \nabla _x f(x) = 0$$
  • Gradient is a n-dimensional vector containing partial derivatives with respect to each dimension


$$ x = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix} \quad \quad \quad \quad \nabla _x f(x) = \begin{bmatrix} \partial f(x) \over \partial x_1 \\ \vdots\\ \partial f(x) \over \partial x_n \end{bmatrix} $$


  • For continuously differentiable $f$ and unconstrained optimization, optimal point must have $\nabla _x f(x^*)=0$

3.1. How To Find $\nabla _x f(x) = 0$: Analytic Approach

  • Direct solution

    • In some cases, it is possible to analytically compute $x^*$ such that $ \nabla _x f(x^*)=0$


$$ \begin{align*} f(x) &= 2x_1^2+ x_2^2 + x_1 x_2 -6 x_1 -5 x_2\\\\ \Longrightarrow \nabla _x f(x) &= \begin{bmatrix} 4x_1+x_2-6\\ 2x_2 + x_1 -5 \end{bmatrix} = \begin{bmatrix}0\\0 \end{bmatrix}\\\\ \therefore x^* &= \begin{bmatrix} 4 & 1\\ 1 & 2 \end{bmatrix} ^{-1} \begin{bmatrix} 6 \\ 5\\ \end{bmatrix} = \begin{bmatrix} 1 \\ 2\\ \end{bmatrix} \end{align*} $$

  • Note: Matrix derivatives



Exampels

  • Affine function $g(x) = a^Tx + b$
$$\nabla g(x) = a$$
  • Quadratic function $g(x) = x^T P x + q^T x + r,\qquad P = P^T$
$$\nabla g(x) = 2Px + q$$
  • $g(x) = \lVert Ax - b \rVert ^2 = x^TA^TAx - 2b^TAx + b^Tb$
$$\nabla g(x) = 2A^TAx-2A^Tb$$



Note: Revisit Least-Square Solution of $J(x) = \lVert Ax - y \rVert ^2$


$$ \begin{align*} J(x) &= (Ax-y)^T(Ax-y)\\ &=(x^TA^T - y^T)(Ax - y)\\ &=x^TA^TAx - x^TA^Ty - y^TAx + y^Ty\\\\ \frac{\partial J}{\partial x} &= A^TAx + (A^TA)^Tx - A^Ty - (y^TA)^T \\ &=A^TAx - 2A^Ty = 0\\\\ &\Rightarrow (A^TA)x = A^Ty\\\\ \therefore x^* &= (A^TA)^{-1}A^Ty \end{align*} $$

3.2. How To Find $\nabla _x f(x) = 0$: Iterative Approach

  • Iterative methods

    • More commonly the condition that the gradient equal zero will not have an analytical solution, require iterative methods





  • The gradient points in the direction of "steepest ascent" for function $f$

3.2.1. Gradient Descent

  • It motivates the gradient descent algorithm, which repeatedly takes steps in the direction of the negative gradient


$$ x \leftarrow x - \alpha \nabla _x f(x) \quad \quad \text{for some step size } \alpha > 0$$



  • Gradient Descent
$$\text{Repeat : } x \leftarrow x - \alpha \nabla _x f(x) \quad \quad \text{for some step size } \alpha > 0$$



  • Gradient Descent in Higher Dimension
$$\text{Repeat : } x \leftarrow x - \alpha \nabla _x f(x)$$






3.2.2. Choosing Step Size

  • Learning rate



3.2.3 Where will We Converge?




$\qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \quad \qquad \qquad \bullet \, \text{Random initialization}$

$\qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \quad \qquad \qquad \bullet \, \text{Multiple trials}$

3.2.4. Example


$$ \begin{align*} \min& \quad (x_1-3)^{2} + (x_2-3)^{2}\\\\ =\min& \quad \frac{1}{2} \begin{bmatrix} x_1 & x_2 \end{bmatrix} \begin{bmatrix} 2 & 0 \\ 0 & 2 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} - \begin{bmatrix} 6 & 6 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} + 18 \end{align*} $$



\begin{align*} f &= \frac{1}{2}X^THX + g^TX \\ \nabla f &= HX+g \end{align*}
  • update rule
$$ X_{i+1} = X_{i} - \alpha_i \nabla f(X_i)$$
In [1]:
import numpy as np

H = np.matrix([[2, 0],[0, 2]])
g = -np.matrix([[6],[6]])

x =  np.zeros((2,1))
alpha = 0.2

for i in range(25):    
    df = H*x + g
    x = x - alpha*df

print(x)
[[2.99999147]
 [2.99999147]]

4. Practically Solving Optimization Problems

  • The good news: for many classes of optimization problems, people have already done all the “hard work” of developing numerical algorithms
    • A wide range of tools that can take optimization problems in “natural” forms and compute a solution
  • Gradient descent
    • Neural networks/deep learning

4.1. Example: Linear Programming (Convex)


  • Linear programming = objective function and constraints are both linear


$$ \begin{align*} \max \; & \; 3x_1 + \frac{3}{2}x_2 \quad \quad \leftarrow \text{objective function}\\ \\ \text{subject to} \; & -1 \leq x_1 \leq 2 \quad \leftarrow \text{constraints}\\ & \quad 0 \leq x_2 \leq 3 \end{align*} $$

Method 1: graphical approach

$$ 3 x_1 + 1.5 x_2 = C \qquad \Rightarrow \qquad x_2 = -2 x_1 + \frac{2}{3}C $$




Method 2: Gradient Descent


$$ \begin{array}{Icr}\begin{align*} \max_{x} \quad & 3x_1 + {3 \over 2}x_2 \\ \text{subject to} \quad & -1 \leq x_1 \leq 2 \\ & \quad 0 \leq x_2 \leq 3 \end{align*}\end{array} \quad\implies\quad \begin{array}{I} \quad \min_{x} \quad & - \begin{bmatrix} 3 \\ 3 / 2 \end{bmatrix}^T \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} \\ \text{subject to} \quad & \begin{bmatrix} -1 \\ 0 \end{bmatrix} \leq \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} \leq \begin{bmatrix} 2 \\ 3 \\ \end{bmatrix} \end{array} $$

In [2]:
f = -np.array([[3], 
              [3/2]])

lb = np.array([[-1], 
               [0]])
ub = np.array([[2], 
               [3]])

x = np.zeros(shape = (2, 1))
alpha = 0.2

for i in range(25): 
    x = x - alpha * f
    
    # lb constraints 
    lb_TF = lb < x 
    x = x * lb_TF + lb * ~lb_TF
    
    # ub constraints 
    ub_TF = x < ub 
    x = x * ub_TF + ub * ~ub_TF
    
print(x)
[[2.]
 [3.]]

Method 3: CVXPY-based solver


$$ \begin{array}{Icr}\begin{align*} \max_{x} \quad & 3x_1 + {3 \over 2}x_2 \\ \text{subject to} \quad & -1 \leq x_1 \leq 2 \\ & \quad 0 \leq x_2 \leq 3 \end{align*}\end{array} \quad\implies\quad \begin{array}{I} \quad \min_{x} \quad & - \begin{bmatrix} 3 \\ 3 / 2 \end{bmatrix}^T \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} \\ \text{subject to} \quad & \begin{bmatrix} -1 \\ 0 \end{bmatrix} \leq \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} \leq \begin{bmatrix} 2 \\ 3 \\ \end{bmatrix} \end{array} $$

In [3]:
import numpy as np
import cvxpy as cvx

f = np.array([[3], [3/2]])
lb = np.array([[-1], [0]])
ub = np.array([[2], [3]])

x = cvx.Variable([2,1])

obj = cvx.Minimize(-f.T @ x)
constraints = [lb <= x, x <= ub]

prob = cvx.Problem(obj, constraints)
result = prob.solve()

print(x.value)
print(result)
[[1.99999999]
 [2.99999999]]
-10.499999966365493

4.2. Example: Quadratic Programming (Convex)


$$ \begin{array}{Icr}\begin{align*} \min \quad &\frac{1}{2}x_1^2+3x_1+4x_2 \\ \text{subject to} \quad & x_1+3x_2 \geq 15 \\ & 2x_1+5x_2 \leq 100 \\ & 3x_1+4x_2 \leq 80 \\ & x_1,x_2 \geq 0 \\ \end{align*}\end{array} \quad\implies\quad \begin{array}{I} \min_{X} \quad & X^T HX + f^T X \\ \text{subject to} \quad & AX \leq b \\ & A_{eq}X = b_{eq} \\ & lb \leq X \leq ub \end{array} $$


$$ \begin{array}{Icr}\begin{align*} \min \quad &\frac{1}{2}x_1^2+3x_1+4x_2 \\ \text{subject to} \quad & x_1+3x_2 \geq 15 \\ & 2x_1+5x_2 \leq 100 \\ & 3x_1+4x_2 \leq 80 \\ & x_1,x_2 \geq 0 \\ \end{align*}\end{array} \quad\implies\quad \begin{array}{I} \min_{X} \quad & \begin{bmatrix}x_1 & x_2\end{bmatrix} \begin{bmatrix} {1 \over 2} & 0 \\ 0 & 0 \end{bmatrix} \begin{bmatrix}x_1 \\ x_2\end{bmatrix} + \begin{bmatrix} 3 & 4 \end{bmatrix} \begin{bmatrix}x_1 \\ x_2\end{bmatrix} \\ \text{subject to} \quad & \begin{bmatrix} -1 & -3 \\ 2 & 5 \\ 3 & 4 \end{bmatrix} \begin{bmatrix}x_1 \\ x_2\end{bmatrix} \leq \begin{bmatrix} -15 \\ 100 \\ 80 \end{bmatrix} \\ & \begin{bmatrix} 0 \\ 0 \end{bmatrix} \leq \begin{bmatrix}x_1 \\ x_2\end{bmatrix} \end{array} $$

In [4]:
f = np.array([[3], [4]])
H = np.array([[1/2, 0], [0, 0]])

A = np.array([[-1, -3], 
              [2, 5], 
              [3, 4]])

b = np.array([[-15], 
              [100], 
              [80]])

lb = np.array([[0], 
               [0]])

x = cvx.Variable([2,1])

obj = cvx.Minimize(cvx.quad_form(x, H) + f.T@x)
constraints = [A@x <= b, lb <= x]

prob = cvx.Problem(obj, constraints)
result = prob.solve()

print(x.value)
print(result)
[[-1.15591357e-25]
 [ 5.00000000e+00]]
20.000000000000004
In [5]:
f = np.array([[3], [4]])
H = np.array([[1/2, 0], [0, 0]])

A = np.array([[-1, -3], 
              [2, 5], 
              [3, 4], 
              [-1, 0], 
              [0, -1]])

b = np.array([[-15], 
              [100], 
              [80],
              [0], 
              [0]])

x = cvx.Variable([2,1])

obj = cvx.Minimize(cvx.quad_form(x, H) + f.T@x)
constraints = [A@x <= b]

prob = cvx.Problem(obj, constraints)
result = prob.solve()

print(x.value)
print(result)
[[-1.15591357e-25]
 [ 5.00000000e+00]]
20.000000000000004

4.4. Example: Empty Bucket




$$ \min \; d_1 + d_2 = \min \left\rVert \vec{a} - \begin{bmatrix}x\\0\end{bmatrix}\right\rVert_2 + \left\rVert \vec{b} - \begin{bmatrix}x\\0\end{bmatrix}\right\rVert_2$$


  • $ \vec{a} \rightarrow \begin{bmatrix}x\\0\end{bmatrix}$: walk with an empty bucket
  • $ \begin{bmatrix}x\\0\end{bmatrix} \rightarrow \vec{b}$: walk with a water-filled bucket
$$\implies \min d_1 + \mu d_2$$
In [6]:
a = np.array([[0], [1]])
b = np.array([[4], [2]])

Aeq = np.array([0,1])
beq = 0

x = cvx.Variable([2,1])

mu = 1
obj = cvx.Minimize(cvx.norm(a-x, 2) + mu * cvx.norm(b-x, 2))
constraints = [Aeq @ x == beq]

prob = cvx.Problem(obj, constraints)
result = prob.solve()

print(x.value)
print(result)
[[1.33325114e+00]
 [5.33304239e-12]]
4.9999999941398166

5. Video Lectures

In [7]:
%%html
<center><iframe src="https://www.youtube.com/embed/QJSSWNIAPlw?rel=0" 
width="420" height="315" frameborder="0" allowfullscreen></iframe></center>
In [8]:
%%html
<center><iframe src="https://www.youtube.com/embed/OJ397Y3fVkY?rel=0" 
width="420" height="315" frameborder="0" allowfullscreen></iframe></center>
In [9]:
%%html
<center><iframe src="https://www.youtube.com/embed/tLxsqbLIvx4?rel=0" 
width="420" height="315" frameborder="0" allowfullscreen></iframe></center>
In [10]:
%%html
<center><iframe src="https://www.youtube.com/embed/91DpsoKFpLw?rel=0" 
width="420" height="315" frameborder="0" allowfullscreen></iframe></center>
In [11]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')