Optimization for Machine Learning

By Seungchul Lee
http://iai.postech.ac.kr/
Industrial AI Lab at POSTECH

Table of Contents

1. Optimization¶

an important tool in 1) engineering problem solving and 2) decision science

optimize

3 key components

objective
decision variable or unknown
constraints

Procedures

The process of identifying objective, variables, and constraints for a given problem is known as "modeling"
Once the model has been formulated, optimization algorithm can be used to find its solutions.

In mathematical expression

$$\begin{align*} \min_{x} \quad &f(x) \\ \text{subject to} \quad &g_i(x) \leq 0, \qquad i=1,\cdots,m \end{align*} $$

$\;\;\; $where

$x=\begin{bmatrix}x_1 \\ \vdots \\ x_n\end{bmatrix} \in \mathbb{R}^n$ is the decision variable

$f: \mathbb{R}^n \rightarrow \mathbb{R}$ is an objective function

Feasible region: $\mathcal{C} = \{x: g_i(x) \leq 0. \quad i=1, \cdots,m\}$

Remarks) equivalent

$$\begin{align*} \min_{x} f(x) \quad&\leftrightarrow \quad \max_{x} -f(x)\\ \quad g_i(x) \leq 0\quad&\leftrightarrow \quad -g_i(x) \geq 0\\ h(x) = 0 \quad&\leftrightarrow \quad \begin{cases} h(x) \leq 0 \quad \text{and} \\ h(x) \geq 0 \end{cases} \end{align*} $$

2. Convex Optimization¶

Convex optimizaiton is an extremely powerful subset of all optimization problems

$$\begin{align*} \min_{x} \quad &f(x) \\ \text{subject to} \quad & x \in \mathcal{C} \end{align*} $$

$\;\;\; $where

$f: \mathbb{R}^n \rightarrow \mathbb{R}$ is a convex function and
Feasible region $\mathcal{C}$ is a convex set

Remarks)

Key property of convex optimization:
- All local solutions are global solutions

We will use CVX (or CVXPY) as a convex optimization solver
- https://www.cvxpy.org/
- Many examples later

Linear Interpolation between Two Points

$$ \begin{align*} \vec{z} &= \vec{y} + \theta (\vec{x} - \vec{y}) = \theta \vec{x} + (1-\theta) \vec{y},\qquad 0 \leq \theta \leq 1\\ \\ \text{or }\;\; \vec{z} &= \alpha \vec{x} + \beta \vec{y}, \qquad \alpha + \beta = 1 \;\;\text{and}\; 0 \leq \alpha, \beta \end{align*} $$

Convex Function

For any $x,y \in \mathbb{R}^n$ and $\theta \in [0,1]$

$$f(\theta x + (1-\theta)y) \leq \theta f(x) + (1-\theta)f(y)$$

Convex Set

For a $x,y \in \mathcal{C}$ and $\theta \in [0,1]$

$$\theta x + (1-\theta)y \in \mathcal{C}$$

3. Solving Optimization Problems¶

Starting with the unconstrained, one dimensional case

To find minimum point $x^*$, we can look at the derivave of the function $f'(x)$:
- Any location where $f'(x)$ = 0 will be a "flat" point in the function

For convex problems, this is guaranteed to be a minimum

Generalization for multivariate function $f:\mathbb{R}^n \rightarrow \ \mathbb{R}$
- The gradient of $f$ must be zero

$$ \nabla _x f(x) = 0$$

Gradient is a n-dimensional vector containing partial derivatives with respect to each dimension

$$ x = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix} \quad \quad \quad \quad \nabla _x f(x) = \begin{bmatrix} \partial f(x) \over \partial x_1 \\ \vdots\\ \partial f(x) \over \partial x_n \end{bmatrix} $$

For continuously differentiable $f$ and unconstrained optimization, optimal point must have $\nabla _x f(x^*)=0$

3.1. How To Find $\nabla _x f(x) = 0$: Analytic Approach¶

Direct solution
- In some cases, it is possible to analytically compute $x^*$ such that $ \nabla _x f(x^*)=0$

$$ \begin{align*} f(x) &= 2x_1^2+ x_2^2 + x_1 x_2 -6 x_1 -5 x_2\\\\ \Longrightarrow \nabla _x f(x) &= \begin{bmatrix} 4x_1+x_2-6\\ 2x_2 + x_1 -5 \end{bmatrix} = \begin{bmatrix}0\\0 \end{bmatrix}\\\\ \therefore x^* &= \begin{bmatrix} 4 & 1\\ 1 & 2 \end{bmatrix} ^{-1} \begin{bmatrix} 6 \\ 5\\ \end{bmatrix} = \begin{bmatrix} 1 \\ 2\\ \end{bmatrix} \end{align*} $$

Note: Matrix derivatives

Exampels

Affine function $g(x) = a^Tx + b$

$$\nabla g(x) = a$$

Quadratic function $g(x) = x^T P x + q^T x + r,\qquad P = P^T$

$$\nabla g(x) = 2Px + q$$

$g(x) = \lVert Ax - b \rVert ^2 = x^TA^TAx - 2b^TAx + b^Tb$

$$\nabla g(x) = 2A^TAx-2A^Tb$$

Note: Revisit Least-Square Solution of $J(x) = \lVert Ax - y \rVert ^2$

$$ \begin{align*} J(x) &= (Ax-y)^T(Ax-y)\\ &=(x^TA^T - y^T)(Ax - y)\\ &=x^TA^TAx - x^TA^Ty - y^TAx + y^Ty\\\\ \frac{\partial J}{\partial x} &= A^TAx + (A^TA)^Tx - A^Ty - (y^TA)^T \\ &=A^TAx - 2A^Ty = 0\\\\ &\Rightarrow (A^TA)x = A^Ty\\\\ \therefore x^* &= (A^TA)^{-1}A^Ty \end{align*} $$

3.2. How To Find $\nabla _x f(x) = 0$: Iterative Approach¶

Iterative methods
- More commonly the condition that the gradient equal zero will not have an analytical solution, require iterative methods

The gradient points in the direction of "steepest ascent" for function $f$

3.2.1. Gradient Descent¶

It motivates the gradient descent algorithm, which repeatedly takes steps in the direction of the negative gradient

$$ x \leftarrow x - \alpha \nabla _x f(x) \quad \quad \text{for some step size } \alpha > 0$$

Gradient Descent

$$\text{Repeat : } x \leftarrow x - \alpha \nabla _x f(x) \quad \quad \text{for some step size } \alpha > 0$$

Gradient Descent in Higher Dimension

$$\text{Repeat : } x \leftarrow x - \alpha \nabla _x f(x)$$

3.2.2. Choosing Step Size¶

Learning rate

3.2.3 Where will We Converge?¶

$\qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \quad \qquad \qquad \bullet \, \text{Random initialization}$

$\qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \quad \qquad \qquad \bullet \, \text{Multiple trials}$

3.2.4. Example¶

$$ \begin{align*} \min& \quad (x_1-3)^{2} + (x_2-3)^{2}\\\\ =\min& \quad \frac{1}{2} \begin{bmatrix} x_1 & x_2 \end{bmatrix} \begin{bmatrix} 2 & 0 \\ 0 & 2 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} - \begin{bmatrix} 6 & 6 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} + 18 \end{align*} $$

\begin{align*} f &= \frac{1}{2}X^THX + g^TX \\ \nabla f &= HX+g \end{align*}

update rule

$$ X_{i+1} = X_{i} - \alpha_i \nabla f(X_i)$$

import numpy as np

H = np.matrix([[2, 0],[0, 2]])
g = -np.matrix([[6],[6]])

x =  np.zeros((2,1))
alpha = 0.2

for i in range(25):    
    df = H*x + g
    x = x - alpha*df

print(x)

[[2.99999147]
 [2.99999147]]

4. Practically Solving Optimization Problems¶

The good news: for many classes of optimization problems, people have already done all the “hard work” of developing numerical algorithms
- A wide range of tools that can take optimization problems in “natural” forms and compute a solution

We will use CVX (or CVXPY) as an optimization solver
- Only for convex problems
- http://cvxr.com/cvx/
- https://www.cvxpy.org/

Gradient descent
- Neural networks/deep learning

4.1. Example: Linear Programming (Convex)¶

Linear programming = objective function and constraints are both linear

$$ \begin{align*} \max \; & \; 3x_1 + \frac{3}{2}x_2 \quad \quad \leftarrow \text{objective function}\\ \\ \text{subject to} \; & -1 \leq x_1 \leq 2 \quad \leftarrow \text{constraints}\\ & \quad 0 \leq x_2 \leq 3 \end{align*} $$

Method 1: graphical approach

$$ 3 x_1 + 1.5 x_2 = C \qquad \Rightarrow \qquad x_2 = -2 x_1 + \frac{2}{3}C $$

Method 2: Gradient Descent

$$ \begin{array}{Icr}\begin{align*} \max_{x} \quad & 3x_1 + {3 \over 2}x_2 \\ \text{subject to} \quad & -1 \leq x_1 \leq 2 \\ & \quad 0 \leq x_2 \leq 3 \end{align*}\end{array} \quad\implies\quad \begin{array}{I} \quad \min_{x} \quad & - \begin{bmatrix} 3 \\ 3 / 2 \end{bmatrix}^T \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} \\ \text{subject to} \quad & \begin{bmatrix} -1 \\ 0 \end{bmatrix} \leq \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} \leq \begin{bmatrix} 2 \\ 3 \\ \end{bmatrix} \end{array} $$

f = -np.array([[3], 
              [3/2]])

lb = np.array([[-1], 
               [0]])
ub = np.array([[2], 
               [3]])

x = np.zeros(shape = (2, 1))
alpha = 0.2

for i in range(25): 
    x = x - alpha * f
    
    # lb constraints 
    lb_TF = lb < x 
    x = x * lb_TF + lb * ~lb_TF
    
    # ub constraints 
    ub_TF = x < ub 
    x = x * ub_TF + ub * ~ub_TF
    
print(x)

[[2.]
 [3.]]

Method 3: CVXPY-based solver

$$ \begin{array}{Icr}\begin{align*} \max_{x} \quad & 3x_1 + {3 \over 2}x_2 \\ \text{subject to} \quad & -1 \leq x_1 \leq 2 \\ & \quad 0 \leq x_2 \leq 3 \end{align*}\end{array} \quad\implies\quad \begin{array}{I} \quad \min_{x} \quad & - \begin{bmatrix} 3 \\ 3 / 2 \end{bmatrix}^T \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} \\ \text{subject to} \quad & \begin{bmatrix} -1 \\ 0 \end{bmatrix} \leq \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} \leq \begin{bmatrix} 2 \\ 3 \\ \end{bmatrix} \end{array} $$

import numpy as np
import cvxpy as cvx

f = np.array([[3], [3/2]])
lb = np.array([[-1], [0]])
ub = np.array([[2], [3]])

x = cvx.Variable([2,1])

obj = cvx.Minimize(-f.T @ x)
constraints = [lb <= x, x <= ub]

prob = cvx.Problem(obj, constraints)
result = prob.solve()

print(x.value)
print(result)

[[1.99999999]
 [2.99999999]]
-10.499999966365493

4.2. Example: Quadratic Programming (Convex)¶

$$ \begin{array}{Icr}\begin{align*} \min \quad &\frac{1}{2}x_1^2+3x_1+4x_2 \\ \text{subject to} \quad & x_1+3x_2 \geq 15 \\ & 2x_1+5x_2 \leq 100 \\ & 3x_1+4x_2 \leq 80 \\ & x_1,x_2 \geq 0 \\ \end{align*}\end{array} \quad\implies\quad \begin{array}{I} \min_{X} \quad & X^T HX + f^T X \\ \text{subject to} \quad & AX \leq b \\ & A_{eq}X = b_{eq} \\ & lb \leq X \leq ub \end{array} $$

$$ \begin{array}{Icr}\begin{align*} \min \quad &\frac{1}{2}x_1^2+3x_1+4x_2 \\ \text{subject to} \quad & x_1+3x_2 \geq 15 \\ & 2x_1+5x_2 \leq 100 \\ & 3x_1+4x_2 \leq 80 \\ & x_1,x_2 \geq 0 \\ \end{align*}\end{array} \quad\implies\quad \begin{array}{I} \min_{X} \quad & \begin{bmatrix}x_1 & x_2\end{bmatrix} \begin{bmatrix} {1 \over 2} & 0 \\ 0 & 0 \end{bmatrix} \begin{bmatrix}x_1 \\ x_2\end{bmatrix} + \begin{bmatrix} 3 & 4 \end{bmatrix} \begin{bmatrix}x_1 \\ x_2\end{bmatrix} \\ \text{subject to} \quad & \begin{bmatrix} -1 & -3 \\ 2 & 5 \\ 3 & 4 \end{bmatrix} \begin{bmatrix}x_1 \\ x_2\end{bmatrix} \leq \begin{bmatrix} -15 \\ 100 \\ 80 \end{bmatrix} \\ & \begin{bmatrix} 0 \\ 0 \end{bmatrix} \leq \begin{bmatrix}x_1 \\ x_2\end{bmatrix} \end{array} $$

f = np.array([[3], [4]])
H = np.array([[1/2, 0], [0, 0]])

A = np.array([[-1, -3], 
              [2, 5], 
              [3, 4]])

b = np.array([[-15], 
              [100], 
              [80]])

lb = np.array([[0], 
               [0]])

x = cvx.Variable([2,1])

obj = cvx.Minimize(cvx.quad_form(x, H) + f.T@x)
constraints = [A@x <= b, lb <= x]

prob = cvx.Problem(obj, constraints)
result = prob.solve()

print(x.value)
print(result)

[[-1.15591357e-25]
 [ 5.00000000e+00]]
20.000000000000004

f = np.array([[3], [4]])
H = np.array([[1/2, 0], [0, 0]])

A = np.array([[-1, -3], 
              [2, 5], 
              [3, 4], 
              [-1, 0], 
              [0, -1]])

b = np.array([[-15], 
              [100], 
              [80],
              [0], 
              [0]])

x = cvx.Variable([2,1])

obj = cvx.Minimize(cvx.quad_form(x, H) + f.T@x)
constraints = [A@x <= b]

prob = cvx.Problem(obj, constraints)
result = prob.solve()

print(x.value)
print(result)

[[-1.15591357e-25]
 [ 5.00000000e+00]]
20.000000000000004

4.4. Example: Empty Bucket¶

$$ \min \; d_1 + d_2 = \min \left\rVert \vec{a} - \begin{bmatrix}x\\0\end{bmatrix}\right\rVert_2 + \left\rVert \vec{b} - \begin{bmatrix}x\\0\end{bmatrix}\right\rVert_2$$

$ \vec{a} \rightarrow \begin{bmatrix}x\\0\end{bmatrix}$: walk with an empty bucket
$ \begin{bmatrix}x\\0\end{bmatrix} \rightarrow \vec{b}$: walk with a water-filled bucket

$$\implies \min d_1 + \mu d_2$$

a = np.array([[0], [1]])
b = np.array([[4], [2]])

Aeq = np.array([0,1])
beq = 0

x = cvx.Variable([2,1])

mu = 1
obj = cvx.Minimize(cvx.norm(a-x, 2) + mu * cvx.norm(b-x, 2))
constraints = [Aeq @ x == beq]

prob = cvx.Problem(obj, constraints)
result = prob.solve()

print(x.value)
print(result)

[[1.33325114e+00]
 [5.33304239e-12]]
4.9999999941398166

5. Video Lectures¶

%%html
<center><iframe src="https://www.youtube.com/embed/QJSSWNIAPlw?rel=0" 
width="420" height="315" frameborder="0" allowfullscreen></iframe></center>

%%html
<center><iframe src="https://www.youtube.com/embed/OJ397Y3fVkY?rel=0" 
width="420" height="315" frameborder="0" allowfullscreen></iframe></center>

%%html
<center><iframe src="https://www.youtube.com/embed/tLxsqbLIvx4?rel=0" 
width="420" height="315" frameborder="0" allowfullscreen></iframe></center>

%%html
<center><iframe src="https://www.youtube.com/embed/91DpsoKFpLw?rel=0" 
width="420" height="315" frameborder="0" allowfullscreen></iframe></center>

%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')