Optimization for Machine Learning



By Seungchul Lee
http://iai.postech.ac.kr/
Industrial AI Lab at POSTECH

Table of Contents

1. Optimization¶

  • an important tool in 1) engineering problem solving and 2) decision science
  • optimize



3 key components

  1. objective
  2. decision variable or unknown
  3. constraints

Procedures

  1. The process of identifying objective, variables, and constraints for a given problem is known as "modeling"
  2. Once the model has been formulated, optimization algorithm can be used to find its solutions.

In mathematical expression


$$\begin{align*} \min_{x} \quad &f(x) \\ \text{subject to} \quad &g_i(x) \leq 0, \qquad i=1,\cdots,m \end{align*} $$

$\;\;\; $where

  • $x=\begin{bmatrix}x_1 \\ \vdots \\ x_n\end{bmatrix} \in \mathbb{R}^n$ is the decision variable
  • $f: \mathbb{R}^n \rightarrow \mathbb{R}$ is an objective function
  • Feasible region: $\mathcal{C} = \{x: g_i(x) \leq 0. \quad i=1, \cdots,m\}$



Remarks) equivalent


$$\begin{align*} \min_{x} f(x) \quad&\leftrightarrow \quad \max_{x} -f(x)\\ \quad g_i(x) \leq 0\quad&\leftrightarrow \quad -g_i(x) \geq 0\\ h(x) = 0 \quad&\leftrightarrow \quad \begin{cases} h(x) \leq 0 \quad \text{and} \\ h(x) \geq 0 \end{cases} \end{align*} $$


2. Convex Optimization¶

  • Convex optimizaiton is an extremely powerful subset of all optimization problems


$$\begin{align*} \min_{x} \quad &f(x) \\ \text{subject to} \quad & x \in \mathcal{C} \end{align*} $$

$\;\;\; $where

  • $f: \mathbb{R}^n \rightarrow \mathbb{R}$ is a convex function and

  • Feasible region $\mathcal{C}$ is a convex set


Remarks)

  • Key property of convex optimization:

    • All local solutions are global solutions

Linear Interpolation between Two Points





$$ \begin{align*} \vec{z} &= \vec{y} + \theta (\vec{x} - \vec{y}) = \theta \vec{x} + (1-\theta) \vec{y},\qquad 0 \leq \theta \leq 1\\ \\ \text{or }\;\; \vec{z} &= \alpha \vec{x} + \beta \vec{y}, \qquad \alpha + \beta = 1 \;\;\text{and}\; 0 \leq \alpha, \beta \end{align*} $$

Convex Function



  • For any $x,y \in \mathbb{R}^n$ and $\theta \in [0,1]$


$$f(\theta x + (1-\theta)y) \leq \theta f(x) + (1-\theta)f(y)$$

Convex Set



  • For a $x,y \in \mathcal{C}$ and $\theta \in [0,1]$


$$\theta x + (1-\theta)y \in \mathcal{C}$$

3. Solving Optimization Problems¶

  • Starting with the unconstrained, one dimensional case



  • To find minimum point $x^*$, we can look at the derivave of the function $f'(x)$:
    • Any location where $f'(x)$ = 0 will be a "flat" point in the function
  • For convex problems, this is guaranteed to be a minimum
  • Generalization for multivariate function $f:\mathbb{R}^n \rightarrow \ \mathbb{R}$

    • The gradient of $f$ must be zero
$$ \nabla _x f(x) = 0$$
  • Gradient is a n-dimensional vector containing partial derivatives with respect to each dimension


$$ x = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix} \quad \quad \quad \quad \nabla _x f(x) = \begin{bmatrix} \partial f(x) \over \partial x_1 \\ \vdots\\ \partial f(x) \over \partial x_n \end{bmatrix} $$


  • For continuously differentiable $f$ and unconstrained optimization, optimal point must have $\nabla _x f(x^*)=0$

3.1. How To Find $\nabla _x f(x) = 0$: Analytic Approach¶

  • Direct solution

    • In some cases, it is possible to analytically compute $x^*$ such that $ \nabla _x f(x^*)=0$


$$ \begin{align*} f(x) &= 2x_1^2+ x_2^2 + x_1 x_2 -6 x_1 -5 x_2\\\\ \Longrightarrow \nabla _x f(x) &= \begin{bmatrix} 4x_1+x_2-6\\ 2x_2 + x_1 -5 \end{bmatrix} = \begin{bmatrix}0\\0 \end{bmatrix}\\\\ \therefore x^* &= \begin{bmatrix} 4 & 1\\ 1 & 2 \end{bmatrix} ^{-1} \begin{bmatrix} 6 \\ 5\\ \end{bmatrix} = \begin{bmatrix} 1 \\ 2\\ \end{bmatrix} \end{align*} $$

  • Note: Matrix derivatives



Exampels

  • Affine function $g(x) = a^Tx + b$
$$\nabla g(x) = a$$
  • Quadratic function $g(x) = x^T P x + q^T x + r,\qquad P = P^T$
$$\nabla g(x) = 2Px + q$$
  • $g(x) = \lVert Ax - b \rVert ^2 = x^TA^TAx - 2b^TAx + b^Tb$
$$\nabla g(x) = 2A^TAx-2A^Tb$$



Note: Revisit Least-Square Solution of $J(x) = \lVert Ax - y \rVert ^2$


$$ \begin{align*} J(x) &= (Ax-y)^T(Ax-y)\\ &=(x^TA^T - y^T)(Ax - y)\\ &=x^TA^TAx - x^TA^Ty - y^TAx + y^Ty\\\\ \frac{\partial J}{\partial x} &= A^TAx + (A^TA)^Tx - A^Ty - (y^TA)^T \\ &=A^TAx - 2A^Ty = 0\\\\ &\Rightarrow (A^TA)x = A^Ty\\\\ \therefore x^* &= (A^TA)^{-1}A^Ty \end{align*} $$

3.2. How To Find $\nabla _x f(x) = 0$: Iterative Approach¶

  • Iterative methods

    • More commonly the condition that the gradient equal zero will not have an analytical solution, require iterative methods





  • The gradient points in the direction of "steepest ascent" for function $f$

3.2.1. Gradient Descent¶

  • It motivates the gradient descent algorithm, which repeatedly takes steps in the direction of the negative gradient


$$ x \leftarrow x - \alpha \nabla _x f(x) \quad \quad \text{for some step size } \alpha > 0$$



  • Gradient Descent
$$\text{Repeat : } x \leftarrow x - \alpha \nabla _x f(x) \quad \quad \text{for some step size } \alpha > 0$$



  • Gradient Descent in Higher Dimension
$$\text{Repeat : } x \leftarrow x - \alpha \nabla _x f(x)$$