Find $J$ of the below optimization problem with the following objective function and constraint. ($x,\omega\in \mathbb{R}^n, \omega_0\in \mathbb{R}$)
If $x$ is on the line of $\omega_0 + \omega^Tx = 0$, the objective function can be interpreted as the shortest distance $J^{*}$ from origin to the line. We geometrically (or intuitively) know that distance $J$ will be minimal when vector $x$ is orthogonal to the line.
The optimal $x^*$ that minimizes $\|x\|_2$:
$$ \omega^T x^* + \omega_0 = 0 \\ x^* = -\omega_0 \frac{\omega}{\|\omega\|^2} $$Calculation of $J^*$: $$ J^* = \|x^*\|_2 = \left\|-\omega_0 \frac{\omega}{\|\omega\|^2}\right\|_2 \\ J^* = \left| \omega_0 \right| \left\|\frac{\omega}{\|\omega\|^2}\right\|_2 \\ J^* = \frac{|\omega_0|}{\|\omega\|} $$
Solve the following optimization problem
where
$$ P = \begin{bmatrix} 13 & 12 & -2\\ 12 & 17 & 6\\ -2 & 6 & 12\end{bmatrix} , \quad\quad q = \begin{bmatrix} -22.0\\ -14.5\\ 13.0\end{bmatrix} , \quad\quad r = 1. $$Prove that $x^* = \left(1,\frac{1}{2},-1 \right)$ is optimal for the optimization problem
(Hint: the direction of the steepest ascent in the value of a function $f(x)$ at a given point $x^*$ is $\nabla f(x^*)^T(x - x^*)$ where $x$ is an arbitrary point)
Let $x = \begin{bmatrix} x_1\\ x_2\\ x_3\end{bmatrix}$, then
To obtain an optimistic solution of $x$, the partial derivative of $f$ for each element of $x$ is as follows.
This is expressed in a matrix form as follows.
The gradient at given point $x^* = \left(1,\frac{1}{2},-1 \right)$ is calculated as follows.
From the given point $x^*$, calculating variance regard to arbitrary point $x$, which is restricted to $-1 \leq x_i \leq 1, \quad i = 1,2,3$, is as follows.
As a result, $x^*$ becomes an optimal point because the value increases at all points belonging to the constraints.
Fit a straight line $y = \theta_1 x + \theta_0$ to the points $(1,2), (2,3), (0,0), (-3,-5)$. You must solve by recasting this problem as least squares.
What are the advantages (or characteristics) of the logistic regression compared to the perceptron?
What is the advantage of having big data?
Why do we divide data into training set, and validation (testing) set?
Suggest possible approaches to avoid overfitting.
LR can find a hyperplane even when data is not fully separated. LR algorithm makes use of information from all data.
prevent overfitting
to detect overfitting
early stopping, regularization, data augmentation, dropout, ...
Suppose we have a linear regression model with two weights and no bias term:
and the usual loss function $\ell(y,\hat y) = \frac{1}{2}(y− \hat y)^2$ and cost $\mathcal{C}(\omega_1, \omega_2) = \frac{1}{m} \sum_{i}\ell(y^{(i)},\hat y ^{(i)})$. Suppose we have a training set consisting of $m = 3$ examples:
$x^{(1)} = [1,0]^T,\; y^{(1)} = 2$
$x^{(2)} = [0,1]^T,\; y^{(2)} = 2$
$x^{(3)} = [\sqrt{3},0]^T,\; y^{(3)} = 0$
Let's sketch one of the contours in weight space.
Consider the binary classification task that consists of the following points:
$g(x) = \omega_0 + \omega_1 x_1 + \omega_2 x_2 = 0$ is a generic expression of a linear classification boundary. Show that the linear classification boundary by the SVM approach is $x_1 = 0$
SVM makes margin(minimum distance from boundary) maximize. First, assumpt that $g(x)$ is a linear line with positive slope. Then,
To make this line as positive slope, $- \frac{\omega_1}{\omega_2} > 1, \quad \therefore \omega_1 < -\omega_2 $
In this assumption, $(1,1)$ and $(-1,-1)$ have to become support vectors. So each distance from boundary line have to become same.
But $\omega_1 = \omega_2$ case violate $\omega_1 < -\omega_2$ condition, so $\omega_0 = 0$
Now problem becomes
The maximum distance is 1, when $\omega_1 = 0$ or $\omega_2 = 0$.
If $\omega_1 = 0$, $g(x): x_2 = 0$ but this cannot classify those points. So $\omega_2 = 0$ and $g(x): x_1 = 0$
Assume that we trained a logistic regression model and our class probabilities can be found by
and we classify using the rule
Show that this corresponds to a linear decision boundary in the input space where $\sigma(x) = \cfrac{1}{1+e^{-x}}$
(Hint: $ y(x) = \mathbb{1}[z(x) > 0.5]$ means $y = 1$ if $z(x) > 0.5$)
What decision boundary looks like:
The derivation to the decision boundary is:
Let's do principal components analysis (PCA). Consider this sample of six points $X_i \in \mathbb{R}^2$
a. Which of those two principal components would be preferred if you use only one?
b. What information does the PCA algorithm use to decide that one principal components is better than another?
c. From an optimization point of view, why do we prefer that one?
$$
\mu = \frac{1}{6} \sum_{i=1}^{6} X_i \begin{bmatrix}
1 \\
\end{bmatrix}.
$$
$$
X^TX = \begin{bmatrix}
4 & 2 \\
2 & 4 \\
\end{bmatrix}.
$$
a. We choose $v_{2} = \begin{bmatrix}1/\sqrt2 \\1/\sqrt2\end{bmatrix}$ first.
b. PCA picks the principal component with the largest eigenvalue.
c. We prefer it because it maximizes the variance of the sample points after they are projected onto a line parallel to $v_{2}$. (Alternatively, because it minimizes the sum of squares of distances from eqach sample point to its projection on that line.
The vector projection of $X_{i}$ onto a direction $w$ is
$$
\hat{X_i} = \frac{X_i^Tw}{||w||^2} w.
$$
For our sample and our chosen principal component $w = v_2$, the projections are
The top figure shows that depending on the given features, a complicated decision tree may result while it can be better split along another direction. However if PCA is used to find the right direction for data splitting, the decision tree may be simpler as shown below.
First, make all image into one 2 dimensional array (64$\times$64)$\times$400
Second, substract average of the array into axis = 0
Third, make zero centered image, decompose it using SVD (Truncation = 100)
Finally,the storage is reduced from ((400$\times$4096)$\times$2)-((400$\times$100)+(4096$\times$100))$\times$2
We want you to compute (by hand) part of the singular value decomposition (SVD) $X = U \Sigma V^T$ of the matrix
$$
Y = X^TX = \begin{bmatrix}
2 & 1 \\
1 & 2 \\
\end{bmatrix}
$$
Note: if you like doing things the hard way, it is also correct to use the matrix
Proof: Let $Z = UDV^T$ be the SVD of $Z$; then $Z^TZ = VDU^TUDV^T = VD^2V^T$, which is clearly an eigendecomposition (because $V$ is orthonormal and $D^2$ is diagonal). If $Z$ has a singular vector $v$ with right singular vector $u$, then $v$ is an eigenvalue of $Z^TZ$ with eigenvalue $\sigma^2$.
Postscript: By the way, one correct SVD is