Probability for Machine Learning


By Prof. Seungchul Lee
http://iai.postech.ac.kr/
Industrial AI Lab at POSTECH

Table of Contents

1. Random Variable (= r.v.)

1.1. Definition


  • (Rough) Definition: Variable with a probability
  • Probability that $x=a$


$$\triangleq \; P_X(x=a)\;= \; P(x=a) \;\implies\; \begin{cases} \text{1)} \; P(x=a) \geq 0 \\ \text{2)} \; \sum\limits_{\text{all}} P(x)=1 \end{cases}$$


  • $\begin{cases} \text{continuous r.v.} \qquad \text{if} \;x \;\text{is continuous}\\ \text{discrete r.v.} \quad \qquad \, \text{if} \;x \;\text{is discrete} \end{cases}$

Example

  • $x$: die outcome


$$ P(x=1)=P(x=2)= \;\dotsb \;= P(x=6) = \frac{1}{6} $$

  • Question
$$ \begin{align*} y &= x_1 + x_2: \;\;\; \text{ sum of two dice} \\\\ P_Y(y=5) &= \text{?} \end{align*} $$



1.2. Expectation (= Mean)



$$ E[x]= \begin{cases} \; \sum\limits_{x}xP(x) & \quad \text{discrete}\\ \; \int_{x} xP(x)dx & \quad \text{continuous} \end{cases}$$


Example


$$ \begin{align*} \text{Sample mean} \quad E[x] &= \sum\limits_{x}x\cdot\frac{1}{m}\;\;(\because\;\text{uniform distribution assumed})\\\\ \text{Variance} \quad \text{var}[x] &=E\left[\left(x-E[x]\right)^2 \right]\;\text{: mean square deviation from mean} \end{align*}$$

2. Random Vectors (Multivariate R.V.)



$$ x = \begin{bmatrix}x_1\\x_2\\\vdots\\x_n\end{bmatrix}, \;\;\text{$n$ random variables}$$

2.1. Joint Density Probability

  • Joint density probability models probability of co-occurrence of many r.v.


$$ P_{X_1,\cdots,X_n}(X_1=x_1,\cdots,X_n=x_n)$$

2.2. Marginal Density Probability



$$ \begin{align*} P_{X_1}(X_1&=x_1)\\ P_{X_2}(X_2&=x_2)\\ &\vdots\\ P_{X_n}(X_n&=x_n) \end{align*} $$


  • For two r.v.
$$ \begin{align*} P(X) &= \sum_y P(X,Y=y)\\ P(Y) &= \sum_x P(X=x,Y) \end{align*}$$




2.3. Conditional Probability

  • Probability of one event when we know the outcome of the other


$$P_{X_1 \mid X_2}(X_1=x_1 \mid X_2=x_2) = \frac{P(X_1=x_1,X_2=x_2)}{P(X_2=x_2)}: \qquad \text{Conditional prob. of $x_1$ given $x_2$}$$



  • Independent random variables
    • when one tells nothing about the other


$$\begin{align*} P(X_1=x_1 \mid X_2=x_2) &=P(X_1=x_1) \\\\ &\updownarrow \\\\ P(X_2=x_2 \mid X_1=x_1) &=P(X_2=x_2) \\\\ &\updownarrow \\\\ P(X_1=x_1,X_2=x_2) & = P(X_1=x_1)P(X_2=x_2) \end{align*}$$



Example

  • Four dice $\, \omega_1,\; \omega_2, \; \omega_3, \;\omega_4$


$$\begin{align*}x&=\omega_1+\omega_2\;\;\; \text{: sum of the first two dice}\\\\ y&=\omega_1+\omega_2+\omega_3+\omega_4\;\;\; \text{: sum of all four dice} \\\\ &\text{probability of } \begin{bmatrix}x\\y\end{bmatrix}=\;? \\ \end{align*}$$

  • Marginal probability


$$ P_X(x) = \sum\limits_{y}P_{XY}(x,y)$$



  • Conditional probability
    • Suppose we measured $\,y=19$


$$P_{X \mid Y}(x \mid y =19) =\;?$$




  • Pictorial illustration






Example

  • Suppose we have three bins, labeled A, B, and C.
  • Two of the bins have only white balls, and one bin has only black balls.

$\;\;$ 1) We take one ball, what is the probability that it is white? (white = 1)

$$ P(X_1=1)=\frac{2}{3} $$

$\;\;$ 2) When a white ball has been drawn at the first, what is the probability of drawing a white ball at the second as well?

$$ P(X_2=1 \mid X_1=1) = \frac{1}{2}$$

$\;\;$ 3) When two balls have been drawn from two different bins, what is the probability of drawing two white balls?

$$ P(X_1=1,X_2=1)=P(X_2=1 \mid X_1 = 1)P(X_1=1)=\frac{1}{2}\cdot\frac{2}{3}=\frac{1}{3}$$

3. Bayes Rule



  • Enables us to swap $A$ and $B$ in conditional probability


$$ \begin{align*} P(X_2,X_1) &= P(X_2 \mid X_1)P(X_1) = P(X_1 \mid X_2)P(X_2)\\ \\ \therefore \;\; P(X_2 \mid X_1) &= \frac{P(X_1 \mid X_2)P(X_2)}{P(X_1)} \end{align*}$$

Example

  • Suppose that in a group of people, 40% are male and 60% are female.
  • 50% of the males are smokers, 30% of the females are smokers.
  • Find the probability that a smoker is male



$$\begin{array}{Icr} x = \; \text{M or F} \qquad \\ y = \; \text{S or N} \qquad \end{array} \quad \quad \begin{array}{Icr} \begin{align*} &P(x=\text{M})=0.4\\ &P(x=\text{F})=0.6\\ &P(y=\text{S} \mid x=\text{M})=0.5\\ &P(y=\text{S} \mid x=\text{F})=0.3 \end{align*}\end{array}$$


$$P(x=\text{M} \mid y = \text{S}) = \text{?}$$


  • Baye's rule + conditional probability


$$\begin{align*} P(x=\text{M} \mid y = \text{S}) &=\frac{P(y=\text{S} \mid x=\text{M})P(x=\text{M})}{P(y=\text{S})}=\frac{0.20}{0.38} \approx 0.53 \\\\ P(y=\text{S}) & =P(y=\text{S} \mid x=\text{M})P(x=\text{M}) + P(y=\text{S} \mid x=\text{F})P(x=\text{F}) \\\\ & = 0.5 \times 0.4 + 0.3 \times 0.6 = 0.38 \end{align*}$$

4. Linear Transformation of Random Variables

4.1. For Single Random Variable


$$ \begin{align*} X &\mapsto Y =aX\\\\ E[aX] &=aE[X]\\ \text{var}(aX) &= a^2 \text{var}(X) \\ \\ \text{var}(X) &= E[(X-E[X])^2]=E[(X-\bar{X})^2]=E[X^2-2X\bar{X}+\bar{X}^2]\\ &=E[X^2]-2E[X\bar{X}]+\bar{X}^2 = E[X^2]-2E[X]\bar{X}+\bar{X}^2\\ &=E[X^2]-E[X]^2 \end{align*} $$

4.2. Sum of Two Random Variables $X$ and $Y$


$$ \begin{align*} Z&=X+Y \;\text{(still univariate)}\\\\ E[X+Y] &=E[X]+E[Y]\\ \text{var}(X+Y) &= E[(X+Y-E[X+Y])^2] = E[((X-\bar{X})+(Y-\bar{Y}))^2]\\ &=E[(X-\bar{X})^2]+E[(Y-\bar{Y})^2]+2E[(X-\bar{X})(Y-\bar{Y})]\\ &=\text{var}(X)+\text{var}(Y)+2 \text{cov}(X,Y)\\ \\ \text{cov}(X,Y) &=E[(X-\bar{X})(Y-\bar{Y})]=E[XY-X\bar{Y}-\bar{X}Y+\bar{X}\bar{Y}]\\ &=E[XY]-E[X]\bar{Y}-\bar{X}E[Y]+\bar{X}\bar{Y}=E[XY]-E[X]E[Y] \end{align*} $$


  • Note: quality control in manufacturing process


$$\text{var}(X+Y)=\text{var}(X)+\text{var}(Y)+2 \text{cov}(X,Y)$$


  • Remark
    • variance for univariable
    • covariance for bivariable
  • Covariance two r.v.


$$\text{cov}(x,y) = E[(x-\mu_x)(y-\mu_y)]$$

  • Covariance matrix for random vectors


$$ \begin{align*} \text{cov}(X) = E[(X-\mu)(X-\mu)^T] &=\begin{bmatrix}\text{cov}(X_1,X_1) & \text{cov}(X_1,X_2)\\\text{cov}(X_2,X_1) & \text{cov}(X_2,X_2)\end{bmatrix}\\ &=\begin{bmatrix}\text{var}(X_1) & \text{cov}(X_1,X_2)\\\text{cov}(X_2,X_1) & \text{var}(X_2)\end{bmatrix}\end{align*}$$

  • Moments: provide rough clues on probability distribution


$$ \int x^kP_x(x)dx \;\;\; \text{or} \;\;\; \sum x^kP_x(x)dx $$

4.3. Affine Transformation of Random Vectors


$$ \begin{align*} y &= Ax+b\\\\ E[y] &=AE[x]+b \\ \text{cov}(y) &= A \,\text{cov}(x)\,A^T \end{align*} $$


  • IID random variables
    • identically distributed
    • independent
  • Suppose $ x_1,x_2,\cdots,x_m$ are IID with mean $\mu$ and variance $\sigma^2$


$$ \text{Let} \; x=\begin{bmatrix}x_1\\\vdots\\x_m\end{bmatrix},\quad \text{then} \; E[x]=\begin{bmatrix}\mu\\\vdots\\\mu\end{bmatrix},\quad \text{cov}(x) = \begin{bmatrix} \sigma^2 & & &\\ & \sigma^2 & &\\ & & \ddots &\\ & & & \sigma^2 \end{bmatrix}$$

  • Sum of IID random variables ($\rightarrow$ single r.v.)



$$S_m = \frac{1}{m}\sum\limits_{i=1}^m x_i \; \implies S_m = Ax \;\; \text{where} \; A=\frac{1}{m}\begin{bmatrix}1 & \cdots & 1 \end{bmatrix}$$



$$ \begin{align*}E[S_m] &= AE[x] = \frac{1}{m}\begin{bmatrix}1 & \cdots & 1 \end{bmatrix} \begin{bmatrix}\mu\\\vdots\\\mu\end{bmatrix} = \frac{1}{m}m\mu =\mu \\\\ \text{var}(S_m) &=A \,\text{cov}(x)\,A^T=A\begin{bmatrix} \sigma^2 & & &\\ & \sigma^2 & &\\ & & \ddots &\\ & & & \sigma^2 \end{bmatrix}A^T=\frac{\sigma^2}{m} \end{align*}$$


  • Reduce the variance by a factor of $m$ $\implies$ Law of large numbers or central limit theorem


$$ \bar{x} \longrightarrow N\left(\mu,\left(\frac{\sigma}{\sqrt{m}}\right)^2 \right) $$
In [1]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')