Probability for Machine Learning

Table of Contents

1. Random Variable¶

1.1. Definition¶

A random variable is a numerical quantity whose value is determined by the outcome of a random process. It serves as a mathematical representation of uncertainty.

Formally, a random variable $x$ is associated with a probability distribution, which specifies the likelihood that $x$ takes on specific values.

For a random variable $x$, the probability that it takes the value $a$ is written as:

$$ P(x = a) \;\triangleq\; P_X(x = a) $$

The following conditions must be satisfied:

$$ \begin{cases} 1)\; P(x = a) \geq 0 \quad \text{for all } a \in \mathcal{X} \\ 2)\; \sum\limits_{\text{all } x} P(x) = 1 \quad \text{(for discrete r.v.)} \end{cases} $$

In the case of continuous random variables, the summation is replaced by integration:

$$ \int_{-\infty}^{\infty} P(x)\,dx = 1 $$

Types of Random Variables

Random variables can be classified into two categories:

Discrete random variable takes values from a countable set
Continuous random variable takes values from a continuous range

Examples:

Tossing a coin: discrete r.v. with outcomes ${0, 1}$
Measuring temperature: continuous r.v. over $\mathbb{R}$

The choice of modeling a quantity as discrete or continuous depends on the nature of the data and the resolution at which the random process is observed.

Example

Let $x$ be the outcome of a fair die:

$$P(x = 1) = P(x = 2) = \dotsb = P(x = 6) = \frac{1}{6}$$

Example

Let $y = x_1 + x_2$: the sum of two independent dice rolls.

What is the probability that $y = 5$?

$$P_Y(y = 5) = \;?$$

1.2. Expectation (= Mean)¶

The expectation (or mean) of a random variable is a measure of its central tendency. It represents the long-run average value of the variable if the random process is repeated many times.

The expectation of a random variable $x$ is defined as:

$$ E[x] = \begin{cases} \sum\limits_{x} x \cdot P(x) & \text{(for discrete random variables)} \\\\ \int x \cdot P(x) \, dx & \text{(for continuous random variables)} \end{cases} $$

The expectation is a weighted average, where the weights are given by the probability distribution of the random variable.

Example: Uniform Distribution

Suppose $x$ takes on $m$ equally likely values (i.e., a uniform distribution), so that:

$$P(x) = \frac{1}{m}$$

Then the expectation becomes:

$$E[x] = \sum\limits_{x} x \cdot \frac{1}{m} = \frac{1}{m} \sum_x x$$

The variance of a random variable $x$ is defined as the expected squared deviation from its mean:

$$\mathrm{Var}[x] = E\left[ \left(x - E[x] \right)^2 \right]$$

This quantifies the spread or uncertainty of the random variable around its mean.

2. Random Vectors (Multivariate R.V.)¶

A random vector is a collection of multiple random variables:

$$ x = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix}, \quad \text{where each } x_i \text{ is a random variable} $$

2.1. Joint Probability Density¶

The joint probability density (or mass) function models the probability of co-occurrence of multiple random variables:

$$P_{X_1, \cdots, X_n}(X_1 = x_1, \cdots, X_n = x_n)$$

This represents the probability that all $n$ random variables take on specific values simultaneously.

For the continuous case, a joint probability density function $p(x_1, \dotsc, x_n)$ is defined such that:

$$ P(a_1 \leq X_1 \leq b_1, \cdots, a_n \leq X_n \leq b_n) = \int_{a_1}^{b_1} \dotsm \int_{a_n}^{b_n} p(x_1, \cdots, x_n) \, dx_n \dotsm dx_1 $$

2.2. Marginal Density (or Marginal Probability)¶

The marginal probability describes the probability distribution of a subset of variables from a joint distribution. It is obtained by summing (discrete case) or integrating (continuous case) over the other variables.

For a random vector:

$$x = \begin{bmatrix}x_1 \\ x_2 \\ \vdots \\ x_n\end{bmatrix}$$

The marginal distribution of each component is:

$$ \begin{align*} P_{X_1}(X_1 &= x_1) = \text{probability distribution of } X_1 \\\\ P_{X_2}(X_2 &= x_2) = \text{probability distribution of } X_2 \\\\ &\vdots \\\\ P_{X_n}(X_n &= x_n) = \text{probability distribution of } X_n \end{align*} $$

Two Random Variables

Suppose we have two random variables $X$ and $Y$ with joint distribution $P(X, Y)$.

Discrete Case

$$ \begin{align*} P(X) &= \sum_y P(X, Y = y) \\\\ P(Y) &= \sum_x P(X = x, Y) \end{align*} $$

Continuous Case

$$ \begin{align*} p_X(x) &= \int p(x, y) \, dy \\\\ p_Y(y) &= \int p(x, y) \, dx \end{align*} $$

The marginal probability isolates the distribution of one variable regardless of the others. It is essential for understanding individual variables in multivariate settings.

2.3. Conditional Probability¶

The conditional probability measures the likelihood of a random variable taking a particular value, given that another variable is already known to take a specific value.

Formally, the conditional probability of $X_1 = x_1$ given $X_2 = x_2$ is defined as:

$$ P_{X_1 \mid X_2}(X_1 = x_1 \mid X_2 = x_2) = \frac{P(X_1 = x_1,\; X_2 = x_2)}{P(X_2 = x_2)} $$

Independence

Two random variables $X_1$ and $X_2$ are independent if the outcome of one provides no information about the outcome of the other.

This is equivalent to any of the following conditions:

$$ \begin{align*} P(X_1 = x_1 \mid X_2 = x_2) &= P(X_1 = x_1) \\\\ \Updownarrow \\\\ P(X_2 = x_2 \mid X_1 = x_1) &= P(X_2 = x_2) \\\\ \Updownarrow \\\\ P(X_1 = x_1,\; X_2 = x_2) &= P(X_1 = x_1) \cdot P(X_2 = x_2) \end{align*} $$

Independence simplifies many probabilistic models and is a common assumption in both theory and practice.

Example: Dice

Let $\omega_1, \omega_2, \omega_3, \omega_4$ be outcomes of four independent dice.

$$ \begin{align*} x &= \omega_1 + \omega_2 \quad \text{(sum of the first two dice)} \\\\ y &= \omega_1 + \omega_2 + \omega_3 + \omega_4 \quad \text{(sum of all four dice)} \end{align*} $$

(1) What is the joint probability?

$$ P\left( \begin{bmatrix}x \\ y\end{bmatrix} = \begin{bmatrix} a \\ b \end{bmatrix} \right) = P_X(x = a) \cdot P_{Y \mid X}(y = b \mid x = a) $$

(2) Marginal probability?

$$ P_X(x) = \sum_y P_{XY}(x, y) $$

(3) Conditional Probability?

Suppose we observed $y = 19$, then:

$$P_{X \mid Y}(x \mid y = 19) = \frac{P_{XY}(x, y = 19)}{P_Y(y = 19)}$$

Summary: Pictorial Illustrations

Probability
Marginal Probability
Conditional Probability

Illustrative Example: Bins and Balls

There are 3 bins: A, B, C
Two bins contain only white balls, one contains only black balls.

(1) What is the probability of drawing a white ball first? (White = 1)

$$ P(X_1 = 1) = \frac{2}{3} $$

(2) Given that a white ball was drawn first, what is the probability of drawing a white ball again from a different bin?

$$P(X_2 = 1 \mid X_1 = 1) = \frac{1}{2}$$

(3) What is the probability of drawing two white balls from two different bins?

$$ \begin{align*} P(X_1 = 1, X_2 = 1) &= P(X_2 = 1 \mid X_1 = 1) \cdot P(X_1 = 1) \\\\ &= \frac{1}{2} \cdot \frac{2}{3} = \frac{1}{3} \end{align*} $$

3. Bayes Rule¶

Bayes' rule allows us to reverse the conditioning in a probability statement. That is, it lets us compute $P(A \mid B)$ when we know $P(B \mid A)$ and the marginal probabilities.

The rule follows from the symmetry of the joint probability:

$$ \begin{align*} P(X_2, X_1) &= P(X_2 \mid X_1) \cdot P(X_1) = P(X_1 \mid X_2) \cdot P(X_2) \\\\ \therefore \quad P(X_{\color{red}2} \mid X_{\color{red}1}) &= \frac{P(X_{\color{red}1} \mid X_{\color{red}2}) \cdot P(X_2)}{P(X_1)} \end{align*} $$

Example

In a population:

40% are male (M), 60% are female (F)
Among males, 50% are smokers (S);
Among females, 30% are smokers

Question: What is the probability that a randomly selected smoker is male?

Let:

$$ x = \text{M or F}, \qquad y = \text{S or N} $$

Given:

$$ \begin{align*} P(x = \text{M}) &= 0.4 \\ P(x = \text{F}) &= 0.6 \\ P(y = \text{S} \mid x = \text{M}) &= 0.5 \\ P(y = \text{S} \mid x = \text{F}) &= 0.3 \end{align*} $$

We want to compute:

$$ P(x = \text{M} \mid y = \text{S}) = \; ? $$

Apply Bayes' rule

$$ \begin{align*} P(x = \text{M} \mid y = \text{S}) &= \frac{P(y = \text{S} \mid x = \text{M}) \cdot P(x = \text{M})}{P(y = \text{S})} \\\\ &= \frac{0.5 \cdot 0.4}{P(y = \text{S})} = \frac{0.20}{0.38} \approx 0.53 \end{align*} $$

Compute the denominator

$$ \begin{align*} P(y = \text{S}) &= P(y = \text{S} \mid x = \text{M}) \cdot P(x = \text{M}) + P(y = \text{S} \mid x = \text{F}) \cdot P(x = \text{F}) \\\\ &= 0.5 \cdot 0.4 + 0.3 \cdot 0.6 = 0.20 + 0.18 = 0.38 \end{align*} $$

Thus, the probability that a smoker is male is approximately 53%.

4. Linear Transformation of Random Variables¶

4.1. Single Random Variable¶

Consider a linear transformation of a scalar random variable $X$:

$$Y = aX$$

Expectation

The expectation of $Y$ is:

$$ E[aX] = aE[X] $$

This result follows from the linearity of expectation.

Variance

The variance of $Y$ is:

$$ \text{Var}(aX) = a^2 \cdot \text{Var}(X) $$

The constant $a$ scales the variance quadratically.

Derivation of Variance Formula

Let $\bar{X} = E[X]$ be the mean of $X$. Then the variance of $X$ is:

$$ \begin{align*} \text{Var}(X) &= E\left[(X - \bar{X})^2\right] \\ &= E[X^2 - 2X\bar{X} + \bar{X}^2] \\ &= E[X^2] - 2\bar{X}E[X] + \bar{X}^2 \\ &= {E}[X^2] - 2 \bar{X}^2 + \bar{X}^2\\ &= {E}[X^2] - \bar{X}^2 \\ &= E[X^2] - E[X]^2 \end{align*} $$

This identity is fundamental in probability theory and appears frequently in statistical analysis.

4.2. Sum of Two Random Variables¶

Let $Z$ be the sum of two scalar random variables:

$$Z = X + Y \quad \text{(still a scalar r.v.)}$$

Expectation

The expectation of the sum is simply the sum of the expectations:

$$E[X + Y] = E[X] + E[Y]$$

Variance

The variance of the sum is:

$$ \begin{align*} \text{Var}(X + Y) &= E[(X + Y - E[X + Y])^2] \\ &= E[((X - \bar{X}) + (Y - \bar{Y}))^2] \\ &= E[(X - \bar{X})^2] + E[(Y - \bar{Y})^2] + 2E[(X - \bar{X})(Y - \bar{Y})] \\ &= \text{Var}(X) + \text{Var}(Y) + 2\cdot\text{Cov}(X, Y) \end{align*} $$

If $X$ and $Y$ are independent, then $\text{Cov}(X, Y) = 0$, and the variance simplifies to:

$$ \text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) $$

Covariance

The covariance of two random variables measures their linear dependence:

$$ \begin{align*} \text{Cov}(X, Y) &= E[(X - \bar{X})(Y - \bar{Y})] \\ &= E[XY] - E[X]E[Y] \end{align*} $$

If $X$ and $Y$ are independent, their covariance is zero.

Practical Note

In quality control applications, the total variance of a product assembled from multiple components is affected not only by the precision of each component but also by the correlation between them.

Suppose a factory uses two machines, A and B, connected in series. Machine A produces an intermediate part with variance $\text{Var}(X)$, and machine B finishes the product with variance $\text{Var}(Y)$. The final product is assembled from outputs of both machines, so its total variance is:

$$ \text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2 \cdot \text{Cov}(X, Y) $$

If we assume $X$ and $Y$ are independent, the covariance term disappears, and the only way to reduce the total variance is by improving the precision (i.e., reducing $\text{Var}(X)$ and $\text{Var}(Y)$) of each machine. This often means redesigning hardware or using tighter tolerances, which can be expensive.

However, there is another approach from a system engineering perspective: introduce a negative correlation between $X$ and $Y$. If $\text{Cov}(X, Y) < 0$, it can reduce the total variance, even if the individual variances remain unchanged.

This insight is useful because designing components that compensate for each other's errors is often more cost-effective than increasing the precision of each part in isolation. Rather than treating correlation as a drawback, we can exploit it for system-level benefits.

Remark:

Variance applies to a single variable
Covariance applies to pairs of variables

Covariance Matrix for Random Vectors

For a random vector $X = \begin{bmatrix} X_1 \\ X_2 \end{bmatrix}$ with mean vector $\mu = E[X]$:

$$ \text{Cov}(X) = E[(X - \mu)(X - \mu)^T] = \begin{bmatrix} \text{Var}(X_1) & \text{Cov}(X_1, X_2) \\ \text{Cov}(X_2, X_1) & \text{Var}(X_2) \end{bmatrix} $$

This symmetric matrix captures both the spread (variances) and correlation (covariances) of the components.

4.3. Affine Transformation of Random Vectors¶

Let a random vector $x \in \mathbb{R}^n$ be transformed via an affine map:

$$ y = Ax + b $$

Then the expectation and covariance of the transformed vector $y$ are given by:

Expectation:

$$ E[y] = A E[x] + b $$

Covariance:

$$ \text{Cov}(y) = A \cdot \text{Cov}(x) \cdot A^T $$

These results follow from the linearity of expectation and properties of covariance under linear transformations.

IID Random Variables

Suppose $x_1, \cdots, x_m$ are independent and identically distributed (IID) random variables with:

$$ E[x_i] = \mu, \quad \text{Var}(x_i) = \sigma^2 $$

Define the vector:

$$ x = \begin{bmatrix} x_1 \\ \vdots \\ x_m \end{bmatrix} $$

Then:

$$ E[x] = \begin{bmatrix} \mu \\ \vdots \\ \mu \end{bmatrix}, \quad \text{Cov}(x) = \begin{bmatrix} \sigma^2 & & &\\ & \sigma^2 & &\\ & & \ddots &\\ & & & \sigma^2 \end{bmatrix} = \sigma^2 I_m $$

Sample Mean as Linear Combination

The sample mean is:

$$ S_m = \frac{1}{m} \sum_{i=1}^m x_i = A x, \quad \text{where } A = \frac{1}{m} \begin{bmatrix} 1 & \cdots & 1 \end{bmatrix} $$

Then:

$$ \begin{align*} E[S_m] &= A E[x] = \mu \\\\ \text{Var}(S_m) &= A \cdot \text{Cov}(x) \cdot A^T = \frac{\sigma^2}{m} \end{align*} $$

This result confirms that averaging reduces variance.

Central Limit Approximation

For large $m$, the sample mean $\bar{x}$ approaches a normal distribution:

$$ \bar{x} = \frac{1}{m} \sum_{i=1}^m x_i \;\; \longrightarrow \;\; N\left(\mu,\; \left(\frac{\sigma}{\sqrt{m}}\right)^2 \right) $$

This approximation is a cornerstone of statistical inference, formalized by the Central Limit Theorem.

The more samples we take, the more concentrated the mean becomes around the true expectation
The variance of the mean decreases at a rate of $1/m$

%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')