Dimension Reduction

Table of Contents

1. Statistics

1.1. Populations and Samples

  • A population includes all the elements from a set of data

  • A parameter is a quantity computed from a population

    • mean, $\mu$
    • variance, $\sigma^2$
  • A sample is a subset of the population.

    • one or more observations
  • A statistic is a quantity computed from a sample

    • sample mean, $\bar{x}$
    • sample variance, $𝑠^2$
    • sample correlation, $𝑆_{𝑥𝑦}$

1.1.1. How to Generate Random Numbers (Samples or data)

  • Data sampled from population/process/generative model
In [ ]:
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline
In [ ]:
## random number generation (1D)
m = 1000;

# uniform distribution U(0,1)
x1 = np.random.rand(m,1);

# uniform distribution U(a,b)
a = 1;
b = 5;
x2 = a + (b-a)*np.random.rand(m,1);

# standard normal (Gaussian) distribution N(0,1^2)
# x3 = np.random.normal(0, 1, m)
x3 = np.random.randn(m,1);

# normal distribution N(5,2^2)
x4 = 5 + 2*np.random.randn(m,1);

# random integers
x5 = np.random.randint(1, 6, size = (1,m));

1.1.2. Histogram : graphical representation of data distribution

$ \Rightarrow$ rough sense of density of data


1.2. Inference

  • True population or process is modeled probabilistically.
  • Sampling supplies us with realizations from probability model.
  • Compute something, but recognize that we could have just as easily gotten a different set of realizations.




  • We want to infer the characteristics of the true probability model from our one sample.


1.3. Law of Large Numbers

  • Sample mean converges to the population mean as sample size gets large

$$ \bar{x} \rightarrow \mu_x \qquad \text{as} \qquad m \rightarrow \infty$$


  • True for any probability density functions


  • sample mean and sample variance

$$ \begin{align} \bar{x} &=\frac{x_1+x_2+...+x_m}{m}\\ s^2 &=\frac{\sum_{i=1}^{m}(x_i-\bar{x})^2}{m-1} \end{align} $$


  • suppose $x \sim U[0,1]$
In [ ]:
# statistics
# numerically understand statisticcs

m = 100
x = np.random.rand(m,1)

#xbar = 1/m*np.sum(x, axis = 0)
#np.mean(x, axis = 0)
xbar = 1/m*np.sum(x)
np.mean(x)

varbar = (1/(m - 1))*np.sum((x - xbar)**2)
np.var(x)

print(xbar)
print(np.mean(x))
print(varbar)
print(np.var(x))
0.5117861127203165
0.5117861127203165
0.08854315742973504
0.08765772585543768
In [ ]:
# various sample size m

m = np.arange(10, 2000, 20)
means = []

for i in m:
    x = np.random.normal(10, 30, i)
    means.append(np.mean(x))

plt.figure(figsize = (6, 4))
plt.plot(m, means, 'bo', markersize = 4)
plt.axhline(10, c = 'k', linestyle='dashed')
plt.xlabel('# of smaples (= sample size)', fontsize = 15)
plt.ylabel('sample mean', fontsize = 15)
plt.ylim([0, 20])
plt.show()
No description has been provided for this image

1.4. Central Limit Theorem

  • Sample mean (not samples) will be approximately normal-distributed as a sample size $m \rightarrow \infty$

$$ \bar{x} =\frac{x_1+x_2+...+x_m}{m}$$


  • More samples provide more confidence (or less uncertainty)
  • Note: true regardless of any distribution of population

$$ \bar{x} \rightarrow N\left(\mu_x,\left(\frac{\sigma}{\sqrt{m}}\right)^2 \right) $$




  • Variance Gets Smaller as $m$ is Larger
    • Seems approximately Gaussian distributed
    • numerically demostrate that sample mean follows the Gaussin distribution
In [ ]:
N = 100
m = np.array([10, 40, 160])   # sample of size m

S1 = []   # sample mean (or sample average)
S2 = []
S3 = []

for i in range(N):
    S1.append(np.mean(np.random.rand(m[0], 1)))
    S2.append(np.mean(np.random.rand(m[1], 1)))
    S3.append(np.mean(np.random.rand(m[2], 1)))

plt.figure(figsize = (6, 4))
plt.subplot(1,3,1), plt.hist(S1, 21), plt.xlim([0, 1]), plt.title('m = '+ str(m[0])), plt.yticks([])
plt.subplot(1,3,2), plt.hist(S2, 21), plt.xlim([0, 1]), plt.title('m = '+ str(m[1])), plt.yticks([])
plt.subplot(1,3,3), plt.hist(S3, 21), plt.xlim([0, 1]), plt.title('m = '+ str(m[2])), plt.yticks([])
plt.show()
No description has been provided for this image

1.5. Multivariate Statistics


$$ x^{(i)} = \begin{bmatrix}x_1^{(i)} \\ x_2^{(i)}\\ \vdots \end{bmatrix}, \quad X = \begin{bmatrix} -& (x^{(i)})^T & -\\ - & (x^{(i)})^T & -\\ & \vdots & \\ - & (x^{(m)})^T & -\end{bmatrix}$$


  • $m$ observations $\left(x^{(i)}, x^{(2)}, \cdots , x^{(m)}\right)$

$$ \begin{align*} \text{sample mean} \; \bar x &= \frac{x^{(1)} + x^{(2)} + \cdots + x^{(m)}}{m} = \frac{1}{m} \sum\limits_{i=1}^{m}x^{(i)} \\ \text{sample variance} \; S^2 &= \frac{1}{m-1} \sum\limits_{i=1}^{m}(x^{(i)} - \bar x)^2 \\ (\text{Note: } &\text{population variance} \; \sigma^2 = \frac{1}{N}\sum\limits_{i=1}^{N}(x^{(i)} - \mu)^2 \end{align*} $$

1.5.1. Correlation of Two Random Variables


$$ \begin{align*} \text{Sample Variance} : S_x &= \frac{1}{m-1} \sum\limits_{i=1}^{m}\left(x^{(i)}-\bar x\right)^2 \\ \text{Sample Covariance} : S_{xy} &= \frac{1}{m-1} \sum\limits_{i=1}^{m}\left(x^{(i)}-\bar x\right)\left(y^{(i)}-\bar y \right)\\ \text{Sample Covariance matrix} : S &= \begin{bmatrix} S_x & S_{xy} \\ S_{yx} & S_y \end{bmatrix}\\ \text{sample correlation coefficient} : r &= \frac{S_{xy}}{ \sqrt {S_{xx}\cdot S_{yy}} } \end{align*}$$


  • Strength of linear relationship between two variables, $x$ and $y$
  • Assume

$$x_1 \leq x_2 \leq \cdots \leq x_n$$

$$y_1 \leq y_2 \leq \cdots \leq y_n$$