Basic Statistics Concepts
Standard Deviation and Variance
Variance of a distribution can be “biased” and “unbiased”. A biased variance is to always underestimate the real bias.
$$
\begin{gather*}
\text{unbiased variance} = \frac{\sum (x - \bar{x})}{n-1}
\text{biased variance} = \frac{\sum (x - \bar{x})}{n}
\end{gather*} $$
The above unbiasing operation is caleld “Bessel correction”
An Isotropic covariance matrix is:
\[\begin{gather*} \begin{aligned} & C = \lambda I \end{aligned} \end{gather*}\]Stochastic Processes and Stationarity
A stochastic process is a collection of Random Variables over time. If a random variable is X
, a process of it is X(t)
. If the mean and variance of X(t)
does not change, then loosely, this process is stationary.
Reasoning For Bessel Correction
- Population variance (or the true variance of the entire population) is calculated as:
\(\begin{gather*}
\sigma^2 = \frac{1}{N} \sum_N (x - \mu)^2
\end{gather*}\)
- Where N
is the whole popilation’s size, $\mu$ is the population mean
- Sample variance:
\(\begin{gather*}
s^2 = \frac{1}{n} \sum_n (x - \bar{x})^2
\end{gather*}\)
- Where n
is the batch size, $\bar{x}$ is the batch mean
The sample variance has a slight bias because $\bar{x}$ is a random variable dependent on the sample. The population mean is slightly larger, so we divide by $N-1$ instead of $N$.
Distributions
Student’s t-distribution
Student’s t-distribution is similar to a Gaussian distribution, but with heavier tails and shorter peaks.
TODO
Covariance And Correlation
Given two random variables $A$, $B$
- Mean
- Standard Deviation
1
- Standard Deviation indicates how spread out the data is from the mean.
- Covariance
1
- Covariance indicates **the Joint variability of $(A_i, B_i)$ pairs together.** If a single pair of $A_i$, $B_i$ are both positive, you will get a positive value. If one of them is positive, one of them is negative, you will get a negative value. Altogether, they could indicate how related $A$ and $B$ are.
- Correlation
\(\begin{gather*} corr(AB) = \frac{cov(AB)}{\sigma_A \sigma_B} \end{gather*}\) - Correlation is a standardized measure of “relatedness” between two random variables. It ranges from $[-1, 1]$. If $A=kB$ after mean normalization, then correlation will be a perfect 1
Regression
Regression in statistics means “estimating the relationship model between dependent variables and independent variables.” For example, linear regression models the relationship of Y
and its independent variables, x1, x2 ...
.
1
Y=β0+β1x1+β2x2+⋯+βnxn+ϵ
Transformer is “autoregressive”. It’s regressive because it tries to model the relationship between output and input sequences. It’s “auto” because the output sequence depends on the previous output.
Random Process
A random process $R(t)$ is basically a collection of random variables that vary along time. The random variables’ mean and standard deviations may or may not change. If they don’t change, we call the random process stationary
Gaussian Random Process
A Gaussain Random Process is
\[\begin{gather*} \begin{aligned} & R(t) \sim \mathcal{gp}(m(t), k(t, t') ) \end{aligned} \end{gather*}\]Where the mean function of the Random Process is $m(t)$, and the covariance function $k(t, t’)$ could change over time, too.
\[\begin{gather*} \begin{aligned} & m(t) = E[R(t)] \\ & k(t, t') = E[(R(t) - m(t))(R(t') - m(t'))] \end{aligned} \end{gather*}\]One special case is white Gaussian Random noise:
\[\begin{gather*} \begin{aligned} & R(t) \sim \mathcal{gp}(0, \delta(t - t') \sigma^2) \end{aligned} \end{gather*}\]the covariance $\sigma$ does not change across time. Between different times, t, t'
, there’s no correlation between them, and they are independent.
$\delta(t - t’)$ is “Dirac Delta Distribution.” It’s a probability distribution, where everywhere is 0 except for at time t'
. Also, $\int_{-\infty}^{\infty} \delta(t-t’)f(t) = f(t’)$
Power Spectral Density
In signal procesisng, if we view a signal x(t)
as a random process, then we can find its power across all frequencies. This is called “power spectral density” (PSD).
It’s defined as the Fourier Transform of the auto-correlation of the signal function at time difference $\tau$. The autocorrelation is:
\[\begin{gather*} \begin{aligned} & R_{xx}(\tau) = E[x(t)x(t + \tau)] \end{aligned} \end{gather*}\]So if the signal is periodic with period of $\tau$, $R_{xx}(n\tau)$ would peak. The PSD $S_{xx}(f)$ is then the Fourier Transform of the autocorrelation across all time differences, $\tau$:
\[\begin{gather*} \begin{aligned} & S_{xx}(f) = F( R_{xx}(\tau)) = \int_{-\infty}^{\infty} R_{xx}(\tau) e^{-j2\pi f \tau} d\tau \end{aligned} \end{gather*}\]For white Gaussian noise, the PSD is a constant $\sigma^2$:
\[\begin{gather*} \begin{aligned} & S_{xx}(f) = F( R_{xx}(\tau)) = \int_{-\infty}^{\infty} \sigma^2 \delta(\tau) e^{-j2\pi f \tau} d\tau = \sigma^2 \end{aligned} \end{gather*}\]Wiener Process
Wiener Process is a.k.a Brownian Motion. It’s non-stationary
\[\begin{gather*} \begin{aligned} & W(t + \Delta t) = W(t) + \Delta W \\ & \Delta W \sim \mathcal{N}(0, \Delta t) \end{aligned} \end{gather*}\]Its mean is 0, but variance is t (so it increases)
\[\begin{gather*} \begin{aligned} & E[W(t)] = 0 \\ & Var[W(t)] = t \end{aligned} \end{gather*}\]