Math - Distance Metrics

KL Divergence , Chi-Squared Similarity

Posted by Rico's Nerd Cluster on June 5, 2017

Statistical Distance & Divergence Metrics

Kullback-Lebler (KL) Divergence

Given two distributions, $p(x)$, and $q(x)$, denotes how different $p(x)$ is from $q(x)$, hence it further denotes how much information will be lost when q(x) is used to represent p(x)

\[\begin{gather*} D_{KL}(p(x) || q(x)) = \sum_X p(x) \frac{p(x)}{q(x)} \end{gather*}\]
  • KL Divergence is not a distance, because the $KL(x)$ from $p(x)$ to $q(x)$ usually is not the same as that from $q(x)$ to $p(x)$
  • $KL(x) \ge 0$, when $p(x)=q(x)$, $KL(x)=0$

From counting, we find that $q(x_i)=0$ for a certain value $x_i$, technically,

\(D_{KL}(p(x) | q(x)) = \sum_X p(x) \frac{p(x)}{0} = \inf\).

However, this could cause a lot of issues. instead, we can assume $q(x) = \epsilon = 10^{-3}$ in this case to avoid numerical errors

Special Case: nn.CrossEntropy()

When the target distribution $p(x)$ is an one-hot vector, the above formulation becomes cross-entropy:

\[\begin{gather*} D_{KL}(p(x) || q(x)) = - \sum_i y_i log(\hat{y_i}) \end{gather*}\]

Chi-Squared Similarity

Chi-Squared Similarity is often used to measure probability distributions of categoritcal data, such as histograms, counts, text data represented by term frequencies.

\[\begin{gather*} \chi^2(P,Q) = \frac{(P_i-Q_i)^2}{(P_i+Q_i)} \end{gather*}\]

Where $P_i$, $Q_i$ are bins for distributions $P$, $Q$. Denometer $P_i + Q_i$ brings a normalization effect, which considers different scales of the distributions.

Bhattacharyya Distance

TODO