Introduction
van der Maaten, Geoff Hinton and Yoshua Bengio came up with “t-SNE” (pronounced as “tee-snee”) to reduce high dimensional data down to low dimensional data, so clusters between data is more easily seen. For example, if we have a 300-long vocabulary, we can plot them in 2D:
Algorithm Intuition
Below illustrations come from StatQuest’s Youtube Video
If we have some 2D data and want to reduce it to 1D, we can’t just project them on x/y axis. We have to move these points in 1D such that their pairwise relative distances are retained
- Calculate pairwise similarities between points
- Initialize positions of these points randomly on an axis.
- Move one point at a time on the 1D axis. For the specific point A, calculate its “net force” from all other points where:
- if a point B is too “close” to A, A will be repelled to the other direction, by a small step. Otherwise, A will be attracted to B
- Repeat this process until no point moves.
Step 1: Calculate pairwise similarities between points
To calculate the similarity between point A and C,
-
Calculate the “unscaled similarity”
- calculate their L2 Distance,
d_{AC}
- Then, plug
d
into a gaussian distribution with a known standard deviation $\sigma_A$? and get an “unscaled similarity” \(\begin{gather*} exp(-d_{AC}^2/2\sigma_A^2) \end{gather*}\)
- calculate their L2 Distance,
- Calculate the unscaled similarity between point A and all other points
- Scale the similarity by:
- So if point A and C are low distance away from each other, their similarity value is low
- The scaled similarity is also the conditional probability “give point A, how similar is C”
- We never have negative distance values. So technically, we are using half of the Gaussian distribution ?
Why put them in a Gaussian Distribution in the first place?
Why is scaling necessary?
Because we want to have the same/similar similarity score if the same cluster becomes less dense at the same scale, as shown below.
Would the similarities A -> C and C -> A the same?
No. This is because the similarity score is finally scaled by A or C’s total distance to other points. If A or C has a lot of points really close to them, the denominator $\sum_{K \neq A} exp(-d_{AK}^2/2\sigma_{i}^2)$ would be large so the similarity values would be scaled down by quite a bit.
Eventually, we would end up with a similarity matrix
In this similarity matrix, the diagonal terms become 0 because the similarity between the point and its self is NOT considered in the scaling at all, (especially in the denominator)
How was standard deviations determined?
Step 3 Calculate Net Force
We use a Student-t distribution for the distance -> similarity
calculation in the 1D space. Similarly, we can come up with a similarity matrix
So by moving one point along the 1D axis, we want the similarity matrix to ultimately look like the one from Step 1.