What is Positional Encoding
In natural languange processing, it’s common to have
1
sentence ("I love ice cream") -> token ("I", "love", "ice", "cream") -> embedding(100, 104, 203, 301) -> + positional encoding = (101, 105, 201, 299)
In self attention, we calculate weights for all embeddings in queries, keys and values. However, word order is also important. E.g., “I ride bike” is not the same as “bike ride I”.
Given an input sequence X0, X1 ... Xn
, we want to find a time encoding such that:
- the time encoding represents the order of time
- the time encoding value is smaller than the embedding space. Otherwise, the encoding could distort the semantic embeddings.
sine
andcosine
are great since they are only within[-1, 1]
. - each input has a unique encoding
- time encoding dimension should be the same as the input dimension
Additionally,
- When reduced embedding to 2 dimensions, semantically closer words are closer on the chart.
- In transformer, positional encoding is added to the word embedding
We arrange the input sequence into an nxd
vector
For time i
, embedding_dimension d
columns 2j
and 2j+1
, the encodings are:
Now let’s enjoy some code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
class PositionalEncoding(torch.nn.Module):
def __init__(self, max_input_timesteps,hidden_size) -> None:
super().__init__()
# Adding 1 to make sure this is a batch
self.time_encodings = torch.zeros((1, max_input_timesteps, hidden_size))
# i / 10000^(2j)
coeffs = torch.arange(max_input_timesteps, dtype=torch.float32).reshape(-1, 1) #(max_input_timesteps, 1)
coeffs = coeffs/torch.pow(
10000, torch.arange(0, hidden_size, 2, dtype=torch.float32) / hidden_size) #(max_input_timesteps, 4)
self.time_encodings [:, :, 0::2] = torch.sin(coeffs)
self.time_encodings [:, :, 1::2] = torch.cos(coeffs) #(max_input_timesteps, 4)
def forward(self, X):
# :X.shape[1] is to because X might be of a different length (lower than max_input_timesteps)
X = X + self.time_encodings[:, :X.shape[1], :].to(X.device)
return X
pe = PositionalEncoding(max_input_timesteps=10, hidden_size=4)
X = torch.rand((10, 4))
pe(X)
So, we can see that for a given column, embeddings at different timesteps change periodically. Elements Different columns could have the same values as well, but they vary at different frequencies. For the same i
, the frequency component in sin and cos values decrease.
1
2
3
4
5
6
encoding_dim, num_steps = 32, 60
pos_encoding = PositionalEncoding(encoding_dim, 0)
X = pos_encoding(torch.zeros((1, num_steps, encoding_dim)))
P = pos_encoding.P[:, :X.shape[1], :]
d2l.plot(torch.arange(num_steps), P[0, :, 6:10].T, xlabel='Row (position)',
figsize=(6, 2.5), legend=["Col %d" % d for d in torch.arange(6, 10)])
In below’s chart, 50 128-dimension positional encodings are shown. Each row is the index of the encoding, each column is a number in a 128-dimension vector.
For example, for the 50th
input embedding, the 0th dim corresponds to the value sin(50/10000^{(2*0/128)}})
. The 127th dim corresponds to cos(50/10000^(126/128))
. As we can see, the frequency of encoding “bit” changing decreases, as the dimension number goes higher.
Effect of Positional Embedding
Using the glove 6B 100d pretrained embedding, we can visualize some example word embeddings in the 2D plane with PCA. As one can see, similar words are close to each other, man-woman
, king-queen
, etc.
After adding the positional encoding based on a sample sentence “a queen is a woman, a king is a man”, now they look like this:
So one can see that woman-queen
are pushed much closer. This relationship already learnt from the sequence.
Masking
There are two types of masking for building a transformer: padding mask and look-ahead mask
Padding Mask
Sometimes, the input exceeds the maximum sentence length of our network. For example, we might have input
1
2
3
4
[["Do", "you", "know", "when", "Jane", "is", "going", "to", "visit", "Africa"],
["Jane", "visits", "Africa", "in", "September" ],
["Exciting", "!"]
]
Which might get vectorized as:
1
2
3
4
[[ 71, 121, 4, 56, 99, 2344, 345, 1284, 15],
[ 56, 1285, 15, 181, 545],
[ 87, 600]
]
In that case, we want to:
- Truncate the sequence to uniform length
- Pad a large negative number (-1e9) instead of 0 onto short sequences. Why -1e9? Because later in scaled-dot product attention, if we have large negative values, $softmax(\frac{QK}{\sqrt(d_k)} V)$ will likely give probabilities of zero
1
2
3
4
5
[[ 71, 121, 4, 56, 99],
[2344, 345, 1284, 15, -1e9],
[ 56, 1285, 15, 181, 545],
[ 87, 600, -1e9, -1e9, -1e9]
]
To illustrate:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
def create_padding_mask(padded_token_ids):
# We assume this has been trucated, then padded with 0 (for short sentences)
# [Batch_size, Time]
mask = (padded_token_ids != 0).float()
return mask
# Sample input sequences with padding (batch_size, seq_len)
input_seq = torch.tensor([
[5, 7, 9, 0, 0], # Sequence 1 (padded)
[3, 2, 4, 1, 0], # Sequence 2 (padded)
[6, 1, 8, 4, 2] # Sequence 3 (no padding)
])
padding_mask = create_padding_mask(input_seq)
# see the zeros in input_seq will also become 0 in softmax
print(torch.nn.functional.softmax(input_seq + (1 - padding_mask) * -1e9))
The multi-headed attention implemented in Keras was implemented this way.
Look-ahead Mask
Given a full sequence, we want to prevent the model from “cheating” by looking at future tokens during training. In autoregressive models, like language models, when predicting a word, the mdoel should only consider the current and previous tokens, not future ones.
1
2
3
4
5
6
7
8
9
def create_look_ahead_mask(sequence_length):
"""
Return an upper triangle
tensor([[False, True, True],
[False, False, True],
[False, False, False]])
"""
# diagonal = 0 is to include the diagonal items
return (1- torch.tril(torch.ones(sequence_length, sequence_length), diagonal=0)).bool()
[Advanced Topic] Tokenization
Tokenization to assign an index to a token, which can be used for further processing. In its simplest form, a token can be a word.
Hugging Face has a series of tokenizers.
<CLS>
(classification token, often the first token): BERT uses [CLS]. It’s similar to<SOS>
,<SOS>
is used in machine translation likeSeq2seq
<SEP>
(separator token): BERT uses[SEP]
between sentences. It’s similar to<EOS>
<PAD>
(padding token for alignment): BERT uses[PAD]
.
Subword-Tokenization
What people do nowadays is “subword-tokenization” an example is to decompose the word unsurprisingly
to [un
, surprising
, ly
]. This can be illustrated using an example with the HuggingFace 🤗 Transformer
library:
1
2
3
4
5
6
7
8
9
10
11
%pip install transformers
from transformers import BertTokenizerFast, BertModel
tokenizer = BertTokenizerFast.from_pretrained("google-bert/bert-base-uncased")
model = BertModel.from_pretrained("google-bert/bert-base-uncased")
text = "unsurprisingly"
encoded_input = tokenizer(text, return_tensors="pt")
# See [101, 4895, 26210, 18098, 9355, 102]. 101, 102 are CLS and
print(encoded_input)
tokens = tokenizer.convert_ids_to_tokens(encoded_input['input_ids'].squeeze())
# See ['[CLS]', 'un', '##sur', '##pr', '##ising', '##ly', '[SEP]']
print(tokens)
One technique to create subword-tokenization is through Byte Pair Encoding (BPE). That is:
- Break words in a dictionary into single characters.
- e.g., “unpredictable” →
["u", "n", "p", "r", "e", "d", "i", "c", "t", "a", "b", "l", "e"]
- e.g., “unpredictable” →
- Count the frequency combinations of characters, like
un
,ble
- Find the most frequent combos