Sequence Models
Some common sequence models include: DNA sequencing, audio clips, sentiment classification, etc. Another example is name indexing, where names in news for a past period of time will be searched so they can be indexed and searched appropriately.
The first step to NLP is to build a dictionary, $X$. Say we have the most common 10000 English words, we can use one-hot encoding to represent a word, and the word can be indexed as $x^{i}$. If we see a word that’s not in the dictionary, the word is indexed <UNK>
as “unknown”.
A fully connected network / CNN doesn’t work well for sequence data: 1. The sequence could be of arbitrary lengths 2. There’s no weight sharing in these sequence models.
RNN Architecture & Forward Propagation
The word “recurrent” means “appearing repeatedly”. In an RNN, we have hidden states $a^{(t)}$, sequential inputs $x^{(t)}$, output $\hat{y}^{(t)}$. Each superscript $i$ represents timestamp, e.g., $a^1$ means a at time $1$. A feedforward network is one that does not have a loop, i.e., output from one layer will never be fed back. RNN is not a feedforward network because its output always gets fed back into the network.
- Dimensions
- Batch is usually dimension 1
- $a^{0}$ is usually zero or randomly generated values. a_0 do not need to be of the same length as x.
- $a^{(t)}$ are stacked on top of $x$
- So $a$ could be $[5, 10]$, x could be $[3,10]$, with a batch size being
10
- $g_0$ could be tanh (more common) or relu, $g_1$ could be sigmoid.
output = softmax(V * (W*output_{t-1} + U*x_{t}))
- $W_{ax}$ is a matrix that “generates a-like vectors, and takes in an x-like vector”. Same notation for $W_{aa}$
We can simplify this notation into:
\[\begin{gather*} W_{a} = [W_{aa}, W_{ax}] \\ a^{t} = g_0(W_{a}[a^{t-1}, x^{t}]^T + b_a) \end{gather*}\]RNN works best with local context (recent words).
Notations
- Superscript $(i)$ denotes an object associated with the $i^{th}$ example.
- Superscript $[l]$ denotes an object associated with the $l^{th}$ layer.
- Superscript $\langle t \rangle$ denotes an object at the $t^{th}$ time step.
- Subscript $i$ denotes the $i^{th}$ entry of a vector.
Architectures
- Above is many to many, meaning you have multiple inputs and have multiple outputs.
- There’s another type of many to many, where we have different lengths of outputs and inputs e.g., machine translation
- We also have many to one, meaning you have multiple inputs and one output, like sentiment analysis
- We also have one to many, like music generation that takes in one prompt input and generates multiple notes.
Disadvantages:
- Inputs in the early sequence are not influenced by inputs later in the sequence. One example is “Teddy bear is on sale”, Teddy is not a name, while many times it is.
Language Modelling
A language model takes in a sequence of words and outputs new sequence like in speech recognition and machine translation. The output is the most probable output. E.g.,
1
2
English input: I love Turkish coffee
Turkish output: Ben türkçe kahvesi seviyorum
Input: Tokenize the sentence into one-hot vectors. Add an end-of-sentence token <EOS>
to be explicit. If there’s a word that’s not in the vocabulary, use an unknown token to represent that <UNK>
. Before outputing probablities, we need softmax
to normalize.
This is equivalent to a Markov Decision Process. Each predition $\hat{y}^{(t)}$ is $p(y^{(t)} | y^{(0)}, y^{(1)} … y^{(t-1)})$. The whole sentence’s probability is $p(y^{(0)}, y^{(1)} … y^{(t-1)}, y^{(t)})$
Let’s walk through an example with arbitrarily-assigned probabilities.
- I (one-hot vector is [0, , … 1, … 0]) -> $P(y_0 = Ben)$ = 0.4 (so take ‘Ben’ as output)
- love -> P(seviyor) = 0.00003, P(seviyorum) = 0.00001
- P(seviyor | Ben) = 1e-10, $P(\text{seviyorum} | y_0 = \text{Ben}) = 0.4$
- So take the most probable sequence ‘Ben seviyorum’ as output.
Turkish
-> $P(\text{Türkçe}) = 0.5$- $P(y_2 = \text{Türkçe} | y_0 = \text{Ben}, y_1 = \text{seviyorum}) = 8e^{-10}$, $P(y_1 = \text{Türkçe} | y_0 = \text{Ben}, y_2 = \text{seviyorum}) = 6e^{-7}$
- So take ‘Ben Türkçe seviyorum’ as output
Coffee
->P(Kahve) = 0.7, P(Kahvesi) = 0.2
:- $P(y_3 = \text{Kahve} | y_0 = \text{Ben}, y_1 = \text{Türkçe}, y_2 = \text{seviyorum}) = 7e^{-15}$
- …
- $P(y_2 = \text{Kahvesi} | y_0 = \text{Ben}, y_1 = \text{Türkçe}, y_3 = \text{seviyorum}) = 9e^{-9}$
- The most probable sequence is
Ben türkçe kahvesi seviyorum
as the output
- The sentence
Türkçe kahvesi seviyorum
has a probability of $9e^{-9}$ and is the highest among all sentences.
One note is the above example is NOT how modern machine translation works using Neural Machine Translation (NMT). It’s a simplification of word-by-word translation.
-
Modern NMTs, expecially transformers, do not rely on Markov Assumptions and consider not just previous words, but the entire input sequence.
-
Probablities like $P(y_2 = \text{Türkçe} | y_0 = \text{Ben}, y_1 = \text{seviyorum})$ are computed by complex neural networks to based on learned representations of both the source and target languages.
RNN
can have exploding/diminising gradient as well. For explosion, do gradient clipping. for diminishing, can try 1. weight init, 2. use relu instead of sigmoid 3. other RNNs: LSTM, GRU.
Training
The training set is a large corpus of English -> Turkish text. Loss function is:
\[\begin{gather*} L(y^{(t)}, \hat{y^{(t)}}) = -\sum_i y_i^{(t)} log (\hat{y}^{(t)}) \\ L = -\sum_t L(y^{(t)}, \hat{y^{(t)}}) \end{gather*}\]Where $i$ is the dimension of one-hot vectors, and $t$ is time.
Sampling Novel Sequences
After training a model, the model should have learned the conditional probability distribution $P(y_t | y_1, … y_{t-1})$. we can informally get a sense of what the model learned by sampling novel sequences. For example, our vocabulary is ['I', 'you', 'love', 'coffee', 'apple']
- At
t=0
- Choose start tokens $a^{(0)} = 0, x^{(0)} = 0$ (Start-Of-Sequence
<SOS>
token), - Generate hidden state $a^{(0)} = W_{ax} x^{(0)} + b_a$
- Generate distribution $y^{(0)} = softmax(W_{ya}a^{0} + b_y) = [0.6, 0.2, 0.05, 0.05, 0.1]$.
- Use a sampling strategy from the distribution $y^{(0)}$
- Greedy sampling: choose the token with the highest probability. You might miss less probable but plausible sequences.
- Stochastic sampling: treat $y^{(0)}$ as a categorical distribution , then randomly draw a sample from it. one can use
np.random.choice()
. - In this step, we draw
'I'
and add it to the sequence
- Choose start tokens $a^{(0)} = 0, x^{(0)} = 0$ (Start-Of-Sequence
- At
t=1
- Generate hidden state $a^{(1)} = W_{ax} x^{(1)} + b_a$
- Generate distribution $y^{(1)} = softmax(W_{ya}a^{1} + b_y) = [0.2, 0.2, 0.4, 0.1, 0.1]$.
- After stochastic sampling, we draw ‘love’ and add it to the sequence
…
Character-Level RNN
Above is “word-level” RNN. If inputs are charaters [a-z, A-Z, 0-9, ...]
, you wouldn’t have to worry about <UKN>
tokens. But one disadvantage is your output sequence is much longer so the long-range dependencies might be missed. This is more computationally expensive.
BackPropagation Through Time (BPTT)
RNN implementation
- $nx$ is used here to denote the number of units in a single time step of a single training example
- $Tx$ will denote the number of timesteps in the longest sequence.
- stack 20 columns of 𝑥(𝑖) examples
It’s worth noting that:
\[\begin{gather*} tanh'(x) = (1-tanh^2(x)) \\ \sigma'(x) = \sigma(x) (1 - \sigma(x)) \end{gather*}\]So:
Note that we need to accumalate hidden input gradient by
\[\begin{gather*} da = da_{loss[t]} + d{a_prev} \end{gather*}\]- $da_loss[t]$ is the hidden state gradient from the immediate loss. So a keypoint of BPTT is consdering both the next time gradient and the immediate loss for hidden state gradient
- loss only considers the current timestamp output, and can use binary cross entropy loss.
for t in reversed(range(T_x)):
# Compute the total gradient at time step t
da_next = da[:, :, t] + da_prevt
gradients = rnn_cell_backward(da_next, caches[t])
dxt = gradients["dxt"]
da_prevt = gradients["da_prev"]
dWax += gradients["dWax"]
dWaa += gradients["dWaa"]
dba += gradients["dba"]
dx[:, :, t] = dxt