LSTM
LSTM came out in 1997 and GRU is a simplification of it. In LSTM, we have the “forget gate”, $\Gamma_r$, the output gate $\Gamma_o$, and the update gate $\Gamma_u$. We do NOT have $\Gamma_r$
$C^{(t-1)}$ can retain largely the $C^{(t)}$.
\[\begin{gather*} \tilde{c}^{\langle t \rangle} &= \tanh\left(W_c \left[a^{\langle t-1 \rangle}, x^{\langle t \rangle} \right] + b_c \right) \\ \Gamma_u &= \sigma\left(W_u \left[a^{\langle t-1 \rangle}, x^{\langle t \rangle} \right] + b_u \right) \\ \Gamma_f &= \sigma\left(W_f \left[a^{\langle t-1 \rangle}, x^{\langle t \rangle} \right] + b_f \right) \\ \Gamma_o &= \sigma\left(W_o \left[a^{\langle t-1 \rangle}, x^{\langle t \rangle} \right] + b_o \right) \\ c^{\langle t \rangle} &= \Gamma_u * \tilde{c}^{\langle t \rangle} + \Gamma_f * c^{\langle t-1 \rangle} \\ a^{\langle t \rangle} &= \Gamma_o * \tanh\left(c^{\langle t \rangle}\right) \\ y^{\langle t \rangle} &= \text{softmax}\left(W_{ya} * a^{\langle t \rangle} + b_y\right) \end{gather*}\]
One variation is the “Peephole” connection. That is, the hidden state is introduced in deciding the gate values.
\[\begin{gather*} \Gamma_u &= \sigma\left(W_u \left[a^{\langle t-1 \rangle}, x^{\langle t \rangle}, C^{(t-1)} \right] + b_u \right) \\ \Gamma_f &= \sigma\left(W_f \left[a^{\langle t-1 \rangle}, x^{\langle t \rangle}, C^{(t-1)} \right] + b_f \right) \\ \Gamma_o &= \sigma\left(W_o \left[a^{\langle t-1 \rangle}, x^{\langle t \rangle}, C^{(t-1)} \right] + b_o \right) \\ \end{gather*}\]LSTM is historically the most proven architecture. GRU is simpler, newer, and potentially better for growing big
Detailed Explanations:
- $x^{(t)}, a^{(t-1)}$ are stacked vertically
- $\Gamma_f * C^{(t-1)}$ is like applying a mask.
- All gates should be in the ranges of $[0,1]$. That is, if close to 0, values from the previous state, or calculated intermediate state won’t be kept
- So when a subject changes its state, like a singular noun changes to plural, $\Gamma_f$’s certain values should change its value
0 -> 1
- $\tilde{C}$ uses
tanh
so its values are in $[-1, 1]$. Whether its values are passed onto the actual hidden state $C^{(t-1)}$is determined by gate $\Gamma_i$ - $a = \Gamma_o * tanh(C^{t})$ to normalize its values to $[-1, 1]$
Why LSTM is useful: LSTM is like GRU, but it also has a cell state $C$ (long term memory), $a$ (short term memory). Its gates control how much the current cell state gets into the overall cell state. Ideally, the model can learn to keep the hidden cell state for things like pluralism, and reject the cell state updates in between.
- A forget gate is added, so long term memory could be abandonded.
- So the long term memory is the direct contribution of LSTM to address RNNs’ vanishing gradient problem
- An update gate to make sure some current cell state candidate might not get into the short term memory
- An output gate determines which elements from the cell state gets into the short term memory.
- Changes more rapidly (like a “scratch pad”), and goes directly into the next output.
Back Propagation of LSTM
\[\begin{gather*} tanh'(x) = (1-tanh^2(x)) \\ \sigma'(x) = \sigma(x) (1 - \sigma(x)) \end{gather*}\]
Parameter Derivatives
\[\begin{gather*} dW_f = d\gamma_f^{\langle t \rangle} \begin{bmatrix} a_{prev} \\ x_t\end{bmatrix}^T \tag{11} \\ dW_u = d\gamma_u^{\langle t \rangle} \begin{bmatrix} a_{prev} \\ x_t\end{bmatrix}^T \tag{12} \\ dW_c = dp\widetilde c^{\langle t \rangle} \begin{bmatrix} a_{prev} \\ x_t\end{bmatrix}^T \tag{13} \end{gather*} \\ dW_o = d\gamma_o^{\langle t \rangle} \begin{bmatrix} a_{prev} \\ x_t\end{bmatrix}^T \tag{14}\]To calculate $db_f, db_u, db_c, db_o$ you just need to sum across all ‘m’ examples (axis= 1) on $d\gamma_f^{\langle t \rangle}, d\gamma_u^{\langle t \rangle}, dp\widetilde c^{\langle t \rangle}, d\gamma_o^{\langle t \rangle}$ respectively. Note that you should have the keepdims = True
option.
Finally, you will compute the derivative with respect to the previous hidden state, previous memory state, and input.
\[\begin{gather*} da_{prev} = W_f^T d\gamma_f^{\langle t \rangle} + W_u^T d\gamma_u^{\langle t \rangle}+ W_c^T dp\widetilde c^{\langle t \rangle} + W_o^T d\gamma_o^{\langle t \rangle} \tag{19} \end{gather*}\]Here, to account for concatenation, the weights for equations 19 are the first n_a, (i.e. $W_f = W_f[:,:n_a]$ etc…)
\[\begin{gather*} dc_{prev} = dc_{next}*\Gamma_f^{\langle t \rangle} + \Gamma_o^{\langle t \rangle} * (1- \tanh^2(c_{next}))*\Gamma_f^{\langle t \rangle}*da_{next} \tag{20} \\ dx^{\langle t \rangle} = W_f^T d\gamma_f^{\langle t \rangle} + W_u^T d\gamma_u^{\langle t \rangle}+ W_c^T dp\widetilde c^{\langle t \rangle} + W_o^T d\gamma_o^{\langle t \rangle}\tag{21} \end{gather*}\]where the weights for equation 21 are from n_a to the end, (i.e. $W_f = W_f[:,n_a:]$ etc…)
Bi-Directional RNN
Bi-Directional RNN can learn not only the correspondence from the past, but also from the “future”. This is an acyclic graph and actually a good thing to try at the beginning.
The RNN cell now has one set of weights $W_h^{forward}$, $W_x^{forward}$, $b^{forward}$ and backwards $W_h^{backward}$, $W_x^{backward}$, $b^{backward}$
E.g., “She saw a dog”. Let’s denote:
- x1: “She”
- x2: “saw”
- x3: “a”
- x4: “dog”
Then,
-
Initialize forward and backward hidden states $h_t^{forward}$ and $h_t^{backward}$ to zeros
-
Forward pass: from
t=1
throught=4
,
- Backward Pass: from
t=4
throught=1
:
- Combine the hidden states
h_t = [h_backward, h_forward]
- Output:
y_t = softmax(w_o h_t + b_o)
The biggest advantage is that the RNN can adjust its weights to produce hidden states not only to reflect forward dependencies like “she” and “saw”, but also “a dog” and “saw”. This is helpful to learn “saw” as a verb, and associate “saw” and “a dog”. This is also called “more nuanced patterns”, or “richer features”.
Deep RNN
With RNN, GRU, LSTM blocks one can build deep RNNs.
3 layers is already pretty deep for RNN, especially if we look at the impact of inputs from the first few time stamps.
It’s not uncommon to connect the output y
to fully connected layers, so temporally, the deeper network is not connected.