Introduction And Data Preparation
The goal of the project is experimenting with date translations, i.e., (“25th of June, 2009”) into machine-readable dates (“2009-06-25”). We need to truncate data if necessary
- Set max input length
Tx
to 30 char long - Set max input length
Ty
to 10 char long
The code for getting the data is:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
"""
human_vocab: {' ': 0, '.': 1, '/': 2 ... 36}
machine_vocab: {'-': 0, '0': 1, '1': 2, '2': 3, '3': 4, '4': 5, '5': 6, '6': 7, '7': 8, '8': 9, '9': 10}
- X becomes [4, 3, ...] 30-d vector, where each character is the index in human_vocab that the element is mapepd to
- Y is the index of character in machine-vocab (a 10-d vector)
- Xoh: one-hot representation of X (30x37)
- Yoh: one-hot representation of Y (10x11)
Eventually, we want:
- Source date: 9 may 1998
- Target date: 1998-05-09
"""
Tx = 30
Ty = 10
X, Y, Xoh, Yoh = preprocess_data(dataset, human_vocab, machine_vocab, Tx, Ty)
print(machine_vocab)
index = 0
print("Source date:", dataset[index][0])
print("Target date:", dataset[index][1])
Model
When we read in English, we put focus on certain “important parts”. The model we are using is a Global Soft Model, within an encoder-decoder framework.. In this attention mechanism, there are:
- A pre-attention bi-directional LSTM going through the entire
Tx
sequence - A post-attention LSTM going through the global
Ty
output sequence. It passes cell stateC
and hidden stateS
from one timestep to the next.
Specific to this model’s post-attention LSTM, we only take hidden state h
and cell state c
. In text generation, the post-attention LSTM would take hidden state h
and the previous output y_(t-1)
. This is because in language generation, adjacent chars have a strong dependency. But in dates YYYY-MM-DD, there isn’t such a strong dependency.
- TODO: recreate the structure of context diagram (0.5h)
- Compute energy $e^{(t, t’)}$ as a function of the post-attention hidden state $s^{(t-1)}$ and pre-attention hidden state $a^{(t’)}$. $e^{(t, t’)}$ is the attention $y^{(t)}$ should pay to $a^{(t’)}$.
- $s^{(t-1)}$ and $a^{(t’)}$ are fed into a dense layer to get $e^{(t, t’)}$. Then, $e^{(t, t’)}$ gets into a softmax layer to compute $\alpha^{(t, t’)}$
- Context
- TODO: more explanation on
RepeatVector
copy $s^{(t-1)}$T_x
times
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
# UNQ_C1 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# GRADED FUNCTION: one_step_attention
def one_step_attention(a, s_prev):
"""
Performs one step of attention: Outputs a context vector computed as a dot product of the attention weights
"alphas" and the hidden states "a" of the Bi-LSTM.
Arguments:
a -- hidden state output of the Bi-LSTM, numpy-array of shape (m, Tx, 2*n_a)
s_prev -- previous hidden state of the (post-attention) LSTM, numpy-array of shape (m, n_s)
Returns:
context -- context vector, input of the next (post-attention) LSTM cell
"""
# Use repeator to repeat s_prev to be of shape (m, Tx, n_s) so that you can concatenate it with all hidden states "a" (≈ 1 line)
m, Tx, n_a2 = a.shape
m, n_s = s_prev.shape
# we MUST reuse the same repeator object
# s_prev = RepeatVector(Tx)(s_prev)
s_prev = repeator(s_prev)
# Use concatenator to concatenate a and s_prev on the last axis (≈ 1 line)
# For grading purposes, please list 'a' first and 's_prev' second, in this order.
# concat = Concatenate(axis = -1)([a, s_prev])
concat = concatenator([a, s_prev])
# concat.shape = TensorShape([10, 30, 128])
# print(f"concat.shape: {concat.shape, Ty}")
# Use densor1 to propagate concat through a small fully-connected neural network to compute the "intermediate energies" variable e.
# rememebr, e is of Ty because it encompasses all output timesteps
e = densor1(concat)
# Use densor2 to propagate e through a small fully-connected neural network to compute the "energies" variable energies. (≈1 lines)
energies = densor2(e)
print(f"e.shape, energies.shape: {e.shape, energies.shape}")
# See TensorShape([10, 30, 10]), TensorShape([10, 30, 1])
# Rico: part of the reason why tf is not popular is because the layer sizes are partially
# inferred, you don't know the full dimensions just from the model definition
# Use "activator" on "energies" to compute the attention weights "alphas" (≈ 1 line)
alphas = activator(energies)
# Use dotor together with "alphas" and "a", in this order, to compute the context vector to be given to the next (post-attention) LSTM-cell (≈ 1 line)
context = dotor([alphas, a])
return context
modelf
:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
n_a = 32 # number of units for the pre-attention, bi-directional LSTM's hidden state 'a'
n_s = 64 # number of units for the post-attention LSTM's hidden state "s"
post_activation_LSTM_cell = LSTM(n_s, return_state = True) # Please do not modify this global variable.
output_layer = Dense(len(machine_vocab), activation=softmax)
def modelf(Tx, Ty, n_a, n_s, human_vocab_size, machine_vocab_size):
"""
Arguments:
Tx -- length of the input sequence
Ty -- length of the output sequence
n_a -- hidden state size of the Bi-LSTM
n_s -- hidden state size of the post-attention LSTM
human_vocab_size -- size of the python dictionary "human_vocab"
machine_vocab_size -- size of the python dictionary "machine_vocab"
Returns:
model -- Keras model instance
"""
# Define the inputs of your model with a shape (Tx, human_vocab_size)
# Define s0 (initial hidden state) and c0 (initial cell state)
# for the decoder LSTM with shape (n_s,)
X = Input(shape=(Tx, human_vocab_size))
# initial hidden state
s0 = Input(shape=(n_s,), name='s0')
# initial cell state
c0 = Input(shape=(n_s,), name='c0')
# hidden state
s = s0
# cell state
c = c0
# Initialize empty list of outputs
outputs = []
### START CODE HERE ###
# Step 1: Define your pre-attention Bi-LSTM. (≈ 1 line)
a = Bidirectional(LSTM(units=n_a, return_sequences=True))(X)
# Step 2: Iterate for Ty steps
for t in range(Ty):
context = one_step_attention(a, s)
# Step 2.B: Apply the post-attention LSTM cell to the "context" vector. (≈ 1 line)
# Don't forget to pass: initial_state = [hidden state, cell state]
# Remember: s = hidden state, c = cell state
_, s, c = post_activation_LSTM_cell(context, initial_state=[s,c])
# Step 2.C: Apply Dense layer to the hidden state output of the post-attention LSTM (≈ 1 line)
out = output_layer(s)
# Step 2.D: Append "out" to the "outputs" list (≈ 1 line)
outputs.append(out)
# Step 3: Create model instance taking three inputs and returning the list of outputs. (≈ 1 line)
model = Model(inputs=[X, s0, c0], outputs=outputs)
### END CODE HERE ###
return model
Learned Result
We can plot the “attention map” of a given input: "Tuesday 09 Oct 1993"
. We can see that at each output character (or time step), attentions (weights) are given to the right input characters.