Rico's Nerd Cluster

「离开世界之前 一切都是过程」

Deep Learning - Neural Machine Translation

Hands-On Attention Project

Introduction And Data Preparation The goal of the project is experimenting with date translations, i.e., (“25th of June, 2009”) into machine-readable dates (“2009-06-25”). We need to truncate data...

Deep Learning - Transformer Series 5 - Transformer Hands On

Hands-On Transformer Training and Validation

Tasks and Data It’s common practice to pad input sequences to MAX_SENTENCE_LENGTH. Therefore, the input is always [batch_size, max_sentence_length] NUM_KEYS = NUM_QUERIES = max_sentence_leng...

Deep Learning - Transformer Series 4 - Transformer All Together

Encoder, Decoder

Overview We’ve seen that RNN and CNN has a longer maximum path length. CNN could have better computational complexity for long sequences, but overall, self attention is the best for deep architect...

Deep Learning - Transformer Series 3 - Multi-Head and Self Attention

Multi-Head Attention, Self Attention, Comparison of Self Attention Against CNN, RNN

Multi-Head Attention To learn a richer set of behaviors, we can instantiate multiple attentions jointly given the same set of queries, keys, and values. Specifically, we are able to capture variou...

Deep Learning - Transformer Series 2 Vanilla Attention Mechanism

Attention Intuition, Query-Key-Value, Bahdanau Attention, Scaled-Dot Attention

Attention Intuition Imagine we are sitting in a room. We have a red cup of coffee, and a notebook in front of us. When we first sit down, the red cup stands out. So it attracts our attention “invo...

Deep Learning - Transformer Series 1 - Embedding Pre-Processing

Positional Encoding, Padding Mask, Look-ahead Mask, Tokenization

What is Positional Encoding In natural languange processing, it’s common to have 1 sentence ("I love ice cream") -> token ("I", "love", "ice", "cream") -> embedding(100, 104, 203, 301) ->...

Deep Learning - Sequence to Sequence Models

seq2seq, encoder-decoder architecture, beam model, Bleu Score

Sequence to Sequence Models: The Encoder - Decoder Architecture Machine Translation Early sequence models use two RNN/LSTM cells to create an encoder-decoder architecture for machie translation. ...

Deep Learning - Word Emojifier Using Dense and LSTM Layers

Emojifier

Introduction When using word vectors, you’ll see that even if your training set explicitly relates only a few words to a particular emoji, your algorithm will be able to generalize and associate a...

Deep Learning - Hands-On Embedding Similarity

Similarity and Debiasing

This blog post is a summary of the Coursera Course on Sequence Models Embedding Similarity and Debiasing embeddings are very computationally expensive to train, most ML practitioners will load a ...

Deep Learning - PyTorch Versioning And Memory Allocation

In-Place and Out-of_Place Matrix Ops, Gradient Checkpointing

PyTorch Versioning Is Necessary Because We Have In-Place and Out-of_Place Matrix Ops Takeaways: - x.add_()/multiply_() is to do in-place addition, and updates the gradient. - x+something a...