Demystify Transformer

10 min readFeb 5, 2020

Transformer is a model architecture that based solely on attention mechanisms without using recurrence and convolutions. Without the sequential restriction of recurrence network, the Transformer architecture is more parallelizable and requires less time to train. It also has better performance compared with LSTM to deal with long-term dependencies. So it can also be deemed as a replacement of LSTM.

Transformer is initially proposed to apply in the field of machine translation. So in general it still uses the encoder (on the left of Figure 1) and decoder (on the right of Figure 1) architecture like seq2seq architecture. But both encoder and decoder block are much more complex than those in seq2seq architecture. Now let’s jump right into the Transformer architecture as shown in Figure 1.

Figure 1. Image: https://arxiv.org/pdf/1706.03762.pdf

We will talk about encoder block first from bottom to up in Figure 1. Then we will talk about decoder block. Most of the components are the same in both encoder and decoder.

Encoder

-Input Embedding

This is a standard step in NLP applications which we turn each input word into a vector. The vector then becomes a representation of this word and it can capture both syntactic and semantic meaning of a word. In the base Transformer model, the dimension of the embedding is 512 and the embedding is trained together with other weight matrices.

- Positional Encoding

The purpose of positional encoding is to inject the order of sequence. Think about RNN architecture where word is sent into the model in a sequential manner, Transformer does not have this property intrinsically since the intent of Transformer is to process words in parallel. By adding positional encoding into word embedding, it can help the model know the position of each word or relative distance between each word.

As a high level illustration, positional embedding generates a vector with same dimension as the input embedding and value in the range from -1 to 1 and add it directly to the input embedding:

Mathematically, the element in each positional embedding vector is determined by a sine and a cosine function as follows:

So for different word and different dimension, we will have a different value computed using Equation (1) above. For even dimension, the value is generated from a sin function, for odd dimension, the value is generated from a cosine dimension. So each positional encoding will be like: [cos, sin, cos, sin…].

When I saw the idea of positional encoding for the first time, I had a concern that adding this encoding into word embedding would mess up the quality of word embedding. However, it seems like after a tons of training, the model is able to learn the patten we added and during self-attention, the model is able to make use of this distance information.

- Self-attention

The intuition of self-attention is that it relates different positions of a single sequence so that we can compute a better representation of a sequence. As an example, if we have a sequence like “Tony didn’t go to school because he was sick”, when the model processes word “he”, self-attention allows it to look at word “Tony”.

In the Transformer architecture, it uses scaled dot-product attention as follows:

This is a standard type of attention except the scale in the denominator. When I first saw this equation, I spent lots of efforts trying to understand what are key, value and query. But later on I realized that it is not worth giving a definition of what exactly key, value and query are doing. Those are just derived by multiplying a vector with different weight matrix. To better understand what Equation (2) is doing, we can go over a specific example:

Say we have 3 words A, B and C with word embedding as follows. And our key, value, query matrix are also written below.

To get the key representation of A, we apply the dot product between word embedding of A and key matrix:

Similarly, we can get the key representation of B and C as:

We can make use the power of vectorization to get query representation and value representation of A, B, C:

With all values of K, Q, V, we can apply Equation (2). For instance, to compute the attention output of word A, we first compute the inner product of QA and K transpose:

Next we divide the result by the square root of dimension of embedding in K which is root 3 in this case and take the softmax:

Finally, we do a dot product between the above vector got from softmax and value matrix:

This is the self-attention output vector of word A. We can do the same thing for word B and C. Or we can utilize the powerful of vectorization to compute A, B, C as a whole. In this example, the attention output for each word is a 1*3 matrix. We have 3 words in total, so it would be a 3*3 matrix.

The python implementation of dot product self attention is as follows:

- Multi-head attention

The general idea of multi-head attention is kind of like model ensemble. Instead of doing self-attention like we described in the last section once, we do it multiple times using different key, value, query matrix. One benefit of multi-head attention is that it allows model to focus on different positions. The other comes from the benefit of model ensemble which is to reduce the variance.

In the Transformer paper it uses 8 heads which means it repeats every self-attention 8 times. Following the same example, we will end up having 8 3*3 matrix. Next step is to concatenate every attention output together (so we have a 24*3 matrix) and multiply the concatenated result by another matrix (pass through a linear layer) to get multi-head attention output. Mathematically it looks like:

Pictorially we have:

Figure 2. Image: https://arxiv.org/pdf/1706.03762.pdf

- Residual connection & layer normalization

Since the invention of resnet [resnet], residual connection is widely used in deep learning models.

I feel like it is easier to understand residual connection through picture. As can be seen in Figure 1, the arrow that start from the bottom of multi-head attention block till the left of add & norm block represents a residual connection. Basically, residual connection forks the input signal and let one signal pass through some functional blocks and the other does not pass through any functional block and we add these two signals together at some point. With this in mind, the output of multi-head attention block and Add & Norm block is just LayerNorm(x + multi-head(x, x, x)).

In brief, layer normalization [layer norm] is like applying batch normalization in recurrent neural network. More detail can be seen from the cited paper.

- Feed forward network

The feed forward network is a pretty standard two layer neural network which contains two linear layers and a ReLU activation unit in between. Mathematically we have:

Notice that the output of this fully connected layer (feed forward network) is added to x directly (because of the residual connection) and then pass through a layer normalization layer to get the final output of the decoder layer.

- Summary of Encoder

With the above building blocks, we can summarize the workflow of encoder. With the raw input sequence, encoder does:

generate word embedding of each input word
add positional embedding to each word embedding
compute multi-head attention and as a residual connection, it also adds the result from positional embedding
do a layer normalization
pass through 2 layer feed-forward network and as a residual connection, it also adds the result from layer normalization
do another layer normalization
step 3–6 are the computation we do in one encoder block. We repeat the same thing 6 times (as indicated by N in the paper) with the input of current encoder block being the output of previous encoder block

Decoder

With the knowledge of encoder, decoder should be easy to understand. We can briefly work through each component:

- Output Embedding

Same as input embedding, we learn the vector representation of each output word.

One thing that worths clarification is that Transformer is designed for machine translation. Same as seq2seq architecture, the input sequence is one language and the output sequence is a different language. The output sequence will be offset by one position compared with input sequence because the first output word depends on the first input word (if considering one to one translation). So usually we insert a token at the beginning of the output word to indicate that we start to translate. The word embedding of this token can be just set to all 0s. A simple illustration is as follows:

- Positional Encoding

Same as before, we basically encode sine and cosine function into the word embedding

- Self Attention

The mechanism of self-attention is same as encoding block except that we need to prevent positions from attending future positions. Because think about the real application of machine translation, the translation of current word only depends on the input words and what you have already translated before rather than the future (the word that will be translated). So if I am at index i of the output sequence, I should only attend indexes before i rather than after. The solution for this is to mask all the indexes (so after softmax it those indexes will be 0) after i while doing self-attention.

- Multi-head attention

Same as encoder block for the most part. Basically we do an ensemble of 8 self-attentions.

One extra block is the encoder-decoder attention (in the middle of the decoder block) with keys and values from the encoder block and query from decoder. This helps the decoder focus on positions in the input sequence. So this is not self-attention anymore. Instead it is output sequence attends to input sequence.

- Residual connection & layer normalization

Nothing special to say, both residual connection and layer normalization are the same as encoder block.

- Final linear and softmax layer

The last decoder layer outputs a vector for each word. The linear layer will project the output vector to a vector that has the same shape as vocabulary size. After passing through the softmax layer, we know the probability of choosing each word in the vocabulary. We can then take the word with maximum probability to be our final output.

During training we know the actual output so we can compute the loss of each word using cross entropy loss (negative log of the probability of actual word in the softmax vector). During testing, we can feed the output word from previous step into the input of current step or we can use beam search to keep a handful of words rather than the word with highest probability.

- Summary of decoder

With the above building blocks, we can summarize the workflow of decoder. With the raw input sequence, decoder does:

generate word embedding of each input word of decoder
add positional embedding to each word embedding
compute multi-head attention and as a residual connection, it also adds the result from positional embedding
do a layer normalization
compute multi-head attention with key and value from the encoder output and query from decoder. Add residual connection.
do a layer normalization
pass through 2 layer feed-forward network and as a residual connection, it also adds the result from layer normalization
do another layer normalization
step 3–8 are the computation we do in one decoder block. We repeat the same thing 6 times (as indicated by N in the paper) with the input of current decoder block being the output of previous decoder block
Pass the result of 6 decoder blocks to a linear layer followed by softmax to get the probability of choosing each word as the decoder output (the translated word) among all the words in the vocabulary. Compute cross entropy loss and back propagate to update weights during training.

This is the end of Transformer architecture overview. Hope this can help you have a better understanding of the original paper.

Useful Resources and reference:

[attention is all you need]Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural information processing systems. 2017.

[resnet] He, Kaiming, et al. “Deep residual learning for image recognition.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.

[layer norm] Ba, Jimmy Lei, Jamie Ryan Kiros, and Geoffrey E. Hinton. “Layer normalization.” arXiv preprint arXiv:1607.06450 (2016).

Demystify Transformer

Encoder

Decoder

Written by Ted Mei