From Static Embedding to Contextualized Embedding

Images from:

This post summarizes a bunch of recent representative and influential results in the area of word embedding (Even though the idea is proposed as early as 1986 by Dr. Geoffrey Hinton, this post focus on results starting from word2vec which is the first word embedding that has great performance in industry). The main problem that word embedding tries to solve is how to represent the meaning of words. In general, static embedding methods represent word as a fixed dense vector while contextualized embedding considers contextual information into play.

Below is the roadmap of this post:

static embedding:

  • Word2vec: Efficient Estimation of Word Representations in Vector Space [2013] + Distributed Representations of Words and Phrases and their Compositionality [2013]
  • Glove

contextualized embedding:

Word2Vec [2013]

Efficient Estimation of Word Representations in Vector Space [2013] and Distributed Representations of Words and Phrases and their Compositionality [2013] construct the idea of word2vec. The first paper introduces two novel neural network models , Skip-gram and CBOW respectively. The second paper introduces negative sampling which is an efficient way to train word2vec.

The core idea between word2vec is that a word can be represented by a set of words that appear nearby (within a fixed window size). Based on this definition, it is natural to use two vectors to represent the characteristics of a word which are center word V and context word U since the same word can be center word and can also be the context word of another center word.

Skip-gram model predicts context word given center word while Continuous Bag of Words (CBOW) model predicts center word given a bag of context words. Skip-gram has slightly better performance than CBOW and it is more often used in NLP.

For skip-gram, the objective is to maximize the probability of context word given center word as shown in Equation (1) of the second paper:

The loss is just the negative of this objective since we maximize log probability but minimize log loss. To compute P(context_word | center_word), the second paper uses softmax function:

The idea is given center word c, the probability that the context word o appears is the chance that o and c appears together over all the possibilities among the whole vocabulary.

Up to this point, skip-gram can already be trained using stochastic gradient descent (SGD). The gradient is also easy to compute manually (refer to page 35 in In deep learning framework such as Pytorch, you don’t even need to derive the gradient, it can be done automatically by the framework. In the end, the paper extracts center word embedding V to be the final word embedding for each word.

However, one drawback is on the denominator of equation 2. It loops through the whole vocabulary every time we compute gradient. This is super inefficient and time-consuming. So the second paper introduces two methods hierarchical softmax and negative sampling to improve efficiency. As the paper mentioned, negative sampling has better performance so I will only talk about negative sampling in this post.

The idea of negative sampling is also simple, since computation of the denominator in Equation (2) above is heavy, we should try to avoid looping through the whole vocabulary. Instead, we only loop k words which are outside of the context window (negative samples). The author also slightly modify the loss function to:

where negative samples are drawn from unigram distribution of words raised to the power of 3/4. The rest steps are same as above, we can use SGD to minimize loss in Equation(3) and extract the weights of center word V to be the word embedding.

My implementation of word2vec can be seen here.

Glove [2014]

Glove (global vectors for word representation) combines the advantages of two major model families, global matrix factorization and local context window respectively.

Global matrix factorization utilizes the statistics of corpus (usually form a word-word co-occurrence matrix) but performs poorly on word analogy tasks. Local context window methods such as Skip-gram, CBOW do better on word analogy tasks but do not use statistics of corpus.

One observation from Glove is that word meaning can be extracted from ratios of co-occurrence probabilities.

For instance in Table 1:

The ratio between P(solid|ice) and P(solid|stream) is much larger than 1 indicates that solid has meaning closer to ice than stream. The ratio between P(gas|ice) and P(gas|stream) is much smaller than 1 indicates that gas has meaning closer to stream than ice.

With this observation, authors of Glove try to encode co-occurrence probabilities into word embeddings. Equation (1) till Equation (7) illustrate this process. However, some of the equations involve authors’ heuristic (prove by experiment in section 4) which are not strict mathematical approval. Equation (8) is the loss function of Glove:

where w represent word vector, b are bias term. Xij is the probability that word j appears in the context of word i. f(x) is a weighting function defined as:

where alpha is set to be 3/4 by experiment.

Section 3 of the paper describes its relationship with word2vec. It turns out that the loss function of word2vec can be rewritten to use cross entropy as shown in Equation (13):

The loss function of Glove just replace cross entropy with least squares.

With the loss function defined in Equation (8), we can train the model and get word embeddings w. Section 4 of the paper compares the performance of Glove with other models, and Glove outperform other models in most of the tasks.


ELMo (embeddings from language models) is a great milestone in the area of word embedding after Word2vec. The main contribution of ELMo is that it incorporates contextualized information into the embedding vector.

One drawback of word2vec is in dealing with polysemic words. For instance it will assign the same vector to both word “bank” in the sentence “Tom left bank and played on the bank of river”, however those two words have completely different meaning (even though they have the same spelling) and should not be represented using the same vector. Instead, the way ELMo does is as follows:

  • first it assigns a basic vector to every word. In the above example, it will assign the same base vector to those two “bank”
  • the it uses a bidirectional LSTM layer to learn another vector representation. From the paper, this layer learns syntactic features
  • next it uses another bidirectional LSTM layer to learn one more vector representation. From the paper, this layer learns semantic features
  • finally it uses the weighted sum of basic vector, hidden state of first bi-LSTM, hidden state of second bi-LSTM to represent the vector of each word. Intuitively you can see now the two “bank” will have different vector representations.

Mathematically, we have token representation for the kth token Rk as follows:

so it is a combination of basic vector Xk and 2*L (in ELMo, L=2) hidden vectors from bi-directional LSTM. The basic vector is trained using character embedding which is a fundamental technique to apply CNN over characters. The ELMo vector for token k is a weighted sum of above hidden vectors where weights are learned during training:

Since ELMo is based on language model, during the training process, we can let the model predict the next word and use cross-entropy as loss function. Pictorially, the training process look as follows:

where T represents token, X is the character embedding of each token, ELMo is the ELMo vector. Softmax will computes the probability of each word in the vocabulary and we train and push the model to give next word a high probability so that we can learn the weights of LSTM, character embedding etc. During the testing process, we fix all the weights and ELMo vector will be the word embedding of each token, so we can build other layers on top of the ELMo vectors:

Based on the structure described above, simply adding ELMo will reduces error rate ranging from 6–20% in at least 6 NLP tasks compared with previous state-of-art results.


Even though ELMo incorporates contextualized information into word embedding, it is not able to consider contextual information from both left-to-right and right-to-left at the same time. This constraint is caused by language model since the mechanism of language model is to predict the next word given previous words. ELMo relies on LSTM to memorize contextual information during propagation which is sub-optimal because ideally we only need contextual information of words around a word rather than contextual information propagated from start to end and from end to start.

GPT-2 [2019]

The main contribution of GPT-2 is that it demonstrates possibilities for language model to perform down-stream tasks in a zero-shot setting. Zero-shot means there is no need to train and fine-tune the model, just evaluate the result as a final test. GPT-2 achieves this by using large number of parameters and train on large number of datasets.

The architecture of GPT-2 is similar to GPT which is the decoder of Transformer (described in detail in my other post here) without encoder-decoder self attention. One special component of decoder in Transformer is its masked attention. Basically position i can only attends to positions smaller than i. As a side note, BERT model (described in the next section) mainly uses the encoder piece of Transformer.

The largest model of GPT-2 named GPT-2 Extra Large has 1.542M parameters. It achieves the state-of-art results in 7 out of 8 NLP tasks in zero-shot setting.

Both GPT-2 and ELMO are auto-regressive models (model based on language model). In ELMO corpus is trained on both left-to-right and right-to-left directions. However in GPT-2, corpus is trained only from left to right. GPT-2 has better performance than ELMO due to its Transformer architecture and much larger size.


BERT (Bidirectional Encoder Representations from Transformers) is a new milestone after ELMO. It improves the state-of-the-art result of 11 NLP tasks. The main contribution of BERT is to train word embedding based on denoising auto-encoders rather than language model so it is able to incorporate contextual information from both directions at the same time.

The idea of denoising auto-encoders is brought from computer vision which is used to train a compact representation of images as shown in the following figure:

Given an image, we can use a bunch of linear and convolution layers to encode it and then decode back. Loss function can simply be the summation of the square difference between each pixel of the original image and the decoded image.Then the weights of middle layer can be used to represent the image. To better generalize the representation, we can add different kind of noise to the given input such as Gaussian noise.

The model architecture of BERT is nothing but a larger version of the encoder part of Transformer (described in detail in my other post here). BERT base has L = 12 Transformer blocks, A = 12 self-attention heads and hidden size of dimension H = 768. BERT large has 24 Transformer blocks, 16 self-attention heads and hidden size of dimension 1024.

Input sequence to BERT are represented as a summation of token embedding, segment embedding and position embedding. For token embedding, input sequence is tokenized and embedded in a special way called WordPiece embedding. In English WordPiece is not always just separate word by space, it can be thought as a special tool to do sentence tokenization and then do embedding. The vocabulary size of WordPiece embedding is 30K, so the shape of embedding size is V * H where V represents vocabulary size and H is hidden size (usually we choose embedding size to be same as hidden size). In terms of segment embedding, it represents the relationship between two sentences. It is not required if our downstream task only involves one sentence rather than a pair of sentence. Position embedding is same as the one described in Transformer here.

BERT has two procedures including pre-training and fine-tuning. Pre-training has two tasks, Masked language model (MLM) and Next sentence prediction (NSP) respectively. In the following section, we are going to describe MLM, NSP and fine-tuning in detail.


MLM is the most important step in BERT and it is based on the idea of denoising auto-encoders described above and apply it in NLP. The way it does is randomly mask 15% of tokens in a given sequence and force the model to predict those tokens. Masking can be thought as adding noise to the given input and predicting the actual value of those masked tokens can be thought as decoding back (recovering) the actual input. Cross entropy loss is used to guide the model to predict the actual tokens. During this process, weights can be learned to incorporate contextual information and better represent the masked tokens. Pictorially, we can illustrate the MLM process as:

One more detail mentioned in the paper is because during testing, there is no [MASK], to mitigate the disconnection between training and testing dataset, among the 15% masked tokens, 80% of the time, it replaces the actual token with [MASK]; 10% of the time, it replaces the actual token with a random token; 10% of the time, it actually does not change the actual token. The idea behind the scene is manually adding some variance can make the model generalize better.


This task is to understand the relationship between two sentences which will be important for Question Answering, Natural Language Inference. The idea is during pre-training 50% of the time sentence B is the next sentence of sentence A while 50% of time B is not. A vector C (Figure 1 in BERT paper) contains features about sentence relationship and it can be used to do the next sentence prediction. Notice NSP is not required for some NLP tasks such as single sentence classification, single sentence tagging etc.


Fine-tuning applies pre-trained model into different downstream tasks. We simply just need to give task-specific inputs and outputs and fine-tune all the parameters end-to-end. In this procedure, pre-trained model can be thought as a blackbox. It won’t output specific word embedding, but instead it gives the features required to do the downstream task. We just need to add a few layers to use those features. As an example, in Question Answer SQuAD 1.1, BERT model will output a start vector and an end vector. The span that maximizes the candidate score S * Ti + E * Tj (with i and j representing positions, i≤j) will be our output. For some tasks such as sentiment analysis, we don’t even need to add extra layer, BERT will output its prediction directly and we use the label in the training set and our designed loss function to fine tune the output. More detail can be seen in Figure 4 in the original paper:

To actually apply BERT is even easier. Huggingface provided some easy APIs to use.

Above are the main ideas of BERT. Even though BERT has performed very well in multiple NLP tasks, it also has limitations. One big limitation is that BERT inherently uses a conditional independent assumption which assumes

P(all masked tokens | all unmasked tokens) = product(P(every masked token | all unmasked tokens))

This assumption can be bad in dealing with compound words. For instance, if we have two sentences like “New York is a city”, “Los Angeles is a city” and we mask token “New, York, Los, Angeles”. BERT might predict something like “New Angeles is a city”.

Also BERT uses a bunch of parameters that do not contribute too much to the overall model performance. And the NSP task sometimes does not truly learn the relationship of two sentences, instead it learns the topic of each sentence. All of those limitations are resolved in ALBERT and XLNET.

ALBERT [2019]

ALBERT (A Lite BERT), as the name shows, is an optimization on BERT. The largest ALBERT model (ALBERT-xxlarge) has around 70% of BERT-large’s parameters but significantly improves results on multiple NLP tasks such as SQuAD, SQuAD2.0, RACE over BERT.

ALBERT makes several improvements on BERT including:

  • factorized embedding parametrization
  • cross-layer parameter sharing
  • Sentence order prediction (SOP)
  • remove dropout in masked language model (MLM)
  • add 10x more data

Factorized embedding parametrization is an optimization on token embedding. In BERT, the shape of token embedding is O(V*H) where V is vocabulary size and H is embedding size (equal to hidden size). ALBERT reduces token embedding size to O(V*E + E*H) where E is much smaller than H. ALBERT authors give two reasons of this modification. One is to decouple the token embedding which is context independent and hidden layer embedding which includes the context. The other reason is token embedding is sparsely updated, so it does not need a high dimensional representation.

Table 3 of the paper shows the result on different values of E:

ALBERT uses E = 128 and BERT uses E = 768, so by comparing the result in second line and forth line, the total number of parameters drop by 19M and performance drop is less than 1%.

Cross-layer parameter sharing is an optimization on transformer block. As the name indicates, ALBERT authors evaluate the model performance if the weights on fed forward network and attention network in transformer block are shared or not.

The result is shown in Table 4 of the paper:

Comparing line 1 and line 4; line 5 and line 8, sharing all the weights can reduce parameter by 77M, performance drop is around 2%.

Combing factorized embedding parametrization and cross-layer parameter sharing, there are over 88% of total parameter drop but the model performance drop is less than 3% (comparing line 4 and line 5).

The above two improvements contribute most to the parameter reduction. The next improvement, sentence order prediction (SOP), is mainly to address the problem of next sentence prediction (NSP) task in BERT. Studies have shown the NSP conflates topic prediction and sentence coherence prediction in a single task. Even though BERT authors wish the model to learn sentence coherence information, it turns out that topic prediction is easier to learn. SOP, in ALBERT, is designed to predict the order of sentence. Positive samples are same as positive samples in NSP while negative samples just switch the order of two sentences. Table 5 compares NSP and SOP:

See line 2, column 4, NSP tasks does poorly (nearly random guess) on SOP. However, SOP can solve NSP task to a reasonable degree. By comparing model performance on downstream tasks, SOP is also better.

One other improvement is to reduce dropout in masked language model (MLM) task. Table 8 shows the result:

Obviously, removing dropout in MLM is better for downstream tasks. Besides, removing dropout significantly reduces memory usage during training.

One last improvement in ALBERT is that it adds 10x more training data. Table 7 in the paper shows a slight improvement on model performance.

Final step is to scale up the width of the network since each building block is smaller, we can increase the hidden size. Table 1 shows all the settings:

As can be seen, the hidden size of ALBERT-xxlarge (last line) is 4 times larger than BERT-large. In terms of the number of layers, ALBERT has the same performance on L = 12 and L=24, so L = 12 is used to reduce useless complexity.

The performance of each setting on several NLP tasks is shown in Table 2:

Comparing line 2 and line 6, the number of parameters reduce by 30%, the performance improvement is around 3.5%, training time is 3 times slower.

Above are the main idea of ALBERT. Generally speaking, it is an optimization on BERT. Factorized embedding parametrization, cross-layer parameter sharing and SOP have the most impact while removing dropout in MLM and add 10x more data further increases model performance.

XLNET [2019]

XLNET is also an optimization on BERT. It combines the advantage of auto-regressive and auto-encoding models and outperforms BERT on 20 NLP tasks.

The pros of auto-regressive model (model based on language model such as ELMO) includes:

  • no discrepancy between training and test data
  • no independence assumption

the cons of auto-regressive model is:

  • model is not able to be trained to include bi-directional context at the same time

For auto-encoding models, the advantage is:

  • model is able to be pre-trained to better capture bi-directional context

drawbacks are:

  • input contains noise such as [MASK] which will never appear in test dataset
  • assume masked tokens are conditional independent on unmasked tokens

XLNET is designed to refine auto-regressive model. It mitigates the drawback of auto-regressive model by introducing a permutation language model. The idea of permutation language model is simple: instead of considering one input sequence, it considers all possible factorizations.

As an example, input sequence “New York is a city” can be expanded to “York is New a city”, “New a city is York”, “York a is city New” etc. In this case if model parameters are shared across all permutations, in expectation, the model will learn to gather information from all positions on both sides. However if sequence length is T, there will be T! permutations which is too large for the model to train. Therefore, XLNET only sample a portion of all the permutations.

Mathematically, the objective of permutation language model is shown in Equation (3) of the paper:

attention mask

In the actual implementation, XLNET won’t change the original sequence order, instead it introduces attention mask in self-attention to keep the permutation order. The idea is if a permutation sequence is 3–2–4–1, then the attention mask will look like:

as shown in Figure 1 of the paper. So token 1 can see token 1,2,3,4 (labeled as red), token 2 can only see token 2 and 3 etc.

partial prediction

Another implementation detail is that instead of predicting every token one by one in a sampling of permutations, we choose to predict last 1/K tokens (K as a hyper-parameter). This can greatly reduce the optimization difficulty. Mathematically, our new objective is shown in Equation (5):

Two stream self attention

In appendix A.1, it illustrates why the standard language model parameterization fails to make model learn useful information. The main idea is that the same parameter theta can be learned given different sequence permutations. In order to solve this problem, the position of target token needs to be included as shown in Equation (4):

Even though this target-aware representation removes the ambiguity in target prediction, it introduces a new problem. For instance, given a sequence 3–2–4–1. In order to predict token 2, we need context information of token 3 and position information of token 2; in order to predict token 1, we need context information of token 3,2,4 and position information of token 1. Therefore, in the first case, token 2 does not have context information, but in the second case, token 2 needs to have context information which has a contradiction. To solve this problem, XLNET uses two sets of hidden representations:

  • content representation (denoted as h) which serves a similar role to the standard hidden states in Transformer
  • query representation (denoted as g) which has context information of tokens in front of the target and the position information of target

With this setting, self attention can be formulated as:

Notice the red part, when we update g, we don’t give the context information Xzt. But Xzt is given while updating h.

With the knowledge of content and query representation, it is easy to understand the architecture of permutation language model shown in Figure 1:

Notice, hi is initialized to the embedding of token i and gi is initialized to a random weight w.

Transformer XL

XLNET also uses Transformer XL rather than the standard Transformer to improve performance. It is not required but can make XLNET have better performance. Transformer XL is an optimization on Transformer. The main idea is to split long input sequence into a bunch of smaller sequences so that it can reduce the memory usage and improve speed of the self attention step in standard transformer. Each transformer block works on a small sequence and is connected to each other similar to an RNN sequence. A key modification is removing the standard positional encoding and include relative position encoding into multi-head attention.

With all the modifications described above, XLNET can be used similar to ELMO on various downstream tasks.

As a summary, XLNET uses permutation language model to mitigate the drawback of auto-regressive models. Then it finds the standard language model parameterization does not work in dealing with sequence permutations, so it re-parameterizes with target positions. After that, it finds contradiction again about having context and does not have context information of a token, so it proposes two-stream self attention. XLNET is complex but powerful, it outperforms BERT in 20 NLP tasks.