A Study on CoVe, Context2Vec, ELMo, ULMFiT and BERT
Note: This post was originally published on AH’s Blog (WordPress) on July 1, 2019, and has been migrated here.
A research study on the models that revolutionized NLP through Transfer Learning — covering architecture, key ideas, and personal notes from implementation experience.
Key Terminology
Vector Space Models (VSMs): Words as unique vectors, feeding downstream ML models.
Word Embedding: Fixed-size vectors where semantically similar words have small Euclidean distance. Foundation for Language Modeling and Machine Translation.
Sentence Embedding: Same idea applied to full sentences.
Language Model: Models a statistical distribution over sentences to predict the next word given context.
Transfer Learning: Store knowledge learned on one task; reuse and optionally fine-tune it for another task.
Multi-Task Learning: Train simultaneously on multiple subtasks; the shared representation captures generalizable knowledge.
Domain Adaptation: A Transfer Learning subfield — adapt a model trained on a source distribution to perform well on a different target distribution.
Context Vectors (CoVe)
Paper: arxiv.org/pdf/1708.00107.pdf
CoVe vectors are learned on top of existing word vectors (GloVe, Word2Vec, FastText) using the encoder of a Neural Machine Translation (NMT) seq2seq model trained on German→English translation. The encoder learns complex semantic relations between words in order to translate, making its hidden representations richer than static embeddings.
Usage:
CoVe = MT-LSTM(GloVe(sentence))
Inspired by the success of pre-trained CNNs on ImageNet, CoVe applies the same transfer idea to NLP: train on a large task (NMT), then use the encoder as an initialization layer for downstream tasks.
The paper introduced Bi-attentive Classification Network (BCN) to validate CoVe quality on tasks like Sentiment Analysis and Paraphrase Detection. BCN accepts two inputs (or duplicates one), passes them through the MT-LSTM encoder, then uses a Bi-LSTM + bi-attention architecture ending in a maxout classifier.


Personal notes:
- You don’t need BCN — just prepend the frozen (or fine-tuned) encoder to your own model.
- Fine-tuning is generally better than freezing to allow slight task-specific adaptation.
- Use FastText over GloVe when character-level distinctions matter (e.g., named entities).
Context to Embeddings (Context2Vec)
Paper: aclweb.org/anthology/K16-1006
Consider the sentence “I can’t find April.” Without context, “April” could be a month or a person. Context2Vec extends CBOW Word2Vec by replacing the simple average-of-context-vectors with a richer parametric model — a Bi-LSTM + feedforward network.
Three-stage architecture:
- Bi-LSTM processes left-to-right and right-to-left context.
- Feedforward network learns from the concatenated Bi-LSTM hidden states.
- Objective function (with Word2Vec negative sampling) compares output to target word embedding.



Personal note: Similar to Doc2Vec, but uses Bi-LSTM instead of a plain projection layer for deeper contextual representation.
Embeddings from Language Models (ELMo)
Paper: arxiv.org/pdf/1802.05365.pdf
ELMo addresses the same polysemy problem (a word’s meaning depends on context) by learning embeddings from a Bi-directional Language Model (BiLM):
-
Forward LM: Predict word given previous words — P(word left context) -
Backward LM: Predict word given following words — P(word right context)

Each word’s final ELMo representation is the weighted element-wise sum of:
- Original word embedding (GloVe/Word2Vec/FastText)
- Forward LSTM hidden state
- Backward LSTM hidden state
Weights can be task-specific (learned during fine-tuning).
Usage steps:
- Train BiLM on a large corpus.
- Freeze the BiLM encoders and attach them at the bottom of your model.
- Replace raw word indices with their ELMo representations.

Personal notes:
- Train the LM on domain-specific data for best downstream results.
- Deeper models or CNN character features can improve the LM quality.
Universal Language Model Fine-tuning (ULMFiT)
Paper: arxiv.org/pdf/1801.06146.pdf
ULMFiT’s goal: one universal language model that can be fine-tuned for any classification task. The base model is AWD-LSTM — a heavily regularized LSTM targeting generalization on long sequences.
AWD-LSTM regularization techniques:
- DropConnect Mask: Randomly zeroes weight connections (not activations).
- Variational Dropout: Same dropout mask applied at every time step within a sequence.
- ASGD (Average SGD): Averages weights over multiple steps for more stable convergence.
- Variable Length BPTT: Randomizes truncation length during training.
ULMFiT introduces two fine-tuning innovations:
Discriminative Fine-tuning (Discr): Different layers use different learning rates, since lower layers capture more general features (should change slowly) while upper layers capture task-specific features (can change faster).
Slanted Triangular Learning Rates (STLR): The learning rate increases quickly then decreases slowly — a specific schedule designed for fine-tuning pre-trained models.
BERT
Paper: arxiv.org/abs/1810.04805
BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained deep Transformer encoder that redefined the state of the art across NLP benchmarks. Unlike the models above which use LSTMs, BERT uses a multi-layer Transformer architecture with self-attention.
Two novel pre-training objectives:
1. Masked Language Model (MLM): Randomly mask 15% of tokens in the input; train the model to predict those masked tokens. This allows truly bidirectional context — both left and right — unlike unidirectional LMs.
2. Next Sentence Prediction (NSP): Given two sentences, predict whether sentence B actually follows sentence A in the original document. This captures inter-sentence relationships useful for QA and inference tasks.
Two model sizes:
- BERT-Base: 12 Transformer layers, 768 hidden units, 12 attention heads (110M parameters)
- BERT-Large: 24 layers, 1024 hidden units, 16 attention heads (340M parameters)
Fine-tuning: Add a task-specific output layer on top of BERT and fine-tune end-to-end. BERT achieved state-of-the-art on 11 NLP tasks including SQuAD, MNLI, and CoLA at time of publication.
Summary
| Model | Core Idea | Architecture | Key Innovation |
|---|---|---|---|
| CoVe | NMT encoder as feature extractor | Bi-LSTM | Transfer from MT task |
| Context2Vec | BiLM-style context modeling | Bi-LSTM + FF | Richer CBOW context |
| ELMo | Contextual word embeddings from BiLM | Stacked Bi-LSTM | Per-layer weighted sum |
| ULMFiT | Universal LM fine-tuning | AWD-LSTM | Discr LR + STLR |
| BERT | Masked LM + NSP pre-training | Transformer | True bidirectionality via masking |
The trajectory is clear: from static word vectors → context-dependent LSTMs → attention-based Transformers. Each step brought deeper, more context-aware representations that better model language semantics.
