The Story That Explains ELMo
Suddenly the dictionary definition is useless. The student is confused. A smart human listener, however, instantly switches context and understands bank now means the edge of a river — not a place to deposit money.
Before 2018, word embedding models like Word2Vec gave every word one fixed vector. "Bank" always had the same number, regardless of sentence. ELMo was the breakthrough that changed everything: it reads the entire sentence and gives each word a different embedding depending on the context it sits in. Same spelling, different vector. Just like a human.
ELMo stands for Embeddings from Language Models. Published by Allen Institute for AI in 2018, it was the first major model to produce truly contextualised word representations — meaning the embedding of a word changes based on its surrounding text. It powered state-of-the-art results on six NLP benchmarks in a single paper and planted the seed for BERT, GPT, and modern LLMs.
Traditional embeddings (Word2Vec, GloVe) assign one fixed vector per word, decided at training time and frozen forever. ELMo assigns a different vector every time — computed on the fly, shaped by the surrounding sentence. The same model, the same weights, but a different output for every context. That is the revolution of contextualised embeddings.
Why Traditional Embeddings Failed
To appreciate ELMo, you must first feel the pain of what came before it. Let us trace the history of word representations in NLP.
Word2Vec and GloVe produce type-level embeddings — one vector per word type. But language has token-level meaning — the same word in two sentences can mean completely different things. Polysemy (multiple meanings) and syntax-driven meaning changes are invisible to static models. Feeding an average of both senses into a classifier is like feeding it noise. ELMo solves this by reading context.
What ELMo Is — The Architecture
ELMo is built on a two-layer bidirectional LSTM language model (biLM) pre-trained on a massive text corpus (1 billion word benchmark). Let us break every piece of that sentence down.
Earlier transfer learning in NLP just used the top layer of a pre-trained model. ELMo discovered that different layers encode different types of linguistic knowledge. The weighted combination lets the downstream task decide which layer is most useful. A POS tagger will learn high weight on Layer 1 (syntax); a word sense disambiguator will favour Layer 2 (semantics). This is learned automatically, not hard-coded.
The ELMo Formula — In Plain English and Math
Let us nail down exactly what ELMo computes. For a sentence of length N, for each token position k, ELMo produces three internal representations:
The biLM's millions of parameters are trained once on a huge corpus and then never changed again. When you fine-tune for a downstream task, you add a task-specific layer on top and only learn the weights s₀, s₁, s₂ and γ (and the task head). This is the essence of transfer learning — expensive compute once, cheap specialisation forever.
Contextualisation in Action — A Diagram
🐕 Sentence A: "The dog let out a sharp bark."
🌳 Sentence B: "The bark of the ancient oak was rough."
Word2Vec assigns the same vector to "bark" in both. ELMo assigns completely different vectors — because the bidirectional LSTM in Sentence A sees "dog", "sharp", and reads "bark" in that context; in Sentence B it sees "ancient oak", "rough" in both directions and produces a vector that sits near "tree", "wood", "trunk" in embedding space. Same spelling. Totally different mathematical identity.
| Word | Vector (simplified) | Problem |
|---|---|---|
| "bark" (dog) | [0.23, -0.41, 0.87, ...] | Identical vectors! The model cannot tell these apart. Downstream classifiers receive the same signal for two opposite meanings. |
| "bark" (tree) | [0.23, -0.41, 0.87, ...] |
| Word | Vector (simplified) | Result |
|---|---|---|
| "bark" (dog) | [0.91, 0.12, -0.34, ...] | Near "growl", "howl", "woof". Downstream model correctly understands the animal context. |
| "bark" (tree) | [-0.18, 0.76, 0.55, ...] | Near "trunk", "wood", "oak". Entirely different region of semantic space. |
What Each Layer Learns — Empirical Evidence
One of the most important findings in the original ELMo paper (Peters et al., 2018) was a systematic study of what information each layer captures. The results were striking:
The authors ran "probing classifiers" — tiny linear models trained on the frozen hidden states of each individual layer — to measure what information each layer contains. Layer 1 scored best on POS and chunking. Layer 2 scored best on Word Sense Disambiguation. This empirically confirmed the hierarchy of linguistic knowledge: syntax first, semantics second, exactly mirroring cognitive linguistics theories about how humans process language.
How to Use ELMo — Installation and Setup
The original ELMo is available through the allennlp library and the
tensorflow-hub module. For modern projects, the Hugging Face ecosystem
is most practical. We will cover both paths.
Path 1 — ELMo via AllenNLP (Original)
The AllenNLP library provides the original ELMo implementation from the authors. It gives you direct access to all three layer representations.
# Install the library
# pip install allennlp
from allennlp.modules.elmo import Elmo, batch_to_ids
# ── 1. Load pre-trained ELMo weights ──────────────────────
# AllenNLP will download these automatically if not present
options_file = (
"https://allennlp.s3.amazonaws.com/models/elmo/"
"2x4096_512_2048cnn_2xhighway/"
"elmo_2x4096_512_2048cnn_2xhighway_options.json"
)
weight_file = (
"https://allennlp.s3.amazonaws.com/models/elmo/"
"2x4096_512_2048cnn_2xhighway/"
"elmo_2x4096_512_2048cnn_2xhighway_weights.hdf5"
)
# num_output_representations: how many weighted sums to produce
# scalar_mix_parameters=None means the weights are learnable
elmo = Elmo(
options_file,
weight_file,
num_output_representations=1, # 1 for most tasks
dropout=0.0
)
# ── 2. Prepare input ──────────────────────────────────────
sentences = [
["The", "dog", "let", "out", "a", "sharp", "bark"],
["The", "bark", "of", "the", "ancient", "oak", "was", "rough"]
]
# batch_to_ids converts tokens → character id tensors
character_ids = batch_to_ids(sentences)
print(f"Character ID tensor shape: {character_ids.shape}")
# → torch.Size([2, 8, 50]) — batch x max_len x max_chars
# ── 3. Get ELMo representations ───────────────────────────
embeddings = elmo(character_ids)
# embeddings_list: list of tensors, one per num_output_representations
elmo_repr = embeddings['elmo_representations'][0]
mask = embeddings['mask']
print(f"ELMo repr shape: {elmo_repr.shape}")
# → torch.Size([2, 8, 1024]) — batch x seq_len x embedding_dim
# ── 4. Inspect "bark" in each sentence ────────────────────
bark_dog_sentence = elmo_repr[0, 6, :] # "bark" at position 6 in sentence 0
bark_tree_sentence = elmo_repr[1, 1, :] # "bark" at position 1 in sentence 1
import torch
cosine_sim = torch.nn.functional.cosine_similarity(
bark_dog_sentence.unsqueeze(0),
bark_tree_sentence.unsqueeze(0)
)
print(f"Cosine similarity between 'bark' (dog) and 'bark' (tree): {cosine_sim.item():.4f}")
ELMo's output dimension is 1024 for the large model (512 for each
direction × 2 directions). Every token in every sentence gets its own 1024-dimensional
vector. The mask tensor tells you which positions are real tokens vs
padding — always use it when computing sentence-level representations (averaging,
attention, pooling).
Path 2 — ELMo via TensorFlow Hub
Google's TF Hub hosts a pre-trained ELMo module that integrates cleanly with TensorFlow 2 and Keras pipelines.
import tensorflow as tf
import tensorflow_hub as hub
import numpy as np
# ── 1. Load ELMo from TF Hub ───────────────────────────────
elmo = hub.load("https://tfhub.dev/google/elmo/3")
# ── 2. Prepare sentences (raw strings, NOT tokenised) ─────
sentences = [
"The dog let out a sharp bark",
"The bark of the ancient oak was rough",
"She opened the bank account online",
"We picnicked on the river bank at sunset"
]
# ── 3. Get token embeddings ────────────────────────────────
# 'output_key' controls what you get back:
# "elmo" → weighted combination (1024-dim per token)
# "lstm_outputs1" → Layer 1 hidden states
# "lstm_outputs2" → Layer 2 hidden states
# "default" → sentence-level fixed embedding (1024-dim)
token_embeddings = elmo(
tf.constant(sentences),
signature="default",
as_dict=True
)["elmo"]
print(f"Shape: {token_embeddings.shape}")
# → (4, max_tokens_in_batch, 1024)
# ── 4. Compare "bank" across contexts ─────────────────────
# "bank" is at index 4 in sentence 2 (0-indexed)
# "bank" is at index 5 in sentence 3
bank_financial = token_embeddings[2, 4, :].numpy()
bank_river = token_embeddings[3, 5, :].numpy()
def cosine_sim(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
sim = cosine_sim(bank_financial, bank_river)
print(f"'bank' (financial) vs 'bank' (river): {sim:.4f}")
# ── 5. Get sentence-level embeddings (fixed size) ──────────
sentence_vectors = elmo(
tf.constant(sentences),
signature="default",
as_dict=True
)["default"]
print(f"Sentence vectors shape: {sentence_vectors.shape}")
# → (4, 1024) — one 1024-dim vector per sentence
Path 3 — Accessing Individual Layers
For research or when you want to use a specific layer (e.g. Layer 1 for NER, Layer 2 for WSD), you can extract each LSTM layer independently.
import tensorflow_hub as hub
import tensorflow as tf
import numpy as np
elmo = hub.load("https://tfhub.dev/google/elmo/3")
sentences = ["The dog barked loudly in the park"]
# ── Extract all three representations ─────────────────────
outputs = elmo(
tf.constant(sentences),
signature="default",
as_dict=True
)
# word_emb: character CNN output (Layer 0)
word_emb = outputs["word_emb"].numpy() # shape: (1, 7, 512)
# lstm_outputs1: first biLSTM layer (Layer 1 — syntactic)
layer1 = outputs["lstm_outputs1"].numpy() # shape: (1, 7, 1024)
# lstm_outputs2: second biLSTM layer (Layer 2 — semantic)
layer2 = outputs["lstm_outputs2"].numpy() # shape: (1, 7, 1024)
# elmo: learned weighted combination
weighted = outputs["elmo"].numpy() # shape: (1, 7, 1024)
tokens = ["The", "dog", "barked", "loudly", "in", "the", "park"]
print("Token | Layer0_norm | Layer1_norm | Layer2_norm")
print("-" * 58)
for i, tok in enumerate(tokens):
l0 = np.linalg.norm(word_emb[0, i, :])
l1 = np.linalg.norm(layer1[0, i, :])
l2 = np.linalg.norm(layer2[0, i, :])
print(f"{tok:14s} | {l0:11.4f} | {l1:11.4f} | {l2:11.4f}")
Full Pipeline — ELMo for Text Classification
Let us build a complete sentiment analysis classifier using ELMo embeddings with PyTorch, showing the full production-grade pipeline: data prep → embeddings → classifier → training → evaluation.
import torch
import torch.nn as nn
import numpy as np
from allennlp.modules.elmo import Elmo, batch_to_ids
from torch.utils.data import DataLoader, TensorDataset
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
# ── 1. Toy sentiment dataset ───────────────────────────────
raw_sentences = [
["This", "film", "was", "absolutely", "brilliant"], # positive
["I", "loved", "every", "single", "moment"], # positive
["Terrible", "waste", "of", "two", "hours"], # negative
["The", "plot", "made", "absolutely", "no", "sense"], # negative
["A", "masterpiece", "of", "modern", "cinema"], # positive
["Boring", "and", "predictable", "from", "start"], # negative
]
labels = [1, 1, 0, 0, 1, 0]
# ── 2. Load ELMo ──────────────────────────────────────────
options_file = "path/to/elmo_options.json"
weight_file = "path/to/elmo_weights.hdf5"
elmo = Elmo(options_file, weight_file,
num_output_representations=1, dropout=0.0)
elmo.eval() # freeze batch norm / dropout layers
# ── 3. Extract ELMo embeddings (mean pooling over tokens) ─
def get_sentence_embedding(sentences_batch):
with torch.no_grad():
char_ids = batch_to_ids(sentences_batch)
result = elmo(char_ids)
embeddings = result['elmo_representations'][0] # (B, T, 1024)
mask = result['mask'].unsqueeze(-1).float() # (B, T, 1)
# Mean pooling: sum masked embeddings / token count
summed = (embeddings * mask).sum(dim=1) # (B, 1024)
counts = mask.sum(dim=1) # (B, 1)
return (summed / counts).numpy()
X = get_sentence_embedding(raw_sentences) # shape: (6, 1024)
y = np.array(labels)
# ── 4. Simple classifier on top of ELMo ───────────────────
class SentimentClassifier(nn.Module):
def __init__(self):
super().__init__()
self.layers = nn.Sequential(
nn.Linear(1024, 256),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(256, 2)
)
def forward(self, x):
return self.layers(x)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.33, random_state=42)
X_tr_t = torch.FloatTensor(X_tr)
y_tr_t = torch.LongTensor(y_tr)
model = SentimentClassifier()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()
# ── 5. Training loop ───────────────────────────────────────
model.train()
for epoch in range(20):
optimizer.zero_grad()
logits = model(X_tr_t)
loss = criterion(logits, y_tr_t)
loss.backward()
optimizer.step()
if (epoch + 1) % 5 == 0:
print(f"Epoch {epoch+1}/20 Loss: {loss.item():.4f}")
# ── 6. Evaluation ─────────────────────────────────────────
model.eval()
with torch.no_grad():
preds = model(torch.FloatTensor(X_te)).argmax(dim=1).numpy()
print(classification_report(y_te, preds, target_names=["Negative", "Positive"]))
Full Pipeline — ELMo for Named Entity Recognition (NER)
NER is where ELMo's syntactic Layer 1 shines. We combine ELMo token embeddings with a CRF (Conditional Random Field) decoder — the standard architecture for sequence labelling tasks.
import torch
import torch.nn as nn
from allennlp.modules.elmo import Elmo, batch_to_ids
# pip install torchcrf
from torchcrf import CRF
# ── Toy NER training example ──────────────────────────────
# Tags: O=outside, B-PER=begin person, I-PER=inside person,
# B-ORG=begin org, I-ORG=inside org, B-LOC=begin location
TAG2IDX = {"O":0, "B-PER":1, "I-PER":2,
"B-ORG":3, "I-ORG":4, "B-LOC":5}
sentence = ["Alan", "Turing", "worked", "at", "Bletchley", "Park"]
gold_tags = ["B-PER", "I-PER", "O", "O", "B-LOC", "I-LOC"]
tag_ids = torch.LongTensor([[TAG2IDX[t] for t in gold_tags]])
# ── ELMo — only using Layer 1 (syntactic) for NER ────────
options_file = "path/to/elmo_options.json"
weight_file = "path/to/elmo_weights.hdf5"
# num_output_representations=2: gives us two weighted sums
# We will use the FIRST one for NER (will learn to weight Layer 1 higher)
elmo = Elmo(options_file, weight_file,
num_output_representations=2, dropout=0.1)
# ── ELMo-CRF NER model ────────────────────────────────────
class ElmoNER(nn.Module):
def __init__(self, elmo_model, num_tags):
super().__init__()
self.elmo = elmo_model
self.linear = nn.Linear(1024, num_tags)
self.crf = CRF(num_tags, batch_first=True)
def forward(self, char_ids, tags=None, mask=None):
elmo_out = self.elmo(char_ids)
embeddings = elmo_out['elmo_representations'][0] # (B, T, 1024)
emissions = self.linear(embeddings) # (B, T, num_tags)
if tags is not None:
return -self.crf(emissions, tags, mask=mask) # negative log-likelihood
return self.crf.decode(emissions, mask=mask) # best tag sequence
model = ElmoNER(elmo, num_tags=len(TAG2IDX))
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
# ── Training step (single sentence demo) ──────────────────
char_ids = batch_to_ids([sentence]) # (1, 6, 50)
mask = torch.ones(1, 6, dtype=torch.bool)
model.train()
for step in range(50):
optimizer.zero_grad()
loss = model(char_ids, tags=tag_ids, mask=mask)
loss.backward()
optimizer.step()
# ── Inference ─────────────────────────────────────────────
model.eval()
with torch.no_grad():
predicted_ids = model(char_ids, mask=mask)[0]
IDX2TAG = {v: k for k, v in TAG2IDX.items()}
predicted_tags = [IDX2TAG[i] for i in predicted_ids]
for token, tag in zip(sentence, predicted_tags):
print(f"{token:12s} → {tag}")
ELMo vs Word2Vec vs BERT — The Comparison
| Property | Word2Vec / GloVe | ELMo | BERT |
|---|---|---|---|
| Embedding type | Static (1 per word type) | Contextual (1 per token) | Contextual (1 per token) |
| Architecture | Shallow neural net | 2-layer biLSTM | Transformer (12–24 layers) |
| Context direction | None (type-level) | Bidirectional (concat forward + backward) | Truly bidirectional (attention) |
| Handles polysemy | No — averages all meanings | Yes — context-dependent | Yes — deeply context-dependent |
| Input representation | Lookup table (word IDs) | Character CNN (handles OOV) | WordPiece subwords (partial OOV) |
| Fine-tuning | N/A (frozen embeddings) | Frozen biLM + learned task weights | Full model fine-tuning |
| Inference speed | Very fast (lookup) | Moderate (LSTM is sequential) | Slow (attention is O(n²)) |
| Memory footprint | Small (embedding matrix) | Medium (~93M params) | Large (110M–340M params) |
| Best task category | Semantic similarity, word clustering | NER, coref, SRL, syntax tasks | QA, NLI, all NLP tasks at SOTA |
| Year released | 2013 / 2014 | 2018 (Feb) | 2018 (Oct) |
ELMo is not obsolete — it is underused in resource-constrained environments. Use ELMo when: (a) you need good contextual embeddings but cannot afford BERT's memory on edge devices; (b) you are working with domain-specific corpora and can pre-train a custom biLM cheaply; (c) you need fast character-level OOV handling for clinical text, code, or multi-lingual corpora with rare words; or (d) you need a strong, interpretable baseline before scaling to transformers.
Training a Custom ELMo on Your Own Corpus
For domain-specific applications (medical notes, legal text, code, social media),
a custom ELMo trained on your own data dramatically outperforms the generic model.
The bilm-tf library from AllenNLP makes this straightforward.
# ── Custom biLM training with bilm-tf ─────────────────────
# pip install bilm-tf tensorflow==1.15
# (bilm-tf requires TF 1.x — use a virtual environment)
# ── Step 1: Prepare your corpus ───────────────────────────
# One sentence per line, tokens space-separated
# Save as train.txt and valid.txt
corpus_sample = """
The patient presented with acute myocardial infarction .
Left ventricular ejection fraction was measured at 45 percent .
Troponin levels were significantly elevated on admission .
The cardiologist recommended immediate percutaneous coronary intervention .
"""
with open("train.txt", "w") as f:
f.write(corpus_sample.strip())
# ── Step 2: Create vocabulary ─────────────────────────────
# bilm-tf expects a vocabulary file (one token per line)
# Special tokens: <S>, </S>, <UNK> at the top
from collections import Counter
with open("train.txt") as f:
tokens = f.read().split()
counts = Counter(tokens)
with open("vocab.txt", "w") as f:
for special in ["<S>", "</S>", "<UNK>"]:
f.write(special + "\n")
for word, count in counts.most_common():
if count >= 3: # minimum frequency threshold
f.write(word + "\n")
# ── Step 3: biLM options ──────────────────────────────────
import json
options = {
"bidirectional": True,
"char_cnn": {
"activation": "relu",
"embedding": {"dim": 16},
"filters": [[1,32],[2,32],[3,64],[4,128],[5,256],
[6,512],[7,1024]],
"max_characters_per_token": 50,
"n_characters": 261,
"n_highway": 2
},
"dropout": 0.1,
"lstm": {
"cell_clip": 3,
"dim": 4096,
"n_layers": 2,
"proj_clip": 3,
"projection_dim": 512,
"use_skip_connections": True
},
"all_clip_norm_val": 10.0,
"n_epochs": 10,
"n_train_tokens": 768648884,
"batch_size": 128,
"n_tokens_vocab": 793471,
"unroll_steps": 20,
"n_negative_samples_batch": 8192
}
with open("custom_elmo_options.json", "w") as f:
json.dump(options, f, indent=2)
print("Options saved. Run training with:")
print(" python bilm/train_elmo.py --train_prefix train.txt \\")
print(" --vocab_file vocab.txt --save_dir ./custom_elmo_model \\")
print(" --options_file custom_elmo_options.json")
Key Use Cases — Where ELMo Excels
ELMo's Limitations — What It Cannot Do
A common misconception: ELMo runs a forward LSTM and a separate backward LSTM and concatenates their outputs. At any given token position, the forward pass has not seen the right context and the backward pass has not seen the left. They are combined after the fact, not fused during computation. BERT's transformer attention, by contrast, allows every token to attend to every other token simultaneously in both directions within each layer — this is true bidirectionality and is why BERT outperforms ELMo significantly.
Golden Rules for Using ELMo
mask.sum(dim=1)), never by the padded length.
lstm_outputs1 (Layer 1). For semantic tasks
(WSD, sentiment, STS), use or upweight lstm_outputs2 (Layer 2).
When in doubt, use the weighted sum and let the task learn the weights.
dropout=0.0 when loading ELMo for inference (frozen). Add a
nn.Dropout(0.2–0.5) layer between the ELMo output and your
task-specific head. This prevents overfitting on small labelled datasets.
nn.LayerNorm(1024) to ELMo outputs before concatenation.
ELMo's Historical Impact
This was unprecedented. Each of those tasks had its own dedicated state-of-the-art system. ELMo swept them all by adding contextual embeddings to existing models. The NLP community called it a watershed moment — proof that pre-training a language model on unlabelled data could provide general-purpose linguistic knowledge transferable to any task.
Just eight months later, BERT (October 2018) extended this idea from LSTMs to Transformers, achieving even larger gains. Then came GPT-2, RoBERTa, T5, and eventually GPT-4 and Claude. Every modern LLM is a direct intellectual descendant of the contextualisation idea ELMo introduced.
ELMo proved: contextual pre-training works. BERT improved: transformers are better than LSTMs for context. GPT-3 scaled: more data and parameters produce emergent capabilities. ChatGPT and Claude added: RLHF alignment makes models useful to people. ELMo is not a legacy tool — it is the conceptual origin of everything that followed.