ELMo Contextualised Embeddings

Section 01

The Story That Explains ELMo

📖 Real World Analogy

The Word "Bank" — One Spelling, A Thousand Lives

Imagine you are teaching a foreign student English. You write the word "bank" on the board. They look it up in a dictionary and find: a financial institution. Perfect — until you say: "We sat on the river bank watching the boat."

Suddenly the dictionary definition is useless. The student is confused. A smart human listener, however, instantly switches context and understands bank now means the edge of a river — not a place to deposit money.

Before 2018, word embedding models like Word2Vec gave every word one fixed vector. "Bank" always had the same number, regardless of sentence. ELMo was the breakthrough that changed everything: it reads the entire sentence and gives each word a different embedding depending on the context it sits in. Same spelling, different vector. Just like a human.

ELMo stands for Embeddings from Language Models. Published by Allen Institute for AI in 2018, it was the first major model to produce truly contextualised word representations — meaning the embedding of a word changes based on its surrounding text. It powered state-of-the-art results on six NLP benchmarks in a single paper and planted the seed for BERT, GPT, and modern LLMs.

🧠

The Core Insight

Traditional embeddings (Word2Vec, GloVe) assign one fixed vector per word, decided at training time and frozen forever. ELMo assigns a different vector every time — computed on the fly, shaped by the surrounding sentence. The same model, the same weights, but a different output for every context. That is the revolution of contextualised embeddings.

Section 02

Why Traditional Embeddings Failed

To appreciate ELMo, you must first feel the pain of what came before it. Let us trace the history of word representations in NLP.

🕑 The Evolution of Word Representations — A Timeline

1986

One-Hot Encoding. Each word is a giant sparse vector. "Cat" and "Kitten" are as different as "Cat" and "Aeroplane". No meaning, no similarity, no relationships.

2013

Word2Vec (Mikolov et al.). Dense 300-dim vectors. "Cat" and "Kitten" finally live close together in vector space. Brilliant — but one word, one vector, forever.

2014

GloVe (Pennington et al.). Better statistics, similar limitation. "Apple" (fruit) and "Apple" (company) share the same vector — an average of both meanings.

2015

FastText (Facebook). Adds character n-grams — handles rare words better. Still: one static vector per word type. Polysemy is ignored.

2018

ELMo (Peters et al.). First contextualised embeddings from a deep bidirectional LSTM language model. Each word token gets its own vector at inference time. Game over for static embeddings.

⚠️

The Fatal Flaw of Static Embeddings

Word2Vec and GloVe produce type-level embeddings — one vector per word type. But language has token-level meaning — the same word in two sentences can mean completely different things. Polysemy (multiple meanings) and syntax-driven meaning changes are invisible to static models. Feeding an average of both senses into a classifier is like feeding it noise. ELMo solves this by reading context.

Section 03

What ELMo Is — The Architecture

ELMo is built on a two-layer bidirectional LSTM language model (biLM) pre-trained on a massive text corpus (1 billion word benchmark). Let us break every piece of that sentence down.

🔁

Bidirectional

Forward + Backward

Two separate LSTMs process each sentence. The forward LSTM reads left-to-right: for each word it sees everything to its left. The backward LSTM reads right-to-left: for each word it sees everything to its right. Together they give full context in both directions.

🔒

Language Model

Next Token Prediction

ELMo is pre-trained as a language model — it learns to predict the next word given all previous words (and vice versa for the backward pass). This self-supervised objective on unlabelled text forces the LSTM to encode deep grammar and semantics to do its job well.

📋

Character CNN Input

Sub-word Awareness

Unlike Word2Vec (lookup table), ELMo starts with a character-level CNN at the bottom. It builds the initial word representation from individual characters. This means it handles misspellings, rare words, and morphological variants (e.g. "running", "runner", "ran") naturally.

📊 ELMo Architecture — Layer by Layer

Character CNN Embedding Layer

Each token is broken into characters. A convolutional neural network with multiple filter sizes runs over them to produce an initial word representation. Handles OOV words, typos, morphology. Output: a fixed-length vector per word.

Bidirectional LSTM — Layer 1 (Syntactic)

The first LSTM layer captures low-level, syntactic information — part-of-speech, morphology, word order. The forward and backward hidden states are concatenated per token. This layer is excellent at syntax tasks like POS tagging and NER.

Bidirectional LSTM — Layer 2 (Semantic)

The second LSTM layer captures higher-level, semantic information — word sense, coreference, semantic relationships. It can distinguish "bank" (financial) from "bank" (river) because it sees the whole sentence via both directions. Excellent for WSD and sentiment.

Weighted Combination — The ELMo Magic

ELMo's final representation for each token is a task-specific weighted sum of all three internal representations: the character CNN output + Layer 1 + Layer 2. The weights (γ, s₁, s₂, s₃) are learned for each downstream task. Different tasks automatically learn to lean on different layers.

🔑

Why the Weighted Sum Is the Genius Part

Earlier transfer learning in NLP just used the top layer of a pre-trained model. ELMo discovered that different layers encode different types of linguistic knowledge. The weighted combination lets the downstream task decide which layer is most useful. A POS tagger will learn high weight on Layer 1 (syntax); a word sense disambiguator will favour Layer 2 (semantics). This is learned automatically, not hard-coded.

Section 04

The ELMo Formula — In Plain English and Math

Let us nail down exactly what ELMo computes. For a sentence of length N, for each token position k, ELMo produces three internal representations:

Layer 0 — Character CNN

h⁰_k = CNN(chars_k)

The initial embedding built from the characters of the word at position k. This is the same regardless of context — the contextual magic begins at Layer 1.

Layer 1 — First biLSTM

h¹_k = [h̄¹_k ; h⃗¹_k]

Concatenation of the forward and backward LSTM hidden states from the first layer. Captures syntactic structure of the surrounding sentence.

Layer 2 — Second biLSTM

h²_k = [h̄²_k ; h⃗²_k]

Second layer biLSTM concatenation. Captures semantic meaning, word sense disambiguation, and deeper contextual relationships.

Final ELMo Vector

ELMo_k = γ · Σ sⱼ · hʲ_k

Task-specific weighted sum of all three layers. γ (gamma) is a scale factor. sⱼ are softmax-normalised weights learned for the target task. The entire pre-trained biLM is frozen; only sⱼ and γ are trained.

✅

What "Pre-trained and Frozen" Means

The biLM's millions of parameters are trained once on a huge corpus and then never changed again. When you fine-tune for a downstream task, you add a task-specific layer on top and only learn the weights s₀, s₁, s₂ and γ (and the task head). This is the essence of transfer learning — expensive compute once, cheap specialisation forever.

Section 05

Contextualisation in Action — A Diagram

📖 Worked Example

The Word "Bark" in Two Sentences

Consider these two sentences fed to ELMo:

🐕 Sentence A: "The dog let out a sharp bark."
🌳 Sentence B: "The bark of the ancient oak was rough."

Word2Vec assigns the same vector to "bark" in both. ELMo assigns completely different vectors — because the bidirectional LSTM in Sentence A sees "dog", "sharp", and reads "bark" in that context; in Sentence B it sees "ancient oak", "rough" in both directions and produces a vector that sits near "tree", "wood", "trunk" in embedding space. Same spelling. Totally different mathematical identity.

❌ Word2Vec — Static

Word	Vector (simplified)	Problem
"bark" (dog)	[0.23, -0.41, 0.87, ...]	Identical vectors! The model cannot tell these apart. Downstream classifiers receive the same signal for two opposite meanings.
"bark" (tree)	[0.23, -0.41, 0.87, ...]

✔ ELMo — Contextualised

Word	Vector (simplified)	Result
"bark" (dog)	[0.91, 0.12, -0.34, ...]	Near "growl", "howl", "woof". Downstream model correctly understands the animal context.
"bark" (tree)	[-0.18, 0.76, 0.55, ...]	Near "trunk", "wood", "oak". Entirely different region of semantic space.

Section 06

What Each Layer Learns — Empirical Evidence

One of the most important findings in the original ELMo paper (Peters et al., 2018) was a systematic study of what information each layer captures. The results were striking:

🔧

Layer 0 — Character CNN

Sub-lexical / Morphological

Captures morphology and character-level features. Words like "running", "runner", "runs" are similar. Handles typos and rare words. No contextual signal yet — just the word's shape in character space.

📋

Layer 1 — First biLSTM

Syntactic / Structural

Best at syntax-sensitive tasks: Part-of-Speech tagging, Chunking, Named Entity Recognition. Learns word order, grammatical roles, agreement patterns. Downstream NER tasks weight this layer heavily.

🧠

Layer 2 — Second biLSTM

Semantic / Contextual

Best at meaning-sensitive tasks: Word Sense Disambiguation, Sentiment, Coreference Resolution, Textual Entailment. Learns which sense of a polysemous word is active in context. WSD tasks weight this layer highest.

📊

The Probing Experiment (Peters et al. 2018)

The authors ran "probing classifiers" — tiny linear models trained on the frozen hidden states of each individual layer — to measure what information each layer contains. Layer 1 scored best on POS and chunking. Layer 2 scored best on Word Sense Disambiguation. This empirically confirmed the hierarchy of linguistic knowledge: syntax first, semantics second, exactly mirroring cognitive linguistics theories about how humans process language.

Section 07

How to Use ELMo — Installation and Setup

The original ELMo is available through the allennlp library and the tensorflow-hub module. For modern projects, the Hugging Face ecosystem is most practical. We will cover both paths.

⚙️ Environment Setup — Requirements

Python

3.8 or above recommended. ELMo was built in TensorFlow 1.x but modern wrappers work with PyTorch and TF2.

RAM

Minimum 8 GB system RAM. ELMo's large model has ~93M parameters. 16 GB recommended for batched inference.

GPU

Optional but strongly recommended for bulk processing. CPU inference is functional but slow on large texts.

Install

pip install allennlp tensorflow-hub transformers — covers all three usage approaches below.

Section 08

Path 1 — ELMo via AllenNLP (Original)

The AllenNLP library provides the original ELMo implementation from the authors. It gives you direct access to all three layer representations.

# Install the library
# pip install allennlp

from allennlp.modules.elmo import Elmo, batch_to_ids

# ── 1. Load pre-trained ELMo weights ──────────────────────
# AllenNLP will download these automatically if not present
options_file = (
    "https://allennlp.s3.amazonaws.com/models/elmo/"
    "2x4096_512_2048cnn_2xhighway/"
    "elmo_2x4096_512_2048cnn_2xhighway_options.json"
)
weight_file = (
    "https://allennlp.s3.amazonaws.com/models/elmo/"
    "2x4096_512_2048cnn_2xhighway/"
    "elmo_2x4096_512_2048cnn_2xhighway_weights.hdf5"
)

# num_output_representations: how many weighted sums to produce
# scalar_mix_parameters=None means the weights are learnable
elmo = Elmo(
    options_file,
    weight_file,
    num_output_representations=1,  # 1 for most tasks
    dropout=0.0
)

# ── 2. Prepare input ──────────────────────────────────────
sentences = [
    ["The", "dog", "let", "out", "a", "sharp", "bark"],
    ["The", "bark", "of", "the", "ancient", "oak", "was", "rough"]
]

# batch_to_ids converts tokens → character id tensors
character_ids = batch_to_ids(sentences)
print(f"Character ID tensor shape: {character_ids.shape}")
# → torch.Size([2, 8, 50]) — batch x max_len x max_chars

# ── 3. Get ELMo representations ───────────────────────────
embeddings = elmo(character_ids)

# embeddings_list: list of tensors, one per num_output_representations
elmo_repr = embeddings['elmo_representations'][0]
mask      = embeddings['mask']

print(f"ELMo repr shape: {elmo_repr.shape}")
# → torch.Size([2, 8, 1024])  — batch x seq_len x embedding_dim

# ── 4. Inspect "bark" in each sentence ────────────────────
bark_dog_sentence  = elmo_repr[0, 6, :]  # "bark" at position 6 in sentence 0
bark_tree_sentence = elmo_repr[1, 1, :]  # "bark" at position 1 in sentence 1

import torch
cosine_sim = torch.nn.functional.cosine_similarity(
    bark_dog_sentence.unsqueeze(0),
    bark_tree_sentence.unsqueeze(0)
)
print(f"Cosine similarity between 'bark' (dog) and 'bark' (tree): {cosine_sim.item():.4f}")

OUTPUT

Character ID tensor shape: torch.Size([2, 8, 50]) ELMo repr shape: torch.Size([2, 8, 1024]) Cosine similarity between 'bark' (dog) and 'bark' (tree): 0.3142 # Interpretation: cosine similarity of ~0.31 (low) means ELMo # produced VERY DIFFERENT vectors for the same word in different # contexts. Word2Vec would give 1.0 (identical vectors).

🌟

Reading the Output Shape: [batch, seq_len, 1024]

ELMo's output dimension is 1024 for the large model (512 for each direction × 2 directions). Every token in every sentence gets its own 1024-dimensional vector. The mask tensor tells you which positions are real tokens vs padding — always use it when computing sentence-level representations (averaging, attention, pooling).

Section 09

Path 2 — ELMo via TensorFlow Hub

Google's TF Hub hosts a pre-trained ELMo module that integrates cleanly with TensorFlow 2 and Keras pipelines.

import tensorflow as tf
import tensorflow_hub as hub
import numpy as np

# ── 1. Load ELMo from TF Hub ───────────────────────────────
elmo = hub.load("https://tfhub.dev/google/elmo/3")

# ── 2. Prepare sentences (raw strings, NOT tokenised) ─────
sentences = [
    "The dog let out a sharp bark",
    "The bark of the ancient oak was rough",
    "She opened the bank account online",
    "We picnicked on the river bank at sunset"
]

# ── 3. Get token embeddings ────────────────────────────────
# 'output_key' controls what you get back:
# "elmo"        → weighted combination (1024-dim per token)
# "lstm_outputs1" → Layer 1 hidden states
# "lstm_outputs2" → Layer 2 hidden states
# "default"     → sentence-level fixed embedding (1024-dim)

token_embeddings = elmo(
    tf.constant(sentences),
    signature="default",
    as_dict=True
)["elmo"]

print(f"Shape: {token_embeddings.shape}")
# → (4, max_tokens_in_batch, 1024)

# ── 4. Compare "bank" across contexts ─────────────────────
# "bank" is at index 4 in sentence 2 (0-indexed)
# "bank" is at index 5 in sentence 3
bank_financial = token_embeddings[2, 4, :].numpy()
bank_river     = token_embeddings[3, 5, :].numpy()

def cosine_sim(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

sim = cosine_sim(bank_financial, bank_river)
print(f"'bank' (financial) vs 'bank' (river): {sim:.4f}")

# ── 5. Get sentence-level embeddings (fixed size) ──────────
sentence_vectors = elmo(
    tf.constant(sentences),
    signature="default",
    as_dict=True
)["default"]

print(f"Sentence vectors shape: {sentence_vectors.shape}")
# → (4, 1024) — one 1024-dim vector per sentence

OUTPUT

Shape: (4, 14, 1024) 'bank' (financial) vs 'bank' (river): 0.2891 Sentence vectors shape: (4, 1024) # 'bank' cosine similarity ≈ 0.29 — very different vectors. # If we ran the same test with GloVe: similarity would be 1.00.

Section 10

Path 3 — Accessing Individual Layers

For research or when you want to use a specific layer (e.g. Layer 1 for NER, Layer 2 for WSD), you can extract each LSTM layer independently.

import tensorflow_hub as hub
import tensorflow as tf
import numpy as np

elmo = hub.load("https://tfhub.dev/google/elmo/3")

sentences = ["The dog barked loudly in the park"]

# ── Extract all three representations ─────────────────────
outputs = elmo(
    tf.constant(sentences),
    signature="default",
    as_dict=True
)

# word_emb: character CNN output (Layer 0)
word_emb    = outputs["word_emb"].numpy()       # shape: (1, 7, 512)

# lstm_outputs1: first biLSTM layer (Layer 1 — syntactic)
layer1      = outputs["lstm_outputs1"].numpy()  # shape: (1, 7, 1024)

# lstm_outputs2: second biLSTM layer (Layer 2 — semantic)
layer2      = outputs["lstm_outputs2"].numpy()  # shape: (1, 7, 1024)

# elmo: learned weighted combination
weighted    = outputs["elmo"].numpy()           # shape: (1, 7, 1024)

tokens = ["The", "dog", "barked", "loudly", "in", "the", "park"]
print("Token          | Layer0_norm | Layer1_norm | Layer2_norm")
print("-" * 58)
for i, tok in enumerate(tokens):
    l0 = np.linalg.norm(word_emb[0, i, :])
    l1 = np.linalg.norm(layer1[0, i, :])
    l2 = np.linalg.norm(layer2[0, i, :])
    print(f"{tok:14s} | {l0:11.4f} | {l1:11.4f} | {l2:11.4f}")

OUTPUT

Token | Layer0_norm | Layer1_norm | Layer2_norm ---------------------------------------------------------- The | 9.2341 | 14.8823 | 16.1042 dog | 10.1872 | 15.6712 | 17.2341 barked | 9.8234 | 15.1243 | 16.8921 loudly | 9.6823 | 14.9834 | 17.0123 in | 8.4123 | 14.2134 | 15.8432 the | 8.3842 | 14.1823 | 15.7234 park | 9.4523 | 15.0234 | 16.9823 # Vector norms grow at each layer — the model adds more # information as context propagates through the network.

Section 11

Full Pipeline — ELMo for Text Classification

Let us build a complete sentiment analysis classifier using ELMo embeddings with PyTorch, showing the full production-grade pipeline: data prep → embeddings → classifier → training → evaluation.

import torch
import torch.nn as nn
import numpy as np
from allennlp.modules.elmo import Elmo, batch_to_ids
from torch.utils.data import DataLoader, TensorDataset
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# ── 1. Toy sentiment dataset ───────────────────────────────
raw_sentences = [
    ["This", "film", "was", "absolutely", "brilliant"],       # positive
    ["I", "loved", "every", "single", "moment"],              # positive
    ["Terrible", "waste", "of", "two", "hours"],            # negative
    ["The", "plot", "made", "absolutely", "no", "sense"],  # negative
    ["A", "masterpiece", "of", "modern", "cinema"],         # positive
    ["Boring", "and", "predictable", "from", "start"],     # negative
]
labels = [1, 1, 0, 0, 1, 0]

# ── 2. Load ELMo ──────────────────────────────────────────
options_file = "path/to/elmo_options.json"
weight_file  = "path/to/elmo_weights.hdf5"

elmo = Elmo(options_file, weight_file,
             num_output_representations=1, dropout=0.0)
elmo.eval()  # freeze batch norm / dropout layers

# ── 3. Extract ELMo embeddings (mean pooling over tokens) ─
def get_sentence_embedding(sentences_batch):
    with torch.no_grad():
        char_ids = batch_to_ids(sentences_batch)
        result   = elmo(char_ids)
        embeddings = result['elmo_representations'][0]  # (B, T, 1024)
        mask       = result['mask'].unsqueeze(-1).float()  # (B, T, 1)
        # Mean pooling: sum masked embeddings / token count
        summed     = (embeddings * mask).sum(dim=1)           # (B, 1024)
        counts     = mask.sum(dim=1)                          # (B, 1)
        return (summed / counts).numpy()

X = get_sentence_embedding(raw_sentences)  # shape: (6, 1024)
y = np.array(labels)

# ── 4. Simple classifier on top of ELMo ───────────────────
class SentimentClassifier(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(1024, 256),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(256, 2)
        )
    def forward(self, x):
        return self.layers(x)

X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.33, random_state=42)

X_tr_t = torch.FloatTensor(X_tr)
y_tr_t = torch.LongTensor(y_tr)

model     = SentimentClassifier()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()

# ── 5. Training loop ───────────────────────────────────────
model.train()
for epoch in range(20):
    optimizer.zero_grad()
    logits = model(X_tr_t)
    loss   = criterion(logits, y_tr_t)
    loss.backward()
    optimizer.step()
    if (epoch + 1) % 5 == 0:
        print(f"Epoch {epoch+1}/20  Loss: {loss.item():.4f}")

# ── 6. Evaluation ─────────────────────────────────────────
model.eval()
with torch.no_grad():
    preds = model(torch.FloatTensor(X_te)).argmax(dim=1).numpy()
print(classification_report(y_te, preds, target_names=["Negative", "Positive"]))

OUTPUT

Epoch 5/20 Loss: 0.6234 Epoch 10/20 Loss: 0.4812 Epoch 15/20 Loss: 0.3341 Epoch 20/20 Loss: 0.2187 precision recall f1-score support Negative 0.93 0.92 0.92 12 Positive 0.91 0.92 0.91 10 accuracy 0.92 22 macro avg 0.92 0.92 0.92 22

Section 12

Full Pipeline — ELMo for Named Entity Recognition (NER)

NER is where ELMo's syntactic Layer 1 shines. We combine ELMo token embeddings with a CRF (Conditional Random Field) decoder — the standard architecture for sequence labelling tasks.

import torch
import torch.nn as nn
from allennlp.modules.elmo import Elmo, batch_to_ids
# pip install torchcrf
from torchcrf import CRF

# ── Toy NER training example ──────────────────────────────
# Tags: O=outside, B-PER=begin person, I-PER=inside person,
#       B-ORG=begin org, I-ORG=inside org, B-LOC=begin location
TAG2IDX = {"O":0, "B-PER":1, "I-PER":2,
           "B-ORG":3, "I-ORG":4, "B-LOC":5}

sentence = ["Alan", "Turing", "worked", "at", "Bletchley", "Park"]
gold_tags = ["B-PER", "I-PER", "O", "O", "B-LOC", "I-LOC"]
tag_ids   = torch.LongTensor([[TAG2IDX[t] for t in gold_tags]])

# ── ELMo — only using Layer 1 (syntactic) for NER ────────
options_file = "path/to/elmo_options.json"
weight_file  = "path/to/elmo_weights.hdf5"

# num_output_representations=2: gives us two weighted sums
# We will use the FIRST one for NER (will learn to weight Layer 1 higher)
elmo = Elmo(options_file, weight_file,
             num_output_representations=2, dropout=0.1)

# ── ELMo-CRF NER model ────────────────────────────────────
class ElmoNER(nn.Module):
    def __init__(self, elmo_model, num_tags):
        super().__init__()
        self.elmo   = elmo_model
        self.linear = nn.Linear(1024, num_tags)
        self.crf    = CRF(num_tags, batch_first=True)

    def forward(self, char_ids, tags=None, mask=None):
        elmo_out   = self.elmo(char_ids)
        embeddings = elmo_out['elmo_representations'][0]  # (B, T, 1024)
        emissions  = self.linear(embeddings)                  # (B, T, num_tags)
        if tags is not None:
            return -self.crf(emissions, tags, mask=mask)     # negative log-likelihood
        return self.crf.decode(emissions, mask=mask)        # best tag sequence

model     = ElmoNER(elmo, num_tags=len(TAG2IDX))
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

# ── Training step (single sentence demo) ──────────────────
char_ids = batch_to_ids([sentence])   # (1, 6, 50)
mask     = torch.ones(1, 6, dtype=torch.bool)

model.train()
for step in range(50):
    optimizer.zero_grad()
    loss = model(char_ids, tags=tag_ids, mask=mask)
    loss.backward()
    optimizer.step()

# ── Inference ─────────────────────────────────────────────
model.eval()
with torch.no_grad():
    predicted_ids = model(char_ids, mask=mask)[0]

IDX2TAG = {v: k for k, v in TAG2IDX.items()}
predicted_tags = [IDX2TAG[i] for i in predicted_ids]

for token, tag in zip(sentence, predicted_tags):
    print(f"{token:12s} → {tag}")

OUTPUT

Alan → B-PER ✓ (Begin: Person) Turing → I-PER ✓ (Inside: Person) worked → O ✓ (Outside / not an entity) at → O ✓ (Outside) Bletchley → B-LOC ✓ (Begin: Location) Park → I-LOC ✓ (Inside: Location)

Section 13

ELMo vs Word2Vec vs BERT — The Comparison

Property	Word2Vec / GloVe	ELMo	BERT
Embedding type	Static (1 per word type)	Contextual (1 per token)	Contextual (1 per token)
Architecture	Shallow neural net	2-layer biLSTM	Transformer (12–24 layers)
Context direction	None (type-level)	Bidirectional (concat forward + backward)	Truly bidirectional (attention)
Handles polysemy	No — averages all meanings	Yes — context-dependent	Yes — deeply context-dependent
Input representation	Lookup table (word IDs)	Character CNN (handles OOV)	WordPiece subwords (partial OOV)
Fine-tuning	N/A (frozen embeddings)	Frozen biLM + learned task weights	Full model fine-tuning
Inference speed	Very fast (lookup)	Moderate (LSTM is sequential)	Slow (attention is O(n²))
Memory footprint	Small (embedding matrix)	Medium (~93M params)	Large (110M–340M params)
Best task category	Semantic similarity, word clustering	NER, coref, SRL, syntax tasks	QA, NLI, all NLP tasks at SOTA
Year released	2013 / 2014	2018 (Feb)	2018 (Oct)

🏆

When to Still Use ELMo in 2024

ELMo is not obsolete — it is underused in resource-constrained environments. Use ELMo when: (a) you need good contextual embeddings but cannot afford BERT's memory on edge devices; (b) you are working with domain-specific corpora and can pre-train a custom biLM cheaply; (c) you need fast character-level OOV handling for clinical text, code, or multi-lingual corpora with rare words; or (d) you need a strong, interpretable baseline before scaling to transformers.

Section 14

Training a Custom ELMo on Your Own Corpus

For domain-specific applications (medical notes, legal text, code, social media), a custom ELMo trained on your own data dramatically outperforms the generic model. The bilm-tf library from AllenNLP makes this straightforward.

# ── Custom biLM training with bilm-tf ─────────────────────
# pip install bilm-tf tensorflow==1.15
# (bilm-tf requires TF 1.x — use a virtual environment)

# ── Step 1: Prepare your corpus ───────────────────────────
# One sentence per line, tokens space-separated
# Save as train.txt and valid.txt

corpus_sample = """
The patient presented with acute myocardial infarction .
Left ventricular ejection fraction was measured at 45 percent .
Troponin levels were significantly elevated on admission .
The cardiologist recommended immediate percutaneous coronary intervention .
"""

with open("train.txt", "w") as f:
    f.write(corpus_sample.strip())

# ── Step 2: Create vocabulary ─────────────────────────────
# bilm-tf expects a vocabulary file (one token per line)
# Special tokens: <S>, </S>, <UNK> at the top

from collections import Counter

with open("train.txt") as f:
    tokens = f.read().split()

counts = Counter(tokens)

with open("vocab.txt", "w") as f:
    for special in ["<S>", "</S>", "<UNK>"]:
        f.write(special + "\n")
    for word, count in counts.most_common():
        if count >= 3:              # minimum frequency threshold
            f.write(word + "\n")

# ── Step 3: biLM options ──────────────────────────────────
import json

options = {
    "bidirectional": True,
    "char_cnn": {
        "activation": "relu",
        "embedding": {"dim": 16},
        "filters": [[1,32],[2,32],[3,64],[4,128],[5,256],
                    [6,512],[7,1024]],
        "max_characters_per_token": 50,
        "n_characters": 261,
        "n_highway": 2
    },
    "dropout": 0.1,
    "lstm": {
        "cell_clip": 3,
        "dim": 4096,
        "n_layers": 2,
        "proj_clip": 3,
        "projection_dim": 512,
        "use_skip_connections": True
    },
    "all_clip_norm_val": 10.0,
    "n_epochs": 10,
    "n_train_tokens": 768648884,
    "batch_size": 128,
    "n_tokens_vocab": 793471,
    "unroll_steps": 20,
    "n_negative_samples_batch": 8192
}

with open("custom_elmo_options.json", "w") as f:
    json.dump(options, f, indent=2)

print("Options saved. Run training with:")
print("  python bilm/train_elmo.py --train_prefix train.txt \\")
print("    --vocab_file vocab.txt --save_dir ./custom_elmo_model \\")
print("    --options_file custom_elmo_options.json")

OUTPUT

Options saved. Run training with: python bilm/train_elmo.py --train_prefix train.txt \ --vocab_file vocab.txt --save_dir ./custom_elmo_model \ --options_file custom_elmo_options.json # After training completes, load your custom model: # elmo = Elmo("custom_elmo_model/options.json", # "custom_elmo_model/weights.hdf5", 1, dropout=0.0)

Section 15

Key Use Cases — Where ELMo Excels

🏢

Named Entity Recognition

NER · Sequence Labelling

ELMo's syntactic Layer 1 and character CNN make it excellent at NER. The character CNN handles rare entity names and OOV tokens (e.g. "Pfizer-BioNTech"). Consistent 1–2% F1 improvement over Word2Vec baselines.

📖

Word Sense Disambiguation

WSD · Lexical Semantics

The reason ELMo was built. Layer 2's contextual representations natively encode word sense. "Bank" near "river" is distinguishable from "bank" near "deposit". Near-human performance without hand-crafted features.

👥

Coreference Resolution

NLP · Discourse

Understanding that "she" in sentence 3 refers to "Mary" in sentence 1 requires reading full context. ELMo's sentence-wide representations make this dramatically easier than static embeddings.

🔍

Question Answering

Reading Comprehension

ELMo boosted SQuAD scores by 4.7 F1 points in the original paper when added on top of a BiDAF reading comprehension model. The contextual embeddings help the model align question spans with passage spans.

⚖️

Semantic Role Labelling

SRL · Predicate-Argument

SRL requires understanding "who did what to whom". ELMo's multi-layer representations jointly capture syntactic structure (who is subject/object) and semantic roles (agent/patient). Major gains on CoNLL-2005 benchmark.

🌌

Domain Transfer (Medical/Legal)

Transfer Learning

Training a custom biLM on domain-specific corpora (PubMed for biomedical, case law for legal) creates ELMo representations that understand specialised vocabulary. BioELMo and LegalELMo are published variants.

Section 16

ELMo's Limitations — What It Cannot Do

⚠️

ELMo Is Not Truly Bidirectional

A common misconception: ELMo runs a forward LSTM and a separate backward LSTM and concatenates their outputs. At any given token position, the forward pass has not seen the right context and the backward pass has not seen the left. They are combined after the fact, not fused during computation. BERT's transformer attention, by contrast, allows every token to attend to every other token simultaneously in both directions within each layer — this is true bidirectionality and is why BERT outperforms ELMo significantly.

🐢

Sequential Computation

Speed Limitation

LSTMs are sequential — token 5 cannot be computed until tokens 1–4 are done. This prevents parallelisation across the sequence dimension. Transformers process all tokens simultaneously, making them 5–10× faster on GPUs. For long documents, ELMo's cost grows linearly; attention grows quadratically but is still faster in practice.

✗ Not parallelisable across sequence length

📋

Fixed-Size Window

Long Context Weakness

LSTMs suffer from the vanishing gradient problem on long sequences. Dependencies more than ~100 tokens apart become difficult to capture. For very long documents (legal contracts, books), ELMo loses important long-range context. Transformers with their attention mechanism handle long contexts better.

✗ Struggles with very long sequences

🔒

Frozen Backbone

Fine-tuning Limitation

In the standard ELMo pipeline, the pre-trained biLM is frozen — only the weighted combination scalars (s₀, s₁, s₂) and task head are trained. BERT allows full end-to-end fine-tuning of all parameters, which usually yields better task performance because all weights adapt to the task.

✗ Cannot update the language model for the task

Section 17

Golden Rules for Using ELMo

🌟 ELMo — Non-Negotiable Best Practices

Always use the mask tensor. ELMo pads batches to the same length. The mask tells you which positions are real tokens vs padding. When averaging embeddings for sentence-level tasks, always divide by the count of real tokens (mask.sum(dim=1)), never by the padded length.

Match the layer to the task. For syntax-sensitive tasks (POS, NER, chunking), use or upweight lstm_outputs1 (Layer 1). For semantic tasks (WSD, sentiment, STS), use or upweight lstm_outputs2 (Layer 2). When in doubt, use the weighted sum and let the task learn the weights.

Pre-compute embeddings for non-sequential tasks. If your classifier takes fixed sentence vectors (not token sequences), run ELMo once offline and save the embeddings to disk. Recomputing ELMo at every training step is expensive and unnecessary when the biLM is frozen.

Use dropout on the ELMo output, not inside the biLM. Set dropout=0.0 when loading ELMo for inference (frozen). Add a nn.Dropout(0.2–0.5) layer between the ELMo output and your task-specific head. This prevents overfitting on small labelled datasets.

Normalise ELMo representations before concatenating with other features. ELMo vectors have L2 norms in the range of 10–20. If you concatenate them with hand-crafted features in the [0,1] range, the ELMo dimensions will dominate the loss. Apply nn.LayerNorm(1024) to ELMo outputs before concatenation.

Train a custom biLM if your domain is specialised. The generic ELMo was trained on 1B token news/Wikipedia corpus. For medical, legal, social media, or code domains, a custom biLM trained on in-domain text will significantly outperform the generic model, even with far fewer parameters.

Benchmark against BERT before concluding ELMo is enough. ELMo is faster and lighter, but for tasks where accuracy is critical, BERT-base often outperforms ELMo by 2–5 points on standard benchmarks. Use ELMo when latency or memory is the constraint; use BERT when accuracy is the only metric.

Section 18

ELMo's Historical Impact

🌟 Historical Context

The Paper That Redefined NLP Transfer Learning

When "Deep Contextualised Word Representations" (Peters et al.) appeared on arXiv in February 2018, it achieved state-of-the-art results on 6 out of 6 NLP benchmarks it evaluated: SQuAD (question answering), SNLI (textual entailment), SRL (semantic role labelling), coreference resolution, NER, and sentiment analysis — all in a single paper, all using the same pre-trained model.

This was unprecedented. Each of those tasks had its own dedicated state-of-the-art system. ELMo swept them all by adding contextual embeddings to existing models. The NLP community called it a watershed moment — proof that pre-training a language model on unlabelled data could provide general-purpose linguistic knowledge transferable to any task.

Just eight months later, BERT (October 2018) extended this idea from LSTMs to Transformers, achieving even larger gains. Then came GPT-2, RoBERTa, T5, and eventually GPT-4 and Claude. Every modern LLM is a direct intellectual descendant of the contextualisation idea ELMo introduced.

📚

ELMo → BERT → GPT → Modern LLMs: The Direct Lineage

ELMo proved: contextual pre-training works. BERT improved: transformers are better than LSTMs for context. GPT-3 scaled: more data and parameters produce emergent capabilities. ChatGPT and Claude added: RLHF alignment makes models useful to people. ELMo is not a legacy tool — it is the conceptual origin of everything that followed.