Stemming vs Lemmatization

Section 01

The Story That Explains Text Normalisation

📖 Real World Analogy

The Library Catalogue Problem

Imagine you are the head librarian of a giant library with 10 million books. A student walks in asking for books about "running". But in your catalogue, some books are filed under "runs", others under "ran", some under "runner", and a few under "running".

Without a system to link these variants together, you will miss most of the relevant books. The student goes home empty-handed, even though the library is full of exactly what they need.

This is precisely the problem that Stemming and Lemmatization solve — they collapse word variants into a common root so that your NLP system can recognise that run, runs, ran, running, runner all refer to the same core idea. The question is: which approach does it well, and which does it cheaply?

When computers process text, they treat every unique string of characters as a different token. "study", "studies", and "studying" look like three completely different words. For tasks like search, sentiment analysis, or topic modelling, this fragmentation kills accuracy. The solution is text normalisation — reducing inflected or derived word forms to a common base form before any analysis begins.

🌿

Two Philosophies, One Goal

Stemming uses fast heuristic rules to chop off word endings — it's crude but lightning-fast. Lemmatization uses linguistic knowledge and vocabulary lookup to return the dictionary root (the lemma) — it's accurate but requires more computation. Both serve the same purpose: mapping word variants to a shared token. The right choice depends on your speed, accuracy, and language requirements.

Section 02

Stemming — The Fast Axe

📖 Story

The Hasty Woodcutter

A woodcutter is given a stack of branches of different lengths and told to make them all the same size. He picks up an axe and just starts chopping from the right end of every branch — same number of centimetres, no measuring, no checking. It's fast. Most branches end up a similar length. But some are left too short ("university" → "univers"), and a few weird ones look nothing like what you'd expect ("argue" → "argu"). The job is done fast, but not perfectly. That woodcutter is a stemmer.

A stemmer applies a fixed sequence of pattern-based rules to strip suffixes (and sometimes prefixes) from words. It does not consult a dictionary. It does not understand grammar. It simply applies rules like: "if a word ends in -ing, remove it." The output — called a stem — is often not a real word at all.

⚡ STEMMING — HOW IT WORKS (ANIMATED)

⚡ No dictionary needed — stems are often not valid English words.

The Porter Stemmer — The Classic Algorithm

The Porter Stemmer (1980) is the most widely used stemming algorithm. It applies five sequential phases of suffix-stripping rules in order. Each phase applies the longest matching suffix rule it finds.

🔧 Porter Stemmer — Phase Pipeline

Phase 1a

Remove plural suffixes: -sses → ss, -ies → i, -ss → ss, -s → ε

Phase 1b

Remove -eed → ee if stem is non-trivial; -ing / -ed if stem contains a vowel

Phase 1c

Replace terminal -y with -i if stem has a vowel: happy → happi

Phase 2–3

Strip longer derivational suffixes: -ational → ate, -fulness → ful, -isation → ise

Phase 4

Remove common endings: -ance, -ism, -able, -ous, -ive, -ize

Phase 5

Tidy up: remove trailing -e in certain conditions; reduce -ll → l

import nltk
from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer

nltk.download('punkt', quiet=True)

words = [
    'running', 'studies', 'happily', 'arguing',
    'generous', 'generation', 'electricity', 'caring'
]

porter    = PorterStemmer()
snowball  = SnowballStemmer('english')
lancaster = LancasterStemmer()

print(f"{'Word':15} {'Porter':12} {'Snowball':12} {'Lancaster'}")
print("-" * 52)
for word in words:
    p = porter.stem(word)
    s = snowball.stem(word)
    l = lancaster.stem(word)
    print(f"{word:15} {p:12} {s:12} {l}")

OUTPUT

Word Porter Snowball Lancaster ---------------------------------------------------- running run run run studies studi studi study happily happili happili happy arguing argu argu argu generous generous generous gen generation generat generat gen electricity electr electr elect caring care care car

⚠️

Stemming Errors — Two Types

Over-stemming (False Positives): Different words collapse to the same stem — general, generate, and generous can all become gener, creating false matches.
Under-stemming (False Negatives): Related words are not merged — alumnus and alumni may remain separate even though they refer to the same concept.

Section 03

Lemmatization — The Precise Linguist

📖 Story

The Expert Translator

Now imagine, instead of a woodcutter, you hire an expert linguist. When given the word "ran", she thinks: "This is the past tense of 'run' — its lemma is 'run'." When given "better", she thinks: "This is the comparative form of 'good' — its lemma is 'good'." She looks the word up in her mental dictionary, understands its grammatical role, and returns the true canonical form.

The result is always a real word that exists in the dictionary. It takes her longer than the woodcutter, but every answer is linguistically correct. That expert linguist is a lemmatizer.

Lemmatization returns the lemma — the dictionary form of a word. It uses a morphological analysis combined with a lexical database (like WordNet) to determine the base form correctly. Crucially, a lemmatizer often needs to know the Part-of-Speech (POS) of a word to return the right lemma.

🧠 LEMMATIZATION — HOW IT WORKS (ANIMATED)

🧠 Lemmatizer consults WordNet and uses POS tags — always returns valid dictionary words.

import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

nltk.download(['wordnet', 'averaged_perceptron_tagger'], quiet=True)

lemmatizer = WordNetLemmatizer()

# Map Penn Treebank POS tags → WordNet POS codes
def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'): return wordnet.ADJ
    elif treebank_tag.startswith('V'): return wordnet.VERB
    elif treebank_tag.startswith('N'): return wordnet.NOUN
    elif treebank_tag.startswith('R'): return wordnet.ADV
    else: return wordnet.NOUN  # default to noun

words = [
    ('running', 'VBG'), ('studies', 'NNS'), ('better', 'JJR'),
    ('drove', 'VBD'),    ('geese', 'NNS'),   ('caring', 'VBG')
]

print(f"{'Word':12} {'POS':6} {'Lemma'}")
print("-" * 32)
for word, pos in words:
    wn_pos = get_wordnet_pos(pos)
    lemma  = lemmatizer.lemmatize(word, pos=wn_pos)
    print(f"{word:12} {pos:6} {lemma}")

OUTPUT

Word POS Lemma -------------------------------- running VBG run studies NNS study better JJR good ← irregular adjective! drove VBD drive ← irregular verb! geese NNS goose ← irregular plural! caring VBG care

✅

Why POS Tagging Matters So Much

The word "saw" has two completely different lemmas depending on its role: as a verb (to see), its lemma is see; as a noun (a cutting tool), its lemma is saw. A lemmatizer without POS context will pick the wrong one. Always tag before lemmatizing.

Section 04

Side-by-Side Comparison — Word by Word

The table below shows the same words processed by Porter Stemmer vs WordNet Lemmatizer. Notice where the stemmer breaks words into non-dictionary tokens and where the lemmatizer shines with irregular forms.

Original Word	Porter Stem	Lemma (with POS)	Stem = Real Word?	Notes
running	run	run	YES	Both agree here
studies	studi	study	NO	Stemmer strips incorrectly
happily	happili	happily	NO	Adverbs poorly handled
better	better	good	MISS	Lemmatizer handles irregulars
drove	drove	drive	MISS	Stemmer misses irregular verbs
geese	gees	goose	NO	Stemmer fails on irregular plurals
generalise	generalis	generalise	NO	Stemmer over-chops
university	univers	university	NO	Classic stemmer failure
caring	care	care	YES	Both agree
arguing	argu	argue	NO	Stemmer drops final 'e' incorrectly

Section 05

The Three Stemmer Variants

Not all stemmers are created equal. Three algorithms dominate NLP practice, each with a different trade-off between aggression and accuracy.

🔨

Porter Stemmer

nltk.stem.PorterStemmer

The original (1980). Moderate aggression. Well-studied and widely benchmarked. Best for English. Produces results like generat from generation.

✔ Balanced, well-documented

✘ Some over/under-stemming

❄️

Snowball Stemmer

nltk.stem.SnowballStemmer

Also called Porter2. Improved version with support for 13+ languages including French, German, Spanish. Slightly more accurate than Porter. The practical default choice for multi-lingual projects.

✔ Multi-lingual, improved logic

✘ Still heuristic — not perfect

⚔️

Lancaster Stemmer

nltk.stem.LancasterStemmer

The most aggressive. Strips words down hard — generous becomes gen. High speed, but very lossy. Use only when maximum collapse is desired and readability doesn't matter.

✔ Maximum compression

✘ Often illegible outputs

Section 06

Animated: Stemming vs Lemmatization Flow

🔄 PROCESSING PIPELINE — ANIMATED COMPARISON

Both paths start with raw text — lemmatization needs an extra POS tagging step but produces valid words.

Section 07

SpaCy Lemmatization — The Modern Approach

While NLTK's WordNet lemmatizer is the classic teaching tool, spaCy is the industry-standard choice for production NLP. Its lemmatizer is integrated into the processing pipeline and automatically uses POS context — no manual tagging needed.

import spacy

# Load English model — run: python -m spacy download en_core_web_sm
nlp = spacy.load('en_core_web_sm')

sentences = [
    "The geese were flying better than they drove last year.",
    "She is caring for her studies very happily.",
    "The runners are running faster than ever before."
]

for sent in sentences:
    doc = nlp(sent)
    print(f"\nSentence: {sent}")
    print(f"{'Token':15} {'POS':8} {'Lemma'}")
    print("-" * 38)
    for token in doc:
        if not token.is_punct and not token.is_space:
            print(f"{token.text:15} {token.pos_:8} {token.lemma_}")

OUTPUT

Sentence: The geese were flying better than they drove last year. Token POS Lemma -------------------------------------- The DET the geese NOUN goose were AUX be flying VERB fly better ADV well than SCONJ than they PRON they drove VERB drive last ADJ last year NOUN year

🔑

spaCy vs NLTK Lemmatizer

spaCy uses a rule-based + lookup table approach by default, while NLTK uses WordNet's morphology database. spaCy is faster in production because it processes the entire pipeline in a single pass. NLTK is more flexible for experimentation. For production text processing, prefer spaCy; for learning and research, NLTK is excellent.

Section 08

Real Pipeline — Full Text Processing

In practice, neither stemming nor lemmatization works alone. They are always part of a broader NLP preprocessing pipeline. Here is a realistic pipeline for both approaches on real text.

Lowercasing

Convert all characters to lowercase. "Running" = "running". Essential first step so "The" and "the" are not treated as different tokens.

Tokenisation

Split text into individual tokens (words, punctuation). "don't" → ["do", "n't"]. Use nltk.word_tokenize or spaCy's tokeniser.

Stopword Removal

Remove high-frequency words with low semantic value: "the", "is", "a", "in". They add noise without contributing meaning to most tasks.

POS Tagging (Lemmatization only)

Before lemmatizing, tag each token with its grammatical role. "running" is a VERB in "She was running" but a NOUN in "Running is fun" — the lemma differs.

Stemming / Lemmatization

Apply the chosen normalisation. Stemming: call stemmer.stem(token). Lemmatization: call lemmatizer.lemmatize(token, pos=wn_pos).

Downstream Task

Feed the normalised tokens into TF-IDF, bag-of-words, word embeddings, or any NLP model. The normalised vocabulary is smaller and cleaner.

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords, wordnet
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk import pos_tag

nltk.download(['punkt', 'stopwords', 'wordnet', 'averaged_perceptron_tagger'], quiet=True)

text = "The scientists were studying the rapidly changing climates and their effects on species."

stop_words = set(stopwords.words('english'))
stemmer    = PorterStemmer()
lemmatizer = WordNetLemmatizer()

def get_wn_pos(tag):
    return {'J': wordnet.ADJ, 'V': wordnet.VERB, 'N': wordnet.NOUN, 'R': wordnet.ADV}.get(tag[0], wordnet.NOUN)

# Step 1: Lowercase + Tokenise
tokens = word_tokenize(text.lower())

# Step 2: Remove stopwords + non-alpha
tokens = [t for t in tokens if t.isalpha() and t not in stop_words]

# Step 3: Stemming pipeline
stems  = [stemmer.stem(t) for t in tokens]

# Step 4: Lemmatization pipeline (needs POS)
tagged  = pos_tag(tokens)
lemmas  = [lemmatizer.lemmatize(t, get_wn_pos(pos)) for t, pos in tagged]

print("Original tokens:", tokens)
print("Stems:          ", stems)
print("Lemmas:         ", lemmas)

OUTPUT

Original tokens: ['scientists', 'studying', 'rapidly', 'changing', 'climates', 'effects', 'species'] Stems: ['scientist', 'studi', 'rapidli', 'chang', 'climat', 'effect', 'speci'] Lemmas: ['scientist', 'study', 'rapidly', 'change', 'climate', 'effect', 'specie']

Section 09

When to Use What — Decision Framework

📖 Story

The Speed-Accuracy Trade-Off at the Airport

Imagine two passport control queues at an airport. Queue A has an officer who barely glances at your passport and waves you through in 3 seconds — fast, mostly fine, but occasionally waves through an expired document. Queue B has an officer who carefully reads every detail, cross-references databases, and confirms your identity perfectly — it takes 25 seconds per person.

For a low-security domestic terminal, Queue A is fine. For an international flight where errors cost real consequences, Queue B is non-negotiable.

Stemming is Queue A. Lemmatization is Queue B. Choose based on the consequences of error in your task.

⚡

Use Stemming When

Speed is critical (real-time search)
Large corpora where lemmatization is too slow
Simple keyword matching or TF-IDF pipelines
Information retrieval at scale
When you don't need interpretable tokens
Quick-and-dirty prototypes

🧠

Use Lemmatization When

Accuracy matters (chatbots, Q&A systems)
Output tokens need to be human-readable
Domain-specific language (medical, legal)
Sentiment analysis where word meaning matters
Small-to-medium corpora where speed is OK
Multi-class text classification

🚫

Use Neither When

Using contextual embeddings (BERT, GPT)
Named Entity Recognition (NER) tasks
Machine translation pipelines
Tasks where word form itself carries meaning
Subword tokenization models (BPE, WordPiece)
Code / structured text processing

Section 10

Comprehensive Comparison Table

Property	Stemming	Lemmatization
Definition	Heuristic suffix-stripping to produce a stem	Dictionary-based morphological analysis to produce the lemma
Output	May not be a real word (studi, argu)	Always a valid dictionary word (study, argue)
Technique	Rule-based pattern matching (regex-like)	Morphological analysis + lexical database lookup
Needs POS?	NO	YES (for correct results)
Speed	Very fast — O(n) string operations	Slower — database lookups required
Handles Irregulars?	NO — "drove" stays "drove"	YES — "drove" → "drive"
Over-stemming Risk	HIGH — different words may collide	NONE — linguistically grounded
Language Support	Good (Snowball: 13+ languages)	Good (spaCy: 60+ languages)
Memory Required	Minimal — just rules in code	Moderate — WordNet / model required
Best For	Search engines, IR, fast pipelines	Chatbots, classification, human-facing NLP
Python Library	nltk.stem.PorterStemmer	nltk / spaCy lemmatizer

Section 11

Visualising Impact on Vocabulary Size

📊 VOCABULARY REDUCTION — ANIMATED BAR CHART

Stemming reduces vocabulary more aggressively because it collapses more variants (sometimes incorrectly). Lemmatization reduces less but keeps only semantically valid distinctions.

Section 12

Edge Cases and Pitfalls

🚨 STEMMING PITFALLS

Input	Stem	Problem
wander	wand	Completely wrong meaning
universe	univers	Not a real word
general	gener	Same as generate, generous
news	new	Plural stripped incorrectly
data	data	No change — correct here
operational	oper	Over-truncated

✅ LEMMATIZATION EDGE CASES

Input	POS	Lemma
saw	VERB	see ← correct
saw	NOUN	saw ← also correct
better	ADJ	good ← irregular
am / is / are	VERB	be ← all unified
mice	NOUN	mouse ← correct
corpora	NOUN	corpus ← Latin plural

⚠️

The "Wander → Wand" Problem

The Porter Stemmer turns wander into wand. Now a document about hiking ("we wandered through the forest") matches searches for magic wands. This kind of collision can poison a search index and is one of the strongest arguments for choosing lemmatization in production systems where precision matters.

Section 13

Production-Ready Code — Full NLP Pipeline

import re
import spacy
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer

# Load spaCy model
nlp     = spacy.load('en_core_web_sm', disable=['ner', 'parser'])
stemmer = PorterStemmer()

# ─── Stemming pipeline ────────────────────────────────────
def preprocess_stem(texts):
    cleaned = []
    for text in texts:
        text   = re.sub(r'[^a-zA-Z\s]', '', text.lower())
        tokens = text.split()
        stems  = [stemmer.stem(t) for t in tokens if len(t) > 2]
        cleaned.append(' '.join(stems))
    return cleaned

# ─── Lemmatization pipeline ───────────────────────────────
def preprocess_lemma(texts):
    cleaned = []
    for doc in nlp.pipe(texts, batch_size=50):
        lemmas = [
            token.lemma_.lower()
            for token in doc
            if not token.is_stop and not token.is_punct
            and token.is_alpha and len(token) > 2
        ]
        cleaned.append(' '.join(lemmas))
    return cleaned

# ─── Example corpus ───────────────────────────────────────
corpus = [
    "Scientists are studying the effects of climate change on polar bears.",
    "The study shows that temperature changes affect the species rapidly.",
    "Researchers studied how rising temperatures impact ecosystems globally.",
]

stem_corpus  = preprocess_stem(corpus)
lemma_corpus = preprocess_lemma(corpus)

# Build TF-IDF vocabularies
stem_vect  = TfidfVectorizer().fit(stem_corpus)
lemma_vect = TfidfVectorizer().fit(lemma_corpus)

print("Stem vocabulary: ",  sorted(stem_vect.vocabulary_.keys()))
print("Lemma vocabulary:", sorted(lemma_vect.vocabulary_.keys()))

OUTPUT

Stem vocabulary: ['affect', 'bear', 'chang', 'climat', 'ecosystem', 'effect', 'global', 'impact', 'polar', 'rapidli', 'research', 'rise', 'scientist', 'show', 'speci', 'studi', 'temperatur'] Lemma vocabulary: ['affect', 'bear', 'change', 'climate', 'ecosystem', 'effect', 'global', 'impact', 'polar', 'rapidly', 'researcher', 'rise', 'scientist', 'show', 'species', 'study', 'temperature']

🎯

Both Unify the Corpus — But Lemmas Are Readable

Both pipelines successfully merge "study / studies / studied / studying" into a single token. The key difference: the stemmed vocabulary contains studi, rapidli, and temperatur — not real words. The lemmatized vocabulary contains study, rapidly, and temperature. For any human-inspectable pipeline, lemmas win every time.

Section 14

Golden Rules

🌿 Stemming vs Lemmatization — Non-Negotiable Rules

Always provide POS tags to your lemmatizer. Without them, the lemmatizer defaults to treating every word as a noun. The word "better" without a POS tag returns better; with ADJ, it correctly returns good.

Do stemming before indexing, never before classification. Stemming destroys readability and semantics. For search indices, the loss is acceptable. For sentiment analysis or text classification, mangled tokens can confuse your model.

Never apply stemming or lemmatization before BERT, GPT, or transformer models. These models use subword tokenization (BPE/WordPiece) and are trained on raw text. Pre-stemming destroys information that the model's attention mechanism relies on.

Benchmark on your own data. The academic benchmarks for stemmer accuracy (typically 75–85% for Porter) are measured on general English corpora. Your domain (medical, legal, financial) may have very different morphological patterns where a custom lemmatizer massively outperforms.

Snowball is almost always better than Porter. It's more accurate, supports multiple languages, and is just as fast. Default to Snowball when you need a stemmer. Use Porter only when reproducing older research that used it specifically.

For production, prefer spaCy over NLTK's lemmatizer. spaCy processes POS tagging and lemmatization in a single pipeline pass, batch-processes documents efficiently, and has model support for 60+ languages. NLTK is excellent for learning; spaCy is built for shipping.

Always measure vocabulary reduction. Check how many unique tokens your preprocessing step produces before and after normalisation. If stemming and lemmatization produce the same vocabulary size, pick lemmatization — you get the same compression with readable outputs.