Natural Language Processing (NLP) 📂 Text Preprocessing · 2 of 3 47 min read

Stemming vs Lemmatization

A comprehensive, beginner-to-advanced tutorial covering how stemming and lemmatization work, why they differ, and when to use each — with animated SVG diagrams, real Python code (NLTK + spaCy), worked examples, pitfalls, and production-ready pipeline code.

Section 01

The Story That Explains Text Normalisation

The Library Catalogue Problem
Imagine you are the head librarian of a giant library with 10 million books. A student walks in asking for books about "running". But in your catalogue, some books are filed under "runs", others under "ran", some under "runner", and a few under "running".

Without a system to link these variants together, you will miss most of the relevant books. The student goes home empty-handed, even though the library is full of exactly what they need.

This is precisely the problem that Stemming and Lemmatization solve — they collapse word variants into a common root so that your NLP system can recognise that run, runs, ran, running, runner all refer to the same core idea. The question is: which approach does it well, and which does it cheaply?

When computers process text, they treat every unique string of characters as a different token. "study", "studies", and "studying" look like three completely different words. For tasks like search, sentiment analysis, or topic modelling, this fragmentation kills accuracy. The solution is text normalisation — reducing inflected or derived word forms to a common base form before any analysis begins.

🌿
Two Philosophies, One Goal

Stemming uses fast heuristic rules to chop off word endings — it's crude but lightning-fast. Lemmatization uses linguistic knowledge and vocabulary lookup to return the dictionary root (the lemma) — it's accurate but requires more computation. Both serve the same purpose: mapping word variants to a shared token. The right choice depends on your speed, accuracy, and language requirements.


Section 02

Stemming — The Fast Axe

The Hasty Woodcutter
A woodcutter is given a stack of branches of different lengths and told to make them all the same size. He picks up an axe and just starts chopping from the right end of every branch — same number of centimetres, no measuring, no checking. It's fast. Most branches end up a similar length. But some are left too short ("university" → "univers"), and a few weird ones look nothing like what you'd expect ("argue" → "argu"). The job is done fast, but not perfectly. That woodcutter is a stemmer.

A stemmer applies a fixed sequence of pattern-based rules to strip suffixes (and sometimes prefixes) from words. It does not consult a dictionary. It does not understand grammar. It simply applies rules like: "if a word ends in -ing, remove it." The output — called a stem — is often not a real word at all.

⚡ STEMMING — HOW IT WORKS (ANIMATED)
INPUT WORDS running studies happily arguing caring SUFFIX RULES -ing → ε -ies → i -ly → ε STEMS (OUTPUT) run studi happi argu care ← not a word ← not a word ← not a word

⚡ No dictionary needed — stems are often not valid English words.

The Porter Stemmer — The Classic Algorithm

The Porter Stemmer (1980) is the most widely used stemming algorithm. It applies five sequential phases of suffix-stripping rules in order. Each phase applies the longest matching suffix rule it finds.

🔧 Porter Stemmer — Phase Pipeline
Phase 1a
Remove plural suffixes: -sses → ss, -ies → i, -ss → ss, -s → ε
Phase 1b
Remove -eed → ee if stem is non-trivial; -ing / -ed if stem contains a vowel
Phase 1c
Replace terminal -y with -i if stem has a vowel: happy → happi
Phase 2–3
Strip longer derivational suffixes: -ational → ate, -fulness → ful, -isation → ise
Phase 4
Remove common endings: -ance, -ism, -able, -ous, -ive, -ize
Phase 5
Tidy up: remove trailing -e in certain conditions; reduce -ll → l
import nltk
from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer

nltk.download('punkt', quiet=True)

words = [
    'running', 'studies', 'happily', 'arguing',
    'generous', 'generation', 'electricity', 'caring'
]

porter    = PorterStemmer()
snowball  = SnowballStemmer('english')
lancaster = LancasterStemmer()

print(f"{'Word':15} {'Porter':12} {'Snowball':12} {'Lancaster'}")
print("-" * 52)
for word in words:
    p = porter.stem(word)
    s = snowball.stem(word)
    l = lancaster.stem(word)
    print(f"{word:15} {p:12} {s:12} {l}")
OUTPUT
Word Porter Snowball Lancaster ---------------------------------------------------- running run run run studies studi studi study happily happili happili happy arguing argu argu argu generous generous generous gen generation generat generat gen electricity electr electr elect caring care care car
⚠️
Stemming Errors — Two Types

Over-stemming (False Positives): Different words collapse to the same stem — general, generate, and generous can all become gener, creating false matches.
Under-stemming (False Negatives): Related words are not merged — alumnus and alumni may remain separate even though they refer to the same concept.


Section 03

Lemmatization — The Precise Linguist

The Expert Translator
Now imagine, instead of a woodcutter, you hire an expert linguist. When given the word "ran", she thinks: "This is the past tense of 'run' — its lemma is 'run'." When given "better", she thinks: "This is the comparative form of 'good' — its lemma is 'good'." She looks the word up in her mental dictionary, understands its grammatical role, and returns the true canonical form.

The result is always a real word that exists in the dictionary. It takes her longer than the woodcutter, but every answer is linguistically correct. That expert linguist is a lemmatizer.

Lemmatization returns the lemma — the dictionary form of a word. It uses a morphological analysis combined with a lexical database (like WordNet) to determine the base form correctly. Crucially, a lemmatizer often needs to know the Part-of-Speech (POS) of a word to return the right lemma.

🧠 LEMMATIZATION — HOW IT WORKS (ANIMATED)
INPUT + POS running [V] studies [N] better [A] caring [V] drove [V] geese [N] MORPHOLOGICAL ANALYSIS suffix stripping irregular forms WORDNET LEXICON 📚 Dictionary Lookup + Verify LEMMAS (REAL WORDS) run study good care drive goose

🧠 Lemmatizer consults WordNet and uses POS tags — always returns valid dictionary words.

import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

nltk.download(['wordnet', 'averaged_perceptron_tagger'], quiet=True)

lemmatizer = WordNetLemmatizer()

# Map Penn Treebank POS tags → WordNet POS codes
def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'): return wordnet.ADJ
    elif treebank_tag.startswith('V'): return wordnet.VERB
    elif treebank_tag.startswith('N'): return wordnet.NOUN
    elif treebank_tag.startswith('R'): return wordnet.ADV
    else: return wordnet.NOUN  # default to noun

words = [
    ('running', 'VBG'), ('studies', 'NNS'), ('better', 'JJR'),
    ('drove', 'VBD'),    ('geese', 'NNS'),   ('caring', 'VBG')
]

print(f"{'Word':12} {'POS':6} {'Lemma'}")
print("-" * 32)
for word, pos in words:
    wn_pos = get_wordnet_pos(pos)
    lemma  = lemmatizer.lemmatize(word, pos=wn_pos)
    print(f"{word:12} {pos:6} {lemma}")
OUTPUT
Word POS Lemma -------------------------------- running VBG run studies NNS study better JJR good ← irregular adjective! drove VBD drive ← irregular verb! geese NNS goose ← irregular plural! caring VBG care
Why POS Tagging Matters So Much

The word "saw" has two completely different lemmas depending on its role: as a verb (to see), its lemma is see; as a noun (a cutting tool), its lemma is saw. A lemmatizer without POS context will pick the wrong one. Always tag before lemmatizing.


Section 04

Side-by-Side Comparison — Word by Word

The table below shows the same words processed by Porter Stemmer vs WordNet Lemmatizer. Notice where the stemmer breaks words into non-dictionary tokens and where the lemmatizer shines with irregular forms.

Original Word Porter Stem Lemma (with POS) Stem = Real Word? Notes
runningrunrunYESBoth agree here
studiesstudistudyNOStemmer strips incorrectly
happilyhappilihappilyNOAdverbs poorly handled
betterbettergoodMISSLemmatizer handles irregulars
drovedrovedriveMISSStemmer misses irregular verbs
geesegeesgooseNOStemmer fails on irregular plurals
generalisegeneralisgeneraliseNOStemmer over-chops
universityuniversuniversityNOClassic stemmer failure
caringcarecareYESBoth agree
arguingarguargueNOStemmer drops final 'e' incorrectly

Section 05

The Three Stemmer Variants

Not all stemmers are created equal. Three algorithms dominate NLP practice, each with a different trade-off between aggression and accuracy.

🔨
Porter Stemmer
nltk.stem.PorterStemmer
The original (1980). Moderate aggression. Well-studied and widely benchmarked. Best for English. Produces results like generat from generation.
✔ Balanced, well-documented
✘ Some over/under-stemming
❄️
Snowball Stemmer
nltk.stem.SnowballStemmer
Also called Porter2. Improved version with support for 13+ languages including French, German, Spanish. Slightly more accurate than Porter. The practical default choice for multi-lingual projects.
✔ Multi-lingual, improved logic
✘ Still heuristic — not perfect
⚔️
Lancaster Stemmer
nltk.stem.LancasterStemmer
The most aggressive. Strips words down hard — generous becomes gen. High speed, but very lossy. Use only when maximum collapse is desired and readability doesn't matter.
✔ Maximum compression
✘ Often illegible outputs

Section 06

Animated: Stemming vs Lemmatization Flow

🔄 PROCESSING PIPELINE — ANIMATED COMPARISON
STEMMING PATH Raw Text "caring" Suffix Rules -ing → ε No Dictionary (skipped) Stem Output "care" ✔ ⚡ FAST LEMMATIZATION PATH Raw Text "drove" POS Tagging VERB → VBD WordNet Lookup drove → drive Lemma Output "drive" ✔ 🧠 SLOWER

Both paths start with raw text — lemmatization needs an extra POS tagging step but produces valid words.


Section 07

SpaCy Lemmatization — The Modern Approach

While NLTK's WordNet lemmatizer is the classic teaching tool, spaCy is the industry-standard choice for production NLP. Its lemmatizer is integrated into the processing pipeline and automatically uses POS context — no manual tagging needed.

import spacy

# Load English model — run: python -m spacy download en_core_web_sm
nlp = spacy.load('en_core_web_sm')

sentences = [
    "The geese were flying better than they drove last year.",
    "She is caring for her studies very happily.",
    "The runners are running faster than ever before."
]

for sent in sentences:
    doc = nlp(sent)
    print(f"\nSentence: {sent}")
    print(f"{'Token':15} {'POS':8} {'Lemma'}")
    print("-" * 38)
    for token in doc:
        if not token.is_punct and not token.is_space:
            print(f"{token.text:15} {token.pos_:8} {token.lemma_}")
OUTPUT
Sentence: The geese were flying better than they drove last year. Token POS Lemma -------------------------------------- The DET the geese NOUN goose were AUX be flying VERB fly better ADV well than SCONJ than they PRON they drove VERB drive last ADJ last year NOUN year
🔑
spaCy vs NLTK Lemmatizer

spaCy uses a rule-based + lookup table approach by default, while NLTK uses WordNet's morphology database. spaCy is faster in production because it processes the entire pipeline in a single pass. NLTK is more flexible for experimentation. For production text processing, prefer spaCy; for learning and research, NLTK is excellent.


Section 08

Real Pipeline — Full Text Processing

In practice, neither stemming nor lemmatization works alone. They are always part of a broader NLP preprocessing pipeline. Here is a realistic pipeline for both approaches on real text.

01
Lowercasing
Convert all characters to lowercase. "Running" = "running". Essential first step so "The" and "the" are not treated as different tokens.
02
Tokenisation
Split text into individual tokens (words, punctuation). "don't" → ["do", "n't"]. Use nltk.word_tokenize or spaCy's tokeniser.
03
Stopword Removal
Remove high-frequency words with low semantic value: "the", "is", "a", "in". They add noise without contributing meaning to most tasks.
04
POS Tagging (Lemmatization only)
Before lemmatizing, tag each token with its grammatical role. "running" is a VERB in "She was running" but a NOUN in "Running is fun" — the lemma differs.
05
Stemming / Lemmatization
Apply the chosen normalisation. Stemming: call stemmer.stem(token). Lemmatization: call lemmatizer.lemmatize(token, pos=wn_pos).
06
Downstream Task
Feed the normalised tokens into TF-IDF, bag-of-words, word embeddings, or any NLP model. The normalised vocabulary is smaller and cleaner.
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords, wordnet
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk import pos_tag

nltk.download(['punkt', 'stopwords', 'wordnet', 'averaged_perceptron_tagger'], quiet=True)

text = "The scientists were studying the rapidly changing climates and their effects on species."

stop_words = set(stopwords.words('english'))
stemmer    = PorterStemmer()
lemmatizer = WordNetLemmatizer()

def get_wn_pos(tag):
    return {'J': wordnet.ADJ, 'V': wordnet.VERB, 'N': wordnet.NOUN, 'R': wordnet.ADV}.get(tag[0], wordnet.NOUN)

# Step 1: Lowercase + Tokenise
tokens = word_tokenize(text.lower())

# Step 2: Remove stopwords + non-alpha
tokens = [t for t in tokens if t.isalpha() and t not in stop_words]

# Step 3: Stemming pipeline
stems  = [stemmer.stem(t) for t in tokens]

# Step 4: Lemmatization pipeline (needs POS)
tagged  = pos_tag(tokens)
lemmas  = [lemmatizer.lemmatize(t, get_wn_pos(pos)) for t, pos in tagged]

print("Original tokens:", tokens)
print("Stems:          ", stems)
print("Lemmas:         ", lemmas)
OUTPUT
Original tokens: ['scientists', 'studying', 'rapidly', 'changing', 'climates', 'effects', 'species'] Stems: ['scientist', 'studi', 'rapidli', 'chang', 'climat', 'effect', 'speci'] Lemmas: ['scientist', 'study', 'rapidly', 'change', 'climate', 'effect', 'specie']

Section 09

When to Use What — Decision Framework

The Speed-Accuracy Trade-Off at the Airport
Imagine two passport control queues at an airport. Queue A has an officer who barely glances at your passport and waves you through in 3 seconds — fast, mostly fine, but occasionally waves through an expired document. Queue B has an officer who carefully reads every detail, cross-references databases, and confirms your identity perfectly — it takes 25 seconds per person.

For a low-security domestic terminal, Queue A is fine. For an international flight where errors cost real consequences, Queue B is non-negotiable.

Stemming is Queue A. Lemmatization is Queue B. Choose based on the consequences of error in your task.
Use Stemming When
  • Speed is critical (real-time search)
  • Large corpora where lemmatization is too slow
  • Simple keyword matching or TF-IDF pipelines
  • Information retrieval at scale
  • When you don't need interpretable tokens
  • Quick-and-dirty prototypes
🧠
Use Lemmatization When
  • Accuracy matters (chatbots, Q&A systems)
  • Output tokens need to be human-readable
  • Domain-specific language (medical, legal)
  • Sentiment analysis where word meaning matters
  • Small-to-medium corpora where speed is OK
  • Multi-class text classification
🚫
Use Neither When
  • Using contextual embeddings (BERT, GPT)
  • Named Entity Recognition (NER) tasks
  • Machine translation pipelines
  • Tasks where word form itself carries meaning
  • Subword tokenization models (BPE, WordPiece)
  • Code / structured text processing

Section 10

Comprehensive Comparison Table

Property Stemming Lemmatization
DefinitionHeuristic suffix-stripping to produce a stemDictionary-based morphological analysis to produce the lemma
OutputMay not be a real word (studi, argu)Always a valid dictionary word (study, argue)
TechniqueRule-based pattern matching (regex-like)Morphological analysis + lexical database lookup
Needs POS?NOYES (for correct results)
SpeedVery fast — O(n) string operationsSlower — database lookups required
Handles Irregulars?NO — "drove" stays "drove"YES — "drove" → "drive"
Over-stemming RiskHIGH — different words may collideNONE — linguistically grounded
Language SupportGood (Snowball: 13+ languages)Good (spaCy: 60+ languages)
Memory RequiredMinimal — just rules in codeModerate — WordNet / model required
Best ForSearch engines, IR, fast pipelinesChatbots, classification, human-facing NLP
Python Librarynltk.stem.PorterStemmernltk / spaCy lemmatizer

Section 11

Visualising Impact on Vocabulary Size

📊 VOCABULARY REDUCTION — ANIMATED BAR CHART
0 25k 50k 75k 100k 100,000 Raw Text 62,000 Stemmed ↓ 38% 70,000 Lemmatized ↓ 30% Approximate unique tokens in a 1M-word corpus

Stemming reduces vocabulary more aggressively because it collapses more variants (sometimes incorrectly). Lemmatization reduces less but keeps only semantically valid distinctions.


Section 12

Edge Cases and Pitfalls

🚨 STEMMING PITFALLS
InputStemProblem
wanderwandCompletely wrong meaning
universeuniversNot a real word
generalgenerSame as generate, generous
newsnewPlural stripped incorrectly
datadataNo change — correct here
operationaloperOver-truncated
✅ LEMMATIZATION EDGE CASES
InputPOSLemma
sawVERBsee ← correct
sawNOUNsaw ← also correct
betterADJgood ← irregular
am / is / areVERBbe ← all unified
miceNOUNmouse ← correct
corporaNOUNcorpus ← Latin plural
⚠️
The "Wander → Wand" Problem

The Porter Stemmer turns wander into wand. Now a document about hiking ("we wandered through the forest") matches searches for magic wands. This kind of collision can poison a search index and is one of the strongest arguments for choosing lemmatization in production systems where precision matters.


Section 13

Production-Ready Code — Full NLP Pipeline

import re
import spacy
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer

# Load spaCy model
nlp     = spacy.load('en_core_web_sm', disable=['ner', 'parser'])
stemmer = PorterStemmer()

# ─── Stemming pipeline ────────────────────────────────────
def preprocess_stem(texts):
    cleaned = []
    for text in texts:
        text   = re.sub(r'[^a-zA-Z\s]', '', text.lower())
        tokens = text.split()
        stems  = [stemmer.stem(t) for t in tokens if len(t) > 2]
        cleaned.append(' '.join(stems))
    return cleaned

# ─── Lemmatization pipeline ───────────────────────────────
def preprocess_lemma(texts):
    cleaned = []
    for doc in nlp.pipe(texts, batch_size=50):
        lemmas = [
            token.lemma_.lower()
            for token in doc
            if not token.is_stop and not token.is_punct
            and token.is_alpha and len(token) > 2
        ]
        cleaned.append(' '.join(lemmas))
    return cleaned

# ─── Example corpus ───────────────────────────────────────
corpus = [
    "Scientists are studying the effects of climate change on polar bears.",
    "The study shows that temperature changes affect the species rapidly.",
    "Researchers studied how rising temperatures impact ecosystems globally.",
]

stem_corpus  = preprocess_stem(corpus)
lemma_corpus = preprocess_lemma(corpus)

# Build TF-IDF vocabularies
stem_vect  = TfidfVectorizer().fit(stem_corpus)
lemma_vect = TfidfVectorizer().fit(lemma_corpus)

print("Stem vocabulary: ",  sorted(stem_vect.vocabulary_.keys()))
print("Lemma vocabulary:", sorted(lemma_vect.vocabulary_.keys()))
OUTPUT
Stem vocabulary: ['affect', 'bear', 'chang', 'climat', 'ecosystem', 'effect', 'global', 'impact', 'polar', 'rapidli', 'research', 'rise', 'scientist', 'show', 'speci', 'studi', 'temperatur'] Lemma vocabulary: ['affect', 'bear', 'change', 'climate', 'ecosystem', 'effect', 'global', 'impact', 'polar', 'rapidly', 'researcher', 'rise', 'scientist', 'show', 'species', 'study', 'temperature']
🎯
Both Unify the Corpus — But Lemmas Are Readable

Both pipelines successfully merge "study / studies / studied / studying" into a single token. The key difference: the stemmed vocabulary contains studi, rapidli, and temperatur — not real words. The lemmatized vocabulary contains study, rapidly, and temperature. For any human-inspectable pipeline, lemmas win every time.


Section 14

Golden Rules

🌿 Stemming vs Lemmatization — Non-Negotiable Rules
1
Always provide POS tags to your lemmatizer. Without them, the lemmatizer defaults to treating every word as a noun. The word "better" without a POS tag returns better; with ADJ, it correctly returns good.
2
Do stemming before indexing, never before classification. Stemming destroys readability and semantics. For search indices, the loss is acceptable. For sentiment analysis or text classification, mangled tokens can confuse your model.
3
Never apply stemming or lemmatization before BERT, GPT, or transformer models. These models use subword tokenization (BPE/WordPiece) and are trained on raw text. Pre-stemming destroys information that the model's attention mechanism relies on.
4
Benchmark on your own data. The academic benchmarks for stemmer accuracy (typically 75–85% for Porter) are measured on general English corpora. Your domain (medical, legal, financial) may have very different morphological patterns where a custom lemmatizer massively outperforms.
5
Snowball is almost always better than Porter. It's more accurate, supports multiple languages, and is just as fast. Default to Snowball when you need a stemmer. Use Porter only when reproducing older research that used it specifically.
6
For production, prefer spaCy over NLTK's lemmatizer. spaCy processes POS tagging and lemmatization in a single pipeline pass, batch-processes documents efficiently, and has model support for 60+ languages. NLTK is excellent for learning; spaCy is built for shipping.
7
Always measure vocabulary reduction. Check how many unique tokens your preprocessing step produces before and after normalisation. If stemming and lemmatization produce the same vocabulary size, pick lemmatization — you get the same compression with readable outputs.