The Story That Explains Text Normalisation
Without a system to link these variants together, you will miss most of the relevant books. The student goes home empty-handed, even though the library is full of exactly what they need.
This is precisely the problem that Stemming and Lemmatization solve — they collapse word variants into a common root so that your NLP system can recognise that run, runs, ran, running, runner all refer to the same core idea. The question is: which approach does it well, and which does it cheaply?
When computers process text, they treat every unique string of characters as a different token. "study", "studies", and "studying" look like three completely different words. For tasks like search, sentiment analysis, or topic modelling, this fragmentation kills accuracy. The solution is text normalisation — reducing inflected or derived word forms to a common base form before any analysis begins.
Stemming uses fast heuristic rules to chop off word endings — it's crude but lightning-fast. Lemmatization uses linguistic knowledge and vocabulary lookup to return the dictionary root (the lemma) — it's accurate but requires more computation. Both serve the same purpose: mapping word variants to a shared token. The right choice depends on your speed, accuracy, and language requirements.
Stemming — The Fast Axe
A stemmer applies a fixed sequence of pattern-based rules to strip suffixes (and sometimes prefixes) from words. It does not consult a dictionary. It does not understand grammar. It simply applies rules like: "if a word ends in -ing, remove it." The output — called a stem — is often not a real word at all.
⚡ No dictionary needed — stems are often not valid English words.
The Porter Stemmer — The Classic Algorithm
The Porter Stemmer (1980) is the most widely used stemming algorithm. It applies five sequential phases of suffix-stripping rules in order. Each phase applies the longest matching suffix rule it finds.
import nltk
from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer
nltk.download('punkt', quiet=True)
words = [
'running', 'studies', 'happily', 'arguing',
'generous', 'generation', 'electricity', 'caring'
]
porter = PorterStemmer()
snowball = SnowballStemmer('english')
lancaster = LancasterStemmer()
print(f"{'Word':15} {'Porter':12} {'Snowball':12} {'Lancaster'}")
print("-" * 52)
for word in words:
p = porter.stem(word)
s = snowball.stem(word)
l = lancaster.stem(word)
print(f"{word:15} {p:12} {s:12} {l}")
Over-stemming (False Positives): Different words collapse to the same stem —
general, generate, and
generous can all become gener,
creating false matches.
Under-stemming (False Negatives): Related words are not merged —
alumnus and alumni may remain separate
even though they refer to the same concept.
Lemmatization — The Precise Linguist
The result is always a real word that exists in the dictionary. It takes her longer than the woodcutter, but every answer is linguistically correct. That expert linguist is a lemmatizer.
Lemmatization returns the lemma — the dictionary form of a word. It uses a morphological analysis combined with a lexical database (like WordNet) to determine the base form correctly. Crucially, a lemmatizer often needs to know the Part-of-Speech (POS) of a word to return the right lemma.
🧠 Lemmatizer consults WordNet and uses POS tags — always returns valid dictionary words.
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
nltk.download(['wordnet', 'averaged_perceptron_tagger'], quiet=True)
lemmatizer = WordNetLemmatizer()
# Map Penn Treebank POS tags → WordNet POS codes
def get_wordnet_pos(treebank_tag):
if treebank_tag.startswith('J'): return wordnet.ADJ
elif treebank_tag.startswith('V'): return wordnet.VERB
elif treebank_tag.startswith('N'): return wordnet.NOUN
elif treebank_tag.startswith('R'): return wordnet.ADV
else: return wordnet.NOUN # default to noun
words = [
('running', 'VBG'), ('studies', 'NNS'), ('better', 'JJR'),
('drove', 'VBD'), ('geese', 'NNS'), ('caring', 'VBG')
]
print(f"{'Word':12} {'POS':6} {'Lemma'}")
print("-" * 32)
for word, pos in words:
wn_pos = get_wordnet_pos(pos)
lemma = lemmatizer.lemmatize(word, pos=wn_pos)
print(f"{word:12} {pos:6} {lemma}")
The word "saw" has two completely different lemmas depending on its role: as a verb (to see), its lemma is see; as a noun (a cutting tool), its lemma is saw. A lemmatizer without POS context will pick the wrong one. Always tag before lemmatizing.
Side-by-Side Comparison — Word by Word
The table below shows the same words processed by Porter Stemmer vs WordNet Lemmatizer. Notice where the stemmer breaks words into non-dictionary tokens and where the lemmatizer shines with irregular forms.
| Original Word | Porter Stem | Lemma (with POS) | Stem = Real Word? | Notes |
|---|---|---|---|---|
| running | run | run | YES | Both agree here |
| studies | studi | study | NO | Stemmer strips incorrectly |
| happily | happili | happily | NO | Adverbs poorly handled |
| better | better | good | MISS | Lemmatizer handles irregulars |
| drove | drove | drive | MISS | Stemmer misses irregular verbs |
| geese | gees | goose | NO | Stemmer fails on irregular plurals |
| generalise | generalis | generalise | NO | Stemmer over-chops |
| university | univers | university | NO | Classic stemmer failure |
| caring | care | care | YES | Both agree |
| arguing | argu | argue | NO | Stemmer drops final 'e' incorrectly |
The Three Stemmer Variants
Not all stemmers are created equal. Three algorithms dominate NLP practice, each with a different trade-off between aggression and accuracy.
Animated: Stemming vs Lemmatization Flow
Both paths start with raw text — lemmatization needs an extra POS tagging step but produces valid words.
SpaCy Lemmatization — The Modern Approach
While NLTK's WordNet lemmatizer is the classic teaching tool, spaCy is the industry-standard choice for production NLP. Its lemmatizer is integrated into the processing pipeline and automatically uses POS context — no manual tagging needed.
import spacy
# Load English model — run: python -m spacy download en_core_web_sm
nlp = spacy.load('en_core_web_sm')
sentences = [
"The geese were flying better than they drove last year.",
"She is caring for her studies very happily.",
"The runners are running faster than ever before."
]
for sent in sentences:
doc = nlp(sent)
print(f"\nSentence: {sent}")
print(f"{'Token':15} {'POS':8} {'Lemma'}")
print("-" * 38)
for token in doc:
if not token.is_punct and not token.is_space:
print(f"{token.text:15} {token.pos_:8} {token.lemma_}")
spaCy uses a rule-based + lookup table approach by default, while NLTK uses WordNet's morphology database. spaCy is faster in production because it processes the entire pipeline in a single pass. NLTK is more flexible for experimentation. For production text processing, prefer spaCy; for learning and research, NLTK is excellent.
Real Pipeline — Full Text Processing
In practice, neither stemming nor lemmatization works alone. They are always part of a broader NLP preprocessing pipeline. Here is a realistic pipeline for both approaches on real text.
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords, wordnet
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk import pos_tag
nltk.download(['punkt', 'stopwords', 'wordnet', 'averaged_perceptron_tagger'], quiet=True)
text = "The scientists were studying the rapidly changing climates and their effects on species."
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
def get_wn_pos(tag):
return {'J': wordnet.ADJ, 'V': wordnet.VERB, 'N': wordnet.NOUN, 'R': wordnet.ADV}.get(tag[0], wordnet.NOUN)
# Step 1: Lowercase + Tokenise
tokens = word_tokenize(text.lower())
# Step 2: Remove stopwords + non-alpha
tokens = [t for t in tokens if t.isalpha() and t not in stop_words]
# Step 3: Stemming pipeline
stems = [stemmer.stem(t) for t in tokens]
# Step 4: Lemmatization pipeline (needs POS)
tagged = pos_tag(tokens)
lemmas = [lemmatizer.lemmatize(t, get_wn_pos(pos)) for t, pos in tagged]
print("Original tokens:", tokens)
print("Stems: ", stems)
print("Lemmas: ", lemmas)
When to Use What — Decision Framework
For a low-security domestic terminal, Queue A is fine. For an international flight where errors cost real consequences, Queue B is non-negotiable.
Stemming is Queue A. Lemmatization is Queue B. Choose based on the consequences of error in your task.
- Speed is critical (real-time search)
- Large corpora where lemmatization is too slow
- Simple keyword matching or TF-IDF pipelines
- Information retrieval at scale
- When you don't need interpretable tokens
- Quick-and-dirty prototypes
- Accuracy matters (chatbots, Q&A systems)
- Output tokens need to be human-readable
- Domain-specific language (medical, legal)
- Sentiment analysis where word meaning matters
- Small-to-medium corpora where speed is OK
- Multi-class text classification
- Using contextual embeddings (BERT, GPT)
- Named Entity Recognition (NER) tasks
- Machine translation pipelines
- Tasks where word form itself carries meaning
- Subword tokenization models (BPE, WordPiece)
- Code / structured text processing
Comprehensive Comparison Table
| Property | Stemming | Lemmatization |
|---|---|---|
| Definition | Heuristic suffix-stripping to produce a stem | Dictionary-based morphological analysis to produce the lemma |
| Output | May not be a real word (studi, argu) | Always a valid dictionary word (study, argue) |
| Technique | Rule-based pattern matching (regex-like) | Morphological analysis + lexical database lookup |
| Needs POS? | NO | YES (for correct results) |
| Speed | Very fast — O(n) string operations | Slower — database lookups required |
| Handles Irregulars? | NO — "drove" stays "drove" | YES — "drove" → "drive" |
| Over-stemming Risk | HIGH — different words may collide | NONE — linguistically grounded |
| Language Support | Good (Snowball: 13+ languages) | Good (spaCy: 60+ languages) |
| Memory Required | Minimal — just rules in code | Moderate — WordNet / model required |
| Best For | Search engines, IR, fast pipelines | Chatbots, classification, human-facing NLP |
| Python Library | nltk.stem.PorterStemmer | nltk / spaCy lemmatizer |
Visualising Impact on Vocabulary Size
Stemming reduces vocabulary more aggressively because it collapses more variants (sometimes incorrectly). Lemmatization reduces less but keeps only semantically valid distinctions.
Edge Cases and Pitfalls
| Input | Stem | Problem |
|---|---|---|
| wander | wand | Completely wrong meaning |
| universe | univers | Not a real word |
| general | gener | Same as generate, generous |
| news | new | Plural stripped incorrectly |
| data | data | No change — correct here |
| operational | oper | Over-truncated |
| Input | POS | Lemma |
|---|---|---|
| saw | VERB | see ← correct |
| saw | NOUN | saw ← also correct |
| better | ADJ | good ← irregular |
| am / is / are | VERB | be ← all unified |
| mice | NOUN | mouse ← correct |
| corpora | NOUN | corpus ← Latin plural |
The Porter Stemmer turns wander into wand. Now a document about hiking ("we wandered through the forest") matches searches for magic wands. This kind of collision can poison a search index and is one of the strongest arguments for choosing lemmatization in production systems where precision matters.
Production-Ready Code — Full NLP Pipeline
import re
import spacy
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
# Load spaCy model
nlp = spacy.load('en_core_web_sm', disable=['ner', 'parser'])
stemmer = PorterStemmer()
# ─── Stemming pipeline ────────────────────────────────────
def preprocess_stem(texts):
cleaned = []
for text in texts:
text = re.sub(r'[^a-zA-Z\s]', '', text.lower())
tokens = text.split()
stems = [stemmer.stem(t) for t in tokens if len(t) > 2]
cleaned.append(' '.join(stems))
return cleaned
# ─── Lemmatization pipeline ───────────────────────────────
def preprocess_lemma(texts):
cleaned = []
for doc in nlp.pipe(texts, batch_size=50):
lemmas = [
token.lemma_.lower()
for token in doc
if not token.is_stop and not token.is_punct
and token.is_alpha and len(token) > 2
]
cleaned.append(' '.join(lemmas))
return cleaned
# ─── Example corpus ───────────────────────────────────────
corpus = [
"Scientists are studying the effects of climate change on polar bears.",
"The study shows that temperature changes affect the species rapidly.",
"Researchers studied how rising temperatures impact ecosystems globally.",
]
stem_corpus = preprocess_stem(corpus)
lemma_corpus = preprocess_lemma(corpus)
# Build TF-IDF vocabularies
stem_vect = TfidfVectorizer().fit(stem_corpus)
lemma_vect = TfidfVectorizer().fit(lemma_corpus)
print("Stem vocabulary: ", sorted(stem_vect.vocabulary_.keys()))
print("Lemma vocabulary:", sorted(lemma_vect.vocabulary_.keys()))
Both pipelines successfully merge "study / studies / studied / studying" into a single token. The key difference: the stemmed vocabulary contains studi, rapidli, and temperatur — not real words. The lemmatized vocabulary contains study, rapidly, and temperature. For any human-inspectable pipeline, lemmas win every time.