The Story That Explains Text Preprocessing
The assistant opens the first box and finds chaos. Some reviews say "Cooking", others say "COOKING", "cookng", "cook!ng", "I love cooking!!!", and "The art of — cooking — is magnificent." All mean the same thing, but a simple word-search would miss half of them.
The wise assistant doesn't panic. They first clean the text (remove the noise), then standardise it (make it consistent), and finally chop it into pieces (tokens) that a machine can compare quickly. Only then do they hand it to the computer.
That is Text Preprocessing — the art of turning raw, messy human language into a clean, structured form a machine can actually learn from.
In Natural Language Processing (NLP), the quality of your preprocessing often matters more than the choice of model. Garbage in, garbage out — but clean text in, clean predictions out. This tutorial covers the three pillars of text preprocessing: Tokenization, Stopword Removal, and Punctuation Handling.
Raw text is the messiest data type in machine learning. Unlike numeric features where a value is just a number, the same meaning can appear in hundreds of linguistic forms. Preprocessing unifies them so your model sees meaning, not noise.
The Preprocessing Pipeline at a Glance
Every NLP project follows a similar cleaning sequence. Think of it as an assembly line — raw text enters one end, clean tokens exit the other.
Tokenization — Cutting Language into Pieces
Tokenization is the process of splitting a string of text into smaller units called tokens. A token can be a word, a sentence, a subword fragment, or even a single character. The choice of tokenizer fundamentally shapes what your model can learn.
Choosing the wrong slicer is like slicing a baguette with a cleaver — you can do it, but you'll lose a lot of meaning in the crumbs.
The Three Types of Tokenization
(BPE / BERT)
Word Tokenization in Python
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
# Download required data (run once)
nltk.download('punkt')
nltk.download('punkt_tab')
text = "Dr. Smith went to Washington. He couldn't believe it!"
# Word tokenization
words = word_tokenize(text)
print("Word tokens:", words)
# Sentence tokenization
sentences = sent_tokenize(text)
print("Sentence tokens:", sentences)
NLTK's sent_tokenize correctly treats "Dr." as an abbreviation (not a sentence boundary), keeping the first sentence intact. This is why rule-based tokenizers are smarter than a simple split('.').
Subword Tokenization with Hugging Face
from transformers import AutoTokenizer
# BERT uses WordPiece subword tokenization
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
text = "unhappiness is not impossible"
tokens = tokenizer.tokenize(text)
token_ids = tokenizer.encode(text)
print("Subword tokens:", tokens)
print("Token IDs: ", token_ids)
print(f"Vocabulary size: {tokenizer.vocab_size:,}")
| Method | Algorithm | Used By | OOV Problem? | Best For |
|---|---|---|---|---|
| Whitespace Split | str.split() | Toy examples | Very bad | Nothing serious |
| NLTK Word | Punkt + rules | Classic NLP | Moderate | Bag-of-words, TF-IDF |
| spaCy | Rule + model | Production NLP | Moderate | NER, dependency parsing |
| BPE | Byte Pair Encoding | GPT, RoBERTa | None | Generation tasks |
| WordPiece | Likelihood-based | BERT, DistilBERT | None | Classification, NER |
| SentencePiece | Unigram/BPE | T5, mBERT | None | Multilingual tasks |
Stopword Removal — Cutting the Noise
Stopwords are the most common words in a language — words like the, is, at, which, on, and, a, an. They appear constantly but carry almost no discriminative meaning for most NLP tasks. Removing them reduces noise and shrinks your vocabulary dramatically.
The meaning is identical. The words "I", "in", and "morning" are padding — your brain fills them in automatically. Stopword removal does the same for your NLP model. The remaining content words — arriving, London, tomorrow — carry all the signal.
Input: "The quick brown fox jumps over the lazy dog near a river"
After Stopword Removal:
Stopword Removal in Python
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('stopwords')
nltk.download('punkt')
text = "The quick brown fox jumps over the lazy dog near a river"
# Load English stopwords (179 words)
stop_words = set(stopwords.words('english'))
print(f"Stopwords count: {len(stop_words)}")
# Tokenize first, then filter
tokens = word_tokenize(text.lower())
clean = [t for t in tokens if t not in stop_words]
print(f"Original tokens : {tokens}")
print(f"After filtering : {clean}")
print(f"Reduction : {len(tokens)} → {len(clean)} tokens")
Custom Stopword Lists
NLTK's default list is generic. Domain-specific text often needs a custom list. For example, in financial news, words like "company", "market", and "said" appear so frequently they lose meaning.
from nltk.corpus import stopwords
# Start with NLTK's defaults
stop_words = set(stopwords.words('english'))
# Add domain-specific stops (financial news context)
custom_stops = {'said', 'company', 'market', 'year', 'percent'}
stop_words.update(custom_stops)
# Or REMOVE words that matter in your domain
# e.g. "not" is a stopword but crucial for sentiment analysis
stop_words.discard('not')
stop_words.discard('no')
stop_words.discard('never')
print(f"Custom stopword list size: {len(stop_words)}")
For sentiment analysis, removing "not" is catastrophic — "not good" becomes "good". For question answering, removing "what", "who", "where" destroys the question's intent. Always tailor your stopword strategy to the downstream task.
| NLP Task | Remove Stopwords? | Reason |
|---|---|---|
| Topic Modelling (LDA) | ✅ Yes | Function words dominate topics unfairly |
| TF-IDF / Bag of Words | ✅ Yes | High-frequency stops inflate IDF denominator |
| Information Retrieval | ✅ Yes | Reduces index size with minimal precision loss |
| Sentiment Analysis | ⚠️ Partial | Keep negations ("not", "never", "no") |
| Named Entity Recognition | ❌ No | Stopwords provide context for entity boundaries |
| Machine Translation | ❌ No | Grammar depends on function words |
| BERT / Transformer models | ❌ No | Models use full context; stopwords help attention |
Punctuation Handling — Signal or Noise?
Punctuation is the most contextual element of text preprocessing. Sometimes it is pure noise ("Hello!!!" — the exclamation marks add nothing). Sometimes it is the most important signal ("Buy now." vs "Buy now?" — the question mark changes the whole intent).
For most NLP tasks, that comma is noise. For legal text analysis, it is life or death. Context decides. Always decide before you delete.
Punctuation Removal — Three Approaches
import re
import string
from nltk.tokenize import word_tokenize
text = "Hello, World!!! I can't believe it's 2025... #NLP @OpenAI"
# ── APPROACH 1: Python string.punctuation (blunt) ────────
# Removes ALL punctuation — including apostrophes!
no_punct_1 = text.translate(str.maketrans('', '', string.punctuation))
print("Blunt:", no_punct_1)
# ── APPROACH 2: Regex (surgical) ─────────────────────────
# Keep apostrophes inside words; remove all else
no_punct_2 = re.sub(r"[^\w\s']", '', text)
print("Surgical:", no_punct_2)
# ── APPROACH 3: Post-tokenization filter (safest) ────────
# Tokenize first, then drop pure-punctuation tokens
tokens = word_tokenize(text)
clean_toks = [t for t in tokens
if t.isalnum() or "'" in t]
print("Post-tokenize:", clean_toks)
Always tokenize before removing punctuation. The blunt approach turns "can't" into "cant" — a real but completely different word. Tokenize first (NLTK correctly splits it as ca + n't), then decide what to keep.
The Complete Preprocessing Pipeline
Here is a production-ready, reusable preprocessing function that chains all three steps together with sensible defaults and task-specific options.
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('punkt_tab', quiet=True)
def preprocess_text(
text,
lowercase = True,
remove_stops = True,
remove_punct = True,
keep_negations= True, # keep "not", "no", "never"
custom_stops = None, # extra words to remove
min_len = 2 # discard tokens shorter than this
):
"""Full text preprocessing pipeline.
Returns a list of clean tokens.
"""
# ── Step 1: Lowercase ────────────────────────────────
if lowercase:
text = text.lower()
# ── Step 2: Remove URLs and HTML tags ────────────────
text = re.sub(r'http\S+|www\S+', '', text)
text = re.sub(r'<[^>]+>', '', text)
# ── Step 3: Tokenize ─────────────────────────────────
tokens = word_tokenize(text)
# ── Step 4: Remove punctuation ───────────────────────
if remove_punct:
tokens = [t for t in tokens if t.isalnum()]
# ── Step 5: Remove stopwords ─────────────────────────
if remove_stops:
stops = set(stopwords.words('english'))
if keep_negations:
stops -= {'not', 'no', 'never', 'nor', 'neither'}
if custom_stops:
stops |= set(custom_stops)
tokens = [t for t in tokens if t not in stops]
# ── Step 6: Length filter ─────────────────────────────
tokens = [t for t in tokens if len(t) >= min_len]
return tokens
# ── Test it ───────────────────────────────────────────────
sample = """
I absolutely CANNOT believe the film was not good at all!!!
Visit https://example.com for more info.
The cinematography <b>was</b> breathtaking, but story failed.
"""
tokens = preprocess_text(sample, keep_negations=True)
print("Clean tokens:", tokens)
print(f"Count: {len(tokens)}")
Starting from 46 raw tokens (including URLs, HTML, punctuation, and stopwords), the pipeline produced 10 clean content tokens — an 78% reduction in noise, while crucially preserving the two not negations that carry the sentiment signal.
spaCy — The Industry Standard Pipeline
While NLTK is great for learning, spaCy is what most production NLP systems use. It is fast, accurate, and handles tokenization + stopword detection in one pass, with built-in awareness of abbreviations, contractions, and special cases.
import spacy
# Load the English model (run: python -m spacy download en_core_web_sm)
nlp = spacy.load('en_core_web_sm')
text = "Dr. Smith won't attend the U.N. conference in New York City."
# Process — one call does tokenization, POS, NER, stopword detection
doc = nlp(text)
print(f"{'Token':15} {'Lemma':15} {'Stop?':8} {'Punct?':8} {'POS'}")
print("-" * 60)
for token in doc:
print(f"{token.text:15} {token.lemma_:15} {str(token.is_stop):8} {str(token.is_punct):8} {token.pos_}")
# Clean tokens in one line
clean = [t.lemma_ for t in doc
if not t.is_stop and not t.is_punct and not t.is_space]
print("\nClean lemmatised tokens:", clean)
Notice spaCy returns attend instead of "attend" (already the base form), and city for "City". More importantly, it correctly handles "wo n't" — splitting "won't" into "will" (stop) + "not" (not a stop). That's morphological intelligence NLTK's word_tokenize alone cannot match.
Visual Summary — Choosing Your Strategy
Lowercase + Tokenize + Stopwords + Punct
Keep negations & sentiment punctuation (!?)
No stopwords, no heavy stripping
"can't" into "cant"
— a real but wrong word. Tokenize first, then decide what stays.