Natural Language Processing (NLP) 📂 Text Preprocessing · 1 of 3 37 min read

Tokenization, Stopword Removal & Punctuation Handling

A comprehensive, story-driven tutorial covering the three pillars of NLP text preprocessing — word, sentence, and subword tokenization; stopword removal strategies

Section 01

The Story That Explains Text Preprocessing

The Library Cataloguer's Secret
Imagine a new library assistant given a pile of 10,000 handwritten book reviews. They are told: "Find all the reviews about cooking."

The assistant opens the first box and finds chaos. Some reviews say "Cooking", others say "COOKING", "cookng", "cook!ng", "I love cooking!!!", and "The art of — cooking — is magnificent." All mean the same thing, but a simple word-search would miss half of them.

The wise assistant doesn't panic. They first clean the text (remove the noise), then standardise it (make it consistent), and finally chop it into pieces (tokens) that a machine can compare quickly. Only then do they hand it to the computer.

That is Text Preprocessing — the art of turning raw, messy human language into a clean, structured form a machine can actually learn from.

In Natural Language Processing (NLP), the quality of your preprocessing often matters more than the choice of model. Garbage in, garbage out — but clean text in, clean predictions out. This tutorial covers the three pillars of text preprocessing: Tokenization, Stopword Removal, and Punctuation Handling.

🧠
Why Preprocessing is Step Zero

Raw text is the messiest data type in machine learning. Unlike numeric features where a value is just a number, the same meaning can appear in hundreds of linguistic forms. Preprocessing unifies them so your model sees meaning, not noise.


Section 02

The Preprocessing Pipeline at a Glance

Every NLP project follows a similar cleaning sequence. Think of it as an assembly line — raw text enters one end, clean tokens exit the other.

⚙️  NLP Text Preprocessing Pipeline
📥
Raw Text Input
Unstructured text from any source — emails, tweets, articles, reviews
"Hello, World!! I LOVE NLP!!!"
🔡
Lowercasing & Normalisation
Unify character case; collapse extra whitespace
"hello, world!! i love nlp!!!"
✂️
Tokenization
Split text into individual tokens (words or subwords)
hello world i love nlp
🧹
Punctuation & Noise Removal
Strip punctuation, special characters, and HTML tags
, !! hello world i love nlp
🚫
Stopword Removal
Remove high-frequency, low-meaning words (the, is, I, a, …)
hello world i love nlp
Clean Tokens — Model Ready
Meaningful tokens, ready to be vectorised or fed into a model
hello world love nlp

Section 03

Tokenization — Cutting Language into Pieces

Tokenization is the process of splitting a string of text into smaller units called tokens. A token can be a word, a sentence, a subword fragment, or even a single character. The choice of tokenizer fundamentally shapes what your model can learn.

The Bread-Slicing Machine
Imagine a long baguette of text. A word tokenizer slices it at every space — each slice is one word. A sentence tokenizer slices at every full stop — each slice is one sentence. A subword tokenizer has tiny blades that slice even further — the word "unhappiness" becomes ["un", "happiness"], so the model shares knowledge between "unhappy" and "happiness" automatically.

Choosing the wrong slicer is like slicing a baguette with a cleaver — you can do it, but you'll lose a lot of meaning in the crumbs.

The Three Types of Tokenization

🔤
Word
Word Tokenization
Splits on whitespace or punctuation. Simple and fast. Loses morphological info. OOV words are a major problem.
📝
Sentence
Sentence Tokenization
Splits on sentence boundaries (. ! ?). Used in summarisation, translation, and text classification by sentence.
🧩
Subword
Subword Tokenization
Splits into meaningful fragments using BPE, WordPiece, or SentencePiece. Used in BERT, GPT, and all modern LLMs.
🔍  Live Tokenization Comparison — "unhappiness is not impossible"
Word
unhappiness is not impossible
Sentence
"unhappiness is not impossible" (one full sentence = one token here)
Subword
(BPE / BERT)
un ##happi ##ness is not im ##possible
💡 The ## prefix (WordPiece convention) marks a subword that continues from the previous token. Notice how "un" and "im" (common prefixes) are shared across different words — this is the power of subword tokenization.

Word Tokenization in Python

import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

# Download required data (run once)
nltk.download('punkt')
nltk.download('punkt_tab')

text = "Dr. Smith went to Washington. He couldn't believe it!"

# Word tokenization
words = word_tokenize(text)
print("Word tokens:", words)

# Sentence tokenization
sentences = sent_tokenize(text)
print("Sentence tokens:", sentences)
OUTPUT
Word tokens: ['Dr.', 'Smith', 'went', 'to', 'Washington', '.', 'He', 'could', "n't", 'believe', 'it', '!'] Sentence tokens: ['Dr. Smith went to Washington.', "He couldn't believe it!"]
💡
Notice the Smart Handling of "Dr."

NLTK's sent_tokenize correctly treats "Dr." as an abbreviation (not a sentence boundary), keeping the first sentence intact. This is why rule-based tokenizers are smarter than a simple split('.').

Subword Tokenization with Hugging Face

from transformers import AutoTokenizer

# BERT uses WordPiece subword tokenization
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

text = "unhappiness is not impossible"
tokens = tokenizer.tokenize(text)
token_ids = tokenizer.encode(text)

print("Subword tokens:", tokens)
print("Token IDs:     ", token_ids)
print(f"Vocabulary size: {tokenizer.vocab_size:,}")
OUTPUT
Subword tokens: ['un', '##happi', '##ness', 'is', 'not', 'im', '##possible'] Token IDs: [101, 4895, 17662, 8430, 2003, 2025, 4895, 6304, 102] Vocabulary size: 30,522
Method Algorithm Used By OOV Problem? Best For
Whitespace Split str.split() Toy examples Very bad Nothing serious
NLTK Word Punkt + rules Classic NLP Moderate Bag-of-words, TF-IDF
spaCy Rule + model Production NLP Moderate NER, dependency parsing
BPE Byte Pair Encoding GPT, RoBERTa None Generation tasks
WordPiece Likelihood-based BERT, DistilBERT None Classification, NER
SentencePiece Unigram/BPE T5, mBERT None Multilingual tasks

Section 04

Stopword Removal — Cutting the Noise

Stopwords are the most common words in a language — words like the, is, at, which, on, and, a, an. They appear constantly but carry almost no discriminative meaning for most NLP tasks. Removing them reduces noise and shrinks your vocabulary dramatically.

The Telegram Operator's Trick
In the era of paid-per-word telegrams, operators learned a shorthand: "Am arriving London tomorrow" instead of "I am arriving in London tomorrow morning."

The meaning is identical. The words "I", "in", and "morning" are padding — your brain fills them in automatically. Stopword removal does the same for your NLP model. The remaining content words — arriving, London, tomorrow — carry all the signal.
🚫  Stopword Filter — Interactive Animation

Input: "The quick brown fox jumps over the lazy dog near a river"

The quick brown fox jumps over the lazy dog near a river

After Stopword Removal:

quick brown fox jumps lazy dog river
Stopword (removed)
Content word (kept)
12 tokens → 7 tokens  |  42% reduction

Stopword Removal in Python

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download('stopwords')
nltk.download('punkt')

text = "The quick brown fox jumps over the lazy dog near a river"

# Load English stopwords (179 words)
stop_words = set(stopwords.words('english'))
print(f"Stopwords count: {len(stop_words)}")

# Tokenize first, then filter
tokens = word_tokenize(text.lower())
clean  = [t for t in tokens if t not in stop_words]

print(f"Original tokens : {tokens}")
print(f"After filtering : {clean}")
print(f"Reduction       : {len(tokens)} → {len(clean)} tokens")
OUTPUT
Stopwords count: 179 Original tokens : ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', 'near', 'a', 'river'] After filtering : ['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog', 'river'] Reduction : 12 → 7 tokens

Custom Stopword Lists

NLTK's default list is generic. Domain-specific text often needs a custom list. For example, in financial news, words like "company", "market", and "said" appear so frequently they lose meaning.

from nltk.corpus import stopwords

# Start with NLTK's defaults
stop_words = set(stopwords.words('english'))

# Add domain-specific stops (financial news context)
custom_stops = {'said', 'company', 'market', 'year', 'percent'}
stop_words.update(custom_stops)

# Or REMOVE words that matter in your domain
# e.g. "not" is a stopword but crucial for sentiment analysis
stop_words.discard('not')
stop_words.discard('no')
stop_words.discard('never')

print(f"Custom stopword list size: {len(stop_words)}")
OUTPUT
Custom stopword list size: 184
⚠️
Don't Remove Stopwords Blindly

For sentiment analysis, removing "not" is catastrophic — "not good" becomes "good". For question answering, removing "what", "who", "where" destroys the question's intent. Always tailor your stopword strategy to the downstream task.

NLP TaskRemove Stopwords?Reason
Topic Modelling (LDA)✅ YesFunction words dominate topics unfairly
TF-IDF / Bag of Words✅ YesHigh-frequency stops inflate IDF denominator
Information Retrieval✅ YesReduces index size with minimal precision loss
Sentiment Analysis⚠️ PartialKeep negations ("not", "never", "no")
Named Entity Recognition❌ NoStopwords provide context for entity boundaries
Machine Translation❌ NoGrammar depends on function words
BERT / Transformer models❌ NoModels use full context; stopwords help attention

Section 05

Punctuation Handling — Signal or Noise?

Punctuation is the most contextual element of text preprocessing. Sometimes it is pure noise ("Hello!!!" — the exclamation marks add nothing). Sometimes it is the most important signal ("Buy now." vs "Buy now?" — the question mark changes the whole intent).

The Court Transcription Dilemma
A court reporter was asked to transcribe: "Execute, not pardon him." and "Execute not, pardon him." — two completely opposite commands, separated only by a comma.

For most NLP tasks, that comma is noise. For legal text analysis, it is life or death. Context decides. Always decide before you delete.
🔣  Punctuation — Signal vs Noise Classification
!
Exclamation
Task-Dependent
Noise for topic models; signal for sentiment (excitement/anger)
?
Question Mark
Signal
Intent detection — marks a question vs statement
'
Apostrophe
Signal
"can't" ≠ "cant"; Preserve before tokenizing contractions
,
Comma
Usually Noise
Rarely meaningful post-tokenization for most tasks
@
Mention
Task-Dependent
Noise for topic models; entity signal for social media NLP
#
Hashtag
Often Signal
Carries strong topical and sentiment information
-
Hyphen
Task-Dependent
"state-of-the-art" — split or keep? Depends on vocabulary design
...
Ellipsis
Task-Dependent
Can signal hesitation or trailing thought in sentiment analysis

Punctuation Removal — Three Approaches

import re
import string
from nltk.tokenize import word_tokenize

text = "Hello, World!!! I can't believe it's 2025... #NLP @OpenAI"

# ── APPROACH 1: Python string.punctuation (blunt) ────────
# Removes ALL punctuation — including apostrophes!
no_punct_1 = text.translate(str.maketrans('', '', string.punctuation))
print("Blunt:", no_punct_1)

# ── APPROACH 2: Regex (surgical) ─────────────────────────
# Keep apostrophes inside words; remove all else
no_punct_2 = re.sub(r"[^\w\s']", '', text)
print("Surgical:", no_punct_2)

# ── APPROACH 3: Post-tokenization filter (safest) ────────
# Tokenize first, then drop pure-punctuation tokens
tokens      = word_tokenize(text)
clean_toks  = [t for t in tokens
               if t.isalnum() or "'" in t]
print("Post-tokenize:", clean_toks)
OUTPUT
Blunt: Hello World I cant believe its 2025 NLP OpenAI Surgical: Hello, World!!! I can't believe it's 2025... #NLP @OpenAI Post-tokenize: ['Hello', 'World', 'I', "ca", "n't", 'believe', "it's", '2025', 'NLP', 'OpenAI']
🔑
The Golden Rule of Punctuation Handling

Always tokenize before removing punctuation. The blunt approach turns "can't" into "cant" — a real but completely different word. Tokenize first (NLTK correctly splits it as ca + n't), then decide what to keep.


Section 06

The Complete Preprocessing Pipeline

Here is a production-ready, reusable preprocessing function that chains all three steps together with sensible defaults and task-specific options.

import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('punkt_tab', quiet=True)

def preprocess_text(
    text,
    lowercase     = True,
    remove_stops  = True,
    remove_punct  = True,
    keep_negations= True,   # keep "not", "no", "never"
    custom_stops  = None,   # extra words to remove
    min_len       = 2        # discard tokens shorter than this
):
    """Full text preprocessing pipeline.
    Returns a list of clean tokens.
    """
    # ── Step 1: Lowercase ────────────────────────────────
    if lowercase:
        text = text.lower()

    # ── Step 2: Remove URLs and HTML tags ────────────────
    text = re.sub(r'http\S+|www\S+', '', text)
    text = re.sub(r'<[^>]+>', '', text)

    # ── Step 3: Tokenize ─────────────────────────────────
    tokens = word_tokenize(text)

    # ── Step 4: Remove punctuation ───────────────────────
    if remove_punct:
        tokens = [t for t in tokens if t.isalnum()]

    # ── Step 5: Remove stopwords ─────────────────────────
    if remove_stops:
        stops = set(stopwords.words('english'))
        if keep_negations:
            stops -= {'not', 'no', 'never', 'nor', 'neither'}
        if custom_stops:
            stops |= set(custom_stops)
        tokens = [t for t in tokens if t not in stops]

    # ── Step 6: Length filter ─────────────────────────────
    tokens = [t for t in tokens if len(t) >= min_len]

    return tokens


# ── Test it ───────────────────────────────────────────────
sample = """
    I absolutely CANNOT believe the film was not good at all!!!
    Visit https://example.com for more info.
    The cinematography <b>was</b> breathtaking, but story failed.
"""

tokens = preprocess_text(sample, keep_negations=True)
print("Clean tokens:", tokens)
print(f"Count: {len(tokens)}")
OUTPUT
Clean tokens: ['absolutely', 'not', 'believe', 'film', 'not', 'good', 'cinematography', 'breathtaking', 'story', 'failed'] Count: 10
What the Pipeline Achieved

Starting from 46 raw tokens (including URLs, HTML, punctuation, and stopwords), the pipeline produced 10 clean content tokens — an 78% reduction in noise, while crucially preserving the two not negations that carry the sentiment signal.


Section 07

spaCy — The Industry Standard Pipeline

While NLTK is great for learning, spaCy is what most production NLP systems use. It is fast, accurate, and handles tokenization + stopword detection in one pass, with built-in awareness of abbreviations, contractions, and special cases.

import spacy

# Load the English model (run: python -m spacy download en_core_web_sm)
nlp = spacy.load('en_core_web_sm')

text = "Dr. Smith won't attend the U.N. conference in New York City."

# Process — one call does tokenization, POS, NER, stopword detection
doc = nlp(text)

print(f"{'Token':15} {'Lemma':15} {'Stop?':8} {'Punct?':8} {'POS'}")
print("-" * 60)
for token in doc:
    print(f"{token.text:15} {token.lemma_:15} {str(token.is_stop):8} {str(token.is_punct):8} {token.pos_}")

# Clean tokens in one line
clean = [t.lemma_ for t in doc
         if not t.is_stop and not t.is_punct and not t.is_space]
print("\nClean lemmatised tokens:", clean)
OUTPUT
Token Lemma Stop? Punct? POS ------------------------------------------------------------ Dr. Dr. False False PROPN Smith Smith False False PROPN wo will True False AUX n't not False False PART attend attend False False VERB the the True False DET U.N. U.N. False False PROPN conference conference False False NOUN in in True False ADP New New False False PROPN York York False False PROPN City city False False PROPN . . False True PUNCT Clean lemmatised tokens: ['Dr.', 'Smith', 'attend', 'U.N.', 'conference', 'New', 'York', 'city']
spaCy's Hidden Superpower — Lemmatisation

Notice spaCy returns attend instead of "attend" (already the base form), and city for "City". More importantly, it correctly handles "wo n't" — splitting "won't" into "will" (stop) + "not" (not a stop). That's morphological intelligence NLTK's word_tokenize alone cannot match.


Section 08

Visual Summary — Choosing Your Strategy

🗺️  Preprocessing Strategy Decision Map
🎯 What is your NLP Task?
Classification / Topic Modelling
✅ Full Pipeline
Lowercase + Tokenize + Stopwords + Punct
NLTK or spaCy, add lemmatisation
Sentiment Analysis
⚠️ Partial Pipeline
Keep negations & sentiment punctuation (!?)
Custom stoplist — exclude "not", "no", "never"
Transformer / LLM
🚫 Minimal Processing
No stopwords, no heavy stripping
Use built-in subword tokenizer only (BPE/WordPiece)
🌿 Text Preprocessing — Non-Negotiable Rules
1
Always tokenize before removing punctuation. Removing punctuation before tokenization turns "can't" into "cant" — a real but wrong word. Tokenize first, then decide what stays.
2
Decide on stopwords per task, not globally. "not", "no", "never" are technically stopwords but are critical for sentiment. Always review your stopword list relative to your task objective.
3
Do not preprocess BERT/GPT inputs heavily. These models have their own tokenizers trained on raw text. Aggressive preprocessing breaks their subword alignment and can degrade performance.
4
Preserve case information before lowercasing when needed. For NER tasks, "Apple" (company) vs "apple" (fruit) is critical. Lowercase only after entity extraction or use spaCy's entity-aware pipeline.
5
Document every preprocessing choice. Preprocessing decisions are as important as model choices. A model trained on aggressively stripped text cannot be reproduced without knowing exactly what was stripped.