Tokenization, Stopwords & Punctuation in Python

Section 01

The Story That Explains Text Preprocessing

📖 Real World Analogy

The Library Cataloguer's Secret

Imagine a new library assistant given a pile of 10,000 handwritten book reviews. They are told: "Find all the reviews about cooking."

The assistant opens the first box and finds chaos. Some reviews say "Cooking", others say "COOKING", "cookng", "cook!ng", "I love cooking!!!", and "The art of — cooking — is magnificent." All mean the same thing, but a simple word-search would miss half of them.

The wise assistant doesn't panic. They first clean the text (remove the noise), then standardise it (make it consistent), and finally chop it into pieces (tokens) that a machine can compare quickly. Only then do they hand it to the computer.

That is Text Preprocessing — the art of turning raw, messy human language into a clean, structured form a machine can actually learn from.

In Natural Language Processing (NLP), the quality of your preprocessing often matters more than the choice of model. Garbage in, garbage out — but clean text in, clean predictions out. This tutorial covers the three pillars of text preprocessing: Tokenization, Stopword Removal, and Punctuation Handling.

🧠

Why Preprocessing is Step Zero

Raw text is the messiest data type in machine learning. Unlike numeric features where a value is just a number, the same meaning can appear in hundreds of linguistic forms. Preprocessing unifies them so your model sees meaning, not noise.

Section 02

The Preprocessing Pipeline at a Glance

Every NLP project follows a similar cleaning sequence. Think of it as an assembly line — raw text enters one end, clean tokens exit the other.

⚙️ NLP Text Preprocessing Pipeline

📥

Raw Text Input

Unstructured text from any source — emails, tweets, articles, reviews

"Hello, World!! I LOVE NLP!!!"

🔡

Lowercasing & Normalisation

Unify character case; collapse extra whitespace

"hello, world!! i love nlp!!!"

✂️

Tokenization

Split text into individual tokens (words or subwords)

hello world i love nlp

🧹

Punctuation & Noise Removal

Strip punctuation, special characters, and HTML tags

, !! hello world i love nlp

🚫

Stopword Removal

Remove high-frequency, low-meaning words (the, is, I, a, …)

hello world i love nlp

✅

Clean Tokens — Model Ready

Meaningful tokens, ready to be vectorised or fed into a model

hello world love nlp

Section 03

Tokenization — Cutting Language into Pieces

Tokenization is the process of splitting a string of text into smaller units called tokens. A token can be a word, a sentence, a subword fragment, or even a single character. The choice of tokenizer fundamentally shapes what your model can learn.

📖 Story

The Bread-Slicing Machine

Imagine a long baguette of text. A word tokenizer slices it at every space — each slice is one word. A sentence tokenizer slices at every full stop — each slice is one sentence. A subword tokenizer has tiny blades that slice even further — the word "unhappiness" becomes ["un", "happiness"], so the model shares knowledge between "unhappy" and "happiness" automatically.

Choosing the wrong slicer is like slicing a baguette with a cleaver — you can do it, but you'll lose a lot of meaning in the crumbs.

The Three Types of Tokenization

🔤

Word

Word Tokenization

Splits on whitespace or punctuation. Simple and fast. Loses morphological info. OOV words are a major problem.

📝

Sentence

Sentence Tokenization

Splits on sentence boundaries (. ! ?). Used in summarisation, translation, and text classification by sentence.

🧩

Subword

Subword Tokenization

Splits into meaningful fragments using BPE, WordPiece, or SentencePiece. Used in BERT, GPT, and all modern LLMs.

🔍 Live Tokenization Comparison — "unhappiness is not impossible"

Word

unhappiness is not impossible

Sentence

"unhappiness is not impossible" (one full sentence = one token here)

Subword
(BPE / BERT)

un ##happi ##ness is not im ##possible

💡 The ## prefix (WordPiece convention) marks a subword that continues from the previous token. Notice how "un" and "im" (common prefixes) are shared across different words — this is the power of subword tokenization.

Word Tokenization in Python

import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

# Download required data (run once)
nltk.download('punkt')
nltk.download('punkt_tab')

text = "Dr. Smith went to Washington. He couldn't believe it!"

# Word tokenization
words = word_tokenize(text)
print("Word tokens:", words)

# Sentence tokenization
sentences = sent_tokenize(text)
print("Sentence tokens:", sentences)

OUTPUT

Word tokens: ['Dr.', 'Smith', 'went', 'to', 'Washington', '.', 'He', 'could', "n't", 'believe', 'it', '!'] Sentence tokens: ['Dr. Smith went to Washington.', "He couldn't believe it!"]

💡

Notice the Smart Handling of "Dr."

NLTK's sent_tokenize correctly treats "Dr." as an abbreviation (not a sentence boundary), keeping the first sentence intact. This is why rule-based tokenizers are smarter than a simple split('.').

Subword Tokenization with Hugging Face

from transformers import AutoTokenizer

# BERT uses WordPiece subword tokenization
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

text = "unhappiness is not impossible"
tokens = tokenizer.tokenize(text)
token_ids = tokenizer.encode(text)

print("Subword tokens:", tokens)
print("Token IDs:     ", token_ids)
print(f"Vocabulary size: {tokenizer.vocab_size:,}")

OUTPUT

Subword tokens: ['un', '##happi', '##ness', 'is', 'not', 'im', '##possible'] Token IDs: [101, 4895, 17662, 8430, 2003, 2025, 4895, 6304, 102] Vocabulary size: 30,522

Method	Algorithm	Used By	OOV Problem?	Best For
Whitespace Split	str.split()	Toy examples	Very bad	Nothing serious
NLTK Word	Punkt + rules	Classic NLP	Moderate	Bag-of-words, TF-IDF
spaCy	Rule + model	Production NLP	Moderate	NER, dependency parsing
BPE	Byte Pair Encoding	GPT, RoBERTa	None	Generation tasks
WordPiece	Likelihood-based	BERT, DistilBERT	None	Classification, NER
SentencePiece	Unigram/BPE	T5, mBERT	None	Multilingual tasks

Section 04

Stopword Removal — Cutting the Noise

Stopwords are the most common words in a language — words like the, is, at, which, on, and, a, an. They appear constantly but carry almost no discriminative meaning for most NLP tasks. Removing them reduces noise and shrinks your vocabulary dramatically.

📖 Story

The Telegram Operator's Trick

In the era of paid-per-word telegrams, operators learned a shorthand: "Am arriving London tomorrow" instead of "I am arriving in London tomorrow morning."

The meaning is identical. The words "I", "in", and "morning" are padding — your brain fills them in automatically. Stopword removal does the same for your NLP model. The remaining content words — arriving, London, tomorrow — carry all the signal.

🚫 Stopword Filter — Interactive Animation

Input: "The quick brown fox jumps over the lazy dog near a river"

The quick brown fox jumps over the lazy dog near a river

After Stopword Removal:

quick brown fox jumps lazy dog river

Stopword (removed)

Content word (kept)

12 tokens → 7 tokens | 42% reduction

Stopword Removal in Python

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download('stopwords')
nltk.download('punkt')

text = "The quick brown fox jumps over the lazy dog near a river"

# Load English stopwords (179 words)
stop_words = set(stopwords.words('english'))
print(f"Stopwords count: {len(stop_words)}")

# Tokenize first, then filter
tokens = word_tokenize(text.lower())
clean  = [t for t in tokens if t not in stop_words]

print(f"Original tokens : {tokens}")
print(f"After filtering : {clean}")
print(f"Reduction       : {len(tokens)} → {len(clean)} tokens")

OUTPUT

Stopwords count: 179 Original tokens : ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', 'near', 'a', 'river'] After filtering : ['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog', 'river'] Reduction : 12 → 7 tokens

Custom Stopword Lists

NLTK's default list is generic. Domain-specific text often needs a custom list. For example, in financial news, words like "company", "market", and "said" appear so frequently they lose meaning.

from nltk.corpus import stopwords

# Start with NLTK's defaults
stop_words = set(stopwords.words('english'))

# Add domain-specific stops (financial news context)
custom_stops = {'said', 'company', 'market', 'year', 'percent'}
stop_words.update(custom_stops)

# Or REMOVE words that matter in your domain
# e.g. "not" is a stopword but crucial for sentiment analysis
stop_words.discard('not')
stop_words.discard('no')
stop_words.discard('never')

print(f"Custom stopword list size: {len(stop_words)}")

OUTPUT

Custom stopword list size: 184

⚠️

Don't Remove Stopwords Blindly

For sentiment analysis, removing "not" is catastrophic — "not good" becomes "good". For question answering, removing "what", "who", "where" destroys the question's intent. Always tailor your stopword strategy to the downstream task.

NLP Task	Remove Stopwords?	Reason
Topic Modelling (LDA)	✅ Yes	Function words dominate topics unfairly
TF-IDF / Bag of Words	✅ Yes	High-frequency stops inflate IDF denominator
Information Retrieval	✅ Yes	Reduces index size with minimal precision loss
Sentiment Analysis	⚠️ Partial	Keep negations ("not", "never", "no")
Named Entity Recognition	❌ No	Stopwords provide context for entity boundaries
Machine Translation	❌ No	Grammar depends on function words
BERT / Transformer models	❌ No	Models use full context; stopwords help attention

Section 05

Punctuation Handling — Signal or Noise?

Punctuation is the most contextual element of text preprocessing. Sometimes it is pure noise ("Hello!!!" — the exclamation marks add nothing). Sometimes it is the most important signal ("Buy now." vs "Buy now?" — the question mark changes the whole intent).

📖 Story

The Court Transcription Dilemma

A court reporter was asked to transcribe: "Execute, not pardon him." and "Execute not, pardon him." — two completely opposite commands, separated only by a comma.

For most NLP tasks, that comma is noise. For legal text analysis, it is life or death. Context decides. Always decide before you delete.

🔣 Punctuation — Signal vs Noise Classification

Exclamation

Task-Dependent

Noise for topic models; signal for sentiment (excitement/anger)

Question Mark

Signal

Intent detection — marks a question vs statement

Apostrophe

Signal

"can't" ≠ "cant"; Preserve before tokenizing contractions

Comma

Usually Noise

Rarely meaningful post-tokenization for most tasks

Mention

Task-Dependent

Noise for topic models; entity signal for social media NLP

Hashtag

Often Signal

Carries strong topical and sentiment information

Hyphen

Task-Dependent

"state-of-the-art" — split or keep? Depends on vocabulary design

...

Ellipsis

Task-Dependent

Can signal hesitation or trailing thought in sentiment analysis

Punctuation Removal — Three Approaches

import re
import string
from nltk.tokenize import word_tokenize

text = "Hello, World!!! I can't believe it's 2025... #NLP @OpenAI"

# ── APPROACH 1: Python string.punctuation (blunt) ────────
# Removes ALL punctuation — including apostrophes!
no_punct_1 = text.translate(str.maketrans('', '', string.punctuation))
print("Blunt:", no_punct_1)

# ── APPROACH 2: Regex (surgical) ─────────────────────────
# Keep apostrophes inside words; remove all else
no_punct_2 = re.sub(r"[^\w\s']", '', text)
print("Surgical:", no_punct_2)

# ── APPROACH 3: Post-tokenization filter (safest) ────────
# Tokenize first, then drop pure-punctuation tokens
tokens      = word_tokenize(text)
clean_toks  = [t for t in tokens
               if t.isalnum() or "'" in t]
print("Post-tokenize:", clean_toks)

OUTPUT

Blunt: Hello World I cant believe its 2025 NLP OpenAI Surgical: Hello, World!!! I can't believe it's 2025... #NLP @OpenAI Post-tokenize: ['Hello', 'World', 'I', "ca", "n't", 'believe', "it's", '2025', 'NLP', 'OpenAI']

🔑

The Golden Rule of Punctuation Handling

Always tokenize before removing punctuation. The blunt approach turns "can't" into "cant" — a real but completely different word. Tokenize first (NLTK correctly splits it as ca + n't), then decide what to keep.

Section 06

The Complete Preprocessing Pipeline

Here is a production-ready, reusable preprocessing function that chains all three steps together with sensible defaults and task-specific options.

import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('punkt_tab', quiet=True)

def preprocess_text(
    text,
    lowercase     = True,
    remove_stops  = True,
    remove_punct  = True,
    keep_negations= True,   # keep "not", "no", "never"
    custom_stops  = None,   # extra words to remove
    min_len       = 2        # discard tokens shorter than this
):
    """Full text preprocessing pipeline.
    Returns a list of clean tokens.
    """
    # ── Step 1: Lowercase ────────────────────────────────
    if lowercase:
        text = text.lower()

    # ── Step 2: Remove URLs and HTML tags ────────────────
    text = re.sub(r'http\S+|www\S+', '', text)
    text = re.sub(r'<[^>]+>', '', text)

    # ── Step 3: Tokenize ─────────────────────────────────
    tokens = word_tokenize(text)

    # ── Step 4: Remove punctuation ───────────────────────
    if remove_punct:
        tokens = [t for t in tokens if t.isalnum()]

    # ── Step 5: Remove stopwords ─────────────────────────
    if remove_stops:
        stops = set(stopwords.words('english'))
        if keep_negations:
            stops -= {'not', 'no', 'never', 'nor', 'neither'}
        if custom_stops:
            stops |= set(custom_stops)
        tokens = [t for t in tokens if t not in stops]

    # ── Step 6: Length filter ─────────────────────────────
    tokens = [t for t in tokens if len(t) >= min_len]

    return tokens


# ── Test it ───────────────────────────────────────────────
sample = """
    I absolutely CANNOT believe the film was not good at all!!!
    Visit https://example.com for more info.
    The cinematography <b>was</b> breathtaking, but story failed.
"""

tokens = preprocess_text(sample, keep_negations=True)
print("Clean tokens:", tokens)
print(f"Count: {len(tokens)}")

OUTPUT

Clean tokens: ['absolutely', 'not', 'believe', 'film', 'not', 'good', 'cinematography', 'breathtaking', 'story', 'failed'] Count: 10

✅

What the Pipeline Achieved

Starting from 46 raw tokens (including URLs, HTML, punctuation, and stopwords), the pipeline produced 10 clean content tokens — an 78% reduction in noise, while crucially preserving the two not negations that carry the sentiment signal.

Section 07

spaCy — The Industry Standard Pipeline

While NLTK is great for learning, spaCy is what most production NLP systems use. It is fast, accurate, and handles tokenization + stopword detection in one pass, with built-in awareness of abbreviations, contractions, and special cases.

import spacy

# Load the English model (run: python -m spacy download en_core_web_sm)
nlp = spacy.load('en_core_web_sm')

text = "Dr. Smith won't attend the U.N. conference in New York City."

# Process — one call does tokenization, POS, NER, stopword detection
doc = nlp(text)

print(f"{'Token':15} {'Lemma':15} {'Stop?':8} {'Punct?':8} {'POS'}")
print("-" * 60)
for token in doc:
    print(f"{token.text:15} {token.lemma_:15} {str(token.is_stop):8} {str(token.is_punct):8} {token.pos_}")

# Clean tokens in one line
clean = [t.lemma_ for t in doc
         if not t.is_stop and not t.is_punct and not t.is_space]
print("\nClean lemmatised tokens:", clean)

OUTPUT

Token Lemma Stop? Punct? POS ------------------------------------------------------------ Dr. Dr. False False PROPN Smith Smith False False PROPN wo will True False AUX n't not False False PART attend attend False False VERB the the True False DET U.N. U.N. False False PROPN conference conference False False NOUN in in True False ADP New New False False PROPN York York False False PROPN City city False False PROPN . . False True PUNCT Clean lemmatised tokens: ['Dr.', 'Smith', 'attend', 'U.N.', 'conference', 'New', 'York', 'city']

⭐

spaCy's Hidden Superpower — Lemmatisation

Notice spaCy returns attend instead of "attend" (already the base form), and city for "City". More importantly, it correctly handles "wo n't" — splitting "won't" into "will" (stop) + "not" (not a stop). That's morphological intelligence NLTK's word_tokenize alone cannot match.

Section 08

Visual Summary — Choosing Your Strategy

🗺️ Preprocessing Strategy Decision Map

🎯 What is your NLP Task?

↓

Classification / Topic Modelling

↓

✅ Full Pipeline
Lowercase + Tokenize + Stopwords + Punct

NLTK or spaCy, add lemmatisation

Sentiment Analysis

↓

⚠️ Partial Pipeline
Keep negations & sentiment punctuation (!?)

Custom stoplist — exclude "not", "no", "never"

Transformer / LLM

↓

🚫 Minimal Processing
No stopwords, no heavy stripping

Use built-in subword tokenizer only (BPE/WordPiece)

🌿 Text Preprocessing — Non-Negotiable Rules

Always tokenize before removing punctuation. Removing punctuation before tokenization turns "can't" into "cant" — a real but wrong word. Tokenize first, then decide what stays.

Decide on stopwords per task, not globally. "not", "no", "never" are technically stopwords but are critical for sentiment. Always review your stopword list relative to your task objective.

Do not preprocess BERT/GPT inputs heavily. These models have their own tokenizers trained on raw text. Aggressive preprocessing breaks their subword alignment and can degrade performance.

Preserve case information before lowercasing when needed. For NER tasks, "Apple" (company) vs "apple" (fruit) is critical. Lowercase only after entity extraction or use spaCy's entity-aware pipeline.

Document every preprocessing choice. Preprocessing decisions are as important as model choices. A model trained on aggressively stripped text cannot be reproduced without knowing exactly what was stripped.