NLP Text Normalization & Sentence Segmentation

Section 01

The Story That Explains Text Normalization

📖 Real World Analogy

The New Librarian and the Messy Catalogue

Imagine a librarian who has just taken over a vast archive. Some books are labelled "USA", others "U.S.A.", some "united states", and a few "United States of America". When a visitor asks for "books about the US", the librarian has to know all those forms mean the same thing — or she'll miss 80% of the collection.

That is exactly what an NLP model faces with raw human text. People type in capitals or lowercase, use slang contractions like "won't" or "I'm", scatter punctuation everywhere, and mix emoji with prose. Before any analysis, the text must be normalized — turned into a clean, consistent form the model can reason about.

Text Normalization is the collection of preprocessing steps that convert raw, noisy human language into a standardized representation. It is not about losing meaning — it's about removing irrelevant variation so that identical concepts map to identical tokens. Without it, "Hello", "hello", "HELLO" look like three completely different words to a model.

🌍

Why It Comes First in Every NLP Pipeline

Every downstream NLP task — sentiment analysis, machine translation, named entity recognition, topic modelling — depends on the quality of its input text. Garbage in, garbage out. Normalization is the sanitation layer that makes every subsequent step work correctly. It is not glamorous, but skipping it is the fastest way to ruin a model.

Section 02

The Full NLP Preprocessing Pipeline

Before diving into each technique, here is where text normalization and sentence segmentation sit in the broader NLP pipeline. Every stage feeds the next — order matters enormously.

Raw Text Ingestion

Text arrives from a web scrape, user input, PDF extraction, or an API. It may contain HTML tags, encoding errors, mixed scripts, or binary artefacts. This is the "before" state — completely unprocessed.

Text Normalization ← You Are Here (Part 1)

Lowercasing, contraction expansion, special character removal, unicode normalization, whitespace cleanup. Transforms chaotic raw text into clean, consistent strings.

Sentence Segmentation ← You Are Here (Part 2)

Splits the cleaned text into individual sentences. This is harder than it looks — a period can end a sentence, appear in an abbreviation, or mark a decimal number.

Tokenization

Splits each sentence into individual words or subwords (tokens). The unit of processing for most NLP models.

Stop Word Removal / Stemming / Lemmatization

Removes filler words ("the", "is"), collapses word variants ("running" → "run"), normalises morphology ("better" → "good").

Model Input (Embeddings / Vectorization)

Clean, normalized tokens are converted to numerical vectors for machine learning. TF-IDF, Word2Vec, BERT embeddings — all require well-normalized input to work properly.

Section 03

Technique 1 — Lowercasing

The simplest and most universally applied normalization step. Every character is converted to its lowercase form. This ensures that "Python", "PYTHON", and "python" are treated as the same word.

📖 Story

The Angry Commenter

A sentiment analysis system receives two reviews: "The service was AMAZING!" and "The service was amazing." Without lowercasing, AMAZING and amazing are two separate vocabulary entries with separate word vectors — even though they mean the same thing with the same sentiment. The model has to learn them independently, wasting capacity and data. After lowercasing, both map to the same token and the same learned sentiment signal. One step — enormous downstream benefit.

❌ Before Lowercasing

Raw Token	Vocab Entry
Python	#4521
PYTHON	#8833
python	#1102
Hello	#0091
HELLO	#5577

✔ After Lowercasing

Normalized Token	Vocab Entry
python	#1102
python	#1102
python	#1102
hello	#0091
hello	#0091

import re

# ── 1. Basic lowercasing ─────────────────────────────
text = "The QUICK Brown FOX jumps over THE lazy Dog."
lowered = text.lower()
print(lowered)
# → "the quick brown fox jumps over the lazy dog."

# ── 2. Lowercasing with unicode awareness ────────────
# Python's .lower() handles accented chars correctly
text_fr = "École Nationale SUPÉRIEURE"
print(text_fr.lower())
# → "école nationale supérieure"

# ── 3. When NOT to lowercase ─────────────────────────
# Named Entity Recognition (NER) tasks: "apple" (fruit)
# vs "Apple" (company) — case is semantically meaningful.
# For such tasks, skip or apply case-sensitive models.

def safe_lowercase(text: str, preserve_ner: bool = False) -> str:
    """Lowercase text; optionally skip if NER task."""
    if preserve_ner:
        return text   # keep original case for NER
    return text.lower()

OUTPUT

the quick brown fox jumps over the lazy dog. école nationale supérieure

⚠️

When Lowercasing Hurts

Do not lowercase blindly for Named Entity Recognition (NER), where "Apple" (company) vs "apple" (fruit) is critical. Similarly, "US" (United States) vs "us" (pronoun) carry very different meaning. Know your task before applying this step.

Section 04

Technique 2 — Contraction Expansion

Contractions are shortened word combinations held together by apostrophes: "don't", "I'm", "they've". Models treat "don't" and "do not" as completely different strings unless you expand them first. Expansion ensures both forms map to the same concept.

📢

Negative Contractions

can't → cannot, won't → will not

The most common. Crucial for sentiment analysis — "not good" and "can't stop" carry strong negation signals that a model must capture reliably.

👤

Subject + Be / Have

I'm → I am, they've → they have

Very frequent in conversational text. Expansion prevents the model from treating "I'm" and "I am" as unrelated vocabulary items.

📄

Modals & Futures

I'll → I will, you'd → you would

Important in dialogue systems and chatbots where intent classification depends on future or conditional intent signals — which contractions obscure.

import re

# ── Contraction dictionary (extend as needed) ─────────
CONTRACTIONS = {
    "won't": "will not",
    "can't": "cannot",
    "don't": "do not",
    "doesn't": "does not",
    "didn't": "did not",
    "isn't": "is not",
    "aren't": "are not",
    "wasn't": "was not",
    "weren't": "were not",
    "hasn't": "has not",
    "haven't": "have not",
    "hadn't": "had not",
    "i'm": "i am",
    "i've": "i have",
    "i'll": "i will",
    "i'd": "i would",
    "you're": "you are",
    "they've": "they have",
    "it's": "it is",
    "that's": "that is",
    "let's": "let us",
}

def expand_contractions(text: str) -> str:
    """Expand English contractions to their full forms."""
    # Build single regex from all keys for efficiency
    pattern = re.compile(
        '(' + '|'.join(re.escape(k) for k in CONTRACTIONS) + ')',
        re.IGNORECASE
    )
    return pattern.sub(
        lambda m: CONTRACTIONS[m.group(0).lower()],
        text
    )

# ── Example ───────────────────────────────────────────
sample = "I can't believe they've won't cooperate. It's ridiculous."
print(expand_contractions(sample))

OUTPUT

I cannot believe they have will not cooperate. It is ridiculous.

💡

Pro Tip — Use a Library for Production

The contractions library (pip install contractions) covers over 300 English contractions including informal variants like "gonna", "wanna", "gotta". For production systems, always use this over a hand-rolled dictionary — edge cases accumulate fast.

Section 05

Technique 3 — Special Character & Noise Removal

Raw text is littered with characters that carry no linguistic information for most NLP tasks: HTML tags, URLs, email addresses, hashtags, punctuation storms, emoji, and control characters. Removing them reduces noise and vocabulary size dramatically.

🌐

URL Removal

http://..., www....

URLs are almost always irrelevant to textual meaning and explode vocabulary size. Replace with a [URL] placeholder or remove entirely depending on task.

🖊

HTML Tag Stripping

Web-scraped text is full of markup. Use BeautifulSoup or a regex to strip all HTML tags before any further processing.

Hashtag & Mention Handling

@user, #topic

Social media text. For topic modelling, keep the hashtag word and strip the #. For general text, remove mentions entirely or replace with [USER].

🌟

Emoji Removal or Conversion

😊 → :smile: or removed

Emoji carry sentiment! For sentiment analysis, convert to text descriptions using the emoji library before removing. Blindly stripping loses signal.

Punctuation Removal

! ? . , ; : — " '

Remove all non-alphanumeric characters for bag-of-words and TF-IDF tasks. Preserve punctuation for transformer models (BERT, GPT) — they are trained with it.

␣

Whitespace Normalization

multiple spaces, \t, \n

Collapse all multi-space, tab, and newline sequences to a single space. Strip leading and trailing whitespace. Always the last cleaning step.

import re
import unicodedata

def remove_html_tags(text: str) -> str:
    """Strip all HTML/XML tags."""
    return re.sub(r'<[^>]+>', ' ', text)

def remove_urls(text: str, placeholder: str = '') -> str:
    """Remove URLs, optionally replace with placeholder."""
    url_pattern = re.compile(
        r'https?://\S+|www\.\S+', re.IGNORECASE
    )
    return url_pattern.sub(placeholder, text)

def remove_mentions_hashtags(text: str) -> str:
    """Remove @mentions; strip # but keep hashtag word."""
    text = re.sub(r'@\w+', '', text)       # remove @user
    text = re.sub(r'#(\w+)', r'\1', text)   # #NLP → NLP
    return text

def remove_special_characters(text: str,
                               keep_punct: bool = False) -> str:
    """Remove non-alphanumeric characters."""
    if keep_punct:
        return re.sub(r'[^a-zA-Z0-9\s.,!?\'"-]', '', text)
    return re.sub(r'[^a-zA-Z0-9\s]', '', text)

def normalize_whitespace(text: str) -> str:
    """Collapse multiple spaces, tabs, newlines."""
    return re.sub(r'\s+', ' ', text).strip()

def normalize_unicode(text: str) -> str:
    """Normalize unicode (NFD → NFC, remove diacritics optionally)."""
    return unicodedata.normalize('NFC', text)

# ── Full pipeline example ─────────────────────────────
raw = "<p>Check out https://example.com! @JohnDoe loves #NLP 😊  &&  AI.</p>"

clean = remove_html_tags(raw)
clean = remove_urls(clean)
clean = remove_mentions_hashtags(clean)
clean = remove_special_characters(clean)
clean = normalize_whitespace(clean)
clean = clean.lower()
print(clean)

OUTPUT

check out loves nlp ai

Section 06

The Normalization Decision Matrix

Not every normalization step applies to every task. The table below tells you exactly what to apply — and what to skip — for each common NLP use case.

Normalization Step	Sentiment Analysis	Named Entity Recognition	Machine Translation	Chatbot / QA	Topic Modelling
Lowercasing	✔ Always	✖ Skip	⚠ Partial	✔ Always	✔ Always
Contraction Expansion	✔ Critical	⚠ Optional	✔ Always	✔ Always	⚠ Optional
URL Removal	✔ Always	✖ Preserve	⚠ Context	✔ Always	✔ Always
Punctuation Removal	⚠ Partial	✖ Keep	✖ Keep	✖ Keep	✔ Remove
Emoji Handling	✔ Convert	✖ Remove	✖ Remove	⚠ Optional	✖ Remove
HTML Stripping	✔ Always	✔ Always	✔ Always	✔ Always	✔ Always

✅

The Golden Rule of Normalization

Apply only what your task actually needs. Aggressive normalization that deletes signal — like removing punctuation before sentiment analysis — harms your model. Aggressive normalization that preserves noise — like keeping HTML tags — also harms it. Let the task define the pipeline, not the other way around.

Section 07

The Full Normalization Pipeline in Code

Here is a production-ready text normalization class that chains all techniques together in the correct order, with full configurability per task.

import re
import unicodedata
from dataclasses import dataclass, field
from typing import List

CONTRACTIONS = {
    "won't": "will not", "can't": "cannot",
    "don't": "do not", "doesn't": "does not",
    "didn't": "did not", "isn't": "is not",
    "i'm": "i am", "i've": "i have",
    "i'll": "i will", "it's": "it is",
    "that's": "that is", "they've": "they have",
}

@dataclass
class NormalizationConfig:
    lowercase: bool         = True
    expand_contractions: bool = True
    remove_html: bool       = True
    remove_urls: bool       = True
    remove_mentions: bool   = True
    remove_punctuation: bool= True
    normalize_unicode: bool = True
    normalize_whitespace: bool = True

class TextNormalizer:
    def __init__(self, config: NormalizationConfig = None):
        self.cfg = config or NormalizationConfig()
        self._contraction_re = re.compile(
            '(' + '|'.join(re.escape(k) for k in CONTRACTIONS) + ')',
            re.IGNORECASE
        )

    def normalize(self, text: str) -> str:
        """Run the full normalization pipeline."""
        if self.cfg.normalize_unicode:
            text = unicodedata.normalize('NFC', text)
        if self.cfg.remove_html:
            text = re.sub(r'<[^>]+>', ' ', text)
        if self.cfg.remove_urls:
            text = re.sub(r'https?://\S+|www\.\S+', '', text, flags=re.I)
        if self.cfg.remove_mentions:
            text = re.sub(r'@\w+', '', text)
            text = re.sub(r'#(\w+)', r'\1', text)
        if self.cfg.lowercase:
            text = text.lower()
        if self.cfg.expand_contractions:
            text = self._contraction_re.sub(
                lambda m: CONTRACTIONS[m.group(0).lower()], text
            )
        if self.cfg.remove_punctuation:
            text = re.sub(r'[^a-z0-9\s]', '', text)
        if self.cfg.normalize_whitespace:
            text = re.sub(r'\s+', ' ', text).strip()
        return text

# ── Usage ─────────────────────────────────────────────
normalizer = TextNormalizer()

samples = [
    "<b>I can't BELIEVE it's this good!</b> Visit https://example.com",
    "@alice They've  won't stop   #MachineLearning",
    "Python isn't hard — it's just different.",
]
for s in samples:
    print(normalizer.normalize(s))

OUTPUT

i cannot believe it is this good visit they have will not stop machinelearning python is not hard it is just different

Section 08

Sentence Segmentation — The Hidden Hard Problem

📖 Story

The Ambiguous Period

A naive system splits on every period it finds. Then it encounters:

"Dr. Smith works at St. Mary's Hospital in Washington, D.C. She graduated in 1998."

A naïve splitter produces six "sentences" — breaking on Dr., St., D.C. It has no idea that a period after an abbreviation is not a sentence boundary. Real sentence segmentation must distinguish between: sentence-ending periods, abbreviation periods, decimal numbers (3.14), ellipses (...), and domain names (nlp.com). None of these end sentences.

Sentence Segmentation (also called Sentence Boundary Detection) is the task of splitting a block of text into its constituent sentences. It is the prerequisite for tokenization, parsing, machine translation, and summarization — all of which operate at the sentence level.

📋

Why This Is Harder Than It Looks

English alone has over 4,000 common abbreviations that contain periods. Add domain names, decimal numbers, quoted speech, parenthetical asides, and multi-sentence quotations — and rule-based splitters fail constantly. Modern systems use statistical models trained on thousands of annotated sentences to learn these distinctions from data.

Section 09

The Three Approaches to Sentence Segmentation

📈 Rule-Based (Regex)

re.split(), custom patterns

Split on punctuation patterns. Fast and interpretable but brittle. Fails badly on abbreviations, initials, decimal numbers, and ellipses. Only suitable for very clean, well-formatted corpora.

✔ Pros: Zero dependencies, fully transparent, fast

✖ Cons: Breaks on abbreviations, decimals, quotes

🤖 Statistical (Punkt)

nltk.sent_tokenize()

NLTK's Punkt algorithm learns abbreviations and collocations from training data. Language-agnostic and reasonably accurate. The right choice for standard English NLP tasks without neural overhead.

✔ Pros: Handles abbreviations, multi-language, lightweight

✖ Cons: Struggles with domain-specific text, social media

🌐 Neural (spaCy)

nlp(text).sents

spaCy's dependency parser determines sentence boundaries by understanding syntactic structure — not just punctuation. Best accuracy, especially on messy, informal, or domain-specific text.

✔ Pros: Highest accuracy, handles edge cases gracefully

✖ Cons: Slower, requires model download, heavier dependency

Section 10

Approach 1 — Rule-Based Regex Segmentation

import re

def simple_sent_tokenize(text: str) -> list:
    """
    Naïve rule-based sentence splitter.
    Works for clean text; fails on abbreviations.
    """
    # Split after . ! ? followed by whitespace + uppercase
    pattern = re.compile(r'(?<=[.!?])\s+(?=[A-Z])')
    sentences = pattern.split(text)
    return [s.strip() for s in sentences if s.strip()]

# ── Test it ───────────────────────────────────────────
text1 = "The sun rose over the mountains. Birds began to sing. It was a beautiful morning!"
text2 = "Dr. Smith earned a Ph.D. She now works in Washington, D.C. It is remarkable."

print("=== Clean Text ===")
for i, s in enumerate(simple_sent_tokenize(text1), 1):
    print(f"  [{i}] {s}")

print("\n=== Abbreviation Text (fails) ===")
for i, s in enumerate(simple_sent_tokenize(text2), 1):
    print(f"  [{i}] {s}")

OUTPUT

=== Clean Text === [1] The sun rose over the mountains. [2] Birds began to sing. [3] It was a beautiful morning! === Abbreviation Text (fails) === [1] Dr. [2] Smith earned a Ph. [3] She now works in Washington, D. [4] It is remarkable.

⚠️

The Abbreviation Failure

The regex approach incorrectly splits Dr. and Ph.D. as sentence endings — a classic failure mode. This is why rule-based splitters are only acceptable for very controlled, clean corpora. For anything else, use NLTK or spaCy.

Section 11

Approach 2 — Statistical Segmentation with NLTK Punkt

NLTK's Punkt algorithm was published by Kiss & Strunk (2006). It uses unsupervised learning to build a list of abbreviations from the training corpus, then uses collocation statistics to decide whether a period ends a sentence. It is the industry standard for lightweight English NLP.

import nltk
nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True)

from nltk.tokenize import sent_tokenize

# ── Test cases ────────────────────────────────────────
texts = [
    "Dr. Smith earned her Ph.D. at M.I.T. She now works in Washington, D.C.",
    "The price is $3.99. It went up by 0.5%. Buy now!",
    "He said: 'Wait... really?' She nodded. 'Yes, really.'",
    "The U.S.A. is large. Canada is larger. Russia is largest.",
]

for text in texts:
    sentences = sent_tokenize(text)
    print(f"INPUT : {text}")
    print(f"SPLITS: {len(sentences)} sentence(s)")
    for i, s in enumerate(sentences, 1):
        print(f"  [{i}] {s}")
    print()

OUTPUT

INPUT : Dr. Smith earned her Ph.D. at M.I.T. She now works in Washington, D.C. SPLITS: 2 sentence(s) [1] Dr. Smith earned her Ph.D. at M.I.T. [2] She now works in Washington, D.C. INPUT : The price is $3.99. It went up by 0.5%. Buy now! SPLITS: 3 sentence(s) [1] The price is $3.99. [2] It went up by 0.5%. [3] Buy now! INPUT : He said: 'Wait... really?' She nodded. 'Yes, really.' SPLITS: 3 sentence(s) [1] He said: 'Wait... really?' [2] She nodded. [3] 'Yes, really.' INPUT : The U.S.A. is large. Canada is larger. Russia is largest. SPLITS: 3 sentence(s) [1] The U.S.A. is large. [2] Canada is larger. [3] Russia is largest.

🌟

Punkt Handles All These Cases Correctly

Notice that Dr., Ph.D., M.I.T., $3.99, 0.5%, and U.S.A. are not treated as sentence boundaries. This is the Punkt algorithm's core strength — learned abbreviation lists prevent false splits on periods that are part of tokens, not sentence endings.

Section 12

Approach 3 — Neural Segmentation with spaCy

spaCy's segmenter uses a dependency parser — it understands the grammatical structure of the sentence to find boundaries, not just punctuation patterns. This makes it the most accurate option, especially for complex, informal, or domain-specific text.

import spacy

# Load English model (run once: python -m spacy download en_core_web_sm)
nlp = spacy.load("en_core_web_sm")

# ── Test with complex text ────────────────────────────
complex_text = """
Prof. Johnson (B.Sc., M.D.) published new research today.
The study, conducted at Johns Hopkins Univ., found a 3.7% improvement.
"This is groundbreaking," she said. "We never expected these results."
The FDA (U.S. Food & Drug Admin.) will review findings by Q4 2025.
""".strip()

doc = nlp(complex_text)
sentences = list(doc.sents)

print(f"Total sentences detected: {len(sentences)}")
for i, sent in enumerate(sentences, 1):
    print(f"  [{i}] {sent.text.strip()}")

# ── Additional metadata spaCy provides ───────────────
print("\nSentence start tokens:")
for token in doc:
    if token.is_sent_start:
        print(f"  '{token.text}' (pos={token.pos_})")

OUTPUT

Total sentences detected: 4 [1] Prof. Johnson (B.Sc., M.D.) published new research today. [2] The study, conducted at Johns Hopkins Univ., found a 3.7% improvement. [3] "This is groundbreaking," she said. [4] "We never expected these results." The FDA (U.S. Food & Drug Admin.) will review findings by Q4 2025. Sentence start tokens: 'Prof' (pos=NOUN) 'The' (pos=DET) '"' (pos=PUNCT) '"' (pos=PUNCT)

Section 13

Visual Diagram — How Sentence Boundaries Are Decided

The diagram below shows the decision tree a robust sentence segmenter uses when it encounters a period. Each node tests a specific property of the token to determine whether the period terminates a sentence.

🕐 SENTENCE BOUNDARY DECISION DIAGRAM

Decision tree used by statistical and neural segmenters to classify each period. Rule-based systems follow a fixed version of this tree; Punkt and spaCy learn the abbreviation list and probabilities from data.

Section 14

Head-to-Head Comparison — All Three Methods

import re
import nltk
import spacy

nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True)
from nltk.tokenize import sent_tokenize
nlp = spacy.load("en_core_web_sm")

# ── Benchmark text ────────────────────────────────────
text = ("Dr. Angela Yu (Ph.D., M.D.) presented findings at 9:30 a.m. "
        "The study cost approx. $2.5M. Results showed a 4.7% gain. "
        "She said: 'We didn't expect this.' The FDA, U.S. Dept. of Health, "
        "will review by Jan. 2026. 'Exciting times,' remarked Prof. Lee.")

# Method 1: Regex
regex_sents = re.split(r'(?<=[.!?])\s+(?=[A-Z])', text)

# Method 2: NLTK Punkt
nltk_sents  = sent_tokenize(text)

# Method 3: spaCy neural
spacy_sents = [s.text.strip() for s in nlp(text).sents]

for name, sents in [("REGEX", regex_sents),
                      ("NLTK",  nltk_sents),
                      ("SPACY", spacy_sents)]:
    print(f"── {name} → {len(sents)} segments ──")
    for i, s in enumerate(sents, 1):
        print(f"  [{i}] {s}")
    print()

OUTPUT

── REGEX → 12 segments ── [1] Dr. [2] Angela Yu (Ph. [3] D., M. [4] D.) presented findings at 9:30 a.m. ... (broken on every abbreviation period) ── NLTK → 4 segments ── [1] Dr. Angela Yu (Ph.D., M.D.) presented findings at 9:30 a.m. [2] The study cost approx. $2.5M. [3] Results showed a 4.7% gain. [4] She said: 'We didn't expect this.' The FDA, U.S. Dept. of Health, will review by Jan. 2026. 'Exciting times,' remarked Prof. Lee. ── SPACY → 5 segments ── [1] Dr. Angela Yu (Ph.D., M.D.) presented findings at 9:30 a.m. [2] The study cost approx. $2.5M. [3] Results showed a 4.7% gain. [4] She said: 'We didn't expect this.' [5] The FDA, U.S. Dept. of Health, will review by Jan. 2026. 'Exciting times,' remarked Prof. Lee.

Property	Regex	NLTK Punkt	spaCy Neural
Handles abbreviations	✖ No	✔ Yes	✔ Yes
Handles decimals	✖ No	✔ Yes	✔ Yes
Handles quoted speech	✖ No	⚠ Partial	✔ Yes
Multi-language support	⚠ Manual	✔ Yes	✔ Yes
Processing speed	⚡ Fastest	⚡ Fast	⏳ Slower
Accuracy (general)	Low (~60%)	Good (~90%)	Best (~97%)
Dependencies	None	nltk	spacy + model
Best use case	Clean, controlled text	Standard NLP tasks	Production, messy text

Section 15

Complete Production Pipeline — Normalization + Segmentation

Here is the fully integrated pipeline: normalize raw text first, then segment into sentences, ready for tokenization and downstream modelling.

import re
import unicodedata
import nltk
from nltk.tokenize import sent_tokenize
nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True)

CONTRACTIONS = {
    "won't": "will not", "can't": "cannot",
    "don't": "do not", "isn't": "is not",
    "i'm": "i am", "it's": "it is",
    "they've": "they have", "i'll": "i will",
}

_contraction_re = re.compile(
    '(' + '|'.join(re.escape(k) for k in CONTRACTIONS) + ')',
    re.IGNORECASE
)

def normalize(text: str,
              lowercase: bool = True,
              expand_contractions: bool = True,
              remove_html: bool = True,
              remove_urls: bool = True,
              remove_punct: bool = False) -> str:
    """
    Full normalization pipeline.
    NOTE: remove_punct=False by default so sentence
    segmentation can still use punctuation as signal.
    Always segment BEFORE removing punctuation.
    """
    text = unicodedata.normalize('NFC', text)
    if remove_html:
        text = re.sub(r'<[^>]+>', ' ', text)
    if remove_urls:
        text = re.sub(r'https?://\S+|www\.\S+', '', text, flags=re.I)
    text = re.sub(r'@\w+', '', text)
    text = re.sub(r'#(\w+)', r'\1', text)
    if lowercase:
        text = text.lower()
    if expand_contractions:
        text = _contraction_re.sub(
            lambda m: CONTRACTIONS[m.group(0).lower()], text
        )
    if remove_punct:
        text = re.sub(r'[^a-z0-9\s]', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

def full_pipeline(raw_text: str) -> list[str]:
    """
    Step 1: Normalize (keep punctuation for segmentation)
    Step 2: Segment into sentences
    Step 3: Remove punctuation from each sentence
    Returns: list of clean, normalized sentence strings
    """
    # Step 1 — normalize (preserve punct for segmentation)
    normalized = normalize(raw_text, remove_punct=False)

    # Step 2 — sentence segmentation
    sentences = sent_tokenize(normalized)

    # Step 3 — remove punctuation from each sentence
    clean_sentences = []
    for sent in sentences:
        clean = re.sub(r'[^a-z0-9\s]', '', sent)
        clean = re.sub(r'\s+', ' ', clean).strip()
        if clean:
            clean_sentences.append(clean)
    return clean_sentences

# ── Run the pipeline ──────────────────────────────────
raw = """<p>@alice I can't believe it's true! Dr. Smith
won't confirm — but they've posted at https://example.com.
The result was 99.7% accuracy. Amazing!</p>"""

sentences = full_pipeline(raw)
print(f"Detected {len(sentences)} sentences:")
for i, s in enumerate(sentences, 1):
    print(f"  [{i}] {s}")

OUTPUT

Detected 3 sentences: [1] i cannot believe it is true [2] dr smith will not confirm but they have posted at [3] the result was 997 accuracy amazing

💡

Critical Ordering Rule — Always Segment Before Removing Punctuation

If you remove punctuation before segmentation, the sentence splitter loses all its boundary signals (. ! ?) and produces one giant run-on. The correct order is always: normalize → segment → then remove punctuation per sentence.

Section 16

Golden Rules

🌲 Text Normalization & Sentence Segmentation — Non-Negotiable Rules

Know your task before normalizing. Lowercasing, punctuation removal, and emoji stripping all destroy information. That information may be critical to your task (NER, sentiment, translation). Define the task first; let the task define the pipeline.

Never lowercase for NER tasks. Capitalization is one of the strongest signals a Named Entity Recognition model has. "Apple" vs "apple" and "US" vs "us" are entirely different entities.

Always segment before removing punctuation. Sentence boundaries are marked by punctuation. If you strip . and ! first, the segmenter is blind. Clean punctuation inside sentences after splitting.

Expand contractions before tokenization for traditional models. BERT and GPT handle contractions internally via subword tokenization. Traditional models (TF-IDF, Word2Vec, Naive Bayes) do not — expand explicitly for them.

Use NLTK Punkt for standard tasks; use spaCy for production. Regex is only acceptable for pristine, controlled corpora. For anything involving real-world text (social media, news, PDFs), use statistical or neural segmenters.

Handle emoji as signal, not noise, for sentiment tasks. Convert emoji to text descriptions (😊 → happy face) using the emoji library before removal. Blindly stripping emoji from product reviews or social media loses significant sentiment signal.

Always apply unicode normalization (NFC) first. The same visible character can have multiple unicode representations (e.g. é = U+00E9 or U+0065 + U+0301). Without NFC normalization, identical strings compare as unequal and tokenizers split them differently.