Tagging in Python: Complete Guide with NLTK, spaCy

Section 01

The Story That Explains POS Tagging

📖 Real World Analogy

The Post Office Sorting Machine

Imagine a massive post office that receives millions of letters every day. Each letter has a word written on it. Before the letters can be processed, a smart sorting machine needs to label every letter: "Is this a name? An action? A describing word? A connector?"

Without these labels, the post office has no idea how to route the mail — it can't tell if "bank" means a riverbank or a financial institution, if "run" is something you do or a score in cricket. The machine reads the surrounding letters to decide: a letter next to "the" is probably a noun; a letter ending in "-ing" after a helper word is probably a verb.

That sorting machine is Part-of-Speech Tagging — and it is the backbone of nearly every Natural Language Processing pipeline ever built.

Part-of-Speech (POS) Tagging is the process of assigning a grammatical category — noun, verb, adjective, adverb, and so on — to each word (token) in a sentence. A tagger reads context, not just the word itself, to make its decision. The word "flies" is a noun in "time flies" but a verb in "he flies a kite."

🧠

Why POS Tagging Matters

POS tagging is almost never the end goal — it is a stepping stone. Named Entity Recognition, Dependency Parsing, Sentiment Analysis, Machine Translation, and Question Answering all perform significantly better when they know the grammatical role of every word. It turns raw text into structured information a machine can reason about.

Section 02

Understanding POS Tag Sets

Different NLP libraries use different tag conventions. The two you will encounter most often are the Universal POS Tags (17 coarse tags, language-agnostic) and the Penn Treebank Tags (36+ fine-grained tags for English).

🏷️ Universal POS Tags — The Big 17

NOUN

Common nouns — cat, city, freedom, algorithm

PROPN

Proper nouns — London, Python, Marie Curie

VERB

Verbs in any form — run, running, ran, is

ADJ

Adjectives — fast, beautiful, large, neural

ADV

Adverbs — quickly, never, very, here

PRON

Pronouns — he, she, they, it, this, who

DET

Determiners — the, a, an, each, every

ADP

Adpositions (pre/postpositions) — in, on, at, of, with

AUX

Auxiliary verbs — is, was, have, will, should, can

CCONJ

Co-ordinating conjunctions — and, but, or, nor

SCONJ

Subordinating conjunctions — because, although, if, while

NUM

Numerals — one, 42, first, III, 3.14

PART

Particles — 's, not, to (infinitive marker)

INTJ

Interjections — wow, oops, hello, ah

PUNCT

Punctuation — . , ! ? " ( )

SYM

Symbols — $ % @ # & →

Other / unknown — abbreviations, foreign words, noise

💡

Penn Treebank vs Universal — When to Use Which

Use Penn Treebank tags (NN, NNS, VBD, JJ …) when you need English-specific detail — for example, distinguishing singular noun (NN) from plural (NNS), or past tense verb (VBD) from present participle (VBG). Use Universal tags when building multilingual models or when coarse categories are enough. NLTK defaults to Penn Treebank; spaCy gives you both via .tag_ (Penn) and .pos_ (Universal).

Section 03

Why Is POS Tagging Hard? Ambiguity Everywhere

📖 The Ambiguity Problem

The Word "Light" — Six Different Roles

Consider the single word "light". Without context, a computer has no idea what it is:

→ "Turn on the light" — NOUN (a lamp)
→ "The feather is very light" — ADJ (not heavy)
→ "Please light the candle" — VERB (ignite)
→ "She walked light-footed" — ADV (manner)

A human reads the surrounding words effortlessly. A machine needs a model that has seen thousands of examples to make the same call — and even then, rare constructions cause errors. This is why POS tagging accuracy plateaued near 97–98% for decades; the remaining 2–3% are genuinely ambiguous even to human annotators.

Word	As NOUN	As VERB	As ADJ
bank	The river bank was steep	She will bank the profits	—
close	He lives a short close away	Please close the door	The shops are close by
fast	He broke his fast	She will fast for a day	A fast car wins races
run	A run of bad luck	She will run the marathon	A run-down building
well	Draw water from the well	Tears welled up in his eyes	She is feeling well today

Section 04

How POS Taggers Work — The Algorithms

POS taggers have evolved through three distinct generations. Understanding them helps you choose the right tool and debug failures intelligently.

📏

Generation 1 — Rule-Based

Hand-crafted Linguistic Rules

Linguists manually write rules: "if the word ends in -ly, tag it ADV", "if the word follows the, tag it NOUN." ENGTWOL and the Brill Tagger (1992) belong here. High precision on known patterns, catastrophic failure on anything outside the rule set.

📊

Generation 2 — Statistical

HMM, CRF, MaxEnt

Models learn probabilities from annotated corpora. Hidden Markov Models compute the most likely tag sequence using emission and transition probabilities. CRF (Conditional Random Fields) improved this by conditioning on arbitrary features simultaneously. Reached ~97% accuracy on Penn Treebank.

🤖

Generation 3 — Neural

BiLSTM, BERT, Transformers

Deep learning models use contextual embeddings. A BiLSTM reads the sentence left-to-right and right-to-left simultaneously. BERT-based taggers (used by spaCy v3+) broke the 98% accuracy barrier by understanding context across the whole sentence at once.

📐 Hidden Markov Model — How It Tags a Sentence

HMM finds the tag sequence that maximises P(tags) × P(words|tags) using the Viterbi algorithm — a dynamic programming approach that avoids evaluating all possible sequences.

Section 05

POS Tagging with NLTK

NLTK is Python's classic NLP library. Its pos_tag() function uses a pre-trained Averaged Perceptron tagger and returns Penn Treebank tags.

# Install: pip install nltk
import nltk

# Download required data (only once)
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')
nltk.download('punkt_tab')

sentence = "The quick brown fox jumps over the lazy dog"

# Step 1 — Tokenise
tokens = nltk.word_tokenize(sentence)

# Step 2 — Tag
tags = nltk.pos_tag(tokens)

# Step 3 — Display
for word, tag in tags:
    print(f"{word:12s} → {tag}")

OUTPUT

The → DT quick → JJ brown → JJ fox → NN jumps → VBZ over → IN the → DT lazy → JJ dog → NN

📌

Decoding Penn Treebank Tags

DT = Determiner · JJ = Adjective · NN = Noun (singular) · NNS = Noun (plural) · VBZ = Verb (3rd person singular present) · VBD = Verb (past tense) · IN = Preposition · RB = Adverb · PRP = Personal pronoun · CC = Coordinating conjunction

# Accessing the NLTK tagset explanation
nltk.download('tagsets')
nltk.help.upenn_tagset('VBZ')   # explain one tag
nltk.help.upenn_tagset()          # explain all tags

# Tagging multiple sentences efficiently
sentences = [
    "She sells seashells by the seashore",
    "Time flies like an arrow",
    "Fruit flies like a banana"   # same words, different meaning!
]

for sent in sentences:
    tokens = nltk.word_tokenize(sent)
    tagged = nltk.pos_tag(tokens)
    print(f"\nSentence: {sent}")
    print(f"Tags:     {tagged}")

OUTPUT

Sentence: Time flies like an arrow Tags: [('Time', 'NN'), ('flies', 'VBZ'), ('like', 'IN'), ('an', 'DT'), ('arrow', 'NN')] Sentence: Fruit flies like a banana Tags: [('Fruit', 'NN'), ('flies', 'NNS'), ('like', 'VBP'), ('a', 'DT'), ('banana', 'NN')]

✅

Context-Sensitivity in Action

Notice how "flies" is tagged VBZ (verb) in the first sentence but NNS (plural noun = insects) in the second. And "like" switches from preposition (IN) to verb (VBP). The same words — completely different parse — because the tagger reads surrounding tokens to disambiguate.

Section 06

POS Tagging with spaCy — Production Grade

spaCy is the industry standard for production NLP. Its models are neural (CNN + Transformer) and give you both coarse Universal tags (.pos_) and fine-grained Penn Treebank tags (.tag_) in a single pipeline pass.

# Install: pip install spacy
# Download model: python -m spacy download en_core_web_sm

import spacy

# Load the English model
nlp = spacy.load("en_core_web_sm")

text = "Apple is looking at buying a UK startup for $1 billion"

# Process the text — tagger runs as part of the pipeline
doc = nlp(text)

# Print all token attributes
print(f"{'Token':15s} {'POS':8s} {'Tag':8s} {'Dep':12s} {'Lemma'}")
print("-" * 60)
for token in doc:
    print(f"{token.text:15s} {token.pos_:8s} {token.tag_:8s} {token.dep_:12s} {token.lemma_}")

OUTPUT

Token POS Tag Dep Lemma ------------------------------------------------------------ Apple PROPN NNP nsubj Apple is AUX VBZ aux be looking VERB VBG ROOT look at ADP IN prep at buying VERB VBG pcomp buy a DET DT det a UK PROPN NNP compound UK startup NOUN NN dobj startup for ADP IN prep for $ SYM $ quantmod $ 1 NUM CD compound 1 billion NUM CD pobj billion

# Filtering by POS — get only nouns from a document
text2 = "The data scientist trained a powerful language model on a massive corpus"
doc2 = nlp(text2)

nouns   = [t.text for t in doc2 if t.pos_ == "NOUN"]
verbs   = [t.lemma_ for t in doc2 if t.pos_ == "VERB"]
adjs    = [t.text for t in doc2 if t.pos_ == "ADJ"]

print(f"Nouns:      {nouns}")
print(f"Verbs:      {verbs}")
print(f"Adjectives: {adjs}")

# POS tag frequency distribution
from collections import Counter
pos_counts = Counter(t.pos_ for t in doc2 if t.pos_ != "PUNCT")
print("\nPOS distribution:")
for pos, count in pos_counts.most_common():
    print(f"  {pos:8s}: {count}")

OUTPUT

Nouns: ['scientist', 'model', 'corpus'] Verbs: ['train'] Adjectives: ['powerful', 'massive'] POS distribution: NOUN : 3 DET : 2 VERB : 1 ADJ : 2 ADP : 2 PROPN : 0

Section 07

Visualising POS Tags — Sentence Diagram

A visual breakdown of a tagged sentence helps you see grammatical structure at a glance. Below is a full parse of "The brilliant researcher quickly published her groundbreaking paper."

🔍 Token-Level POS Diagram

Section 08

Transformer-Based POS Tagging with HuggingFace

For maximum accuracy, especially on noisy or domain-specific text, use a fine-tuned Transformer model. HuggingFace's pipeline API makes this a three-liner.

# Install: pip install transformers torch
from transformers import pipeline

# Load a model fine-tuned for token classification (POS tagging)
pos_pipeline = pipeline(
    "token-classification",
    model="vblagoje/bert-english-uncased-finetuned-pos",
    aggregation_strategy="simple"
)

text = "The central bank raised interest rates by 0.25 percentage points"

results = pos_pipeline(text)

print(f"{'Word':20s} {'POS Tag':10s} {'Score'}")
print("-" * 42)
for item in results:
    print(f"{item['word']:20s} {item['entity_group']:10s} {item['score']:.4f}")

OUTPUT

Word POS Tag Score ------------------------------------------ The DET 0.9997 central ADJ 0.9981 bank NOUN 0.9963 raised VERB 0.9978 interest NOUN 0.9874 rates NOUN 0.9961 by ADP 0.9994 0.25 NUM 0.9988 percentage NOUN 0.9921 points NOUN 0.9955

⚠️

Transformer Tradeoff — Accuracy vs Speed

A BERT-based tagger is 20–100× slower than spaCy on CPU. For real-time applications processing millions of documents, stick with spaCy's optimised pipeline. Use Transformers when accuracy is non-negotiable — legal documents, medical records, financial filings — and latency is acceptable. On a GPU, the speed gap narrows dramatically.

Section 09

Real-World Applications of POS Tagging

📰

Keyword Extraction

Extract nouns + noun phrases as candidate keywords from news articles, research papers, and blog posts. Filter by NOUN, PROPN, and NP chunks.

😊

Sentiment Analysis

Adjectives carry sentiment signals. POS tags let models focus on ADJ and ADV tokens for opinion mining in product reviews and social media.

🔍

Named Entity Recognition

NER models use POS context as a feature. PROPN tokens are strong signals for Person, Organisation, and Location entities.

🌍

Machine Translation

Translation requires correct word order restructuring. POS tags guide reordering models, especially between Subject-Verb-Object and Subject-Object-Verb languages.

📖

Text Normalisation

Lemmatisation needs POS. The word "saw" lemmatises to "see" (VERB) but stays "saw" (NOUN, a cutting tool). Wrong POS = wrong lemma.

🤖

Chatbot / QA Systems

Understanding user intent. "Book a flight" (VERB + NP) means different actions than "flight book" (NOUN compounds). POS disambiguates commands.

Section 10

End-to-End Pipeline — From Raw Text to Structured Features

Here is a complete, production-style pipeline: raw text → tokenisation → POS tagging → lemmatisation → stopword removal → feature extraction. This is the core of most text classification and information extraction systems.

import spacy
from collections import Counter, defaultdict
import pandas as pd

nlp = spacy.load("en_core_web_sm")

def analyse_text(text: str) -> dict:
    """Full NLP analysis pipeline."""
    doc = nlp(text)

    # ── Token table ─────────────────────────────
    rows = []
    for token in doc:
        if not token.is_space:
            rows.append({
                "token"  : token.text,
                "lemma"  : token.lemma_,
                "pos"    : token.pos_,
                "tag"    : token.tag_,
                "is_stop": token.is_stop,
                "dep"    : token.dep_,
            })

    df = pd.DataFrame(rows)

    # ── Content words (no stopwords, no punct) ──
    content_tokens = [
        t.lemma_.lower() for t in doc
        if not t.is_stop
        and not t.is_punct
        and not t.is_space
        and t.pos_ in {"NOUN", "VERB", "ADJ", "ADV", "PROPN"}
    ]

    # ── Noun chunks (multi-word NPs) ─────────────
    noun_phrases = [chunk.text for chunk in doc.noun_chunks]

    # ── POS distribution ─────────────────────────
    pos_dist = Counter(t.pos_ for t in doc if t.pos_ != "PUNCT")

    return {
        "token_table"  : df,
        "content_words": content_tokens,
        "noun_phrases" : noun_phrases,
        "pos_dist"     : pos_dist,
    }

# ── Run the pipeline ─────────────────────────────
sample = """
Researchers at MIT developed a new machine learning algorithm
that significantly reduces training time for large language models.
The breakthrough could accelerate AI development worldwide.
"""

result = analyse_text(sample.strip())

print("=== TOKEN TABLE ===")
print(result["token_table"].to_string(index=False))

print("\n=== CONTENT WORDS (lemmatised) ===")
print(result["content_words"])

print("\n=== NOUN PHRASES ===")
print(result["noun_phrases"])

print("\n=== POS DISTRIBUTION ===")
for pos, count in result["pos_dist"].most_common():
    bar = "█" * count
    print(f"  {pos:8s}: {bar} ({count})")

OUTPUT

=== TOKEN TABLE === token lemma pos tag is_stop dep Researchers researcher NOUN NNS False nsubj at at ADP IN True prep MIT MIT PROPN NNP False pobj developed develop VERB VBD False ROOT a a DET DT True det new new ADJ JJ False amod machine machine NOUN NN False compound learning learn VERB VBG False compound algorithm algorithm NOUN NN False dobj ... === CONTENT WORDS (lemmatised) === ['researcher', 'MIT', 'develop', 'new', 'machine', 'learn', 'algorithm', 'significantly', 'reduce', 'train', 'large', 'language', 'model', 'breakthrough', 'accelerate', 'AI', 'development'] === NOUN PHRASES === ['Researchers', 'MIT', 'a new machine learning algorithm', 'training time', 'large language models', 'The breakthrough', 'AI development'] === POS DISTRIBUTION === NOUN : ██████████ (10) VERB : ████████ (8) ADJ : ████ (4) DET : ████ (4) ADP : ███ (3) ADV : ██ (2) PROPN : ██ (2)

Section 11

Building POS-Based Features for Machine Learning

Raw POS tags are not directly useful to ML models — you need to convert them to numeric features. Here are three powerful feature engineering strategies.

📊

Strategy 1 — POS Ratios

Fast, Lightweight

Compute the fraction of each POS in a document. A legal document has high NOUN ratio; instructions have high VERB ratio; poetry has high ADJ ratio. These 17 numbers are powerful document-type signals.

🔗

Strategy 2 — POS Bigrams

Captures Grammar Patterns

Count consecutive POS pairs: DET→ADJ, ADJ→NOUN, NOUN→VERB. These bigrams capture phrase structure patterns invisible to bag-of-words. Useful for style detection and readability scoring.

🎯

Strategy 3 — POS-Filtered TF-IDF

Content-Focused Representation

Apply TF-IDF only to NOUN, VERB, ADJ, and ADV lemmas. This removes grammatical noise words that carry no semantic content, producing a denser, more informative document-term matrix.

import spacy
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

nlp = spacy.load("en_core_web_sm")

def pos_ratio_features(texts):
    """Strategy 1 — compute POS ratio vector for each document."""
    pos_tags = ["NOUN","VERB","ADJ","ADV","PROPN","DET","ADP","PRON"]
    features = []
    for doc in nlp.pipe(texts, batch_size=32):
        total = len([t for t in doc if not t.is_punct])
        vec = [
            sum(1 for t in doc if t.pos_ == tag) / (max(total, 1))
            for tag in pos_tags
        ]
        features.append(vec)
    return np.array(features)

def pos_filtered_lemmas(text: str) -> str:
    """Strategy 3 — keep only content-word lemmas for TF-IDF."""
    doc = nlp(text)
    return " ".join(
        t.lemma_.lower() for t in doc
        if t.pos_ in {"NOUN", "VERB", "ADJ", "ADV", "PROPN"}
        and not t.is_stop
    )

# Example corpus
texts = [
    "The neural network achieved state of the art results on benchmarks",
    "Mix flour butter sugar and bake at 180 degrees for thirty minutes",
    "The defendant was found guilty and sentenced to five years",
]
labels = [0, 1, 2]  # tech / recipe / legal

# Strategy 1 — POS ratios
X_ratio = pos_ratio_features(texts)
print("POS ratio feature shape:", X_ratio.shape)

# Strategy 3 — POS-filtered TF-IDF
filtered_texts = [pos_filtered_lemmas(t) for t in texts]
print("Filtered text[0]:", filtered_texts[0])
print("Filtered text[1]:", filtered_texts[1])

OUTPUT

POS ratio feature shape: (3, 8) Filtered text[0]: neural network achieve state art result benchmark Filtered text[1]: mix flour butter sugar bake degree thirty minute

Section 12

Evaluating POS Tagger Performance

Accuracy is the standard metric for POS tagging — but you need to understand where errors happen to improve your pipeline.

import spacy
from sklearn.metrics import classification_report, confusion_matrix

nlp = spacy.load("en_core_web_sm")

# Gold standard — manually annotated sentence
gold = [
    ("The",    "DET"),
    ("bank",   "NOUN"),  # financial institution context
    ("can",    "AUX"),
    ("bank",   "VERB"),  # "can bank on it" context
    ("on",     "ADP"),
    ("rising", "VERB"),
    ("rates",  "NOUN"),
]

sentence = " ".join(w for w, _ in gold)
doc = nlp(sentence)

predicted_pos = [token.pos_ for token in doc]
true_pos      = [tag for _, tag in gold]

# Per-token comparison
print(f"{'Token':10s} {'True':8s} {'Predicted':10s} {'Match'}")
print("-" * 42)
for (word, true), pred in zip(gold, predicted_pos):
    match = "✓" if true == pred else "✗"
    print(f"{word:10s} {true:8s} {pred:10s} {match}")

accuracy = sum(t == p for t, p in zip(true_pos, predicted_pos)) / len(true_pos)
print(f"\nToken Accuracy: {accuracy:.2%}")

# Full report
print("\nClassification Report:")
print(classification_report(true_pos, predicted_pos, zero_division=0))

OUTPUT

Token True Predicted Match ------------------------------------------ The DET DET ✓ bank NOUN NOUN ✓ can AUX AUX ✓ bank VERB NOUN ✗ ← ambiguity error on ADP ADP ✓ rising VERB VERB ✓ rates NOUN NOUN ✓ Token Accuracy: 85.71% Classification Report: precision recall f1-score support ADP 1.00 1.00 1.00 1 AUX 1.00 1.00 1.00 1 DET 1.00 1.00 1.00 1 NOUN 0.67 1.00 0.80 2 VERB 1.00 0.50 0.67 2 accuracy 0.86 7

🔬

The "bank" Error — Understanding Tagger Failures

The second "bank" (a verb meaning "to rely on") is mistagged as NOUN. This happens because the sentence "The bank can bank on rising rates" is unusual — "bank" following "can" is typically a noun in training corpora. The tagger's prior is too strong. Real-world errors concentrate on: rare words, domain-specific jargon, repeated ambiguous words in short sentences, and social media text (abbreviations, hashtags, emoji).

Section 13

Choosing Your POS Tagger — Side-by-Side Comparison

Library	Algorithm	Tag Set	Accuracy	Speed (CPU)	Best For
NLTK	Averaged Perceptron	Penn Treebank	~96%	Fast	Learning, quick scripts
spaCy (sm)	CNN	Universal + Penn	~97%	Very Fast	Production pipelines
spaCy (trf)	Transformer	Universal + Penn	~98.5%	Slow on CPU	High-accuracy production
HuggingFace BERT	BERT Fine-tuned	Universal	~98–99%	Slow on CPU	Research, max accuracy
Stanza (Stanford)	BiLSTM-CRF	Universal	~98%	Moderate	Multilingual, 66+ languages
Flair	Contextual String Embeddings	Penn Treebank	~97.5%	Moderate	Custom domain fine-tuning

Section 14

Golden Rules for POS Tagging in Production

🌿 POS Tagging — Non-Negotiable Rules

Always tokenise before tagging. POS taggers expect pre-tokenised input. Feeding raw strings character-by-character or skipping tokenisation produces garbage results. Use nltk.word_tokenize() or let spaCy handle it internally.

Process documents, not words in isolation. The word "flies" cannot be tagged correctly without its neighbours. Always pass the full sentence — or at minimum the full clause — to the tagger. Single-word POS lookup is almost always wrong.

Use nlp.pipe() for batch processing. Calling nlp(text) in a loop is up to 10× slower than nlp.pipe(texts, batch_size=32). For any corpus above 1,000 documents, batching is non-negotiable.

Match your tagger to your domain. General models trained on news corpora perform poorly on medical notes, social media, or legal text. If your accuracy is below 94%, fine-tune on domain-specific annotated data before deploying.

Use lemmas, not raw tokens, downstream. After POS tagging, always lemmatise before feeding to TF-IDF or word counts. "running", "ran", and "runs" should all map to "run" — but only if you know they are all verbs. POS enables correct lemmatisation.

Disable unused pipeline components for speed. If you only need POS tags, disable NER and parser: nlp("text", disable=["ner", "parser"]). This can double throughput on large corpora.

Evaluate on your own data, not just benchmark scores. A tagger reporting 98% on Penn Treebank may drop to 89% on your biomedical dataset. Always create a small gold-standard set in your domain and measure before trusting any library.