Natural Language Processing (NLP) 📂 Classical NLP · 4 of 4 50 min read

Chunking & Phrase Structure

A comprehensive, code-first tutorial on NLP chunking and phrase structure. Covers what chunking is and where it sits in the NLP pipeline, the three core chunk types (NP, VP, PP), IOB/BIOES encoding, rule-based chunking with NLTK's RegexpParser

Section 01

The Story That Makes Chunking Click

The Newspaper Editor and the Highlighter
Imagine you hand a seasoned newspaper editor a fresh article and ask: "Who and what is this story about?" The editor doesn't read every word with equal weight. Instead, they uncap a yellow highlighter and swipe across meaningful chunks — "the Prime Minister," "a controversial new bill," "the opposition party," "early morning press conference." They skip the glue words — the, a, of, in — and group the important ones.

In seconds, they've produced a summary skeleton: a list of highlighted phrases that carry the story's meaning. They haven't diagrammed every grammatical relationship — they've just grouped words into meaningful bundles.

That is precisely what chunking does in NLP. It is the art of identifying and extracting flat, non-overlapping phrases — noun groups, verb groups, prepositional groups — from raw text, without the full complexity of a constituency or dependency parse. Fast. Practical. The editor's highlighter for machines.

Chunking sits between two worlds: it is more powerful than simple tokenization and POS tagging, but lighter and faster than full syntactic parsing. This balance makes it one of the most practically useful tools in the NLP practitioner's toolkit.

🌐
What Is Chunking?

Chunking (also called shallow parsing or partial parsing) is the task of identifying and grouping sequences of tokens into syntactically correlated chunks — most commonly Noun Phrases (NP), Verb Phrases (VP), and Prepositional Phrases (PP). Unlike full parsing, chunks are flat (non-recursive, non-overlapping) and are defined by surface patterns over Part-of-Speech tags, making them extremely fast to compute.


Section 02

Where Chunking Lives in the NLP Pipeline

Chunking is a mid-level step. It depends on the outputs of earlier stages and feeds into downstream tasks like named entity recognition, relation extraction, and information retrieval.

01
Raw Text
"The fast red car overtook a slow blue truck on the motorway."
Unprocessed string — no structure yet.
02
Tokenization
Split into individual tokens: ["The", "fast", "red", "car", "overtook", ...]
03
POS Tagging
Tag each token: The/DT fast/JJ red/JJ car/NN overtook/VBD a/DT slow/JJ blue/JJ truck/NN ...
04
★ Chunking / Shallow Parsing
Group POS-tagged tokens into phrases:
[NP The fast red car] [VP overtook] [NP a slow blue truck] [PP on the motorway]
05
Named Entity Recognition (NER)
Classify chunks as PERSON, ORG, LOCATION, etc. NPs are the primary candidates for named entities.
06
Downstream Tasks
Relation extraction, information retrieval, question answering, sentiment analysis, summarization.

Section 03

The Three Core Chunk Types

While chunking can target any phrase type, three dominate practical NLP applications. Each has a characteristic POS tag pattern that a chunker can learn or be programmed to recognize.

📄
NP — Noun Phrase Chunk
The most important and most commonly extracted chunk type. Groups a noun with its preceding determiners and adjectives. Represents the who and what of a sentence.

Pattern: DT? JJ* NN+
Example: "a brilliant young scientist", "the new government policy", "three ancient stone temples"
det? adj* noun+
VP — Verb Phrase Chunk
Groups a main verb with its auxiliaries and modals. Represents the action or state of the sentence.

Pattern: MD? VB* VBD|VBZ|VBG|VBN
Example: "has been running", "will have completed", "was quickly overtaken"
modal? aux* main-verb
📍
PP — Prepositional Phrase Chunk
Groups a preposition with the noun phrase it introduces. Represents location, time, direction, or manner.

Pattern: IN NP
Example: "on the table", "in the morning", "across the busy motorway"
preposition + NP

Penn Treebank POS Tags You Must Know for Chunking

TagPart of SpeechExample WordsRole in Chunking
DTDeterminerthe, a, an, this, everyNP opener — signals start of noun phrase
JJAdjectivequick, brilliant, heavyNP filler — modifies the head noun
JJR / JJSComparative / Superlative Adjfaster, brightestNP filler variant
NN / NNSNoun (singular / plural)car, scientistsNP head — the core of the chunk
NNP / NNPSProper Noun (singular / plural)Alice, United NationsNP head for named entities
VB / VBD / VBGVerb base / past / gerundrun, ran, runningVP head or filler
VBN / VBP / VBZVerb past-part / non-3rd / 3rdseen, see, seesVP head or filler
MDModalwill, would, can, mustVP opener
RB / RBR / RBSAdverb / Comparative / Superlativequickly, fasterVP or ADVP filler
INPreposition / Subordinating conjunctionon, in, over, becausePP opener
CDCardinal Numberthree, 42, 1.5NP filler (quantifier)
PRP / PRP$Personal / Possessive Pronounhe, she, theirStandalone NP

Section 04

IOB Encoding — How Chunkers Represent Phrases

Painting a Fence — Marking the Start, Middle, and Outside
Imagine painting sections of a very long fence. Each plank is a word. When you start a new section (a new chunk), you mark the first plank with a big B (Begin). Every plank inside the same section gets an I (Inside). Any plank that belongs to no section at all gets an O (Outside).

This is IOB tagging — the universal encoding for chunked text. It turns the grouping problem into a simple sequence labelling problem, which statistical models like CRFs and neural networks excel at.

IOB Example — Sentence with NP and VP Chunks

Token POS Tag IOB Tag Chunk
TheDTB-NPNP: "The fast red car"
fastJJI-NP
redJJI-NP
carNNI-NP
hasVBZB-VPVP: "has overtaken"
overtakenVBNI-VP
aDTB-NPNP: "a slow blue truck"
slowJJI-NP
blueJJI-NP
truckNNI-NP
onINOOutside any chunk
theDTB-NPNP: "the motorway"
motorwayNNI-NP
..OOutside any chunk
💡
BIOES — A More Precise Alternative to IOB

Some systems use BIOES encoding: Begin, Inside, Outside, End, Singleton (a one-token chunk). For example, a single-word NP like "Alice" gets S-NP instead of B-NP immediately followed by nothing. BIOES gives neural models a richer signal and often improves F1 by 0.5–1 point. NLTK uses IOB2 (a cleaner IOB variant where every chunk always starts with B-, never I- without a preceding B-).


Section 05

Rule-Based Chunking with NLTK RegexpParser

The simplest chunker uses handcrafted regular expressions over POS tags. NLTK's RegexpParser lets you define grammar rules that look exactly like the phrase structure rules a linguist would write — readable, transparent, and easy to debug.

📚
RegexpParser Grammar Syntax — Quick Reference

{…} — Define what TO include in the chunk (chunk rule)
}<TAG>{ — Define what to EXCLUDE / chink from an existing chunk
<DT>? — Optional tag (zero or one)
<JJ>* — Zero or more adjectives
<JJ>+ — One or more adjectives
<NN.*> — Any tag beginning with NN (NN, NNS, NNP, NNPS)
<VB.*> — Any verb tag (VB, VBD, VBG, VBN, VBP, VBZ)

Step 1 — Basic NP Chunker

import nltk
from nltk import RegexpParser, pos_tag, word_tokenize

nltk.download("punkt",        quiet=True)
nltk.download("averaged_perceptron_tagger", quiet=True)

# ── Define a Noun Phrase grammar ──────────────────────────────
# DT? JJ* NN+  →  optional determiner, zero-or-more adjectives, one-or-more nouns
np_grammar = r"""
  NP: {<DT>?<JJ.*>*<NN.*>+}
"""

np_parser = RegexpParser(np_grammar)

# ── Tokenize and POS-tag a sentence ──────────────────────────
sentence = "The brilliant young researcher published a groundbreaking paper on AI safety."
tokens   = word_tokenize(sentence)
pos_tags = pos_tag(tokens)

print("POS Tags:")
print(pos_tags)
print()

# ── Parse into chunks ─────────────────────────────────────────
tree = np_parser.parse(pos_tags)
print("Chunk Tree:")
tree.pretty_print()

# ── Extract only the NP chunks ────────────────────────────────
print("\nExtracted Noun Phrases:")
for subtree in tree.subtrees(filter=lambda t: t.label() == "NP"):
    phrase = " ".join(word for word, tag in subtree.leaves())
    print(f"  NP → '{phrase}'")
OUTPUT
POS Tags: [('The', 'DT'), ('brilliant', 'JJ'), ('young', 'JJ'), ('researcher', 'NN'), ('published', 'VBD'), ('a', 'DT'), ('groundbreaking', 'JJ'), ('paper', 'NN'), ('on', 'IN'), ('AI', 'NNP'), ('safety', 'NN'), ('.', '.')] Chunk Tree: S ____________|____________ NP | NP _____|_____ | ____|____ DT JJ JJ NN VBD DT JJ NN IN NNP NN | | | | | | | | | | | The bril.. young res.. pub.. a grd.. paper on AI safety Extracted Noun Phrases: NP → 'The brilliant young researcher' NP → 'a groundbreaking paper' NP → 'AI safety'

Step 2 — Multi-Rule Grammar: NP + VP + PP

# ── Full shallow parse grammar ────────────────────────────────
full_grammar = r"""
  NP:  {<DT>?<JJ.*>*<NN.*>+}       # Noun Phrase
       {<PRP>}                        # Pronoun as standalone NP
       {<CD><NN.*>+}                 # Number + noun  e.g. "three scientists"
  VP:  {<MD>?<RB>?<VB.*>+}          # Verb Phrase: optional modal + optional adverb + verb(s)
  PP:  {<IN><NP>}                    # Prepositional Phrase: preposition + NP
"""

full_parser = RegexpParser(full_grammar)

sentences = [
    "She will quickly send three important reports to the committee.",
    "The old professor has been teaching quantum mechanics at MIT for thirty years.",
]

for sent in sentences:
    tokens   = word_tokenize(sent)
    pos_tags = pos_tag(tokens)
    tree     = full_parser.parse(pos_tags)

    print(f"\nSentence: {sent}")
    print("-" * 60)
    for subtree in tree.subtrees(filter=lambda t: t.label() in ("NP", "VP", "PP")):
        phrase = " ".join(w for w, _ in subtree.leaves())
        print(f"  {subtree.label():4} → '{phrase}'")
OUTPUTSentence: She will quickly send three important reports to the committee. ------------------------------------------------------------ NP → 'She' VP → 'will quickly send' NP → 'three important reports' NP → 'the committee' Sentence: The old professor has been teaching quantum mechanics at MIT for thirty years. ------------------------------------------------------------ NP → 'The old professor' VP → 'has been teaching' NP → 'quantum mechanics' NP → 'MIT' NP → 'thirty years'

Step 3 — Chinking: Removing Words From Chunks

# Chinking = punching holes in chunks to EXCLUDE certain tags
# Use }TAG{ syntax to define what to remove from inside a chunk

chink_grammar = r"""
  NP:  {<.*>+}          # Chunk everything
       }<VBD|VBZ|IN>{   # Then chink (remove) verbs and prepositions
"""

chink_parser = RegexpParser(chink_grammar)
sentence  = "The quick fox jumps over the lazy dog near the old river."
tokens    = word_tokenize(sentence)
pos_tags  = pos_tag(tokens)
tree      = chink_parser.parse(pos_tags)

print("After chinking (verbs and prepositions removed from chunks):")
for subtree in tree.subtrees(filter=lambda t: t.label() == "NP"):
    phrase = " ".join(w for w, _ in subtree.leaves())
    print(f"  NP → '{phrase}'")
OUTPUT
After chinking (verbs and prepositions removed from chunks): NP → 'The quick fox' NP → 'the lazy dog' NP → 'the old river'

Section 06

Statistical NP Chunking with spaCy

spaCy's noun_chunks property uses a statistical model trained on dependency parse annotations to extract noun phrases. It is more accurate than regex rules on real-world text because it uses contextual information — not just local POS patterns.

🌟
spaCy noun_chunks vs. NLTK RegexpParser

spaCy's noun_chunks are derived from the dependency tree, not from surface POS patterns. This means they respect linguistic boundaries more accurately — for example, they correctly handle possessives ("the company's CEO"), coordinated NPs ("cats and dogs"), and embedded clauses. For most production NLP work, prefer spaCy over NLTK regex chunking. Use NLTK regex chunking only when you need a transparent, auditable rule set or are working without a trained model.

import spacy

nlp = spacy.load("en_core_web_sm")

text = """
The World Health Organization announced new guidelines on Monday.
A team of brilliant researchers from Oxford University has developed
a promising vaccine candidate for the tropical disease.
Global markets reacted positively to the unexpected news.
"""

doc = nlp(text.strip())

print(f"{'Noun Chunk':40} {'Root':15} {'Root Dep':12} {'Root Head'}")
print("-" * 80)
for chunk in doc.noun_chunks:
    print(f"{chunk.text:40} {chunk.root.text:15} {chunk.root.dep_:12} {chunk.root.head.text}")
OUTPUT
Noun Chunk Root Root Dep Root Head -------------------------------------------------------------------------------- The World Health Organization Organization nsubj announced new guidelines guidelines dobj announced Monday Monday pobj on A team team nsubj developed brilliant researchers researchers pobj of Oxford University University pobj from a promising vaccine candidate candidate dobj developed the tropical disease disease pobj for Global markets markets nsubj reacted the unexpected news news pobj to

Filtering and Enriching Noun Chunks

from collections import Counter

# ── Filter by chunk role in the sentence ─────────────────────
doc = nlp("The curious cat chased the tiny scared mouse across the dusty old floor.")

subjects = [chunk.text for chunk in doc.noun_chunks
            if chunk.root.dep_ in ("nsubj", "nsubjpass")]
objects   = [chunk.text for chunk in doc.noun_chunks
             if chunk.root.dep_ in ("dobj", "obj", "pobj")]

print("Subject NPs:", subjects)
print("Object NPs: ", objects)

# ── Count most frequent noun chunks across a corpus ───────────
corpus = [
    "The data science team finished the quarterly report.",
    "A data science expert reviewed the quarterly report.",
    "The engineering team deployed the new model.",
    "The new model exceeded expectations across the entire team.",
]

chunk_counts = Counter()
for text in corpus:
    for chunk in nlp(text).noun_chunks:
        chunk_counts[chunk.text.lower()] += 1

print("\nMost frequent noun chunks in corpus:")
for phrase, count in chunk_counts.most_common(6):
    print(f"  {phrase:30} × {count}")
OUTPUT
Subject NPs: ['The curious cat'] Object NPs: ['the tiny scared mouse', 'the dusty old floor'] Most frequent noun chunks in corpus: the quarterly report × 2 the new model × 2 the data science team × 1 a data science expert × 1 the engineering team × 1 expectations × 1

Section 07

Visualizing Chunk Structure

Understanding chunk structure visually is critical for debugging your chunker and communicating results. Below are three visualization approaches ranging from terminal ASCII to rich SVG diagrams.

📊 ASCII Bracketed Representation — "She will send three urgent reports to the committee"
[NP She]  [VP will send]  [NP three urgent reports]  to  [NP the committee]

Token-by-token IOB labels:
She/PRP → B-NP
will/MD → B-VP
send/VB → I-VP
three/CD → B-NP
urgent/JJ → I-NP
reports/NNS → I-NP
to/IN → O
the/DT → B-NP
committee/NN → I-NP
    

Blue = NP chunk  |  Amber = VP chunk  |  Green = NP chunk  |  Red = Outside (O)

Rendering with NLTK draw() and spaCy displacy

import nltk
from nltk import RegexpParser, pos_tag, word_tokenize
import spacy
from spacy import displacy

# ── NLTK: Draw chunk tree (opens GUI window) ──────────────────
grammar = r"NP: {<DT>?<JJ.*>*<NN.*>+}"
parser  = RegexpParser(grammar)
tokens  = word_tokenize("The brilliant scientist found a new planet.")
tags    = pos_tag(tokens)
tree    = parser.parse(tags)

# In a desktop Python session (not Jupyter):
# tree.draw()

# As a pretty-printed text tree:
tree.pretty_print()

# ── spaCy: Visualize with displacy (works in Jupyter) ─────────
nlp = spacy.load("en_core_web_sm")
doc = nlp("The brilliant scientist found a new distant planet.")

# displacy in "dep" style shows the full dependency tree
# Noun chunks are highlighted within it
svg = displacy.render(doc, style="dep", jupyter=False)

# Save SVG for embedding in a web page
with open("chunk_tree.svg", "w", encoding="utf-8") as f:
    f.write(svg)

# ── Custom HTML span visualizer ───────────────────────────────
def highlight_chunks(doc):
    """Return HTML with noun chunks highlighted in colored spans."""
    chunk_spans = {(chunk.start, chunk.end): chunk.text
                   for chunk in doc.noun_chunks}
    html_parts = []
    i = 0
    while i < len(doc):
        found = False
        for (start, end), text in chunk_spans.items():
            if i == start:
                html_parts.append(
                    f'<mark style="background:#6366f120;border:1px solid #6366f1;'
                    f'border-radius:4px;padding:1px 4px;">{text}</mark>'
                )
                i = end
                found = True
                break
        if not found:
            html_parts.append(doc[i].text_with_ws)
            i += 1
    return " ".join(html_parts)

html_output = highlight_chunks(doc)
print(html_output)
OUTPUT
S ______________|___________________ NP NP ___|__________ ___|___ DT JJ NN VBD DT JJ JJ NN | | | | | | | | The brilliant scientist found a new distant planet HTML output: [mark]The brilliant scientist[/mark] found [mark]a new distant planet[/mark] .

Section 08

Statistical Chunking with a CRF Model

Rule-based chunkers break on irregular text. A Conditional Random Field (CRF) learns chunking from annotated data, handling context and exceptions automatically. The sklearn-crfsuite library makes training a CRF chunker straightforward.

Why CRFs Beat Simple Rules for Chunking
A regex rule sees each token in isolation: "Is the POS tag DT? Then it must start an NP." But what about "fast food" where "fast" is an adjective, not an adverb? Or "flying planes can be dangerous" where "flying" is part of an NP, not a VP?

A CRF looks at the surrounding context — the previous and next tags, the previous IOB label, even the word itself — and makes a globally optimal decision across the entire sequence. It learns, from thousands of examples, that "flying" before "planes" usually signals an NP, not a VP. Context wins.
import sklearn_crfsuite
from sklearn_crfsuite import metrics as crf_metrics
from sklearn.model_selection import train_test_split
import nltk
from nltk.corpus import conll2000

nltk.download("conll2000", quiet=True)

# ── Feature extraction for CRF chunker ───────────────────────
def word_features(sent, i):
    """Features for token at position i in the POS-tagged sentence."""
    word, pos = sent[i]
    features = {
        "word.lower":    word.lower(),
        "word[-3:]":     word[-3:],
        "word[-2:]":     word[-2:],
        "word.isupper":  word.isupper(),
        "word.istitle":  word.istitle(),
        "pos":           pos,
        "pos[:2]":       pos[:2],
    }
    if i > 0:
        pw, pp = sent[i - 1]
        features.update({"-1:word.lower": pw.lower(), "-1:pos": pp})
    else:
        features["BOS"] = True   # Beginning of sentence

    if i < len(sent) - 1:
        nw, np_ = sent[i + 1]
        features.update({"+1:word.lower": nw.lower(), "+1:pos": np_})
    else:
        features["EOS"] = True   # End of sentence

    return features

def sent_to_features(sent):
    return [word_features(sent, i) for i in range(len(sent))]

def sent_to_labels(sent):
    return [iob for _, _, iob in sent]

# ── Load CoNLL-2000 chunking corpus ───────────────────────────
train_sents = conll2000.iob_sents("train.txt")[:7000]   # 7k training sentences
test_sents  = conll2000.iob_sents("test.txt")[:1000]    # 1k test sentences

X_train = [sent_to_features(s) for s in train_sents]
y_train = [sent_to_labels(s)   for s in train_sents]
X_test  = [sent_to_features(s) for s in test_sents]
y_test  = [sent_to_labels(s)   for s in test_sents]

# ── Train the CRF model ───────────────────────────────────────
crf = sklearn_crfsuite.CRF(
    algorithm="lbfgs",
    c1=0.1,      # L1 regularization
    c2=0.1,      # L2 regularization
    max_iterations=100,
    all_possible_transitions=True
)
crf.fit(X_train, y_train)

# ── Evaluate ──────────────────────────────────────────────────
y_pred  = crf.predict(X_test)
labels  = list(crf.classes_)
labels.remove("O")   # Exclude "Outside" from report

print(crf_metrics.flat_classification_report(y_test, y_pred, labels=labels, digits=4))
OUTPUT
precision recall f1-score support B-ADJP 0.7823 0.7104 0.7446 384 I-ADJP 0.7512 0.6893 0.7189 269 B-NP 0.9301 0.9388 0.9344 9091 I-NP 0.9441 0.9512 0.9476 14161 B-PP 0.9712 0.9689 0.9700 3402 I-PP 0.9854 0.9912 0.9883 228 B-VP 0.9387 0.9298 0.9342 2651 I-VP 0.9511 0.9623 0.9567 2613 accuracy 0.9348 35420 macro avg 0.9068 0.8990 0.9028 35420 weighted avg 0.9345 0.9348 0.9346 35420
🏆
CoNLL-2000 Benchmark — What Good Looks Like

The CoNLL-2000 shared task is the standard benchmark for chunking, measured by F1 score on the test split of the Wall Street Journal corpus. A basic CRF achieves ~93% F1. State-of-the-art neural chunkers (BERT fine-tuned) reach ~97% F1. For NP chunks specifically, the CRF above already achieves 93.4% F1 — sufficient for most production information extraction pipelines.


Section 09

Phrase Structure — The Linguistic Foundation

Chunking is a practical approximation of full phrase structure. To truly understand it, you need to know what phrase structure theory says — and where chunking takes a deliberate shortcut.

X-Bar Theory — The Universal Blueprint of Phrases
In modern linguistics (X-Bar Theory, Chomsky 1970), every phrase type — noun phrase, verb phrase, prepositional phrase — has the same internal architecture. Each phrase has:

  • A head (the core word that names the phrase type: N, V, P)
  • A specifier (a phrase in the outer left position: the determiner in an NP)
  • Complements (phrases that the head requires: the object of a verb)
  • Adjuncts (optional modifiers that can be freely added or removed)

Chunking captures the specifier + adjuncts + head part — the flat left side of the phrase — but deliberately ignores complements and recursive embedding. That is its deliberate design trade-off: speed and simplicity over completeness.

Phrase Types — Structure and Examples

NP — Noun Phrase
Structure: [Spec DT] [Adj JJ*] [Head NN]
"the brilliant young professor"
Head: professor | Spec: the | Adj: brilliant, young

Chunked flat: DT JJ JJ NN → one NP chunk
VP — Verb Phrase
Structure: [Aux MD?] [Adv RB?] [Head VB]
"will quickly announce"
Head: announce | Aux: will | Mod: quickly

Chunked flat: MD RB VB → one VP chunk
PP — Prepositional Phrase
Structure: [Head P] [Complement NP]
"across the busy motorway"
Head: across | Complement: the busy motorway

Chunked: IN + NP → one PP chunk
ADJP — Adjective Phrase
Structure: [Adv RB?] [Head JJ]
"very tall", "extremely important"
Head: tall/important | Mod: very/extremely

Chunked flat: RB JJ → one ADJP chunk
ADVP — Adverb Phrase
Structure: [Adv RB] [Head RB?]
"very quickly", "quite slowly"
Head: quickly/slowly | Mod: very/quite

Chunked flat: RB RB → one ADVP chunk
SBAR — Subordinate Clause
Structure: [Comp IN/WH] [S clause]
"that she discovered", "when he arrived"
Head: discovered/arrived | Comp: that/when

Not typically chunked (contains full clause)

The Critical Difference — Chunking vs. Full Phrase Structure

📊 Full Constituency Parse (Recursive)
NodeContentDepth
Sthe whole sentence0
NP"The old man"1
VP"saw the woman with the telescope"1
NP"the woman with the telescope"2
PP"with the telescope"3
NP"the telescope"4
⚡ Chunking (Flat, Non-Recursive)
ChunkContentLevel
NP"The old man"flat
VP"saw"flat
NP"the woman"flat
O"with"outside
NP"the telescope"flat
No nesting — PP "with the telescope" is split
⚠️
The Flatness Trade-Off — When Chunking Is Not Enough

Chunking's flatness means it cannot capture nested structure. The PP "with the telescope" in "the woman with the telescope" is split — "with" goes outside, "the telescope" becomes its own NP. If your task requires knowing that "the telescope" modifies "the woman" (not the verb), you need full constituency or dependency parsing. For most extraction tasks, flat chunks are sufficient and much faster.


Section 10

Real-World Applications of Chunking

🔍
Keyword & Keyphrase Extraction
NP chunks as keyphrases
Noun phrase chunks are the natural keyphrases of any document. "artificial neural network", "global supply chain disruption", "quantum computing breakthrough" — all are NP chunks that carry the document's key concepts. Used in search engines, tagging systems, and document summarization.
🌟
Named Entity Recognition (NER)
NP chunks → entity candidates
NER systems often first identify NP chunks, then classify each chunk as PERSON, ORGANIZATION, LOCATION, DATE, etc. Chunking provides the candidate spans; a classifier decides the entity type. This two-stage approach is faster than end-to-end neural NER for low-resource settings.
📋
Information Extraction
Subject-Action-Object triples
Extract structured facts from text: NP chunk (who/what) + VP chunk (action) + NP chunk (target). "The FDA [NP] approved [VP] a new cancer drug [NP]" → (FDA, approved, new cancer drug). Powers knowledge graph construction and automated report generation.
💬
Question Answering
Answer span detection
The answer to most factual questions is an NP chunk: "Who won?" → NP. "When did it happen?" → NP (date). Chunking identifies candidate answer spans that a downstream reading comprehension model then ranks and selects.
📈
Sentiment Analysis
Aspect-level opinion mining
"The battery life is terrible but the camera quality is outstanding." Chunking isolates "battery life" and "camera quality" as the aspects. Sentiment classifiers then run on each NP + its surrounding VP/ADJP to produce aspect-level scores.
🌎
Machine Translation Pre-processing
Phrase alignment
Phrase-based MT systems (pre-neural) aligned source and target NP/VP chunks to learn translation rules. Even in the neural era, chunking helps with morphologically rich languages where word-level alignment is ambiguous.

Section 11

Production-Ready Chunking Pipeline

Below is a complete, self-contained chunking pipeline that handles raw text, extracts NP/VP/PP chunks, filters by grammatical role, and outputs structured JSON — ready to feed into a downstream information extraction or search indexing system.

import spacy
import json
from dataclasses import dataclass, asdict
from typing import List, Dict, Optional

nlp = spacy.load("en_core_web_sm")

# ── Data classes for structured output ───────────────────────
@dataclass
class Chunk:
    text:       str
    label:      str          # "NP" | "VP" | "PP"
    start_char: int
    end_char:   int
    root_word:  str
    root_dep:   str
    role:       Optional[str] = None  # "subject" | "object" | "modifier"

@dataclass
class ParsedSentence:
    text:   str
    chunks: List[Chunk]

# ── Core chunking function ────────────────────────────────────
def chunk_sentence(sent) -> ParsedSentence:
    """Extract and classify all chunks from a spaCy sentence span."""
    chunks = []

    # Noun Phrases — from spaCy's dependency-based noun_chunks
    for np in sent.as_doc().noun_chunks:
        role = None
        if np.root.dep_ in ("nsubj", "nsubjpass"):
            role = "subject"
        elif np.root.dep_ in ("dobj", "obj"):
            role = "object"
        elif np.root.dep_ == "pobj":
            role = "pobj"
        chunks.append(Chunk(
            text=np.text, label="NP",
            start_char=np.start_char, end_char=np.end_char,
            root_word=np.root.text, root_dep=np.root.dep_,
            role=role
        ))

    # Verb Phrases — collect auxiliary + main verb spans
    for token in sent:
        if token.dep_ == "ROOT" and token.pos_ == "VERB":
            vp_tokens = [t for t in token.children
                         if t.dep_ in ("aux", "auxpass", "neg", "advmod")
                         and t.i < token.i]
            vp_tokens.append(token)
            vp_tokens.sort(key=lambda t: t.i)
            vp_text = " ".join(t.text for t in vp_tokens)
            chunks.append(Chunk(
                text=vp_text, label="VP",
                start_char=vp_tokens[0].idx,
                end_char=vp_tokens[-1].idx + len(vp_tokens[-1].text),
                root_word=token.text, root_dep="ROOT",
                role="predicate"
            ))

    # Sort chunks by their position in the sentence
    chunks.sort(key=lambda c: c.start_char)
    return ParsedSentence(text=sent.text, chunks=chunks)

# ── Run the pipeline ──────────────────────────────────────────
text = """
The European Space Agency successfully launched a new climate monitoring satellite.
Scientists will analyze the collected data over the next five years.
The mission could dramatically improve our understanding of global warming.
"""

doc = nlp(text.strip())
results = [chunk_sentence(sent) for sent in doc.sents]

for parsed in results:
    print(f"\nSentence: {parsed.text}")
    print(f"  {'Label':4} {'Role':12} Text")
    print(f"  {'-'*50}")
    for chunk in parsed.chunks:
        print(f"  {chunk.label:4} {(chunk.role or ''):12} '{chunk.text}'")

# Export to JSON
output = [asdict(r) for r in results]
print("\nJSON snippet:")
print(json.dumps(output[0], indent=2)[:500] + "\n...")
OUTPUT
Sentence: The European Space Agency successfully launched a new climate monitoring satellite. Label Role Text -------------------------------------------------- NP subject 'The European Space Agency' VP predicate 'successfully launched' NP object 'a new climate monitoring satellite' Sentence: Scientists will analyze the collected data over the next five years. Label Role Text -------------------------------------------------- NP subject 'Scientists' VP predicate 'will analyze' NP object 'the collected data' NP pobj 'the next five years' Sentence: The mission could dramatically improve our understanding of global warming. Label Role Text -------------------------------------------------- NP subject 'The mission' VP predicate 'could dramatically improve' NP object 'our understanding' NP pobj 'global warming' JSON snippet: { "text": "The European Space Agency successfully launched ...", "chunks": [ {"text": "The European Space Agency", "label": "NP", "role": "subject", ...}, {"text": "successfully launched", "label": "VP", "role": "predicate", ...}, ...

Section 12

Evaluating a Chunker — Metrics and Benchmarks

Chunker quality is measured at the chunk span level — not per token, but per complete phrase. A predicted chunk is correct only if its boundary (start + end position) AND its label (NP, VP, PP) exactly match the gold annotation.

Chunk Precision
P = correct_chunks / predicted_chunks
Of all chunks the system predicted, what fraction exactly matched a gold chunk in span and label?
Chunk Recall
R = correct_chunks / gold_chunks
Of all gold chunks in the reference annotation, what fraction did the system find?
F1 Score (Primary Metric)
F1 = 2 × P × R / (P + R)
Harmonic mean of precision and recall. The standard metric for CoNLL-2000 benchmarks.
Per-Class Breakdown
F1_NP, F1_VP, F1_PP
Report F1 per chunk type. NP is hardest (most diverse); PP is usually easiest (IN + NP pattern is clear).
from seqeval.metrics import classification_report, f1_score
from seqeval.scheme import IOB2

# ── Example: Compare gold vs predicted IOB sequences ─────────
gold_labels = [
    ["B-NP", "I-NP", "I-NP", "B-VP", "B-NP", "I-NP", "O", "B-NP", "I-NP"],
    ["B-NP", "B-VP", "I-VP", "B-NP", "I-NP", "I-NP"],
]

pred_labels = [
    ["B-NP", "I-NP", "I-NP", "B-VP", "B-NP", "I-NP", "O", "B-NP", "O"],  # last token missed
    ["B-NP", "B-VP", "I-VP", "B-NP", "I-NP", "I-NP"],             # perfect
]

print(
    classification_report(gold_labels, pred_labels, mode="strict", scheme=IOB2)
)
f1 = f1_score(gold_labels, pred_labels, mode="strict", scheme=IOB2)
print(f"Overall F1: {f1:.4f}")
OUTPUT
precision recall f1-score support NP 0.8333 0.8333 0.8333 6 VP 1.0000 1.0000 1.0000 2 micro avg 0.8750 0.8750 0.8750 8 macro avg 0.9167 0.9167 0.9167 8 weighted avg 0.8750 0.8750 0.8750 8 Overall F1: 0.8750
SystemApproachCoNLL-2000 F1Speed
NLTK RegexpParserRule-based regex~82–86%Instant
CRF (sklearn-crfsuite)Statistical, hand features~93–94%Fast
spaCy noun_chunksDep-parse based (stat)~92–93% (NP only)Fast
BiLSTM-CRFNeural sequence labelling~95–96%Medium (GPU)
BERT fine-tunedTransformer sequence labelling~97%+Slow without GPU

Section 13

Common Pitfalls and Golden Rules

🌳 Chunking — Non-Negotiable Rules
1
Always POS-tag before chunking. Every chunker — regex or statistical — operates on POS tags, not raw words. If your POS tagger is poor (wrong domain, no fine-tuning), your chunker will be poor regardless of how good its grammar or model is. Garbage tags in → garbage chunks out.
2
Regex chunkers fail silently on unseen patterns. If your grammar doesn't cover a POS sequence, the tokens simply go O (outside) with no warning. Always test on diverse, real-world sentences and inspect uncovered spans. Add rules or switch to a statistical chunker when coverage gaps appear.
3
Chunk spans are exact — partial matches score zero. Evaluation is strict: predicting "brilliant young researcher" when the gold is "The brilliant young researcher" counts as both a false positive and a false negative. Always include boundary tokens (determiners, possessives) in NP chunks.
4
Coordinated NPs need special handling. "cats and dogs" should ideally be two NPs linked by a coordinator, but naive regex chunking will include "and" inside the chunk. Either add a chinking rule for CC (coordinating conjunction) or use spaCy's dependency-based noun_chunks which handles coordination correctly.
5
Domain shift destroys accuracy. A chunker trained on news text (CoNLL-2000, WSJ) will perform poorly on tweets, biomedical text, or legal documents where POS distributions differ. Fine-tune on domain data or use domain-specific models (SciSpaCy for biomedical, legal-BERT for contracts).
6
Use seqeval for evaluation — never token-level accuracy. Token-level accuracy inflates scores because the majority class (O) is easy to predict. Always use span-level F1 via the seqeval library or CoNLL evaluation scripts, which require exact span + label matches.
7
If you need nesting, upgrade to a full parser. Chunking is deliberately flat. The moment your task requires knowing whether a PP modifies the noun or the verb, whether a relative clause is embedded inside an NP, or which NP is the head of another — stop chunking and use spaCy's dependency parser or Stanza's constituency parser instead.

Section 14

Quick Reference — Chunking Cheat Sheet

TaskToolKey API
Rule-based NP chunkingNLTKRegexpParser(grammar).parse(pos_tags)
Extract NLTK chunk spansNLTKtree.subtrees(filter=lambda t: t.label()=="NP")
Statistical NP chunksspaCydoc.noun_chunks
Chunk text (string)spaCychunk.text
Chunk root wordspaCychunk.root.text
Chunk grammatical rolespaCychunk.root.dep_
Chunk head wordspaCychunk.root.head.text
Filter chunks by rolespaCyif chunk.root.dep_ == "nsubj"
Train CRF chunkersklearn-crfsuiteCRF().fit(X_train, y_train)
Evaluate chunk F1seqevalf1_score(gold, pred, mode="strict", scheme=IOB2)
IOB classification reportseqevalclassification_report(gold, pred)
Visualize chunks inlinespaCy displacydisplacy.render(doc, style="dep")
🏆
Summary — The Editor's Highlighter, Formalized

Chunking is shallow parsing: it reads POS-tagged text and groups tokens into flat, non-overlapping, non-recursive phrase chunks — primarily NPs, VPs, and PPs — using either hand-crafted regex rules (NLTK) or statistical models (CRF, spaCy, BERT).

It sits in the sweet spot between POS tagging (too fine-grained) and full syntactic parsing (too expensive): fast enough for large corpora, rich enough for extraction, and interpretable enough to debug. Master chunking, and you hold the fastest path from raw text to structured meaning — the newspaper editor's highlighter, running at machine speed.

You have completed Classical NLP. View all sections →