Dependency Parsing vs Constituency Parsing

Section 01

The Story That Makes Parsing Click

📖 Real World Analogy

The Detective and the Sentence

Imagine a detective who receives a mysterious note: "The old man saw the woman with the telescope." The detective pauses. Who had the telescope? The old man — or the woman?

This single ambiguous sentence has two completely different meanings, yet the words are identical. To resolve the mystery, the detective must go beyond individual words and understand the grammatical structure — who is doing what to whom, and which phrase belongs with which noun.

This is exactly what a syntactic parser does. It reads a sentence and builds a structural map — a grammar blueprint — that removes ambiguity and reveals meaning. NLP has two dominant traditions for doing this: Dependency Parsing and Constituency Parsing. They approach the problem differently, but both are answering the same detective's question: How do the pieces of this sentence relate to each other?

Before diving into each approach, it helps to understand why parsing matters in the first place. Most NLP pipelines — from chatbots to search engines to machine translation — need to know not just what words appear, but how those words are grammatically connected. A word bag knows that "dog bites man" and "man bites dog" contain the same words. A parser knows they mean opposite things.

🌐

Syntactic Parsing — The Big Picture

Syntactic parsing is the task of automatically analyzing the grammatical structure of a sentence according to a formal grammar. It is a foundational step in the NLP pipeline, sitting between tokenization/POS tagging and semantic understanding. The two dominant paradigms are Dependency Parsing (which focuses on binary word-to-word relationships) and Constituency Parsing (which groups words into nested phrases).

Section 02

Part I — Dependency Parsing

Dependency Parsing represents a sentence as a directed graph where every word (except one) has exactly one head (a word it depends on), connected by a labeled arc that names the grammatical relationship. The one word with no head is the root — usually the main verb.

📖 Analogy

The Org Chart of a Sentence

Think of a dependency tree like a corporate org chart. The CEO (the root verb) sits at the top. Every employee reports to exactly one manager — never to two, never to zero (unless they are the CEO). Each reporting line has a label: subject, direct object, modifier, and so on. When you look at the chart, you instantly know who controls what and how decisions flow. A dependency parse gives you the same clarity for language: who is the actor, what is acted upon, and what modifies what.

Anatomy of a Dependency Tree

Take the sentence: "The quick brown fox jumps over the lazy dog." Below is its dependency structure broken into key components.

🔍 Dependency Arcs — "The quick brown fox jumps over the lazy dog"

ROOT

jumps — the main verb; the root of the entire tree. No head.

nsubj

fox → jumps — "fox" is the nominal subject of "jumps"

det

The → fox — "The" is a determiner modifying "fox"

amod

quick → fox, brown → fox — adjectival modifiers of "fox"

prep

over → jumps — prepositional phrase attached to "jumps"

pobj

dog → over — object of the preposition "over"

det/amod

the → dog, lazy → dog — determiner and adjective modifying "dog"

📊 Visual Dependency Tree — ASCII Diagram

                      jumps  [ROOT]
                     /      \
                  fox        over
                 /|\           \
               The quick      dog
                   |          / \
                 brown      the  lazy

Each arrow points FROM the dependent TO the head. Every node except ROOT has exactly one incoming arrow.

Universal Dependency Labels — The Most Important Ones

Label	Full Name	Example	Meaning
nsubj	Nominal Subject	The cat sleeps	Who/what performs the verb
obj	Object	She ate an apple	Who/what receives the action
iobj	Indirect Object	Give her the book	Secondary recipient
amod	Adjectival Modifier	A red car	Adjective modifying a noun
advmod	Adverbial Modifier	She runs fast	Adverb modifying a verb/adj
det	Determiner	The dog	Article or demonstrative
prep	Prepositional Modifier	Ran to the store	Prepositional phrase attachment
conj	Conjunct	Cats and dogs	Coordinated element
aux	Auxiliary	She will leave	Tense/aspect auxiliary verb
cop	Copula	She is happy	Linking verb "to be"
ROOT	Root	She sleeps	Syntactic root of the tree

Section 03

Dependency Parsing Algorithms

Under the hood, dependency parsers use one of two main algorithmic families. Understanding them helps you choose the right tool and interpret speed vs. accuracy tradeoffs.

⚡

Transition-Based (Arc-Eager)

Linear Time — O(n)

Processes words left-to-right using a stack and a buffer. At each step it decides: SHIFT (push next word onto stack), LEFT-ARC (create arc from buffer to stack top), or RIGHT-ARC (arc from stack top to buffer). Fast and greedy — used by spaCy.

✓ Very fast, suited for production

✗ Greedy — early errors can cascade

📈

Graph-Based (MST)

O(n²) — Exhaustive Search

Scores all possible arcs between every word pair and selects the Maximum Spanning Tree (Eisner algorithm for projective, Chu-Liu/Edmonds for non-projective). Globally optimal — used by Stanza, UDPipe.

✓ Higher accuracy, global optimum

✗ Slower — quadratic in sentence length

🌐

Neural Biaffine (Dozat & Manning)

State-of-the-Art

Uses a deep BiLSTM or Transformer encoder + biaffine attention to score head-dependent pairs and label arcs jointly. State-of-the-art accuracy on Universal Dependencies benchmarks. Used by Stanza and modern transformer parsers.

✓ Best accuracy, rich context

✗ Heavier compute, needs GPU for speed

Section 04

Dependency Parsing with spaCy — Full Code Walkthrough

🔑

Why spaCy for Dependency Parsing?

spaCy uses a Transition-Based parser (Honnibal & Johnson, 2015) fine-tuned on Universal Dependencies corpora. It is the go-to library for production NLP because it runs in linear time, integrates seamlessly with the rest of the spaCy pipeline (tokenizer, POS tagger, NER), and achieves near state-of-the-art accuracy on English with en_core_web_trf (transformer-based model).

Step 1 — Installation and Basic Parse

# Install spaCy and download a model
# pip install spacy
# python -m spacy download en_core_web_sm

import spacy

# Load the small English model (use en_core_web_trf for best accuracy)
nlp = spacy.load("en_core_web_sm")

# Parse a sentence
doc = nlp("The quick brown fox jumps over the lazy dog.")

# Inspect each token's dependency information
print(f"{'Token':12} {'Dep':10} {'Head':12} {'Head POS'}")
print("-" * 45)
for token in doc:
    print(f"{token.text:12} {token.dep_:10} {token.head.text:12} {token.head.pos_}")

OUTPUT

Token Dep Head Head POS --------------------------------------------- The det fox NOUN quick amod fox NOUN brown amod fox NOUN fox nsubj jumps VERB jumps ROOT jumps VERB over prep jumps VERB the det dog NOUN lazy amod dog NOUN dog pobj over ADP . punct jumps VERB

Step 2 — Extracting Subject-Verb-Object Triples

def extract_svo(doc):
    """Extract (subject, verb, object) triples from a spaCy doc."""
    triples = []
    for token in doc:
        # Find the main verb (root)
        if token.dep_ == "ROOT" and token.pos_ == "VERB":
            subject = None
            obj      = None
            for child in token.children:
                if child.dep_ in ("nsubj", "nsubjpass"):
                    subject = child.text
                if child.dep_ in ("dobj", "obj"):
                    obj = child.text
            triples.append({
                "verb":    token.text,
                "subject": subject,
                "object":  obj,
            })
    return triples

sentences = [
    "Alice ate the pizza quickly.",
    "The company announced record profits.",
    "Scientists discovered a new exoplanet.",
]

for sent in sentences:
    doc = nlp(sent)
    triples = extract_svo(doc)
    for t in triples:
        print(f"  SUBJ={t['subject']:12} VERB={t['verb']:12} OBJ={t['object']}")

OUTPUT

SUBJ=Alice VERB=ate OBJ=pizza SUBJ=company VERB=announced OBJ=profits SUBJ=Scientists VERB=discovered OBJ=exoplanet

Step 3 — Visualizing the Parse Tree

from spacy import displacy

# Render in a Jupyter notebook (inline SVG)
doc = nlp("Alice quickly sent Bob the urgent report.")
displacy.render(doc, style="dep", jupyter=True)

# Or serve as a local web page
displacy.serve(doc, style="dep", port=5001)

# Or get the raw SVG string
svg = displacy.render(doc, style="dep", page=False)
with open("dep_tree.svg", "w") as f:
    f.write(svg)
print("SVG saved to dep_tree.svg")

Step 4 — Advanced: Navigating Subtrees and Ancestors

doc = nlp("The elderly professor published a groundbreaking paper on quantum computing.")

for token in doc:
    if token.dep_ == "ROOT":
        verb = token
        break

# Full subtree of the root (entire sentence as dependency structure)
print("Root verb:", verb.text)
print("Subtree:  ", [t.text for t in verb.subtree()])

# Direct children classified by function
print("\nDirect children:")
for child in verb.children:
    print(f"  {child.text:15} dep={child.dep_:10} subtree={[t.text for t in child.subtree()]}")

# Token ancestors (path back to root)
target = doc[9]  # "quantum"
print(f"\nAncestors of '{target.text}':", [t.text for t in target.ancestors()])

OUTPUT

Root verb: published Subtree: ['The', 'elderly', 'professor', 'published', 'a', 'groundbreaking', 'paper', 'on', 'quantum', 'computing', '.'] Direct children: professor dep=nsubj subtree=['The', 'elderly', 'professor'] paper dep=dobj subtree=['a', 'groundbreaking', 'paper', 'on', 'quantum', 'computing'] . dep=punct subtree=['.'] Ancestors of 'quantum': ['computing', 'on', 'paper', 'published']

✅

Key spaCy Token Attributes for Dependency Parsing

token.dep_ — dependency label (e.g. "nsubj")
token.head — the token's syntactic head
token.children — direct dependents iterator
token.subtree() — all tokens in the subtree rooted at this token
token.ancestors() — path from this token up to the root
token.lefts / token.rights — left/right children

Section 05

Real-World Applications of Dependency Parsing

🔍

Information Extraction

Extract structured facts from raw text — who did what to whom. News extraction pipelines use dependency trees to find (entity, relation, entity) triples: "Elon Musk acquired Twitter."

nsubj → dobj → prep

💬

Question Answering

Identify the focus of a question. "Who wrote Hamlet?" — "who" is nsubj of "wrote", so the answer is the nsubj of "wrote" in the source document.

wh-word dependency tracing

🎉

Sentiment on Aspects

"The camera is great but the battery life is terrible." Dependency trees connect "great" to "camera" and "terrible" to "battery life" — enabling aspect-level sentiment.

amod + nsubj attachment

🌎

Machine Translation

Dependency structures are more language-universal than word order. Transferring dependency relations across languages helps handle free word-order languages (Turkish, Japanese, Russian).

cross-lingual transfer

📋

Clinical NLP

Medical records: "Patient denies chest pain." A dependency parser identifies "denies" negates "chest pain" — critical for correctly coding symptoms as absent, not present.

negation detection via dep

🔧

Grammar Checking

Subject-verb agreement errors, dangling modifiers, incorrect preposition use — all become detectable once you know the grammatical structure of the sentence.

nsubj → verb agreement

Section 06

Part II — Constituency Parsing

Constituency Parsing (also called phrase-structure parsing) takes a different perspective. Instead of mapping word-to-word relationships, it groups words into nested phrases (constituents), building a hierarchical tree rooted at the sentence level.

📖 Analogy

Russian Nesting Dolls (Matryoshka)

Picture a set of Russian nesting dolls. The largest doll is the whole sentence — S. Inside it, you find two dolls: a Noun Phrase (NP) and a Verb Phrase (VP). Inside the VP, there's another NP. Inside each NP, there are more dolls — determiners, adjectives, nouns — until you reach the smallest individual wooden tokens: the actual words.

This is constituency structure: a sentence is a hierarchy of nested phrases, each one a meaningful unit that can be replaced by another phrase of the same type. You can replace "The quick brown fox" with "She" — same slot, same NP function — and the sentence remains grammatical.

Core Phrase Types in Constituency Grammar

📄

S — Sentence

Top-level node

The root of every constituency tree. Typically decomposes into NP + VP (subject + predicate).

📄

NP — Noun Phrase

DT + (JJ*) + NN

Groups a noun with its determiners and modifiers. "The quick brown fox" — one NP.

📄

VP — Verb Phrase

VB + NP / PP / ADJP

Groups the verb with its complements and modifiers. "jumps over the lazy dog" — one VP.

📄

PP — Prepositional Phrase

IN + NP

A preposition plus its object NP. "over the lazy dog" — one PP inside the VP.

📄

ADJP — Adjective Phrase

JJ (+ RB)

A group of adjectives, possibly with an adverb. "very tall", "extremely happy".

📄

ADVP — Adverb Phrase

RB (+ JJ)

Groups adverbs. "very quickly", "quite slowly". Often modifies VP.

A Full Constituency Tree — Visualized

Sentence: "The quick brown fox jumps over the lazy dog."

🌳 Penn Treebank Style — Constituency Parse Tree

                              S
               ┌──────────────┴──────────────────┐
              NP                                 VP
      ┌───┬───┬───┐                     ┌────────┴──────────┐
     DT  JJ  JJ  NN                   VBZ                  PP
     │   │   │   │                     │              ┌─────┴─────┐
    The quick brown fox              jumps            IN          NP
                                                      │       ┌───┴──┐
                                                     over    DT     JJ    NN
                                                             │      │      │
                                                            the    lazy   dog

Legend: S = Sentence | NP = Noun Phrase | VP = Verb Phrase | PP = Prepositional Phrase | DT = Determiner, JJ = Adjective, NN = Noun, VBZ = Verb (3rd person sg)

📚

Penn Treebank — The Training Dataset That Defined Constituency Parsing

The Penn Treebank (PTB), released by the University of Pennsylvania in 1993, contains ~49,000 manually annotated parse trees from the Wall Street Journal. It defined the constituency notation (S, NP, VP…) that almost every English parser is still trained or evaluated on today. The standard split is sections 2–21 for training, 22 for development, and 23 for testing, measured by F1 on labeled brackets.

Section 07

Context-Free Grammars — The Formal Foundation

Constituency parsing is rooted in Context-Free Grammars (CFGs). A CFG defines a set of rewrite rules that expand non-terminal symbols (NP, VP…) into sequences of other symbols.

Grammar Rule

S → NP VP

A sentence is a Noun Phrase followed by a Verb Phrase

Grammar Rule

NP → DT JJ* NN

A Noun Phrase is a determiner, zero or more adjectives, and a noun

Grammar Rule

VP → VBZ PP

A Verb Phrase is a 3rd-person-singular verb followed by a prepositional phrase

Parsing Algorithm

CYK / Earley

Dynamic programming algorithms that find all valid parse trees for a sentence given a CFG — O(n³) in sentence length

⚠️

The Ambiguity Problem in CFGs

Most real sentences have exponentially many valid parse trees under a CFG. "I saw the man with the telescope" has at least two: telescope modifies "man" or telescope is the instrument of "saw." Probabilistic CFGs (PCFGs) assign probabilities to each rule so the parser can select the most probable tree using the Viterbi algorithm.

Section 08

Constituency Parsing in Python — NLTK & Stanza

Method 1 — NLTK with a Pre-defined Grammar

import nltk
from nltk import CFG, ChartParser

# Download NLTK tokenizer data (once)
nltk.download("punkt", quiet=True)

# Define a simple Context-Free Grammar
grammar = CFG.fromstring("""
  S   -> NP VP
  VP  -> VBZ PP | VBZ NP | VBD NP
  PP  -> IN NP
  NP  -> DT JJ JJ NN | DT JJ NN | DT NN | NNP
  DT  -> 'the' | 'a' | 'The'
  JJ  -> 'quick' | 'brown' | 'lazy' | 'old'
  NN  -> 'fox' | 'dog' | 'man'
  NNP -> 'Alice' | 'Bob'
  VBZ -> 'jumps' | 'runs'
  VBD -> 'saw'
  IN  -> 'over' | 'with'
""")

parser = ChartParser(grammar)
tokens = ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]

for tree in parser.parse(tokens):
    tree.pretty_print()
    print("Tree string:", str(tree))

OUTPUT

S _______|_____________ NP VP ___|________ ____|_____ DT JJ JJ NN VBZ PP | | | | | ___|___ The quick brown fox jumps IN NP | __|___ over DT JJ NN | | | the lazy dog Tree string: (S (NP (DT The) (JJ quick) (JJ brown) (NN fox)) (VP (VBZ jumps) (PP (IN over) (NP (DT the) (JJ lazy) (NN dog)))))

Method 2 — Stanza (Neural Berkeley Parser)

import stanza

# Download English models (first time only)
# stanza.download("en")

# Initialize with constituency parse component
nlp = stanza.Pipeline(
    lang="en",
    processors="tokenize,pos,constituency",
    use_gpu=False
)

text = "Alice quickly sent Bob the urgent report on AI safety."
doc = nlp(text)

for sentence in doc.sentences:
    print("Constituency Tree:")
    print(sentence.constituency)
    print()
    # Access tree nodes programmatically
    tree = sentence.constituency
    print("Root label:     ", tree.label)
    print("Root children:  ", [child.label for child in tree.children])

OUTPUT

Constituency Tree: (ROOT (S (NP (NNP Alice)) (VP (ADVP (RB quickly)) (VBD sent) (NP (NNP Bob)) (NP (DT the) (JJ urgent) (NN report) (PP (IN on) (NP (NN AI) (NN safety))))))) Root label: ROOT Root children: ['S']

Method 3 — Extracting All Noun Phrases (Chunking / Constituency)

def extract_constituents(tree, target_label):
    """Recursively extract all phrases of a given type from a Stanza tree."""
    results = []
    if tree.label == target_label and tree.is_leaf() is False:
        results.append(" ".join(tree.leaf_labels()))
    for child in tree.children:
        results.extend(extract_constituents(child, target_label))
    return results

# Parse a sentence and extract all NPs and VPs
doc = nlp("The brilliant young researcher published a groundbreaking paper on quantum entanglement.")
tree = doc.sentences[0].constituency

noun_phrases = extract_constituents(tree, "NP")
verb_phrases  = extract_constituents(tree, "VP")

print("Noun Phrases:")
for np in noun_phrases:
    print(f"  NP: {np}")

print("\nVerb Phrases:")
for vp in verb_phrases:
    print(f"  VP: {vp}")

OUTPUT

Noun Phrases: NP: The brilliant young researcher NP: a groundbreaking paper on quantum entanglement NP: quantum entanglement Verb Phrases: VP: published a groundbreaking paper on quantum entanglement

Section 09

Constituency Parsing with Transformers — Berkeley Neural Parser

Modern state-of-the-art constituency parsers use transformer encoders (BERT, RoBERTa) to produce rich contextualized embeddings, then apply a chart-based decoder. The Berkeley Neural Parser (benepar) integrates directly into spaCy.

# Installation
# pip install benepar
# python -c "import benepar; benepar.download('benepar_en3')"

import spacy
import benepar

# Add benepar to a spaCy pipeline
nlp = spacy.load("en_core_web_md")
if spacy.util.is_package("benepar_en3"):
    nlp.add_pipe("benepar", config={"model": "benepar_en3"})

text = "The old man saw the woman with the telescope."
doc = nlp(text)

for sent in doc.sents:
    print("Parse string:", sent._.parse_string)
    print("Labels:      ", list(sent._.labels))
    print()
    # Access spans for each constituent
    for span in sent._.constituents:
        if span._.labels:
            print(f"  {str(span._.labels):20} → '{span.text}'")

OUTPUT

Parse string: (S (NP (DT The) (JJ old) (NN man)) (VP (VBD saw) (NP (DT the) (NN woman) (PP (IN with) (NP (DT the) (NN telescope)))))) Labels: ['S'] ('S',) → 'The old man saw the woman with the telescope .' ('NP',) → 'The old man' ('VP',) → 'saw the woman with the telescope' ('NP',) → 'the woman with the telescope' ('PP',) → 'with the telescope' ('NP',) → 'the telescope'

🌍

The Classic Attachment Ambiguity — Resolved!

Notice the parse above: "the woman with the telescope" is parsed as a single NP, meaning the PP "with the telescope" attaches to "the woman" — she has the telescope. An alternative parse would attach the PP to the VP "saw" — the man used the telescope to see. This is the PP attachment ambiguity that has driven decades of NLP research. Neural parsers handle it using contextual embeddings; rule-based parsers often get it wrong.

Section 10

Head-to-Head — Dependency vs. Constituency Parsing

Both paradigms model sentence structure, but they emphasize different aspects. The choice depends on your downstream task.

th>Constituency Parsing

Property	Dependency Parsing
Core Representation	Directed graph — word-to-word arcs with labels	Hierarchical tree — nested phrase nodes
What it captures	Grammatical functions (subject, object, modifier)	Phrasal groupings (NP, VP, PP boundaries)
Tree nodes	Only words (no abstract phrase nodes)	Words + abstract non-terminals (NP, VP…)
Speed	Faster — O(n) transition-based parsers	Slower — O(n³) chart algorithms
Language independence	High — Universal Dependencies covers 100+ languages	Lower — phrase structures vary by language
Best for	IE, QA, sentiment, relation extraction	Grammar induction, NLI, tree kernels
Human interpretability	Medium — arc labels can be cryptic	High — NP/VP very intuitive to linguists
Best Python library	spaCy, Stanza, UDPipe	Stanza (benepar), NLTK, AllenNLP

🎯

Practitioner's Rule — Which Parsing Should You Use?

Use Dependency Parsing (spaCy) when you need to know who did what to whom — information extraction, relation mining, aspect-based sentiment, clinical NLP.

Use Constituency Parsing (benepar/Stanza) when you need phrase boundaries — extracting all NPs for entity mentions, syntactic features for NLI models, tree-kernel SVMs, or linguistic research on phrase structure.

Many production pipelines use both: constituency for phrase chunking + dependency for relation labeling.

Section 11

Evaluation Metrics for Parsers

Dependency Parsing — Labeled & Unlabeled Attachment Score

UAS — Unlabeled Attachment Score

UAS = correct_heads / total_tokens

Fraction of tokens where the predicted head is correct — ignoring the arc label

LAS — Labeled Attachment Score

LAS = correct_head_AND_label / total_tokens

Fraction where BOTH head AND dependency label are correct — the standard metric

Constituency Parsing — Labeled Bracketing F1

Precision

P = correct_brackets / predicted_brackets

Of all predicted phrase brackets, what fraction exactly match a bracket in the gold tree?

F1 (EVALB Standard)

F1 = 2 × P × R / (P + R)

Harmonic mean of precision and recall on labeled brackets. Penn Treebank Section 23 benchmark.

Parser	Type	En LAS / F1	Speed	Notes
spaCy en_core_web_trf	Dependency	~92.5 LAS	Medium	Transformer-based, production-ready
spaCy en_core_web_sm	Dependency	~89.8 LAS	Very Fast	Good for most tasks
Stanza (biaffine)	Dependency	~93.0 LAS	Medium	State-of-the-art on UD benchmarks
benepar_en3	Constituency	~95.9 F1	Slow (GPU recommended)	Best constituency parser for English
Stanza constituency	Constituency	~94.8 F1	Medium	Multi-language support, easy API
NLTK Chart Parser	Constituency	Depends on grammar	Fast for small grammars	Educational use, custom grammars

Section 12

End-to-End Pipeline — Combining Both Parsers

Real NLP systems often combine dependency and constituency information. Below is a production-grade pipeline that extracts rich structural features from text.

import spacy
import stanza
from collections import defaultdict

# ── 1. Dependency parsing with spaCy ─────────────────────────
dep_nlp = spacy.load("en_core_web_sm")

# ── 2. Constituency parsing with Stanza ──────────────────────
con_nlp = stanza.Pipeline(
    lang="en",
    processors="tokenize,pos,constituency",
    use_gpu=False,
    verbose=False
)

def full_parse_report(text):
    """Run both parsers and print a combined structural report."""
    print(f"\n{'='*60}")
    print(f"TEXT: {text}")
    print('='*60)

    # ── Dependency features ──────────────────────────────────────
    dep_doc = dep_nlp(text)
    relations = defaultdict(list)
    for token in dep_doc:
        relations[token.dep_].append(token.text)

    print("\n[DEPENDENCY]")
    print(f"  Root:    {[t.text for t in dep_doc if t.dep_ == 'ROOT']}")
    print(f"  Subjects:{relations.get('nsubj', [])}")
    print(f"  Objects: {relations.get('dobj', relations.get('obj', []))}")
    print(f"  Modifiers:{relations.get('amod', [])}")

    # ── Constituency features ────────────────────────────────────
    con_doc = con_nlp(text)
    print("\n[CONSTITUENCY]")
    for sent in con_doc.sentences:
        tree = sent.constituency
        nps = _get_phrases(tree, "NP")
        vps = _get_phrases(tree, "VP")
        print(f"  Noun Phrases:  {nps}")
        print(f"  Verb Phrases:  {vps}")

def _get_phrases(tree, label):
    results = []
    if tree.label == label and len(tree.children) > 0:
        results.append(" ".join(tree.leaf_labels()))
    for c in tree.children:
        results.extend(_get_phrases(c, label))
    return results

# Run on sample sentences
full_parse_report("The brilliant scientist discovered a new vaccine.")
full_parse_report("The city council approved the new budget yesterday.")

OUTPUT

============================================================ TEXT: The brilliant scientist discovered a new vaccine. ============================================================ [DEPENDENCY] Root: ['discovered'] Subjects: ['scientist'] Objects: ['vaccine'] Modifiers:['brilliant', 'new'] [CONSTITUENCY] Noun Phrases: ['The brilliant scientist', 'a new vaccine'] Verb Phrases: ['discovered a new vaccine'] ============================================================ TEXT: The city council approved the new budget yesterday. ============================================================ [DEPENDENCY] Root: ['approved'] Subjects: ['council'] Objects: ['budget'] Modifiers:['city', 'new'] [CONSTITUENCY] Noun Phrases: ['The city council', 'the new budget'] Verb Phrases: ['approved the new budget yesterday']

Section 13

Common Pitfalls and How to Avoid Them

⚠️ Parser Gotchas — Non-Negotiable Checks

Never assume one sentence = one tree. If your text has multiple sentences and you parse the whole string at once, spaCy and Stanza will correctly split them — but NLTK's ChartParser does not handle sentence splitting. Always pre-tokenize into sentences first.

Dependency labels differ between frameworks. spaCy uses clear English labels (nsubj, dobj) while Stanza follows Universal Dependencies v2 (nsubj, obj). dobj in spaCy = obj in UD. Always check which label set your library uses before writing hard-coded string matches.

Domain mismatch degrades accuracy drastically. All standard parsers are trained on news/Wikipedia text (Penn Treebank, UD English-EWT). On social media tweets, medical text, or legal documents, LAS can drop by 5–15 points. Fine-tune on domain data or use domain-specific models (e.g., SciSpaCy for biomedical text).

Long sentences explode parse time. Constituency parsing is O(n³) in sentence length. A 200-word sentence takes ~64× longer than a 50-word sentence. For long documents, consider sentence splitting aggressively or using dependency parsing (O(n)) instead.

PP attachment is the hardest problem in parsing. Prepositional phrase attachment (does the PP modify the noun or the verb?) remains the most frequent error in all parsers. If your downstream task is sensitive to PP attachment (e.g., "give medication with food" vs. "give medication, with food as context"), validate manually on your domain.

Negation is not captured by parse structure alone. "The patient shows no signs of infection" — the dependency tree connects "signs" to "shows", but detecting negation requires also tracking neg arcs and their scope. Always handle negation as a separate, explicit post-processing step.

Do not serialize Token objects directly. spaCy Doc objects are in-memory; if you need to store parse results, serialize with doc.to_json() or use DocBin for large corpora. Never pickle raw spaCy Docs for production storage.

Section 14

Quick Reference — Cheat Sheet

Task	Parser Type	Library	Key API Call
Get dependency label	Dependency	spaCy	token.dep_
Get syntactic head	Dependency	spaCy	token.head
Get direct children	Dependency	spaCy	token.children
Get entire subtree	Dependency	spaCy	token.subtree()
Visualize dep tree	Dependency	spaCy	displacy.render(doc, style="dep")
Get constituency parse string	Constituency	Stanza	sent.constituency
Get constituency parse string	Constituency	benepar+spaCy	span._.parse_string
Get phrase spans	Constituency	benepar+spaCy	span._.constituents
Chart parse with custom grammar	Constituency	NLTK	ChartParser(grammar).parse(tokens)
Evaluate LAS/UAS	Dependency	Stanza eval	stanza.utils.conll.CoNLL.conll2dict()

🏆

Summary — The Two Lenses of Sentence Structure

Dependency Parsing gives you the functional skeleton of a sentence — a directed graph telling you who is the subject, what is the object, and which word modifies which. It is fast, cross-lingual, and perfect for extraction tasks.

Constituency Parsing gives you the phrasal architecture — a nested tree of noun phrases, verb phrases, and prepositional phrases that reflects the hierarchical building blocks of syntax. It is richer linguistically and ideal for phrase-boundary detection and syntactic feature engineering.

Master both, and you hold two X-ray machines for the human sentence.