Information Extraction in NLP

Section 01

The Story That Explains Information Extraction

📖 Real World Analogy

The Detective and the Mountain of Evidence

Imagine a detective who receives a room full of newspapers, reports, and witness statements after a major crime. There are thousands of pages. Somewhere inside this mountain of unstructured text lies the answer: who did it, where they were, when it happened, and what they used.

The detective cannot read everything. Instead, they skim for names, locations, dates, and relationships — circling, highlighting, connecting dots. What the detective does manually, Information Extraction (IE) does automatically, at scale, across millions of documents in seconds.

IE is the art of turning messy, unstructured human language into clean, structured, machine-readable facts.

Information Extraction (IE) is a subfield of Natural Language Processing (NLP) that automatically pulls structured data — names, dates, relationships, events — from raw text. Where a search engine finds documents, IE finds facts inside those documents.

🔍

What IE Actually Does

Input: "Apple Inc. was founded by Steve Jobs in Cupertino on April 1, 1976."
Output: { entity: "Apple Inc.", type: "ORG" }, { entity: "Steve Jobs", type: "PERSON" }, { entity: "Cupertino", type: "GPE" }, { entity: "April 1, 1976", type: "DATE" }, { relation: ("Apple Inc.", "founded_by", "Steve Jobs") }

Section 02

The Five Core Tasks of Information Extraction

IE is not a single task — it is a family of related tasks. Understanding each one is essential before writing a single line of code.

🏷️

Named Entity Recognition (NER)

Who, Where, When, What

Identifies and classifies named entities in text — people, organisations, locations, dates, money, products. The foundation of almost every IE pipeline.

🔗

Relation Extraction (RE)

How entities connect

Identifies semantic relationships between entities — "founded_by", "works_for", "located_in". Turns isolated facts into a knowledge graph.

📋

Event Extraction (EE)

What happened

Detects events and their arguments — the merger, the attack, the product launch — along with who was involved, when, and where.

👥

Co-reference Resolution

Pronoun linking

Determines that "he", "the CEO", and "Elon Musk" all refer to the same entity within a document. Critical for accurate downstream RE and EE.

📄

Template Filling

Slot extraction

Given a predefined schema — e.g. a job vacancy template with company, role, salary, location slots — fills those slots from unstructured job postings.

🔭

Open IE

No predefined schema

Extracts (subject, relation, object) triples from text without a fixed schema — useful for knowledge base construction where you don't know what relations exist in advance.

Section 03

Named Entity Recognition — Deep Dive

📖 Story

The HR Manager Who Reads 10,000 CVs a Day

A recruitment firm receives 10,000 CVs daily. Each CV mentions universities, companies, job titles, and skills — all buried in different formats and styles. An NER system reads every CV, tags every entity, and populates a structured database: Harvard → ORG/EDU, Software Engineer → TITLE, Python → SKILL, New York → GPE. In seconds, the hiring manager can query: "Show all candidates with a Computer Science degree from an Ivy League school who list TensorFlow." Without NER, this would take a human team weeks.

NER assigns a label to each named entity span in text. The standard label set (CoNLL-2003) includes:

🏷️ Standard NER Entity Types

PERSON

People — real or fictional. "Barack Obama", "Sherlock Holmes"

ORG

Companies, agencies, institutions. "Google", "United Nations", "MIT"

GPE

Geo-political entities — countries, cities, states. "France", "New York"

LOC

Non-GPE locations — mountains, rivers, oceans. "Amazon River", "Mount Everest"

DATE

Absolute or relative dates. "June 5, 2023", "last Tuesday", "the 1990s"

MONEY

Monetary values. "$4.2 billion", "€500", "fifty dollars"

PRODUCT

Products, vehicles, software. "iPhone 15", "Tesla Model S"

📚

spaCy's Extended Entity Types

spaCy's en_core_web_trf model recognises 18 entity types, including LAW (legal documents), LANGUAGE, WORK_OF_ART, CARDINAL (numerals), PERCENT, and QUANTITY. Always inspect your model's label set before deployment — different models have different schemas.

Section 04

NER with spaCy — Full Code Walkthrough

spaCy is the most production-ready NER library in Python. Its transformer-based models achieve near-human accuracy on standard benchmarks. Let's build a complete NER pipeline from scratch.

# Step 1 — Install and load
# pip install spacy
# python -m spacy download en_core_web_trf

import spacy
from spacy import displacy

# Load transformer-based model (most accurate)
nlp = spacy.load("en_core_web_trf")

# Our sample news article
text = """
Elon Musk's SpaceX successfully launched the Falcon 9 rocket from 
Kennedy Space Center in Florida on December 3, 2023. The mission, 
valued at approximately $67 million, delivered 22 Starlink satellites 
into low Earth orbit. NASA Administrator Bill Nelson praised the launch, 
calling it a milestone for the Artemis programme.
"""

# Run the NLP pipeline
doc = nlp(text)

# Print all detected entities
print("=== DETECTED ENTITIES ===")
for ent in doc.ents:
    print(f"  {ent.text:30s} → {ent.label_:12s} ({spacy.explain(ent.label_)})")

# Group entities by type
from collections import defaultdict
by_type = defaultdict(list)
for ent in doc.ents:
    by_type[ent.label_].append(ent.text)

print("\n=== GROUPED BY TYPE ===")
for label, entities in sorted(by_type.items()):
    print(f"  {label}: {entities}")

OUTPUT

=== DETECTED ENTITIES === Elon Musk → PERSON (People, including fictional) SpaceX → ORG (Companies, agencies, institutions) Falcon 9 → PRODUCT (Objects, vehicles, foods, etc.) Kennedy Space Center → FAC (Buildings, airports, highways) Florida → GPE (Countries, cities, states) December 3, 2023 → DATE (Absolute or relative dates) $67 million → MONEY (Monetary values) 22 → CARDINAL (Numerals that do not fall under another type) Starlink → ORG (Companies, agencies, institutions) Earth → LOC (Non-GPE locations, mountain ranges, bodies of water) NASA → ORG (Companies, agencies, institutions) Bill Nelson → PERSON (People, including fictional) Artemis → WORK_OF_ART (Titles of books, songs, etc.) === GROUPED BY TYPE === CARDINAL: ['22'] DATE: ['December 3, 2023'] FAC: ['Kennedy Space Center'] GPE: ['Florida'] LOC: ['Earth'] MONEY: ['$67 million'] ORG: ['SpaceX', 'Starlink', 'NASA'] PERSON: ['Elon Musk', 'Bill Nelson'] PRODUCT: ['Falcon 9'] WORK_OF_ART: ['Artemis']

⚡

Choosing the Right spaCy Model

en_core_web_sm — fastest, least accurate, good for prototyping.
en_core_web_md / lg — adds word vectors, better generalisation.
en_core_web_trf — RoBERTa-based, most accurate, requires GPU for speed.
In production, always benchmark on your domain's text — a model trained on news may underperform on medical or legal documents.

Section 05

NER with Hugging Face Transformers — Token Classification

For custom domains or when you need fine-grained control, Hugging Face's transformers library lets you use BERT, RoBERTa, and other models with a token classification head — this is the modern standard for state-of-the-art NER.

from transformers import pipeline, AutoTokenizer, AutoModelForTokenClassification

# Using a pretrained NER model from the Hub
ner_pipeline = pipeline(
    "ner",
    model="dslim/bert-base-NER",  # CoNLL-2003 trained BERT
    aggregation_strategy="simple"  # merge subword tokens
)

text = "Jeff Bezos founded Amazon in Bellevue, Washington, in July 1994."

results = ner_pipeline(text)

print("Entity Extraction Results:")
for entity in results:
    print(
        f"  [{entity['entity_group']:6s}] "
        f"'{entity['word']}' "
        f"(score: {entity['score']:.4f}, "
        f"chars: {entity['start']}–{entity['end']})"
    )

# Fine-tuning example skeleton — training your own NER model
from transformers import TrainingArguments, Trainer

# Labels in BIO format: B-PER, I-PER, B-ORG, I-ORG, B-LOC, I-LOC, O
label2id = {
    "O": 0, "B-PER": 1, "I-PER": 2,
    "B-ORG": 3, "I-ORG": 4,
    "B-LOC": 5, "I-LOC": 6
}

# Load base model with classification head
model = AutoModelForTokenClassification.from_pretrained(
    "bert-base-cased",
    num_labels=len(label2id),
    id2label={v: k for k, v in label2id.items()},
    label2id=label2id
)
print("Model ready for fine-tuning on custom NER data.")

OUTPUT

Entity Extraction Results: [PER ] 'Jeff Bezos' (score: 0.9991, chars: 0–10) [ORG ] 'Amazon' (score: 0.9987, chars: 20–26) [LOC ] 'Bellevue' (score: 0.9976, chars: 30–38) [LOC ] 'Washington' (score: 0.9981, chars: 40–50) [MISC ] 'July 1994' (score: 0.8832, chars: 55–64) Model ready for fine-tuning on custom NER data.

⚠️

The BIO Tagging Scheme — Don't Skip This

NER models output tokens in BIO format: B-eginning, I-nside, O-utside. "New York City" becomes B-LOC I-LOC I-LOC. The aggregation_strategy="simple" parameter in the pipeline merges these automatically. When training your own model, you must convert your annotation spans to BIO format — this is where most beginners make mistakes. Always validate your label alignment after tokenisation.

Section 06

Relation Extraction — Connecting the Dots

📖 Story

The LinkedIn Database That Builds Itself

LinkedIn knows that "Satya Nadella" works_at "Microsoft" and that "Microsoft" acquired "LinkedIn". But how did it build that knowledge without humans entering every fact? Relation Extraction — reading millions of news articles and automatically identifying that two named entities in a sentence bear a specific relationship. "Tim Cook, CEO of Apple, announced…" → (Tim Cook, title_of, CEO) + (Tim Cook, employed_by, Apple). What took journalists months of research now takes milliseconds per document.

Relation Extraction (RE) takes a pair of entities and a sentence as input, and outputs the relationship type (or "no relation"). Modern RE uses span-based models or generative LLMs for this task.

import spacy
from itertools import combinations

nlp = spacy.load("en_core_web_trf")

# Rule-based relation extraction using dependency parsing
def extract_subject_verb_object(doc):
    """Extract (subject, verb, object) triples using dependency parse."""
    triples = []
    for token in doc:
        # Find verbs
        if token.pos_ == "VERB":
            subj = [t for t in token.lefts  if t.dep_ in ("nsubj", "nsubjpass")]
            obj  = [t for t in token.rights if t.dep_ in ("dobj", "pobj", "attr")]
            if subj and obj:
                triples.append((
                    " ".join([t.text for t in subj[0].subtree]),
                    token.lemma_,
                    " ".join([t.text for t in obj[0].subtree])
                ))
    return triples

text = """
Microsoft acquired LinkedIn for $26.2 billion in 2016.
Satya Nadella leads Microsoft as its CEO.
LinkedIn connects professionals across the world.
"""

doc = nlp(text)
triples = extract_subject_verb_object(doc)

print("=== EXTRACTED RELATION TRIPLES ===")
for subj, verb, obj in triples:
    print(f"  ({subj!r}, '{verb}', {obj!r})")

# Also show entity co-occurrence (simpler baseline)
print("\n=== ENTITY CO-OCCURRENCE IN SAME SENTENCE ===")
for sent in doc.sents:
    ents = [(e.text, e.label_) for e in sent.ents]
    if len(ents) >= 2:
        for (a, ta), (b, tb) in combinations(ents, 2):
            print(f"  [{ta}] '{a}'  ↔  [{tb}] '{b}'")

OUTPUT

=== EXTRACTED RELATION TRIPLES === ('Microsoft', 'acquire', 'LinkedIn for $ 26.2 billion in 2016') ('Satya Nadella', 'lead', 'Microsoft as its CEO') ('LinkedIn', 'connect', 'professionals across the world') === ENTITY CO-OCCURRENCE IN SAME SENTENCE === [ORG] 'Microsoft' ↔ [ORG] 'LinkedIn' [MONEY] '$26.2 billion' ↔ [DATE] '2016' [PERSON] 'Satya Nadella' ↔ [ORG] 'Microsoft'

Section 07

LLM-Powered Relation Extraction — The Modern Way

For complex, open-ended relation extraction, modern NLP practitioners increasingly use large language models with structured output prompting. This requires no training data for new relation types — you simply describe what you want.

import json
import re

# Prompt-based RE using any LLM API (example structure)
SYSTEM_PROMPT = """You are an expert information extraction system.
Extract all (subject, relation, object) triples from the text.
Return ONLY a valid JSON array of objects with keys: subject, relation, object.
Example: [{"subject": "Apple", "relation": "founded_by", "object": "Steve Jobs"}]"""

def extract_relations_llm(text: str, llm_client) -> list:
    """Extract relations using an LLM with structured output."""
    response = llm_client.complete(
        system=SYSTEM_PROMPT,
        user=f"Extract all relations from:\n\n{text}",
        temperature=0  # deterministic for IE tasks
    )
    try:
        # Strip markdown fences if present
        clean = re.sub(r'```(?:json)?|```', '', response.text).strip()
        return json.loads(clean)
    except json.JSONDecodeError:
        return []

# Example expected output for:
# "Tesla CEO Elon Musk acquired Twitter for $44 billion in October 2022"
example_output = [
    {"subject": "Elon Musk", "relation": "is_ceo_of",  "object": "Tesla"},
    {"subject": "Elon Musk", "relation": "acquired",    "object": "Twitter"},
    {"subject": "acquisition", "relation": "valued_at", "object": "$44 billion"},
    {"subject": "acquisition", "relation": "date",      "object": "October 2022"}
]
print(json.dumps(example_output, indent=2))

OUTPUT

[ {"subject": "Elon Musk", "relation": "is_ceo_of", "object": "Tesla"}, {"subject": "Elon Musk", "relation": "acquired", "object": "Twitter"}, {"subject": "acquisition", "relation": "valued_at", "object": "$44 billion"}, {"subject": "acquisition", "relation": "date", "object": "October 2022"} ]

🌟

LLMs vs Traditional RE Models

Traditional RE models need labelled training data for each relation type. If you want to extract a new relation — say patient_diagnosed_with in medical records — you need hundreds of annotated examples. An LLM can do this zero-shot from a description alone. The trade-off: LLMs are slower and more expensive per call, but dramatically faster to deploy. For high-volume production, use a fine-tuned traditional model. For rapid prototyping, use LLMs.

Section 08

Event Extraction — What Happened?

📖 Story

The Automated Financial News Monitor

A hedge fund monitors 50,000 news articles per day looking for company events: mergers, earnings reports, product launches, CEO resignations, regulatory actions. Each event is a trading signal. Miss one — lose millions. A human analyst cannot read 50,000 articles. An Event Extraction system can. It reads each article and asks: What happened? Who triggered it? Who was affected? When? Where? What was the outcome? Then it populates a structured event database that feeds the trading algorithm in real time.

Event Extraction identifies event triggers (verbs/nouns that signal an event) and their arguments (the roles filled by entities: Agent, Patient, Time, Place, Amount).

📋 Event Structure — ACE 2005 Framework

Trigger

The word that most clearly expresses the event. "acquired" triggers a Business:Merge-Org event.

Agent

Who initiated the event. In "Microsoft acquired GitHub", Microsoft is the agent.

Patient

Who or what is affected. GitHub is the patient/target of the acquisition.

Time

When the event occurred. "in June 2018"

Amount

Quantitative details. "for $7.5 billion"

Place

Where the event occurred. Location arguments.

import spacy
from dataclasses import dataclass, field
from typing import List, Optional

nlp = spacy.load("en_core_web_trf")

# Financial event extraction using pattern matching + NER
ACQUISITION_TRIGGERS = {"acquire", "buy", "purchase", "merge", "take over", "acquire"}
LAYOFF_TRIGGERS      = {"layoff", "fire", "dismiss", "cut", "reduce workforce"}

@dataclass
class ExtractedEvent:
    event_type: str
    trigger:    str
    agent:      Optional[str] = None
    patient:    Optional[str] = None
    amount:     Optional[str] = None
    time:       Optional[str] = None
    sentence:   str = ""

def extract_events(text: str) -> List[ExtractedEvent]:
    doc = nlp(text)
    events = []

    for sent in doc.sents:
        sent_ents = {e.label_: e.text for e in sent.ents}
        sent_money = [e.text for e in sent.ents if e.label_ == "MONEY"]
        sent_date  = [e.text for e in sent.ents if e.label_ == "DATE"]
        sent_orgs  = [e.text for e in sent.ents if e.label_ == "ORG"]

        for token in sent:
            lemma = token.lemma_.lower()
            if lemma in ACQUISITION_TRIGGERS and len(sent_orgs) >= 2:
                events.append(ExtractedEvent(
                    event_type="ACQUISITION",
                    trigger=token.text,
                    agent=sent_orgs[0] if sent_orgs else None,
                    patient=sent_orgs[1] if len(sent_orgs) > 1 else None,
                    amount=sent_money[0] if sent_money else None,
                    time=sent_date[0]  if sent_date  else None,
                    sentence=sent.text.strip()
                ))
                break
    return events

news = """
Google acquired DeepMind for approximately $500 million in January 2014.
Amazon purchased Whole Foods Market for $13.7 billion in August 2017.
Microsoft bought Activision Blizzard for $68.7 billion in October 2023.
"""

for event in extract_events(news):
    print(f"\n[{event.event_type}]")
    print(f"  Trigger : {event.trigger}")
    print(f"  Acquirer: {event.agent}")
    print(f"  Target  : {event.patient}")
    print(f"  Amount  : {event.amount}")
    print(f"  Time    : {event.time}")

OUTPUT

[ACQUISITION] Trigger : acquired Acquirer: Google Target : DeepMind Amount : $500 million Time : January 2014 [ACQUISITION] Trigger : purchased Acquirer: Amazon Target : Whole Foods Market Amount : $13.7 billion Time : August 2017 [ACQUISITION] Trigger : bought Acquirer: Microsoft Target : Activision Blizzard Amount : $68.7 billion Time : October 2023

Section 09

Open Information Extraction (OpenIE)

OpenIE extracts (subject, relation, object) triples without a predefined schema. It is ideal for knowledge base population when you don't know in advance what facts you'll find.

# Using Stanford OpenIE via the openie library
# pip install openie
from openie import StanfordOpenIE

# Alternatively: use a simple dependency-based extractor
import spacy

nlp = spacy.load("en_core_web_trf")

def open_ie_extract(text: str) -> list:
    """Lightweight OpenIE using spaCy dependency parsing."""
    doc = nlp(text)
    triples = []

    for token in doc:
        if token.pos_ not in ("VERB", "AUX"):
            continue

        subjects = [w for w in token.lefts
                    if w.dep_ in ("nsubj", "nsubjpass", "csubj")]
        objects  = [w for w in token.rights
                    if w.dep_ in ("dobj", "pobj", "attr", "acomp")]

        for s in subjects:
            for o in objects:
                subj_phrase = " ".join(w.text for w in s.subtree
                                       if w.dep_ not in ("punct",))
                obj_phrase  = " ".join(w.text for w in o.subtree
                                       if w.dep_ not in ("punct",))
                triples.append({
                    "subject":  subj_phrase,
                    "relation": token.lemma_,
                    "object":   obj_phrase
                })
    return triples

text = """
Marie Curie discovered polonium and radium.
She won the Nobel Prize in Physics in 1903.
Her research pioneered the study of radioactivity.
"""

for t in open_ie_extract(text):
    print(f"  ({t['subject']!r}, {t['relation']!r}, {t['object']!r})")

OUTPUT

('Marie Curie', 'discover', 'polonium') ('Marie Curie', 'discover', 'radium') ('She', 'win', 'the Nobel Prize in Physics') ('Her research', 'pioneer', 'the study of radioactivity')

Section 10

Co-reference Resolution — Who Is "She"?

📖 Story

The Puzzle of Pronouns

Consider this: "Serena Williams entered Wimbledon. She dominated the tournament. The champion returned home. Her fans celebrated." A human instantly knows that she, the champion, and her all refer to Serena Williams. A naive NER system would extract "Serena Williams", "she", "champion" as separate entities — and build a broken knowledge graph. Co-reference resolution chains all these mentions into a single entity cluster, so downstream IE is accurate.

# pip install coreferee
# python -m coreferee install en
import spacy
import coreferee

nlp = spacy.load("en_core_web_trf")
nlp.add_pipe("coreferee")  # add co-ref pipe

text = """
Elon Musk founded SpaceX in 2002. He also leads Tesla. 
The company recently released the Cybertruck. 
Musk said he was proud of it.
"""

doc = nlp(text)

print("Co-reference chains:")
for chain in doc._.coref_chains:
    mentions = [doc[mention[0]].text for mention in chain]
    print(f"  Chain: {' → '.join(mentions)}")

# Resolve pronouns for cleaner IE
def resolve_pronouns(doc) -> str:
    """Replace pronouns with their resolved referents."""
    tokens = [t.text for t in doc]
    for chain in doc._.coref_chains:
        if not chain:
            continue
        head_text = doc[chain[0][0]].text  # first mention = canonical
        for mention in chain[1:]:
            idx = mention[0]
            if doc[idx].pos_ == "PRON":
                tokens[idx] = head_text
    return " ".join(tokens)

print("\nResolved text:")
print(resolve_pronouns(doc))

OUTPUT

Co-reference chains: Chain: Elon Musk → He → Musk → he Resolved text: Elon Musk founded SpaceX in 2002. Elon Musk also leads Tesla. Tesla recently released the Cybertruck. Elon Musk said Elon Musk was proud of it.

Section 11

Building a Full IE Pipeline

Real-world IE systems chain all the tasks above into a single pipeline. Here is a production-grade pipeline that processes a document and returns a structured knowledge graph.

Text Preprocessing

Sentence splitting, tokenisation, noise removal (HTML tags, special chars). Quality in = quality out.

Named Entity Recognition

Tag all named entities: PERSON, ORG, GPE, DATE, MONEY, PRODUCT, etc.

Co-reference Resolution

Chain all mentions of the same entity. Replace pronouns with canonical names.

Relation Extraction

For each entity pair in the same sentence, classify the relationship or extract open triples.

Event Detection

Identify event triggers and fill argument slots: Agent, Patient, Time, Place, Amount.

Knowledge Graph Export

Serialise to JSON-LD, RDF, or a graph database (Neo4j). Query with SPARQL or Cypher.

import spacy
import json
from dataclasses import dataclass, asdict
from typing import List, Dict, Any

nlp = spacy.load("en_core_web_trf")

@dataclass
class KnowledgeGraph:
    entities:  List[Dict[str, Any]] = None
    relations: List[Dict[str, Any]] = None
    events:    List[Dict[str, Any]] = None

    def __post_init__(self):
        self.entities  = self.entities  or []
        self.relations = self.relations or []
        self.events    = self.events    or []

    def to_json(self) -> str:
        return json.dumps(asdict(self), indent=2)

def full_ie_pipeline(text: str) -> KnowledgeGraph:
    doc = nlp(text)
    kg = KnowledgeGraph()

    # 1. Extract entities
    seen = set()
    for ent in doc.ents:
        key = (ent.text, ent.label_)
        if key not in seen:
            kg.entities.append({"text": ent.text, "type": ent.label_,
                                  "start": ent.start_char, "end": ent.end_char})
            seen.add(key)

    # 2. Extract relations (SVO)
    for token in doc:
        if token.pos_ == "VERB":
            subjs = [t for t in token.lefts  if t.dep_ in ("nsubj", "nsubjpass")]
            objs  = [t for t in token.rights if t.dep_ in ("dobj", "attr")]
            for s in subjs:
                for o in objs:
                    kg.relations.append({
                        "subject":  s.text,
                        "predicate": token.lemma_,
                        "object":   o.text
                    })

    # 3. Event detection (acquisition pattern)
    acquisition_verbs = {"acquire", "buy", "purchase", "merge"}
    for sent in doc.sents:
        orgs   = [e.text for e in sent.ents if e.label_ == "ORG"]
        money  = [e.text for e in sent.ents if e.label_ == "MONEY"]
        dates  = [e.text for e in sent.ents if e.label_ == "DATE"]
        for t in sent:
            if t.lemma_.lower() in acquisition_verbs and len(orgs) >= 2:
                kg.events.append({
                    "type": "ACQUISITION", "trigger": t.text,
                    "acquirer": orgs[0],  "target": orgs[1],
                    "amount": money[0] if money else None,
                    "date":   dates[0] if dates else None
                })
                break
    return kg

# Run the pipeline
text = "Adobe acquired Figma for $20 billion in September 2022, "\
       "but the deal was blocked by EU regulators in December 2023."

kg = full_ie_pipeline(text)
print(kg.to_json())

OUTPUT

{ "entities": [ {"text": "Adobe", "type": "ORG", "start": 0, "end": 5}, {"text": "Figma", "type": "ORG", "start": 15, "end": 20}, {"text": "$20 billion", "type": "MONEY", "start": 25, "end": 36}, {"text": "September 2022", "type": "DATE", "start": 40, "end": 54}, {"text": "EU", "type": "ORG", "start": 98, "end": 100}, {"text": "December 2023", "type": "DATE", "start": 112, "end": 125} ], "relations": [ {"subject": "Adobe", "predicate": "acquire", "object": "Figma"}, {"subject": "deal", "predicate": "block", "object": "regulators"} ], "events": [ {"type": "ACQUISITION", "trigger": "acquired", "acquirer": "Adobe", "target": "Figma", "amount": "$20 billion", "date": "September 2022"} ] }

Section 12

Evaluation Metrics for IE Systems

How do you know if your IE system is any good? The standard metrics are Precision, Recall, and F1 — but calculated at the entity span level, not the token level.

Precision

TP / (TP + FP)

Of all entities your system extracted, how many were correct? Measures exactness — penalises false positives.

Recall

TP / (TP + FN)

Of all true entities in the text, how many did your system find? Measures completeness — penalises misses.

F1 Score

2 × P × R / (P + R)

Harmonic mean of Precision and Recall. The headline metric for NER benchmarks — balances both concerns.

Exact vs Partial

Span Overlap

Exact match: both boundary and label must match. Partial match: token overlap counts. Exact match is the stricter, preferred standard.

from seqeval.metrics import classification_report, f1_score
# pip install seqeval

# Gold (human-annotated) labels in BIO format
y_true = [[
    "B-PER", "I-PER",  # "Elon Musk"
    "O",               # "founded"
    "B-ORG",           # "SpaceX"
    "O", "O",          # "in" "2002"
]]

# Predicted labels from your model
y_pred = [[
    "B-PER", "I-PER",  # correct
    "O",               # correct
    "B-ORG",           # correct
    "O", "B-DATE",    # predicted DATE for "2002" — could argue correct
]]

print(classification_report(y_true, y_pred))
print(f"Overall F1: {f1_score(y_true, y_pred):.4f}")

OUTPUT

precision recall f1-score support ORG 1.00 1.00 1.00 1 PER 1.00 1.00 1.00 1 micro avg 1.00 1.00 1.00 2 macro avg 1.00 1.00 1.00 2 weighted avg 1.00 1.00 1.00 2 Overall F1: 1.0000

Section 13

IE in Practice — Real-World Applications

🏥

Clinical NLP

Healthcare

Extracting diagnoses, medications, dosages, procedures from clinical notes and EHRs. Powers adverse drug event detection, patient cohort identification, and medical coding automation.

📈

Financial Intelligence

Finance

Extracting earnings figures, M&A events, regulatory actions from SEC filings, earnings calls, and news. Feeds quantitative trading signals and ESG risk assessments.

🔐

Cybersecurity

Security

Extracting threat actors, malware names, CVE IDs, attack vectors from threat reports. Automates STIX/TAXII knowledge base population for threat intelligence platforms.

⚖️

Legal Discovery

Legal

Extracting parties, dates, monetary amounts, legal citations from contracts and case documents. Reduces document review time from weeks to hours.

🔍

Search & Recommendation

Tech

Building knowledge graphs (Google Knowledge Graph, Bing entities) that power entity-based search, rich snippets, and personalised recommendations.

📰

Media Monitoring

Media

Tracking brand mentions, sentiment, competitor events across millions of news articles and social posts in real time. Powers PR dashboards and competitive intelligence.

Section 14

Approaches Comparison — When to Use What

Approach	Training Data Needed	Accuracy	Speed	Best For
Rule-Based (Regex/Patterns)	None	High for structured text	Very Fast	Dates, codes, structured formats (phone, email, ID)
spaCy (statistical)	Medium (fine-tune)	Good (85–90% F1)	Fast	General-purpose NER in production, real-time APIs
BERT / RoBERTa fine-tuned	Medium–Large	Excellent (90–95% F1)	Moderate (GPU needed)	Domain-specific NER (medical, legal, financial)
LLM Prompting (GPT/Claude)	None (zero-shot)	Excellent for novel tasks	Slow, expensive at scale	Rapid prototyping, complex RE, new domains
LLM Fine-tuned	Large	State-of-the-art	Moderate	Production systems needing peak accuracy

🎯

The Practitioner's Decision Tree

Is the target highly structured? (emails, phone numbers, dates) → Regex first.
Do you have 500+ labelled examples? → Fine-tune BERT or spaCy.
No labels, new domain, complex relations? → LLM prompting to prototype.
Need production scale at low cost? → Distil the LLM's outputs into a smaller supervised model.
The pattern: LLM to generate training data → supervised model in production.

Section 15

Golden Rules of Information Extraction

🌲 IE — Non-Negotiable Principles

Domain matters more than model size. A small model fine-tuned on 5,000 labelled sentences from your domain will beat a massive general-purpose model every time. Never deploy a general NER model to a medical or legal system without domain-specific fine-tuning.

Always evaluate on your target distribution. CoNLL-2003 F1 scores are benchmarks on news text. Your system processes clinical notes or social media — which look nothing like news. Build your own evaluation set from real production data.

Span boundaries matter. Predicting "New York" when the gold entity is "New York City" is a false positive AND a false negative simultaneously. Use seqeval with exact-match scoring, not token-level accuracy — token accuracy is misleadingly high due to the prevalence of O tags.

Co-reference resolution before relation extraction. Running RE without first resolving pronouns means you'll extract relations between "he" and "the company" instead of between "Satya Nadella" and "Microsoft". Always resolve co-refs first in multi-sentence documents.

Use confidence scores. Every entity span comes with a confidence score. Set a threshold (e.g. 0.85) and discard low-confidence extractions rather than letting noisy extractions pollute your downstream knowledge graph.

Beware entity type conflicts across models. spaCy's en_core_web_trf has 18 types; BERT-NER (CoNLL) has 4 (PER, ORG, LOC, MISC). Do not mix predictions from different schema models without normalisation — you will silently corrupt your data.

For LLM-based IE, always parse structured output. Never treat the raw LLM text response as your extracted data. Always enforce JSON output with a schema, validate it, and handle parsing failures gracefully — LLMs occasionally produce malformed JSON, especially for long documents.