The Story That Explains Information Extraction
The detective cannot read everything. Instead, they skim for names, locations, dates, and relationships — circling, highlighting, connecting dots. What the detective does manually, Information Extraction (IE) does automatically, at scale, across millions of documents in seconds.
IE is the art of turning messy, unstructured human language into clean, structured, machine-readable facts.
Information Extraction (IE) is a subfield of Natural Language Processing (NLP) that automatically pulls structured data — names, dates, relationships, events — from raw text. Where a search engine finds documents, IE finds facts inside those documents.
Input: "Apple Inc. was founded by Steve Jobs in Cupertino on April 1, 1976."
Output: { entity: "Apple Inc.", type: "ORG" }, { entity: "Steve Jobs", type: "PERSON" },
{ entity: "Cupertino", type: "GPE" }, { entity: "April 1, 1976", type: "DATE" },
{ relation: ("Apple Inc.", "founded_by", "Steve Jobs") }
The Five Core Tasks of Information Extraction
IE is not a single task — it is a family of related tasks. Understanding each one is essential before writing a single line of code.
Named Entity Recognition — Deep Dive
NER assigns a label to each named entity span in text. The standard label set (CoNLL-2003) includes:
spaCy's en_core_web_trf model recognises 18 entity types, including
LAW (legal documents), LANGUAGE, WORK_OF_ART,
CARDINAL (numerals), PERCENT, and QUANTITY.
Always inspect your model's label set before deployment — different models have different schemas.
NER with spaCy — Full Code Walkthrough
spaCy is the most production-ready NER library in Python. Its transformer-based models achieve near-human accuracy on standard benchmarks. Let's build a complete NER pipeline from scratch.
# Step 1 — Install and load
# pip install spacy
# python -m spacy download en_core_web_trf
import spacy
from spacy import displacy
# Load transformer-based model (most accurate)
nlp = spacy.load("en_core_web_trf")
# Our sample news article
text = """
Elon Musk's SpaceX successfully launched the Falcon 9 rocket from
Kennedy Space Center in Florida on December 3, 2023. The mission,
valued at approximately $67 million, delivered 22 Starlink satellites
into low Earth orbit. NASA Administrator Bill Nelson praised the launch,
calling it a milestone for the Artemis programme.
"""
# Run the NLP pipeline
doc = nlp(text)
# Print all detected entities
print("=== DETECTED ENTITIES ===")
for ent in doc.ents:
print(f" {ent.text:30s} → {ent.label_:12s} ({spacy.explain(ent.label_)})")
# Group entities by type
from collections import defaultdict
by_type = defaultdict(list)
for ent in doc.ents:
by_type[ent.label_].append(ent.text)
print("\n=== GROUPED BY TYPE ===")
for label, entities in sorted(by_type.items()):
print(f" {label}: {entities}")
en_core_web_sm — fastest, least accurate, good for prototyping.
en_core_web_md / lg — adds word vectors, better generalisation.
en_core_web_trf — RoBERTa-based, most accurate, requires GPU for speed.
In production, always benchmark on your domain's text — a model trained on news
may underperform on medical or legal documents.
NER with Hugging Face Transformers — Token Classification
For custom domains or when you need fine-grained control, Hugging Face's
transformers library lets you use BERT, RoBERTa, and other models
with a token classification head — this is the modern standard for state-of-the-art NER.
from transformers import pipeline, AutoTokenizer, AutoModelForTokenClassification
# Using a pretrained NER model from the Hub
ner_pipeline = pipeline(
"ner",
model="dslim/bert-base-NER", # CoNLL-2003 trained BERT
aggregation_strategy="simple" # merge subword tokens
)
text = "Jeff Bezos founded Amazon in Bellevue, Washington, in July 1994."
results = ner_pipeline(text)
print("Entity Extraction Results:")
for entity in results:
print(
f" [{entity['entity_group']:6s}] "
f"'{entity['word']}' "
f"(score: {entity['score']:.4f}, "
f"chars: {entity['start']}–{entity['end']})"
)
# Fine-tuning example skeleton — training your own NER model
from transformers import TrainingArguments, Trainer
# Labels in BIO format: B-PER, I-PER, B-ORG, I-ORG, B-LOC, I-LOC, O
label2id = {
"O": 0, "B-PER": 1, "I-PER": 2,
"B-ORG": 3, "I-ORG": 4,
"B-LOC": 5, "I-LOC": 6
}
# Load base model with classification head
model = AutoModelForTokenClassification.from_pretrained(
"bert-base-cased",
num_labels=len(label2id),
id2label={v: k for k, v in label2id.items()},
label2id=label2id
)
print("Model ready for fine-tuning on custom NER data.")
NER models output tokens in BIO format: B-eginning,
I-nside, O-utside. "New York City" becomes
B-LOC I-LOC I-LOC. The aggregation_strategy="simple"
parameter in the pipeline merges these automatically. When training your own model,
you must convert your annotation spans to BIO format — this is where most
beginners make mistakes. Always validate your label alignment after tokenisation.
Relation Extraction — Connecting the Dots
Relation Extraction (RE) takes a pair of entities and a sentence as input, and outputs the relationship type (or "no relation"). Modern RE uses span-based models or generative LLMs for this task.
import spacy
from itertools import combinations
nlp = spacy.load("en_core_web_trf")
# Rule-based relation extraction using dependency parsing
def extract_subject_verb_object(doc):
"""Extract (subject, verb, object) triples using dependency parse."""
triples = []
for token in doc:
# Find verbs
if token.pos_ == "VERB":
subj = [t for t in token.lefts if t.dep_ in ("nsubj", "nsubjpass")]
obj = [t for t in token.rights if t.dep_ in ("dobj", "pobj", "attr")]
if subj and obj:
triples.append((
" ".join([t.text for t in subj[0].subtree]),
token.lemma_,
" ".join([t.text for t in obj[0].subtree])
))
return triples
text = """
Microsoft acquired LinkedIn for $26.2 billion in 2016.
Satya Nadella leads Microsoft as its CEO.
LinkedIn connects professionals across the world.
"""
doc = nlp(text)
triples = extract_subject_verb_object(doc)
print("=== EXTRACTED RELATION TRIPLES ===")
for subj, verb, obj in triples:
print(f" ({subj!r}, '{verb}', {obj!r})")
# Also show entity co-occurrence (simpler baseline)
print("\n=== ENTITY CO-OCCURRENCE IN SAME SENTENCE ===")
for sent in doc.sents:
ents = [(e.text, e.label_) for e in sent.ents]
if len(ents) >= 2:
for (a, ta), (b, tb) in combinations(ents, 2):
print(f" [{ta}] '{a}' ↔ [{tb}] '{b}'")
LLM-Powered Relation Extraction — The Modern Way
For complex, open-ended relation extraction, modern NLP practitioners increasingly use large language models with structured output prompting. This requires no training data for new relation types — you simply describe what you want.
import json
import re
# Prompt-based RE using any LLM API (example structure)
SYSTEM_PROMPT = """You are an expert information extraction system.
Extract all (subject, relation, object) triples from the text.
Return ONLY a valid JSON array of objects with keys: subject, relation, object.
Example: [{"subject": "Apple", "relation": "founded_by", "object": "Steve Jobs"}]"""
def extract_relations_llm(text: str, llm_client) -> list:
"""Extract relations using an LLM with structured output."""
response = llm_client.complete(
system=SYSTEM_PROMPT,
user=f"Extract all relations from:\n\n{text}",
temperature=0 # deterministic for IE tasks
)
try:
# Strip markdown fences if present
clean = re.sub(r'```(?:json)?|```', '', response.text).strip()
return json.loads(clean)
except json.JSONDecodeError:
return []
# Example expected output for:
# "Tesla CEO Elon Musk acquired Twitter for $44 billion in October 2022"
example_output = [
{"subject": "Elon Musk", "relation": "is_ceo_of", "object": "Tesla"},
{"subject": "Elon Musk", "relation": "acquired", "object": "Twitter"},
{"subject": "acquisition", "relation": "valued_at", "object": "$44 billion"},
{"subject": "acquisition", "relation": "date", "object": "October 2022"}
]
print(json.dumps(example_output, indent=2))
Traditional RE models need labelled training data for each relation type. If you want to extract a new relation — say patient_diagnosed_with in medical records — you need hundreds of annotated examples. An LLM can do this zero-shot from a description alone. The trade-off: LLMs are slower and more expensive per call, but dramatically faster to deploy. For high-volume production, use a fine-tuned traditional model. For rapid prototyping, use LLMs.
Event Extraction — What Happened?
Event Extraction identifies event triggers (verbs/nouns that signal an event) and their arguments (the roles filled by entities: Agent, Patient, Time, Place, Amount).
import spacy
from dataclasses import dataclass, field
from typing import List, Optional
nlp = spacy.load("en_core_web_trf")
# Financial event extraction using pattern matching + NER
ACQUISITION_TRIGGERS = {"acquire", "buy", "purchase", "merge", "take over", "acquire"}
LAYOFF_TRIGGERS = {"layoff", "fire", "dismiss", "cut", "reduce workforce"}
@dataclass
class ExtractedEvent:
event_type: str
trigger: str
agent: Optional[str] = None
patient: Optional[str] = None
amount: Optional[str] = None
time: Optional[str] = None
sentence: str = ""
def extract_events(text: str) -> List[ExtractedEvent]:
doc = nlp(text)
events = []
for sent in doc.sents:
sent_ents = {e.label_: e.text for e in sent.ents}
sent_money = [e.text for e in sent.ents if e.label_ == "MONEY"]
sent_date = [e.text for e in sent.ents if e.label_ == "DATE"]
sent_orgs = [e.text for e in sent.ents if e.label_ == "ORG"]
for token in sent:
lemma = token.lemma_.lower()
if lemma in ACQUISITION_TRIGGERS and len(sent_orgs) >= 2:
events.append(ExtractedEvent(
event_type="ACQUISITION",
trigger=token.text,
agent=sent_orgs[0] if sent_orgs else None,
patient=sent_orgs[1] if len(sent_orgs) > 1 else None,
amount=sent_money[0] if sent_money else None,
time=sent_date[0] if sent_date else None,
sentence=sent.text.strip()
))
break
return events
news = """
Google acquired DeepMind for approximately $500 million in January 2014.
Amazon purchased Whole Foods Market for $13.7 billion in August 2017.
Microsoft bought Activision Blizzard for $68.7 billion in October 2023.
"""
for event in extract_events(news):
print(f"\n[{event.event_type}]")
print(f" Trigger : {event.trigger}")
print(f" Acquirer: {event.agent}")
print(f" Target : {event.patient}")
print(f" Amount : {event.amount}")
print(f" Time : {event.time}")
Open Information Extraction (OpenIE)
OpenIE extracts (subject, relation, object) triples without a predefined schema. It is ideal for knowledge base population when you don't know in advance what facts you'll find.
# Using Stanford OpenIE via the openie library
# pip install openie
from openie import StanfordOpenIE
# Alternatively: use a simple dependency-based extractor
import spacy
nlp = spacy.load("en_core_web_trf")
def open_ie_extract(text: str) -> list:
"""Lightweight OpenIE using spaCy dependency parsing."""
doc = nlp(text)
triples = []
for token in doc:
if token.pos_ not in ("VERB", "AUX"):
continue
subjects = [w for w in token.lefts
if w.dep_ in ("nsubj", "nsubjpass", "csubj")]
objects = [w for w in token.rights
if w.dep_ in ("dobj", "pobj", "attr", "acomp")]
for s in subjects:
for o in objects:
subj_phrase = " ".join(w.text for w in s.subtree
if w.dep_ not in ("punct",))
obj_phrase = " ".join(w.text for w in o.subtree
if w.dep_ not in ("punct",))
triples.append({
"subject": subj_phrase,
"relation": token.lemma_,
"object": obj_phrase
})
return triples
text = """
Marie Curie discovered polonium and radium.
She won the Nobel Prize in Physics in 1903.
Her research pioneered the study of radioactivity.
"""
for t in open_ie_extract(text):
print(f" ({t['subject']!r}, {t['relation']!r}, {t['object']!r})")
Co-reference Resolution — Who Is "She"?
# pip install coreferee
# python -m coreferee install en
import spacy
import coreferee
nlp = spacy.load("en_core_web_trf")
nlp.add_pipe("coreferee") # add co-ref pipe
text = """
Elon Musk founded SpaceX in 2002. He also leads Tesla.
The company recently released the Cybertruck.
Musk said he was proud of it.
"""
doc = nlp(text)
print("Co-reference chains:")
for chain in doc._.coref_chains:
mentions = [doc[mention[0]].text for mention in chain]
print(f" Chain: {' → '.join(mentions)}")
# Resolve pronouns for cleaner IE
def resolve_pronouns(doc) -> str:
"""Replace pronouns with their resolved referents."""
tokens = [t.text for t in doc]
for chain in doc._.coref_chains:
if not chain:
continue
head_text = doc[chain[0][0]].text # first mention = canonical
for mention in chain[1:]:
idx = mention[0]
if doc[idx].pos_ == "PRON":
tokens[idx] = head_text
return " ".join(tokens)
print("\nResolved text:")
print(resolve_pronouns(doc))
Building a Full IE Pipeline
Real-world IE systems chain all the tasks above into a single pipeline. Here is a production-grade pipeline that processes a document and returns a structured knowledge graph.
import spacy
import json
from dataclasses import dataclass, asdict
from typing import List, Dict, Any
nlp = spacy.load("en_core_web_trf")
@dataclass
class KnowledgeGraph:
entities: List[Dict[str, Any]] = None
relations: List[Dict[str, Any]] = None
events: List[Dict[str, Any]] = None
def __post_init__(self):
self.entities = self.entities or []
self.relations = self.relations or []
self.events = self.events or []
def to_json(self) -> str:
return json.dumps(asdict(self), indent=2)
def full_ie_pipeline(text: str) -> KnowledgeGraph:
doc = nlp(text)
kg = KnowledgeGraph()
# 1. Extract entities
seen = set()
for ent in doc.ents:
key = (ent.text, ent.label_)
if key not in seen:
kg.entities.append({"text": ent.text, "type": ent.label_,
"start": ent.start_char, "end": ent.end_char})
seen.add(key)
# 2. Extract relations (SVO)
for token in doc:
if token.pos_ == "VERB":
subjs = [t for t in token.lefts if t.dep_ in ("nsubj", "nsubjpass")]
objs = [t for t in token.rights if t.dep_ in ("dobj", "attr")]
for s in subjs:
for o in objs:
kg.relations.append({
"subject": s.text,
"predicate": token.lemma_,
"object": o.text
})
# 3. Event detection (acquisition pattern)
acquisition_verbs = {"acquire", "buy", "purchase", "merge"}
for sent in doc.sents:
orgs = [e.text for e in sent.ents if e.label_ == "ORG"]
money = [e.text for e in sent.ents if e.label_ == "MONEY"]
dates = [e.text for e in sent.ents if e.label_ == "DATE"]
for t in sent:
if t.lemma_.lower() in acquisition_verbs and len(orgs) >= 2:
kg.events.append({
"type": "ACQUISITION", "trigger": t.text,
"acquirer": orgs[0], "target": orgs[1],
"amount": money[0] if money else None,
"date": dates[0] if dates else None
})
break
return kg
# Run the pipeline
text = "Adobe acquired Figma for $20 billion in September 2022, "\
"but the deal was blocked by EU regulators in December 2023."
kg = full_ie_pipeline(text)
print(kg.to_json())
Evaluation Metrics for IE Systems
How do you know if your IE system is any good? The standard metrics are Precision, Recall, and F1 — but calculated at the entity span level, not the token level.
from seqeval.metrics import classification_report, f1_score
# pip install seqeval
# Gold (human-annotated) labels in BIO format
y_true = [[
"B-PER", "I-PER", # "Elon Musk"
"O", # "founded"
"B-ORG", # "SpaceX"
"O", "O", # "in" "2002"
]]
# Predicted labels from your model
y_pred = [[
"B-PER", "I-PER", # correct
"O", # correct
"B-ORG", # correct
"O", "B-DATE", # predicted DATE for "2002" — could argue correct
]]
print(classification_report(y_true, y_pred))
print(f"Overall F1: {f1_score(y_true, y_pred):.4f}")
IE in Practice — Real-World Applications
Approaches Comparison — When to Use What
| Approach | Training Data Needed | Accuracy | Speed | Best For |
|---|---|---|---|---|
| Rule-Based (Regex/Patterns) | None | High for structured text | Very Fast | Dates, codes, structured formats (phone, email, ID) |
| spaCy (statistical) | Medium (fine-tune) | Good (85–90% F1) | Fast | General-purpose NER in production, real-time APIs |
| BERT / RoBERTa fine-tuned | Medium–Large | Excellent (90–95% F1) | Moderate (GPU needed) | Domain-specific NER (medical, legal, financial) |
| LLM Prompting (GPT/Claude) | None (zero-shot) | Excellent for novel tasks | Slow, expensive at scale | Rapid prototyping, complex RE, new domains |
| LLM Fine-tuned | Large | State-of-the-art | Moderate | Production systems needing peak accuracy |
Is the target highly structured? (emails, phone numbers, dates) → Regex first.
Do you have 500+ labelled examples? → Fine-tune BERT or spaCy.
No labels, new domain, complex relations? → LLM prompting to prototype.
Need production scale at low cost? → Distil the LLM's outputs into a smaller supervised model.
The pattern: LLM to generate training data → supervised model in production.
Golden Rules of Information Extraction
seqeval with exact-match scoring, not token-level accuracy — token accuracy
is misleadingly high due to the prevalence of O tags.
en_core_web_trf
has 18 types; BERT-NER (CoNLL) has 4 (PER, ORG, LOC, MISC). Do not mix predictions
from different schema models without normalisation — you will silently corrupt your data.