Named Entity Recognition (NER)

Section 01

The Story That Explains NER

📖 Real World Analogy

The Highlighter Problem — Scanning a Newspaper

Imagine you are given a stack of 10,000 newspaper articles and asked to highlight every person's name in yellow, every city or country in blue, every company name in green, and every date in orange.

A human expert could do it — slowly, expensively, and with occasional mistakes. Now imagine doing it for 10 million articles, in 50 languages, in real time.

That is exactly what Named Entity Recognition (NER) does — automatically, at superhuman speed. It reads raw text and labels every important "thing" with a category. It is the highlighter that never tires.

Named Entity Recognition is an NLP (Natural Language Processing) task that identifies and classifies named entities — real-world objects — within unstructured text. A "named entity" is any noun that refers to a specific, identifiable thing: a person, an organisation, a location, a date, a currency, a product, and so on.

🔎

What Makes NER Different from Other NLP Tasks?

Most NLP tasks classify an entire sentence (sentiment analysis) or generate new text (summarisation). NER is a token-level classification task — it labels every individual word (or sub-word) in a sentence. This makes it fundamentally harder: the model must understand not just what a word means, but where it starts, where it ends, and what type of thing it refers to.

Section 02

Why NER Matters — Real Applications

NER is not an academic exercise. It is a core component of some of the most important information systems in the world — powering everything from search engines to financial compliance systems.

📰

News & Media Intelligence

Information Extraction

Reuters, Bloomberg, and Google News use NER to automatically tag articles with people, organisations, and locations — enabling structured search and trend analysis over millions of articles per day.

🏥

Healthcare & Clinical NLP

Medical Records

Hospitals extract drug names, disease terms, dosages, and patient symptoms from unstructured clinical notes — turning doctor's handwriting into structured medical databases.

📈

Finance & Compliance

Risk & Regulation

Banks monitor transactions and news in real time, extracting entity mentions to detect sanction breaches, insider trading signals, or links between entities in SEC filings.

🔍

Search Engines

Knowledge Graphs

Google's Knowledge Graph is built on NER. When you search "Tesla CEO", the engine recognises "Tesla" as an ORG and "CEO" as a role relation — then retrieves the correct person.

💬

Chatbots & Voice Assistants

Intent Understanding

When you say "Book a flight to Paris on Friday", Siri or Alexa uses NER to extract "Paris" (LOC) and "Friday" (DATE) before booking anything.

⚖️

Legal & Contract Analysis

Document Intelligence

Legal AI tools extract party names, clause types, jurisdiction, dates, and monetary values from thousands of contracts — work that would take lawyers weeks takes minutes.

Section 03

The Standard Entity Types

Different NER systems use different label sets. The most universally used is the CoNLL-2003 / spaCy standard. Here are the most common entity categories:

Label	Entity Type	Example	Domain
PERSON	People, real or fictional	Elon Musk, Sherlock Holmes	Universal
ORG	Companies, agencies, institutions	OpenAI, NASA, WHO	Universal
GPE	Geo-political entities (countries, cities)	India, New York, the EU	Universal
LOC	Non-GPE locations (mountains, rivers)	the Amazon, Mount Everest	Universal
DATE	Absolute or relative dates and periods	January 2024, last Tuesday, Q3	Universal
MONEY	Monetary values including currency	$4.2 billion, £50,000	Finance
PRODUCT	Objects, vehicles, foods	iPhone 15, Tesla Model S	E-commerce
EVENT	Named hurricanes, battles, elections	World War II, the Olympics	News
LAW	Named documents made into laws	GDPR, the US Constitution	Legal
PERCENT	Percentage including "%"	85%, a quarter	Finance

⚠️

The Ambiguity Problem

"Apple announced record profits" — is Apple a fruit or a company? "Jordan signed the agreement" — is Jordan a person or a country? "May confirmed the decision" — is May a month, a name, or a modal verb? This context-dependency is why simple rule-based approaches fail and why modern NER uses deep contextual models like Transformers.

Section 04

How NER Works — The BIO Tagging Scheme

📖 Core Concept

The BIO Puzzle — Turning NER into a Labelling Game

The central challenge in NER is that a named entity can span multiple words. "New York City" is three words that form one entity. "Goldman Sachs Group Inc" is four words forming one entity.

How do we tell a model where an entity starts versus where it continues? The answer is a simple but powerful labelling scheme called BIO tagging: Beginning, Inside, Outside.

🏭 BIO Tags — Sentence: "Barack Obama visited New York City last Tuesday"

B-PERSON

Barack — Beginning of a PERSON entity

I-PERSON

Obama — Inside (continuation) of the PERSON entity

visited — Outside (not an entity)

B-GPE

New — Beginning of a GPE (location) entity

I-GPE

York — Inside the GPE entity

I-GPE

City — Still inside the GPE entity

last — Outside (not an entity)

B-DATE

Tuesday — Beginning of a DATE entity (single-word span)

💡

BIOES — An Extension You May Encounter

Some systems use BIOES: Beginning, Inside, Outside, End, Single. The E tag marks the last token of a multi-word entity, and S marks a single-token entity. This gives the model stronger positional signals and often improves performance on longer entity spans.

Section 05

Three Generations of NER Approaches

NER has evolved dramatically over the past 30 years. Understanding all three generations gives you the context to choose the right tool for any situation.

Rule-Based / Gazetteer Systems (1990s–2000s)

Hard-coded lists of known entities (gazetteers) + hand-crafted regex patterns. If the word appeared in a list of known cities, it was tagged LOC. Fast and highly precise on known entities — but brittle, expensive to maintain, and completely blind to new names. Companies like Bloomberg maintained massive entity dictionaries just to keep these systems running.

Statistical / ML Models (2000s–2017)

Conditional Random Fields (CRFs) and Hidden Markov Models dominated this era. Features like capitalisation, prefix/suffix, surrounding words, and POS tags were hand-engineered and fed into a probabilistic sequence model. The Stanford NER tagger (still widely used today) is a CRF. Excellent for well-defined domains, but performance plateaus because features must be hand-crafted.

Neural / Transformer Models (2018–Present)

BERT, RoBERTa, and their fine-tuned variants now dominate NER benchmarks. Pre-trained on billions of words, these models learn rich contextual representations of every token. Fine-tuning on labelled NER data takes minutes. spaCy's en_core_web_trf pipeline and HuggingFace's bert-base-NER are the workhorses of modern production NER. F1 scores above 0.92 on CoNLL-2003 — a benchmark that once required years of research to improve by 1%.

Section 06

Quick Start — NER with spaCy

spaCy is the most production-ready NLP library in Python. Its NER pipeline is pre-trained and ready to use in under five lines of code.

# Step 1: Install spaCy and download a model
# Run in terminal: pip install spacy
# Run in terminal: python -m spacy download en_core_web_sm

import spacy

# Load a pre-trained English model
nlp = spacy.load("en_core_web_sm")

# Process a sentence
text = "Apple CEO Tim Cook announced a $110 billion share buyback in Cupertino on Tuesday."
doc = nlp(text)

# Print all detected entities
for ent in doc.ents:
    print(f"{ent.text:25} | {ent.label_:10} | {spacy.explain(ent.label_)}")

OUTPUT

✅

spacy.explain() — Your Best Friend

spacy.explain("GPE") returns a human-readable description of any label. Never guess what a cryptic tag means — always explain it. Use this liberally when exploring a new model's output or debugging unexpected labels.

Section 07

Visualising NER Output — displaCy

spaCy ships with a built-in visualiser called displaCy that renders NER annotations beautifully inline in Jupyter Notebooks or as standalone HTML.

import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_sm")

text = """Elon Musk founded SpaceX in 2002 in Hawthorne, California.
The company secured a $1.6 billion NASA contract in 2008."""

doc = nlp(text)

# Render inline in a Jupyter Notebook
displacy.render(doc, style="ent", jupyter=True)

# Or save as standalone HTML file
html = displacy.render(doc, style="ent", page=True)
with open("ner_output.html", "w", encoding="utf-8") as f:
    f.write(html)

🎨 displaCy NER Visualisation — Simulated Output

Elon Musk PERSON founded SpaceX ORG in 2002 DATE in Hawthorne, California GPE. The company secured a $1.6 billion MONEY NASA ORG contract in 2008 DATE.

This is how displaCy renders entity spans directly inside your browser or notebook.

Section 08

Choosing Your Model — sm vs md vs lg vs trf

spaCy ships four model tiers. Choosing the right one is a trade-off between accuracy, speed, and memory. Here is the definitive comparison:

Model	Architecture	NER F1	Speed	RAM	Best For
en_core_web_sm	CNN + HashEmbed	~0.85	Very fast	~12 MB	Prototyping, edge devices
en_core_web_md	CNN + GloVe vectors	~0.85	Fast	~43 MB	General production, better OOV
en_core_web_lg	CNN + large GloVe	~0.86	Moderate	~741 MB	When embedding coverage matters
en_core_web_trf	RoBERTa Transformer	~0.90	Slow (needs GPU)	~435 MB + GPU	Maximum accuracy in production

⚡

The Practitioner's Rule for Model Selection

Start with en_core_web_sm for all prototyping. Switch to en_core_web_trf only when accuracy on ambiguous or domain-specific text is critical — and only if you have a GPU. For most batch-processing pipelines, en_core_web_md hits the ideal accuracy/speed sweet spot.

Section 09

NER with HuggingFace Transformers

For the highest accuracy — particularly on domain-specific text — the HuggingFace transformers library gives you access to hundreds of fine-tuned NER models from the Model Hub. Here is how to use it with a single pipeline call:

from transformers import pipeline

# Load a BERT-based NER model from HuggingFace Hub
ner = pipeline(
    "ner",
    model="dslim/bert-base-NER",
    aggregation_strategy="simple"  # merges B/I tokens into spans
)

text = "Sundar Pichai, CEO of Google, visited London last week for a summit on AI safety."

results = ner(text)

for entity in results:
    print(f"Entity : {entity['word']}")
    print(f"Type   : {entity['entity_group']}")
    print(f"Score  : {entity['score']:.4f}")
    print(f"Span   : chars {entity['start']}–{entity['end']}")
    print("---")

OUTPUT

Entity : Sundar Pichai Type : PER Score : 0.9991 Span : chars 0–14 --- Entity : Google Type : ORG Score : 0.9998 Span : chars 24–30 --- Entity : London Type : LOC Score : 0.9995 Span : chars 40–46 ---

🔨

aggregation_strategy — The Hidden Power Parameter

Without aggregation_strategy="simple", HuggingFace returns one row per sub-word token — so "New York" becomes three separate rows with B-GPE, I-GPE, I-GPE. Setting it to "simple" or "first" collapses these into a single entity span automatically. Always set this.

Section 10

Training a Custom NER Model — The Full Pipeline

📖 Story

The Vet Clinic Problem

A veterinary clinic wants to extract animal species, medication names, and dosages from clinical notes. The general-purpose spaCy model has never heard of "Metacam 0.5mg/ml" or "Bengal cat with FIP" as structured entities.

Pre-trained models are trained on news text. In medical, legal, scientific, or niche domains, you almost always need to train a custom NER model on domain-specific labelled data. Here is the complete workflow.

📄 Custom NER Training Workflow — Step by Step

Step 1

Collect raw text — gather 500–5,000 sentences from your target domain

Step 2

Annotate entities — use Label Studio, Prodigy, or Doccano to label spans

Step 3

Convert to spaCy format — transform annotations into DocBin format

Step 4

Configure training — generate a config.cfg file with spaCy's init utility

Step 5

Train and evaluate — run spacy train config.cfg and watch F1 climb

Step 6

Package and deploy — spacy package produces an installable model

import spacy
from spacy.tokens import DocBin
from spacy.training import Example

# ── Step 1: Define training data ───────────────────────────
# Format: (text, {"entities": [(start, end, label), ...]})
TRAIN_DATA = [
    ("Give Bella 0.5ml of Metacam twice daily.",
     {"entities": [(5, 10, "ANIMAL_NAME"), (11, 16, "DOSAGE"), (20, 27, "DRUG")]}),
    ("Max the Labrador received 10mg of Carprofen.",
     {"entities": [(0, 3, "ANIMAL_NAME"), (8, 16, "SPECIES"), (26, 31, "DOSAGE"), (35, 44, "DRUG")]}),
]

# ── Step 2: Create a blank English model ───────────────────
nlp = spacy.blank("en")
ner = nlp.add_pipe("ner")

# ── Step 3: Add custom labels ──────────────────────────────
for _, annotations in TRAIN_DATA:
    for _, _, label in annotations["entities"]:
        ner.add_label(label)

# ── Step 4: Build DocBin for efficient training ────────────
db = DocBin()
for text, annotations in TRAIN_DATA:
    doc = nlp.make_doc(text)
    example = Example.from_dict(doc, annotations)
    db.add(example.reference)
db.to_disk("./train.spacy")

# ── Step 5: Train (basic loop — use spacy train CLI for real projects)
nlp.initialize()
optimizer = nlp.begin_training()

for itn in range(30):
    losses = {}
    examples = []
    for text, annotations in TRAIN_DATA:
        doc = nlp.make_doc(text)
        example = Example.from_dict(doc, annotations)
        examples.append(example)
    nlp.update(examples, drop=0.3, losses=losses)
    if itn % 10 == 0:
        print(f"Iteration {itn:3d} | Loss: {losses['ner']:.4f}")

# ── Step 6: Save and test the model ───────────────────────
nlp.to_disk("./vet_ner_model")
print("Model saved.")

# Test on a new sentence
nlp2 = spacy.load("./vet_ner_model")
doc = nlp2("Administer 0.2ml of Metacam to Rocky.")
for ent in doc.ents:
    print(f"{ent.text} → {ent.label_}")

OUTPUT

Iteration 0 | Loss: 14.8234 Iteration 10 | Loss: 3.2107 Iteration 20 | Loss: 0.8812 Model saved. 0.2ml → DOSAGE Metacam → DRUG Rocky → ANIMAL_NAME

Section 11

Evaluating NER — Metrics That Matter

NER is evaluated at the entity span level, not the word level. An entity is only counted as correct if both its boundary and its type are predicted correctly. A partial match counts as a full miss.

Precision

TP / (TP + FP)

Of all entities the model predicted, what fraction were correct? Punishes false alarms.

Recall

TP / (TP + FN)

Of all real entities in the text, what fraction did the model find? Punishes missed entities.

F1 Score

2 · P · R / (P + R)

The harmonic mean of precision and recall. The primary NER benchmark metric.

Partial Match

Boundary OR Type wrong

"New York" tagged as "New York City" is a partial match — counted as a miss in strict evaluation.

from seqeval.metrics import classification_report, f1_score

# True labels (BIO format, one list per sentence)
y_true = [["B-PERSON", "I-PERSON", "O", "B-ORG", "O"]]
y_pred = [["B-PERSON", "I-PERSON", "O", "B-GPE", "O"]]
# Note: model confused ORG → GPE for the 4th token

print(classification_report(y_true, y_pred))
print(f"Overall F1: {f1_score(y_true, y_pred):.4f}")

OUTPUT

precision recall f1-score support PERSON 1.00 1.00 1.00 1 ORG 0.00 0.00 0.00 1 micro avg 0.50 0.50 0.50 2 macro avg 0.50 0.50 0.50 2 Overall F1: 0.5000

⚠️

The seqeval Library — Non-Negotiable for NER Evaluation

Never use scikit-learn's classification_report directly on BIO labels — it evaluates at the token level and gives misleadingly high scores. seqeval correctly handles span-level evaluation. Install with pip install seqeval. Always use it.

Section 12

NER in a Real Pipeline — Building an Entity Extractor

In production, NER is rarely used alone. It feeds downstream tasks like entity linking, relation extraction, or knowledge graph population. Here is a complete, production-ready entity extraction pipeline:

import spacy
import pandas as pd
from collections import Counter
from typing import List, Dict

# ── Load model ─────────────────────────────────────────────
nlp = spacy.load("en_core_web_sm")

# ── Sample news corpus ─────────────────────────────────────
corpus = [
    "Amazon acquired MGM for $8.45 billion in 2021, strengthening its Prime Video library.",
    "Satya Nadella, CEO of Microsoft, met with EU regulators in Brussels on Monday.",
    "The Federal Reserve raised interest rates by 0.25% to combat inflation in the United States.",
    "Tesla reported $25.1 billion in revenue for Q4 2023, beating Wall Street estimates.",
]

# ── Extract entities from all documents ────────────────────
def extract_entities(texts: List[str]) -> List[Dict]:
    records = []
    for doc in nlp.pipe(texts):  # nlp.pipe() is much faster than looping
        for ent in doc.ents:
            records.append({
                "text": doc.text[:60] + "...",
                "entity": ent.text,
                "label": ent.label_,
                "start": ent.start_char,
                "end": ent.end_char,
                "description": spacy.explain(ent.label_)
            })
    return records

records = extract_entities(corpus)
df = pd.DataFrame(records)

# ── Summarise results ──────────────────────────────────────
print("=== Entity Type Distribution ===")
print(df["label"].value_counts().to_string())

print("\n=== Top Mentioned Organisations ===")
orgs = df[df["label"] == "ORG"]["entity"]
print(Counter(orgs).most_common(5))

OUTPUT

=== Entity Type Distribution === ORG 6 MONEY 3 DATE 2 PERSON 1 GPE 1 PERCENT 1 === Top Mentioned Organisations === [('Amazon', 1), ('MGM', 1), ('Microsoft', 1), ('EU', 1), ('Federal Reserve', 1)]

⚡

Always Use nlp.pipe() for Batch Processing

nlp.pipe(texts) processes a list of documents in batches, making full use of vectorised operations. It can be 10–50× faster than calling nlp(text) in a loop. For large corpora, add disable=["tagger", "parser"] to skip components you don't need — running only the NER head gives another 2–3× speedup.

Section 13

Common NER Pitfalls and How to Fix Them

Pitfall	Symptom	Fix
Wrong entity type	"Apple" → PERSON instead of ORG	Use a larger model (trf); add context in training data
Span boundary errors	"New York" found but "New York City" missed	Fine-tune on domain text; add more multi-word examples
Missed rare entities	New company names not detected	Augment with rule-based patterns using EntityRuler
Domain mismatch	Medical/legal terms unrecognised	Fine-tune on domain-specific labelled data
Overlapping entities	"New York Times" tagged as both ORG and LOC	Prioritise by label; add entity to training to assert type
Slow batch processing	Processing 100k docs takes hours	Use nlp.pipe(); disable unused pipeline components

import spacy
from spacy.pipeline import EntityRuler

# Fix: Add rule-based patterns for known entities the model misses
nlp = spacy.load("en_core_web_sm")

# EntityRuler runs BEFORE the statistical NER model
ruler = nlp.add_pipe("entity_ruler", before="ner")

patterns = [
    {"label": "ORG",    "pattern": "DeepMind"},
    {"label": "ORG",    "pattern": "Anthropic"},
    {"label": "PRODUCT", "pattern": [{"LOWER": "gpt"}, {"IS_DIGIT": True}]},
    {"label": "PRODUCT", "pattern": "Claude"},
]
ruler.add_patterns(patterns)

doc = nlp("Anthropic released Claude 3, while DeepMind launched Gemini Ultra.")
for ent in doc.ents:
    print(f"{ent.text} → {ent.label_}")

OUTPUT

Anthropic → ORG Claude 3 → PRODUCT DeepMind → ORG Gemini → PRODUCT (detected by base model)

Section 14

NER Model Comparison — spaCy vs HuggingFace vs Stanford

Property	spaCy (sm/md/lg)	HuggingFace Transformers	Stanford CoreNLP
Architecture	CNN / HashEmbed	BERT / RoBERTa	CRF (statistical)
CoNLL-2003 F1	~0.85–0.90	~0.91–0.94	~0.87
Speed (CPU)	Very fast (~50k words/s)	Slow (~2–5k words/s)	Moderate
Custom training	Yes — easy CLI	Yes — Trainer API	Complex Java setup
Multilingual	60+ languages	mBERT, XLM-RoBERTa	Limited
Best for	Production pipelines, speed	Max accuracy, research	Legacy Java systems

Section 15

Golden Rules of NER

🎯 NER — Non-Negotiable Rules

Always use nlp.pipe() for batch processing. Calling nlp(text) in a loop is single-threaded and skips internal batching optimisations. For 10,000+ documents, the difference can be an hour vs. minutes.

Always evaluate with seqeval at the span level, never token-level accuracy. A model that gets the boundary wrong by one word deserves zero credit — seqeval enforces this correctly.

Do not assume a general model works on domain-specific text. Models trained on news text perform poorly on medical, legal, or technical documents. Always evaluate on a held-out sample from your actual target domain before deploying.

Use EntityRuler before the NER component for known, high-precision entities (product names, proprietary terms). Rules set early protect the statistical model from overriding high-confidence known entities.

When training custom NER, include at least 200 annotated examples per entity type. With fewer examples the model memorises rather than generalises. If you can't annotate 200 examples, use few-shot approaches with GPT-4 or a fine-tuned LLM instead.

Always call spacy.explain(label) when inspecting unfamiliar labels. Never guess what NORP, FAC, or WORK_OF_ART means — documentation is one call away.

For multilingual NER, use XLM-RoBERTa from HuggingFace (xlm-roberta-base) rather than training per-language models. A single cross-lingual model fine-tuned on your target languages beats separate models in almost every low-resource scenario.