Natural Language Processing (NLP) 📂 Classical NLP · 2 of 4 34 min read

Named Entity Recognition (NER)

A comprehensive, story-driven tutorial on Named Entity Recognition (NER) — covering what NER is, how BIO tagging works, the three generations of NER approaches, and hands-on Python code using spaCy and HuggingFace Transformers, including custom model training, batch processing, and production best practices.

Section 01

The Story That Explains NER

The Highlighter Problem — Scanning a Newspaper
Imagine you are given a stack of 10,000 newspaper articles and asked to highlight every person's name in yellow, every city or country in blue, every company name in green, and every date in orange.

A human expert could do it — slowly, expensively, and with occasional mistakes. Now imagine doing it for 10 million articles, in 50 languages, in real time.

That is exactly what Named Entity Recognition (NER) does — automatically, at superhuman speed. It reads raw text and labels every important "thing" with a category. It is the highlighter that never tires.

Named Entity Recognition is an NLP (Natural Language Processing) task that identifies and classifies named entities — real-world objects — within unstructured text. A "named entity" is any noun that refers to a specific, identifiable thing: a person, an organisation, a location, a date, a currency, a product, and so on.

🔎
What Makes NER Different from Other NLP Tasks?

Most NLP tasks classify an entire sentence (sentiment analysis) or generate new text (summarisation). NER is a token-level classification task — it labels every individual word (or sub-word) in a sentence. This makes it fundamentally harder: the model must understand not just what a word means, but where it starts, where it ends, and what type of thing it refers to.


Section 02

Why NER Matters — Real Applications

NER is not an academic exercise. It is a core component of some of the most important information systems in the world — powering everything from search engines to financial compliance systems.

📰
News & Media Intelligence
Information Extraction
Reuters, Bloomberg, and Google News use NER to automatically tag articles with people, organisations, and locations — enabling structured search and trend analysis over millions of articles per day.
🏥
Healthcare & Clinical NLP
Medical Records
Hospitals extract drug names, disease terms, dosages, and patient symptoms from unstructured clinical notes — turning doctor's handwriting into structured medical databases.
📈
Finance & Compliance
Risk & Regulation
Banks monitor transactions and news in real time, extracting entity mentions to detect sanction breaches, insider trading signals, or links between entities in SEC filings.
🔍
Search Engines
Knowledge Graphs
Google's Knowledge Graph is built on NER. When you search "Tesla CEO", the engine recognises "Tesla" as an ORG and "CEO" as a role relation — then retrieves the correct person.
💬
Chatbots & Voice Assistants
Intent Understanding
When you say "Book a flight to Paris on Friday", Siri or Alexa uses NER to extract "Paris" (LOC) and "Friday" (DATE) before booking anything.
⚖️
Legal & Contract Analysis
Document Intelligence
Legal AI tools extract party names, clause types, jurisdiction, dates, and monetary values from thousands of contracts — work that would take lawyers weeks takes minutes.

Section 03

The Standard Entity Types

Different NER systems use different label sets. The most universally used is the CoNLL-2003 / spaCy standard. Here are the most common entity categories:

Label Entity Type Example Domain
PERSON People, real or fictional Elon Musk, Sherlock Holmes Universal
ORG Companies, agencies, institutions OpenAI, NASA, WHO Universal
GPE Geo-political entities (countries, cities) India, New York, the EU Universal
LOC Non-GPE locations (mountains, rivers) the Amazon, Mount Everest Universal
DATE Absolute or relative dates and periods January 2024, last Tuesday, Q3 Universal
MONEY Monetary values including currency $4.2 billion, £50,000 Finance
PRODUCT Objects, vehicles, foods iPhone 15, Tesla Model S E-commerce
EVENT Named hurricanes, battles, elections World War II, the Olympics News
LAW Named documents made into laws GDPR, the US Constitution Legal
PERCENT Percentage including "%" 85%, a quarter Finance
⚠️
The Ambiguity Problem

"Apple announced record profits" — is Apple a fruit or a company? "Jordan signed the agreement" — is Jordan a person or a country? "May confirmed the decision" — is May a month, a name, or a modal verb? This context-dependency is why simple rule-based approaches fail and why modern NER uses deep contextual models like Transformers.


Section 04

How NER Works — The BIO Tagging Scheme

The BIO Puzzle — Turning NER into a Labelling Game
The central challenge in NER is that a named entity can span multiple words. "New York City" is three words that form one entity. "Goldman Sachs Group Inc" is four words forming one entity.

How do we tell a model where an entity starts versus where it continues? The answer is a simple but powerful labelling scheme called BIO tagging: Beginning, Inside, Outside.
🏭 BIO Tags — Sentence: "Barack Obama visited New York City last Tuesday"
B-PERSON
Barack — Beginning of a PERSON entity
I-PERSON
Obama — Inside (continuation) of the PERSON entity
O
visited — Outside (not an entity)
B-GPE
New — Beginning of a GPE (location) entity
I-GPE
York — Inside the GPE entity
I-GPE
City — Still inside the GPE entity
O
last — Outside (not an entity)
B-DATE
Tuesday — Beginning of a DATE entity (single-word span)
💡
BIOES — An Extension You May Encounter

Some systems use BIOES: Beginning, Inside, Outside, End, Single. The E tag marks the last token of a multi-word entity, and S marks a single-token entity. This gives the model stronger positional signals and often improves performance on longer entity spans.


Section 05

Three Generations of NER Approaches

NER has evolved dramatically over the past 30 years. Understanding all three generations gives you the context to choose the right tool for any situation.

01
Rule-Based / Gazetteer Systems (1990s–2000s)
Hard-coded lists of known entities (gazetteers) + hand-crafted regex patterns. If the word appeared in a list of known cities, it was tagged LOC. Fast and highly precise on known entities — but brittle, expensive to maintain, and completely blind to new names. Companies like Bloomberg maintained massive entity dictionaries just to keep these systems running.
02
Statistical / ML Models (2000s–2017)
Conditional Random Fields (CRFs) and Hidden Markov Models dominated this era. Features like capitalisation, prefix/suffix, surrounding words, and POS tags were hand-engineered and fed into a probabilistic sequence model. The Stanford NER tagger (still widely used today) is a CRF. Excellent for well-defined domains, but performance plateaus because features must be hand-crafted.
03
Neural / Transformer Models (2018–Present)
BERT, RoBERTa, and their fine-tuned variants now dominate NER benchmarks. Pre-trained on billions of words, these models learn rich contextual representations of every token. Fine-tuning on labelled NER data takes minutes. spaCy's en_core_web_trf pipeline and HuggingFace's bert-base-NER are the workhorses of modern production NER. F1 scores above 0.92 on CoNLL-2003 — a benchmark that once required years of research to improve by 1%.

Section 06

Quick Start — NER with spaCy

spaCy is the most production-ready NLP library in Python. Its NER pipeline is pre-trained and ready to use in under five lines of code.

# Step 1: Install spaCy and download a model
# Run in terminal: pip install spacy
# Run in terminal: python -m spacy download en_core_web_sm

import spacy

# Load a pre-trained English model
nlp = spacy.load("en_core_web_sm")

# Process a sentence
text = "Apple CEO Tim Cook announced a $110 billion share buyback in Cupertino on Tuesday."
doc = nlp(text)

# Print all detected entities
for ent in doc.ents:
    print(f"{ent.text:25} | {ent.label_:10} | {spacy.explain(ent.label_)}")
OUTPUT
Apple | ORG | Companies, agencies, institutions, etc. Tim Cook | PERSON | People, including fictional $110 billion | MONEY | Monetary values, including unit Cupertino | GPE | Countries, cities, states Tuesday | DATE | Absolute or relative dates or periods
spacy.explain() — Your Best Friend

spacy.explain("GPE") returns a human-readable description of any label. Never guess what a cryptic tag means — always explain it. Use this liberally when exploring a new model's output or debugging unexpected labels.


Section 07

Visualising NER Output — displaCy

spaCy ships with a built-in visualiser called displaCy that renders NER annotations beautifully inline in Jupyter Notebooks or as standalone HTML.

import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_sm")

text = """Elon Musk founded SpaceX in 2002 in Hawthorne, California.
The company secured a $1.6 billion NASA contract in 2008."""

doc = nlp(text)

# Render inline in a Jupyter Notebook
displacy.render(doc, style="ent", jupyter=True)

# Or save as standalone HTML file
html = displacy.render(doc, style="ent", page=True)
with open("ner_output.html", "w", encoding="utf-8") as f:
    f.write(html)
🎨 displaCy NER Visualisation — Simulated Output

Elon Musk PERSON  founded  SpaceX ORG  in  2002 DATE  in  Hawthorne, California GPE. The company secured a  $1.6 billion MONEY  NASA ORG  contract in  2008 DATE.

This is how displaCy renders entity spans directly inside your browser or notebook.


Section 08

Choosing Your Model — sm vs md vs lg vs trf

spaCy ships four model tiers. Choosing the right one is a trade-off between accuracy, speed, and memory. Here is the definitive comparison:

Model Architecture NER F1 Speed RAM Best For
en_core_web_sm CNN + HashEmbed ~0.85 Very fast ~12 MB Prototyping, edge devices
en_core_web_md CNN + GloVe vectors ~0.85 Fast ~43 MB General production, better OOV
en_core_web_lg CNN + large GloVe ~0.86 Moderate ~741 MB When embedding coverage matters
en_core_web_trf RoBERTa Transformer ~0.90 Slow (needs GPU) ~435 MB + GPU Maximum accuracy in production
The Practitioner's Rule for Model Selection

Start with en_core_web_sm for all prototyping. Switch to en_core_web_trf only when accuracy on ambiguous or domain-specific text is critical — and only if you have a GPU. For most batch-processing pipelines, en_core_web_md hits the ideal accuracy/speed sweet spot.


Section 09

NER with HuggingFace Transformers

For the highest accuracy — particularly on domain-specific text — the HuggingFace transformers library gives you access to hundreds of fine-tuned NER models from the Model Hub. Here is how to use it with a single pipeline call:

from transformers import pipeline

# Load a BERT-based NER model from HuggingFace Hub
ner = pipeline(
    "ner",
    model="dslim/bert-base-NER",
    aggregation_strategy="simple"  # merges B/I tokens into spans
)

text = "Sundar Pichai, CEO of Google, visited London last week for a summit on AI safety."

results = ner(text)

for entity in results:
    print(f"Entity : {entity['word']}")
    print(f"Type   : {entity['entity_group']}")
    print(f"Score  : {entity['score']:.4f}")
    print(f"Span   : chars {entity['start']}–{entity['end']}")
    print("---")
OUTPUT
Entity : Sundar Pichai Type : PER Score : 0.9991 Span : chars 0–14 --- Entity : Google Type : ORG Score : 0.9998 Span : chars 24–30 --- Entity : London Type : LOC Score : 0.9995 Span : chars 40–46 ---
🔨
aggregation_strategy — The Hidden Power Parameter

Without aggregation_strategy="simple", HuggingFace returns one row per sub-word token — so "New York" becomes three separate rows with B-GPE, I-GPE, I-GPE. Setting it to "simple" or "first" collapses these into a single entity span automatically. Always set this.


Section 10

Training a Custom NER Model — The Full Pipeline

The Vet Clinic Problem
A veterinary clinic wants to extract animal species, medication names, and dosages from clinical notes. The general-purpose spaCy model has never heard of "Metacam 0.5mg/ml" or "Bengal cat with FIP" as structured entities.

Pre-trained models are trained on news text. In medical, legal, scientific, or niche domains, you almost always need to train a custom NER model on domain-specific labelled data. Here is the complete workflow.
📄 Custom NER Training Workflow — Step by Step
Step 1
Collect raw text — gather 500–5,000 sentences from your target domain
Step 2
Annotate entities — use Label Studio, Prodigy, or Doccano to label spans
Step 3
Convert to spaCy format — transform annotations into DocBin format
Step 4
Configure training — generate a config.cfg file with spaCy's init utility
Step 5
Train and evaluate — run spacy train config.cfg and watch F1 climb
Step 6
Package and deployspacy package produces an installable model
import spacy
from spacy.tokens import DocBin
from spacy.training import Example

# ── Step 1: Define training data ───────────────────────────
# Format: (text, {"entities": [(start, end, label), ...]})
TRAIN_DATA = [
    ("Give Bella 0.5ml of Metacam twice daily.",
     {"entities": [(5, 10, "ANIMAL_NAME"), (11, 16, "DOSAGE"), (20, 27, "DRUG")]}),
    ("Max the Labrador received 10mg of Carprofen.",
     {"entities": [(0, 3, "ANIMAL_NAME"), (8, 16, "SPECIES"), (26, 31, "DOSAGE"), (35, 44, "DRUG")]}),
]

# ── Step 2: Create a blank English model ───────────────────
nlp = spacy.blank("en")
ner = nlp.add_pipe("ner")

# ── Step 3: Add custom labels ──────────────────────────────
for _, annotations in TRAIN_DATA:
    for _, _, label in annotations["entities"]:
        ner.add_label(label)

# ── Step 4: Build DocBin for efficient training ────────────
db = DocBin()
for text, annotations in TRAIN_DATA:
    doc = nlp.make_doc(text)
    example = Example.from_dict(doc, annotations)
    db.add(example.reference)
db.to_disk("./train.spacy")

# ── Step 5: Train (basic loop — use spacy train CLI for real projects)
nlp.initialize()
optimizer = nlp.begin_training()

for itn in range(30):
    losses = {}
    examples = []
    for text, annotations in TRAIN_DATA:
        doc = nlp.make_doc(text)
        example = Example.from_dict(doc, annotations)
        examples.append(example)
    nlp.update(examples, drop=0.3, losses=losses)
    if itn % 10 == 0:
        print(f"Iteration {itn:3d} | Loss: {losses['ner']:.4f}")

# ── Step 6: Save and test the model ───────────────────────
nlp.to_disk("./vet_ner_model")
print("Model saved.")

# Test on a new sentence
nlp2 = spacy.load("./vet_ner_model")
doc = nlp2("Administer 0.2ml of Metacam to Rocky.")
for ent in doc.ents:
    print(f"{ent.text} → {ent.label_}")
OUTPUT
Iteration 0 | Loss: 14.8234 Iteration 10 | Loss: 3.2107 Iteration 20 | Loss: 0.8812 Model saved. 0.2ml → DOSAGE Metacam → DRUG Rocky → ANIMAL_NAME

Section 11

Evaluating NER — Metrics That Matter

NER is evaluated at the entity span level, not the word level. An entity is only counted as correct if both its boundary and its type are predicted correctly. A partial match counts as a full miss.

Precision
TP / (TP + FP)
Of all entities the model predicted, what fraction were correct? Punishes false alarms.
Recall
TP / (TP + FN)
Of all real entities in the text, what fraction did the model find? Punishes missed entities.
F1 Score
2 · P · R / (P + R)
The harmonic mean of precision and recall. The primary NER benchmark metric.
Partial Match
Boundary OR Type wrong
"New York" tagged as "New York City" is a partial match — counted as a miss in strict evaluation.
from seqeval.metrics import classification_report, f1_score

# True labels (BIO format, one list per sentence)
y_true = [["B-PERSON", "I-PERSON", "O", "B-ORG", "O"]]
y_pred = [["B-PERSON", "I-PERSON", "O", "B-GPE", "O"]]
# Note: model confused ORG → GPE for the 4th token

print(classification_report(y_true, y_pred))
print(f"Overall F1: {f1_score(y_true, y_pred):.4f}")
OUTPUT
precision recall f1-score support PERSON 1.00 1.00 1.00 1 ORG 0.00 0.00 0.00 1 micro avg 0.50 0.50 0.50 2 macro avg 0.50 0.50 0.50 2 Overall F1: 0.5000
⚠️
The seqeval Library — Non-Negotiable for NER Evaluation

Never use scikit-learn's classification_report directly on BIO labels — it evaluates at the token level and gives misleadingly high scores. seqeval correctly handles span-level evaluation. Install with pip install seqeval. Always use it.


Section 12

NER in a Real Pipeline — Building an Entity Extractor

In production, NER is rarely used alone. It feeds downstream tasks like entity linking, relation extraction, or knowledge graph population. Here is a complete, production-ready entity extraction pipeline:

import spacy
import pandas as pd
from collections import Counter
from typing import List, Dict

# ── Load model ─────────────────────────────────────────────
nlp = spacy.load("en_core_web_sm")

# ── Sample news corpus ─────────────────────────────────────
corpus = [
    "Amazon acquired MGM for $8.45 billion in 2021, strengthening its Prime Video library.",
    "Satya Nadella, CEO of Microsoft, met with EU regulators in Brussels on Monday.",
    "The Federal Reserve raised interest rates by 0.25% to combat inflation in the United States.",
    "Tesla reported $25.1 billion in revenue for Q4 2023, beating Wall Street estimates.",
]

# ── Extract entities from all documents ────────────────────
def extract_entities(texts: List[str]) -> List[Dict]:
    records = []
    for doc in nlp.pipe(texts):  # nlp.pipe() is much faster than looping
        for ent in doc.ents:
            records.append({
                "text": doc.text[:60] + "...",
                "entity": ent.text,
                "label": ent.label_,
                "start": ent.start_char,
                "end": ent.end_char,
                "description": spacy.explain(ent.label_)
            })
    return records

records = extract_entities(corpus)
df = pd.DataFrame(records)

# ── Summarise results ──────────────────────────────────────
print("=== Entity Type Distribution ===")
print(df["label"].value_counts().to_string())

print("\n=== Top Mentioned Organisations ===")
orgs = df[df["label"] == "ORG"]["entity"]
print(Counter(orgs).most_common(5))
OUTPUT
=== Entity Type Distribution === ORG 6 MONEY 3 DATE 2 PERSON 1 GPE 1 PERCENT 1 === Top Mentioned Organisations === [('Amazon', 1), ('MGM', 1), ('Microsoft', 1), ('EU', 1), ('Federal Reserve', 1)]
Always Use nlp.pipe() for Batch Processing

nlp.pipe(texts) processes a list of documents in batches, making full use of vectorised operations. It can be 10–50× faster than calling nlp(text) in a loop. For large corpora, add disable=["tagger", "parser"] to skip components you don't need — running only the NER head gives another 2–3× speedup.


Section 13

Common NER Pitfalls and How to Fix Them

Pitfall Symptom Fix
Wrong entity type "Apple" → PERSON instead of ORG Use a larger model (trf); add context in training data
Span boundary errors "New York" found but "New York City" missed Fine-tune on domain text; add more multi-word examples
Missed rare entities New company names not detected Augment with rule-based patterns using EntityRuler
Domain mismatch Medical/legal terms unrecognised Fine-tune on domain-specific labelled data
Overlapping entities "New York Times" tagged as both ORG and LOC Prioritise by label; add entity to training to assert type
Slow batch processing Processing 100k docs takes hours Use nlp.pipe(); disable unused pipeline components
import spacy
from spacy.pipeline import EntityRuler

# Fix: Add rule-based patterns for known entities the model misses
nlp = spacy.load("en_core_web_sm")

# EntityRuler runs BEFORE the statistical NER model
ruler = nlp.add_pipe("entity_ruler", before="ner")

patterns = [
    {"label": "ORG",    "pattern": "DeepMind"},
    {"label": "ORG",    "pattern": "Anthropic"},
    {"label": "PRODUCT", "pattern": [{"LOWER": "gpt"}, {"IS_DIGIT": True}]},
    {"label": "PRODUCT", "pattern": "Claude"},
]
ruler.add_patterns(patterns)

doc = nlp("Anthropic released Claude 3, while DeepMind launched Gemini Ultra.")
for ent in doc.ents:
    print(f"{ent.text} → {ent.label_}")
OUTPUT
Anthropic → ORG Claude 3 → PRODUCT DeepMind → ORG Gemini → PRODUCT (detected by base model)

Section 14

NER Model Comparison — spaCy vs HuggingFace vs Stanford

Property spaCy (sm/md/lg) HuggingFace Transformers Stanford CoreNLP
Architecture CNN / HashEmbed BERT / RoBERTa CRF (statistical)
CoNLL-2003 F1 ~0.85–0.90 ~0.91–0.94 ~0.87
Speed (CPU) Very fast (~50k words/s) Slow (~2–5k words/s) Moderate
Custom training Yes — easy CLI Yes — Trainer API Complex Java setup
Multilingual 60+ languages mBERT, XLM-RoBERTa Limited
Best for Production pipelines, speed Max accuracy, research Legacy Java systems

Section 15

Golden Rules of NER

🎯 NER — Non-Negotiable Rules
1
Always use nlp.pipe() for batch processing. Calling nlp(text) in a loop is single-threaded and skips internal batching optimisations. For 10,000+ documents, the difference can be an hour vs. minutes.
2
Always evaluate with seqeval at the span level, never token-level accuracy. A model that gets the boundary wrong by one word deserves zero credit — seqeval enforces this correctly.
3
Do not assume a general model works on domain-specific text. Models trained on news text perform poorly on medical, legal, or technical documents. Always evaluate on a held-out sample from your actual target domain before deploying.
4
Use EntityRuler before the NER component for known, high-precision entities (product names, proprietary terms). Rules set early protect the statistical model from overriding high-confidence known entities.
5
When training custom NER, include at least 200 annotated examples per entity type. With fewer examples the model memorises rather than generalises. If you can't annotate 200 examples, use few-shot approaches with GPT-4 or a fine-tuned LLM instead.
6
Always call spacy.explain(label) when inspecting unfamiliar labels. Never guess what NORP, FAC, or WORK_OF_ART means — documentation is one call away.
7
For multilingual NER, use XLM-RoBERTa from HuggingFace (xlm-roberta-base) rather than training per-language models. A single cross-lingual model fine-tuned on your target languages beats separate models in almost every low-resource scenario.