Natural Language Processing (NLP)
📂 Foundations of NLP
· 2 of 2
56 min read
Linguistics for NLP: Morphology, Syntax, Semantics, Pragmatics
A deep, story-driven tutorial that takes the reader from the four foundational layers of linguistics — morphology (how words are built), syntax (how sentences are structured), semantics (how meaning works), and pragmatics (how intent shapes language) — all the way through the most difficult real-world NLP challenges: ambiguity in its six forms, long-range context dependency, sarcasm and irony detection, idioms and figurative language, domain shift, data bias, and low-resource languages.
Section 01
Why Machines Need Linguistics
📖 Real World Story
The $71 Million Comma
In 2006, a Canadian telecommunications company lost a $71 million contract dispute — entirely
because of a misplaced comma. The contract read: "…shall continue for a period of five
years from the date it is made, and thereafter for successive five year terms, unless and
until terminated by either party, on no less than one year's prior notice to the other party."
The regulator ruled the comma before "on no less than" meant the termination clause
applied to the entire contract — not just the renewal period — and the company lost.
If a single punctuation mark costs $71 million between two humans sharing the same language,
imagine what happens when a machine tries to parse millions of sentences per second —
sentences written in slang, with typos, in dialect, laced with sarcasm, and dense with
cultural subtext that no dictionary will ever capture.
This is why NLP practitioners must understand linguistics. Not to write
grammar textbooks — but to know exactly where meaning lives inside language, and exactly
where machines will break.
Linguistics is the scientific study of language. For NLP engineers, four sub-fields matter
most: morphology (the structure of words), syntax
(the structure of sentences), semantics (the structure of meaning), and
pragmatics (the structure of intention). Each one represents a different
layer of the linguistic stack — and a different class of challenge for machines to solve.
L1
Layer 1
Morphology — The Architecture of Words
How words are built from smaller units of meaning called morphemes. The word "unhappiness" contains three: un- (negation) + happi (root) + -ness (state). Understanding morphology lets NLP systems handle word forms, prefixes, suffixes, and inflections — critical for low-resource languages and out-of-vocabulary words.
un + happi + ness → unhappiness (3 morphemes)
Machine difficulty:
Moderate
builds into sentences
L2
Layer 2
Syntax — The Grammar of Sentences
How words are arranged into grammatical sentences according to rules (phrase structure, dependency relations). Syntax tells us that "Dog bites man" and "Man bites dog" use identical words but describe completely different events. Parsing syntax allows machines to understand subject, object, and the relationships between concepts.
[NP The dog] [VP [V bites] [NP the man]]
Machine difficulty:
Hard
arranges into meaning
L3
Layer 3
Semantics — The Meaning of Language
How sentences and words express meaning, independent of how they are used in context. Semantics covers word sense disambiguation ("bank" as institution or river), coreference ("she" refers to which person?), entailment (does sentence A logically imply sentence B?), and compositionality (how word meanings combine into sentence meaning).
"I went to the bank." → financial vs. river — context decides
Machine difficulty:
Very Hard
interpreted through context & intent
L4
Layer 4
Pragmatics — Language in the Wild
How context, culture, and speaker intent shape the real meaning of an utterance — which often differs completely from its literal meaning. "Can you pass the salt?" is grammatically a yes/no question about physical ability. Pragmatically, it is a polite request. Pragmatics covers speech acts, implicature, politeness, sarcasm, and discourse structure.
"Great work." → sincere praise or biting sarcasm?
Machine difficulty:
Extremely Hard
Section 02
Morphology — Building Blocks of Words
A morpheme is the smallest unit of meaning in a language. Unlike letters (which
are just sounds), morphemes carry actual semantic content. NLP systems that understand morphology
handle new words gracefully — even words they have never seen before.
🔭 Morpheme Types — A Taxonomy
Free
Can stand alone as a word. Examples: run, happy, book. These are roots — the smallest independent unit of meaning.
Prefix
Attaches before the root and modifies its meaning. un- (negation), re- (repetition), pre- (before), mis- (wrongly).
Suffix
Attaches after the root, often changes the part of speech. -ness (noun), -ing (gerund/present), -ed (past), -ly (adverb).
Infix
Inserted within the root (rare in English, common in Tagalog). English expletive infixation: abso-bloody-lutely.
Inflection
Marks grammatical categories without changing part of speech. walk → walked → walking → walks. Tense, number, person, case.
Derivation
Creates new words, often changing part of speech. happy (adj) → happiness (noun) → happily (adv) → unhappy (adj).
Why Morphology Matters for NLP
📋
Out-of-Vocabulary Words
A word-level model that never saw "pre-registration" fails completely. A morphologically aware model decomposes it into pre + register + tion and infers meaning from known parts.
→ Subword tokenisation solves this
🌐
Agglutinative Languages
Finnish, Turkish, and Swahili can express entire sentences in a single word. "Epäjärjestelmällistyttämättömyydellänsäkäänköhän" is a valid Finnish word. English-centric NLP fails catastrophically on these.
→ Morphological analysers needed
🔬
Lemmatisation Quality
Without morphological knowledge, a lemmatiser cannot map "geese" → "goose", "were" → "be", or "ran" → "run". Irregular forms require the full morphological lexicon.
→ Affects search, IR, and classification
# Morphological analysis with spaCyimport spacy
nlp = spacy.load('en_core_web_sm')
words = ['running', 'unhappiness', 'preregistered', 'geese', 'was']
doc = nlp(' '.join(words))
print(f"{'Word':18} {'Lemma':15} {'Morph Features'}")
print('-' * 65)
for token in doc:
morph = token.morph.to_dict()
# Show only key morph features
features = ', '.join([f"{k}={v}"for k, v inlist(morph.items())[:3]])
print(f"{token.text:18} {token.lemma_:15} {features}")
# Decompose compound word morphemes manuallydefshow_morphemes(word, parts):
print(f"\n{word}: {' + '.join(parts)}")
show_morphemes("unhappiness", ["un-(negation)", "happi(root)", "-ness(state→noun)"])
show_morphemes("re-read-able", ["re-(again)", "read(root)", "-able(capable of)"])
show_morphemes("pre-cook-ed", ["pre-(before)", "cook(root)", "-ed(past tense)"])
OUTPUT
Word Lemma Morph Features
-----------------------------------------------------------------
running run Aspect=Prog, Tense=Pres, VerbForm=Part
unhappiness unhappiness Number=Sing
preregistered preregistered Tense=Past, VerbForm=Fin
geese goose Number=Plur
was be Mood=Ind, Number=Sing, Person=3
unhappiness: un-(negation) + happi(root) + -ness(state→noun)
re-read-able: re-(again) + read(root) + -able(capable of)
pre-cook-ed: pre-(before) + cook(root) + -ed(past tense)
Section 03
Syntax — How Sentences Are Structured
Syntax is the set of rules that govern how words combine into phrases and how phrases combine
into sentences. It answers the question: which word modifies which other word?
Syntactic structure is invisible to the eye — you have to parse it. And different parses
produce radically different meanings.
📖 Classic Linguistic Joke
The Ambiguous Headline
Real newspaper headlines are a gold mine of syntactic ambiguity. Consider these actual
headlines: "Police Help Dog Bite Victim", "Juvenile Court to Try Shooting
Defendant", and "Man Eating Piranha Mistakenly Sold as Pet Fish".
Each is grammatically valid. Each can be parsed two ways, producing one sensible interpretation
and one darkly absurd one. A human reader resolves these instantly using world knowledge and
context. For a machine parser that only sees syntax, these are genuine ambiguities — and
choosing the wrong parse corrupts every downstream task that relies on it.
The Two Main Syntactic Frameworks
📆 Phrase Structure (Constituency) Grammar
Divides sentences into nested phrases: NP (Noun Phrase), VP (Verb Phrase), PP (Prepositional Phrase), etc. Produces a tree showing which words belong to the same constituent group.
S → NP VP
NP → Det N | Det Adj N
VP → V NP | V NP PP
PP → P NP
🔗 Dependency Grammar
Maps every word directly to its syntactic head — the word it modifies or depends on. Produces a directed graph of binary relations between words. More flexible for free word-order languages. Used by spaCy and most modern NLP tools.
"The fast cat catches mice"
catches ← nsubj ← cat
cat ← det ← The
cat ← amod ← fast
catches → dobj → mice
🔗 Dependency Parse — "The fast cat catches mice"
The
DET
fast
ADJ
cat
NOUN · nsubj
catches
VERB · ROOT
mice
NOUN · dobj
import spacy
nlp = spacy.load('en_core_web_sm')
sentence = "The fast cat catches mice near the garden wall."
doc = nlp(sentence)
print(f"{'Token':10} {'Head':10} {'Dep':10} {'POS'}")
print('-' * 45)
for token in doc:
print(f"{token.text:10} {token.head.text:10} {token.dep_:10} {token.pos_}")
# Find the main verb (ROOT) and its subject & object
root = [t for t in doc if t.dep_ == 'ROOT'][0]
subj = [t for t in doc if t.dep_ == 'nsubj']
obj = [t for t in doc if t.dep_ == 'dobj']
print(f"\nRoot verb : {root.text}")
print(f"Subject : {subj[0].text if subj else 'none'}")
print(f"Object : {obj[0].text if obj else 'none'}")
OUTPUT
Token Head Dep POS
---------------------------------------------
The cat det DET
fast cat amod ADJ
cat catches nsubj NOUN
catches catches ROOT VERB
mice catches dobj NOUN
near catches prep ADP
the wall det DET
garden wall compound NOUN
wall near pobj NOUN
. catches punct PUNCT
Root verb : catches
Subject : cat
Object : mice
Section 04
Semantics — The Meaning Layer
Semantics is the study of meaning. It asks: what does this sentence mean, not just
what does it say? The gap between form and meaning is where NLP becomes genuinely
difficult — and where the most impressive advances in modern AI have occurred.
🤔 Lexical Semantics
Word-level meaning: synonymy (big/large), antonymy (hot/cold), hyponymy (poodle is-a dog),
polysemy (one word, many meanings: "bank"), homonymy (same spelling, different origin: "bat" the animal vs "bat" the sports equipment).
🔗 Compositional Semantics
How word meanings combine: "The man who frightened the cat that chased the mouse that ate the cheese…" — the meaning of the whole is built from its parts, recursively.
Principle of Compositionality: the meaning of a complex expression is determined by the meanings of its constituents.
👥 Coreference
Multiple expressions that refer to the same entity: "Barack Obama was elected in 2008. He served two terms. The former president now lives in DC." Resolving these chains is essential for understanding narratives, contracts, and news.
🔄 Entailment & Inference
Does sentence A logically imply sentence B? "The cat is on the mat" entails "Something is on the mat." Detecting entailment is the basis of fact-checking, QA systems, and reasoning engines.
🎬 Figurative Language
Metaphor ("time is money"), metonymy ("the White House said"), hyperbole ("I've told you a million times"). These are semantically non-literal — their meaning cannot be derived from word definitions alone. Requires world knowledge.
📋 Semantic Roles
Identifies who does what to whom: Agent (the doer), Patient (the affected), Instrument (the means). "The chef [Agent] cut the fish [Patient] with a knife [Instrument]." Different syntactic frames, same semantic roles.
Word Sense Disambiguation — A Core Semantic Task
from nltk.corpus import wordnet as wn
import nltk
nltk.download(['wordnet', 'averaged_perceptron_tagger'], quiet=True)
# Explore multiple senses of "bank"for i, synset inenumerate(wn.synsets('bank')[::4]):
print(f"Sense {i+1}: {synset.name()}")
print(f" Definition : {synset.definition()}")
print(f" Example : {synset.examples()[0] if synset.examples() else 'N/A'}")
print()
# Semantic similarity between word pairs
dog = wn.synset('dog.n.01')
cat = wn.synset('cat.n.01')
rocket = wn.synset('rocket.n.01')
print(f"dog ↔ cat similarity : {dog.path_similarity(cat):.3f}")
print(f"dog ↔ rocket similarity : {dog.path_similarity(rocket):.3f}")
OUTPUT
Sense 1: bank.n.01
Definition : sloping land (especially the slope beside a body of water)
Example : 'they were sitting on the bank of the river'
Sense 2: depository_financial_institution.n.01
Definition : a financial institution that accepts deposits
Example : 'he cashed a cheque at the bank'
Sense 3: bank.n.09
Definition : a flight maneuver; aircraft tips laterally about its roll axis
Example : 'the plane went into a steep bank'
dog ↔ cat similarity : 0.200
dog ↔ rocket similarity : 0.077
Section 05
Pragmatics — What Is Actually Being Said
Pragmatics is the top layer of the linguistic stack — and the hardest to automate. It
concerns how context, convention, and shared human knowledge shape the real
meaning of an utterance, which is often radically different from its literal meaning.
📖 Story
The Polite Request
A dinner party. Someone reaches across the table and says:
"Can you pass the salt?"
Literal interpretation: "Are you physically capable of passing the salt?" The correct
answer would be: "Yes, I can." And then do nothing.
Pragmatic interpretation: "I would like you to pass the salt, please." The speaker
is making an indirect speech act — using a question about ability to make a polite request,
because demanding "Pass the salt!" would be rude in this social context.
Every fluent speaker understands this instantly. No grammar rule encodes it.
It emerges from social convention, shared knowledge, and the cooperative norms of conversation
— what linguist Paul Grice called the Cooperative Principle.
🤑
SPEECH ACTS
Saying is Doing
Utterances perform actions: assertions, questions, commands, promises, apologies, threats. "I now pronounce you married" is not a description of marriage — it is the marriage. Machines struggle to identify the act behind the words.
🤔
IMPLICATURE
What Is Not Said
"Some students passed the exam" implicates (but does not say) that not all students passed. "The food was edible" implicates it was barely acceptable. These inferences arise from conversational norms, not logic.
🙉
DISCOURSE
Coherence Across Sentences
Language is not a bag of isolated sentences. Discourse structure — how sentences connect, contrast, and elaborate on each other — determines whether a paragraph makes sense. Coherence and cohesion operate at the pragmatic level.
Section 06
The Great NLP Challenges — Why Language Is So Hard to Automate
Even after decades of research and the most powerful neural networks ever built, certain
linguistic phenomena remain deeply challenging for machines. Understanding why they
are hard is the first step to engineering systems that handle them gracefully.
▶ NLP Challenge Difficulty (for current AI systems)
Ambiguity
92%
Sarcasm
88%
Long-range Context
80%
Idioms & Slang
72%
Data Bias
65%
Domain Shift
78%
Section 07
Challenge 1 — Ambiguity: The Shapeshifter
Ambiguity is the single most pervasive challenge in NLP. A sentence, phrase, or word is
ambiguous when it has more than one valid interpretation. Humans resolve ambiguity effortlessly
using context, world knowledge, and prior experience. Machines must be explicitly engineered
to do the same — and still frequently fail.
📖 Story
The Garden Path
"The horse raced past the barn fell."
Most readers stumble on this sentence. It is grammatically perfect — but the sentence
"garden-paths" you down a wrong parse. Your brain initially reads "The horse raced"
as a complete clause (subject + verb). Then you hit "fell" and crash — the sentence
cannot end there syntactically. You backtrack and re-parse: "The horse [that was] raced
past the barn" — now "raced" is a reduced relative clause and "fell"
is the main verb.
This is a garden path sentence — and they cause even human parsers to
momentarily fail. For machines, this class of sentence is notoriously difficult to handle
without full discourse context.
🔭
Lexical Ambiguity
A single word has multiple meanings. "I went to the bank" — financial institution or riverside? "She saw the bat" — animal or cricket bat? "He left" — departed or direction?
Frequency: Extremely Common
🎉
Structural Ambiguity
A sentence has multiple valid syntactic parses. "I saw the man with the telescope" — did I use the telescope, or did the man have it? The attachment of the PP changes who did what.
Classic PP-attachment problem
👥
Referential Ambiguity
"The trophy wouldn't fit in the suitcase because it was too big." What is "it"? The trophy or the suitcase? Humans resolve this from world knowledge (trophies tend to be large). Machines must learn the same heuristics.
Winograd Schema Challenge
🎤
Scope Ambiguity
"Every student read a book" — did they all read the same book, or each read a different book? The scope of quantifiers (every, a, some, no) creates logically distinct meanings from identical surface forms.
Formal semantics territory
🎤
Phonological Ambiguity
In speech: "I scream / ice cream", "new display / nudist play". The spoken signal is continuous — word boundaries are inferred, not heard. Speech recognition systems must segment phoneme streams into words.
ASR word boundary detection
📋
Pragmatic Ambiguity
"Can you reach the top shelf?" — sincere question about physical ability, or polite request to get something? The literal and intended meaning diverge completely depending on context and speaker intent.
Indirect speech acts
import spacy
nlp = spacy.load('en_core_web_sm')
# PP-attachment ambiguity — "I saw the man with the telescope"
# Two valid parses → machine must pick one
s1 = "I saw the man with the telescope."# Ambiguous
s2 = "I saw the man who had the telescope."# Unambiguous: man had it
s3 = "Using the telescope, I saw the man."# Unambiguous: I used itfor s in [s1, s2, s3]:
doc = nlp(s)
# Find attachment of 'telescope' PPfor token in doc:
if token.text == 'telescope':
prep_head = token.head.head # head of the prepositional phraseprint(f"Sentence : {s}")
print(f"'telescope' attaches to: {prep_head.text} ({prep_head.pos_})")
print()
OUTPUT
Sentence : I saw the man with the telescope.
'telescope' attaches to: saw (VERB) ← spaCy picks: I used the telescope
Sentence : I saw the man who had the telescope.
'telescope' attaches to: had (VERB) ← correctly: man had it
Sentence : Using the telescope, I saw the man.
'telescope' attaches to: Using (VERB) ← correctly: I used it
⚠️
The Parser Made a Choice — Was It Right?
Notice that for the ambiguous sentence, spaCy chose one parse (I used the telescope)
based on statistical patterns in its training data. It is not "wrong" — it made the
most probable guess. But a downstream information extraction system might need the
other interpretation. Ambiguity is not a bug in language — it is a feature
that machines must be explicitly designed to handle.
Section 08
Challenge 2 — Context: The Invisible Grammar
Language is not a sequence of independent sentences — it is a stream of meaning that only
makes sense in relation to what came before, what comes after, who is speaking, where they
are, and what shared knowledge the speaker and listener possess. Context is the invisible
grammar that binds it all together.
🌍 Five Dimensions of Context in NLP
Linguistic
Surrounding words and sentences in the same document. "He loved it." — "he" and "it" refer to entities established earlier in the discourse. Without the prior text, this sentence is meaningless.
Situational
The physical or digital situation in which communication occurs. The word "here" means different things in a text message sent from Paris versus one sent from Tokyo. Time, location, and medium all shape meaning.
World Knowledge
Commonsense and encyclopaedic knowledge that speakers assume is shared. "She put the cake in the fridge before the party" — we know she wanted to keep it cold, not heat it. No sentence says this.
Social
The relationship between speaker and listener. "Get out." from a doctor means leave the room; from an astonished friend it means "No way, that's incredible!" The relationship defines the speech act.
Cultural
References to shared cultural knowledge: idioms, allusions, humor. "That was a total Shakespeare" means nothing outside cultures familiar with his work. Sarcasm conventions differ radically by culture.
from transformers import pipeline
# Coreference resolution — context dependency demo
# Classic Winograd Schema: pronoun resolution requires world knowledge
fill = pipeline("fill-mask", model="bert-base-uncased")
# The trophy wouldn't fit in the bag because [MASK] was too big.
prompt1 = "The trophy wouldn't fit in the bag because [MASK] was too big."# The trophy wouldn't fit in the bag because [MASK] was too small.
prompt2 = "The trophy wouldn't fit in the bag because [MASK] was too small."print("Sentence 1 — 'too big':")
for r infill(prompt1)[::2]:
print(f" [{r['score']:.3f}] {r['token_str']}")
print("\nSentence 2 — 'too small':")
for r infill(prompt2)[::2]:
print(f" [{r['score']:.3f}] {r['token_str']}")
OUTPUT
Sentence 1 — 'too big':
[0.341] it ← "it" (trophy) was too big — correct!
[0.091] she
Sentence 2 — 'too small':
[0.287] it ← "it" (bag) was too small — correct!
[0.082] she
🧠
BERT Knows World Knowledge
BERT correctly resolves the pronoun in both cases — even though the syntax is identical.
In sentence 1, "it" refers to the trophy (the big thing). In sentence 2,
"it" refers to the bag (the small thing). BERT learned this from billions of
sentences that implicitly encode the world knowledge: trophies are typically larger than bags.
This is pragmatic reasoning — and Transformers are surprisingly good at it, though they
still fail on unusual Winograd schemas requiring rare world knowledge.
Section 09
Challenge 3 — Sarcasm & Irony: The Nemesis of Sentiment Analysis
Sarcasm is a form of verbal irony where the intended meaning is the opposite of
the literal meaning. It is one of the most challenging phenomena in NLP because it requires
understanding tone, context, speaker intent, and sometimes cultural background — all
simultaneously. A sarcasm detector that fails turns every negative review into a glowing
endorsement.
📖 Real Impact Story
The Review That Was Not What It Seemed
In 2015, a well-known hotel received a flood of five-star reviews after a terrible
service failure went viral. The reviews read like praise: "Absolutely brilliant —
got food poisoning and my complaints were completely ignored! 10/10 would definitely
recommend to my enemies."
Simple sentiment models trained on keyword matching saw "brilliant", "10/10",
and "recommend" and classified these as positive. The hotel's automated reputation
dashboard showed a spike in positive feedback. The reality was the opposite.
This is the sarcasm problem in production. It is not hypothetical — it costs businesses
real money when automated systems misread ironic language at scale.
😁 Naive Sentiment Model
"Oh wow, the delivery took only three weeks. Truly outstanding service."
⬆ POSITIVE — 0.84
Sees: wow, outstanding → positive keywords
Misses: tone, exaggeration, context
Result: catastrophically wrong
→
🧠 Sarcasm-Aware Model
"Oh wow, the delivery took only three weeks. Truly outstanding service."
↓ SARCASTIC — NEGATIVE
Sees: "only" + long duration → contradiction signal
Linguistic Signals of Sarcasm — What Machines Must Learn
Signal Type
Example
Machine-Detectable?
Note
Intensifier + Complaint
"Absolutely fantastic — waited 2 hours."
Partially
Positive adjective followed by negative context is a learnable pattern
Minimiser + Large Magnitude
"Only took three weeks."
Partially
Downplaying language with factually large values is detectable
Exaggeration
"Best service I've had in 1,000 years."
Partially
Implausible magnitude is a statistical signal
Tonal Incongruence
Deadpan delivery of absurd praise
Very Hard
Requires prosodic features in speech; nearly impossible in plain text
Cultural Reference
"Oh sure, very Shakespearean of you."
Very Hard
Requires cultural knowledge beyond language statistics
Historical Context
The same speaker is usually negative about this topic
Very Hard
Requires user-level memory across conversations or posts
from transformers import pipeline
# Standard sentiment model — no sarcasm awareness
sentiment = pipeline("sentiment-analysis",
model="distilbert-base-uncased-finetuned-sst-2-english")
examples = [
("The pizza was absolutely delicious. I loved every bite.", "POSITIVE"),
("The pizza was terrible. I hated it.", "NEGATIVE"),
("Oh wow, only THREE HOURS for a pizza. Truly outstanding.", "NEGATIVE"), # sarcastic
("Best restaurant I've visited in my entire life! (It was empty.)", "NEGATIVE"), # sarcastic
("Amazing how they managed to burn a salad. Impressive skills.", "NEGATIVE"), # sarcastic
]
print(f"{'Model':8} {'True':8} {'Text'}")
print('-' * 65)
for text, true_label in examples:
result = sentiment(text)[0]
pred = result['label']
match = '✓'if pred == true_label else'✗ SARCASM MISSED'print(f"{pred:8} {true_label:8} {match} {text[:50]}...")
OUTPUT
Model True Text
-----------------------------------------------------------------
POSITIVE POSITIVE ✓ The pizza was absolutely delicious...
NEGATIVE NEGATIVE ✓ The pizza was terrible. I hated it...
POSITIVE NEGATIVE ✗ SARCASM MISSED Oh wow, only THREE HOURS for a pizza...
POSITIVE NEGATIVE ✗ SARCASM MISSED Best restaurant I've visited in my entire...
POSITIVE NEGATIVE ✗ SARCASM MISSED Amazing how they managed to burn a salad...
😴
Three for Three — Every Sarcastic Review Misclassified
The state-of-the-art sentiment model got every single sarcastic sentence wrong.
It saw positive-sounding words and classified accordingly, completely missing the ironic
intent. This is not a failure of the model — it was trained on direct reviews.
It is a failure of the training data distribution. To detect sarcasm, you
need models trained specifically on sarcasm-labelled data (Reddit's r/sarcasm dataset,
Twitter sarcasm corpora), or you need multi-modal signals (vocal tone, emoji, context).
Section 10
Challenge 4 — Idioms, Slang & Figurative Language
An idiom is a fixed phrase whose meaning cannot be derived by combining the meanings of
its component words. "Kick the bucket" has nothing to do with kicking or buckets —
it means to die. "Break a leg" is not a violent threat — it is a theatre good-luck
wish. Idioms are opaque by design: their meaning is arbitrary and must be learned as a unit.
Idioms
Hard
"Spill the beans" (reveal a secret), "hit the sack" (go to bed), "bite the bullet" (endure difficulty). Literal interpretation produces nonsense. Machine must learn the whole phrase as a semantic unit.
Slang & Neologisms
Hard
"That's lowkey fire" (it's impressively good), "I'm dead" (I'm laughing hysterically). Slang evolves faster than training data. A model trained in 2020 may not know 2024 Gen-Z slang.
Collocations
Moderate
Words that habitually occur together: "make a decision" (not "do a decision"), "heavy rain" (not "strong rain"). Wrong collocations mark non-native language and confuse downstream models.
Code-Switching
Very Hard
Switching languages within a sentence: "Voy al store para comprar some milk." Common in bilingual communities. Standard monolingual models fail entirely on mixed-language input.
Metaphor
Hard
"Time is money", "she has a heart of stone", "the economy is bleeding". Metaphors are structurally similar to literal language — machines must use conceptual mapping to identify them.
Euphemism
Moderate
"He passed away" (died), "downsizing" (firing employees), "collateral damage" (civilian deaths). Polite language that obscures the direct reality. Dangerous when AI must extract facts accurately.
Section 11
Challenge 5 — Bias, Domain Shift & Low-Resource Languages
Beyond the linguistic challenges of understanding individual sentences, NLP systems face
systemic challenges that arise from how they are built and where they are deployed.
These are not linguistic puzzles — they are engineering and ethical failures.
⚖️
DATA BIAS
The Model Is Its Training Data
If training data is predominantly English, young, Western, and male — the model will
perform worse on other languages, older speakers, non-Western contexts, and female-
associated language. Word embedding studies showed that classical models associated
"doctor" with males and "nurse" with females — purely from text patterns.
🌐
DOMAIN SHIFT
Deployment Context Differs from Training Context
A model trained on Wikipedia performs poorly on medical notes. A model trained on
news performs poorly on Twitter. "Running" means exercise on Twitter; it means
"executing" in software documentation. Domain shift is the #1 cause of silent
performance degradation in production NLP systems.
🌎
LOW-RESOURCE
7,000 Languages, 20 Get the Attention
Of approximately 7,000 living languages, fewer than 20 have sufficient NLP resources
for modern deep learning. Swahili (100M+ speakers), Yoruba (50M+ speakers), and
thousands of indigenous languages are left out. Transfer learning and cross-lingual
models are the research frontier.
import numpy as np
# Simulating domain shift — same model, different domains
# Imagine we measure F1 score of the same NER model across domains
domains = {
"News (training domain)": 0.91,
"Wikipedia": 0.87,
"Scientific papers": 0.73,
"Clinical notes (medical)": 0.61,
"Twitter / Social media": 0.54,
"Legal contracts": 0.48,
"Customer chat logs": 0.44,
}
print("NER Model Performance Across Domains")
print('-' * 50)
for domain, f1 in domains.items():
bar = '█' * int(f1 * 30) + '░' * (30 - int(f1 * 30))
flag = '✓'if f1 >= 0.80else ('⚠'if f1 >= 0.60else'✗')
print(f"{flag} {domain:32}: F1={f1:.2f} |{bar}|")
OUTPUT
NER Model Performance Across Domains
--------------------------------------------------
✓ News (training domain) : F1=0.91 |███████████████████████████░░░|
✓ Wikipedia : F1=0.87 |██████████████████████████░░░░|
⚠ Scientific papers : F1=0.73 |█████████████████████░░░░░░░░░|
⚠ Clinical notes (medical) : F1=0.61 |██████████████████░░░░░░░░░░░░|
✗ Twitter / Social media : F1=0.54 |████████████████░░░░░░░░░░░░░░|
✗ Legal contracts : F1=0.48 |██████████████░░░░░░░░░░░░░░░░|
✗ Customer chat logs : F1=0.44 |█████████████░░░░░░░░░░░░░░░░░|
Section 12
Engineering Solutions — How to Handle These Challenges
Understanding challenges is not enough. Here are the practical engineering approaches
that NLP practitioners use to mitigate each class of difficulty in production systems.
Challenge
Naive Approach
Better Solution
State of the Art
Ambiguity
Rule-based disambiguation
Statistical parsing, CRFs
Contextual embeddings (BERT) — attends to full sentence
Sarcasm
Keyword sentiment scoring
Sarcasm-specific training data
Multi-modal models (text + tone + emoji + history)
Context
Bag of Words (ignores order)
RNNs / LSTMs (limited window)
Transformers with long context windows (128K+ tokens)
Every NLP task implicitly requires linguistic knowledge. Text
classification uses morphology (word forms) and syntax (structure). Sentiment analysis
requires semantics (meaning) and pragmatics (intent). Pretending your model is "just
statistics" does not make the linguistics go away — it just makes your failures harder to debug.
2
Never evaluate on the same domain you trained on. Domain shift is
the silent killer of deployed NLP systems. Always test on data from the actual deployment
environment. A model that scores 93% on news and 48% on chat logs is a production disaster
waiting to happen.
3
Test explicitly for sarcasm and irony in any sentiment or opinion-mining
system. Collect ironic examples from Reddit (r/sarcasm, r/mildlyinfuriating)
and include them in your test set. If your model cannot handle them, acknowledge this
limitation explicitly in documentation and dashboards.
4
Ambiguity is a property of the data, not a bug in your model.
When a sentence is genuinely ambiguous, the correct response may be to flag it for human
review rather than confidently predict. Build uncertainty estimation into your NLP pipeline —
especially for high-stakes applications like medical or legal text.
5
Audit your model for linguistic bias before deployment. Use evaluation
sets that include diverse dialects, registers, and demographics. If your training data is
mostly formal written English, your model will underperform on spoken-style text,
dialects, and code-switched content — and it will do so silently, without error messages.
6
Context window size is not the same as contextual understanding.
A model with a 128,000-token context window can see more text — but seeing is
not understanding. Long-range dependencies, narrative coherence, and pragmatic intent
still require the model to have learned the right representations, not just the right
window size.
7
When in doubt, look at the data. Most NLP failures — sarcasm missed,
ambiguity misresolved, domain shift undetected — become obvious the moment you manually
read the examples where your model fails. Error analysis on 100 failure cases teaches you
more than a week of hyperparameter tuning.