Why Machines Cannot Read Words — The Core Problem
That alien's dilemma is exactly what every machine learning model faces. Models are mathematical machines — they consume numbers, multiply them, add them, transform them. They cannot consume the word "dog" or the sentence "the cat sat on the mat".
Text representation is the art of translating human language into numbers a machine can learn from. And like any translation, some methods lose meaning, some preserve it, and some do far better than others.
In NLP (Natural Language Processing), before you build any classifier, sentiment analyser, or search engine, you must first answer one fundamental question: how do I turn this text into numbers? The three classical answers are Bag of Words (BoW), TF-IDF, and N-grams. Each represents a different philosophy, a different trade-off, and a different view of what "meaning" is in text.
By the end of this tutorial you will understand how BoW, TF-IDF, and N-grams work conceptually and mathematically, when to use each, their strengths and limitations, and how to implement all three in Python using scikit-learn — with worked examples on real text.
Bag of Words (BoW) — The Simplest Possible Idea
That jar is your Bag of Words. It doesn't care about order. It doesn't care about grammar. It only cares about which words appear and how often.
How BoW Works — Step by Step
Concrete Example
Consider these three short documents:
| Doc # | Text |
|---|---|
| D1 | "I love machine learning" |
| D2 | "machine learning is great" |
| D3 | "I love great food" |
| Index | Word |
|---|---|
| 0 | food |
| 1 | great |
| 2 | is |
| 3 | learning |
| 4 | love |
| 5 | machine |
| Document | food | great | is | learning | love | machine |
|---|---|---|---|---|---|---|
| D1 | 0 | 0 | 0 | 1 | 1 | 1 |
| D2 | 0 | 1 | 1 | 1 | 0 | 1 |
| D3 | 1 | 1 | 0 | 0 | 1 | 0 |
Each document is now a 6-dimensional vector. D1 = [0, 0, 0, 1, 1, 1]. This is what gets fed to your logistic regression, SVM, or naive Bayes classifier.
Each document becomes a count vector of length = vocabulary size. The matrix is often very sparse (lots of zeros).
BoW in Python — Full Working Example
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
# Our corpus
corpus = [
"I love machine learning",
"machine learning is great",
"I love great food",
]
# Build the BoW model
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
# Inspect the vocabulary
vocab = vectorizer.get_feature_names_out()
print("Vocabulary:", vocab)
# View as a readable DataFrame
df = pd.DataFrame(
X.toarray(),
columns=vocab,
index=["D1", "D2", "D3"]
)
print(df)
BoW — Strengths and Weaknesses
A real corpus of 10,000 documents might have a vocabulary of 50,000 words. Each document uses maybe 200 words — so 49,800 cells are zero. That matrix is 99.6% zeros. This wastes memory and can slow learning. Scikit-learn uses sparse matrices internally to handle this efficiently.
TF-IDF — Rewarding Rare, Punishing Common
But if one novel uses the word "arsenic" 15 times, that word is screaming at you. It's rare across all books, but frequent in this one — which means it's crucial to understanding this specific novel.
TF-IDF is that detective instinct made mathematical. It gives high scores to words that are frequent in a document but rare across the corpus.
The Two Components of TF-IDF
TF-IDF(t, d) = TF(t, d) × IDF(t)
A word scores high only when it is frequent in this document AND rare across all documents.
Words like "the" have low IDF (they appear everywhere), so their TF-IDF score collapses to near zero —
even if they appear 50 times in one document.
TF-IDF Worked Calculation
| Scenario | Word | TF (this doc) | IDF (corpus) | TF-IDF Score | Meaning |
|---|---|---|---|---|---|
| Medical article | arsenic | 0.08 | 4.2 (very rare) | 0.336 | Key topic — reward it |
| Medical article | the | 0.12 | 0.01 (everywhere) | 0.001 | Stopword — penalise it |
| Sports article | touchdown | 0.06 | 3.8 (rare outside sports) | 0.228 | Domain-specific — reward it |
| Sports article | game | 0.10 | 0.5 (somewhat common) | 0.05 | Moderately useful |
TF-IDF Visual — How Scores Change Across Documents
Common words ("the") score near zero across all documents. Rare content words score high only in the document where they frequently appear.
TF-IDF in Python
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import numpy as np
corpus = [
"I love machine learning",
"machine learning is great",
"I love great food",
]
# Build TF-IDF model
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(corpus)
vocab = tfidf.get_feature_names_out()
df = pd.DataFrame(
np.round(X.toarray(), 4),
columns=vocab,
index=["D1", "D2", "D3"]
)
print(df)
# Inspect IDF scores — higher = rarer word
for word, idf in zip(vocab, tfidf.idf_):
print(f" {word:12s}: IDF = {idf:.4f}")
"food" and "is" each appear in only 1 document, so they have the highest IDF (1.6931). "great", "love", "machine", and "learning" each appear in 2 documents — same IDF (1.2877). When you multiply TF × IDF, rarer words that are important to a specific document get rewarded while common words across all docs get suppressed automatically.
BoW vs TF-IDF — Side by Side
| Property | Bag of Words | TF-IDF |
|---|---|---|
| What it counts | Raw word frequency | Frequency × Rarity score |
| Common words ("the", "is") | Treated equally | Automatically down-weighted |
| Domain-specific rare words | No special treatment | Up-weighted automatically |
| Output values | Integer counts | Float scores (0–1 normalised) |
| Needs stopword list? | Yes, manually | Less critical (IDF suppresses them) |
| Best use case | Naive Bayes, quick baseline | Classification, similarity, search |
N-grams — Capturing Context and Word Sequences
Now imagine a smarter jar that keeps word pairs together: the jar contains "not_good", "good_at", "at_all". Suddenly, the negative sentiment of "not good" is preserved! The model can see the context.
That is an N-gram — a sequence of N consecutive words treated as a single unit. Unigrams are single words (N=1), bigrams are two-word phrases (N=2), and trigrams are three-word phrases (N=3).
N-gram Types — Visual Breakdown
ngram_range=(1,2) in scikit-learn to combine all unigrams and bigrams as features
A window of size N slides one position at a time, extracting overlapping phrases. With 5 words, you get 4 bigrams and 3 trigrams.
N-grams in Python
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
corpus = [
"machine learning is not hard",
"deep learning is great and not boring",
"not all machine learning is deep learning",
]
# ── Unigrams only (standard BoW) ─────────────────────────────
uni = CountVectorizer(ngram_range=(1, 1))
X_uni = uni.fit_transform(corpus)
print("Unigram features:", uni.get_feature_names_out())
# ── Bigrams only ─────────────────────────────────────────────
bi = CountVectorizer(ngram_range=(2, 2))
X_bi = bi.fit_transform(corpus)
print("\nBigram features:", bi.get_feature_names_out())
# ── Unigrams + Bigrams combined ───────────────────────────────
combo = CountVectorizer(ngram_range=(1, 2))
X_combo = combo.fit_transform(corpus)
print("\nCombined (1-2) features:", len(combo.get_feature_names_out()),
"total features")
print(combo.get_feature_names_out())
# ── TF-IDF with bigrams (common in practice) ──────────────────
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_bi = TfidfVectorizer(ngram_range=(1, 2), max_features=20)
X_tfidf_bi = tfidf_bi.fit_transform(corpus)
print("\nTF-IDF + Bigrams top features:")
print(tfidf_bi.get_feature_names_out())
Notice the bigram "is not" and "not hard" appear as features. A unigram model sees "not" and "hard" separately — it might actually associate "hard" with something negative. The bigram model sees "not hard" as a single unit and can learn that this combination carries a positive or neutral sentiment. This is why TF-IDF + bigrams often outperforms basic BoW on sentiment tasks.
Full Pipeline — Sentiment Analysis on Movie Reviews
Let's put all three techniques together on a real task: classifying movie reviews as positive or negative. We compare BoW, TF-IDF, and TF-IDF+Bigrams directly.
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
import numpy as np
# ── Load a binary subset: comp.graphics vs sci.space ─────────
cats = ['rec.sport.hockey', 'sci.space']
data = fetch_20newsgroups(subset='all', categories=cats,
remove=('headers', 'footers', 'quotes'))
X_train, X_test, y_train, y_test = train_test_split(
data.data, data.target, test_size=0.2, random_state=42, stratify=data.target
)
# ── Method 1: Bag of Words ─────────────────────────────────
bow = CountVectorizer(stop_words='english', max_features=10000)
X_bow_tr = bow.fit_transform(X_train)
X_bow_te = bow.transform(X_test)
lr_bow = LogisticRegression(max_iter=500)
lr_bow.fit(X_bow_tr, y_train)
acc_bow = accuracy_score(y_test, lr_bow.predict(X_bow_te))
# ── Method 2: TF-IDF (unigrams) ────────────────────────────
tfidf = TfidfVectorizer(stop_words='english', max_features=10000)
X_tfidf_tr = tfidf.fit_transform(X_train)
X_tfidf_te = tfidf.transform(X_test)
lr_tfidf = LogisticRegression(max_iter=500)
lr_tfidf.fit(X_tfidf_tr, y_train)
acc_tfidf = accuracy_score(y_test, lr_tfidf.predict(X_tfidf_te))
# ── Method 3: TF-IDF + Bigrams ─────────────────────────────
tfidf_bi = TfidfVectorizer(ngram_range=(1, 2), stop_words='english', max_features=10000)
X_bi_tr = tfidf_bi.fit_transform(X_train)
X_bi_te = tfidf_bi.transform(X_test)
lr_bi = LogisticRegression(max_iter=500)
lr_bi.fit(X_bi_tr, y_train)
acc_bi = accuracy_score(y_test, lr_bi.predict(X_bi_te))
# ── Compare ────────────────────────────────────────────────
print(f"BoW Accuracy: {acc_bow:.4f}")
print(f"TF-IDF Accuracy: {acc_tfidf:.4f}")
print(f"TF-IDF+Bigrams Acc.: {acc_bi:.4f}")
# ── Top TF-IDF features for each class ────────────────────
feature_names = tfidf_bi.get_feature_names_out()
for cls, name in enumerate(data.target_names):
top = np.argsort(lr_bi.coef_[cls])[-10:]
print(f"\nTop features for '{name}':")
print([feature_names[i] for i in top])
Look at the top bigram features: "stanley cup" for hockey and "space shuttle" for science. These two-word phrases are completely invisible to a unigram model — "stanley" alone means nothing. "space" alone could belong to either class. The bigram captures the actual meaning.
Head-to-Head — When to Use Which Technique
| Factor | Bag of Words | TF-IDF | N-grams (2-3) |
|---|---|---|---|
| Captures word order? | No | No | Partial (local context) |
| Handles stopwords? | Manual removal needed | Auto-suppressed by IDF | Manual still helpful |
| Vocabulary size | V words | V words | V + V² phrases (much larger) |
| Sentiment analysis | Weak (misses negation) | Moderate | Good (captures "not good") |
| Document classification | Good baseline | Better — highlights key terms | Best with TF-IDF+bigrams |
| Text similarity / search | Acceptable | Excellent (weights rarity) | Excellent for phrase matching |
| Memory/compute cost | Low | Low | High (explodes with N) |
| Naive Bayes compatibility | Excellent | Okay (non-negative needed) | Good |
A vocabulary of 10,000 words produces ~10,000 unigrams.
Adding bigrams can produce up to 100 million possible bigram combinations.
In practice, only a fraction appear in your corpus, but memory can still explode.
Always set max_features (e.g. 10,000–50,000) when using bigrams or trigrams in production.
Shared Limitations — What These Methods Cannot Do
Despite their wide use, all three classical methods share a fundamental blind spot: they represent text as independent word occurrences. They have no understanding of what words actually mean in context.
These limitations motivated the rise of word embeddings (Word2Vec, GloVe, FastText) which represent words as dense low-dimensional vectors where "king − man + woman ≈ queen". Beyond those, transformer models (BERT, GPT, RoBERTa) learn contextual representations — the same word gets a different vector depending on the sentence it appears in. But for many real tasks (spam filtering, topic classification, search), TF-IDF and bigrams remain competitive, fast, and explainable — and are always the right starting point.
Golden Rules — Non-Negotiable Practices
fit_transform(X_train) and
transform(X_test) — never fit_transform on the test set.
Fitting on test data leaks vocabulary and IDF scores, inflating your accuracy metrics.
stop_words='english'
in scikit-learn, or supply a custom list for domain-specific corpora. Stopwords waste features and add noise —
but keep them for language modelling tasks where function words matter.
max_features when using N-grams. Bigrams and trigrams explode the vocabulary size.
Capping at 10,000–100,000 prevents memory issues while retaining the most informative phrases.
Start with 20,000 and tune up if performance keeps improving.
ngram_range=(1, 2), never jump to trigrams first.
Bigrams usually capture 90% of the benefit. Trigrams massively increase vocabulary size for diminishing
returns. Benchmark before adding them — the compute cost is real.