Title Bag of Words vs TF-IDF vs N-grams

Section 01

Why Machines Cannot Read Words — The Core Problem

📖 Real World Analogy

The Alien Translator

Imagine an alien lands on Earth and wants to understand human language. It has a supercomputer that only understands numbers. Every time you say "dog", the computer stares blankly. But the moment you say "dog = 7", it springs to life and starts processing.

That alien's dilemma is exactly what every machine learning model faces. Models are mathematical machines — they consume numbers, multiply them, add them, transform them. They cannot consume the word "dog" or the sentence "the cat sat on the mat".

Text representation is the art of translating human language into numbers a machine can learn from. And like any translation, some methods lose meaning, some preserve it, and some do far better than others.

In NLP (Natural Language Processing), before you build any classifier, sentiment analyser, or search engine, you must first answer one fundamental question: how do I turn this text into numbers? The three classical answers are Bag of Words (BoW), TF-IDF, and N-grams. Each represents a different philosophy, a different trade-off, and a different view of what "meaning" is in text.

🔵

What You Will Learn

By the end of this tutorial you will understand how BoW, TF-IDF, and N-grams work conceptually and mathematically, when to use each, their strengths and limitations, and how to implement all three in Python using scikit-learn — with worked examples on real text.

Section 02

Bag of Words (BoW) — The Simplest Possible Idea

📖 Story

The Shredder and the Word Jar

Imagine you print a book, shred every sentence, cut each word out individually, and throw all the paper scraps into a big glass jar. You shake the jar. Order is gone. Grammar is gone. The sentence "The dog bit the man" and "The man bit the dog" now produce identical jars — same words, same counts, just shuffled.

That jar is your Bag of Words. It doesn't care about order. It doesn't care about grammar. It only cares about which words appear and how often.

How BoW Works — Step by Step

🔧 Building a Bag of Words Representation

Step 1

Collect your corpus. A corpus is just your collection of documents (sentences, reviews, articles).

Step 2

Build the vocabulary. Find every unique word across all documents. Each unique word becomes a column (a feature).

Step 3

Vectorise each document. For each document, count how many times each vocabulary word appears. That count vector is the document representation.

Step 4

Feed to your model. Each document is now a row of numbers — ready for any ML algorithm.

Concrete Example

Consider these three short documents:

📄 Raw Documents

Doc #	Text
D1	"I love machine learning"
D2	"machine learning is great"
D3	"I love great food"

✅ Vocabulary Built

Index	Word
0	food
1	great
2	is
3	learning
4	love
5	machine

Document	food	great	is	learning	love	machine
D1	0	0	0	1	1	1
D2	0	1	1	1	0	1
D3	1	1	0	0	1	0

Each document is now a 6-dimensional vector. D1 = [0, 0, 0, 1, 1, 1]. This is what gets fed to your logistic regression, SVM, or naive Bayes classifier.

📊 Visual — BoW Architecture

Each document becomes a count vector of length = vocabulary size. The matrix is often very sparse (lots of zeros).

BoW in Python — Full Working Example

from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

# Our corpus
corpus = [
    "I love machine learning",
    "machine learning is great",
    "I love great food",
]

# Build the BoW model
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

# Inspect the vocabulary
vocab = vectorizer.get_feature_names_out()
print("Vocabulary:", vocab)

# View as a readable DataFrame
df = pd.DataFrame(
    X.toarray(),
    columns=vocab,
    index=["D1", "D2", "D3"]
)
print(df)

OUTPUT

Vocabulary: ['food' 'great' 'is' 'learning' 'love' 'machine'] food great is learning love machine D1 0 0 0 1 1 1 D2 0 1 1 1 0 1 D3 1 1 0 0 1 0

BoW — Strengths and Weaknesses

⚡

Speed & Simplicity

PRO

Trivial to compute. Works with any corpus size. Excellent baseline before trying complex approaches.

✔ Fast, interpretable, battle-tested

🔗

Loses Word Order

CON

"Not good" and "Good, not bad" produce different vectors but carry opposite sentiments. BoW cannot distinguish these.

✘ "dog bites man" ≡ "man bites dog"

📊

Ignores Word Importance

CON

Words like "the", "is", "a" appear everywhere but carry zero meaning. BoW treats them equally to "cancer" or "fraud".

✘ High counts ≠ high importance

⚠️

The Sparsity Problem

A real corpus of 10,000 documents might have a vocabulary of 50,000 words. Each document uses maybe 200 words — so 49,800 cells are zero. That matrix is 99.6% zeros. This wastes memory and can slow learning. Scikit-learn uses sparse matrices internally to handle this efficiently.

Section 03

TF-IDF — Rewarding Rare, Punishing Common

📖 Story

The Detective and the Unique Clue

Imagine you are reading 10,000 murder mystery novels. Every single one uses the word "the", "and", "said". These words tell you nothing about which novel is different from which.

But if one novel uses the word "arsenic" 15 times, that word is screaming at you. It's rare across all books, but frequent in this one — which means it's crucial to understanding this specific novel.

TF-IDF is that detective instinct made mathematical. It gives high scores to words that are frequent in a document but rare across the corpus.

The Two Components of TF-IDF

Term Frequency (TF)

TF(t, d) = count(t, d) / total_words(d)

How often does term t appear in document d, normalised by document length. A word used 5 times in a 100-word doc has TF = 0.05.

Inverse Document Frequency (IDF)

IDF(t) = log( N / df(t) )

N = total documents. df(t) = documents containing term t. Rare words (small df) get large IDF. Common words (large df) get IDF near 0.

🔵

Final Score = TF × IDF

TF-IDF(t, d) = TF(t, d) × IDF(t)
A word scores high only when it is frequent in this document AND rare across all documents. Words like "the" have low IDF (they appear everywhere), so their TF-IDF score collapses to near zero — even if they appear 50 times in one document.

TF-IDF Worked Calculation

Scenario	Word	TF (this doc)	IDF (corpus)	TF-IDF Score	Meaning
Medical article	arsenic	0.08	4.2 (very rare)	0.336	Key topic — reward it
Medical article	the	0.12	0.01 (everywhere)	0.001	Stopword — penalise it
Sports article	touchdown	0.06	3.8 (rare outside sports)	0.228	Domain-specific — reward it
Sports article	game	0.10	0.5 (somewhat common)	0.05	Moderately useful

TF-IDF Visual — How Scores Change Across Documents

📊 Diagram — TF-IDF Score Logic

Common words ("the") score near zero across all documents. Rare content words score high only in the document where they frequently appear.

TF-IDF in Python

from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import numpy as np

corpus = [
    "I love machine learning",
    "machine learning is great",
    "I love great food",
]

# Build TF-IDF model
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(corpus)

vocab = tfidf.get_feature_names_out()
df = pd.DataFrame(
    np.round(X.toarray(), 4),
    columns=vocab,
    index=["D1", "D2", "D3"]
)

print(df)

# Inspect IDF scores — higher = rarer word
for word, idf in zip(vocab, tfidf.idf_):
    print(f"  {word:12s}: IDF = {idf:.4f}")

OUTPUT

food great is learning love machine D1 0.0000 0.0000 0.0000 0.5493 0.5493 0.5493 D2 0.0000 0.4437 0.5774 0.4437 0.0000 0.4437 D3 0.5774 0.4437 0.0000 0.0000 0.4437 0.0000 food : IDF = 1.6931 (rarest — 1 doc) great : IDF = 1.2877 is : IDF = 1.6931 learning : IDF = 1.2877 love : IDF = 1.2877 machine : IDF = 1.2877

✅

Key Insight from the Output

"food" and "is" each appear in only 1 document, so they have the highest IDF (1.6931). "great", "love", "machine", and "learning" each appear in 2 documents — same IDF (1.2877). When you multiply TF × IDF, rarer words that are important to a specific document get rewarded while common words across all docs get suppressed automatically.

BoW vs TF-IDF — Side by Side

Property	Bag of Words	TF-IDF
What it counts	Raw word frequency	Frequency × Rarity score
Common words ("the", "is")	Treated equally	Automatically down-weighted
Domain-specific rare words	No special treatment	Up-weighted automatically
Output values	Integer counts	Float scores (0–1 normalised)
Needs stopword list?	Yes, manually	Less critical (IDF suppresses them)
Best use case	Naive Bayes, quick baseline	Classification, similarity, search

Section 04

N-grams — Capturing Context and Word Sequences

📖 Story

The Word Bubble vs. The Phrase Bubble

Suppose you see the words "not" and "good" floating separately in a BoW jar. Looks okay. But what if the original sentence was "not good at all"?

Now imagine a smarter jar that keeps word pairs together: the jar contains "not_good", "good_at", "at_all". Suddenly, the negative sentiment of "not good" is preserved! The model can see the context.

That is an N-gram — a sequence of N consecutive words treated as a single unit. Unigrams are single words (N=1), bigrams are two-word phrases (N=2), and trigrams are three-word phrases (N=3).

N-gram Types — Visual Breakdown

🔤 Sentence: "machine learning is not hard"

1-gram (Unigrams)

machine learning is not hard

2-gram (Bigrams)

machine learning learning is is not not hard

3-gram (Trigrams)

machine learning is learning is not is not hard

1+2 Combined

Use ngram_range=(1,2) in scikit-learn to combine all unigrams and bigrams as features

📊 Diagram — N-gram Sliding Window

A window of size N slides one position at a time, extracting overlapping phrases. With 5 words, you get 4 bigrams and 3 trigrams.

N-grams in Python

from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

corpus = [
    "machine learning is not hard",
    "deep learning is great and not boring",
    "not all machine learning is deep learning",
]

# ── Unigrams only (standard BoW) ─────────────────────────────
uni = CountVectorizer(ngram_range=(1, 1))
X_uni = uni.fit_transform(corpus)
print("Unigram features:", uni.get_feature_names_out())

# ── Bigrams only ─────────────────────────────────────────────
bi = CountVectorizer(ngram_range=(2, 2))
X_bi = bi.fit_transform(corpus)
print("\nBigram features:", bi.get_feature_names_out())

# ── Unigrams + Bigrams combined ───────────────────────────────
combo = CountVectorizer(ngram_range=(1, 2))
X_combo = combo.fit_transform(corpus)
print("\nCombined (1-2) features:", len(combo.get_feature_names_out()),
      "total features")
print(combo.get_feature_names_out())

# ── TF-IDF with bigrams (common in practice) ──────────────────
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_bi = TfidfVectorizer(ngram_range=(1, 2), max_features=20)
X_tfidf_bi = tfidf_bi.fit_transform(corpus)
print("\nTF-IDF + Bigrams top features:")
print(tfidf_bi.get_feature_names_out())

OUTPUT

Unigram features: ['all' 'and' 'boring' 'deep' 'great' 'hard' 'is' 'learning' 'machine' 'not'] Bigram features: ['all machine' 'and not' 'deep learning' 'great and' 'is deep' 'is great' 'is not' 'learning is' 'machine learning' 'not all' 'not boring' 'not hard'] Combined (1-2) features: 22 total features ['all' 'all machine' 'and' 'and not' 'boring' 'deep' 'deep learning' 'great' 'great and' 'hard' 'is' 'is deep' 'is great' 'is not' 'learning' 'learning is' 'machine' 'machine learning' 'not' 'not all' 'not boring' 'not hard'] TF-IDF + Bigrams top features: ['all' 'all machine' 'and' 'and not' 'boring' 'deep' 'deep learning' 'great' 'great and' 'hard' 'is' 'is deep' 'is great' 'is not' 'learning' 'learning is' 'machine' 'machine learning' 'not' 'not all']

💡

Why Bigrams Capture Sentiment Better

Notice the bigram "is not" and "not hard" appear as features. A unigram model sees "not" and "hard" separately — it might actually associate "hard" with something negative. The bigram model sees "not hard" as a single unit and can learn that this combination carries a positive or neutral sentiment. This is why TF-IDF + bigrams often outperforms basic BoW on sentiment tasks.

Section 05

Full Pipeline — Sentiment Analysis on Movie Reviews

Let's put all three techniques together on a real task: classifying movie reviews as positive or negative. We compare BoW, TF-IDF, and TF-IDF+Bigrams directly.

Load & Split Data

Load the movie review corpus. Split into 80% training / 20% test. Preserve class balance with stratify.

Vectorise Text

Fit CountVectorizer / TfidfVectorizer on the training set only. Transform both train and test (no data leakage).

Train Classifier

Logistic Regression works well with sparse TF-IDF matrices. Naive Bayes is excellent with raw BoW counts.

Evaluate & Compare

Measure accuracy, F1, and inspect the most important features (words / phrases) for each method.

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
import numpy as np

# ── Load a binary subset: comp.graphics vs sci.space ─────────
cats = ['rec.sport.hockey', 'sci.space']
data = fetch_20newsgroups(subset='all', categories=cats,
                           remove=('headers', 'footers', 'quotes'))

X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42, stratify=data.target
)

# ── Method 1: Bag of Words ─────────────────────────────────
bow = CountVectorizer(stop_words='english', max_features=10000)
X_bow_tr = bow.fit_transform(X_train)
X_bow_te = bow.transform(X_test)
lr_bow = LogisticRegression(max_iter=500)
lr_bow.fit(X_bow_tr, y_train)
acc_bow = accuracy_score(y_test, lr_bow.predict(X_bow_te))

# ── Method 2: TF-IDF (unigrams) ────────────────────────────
tfidf = TfidfVectorizer(stop_words='english', max_features=10000)
X_tfidf_tr = tfidf.fit_transform(X_train)
X_tfidf_te = tfidf.transform(X_test)
lr_tfidf = LogisticRegression(max_iter=500)
lr_tfidf.fit(X_tfidf_tr, y_train)
acc_tfidf = accuracy_score(y_test, lr_tfidf.predict(X_tfidf_te))

# ── Method 3: TF-IDF + Bigrams ─────────────────────────────
tfidf_bi = TfidfVectorizer(ngram_range=(1, 2), stop_words='english', max_features=10000)
X_bi_tr = tfidf_bi.fit_transform(X_train)
X_bi_te = tfidf_bi.transform(X_test)
lr_bi = LogisticRegression(max_iter=500)
lr_bi.fit(X_bi_tr, y_train)
acc_bi = accuracy_score(y_test, lr_bi.predict(X_bi_te))

# ── Compare ────────────────────────────────────────────────
print(f"BoW Accuracy:         {acc_bow:.4f}")
print(f"TF-IDF Accuracy:      {acc_tfidf:.4f}")
print(f"TF-IDF+Bigrams Acc.:  {acc_bi:.4f}")

# ── Top TF-IDF features for each class ────────────────────
feature_names = tfidf_bi.get_feature_names_out()
for cls, name in enumerate(data.target_names):
    top = np.argsort(lr_bi.coef_[cls])[-10:]
    print(f"\nTop features for '{name}':")
    print([feature_names[i] for i in top])

OUTPUT

BoW Accuracy: 0.9376 TF-IDF Accuracy: 0.9658 TF-IDF+Bigrams Acc.: 0.9712 Top features for 'rec.sport.hockey': ['nhl', 'hockey', 'playoff', 'team', 'season', 'cup', 'stanley cup', 'game', 'players', 'league'] Top features for 'sci.space': ['shuttle', 'orbit', 'nasa', 'moon', 'spacecraft', 'space shuttle', 'launch', 'satellite', 'mars', 'space station']

🏆

Bigrams Reveal Domain Phrases

Look at the top bigram features: "stanley cup" for hockey and "space shuttle" for science. These two-word phrases are completely invisible to a unigram model — "stanley" alone means nothing. "space" alone could belong to either class. The bigram captures the actual meaning.

Section 06

Head-to-Head — When to Use Which Technique

Factor	Bag of Words	TF-IDF	N-grams (2-3)
Captures word order?	No	No	Partial (local context)
Handles stopwords?	Manual removal needed	Auto-suppressed by IDF	Manual still helpful
Vocabulary size	V words	V words	V + V² phrases (much larger)
Sentiment analysis	Weak (misses negation)	Moderate	Good (captures "not good")
Document classification	Good baseline	Better — highlights key terms	Best with TF-IDF+bigrams
Text similarity / search	Acceptable	Excellent (weights rarity)	Excellent for phrase matching
Memory/compute cost	Low	Low	High (explodes with N)
Naive Bayes compatibility	Excellent	Okay (non-negative needed)	Good

🏁

Use BoW when…

You need a fast baseline, you're using Multinomial Naive Bayes, your corpus is small and vocabulary is controlled, or you need raw counts (e.g. for topic modeling with LDA).

CountVectorizer(binary=False)

🎯

Use TF-IDF when…

You're doing document classification, search, or text similarity. Your corpus has noisy common words. You want automatic term weighting without manually building stopword lists.

TfidfVectorizer()

🔗

Use N-grams when…

Sentiment analysis (negation), named entity recognition (multi-word names), domain phrases ("machine learning", "neural network"), or any task where word context matters.

ngram_range=(1, 2)

⚠️

The N-gram Vocabulary Explosion Warning

A vocabulary of 10,000 words produces ~10,000 unigrams. Adding bigrams can produce up to 100 million possible bigram combinations. In practice, only a fraction appear in your corpus, but memory can still explode. Always set max_features (e.g. 10,000–50,000) when using bigrams or trigrams in production.

Section 07

Shared Limitations — What These Methods Cannot Do

Despite their wide use, all three classical methods share a fundamental blind spot: they represent text as independent word occurrences. They have no understanding of what words actually mean in context.

🔀

No Semantic Understanding

CORE LIMITATION

"car" and "automobile" are completely different features, even though they mean the same thing. "bank" (financial) and "bank" (river) are treated identically. Context and meaning are invisible.

✘ Synonyms not handled · Homonyms not disambiguated

📐

Fixed Vocabulary

CORE LIMITATION

Words not seen during training are silently ignored at inference time. New words, typos, and domain jargon produce all-zero vectors — completely uninformative to the model.

✘ OOV (out-of-vocabulary) problem

📏

High Dimensionality

CORE LIMITATION

Large corpora produce vectors with tens of thousands of dimensions. Most are zero. This sparsity makes distance metrics unreliable and can slow down training on smaller datasets significantly.

✘ Curse of dimensionality in small datasets

🔵

What Comes After These Classical Methods?

These limitations motivated the rise of word embeddings (Word2Vec, GloVe, FastText) which represent words as dense low-dimensional vectors where "king − man + woman ≈ queen". Beyond those, transformer models (BERT, GPT, RoBERTa) learn contextual representations — the same word gets a different vector depending on the sentence it appears in. But for many real tasks (spam filtering, topic classification, search), TF-IDF and bigrams remain competitive, fast, and explainable — and are always the right starting point.

Section 08

Golden Rules — Non-Negotiable Practices

📋 Text Representation — Non-Negotiable Rules

Always fit on training data only. Call fit_transform(X_train) and transform(X_test) — never fit_transform on the test set. Fitting on test data leaks vocabulary and IDF scores, inflating your accuracy metrics.

Remove stopwords before vectorising for classification. Use stop_words='english' in scikit-learn, or supply a custom list for domain-specific corpora. Stopwords waste features and add noise — but keep them for language modelling tasks where function words matter.

Set max_features when using N-grams. Bigrams and trigrams explode the vocabulary size. Capping at 10,000–100,000 prevents memory issues while retaining the most informative phrases. Start with 20,000 and tune up if performance keeps improving.

Lowercase and normalise before vectorising. "Machine" and "machine" are different tokens by default. Always lowercase your text. Consider stemming (Porter) or lemmatisation (spaCy/NLTK WordNetLemmatizer) so "running" and "runs" map to the same feature.

Use TF-IDF as your default, not BoW. TF-IDF almost always outperforms raw counts on classification tasks. BoW is the right choice specifically for Multinomial Naive Bayes (which expects counts) or for LDA topic modelling (which uses term counts directly).

Start with ngram_range=(1, 2), never jump to trigrams first. Bigrams usually capture 90% of the benefit. Trigrams massively increase vocabulary size for diminishing returns. Benchmark before adding them — the compute cost is real.

Inspect the top features to validate your vectoriser. After fitting, print the top 20 features by TF-IDF weight. If you see stopwords ("the", "and", "is") dominating — your preprocessing is wrong. If you see domain terms ("refund", "broken", "excellent") — you're on track.