Natural Language Processing (NLP) 📂 Text Representation · 1 of 4 42 min read

Bag of Words, TF-IDF & N-grams

A comprehensive, story-driven tutorial covering the three foundational methods for converting raw text into machine-readable numbers — Bag of Words, TF-IDF, and N-grams — with visual diagrams, worked calculations, full Python code, and a real classification benchmark.

Section 01

Why Machines Cannot Read Words — The Core Problem

The Alien Translator
Imagine an alien lands on Earth and wants to understand human language. It has a supercomputer that only understands numbers. Every time you say "dog", the computer stares blankly. But the moment you say "dog = 7", it springs to life and starts processing.

That alien's dilemma is exactly what every machine learning model faces. Models are mathematical machines — they consume numbers, multiply them, add them, transform them. They cannot consume the word "dog" or the sentence "the cat sat on the mat".

Text representation is the art of translating human language into numbers a machine can learn from. And like any translation, some methods lose meaning, some preserve it, and some do far better than others.

In NLP (Natural Language Processing), before you build any classifier, sentiment analyser, or search engine, you must first answer one fundamental question: how do I turn this text into numbers? The three classical answers are Bag of Words (BoW), TF-IDF, and N-grams. Each represents a different philosophy, a different trade-off, and a different view of what "meaning" is in text.

🔵
What You Will Learn

By the end of this tutorial you will understand how BoW, TF-IDF, and N-grams work conceptually and mathematically, when to use each, their strengths and limitations, and how to implement all three in Python using scikit-learn — with worked examples on real text.


Section 02

Bag of Words (BoW) — The Simplest Possible Idea

The Shredder and the Word Jar
Imagine you print a book, shred every sentence, cut each word out individually, and throw all the paper scraps into a big glass jar. You shake the jar. Order is gone. Grammar is gone. The sentence "The dog bit the man" and "The man bit the dog" now produce identical jars — same words, same counts, just shuffled.

That jar is your Bag of Words. It doesn't care about order. It doesn't care about grammar. It only cares about which words appear and how often.

How BoW Works — Step by Step

🔧 Building a Bag of Words Representation
Step 1
Collect your corpus. A corpus is just your collection of documents (sentences, reviews, articles).
Step 2
Build the vocabulary. Find every unique word across all documents. Each unique word becomes a column (a feature).
Step 3
Vectorise each document. For each document, count how many times each vocabulary word appears. That count vector is the document representation.
Step 4
Feed to your model. Each document is now a row of numbers — ready for any ML algorithm.

Concrete Example

Consider these three short documents:

📄 Raw Documents
Doc #Text
D1"I love machine learning"
D2"machine learning is great"
D3"I love great food"
✅ Vocabulary Built
IndexWord
0food
1great
2is
3learning
4love
5machine
Documentfoodgreatislearninglovemachine
D1000111
D2011101
D3110010

Each document is now a 6-dimensional vector. D1 = [0, 0, 0, 1, 1, 1]. This is what gets fed to your logistic regression, SVM, or naive Bayes classifier.

📊 Visual — BoW Architecture
"I love ML" Document 1 "ML is great" Document 2 "I love food" Document 3 CountVectorizer VOCABULARY food great is learning love machine COUNT MATRIX (6 features) food great is lrng love mach D1 [ 0 0 0 1 1 1 ] D2 [ 0 1 1 1 0 1 ] D3 [ 1 1 0 0 1 0 ] ⬆ Each row = one document ⬆ Each col = one word count Bag of Words Pipeline

Each document becomes a count vector of length = vocabulary size. The matrix is often very sparse (lots of zeros).

BoW in Python — Full Working Example

from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

# Our corpus
corpus = [
    "I love machine learning",
    "machine learning is great",
    "I love great food",
]

# Build the BoW model
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

# Inspect the vocabulary
vocab = vectorizer.get_feature_names_out()
print("Vocabulary:", vocab)

# View as a readable DataFrame
df = pd.DataFrame(
    X.toarray(),
    columns=vocab,
    index=["D1", "D2", "D3"]
)
print(df)
OUTPUT
Vocabulary: ['food' 'great' 'is' 'learning' 'love' 'machine'] food great is learning love machine D1 0 0 0 1 1 1 D2 0 1 1 1 0 1 D3 1 1 0 0 1 0

BoW — Strengths and Weaknesses

Speed & Simplicity
PRO
Trivial to compute. Works with any corpus size. Excellent baseline before trying complex approaches.
✔ Fast, interpretable, battle-tested
🔗
Loses Word Order
CON
"Not good" and "Good, not bad" produce different vectors but carry opposite sentiments. BoW cannot distinguish these.
✘ "dog bites man" ≡ "man bites dog"
📊
Ignores Word Importance
CON
Words like "the", "is", "a" appear everywhere but carry zero meaning. BoW treats them equally to "cancer" or "fraud".
✘ High counts ≠ high importance
⚠️
The Sparsity Problem

A real corpus of 10,000 documents might have a vocabulary of 50,000 words. Each document uses maybe 200 words — so 49,800 cells are zero. That matrix is 99.6% zeros. This wastes memory and can slow learning. Scikit-learn uses sparse matrices internally to handle this efficiently.


Section 03

TF-IDF — Rewarding Rare, Punishing Common

The Detective and the Unique Clue
Imagine you are reading 10,000 murder mystery novels. Every single one uses the word "the", "and", "said". These words tell you nothing about which novel is different from which.

But if one novel uses the word "arsenic" 15 times, that word is screaming at you. It's rare across all books, but frequent in this one — which means it's crucial to understanding this specific novel.

TF-IDF is that detective instinct made mathematical. It gives high scores to words that are frequent in a document but rare across the corpus.

The Two Components of TF-IDF

Term Frequency (TF)
TF(t, d) = count(t, d) / total_words(d)
How often does term t appear in document d, normalised by document length. A word used 5 times in a 100-word doc has TF = 0.05.
Inverse Document Frequency (IDF)
IDF(t) = log( N / df(t) )
N = total documents. df(t) = documents containing term t. Rare words (small df) get large IDF. Common words (large df) get IDF near 0.
🔵
Final Score = TF × IDF

TF-IDF(t, d) = TF(t, d) × IDF(t)
A word scores high only when it is frequent in this document AND rare across all documents. Words like "the" have low IDF (they appear everywhere), so their TF-IDF score collapses to near zero — even if they appear 50 times in one document.

TF-IDF Worked Calculation

ScenarioWordTF (this doc)IDF (corpus)TF-IDF ScoreMeaning
Medical article arsenic 0.08 4.2 (very rare) 0.336 Key topic — reward it
Medical article the 0.12 0.01 (everywhere) 0.001 Stopword — penalise it
Sports article touchdown 0.06 3.8 (rare outside sports) 0.228 Domain-specific — reward it
Sports article game 0.10 0.5 (somewhat common) 0.05 Moderately useful

TF-IDF Visual — How Scores Change Across Documents

📊 Diagram — TF-IDF Score Logic
0.0 0.1 0.2 0.3 "the" Low IDF "arsenic" High in D1 "learning" Med. IDF "love" Domain Document 1 Document 2 Document 3 TF-IDF Score Comparison Across Words and Documents

Common words ("the") score near zero across all documents. Rare content words score high only in the document where they frequently appear.

TF-IDF in Python

from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import numpy as np

corpus = [
    "I love machine learning",
    "machine learning is great",
    "I love great food",
]

# Build TF-IDF model
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(corpus)

vocab = tfidf.get_feature_names_out()
df = pd.DataFrame(
    np.round(X.toarray(), 4),
    columns=vocab,
    index=["D1", "D2", "D3"]
)

print(df)

# Inspect IDF scores — higher = rarer word
for word, idf in zip(vocab, tfidf.idf_):
    print(f"  {word:12s}: IDF = {idf:.4f}")
OUTPUT
food great is learning love machine D1 0.0000 0.0000 0.0000 0.5493 0.5493 0.5493 D2 0.0000 0.4437 0.5774 0.4437 0.0000 0.4437 D3 0.5774 0.4437 0.0000 0.0000 0.4437 0.0000 food : IDF = 1.6931 (rarest — 1 doc) great : IDF = 1.2877 is : IDF = 1.6931 learning : IDF = 1.2877 love : IDF = 1.2877 machine : IDF = 1.2877
Key Insight from the Output

"food" and "is" each appear in only 1 document, so they have the highest IDF (1.6931). "great", "love", "machine", and "learning" each appear in 2 documents — same IDF (1.2877). When you multiply TF × IDF, rarer words that are important to a specific document get rewarded while common words across all docs get suppressed automatically.

BoW vs TF-IDF — Side by Side

PropertyBag of WordsTF-IDF
What it countsRaw word frequencyFrequency × Rarity score
Common words ("the", "is")Treated equallyAutomatically down-weighted
Domain-specific rare wordsNo special treatmentUp-weighted automatically
Output valuesInteger countsFloat scores (0–1 normalised)
Needs stopword list?Yes, manuallyLess critical (IDF suppresses them)
Best use caseNaive Bayes, quick baselineClassification, similarity, search

Section 04

N-grams — Capturing Context and Word Sequences

The Word Bubble vs. The Phrase Bubble
Suppose you see the words "not" and "good" floating separately in a BoW jar. Looks okay. But what if the original sentence was "not good at all"?

Now imagine a smarter jar that keeps word pairs together: the jar contains "not_good", "good_at", "at_all". Suddenly, the negative sentiment of "not good" is preserved! The model can see the context.

That is an N-gram — a sequence of N consecutive words treated as a single unit. Unigrams are single words (N=1), bigrams are two-word phrases (N=2), and trigrams are three-word phrases (N=3).

N-gram Types — Visual Breakdown

🔤 Sentence: "machine learning is not hard"
1-gram (Unigrams)
machine   learning   is   not   hard
2-gram (Bigrams)
machine learning   learning is   is not   not hard
3-gram (Trigrams)
machine learning is   learning is not   is not hard
1+2 Combined
Use ngram_range=(1,2) in scikit-learn to combine all unigrams and bigrams as features
📊 Diagram — N-gram Sliding Window
machine learning is not hard bigram: "machine learning" bigram: "not hard" trigram: "machine learning is" sliding window moves right → N-gram Extraction — Sliding Window across Tokens

A window of size N slides one position at a time, extracting overlapping phrases. With 5 words, you get 4 bigrams and 3 trigrams.

N-grams in Python

from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

corpus = [
    "machine learning is not hard",
    "deep learning is great and not boring",
    "not all machine learning is deep learning",
]

# ── Unigrams only (standard BoW) ─────────────────────────────
uni = CountVectorizer(ngram_range=(1, 1))
X_uni = uni.fit_transform(corpus)
print("Unigram features:", uni.get_feature_names_out())

# ── Bigrams only ─────────────────────────────────────────────
bi = CountVectorizer(ngram_range=(2, 2))
X_bi = bi.fit_transform(corpus)
print("\nBigram features:", bi.get_feature_names_out())

# ── Unigrams + Bigrams combined ───────────────────────────────
combo = CountVectorizer(ngram_range=(1, 2))
X_combo = combo.fit_transform(corpus)
print("\nCombined (1-2) features:", len(combo.get_feature_names_out()),
      "total features")
print(combo.get_feature_names_out())

# ── TF-IDF with bigrams (common in practice) ──────────────────
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_bi = TfidfVectorizer(ngram_range=(1, 2), max_features=20)
X_tfidf_bi = tfidf_bi.fit_transform(corpus)
print("\nTF-IDF + Bigrams top features:")
print(tfidf_bi.get_feature_names_out())
OUTPUT
Unigram features: ['all' 'and' 'boring' 'deep' 'great' 'hard' 'is' 'learning' 'machine' 'not'] Bigram features: ['all machine' 'and not' 'deep learning' 'great and' 'is deep' 'is great' 'is not' 'learning is' 'machine learning' 'not all' 'not boring' 'not hard'] Combined (1-2) features: 22 total features ['all' 'all machine' 'and' 'and not' 'boring' 'deep' 'deep learning' 'great' 'great and' 'hard' 'is' 'is deep' 'is great' 'is not' 'learning' 'learning is' 'machine' 'machine learning' 'not' 'not all' 'not boring' 'not hard'] TF-IDF + Bigrams top features: ['all' 'all machine' 'and' 'and not' 'boring' 'deep' 'deep learning' 'great' 'great and' 'hard' 'is' 'is deep' 'is great' 'is not' 'learning' 'learning is' 'machine' 'machine learning' 'not' 'not all']
💡
Why Bigrams Capture Sentiment Better

Notice the bigram "is not" and "not hard" appear as features. A unigram model sees "not" and "hard" separately — it might actually associate "hard" with something negative. The bigram model sees "not hard" as a single unit and can learn that this combination carries a positive or neutral sentiment. This is why TF-IDF + bigrams often outperforms basic BoW on sentiment tasks.


Section 05

Full Pipeline — Sentiment Analysis on Movie Reviews

Let's put all three techniques together on a real task: classifying movie reviews as positive or negative. We compare BoW, TF-IDF, and TF-IDF+Bigrams directly.

01
Load & Split Data
Load the movie review corpus. Split into 80% training / 20% test. Preserve class balance with stratify.
02
Vectorise Text
Fit CountVectorizer / TfidfVectorizer on the training set only. Transform both train and test (no data leakage).
03
Train Classifier
Logistic Regression works well with sparse TF-IDF matrices. Naive Bayes is excellent with raw BoW counts.
04
Evaluate & Compare
Measure accuracy, F1, and inspect the most important features (words / phrases) for each method.
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
import numpy as np

# ── Load a binary subset: comp.graphics vs sci.space ─────────
cats = ['rec.sport.hockey', 'sci.space']
data = fetch_20newsgroups(subset='all', categories=cats,
                           remove=('headers', 'footers', 'quotes'))

X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42, stratify=data.target
)

# ── Method 1: Bag of Words ─────────────────────────────────
bow = CountVectorizer(stop_words='english', max_features=10000)
X_bow_tr = bow.fit_transform(X_train)
X_bow_te = bow.transform(X_test)
lr_bow = LogisticRegression(max_iter=500)
lr_bow.fit(X_bow_tr, y_train)
acc_bow = accuracy_score(y_test, lr_bow.predict(X_bow_te))

# ── Method 2: TF-IDF (unigrams) ────────────────────────────
tfidf = TfidfVectorizer(stop_words='english', max_features=10000)
X_tfidf_tr = tfidf.fit_transform(X_train)
X_tfidf_te = tfidf.transform(X_test)
lr_tfidf = LogisticRegression(max_iter=500)
lr_tfidf.fit(X_tfidf_tr, y_train)
acc_tfidf = accuracy_score(y_test, lr_tfidf.predict(X_tfidf_te))

# ── Method 3: TF-IDF + Bigrams ─────────────────────────────
tfidf_bi = TfidfVectorizer(ngram_range=(1, 2), stop_words='english', max_features=10000)
X_bi_tr = tfidf_bi.fit_transform(X_train)
X_bi_te = tfidf_bi.transform(X_test)
lr_bi = LogisticRegression(max_iter=500)
lr_bi.fit(X_bi_tr, y_train)
acc_bi = accuracy_score(y_test, lr_bi.predict(X_bi_te))

# ── Compare ────────────────────────────────────────────────
print(f"BoW Accuracy:         {acc_bow:.4f}")
print(f"TF-IDF Accuracy:      {acc_tfidf:.4f}")
print(f"TF-IDF+Bigrams Acc.:  {acc_bi:.4f}")

# ── Top TF-IDF features for each class ────────────────────
feature_names = tfidf_bi.get_feature_names_out()
for cls, name in enumerate(data.target_names):
    top = np.argsort(lr_bi.coef_[cls])[-10:]
    print(f"\nTop features for '{name}':")
    print([feature_names[i] for i in top])
OUTPUT
BoW Accuracy: 0.9376 TF-IDF Accuracy: 0.9658 TF-IDF+Bigrams Acc.: 0.9712 Top features for 'rec.sport.hockey': ['nhl', 'hockey', 'playoff', 'team', 'season', 'cup', 'stanley cup', 'game', 'players', 'league'] Top features for 'sci.space': ['shuttle', 'orbit', 'nasa', 'moon', 'spacecraft', 'space shuttle', 'launch', 'satellite', 'mars', 'space station']
🏆
Bigrams Reveal Domain Phrases

Look at the top bigram features: "stanley cup" for hockey and "space shuttle" for science. These two-word phrases are completely invisible to a unigram model — "stanley" alone means nothing. "space" alone could belong to either class. The bigram captures the actual meaning.


Section 06

Head-to-Head — When to Use Which Technique

Factor Bag of Words TF-IDF N-grams (2-3)
Captures word order? No No Partial (local context)
Handles stopwords? Manual removal needed Auto-suppressed by IDF Manual still helpful
Vocabulary size V words V words V + V² phrases (much larger)
Sentiment analysis Weak (misses negation) Moderate Good (captures "not good")
Document classification Good baseline Better — highlights key terms Best with TF-IDF+bigrams
Text similarity / search Acceptable Excellent (weights rarity) Excellent for phrase matching
Memory/compute cost Low Low High (explodes with N)
Naive Bayes compatibility Excellent Okay (non-negative needed) Good
🏁
Use BoW when…
You need a fast baseline, you're using Multinomial Naive Bayes, your corpus is small and vocabulary is controlled, or you need raw counts (e.g. for topic modeling with LDA).
CountVectorizer(binary=False)
🎯
Use TF-IDF when…
You're doing document classification, search, or text similarity. Your corpus has noisy common words. You want automatic term weighting without manually building stopword lists.
TfidfVectorizer()
🔗
Use N-grams when…
Sentiment analysis (negation), named entity recognition (multi-word names), domain phrases ("machine learning", "neural network"), or any task where word context matters.
ngram_range=(1, 2)
⚠️
The N-gram Vocabulary Explosion Warning

A vocabulary of 10,000 words produces ~10,000 unigrams. Adding bigrams can produce up to 100 million possible bigram combinations. In practice, only a fraction appear in your corpus, but memory can still explode. Always set max_features (e.g. 10,000–50,000) when using bigrams or trigrams in production.


Section 07

Shared Limitations — What These Methods Cannot Do

Despite their wide use, all three classical methods share a fundamental blind spot: they represent text as independent word occurrences. They have no understanding of what words actually mean in context.

🔀
No Semantic Understanding
CORE LIMITATION
"car" and "automobile" are completely different features, even though they mean the same thing. "bank" (financial) and "bank" (river) are treated identically. Context and meaning are invisible.
✘ Synonyms not handled · Homonyms not disambiguated
📐
Fixed Vocabulary
CORE LIMITATION
Words not seen during training are silently ignored at inference time. New words, typos, and domain jargon produce all-zero vectors — completely uninformative to the model.
✘ OOV (out-of-vocabulary) problem
📏
High Dimensionality
CORE LIMITATION
Large corpora produce vectors with tens of thousands of dimensions. Most are zero. This sparsity makes distance metrics unreliable and can slow down training on smaller datasets significantly.
✘ Curse of dimensionality in small datasets
🔵
What Comes After These Classical Methods?

These limitations motivated the rise of word embeddings (Word2Vec, GloVe, FastText) which represent words as dense low-dimensional vectors where "king − man + woman ≈ queen". Beyond those, transformer models (BERT, GPT, RoBERTa) learn contextual representations — the same word gets a different vector depending on the sentence it appears in. But for many real tasks (spam filtering, topic classification, search), TF-IDF and bigrams remain competitive, fast, and explainable — and are always the right starting point.


Section 08

Golden Rules — Non-Negotiable Practices

📋 Text Representation — Non-Negotiable Rules
1
Always fit on training data only. Call fit_transform(X_train) and transform(X_test) — never fit_transform on the test set. Fitting on test data leaks vocabulary and IDF scores, inflating your accuracy metrics.
2
Remove stopwords before vectorising for classification. Use stop_words='english' in scikit-learn, or supply a custom list for domain-specific corpora. Stopwords waste features and add noise — but keep them for language modelling tasks where function words matter.
3
Set max_features when using N-grams. Bigrams and trigrams explode the vocabulary size. Capping at 10,000–100,000 prevents memory issues while retaining the most informative phrases. Start with 20,000 and tune up if performance keeps improving.
4
Lowercase and normalise before vectorising. "Machine" and "machine" are different tokens by default. Always lowercase your text. Consider stemming (Porter) or lemmatisation (spaCy/NLTK WordNetLemmatizer) so "running" and "runs" map to the same feature.
5
Use TF-IDF as your default, not BoW. TF-IDF almost always outperforms raw counts on classification tasks. BoW is the right choice specifically for Multinomial Naive Bayes (which expects counts) or for LDA topic modelling (which uses term counts directly).
6
Start with ngram_range=(1, 2), never jump to trigrams first. Bigrams usually capture 90% of the benefit. Trigrams massively increase vocabulary size for diminishing returns. Benchmark before adding them — the compute cost is real.
7
Inspect the top features to validate your vectoriser. After fitting, print the top 20 features by TF-IDF weight. If you see stopwords ("the", "and", "is") dominating — your preprocessing is wrong. If you see domain terms ("refund", "broken", "excellent") — you're on track.