The Story That Explains It All
Now imagine you could train a machine to read all of them, instantly sort them by tone (positive, negative, neutral), and flag the angriest ones for immediate customer service action. That's Sentiment Analysis — a type of Text Classification.
Text Classification is simply the task of assigning a predefined category to a piece of text. Sentiment Analysis is the most famous subtype — classifying text by the emotion or opinion it expresses.
In this tutorial, you will build a complete understanding of how machines read and classify text — from the vocabulary they use, to the models they train, to the evaluation they need. Every concept comes with a story and working Python code.
Text preprocessing and feature extraction · Bag of Words and TF-IDF · Naïve Bayes, Logistic Regression, and SVM classifiers · Transformer-based sentiment with HuggingFace · Evaluation metrics (Accuracy, F1, Confusion Matrix) · A full end-to-end pipeline on real movie review data.
What Is Text Classification?
Text Classification is a supervised machine learning task. You give the model labelled text examples, it learns patterns, and then it predicts the label of new, unseen text.
Step 1 — Text Preprocessing
Every text classification project begins with the same set of preprocessing steps. Here is what each step does and why it matters.
# ── Text Preprocessing Pipeline ────────────────────────────
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))
def preprocess(text: str) -> str:
text = text.lower() # Step 1: lowercase
text = re.sub(r'[^a-z\s]', '', text) # Step 2: remove punctuation
tokens = text.split() # Step 3: tokenise
tokens = [t for t in tokens if t not in stop_words] # Step 4: stop words
tokens = [lemmatizer.lemmatize(t) for t in tokens] # Step 5: lemmatise
return ' '.join(tokens)
# Example
raw = "This was absolutely the WORST movie I've ever seen!! Terrible acting!!"
print(preprocess(raw))
For tasks like negation detection, removing stop words can destroy meaning. "not bad" → removing "not" leaves "bad" — the exact opposite sentiment. For transformer models (BERT, RoBERTa), skip preprocessing entirely — they handle raw text natively and benefit from every word including stop words.
Step 2 — Feature Extraction: Turning Words into Numbers
Machine learning models cannot read text — they need numbers. Feature extraction is the process of converting clean text into a numerical representation. There are three major approaches, each more powerful (and more complex) than the last.
This is Bag of Words — every document becomes a 10,000-dimensional vector of word counts. Simple, effective, and still used in production today.
Method 1 — Bag of Words (BoW)
Count how many times each word appears in a document. The entire vocabulary becomes the feature space. Word order is discarded — the "bag" metaphor is intentional.
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
"the movie was great and the acting was superb",
"terrible film boring plot and bad acting",
"great film superb plot loved every moment",
]
vectorizer = CountVectorizer()
X_bow = vectorizer.fit_transform(corpus)
print("Vocabulary:", vectorizer.vocabulary_)
print("Shape:", X_bow.toarray().shape) # (3 docs, N unique words)
print(X_bow.toarray())
Method 2 — TF-IDF (Term Frequency–Inverse Document Frequency)
BoW treats all words equally. But the word "movie" appears in almost every review — it tells you nothing distinctive. TF-IDF penalises words that appear in many documents and rewards words that appear frequently in this document but rarely elsewhere.
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))
# ngram_range=(1,2) captures single words AND 2-word phrases
# e.g. "not good" is a unigram pair, much more informative than just "good"
X_tfidf = tfidf.fit_transform(corpus)
print(f"TF-IDF shape: {X_tfidf.shape}")
# See which words got the highest IDF scores (most distinctive)
import numpy as np
feature_names = tfidf.get_feature_names_out()
idf_scores = tfidf.idf_
top_idx = np.argsort(idf_scores)[-8:]
for i in top_idx:
print(f" {feature_names[i]:20s} IDF={idf_scores[i]:.3f}")
Despite being decades old, TF-IDF with logistic regression or SVM still competes surprisingly well with deep learning on short, domain-specific text like reviews or support tickets. It's fast to train, easy to interpret, and runs on a laptop. Always build a TF-IDF baseline before reaching for transformers.
Step 3 — Training Classifiers
With numerical features in hand, you can now train a classification model. We will cover three classifiers that are the standard toolkit for text classification, each with a story to make the intuition stick.
Classifier 1 — Naïve Bayes
Naïve Bayes for text works the same way. It asks: "Given this is a positive review, how likely is the word 'excellent' to appear? How likely is 'boring'?" It multiplies all those probabilities (treating each word as independent) and picks the class with the highest result.
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.datasets import fetch_20newsgroups
# Using 20 Newsgroups — a real text classification benchmark
cats = ['rec.sport.hockey', 'sci.space', 'talk.politics.guns']
data = fetch_20newsgroups(subset='all', categories=cats, remove=('headers', 'footers', 'quotes'))
X_train, X_test, y_train, y_test = train_test_split(
data.data, data.target, test_size=0.25, random_state=42
)
# Pipeline: TF-IDF → Naïve Bayes
nb_pipeline = Pipeline([
('tfidf', TfidfVectorizer(max_features=10000, stop_words='english')),
('clf', MultinomialNB(alpha=0.1)), # alpha = Laplace smoothing
])
nb_pipeline.fit(X_train, y_train)
y_pred = nb_pipeline.predict(X_test)
print(classification_report(y_test, y_pred, target_names=data.target_names))
Classifier 2 — Logistic Regression
Logistic Regression does the same. Every word in your vocabulary gets a learned weight — the word "outstanding" gets a high positive weight for positive sentiment, "dreadful" gets a large negative weight. The model sums all active word weights, passes through a sigmoid function, and outputs a probability between 0 and 1.
from sklearn.linear_model import LogisticRegression
lr_pipeline = Pipeline([
('tfidf', TfidfVectorizer(
max_features=15000,
ngram_range=(1, 2), # unigrams + bigrams
stop_words='english',
sublinear_tf=True # use log(TF) — reduces impact of very frequent terms
)),
('clf', LogisticRegression(
C=5.0, # inverse regularisation — higher = less regularised
max_iter=1000,
solver='lbfgs',
multi_class='auto'
)),
])
lr_pipeline.fit(X_train, y_train)
y_pred_lr = lr_pipeline.predict(X_test)
print(classification_report(y_test, y_pred_lr, target_names=data.target_names))
# Inspect model weights — which words drive each class?
tfidf_step = lr_pipeline.named_steps['tfidf']
clf_step = lr_pipeline.named_steps['clf']
feature_names = tfidf_step.get_feature_names_out()
for i, cls in enumerate(data.target_names):
top10 = feature_names[clf_step.coef_[i].argsort()[-5:]]
print(f"{cls}: top words = {list(top10)}")
For most text classification tasks with TF-IDF features, Logistic Regression beats Naïve Bayes and rivals SVM — and it's fully interpretable. You can literally read the learned weights and understand why the model made a decision. This matters enormously in production when you need to explain your model to stakeholders.
Classifier 3 — Support Vector Machine (SVM)
SVM finds exactly that — the decision boundary with the largest possible margin between the two classes. The data points closest to the boundary are the support vectors, the only points that actually matter.
from sklearn.svm import LinearSVC
from sklearn.calibration import CalibratedClassifierCV
svm_pipeline = Pipeline([
('tfidf', TfidfVectorizer(
max_features=20000,
ngram_range=(1, 2),
sublinear_tf=True
)),
('clf', CalibratedClassifierCV( # wraps LinearSVC to give predict_proba
LinearSVC(C=1.0, max_iter=2000)
)),
])
svm_pipeline.fit(X_train, y_train)
y_pred_svm = svm_pipeline.predict(X_test)
print(classification_report(y_test, y_pred_svm, target_names=data.target_names))
Comparing All Three Classifiers
| Property | Naïve Bayes | Logistic Regression | LinearSVC |
|---|---|---|---|
| Training speed | Fastest — closed form | Fast with lbfgs | Fast with LinearSVC |
| Accuracy (typical) | Good | Very Good | Very Good |
| Interpretability | High — log probabilities | High — feature weights | Medium — SV weights |
| Handles imbalance | Poor — prior skews results | Yes — class_weight='balanced' | Yes — class_weight='balanced' |
| Outputs probabilities | Yes (native) | Yes (native) | Needs CalibratedClassifierCV |
| Best for | Quick baseline, very small data | General purpose, interpretable production | High-dimensional text, max accuracy |
Sentiment Analysis Deep Dive — The IMDB Dataset
Now let us build a complete end-to-end sentiment analysis pipeline using the famous IMDB movie review dataset: 50,000 reviews, perfectly balanced between positive and negative.
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
import re
# ── Load the IMDB dataset ─────────────────────────────────
# Download from: https://ai.stanford.edu/~amaas/data/sentiment/
# Or load directly from Hugging Face datasets:
# from datasets import load_dataset
# dataset = load_dataset("imdb")
# For this demo, simulate with a CSV: columns = [review, sentiment]
# df = pd.read_csv("imdb_reviews.csv")
# ── Preprocessing function ────────────────────────────────
def clean_text(text):
text = re.sub(r'<[^>]+>', ' ', text) # strip HTML tags
text = re.sub(r'[^a-zA-Z\s]', ' ', text) # remove non-alpha
text = text.lower().strip()
text = re.sub(r'\s+', ' ', text) # collapse whitespace
return text
df['clean'] = df['review'].apply(clean_text)
df['label'] = (df['sentiment'] == 'positive').astype(int)
X_train, X_test, y_train, y_test = train_test_split(
df['clean'], df['label'], test_size=0.20, random_state=42, stratify=df['label']
)
# ── Best-practice pipeline ────────────────────────────────
pipeline = Pipeline([
('tfidf', TfidfVectorizer(
max_features=30000,
ngram_range=(1, 3), # up to trigrams: "not at all"
sublinear_tf=True,
min_df=3, # ignore words appearing in fewer than 3 docs
max_df=0.95 # ignore words appearing in 95%+ of docs
)),
('clf', LogisticRegression(C=4.0, max_iter=2000, solver='lbfgs')),
])
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
y_proba = pipeline.predict_proba(X_test)[:, 1]
print(classification_report(y_test, y_pred, target_names=['Negative', 'Positive']))
print(f"ROC-AUC: {roc_auc_score(y_test, y_proba):.4f}")
This pipeline uses no neural network, no GPU, trains in under 30 seconds on a laptop, and achieves 91% accuracy on one of the most studied NLP benchmarks. Always establish this classical baseline before spending compute on transformers. The gap is often smaller than you expect.
Confusion Matrix — Reading Between the Predictions
Accuracy alone is dangerous. A model predicting 100% of reviews as "positive" in a balanced dataset scores 50% accuracy — useless but it looks respectable. The confusion matrix exposes exactly where and how your model fails.
Accuracy = 50% · F1 (Negative) = 0.00 · Useless.
Accuracy = 91% · F1 = 0.91 · Balanced and reliable.
Transformer-Based Sentiment — Enter BERT
BERT (Bidirectional Encoder Representations from Transformers) reads every word in the context of all surrounding words simultaneously. It was pre-trained on 3.3 billion words of text from books and Wikipedia — it has seen "not bad" countless times and knows exactly what it means. Fine-tuning it on your sentiment data takes minutes and typically yields 93–96% accuracy on IMDB.
from transformers import pipeline as hf_pipeline
# Zero-shot sentiment with a pre-fine-tuned model
# No training required — just load and infer
sentiment_pipe = hf_pipeline(
"text-classification",
model="distilbert-base-uncased-finetuned-sst-2-english",
truncation=True,
max_length=512
)
reviews = [
"Absolutely stunning. Every scene was a work of art.",
"Not bad for a low-budget film, though the pacing drags.",
"A complete disaster. Two hours I will never get back.",
"The film wasn't as terrible as I expected it to be.", # tricky negation
]
results = sentiment_pipe(reviews)
for review, result in zip(reviews, results):
label = result['label']
score = result['score']
print(f"[{label:8s} {score:.2f}] {review[:60]}...")
Transformers win when: your text is long and context-dependent, you have sufficient compute (GPU), or accuracy above 92% is required. Classical ML wins when: you need explainability, your data is small (< 10k examples), you need sub-second inference on CPU, or you must interpret feature weights for compliance or auditing. In most real production systems, classical TF-IDF + LR is still the first choice.
Fine-Tuning BERT on Your Own Data
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import Dataset
import torch
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
def tokenize_fn(batch):
return tokenizer.__call__(
batch['text'], truncation=True,
padding='max_length', max_length=256
)
# Build HuggingFace Dataset from pandas DataFrames
train_ds = Dataset.from_pandas(pd.DataFrame({'text': X_train, 'label': y_train}))
test_ds = Dataset.from_pandas(pd.DataFrame({'text': X_test, 'label': y_test}))
train_ds = train_ds.map(tokenize_fn, batched=True)
test_ds = test_ds.map(tokenize_fn, batched=True)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
training_args = TrainingArguments(
output_dir="./bert-sentiment",
num_train_epochs=3,
per_device_train_batch_size=16,
per_device_eval_batch_size=32,
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
learning_rate=2e-5,
weight_decay=0.01,
warmup_ratio=0.1,
logging_steps=50,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_ds,
eval_dataset=test_ds,
)
trainer.train()
Handling Imbalanced Classes
In real-world text classification, classes are rarely balanced. Spam is 2% of email. Hate speech is 1% of social posts. Fraudulent reviews are 5% of the total. A model that predicts "not spam" for every email is 98% accurate — and completely useless.
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, roc_auc_score
# Always use class_weight='balanced' when classes are unequal
balanced_pipeline = Pipeline([
('tfidf', TfidfVectorizer(max_features=15000, sublinear_tf=True)),
('clf', LogisticRegression(
C=3.0,
max_iter=1000,
class_weight='balanced' # ← the one change that matters
)),
])
balanced_pipeline.fit(X_train, y_train)
y_pred = balanced_pipeline.predict(X_test)
macro_f1 = f1_score(y_test, y_pred, average='macro')
print(f"Macro F1: {macro_f1:.4f}") # use macro F1 for imbalanced tasks
Predicting New Text — Live Inference
# ── Inference on new, unseen text ────────────────────────────
def predict_sentiment(texts, pipeline, label_map=None):
if isinstance(texts, str):
texts = [texts]
cleaned = [clean_text(t) for t in texts]
preds = pipeline.predict(cleaned)
probas = pipeline.predict_proba(cleaned)
if label_map is None:
label_map = {0: "NEGATIVE", 1: "POSITIVE"}
for text, pred, proba in zip(texts, preds, probas):
label = label_map[pred]
confidence = proba[pred]
print(f"[{label} {confidence:.2%}] {text[:70]}")
test_reviews = [
"One of the greatest films ever made. Pure cinematic poetry.",
"I walked out after 20 minutes. Absolute rubbish.",
"It was fine. Nothing special but not terrible either.",
"The cinematography is stunning but the script is painfully weak.",
]
predict_sentiment(test_reviews, pipeline)
Notice the third and fourth reviews score ~60% — near the 50% decision boundary. In production, you should never silently trust uncertain predictions. Flag anything below 70% confidence for human review, secondary model routing, or output as "NEUTRAL / MIXED" rather than forcing a binary label.
Multi-Class Classification — Beyond Binary
Real sentiment is not just positive or negative. Customers leave 1-to-5-star reviews. Tweets express joy, anger, sadness, fear, surprise. Here is how to handle more than two classes.
from sklearn.preprocessing import LabelEncoder
# Simulate a 5-class star rating dataset
# Each class: 1-star, 2-star, 3-star, 4-star, 5-star
multiclass_pipeline = Pipeline([
('tfidf', TfidfVectorizer(
max_features=20000,
ngram_range=(1, 2),
sublinear_tf=True
)),
('clf', LogisticRegression(
C=5.0,
max_iter=2000,
multi_class='multinomial', # explicit multinomial for 3+ classes
class_weight='balanced'
)),
])
multiclass_pipeline.fit(X_train_multi, y_train_multi)
y_pred_multi = multiclass_pipeline.predict(X_test_multi)
print(classification_report(
y_test_multi, y_pred_multi,
target_names=['1-star', '2-star', '3-star', '4-star', '5-star']
))
Mid-range reviews are inherently ambiguous — they contain both positive and negative language in roughly equal measure. Humans often disagree on them too. This is a fundamental data problem, not a model problem. One fix: collapse to 3 classes (negative = 1–2 stars, neutral = 3 stars, positive = 4–5 stars) and watch your F1 jump significantly.
Golden Rules for Text Classification
ngram_range=(1,3)) in your TF-IDF.
"Not good", "highly recommend", "not at all" — these phrases reverse the meaning of their component words.
Single-word BoW misses every one of them.
sublinear_tf=True in TfidfVectorizer.
This applies log(1 + tf) instead of raw TF — reducing the outsized weight
of very frequent words within a single document. Almost always improves performance.