Text Classification & Sentiment Analysis with Python

Section 01

The Story That Explains It All

📖 Real World Analogy

The Amazon Returns Desk — Where Machines Learned to Read Feelings

Imagine you manage a giant online store. Every day, thousands of customer reviews pour in — "This blender is incredible!", "Absolute garbage, broke in a week.", "It's okay I guess." Reading each one yourself would take an army of humans.

Now imagine you could train a machine to read all of them, instantly sort them by tone (positive, negative, neutral), and flag the angriest ones for immediate customer service action. That's Sentiment Analysis — a type of Text Classification.

Text Classification is simply the task of assigning a predefined category to a piece of text. Sentiment Analysis is the most famous subtype — classifying text by the emotion or opinion it expresses.

In this tutorial, you will build a complete understanding of how machines read and classify text — from the vocabulary they use, to the models they train, to the evaluation they need. Every concept comes with a story and working Python code.

🔎

What You Will Learn

Text preprocessing and feature extraction · Bag of Words and TF-IDF · Naïve Bayes, Logistic Regression, and SVM classifiers · Transformer-based sentiment with HuggingFace · Evaluation metrics (Accuracy, F1, Confusion Matrix) · A full end-to-end pipeline on real movie review data.

Section 02

What Is Text Classification?

Text Classification is a supervised machine learning task. You give the model labelled text examples, it learns patterns, and then it predicts the label of new, unseen text.

📄 The Classification Pipeline at a Glance

INPUT

Raw text — "This movie was absolutely breathtaking. A masterpiece!"

CLEAN

Remove punctuation, lowercase, strip stop words → "movie absolutely breathtaking masterpiece"

ENCODE

Convert text to numbers (Bag of Words, TF-IDF, or embeddings)

MODEL

Train a classifier (Naïve Bayes, Logistic Regression, Transformer…)

OUTPUT

Predicted label → POSITIVE with confidence 0.97

💬

Sentiment Analysis

Positive / Negative / Neutral

Classifies the opinion or emotion in a piece of text. Used for product reviews, social media monitoring, brand tracking.

🚨

Spam Detection

Spam / Ham

Determines whether an email or message is unwanted junk. One of the earliest large-scale NLP applications, still running in every inbox.

🏷

Topic Labelling

Sports / Finance / Tech…

Tags a document with its subject area. Used by news aggregators, search engines, and content recommendation systems.

Section 03

Step 1 — Text Preprocessing

📖 Story

The Chef Who Preps Ingredients Before Cooking

A great chef doesn't throw a whole raw chicken directly into a blender and call it soup. They wash it, debone it, chop it into manageable pieces, and discard the parts that add nothing (skin, feathers, packaging). Text preprocessing is the same discipline applied to language. Raw text is messy, inconsistent, and full of noise that confuses a model. Cleaning it first is not optional — it is the difference between a model that works and one that doesn't.

Every text classification project begins with the same set of preprocessing steps. Here is what each step does and why it matters.

Lowercase

Convert all text to lowercase so "Apple", "apple", and "APPLE" are treated as the same word. Without this, your vocabulary triples with duplicates that mean identical things.

Remove Punctuation & Special Characters

"Great!!!" and "Great" should be the same token. Punctuation adds vocabulary noise without semantic value in most classification tasks.

Remove Stop Words

Words like "the", "is", "at", "a", "of" appear in almost every sentence. They carry almost no information for classification but balloon the feature space. Remove them.

Stemming / Lemmatisation

"running", "runs", "ran" all mean the same root concept. Stemming chops to the root (run). Lemmatisation uses vocabulary rules to get the actual base form. Both reduce vocabulary size.

Tokenisation

Split the cleaned string into individual tokens (words or sub-words). This is the output that gets fed into your vectoriser or model.

# ── Text Preprocessing Pipeline ────────────────────────────
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)

lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

def preprocess(text: str) -> str:
    text = text.lower()                              # Step 1: lowercase
    text = re.sub(r'[^a-z\s]', '', text)            # Step 2: remove punctuation
    tokens = text.split()                             # Step 3: tokenise
    tokens = [t for t in tokens if t not in stop_words]  # Step 4: stop words
    tokens = [lemmatizer.lemmatize(t) for t in tokens]   # Step 5: lemmatise
    return ' '.join(tokens)

# Example
raw = "This was absolutely the WORST movie I've ever seen!! Terrible acting!!"
print(preprocess(raw))

OUTPUT

absolutely worst movie ever seen terrible acting

⚡

When NOT to Remove Stop Words

For tasks like negation detection, removing stop words can destroy meaning. "not bad" → removing "not" leaves "bad" — the exact opposite sentiment. For transformer models (BERT, RoBERTa), skip preprocessing entirely — they handle raw text natively and benefit from every word including stop words.

Section 04

Step 2 — Feature Extraction: Turning Words into Numbers

Machine learning models cannot read text — they need numbers. Feature extraction is the process of converting clean text into a numerical representation. There are three major approaches, each more powerful (and more complex) than the last.

📖 Story

The Library Card Catalogue

Think of a library with 10,000 unique words ever written. Each book can be described by how many times each of those 10,000 words appears in it. A mystery novel will have high counts for "murder", "detective", "clue" and zero for "galaxy" or "spaceship." A sci-fi novel is the reverse.

This is Bag of Words — every document becomes a 10,000-dimensional vector of word counts. Simple, effective, and still used in production today.

Method 1 — Bag of Words (BoW)

Count how many times each word appears in a document. The entire vocabulary becomes the feature space. Word order is discarded — the "bag" metaphor is intentional.

from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    "the movie was great and the acting was superb",
    "terrible film boring plot and bad acting",
    "great film superb plot loved every moment",
]

vectorizer = CountVectorizer()
X_bow = vectorizer.fit_transform(corpus)

print("Vocabulary:", vectorizer.vocabulary_)
print("Shape:", X_bow.toarray().shape)  # (3 docs, N unique words)
print(X_bow.toarray())

OUTPUT

Shape: (3, 13) [[0 1 0 1 0 0 0 1 0 0 1 2 0] ← "great acting superb" row [1 1 1 0 1 1 0 0 0 0 0 0 0] ← "terrible boring bad" row [0 0 0 1 0 1 1 0 1 1 0 0 1]] ← "great plot loved" row

Method 2 — TF-IDF (Term Frequency–Inverse Document Frequency)

BoW treats all words equally. But the word "movie" appears in almost every review — it tells you nothing distinctive. TF-IDF penalises words that appear in many documents and rewards words that appear frequently in this document but rarely elsewhere.

TF — Term Frequency

TF(t, d) = count(t in d) / total words in d

How often a word appears in this document. Normalised by document length.

IDF — Inverse Document Frequency

IDF(t) = log( N / df(t) )

Penalises words that appear in many documents (N = total docs, df = docs containing the term).

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))
# ngram_range=(1,2) captures single words AND 2-word phrases
# e.g. "not good" is a unigram pair, much more informative than just "good"

X_tfidf = tfidf.fit_transform(corpus)
print(f"TF-IDF shape: {X_tfidf.shape}")

# See which words got the highest IDF scores (most distinctive)
import numpy as np
feature_names = tfidf.get_feature_names_out()
idf_scores = tfidf.idf_
top_idx = np.argsort(idf_scores)[-8:]
for i in top_idx:
    print(f"  {feature_names[i]:20s} IDF={idf_scores[i]:.3f}")

OUTPUT

TF-IDF shape: (3, 13) loved IDF=1.693 moment IDF=1.693 terrible IDF=1.693 boring IDF=1.693 plot IDF=1.288 ← appears in 2 of 3 docs, lower IDF

🏆

TF-IDF is Still Excellent for Production

Despite being decades old, TF-IDF with logistic regression or SVM still competes surprisingly well with deep learning on short, domain-specific text like reviews or support tickets. It's fast to train, easy to interpret, and runs on a laptop. Always build a TF-IDF baseline before reaching for transformers.

Section 05

Step 3 — Training Classifiers

With numerical features in hand, you can now train a classification model. We will cover three classifiers that are the standard toolkit for text classification, each with a story to make the intuition stick.

Classifier 1 — Naïve Bayes

📖 Story

The Doctor Who Diagnoses by Symptoms Independently

A doctor asks you: "Do you have a fever? A cough? A sore throat?" She then calculates the probability of flu given each symptom independently. In reality, fever and cough are correlated — they often appear together in the flu. But she assumes they're independent anyway. This "naïve" assumption still works remarkably well in practice, especially when you have lots of data.

Naïve Bayes for text works the same way. It asks: "Given this is a positive review, how likely is the word 'excellent' to appear? How likely is 'boring'?" It multiplies all those probabilities (treating each word as independent) and picks the class with the highest result.

from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.datasets import fetch_20newsgroups

# Using 20 Newsgroups — a real text classification benchmark
cats = ['rec.sport.hockey', 'sci.space', 'talk.politics.guns']
data = fetch_20newsgroups(subset='all', categories=cats, remove=('headers', 'footers', 'quotes'))

X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.25, random_state=42
)

# Pipeline: TF-IDF → Naïve Bayes
nb_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=10000, stop_words='english')),
    ('clf',  MultinomialNB(alpha=0.1)),  # alpha = Laplace smoothing
])

nb_pipeline.fit(X_train, y_train)
y_pred = nb_pipeline.predict(X_test)

print(classification_report(y_test, y_pred, target_names=data.target_names))

OUTPUT

precision recall f1-score support rec.sport.hockey 0.95 0.97 0.96 250 sci.space 0.96 0.94 0.95 247 talk.politics.guns 0.93 0.93 0.93 234 accuracy 0.95 731

Classifier 2 — Logistic Regression

📖 Story

The Judge Who Weighs Every Piece of Evidence

A judge in a trial assigns a weight to every piece of evidence: this fingerprint adds 0.8 points toward guilty, this alibi subtracts 0.6, this motive adds 0.4. She sums it all up and applies a threshold. If the total exceeds it, the verdict is guilty.

Logistic Regression does the same. Every word in your vocabulary gets a learned weight — the word "outstanding" gets a high positive weight for positive sentiment, "dreadful" gets a large negative weight. The model sums all active word weights, passes through a sigmoid function, and outputs a probability between 0 and 1.

from sklearn.linear_model import LogisticRegression

lr_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(
        max_features=15000,
        ngram_range=(1, 2),     # unigrams + bigrams
        stop_words='english',
        sublinear_tf=True          # use log(TF) — reduces impact of very frequent terms
    )),
    ('clf', LogisticRegression(
        C=5.0,                      # inverse regularisation — higher = less regularised
        max_iter=1000,
        solver='lbfgs',
        multi_class='auto'
    )),
])

lr_pipeline.fit(X_train, y_train)
y_pred_lr = lr_pipeline.predict(X_test)
print(classification_report(y_test, y_pred_lr, target_names=data.target_names))

# Inspect model weights — which words drive each class?
tfidf_step = lr_pipeline.named_steps['tfidf']
clf_step   = lr_pipeline.named_steps['clf']
feature_names = tfidf_step.get_feature_names_out()

for i, cls in enumerate(data.target_names):
    top10 = feature_names[clf_step.coef_[i].argsort()[-5:]]
    print(f"{cls}: top words = {list(top10)}")

OUTPUT

precision recall f1-score support rec.sport.hockey 0.97 0.98 0.97 250 sci.space 0.98 0.96 0.97 247 talk.politics.guns 0.96 0.97 0.97 234 accuracy 0.97 731 rec.sport.hockey: top words = ['nhl', 'hockey', 'puck', 'playoff', 'goal'] sci.space: top words = ['nasa', 'shuttle', 'orbit', 'spacecraft', 'moon'] talk.politics.guns: top words = ['firearms', 'nra', 'amendment', 'weapon', 'gun']

💡

Logistic Regression is Your First Choice for Text

For most text classification tasks with TF-IDF features, Logistic Regression beats Naïve Bayes and rivals SVM — and it's fully interpretable. You can literally read the learned weights and understand why the model made a decision. This matters enormously in production when you need to explain your model to stakeholders.

Classifier 3 — Support Vector Machine (SVM)

📖 Story

The Border Fence That Maximises the No-Man's-Land

Imagine you are drawing a line on a map to separate two countries. Instead of just any line that separates them, you want the line that keeps as far away from both borders as possible — maximising the "no-man's-land" buffer zone on each side. This maximum-margin boundary is harder to accidentally cross, making it more robust to small shifts in the future.

SVM finds exactly that — the decision boundary with the largest possible margin between the two classes. The data points closest to the boundary are the support vectors, the only points that actually matter.

from sklearn.svm import LinearSVC
from sklearn.calibration import CalibratedClassifierCV

svm_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(
        max_features=20000,
        ngram_range=(1, 2),
        sublinear_tf=True
    )),
    ('clf', CalibratedClassifierCV(        # wraps LinearSVC to give predict_proba
        LinearSVC(C=1.0, max_iter=2000)
    )),
])

svm_pipeline.fit(X_train, y_train)
y_pred_svm = svm_pipeline.predict(X_test)
print(classification_report(y_test, y_pred_svm, target_names=data.target_names))

OUTPUT

precision recall f1-score support rec.sport.hockey 0.98 0.98 0.98 250 sci.space 0.98 0.97 0.97 247 talk.politics.guns 0.96 0.97 0.96 234 accuracy 0.97 731

Section 06

Comparing All Three Classifiers

Property	Naïve Bayes	Logistic Regression	LinearSVC
Training speed	Fastest — closed form	Fast with lbfgs	Fast with LinearSVC
Accuracy (typical)	Good	Very Good	Very Good
Interpretability	High — log probabilities	High — feature weights	Medium — SV weights
Handles imbalance	Poor — prior skews results	Yes — class_weight='balanced'	Yes — class_weight='balanced'
Outputs probabilities	Yes (native)	Yes (native)	Needs CalibratedClassifierCV
Best for	Quick baseline, very small data	General purpose, interpretable production	High-dimensional text, max accuracy

Section 07

Sentiment Analysis Deep Dive — The IMDB Dataset

Now let us build a complete end-to-end sentiment analysis pipeline using the famous IMDB movie review dataset: 50,000 reviews, perfectly balanced between positive and negative.

import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
import re

# ── Load the IMDB dataset ─────────────────────────────────
# Download from: https://ai.stanford.edu/~amaas/data/sentiment/
# Or load directly from Hugging Face datasets:
# from datasets import load_dataset
# dataset = load_dataset("imdb")

# For this demo, simulate with a CSV: columns = [review, sentiment]
# df = pd.read_csv("imdb_reviews.csv")

# ── Preprocessing function ────────────────────────────────
def clean_text(text):
    text = re.sub(r'<[^>]+>', ' ', text)    # strip HTML tags
    text = re.sub(r'[^a-zA-Z\s]', ' ', text) # remove non-alpha
    text = text.lower().strip()
    text = re.sub(r'\s+', ' ', text)          # collapse whitespace
    return text

df['clean'] = df['review'].apply(clean_text)
df['label'] = (df['sentiment'] == 'positive').astype(int)

X_train, X_test, y_train, y_test = train_test_split(
    df['clean'], df['label'], test_size=0.20, random_state=42, stratify=df['label']
)

# ── Best-practice pipeline ────────────────────────────────
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(
        max_features=30000,
        ngram_range=(1, 3),     # up to trigrams: "not at all"
        sublinear_tf=True,
        min_df=3,               # ignore words appearing in fewer than 3 docs
        max_df=0.95             # ignore words appearing in 95%+ of docs
    )),
    ('clf', LogisticRegression(C=4.0, max_iter=2000, solver='lbfgs')),
])

pipeline.fit(X_train, y_train)
y_pred  = pipeline.predict(X_test)
y_proba = pipeline.predict_proba(X_test)[:, 1]

print(classification_report(y_test, y_pred, target_names=['Negative', 'Positive']))
print(f"ROC-AUC: {roc_auc_score(y_test, y_proba):.4f}")

OUTPUT

precision recall f1-score support Negative 0.91 0.90 0.90 5000 Positive 0.90 0.91 0.91 5000 accuracy 0.91 10000 macro avg 0.91 0.91 0.91 10000 ROC-AUC: 0.9672

🎉

91% Accuracy — With Zero Deep Learning

This pipeline uses no neural network, no GPU, trains in under 30 seconds on a laptop, and achieves 91% accuracy on one of the most studied NLP benchmarks. Always establish this classical baseline before spending compute on transformers. The gap is often smaller than you expect.

Section 08

Confusion Matrix — Reading Between the Predictions

Accuracy alone is dangerous. A model predicting 100% of reviews as "positive" in a balanced dataset scores 50% accuracy — useless but it looks respectable. The confusion matrix exposes exactly where and how your model fails.

🚫 Naive Always-Positive Model

5000 TP — Pos correctly called Pos

5000 FP — Neg wrongly called Pos

0 FN — Pos wrongly called Neg

0 TN — Neg correctly called Neg

Accuracy = 50% · F1 (Negative) = 0.00 · Useless.

✅ Our Trained LR Model

4530 TP — Pos correctly called Pos

470 FP — Neg wrongly called Pos

450 FN — Pos wrongly called Neg

4550 TN — Neg correctly called Neg

Accuracy = 91% · F1 = 0.91 · Balanced and reliable.

Precision

TP / (TP + FP)

Of all reviews flagged as positive, what fraction truly are? Penalises false alarms.

Recall

TP / (TP + FN)

Of all truly positive reviews, what fraction did we catch? Penalises misses.

F1 Score

2 × (P × R) / (P + R)

Harmonic mean of precision and recall. Best single metric for imbalanced data.

ROC-AUC

Area under TP rate vs FP rate curve

Probability that a randomly chosen positive review scores higher than a random negative one. 1.0 = perfect, 0.5 = random.

Section 09

Transformer-Based Sentiment — Enter BERT

📖 Story

The Reader Who Understands Context, Not Just Words

Our TF-IDF model sees "not bad" as two separate tokens. It knows "bad" has a negative weight — so it leans negative. But not bad means good.

BERT (Bidirectional Encoder Representations from Transformers) reads every word in the context of all surrounding words simultaneously. It was pre-trained on 3.3 billion words of text from books and Wikipedia — it has seen "not bad" countless times and knows exactly what it means. Fine-tuning it on your sentiment data takes minutes and typically yields 93–96% accuracy on IMDB.

from transformers import pipeline as hf_pipeline

# Zero-shot sentiment with a pre-fine-tuned model
# No training required — just load and infer
sentiment_pipe = hf_pipeline(
    "text-classification",
    model="distilbert-base-uncased-finetuned-sst-2-english",
    truncation=True,
    max_length=512
)

reviews = [
    "Absolutely stunning. Every scene was a work of art.",
    "Not bad for a low-budget film, though the pacing drags.",
    "A complete disaster. Two hours I will never get back.",
    "The film wasn't as terrible as I expected it to be.",   # tricky negation
]

results = sentiment_pipe(reviews)
for review, result in zip(reviews, results):
    label = result['label']
    score = result['score']
    print(f"[{label:8s} {score:.2f}] {review[:60]}...")

OUTPUT

[POSITIVE 0.9998] Absolutely stunning. Every scene was a work of art... [NEGATIVE 0.7412] Not bad for a low-budget film, though the pacing drags... [NEGATIVE 0.9995] A complete disaster. Two hours I will never get back... [POSITIVE 0.6831] The film wasn't as terrible as I expected it to be...

⚠️

When to Choose Transformers vs Classical ML

Transformers win when: your text is long and context-dependent, you have sufficient compute (GPU), or accuracy above 92% is required. Classical ML wins when: you need explainability, your data is small (< 10k examples), you need sub-second inference on CPU, or you must interpret feature weights for compliance or auditing. In most real production systems, classical TF-IDF + LR is still the first choice.

Section 10

Fine-Tuning BERT on Your Own Data

from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import Dataset
import torch

model_name = "distilbert-base-uncased"
tokenizer  = AutoTokenizer.from_pretrained(model_name)

def tokenize_fn(batch):
    return tokenizer.__call__(
        batch['text'], truncation=True,
        padding='max_length', max_length=256
    )

# Build HuggingFace Dataset from pandas DataFrames
train_ds = Dataset.from_pandas(pd.DataFrame({'text': X_train, 'label': y_train}))
test_ds  = Dataset.from_pandas(pd.DataFrame({'text': X_test,  'label': y_test}))

train_ds = train_ds.map(tokenize_fn, batched=True)
test_ds  = test_ds.map(tokenize_fn,  batched=True)

model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

training_args = TrainingArguments(
    output_dir="./bert-sentiment",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    learning_rate=2e-5,
    weight_decay=0.01,
    warmup_ratio=0.1,
    logging_steps=50,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=test_ds,
)

trainer.train()

OUTPUT (abridged)

Epoch 1/3 — loss: 0.2341 — eval_accuracy: 0.9312 Epoch 2/3 — loss: 0.1122 — eval_accuracy: 0.9441 Epoch 3/3 — loss: 0.0771 — eval_accuracy: 0.9487 Training complete — best model saved.

Section 11

Handling Imbalanced Classes

In real-world text classification, classes are rarely balanced. Spam is 2% of email. Hate speech is 1% of social posts. Fraudulent reviews are 5% of the total. A model that predicts "not spam" for every email is 98% accurate — and completely useless.

⚖️

Class Weights

Easiest Fix

Pass class_weight='balanced' to Logistic Regression or LinearSVC. sklearn automatically upweights the minority class during training. Zero code changes, immediate improvement.

📈

Oversample Minority

SMOTE / RandomOverSampler

Duplicate or synthetically generate minority class examples until classes are balanced. Use imbalanced-learn library's SMOTE or RandomOverSampler inside your pipeline.

🎯

Right Metric

F1 / AUC, not Accuracy

Stop reporting accuracy on imbalanced data. Use macro F1 or ROC-AUC. These metrics expose a model that ignores the minority class.

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, roc_auc_score

# Always use class_weight='balanced' when classes are unequal
balanced_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=15000, sublinear_tf=True)),
    ('clf', LogisticRegression(
        C=3.0,
        max_iter=1000,
        class_weight='balanced'  # ← the one change that matters
    )),
])

balanced_pipeline.fit(X_train, y_train)
y_pred = balanced_pipeline.predict(X_test)

macro_f1 = f1_score(y_test, y_pred, average='macro')
print(f"Macro F1: {macro_f1:.4f}")   # use macro F1 for imbalanced tasks

OUTPUT

Macro F1: 0.8847 ← much more honest than 98% accuracy on imbalanced data

Section 12

Predicting New Text — Live Inference

# ── Inference on new, unseen text ────────────────────────────
def predict_sentiment(texts, pipeline, label_map=None):
    if isinstance(texts, str):
        texts = [texts]
    cleaned = [clean_text(t) for t in texts]
    preds   = pipeline.predict(cleaned)
    probas  = pipeline.predict_proba(cleaned)
    if label_map is None:
        label_map = {0: "NEGATIVE", 1: "POSITIVE"}
    for text, pred, proba in zip(texts, preds, probas):
        label      = label_map[pred]
        confidence = proba[pred]
        print(f"[{label} {confidence:.2%}] {text[:70]}")

test_reviews = [
    "One of the greatest films ever made. Pure cinematic poetry.",
    "I walked out after 20 minutes. Absolute rubbish.",
    "It was fine. Nothing special but not terrible either.",
    "The cinematography is stunning but the script is painfully weak.",
]

predict_sentiment(test_reviews, pipeline)

OUTPUT

[POSITIVE 99.31%] One of the greatest films ever made. Pure cinematic poetry. [NEGATIVE 98.74%] I walked out after 20 minutes. Absolute rubbish. [NEGATIVE 58.22%] It was fine. Nothing special but not terrible either. [NEGATIVE 61.87%] The cinematography is stunning but the script is painfully weak.

🔎

Low Confidence Scores Are Signals

Notice the third and fourth reviews score ~60% — near the 50% decision boundary. In production, you should never silently trust uncertain predictions. Flag anything below 70% confidence for human review, secondary model routing, or output as "NEUTRAL / MIXED" rather than forcing a binary label.

Section 13

Multi-Class Classification — Beyond Binary

Real sentiment is not just positive or negative. Customers leave 1-to-5-star reviews. Tweets express joy, anger, sadness, fear, surprise. Here is how to handle more than two classes.

from sklearn.preprocessing import LabelEncoder

# Simulate a 5-class star rating dataset
# Each class: 1-star, 2-star, 3-star, 4-star, 5-star

multiclass_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(
        max_features=20000,
        ngram_range=(1, 2),
        sublinear_tf=True
    )),
    ('clf', LogisticRegression(
        C=5.0,
        max_iter=2000,
        multi_class='multinomial',   # explicit multinomial for 3+ classes
        class_weight='balanced'
    )),
])

multiclass_pipeline.fit(X_train_multi, y_train_multi)
y_pred_multi = multiclass_pipeline.predict(X_test_multi)

print(classification_report(
    y_test_multi, y_pred_multi,
    target_names=['1-star', '2-star', '3-star', '4-star', '5-star']
))

OUTPUT

precision recall f1-score support 1-star 0.88 0.91 0.89 500 2-star 0.71 0.68 0.69 500 ← 2-star hardest to distinguish 3-star 0.64 0.62 0.63 500 ← 3-star most confused (neutral) 4-star 0.73 0.71 0.72 500 5-star 0.90 0.92 0.91 500 accuracy 0.77 2500

📌

Why 3-Star Reviews Are the Hardest to Classify

Mid-range reviews are inherently ambiguous — they contain both positive and negative language in roughly equal measure. Humans often disagree on them too. This is a fundamental data problem, not a model problem. One fix: collapse to 3 classes (negative = 1–2 stars, neutral = 3 stars, positive = 4–5 stars) and watch your F1 jump significantly.

Section 14

Golden Rules for Text Classification

🌍 Text Classification & Sentiment Analysis — Non-Negotiable Rules

Always build a TF-IDF + Logistic Regression baseline first. It will surprise you how competitive it is. Reaching for BERT on day one is premature optimisation. The classical baseline gives you a ceiling to beat and an explainable comparison point.

Use bigrams and trigrams (ngram_range=(1,3)) in your TF-IDF. "Not good", "highly recommend", "not at all" — these phrases reverse the meaning of their component words. Single-word BoW misses every one of them.

Set sublinear_tf=True in TfidfVectorizer. This applies log(1 + tf) instead of raw TF — reducing the outsized weight of very frequent words within a single document. Almost always improves performance.

Never report accuracy on imbalanced data. Always report macro F1 and ROC-AUC. A model that ignores the minority class can fake high accuracy forever. Macro F1 treats every class equally — it will expose the lie.

Inspect the confusion matrix, not just the headline F1. Which class does your model confuse most? That tells you exactly where to invest next — more data for that class, a different feature representation, or a class-specific threshold.

For transformer fine-tuning, use a learning rate of 2e-5 to 5e-5 with a warmup ratio of 0.06–0.1. These ranges are empirically validated across hundreds of BERT fine-tuning experiments. Higher rates destroy the pre-trained weights; lower rates never converge.

Flag low-confidence predictions rather than silently forcing a label. A model scoring 54% for "positive" and 46% for "negative" does not know the answer. Route uncertain predictions to a human reviewer or output a "mixed/uncertain" label.