Semantic Similarity & Sentence Embeddings

Section 01

The Story That Explains Sentence Embeddings

📖 Real World Analogy

The Library With No Titles

Imagine a library where every book has its cover ripped off. No title, no author, nothing. You walk in and ask the librarian: "I want a book about a lonely astronaut on Mars."

A bad librarian scans for books that contain the exact words "lonely", "astronaut", and "Mars". She returns three technical manuals and one NASA report. Technically correct, utterly useless.

A brilliant librarian does something different. She feels what you mean — solitude, space, survival, psychological resilience in isolation — and walks you straight to Andy Weir's The Martian, Stanisław Lem's Solaris, and Kim Stanley Robinson's Red Mars. No title matching needed.

That brilliant librarian is a sentence embedding model. She converts meaning into a location in a vast mathematical space — and the books closest to your request are the most relevant. This is semantic similarity.

Semantic similarity is the task of measuring how close two pieces of text are in meaning — not in exact words, not in length, but in the underlying idea they express. The engine that makes this possible is a sentence embedding: a dense numeric vector (typically 384 to 1536 numbers) that encodes the meaning of a sentence into a single point in high-dimensional space.

💡

The Core Idea

Two sentences that mean the same thing — even if they share zero words — should map to nearby points in embedding space. Two sentences that mean opposite things should be far apart. The distance between points is the measure of semantic similarity. This is what separates NLP from mere text matching.

Section 02

From Words to Vectors — A Brief History

Before we understand sentence embeddings, we need to understand why simple text matching fails, and how the field evolved to fix it.

🕐 The Evolution of Text Representation

1990s

Bag of Words (BoW): A sentence is a count of words. "The cat sat" → {the:1, cat:1, sat:1}. No order. No meaning. A thesaurus could fool it completely.

2000s

TF-IDF: Weighted word counts. Rare words matter more. Better for search, but still keyword-matching at heart. "Quick brown fox" ≠ "Speedy auburn canine."

2013

Word2Vec (Google): Each word gets a 300-dimensional vector. "King" − "Man" + "Woman" ≈ "Queen." Words in context encode meaning. Revolutionary — but still per-word.

2018

BERT (Google): Transformer model reads the entire sentence at once. Context-aware. "Bank" by a river ≠ "Bank" you deposit money in. Meaning depends on surroundings.

2019+

Sentence-BERT / SBERT: Fine-tuned BERT specifically for sentence-level similarity. One forward pass → one vector per sentence. Fast, accurate, production-ready.

⚠️

The Fatal Flaw of Keyword Matching

Consider: "The medication eased her pain" vs. "She felt relief after taking the drug." These share zero content words, yet mean almost the same thing. Keyword matching gives them a similarity score of 0. A sentence embedding model gives them a score close to 1.0. In healthcare, legal, or customer-service applications, this difference is the difference between a useful system and a broken one.

Section 03

What Is an Embedding, Mathematically?

An embedding is a function that maps a discrete object (a word, sentence, document) to a continuous vector in ℝⁿ. For sentence embeddings, a typical model produces a vector of 384 or 768 floating-point numbers for any input sentence, regardless of its length.

Input

"The cat sat on the mat."

Any string of text — a word, sentence, paragraph, or document.

Embedding Function

f(text) → ℝ³⁸⁴

The model maps the text to a dense vector of fixed dimension.

Output Vector

[0.21, −0.83, 0.47, …]

A point in high-dimensional space. Each dimension encodes a latent semantic feature.

Similarity Score

cos(A, B) ∈ [−1, 1]

Cosine of the angle between two vectors. 1 = identical meaning, 0 = unrelated, −1 = opposite.

🔑

Why Cosine Similarity, Not Euclidean Distance?

Euclidean distance is sensitive to vector magnitude — a long sentence generates a larger-magnitude vector than a short one, even if they mean the same thing. Cosine similarity measures only the angle between vectors, making it magnitude-independent. A tweet and a paragraph with the same meaning will have a cosine score near 1.0, even if their raw distances are large.

Section 04

The Three Similarity Metrics

Once you have embeddings, you need a metric. The three most common choices are below. Cosine similarity is the default for NLP; the others have specific use cases.

📈

Cosine Similarity

Most common in NLP

Measures the cosine of the angle between two vectors. Immune to magnitude. Range: −1 to 1. Normalized embeddings turn this into a dot product.

✓ Fast, magnitude-independent, interpretable

✗ Doesn't capture absolute distance in space

📏

Dot Product

Used when magnitude matters

The sum of element-wise products. Equivalent to cosine similarity for unit-normalized vectors. OpenAI's text-embedding-3 recommends this.

✓ Faster computation, same result if L2-normalized

✗ Sensitive to embedding magnitude if not normalized

📐

Euclidean Distance

Geometric distance

The straight-line distance between two points in space. Useful in clustering (k-means uses it) but less reliable for similarity scoring directly.

✓ Intuitive, works well in clustering algorithms

✗ Sensitive to vector magnitude; shorter texts get penalized

Section 05

Your First Sentence Embeddings — Step by Step

We'll use sentence-transformers, the most practical library for this task. It wraps SBERT models and handles all preprocessing automatically. Installation is one command.

# Install sentence-transformers (includes torch and transformers)
# pip install sentence-transformers

from sentence_transformers import SentenceTransformer, util
import numpy as np

# Load a lightweight, fast model (384-dimensional embeddings)
model = SentenceTransformer('all-MiniLM-L6-v2')

# Five sentences — notice they express similar ideas in very different words
sentences = [
    "The dog ran across the park at full speed.",
    "A canine sprinted through the garden quickly.",
    "The cat sat quietly by the window.",
    "She enjoyed her morning jog in the sunshine.",
    "Stock markets fell sharply on Tuesday.",
]

# Each sentence → 384-dimensional vector. Shape: (5, 384)
embeddings = model.encode(sentences, convert_to_tensor=True)
print(f"Embedding shape: {embeddings.shape}")

# Compute pairwise cosine similarity matrix
cosine_scores = util.cos_sim(embeddings, embeddings)

# Print the most similar pair (ignoring self-similarity)
for i in range(len(sentences)):
    for j in range(i+1, len(sentences)):
        print(f"Score: {cosine_scores[i][j]:.4f} | '{sentences[i][:40]}' ↔ '{sentences[j][:40]}'")

OUTPUT

Embedding shape: torch.Size([5, 384]) Score: 0.8732 | 'The dog ran across the park at full sp' ↔ 'A canine sprinted through the garden qu' Score: 0.2341 | 'The dog ran across the park at full sp' ↔ 'The cat sat quietly by the window.' Score: 0.3817 | 'The dog ran across the park at full sp' ↔ 'She enjoyed her morning jog in the sunsh' Score: 0.0214 | 'The dog ran across the park at full sp' ↔ 'Stock markets fell sharply on Tuesday.' Score: 0.1923 | 'A canine sprinted through the garden qu' ↔ 'The cat sat quietly by the window.' Score: 0.3644 | 'A canine sprinted through the garden qu' ↔ 'She enjoyed her morning jog in the sunsh' Score: 0.0189 | 'A canine sprinted through the garden qu' ↔ 'Stock markets fell sharply on Tuesday.' Score: 0.2109 | 'The cat sat quietly by the window.' ↔ 'She enjoyed her morning jog in the sunsh' Score: 0.0431 | 'The cat sat quietly by the window.' ↔ 'Stock markets fell sharply on Tuesday.' Score: 0.0312 | 'She enjoyed her morning jog in the sunsh' ↔ 'Stock markets fell sharply on Tuesday.'

🌟

The Model Understood Meaning, Not Words

Sentences 1 and 2 share zero words in common (dog ≠ canine, ran ≠ sprinted, park ≠ garden) yet the model scored them 0.87 — highly similar. The stock market sentence correctly scores near 0 against all physical-activity sentences. This is semantic understanding at work.

Section 06

Story: The Paraphrase Problem

📖 Real World Story

The Resume Screening Disaster

In 2019, a large e-commerce company built a resume screening tool using keyword matching. A recruiter posted a job for a "Software Engineer with experience in distributed systems."

A candidate submitted a resume that said: "Built scalable microservice architectures handling petabyte-scale data pipelines." Zero keywords matched. The system ranked her last. The recruiter manually reviewed her anyway (by luck) and she turned out to be the best candidate — she'd been an architect at a major cloud provider for 6 years.

Another candidate wrote: "Software Engineer. Experienced in distributed systems. Skills: distributed systems, software engineering." The system ranked him first. He failed the first technical interview.

When the company switched to a sentence embedding model, the qualified candidate rose to the top automatically — her description of microservices and petabyte pipelines mapped to the same semantic neighbourhood as distributed systems at a cosine score of 0.81.

The model had no idea what microservices were explicitly. It understood context.

Section 07

Popular Embedding Models — Which to Use?

⚡

all-MiniLM-L6-v2

sentence-transformers

384 dims. The workhorse. Extremely fast, surprisingly accurate. Best for prototypes, local inference, high-throughput pipelines. ~80MB model size.

Use when: speed matters, CPU deployment

🎯

all-mpnet-base-v2

sentence-transformers

768 dims. Best open-source sentence embedding model as of benchmarks. Slower than MiniLM but significantly more accurate. Ideal for semantic search.

Use when: accuracy is paramount

🚀

text-embedding-3-small

OpenAI API

1536 dims. State-of-the-art quality. API-based (no local GPU). $0.02 per 1M tokens. Best for production apps where accuracy justifies API cost.

Use when: top accuracy, managed infra

🔭

paraphrase-multilingual-mpnet

sentence-transformers

768 dims, 50+ languages. Trained on multilingual paraphrase data. Cross-lingual: query in English, retrieve in Hindi. Perfect for global apps.

Use when: multilingual support needed

📑

multi-qa-mpnet-base-dot-v1

sentence-transformers

768 dims. Fine-tuned specifically for question-answer retrieval (asymmetric similarity). Query and document vectors optimized separately. Best for Q&A search.

Use when: semantic search over documents

🌎

E5 / GTE / BGE family

HuggingFace Hub

State-of-the-art open-source models (2023–24). BGE-large-en-v1.5 rivals commercial APIs. Excellent MTEB leaderboard scores. GPU-friendly inference.

Use when: SOTA open-source required

Section 08

Symmetric vs Asymmetric Semantic Similarity

Not all similarity tasks are the same. The type of task determines which model and approach to use. This is one of the most overlooked distinctions in practice.

🔄 Symmetric Similarity

Sentence A	Sentence B	Relation
"I love cats"	"I adore felines"	Paraphrase
"He quit his job"	"He resigned"	Paraphrase
"Rain is falling"	"It's raining outside"	Paraphrase

🔍 Asymmetric Similarity (Q&A)

Query (Short)	Document (Long)	Task
"headache remedy"	Long medical article about paracetamol, ibuprofen, rest, hydration...	Retrieval
"best Python library for ML"	"scikit-learn provides efficient tools for data mining..."	Search
"capital of Japan"	"Tokyo is the capital and most populous city of Japan..."	Q&A

📄

Which Model to Use?

For symmetric tasks (paraphrase detection, duplicate detection, clustering): use all-mpnet-base-v2 or all-MiniLM-L6-v2. For asymmetric tasks (semantic search, Q&A retrieval): use multi-qa-mpnet-base-dot-v1 or the msmarco family. Using the wrong type can reduce performance by 15–25% on retrieval benchmarks.

Section 09

Semantic Search — Full Working Example

Semantic search is the killer application of sentence embeddings. Pre-compute embeddings for your corpus once, store them, and at query time compute similarity between the query embedding and all stored embeddings. The top-k most similar become your search results.

from sentence_transformers import SentenceTransformer, util
import torch

model = SentenceTransformer('all-MiniLM-L6-v2')

# Corpus: articles on a medical FAQ site
corpus = [
    "Paracetamol reduces fever and mild to moderate pain. Take 500mg every 4–6 hours.",
    "Ibuprofen is a non-steroidal anti-inflammatory drug. Effective for headaches and inflammation.",
    "Staying hydrated helps the immune system fight infections and speeds recovery.",
    "Antibiotics treat bacterial infections but have no effect on viruses like the common cold.",
    "Regular exercise reduces cardiovascular disease risk and improves mental health.",
    "Diabetes type 2 can often be managed with diet, exercise, and weight loss alone.",
    "Vitamin D deficiency is linked to depression, fatigue, and weakened immunity.",
    "Sleep deprivation impairs cognitive function and weakens immune response.",
]

# Encode corpus ONCE and store (in production, save to disk or vector DB)
corpus_embeddings = model.encode(corpus, convert_to_tensor=True)

# User query — note completely different words from any corpus sentence
query = "What can I take for a headache and high temperature?"
query_embedding = model.encode(query, convert_to_tensor=True)

# Find top-3 most semantically similar corpus entries
hits = util.semantic_search(query_embedding, corpus_embeddings, top_k=3)

print(f"Query: '{query}'\n")
print("Top 3 Results:")
for i, hit in enumerate(hits[0], 1):
    idx = hit['corpus_id']
    score = hit['score']
    print(f"  {i}. [Score: {score:.4f}] {corpus[idx]}")

OUTPUT

Query: 'What can I take for a headache and high temperature?' Top 3 Results: 1. [Score: 0.6821] Paracetamol reduces fever and mild to moderate pain. Take 500mg every 4–6 hours. 2. [Score: 0.6249] Ibuprofen is a non-steroidal anti-inflammatory drug. Effective for headaches and inflammation. 3. [Score: 0.2103] Staying hydrated helps the immune system fight infections and speeds recovery.

🌟

Why This Is Remarkable

The query mentions "headache" (which maps to ibuprofen correctly) and "high temperature" (which maps to paracetamol / fever correctly) — both retrieved at the top despite different terminology. "Fever" and "high temperature" share no characters, yet the model understands they are clinically equivalent. This is the direct, practical value of semantic similarity.

Section 10

Story: The Customer Support Revolution

📖 Industry Story

From 48-Hour Wait to 3 Seconds

A SaaS company had a support team drowning in tickets. 400 new tickets every day. Average resolution time: 48 hours. Customer satisfaction: 3.1 / 5. Staff were burning out.

A data scientist embedded every resolved ticket's question and solution into a vector store. 58,000 resolved tickets → 58,000 embedding pairs. New incoming tickets were embedded at arrival and the top 5 most semantically similar resolved tickets were retrieved instantly.

Customer types: "My dashboard keeps freezing when I open the analytics tab."
The system retrieved: "UI hangs when loading heavy chart in reporting section" — solved with a browser cache clear and one config flag. Same problem, entirely different words.

The support agent now reviews 5 suggested solutions instead of searching from scratch. Average resolution time: 11 minutes. Customer satisfaction: 4.6 / 5. Staff headcount needed: reduced by 40%.

The model was all-MiniLM-L6-v2. Total model size: 80MB. Total compute cost: $12/month on a small cloud instance.

Section 11

Paraphrase Detection — Duplicate Question Finder

One of the cleanest use cases for semantic similarity is finding duplicate questions in a knowledge base or forum. The goal: two questions that ask the same thing in different words should be flagged as duplicates (think Stack Overflow, Quora, or internal wikis).

from sentence_transformers import SentenceTransformer, util
import itertools

model = SentenceTransformer('all-mpnet-base-v2')

questions = [
    "How do I reset my password?",
    "I forgot my login credentials, how can I recover access?",
    "What is the refund policy?",
    "Can I get my money back if I cancel?",
    "How do I change my billing information?",
    "Where can I update my credit card details?",
    "The app keeps crashing on startup.",
    "Application fails to launch every time I open it.",
]

embeddings = model.encode(questions, convert_to_tensor=True)

# Detect pairs above similarity threshold of 0.75
THRESHOLD = 0.75
pairs = util.paraphrase_mining(model, questions, threshold=THRESHOLD)

print(f"Duplicate question pairs (similarity > {THRESHOLD}):\n")
for score, i, j in pairs:
    print(f"  Score: {score:.4f}")
    print(f"    Q1: '{questions[i]}'")
    print(f"    Q2: '{questions[j]}'\n")

OUTPUT

Duplicate question pairs (similarity > 0.75): Score: 0.8923 Q1: 'How do I reset my password?' Q2: 'I forgot my login credentials, how can I recover access?' Score: 0.8611 Q1: 'What is the refund policy?' Q2: 'Can I get my money back if I cancel?' Score: 0.8784 Q1: 'How do I change my billing information?' Q2: 'Where can I update my credit card details?' Score: 0.9102 Q1: 'The app keeps crashing on startup.' Q2: 'Application fails to launch every time I open it.'

Section 12

Clustering Sentences by Topic

When you have a large number of sentences (customer reviews, support tickets, survey responses), semantic clustering groups them automatically by topic — no labels required. This is unsupervised topic discovery at scale.

from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
import numpy as np

model = SentenceTransformer('all-MiniLM-L6-v2')

# Customer reviews — mixed topics
reviews = [
    "Delivery was incredibly fast, arrived next day!",
    "Shipping took only 24 hours, very impressed.",
    "Package arrived damaged, very disappointed.",
    "Product was broken when I opened the box.",
    "Customer service team was very helpful and responsive.",
    "Support agent resolved my issue in minutes.",
    "The quality of the material feels cheap.",
    "Build quality is poor, fell apart after one week.",
]

embeddings = model.encode(reviews)

# Cluster into 4 groups (shipping / damage / support / quality)
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
labels = kmeans.fit_predict(embeddings)

print("Cluster Assignments:\n")
for cluster_id in range(4):
    print(f"  Cluster {cluster_id + 1}:")
    for i, label in enumerate(labels):
        if label == cluster_id:
            print(f"    - {reviews[i]}")
    print()

OUTPUT

Cluster Assignments: Cluster 1: - Delivery was incredibly fast, arrived next day! - Shipping took only 24 hours, very impressed. Cluster 2: - Package arrived damaged, very disappointed. - Product was broken when I opened the box. Cluster 3: - Customer service team was very helpful and responsive. - Support agent resolved my issue in minutes. Cluster 4: - The quality of the material feels cheap. - Build quality is poor, fell apart after one week.

🌟

When to Use HDBSCAN Instead of K-Means

K-Means requires you to specify the number of clusters upfront. For exploring unknown data, use HDBSCAN (pip install hdbscan) — it discovers the number of clusters automatically and handles noise points (reviews that don't fit any cluster). HDBSCAN with cosine distance is the gold standard for semantic clustering in production.

Section 13

Scaling Up — Vector Databases

For small corpora (< 100,000 sentences), computing cosine similarity against all embeddings at query time is fast enough. For millions of entries, you need a vector database that enables approximate nearest-neighbour (ANN) search in milliseconds.

📷

FAISS

Meta / Open Source

CPU and GPU. Billion-scale. Most performant. Harder to set up. Best for on-premise deployments.

🔲

Chroma

Open Source

Easiest to embed in Python apps. Auto-embeds text with built-in models. Ideal for prototypes and small to medium scale.

🚀

Pinecone

Managed Cloud

Fully managed. REST API. Scales to billions of vectors. No infrastructure management. Pay-per-query model.

🌎

Weaviate / Qdrant

Open Source + Cloud

Full-featured vector DBs. Filtering, metadata, hybrid search. Strong for production RAG (Retrieval-Augmented Generation) pipelines.

# FAISS — fast ANN search for large corpora
# pip install faiss-cpu   (or faiss-gpu for GPU)

import faiss
import numpy as np
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')
DIM = 384  # embedding dimension of this model

# Step 1: Build the index (do this once, then save it)
index = faiss.IndexFlatIP(DIM)  # Inner Product = cosine sim for L2-normalised vectors

corpus = [
    "Python is a high-level general-purpose programming language.",
    "Machine learning enables computers to learn from data without explicit programming.",
    "Deep neural networks have revolutionised image recognition tasks.",
    "Natural language processing allows computers to understand human text.",
    "The Eiffel Tower is located in Paris, France.",
]

corpus_emb = model.encode(corpus, normalize_embeddings=True).astype(np.float32)
index.add(corpus_emb)  # Add vectors to index

# Step 2: Query at runtime (milliseconds even for millions of vectors)
query = "How do machines learn from examples?"
query_emb = model.encode([query], normalize_embeddings=True).astype(np.float32)

scores, indices = index.search(query_emb, k=3)  # top-3 results

print(f"Query: '{query}'\n")
for rank, (score, idx) in enumerate(zip(scores[0], indices[0]), 1):
    print(f"  {rank}. [{score:.4f}] {corpus[idx]}")

OUTPUT

Query: 'How do machines learn from examples?' 1. [0.7831] Machine learning enables computers to learn from data without explicit programming. 2. [0.5012] Deep neural networks have revolutionised image recognition tasks. 3. [0.4734] Natural language processing allows computers to understand human text.

Section 14

Cross-Encoder vs Bi-Encoder — The Accuracy Trade-off

All models we've used so far are bi-encoders: they embed sentences independently, then compare. Faster, but less accurate. A cross-encoder takes both sentences simultaneously for maximum accuracy — at the cost of speed.

Property	Bi-Encoder (SBERT)	Cross-Encoder (BERT Reranker)
Input	One sentence at a time	Both sentences together
Output	Embedding vector per sentence	Single similarity score
Speed	Very fast (pre-compute embeddings)	Slow — O(N) per query against N docs
Scalability	Millions of documents	Limited to small candidate sets
Accuracy	Very good	Higher — full attention over both inputs
Best use case	First-stage retrieval (recall)	Second-stage reranking (precision)

💡

The Two-Stage Pipeline (Industry Best Practice)

Stage 1 — Bi-Encoder Retrieval: Retrieve top-100 candidates from millions of documents in milliseconds using FAISS.
Stage 2 — Cross-Encoder Reranking: Run the 100 candidates through a cross-encoder that reads both the query and candidate simultaneously, producing a precise reranked top-10. This pipeline gets you near-cross-encoder accuracy at bi-encoder speed.

from sentence_transformers import SentenceTransformer, CrossEncoder, util

# Stage 1: Bi-encoder for fast retrieval
bi_encoder = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')

# Stage 2: Cross-encoder for accurate reranking
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

corpus = [
    "Python can be used for machine learning and data science.",
    "Java is an object-oriented language primarily used in enterprise systems.",
    "scikit-learn provides tools for supervised and unsupervised learning in Python.",
    "PyTorch is a deep learning framework used for neural network training.",
    "TensorFlow was developed by Google and powers production AI systems worldwide.",
    "NumPy provides efficient array operations for numerical computing.",
]

query = "best Python tools for AI"

# Stage 1: Get top-3 candidates quickly
corpus_emb = bi_encoder.encode(corpus, convert_to_tensor=True)
query_emb  = bi_encoder.encode(query,  convert_to_tensor=True)
hits = util.semantic_search(query_emb, corpus_emb, top_k=3)[0]

candidates = [corpus[hit['corpus_id']] for hit in hits]

# Stage 2: Rerank with cross-encoder
pairs = [[query, c] for c in candidates]
rerank_scores = cross_encoder.predict(pairs)

ranked = sorted(zip(rerank_scores, candidates), reverse=True)
print("Reranked Results:")
for score, doc in ranked:
    print(f"  [{score:.4f}] {doc}")

OUTPUT

Reranked Results: [7.2341] scikit-learn provides tools for supervised and unsupervised learning in Python. [6.9812] PyTorch is a deep learning framework used for neural network training. [5.4221] Python can be used for machine learning and data science.

Section 15

Real-World Applications at a Glance

🔍

Semantic Search

Replace keyword search in internal wikis, documentation, and product databases. Query in natural language, retrieve by meaning.

Stack Overflow, Notion, Confluence, enterprise search

🤖

RAG (Retrieval-Augmented Generation)

Power chatbots by retrieving relevant context from a vector DB before generating an answer. Grounds LLMs in real documents.

LangChain, LlamaIndex, GPT+knowledge base

📊

Duplicate Detection

Identify near-identical tickets, questions, or product listings regardless of wording. Reduces support queue size dramatically.

Quora, Stack Overflow, helpdesk systems

😄

Sentiment & Intent Classification

Embed user messages and use KNN or a lightweight classifier on top of embeddings to detect intent, sentiment, or category.

Chatbots, IVR routing, social media monitoring

🌟

Recommendation Systems

Recommend articles, products, or courses based on semantic similarity between item descriptions and user preferences.

Netflix, Spotify, e-learning platforms

🌎

Cross-Lingual Retrieval

Query in English, retrieve results in Hindi, French, or Spanish. Multilingual models map all languages to a shared embedding space.

Global customer support, multilingual legal search

Section 16

Golden Rules — Sentence Embeddings in Production

🧠 Semantic Similarity — Non-Negotiable Rules

Always L2-normalize embeddings before dot product. When using cosine similarity as inner product (required for FAISS IndexFlatIP), normalize with normalize_embeddings=True in model.encode(). Un-normalized vectors turn dot product into a magnitude comparison, not a similarity one.

Use an asymmetric model for Q&A retrieval. Using a symmetric model (all-MiniLM-L6-v2) for semantic search gives acceptable but suboptimal results. Switch to multi-qa-mpnet-base-dot-v1 or the msmarco family for retrieval tasks.

Pre-compute and cache corpus embeddings. Never re-embed the same corpus on every query. Encode once, store to disk (np.save) or a vector DB. The query embedding only should be computed at runtime.

Don't embed text longer than the model's max sequence length. Most SBERT models have a 256 or 512 token limit. Text beyond that is silently truncated. For long documents, chunk into paragraphs, embed each chunk separately, and use max-pooling or mean-pooling of chunk embeddings.

Calibrate your similarity threshold per task. A 0.75 threshold for paraphrase detection is not the same as a useful threshold for document retrieval. Validate thresholds against labelled data for your specific domain. What counts as "similar" varies drastically across medical, legal, conversational, and technical corpora.

Use a two-stage pipeline at scale. Bi-encoder for top-100 retrieval, cross-encoder for top-10 reranking. This is the industry-standard architecture that balances speed and accuracy. Never run a cross-encoder over a full corpus — it does not scale.

Benchmark on MTEB before choosing a model. The Massive Text Embedding Benchmark (huggingface.co/spaces/mteb/leaderboard) provides independent, task-specific accuracy scores for hundreds of models. Don't pick a model from blog posts — look up your task type on MTEB and choose the highest-scoring model that fits your latency and memory budget.