Natural Language Processing (NLP) 📂 Advanced NLP Applications · 1 of 2 35 min read

Semantic Similarity & Sentence Embeddings

This tutorial explains how sentence embeddings work, how machines measure meaning rather than words, and how to build real semantic search, duplicate detection, and clustering systems with Python. Includes stories, working code, and production best practices.

Section 01

The Story That Explains Sentence Embeddings

The Library With No Titles
Imagine a library where every book has its cover ripped off. No title, no author, nothing. You walk in and ask the librarian: "I want a book about a lonely astronaut on Mars."

A bad librarian scans for books that contain the exact words "lonely", "astronaut", and "Mars". She returns three technical manuals and one NASA report. Technically correct, utterly useless.

A brilliant librarian does something different. She feels what you mean — solitude, space, survival, psychological resilience in isolation — and walks you straight to Andy Weir's The Martian, Stanisław Lem's Solaris, and Kim Stanley Robinson's Red Mars. No title matching needed.

That brilliant librarian is a sentence embedding model. She converts meaning into a location in a vast mathematical space — and the books closest to your request are the most relevant. This is semantic similarity.

Semantic similarity is the task of measuring how close two pieces of text are in meaning — not in exact words, not in length, but in the underlying idea they express. The engine that makes this possible is a sentence embedding: a dense numeric vector (typically 384 to 1536 numbers) that encodes the meaning of a sentence into a single point in high-dimensional space.

💡
The Core Idea

Two sentences that mean the same thing — even if they share zero words — should map to nearby points in embedding space. Two sentences that mean opposite things should be far apart. The distance between points is the measure of semantic similarity. This is what separates NLP from mere text matching.


Section 02

From Words to Vectors — A Brief History

Before we understand sentence embeddings, we need to understand why simple text matching fails, and how the field evolved to fix it.

🕐 The Evolution of Text Representation
1990s
Bag of Words (BoW): A sentence is a count of words. "The cat sat" → {the:1, cat:1, sat:1}. No order. No meaning. A thesaurus could fool it completely.
2000s
TF-IDF: Weighted word counts. Rare words matter more. Better for search, but still keyword-matching at heart. "Quick brown fox" ≠ "Speedy auburn canine."
2013
Word2Vec (Google): Each word gets a 300-dimensional vector. "King" − "Man" + "Woman" ≈ "Queen." Words in context encode meaning. Revolutionary — but still per-word.
2018
BERT (Google): Transformer model reads the entire sentence at once. Context-aware. "Bank" by a river ≠ "Bank" you deposit money in. Meaning depends on surroundings.
2019+
Sentence-BERT / SBERT: Fine-tuned BERT specifically for sentence-level similarity. One forward pass → one vector per sentence. Fast, accurate, production-ready.
⚠️
The Fatal Flaw of Keyword Matching

Consider: "The medication eased her pain" vs. "She felt relief after taking the drug." These share zero content words, yet mean almost the same thing. Keyword matching gives them a similarity score of 0. A sentence embedding model gives them a score close to 1.0. In healthcare, legal, or customer-service applications, this difference is the difference between a useful system and a broken one.


Section 03

What Is an Embedding, Mathematically?

An embedding is a function that maps a discrete object (a word, sentence, document) to a continuous vector in ℝn. For sentence embeddings, a typical model produces a vector of 384 or 768 floating-point numbers for any input sentence, regardless of its length.

Input
"The cat sat on the mat."
Any string of text — a word, sentence, paragraph, or document.
Embedding Function
f(text) → ℝ384
The model maps the text to a dense vector of fixed dimension.
Output Vector
[0.21, −0.83, 0.47, …]
A point in high-dimensional space. Each dimension encodes a latent semantic feature.
Similarity Score
cos(A, B) ∈ [−1, 1]
Cosine of the angle between two vectors. 1 = identical meaning, 0 = unrelated, −1 = opposite.
🔑
Why Cosine Similarity, Not Euclidean Distance?

Euclidean distance is sensitive to vector magnitude — a long sentence generates a larger-magnitude vector than a short one, even if they mean the same thing. Cosine similarity measures only the angle between vectors, making it magnitude-independent. A tweet and a paragraph with the same meaning will have a cosine score near 1.0, even if their raw distances are large.


Section 04

The Three Similarity Metrics

Once you have embeddings, you need a metric. The three most common choices are below. Cosine similarity is the default for NLP; the others have specific use cases.

📈
Cosine Similarity
Most common in NLP
Measures the cosine of the angle between two vectors. Immune to magnitude. Range: −1 to 1. Normalized embeddings turn this into a dot product.
✓ Fast, magnitude-independent, interpretable
✗ Doesn't capture absolute distance in space
📏
Dot Product
Used when magnitude matters
The sum of element-wise products. Equivalent to cosine similarity for unit-normalized vectors. OpenAI's text-embedding-3 recommends this.
✓ Faster computation, same result if L2-normalized
✗ Sensitive to embedding magnitude if not normalized
📐
Euclidean Distance
Geometric distance
The straight-line distance between two points in space. Useful in clustering (k-means uses it) but less reliable for similarity scoring directly.
✓ Intuitive, works well in clustering algorithms
✗ Sensitive to vector magnitude; shorter texts get penalized

Section 05

Your First Sentence Embeddings — Step by Step

We'll use sentence-transformers, the most practical library for this task. It wraps SBERT models and handles all preprocessing automatically. Installation is one command.

# Install sentence-transformers (includes torch and transformers)
# pip install sentence-transformers

from sentence_transformers import SentenceTransformer, util
import numpy as np

# Load a lightweight, fast model (384-dimensional embeddings)
model = SentenceTransformer('all-MiniLM-L6-v2')

# Five sentences — notice they express similar ideas in very different words
sentences = [
    "The dog ran across the park at full speed.",
    "A canine sprinted through the garden quickly.",
    "The cat sat quietly by the window.",
    "She enjoyed her morning jog in the sunshine.",
    "Stock markets fell sharply on Tuesday.",
]

# Each sentence → 384-dimensional vector. Shape: (5, 384)
embeddings = model.encode(sentences, convert_to_tensor=True)
print(f"Embedding shape: {embeddings.shape}")

# Compute pairwise cosine similarity matrix
cosine_scores = util.cos_sim(embeddings, embeddings)

# Print the most similar pair (ignoring self-similarity)
for i in range(len(sentences)):
    for j in range(i+1, len(sentences)):
        print(f"Score: {cosine_scores[i][j]:.4f} | '{sentences[i][:40]}' ↔ '{sentences[j][:40]}'")
OUTPUT
Embedding shape: torch.Size([5, 384]) Score: 0.8732 | 'The dog ran across the park at full sp' ↔ 'A canine sprinted through the garden qu' Score: 0.2341 | 'The dog ran across the park at full sp' ↔ 'The cat sat quietly by the window.' Score: 0.3817 | 'The dog ran across the park at full sp' ↔ 'She enjoyed her morning jog in the sunsh' Score: 0.0214 | 'The dog ran across the park at full sp' ↔ 'Stock markets fell sharply on Tuesday.' Score: 0.1923 | 'A canine sprinted through the garden qu' ↔ 'The cat sat quietly by the window.' Score: 0.3644 | 'A canine sprinted through the garden qu' ↔ 'She enjoyed her morning jog in the sunsh' Score: 0.0189 | 'A canine sprinted through the garden qu' ↔ 'Stock markets fell sharply on Tuesday.' Score: 0.2109 | 'The cat sat quietly by the window.' ↔ 'She enjoyed her morning jog in the sunsh' Score: 0.0431 | 'The cat sat quietly by the window.' ↔ 'Stock markets fell sharply on Tuesday.' Score: 0.0312 | 'She enjoyed her morning jog in the sunsh' ↔ 'Stock markets fell sharply on Tuesday.'
🌟
The Model Understood Meaning, Not Words

Sentences 1 and 2 share zero words in common (dog ≠ canine, ran ≠ sprinted, park ≠ garden) yet the model scored them 0.87 — highly similar. The stock market sentence correctly scores near 0 against all physical-activity sentences. This is semantic understanding at work.


Section 06

Story: The Paraphrase Problem

The Resume Screening Disaster
In 2019, a large e-commerce company built a resume screening tool using keyword matching. A recruiter posted a job for a "Software Engineer with experience in distributed systems."

A candidate submitted a resume that said: "Built scalable microservice architectures handling petabyte-scale data pipelines." Zero keywords matched. The system ranked her last. The recruiter manually reviewed her anyway (by luck) and she turned out to be the best candidate — she'd been an architect at a major cloud provider for 6 years.

Another candidate wrote: "Software Engineer. Experienced in distributed systems. Skills: distributed systems, software engineering." The system ranked him first. He failed the first technical interview.

When the company switched to a sentence embedding model, the qualified candidate rose to the top automatically — her description of microservices and petabyte pipelines mapped to the same semantic neighbourhood as distributed systems at a cosine score of 0.81.

The model had no idea what microservices were explicitly. It understood context.

Section 07

Popular Embedding Models — Which to Use?

all-MiniLM-L6-v2
sentence-transformers
384 dims. The workhorse. Extremely fast, surprisingly accurate. Best for prototypes, local inference, high-throughput pipelines. ~80MB model size.
Use when: speed matters, CPU deployment
🎯
all-mpnet-base-v2
sentence-transformers
768 dims. Best open-source sentence embedding model as of benchmarks. Slower than MiniLM but significantly more accurate. Ideal for semantic search.
Use when: accuracy is paramount
🚀
text-embedding-3-small
OpenAI API
1536 dims. State-of-the-art quality. API-based (no local GPU). $0.02 per 1M tokens. Best for production apps where accuracy justifies API cost.
Use when: top accuracy, managed infra
🔭
paraphrase-multilingual-mpnet
sentence-transformers
768 dims, 50+ languages. Trained on multilingual paraphrase data. Cross-lingual: query in English, retrieve in Hindi. Perfect for global apps.
Use when: multilingual support needed
📑
multi-qa-mpnet-base-dot-v1
sentence-transformers
768 dims. Fine-tuned specifically for question-answer retrieval (asymmetric similarity). Query and document vectors optimized separately. Best for Q&A search.
Use when: semantic search over documents
🌎
E5 / GTE / BGE family
HuggingFace Hub
State-of-the-art open-source models (2023–24). BGE-large-en-v1.5 rivals commercial APIs. Excellent MTEB leaderboard scores. GPU-friendly inference.
Use when: SOTA open-source required

Section 08

Symmetric vs Asymmetric Semantic Similarity

Not all similarity tasks are the same. The type of task determines which model and approach to use. This is one of the most overlooked distinctions in practice.

🔄 Symmetric Similarity
Sentence ASentence BRelation
"I love cats""I adore felines"Paraphrase
"He quit his job""He resigned"Paraphrase
"Rain is falling""It's raining outside"Paraphrase
🔍 Asymmetric Similarity (Q&A)
Query (Short)Document (Long)Task
"headache remedy"Long medical article about paracetamol, ibuprofen, rest, hydration...Retrieval
"best Python library for ML""scikit-learn provides efficient tools for data mining..."Search
"capital of Japan""Tokyo is the capital and most populous city of Japan..."Q&A
📄
Which Model to Use?

For symmetric tasks (paraphrase detection, duplicate detection, clustering): use all-mpnet-base-v2 or all-MiniLM-L6-v2. For asymmetric tasks (semantic search, Q&A retrieval): use multi-qa-mpnet-base-dot-v1 or the msmarco family. Using the wrong type can reduce performance by 15–25% on retrieval benchmarks.


Section 09

Semantic Search — Full Working Example

Semantic search is the killer application of sentence embeddings. Pre-compute embeddings for your corpus once, store them, and at query time compute similarity between the query embedding and all stored embeddings. The top-k most similar become your search results.

from sentence_transformers import SentenceTransformer, util
import torch

model = SentenceTransformer('all-MiniLM-L6-v2')

# Corpus: articles on a medical FAQ site
corpus = [
    "Paracetamol reduces fever and mild to moderate pain. Take 500mg every 4–6 hours.",
    "Ibuprofen is a non-steroidal anti-inflammatory drug. Effective for headaches and inflammation.",
    "Staying hydrated helps the immune system fight infections and speeds recovery.",
    "Antibiotics treat bacterial infections but have no effect on viruses like the common cold.",
    "Regular exercise reduces cardiovascular disease risk and improves mental health.",
    "Diabetes type 2 can often be managed with diet, exercise, and weight loss alone.",
    "Vitamin D deficiency is linked to depression, fatigue, and weakened immunity.",
    "Sleep deprivation impairs cognitive function and weakens immune response.",
]

# Encode corpus ONCE and store (in production, save to disk or vector DB)
corpus_embeddings = model.encode(corpus, convert_to_tensor=True)

# User query — note completely different words from any corpus sentence
query = "What can I take for a headache and high temperature?"
query_embedding = model.encode(query, convert_to_tensor=True)

# Find top-3 most semantically similar corpus entries
hits = util.semantic_search(query_embedding, corpus_embeddings, top_k=3)

print(f"Query: '{query}'\n")
print("Top 3 Results:")
for i, hit in enumerate(hits[0], 1):
    idx = hit['corpus_id']
    score = hit['score']
    print(f"  {i}. [Score: {score:.4f}] {corpus[idx]}")
OUTPUT
Query: 'What can I take for a headache and high temperature?' Top 3 Results: 1. [Score: 0.6821] Paracetamol reduces fever and mild to moderate pain. Take 500mg every 4–6 hours. 2. [Score: 0.6249] Ibuprofen is a non-steroidal anti-inflammatory drug. Effective for headaches and inflammation. 3. [Score: 0.2103] Staying hydrated helps the immune system fight infections and speeds recovery.
🌟
Why This Is Remarkable

The query mentions "headache" (which maps to ibuprofen correctly) and "high temperature" (which maps to paracetamol / fever correctly) — both retrieved at the top despite different terminology. "Fever" and "high temperature" share no characters, yet the model understands they are clinically equivalent. This is the direct, practical value of semantic similarity.


Section 10

Story: The Customer Support Revolution

From 48-Hour Wait to 3 Seconds
A SaaS company had a support team drowning in tickets. 400 new tickets every day. Average resolution time: 48 hours. Customer satisfaction: 3.1 / 5. Staff were burning out.

A data scientist embedded every resolved ticket's question and solution into a vector store. 58,000 resolved tickets → 58,000 embedding pairs. New incoming tickets were embedded at arrival and the top 5 most semantically similar resolved tickets were retrieved instantly.

Customer types: "My dashboard keeps freezing when I open the analytics tab."
The system retrieved: "UI hangs when loading heavy chart in reporting section" — solved with a browser cache clear and one config flag. Same problem, entirely different words.

The support agent now reviews 5 suggested solutions instead of searching from scratch. Average resolution time: 11 minutes. Customer satisfaction: 4.6 / 5. Staff headcount needed: reduced by 40%.

The model was all-MiniLM-L6-v2. Total model size: 80MB. Total compute cost: $12/month on a small cloud instance.

Section 11

Paraphrase Detection — Duplicate Question Finder

One of the cleanest use cases for semantic similarity is finding duplicate questions in a knowledge base or forum. The goal: two questions that ask the same thing in different words should be flagged as duplicates (think Stack Overflow, Quora, or internal wikis).

from sentence_transformers import SentenceTransformer, util
import itertools

model = SentenceTransformer('all-mpnet-base-v2')

questions = [
    "How do I reset my password?",
    "I forgot my login credentials, how can I recover access?",
    "What is the refund policy?",
    "Can I get my money back if I cancel?",
    "How do I change my billing information?",
    "Where can I update my credit card details?",
    "The app keeps crashing on startup.",
    "Application fails to launch every time I open it.",
]

embeddings = model.encode(questions, convert_to_tensor=True)

# Detect pairs above similarity threshold of 0.75
THRESHOLD = 0.75
pairs = util.paraphrase_mining(model, questions, threshold=THRESHOLD)

print(f"Duplicate question pairs (similarity > {THRESHOLD}):\n")
for score, i, j in pairs:
    print(f"  Score: {score:.4f}")
    print(f"    Q1: '{questions[i]}'")
    print(f"    Q2: '{questions[j]}'\n")
OUTPUT
Duplicate question pairs (similarity > 0.75): Score: 0.8923 Q1: 'How do I reset my password?' Q2: 'I forgot my login credentials, how can I recover access?' Score: 0.8611 Q1: 'What is the refund policy?' Q2: 'Can I get my money back if I cancel?' Score: 0.8784 Q1: 'How do I change my billing information?' Q2: 'Where can I update my credit card details?' Score: 0.9102 Q1: 'The app keeps crashing on startup.' Q2: 'Application fails to launch every time I open it.'

Section 12

Clustering Sentences by Topic

When you have a large number of sentences (customer reviews, support tickets, survey responses), semantic clustering groups them automatically by topic — no labels required. This is unsupervised topic discovery at scale.

from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
import numpy as np

model = SentenceTransformer('all-MiniLM-L6-v2')

# Customer reviews — mixed topics
reviews = [
    "Delivery was incredibly fast, arrived next day!",
    "Shipping took only 24 hours, very impressed.",
    "Package arrived damaged, very disappointed.",
    "Product was broken when I opened the box.",
    "Customer service team was very helpful and responsive.",
    "Support agent resolved my issue in minutes.",
    "The quality of the material feels cheap.",
    "Build quality is poor, fell apart after one week.",
]

embeddings = model.encode(reviews)

# Cluster into 4 groups (shipping / damage / support / quality)
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
labels = kmeans.fit_predict(embeddings)

print("Cluster Assignments:\n")
for cluster_id in range(4):
    print(f"  Cluster {cluster_id + 1}:")
    for i, label in enumerate(labels):
        if label == cluster_id:
            print(f"    - {reviews[i]}")
    print()
OUTPUT
Cluster Assignments: Cluster 1: - Delivery was incredibly fast, arrived next day! - Shipping took only 24 hours, very impressed. Cluster 2: - Package arrived damaged, very disappointed. - Product was broken when I opened the box. Cluster 3: - Customer service team was very helpful and responsive. - Support agent resolved my issue in minutes. Cluster 4: - The quality of the material feels cheap. - Build quality is poor, fell apart after one week.
🌟
When to Use HDBSCAN Instead of K-Means

K-Means requires you to specify the number of clusters upfront. For exploring unknown data, use HDBSCAN (pip install hdbscan) — it discovers the number of clusters automatically and handles noise points (reviews that don't fit any cluster). HDBSCAN with cosine distance is the gold standard for semantic clustering in production.


Section 13

Scaling Up — Vector Databases

For small corpora (< 100,000 sentences), computing cosine similarity against all embeddings at query time is fast enough. For millions of entries, you need a vector database that enables approximate nearest-neighbour (ANN) search in milliseconds.

📷
FAISS
Meta / Open Source
CPU and GPU. Billion-scale. Most performant. Harder to set up. Best for on-premise deployments.
🔲
Chroma
Open Source
Easiest to embed in Python apps. Auto-embeds text with built-in models. Ideal for prototypes and small to medium scale.
🚀
Pinecone
Managed Cloud
Fully managed. REST API. Scales to billions of vectors. No infrastructure management. Pay-per-query model.
🌎
Weaviate / Qdrant
Open Source + Cloud
Full-featured vector DBs. Filtering, metadata, hybrid search. Strong for production RAG (Retrieval-Augmented Generation) pipelines.
# FAISS — fast ANN search for large corpora
# pip install faiss-cpu   (or faiss-gpu for GPU)

import faiss
import numpy as np
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')
DIM = 384  # embedding dimension of this model

# Step 1: Build the index (do this once, then save it)
index = faiss.IndexFlatIP(DIM)  # Inner Product = cosine sim for L2-normalised vectors

corpus = [
    "Python is a high-level general-purpose programming language.",
    "Machine learning enables computers to learn from data without explicit programming.",
    "Deep neural networks have revolutionised image recognition tasks.",
    "Natural language processing allows computers to understand human text.",
    "The Eiffel Tower is located in Paris, France.",
]

corpus_emb = model.encode(corpus, normalize_embeddings=True).astype(np.float32)
index.add(corpus_emb)  # Add vectors to index

# Step 2: Query at runtime (milliseconds even for millions of vectors)
query = "How do machines learn from examples?"
query_emb = model.encode([query], normalize_embeddings=True).astype(np.float32)

scores, indices = index.search(query_emb, k=3)  # top-3 results

print(f"Query: '{query}'\n")
for rank, (score, idx) in enumerate(zip(scores[0], indices[0]), 1):
    print(f"  {rank}. [{score:.4f}] {corpus[idx]}")
OUTPUT
Query: 'How do machines learn from examples?' 1. [0.7831] Machine learning enables computers to learn from data without explicit programming. 2. [0.5012] Deep neural networks have revolutionised image recognition tasks. 3. [0.4734] Natural language processing allows computers to understand human text.

Section 14

Cross-Encoder vs Bi-Encoder — The Accuracy Trade-off

All models we've used so far are bi-encoders: they embed sentences independently, then compare. Faster, but less accurate. A cross-encoder takes both sentences simultaneously for maximum accuracy — at the cost of speed.

Property Bi-Encoder (SBERT) Cross-Encoder (BERT Reranker)
InputOne sentence at a timeBoth sentences together
OutputEmbedding vector per sentenceSingle similarity score
SpeedVery fast (pre-compute embeddings)Slow — O(N) per query against N docs
ScalabilityMillions of documentsLimited to small candidate sets
AccuracyVery goodHigher — full attention over both inputs
Best use caseFirst-stage retrieval (recall)Second-stage reranking (precision)
💡
The Two-Stage Pipeline (Industry Best Practice)

Stage 1 — Bi-Encoder Retrieval: Retrieve top-100 candidates from millions of documents in milliseconds using FAISS.
Stage 2 — Cross-Encoder Reranking: Run the 100 candidates through a cross-encoder that reads both the query and candidate simultaneously, producing a precise reranked top-10. This pipeline gets you near-cross-encoder accuracy at bi-encoder speed.

from sentence_transformers import SentenceTransformer, CrossEncoder, util

# Stage 1: Bi-encoder for fast retrieval
bi_encoder = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')

# Stage 2: Cross-encoder for accurate reranking
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

corpus = [
    "Python can be used for machine learning and data science.",
    "Java is an object-oriented language primarily used in enterprise systems.",
    "scikit-learn provides tools for supervised and unsupervised learning in Python.",
    "PyTorch is a deep learning framework used for neural network training.",
    "TensorFlow was developed by Google and powers production AI systems worldwide.",
    "NumPy provides efficient array operations for numerical computing.",
]

query = "best Python tools for AI"

# Stage 1: Get top-3 candidates quickly
corpus_emb = bi_encoder.encode(corpus, convert_to_tensor=True)
query_emb  = bi_encoder.encode(query,  convert_to_tensor=True)
hits = util.semantic_search(query_emb, corpus_emb, top_k=3)[0]

candidates = [corpus[hit['corpus_id']] for hit in hits]

# Stage 2: Rerank with cross-encoder
pairs = [[query, c] for c in candidates]
rerank_scores = cross_encoder.predict(pairs)

ranked = sorted(zip(rerank_scores, candidates), reverse=True)
print("Reranked Results:")
for score, doc in ranked:
    print(f"  [{score:.4f}] {doc}")
OUTPUT
Reranked Results: [7.2341] scikit-learn provides tools for supervised and unsupervised learning in Python. [6.9812] PyTorch is a deep learning framework used for neural network training. [5.4221] Python can be used for machine learning and data science.

Section 15

Real-World Applications at a Glance

🔍
Semantic Search
Replace keyword search in internal wikis, documentation, and product databases. Query in natural language, retrieve by meaning.
Stack Overflow, Notion, Confluence, enterprise search
🤖
RAG (Retrieval-Augmented Generation)
Power chatbots by retrieving relevant context from a vector DB before generating an answer. Grounds LLMs in real documents.
LangChain, LlamaIndex, GPT+knowledge base
📊
Duplicate Detection
Identify near-identical tickets, questions, or product listings regardless of wording. Reduces support queue size dramatically.
Quora, Stack Overflow, helpdesk systems
😄
Sentiment & Intent Classification
Embed user messages and use KNN or a lightweight classifier on top of embeddings to detect intent, sentiment, or category.
Chatbots, IVR routing, social media monitoring
🌟
Recommendation Systems
Recommend articles, products, or courses based on semantic similarity between item descriptions and user preferences.
Netflix, Spotify, e-learning platforms
🌎
Cross-Lingual Retrieval
Query in English, retrieve results in Hindi, French, or Spanish. Multilingual models map all languages to a shared embedding space.
Global customer support, multilingual legal search

Section 16

Golden Rules — Sentence Embeddings in Production

🧠 Semantic Similarity — Non-Negotiable Rules
1
Always L2-normalize embeddings before dot product. When using cosine similarity as inner product (required for FAISS IndexFlatIP), normalize with normalize_embeddings=True in model.encode(). Un-normalized vectors turn dot product into a magnitude comparison, not a similarity one.
2
Use an asymmetric model for Q&A retrieval. Using a symmetric model (all-MiniLM-L6-v2) for semantic search gives acceptable but suboptimal results. Switch to multi-qa-mpnet-base-dot-v1 or the msmarco family for retrieval tasks.
3
Pre-compute and cache corpus embeddings. Never re-embed the same corpus on every query. Encode once, store to disk (np.save) or a vector DB. The query embedding only should be computed at runtime.
4
Don't embed text longer than the model's max sequence length. Most SBERT models have a 256 or 512 token limit. Text beyond that is silently truncated. For long documents, chunk into paragraphs, embed each chunk separately, and use max-pooling or mean-pooling of chunk embeddings.
5
Calibrate your similarity threshold per task. A 0.75 threshold for paraphrase detection is not the same as a useful threshold for document retrieval. Validate thresholds against labelled data for your specific domain. What counts as "similar" varies drastically across medical, legal, conversational, and technical corpora.
6
Use a two-stage pipeline at scale. Bi-encoder for top-100 retrieval, cross-encoder for top-10 reranking. This is the industry-standard architecture that balances speed and accuracy. Never run a cross-encoder over a full corpus — it does not scale.
7
Benchmark on MTEB before choosing a model. The Massive Text Embedding Benchmark (huggingface.co/spaces/mteb/leaderboard) provides independent, task-specific accuracy scores for hundreds of models. Don't pick a model from blog posts — look up your task type on MTEB and choose the highest-scoring model that fits your latency and memory budget.