Maximum Likelihood Estimation: MLE Explained with Examples

Section 01

The Detective and the Evidence 🔍

Imagine a detective who arrives at a crime scene. She cannot go back in time to see what happened — the event is over. But she has evidence: footprints, a broken lock, a timestamp on a security camera. Her job is to reason backwards from what she observes to the most plausible explanation of what caused it.

This is exactly what Maximum Likelihood Estimation does with data. You have observed data — a set of measurements, outcomes, counts, or readings. You cannot see the underlying process that generated them, but you know it follows some probability distribution with unknown parameters. MLE asks: "Which parameter values make this observed data most probable?" The answer is the Maximum Likelihood Estimate.

MLE is the single most important parameter estimation technique in all of statistics and machine learning. It underpins logistic regression, linear regression (from a probabilistic view), neural network training, survival analysis, time series models, and most of modern statistical inference. Understanding it deeply changes how you see every model you build.

💡

Probability vs Likelihood — The Core Distinction

Probability: Given fixed parameters θ, what is the probability of observing data x? → P(X = x | θ). Parameters are known; data is uncertain. | Likelihood: Given fixed observed data x, how plausible is each value of θ? → L(θ | x). Data is fixed (already observed); parameters are uncertain. The likelihood function reverses the direction of reasoning.

Section 02

The Likelihood Function 📐

The Story: The Biased Coin Inspector

Arjun is a quality inspector at a coin factory. He suspects a newly minted coin is biased — not perfectly fair. He cannot look inside the coin to measure its bias directly. Instead, he flips it 10 times and records the results: H, H, T, H, H, T, H, H, H, T — that is, 7 heads and 3 tails. He wants to use this data to estimate the true probability p of heads.

For each candidate value of p (0.1, 0.2, …, 0.9, 1.0), Arjun asks: "If the true bias were this value of p, how probable is it that I would observe exactly 7 heads in 10 flips?" The function that maps each p to this probability is the Likelihood Function.

Likelihood Function

L(θ | x) = P(X = x | θ)

The likelihood of parameter θ given observed data x. Mathematically the same expression as the probability, but viewed as a function of θ with x fixed — not a function of x with θ fixed.

For i.i.d. Observations

L(θ|x₁…xₙ) = Π P(xᵢ|θ)

When observations are independent and identically distributed (i.i.d.), the joint likelihood is the product of individual probabilities. Each data point contributes a multiplicative factor.

Log-Likelihood (ℓ)

ℓ(θ) = Σ log P(xᵢ|θ)

Taking the logarithm converts products to sums — easier to compute and maximise analytically. Since log is monotone increasing, maximising ℓ(θ) gives the same θ̂ as maximising L(θ).

MLE Estimator

θ̂ = argmax_θ L(θ|x)

The MLE is the parameter value that maximises the likelihood function. Equivalently: θ̂ = argmax_θ ℓ(θ). Found analytically (calculus) or numerically (gradient ascent).

Section 03

Computing the Likelihood — Step by Step 🧮

🧮 Coin Flip — Building the Likelihood Function

Observed Data

10 flips: 7 Heads, 3 Tails. The data is fixed. We vary the parameter p ∈ [0, 1].

Model

Each flip ~ Bernoulli(p). Flips are independent. So 10 flips ~ Binomial(n=10, p).
P(7 heads in 10 flips | p) = C(10,7) × p⁷ × (1−p)³

L(p = 0.1)

C(10,7) × (0.1)⁷ × (0.9)³ = 120 × 0.0000001 × 0.729 = 0.0000087 — very unlikely!

L(p = 0.5)

C(10,7) × (0.5)⁷ × (0.5)³ = 120 × 0.0078125 × 0.125 = 0.1172 — reasonable.

L(p = 0.7)

C(10,7) × (0.7)⁷ × (0.3)³ = 120 × 0.0823543 × 0.027 = 0.2668 — much higher!

L(p = 0.9)

C(10,7) × (0.9)⁷ × (0.1)³ = 120 × 0.4782969 × 0.001 = 0.0574 — drops again.

Pattern

Likelihood rises from p=0.1, peaks somewhere near p=0.7, then falls again toward p=1.0. The peak is the Maximum Likelihood Estimate. Let's find it analytically.

Section 04

Maximising the Likelihood — Calculus Approach 🎓

Finding the MLE Analytically

To find the exact peak of the likelihood function, we take its derivative with respect to the parameter θ, set it to zero, and solve. Because products are harder to differentiate than sums, we first take the logarithm — creating the log-likelihood function ℓ(θ). Since log is a monotonically increasing function, the θ that maximises L(θ) also maximises ℓ(θ).

🧮 MLE — Binomial Parameter (Coin Flip Proof)

Likelihood

L(p) = C(n,k) × pᵏ × (1−p)ⁿ⁻ᵏ where n=10, k=7

Log-Likelihood

ℓ(p) = log C(n,k) + k·log(p) + (n−k)·log(1−p)
ℓ(p) = constant + 7·log(p) + 3·log(1−p)

Differentiate

dℓ/dp = 7/p − 3/(1−p)

Set to Zero

7/p − 3/(1−p) = 0
7(1−p) = 3p
7 − 7p = 3p
7 = 10p

MLE Result

p̂ = 7/10 = 0.70
The MLE for the binomial parameter is simply k/n — the observed proportion of successes. This confirms our intuition: the most likely bias is exactly the fraction of heads we saw.

Verify Max

d²ℓ/dp² = −k/p² − (n−k)/(1−p)² < 0 always → confirms p̂=0.7 is a maximum, not a minimum.

✅

General Binomial MLE — Always k/n

The proof above is completely general. For any Binomial(n, p) model with k observed successes in n trials, the MLE is always p̂ = k/n. This elegant result means the MLE for a proportion is simply the observed relative frequency — confirming that MLE gives the most natural, intuitive estimate for this case.

Section 05

MLE for the Normal Distribution 📊

The Story: Estimating Average Human Reaction Time

A neuroscientist measures the reaction times (in milliseconds) of 5 subjects in a stimulus experiment. The data: 240, 255, 261, 248, 271 ms. She assumes reaction times follow a Normal distribution N(μ, σ²) — but she doesn't know the true mean μ or the true variance σ². She uses MLE to estimate both parameters simultaneously from her 5 observations.

🧮 MLE for Normal Distribution — Deriving μ̂ and σ̂²

Data

x = {240, 255, 261, 248, 271}, n = 5

Normal PDF

f(xᵢ | μ, σ²) = (1/√2πσ²) × exp[−(xᵢ−μ)²/(2σ²)]

Joint Likelihood

L(μ,σ²) = Π f(xᵢ|μ,σ²) = (2πσ²)^(−n/2) × exp[−Σ(xᵢ−μ)²/(2σ²)]

Log-Likelihood

ℓ(μ,σ²) = −(n/2)log(2π) − (n/2)log(σ²) − Σ(xᵢ−μ)²/(2σ²)

∂ℓ/∂μ = 0

Σ(xᵢ−μ)/σ² = 0 → Σxᵢ = nμ
μ̂ = (1/n)Σxᵢ = x̄ — the sample mean!

∂ℓ/∂σ² = 0

−n/(2σ²) + Σ(xᵢ−μ)²/(2σ⁴) = 0
σ̂² = (1/n)Σ(xᵢ−x̄)² — the biased sample variance!

Compute μ̂

μ̂ = (240+255+261+248+271)/5 = 1275/5 = 255.0 ms

Compute σ̂²

Deviations from mean: −15, 0, 6, −7, 16
Squared: 225, 0, 36, 49, 256 → Sum = 566
σ̂² = 566/5 = 113.2 ms² | σ̂ = √113.2 = 10.64 ms

⚠️

MLE Variance is Biased — An Important Subtlety

The MLE for variance uses divisor n, but the unbiased sample variance uses n−1 (Bessel's correction). The MLE estimator σ̂² = (1/n)Σ(xᵢ−x̄)² is a biased estimator — it systematically underestimates the true population variance. For large n the difference is negligible, but for small samples it matters. This is one case where MLE gives a technically correct but practically suboptimal answer — a rare but important limitation.

MLE for Normal Mean

μ̂ = x̄ = (1/n) Σxᵢ

The sample mean. Intuitive and unbiased. The MLE confirms mathematically what common sense already suggests: use the average of your data.

MLE for Normal Variance

σ̂² = (1/n) Σ(xᵢ − x̄)²

Biased (divides by n, not n−1). In practice, the unbiased estimator s² = Σ(xᵢ−x̄)²/(n−1) is often preferred for small samples.

MLE for Poisson Rate

λ̂ = x̄ = (1/n) Σxᵢ

For count data modelled by Poisson(λ), the MLE is the sample mean — elegant and intuitive. λ is both the mean and variance of the Poisson distribution.

MLE for Exponential Rate

λ̂ = n / Σxᵢ = 1/x̄

For waiting time data modelled by Exponential(λ), the MLE is the reciprocal of the sample mean. The mean waiting time equals 1/λ̂.

Section 06

Log-Likelihood — Why It Matters 📉

The Story: The Astronomy Data Problem

An astronomer analyses light from 10,000 stars, each providing an independent photon count observation. The likelihood of observing all 10,000 counts simultaneously is the product of 10,000 individual probabilities — each between 0 and 1. On a computer, this product would round to zero due to floating-point underflow before you could even maximise it. The log-likelihood converts this product of tiny numbers into a manageable sum.

🧮 Log-Likelihood — Numerical Example (5 Coin Flips)

Data

5 independent coin flips: H, H, T, H, T (3 heads, 2 tails). p = unknown bias.

Likelihood

L(p) = p × p × (1−p) × p × (1−p) = p³(1−p)²
Each flip multiplies a factor. With 100 flips, this is 100 tiny multiplications → numerical underflow.

Log-Likelihood

ℓ(p) = log[p³(1−p)²] = 3·log(p) + 2·log(1−p)
Products become sums — safe for any sample size.

At p = 0.6

ℓ(0.6) = 3·log(0.6) + 2·log(0.4) = 3×(−0.5108) + 2×(−0.9163) = −1.532 − 1.833 = −3.365

MLE

dℓ/dp = 3/p − 2/(1−p) = 0 → 3(1−p) = 2p → p̂ = 3/5 = 0.60 (as expected: k/n)

📐

Three Reasons to Always Use Log-Likelihood

1. Numerical stability: Products of small probabilities underflow to zero on computers — sums of log-probabilities stay manageable. | 2. Analytical convenience: log converts products (hard to differentiate) into sums (easy to differentiate term by term). | 3. Convexity: For many distributions, the negative log-likelihood is convex — guaranteeing a unique global minimum and enabling efficient gradient-based optimisation.

Section 07

MLE for Logistic Regression 🤖

The Story: Credit Default Prediction

A bank wants to predict whether a loan applicant will default (Y=1) or not (Y=0) based on their income (X). The model is logistic regression: P(Y=1|X) = σ(β₀ + β₁X) where σ is the sigmoid function. The parameters β₀ and β₁ are unknown. There is no closed-form MLE solution here — instead, gradient ascent (or descent on the negative log-likelihood) is used iteratively. This is exactly the training process for logistic regression, linear regression, and neural networks.

Logistic Regression Model

P(Y=1|x) = σ(β₀+β₁x)

σ(z) = 1/(1+e⁻ᶻ) is the sigmoid function mapping any real number to (0,1). The parameters β₀, β₁ define the decision boundary.

Bernoulli Log-Likelihood

ℓ(β) = Σ[yᵢlog(p̂ᵢ) + (1−yᵢ)log(1−p̂ᵢ)]

Each observation contributes: log(p̂) if it's a 1, log(1−p̂) if it's a 0. This is the binary cross-entropy loss — and maximising it IS training logistic regression.

Cross-Entropy Loss

Loss = −(1/n) · ℓ(β)

Minimising cross-entropy loss is identical to maximising the log-likelihood. Neural network training minimises cross-entropy = maximises likelihood. MLE IS deep learning training.

Gradient Ascent Update

β ← β + η · ∂ℓ/∂β

No closed form: iterate. Each step moves β in the direction of steepest log-likelihood increase. η is the learning rate. Equivalent to gradient descent on the loss function.

🧮 Logistic Regression — MLE Log-Likelihood Calculation

Mini Dataset

3 loan applicants (simplified):
Person 1: Income X=50k, Defaulted Y=1
Person 2: Income X=80k, Did not default Y=0
Person 3: Income X=40k, Defaulted Y=1

Trial β

Try β₀ = −4, β₁ = 0.06 (i.e., 0.06 per thousand income).
P₁ = σ(−4 + 0.06×50) = σ(−1) = 0.269
P₂ = σ(−4 + 0.06×80) = σ(+0.8) = 0.690
P₃ = σ(−4 + 0.06×40) = σ(−1.6) = 0.168

Log-Likelihood

ℓ(β) = log(P₁) + log(1−P₂) + log(P₃) [Y=1 uses p̂, Y=0 uses 1−p̂]
= log(0.269) + log(0.310) + log(0.168)
= −1.312 + (−1.171) + (−1.783) = −4.266

Interpretation

A higher (less negative) ℓ means the model fits better. By trying different β values and moving in the direction that increases ℓ (gradient ascent), we converge to the MLE estimates β̂₀ and β̂₁. This process is exactly what sklearn.LogisticRegression().fit() does internally.

Section 08

MLE for Poisson Distribution 📦

The Story: The Call Centre Manager

Priya manages a call centre. Over 8 hours, she records the number of calls per hour: 3, 5, 2, 8, 4, 6, 3, 5. She models call arrivals as a Poisson process with unknown rate λ. She wants the MLE estimate of λ — the true average calls per hour — to staff her centre optimally.

🧮 Poisson MLE — Call Centre Rate Estimation

Data

x = {3, 5, 2, 8, 4, 6, 3, 5}, n = 8. Σxᵢ = 36.

Poisson PMF

P(X=k | λ) = e⁻λ λᵏ / k!

Log-Likelihood

ℓ(λ) = Σᵢ [−λ + xᵢ·log(λ) − log(xᵢ!)]
= −nλ + (Σxᵢ)·log(λ) − Σlog(xᵢ!)
= −8λ + 36·log(λ) − constant

Differentiate

dℓ/dλ = −n + Σxᵢ/λ = −8 + 36/λ

Set to Zero

−8 + 36/λ = 0 → λ = 36/8 = 4.5 calls/hour

General Result

λ̂ = Σxᵢ/n = x̄ = sample mean
The MLE for the Poisson rate is the sample mean — the most intuitive estimate imaginable. MLE confirms common sense.

Business Use

Priya now knows λ̂ = 4.5 calls/hour. She uses P(X > 8) = 1 − Poisson_CDF(8; 4.5) ≈ 3.8% to plan for peak staffing. MLE just powered a real operational decision.

Section 09

Numerical MLE — When Calculus Fails 🔢

The Story: The Mixture Model

Many real distributions — customer lifetimes, image pixel intensities, gene expression levels — are not simple single distributions. They are mixtures: a blend of two or more components. The log-likelihood for a mixture model has no closed-form solution. Instead, numerical methods are used: gradient ascent, the EM algorithm, or numerical optimisation.

Optimisation Method	How It Works	Best For	Used In
Analytical MLE	dℓ/dθ = 0, solve for θ̂	Simple distributions (Normal, Binomial, Poisson)	Statistics textbooks, closed-form solutions
Gradient Ascent	β ← β + η·∂ℓ/∂β iteratively	Smooth, differentiable log-likelihoods	Logistic regression, GLMs
Newton-Raphson	Uses 2nd derivative (Hessian)	Fast convergence near the peak	GLMs, survival models
EM Algorithm	Expectation + Maximisation alternating steps	Latent variable / mixture models	Gaussian Mixture Models (GMMs), HMMs
BFGS / L-BFGS	Quasi-Newton, approximates Hessian	Large parameter spaces	scipy.optimize, ML libraries
SGD / Adam	Stochastic gradient on mini-batches	Very large datasets, deep learning	PyTorch, TensorFlow neural networks

Section 10

Properties of MLE — Why It's So Widely Used ⭐

Consistency

θ̂→θ

As n→∞, θ̂ converges to true θ
More data = more accurate estimate
Guaranteed by Law of Large Numbers
Asymptotically unbiased
Fundamental reliability guarantee

Asymptotic Normality

θ̂ ~ N

For large n, θ̂ is approximately normal
Enables confidence intervals for free
Enables hypothesis tests on parameters
Variance = 1/Fisher Information
Central to statistical inference

Efficiency

CRLB

Achieves Cramér-Rao lower bound
Minimum variance among unbiased estimators
No other estimator can do better
Asymptotically efficient (large n)
"Best possible" estimator guarantee

⚠️

Limitations of MLE

Small samples: MLE can be biased for small n (e.g., variance estimator). | Model misspecification: MLE finds the best parameters for your assumed distribution — if the distribution is wrong, the MLE is wrong. | Overfitting: MLE maximises fit to training data and can overfit without regularisation. | Multiple maxima: Some likelihood functions have multiple peaks (non-convex) — gradient methods may find local, not global, maxima.

Section 11

MLE vs MAP — Adding Prior Knowledge 🧠

MLE finds the parameter that maximises the likelihood of the observed data. Maximum A Posteriori (MAP) estimation adds a prior distribution on the parameters — encoding existing knowledge — and finds the parameter that maximises the posterior P(θ|data) ∝ L(θ|data) × P(θ).

Feature	MLE	MAP
Objective	Maximise L(θ\|data)	Maximise L(θ\|data)·P(θ)
Uses prior?	No — data only	Yes — data + prior belief
Small data behaviour	Can overfit severely	Prior regularises — more stable
Large data behaviour	Prior becomes negligible	Converges to MLE as n→∞
Equivalent ML concept	No regularisation	L2 regularisation (Gaussian prior)
Log-objective	log L(θ\|data)	log L(θ\|data) + log P(θ)
Philosophical position	Frequentist	Bayesian

💡

L2 Regularisation IS MAP with a Gaussian Prior

When you train a neural network or linear regression with L2 regularisation (Ridge), you are implicitly doing MAP estimation with a Gaussian prior on the weights: P(θ) = N(0, σ²). The regularisation term λ||θ||² in the loss function is exactly −log P(θ) up to a constant. Every regularised model in machine learning is a MAP estimator in disguise.

Section 12

MLE Across Machine Learning — The Unified View 🤖

ML Algorithm	Assumed Distribution	Log-Likelihood = Loss Function	MLE Estimate
Linear Regression	Normal errors: ε ~ N(0,σ²)	Mean Squared Error (MSE)	Ordinary Least Squares (OLS)
Logistic Regression	Bernoulli outputs	Binary Cross-Entropy	Gradient ascent on ℓ(β)
Softmax Regression	Categorical outputs	Categorical Cross-Entropy	Gradient ascent multi-class
Naive Bayes	Class-conditional distributions	Joint log-likelihood of labels	Empirical frequency estimates
Gaussian Mixture Model	Mixture of Normals	Mixture log-likelihood	EM Algorithm
Deep Neural Network	Task-dependent (Normal/Bernoulli/Categorical)	MSE or Cross-Entropy	SGD / Adam on −ℓ(θ)
Survival Analysis (Cox)	Hazard function model	Partial log-likelihood	Newton-Raphson iteration
Language Models (LLMs)	Categorical next-token distribution	Cross-entropy over token sequence	SGD on −Σ log P(token\|context)

🧮

The Deep Unification

Every row in the table above is the same mathematical operation: find the parameters that maximise the probability of the observed data under an assumed model. Linear regression minimises MSE because under Gaussian error assumptions, minimising MSE IS maximising the likelihood. Neural networks minimise cross-entropy because under categorical output assumptions, minimising cross-entropy IS maximising the likelihood. MLE doesn't just explain statistics — it explains why machine learning training works at all.

Section 13

Complete MLE Reference — All Distributions

Distribution	Parameter(s)	MLE Estimate	Intuition
Bernoulli(p)	p	p̂ = k/n (successes / trials)	Observed proportion
Binomial(n,p)	p	p̂ = k/n	Observed success rate
Poisson(λ)	λ	λ̂ = x̄	Sample mean of counts
Normal(μ, σ²)	μ, σ²	μ̂ = x̄, σ̂² = (1/n)Σ(xᵢ−x̄)²	Sample mean & biased variance
Exponential(λ)	λ	λ̂ = 1/x̄	Reciprocal of mean wait time
Uniform(a,b)	a, b	â = min(xᵢ), b̂ = max(xᵢ)	Observed range (biased!)
Geometric(p)	p	p̂ = 1/x̄	Reciprocal of mean trials
Beta(α,β)	α, β	Method of moments / numerical	No closed form — Newton iteration

Section 14

The Golden Rules of MLE

🎯 12 Rules Every Data Scientist Must Master

Likelihood is NOT probability — the direction of conditioning is reversed. P(data | θ) treats θ as fixed and data as variable. L(θ | data) treats data as fixed and θ as variable. Both use the same mathematical formula, but the question being asked is entirely different.

Always work with log-likelihood, never raw likelihood. Products of small probabilities cause numerical underflow. Sums of log-probabilities are always numerically stable. The log transformation preserves the argmax — you get the same θ̂ either way.

MLE for simple distributions always equals the sample statistic. Binomial p̂ = k/n. Normal μ̂ = x̄. Poisson λ̂ = x̄. Exponential λ̂ = 1/x̄. These elegant results confirm that MLE formalises what common sense already suggests.

The Normal MLE for variance is biased — divide by n, not n−1. For small samples, use the unbiased estimator s² = Σ(xᵢ−x̄)²/(n−1). For large samples (n > 30), the difference is negligible. This is one of MLE's known limitations.

Minimising MSE = maximising likelihood under Gaussian errors. Linear regression's least squares solution is identical to MLE under the assumption that errors are normally distributed. Understanding this connection reveals the probabilistic assumptions hidden in every regression model.

Minimising cross-entropy loss = maximising likelihood for classification. Logistic regression, softmax, and neural network classifiers all minimise cross-entropy — which is exactly −ℓ(θ), the negative log-likelihood under Bernoulli or categorical assumptions. Neural network training IS MLE.

Specify the data-generating model before computing likelihood. MLE finds the best parameters for a given model — but if the model is wrong (wrong distribution family), the MLE is the best fit to the wrong model. Model selection precedes estimation.

L2 regularisation is MAP estimation with a Gaussian prior. Adding λ||θ||² to the loss function is equivalent to placing a Gaussian prior on θ and doing MAP estimation. L1 regularisation corresponds to a Laplace prior. Every regularised model has a Bayesian interpretation.

Verify that d²ℓ/dθ² < 0 at your solution. The first derivative being zero is necessary but not sufficient for a maximum. Always check the second derivative (or Hessian) to confirm the solution is a maximum, not a minimum or saddle point.

For i.i.d. data, the log-likelihood is a sum — exploit this. Σ log P(xᵢ|θ) means each data point contributes additively. This makes MLE decomposable, parallelisable, and allows stochastic gradient methods that process mini-batches instead of the full dataset.

MLE is consistent and asymptotically efficient — but may overfit small samples. For large n, MLE is the best estimator you can use. For small n, consider regularisation (MAP) or Bayesian approaches that incorporate prior knowledge to stabilise estimates.

The likelihood ratio is the foundation of hypothesis testing. The likelihood ratio test statistic −2log[L(θ₀)/L(θ̂)] follows a chi-square distribution under H₀. Wald tests, Score tests, and the AIC/BIC model selection criteria are all derived from the likelihood function — MLE connects parameter estimation to inference.

🧮

MLE — The Engine Behind All of Modern Statistics and Machine Learning

From the detective reasoning backwards from evidence to the deep learning system training on billions of tokens — the underlying logic is identical. Observe data. Assume a model. Find the parameters that make the observed data most probable. That is the likelihood function. Maximise it. That is MLE. Every statistical model you fit, every neural network you train, every A/B test you analyse — all of it flows from this one profound idea: the best explanation of data is the one that would have made that data most likely to occur.