The Detective and the Evidence 🔍
Imagine a detective who arrives at a crime scene. She cannot go back in time to see what happened — the event is over. But she has evidence: footprints, a broken lock, a timestamp on a security camera. Her job is to reason backwards from what she observes to the most plausible explanation of what caused it.
This is exactly what Maximum Likelihood Estimation does with data. You have observed data — a set of measurements, outcomes, counts, or readings. You cannot see the underlying process that generated them, but you know it follows some probability distribution with unknown parameters. MLE asks: "Which parameter values make this observed data most probable?" The answer is the Maximum Likelihood Estimate.
MLE is the single most important parameter estimation technique in all of statistics and machine learning. It underpins logistic regression, linear regression (from a probabilistic view), neural network training, survival analysis, time series models, and most of modern statistical inference. Understanding it deeply changes how you see every model you build.
Probability: Given fixed parameters θ, what is the probability of observing data x? → P(X = x | θ). Parameters are known; data is uncertain. | Likelihood: Given fixed observed data x, how plausible is each value of θ? → L(θ | x). Data is fixed (already observed); parameters are uncertain. The likelihood function reverses the direction of reasoning.
The Likelihood Function 📐
The Story: The Biased Coin Inspector
Arjun is a quality inspector at a coin factory. He suspects a newly minted coin is biased — not perfectly fair. He cannot look inside the coin to measure its bias directly. Instead, he flips it 10 times and records the results: H, H, T, H, H, T, H, H, H, T — that is, 7 heads and 3 tails. He wants to use this data to estimate the true probability p of heads.
For each candidate value of p (0.1, 0.2, …, 0.9, 1.0), Arjun asks: "If the true bias were this value of p, how probable is it that I would observe exactly 7 heads in 10 flips?" The function that maps each p to this probability is the Likelihood Function.
Computing the Likelihood — Step by Step 🧮
P(7 heads in 10 flips | p) = C(10,7) × p⁷ × (1−p)³
Maximising the Likelihood — Calculus Approach 🎓
Finding the MLE Analytically
To find the exact peak of the likelihood function, we take its derivative with respect to the parameter θ, set it to zero, and solve. Because products are harder to differentiate than sums, we first take the logarithm — creating the log-likelihood function ℓ(θ). Since log is a monotonically increasing function, the θ that maximises L(θ) also maximises ℓ(θ).
ℓ(p) = constant + 7·log(p) + 3·log(1−p)
7(1−p) = 3p
7 − 7p = 3p
7 = 10p
The MLE for the binomial parameter is simply k/n — the observed proportion of successes. This confirms our intuition: the most likely bias is exactly the fraction of heads we saw.
The proof above is completely general. For any Binomial(n, p) model with k observed successes in n trials, the MLE is always p̂ = k/n. This elegant result means the MLE for a proportion is simply the observed relative frequency — confirming that MLE gives the most natural, intuitive estimate for this case.
MLE for the Normal Distribution 📊
The Story: Estimating Average Human Reaction Time
A neuroscientist measures the reaction times (in milliseconds) of 5 subjects in a stimulus experiment. The data: 240, 255, 261, 248, 271 ms. She assumes reaction times follow a Normal distribution N(μ, σ²) — but she doesn't know the true mean μ or the true variance σ². She uses MLE to estimate both parameters simultaneously from her 5 observations.
μ̂ = (1/n)Σxᵢ = x̄ — the sample mean!
σ̂² = (1/n)Σ(xᵢ−x̄)² — the biased sample variance!
Squared: 225, 0, 36, 49, 256 → Sum = 566
σ̂² = 566/5 = 113.2 ms² | σ̂ = √113.2 = 10.64 ms
The MLE for variance uses divisor n, but the unbiased sample variance uses n−1 (Bessel's correction). The MLE estimator σ̂² = (1/n)Σ(xᵢ−x̄)² is a biased estimator — it systematically underestimates the true population variance. For large n the difference is negligible, but for small samples it matters. This is one case where MLE gives a technically correct but practically suboptimal answer — a rare but important limitation.
Log-Likelihood — Why It Matters 📉
The Story: The Astronomy Data Problem
An astronomer analyses light from 10,000 stars, each providing an independent photon count observation. The likelihood of observing all 10,000 counts simultaneously is the product of 10,000 individual probabilities — each between 0 and 1. On a computer, this product would round to zero due to floating-point underflow before you could even maximise it. The log-likelihood converts this product of tiny numbers into a manageable sum.
Each flip multiplies a factor. With 100 flips, this is 100 tiny multiplications → numerical underflow.
Products become sums — safe for any sample size.
1. Numerical stability: Products of small probabilities underflow to zero on computers — sums of log-probabilities stay manageable. | 2. Analytical convenience: log converts products (hard to differentiate) into sums (easy to differentiate term by term). | 3. Convexity: For many distributions, the negative log-likelihood is convex — guaranteeing a unique global minimum and enabling efficient gradient-based optimisation.
MLE for Logistic Regression 🤖
The Story: Credit Default Prediction
A bank wants to predict whether a loan applicant will default (Y=1) or not (Y=0) based on their income (X). The model is logistic regression: P(Y=1|X) = σ(β₀ + β₁X) where σ is the sigmoid function. The parameters β₀ and β₁ are unknown. There is no closed-form MLE solution here — instead, gradient ascent (or descent on the negative log-likelihood) is used iteratively. This is exactly the training process for logistic regression, linear regression, and neural networks.
Person 1: Income X=50k, Defaulted Y=1
Person 2: Income X=80k, Did not default Y=0
Person 3: Income X=40k, Defaulted Y=1
P₁ = σ(−4 + 0.06×50) = σ(−1) = 0.269
P₂ = σ(−4 + 0.06×80) = σ(+0.8) = 0.690
P₃ = σ(−4 + 0.06×40) = σ(−1.6) = 0.168
= log(0.269) + log(0.310) + log(0.168)
= −1.312 + (−1.171) + (−1.783) = −4.266
MLE for Poisson Distribution 📦
The Story: The Call Centre Manager
Priya manages a call centre. Over 8 hours, she records the number of calls per hour: 3, 5, 2, 8, 4, 6, 3, 5. She models call arrivals as a Poisson process with unknown rate λ. She wants the MLE estimate of λ — the true average calls per hour — to staff her centre optimally.
= −nλ + (Σxᵢ)·log(λ) − Σlog(xᵢ!)
= −8λ + 36·log(λ) − constant
The MLE for the Poisson rate is the sample mean — the most intuitive estimate imaginable. MLE confirms common sense.
Numerical MLE — When Calculus Fails 🔢
The Story: The Mixture Model
Many real distributions — customer lifetimes, image pixel intensities, gene expression levels — are not simple single distributions. They are mixtures: a blend of two or more components. The log-likelihood for a mixture model has no closed-form solution. Instead, numerical methods are used: gradient ascent, the EM algorithm, or numerical optimisation.
| Optimisation Method | How It Works | Best For | Used In |
|---|---|---|---|
| Analytical MLE | dℓ/dθ = 0, solve for θ̂ | Simple distributions (Normal, Binomial, Poisson) | Statistics textbooks, closed-form solutions |
| Gradient Ascent | β ← β + η·∂ℓ/∂β iteratively | Smooth, differentiable log-likelihoods | Logistic regression, GLMs |
| Newton-Raphson | Uses 2nd derivative (Hessian) | Fast convergence near the peak | GLMs, survival models |
| EM Algorithm | Expectation + Maximisation alternating steps | Latent variable / mixture models | Gaussian Mixture Models (GMMs), HMMs |
| BFGS / L-BFGS | Quasi-Newton, approximates Hessian | Large parameter spaces | scipy.optimize, ML libraries |
| SGD / Adam | Stochastic gradient on mini-batches | Very large datasets, deep learning | PyTorch, TensorFlow neural networks |
Properties of MLE — Why It's So Widely Used ⭐
- As n→∞, θ̂ converges to true θ
- More data = more accurate estimate
- Guaranteed by Law of Large Numbers
- Asymptotically unbiased
- Fundamental reliability guarantee
- For large n, θ̂ is approximately normal
- Enables confidence intervals for free
- Enables hypothesis tests on parameters
- Variance = 1/Fisher Information
- Central to statistical inference
- Achieves Cramér-Rao lower bound
- Minimum variance among unbiased estimators
- No other estimator can do better
- Asymptotically efficient (large n)
- "Best possible" estimator guarantee
Small samples: MLE can be biased for small n (e.g., variance estimator). | Model misspecification: MLE finds the best parameters for your assumed distribution — if the distribution is wrong, the MLE is wrong. | Overfitting: MLE maximises fit to training data and can overfit without regularisation. | Multiple maxima: Some likelihood functions have multiple peaks (non-convex) — gradient methods may find local, not global, maxima.
MLE vs MAP — Adding Prior Knowledge 🧠
MLE finds the parameter that maximises the likelihood of the observed data. Maximum A Posteriori (MAP) estimation adds a prior distribution on the parameters — encoding existing knowledge — and finds the parameter that maximises the posterior P(θ|data) ∝ L(θ|data) × P(θ).
| Feature | MLE | MAP |
|---|---|---|
| Objective | Maximise L(θ|data) | Maximise L(θ|data)·P(θ) |
| Uses prior? | No — data only | Yes — data + prior belief |
| Small data behaviour | Can overfit severely | Prior regularises — more stable |
| Large data behaviour | Prior becomes negligible | Converges to MLE as n→∞ |
| Equivalent ML concept | No regularisation | L2 regularisation (Gaussian prior) |
| Log-objective | log L(θ|data) | log L(θ|data) + log P(θ) |
| Philosophical position | Frequentist | Bayesian |
When you train a neural network or linear regression with L2 regularisation (Ridge), you are implicitly doing MAP estimation with a Gaussian prior on the weights: P(θ) = N(0, σ²). The regularisation term λ||θ||² in the loss function is exactly −log P(θ) up to a constant. Every regularised model in machine learning is a MAP estimator in disguise.
MLE Across Machine Learning — The Unified View 🤖
| ML Algorithm | Assumed Distribution | Log-Likelihood = Loss Function | MLE Estimate |
|---|---|---|---|
| Linear Regression | Normal errors: ε ~ N(0,σ²) | Mean Squared Error (MSE) | Ordinary Least Squares (OLS) |
| Logistic Regression | Bernoulli outputs | Binary Cross-Entropy | Gradient ascent on ℓ(β) |
| Softmax Regression | Categorical outputs | Categorical Cross-Entropy | Gradient ascent multi-class |
| Naive Bayes | Class-conditional distributions | Joint log-likelihood of labels | Empirical frequency estimates |
| Gaussian Mixture Model | Mixture of Normals | Mixture log-likelihood | EM Algorithm |
| Deep Neural Network | Task-dependent (Normal/Bernoulli/Categorical) | MSE or Cross-Entropy | SGD / Adam on −ℓ(θ) |
| Survival Analysis (Cox) | Hazard function model | Partial log-likelihood | Newton-Raphson iteration |
| Language Models (LLMs) | Categorical next-token distribution | Cross-entropy over token sequence | SGD on −Σ log P(token|context) |
Every row in the table above is the same mathematical operation: find the parameters that maximise the probability of the observed data under an assumed model. Linear regression minimises MSE because under Gaussian error assumptions, minimising MSE IS maximising the likelihood. Neural networks minimise cross-entropy because under categorical output assumptions, minimising cross-entropy IS maximising the likelihood. MLE doesn't just explain statistics — it explains why machine learning training works at all.
Complete MLE Reference — All Distributions
| Distribution | Parameter(s) | MLE Estimate | Intuition |
|---|---|---|---|
| Bernoulli(p) | p | p̂ = k/n (successes / trials) | Observed proportion |
| Binomial(n,p) | p | p̂ = k/n | Observed success rate |
| Poisson(λ) | λ | λ̂ = x̄ | Sample mean of counts |
| Normal(μ, σ²) | μ, σ² | μ̂ = x̄, σ̂² = (1/n)Σ(xᵢ−x̄)² | Sample mean & biased variance |
| Exponential(λ) | λ | λ̂ = 1/x̄ | Reciprocal of mean wait time |
| Uniform(a,b) | a, b | â = min(xᵢ), b̂ = max(xᵢ) | Observed range (biased!) |
| Geometric(p) | p | p̂ = 1/x̄ | Reciprocal of mean trials |
| Beta(α,β) | α, β | Method of moments / numerical | No closed form — Newton iteration |
The Golden Rules of MLE
From the detective reasoning backwards from evidence to the deep learning system training on billions of tokens — the underlying logic is identical. Observe data. Assume a model. Find the parameters that make the observed data most probable. That is the likelihood function. Maximise it. That is MLE. Every statistical model you fit, every neural network you train, every A/B test you analyse — all of it flows from this one profound idea: the best explanation of data is the one that would have made that data most likely to occur.