Foundations of Data Science 📂 Inferential Statistics · 1 of 8 37 min read

Inferential Statistics

A complete beginner-to-intermediate guide to inferential statistics — covering what inference means, the population vs sample distinction, all five major sampling techniques with visual diagrams, the sampling distribution of the mean, the Central Limit Theorem, Standard Error, and Python implementations using NumPy, Pandas, and SciPy.

Section 01

The Taste Tester & The Election Pollster

A soup chef is preparing a 50-litre pot of dal for a wedding feast of 500 guests. Does she pour every drop into bowls and serve one to each guest just to check the salt? Of course not. She picks up a single ladle — maybe 100 ml — stirs well, tastes it, adjusts the seasoning, and confidently serves the entire pot.

One ladle. 500 guests. That is inferential statistics in its purest form.

💡
The Election Poll Story

Before India's 2024 general election, no organisation surveyed all 970 million eligible voters. Instead, polling agencies surveyed roughly 20,000–30,000 people — about 0.003% of voters — and from that tiny sample predicted seat counts with reasonable accuracy for 543 parliamentary constituencies. That is the extraordinary power of inference: learning about the whole from a carefully chosen part.

Every day, decisions worth billions of rupees — drug approvals, crop yield forecasts, product quality checks, credit-risk models — are made not by measuring everyone and everything, but by measuring a few and reasoning about the rest. That is the job of inferential statistics.


Section 02

What Is Inferential Statistics?

Statistics splits into two broad branches. Descriptive statistics summarises and describes the data you already have — means, medians, charts. Inferential statistics goes further: it uses a sample to draw conclusions — inferences — about a larger population that you could not fully measure.

Estimation
Point & Interval
  • Estimate population parameters
  • Confidence intervals
  • E.g. mean income of India
Hypothesis Testing
p-value & α
  • Test claims about populations
  • t-tests, chi-square, ANOVA
  • E.g. does drug A beat drug B?
Prediction
Models & Forecasts
  • Generalise patterns to new data
  • Regression, classification
  • E.g. predict loan default
🎯
The Two Core Assumptions That Make It Work

Inferential statistics rests on two pillars: (1) the sample must be representative of the population — no hidden bias — and (2) randomness must be involved in how the sample is selected, so probability theory can be applied. Violate either and your inferences collapse.


Section 03

Population vs Sample

Before you can infer anything, you must be crystal clear about what you are measuring and what you are trying to say. This is the population-vs-sample distinction — one of the most important conceptual divides in all of statistics.

POPULATION (N) All elements of interest Parameters: μ, σ, P (usually unknown) Sampling Inference SAMPLE (n) Subset selected from population Statistics: x̄, s, p̂ (calculated, known)
Population — Parameters (Greek letters)
μ = mean  |  σ = std dev  |  N = size  |  P = proportion
The true, fixed values of the entire group you care about. Usually unknown — this is why we sample. Examples: average income of all Indian adults, defect rate of an entire production batch.
Sample — Statistics (Latin letters)
x̄ = mean  |  s = std dev  |  n = size  |  p̂ = proportion
The computed values from your observed data. These are known and are used to estimate the population parameters. Examples: average income of 2,000 surveyed adults, defects found in a 200-unit quality check.
Concept Population Sample
Definition Every member of the group of interest A subset drawn from the population
Size notation N (usually large or infinite) n (small relative to N)
Measures called Parameters (μ, σ, P) Statistics (x̄, s, p̂)
Usually known? Rarely Yes — calculated
Goal What we want to know The tool we use to find it
Example (medicine) All diabetic patients in India 500 patients in a clinical trial

Section 04

Sampling Techniques — How to Pick a Sample

Choosing who goes into your sample is one of the most consequential decisions in any study. The wrong technique introduces sampling bias — a systematic distortion that no amount of clever statistics can later fix. There are five major techniques, each suited to different situations.

Five Sampling Techniques at a Glance 1. Simple Random Every member has equal chance of selection Use: small, homogeneous 2. Stratified Stratum A (e.g. Rural) Stratum B (e.g. Urban) Stratum C (e.g. Metro) Divide into groups, sample Use: diverse populations 3. Cluster Selected cluster Divide into groups, select whole clusters randomly 4. Systematic #01 #02 ✓ #03 #04 #05 ✓ #06 #07 #08 ✓ #09 #10 #11 ✓ #12 Pick random start, then every k-th element Use: ordered lists, manufacturing lines k = N/n (sampling interval) 5. Convenience (Non-probability) Easy to reach → selected 🧑 🧑 🧑 Far-away members ⚠️ High bias risk — use only for pilot studies E.g. surveying friends, street interviews NOT suitable for inferential conclusions

1. Simple Random Sampling

Every member of the population has an equal and independent probability of being selected. Think of a lottery draw — each ticket has the same chance. In Python, this is random.sample() or df.sample(n=100). It is the gold standard for small, homogeneous populations but becomes impractical when the population is geographically spread or has important sub-groups.

2. Stratified Sampling

The population is first divided into non-overlapping sub-groups (strata) based on a key characteristic (income level, age group, state), and random samples are drawn from each stratum proportionally. An NSSO survey of Indian household income uses stratified sampling — it ensures rural, semi-urban, and urban households are all represented, not accidentally dominated by one group.

🎯
Proportional vs Equal Stratified Sampling

In proportional stratified sampling, the sample size from each stratum matches its share of the population (e.g. 70% rural → 70 rural respondents per 100). In equal stratified sampling, each stratum gets the same sample size regardless of its population size — useful when you need to compare strata that differ greatly in size.

3. Cluster Sampling

The population is divided into natural clusters (villages, schools, city blocks), a random subset of clusters is selected, and every member of the chosen clusters is surveyed. This is far cheaper and faster than travelling to survey individuals scattered across the country. A government health survey might randomly pick 50 villages and survey all households in those villages. The trade-off: if the selected clusters happen to be unusually similar, the sample may not be representative.

4. Systematic Sampling

Choose a random starting point, then select every k-th element from an ordered list, where k = N / n (the sampling interval). For example, to sample 200 patients from a hospital register of 4,000: k = 4000 / 200 = 20. Pick a random start between 1 and 20, then take every 20th patient. Simple, fast, and nearly as good as random — unless there is a periodic pattern in the list (e.g. every 20th patient is always a Monday morning emergency case).

5. Convenience Sampling

Select whoever is easiest to reach — students in the researcher's classroom, followers on social media, shoppers at a nearby mall. It is fast and cheap but almost always biased. It cannot support valid statistical inference about any broader population. Use it only for early exploratory pilots or qualitative insights, never for drawing quantitative conclusions.

Technique Probability-based? Cost Bias Risk Best For
Simple Random Yes Medium Low Small homogeneous populations
Stratified Yes Higher Very low Diverse populations with sub-groups
Cluster Yes Low Moderate Geographically spread populations
Systematic Yes Low Low Ordered lists, assembly lines
Convenience No Very Low High Pilot studies only — not for inference

Section 05

Sampling Distribution — The Heart of Inference

Here is where the magic happens. Imagine you want to estimate the average monthly salary of all 500 million workers in India. You take a random sample of 100 workers and compute the sample mean x̄. You get ₹32,500.

But what if someone else took a different random sample of 100 workers? They would get a slightly different x̄ — maybe ₹31,800. A third researcher gets ₹33,100. Every possible sample of size 100 produces its own x̄. The distribution of all these sample means is called the Sampling Distribution of the Mean.

💡
The Bowling Ball Analogy

Think of the population as a bowling alley with thousands of balls, each labelled with a salary. Every time you close your eyes and pick 100 balls at random, you compute the average. If you repeat this thousands of times and plot those averages, you get the sampling distribution. It will always form a bell shape centred exactly at the true population mean — no matter what shape the original data had, as long as n is large enough. This is the Central Limit Theorem at work.

From Population → Many Samples → Sampling Distribution Population (any shape) μ = unknown Take many samples of n Each Sample → x̄ Sample 1 → x̄₁ = 32,500 Sample 2 → x̄₂ = 31,800 Sample 3 → x̄₃ = 33,100 Sample 4 → x̄₄ = 32,050 ⋮ thousands more ⋮ Centre of all x̄ = μ Plot all x̄ values Sampling Distribution of x̄ μ_x̄ = μ +1 SE −1 SE SE = σ / √n (Standard Error) Step 1 Step 2 Step 3 Central Limit Theorem (CLT) As sample size n increases, the sampling distribution of x̄ approaches a normal distribution centred at μ with standard error SE = σ/√n — regardless of population shape.
Mean of the Sampling Distribution
μ_x̄ = μ
The average of all possible sample means equals the true population mean. This property is called unbiasedness — on average, your sample mean is exactly right.
Standard Error of the Mean (SE)
SE = σ / √n
The standard deviation of the sampling distribution. It measures how much sample means vary around μ. Larger n → smaller SE → more precise estimates. Quadruple your sample size to halve your standard error.
⚠️
Standard Error ≠ Standard Deviation

Standard deviation (s) measures how individual data points vary around the sample mean — it describes your data's spread.
Standard error (SE) measures how much the sample mean itself varies across repeated samples — it describes the precision of your estimate.
Confusing the two is one of the most common errors in applied statistics. SE always decreases as n grows; standard deviation does not.


Section 06

The Central Limit Theorem in Action

The CLT is arguably the most powerful theorem in all of statistics. It says that no matter how strange, skewed, or non-normal your population distribution looks, the sampling distribution of the mean will become approximately normal as your sample size grows — typically by n ≥ 30.

🧮 CLT Demonstration — Exponential Population
Setup
Population: exponential distribution with mean = 5. Heavily right-skewed — nothing like a bell curve. We will take 10,000 random samples at different sizes and plot the distribution of sample means.
n = 2
Sampling distribution still very right-skewed. SE = 5 / √2 = 3.54. Far from normal.
n = 10
Noticeably more bell-shaped but still slightly right-skewed. SE = 5 / √10 = 1.58. Getting closer.
n = 30
Nearly perfect bell curve centred at μ = 5. SE = 5 / √30 = 0.91. CLT kicks in.
n = 100
Very tight, symmetric bell curve. SE = 5 / √100 = 0.50. Sample means cluster tightly around the true population mean.
Why This Changes Everything

Because the sampling distribution is normal (for large n), we can use the properties of the normal distribution — z-scores, probabilities, confidence intervals — to make precise statements about population parameters. This is exactly why z-tests, t-tests, and confidence intervals all work. They all rely on the CLT to justify treating the sampling distribution as normal.


Section 07

Python Implementation

Sampling Techniques with Pandas & NumPy

import pandas as pd
import numpy as np

np.random.seed(42)

# Simulated population: 10,000 workers with income data
population = pd.DataFrame({
    'worker_id': range(1, 10001),
    'income':    np.random.lognormal(mean=10.5, sigma=0.5, size=10000),
    'region':    np.random.choice(['Rural', 'Urban', 'Metro'], size=10000,
                                   p=[0.60, 0.30, 0.10]),
    'sector':    np.random.choice(['Agriculture', 'Manufacturing', 'Services'],
                                   size=10000, p=[0.45, 0.25, 0.30])
})

print(f"Population mean income: ₹{population['income'].mean():,.0f}")
print(f"Population size N: {len(population)}")
# ── 1. Simple Random Sampling ──────────────────────────────────
srs = population.sample(n=200, random_state=42)
print(f"\nSRS Sample mean:  ₹{srs['income'].mean():,.0f}")
print(f"SRS Sample size n = {len(srs)}")
# ── 2. Stratified Sampling (by region, proportional) ───────────
strata_sizes = {'Rural': 120, 'Urban': 60, 'Metro': 20}   # 60/30/10 split

stratified = pd.concat([
    population[population['region'] == region].sample(n=n, random_state=42)
    for region, n in strata_sizes.items()
])
print(f"\nStratified Sample mean:  ₹{stratified['income'].mean():,.0f}")
print(f"Region distribution:\n{stratified['region'].value_counts()}")
# ── 3. Systematic Sampling ──────────────────────────────────────
n_sample  = 200
k         = len(population) // n_sample    # sampling interval = 50
start     = np.random.randint(0, k)        # random start between 0 and 49

systematic_idx = range(start, len(population), k)
systematic = population.iloc[systematic_idx]
print(f"\nSystematic Sample mean:  ₹{systematic['income'].mean():,.0f}")
print(f"Sampling interval k = {k}")

Demonstrating the Sampling Distribution & CLT

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(0)

# Population: exponential (heavily right-skewed)
population_data = np.random.exponential(scale=5, size=100_000)
print(f"Population mean: {population_data.mean():.4f}")    # ~5.0
print(f"Population std:  {population_data.std():.4f}")     # ~5.0

# Draw 10,000 samples at different sizes and record sample means
for n in [2, 10, 30, 100]:
    sample_means = [
        np.random.choice(population_data, size=n, replace=False).mean()
        for _ in range(10_000)
    ]
    se_theoretical = population_data.std() / np.sqrt(n)
    se_observed    = np.std(sample_means)

    print(f"\nn = {n:>3d}  |  "
          f"Mean of x̄ = {np.mean(sample_means):.4f}  |  "
          f"SE theoretical = {se_theoretical:.4f}  |  "
          f"SE observed = {se_observed:.4f}")

# Output (approximate):
# n =   2  |  Mean of x̄ = 4.9982  |  SE theoretical = 3.5355  |  SE observed = 3.5101
# n =  10  |  Mean of x̄ = 4.9998  |  SE theoretical = 1.5811  |  SE observed = 1.5693
# n =  30  |  Mean of x̄ = 5.0003  |  SE theoretical = 0.9129  |  SE observed = 0.9041
# n = 100  |  Mean of x̄ = 5.0001  |  SE theoretical = 0.5000  |  SE observed = 0.4987

Calculating Standard Error and Confidence Interval

import numpy as np
from scipy import stats

# Sample data: monthly income of 100 workers (₹)
sample = np.random.lognormal(mean=10.5, sigma=0.5, size=100)

x_bar = np.mean(sample)
s     = np.std(sample, ddof=1)
n     = len(sample)
se    = s / np.sqrt(n)                         # Standard Error

# 95% Confidence Interval using t-distribution (n < 1000 rule of thumb)
ci = stats.t.interval(confidence=0.95, df=n-1, loc=x_bar, scale=se)

print(f"Sample mean (x̄):    ₹{x_bar:,.2f}")
print(f"Standard deviation:  ₹{s:,.2f}")
print(f"Standard Error (SE): ₹{se:,.2f}")
print(f"95% CI: (₹{ci[0]:,.2f},  ₹{ci[1]:,.2f})")
# Interpretation: We are 95% confident the true population mean income
# lies between these two values.
⚠️
replace=True vs replace=False in Sampling

np.random.choice(data, size=n, replace=False) is sampling without replacement — like drawing names from a hat without putting them back. This is what you almost always want in real surveys. replace=True is used in bootstrapping — a resampling technique for estimating confidence intervals when theoretical distributions are unknown. Do not confuse the two.


Section 08

Key Concepts — Comparison Table

Concept Symbol Applies To Known? Purpose
Population Mean μ Population Usually No True average we want to estimate
Sample Mean Sample Yes Estimates μ from data we collected
Population Std Dev σ Population Usually No True spread of the population
Sample Std Dev s Sample Yes Estimates σ (use ddof=1)
Standard Error SE = σ/√n Sampling Distribution Estimated Measures precision of x̄ as an estimator
Sampling Distribution x̄ ~ N(μ, SE²) Theoretical Derived Foundation for CI and hypothesis tests

Section 09

Golden Rules

🎯 Inferential Statistics — Key Rules
1
A sample must be representative — this is non-negotiable. No statistical technique can fix a biased sample. If your sampling method systematically excludes part of the population, your inferences will be systematically wrong. Method matters more than sample size.
2
Population parameters use Greek letters; sample statistics use Latin letters. μ vs x̄. σ vs s. N vs n. P vs p̂. Keeping this notation consistent prevents confusion about what is known, what is estimated, and what you are inferring.
3
Standard Error ≠ Standard Deviation. SD measures spread of individual data points. SE measures spread of sample means across repeated samples. SE = σ / √n — it shrinks as n grows. SD does not. Reporting SE when you mean SD (or vice versa) is a critical error in publications.
4
The CLT requires n ≥ 30 for most distributions. For heavily skewed or multi-modal populations, you may need n ≥ 50 or n ≥ 100 before the sampling distribution becomes approximately normal. For normally distributed populations, even n = 5 may suffice.
5
Choose your sampling technique before collecting data. Stratified sampling gives more precise estimates for diverse populations. Cluster sampling reduces cost for geographically spread populations. Simple random is the cleanest but not always practical. Never use convenience sampling for inferential conclusions.
6
Larger n reduces Standard Error but never eliminates bias. A sample of 1,000,000 with a biased selection method is worse than a sample of 1,000 with proper random selection. This is exactly why the 1936 Literary Digest poll of 2.4 million people incorrectly predicted the US election — the sample was large but catastrophically biased toward wealthy readers.