What is Federated Learning?
But there's a problem. Patient data is deeply private. Regulations like HIPAA forbid it. Patients didn't consent to their records leaving the building. IT departments would reject the transfer. Legal teams would shut it down before it started.
Federated Learning solves this. Instead of moving the data to the model, you move the model to the data. Each hospital trains on its own records locally. Only the learned gradients — mathematical summaries of what was learned — are sent back to a central server. The server combines these updates into a better global model and sends it back. The actual patient records never leave the hospital. Ever.
Federated Learning (FL) is a machine learning paradigm where a model is trained across many decentralised devices or servers holding local data samples, without exchanging the raw data itself. The term was coined by Google researchers McMahan et al. in 2017, originally to train the Gboard keyboard prediction model on Android phones.
Federated Learning = train on local data, share only model updates, aggregate globally. The data never moves. Only knowledge does.
The global model travels down; only gradient updates travel up. No raw data ever crosses the boundary.
Centralized vs. Federated Training
To appreciate federated learning, you must first deeply understand what it replaces and why the traditional approach breaks down at scale and in sensitive domains.
| All raw data collected on one server |
| Massive storage infrastructure required |
| Single point of failure & breach risk |
| Data must physically travel over network |
| Regulated data (health, finance) often illegal |
| User trust eroded: "my data left my device" |
| Latency: data collection takes time |
| Raw data stays on local device/server |
| Distributed compute — clients share the load |
| No central data honeypot to breach |
| Only compressed model updates travel |
| Compliant with HIPAA, GDPR, CCPA |
| User trust maintained: "my data stayed home" |
| Train on real-time edge data as it's generated |
Left: centralized training exposes raw data. Right: federated training shares only model gradients.
| Dimension | Centralized ML | Federated Learning |
|---|---|---|
| Data location | Single central server | Stays on each device/node |
| Privacy | Raw data exposed to central party | Only model updates shared |
| Regulatory compliance | Complex; often violates GDPR/HIPAA | Designed for compliance |
| Data diversity | Limited to what you can collect | Real-world distribution from edge |
| Bandwidth cost | Transfer all data (very high) | Transfer gradients only (lower) |
| Scalability | Limited by central storage | Scales to millions of clients |
| Fault tolerance | Single point of failure | Resilient — clients fail gracefully |
| Freshness of data | Batch collection lag | Trains on real-time local data |
Why Federated Learning Matters
Federated Learning isn't a minor engineering tweak — it's a fundamental rethinking of who owns data and where intelligence should live. Two forces make it indispensable: privacy and scale.
🔒 The Privacy Argument
Google's Gboard team faced exactly this problem. They needed to improve next-word prediction without reading anyone's messages. Federated Learning was the answer. The model trains on your device. Your words never leave your phone. The server only receives a tiny, averaged update that says "the model got slightly better at predicting X context" — not what you typed.
🏠 The Scale Argument
The second reason FL matters is raw scale. Modern ML needs enormous, diverse datasets. But the most valuable data in the world exists on billions of edge devices — phones, wearables, factory sensors, medical equipment — that will never centralise their data.
Federated Learning unlocks training on data that would otherwise be completely inaccessible to ML systems.
Privacy + Scale together create a virtuous cycle. Because users know their data stays local, they consent to participation. More participation means more training diversity. More diversity means better models. Better models improve user experience. Better UX increases participation. Federated Learning is the engine that makes this cycle possible.
Real-World Use Cases
Federated Learning has moved well beyond the research lab. Here are the four most impactful domains where it is actively deployed today.
| Industry | Key Players | FL Type | Primary Constraint | Result |
|---|---|---|---|---|
| ⌨️ Mobile / NLP | Google, Apple, Samsung | Cross-Device | User message privacy | +15-20% recall |
| 💊 Healthcare | Intel FeTS, NHS, Mayo | Cross-Silo | HIPAA / GDPR | Nature Medicine SOTA |
| 🔌 Manufacturing | Siemens, Bosch, ABB | Cross-Silo | IP protection | -22-35% downtime |
| 💵 Finance | WeBank, Visa, SWIFT | Vertical FL | Regulatory + competitive | AUC 0.73 → 0.91 |
| 🌎 Autonomous Vehicles | Waymo, BMW, Mobileye | Cross-Device | Edge data volume | Active research |
Key Players: Server, Clients, and Coordinator
Every federated system has three categories of actor. Understanding each one's role, responsibilities, and constraints is essential before you write a single line of FL code.
Three entities work in concert: the Coordinator orchestrates, the Server aggregates, and Clients train locally.
In centralised ML, data is assumed to be IID (independently and identically distributed). In federated settings, client data is almost always non-IID. A hospital in rural India sees different disease patterns than one in Tokyo. A teenager's keyboard has different language patterns than a lawyer's. This data heterogeneity is the single biggest technical challenge in federated learning — and the reason simple FedAvg sometimes fails. We will cover non-IID solutions in Topic 3.
w_global = Σ (n_k / n) × w_k — a weighted average of client weights.
The FedAvg Algorithm — The Heart of FL
The Federated Averaging (FedAvg) algorithm, introduced by McMahan et al. (2017), is the foundation of almost every federated system deployed today. Understanding it precisely is essential before moving to more advanced FL methods.
💻 Minimal FedAvg Implementation in Python
import numpy as np
import copy
from collections import OrderedDict
import torch
import torch.nn as nn
# ── Simple model definition ──────────────────────────────
class SimpleNet(nn.Module):
def __init__(self, input_dim=20, hidden=64, num_classes=2):
super().__init__()
self.net = nn.Sequential(
nn.Linear(input_dim, hidden),
nn.ReLU(),
nn.Linear(hidden, num_classes)
)
def forward(self, x):
return self.net(x)
# ── Local training on a single client ────────────────────
def local_train(model, data_loader, lr=0.01, epochs=3):
model = copy.deepcopy(model) # each client gets its own copy
optimizer = torch.optim.SGD(model.parameters(), lr=lr)
criterion = nn.CrossEntropyLoss()
model.train()
for _ in range(epochs):
for X, y in data_loader:
optimizer.zero_grad()
loss = criterion(model(X), y)
loss.backward()
optimizer.step()
return model.state_dict() # return updated weights
# ── FedAvg aggregation ────────────────────────────────────
def fedavg(global_weights, client_weights, client_sizes):
"""Weighted average of client model weights by dataset size."""
total = sum(client_sizes)
averaged = copy.deepcopy(global_weights)
# Initialise all tensors to zero
for key in averaged:
averaged[key] = torch.zeros_like(averaged[key], dtype=torch.float)
# Weighted sum across all clients
for w, n in zip(client_weights, client_sizes):
for key in averaged:
averaged[key] += (n / total) * w[key].float()
return averaged
# ── Federated training loop ───────────────────────────────
def federated_train(global_model, client_data_loaders,
rounds=10, fraction=0.5):
"""
global_model : initialised PyTorch model
client_data_loaders: list of DataLoaders, one per client
rounds : number of FL communication rounds
fraction : fraction of clients selected per round
"""
num_clients = len(client_data_loaders)
for r in range(rounds):
# Step 1: select a subset of clients (coordinator logic)
selected_idx = np.random.choice(
num_clients,
size=max(1, int(fraction * num_clients)),
replace=False
)
client_weights = []
client_sizes = []
# Step 2: each selected client trains locally
for idx in selected_idx:
local_w = local_train(global_model, client_data_loaders[idx])
client_weights.append(local_w)
client_sizes.append(len(client_data_loaders[idx].dataset))
# Step 3: server aggregates with FedAvg
new_global_weights = fedavg(
global_model.state_dict(), client_weights, client_sizes
)
global_model.load_state_dict(new_global_weights)
print(f"Round {r+1}/{rounds} complete — {len(selected_idx)} clients trained")
return global_model
E (local epochs): More epochs = fewer rounds needed, but more client drift on non-IID data.
Start with E=1 for safety.
C (fraction): Higher fraction = more stable aggregation but slower rounds. C=0.1 is typical for mobile FL.
B (batch size): Smaller batches = more gradient noise = more regularisation. B=10 is common.
η (learning rate): Use a schedule that decays over rounds. Start at 0.01–0.1.
Three Types of Federated Learning
Not all federated learning systems look the same. The differences in data distribution lead to three distinct FL paradigms. Knowing which type you need determines your entire architecture.
The type of FL required is determined by how much feature and sample overlap exists between clients.
| FL Type | Feature Overlap | Sample Overlap | Typical Setting | Key Algorithm |
|---|---|---|---|---|
| Horizontal FL | High | Low | Same task, different users (phones, hospitals) | FedAvg |
| Vertical FL | Low | High | Different features of same users (bank + retailer) | Split Learning |
| Federated Transfer | Low | Low | Different tasks, different users, related domains | FedTrans |
Summary & What You Now Know
You have covered the complete conceptual foundation of Federated Learning. Here is what to carry forward:
Topic 2: The FedAvg Algorithm in Depth. We will implement a complete federated learning system from scratch using PyTorch and Flower (flwr), simulate non-IID data distribution, compare FedAvg vs FedProx, and analyse convergence behaviour across different client heterogeneity levels. You will build a working FL pipeline end-to-end.