Federated Learning 📂 Federated Learning Introduction · 1 of 1 44 min read

Introduction to Federated Learning: Privacy-First Machine Learning at Scale

Federated Learning trains AI models across millions of devices and institutions without ever moving raw data. This tutorial covers the core concept, how it differs from centralised ML, why privacy and scale make it essential, four real-world deployments (Gboard, FeTS cancer research, Siemens manufacturing, WeBank fraud detection), and the three key actors — coordinator, server, and clients — with a fully working FedAvg Python implementation.

Section 01

What is Federated Learning?

The Doctor Who Never Leaves Town
Imagine 100 hospitals across the country each have patient records that could help train an AI to detect cancer early. The obvious approach: send all records to one central server, train the model, done.

But there's a problem. Patient data is deeply private. Regulations like HIPAA forbid it. Patients didn't consent to their records leaving the building. IT departments would reject the transfer. Legal teams would shut it down before it started.

Federated Learning solves this. Instead of moving the data to the model, you move the model to the data. Each hospital trains on its own records locally. Only the learned gradients — mathematical summaries of what was learned — are sent back to a central server. The server combines these updates into a better global model and sends it back. The actual patient records never leave the hospital. Ever.

Federated Learning (FL) is a machine learning paradigm where a model is trained across many decentralised devices or servers holding local data samples, without exchanging the raw data itself. The term was coined by Google researchers McMahan et al. in 2017, originally to train the Gboard keyboard prediction model on Android phones.

💡
The One-Line Definition

Federated Learning = train on local data, share only model updates, aggregate globally. The data never moves. Only knowledge does.

🔄 How a Single Federated Round Works
Step 1
Server sends the current global model to a subset of clients (e.g. 100 phones out of 1 million).
Step 2
Each selected client trains the model locally on its own private data for a few epochs.
Step 3
Each client sends back only the model update (gradient or weight delta) — not the raw data.
Step 4
Server aggregates all updates (e.g. via FedAvg) into an improved global model.
Step 5
Repeat for many rounds until the model converges. Raw data stayed local throughout.
🏠 Federated Learning — Architecture Overview
🖥 Global Server Aggregates Updates (FedAvg) 📱 Client A Hospital Local Data: Private 📱 Client B Smartphone Local Data: Private 📱 Client C IoT Sensor Local Data: Private 📱 Client D Bank Branch Local Data: Private Model sent to clients Gradients sent to server ⚠ Raw data never leaves clients

The global model travels down; only gradient updates travel up. No raw data ever crosses the boundary.


Section 02

Centralized vs. Federated Training

To appreciate federated learning, you must first deeply understand what it replaces and why the traditional approach breaks down at scale and in sensitive domains.

❌ Centralized Training
All raw data collected on one server
Massive storage infrastructure required
Single point of failure & breach risk
Data must physically travel over network
Regulated data (health, finance) often illegal
User trust eroded: "my data left my device"
Latency: data collection takes time
✅ Federated Training
Raw data stays on local device/server
Distributed compute — clients share the load
No central data honeypot to breach
Only compressed model updates travel
Compliant with HIPAA, GDPR, CCPA
User trust maintained: "my data stayed home"
Train on real-time edge data as it's generated
⚖️ Architecture Comparison
Centralized 🗄 Central DB ALL raw data here data data data ⚠️ RAW DATA EXPOSED Federated ☁ Aggregator model weights only 📱 Client 1 data locked 📱 Client 2 data locked 📱 Client 3 data locked ✓ ONLY GRADIENTS SHARED Raw data flow (dangerous) Gradient flow (safe)

Left: centralized training exposes raw data. Right: federated training shares only model gradients.

Dimension Centralized ML Federated Learning
Data location Single central server Stays on each device/node
Privacy Raw data exposed to central party Only model updates shared
Regulatory compliance Complex; often violates GDPR/HIPAA Designed for compliance
Data diversity Limited to what you can collect Real-world distribution from edge
Bandwidth cost Transfer all data (very high) Transfer gradients only (lower)
Scalability Limited by central storage Scales to millions of clients
Fault tolerance Single point of failure Resilient — clients fail gracefully
Freshness of data Batch collection lag Trains on real-time local data

Section 03

Why Federated Learning Matters

Federated Learning isn't a minor engineering tweak — it's a fundamental rethinking of who owns data and where intelligence should live. Two forces make it indispensable: privacy and scale.

🔒 The Privacy Argument

The Consent Problem
When you type a message on your phone, that text reflects your relationships, health worries, financial stress, and daily habits. Every autocomplete suggestion your keyboard learns from is a window into your life. Now imagine that data — from a billion phones — being sent to a company's server. Even anonymised, it can be re-identified. Even encrypted in transit, it can be breached at rest.

Google's Gboard team faced exactly this problem. They needed to improve next-word prediction without reading anyone's messages. Federated Learning was the answer. The model trains on your device. Your words never leave your phone. The server only receives a tiny, averaged update that says "the model got slightly better at predicting X context" — not what you typed.
🔒
Data Minimisation
GDPR Article 5 requires that personal data be limited to what is necessary. FL never collects personal data in the first place — it processes it where it lives.
🦔
Reduced Breach Surface
No central database of personal data = no honeypot. Even if the server is compromised, there are no raw user records to steal. The gradients themselves are largely uninterpretable.
🕵
User Sovereignty
Users retain control. If a user deletes their data locally, the model can be unlearned without touching anyone else's data. This aligns with GDPR's "right to erasure."

🏠 The Scale Argument

The second reason FL matters is raw scale. Modern ML needs enormous, diverse datasets. But the most valuable data in the world exists on billions of edge devices — phones, wearables, factory sensors, medical equipment — that will never centralise their data.

🌍 The Scale FL Can Reach
~10M Centralised Dataset ~500M FL Cross-Silo (hospitals) ~1B FL Cross-Device (smartphones) ~15B FL IoT Edge (sensors) data points reachable

Federated Learning unlocks training on data that would otherwise be completely inaccessible to ML systems.

🎉
The Compounding Benefit

Privacy + Scale together create a virtuous cycle. Because users know their data stays local, they consent to participation. More participation means more training diversity. More diversity means better models. Better models improve user experience. Better UX increases participation. Federated Learning is the engine that makes this cycle possible.


Section 04

Real-World Use Cases

Federated Learning has moved well beyond the research lab. Here are the four most impactful domains where it is actively deployed today.

⌨️ Use Case 1 — Mobile Keyboard & Next-Word Prediction
Who
Google Gboard, Apple iOS Keyboard, Samsung Keyboard
Problem
Improve autocomplete and next-word prediction across 1B+ phones without ever reading user messages.
FL Solution
Each phone trains a local language model on recent typing history. Only masked gradient updates are uploaded when the phone is idle, charging, and on WiFi. Server aggregates using FedAvg.
Result
15–20% improvement in next-word recall vs. models trained on public text corpora, because the model learns real, personalised language patterns.
💊 Use Case 2 — Healthcare & Medical Imaging
Who
Intel + 29 international cancer research sites (FeTS project), NHS, Mayo Clinic
Problem
Brain tumour segmentation models need MRI scans from diverse populations, but patient scans cannot leave hospitals due to HIPAA and equivalent laws.
FL Solution
Intel's Federated Tumour Segmentation (FeTS) project trained a brain tumour model across 29 institutions in 6 countries. Each site trained locally on its own MRI data. Only weight updates were exchanged over encrypted channels.
Result
Outperformed any single-site model and matched models trained on centralised data, while maintaining full HIPAA compliance. Landmark result published in Nature Medicine, 2022.
🔌 Use Case 3 — IoT & Industrial Edge
Who
Siemens, Bosch, ABB, NVIDIA Jetson-based edge deployments
Problem
Predictive maintenance models need sensor data from factory equipment. But manufacturing data reveals production volumes, defect rates, and competitive IP — companies refuse to share it centrally.
FL Solution
Each factory trains anomaly detection locally on its CNC machines, conveyor sensors, and motor vibration data. A shared federated model learns general failure signatures without any factory exposing its specific data.
Result
Reduced unplanned downtime by 22–35% in pilot deployments, while preserving manufacturing IP. Siemens reported zero data sovereignty disputes across cross-country partnerships.
💵 Use Case 4 — Financial Services & Fraud Detection
Who
Visa, Mastercard, WeBank (Tencent), SWIFT
Problem
Fraud detection requires patterns across many banks. But transaction data is fiercely regulated and competitively sensitive — banks cannot pool customer data with rivals.
FL Solution
WeBank pioneered FATE (Federated AI Technology Enabler), enabling cross-bank fraud detection via vertical federated learning — each bank contributes different feature columns about overlapping customers without sharing raw transaction histories.
Result
AUC improved from 0.73 to 0.91 for fraud detection at WeBank, while maintaining full data isolation between institutions. FATE is now open-source with 400+ contributor organisations.
Industry Key Players FL Type Primary Constraint Result
⌨️ Mobile / NLP Google, Apple, Samsung Cross-Device User message privacy +15-20% recall
💊 Healthcare Intel FeTS, NHS, Mayo Cross-Silo HIPAA / GDPR Nature Medicine SOTA
🔌 Manufacturing Siemens, Bosch, ABB Cross-Silo IP protection -22-35% downtime
💵 Finance WeBank, Visa, SWIFT Vertical FL Regulatory + competitive AUC 0.73 → 0.91
🌎 Autonomous Vehicles Waymo, BMW, Mobileye Cross-Device Edge data volume Active research

Section 05

Key Players: Server, Clients, and Coordinator

Every federated system has three categories of actor. Understanding each one's role, responsibilities, and constraints is essential before you write a single line of FL code.

🎸 The Three Roles in a Federated System
📋 Coordinator Orchestrates rounds selects participants ☁️ Aggregation Server Merges model updates runs FedAvg / FedProx 📱 Clients (N) Hold private local data Train locally, send Δw N = 10 to 10,000,000+ orchestrates selects gradients → ← global model

Three entities work in concert: the Coordinator orchestrates, the Server aggregates, and Clients train locally.

📋
The Coordinator
Orchestration Layer
Decides which clients participate in each round (client selection), how many rounds to run, and when training converges. In Google's FL system, the coordinator selects ~1% of eligible devices per round based on battery charge, WiFi availability, and idle status. Often implemented as a separate microservice from the aggregation server.
☁️
The Aggregation Server
Model Aggregation
Receives encrypted weight updates from clients, runs the aggregation algorithm (FedAvg, FedProx, SCAFFOLD etc.), and produces an updated global model. The server never sees raw client data — only gradients. In cross-silo settings this is often a trusted third party or a cryptographically secured computation cluster.
📱
The Clients
Local Trainers
Each client holds its own local dataset that never leaves the device. When selected, a client downloads the current global model, runs local SGD for E epochs (typically 1–5), and uploads the resulting weight delta (Δw) or full updated weights back to the server. Clients can be phones, hospital servers, factory machines, or browser instances.
⚠️
The Non-IID Problem — The Core Challenge

In centralised ML, data is assumed to be IID (independently and identically distributed). In federated settings, client data is almost always non-IID. A hospital in rural India sees different disease patterns than one in Tokyo. A teenager's keyboard has different language patterns than a lawyer's. This data heterogeneity is the single biggest technical challenge in federated learning — and the reason simple FedAvg sometimes fails. We will cover non-IID solutions in Topic 3.

🌟 The Five Principles of Federated Learning
1
Data stays local. This is the non-negotiable axiom. Raw training data must never leave the device or silo where it was generated. Violate this and you have centralised ML, not FL.
2
Share only model updates. Gradients or weight deltas are communicated, never raw features or labels. Even these should be compressed and ideally noise-enhanced with differential privacy.
3
Aggregate to improve. The central server exists solely to aggregate. It has no business logic around the data. FedAvg is the canonical algorithm: w_global = Σ (n_k / n) × w_k — a weighted average of client weights.
4
Tolerate unreliable clients. In cross-device FL, clients drop out constantly. The system must handle stragglers, disconnections, and partial rounds gracefully. Plan for 30–60% of selected clients to fail to report in any given round.
5
Evaluate on a held-out server dataset. Since you can't centralise client data, maintain a small, clean, representative validation set on the server to track global model quality across rounds. This is your only reliable accuracy signal.

Section 06

The FedAvg Algorithm — The Heart of FL

The Federated Averaging (FedAvg) algorithm, introduced by McMahan et al. (2017), is the foundation of almost every federated system deployed today. Understanding it precisely is essential before moving to more advanced FL methods.

Global Aggregation
w_t+1 = Σ (n_k / n) · w_k^t
Weighted average of all client weights by their local dataset size. Larger clients contribute more.
Local Client Update
w_k = w_t − η · ∇L_k(w_t)
Each client runs stochastic gradient descent on its local loss function L_k for E epochs.
Communication Round
Δw_k = w_k − w_t
Only the weight delta is transmitted. This can be compressed to reduce bandwidth by 100–1000×.
Convergence Criterion
‖w_t+1 − w_t‖ < ε
Training stops when the change in global weights between rounds falls below threshold ε.

💻 Minimal FedAvg Implementation in Python

import numpy as np
import copy
from collections import OrderedDict
import torch
import torch.nn as nn

# ── Simple model definition ──────────────────────────────
class SimpleNet(nn.Module):
    def __init__(self, input_dim=20, hidden=64, num_classes=2):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden),
            nn.ReLU(),
            nn.Linear(hidden, num_classes)
        )
    def forward(self, x):
        return self.net(x)

# ── Local training on a single client ────────────────────
def local_train(model, data_loader, lr=0.01, epochs=3):
    model = copy.deepcopy(model)   # each client gets its own copy
    optimizer = torch.optim.SGD(model.parameters(), lr=lr)
    criterion = nn.CrossEntropyLoss()

    model.train()
    for _ in range(epochs):
        for X, y in data_loader:
            optimizer.zero_grad()
            loss = criterion(model(X), y)
            loss.backward()
            optimizer.step()
    return model.state_dict()        # return updated weights

# ── FedAvg aggregation ────────────────────────────────────
def fedavg(global_weights, client_weights, client_sizes):
    """Weighted average of client model weights by dataset size."""
    total = sum(client_sizes)
    averaged = copy.deepcopy(global_weights)

    # Initialise all tensors to zero
    for key in averaged:
        averaged[key] = torch.zeros_like(averaged[key], dtype=torch.float)

    # Weighted sum across all clients
    for w, n in zip(client_weights, client_sizes):
        for key in averaged:
            averaged[key] += (n / total) * w[key].float()

    return averaged

# ── Federated training loop ───────────────────────────────
def federated_train(global_model, client_data_loaders,
                     rounds=10, fraction=0.5):
    """
    global_model      : initialised PyTorch model
    client_data_loaders: list of DataLoaders, one per client
    rounds            : number of FL communication rounds
    fraction          : fraction of clients selected per round
    """
    num_clients = len(client_data_loaders)

    for r in range(rounds):
        # Step 1: select a subset of clients (coordinator logic)
        selected_idx = np.random.choice(
            num_clients,
            size=max(1, int(fraction * num_clients)),
            replace=False
        )

        client_weights = []
        client_sizes   = []

        # Step 2: each selected client trains locally
        for idx in selected_idx:
            local_w = local_train(global_model, client_data_loaders[idx])
            client_weights.append(local_w)
            client_sizes.append(len(client_data_loaders[idx].dataset))

        # Step 3: server aggregates with FedAvg
        new_global_weights = fedavg(
            global_model.state_dict(), client_weights, client_sizes
        )
        global_model.load_state_dict(new_global_weights)

        print(f"Round {r+1}/{rounds} complete — {len(selected_idx)} clients trained")

    return global_model
CONSOLE OUTPUT (10 rounds, 10 clients, fraction=0.5)
Round 1/10 complete — 5 clients trained Round 2/10 complete — 5 clients trained Round 3/10 complete — 5 clients trained Round 4/10 complete — 5 clients trained Round 5/10 complete — 5 clients trained Round 6/10 complete — 5 clients trained Round 7/10 complete — 5 clients trained Round 8/10 complete — 5 clients trained Round 9/10 complete — 5 clients trained Round 10/10 complete — 5 clients trained
🔑
Key Parameters You Must Tune

E (local epochs): More epochs = fewer rounds needed, but more client drift on non-IID data. Start with E=1 for safety.
C (fraction): Higher fraction = more stable aggregation but slower rounds. C=0.1 is typical for mobile FL.
B (batch size): Smaller batches = more gradient noise = more regularisation. B=10 is common.
η (learning rate): Use a schedule that decays over rounds. Start at 0.01–0.1.


Section 07

Three Types of Federated Learning

Not all federated learning systems look the same. The differences in data distribution lead to three distinct FL paradigms. Knowing which type you need determines your entire architecture.

📊 FL Types by Data Overlap
Horizontal FL Same features, different samples Client A age, income rows 1-500 Client B age, income rows 501-1000 ↔ same feature space Example: Gboard, FeTS Most common FL type Vertical FL Same samples, different features Bank income credit score E-commerce purchase history browsing patterns ↕ same user IDs Example: WeBank FATE Requires secure PSI Federated Transfer Different features & samples Domain A X-rays, EN patients A Domain B MRI, FR patients B ← minimal overlap → Example: cross-language NLP, cross-country health

The type of FL required is determined by how much feature and sample overlap exists between clients.

FL Type Feature Overlap Sample Overlap Typical Setting Key Algorithm
Horizontal FL High Low Same task, different users (phones, hospitals) FedAvg
Vertical FL Low High Different features of same users (bank + retailer) Split Learning
Federated Transfer Low Low Different tasks, different users, related domains FedTrans

Section 08

Summary & What You Now Know

You have covered the complete conceptual foundation of Federated Learning. Here is what to carry forward:

🌟 Topic 1 — Key Takeaways
Federated Learning = train locally, aggregate globally. Raw data never leaves its origin. Only model updates are shared.
It exists because privacy regulations and data sovereignty make centralisation impossible in healthcare, finance, mobile, and industrial IoT.
Three actors: Coordinator (orchestrates), Server (aggregates), Clients (train locally). Each has distinct responsibilities.
FedAvg is the canonical aggregation algorithm: a weighted average of client weights proportional to dataset size. It is the SGD of the federated world.
Non-IID data is the central challenge. Real-world client data is heterogeneous. This is what makes FL hard and what separates novice implementations from production-grade systems.
Three FL types exist: Horizontal (same features, different users), Vertical (different features, same users), and Federated Transfer (limited overlap, related domains).
🚀
Coming Up in Topic 2

Topic 2: The FedAvg Algorithm in Depth. We will implement a complete federated learning system from scratch using PyTorch and Flower (flwr), simulate non-IID data distribution, compare FedAvg vs FedProx, and analyse convergence behaviour across different client heterogeneity levels. You will build a working FL pipeline end-to-end.

You have completed Federated Learning Introduction. View all sections →