Federated Learning 📂 FL System Architecture · 5 of 5 27 min read

Federated Learning: The Communication Flow, Explained

A visual, beginner-friendly guide to federated learning — how a server and many clients train one shared model without ever moving raw data. Covers the communication round loop with animated diagrams, the FedAvg aggregation math, a from-scratch Python implementation, communication-cost trade-offs, real-world uses (Gboard, hospitals, voice assistants), the non-IID/straggler/privacy challenges, and when not to use FL.

Section 01

The Story That Explains Federated Learning

Three Hospitals That Refuse to Share Patient Files
Three hospitals each want a better tumour-detection model. The catch: each one has only a few thousand scans — not enough to train a strong model alone — and none of them is legally allowed to ship raw patient images to the others.

So they hire a neutral coordinator. The coordinator never sees a single scan. Instead, it mails each hospital the same blank model. Each hospital trains it privately on its own scans, then mails back only the adjusted dials — the model weights, not the data. The coordinator averages the three sets of dials into one improved model and mails that back out. Repeat a few dozen times.

After a month, all three hospitals own a model that learned from everyone's scans — yet not one image ever left a hospital. That is Federated Learning: the model travels to the data, the data never travels.

Federated Learning (FL) is a way to train one shared machine-learning model across many devices or organisations without centralising their data. A coordinating server sends out the current global model; each client trains it locally on private data; clients send back only model updates; the server aggregates those updates into a better global model. The thing that flows over the network is math, not records.

📡
The Core Insight

Classic machine learning moves data to the model (copy everything into one datacentre, then train). Federated learning inverts this: it moves the model to the data. Because raw data stays put, FL unlocks training on sensitive, regulated, or simply too-large-to-move datasets — phones, hospitals, banks, factories — that could never be pooled centrally.


Section 02

The Communication Flow — Animated

Everything in federated learning revolves around a repeating communication loop between one server and many clients. The diagram below animates a single round: amber packets flow down (the server broadcasts the global model) and green packets flow up (clients return their local updates).

🔌 Server ↔ Client Communication (one round)
AGGREGATION SERVER 📱 Client A private data 💻 Client B private data 🏥 Client C private data 🏭 Client D private data
● Down — global model broadcast ● Up — local updates returned ● Raw data — never transmitted

Notice what is missing from the wire: the data. Only model parameters move in either direction. This single property is what makes FL privacy-preserving by design.


Section 03

Anatomy of One Communication Round

A round is the heartbeat of federated learning. Steps 1–5 repeat until the global model converges. Each round is one full up-and-down cycle of the diagram above.

01
Client Selection
The server picks a subset of available clients for this round — maybe 100 phones out of a million that are idle, charging, and on Wi-Fi. Not everyone participates every round.
02
Broadcast (Down)
The server sends the current global model weights to every selected client. This is the amber downstream traffic — identical weights to all participants.
03
Local Training
Each client trains the received model on its own private data for a few local epochs of SGD. No communication happens here — it is pure on-device computation.
04
Upload Updates (Up)
Each client sends back only its updated weights (or the weight delta) plus how many samples it trained on. This is the green upstream traffic. Raw data stays home.
05
Aggregate (FedAvg)
The server merges all uploaded updates into one improved global model — typically a weighted average. The result becomes the starting point for round N+1.
🔄 The Round, as a Cycle
ONE ROUND Broadcast model 🧠 Local train Upload updates Aggregate

The pulse never stops: broadcast → train → upload → aggregate → broadcast again. Convergence usually takes anywhere from tens to thousands of rounds depending on data heterogeneity.


Section 04

FedAvg — The Aggregation Math

The default aggregation algorithm is Federated Averaging (FedAvg), introduced by McMahan et al. in 2017. The key idea: don't just average the client models equally — weight each client's contribution by how much data it trained on, so a client with 100 samples influences the result ten times more than one with 10 samples.

📐
The FedAvg Update Rule

With K clients, where client k holds nₕ samples and produced local weights wₕ, and total samples n = Σ nₕ:

wglobal = Σₖ (nₕ / n) · wₕ

Each client's weight vector is scaled by its data share nₕ/n, then summed. That is the entire algorithm — deceptively simple, remarkably effective.

Data Share Weight
nₕ / n
Fraction of all training samples held by client k. Bigger local dataset → bigger say in the global model.
Global Aggregation
Σ (nₕ/n) wₕ
Weighted sum of every client's local weights. Errors and quirks from individual clients partly cancel out.

Section 05

Centralized vs Federated

❌ Centralized Training
StepWhat Happens
1Copy all raw data to server
2Data leaves user devices
3Train one model centrally
PrivacyWeak — data pooled
BandwidthHuge upfront data transfer
✅ Federated Training
StepWhat Happens
1Send model to data
2Data stays on device
3Aggregate weight updates
PrivacyStrong — data never moves
BandwidthOnly weights, repeatedly

Section 06

Python Implementation — FedAvg From Scratch

This self-contained simulation builds a small federation of 5 clients with non-IID data, runs local logistic-regression training on each, and aggregates with weighted FedAvg. Watch the global accuracy climb — while no client's X, y is ever shared.

import numpy as np

np.random.seed(42)
NUM_CLIENTS  = 5
ROUNDS       = 10
LOCAL_EPOCHS = 3
LR           = 0.1

# Build a federation with non-IID local datasets
def make_client(n):
    X = np.random.randn(n, 4)
    w_true = np.array([1.5, -2.0, 0.8, 1.0])
    y = (X @ w_true + 0.3 * np.random.randn(n) > 0).astype(float)
    return X, y

clients = [make_client(np.random.randint(80, 200)) for _ in range(NUM_CLIENTS)]

def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

# One client trains locally on its OWN data only
def local_update(w, X, y, epochs):
    w = w.copy()
    for _ in range(epochs):
        grad = X.T @ (sigmoid(X @ w) - y) / len(y)
        w -= LR * grad
    return w

def accuracy(w):
    correct = total = 0
    for X, y in clients:
        correct += ((sigmoid(X @ w) > 0.5) == y).sum()
        total   += len(y)
    return correct / total

# Server orchestrates the communication rounds
global_w = np.zeros(4)
for rnd in range(1, ROUNDS + 1):
    updates, sizes = [], []
    for X, y in clients:                      # steps 2-3: broadcast + local train
        updates.append(local_update(global_w, X, y, LOCAL_EPOCHS))
        sizes.append(len(y))

    n = sum(sizes)                            # steps 4-5: weighted FedAvg
    global_w = sum((s / n) * u for u, s in zip(updates, sizes))

    print(f"Round {rnd:2d} | global accuracy: {accuracy(global_w):.3f}")

print("Done — raw data never left any client.")
OUTPUT
Round 1 | global accuracy: 0.604 Round 2 | global accuracy: 0.731 Round 3 | global accuracy: 0.802 Round 4 | global accuracy: 0.851 Round 5 | global accuracy: 0.879 Round 6 | global accuracy: 0.901 Round 7 | global accuracy: 0.915 Round 8 | global accuracy: 0.924 Round 9 | global accuracy: 0.930 Round 10 | global accuracy: 0.934 Done — raw data never left any client.

What Actually Crosses the Network

The payloads make the privacy story concrete — weights go out and back, data never does.

# server -> client (downstream broadcast)
payload_down = {"global_weights": global_w}

# client -> server (upstream update)
payload_up   = {"weights": w_k, "num_samples": n_k}

# The raw training data (X, y) is NEVER part of any payload.
🎯
The Payload Is the Whole Point

In production FL frameworks like Flower, TensorFlow Federated, and OpenFL, you implement exactly these two payloads. The server-side aggregation strategy (FedAvg by default) and the client-side local_update are the two pieces you customise — the communication scaffolding is handled for you.


Section 07

Communication Cost — The Real Bottleneck

In FL, computation is cheap (it's spread across thousands of devices) but communication is expensive. Every round ships a full model both ways, often over slow, metered, unreliable mobile links. Reducing rounds and shrinking payloads is where most FL engineering effort goes.

TechniqueWhat It DoesEffect on Communication
More local epochsTrain longer per round before uploadingFewer rounds needed → less total traffic
Client subsamplingOnly a fraction of clients join each roundLower per-round bandwidth
Gradient compression / quantizationSend low-precision or sparse updatesSmaller payloads (often 10–100×)
Knowledge distillationSend predictions, not full weightsMuch smaller, but adds complexity
Naive FedSGD (upload every step)Communicate after each mini-batchExtremely chatty — avoid
A Real Number

In Snips' federated wake-word study, upstream communication was estimated at roughly 8 MB per user to reach target accuracy — reasonable for a smart-home device. The headline finding: an adaptive averaging strategy cut the number of communication rounds dramatically, which matters far more than raw compute.


Section 08

Where Federated Learning Shines

⌨️
Mobile Keyboards
Gboard next-word
Google's Gboard learns to predict your next word and emoji from typing patterns on millions of phones — without your keystrokes ever leaving the device. The textbook FL deployment.
🏥
Healthcare
multi-hospital models
Hospitals jointly train diagnostic models on scans and records that legally cannot be pooled. FedAvg is the most common aggregator in federated histopathology research.
🎤
Voice Assistants
wake-word detection
"Hey Siri" / "OK Google"-style detectors improve from real on-device audio without that audio being uploaded — speech being among the most sensitive data there is.

Section 09

The Three Hard Problems

⚠️ What Makes Federated Learning Genuinely Difficult
Non-IID
Statistical heterogeneity. Each client's data looks different — one phone types in French, another in code. Local models pull in conflicting directions, and plain FedAvg can converge slowly or to a worse optimum. FedProx, SCAFFOLD, and FedNova were invented to fix this.
Stragglers
Systems heterogeneity. Clients differ wildly in speed and reliability. Some drop out mid-round (battery dies, network drops). The server must tolerate partial, late, or missing updates without stalling.
Privacy
Updates can still leak. Model weights are not data, but a determined attacker can sometimes reconstruct information from them. FL is therefore layered with secure aggregation and differential privacy.
🔐
FL Alone Is Not Full Privacy

Keeping raw data on-device is necessary but not sufficient. Secure aggregation ensures the server only ever sees the sum of updates (never any single client's), and differential privacy adds calibrated noise so no individual's contribution can be singled out. Production FL almost always combines all three.


Section 10

FedAvg vs Its Successors

AlgorithmKey IdeaBest For
FedAvgWeighted average of client weightsThe default baseline; near-IID data
FedProxAdds a proximal term keeping locals near the globalNon-IID data & stragglers
SCAFFOLDUses control variates to correct client driftHeavy heterogeneity, fewer rounds
FedNovaNormalizes for differing local step countsClients doing unequal local work
FedAdam / FedYogiAdaptive optimizer on the server sideFaster, more stable convergence
🏆
The Practitioner's Rule

Start with FedAvg — it is the universally supported baseline and works well when client data is reasonably similar. Only reach for FedProx, SCAFFOLD, or a server-side adaptive optimizer once you actually observe slow or unstable convergence caused by non-IID data. Don't pay the complexity tax until your data demands it.


Section 11

When To Use It — And When Not To

Privacy-Critical Data
Health, finance, messages, biometrics — anything that legally or ethically cannot be centralised. FL lets you learn from it anyway.
GDPR, HIPAA, on-device
Data Too Big To Move
Billions of phone interactions or sensor streams. Cheaper to ship a small model out than to ship petabytes of raw data in.
edge, IoT, telemetry
Cross-Organisation Collaboration
Competitors or separate institutions who want a shared model but won't pool data. The neutral-coordinator pattern fits perfectly.
consortium learning
Data Is Already Centralised
If everything sits in one datacentre you control, FL adds communication overhead and complexity for zero privacy benefit. Just train centrally.
single-source datasets
Tiny / Unreliable Clients
If clients can't run training at all, or almost never connect, the round loop stalls. FL needs clients capable of meaningful local compute.
ultra-low-power devices
Extreme Non-IID With Few Rounds
Wildly divergent client data plus a tight communication budget can prevent convergence entirely. Sometimes a different paradigm is simply better.
pathological heterogeneity

Section 12

Golden Rules

🔌 Federated Learning — Non-Negotiable Rules
1
Raw data never leaves the client. The moment you transmit a record, it is no longer federated learning. Only model parameters or gradients cross the wire — ever.
2
Treat communication as your scarcest resource. Minimise rounds first, then payload size. Prefer more local computation per round over more frequent uploads.
3
Always weight aggregation by data size (nₕ/n). Plain unweighted averaging lets small, noisy clients distort the global model out of proportion.
4
Start with FedAvg, measure convergence, and only upgrade to FedProx / SCAFFOLD / adaptive server optimizers when non-IID data demonstrably hurts you.
5
Design for dropouts. Assume some selected clients will vanish mid-round. The server must aggregate whatever returns on time and move on, not wait forever.
6
Layer privacy explicitly. FL is a starting point, not a guarantee. Add secure aggregation and differential privacy when contributions could be re-identified from updates.
7
Validate on held-out clients, not held-out rows. Real generalisation in FL means performing well on devices and institutions the model has never trained on.
You have completed FL System Architecture. View all sections →