The Story That Explains FedAvg
She hands each family the same starting recipe card. Each family cooks it many times at home, tweaking the salt, the timing, the heat — using ingredients they will never disclose. At the end of the week, no family sends back food or ingredients. They send back only their adjusted recipe card: "we used a bit more salt, a bit less time."
Now the chef must combine three different recipe cards into one. She does not treat them equally. The family that cooked the dish 500 times clearly learned more than the family that cooked it 50 times, so she lets the busy kitchen's card count ten times more in the blend. She averages the cards — weighted by how much each family cooked — into one improved recipe, and the cycle repeats.
That weighted blending of locally-trained models is exactly the Federated Averaging (FedAvg) algorithm, introduced by McMahan and colleagues at Google in 2017. It is the single most important algorithm in federated learning.
The One Big Idea Behind FedAvg
FedAvg rests on a surprisingly simple insight: instead of sending gradients back to the server after every single mini-batch (expensive, chatty), let each client run many steps of local training first, then send back only its final model weights. The server averages those weights — weighted by each client's data size — to form the next global model.
Compare this to its simpler cousin FedSGD, where clients do one gradient step and report immediately. FedAvg does E local epochs before reporting — slashing communication rounds, often by 10× to 100×, at the cost of clients drifting apart on non-identical data.
FedAvg Has Two Halves: Server & Client
The original paper describes FedAvg as two cooperating procedures. The server orchestrates rounds; each client does the local learning. The animation shows one full round flowing through both halves.
The server half is cheap bookkeeping; the client half does the real learning on data that never moves.
How FedAvg Works — Step by Step
A Worked Example — Averaging Real Weight Vectors
Let the model be a tiny 2-parameter vector w = [w₁, w₂]. Three clients join round t. Each starts from the same global wt = [0.50, 0.50], trains locally, and returns a different result along with how many samples it holds:
| Client | Samples nk | Returned wk = [w₁, w₂] | Share nk/n |
|---|---|---|---|
| Phone A | 600 | [0.90, 0.20] | 0.60 |
| Phone B | 300 | [0.40, 0.80] | 0.30 |
| Phone C | 100 | [0.10, 0.10] | 0.10 |
| Total | 1,000 | — | 1.00 |
The diagram below shows the same weighted blend visually: each client's bar contributes a slice proportional to its data share.
Bar height = the returned w₁ value; the global result (0.67) sits closest to the highest-data client, A.
FedAvg in Python — Server & Client, Step by Step
This mirrors the two procedures exactly: client_update does the local SGD, and fedavg_server runs the rounds and does the weighted average.
import numpy as np
# ═══ CLIENT SIDE — local SGD for E epochs, batch size B ══════
def client_update(w, data, lr=0.05, epochs=5, batch=32):
"""Train a COPY of the global model on this client's private data."""
w = w.copy() # never touch the global weights
X, y = data
n = len(y)
for _ in range(epochs): # E local epochs
idx = np.random.permutation(n)
for s in range(0, n, batch): # mini-batches of size B
b = idx[s:s + batch]
Xb, yb = X[b], y[b]
grad = Xb.T @ (Xb @ w - yb) / len(yb)
w = w - lr * grad # plain SGD step
return w, n # return weights + sample count
# ═══ SERVER SIDE — selection, broadcast, weighted average ════
def fedavg_server(clients, n_features, rounds=15, frac=0.6):
w_global = np.zeros(n_features) # S0: init global model
for t in range(rounds): # each t = one communication round
# S1: select a random fraction C of clients
m = max(1, int(frac * len(clients)))
selected = np.random.choice(len(clients), m, replace=False)
weights, sizes = [], []
for k in selected:
# S2 broadcast w_global → C1–C4 ClientUpdate
wk, nk = client_update(w_global, clients[k])
weights.append(wk) # S3: collect updates
sizes.append(nk)
# S4: WEIGHTED AVERAGE w = Σ (n_k / n) · w_k
n = sum(sizes)
w_global = sum((nk / n) * wk for wk, nk in zip(weights, sizes))
print(f"Round {t+1:2d} | clients={m} | ‖w‖={np.linalg.norm(w_global):.4f}")
return w_global
# ═══ Demo: 5 clients, each with its own private data ═════════
np.random.seed(0)
true_w = np.array([1.5, -2.0, 0.7])
clients = []
for _ in range(5):
size = np.random.randint(100, 400) # uneven data sizes → weighting matters
X = np.random.randn(size, 3)
y = X @ true_w + 0.05 * np.random.randn(size)
clients.append((X, y))
final = fedavg_server(clients, n_features=3)
print("Learned :", np.round(final, 3))
print("True :", true_w)
The global model recovers the true weights [1.5, -2.0, 0.7] in a
handful of rounds — and not a single client's X or y ever left
its device. Only weight vectors crossed the wire.
FedAvg vs FedSGD, and the Knobs That Define It
| Aspect | FedSGD | FedAvg |
|---|---|---|
| Local work per round | 1 gradient step | E epochs of mini-batch SGD |
| What is sent back | A gradient | Updated weights |
| Communication rounds | Many | Far fewer (often 10–100× fewer) |
| Local compute | Light | Heavy |
| Risk on non-IID data | Low | Higher (client drift) |
FedSGD is really just FedAvg with E = 1 and one full-batch step. The three knobs below turn one into the other and control everything in between:
| Knob | Symbol | Meaning | Push it up → |
|---|---|---|---|
| Client fraction | C | Share of clients sampled per round. | More representative average, more network traffic. |
| Local epochs | E | Passes over local data before reporting. | Fewer rounds, but more client drift on non-IID data. |
| Local batch size | B | Mini-batch size in local SGD. | Fewer, coarser updates per epoch. |
Golden Rules for FedAvg
In one line: FedAvg trains a shared model by having each client run several epochs of local SGD, then averaging their resulting weights in proportion to how much data each holds — achieving central-quality accuracy while every dataset stays private and never moves.