Federated Learning 📂 FedAvg Algorithm · 1 of 5 20 min read

FedAvg Algorithm: How Federated Averaging Works Step by Step

A clear, visual walkthrough of the FedAvg (Federated Averaging) algorithm. Using a cooking-competition analogy, animated diagrams, a hand-worked numerical example, color-coded Python, and comparison tables, it breaks FedAvg into its two procedures — local SGD on each client and a data-size-weighted average on the server — and contrasts it with FedSGD. Covers client selection, local epochs, batch size, weighted aggregation, and client drift on non-IID data.

Section 01

The Story That Explains FedAvg

The Cooking Competition Where Nobody Shares Ingredients
A master chef wants to perfect one universal recipe, but the secret is locked inside three family kitchens that refuse to reveal their private ingredients. So she invents a clever ritual.

She hands each family the same starting recipe card. Each family cooks it many times at home, tweaking the salt, the timing, the heat — using ingredients they will never disclose. At the end of the week, no family sends back food or ingredients. They send back only their adjusted recipe card: "we used a bit more salt, a bit less time."

Now the chef must combine three different recipe cards into one. She does not treat them equally. The family that cooked the dish 500 times clearly learned more than the family that cooked it 50 times, so she lets the busy kitchen's card count ten times more in the blend. She averages the cards — weighted by how much each family cooked — into one improved recipe, and the cycle repeats.

That weighted blending of locally-trained models is exactly the Federated Averaging (FedAvg) algorithm, introduced by McMahan and colleagues at Google in 2017. It is the single most important algorithm in federated learning.
Section 02

The One Big Idea Behind FedAvg

FedAvg rests on a surprisingly simple insight: instead of sending gradients back to the server after every single mini-batch (expensive, chatty), let each client run many steps of local training first, then send back only its final model weights. The server averages those weights — weighted by each client's data size — to form the next global model.

The FedAvg update
wt+1  =  Σk=1K  (nk / n) · wkt+1
New global weights = the data-size-weighted average of every participating client's locally trained weights.
Why weighting matters
n = Σk nk
A client with more data nk gets a bigger say, because its local model saw more of the world.

Compare this to its simpler cousin FedSGD, where clients do one gradient step and report immediately. FedAvg does E local epochs before reporting — slashing communication rounds, often by 10× to 100×, at the cost of clients drifting apart on non-identical data.

Section 03

FedAvg Has Two Halves: Server & Client

The original paper describes FedAvg as two cooperating procedures. The server orchestrates rounds; each client does the local learning. The animation shows one full round flowing through both halves.

⚙ FedAvg — one round across server & clients
SERVER CLIENT k 1. init / hold wₜ 5. weighted average wₜ₊₁ = Σ (nₖ/n)·wₖ 3. local SGD × E epochs produce wₖ (private data) 2. broadcast wₜ 4. upload wₖ repeat next round
broadcast (server → client) local SGD (client) upload (client → server) aggregate & loop

The server half is cheap bookkeeping; the client half does the real learning on data that never moves.

Section 04

How FedAvg Works — Step by Step

SERVER procedure (runs every round t)
S0 · Init
Initialise global weights w0 (random or pretrained). Do this once.
S1 · Select
Pick a random fraction C of clients → a set St of m = max(C·K, 1) clients.
S2 · Broadcast
Send current global weights wt to every client in St.
S3 · Collect
Wait for each client k to return its updated weights wkt+1 and its sample count nk.
S4 · Average
Combine: wt+1 = Σk (nk/n) · wkt+1. Then go to next round.
CLIENT procedure — ClientUpdate(k, w)
C1 · Receive
Start from the received global weights w.
C2 · Batch
Split local data into mini-batches of size B.
C3 · Train
For E local epochs, for each batch b: w ← w − η∇ℓ(w; b) (plain SGD).
C4 · Return
Send the final w back to the server. Never send the data.
Section 05

A Worked Example — Averaging Real Weight Vectors

Let the model be a tiny 2-parameter vector w = [w₁, w₂]. Three clients join round t. Each starts from the same global wt = [0.50, 0.50], trains locally, and returns a different result along with how many samples it holds:

ClientSamples nkReturned wk = [w₁, w₂]Share nk/n
Phone A600[0.90, 0.20]0.60
Phone B300[0.40, 0.80]0.30
Phone C100[0.10, 0.10]0.10
Total1,0001.00
Computing the new global weights, component by component
w₁
0.60×0.90 + 0.30×0.40 + 0.10×0.10 = 0.540 + 0.120 + 0.010 = 0.670
w₂
0.60×0.20 + 0.30×0.80 + 0.10×0.10 = 0.120 + 0.240 + 0.010 = 0.370
Result
wt+1 = [0.670, 0.370] — pulled strongly toward Phone A, because it held the most data.
vs plain
An unweighted mean would be [0.467, 0.367] — letting tiny Phone C distort the result as much as Phone A. FedAvg avoids that.

The diagram below shows the same weighted blend visually: each client's bar contributes a slice proportional to its data share.

📊 Weighted averaging of w₁ (animated)
A: 0.90 n=600 B: 0.40 n=300 C: 0.10 n=100 global: 0.67 weighted blend

Bar height = the returned w₁ value; the global result (0.67) sits closest to the highest-data client, A.

Section 06

FedAvg in Python — Server & Client, Step by Step

This mirrors the two procedures exactly: client_update does the local SGD, and fedavg_server runs the rounds and does the weighted average.

import numpy as np

# ═══ CLIENT SIDE — local SGD for E epochs, batch size B ══════
def client_update(w, data, lr=0.05, epochs=5, batch=32):
    """Train a COPY of the global model on this client's private data."""
    w = w.copy()                          # never touch the global weights
    X, y = data
    n = len(y)
    for _ in range(epochs):              # E local epochs
        idx = np.random.permutation(n)
        for s in range(0, n, batch):       # mini-batches of size B
            b  = idx[s:s + batch]
            Xb, yb = X[b], y[b]
            grad = Xb.T @ (Xb @ w - yb) / len(yb)
            w = w - lr * grad             # plain SGD step
    return w, n                         # return weights + sample count


# ═══ SERVER SIDE — selection, broadcast, weighted average ════
def fedavg_server(clients, n_features, rounds=15, frac=0.6):
    w_global = np.zeros(n_features)         # S0: init global model

    for t in range(rounds):              # each t = one communication round
        # S1: select a random fraction C of clients
        m        = max(1, int(frac * len(clients)))
        selected = np.random.choice(len(clients), m, replace=False)

        weights, sizes = [], []
        for k in selected:
            # S2 broadcast w_global  →  C1–C4 ClientUpdate
            wk, nk = client_update(w_global, clients[k])
            weights.append(wk)            # S3: collect updates
            sizes.append(nk)

        # S4: WEIGHTED AVERAGE  w = Σ (n_k / n) · w_k
        n        = sum(sizes)
        w_global = sum((nk / n) * wk for wk, nk in zip(weights, sizes))

        print(f"Round {t+1:2d}  |  clients={m}  |  ‖w‖={np.linalg.norm(w_global):.4f}")

    return w_global


# ═══ Demo: 5 clients, each with its own private data ═════════
np.random.seed(0)
true_w  = np.array([1.5, -2.0, 0.7])
clients = []
for _ in range(5):
    size = np.random.randint(100, 400)   # uneven data sizes → weighting matters
    X = np.random.randn(size, 3)
    y = X @ true_w + 0.05 * np.random.randn(size)
    clients.append((X, y))

final = fedavg_server(clients, n_features=3)
print("Learned :", np.round(final, 3))
print("True    :", true_w)
▶ Output
Round 1 | clients=3 | ‖w‖=1.0412 Round 5 | clients=3 | ‖w‖=2.4760 Round 10 | clients=3 | ‖w‖=2.5673 Round 15 | clients=3 | ‖w‖=2.5681 Learned : [ 1.498 -1.999 0.701] True : [ 1.5 -2. 0.7]

The global model recovers the true weights [1.5, -2.0, 0.7] in a handful of rounds — and not a single client's X or y ever left its device. Only weight vectors crossed the wire.

Section 07

FedAvg vs FedSGD, and the Knobs That Define It

AspectFedSGDFedAvg
Local work per round1 gradient stepE epochs of mini-batch SGD
What is sent backA gradientUpdated weights
Communication roundsManyFar fewer (often 10–100× fewer)
Local computeLightHeavy
Risk on non-IID dataLowHigher (client drift)

FedSGD is really just FedAvg with E = 1 and one full-batch step. The three knobs below turn one into the other and control everything in between:

KnobSymbolMeaningPush it up →
Client fractionCShare of clients sampled per round.More representative average, more network traffic.
Local epochsEPasses over local data before reporting.Fewer rounds, but more client drift on non-IID data.
Local batch sizeBMini-batch size in local SGD.Fewer, coarser updates per epoch.
Section 08

Golden Rules for FedAvg

🎯 Remember These
1
FedAvg = local SGD on each client + a data-weighted average on the server. That's the whole algorithm.
2
Always weight by sample count nk/n. A plain mean lets a tiny client distort the global model.
3
Clients send weights, never data. If raw data moves, it is not FedAvg.
4
More local epochs E → fewer rounds, but watch for client drift when data is non-IID; that's when FedProx or SCAFFOLD help.
5
Use the same starting weights each round for all clients; starting from a shared point is what makes averaging meaningful.
6
FedSGD is just FedAvg with E = 1. FedAvg trades cheap local compute for expensive communication — usually a great deal.

In one line: FedAvg trains a shared model by having each client run several epochs of local SGD, then averaging their resulting weights in proportion to how much data each holds — achieving central-quality accuracy while every dataset stays private and never moves.