FedAvg Algorithm: How Federated Averaging Works

Section 01

The Story That Explains FedAvg

📖 Real World Analogy

The Cooking Competition Where Nobody Shares Ingredients

A master chef wants to perfect one universal recipe, but the secret is locked inside three family kitchens that refuse to reveal their private ingredients. So she invents a clever ritual.

She hands each family the same starting recipe card. Each family cooks it many times at home, tweaking the salt, the timing, the heat — using ingredients they will never disclose. At the end of the week, no family sends back food or ingredients. They send back only their adjusted recipe card: "we used a bit more salt, a bit less time."

Now the chef must combine three different recipe cards into one. She does not treat them equally. The family that cooked the dish 500 times clearly learned more than the family that cooked it 50 times, so she lets the busy kitchen's card count ten times more in the blend. She averages the cards — weighted by how much each family cooked — into one improved recipe, and the cycle repeats.

That weighted blending of locally-trained models is exactly the Federated Averaging (FedAvg) algorithm, introduced by McMahan and colleagues at Google in 2017. It is the single most important algorithm in federated learning.

Section 02

The One Big Idea Behind FedAvg

FedAvg rests on a surprisingly simple insight: instead of sending gradients back to the server after every single mini-batch (expensive, chatty), let each client run many steps of local training first, then send back only its final model weights. The server averages those weights — weighted by each client's data size — to form the next global model.

The FedAvg update

w_t+1 = Σ_k=1^K (n_k / n) · w^k_t+1

New global weights = the data-size-weighted average of every participating client's locally trained weights.

Why weighting matters

n = Σ_k n_k

A client with more data n_k gets a bigger say, because its local model saw more of the world.

Compare this to its simpler cousin FedSGD, where clients do one gradient step and report immediately. FedAvg does E local epochs before reporting — slashing communication rounds, often by 10× to 100×, at the cost of clients drifting apart on non-identical data.

Section 03

FedAvg Has Two Halves: Server & Client

The original paper describes FedAvg as two cooperating procedures. The server orchestrates rounds; each client does the local learning. The animation shows one full round flowing through both halves.

⚙ FedAvg — one round across server & clients

broadcast (server → client) local SGD (client) upload (client → server) aggregate & loop

The server half is cheap bookkeeping; the client half does the real learning on data that never moves.

Section 04

How FedAvg Works — Step by Step

SERVER procedure (runs every round t)

S0 · Init

Initialise global weights w₀ (random or pretrained). Do this once.

S1 · Select

Pick a random fraction C of clients → a set S_t of m = max(C·K, 1) clients.

S2 · Broadcast

Send current global weights w_t to every client in S_t.

S3 · Collect

Wait for each client k to return its updated weights w^k_t+1 and its sample count n_k.

S4 · Average

Combine: w_t+1 = Σ_k (n_k/n) · w^k_t+1. Then go to next round.

CLIENT procedure — ClientUpdate(k, w)

C1 · Receive

Start from the received global weights w.

C2 · Batch

Split local data into mini-batches of size B.

C3 · Train

For E local epochs, for each batch b: w ← w − η∇ℓ(w; b) (plain SGD).

C4 · Return

Send the final w back to the server. Never send the data.

Section 05

A Worked Example — Averaging Real Weight Vectors

Let the model be a tiny 2-parameter vector w = [w₁, w₂]. Three clients join round t. Each starts from the same global w_t = [0.50, 0.50], trains locally, and returns a different result along with how many samples it holds:

Client	Samples n_k	Returned w^k = [w₁, w₂]	Share n_k/n
Phone A	600	[0.90, 0.20]	0.60
Phone B	300	[0.40, 0.80]	0.30
Phone C	100	[0.10, 0.10]	0.10
Total	1,000	—	1.00

Computing the new global weights, component by component

w₁

0.60×0.90 + 0.30×0.40 + 0.10×0.10 = 0.540 + 0.120 + 0.010 = 0.670

w₂

0.60×0.20 + 0.30×0.80 + 0.10×0.10 = 0.120 + 0.240 + 0.010 = 0.370

Result

w_t+1 = [0.670, 0.370] — pulled strongly toward Phone A, because it held the most data.

vs plain

An unweighted mean would be [0.467, 0.367] — letting tiny Phone C distort the result as much as Phone A. FedAvg avoids that.

The diagram below shows the same weighted blend visually: each client's bar contributes a slice proportional to its data share.

📊 Weighted averaging of w₁ (animated)

Bar height = the returned w₁ value; the global result (0.67) sits closest to the highest-data client, A.

Section 06

FedAvg in Python — Server & Client, Step by Step

This mirrors the two procedures exactly: client_update does the local SGD, and fedavg_server runs the rounds and does the weighted average.

import numpy as np

# ═══ CLIENT SIDE — local SGD for E epochs, batch size B ══════
def client_update(w, data, lr=0.05, epochs=5, batch=32):
    """Train a COPY of the global model on this client's private data."""
    w = w.copy()                          # never touch the global weights
    X, y = data
    n = len(y)
    for _ in range(epochs):              # E local epochs
        idx = np.random.permutation(n)
        for s in range(0, n, batch):       # mini-batches of size B
            b  = idx[s:s + batch]
            Xb, yb = X[b], y[b]
            grad = Xb.T @ (Xb @ w - yb) / len(yb)
            w = w - lr * grad             # plain SGD step
    return w, n                         # return weights + sample count


# ═══ SERVER SIDE — selection, broadcast, weighted average ════
def fedavg_server(clients, n_features, rounds=15, frac=0.6):
    w_global = np.zeros(n_features)         # S0: init global model

    for t in range(rounds):              # each t = one communication round
        # S1: select a random fraction C of clients
        m        = max(1, int(frac * len(clients)))
        selected = np.random.choice(len(clients), m, replace=False)

        weights, sizes = [], []
        for k in selected:
            # S2 broadcast w_global  →  C1–C4 ClientUpdate
            wk, nk = client_update(w_global, clients[k])
            weights.append(wk)            # S3: collect updates
            sizes.append(nk)

        # S4: WEIGHTED AVERAGE  w = Σ (n_k / n) · w_k
        n        = sum(sizes)
        w_global = sum((nk / n) * wk for wk, nk in zip(weights, sizes))

        print(f"Round {t+1:2d}  |  clients={m}  |  ‖w‖={np.linalg.norm(w_global):.4f}")

    return w_global


# ═══ Demo: 5 clients, each with its own private data ═════════
np.random.seed(0)
true_w  = np.array([1.5, -2.0, 0.7])
clients = []
for _ in range(5):
    size = np.random.randint(100, 400)   # uneven data sizes → weighting matters
    X = np.random.randn(size, 3)
    y = X @ true_w + 0.05 * np.random.randn(size)
    clients.append((X, y))

final = fedavg_server(clients, n_features=3)
print("Learned :", np.round(final, 3))
print("True    :", true_w)

▶ Output

The global model recovers the true weights [1.5, -2.0, 0.7] in a handful of rounds — and not a single client's X or y ever left its device. Only weight vectors crossed the wire.

Section 07

FedAvg vs FedSGD, and the Knobs That Define It

Aspect	FedSGD	FedAvg
Local work per round	1 gradient step	E epochs of mini-batch SGD
What is sent back	A gradient	Updated weights
Communication rounds	Many	Far fewer (often 10–100× fewer)
Local compute	Light	Heavy
Risk on non-IID data	Low	Higher (client drift)

FedSGD is really just FedAvg with E = 1 and one full-batch step. The three knobs below turn one into the other and control everything in between:

Knob	Symbol	Meaning	Push it up →
Client fraction	C	Share of clients sampled per round.	More representative average, more network traffic.
Local epochs	E	Passes over local data before reporting.	Fewer rounds, but more client drift on non-IID data.
Local batch size	B	Mini-batch size in local SGD.	Fewer, coarser updates per epoch.

Section 08

Golden Rules for FedAvg

🎯 Remember These

FedAvg = local SGD on each client + a data-weighted average on the server. That's the whole algorithm.

Always weight by sample count n_k/n. A plain mean lets a tiny client distort the global model.

Clients send weights, never data. If raw data moves, it is not FedAvg.

More local epochs E → fewer rounds, but watch for client drift when data is non-IID; that's when FedProx or SCAFFOLD help.

Use the same starting weights each round for all clients; starting from a shared point is what makes averaging meaningful.

FedSGD is just FedAvg with E = 1. FedAvg trades cheap local compute for expensive communication — usually a great deal.

In one line: FedAvg trains a shared model by having each client run several epochs of local SGD, then averaging their resulting weights in proportion to how much data each holds — achieving central-quality accuracy while every dataset stays private and never moves.