Federated Learning 📂 FL System Architecture · 4 of 5 16 min read

Model Aggregation in Federated Learning: FedAvg Guide

A clear, example-driven guide to how a federated learning server merges client model updates into one global model. Covers the FedAvg formula, a fully worked numerical example, an animated round diagram, a comparison of aggregation strategies (FedSGD, FedProx, FedAvgM, secure and robust aggregation), Python implementation, and the key challenges — non-IID data, stragglers, and poisoning attacks.

Section 01

The Story That Explains Model Aggregation

The Three Hospitals That Never Shared a Single Patient File
Three hospitals — a huge city hospital, a mid-size clinic, and a tiny rural practice — want to build one shared AI model that predicts heart disease. But there is a hard wall: patient records are private and legally cannot leave the building. No raw data is allowed to travel.

So they make a clever deal. Each hospital trains the same model on its own private patients, behind its own firewall. Then — instead of sending patients — each sends only the lessons the model learned (its updated numbers, the weights) to a neutral coordinator.

The coordinator never sees a single patient. It does one job: it blends the three sets of lessons into one improved model and sends that combined model back to everyone. Crucially, the city hospital saw 600 patients and the rural practice saw 100 — so the coordinator gives the city hospital's lessons more weight. That blending step is exactly what we call model aggregation at the server.

Federated Learning lets many devices or organizations train one shared model without ever centralizing their raw data. The clients do the learning; the server does the merging. This tutorial zooms into that merge — the single most important operation on the server side.

Section 02

What Actually Happens on the Server

In one federated round, the server runs a tight loop: it broadcasts the current global model, each client trains locally on its private data, every client uploads only its model update, and the server aggregates those updates into the next global model. Watch the data flow below — notice that only model weights move, never raw data.

⚡ One Federated Round — Updates Travel Up, Global Model Travels Down
SERVER aggregate wₙ₊₁ = Σ (nₖ / n) · wₖ Client A 600 samples Client B 300 samples Client C 100 samples ↑ upload weight updates (private data stays put) ↓ broadcast new global model
🔑 The Golden Rule of Aggregation
1
Raw data never leaves the client. Only model parameters (or gradients) are uploaded.
2
The server is a blender, not a learner — it combines updates, it does not see any examples.
3
Clients with more data get more influence in the blend (weighted by sample count).
Section 03

FedAvg — The Aggregation Formula

The standard server-side aggregation is Federated Averaging (FedAvg), introduced by McMahan et al. It is a weighted average of every client's model, where each client's weight is the fraction of total data it holds. After clients train locally, the server computes one new global model for the next round.

FedAvg Aggregation
wt+1 = Σk=1K (nk / n) · wt+1k
The new global model is the sum of each client model wk scaled by its data share nk/n.
Data-Share Weight
αk = nk / n,   n = Σ nk
nk = samples on client k; n = total samples across all selected clients. The weights αk always sum to 1.
A plain average treats a client that trained on 5 examples the same as one that trained on 5,000. That lets a tiny, noisy client drag the global model around. Weighting by sample count makes the global model behave as if it had trained on all the data pooled together — which is the whole point of federation.
Section 04

A Worked Numerical Example

Let's aggregate a single weight value from our three hospitals. Each trained the model locally and arrived at a different value for one parameter. (Real models have millions of these — the math is applied element-by-element.)

Inputs from the three clients
Client A
wA = 0.80  ·  nA = 600
Client B
wB = 0.50  ·  nB = 300
Client C
wC = 0.20  ·  nC = 100
Total
n = 600 + 300 + 100 = 1000
Step-by-step FedAvg
Weights
αA=0.6, αB=0.3, αC=0.1
A term
0.6 × 0.80 = 0.48
B term
0.3 × 0.50 = 0.15
C term
0.1 × 0.20 = 0.02
Sum
wglobal = 0.48 + 0.15 + 0.02 = 0.65

Compare that to a plain average: (0.80 + 0.50 + 0.20) / 3 = 0.50. FedAvg lands at 0.65 instead, pulled toward Client A's value — correctly, because Client A backed its number with six times more data than Client C. That 0.15 difference is the data-weighting at work.

Section 05

Aggregation Strategies at a Glance

FedAvg is the default, but the server can blend updates in several ways depending on whether you care about data heterogeneity, privacy, or defending against malicious clients.

Strategy What the server combines Core idea Best when…
FedSGD One gradient per client, per step Server averages gradients every step; clients do 1 local step You want exactness; communication is cheap
FedAvg default Full local model weights after several epochs Weighted average by sample count; far fewer rounds Communication is expensive (the usual case)
FedProx Local weights + a proximal penalty Keeps local models from drifting too far from global Clients have very different (non-IID) data
FedAvgM Aggregated update + server momentum Server keeps momentum across rounds for stability Training is noisy or oscillating
Secure Agg. Masked / encrypted weights Server sees only the sum, never an individual update Privacy of single updates must be guaranteed
Krum / Median A robust statistic, not the mean Discards outlier updates to resist poisoning Some clients may be malicious or faulty
Section 06

Implementing Server Aggregation in Python

Here is FedAvg as the server would run it. Each client returns its trained weights plus how many samples it trained on. The server blends them into one global model.

import numpy as np

# Each client sends back: (model_weights, num_samples)
# Here we use the 3-hospital values from Section 04.
w_A, w_B, w_C = np.array([0.80]), np.array([0.50]), np.array([0.20])

client_updates = [
    (w_A, 600),
    (w_B, 300),
    (w_C, 100),
]

def federated_average(updates):
    """Weighted average of client weights by sample count (FedAvg)."""
    total_samples = sum(n for _, n in updates)

    # accumulator shaped exactly like the model weights
    global_w = np.zeros_like(updates[0][0])

    for weights, n in updates:
        alpha = n / total_samples      # this client's data share
        global_w += alpha * weights   # weighted contribution

    return global_w

new_global = federated_average(client_updates)
print("Aggregated global weight:", new_global)
▶ Output
Aggregated global weight: [0.65]

Real neural networks store weights as a state_dict — a dictionary of named tensors (one per layer). The server simply applies the same weighted average key-by-key across every layer:

def aggregate_state_dicts(client_states, client_sizes):
    """FedAvg over PyTorch-style state_dicts (layer by layer)."""
    total = sum(client_sizes)
    avg_state = {}

    # every client shares the same layer names (same architecture)
    for key in client_states[0].keys():
        avg_state[key] = sum(
            (client_sizes[i] / total) * client_states[i][key]
            for i in range(len(client_states))
        )

    return avg_state   # load this back into the global model
In practice the server only selects a fraction of clients each round (say 10 out of millions of phones), waits for updates with a timeout to handle stragglers, and may apply secure aggregation so it only ever decrypts the sum of updates — never any single client's contribution. The averaging math above stays the same; only which updates reach it changes.
Section 07

Challenges the Server Must Handle

Challenge Why it hurts aggregation Common fix
Non-IID data Clients hold very different distributions, so their models point in conflicting directions FedProx, FedAvgM, or personalization layers
Stragglers Slow or offline clients delay the round Client sampling + timeouts; aggregate whoever returns
Communication cost Sending full models every round is expensive More local epochs (FedAvg), gradient compression
Poisoning attacks A malicious client sends a corrupt update to skew the mean Robust aggregation (Krum, trimmed mean, median)
Update leakage Raw updates can leak information about private data Secure aggregation + differential privacy noise
🎯 Key Takeaways
1
The server aggregates model updates, not data — that is what makes federated learning privacy-preserving.
2
FedAvg is the workhorse: a sample-weighted average of client models, wt+1 = Σ (nk/n) wk.
3
Weighting by data volume keeps the global model faithful to a pooled dataset it is never allowed to see.
4
Swap the averaging step for robust or secure variants when facing non-IID data, attackers, or strict privacy needs.