Model Aggregation in Federated Learning: FedAvg Guide

Section 01

The Story That Explains Model Aggregation

📖 Real World Analogy

The Three Hospitals That Never Shared a Single Patient File

Three hospitals — a huge city hospital, a mid-size clinic, and a tiny rural practice — want to build one shared AI model that predicts heart disease. But there is a hard wall: patient records are private and legally cannot leave the building. No raw data is allowed to travel.

So they make a clever deal. Each hospital trains the same model on its own private patients, behind its own firewall. Then — instead of sending patients — each sends only the lessons the model learned (its updated numbers, the weights) to a neutral coordinator.

The coordinator never sees a single patient. It does one job: it blends the three sets of lessons into one improved model and sends that combined model back to everyone. Crucially, the city hospital saw 600 patients and the rural practice saw 100 — so the coordinator gives the city hospital's lessons more weight. That blending step is exactly what we call model aggregation at the server.

Federated Learning lets many devices or organizations train one shared model without ever centralizing their raw data. The clients do the learning; the server does the merging. This tutorial zooms into that merge — the single most important operation on the server side.

Section 02

What Actually Happens on the Server

In one federated round, the server runs a tight loop: it broadcasts the current global model, each client trains locally on its private data, every client uploads only its model update, and the server aggregates those updates into the next global model. Watch the data flow below — notice that only model weights move, never raw data.

⚡ One Federated Round — Updates Travel Up, Global Model Travels Down

🔑 The Golden Rule of Aggregation

Raw data never leaves the client. Only model parameters (or gradients) are uploaded.

The server is a blender, not a learner — it combines updates, it does not see any examples.

Clients with more data get more influence in the blend (weighted by sample count).

Section 03

FedAvg — The Aggregation Formula

The standard server-side aggregation is Federated Averaging (FedAvg), introduced by McMahan et al. It is a weighted average of every client's model, where each client's weight is the fraction of total data it holds. After clients train locally, the server computes one new global model for the next round.

FedAvg Aggregation

w_t+1 = Σ_k=1^K (n_k / n) · w_t+1^k

The new global model is the sum of each client model w^k scaled by its data share n_k/n.

Data-Share Weight

α_k = n_k / n, n = Σ n_k

n_k = samples on client k; n = total samples across all selected clients. The weights α_k always sum to 1.

💡 Why weighted, not a plain average?

A plain average treats a client that trained on 5 examples the same as one that trained on 5,000. That lets a tiny, noisy client drag the global model around. Weighting by sample count makes the global model behave as if it had trained on all the data pooled together — which is the whole point of federation.

Section 04

A Worked Numerical Example

Let's aggregate a single weight value from our three hospitals. Each trained the model locally and arrived at a different value for one parameter. (Real models have millions of these — the math is applied element-by-element.)

Inputs from the three clients

Client A

w_A = 0.80 · n_A = 600

Client B

w_B = 0.50 · n_B = 300

Client C

w_C = 0.20 · n_C = 100

Total

n = 600 + 300 + 100 = 1000

Step-by-step FedAvg

Weights

α_A=0.6, α_B=0.3, α_C=0.1

A term

0.6 × 0.80 = 0.48

B term

0.3 × 0.50 = 0.15

C term

0.1 × 0.20 = 0.02

Sum

w_global = 0.48 + 0.15 + 0.02 = 0.65

Compare that to a plain average: (0.80 + 0.50 + 0.20) / 3 = 0.50. FedAvg lands at 0.65 instead, pulled toward Client A's value — correctly, because Client A backed its number with six times more data than Client C. That 0.15 difference is the data-weighting at work.

Section 05

Aggregation Strategies at a Glance

FedAvg is the default, but the server can blend updates in several ways depending on whether you care about data heterogeneity, privacy, or defending against malicious clients.

Strategy	What the server combines	Core idea	Best when…
FedSGD	One gradient per client, per step	Server averages gradients every step; clients do 1 local step	You want exactness; communication is cheap
FedAvg default	Full local model weights after several epochs	Weighted average by sample count; far fewer rounds	Communication is expensive (the usual case)
FedProx	Local weights + a proximal penalty	Keeps local models from drifting too far from global	Clients have very different (non-IID) data
FedAvgM	Aggregated update + server momentum	Server keeps momentum across rounds for stability	Training is noisy or oscillating
Secure Agg.	Masked / encrypted weights	Server sees only the sum, never an individual update	Privacy of single updates must be guaranteed
Krum / Median	A robust statistic, not the mean	Discards outlier updates to resist poisoning	Some clients may be malicious or faulty

Section 06

Implementing Server Aggregation in Python

Here is FedAvg as the server would run it. Each client returns its trained weights plus how many samples it trained on. The server blends them into one global model.

import numpy as np

# Each client sends back: (model_weights, num_samples)
# Here we use the 3-hospital values from Section 04.
w_A, w_B, w_C = np.array([0.80]), np.array([0.50]), np.array([0.20])

client_updates = [
    (w_A, 600),
    (w_B, 300),
    (w_C, 100),
]

def federated_average(updates):
    """Weighted average of client weights by sample count (FedAvg)."""
    total_samples = sum(n for _, n in updates)

    # accumulator shaped exactly like the model weights
    global_w = np.zeros_like(updates[0][0])

    for weights, n in updates:
        alpha = n / total_samples      # this client's data share
        global_w += alpha * weights   # weighted contribution

    return global_w

new_global = federated_average(client_updates)
print("Aggregated global weight:", new_global)

▶ Output

Aggregated global weight: [0.65]

Real neural networks store weights as a state_dict — a dictionary of named tensors (one per layer). The server simply applies the same weighted average key-by-key across every layer:

def aggregate_state_dicts(client_states, client_sizes):
    """FedAvg over PyTorch-style state_dicts (layer by layer)."""
    total = sum(client_sizes)
    avg_state = {}

    # every client shares the same layer names (same architecture)
    for key in client_states[0].keys():
        avg_state[key] = sum(
            (client_sizes[i] / total) * client_states[i][key]
            for i in range(len(client_states))
        )

    return avg_state   # load this back into the global model

⚙️ Engineering note

In practice the server only selects a fraction of clients each round (say 10 out of millions of phones), waits for updates with a timeout to handle stragglers, and may apply secure aggregation so it only ever decrypts the sum of updates — never any single client's contribution. The averaging math above stays the same; only which updates reach it changes.

Section 07

Challenges the Server Must Handle

Challenge	Why it hurts aggregation	Common fix
Non-IID data	Clients hold very different distributions, so their models point in conflicting directions	FedProx, FedAvgM, or personalization layers
Stragglers	Slow or offline clients delay the round	Client sampling + timeouts; aggregate whoever returns
Communication cost	Sending full models every round is expensive	More local epochs (FedAvg), gradient compression
Poisoning attacks	A malicious client sends a corrupt update to skew the mean	Robust aggregation (Krum, trimmed mean, median)
Update leakage	Raw updates can leak information about private data	Secure aggregation + differential privacy noise

🎯 Key Takeaways

The server aggregates model updates, not data — that is what makes federated learning privacy-preserving.

FedAvg is the workhorse: a sample-weighted average of client models, w_t+1 = Σ (n_k/n) w^k.

Weighting by data volume keeps the global model faithful to a pooled dataset it is never allowed to see.

Swap the averaging step for robust or secure variants when facing non-IID data, attackers, or strict privacy needs.