The Blueprint Before the Bricks
A central architect sends the blueprint to each site. Workers build their floor locally. The architect reviews progress reports (not the buildings themselves), refines the blueprint, and sends updates. No materials ever cross an ocean — only knowledge does.
This is FL System Architecture. The "blueprint" is the global model. The "sites" are clients. The "architect" is the aggregation server. The "progress reports" are gradient updates. And the client–server topology is the engineering scaffold that makes it all work reliably at scale.
Before writing a single line of FL code, every practitioner must understand the system topology — how components connect, communicate, and co-ordinate. Getting this wrong means models that diverge, clients that stall, and systems that collapse the moment a single node drops off the network.
We will dissect the client–server topology that underpins every federated system:
the roles each node plays, how data and model weights flow between them, the communication
protocol stack, synchronous vs asynchronous operation modes, hierarchical and peer-to-peer
extensions, and how to implement a production-ready FL server with Flower (flwr).
The Core Client–Server Topology
The canonical FL topology is a star network: one central server (or a small cluster) at the hub, and N clients at the spokes. This is not the only topology (we will cover hierarchical and P2P variants), but it is where every FL practitioner starts.
Star topology: one central server broadcasts the global model; each client trains locally and returns only weight updates. Raw data never crosses the dashed lines.
📋 What Each Node Does
Δw = w_local − w_global.
Uploads only Δw (optionally compressed + noise-masked).
May decline participation if battery / bandwidth is low.
The Communication Protocol Stack
Data flowing between server and clients doesn't travel as raw Python objects. A full protocol stack handles serialisation, compression, encryption, and transport. Understanding each layer prevents the most common production failures.
Every gradient update passes down all five layers before transmission and back up on receipt. Compression alone can reduce bandwidth by 100–1000×.
| Layer | Technology Used | What It Does | Bandwidth Impact |
|---|---|---|---|
| FL Application | Flower flwr, TensorFlow Federated | Defines training logic, strategy, client/server interface | N/A |
| Serialisation | protobuf, MessagePack, NumPy bytes | Converts tensors to byte streams for transmission | Baseline |
| Compression | Top-k, quantisation, random mask | Reduces gradient size before encryption | 10–1000× reduction |
| Security | TLS 1.3, SecAgg, DP noise | Prevents interception; hides individual updates | +5–15% overhead |
| Transport | gRPC / HTTP2 | Reliable delivery with streaming and multiplexing | Lowest latency option |
FL systems use gRPC (not REST) for three reasons. First, gRPC uses HTTP/2 which supports bidirectional streaming — the server can push the new model to a client while that client is still uploading its gradients from the previous round. Second, Protocol Buffers are 3–10× more compact than JSON. Third, gRPC has built-in deadline/retry semantics that handle the unreliable nature of edge client connections gracefully. Flower uses gRPC by default; TensorFlow Federated offers both gRPC and REST.
Synchronous vs. Asynchronous Operation
One of the most consequential architectural decisions is whether the server waits for all selected clients before aggregating, or whether it aggregates whenever updates arrive. This is the sync vs. async trade-off.
Now imagine a relay race where each runner passes the baton as soon as they finish their leg, without waiting for anyone else. The team's collective speed improves continuously. A tired runner who drops the baton doesn't stop the race. This is asynchronous FL: the global model updates the moment any client returns — but the model may drift if fast clients dominate all the updates.
Sync FL: all clients must finish before aggregation. Async FL: server aggregates each update immediately — faster rounds, but risk of gradient staleness.
| How it works |
| Server selects K clients, broadcasts model |
| Waits until min(K, threshold) clients respond |
| Aggregates all received updates at once |
| Broadcasts updated model for next round |
| Best for: Cross-silo (hospitals, banks) where clients are reliable servers, not mobile devices |
| Risk: One slow client delays every other client |
| How it works |
| Server continuously accepts incoming updates |
| Aggregates each update with momentum into global model |
| Fast clients train on a newer model version |
| Slow clients may upload stale gradients (staleness τ) |
| Best for: Cross-device (billions of phones) where client availability is unpredictable |
| Risk: Gradient staleness degrades convergence if τ is large |
The Full Round Lifecycle — Step by Step
A single FL communication round is more complex than it first appears. Here is every state transition from round start to round end, with the exact data flowing at each step.
Beyond Star: Three Topology Variants
The standard client–server star works well for up to ~10,000 clients. Above that, or in settings with geographic constraints, three extended topologies are used in production.
Left: standard star for small-to-medium deployments. Right: hierarchical two-tier for geographic scale — regional edge servers aggregate locally before reporting to the global server.
Client Selection Strategies
Which clients train in each round is one of the most impactful decisions in FL system design. Random selection is the baseline — but it ignores data quality, connectivity, and model bias.
Random selection (left) treats all eligible clients equally. Power-of-choice (right) biases selection toward clients with higher local loss — these clients have the most to learn and speed up convergence.
| Strategy | Selection Criterion | Convergence Speed | Fairness | Best For |
|---|---|---|---|---|
| Uniform Random | Equal probability for all eligible clients | Baseline | High | Default; most deployments |
| Power-of-Choice | Sample d candidates; pick top-k by local loss | 1.5–3× faster | Medium | Non-IID data; slow convergence |
| Deadline-Aware | Predict training time; select likely finishers | Fewer stragglers | Low (fast clients favoured) | Mobile cross-device FL |
| Importance Weighted | Weight by data quality / label diversity score | Best final accuracy | Medium | Medical imaging; rare class data |
| Oort (Microsoft) | Utility = data utility × system utility | SOTA in heterogeneous nets | High (enforced fairness) | Production cross-device systems |
Implementing FL Architecture with Flower
Flower (flwr) is the most widely used FL framework, designed to be framework-agnostic (works with PyTorch, TensorFlow, JAX, scikit-learn). It implements the full client–server topology we've described, using gRPC for communication.
flwr.server.Server = Aggregation Server | flwr.server.Strategy = Aggregation Algorithm (FedAvg, FedProx, etc.) | flwr.client.Client = Client Node | flwr.server.start_server() = Coordinator entrypoint | flwr.client.start_client() = Client registration + round participation
💻 Complete Flower Server Implementation
# server.py — FL Aggregation Server with Custom FedAvg Strategy
import flwr as fl
from flwr.common import Metrics
from typing import List, Tuple, Optional, Dict
import numpy as np
# ── Custom weighted FedAvg strategy ──────────────────────
class WeightedFedAvg(fl.server.strategy.FedAvg):
"""FedAvg + server-side evaluation logging."""
def aggregate_fit(
self,
server_round: int,
results: List,
failures: List,
) -> Tuple[Optional[fl.common.Parameters], Dict]:
# Log participation stats each round
total = len(results) + len(failures)
success = len(results)
fail_rate = len(failures) / total if total > 0 else 0
print(f"[Round {server_round}] Clients: {success}/{total} "
f"| Failure rate: {fail_rate:.1%}")
# Delegate aggregation to parent FedAvg
aggregated_params, metrics = super().aggregate_fit(
server_round, results, failures
)
return aggregated_params, metrics
def aggregate_evaluate(
self,
server_round: int,
results: List,
failures: List,
) -> Tuple[Optional[float], Dict]:
# Weighted average of client-reported losses
if not results:
return None, {}
total_samples = sum([num for num, _ in results])
weighted_losses = sum([num * loss
for num, loss in results]) / total_samples
print(f"[Round {server_round}] Global loss: {weighted_losses:.4f}")
return weighted_losses, {}
# ── Server configuration ──────────────────────────────────
strategy = WeightedFedAvg(
fraction_fit=0.1, # 10% of clients per round
fraction_evaluate=0.05, # 5% of clients for eval
min_fit_clients=10, # minimum to start a round
min_evaluate_clients=5, # minimum for evaluation
min_available_clients=50, # wait until 50 clients connect
)
# ── Start the server ──────────────────────────────────────
if __name__ == "__main__":
fl.server.start_server(
server_address="0.0.0.0:8080", # gRPC endpoint
config=fl.server.ServerConfig(num_rounds=20),
strategy=strategy,
)
💻 Complete Flower Client Implementation
# client.py — FL Client Node (PyTorch backend)
import flwr as fl
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from collections import OrderedDict
from typing import List, Dict, Tuple
import numpy as np
# ── Model definition ──────────────────────────────────────
class Net(nn.Module):
def __init__(self):
super().__init__()
self.conv = nn.Sequential(
nn.Conv2d(1, 32, 3, padding=1),
nn.ReLU(),
nn.MaxPool2d(2),
nn.Conv2d(32, 64, 3, padding=1),
nn.ReLU(),
nn.AdaptiveAvgPool2d((4, 4)),
)
self.fc = nn.Linear(64 * 4 * 4, 10)
def forward(self, x):
return self.fc(self.conv(x).flatten(1))
# ── Flower client class ───────────────────────────────────
class FLClient(fl.client.NumPyClient):
def __init__(self, model, trainloader, valloader, client_id):
self.model = model
self.trainloader = trainloader
self.valloader = valloader
self.client_id = client_id
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
self.model.to(self.device)
def get_parameters(self, config) -> List[np.ndarray]:
# Extract model weights as NumPy arrays → Flower serialises to protobuf
return [val.cpu().numpy() for _, val
in self.model.state_dict().items()]
def set_parameters(self, parameters: List[np.ndarray]):
# Load server weights into local model
params_dict = zip(self.model.state_dict().keys(), parameters)
state_dict = OrderedDict(
{k: torch.tensor(v) for k, v in params_dict}
)
self.model.load_state_dict(state_dict, strict=True)
def fit(self, parameters, config) -> Tuple[List, int, Dict]:
# Step 1: load global weights from server
self.set_parameters(parameters)
# Step 2: read hyperparams from server config
lr = config.get("learning_rate", 0.01)
epochs = config.get("local_epochs", 3)
batch_size = config.get("batch_size", 32)
# Step 3: local training
optimizer = torch.optim.SGD(self.model.parameters(),
lr=lr, momentum=0.9)
criterion = nn.CrossEntropyLoss()
self.model.train()
for _ in range(epochs):
for images, labels in self.trainloader:
images, labels = images.to(self.device), labels.to(self.device)
optimizer.zero_grad()
criterion(self.model(images), labels).backward()
optimizer.step()
# Step 4: return updated weights + dataset size (for weighted FedAvg)
return self.get_parameters(config={}), len(self.trainloader.dataset), {}
def evaluate(self, parameters, config) -> Tuple[float, int, Dict]:
self.set_parameters(parameters)
criterion = nn.CrossEntropyLoss()
self.model.eval()
loss, correct = 0.0, 0
with torch.no_grad():
for images, labels in self.valloader:
images, labels = images.to(self.device), labels.to(self.device)
outputs = self.model(images)
loss += criterion(outputs, labels).item()
correct += (outputs.argmax(1) == labels).sum().item()
n = len(self.valloader.dataset)
accuracy = correct / n
return loss / len(self.valloader), n, {"accuracy": accuracy}
# ── Launch the client ─────────────────────────────────────
if __name__ == "__main__":
import sys
client_id = int(sys.argv[1]) if len(sys.argv) > 1 else 0
# Each client loads ONLY its own local data partition
trainloader, valloader = load_local_partition(client_id)
model = Net()
client = FLClient(model, trainloader, valloader, client_id)
fl.client.start_client(
server_address="server-host:8080", # gRPC server address
client=client.to_client(),
)
The implementation above is clean and functional but production systems add:
(1) SecAgg — cryptographic secret sharing so the server never sees individual client gradients.
(2) Differential Privacy — clip + Gaussian noise before upload (flwr has DPFedAvgFixed built-in).
(3) Compression — top-k sparsification plugin for large models.
(4) TLS mutual auth — pass grpc_max_message_length and SSL credentials to start_server().
(5) Client state persistence — save client model between rounds so returning clients resume training.
Architecture Decision Guide
Choosing the right topology and operating mode depends on your specific constraints. Use this table to make the decision systematically.
| Constraint | Recommended Topology | Sync Mode | Selection Strategy | Notes |
|---|---|---|---|---|
| 🏠 <100 reliable servers (hospitals, banks) | Standard Star | Synchronous | Uniform random | Cross-silo; clients are reliable; FedAvg default |
| 📱 10K–10M mobile devices | Standard Star or Hierarchical | Async or Semi-Sync | Deadline-aware / Oort | Cross-device; high churn; need straggler mitigation |
| 🏠🏠 Multi-country, geo-distributed | Hierarchical (2-tier) | Sync within tier, Async across | Regional coordinator | Reduces cross-WAN bandwidth by 80-95% |
| 🦔 No trusted central server | Peer-to-Peer (gossip) | Asynchronous | Neighbour-based | Blockchain FL; slower convergence |
| 🔌 Tiny edge devices (<512MB RAM) | Split Learning | Synchronous | Uniform | Clients only run first few network layers |
| 🏭 Vertical FL (different features, same users) | Standard Star + VFL protocol | Synchronous | All participants always | Needs PSI for user alignment; use FATE framework |
Architecture Golden Rules
min_fit_clients to 60–70% of your selection target —
never require 100% of selected clients. Use min_available_clients
to wait until enough clients are online before starting a round.
num_examples from clients and use it in
FedAvg weighting: w_agg = Σ (n_k / n_total) × w_k.
participation_rate alongside loss every round.
Topic 3: The Non-IID Problem & Advanced Aggregation Algorithms. Now that you understand the topology, we go deeper into what happens when client data is heterogeneous. We will cover FedProx, SCAFFOLD, FedNova, and MOON — algorithms specifically designed to handle non-IID data that FedAvg cannot converge on. We will benchmark all four on pathologically non-IID CIFAR-10 partitions.