FL System Architecture: Client

Section 01

The Blueprint Before the Bricks

📖 Real-World Analogy

Building a Skyscraper Without Moving the Land

Imagine a construction company awarded contracts to build identical floors in skyscrapers across New York, London, and Tokyo simultaneously. Each site has its own workers, materials, and local rules — none of which can leave the city. But the final buildings must all match the same master blueprint.

A central architect sends the blueprint to each site. Workers build their floor locally. The architect reviews progress reports (not the buildings themselves), refines the blueprint, and sends updates. No materials ever cross an ocean — only knowledge does.

This is FL System Architecture. The "blueprint" is the global model. The "sites" are clients. The "architect" is the aggregation server. The "progress reports" are gradient updates. And the client–server topology is the engineering scaffold that makes it all work reliably at scale.

Before writing a single line of FL code, every practitioner must understand the system topology — how components connect, communicate, and co-ordinate. Getting this wrong means models that diverge, clients that stall, and systems that collapse the moment a single node drops off the network.

💡

What This Topic Covers

We will dissect the client–server topology that underpins every federated system: the roles each node plays, how data and model weights flow between them, the communication protocol stack, synchronous vs asynchronous operation modes, hierarchical and peer-to-peer extensions, and how to implement a production-ready FL server with Flower (flwr).

Section 02

The Core Client–Server Topology

The canonical FL topology is a star network: one central server (or a small cluster) at the hub, and N clients at the spokes. This is not the only topology (we will cover hierarchical and P2P variants), but it is where every FL practitioner starts.

🏠 FL Client–Server Star Topology — Full System View

Star topology: one central server broadcasts the global model; each client trains locally and returns only weight updates. Raw data never crosses the dashed lines.

📋 What Each Node Does

☁️

Aggregation Server

Hub of the star

Holds the global model weights between rounds. Selects participating clients each round. Receives compressed weight deltas. Runs the aggregation algorithm (FedAvg, FedProx, etc.). Broadcasts the updated global model. Tracks convergence metrics. Does not store any client raw data — ever.

✓ Single source of truth for model state

✗ Potential bottleneck at very high client counts

📱

Client Node

Spoke of the star

Stores private local data that never leaves. Downloads the current global model when selected. Runs local SGD for E epochs on its own data. Computes weight delta: Δw = w_local − w_global. Uploads only Δw (optionally compressed + noise-masked). May decline participation if battery / bandwidth is low.

✓ Full data sovereignty maintained

✗ Heterogeneous compute; may straggle

📋

Coordinator / Selector

Orchestration layer

Often a separate process from the aggregation server. Maintains a registry of all eligible clients and their metadata (last-seen, battery, WiFi status, data size, round history). Applies selection strategy: random, power-of-choice, importance-weighted. Manages round timing and timeout logic. In small deployments, often co-located with the server.

✓ Decouples selection policy from aggregation

✗ Must handle client churn registration at scale

Section 03

The Communication Protocol Stack

Data flowing between server and clients doesn't travel as raw Python objects. A full protocol stack handles serialisation, compression, encryption, and transport. Understanding each layer prevents the most common production failures.

🔌 FL Communication Protocol Stack

Every gradient update passes down all five layers before transmission and back up on receipt. Compression alone can reduce bandwidth by 100–1000×.

Layer	Technology Used	What It Does	Bandwidth Impact
FL Application	Flower flwr, TensorFlow Federated	Defines training logic, strategy, client/server interface	N/A
Serialisation	protobuf, MessagePack, NumPy bytes	Converts tensors to byte streams for transmission	Baseline
Compression	Top-k, quantisation, random mask	Reduces gradient size before encryption	10–1000× reduction
Security	TLS 1.3, SecAgg, DP noise	Prevents interception; hides individual updates	+5–15% overhead
Transport	gRPC / HTTP2	Reliable delivery with streaming and multiplexing	Lowest latency option

🔑

Why gRPC over REST?

FL systems use gRPC (not REST) for three reasons. First, gRPC uses HTTP/2 which supports bidirectional streaming — the server can push the new model to a client while that client is still uploading its gradients from the previous round. Second, Protocol Buffers are 3–10× more compact than JSON. Third, gRPC has built-in deadline/retry semantics that handle the unreliable nature of edge client connections gracefully. Flower uses gRPC by default; TensorFlow Federated offers both gRPC and REST.

Section 04

Synchronous vs. Asynchronous Operation

One of the most consequential architectural decisions is whether the server waits for all selected clients before aggregating, or whether it aggregates whenever updates arrive. This is the sync vs. async trade-off.

📖 Story

The Synchronous Marathon vs. The Asynchronous Relay Race

Imagine organising a marathon where 100 runners must all cross the finish line before the race clock advances. The fastest runners wait at the finish line for the slowest. Nobody's time is wasted training — but the clock only moves when everyone arrives. This is synchronous FL: slow stragglers block every round.

Now imagine a relay race where each runner passes the baton as soon as they finish their leg, without waiting for anyone else. The team's collective speed improves continuously. A tired runner who drops the baton doesn't stop the race. This is asynchronous FL: the global model updates the moment any client returns — but the model may drift if fast clients dominate all the updates.

⚙️ Synchronous vs Asynchronous FL — Round Timing

Sync FL: all clients must finish before aggregation. Async FL: server aggregates each update immediately — faster rounds, but risk of gradient staleness.

⏳ Synchronous FL

How it works

Server selects K clients, broadcasts model

Waits until min(K, threshold) clients respond

Aggregates all received updates at once

Broadcasts updated model for next round

Best for: Cross-silo (hospitals, banks) where clients are reliable servers, not mobile devices

Risk: One slow client delays every other client

⚡ Asynchronous FL

How it works

Server continuously accepts incoming updates

Aggregates each update with momentum into global model

Fast clients train on a newer model version

Slow clients may upload stale gradients (staleness τ)

Best for: Cross-device (billions of phones) where client availability is unpredictable

Risk: Gradient staleness degrades convergence if τ is large

Section 05

The Full Round Lifecycle — Step by Step

A single FL communication round is more complex than it first appears. Here is every state transition from round start to round end, with the exact data flowing at each step.

Client Registration & Eligibility Check

Before any round begins, clients register with the coordinator by sending a ClientHello message containing: device ID, battery %, connection type (WiFi/4G), available RAM, local dataset size, and last participation timestamp. The coordinator marks clients as ELIGIBLE or INELIGIBLE based on configurable thresholds (e.g. battery > 20%, on WiFi, idle for 5+ minutes in Google's Gboard system).

Client Selection (Coordinator → Server)

The coordinator selects a subset S of eligible clients using a selection strategy. Default: uniform random sample of fraction C (e.g. C=0.1 means 10% of eligible clients). Advanced strategies: power-of-choice (selects clients with highest local loss to reduce bias), deadline-aware selection (only selects clients likely to finish within the round timeout). Selected clients receive a RoundConfig object containing round ID, local epochs E, batch size B, and learning rate η.

Model Broadcast (Server → Clients)

The server serialises current global weights w_t into a Parameters object (Flower) or ServerMessage (TFF). The weights are compressed (optional quantisation to int8) and encrypted via TLS. For a 100M parameter model, this is ~400 MB in fp32 or ~100 MB quantised. Clients acknowledge receipt with a ModelAck message containing their local dataset size — used later for weighted aggregation.

Local Training (Client-Side)

Each client initialises its local model with w_t, then runs E epochs of mini-batch SGD on its local dataset D_k. The local loss function L_k(w) = (1/|D_k|) Σ ℓ(w; x_i, y_i). After E epochs, the client has local weights w_k. It computes the update: Δw_k = w_k − w_t. This delta is the only information that will leave the client.

Gradient Upload (Clients → Server)

The client applies optional gradient compression (top-k sparsification keeps only k% of largest gradient values; the rest are zeroed). Then adds Gaussian noise N(0, σ²) for differential privacy (if enabled). The compressed, noised delta is serialised to protobuf and sent via gRPC streaming. Along with Δw_k, the client also reports its local loss value and number of samples trained on — metadata used by the server strategy.

Aggregation (Server)

Once the server has received updates from a sufficient number of clients (min_available_clients threshold), it runs the aggregation algorithm. FedAvg: w_t+1 = Σ (n_k/n) · w_k, where n_k is client k's dataset size and n = Σn_k. The aggregated weights form the new global model w_t+1. Aggregation is typically <1 second even for 1000 clients in a GPU cluster.

Evaluation & Round Completion

The server evaluates w_t+1 on its held-out validation dataset. It logs: global loss, global accuracy, per-round client participation rate, average gradient magnitude, and wall-clock time per round. If convergence criteria are met (e.g. loss plateau for 5 consecutive rounds), training stops. Otherwise, round t+2 begins from step 01. Final model weights are saved and optionally deployed to all clients.

Section 06

Beyond Star: Three Topology Variants

The standard client–server star works well for up to ~10,000 clients. Above that, or in settings with geographic constraints, three extended topologies are used in production.

🌎 FL Topology Variants

Left: standard star for small-to-medium deployments. Right: hierarchical two-tier for geographic scale — regional edge servers aggregate locally before reporting to the global server.

⭐

Standard Star

1 server · N clients

One central aggregation server communicates directly with all selected clients. Simple to implement and reason about. Works well up to ~10K clients with sufficient server bandwidth. The default topology for most FL frameworks (Flower, PySyft, TFF).

🏠

Hierarchical (Two-Tier)

Global server · Edge servers · Clients

Regional edge servers (e.g. 5G MEC nodes or hospital cluster nodes) aggregate locally first, then report to a global cloud server. Drastically reduces WAN traffic. Used by Huawei in 5G FL, and by healthcare consortia spanning multiple countries. Convergence is slightly slower per global round but each round is much faster.

🔗

Peer-to-Peer (Decentralised)

No central server · Gossip protocol

No central server at all. Each client communicates with a small neighbourhood of peers and averages their models via gossip protocols (e.g. MATCHA, D-PSGD). Eliminates the single point of failure and the trust requirement on the server. Used in blockchain-integrated FL systems. Convergence proofs are harder; not yet production-standard for most applications.

⚙️

Split Learning

Model partitioned across client + server

The neural network is split: clients compute forward pass through early layers, send only the smashed data (intermediate activations) to the server, which computes the rest. The server sends gradients back to complete the backward pass. Used for vertical FL where clients have different feature sets for the same samples. Requires more communication rounds but allows huge models on weak clients.

📋

Cluster-Based FL

Clients grouped by data similarity

Clients with similar data distributions are clustered before training. Each cluster trains its own specialised global model (IFCA algorithm). Addresses the non-IID problem by separating heterogeneous clients. Useful when client populations are genuinely multi-modal (e.g. teenage users vs professional users of the same keyboard app have very different language patterns).

🎉

Personalised FL (pFL)

Global model + local fine-tuning head

Combines topology and training: a shared global backbone is trained federally, then each client fine-tunes a small personal head (last 1–2 layers) on local data. Techniques: FedPer, Per-FedAvg (MAML-inspired), Ditto (regularised local objective). Achieves the best of both worlds: global generality + local personalisation. Apple uses this for on-device personalisation of Siri and keyboard.

Section 07

Client Selection Strategies

Which clients train in each round is one of the most impactful decisions in FL system design. Random selection is the baseline — but it ignores data quality, connectivity, and model bias.

📋 Client Selection Strategies — Impact on Model Quality

Random selection (left) treats all eligible clients equally. Power-of-choice (right) biases selection toward clients with higher local loss — these clients have the most to learn and speed up convergence.

Strategy	Selection Criterion	Convergence Speed	Fairness	Best For
Uniform Random	Equal probability for all eligible clients	Baseline	High	Default; most deployments
Power-of-Choice	Sample d candidates; pick top-k by local loss	1.5–3× faster	Medium	Non-IID data; slow convergence
Deadline-Aware	Predict training time; select likely finishers	Fewer stragglers	Low (fast clients favoured)	Mobile cross-device FL
Importance Weighted	Weight by data quality / label diversity score	Best final accuracy	Medium	Medical imaging; rare class data
Oort (Microsoft)	Utility = data utility × system utility	SOTA in heterogeneous nets	High (enforced fairness)	Production cross-device systems

Section 08

Implementing FL Architecture with Flower

Flower (flwr) is the most widely used FL framework, designed to be framework-agnostic (works with PyTorch, TensorFlow, JAX, scikit-learn). It implements the full client–server topology we've described, using gRPC for communication.

🚀

Flower Architecture Map to Our Concepts

flwr.server.Server = Aggregation Server | flwr.server.Strategy = Aggregation Algorithm (FedAvg, FedProx, etc.) | flwr.client.Client = Client Node | flwr.server.start_server() = Coordinator entrypoint | flwr.client.start_client() = Client registration + round participation

💻 Complete Flower Server Implementation

# server.py — FL Aggregation Server with Custom FedAvg Strategy
import flwr as fl
from flwr.common import Metrics
from typing import List, Tuple, Optional, Dict
import numpy as np

# ── Custom weighted FedAvg strategy ──────────────────────
class WeightedFedAvg(fl.server.strategy.FedAvg):
    """FedAvg + server-side evaluation logging."""

    def aggregate_fit(
        self,
        server_round: int,
        results: List,
        failures: List,
    ) -> Tuple[Optional[fl.common.Parameters], Dict]:

        # Log participation stats each round
        total     = len(results) + len(failures)
        success   = len(results)
        fail_rate = len(failures) / total if total > 0 else 0
        print(f"[Round {server_round}] Clients: {success}/{total} "
              f"| Failure rate: {fail_rate:.1%}")

        # Delegate aggregation to parent FedAvg
        aggregated_params, metrics = super().aggregate_fit(
            server_round, results, failures
        )
        return aggregated_params, metrics

    def aggregate_evaluate(
        self,
        server_round: int,
        results: List,
        failures: List,
    ) -> Tuple[Optional[float], Dict]:
        # Weighted average of client-reported losses
        if not results:
            return None, {}

        total_samples   = sum([num for num, _ in results])
        weighted_losses = sum([num * loss
                           for num, loss in results]) / total_samples
        print(f"[Round {server_round}] Global loss: {weighted_losses:.4f}")
        return weighted_losses, {}

# ── Server configuration ──────────────────────────────────
strategy = WeightedFedAvg(
    fraction_fit=0.1,           # 10% of clients per round
    fraction_evaluate=0.05,     # 5% of clients for eval
    min_fit_clients=10,         # minimum to start a round
    min_evaluate_clients=5,     # minimum for evaluation
    min_available_clients=50,   # wait until 50 clients connect
)

# ── Start the server ──────────────────────────────────────
if __name__ == "__main__":
    fl.server.start_server(
        server_address="0.0.0.0:8080",   # gRPC endpoint
        config=fl.server.ServerConfig(num_rounds=20),
        strategy=strategy,
    )

SERVER CONSOLE OUTPUT

INFO flwr 1.8.0 / Starting Flower server, listening on 0.0.0.0:8080 INFO Flower ECE: gRPC server running (20 rounds), SSL disabled [Round 1] Clients: 12/15 | Failure rate: 20.0% [Round 1] Global loss: 0.8341 [Round 2] Clients: 14/15 | Failure rate: 6.7% [Round 2] Global loss: 0.7204 [Round 3] Clients: 13/15 | Failure rate: 13.3% [Round 3] Global loss: 0.6118 ... [Round 20] Clients: 15/15 | Failure rate: 0.0% [Round 20] Global loss: 0.1973

💻 Complete Flower Client Implementation

# client.py — FL Client Node (PyTorch backend)
import flwr as fl
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from collections import OrderedDict
from typing import List, Dict, Tuple
import numpy as np

# ── Model definition ──────────────────────────────────────
class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv = nn.Sequential(
            nn.Conv2d(1, 32, 3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Conv2d(32, 64, 3, padding=1),
            nn.ReLU(),
            nn.AdaptiveAvgPool2d((4, 4)),
        )
        self.fc = nn.Linear(64 * 4 * 4, 10)

    def forward(self, x):
        return self.fc(self.conv(x).flatten(1))

# ── Flower client class ───────────────────────────────────
class FLClient(fl.client.NumPyClient):
    def __init__(self, model, trainloader, valloader, client_id):
        self.model       = model
        self.trainloader = trainloader
        self.valloader   = valloader
        self.client_id   = client_id
        self.device      = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.model.to(self.device)

    def get_parameters(self, config) -> List[np.ndarray]:
        # Extract model weights as NumPy arrays → Flower serialises to protobuf
        return [val.cpu().numpy() for _, val
                in self.model.state_dict().items()]

    def set_parameters(self, parameters: List[np.ndarray]):
        # Load server weights into local model
        params_dict  = zip(self.model.state_dict().keys(), parameters)
        state_dict   = OrderedDict(
            {k: torch.tensor(v) for k, v in params_dict}
        )
        self.model.load_state_dict(state_dict, strict=True)

    def fit(self, parameters, config) -> Tuple[List, int, Dict]:
        # Step 1: load global weights from server
        self.set_parameters(parameters)

        # Step 2: read hyperparams from server config
        lr         = config.get("learning_rate", 0.01)
        epochs     = config.get("local_epochs",  3)
        batch_size = config.get("batch_size",    32)

        # Step 3: local training
        optimizer = torch.optim.SGD(self.model.parameters(),
                                     lr=lr, momentum=0.9)
        criterion = nn.CrossEntropyLoss()
        self.model.train()

        for _ in range(epochs):
            for images, labels in self.trainloader:
                images, labels = images.to(self.device), labels.to(self.device)
                optimizer.zero_grad()
                criterion(self.model(images), labels).backward()
                optimizer.step()

        # Step 4: return updated weights + dataset size (for weighted FedAvg)
        return self.get_parameters(config={}), len(self.trainloader.dataset), {}

    def evaluate(self, parameters, config) -> Tuple[float, int, Dict]:
        self.set_parameters(parameters)
        criterion = nn.CrossEntropyLoss()
        self.model.eval()
        loss, correct = 0.0, 0

        with torch.no_grad():
            for images, labels in self.valloader:
                images, labels = images.to(self.device), labels.to(self.device)
                outputs = self.model(images)
                loss    += criterion(outputs, labels).item()
                correct += (outputs.argmax(1) == labels).sum().item()

        n        = len(self.valloader.dataset)
        accuracy = correct / n
        return loss / len(self.valloader), n, {"accuracy": accuracy}

# ── Launch the client ─────────────────────────────────────
if __name__ == "__main__":
    import sys
    client_id = int(sys.argv[1]) if len(sys.argv) > 1 else 0

    # Each client loads ONLY its own local data partition
    trainloader, valloader = load_local_partition(client_id)
    model  = Net()
    client = FLClient(model, trainloader, valloader, client_id)

    fl.client.start_client(
        server_address="server-host:8080",   # gRPC server address
        client=client.to_client(),
    )

CLIENT CONSOLE OUTPUT (client 3, 20 rounds)

⚙️

Scaling to Real Deployments: What the Code Doesn't Show

The implementation above is clean and functional but production systems add: (1) SecAgg — cryptographic secret sharing so the server never sees individual client gradients. (2) Differential Privacy — clip + Gaussian noise before upload (flwr has DPFedAvgFixed built-in). (3) Compression — top-k sparsification plugin for large models. (4) TLS mutual auth — pass grpc_max_message_length and SSL credentials to start_server(). (5) Client state persistence — save client model between rounds so returning clients resume training.

Section 09

Architecture Decision Guide

Choosing the right topology and operating mode depends on your specific constraints. Use this table to make the decision systematically.

Constraint	Recommended Topology	Sync Mode	Selection Strategy	Notes
🏠 <100 reliable servers (hospitals, banks)	Standard Star	Synchronous	Uniform random	Cross-silo; clients are reliable; FedAvg default
📱 10K–10M mobile devices	Standard Star or Hierarchical	Async or Semi-Sync	Deadline-aware / Oort	Cross-device; high churn; need straggler mitigation
🏠🏠 Multi-country, geo-distributed	Hierarchical (2-tier)	Sync within tier, Async across	Regional coordinator	Reduces cross-WAN bandwidth by 80-95%
🦔 No trusted central server	Peer-to-Peer (gossip)	Asynchronous	Neighbour-based	Blockchain FL; slower convergence
🔌 Tiny edge devices (<512MB RAM)	Split Learning	Synchronous	Uniform	Clients only run first few network layers
🏭 Vertical FL (different features, same users)	Standard Star + VFL protocol	Synchronous	All participants always	Needs PSI for user alignment; use FATE framework

Section 10

Architecture Golden Rules

🌟 FL System Architecture — Non-Negotiable Rules

Never let raw data cross a topology boundary. If any component in your architecture allows raw features or labels to flow outside the originating client node, it is not federated learning — it is distributed learning with privacy violations. Audit every gRPC message type in your implementation.

Design for client dropout from day one. In cross-device FL, expect 20–60% of selected clients to fail in any given round. Set min_fit_clients to 60–70% of your selection target — never require 100% of selected clients. Use min_available_clients to wait until enough clients are online before starting a round.

Use gRPC with TLS mutual authentication. Plain HTTP is never acceptable in production FL. Clients must verify the server's certificate (prevents model injection attacks) and the server must verify client certificates (prevents gradient poisoning from rogue participants).

Weight your aggregation by client dataset size. Unweighted averaging gives equal influence to a client with 50 samples and one with 50,000 samples. Always pass num_examples from clients and use it in FedAvg weighting: w_agg = Σ (n_k / n_total) × w_k.

Maintain a server-side validation set. Since client data never reaches the server, you need a small, representative, held-out dataset on the server to track global model quality over rounds. Without it, you are flying blind. Aim for 1–5% of total estimated data size.

Compress before encrypting, not after. Top-k sparsification or int8 quantisation must happen before TLS encryption. Encrypting first then compressing yields almost no size reduction (encrypted data is incompressible). Compression → DP noise → TLS is the correct pipeline order.

Track per-round client participation rate as a first-class metric. A sudden drop in participation (e.g. from 60% to 20%) is almost always a system signal — not a data signal. It means clients are failing eligibility checks, experiencing network issues, or the round timeout is too short. Log participation_rate alongside loss every round.

🚀

Coming Up in Topic 3

Topic 3: The Non-IID Problem & Advanced Aggregation Algorithms. Now that you understand the topology, we go deeper into what happens when client data is heterogeneous. We will cover FedProx, SCAFFOLD, FedNova, and MOON — algorithms specifically designed to handle non-IID data that FedAvg cannot converge on. We will benchmark all four on pathologically non-IID CIFAR-10 partitions.