A Bayesian Incentive Mechanism for Poison-Resilient Federated Learning

February 10, 2026 · Data Security

Federated Learning Game Theory Security

The 20-Second Summary

Open federated learning invites poisoning attacks, and many defenses either add heavy aggregation-time overhead or still assume an honest majority. This work takes a different angle: design incentives (as a Bayesian mechanism) so that, for rational clients, poisoning becomes economically irrational—while remaining plug-in compatible with standard FL pipelines.

The Problem

Federated learning works because participants contribute local model updates to a shared global model without exposing raw data. But this openness is also its Achilles’ heel: any participant can submit a poisoned update, and the server has no way to inspect the underlying data.

Poisoning comes in two flavors. Data poisoning corrupts the local training set so the resulting update degrades global accuracy. Model poisoning directly modifies the update - flipping gradient signs, scaling magnitudes, or injecting backdoor patterns - to steer the global model toward attacker-chosen behavior.

Most defenses focus on the aggregation step: robust mean estimators, trimmed statistics, or similarity-based filtering. These help, but they share a common weakness: they are reactive. They try to detect bad updates after they’ve been submitted. A sophisticated attacker can tune their poisoned update to sit just inside the filter boundary - close enough to honest updates to avoid rejection, but biased enough to shift the model over many rounds.

We asked a different question: what if we made poisoning a bad economic decision in the first place?

Why Existing Approaches Fall Short

Defense Strategy	Limitation
Krum / Multi-Krum	Selects updates closest to the pack - fails when a majority is malicious
Trimmed Mean	Discards extremes but can still be manipulated by coordinated attackers
FLTrust	Requires a clean root dataset at the server, which may not be available
Reputation systems	Build trust slowly; vulnerable to strategic long-game poisoning
Differential privacy	Clips and noises updates, but does not distinguish honest from malicious

The core issue: aggregation-based defenses treat all participants as adversaries, penalizing honest clients alongside malicious ones. They add computational overhead to every round and can degrade model quality even in the absence of attacks. We need a mechanism that targets malicious behavior specifically - and ideally prevents it before it happens.

Our Approach: Bayesian Incentive Mechanism

We model the interaction between the FL server and participating clients as a Bayesian game. In this game, each client has a private type - honest or malicious - known only to itself. The server cannot observe types directly but can observe the quality of submitted updates.

The Game Setup

Players: One server (mechanism designer) and $N$ clients (strategic agents)
Types: Each client is privately either honest (wants to contribute useful updates) or malicious (wants to degrade the model)
Actions: Each client chooses an effort level - how faithfully to train on local data
Payments: The server offers a payment scheme tied to update quality

How the Mechanism Works

The server implements a scoring rule that evaluates each submitted update against a quality benchmark. This benchmark can be computed from the distribution of all received updates, from a small validation set, or from historical round statistics. Based on the score, the server assigns a payment:

Client trains locally and submits update u_i
Server computes quality score s(u_i)
Payment p_i = f(s(u_i)) - reward proportional to quality
Low-quality (poisoned) updates receive zero or negative payment
Global model aggregates only qualifying updates

The key insight is the design of the payment function $f$. We prove that under our mechanism, submitting an honest update is a Bayesian Nash Equilibrium - no client can improve their expected payoff by deviating to a poisoning strategy, regardless of what other clients do. In game-theoretic terms, poisoning becomes a dominated strategy.

Why Bayesian?

Classical mechanism design assumes complete information - everyone knows everyone else’s type. That’s unrealistic in FL, where the server doesn’t know which clients are malicious. The Bayesian formulation handles this naturally: the server reasons about the distribution of types and designs payments that are robust to any realization.

In practice, the server never has to identify which specific clients are malicious. The mechanism remains effective even when the malicious fraction is high, and it doesn’t require a trusted subset of clients or a clean server-side dataset.

Integration with FL Pipelines

The mechanism is designed as a drop-in layer on top of standard FedAvg or FedSGD. The server adds a scoring and payment step after receiving updates but before aggregation. No changes are needed to the local training process, the model architecture, or the communication protocol.

How We Evaluated

We ran experiments on two standard benchmarks under conditions designed to stress-test the mechanism:

Datasets:

We use MNIST (handwritten digit classification; 10 classes) and Fashion-MNIST (clothing item classification; 10 classes).

Data partitions: Non-IID splits using Dirichlet allocation, simulating the heterogeneous data distributions typical in real FL deployments. Non-IID data is important because it makes honest updates naturally diverse, which makes distinguishing them from poisoned updates harder.

Malicious fractions: We tested with up to high proportions of malicious clients submitting poisoned updates - well beyond the “honest majority” assumption that many defenses require.

Attack types: Label-flipping attacks and gradient-scaling attacks, representing both data poisoning and model poisoning strategies.

Baselines: Standard FedAvg (no defense), Krum, Trimmed Mean, and Median aggregation.

Key Results

In the paper’s label-flipping attack setting (non-IID partitions), the mechanism remains stable even as the malicious fraction increases from 30% to 50%.

MNIST (accuracy @ 30% → 50% malicious):

FedAvg drops from 95.27% → 43.52% (−51.75 pts), Krum goes from 85.31% → 81.56% (−3.75 pts), while the mechanism remains essentially stable at 96.96% → 96.72% (−0.24 pts).

Fashion-MNIST (accuracy @ 30% → 50% malicious):

FedAvg drops from 80.74% → 35.44% (−45.30 pts), Krum collapses from 73.68% → 0.33% (−73.35 pts), while the mechanism stays high at 81.89% → 80.67% (−1.22 pts).

Discussion

Most FL defenses focus on detecting or filtering bad updates at aggregation time. This work explores an alternative angle: using payments tied to update quality so that, under the model assumptions, honest behavior is incentive-compatible.

Key implications:

Per-round overhead is low compared to some robust aggregation rules
Compatibility: Incentives can complement other defenses rather than replacing them
Assumptions matter: The guarantees target rational participants; fully adversarial clients outside the incentive model remain out of scope

Limitations and Next Steps

Current Limitations:

The mechanism assumes rational attackers who respond to incentives. Fully adversarial agents (e.g., state-sponsored) who don’t care about payments are outside the model
The quality scoring function requires some form of validation signal at the server, even if it’s weaker than a full clean dataset
Evaluation is on standard benchmarks; validation on production FL systems with realistic scale and churn is still needed

Future Work:

Extend to settings with collusion, where groups of malicious clients coordinate their poisoning strategy
Combine with differential privacy to handle both rational and irrational adversaries
Evaluate on larger-scale tasks (language models, medical imaging) and real federated deployments
Explore dynamic payment schedules that adapt as the server learns more about client distributions over time

Reference

Preprint (arXiv)