ZKP-FedEval: Verifiable, Privacy-Preserving Federated Evaluation

Zero-Knowledge Proofs Federated Learning Privacy

The 20-Second Summary

Federated learning keeps raw data local during training, but evaluation metrics (loss/accuracy) can still leak sensitive information about a client’s dataset. ZKP-FedEval replaces raw metric reporting with a zero-knowledge proof of a predicate (e.g., “my loss is below threshold $\tau$”), so the server can make gating decisions without learning the actual metric value.


The Problem

Federated learning has a well-known privacy story for training: data stays local, only model updates are shared, and techniques like secure aggregation and differential privacy further protect those updates. But training is only half the picture. At some point, the system needs to evaluate the model - and that’s where the privacy story breaks down.

In standard FL evaluation, each client computes metrics on its local test set and reports them to the server. These metrics seem innocuous: a loss value, an accuracy number, maybe per-class precision. But they are computed on private data, and they leak information:

Loss values can reveal data difficulty (unusually low loss may indicate easy/common examples; high loss can indicate rare or sensitive outliers). Per-class metrics can expose label distributions, potentially leaking diagnoses, behaviors, or demographics. Even trajectories over rounds can hint at data volume/quality and distribution shift. And because metrics are data-derived functions, they can support membership inference in the presence of auxiliary knowledge.

This is not a theoretical concern. Research has demonstrated practical attacks that reconstruct training data characteristics from gradient updates alone. Evaluation metrics, which are direct functions of the data, are at least as revealing.

The FL community has spent years building privacy-preserving training. We need the same rigor for evaluation.


Why Existing Approaches Fall Short

Approach Limitation
Plaintext metric reporting Directly reveals per-client loss, accuracy, and label distributions
Aggregate-only reporting Hides individual clients but requires a trusted aggregator
Differential privacy on metrics Adds noise to reported values; noisy metrics are unreliable for decision-making
Secure aggregation of metrics Protects individual values but still reveals the aggregate; doesn’t support threshold checks
Skip evaluation entirely Not viable - FL needs evaluation for model selection, convergence, and quality assurance

The core issue: the server needs to know whether clients are performing well, but learning their exact performance leaks private information. We need a mechanism where clients can prove a property about their evaluation (e.g., “my loss is acceptable”) without revealing the evaluation itself.


Our Approach: ZKP-FedEval

ZKP-FedEval introduces zero-knowledge proofs into the FL evaluation pipeline. Instead of reporting raw metrics, each client generates a cryptographic proof that its evaluation result satisfies a predefined condition - without revealing the result.

The Core Idea

The server defines an evaluation policy, for example:

“Each client must demonstrate that its local test loss is below threshold τ.”

In standard FL, the client would report its loss $\ell_i$ and the server would check $\ell_i < \tau$. This reveals $\ell_i$.

In ZKP-FedEval, the client computes $\ell_i$ locally and then generates a zero-knowledge proof $\pi_i$ asserting:

\[\exists \; \ell_i \text{ such that } \ell_i = \text{Loss}(M, D_i) \text{ and } \ell_i < \tau\]

The server verifies $\pi_i$ and learns only: “yes, this client’s loss is below the threshold” or “no, it isn’t.” The server never sees the actual loss value.

Protocol Detail

1. Server broadcasts global model M and threshold τ
2. Each client evaluates M on local test data D_i → loss ℓ_i
3. Client encodes the evaluation as an arithmetic circuit:
 - Input (private): local data D_i, computed loss ℓ_i
 - Input (public): model M (or its hash), threshold τ
 - Constraint: ℓ_i = Loss(M, D_i) AND ℓ_i < τ
4. Client generates zk-SNARK proof π_i for the circuit
5. Client sends π_i (not ℓ_i) to the server
6. Server verifies π_i - learns pass/fail, nothing more
7. Server uses pass/fail signals for model selection and convergence decisions

Why Zero-Knowledge Proofs?

ZKPs are the natural cryptographic tool here because they provide exactly the right guarantee: the verifier is convinced that a statement is true without learning anything beyond the statement’s truth. Other privacy tools don’t fit as cleanly:

Differential privacy would add noise to the answer (making pass/fail unreliable at the margin), secure computation would add interaction rounds between parties, and homomorphic encryption would let the server compute on encrypted metrics—when the core goal is to avoid learning the metric at all.

ZKPs give us verifiable binary decisions from private continuous data, which is exactly what evaluation needs.

Extending Beyond Loss

The framework generalizes beyond simple loss thresholds. The arithmetic circuit can encode richer evaluation policies:

  • “My accuracy is above 90%” - quality gatekeeping
  • “My per-class F1 variance is below δ” - fairness checks
  • “My evaluation set has at least N samples” - data sufficiency
  • “My loss decreased from last round” - convergence verification

Each policy becomes a circuit, and each circuit produces a proof. The server composes these proofs to make informed decisions without seeing any raw metrics.


How We Evaluated

We implemented ZKP-FedEval and tested it on two standard FL benchmark datasets:

Datasets:

We use MNIST (handwritten digit classification; 10 classes, 28×28 grayscale images) and HAR (Human Activity Recognition from smartphone sensors), where HAR is especially relevant because it reflects a natural federated setting and high privacy expectations.

HAR is particularly relevant because it represents a natural federated setting: each user’s smartphone generates their own activity data, and privacy expectations are high. Reporting that “User X has low accuracy on activity class ‘Running’” could reveal health or lifestyle information.

FL Setup:

We run a standard FedAvg setup with multiple clients under non-IID splits over multiple rounds, performing evaluation each round via ZKP-FedEval.

Metrics:

We measure proof generation and verification time, proof size (communication overhead), end-to-end round time relative to plaintext evaluation, and correctness (pass/fail decisions match the plaintext threshold check).

Comparison: Standard FL evaluation (plaintext metric reporting) as the baseline, with ZKP-FedEval as the privacy-preserving alternative.


Key Results

ZKP-FedEval demonstrates that privacy-preserving evaluation can be practical for modest models/datasets:

Correctness follows from soundness: a verifying proof corresponds to a true threshold statement (clients can’t “prove” low loss when loss is actually high, under standard assumptions). In the reported setup, proof generation is ~0.4–0.5 s on MNIST (CNN circuit) and ~0.12–0.13 s on HAR (MLP circuit), with verification at ~0.31–0.32 s per proof (MNIST) and ~0.10 s (HAR). Proof sizes are ~0.79 KiB (MNIST) and ~0.26 KiB (HAR), and the preprint reports 100% correctness of pass/fail decisions across tested thresholds. The privacy story is strong: the server learns only pass/fail bits, not raw metrics.


Discussion

ZKP-FedEval addresses a gap in FL privacy: evaluation often reports client metrics directly, even when training is privacy-preserving.

This matters because:

If evaluation metrics leak information, training-only privacy may be insufficient. In regulated domains even derived statistics can be sensitive, and predicate proofs reduce what gets disclosed. ZKPs also add verifiability (clients can’t claim the predicate holds when it doesn’t), and pass/fail signals support simple gating/selection without revealing raw metrics. Finally, evaluation predicates compose cleanly with privacy-preserving training (secure aggregation or DP).


Limitations and Next Steps

Current Limitations:

Circuit cost scales with model size: encoding the loss computation for large models becomes expensive, and transformer-scale proof generation is currently impractical. The server must also choose $\tau$ in advance (which may require domain knowledge and tuning), and binary pass/fail is less informative than raw metrics for some optimization strategies. Finally, MNIST and HAR are standard but modest; larger-scale evaluations (medical imaging, NLP) are needed to assess scalability.

Future Work:

Future work is largely about scaling and expressiveness: more efficient circuits for common losses, richer predicates (fairness, calibration, distributional constraints), and system-level evaluation on larger deployments. Defense in depth (combining predicate proofs with DP against inference from repeated pass/fail signals) and recursive proof composition for batching are also natural directions.


Reference