How to Build a Privacy-Preserving Federated Fraud Detection System with Lightweight PyTorch (and OpenAI-Assisted Reporting)

Fraud detection models often improve with more data, but financial institutions can’t simply pool sensitive transaction records. A practical alternative is federated learning: each bank trains locally, shares only model updates, and benefits from a stronger global model—without exporting raw customer data. This article walks through a clean, CPU-friendly simulation of that setup using lightweight PyTorch building blocks, plus an optional step that turns the final metrics into an internal, decision-oriented risk report with OpenAI.

What this implementation is trying to achieve

The goal is to demonstrate a privacy-preserving fraud detection workflow that is:

Federated: multiple parties (“banks”) train locally and only exchange model weights/updates.
Lightweight: no heavyweight federated frameworks; a simple coordination loop is enough for experimentation.
Realistic: fraud is rare and client datasets are heterogeneous (non-IID), reflecting how fraud patterns vary by institution.
Actionable: after training, results are summarized into a concise fraud-risk report.

The simulation uses ten clients (ten independent banks), a highly imbalanced synthetic dataset (fraud is a small minority class), and a FedAvg aggregation loop for 10 rounds of training.

Environment setup: reproducible, CPU-friendly execution

The notebook-style implementation installs the essentials—torch, scikit-learn, numpy, and openai—and then fixes random seeds to keep results deterministic and repeatable. It also explicitly selects a CPU device to ensure the simulation runs in common environments without special hardware.

From a practical standpoint, this emphasis on determinism matters when you’re experimenting with federated learning behavior. Small changes in initialization or partitioning can lead to noticeably different convergence patterns, especially with skewed class distributions.

Generating a credit-card-like imbalanced fraud dataset

To mimic the class imbalance typical in fraud detection, the workflow uses make_classification to generate a dataset with:

n_samples=60000
n_features=30
n_informative=18
n_redundant=8
weights=[0.985, 0.015] (fraud is ~1.5%)
class_sep=1.5
flip_y=0.01
random_state=SEED

The data is then split into train and test sets using train_test_split with:

test_size=0.2
stratify=y (preserves the fraud/non-fraud ratio)
random_state=SEED

Why there’s a “server-side” scaler here

A StandardScaler is fit on the full training data to produce standardized features for the global test set evaluation. This standardized test loader (batch size 1024, shuffle=False) provides a consistent benchmark to track how the global model changes after each federated round.

Even though the learning is federated, the evaluation step in this simulation is centralized for convenience: the global model is assessed on a single fixed test set after each aggregation round.

Simulating non-IID client data with Dirichlet partitioning

A key challenge in federated learning is that client data is rarely independent and identically distributed. Different banks have different customer bases, transaction mixes, and fraud exposure. To model this heterogeneity, the training data is partitioned across clients using a Dirichlet-based splitter:

NUM_CLIENTS = 10
alpha = 0.35 in the Dirichlet distribution

Dirichlet partitioning can create uneven class distributions and varying client dataset sizes—precisely the kind of skew that makes federated optimization harder than centralized training.

Client train/validation splits and a safety check for missing classes

Each client dataset is split into local train and validation partitions with:

test_size=0.15
stratify=yi
random_state=SEED

The implementation includes an important guardrail: if a client ends up with only one class (e.g., no fraud examples), it adds a small number of samples from the opposite class (up to 10) to ensure the local training/evaluation is feasible.

Local feature scaling per client

Each client uses its own StandardScaler, fit on its local training split and applied to its validation split. This better reflects real-world decentralization: banks don’t share feature statistics, and preprocessing can differ subtly across institutions.

Client data loaders are created with:

batch_size=512 for training (shuffled)
batch_size=512 for validation

The fraud model: a compact neural network for tabular data

The fraud detector, FraudNet, is a small feedforward network designed for tabular features. It has:

A linear layer from in_dim to 64, ReLU, and Dropout(0.1)
A linear layer from 64 to 32, ReLU, and Dropout(0.1)
A final linear layer from 32 to 1 logit

The model outputs a single logit (later passed through a sigmoid for probabilities during evaluation), which is standard for binary classification with BCEWithLogitsLoss.

Weight exchange utilities

Because federated learning here is implemented “from scratch,” the code includes helpers to:

Extract weights from a model’s state_dict into NumPy arrays.
Set weights on a new model instance from those arrays.

This makes it straightforward to simulate sending model parameters from the server to clients and sending updated weights back to the server—without needing specialized infrastructure.

Evaluation metrics: beyond accuracy

Fraud detection is highly imbalanced, so accuracy can be misleading. The evaluation routine reports:

loss (mean BCEWithLogitsLoss)
auc via roc_auc_score
ap via average_precision_score
acc via accuracy_score with a 0.5 threshold

In practice, AUC and especially Average Precision (AP) tend to be more informative than raw accuracy for rare-event detection, because they focus on ranking quality and precision-recall tradeoffs.

Federated training with FedAvg: 10 rounds of coordination

The coordination layer uses a classic federated averaging approach:

ROUNDS = 10
LR = 5e-4
Each round:
- Create a local model per client and initialize it with the current global weights.
- Train locally using Adam and BCEWithLogitsLoss.
- Collect updated client weights and the number of local samples.
- Aggregate using fedavg, weighting each client by its dataset size.
- Evaluate the new global model on the fixed test loader and print metrics.

Why size-weighted averaging matters

FedAvg combines client updates proportionally to how much data each client trained on. In heterogeneous settings (especially with Dirichlet partitions), this helps prevent tiny clients from disproportionately steering the global model, while still allowing them to contribute.

At the same time, weighting by size can raise fairness questions in real deployments: larger institutions may dominate the global model. This implementation keeps the logic explicit so you can experiment with alternative weighting strategies if desired.

Turning model metrics into a risk-team report with OpenAI

After federated training completes, the workflow optionally generates an internal-facing fraud-risk summary using the OpenAI API. The API key is requested securely via hidden keyboard input (getpass) and placed into the OPENAI_API_KEY environment variable.

What gets summarized

A compact summary object is constructed containing:

rounds (the number of federated rounds)
num_clients
final_metrics (the last printed evaluation metrics)
client_sizes (training dataset sizes for each client)
client_fraud_rates (fraud rate per client, derived from the client split)

This is then embedded into a prompt asking the model to: “Write a concise internal fraud-risk report. Include executive summary, metric interpretation, risks, and next steps.” The request uses client.responses.create with model="gpt-5.2" and prints the generated output_text.

Why this step is useful (even in a simulation)

Federated learning experiments often end with raw metrics that are meaningful to ML practitioners but less digestible for compliance teams, risk leaders, or fraud operations. The reporting step demonstrates how to convert technical outputs (AUC/AP, heterogeneous client sizes, varying fraud rates) into narrative guidance—without changing the underlying privacy premise (no raw transaction exports).

Implementation notes and practical takeaways

Privacy boundary: In this setup, clients never transmit raw samples—only model weights are aggregated. That said, the code is a simulation; production-grade privacy typically requires additional protections and threat modeling.
Non-IID effects are the point: Using a Dirichlet partition with alpha=0.35 surfaces convergence challenges that are easy to miss with uniform splits.
Evaluate with the right metrics: Including AP alongside AUC and accuracy helps keep focus on imbalanced classification realities.
Lightweight doesn’t mean simplistic: Even with a compact model and a minimal FedAvg loop, you can study realistic behaviors such as client skew, aggregation effects, and stability across rounds.

Conclusion

This lightweight PyTorch simulation shows how a federated fraud detection pipeline can be built from first principles: ten banks train local models on highly imbalanced data, a FedAvg loop aggregates updates over 10 rounds, and the final results can be summarized into a decision-ready internal report using OpenAI. It’s a practical blueprint for experimenting with privacy-aware collaboration without requiring complex federated infrastructure.

Attribution: This article is based on reporting originally published by www.marktechpost.com. The original tutorial includes the notebook and full implementation details available via the embedded “Full Codes here” link: Full Codes here.

<<>>

Based on reporting originally published by www.marktechpost.com. See the sources section below.