← Back to modelling programme summary

Task 1: Synthetic Adversarial Observation Perturbation Dataset Generation

Project: corpora-task-modelling-1778795810213-620a9917  •  Generated: 2026-05-14 22:57

Generate realistic synthetic datasets of sensor observations with controlled adversarial perturbations using simulation and GAN-based augmentation to evaluate detection and inference pipelines before hardware deployment.

Monte CarloSimulationGANFeasibility

Source in Roadmap / IdeateChapter 1 – AOI-GBE Foundations & Data Collection
Why model firstProvides a diverse, controllable set of perturbations to benchmark detection and inference algorithms, reducing costly field data collection and enabling early validation of robustness metrics.

What Is Modelled

Multi‑modal sensor observation streams (camera, LiDAR, IMU, radar) from autonomous agents, together with a parametric adversarial perturbation model that injects realistic noise, spoofing, and semantic manipulation into the raw sensor data.

Objectives

Success Criteria

Output Form

A versioned dataset in Parquet/CSV with accompanying metadata (sensor specs, perturbation parameters, timestamps), trained CC‑GAN checkpoints, a Monte Carlo configuration file, and a Python package containing evaluation utilities and a reproducible Docker image.

Key Parameters & What They Affect

ParameterRange / UnitsAffectsNotes
perturbation_magnitude0.0–0.5 (normalized sensor value)qualityreliabilityControls the amplitude of additive Gaussian or Poisson noise; higher values increase attack severity.
spoofing_probability0.0–1.0securitycostProbability that a given sensor frame is replaced with a fabricated signal.
semantic_attack_typecategorical (label‑swap, object‑removal, path‑spoof)qualityinterpretabilityDefines the high‑level manipulation applied to image or LiDAR point clouds.
latent_dim128–512speedqualityDimensionality of the CC‑GAN latent vector; larger values improve fidelity but increase training time.
simulation_fps30–120 fpsspeedcostFrame rate of the physics‑based simulator; higher FPS yields more realistic dynamics but requires more compute.

Input Data

Required data:

Natural Sources (from the project)

Acquired Sources

  • KITTI, nuScenes, and Waymo Open Dataset for baseline clean imagery and point clouds.
  • OpenStreetMap and OpenSceneGraph for realistic urban terrain.
  • Publicly available adversarial benchmark datasets (e.g., D‑REX, XSTest) for semantic attack templates.

Synthesised Sources

  • Physics‑based simulation in AirSim or CARLA to generate clean and perturbed sensor streams.
  • GAN‑generated perturbations seeded by a DOE‑style Latin‑Hypercube of attack parameters.
  • Synthetic point‑cloud augmentations using Open3D and PyTorch3D.

Engineer / Scientist Guidance

  1. Set up a reproducible AirSim/CARLA environment with the target vehicle and sensor suite; export clean telemetry logs.
  2. Define a perturbation taxonomy (noise, spoofing, semantic) and encode each as a parameterized function.
  3. Implement a Monte Carlo engine in Python that samples perturbation parameters across the defined ranges and writes a CSV of attack scenarios.
  4. Pre‑process clean logs to create paired datasets (clean, perturbed) for GAN training; normalize sensor data and align timestamps.
  5. Configure a CC‑GAN architecture in PyTorch: a conditional generator with a GRU encoder for temporal context and a multi‑head discriminator for each modality.
  6. Use Optuna or Ray Tune to perform hyper‑heuristic search over GAN hyperparameters (latent_dim, learning rate, batch size) and perturbation generation strategies.
  7. Train the GAN on a GPU cluster; monitor reconstruction loss and FID; save checkpoints every 10 k iterations.
  8. Generate synthetic perturbations by sampling the trained generator with random latent vectors conditioned on clean observations.
  9. Validate synthetic data by comparing statistical distributions (mean, variance, KL‑divergence) to real logs; perform detection model benchmarking.
  10. Package the dataset, GAN checkpoints, and evaluation scripts into a Docker image; publish to a private registry for downstream teams.

Recommended Tools

AirSim (C++/Python) or CARLA (Python) for simulationGazebo + ROS2 for alternative physics enginePyTorch 2.0 for GAN implementationTensorFlow 2.12 for alternative trainingOptuna 3.x or Ray Tune 2.x for hyper‑heuristic orchestrationWeights & Biases for experiment trackingNumPy, Pandas, OpenCV, Open3D for data processingDocker, NVIDIA Container Toolkit for reproducibilityApache Parquet for dataset storageScikit‑learn for evaluation metrics (F1, KL‑div, FID)pytest for unit tests

Validation & Verification

The dataset will be validated by (1) statistical comparison of clean vs. perturbed distributions against real logs (KL‑div < 0.1), (2) reconstruction error of the CC‑GAN on a held‑out test set (MAE < 5 %), (3) detection pipeline performance on synthetic attacks (F1 ≥ 90 %) and on a separate real‑world test set (F1 ≥ 80 %), and (4) inference latency measured on target edge hardware (≤ 50 ms).

Expected Impact

Quality

Provides high‑fidelity, attack‑rich data that improves detection and inference robustness by exposing models to a broader spectrum of perturbations.

Timescale

Reduces field data collection from 6–12 months to 2–3 months by leveraging simulation and GAN augmentation.

Cost

Cuts hardware procurement and test‑bed maintenance costs by ~40 % through virtual experimentation.

Risk Retired

Mitigates deployment risk by enabling early validation of detection pipelines and policy resilience against unseen adversarial scenarios.

Software Tool Development Prompts

Drop these into a coding assistant toscaffold the supporting software for this modelling task.

Create a Python script that sets up an Optuna study to tune a conditional GAN for multimodal sensor data. The study should explore latent_dim ∈ [128, 512], learning_rate ∈ [1e-4, 1e-3], batch_size ∈ [32, 128], and generator depth ∈ [2, 4]. For each trial, train for 10 k steps on a GPU, evaluate reconstruction MAE on a validation set, and report the best hyperparameters. Include code to log metrics to Weights & Biases and to save the best model checkpoint.
Write a Dockerfile that installs AirSim, ROS2, PyTorch, and all dependencies, copies a pre‑trained CC‑GAN checkpoint, and exposes a REST API endpoint that accepts a clean observation JSON, applies the GAN to generate a perturbed observation, and returns the result. The container should run under NVIDIA runtime and expose port 8000.

Risks & Assumptions