Counterfactual Explanation Robustness to Adversarial Noise
TITLE OF THE INVENTION
Frontier Counterfactual Explanation Architecture for Adversarial Robustness Across Multimodal Domains
FIELD OF THE INVENTION
The present invention relates to machine‑learning explainability, specifically to the generation of counterfactual explanations that remain faithful, actionable, and interpretable when subjected to adversarial perturbations at both the input and model levels, and to multimodal data such as images, text, and graphs.
BACKGROUND AND PRIOR ART
Conventional counterfactual explanation (CE) methods rely on gradient‑based or search‑based perturbations that are highly sensitive to adversarial noise. Perturbations that flip a model’s prediction are often treated as noise rather than actionable changes, leading to misleading explanations and eroding user trust. Prior work on causally‑guided adversarial steering (CECAS) [1][2] demonstrates that steering perturbations along causal edges can preserve domain semantics, yet these methods still generate off‑manifold artifacts. Diffusion‑constrained manifold projection (ACE‑DMP) [3] mitigates high‑frequency noise but does not address multimodal consistency or model drift. Robust recourse optimization with Lp‑bounded model change (RO‑Lp) [4][5] bounds model updates but lacks a causal or manifold constraint. Validation evidence further shows that attention‑guided adversarial attacks (SAGA) [v4266], diffusion‑based counterfactuals (VCE) [v12930], and multi‑modal recourse modules (MARM) [v9141] highlight the need for a unified framework that integrates causal guidance, manifold fidelity, multimodal consistency, and model‑change robustness. The technical problem, therefore, is to provide a counterfactual explanation mechanism that simultaneously satisfies causal integrity, manifold fidelity, multimodal actionability, and resilience to model drift or poisoning.
SUMMARY OF THE INVENTION
The invention discloses a Frontier Counterfactual Architecture (FCA) that integrates four complementary innovations: causally‑guided adversarial steering (CECAS‑style), diffusion‑constrained manifold projection (ACE‑DMP), a multi‑modal adversarial recourse module (MARM), and a robust recourse optimizer with Lp‑bounded model change (RO‑Lp). FCA learns or accepts a causal graph, projects adversarial perturbations onto the data manifold via a denoising diffusion probabilistic model, generates multimodal counterfactuals that respect cross‑modal causal constraints, and optimizes for minimal action cost while bounding model change in the ℓp norm. The resulting counterfactual explanations are faithful, actionable, and robust to both input‑level adversarial noise and model‑level shifts, thereby enhancing user trust in adversarial environments.
DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS
Embodiment 1 – Causal Graph Acquisition
The FCA pipeline begins by acquiring a causal graph G=(V,E) that captures domain‑specific causal relationships among features. G may be learned from observational data using fast, graph‑free techniques such as FCI or GAC, or supplied by domain experts. The graph is stored in a causal knowledge base and used to constrain subsequent perturbations. This step ensures that any perturbation respects causal consistency, preventing semantic violations that would otherwise produce misleading counterfactuals [1][2][v13179].
Embodiment 2 – Diffusion‑Constrained Manifold Projection (ACE‑DMP)
A denoising diffusion probabilistic model (DDPM) is employed to project raw adversarial perturbations onto the data manifold. The diffusion process iteratively denoises a perturbed sample x′, yielding a counterfactual x* that lies on the manifold of realistic data points. The filtering function Fτ removes high‑frequency artifacts while preserving the semantic direction of the perturbation [3][v12930][v2830]. Fast samplers such as DDIM or DPM‑Solver can be used to reduce computational cost [v14059].
Embodiment 3 – Multi‑Modal Adversarial Recourse Module (MARM)
MARM extends CE to images, text, and graph data simultaneously. It generates adversarial examples that respect cross‑modal causal constraints by jointly optimizing over a shared latent space. Cross‑modal consistency losses and parameter‑efficient adversarial training (AdvPT, APT) are incorporated to harden embeddings against multimodal perturbations [v9141][v15921]. The module outputs coherent “what‑if” scenarios that are actionable across all modalities.
Embodiment 4 – Robust Recourse Optimizer with Lp‑Bounded Model Change (RO‑Lp)
An optimization framework bounds the change in model parameters Δθ in the ℓp norm, ensuring that the counterfactual remains valid even after adversarial training or data poisoning. The objective minimizes action cost while satisfying ||Δθ||p ≤ ε, where ε is a user‑defined tolerance [4][5][v6294][v1977]. The optimizer can be implemented as a convex sub‑problem for generalized linear models or via a tractable approximation for deep networks.
Embodiment 5 – Robustness Oracle Evaluation
A model‑agnostic robustness oracle simulates worst‑case adversarial model variations. It employs metamorphic testing and oracle distillation to evaluate whether the generated counterfactual remains valid across a set of perturbed models. The oracle computes a multiplicity‑based robustness score by sampling models within the ℓp radius and checking counterfactual feasibility [v3453][v10859][v5423][v12247][v12560].
CLAIMS
1. A method for generating counterfactual explanations robust to adversarial perturbations, comprising: learning a causal graph G from domain data; generating a set of candidate perturbations that respect the edges of G; projecting each candidate onto the data manifold using a denoising diffusion probabilistic model; optimizing the projected perturbations to minimize an action cost objective while constraining the ℓp norm of model parameter change to be less than or equal to a predetermined threshold; and evaluating the resulting counterfactuals using a robustness oracle that simulates adversarial model variations.
2. The method of claim 1, wherein the causal graph G is learned using the Fast Causal Inference (FCI) algorithm.
3. The method of claim 1, wherein the diffusion projection employs a DDIM sampler with 50 denoising steps.
4. The method of claim 1, wherein the action cost objective includes a weighted sum of feature‑level L1 and L2 penalties.
5. The method of claim 1, wherein the robustness oracle evaluates counterfactual validity across a set of models perturbed within an ℓ∞ radius of 0.05.
6. The method of claim 1, wherein the candidate perturbations are generated using sparse patch‑wise modifications guided by attention scores.
7. The method of claim 1, wherein the counterfactuals are generated for multimodal inputs comprising images, text, and graph data, and the optimization enforces cross‑modal causal consistency.
8. A system for generating robust counterfactual explanations, comprising: a causal graph module that stores a causal graph G; a diffusion projection module that projects perturbations onto the data manifold; a recourse optimizer module that minimizes action cost under an ℓp model‑change constraint; a multi‑modal recourse module that generates counterfactuals across image, text, and graph modalities; and a robustness oracle module that evaluates counterfactual validity under adversarial model variations.
9. The system of claim 8, wherein the diffusion projection module implements a DDPM with a guidance scale of 3.0.
10. The system of claim 8, wherein the recourse optimizer module employs a convex relaxation for generalized linear models.
11. The system of claim 8, wherein the robustness oracle module performs oracle distillation to train a surrogate classifier that mimics the target model’s decision strategy.
12. The system of claim 8, wherein the multi‑modal recourse module uses a shared latent space and cross‑modal consistency loss to align visual and textual embeddings.
13. The system of claim 8, wherein the causal graph module is updated periodically using differential privacy‑preserving causal discovery.
14. The system of claim 8, wherein the robustness oracle module computes a multiplicity‑based robustness score by sampling 100 perturbed models.
15. The system of claim 8, wherein the counterfactual explanations are rendered as visual heatmaps, textual rationales, and graph edits that can be embedded into electronic health records via HL7/FHIR standards.
ABSTRACT
The present invention provides a Frontier Counterfactual Architecture (FCA) that generates counterfactual explanations robust to adversarial perturbations across multimodal data. FCA learns a causal graph to steer perturbations along semantically consistent edges, projects candidate perturbations onto the data manifold using a denoising diffusion probabilistic model, and optimizes for minimal action cost while bounding model change in the ℓp norm. A multi‑modal adversarial recourse module ensures that explanations remain actionable across images, text, and graphs. A robustness oracle evaluates counterfactual validity under simulated adversarial model variations, yielding a multiplicity‑based robustness score. The resulting counterfactuals are faithful, interpretable, and resilient to both input‑level noise and model‑level shifts, thereby enhancing trust in adversarial environments.