Counterfactual Explanation Robustness to Adversarial Noise

Draft Patent Application 7 — For Review

Counterfactual Explanation Robustness to Adversarial Noise

TITLE OF THE INVENTION

Frontier Counterfactual Explanation Architecture for Adversarial Robustness Across Multimodal Domains

FIELD OF THE INVENTION

The present invention relates to machine‑learning explainability, specifically to the generation of counterfactual explanations that remain faithful, actionable, and interpretable when subjected to adversarial perturbations at both the input and model levels, and to multimodal data such as images, text, and graphs.

BACKGROUND AND PRIOR ART

Conventional counterfactual explanation (CE) methods rely on gradient‑based or search‑based perturbations that are highly sensitive to adversarial noise. Perturbations that flip a model’s prediction are often treated as noise rather than actionable changes, leading to misleading explanations and eroding user trust. Prior work on causally‑guided adversarial steering (CECAS) ^[1]^[2] demonstrates that steering perturbations along causal edges can preserve domain semantics, yet these methods still generate off‑manifold artifacts. Diffusion‑constrained manifold projection (ACE‑DMP) ^[3] mitigates high‑frequency noise but does not address multimodal consistency or model drift. Robust recourse optimization with Lp‑bounded model change (RO‑Lp) ^[4]^[5] bounds model updates but lacks a causal or manifold constraint. Validation evidence further shows that attention‑guided adversarial attacks (SAGA) ^[v4266], diffusion‑based counterfactuals (VCE) ^[v12930], and multi‑modal recourse modules (MARM) ^[v9141] highlight the need for a unified framework that integrates causal guidance, manifold fidelity, multimodal consistency, and model‑change robustness. The technical problem, therefore, is to provide a counterfactual explanation mechanism that simultaneously satisfies causal integrity, manifold fidelity, multimodal actionability, and resilience to model drift or poisoning.

SUMMARY OF THE INVENTION

The invention discloses a Frontier Counterfactual Architecture (FCA) that integrates four complementary innovations: causally‑guided adversarial steering (CECAS‑style), diffusion‑constrained manifold projection (ACE‑DMP), a multi‑modal adversarial recourse module (MARM), and a robust recourse optimizer with Lp‑bounded model change (RO‑Lp). FCA learns or accepts a causal graph, projects adversarial perturbations onto the data manifold via a denoising diffusion probabilistic model, generates multimodal counterfactuals that respect cross‑modal causal constraints, and optimizes for minimal action cost while bounding model change in the ℓp norm. The resulting counterfactual explanations are faithful, actionable, and robust to both input‑level adversarial noise and model‑level shifts, thereby enhancing user trust in adversarial environments.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS

Embodiment 1 – Causal Graph Acquisition
The FCA pipeline begins by acquiring a causal graph G=(V,E) that captures domain‑specific causal relationships among features. G may be learned from observational data using fast, graph‑free techniques such as FCI or GAC, or supplied by domain experts. The graph is stored in a causal knowledge base and used to constrain subsequent perturbations. This step ensures that any perturbation respects causal consistency, preventing semantic violations that would otherwise produce misleading counterfactuals ^[1]^[2]^[v13179].

Embodiment 2 – Diffusion‑Constrained Manifold Projection (ACE‑DMP)
A denoising diffusion probabilistic model (DDPM) is employed to project raw adversarial perturbations onto the data manifold. The diffusion process iteratively denoises a perturbed sample x′, yielding a counterfactual x* that lies on the manifold of realistic data points. The filtering function Fτ removes high‑frequency artifacts while preserving the semantic direction of the perturbation ^[3]^[v12930]^[v2830]. Fast samplers such as DDIM or DPM‑Solver can be used to reduce computational cost ^[v14059].

Embodiment 3 – Multi‑Modal Adversarial Recourse Module (MARM)
MARM extends CE to images, text, and graph data simultaneously. It generates adversarial examples that respect cross‑modal causal constraints by jointly optimizing over a shared latent space. Cross‑modal consistency losses and parameter‑efficient adversarial training (AdvPT, APT) are incorporated to harden embeddings against multimodal perturbations ^[v9141]^[v15921]. The module outputs coherent “what‑if” scenarios that are actionable across all modalities.

Embodiment 4 – Robust Recourse Optimizer with Lp‑Bounded Model Change (RO‑Lp)
An optimization framework bounds the change in model parameters Δθ in the ℓp norm, ensuring that the counterfactual remains valid even after adversarial training or data poisoning. The objective minimizes action cost while satisfying ||Δθ||p ≤ ε, where ε is a user‑defined tolerance ^[4]^[5]^[v6294]^[v1977]. The optimizer can be implemented as a convex sub‑problem for generalized linear models or via a tractable approximation for deep networks.

Embodiment 5 – Robustness Oracle Evaluation
A model‑agnostic robustness oracle simulates worst‑case adversarial model variations. It employs metamorphic testing and oracle distillation to evaluate whether the generated counterfactual remains valid across a set of perturbed models. The oracle computes a multiplicity‑based robustness score by sampling models within the ℓp radius and checking counterfactual feasibility ^[v3453]^[v10859]^[v5423]^[v12247]^[v12560].

CLAIMS

1. A method for generating counterfactual explanations robust to adversarial perturbations, comprising: learning a causal graph G from domain data; generating a set of candidate perturbations that respect the edges of G; projecting each candidate onto the data manifold using a denoising diffusion probabilistic model; optimizing the projected perturbations to minimize an action cost objective while constraining the ℓp norm of model parameter change to be less than or equal to a predetermined threshold; and evaluating the resulting counterfactuals using a robustness oracle that simulates adversarial model variations.

2. The method of claim 1, wherein the causal graph G is learned using the Fast Causal Inference (FCI) algorithm.

3. The method of claim 1, wherein the diffusion projection employs a DDIM sampler with 50 denoising steps.

4. The method of claim 1, wherein the action cost objective includes a weighted sum of feature‑level L1 and L2 penalties.

5. The method of claim 1, wherein the robustness oracle evaluates counterfactual validity across a set of models perturbed within an ℓ∞ radius of 0.05.

6. The method of claim 1, wherein the candidate perturbations are generated using sparse patch‑wise modifications guided by attention scores.

7. The method of claim 1, wherein the counterfactuals are generated for multimodal inputs comprising images, text, and graph data, and the optimization enforces cross‑modal causal consistency.

8. A system for generating robust counterfactual explanations, comprising: a causal graph module that stores a causal graph G; a diffusion projection module that projects perturbations onto the data manifold; a recourse optimizer module that minimizes action cost under an ℓp model‑change constraint; a multi‑modal recourse module that generates counterfactuals across image, text, and graph modalities; and a robustness oracle module that evaluates counterfactual validity under adversarial model variations.

9. The system of claim 8, wherein the diffusion projection module implements a DDPM with a guidance scale of 3.0.

10. The system of claim 8, wherein the recourse optimizer module employs a convex relaxation for generalized linear models.

11. The system of claim 8, wherein the robustness oracle module performs oracle distillation to train a surrogate classifier that mimics the target model’s decision strategy.

12. The system of claim 8, wherein the multi‑modal recourse module uses a shared latent space and cross‑modal consistency loss to align visual and textual embeddings.

13. The system of claim 8, wherein the causal graph module is updated periodically using differential privacy‑preserving causal discovery.

14. The system of claim 8, wherein the robustness oracle module computes a multiplicity‑based robustness score by sampling 100 perturbed models.

15. The system of claim 8, wherein the counterfactual explanations are rendered as visual heatmaps, textual rationales, and graph edits that can be embedded into electronic health records via HL7/FHIR standards.

ABSTRACT

The present invention provides a Frontier Counterfactual Architecture (FCA) that generates counterfactual explanations robust to adversarial perturbations across multimodal data. FCA learns a causal graph to steer perturbations along semantically consistent edges, projects candidate perturbations onto the data manifold using a denoising diffusion probabilistic model, and optimizes for minimal action cost while bounding model change in the ℓp norm. A multi‑modal adversarial recourse module ensures that explanations remain actionable across images, text, and graphs. A robustness oracle evaluates counterfactual validity under simulated adversarial model variations, yielding a multiplicity‑based robustness score. The resulting counterfactuals are faithful, interpretable, and resilient to both input‑level noise and model‑level shifts, thereby enhancing trust in adversarial environments.

1	Counterfactual Visual Explanation via Causally-Guided Adversarial Steering 2025-09-29 https://arxiv.org/abs/2507.09881 Abstract: Recent work on counterfactual visual explanations has contributed to making artificial intelligence models more explainable by providing visual perturbation to flip the prediction. However, these approaches neglect the causal relationships and the spurious correlations behind the image generation process, which often leads to unintended alterations in the counterfactual images and renders the explanations with limited quality. To address this challenge, we introduce a novel framework C...
2	Counterfactual Visual Explanation via Causally-Guided Adversarial Steering 2025-07-13 https://doi.org/10.48550/arXiv.2507.09881 Recent work on counterfactual visual explanations has contributed to making artificial intelligence models more explainable by providing visual perturbation to flip the prediction. However, these approaches neglect the causal relationships and the spurious correlations behind the image generation process, which often leads to unintended alterations in the counterfactual images and renders the explanations with limited quality. To address this challenge, we introduce a novel framework CECAS, whic...
3	Diffusion Counterfactuals for Image Regressors 2025-12-31 https://doi.org/10.48550/arxiv.2503.20595 Adversarial Counterfactual Explanations (ACE) generate counterfactual images by optimizing adversarial perturbations in the image space while filtering high-frequency and out-of-distribution artifacts using a diffusion model. More specifically, consider L class (x, y) as a function that quantifies the match between a sample x and a class y, typically the cross-entropy loss, which we aim to minimize.Consider a filtering function F that constrains a counterfactual x ' to the data manifold of the t...
4	Optimal Robust Recourse with L p -Bounded Model Change 2025-12-31 https://doi.org/10.48550/arxiv.2509.21293 Our Contributions and Results Our main goal is to understand the true price of recourse for more restricted adversarial model changes.In particular, we measure model changes by bounding the L p norm of the difference between initial and changed models, where p 1 but p = .We provide a new algorithm that provably computes the optimal robust recourse for generalized linear models for this type of model change. The key insight in the design of our algorithm is the observation that the optimal soluti...
5	Recourse provides individuals who received undesirable labels (e.g., denied a loan) from algorithmic decision-making systems with a minimum-cost improvement suggestion to achieve the desired outcome. 2026-04-20 https://arxiv.org/html/2509.21293v1 Our main goal is to understand the true price of recourse for more restricted adversarial model changes. In particular, we measure model changes by bounding the LpL^{p} norm of the difference between initial and changed models, where p 1p\geq 1 but p peq\infty. We provide a new algorithm that provably computes the optimal robust recourse for generalized linear models for this type of model change. The key insight in the design of our algorithm is the observation that the optimal solution of the...
6	Counterfactual explanations and adversarial attacks have a related goal: flipping output labels with minimal perturbations regardless of their characteristics. 2026-03-17 https://liner.com/ko/review/adversarial-counterfactual-visual-explanations Counterfactual explanations and adversarial attacks have a related goal: flipping output labels with minimal perturbations regardless of their characteristics. Yet, adversarial attacks cannot be used directly in a counterfactual explanation perspective, as such perturbations are perceived as noise and not as actionable and understandable image modifications....
7	Adversarial Counterfactual Visual Explanations 2023-03-16 https://doi.org/10.1109/CVPR52729.2023.01576 Yet, adversarial attacks cannot be used directly in a counterfactual explanation perspective, as such perturbations are perceived as noise and not as actionable and understandable image modifications. (2023)...
8	Towards desiderata-driven design of visual counterfactual explainers 2026-05-07 https://doi.org/10.1016/j.patcog.2025.112811 This can be e.g. the inclusion or removal of object parts, but also more intricate changes in image quality or color, that may not be accessible with other explanation techniques such as feature attribution.Another advantage of counterfactuals is that they are inherently actionable, e.g.together with a human in the loop, counterfactuals provide an implicit data augmentation scheme that can serve to address a model's missing invariances or reliance on spurious correlations .Mathematically, the se...
10	The effect of data poisoning on counterfactual explanations 2026-05-07 https://doi.org/10.1016/j.inffus.2026.104237 We demonstrate that state-of-the-art counterfactual generation methods and toolboxes are vulnerable to such data poisoning. Introduction Nowadays, many Artificial Intelligence (AI-) and Machine Learning (ML-) based systems are deployed in the real world [Zhao et al., 2023;Ho et al., 2022].These systems show an impressive performance but are still not perfecte.g.failures, issues of fairness, and vulnerability to data poisoning can cause harm when applied in the real world....
11	In November 2023, Mount Sinai Health System deployed an explainable AI diagnostic system across its network of 8 hospitals serving 7.4 million patients annually in New York, addressing critical trust 2026-04-23 https://ashganda.com/blog/explainable-ai-xai-transparent-trustworthy-models/ However, saliency methods face faithfulness challenges: generated visualizations may not accurately reflect true model behavior due to saturation effects, adversarial perturbations, and implementation choices that produce visually appealing but technically incorrect attributions. Research from Google analyzing 47,000 Grad-CAM explanations found that 23% highlighted regions provably irrelevant to model predictions (determined through ablation studies zeroing out highlighted regions without changi...

Counterfactual Explanation Robustness to Adversarial Noise

Contents