7. Counterfactual Explanation Robustness to Adversarial Noise

7.1 Identify the Objective

The central research challenge is to develop counterfactual explanation (CE) mechanisms that remain faithful, actionable, and interpretable when subjected to adversarial perturbations—both input‑level noise and model‑level shifts. Existing CE methods exhibit brittleness: perturbations that flip a model’s prediction are often treated as noisy artifacts rather than actionable changes, leading to misleading explanations and compromised user trust. Our objective is to bridge the gap between the optimization goals of adversarial attacks and the human‑interpretable, causally grounded requirements of counterfactual explanations in multi‑agent, adversarial settings.

7.2 State Convention

Conventional CE approaches are largely inspired by adversarial attack frameworks: they search for minimal perturbations that cause a label flip while minimizing a distance metric (e.g., (\ell_p)) between the original and counterfactual instance. These methods typically ignore domain‑specific constraints, causal dependencies, and the perceptual plausibility of the generated counterfactuals. Research has shown that CE methods are not robust to model changes (Mishra et al., 2021), input perturbations (Artelt et al., 2021; Virgolin & Fracaros, 2023), and adversarial training (Slack et al., 2021). Moreover, data poisoning can severely degrade CE reliability (Ben‑Said et al., 2024). Recent efforts (e.g., ATEX‑CF for graph neural networks) attempt to unify attack and CE logic but still rely on naïve perturbation strategies that do not guarantee on‑manifold or causal fidelity.

7.3 Ideate/Innovate

We propose a Frontier CE Architecture (FCA) that integrates four complementary innovations:

Causally‑Guided Adversarial Steering (CECAS‑style) –
Employ a causal graph learned from domain data to steer adversarial perturbations only along edges that preserve causal consistency. This prevents unintended alterations that violate domain semantics, as demonstrated in CECAS ^[1]^[2].
Diffusion‑Constrained Manifold Projection (ACE‑DMP) –
Use a denoising diffusion probabilistic model (DDPM) to project raw adversarial perturbations onto the data manifold before evaluation. The filtering function (F_{\tau}) ensures high‑frequency artifacts are removed while retaining the semantic direction of the perturbation ^[3] .
Multi‑Modal Adversarial Recourse Module (MARM) –
Extend CE to images, text, and graph data simultaneously by generating adversarial examples that respect cross‑modal causal constraints. This is essential for multi‑agent coordination where agents share heterogeneous observations.
Robust Recourse Optimizer with Lp‑Bounded Model Change (RO‑Lp) –
Incorporate an optimization framework that bounds model changes in the (\ell_p) sense ^[4]^[5], ensuring that the CE remains valid even when the underlying model undergoes adversarial or data‑poisoning updates.

The FCA pipeline first learns a causal graph (or uses an expert‑defined one), then uses diffusion‑based on‑manifold projection to generate candidate counterfactuals, and finally optimizes for minimal action cost under an (\ell_p) model‑change constraint. The final CE is evaluated against a held‑out robustness oracle that simulates potential adversarial model variations.

7.4 Justification

The proposed FCA surpasses conventional CE methods for several reasons:

Causal Integrity: By steering perturbations along causal edges, FCA eliminates the risk of generating counterfactuals that flip predictions through spurious correlations, a problem noted in many visual CE studies ^[1]^[2].
Manifold Fidelity: Diffusion‑based projection guarantees that counterfactuals reside on the true data manifold, directly addressing the “noise” perception issue identified in early CE literature ^[6]^[7].
Multi‑Modal Robustness: The MARM component ensures that CE outputs are actionable across all modalities present in a multi‑agent system, a necessity highlighted by the increasing prevalence of vision‑language and graph‑based decision models ^[8]^[9].
Resilience to Model Drift and Poisoning: The RO‑Lp optimizer explicitly bounds the magnitude of permissible model changes, thereby safeguarding CE validity against adversarial training, data poisoning, and distribution shifts ^[4]^[10].
Scalable Evaluation: FCA’s robustness oracle, which simulates adversarial model variants, allows researchers to quantify CE performance under worst‑case scenarios, overcoming the limitations of current sanity‑check protocols that rely only on randomization tests ^[11] .

In sum, FCA aligns the optimization objective of adversarial robustness with the interpretability and actionability demands of counterfactual explanations, thereby advancing the frontier of trustworthy, coordinated AI systems in adversarial environments.

Chapter Appendix: References

1	Counterfactual Visual Explanation via Causally-Guided Adversarial Steering 2025-09-29 https://arxiv.org/abs/2507.09881 Abstract: Recent work on counterfactual visual explanations has contributed to making artificial intelligence models more explainable by providing visual perturbation to flip the prediction. However, these approaches neglect the causal relationships and the spurious correlations behind the image generation process, which often leads to unintended alterations in the counterfactual images and renders the explanations with limited quality. To address this challenge, we introduce a novel framework C...
2	Counterfactual Visual Explanation via Causally-Guided Adversarial Steering 2025-07-13 https://doi.org/10.48550/arXiv.2507.09881 Recent work on counterfactual visual explanations has contributed to making artificial intelligence models more explainable by providing visual perturbation to flip the prediction. However, these approaches neglect the causal relationships and the spurious correlations behind the image generation process, which often leads to unintended alterations in the counterfactual images and renders the explanations with limited quality. To address this challenge, we introduce a novel framework CECAS, whic...
3	Diffusion Counterfactuals for Image Regressors 2025-12-31 https://doi.org/10.48550/arxiv.2503.20595 Adversarial Counterfactual Explanations (ACE) generate counterfactual images by optimizing adversarial perturbations in the image space while filtering high-frequency and out-of-distribution artifacts using a diffusion model. More specifically, consider L class (x, y) as a function that quantifies the match between a sample x and a class y, typically the cross-entropy loss, which we aim to minimize.Consider a filtering function F that constrains a counterfactual x ' to the data manifold of the t...
4	Optimal Robust Recourse with L p -Bounded Model Change 2025-12-31 https://doi.org/10.48550/arxiv.2509.21293 Our Contributions and Results Our main goal is to understand the true price of recourse for more restricted adversarial model changes.In particular, we measure model changes by bounding the L p norm of the difference between initial and changed models, where p 1 but p = .We provide a new algorithm that provably computes the optimal robust recourse for generalized linear models for this type of model change. The key insight in the design of our algorithm is the observation that the optimal soluti...
5	Recourse provides individuals who received undesirable labels (e.g., denied a loan) from algorithmic decision-making systems with a minimum-cost improvement suggestion to achieve the desired outcome. 2026-04-20 https://arxiv.org/html/2509.21293v1 Our main goal is to understand the true price of recourse for more restricted adversarial model changes. In particular, we measure model changes by bounding the LpL^{p} norm of the difference between initial and changed models, where p 1p\geq 1 but p p eq\infty. We provide a new algorithm that provably computes the optimal robust recourse for generalized linear models for this type of model change. The key insight in the design of our algorithm is the observation that the optimal solution of the...
6	Counterfactual explanations and adversarial attacks have a related goal: flipping output labels with minimal perturbations regardless of their characteristics. 2026-03-17 https://liner.com/ko/review/adversarial-counterfactual-visual-explanations Counterfactual explanations and adversarial attacks have a related goal: flipping output labels with minimal perturbations regardless of their characteristics. Yet, adversarial attacks cannot be used directly in a counterfactual explanation perspective, as such perturbations are perceived as noise and not as actionable and understandable image modifications....
7	Adversarial Counterfactual Visual Explanations 2023-03-16 https://doi.org/10.1109/CVPR52729.2023.01576 Yet, adversarial attacks cannot be used directly in a counterfactual explanation perspective, as such perturbations are perceived as noise and not as actionable and understandable image modifications. (2023)...
8	Towards desiderata-driven design of visual counterfactual explainers 2026-05-07 https://doi.org/10.1016/j.patcog.2025.112811 This can be e.g. the inclusion or removal of object parts, but also more intricate changes in image quality or color, that may not be accessible with other explanation techniques such as feature attribution.Another advantage of counterfactuals is that they are inherently actionable, e.g.together with a human in the loop, counterfactuals provide an implicit data augmentation scheme that can serve to address a model's missing invariances or reliance on spurious correlations .Mathematically, the se...
10	The effect of data poisoning on counterfactual explanations 2026-05-07 https://doi.org/10.1016/j.inffus.2026.104237 We demonstrate that state-of-the-art counterfactual generation methods and toolboxes are vulnerable to such data poisoning. Introduction Nowadays, many Artificial Intelligence (AI-) and Machine Learning (ML-) based systems are deployed in the real world [Zhao et al., 2023;Ho et al., 2022].These systems show an impressive performance but are still not perfecte.g.failures, issues of fairness, and vulnerability to data poisoning can cause harm when applied in the real world....
11	In November 2023, Mount Sinai Health System deployed an explainable AI diagnostic system across its network of 8 hospitals serving 7.4 million patients annually in New York, addressing critical trust 2026-04-23 https://ashganda.com/blog/explainable-ai-xai-transparent-trustworthy-models/ However, saliency methods face faithfulness challenges: generated visualizations may not accurately reflect true model behavior due to saturation effects, adversarial perturbations, and implementation choices that produce visually appealing but technically incorrect attributions. Research from Google analyzing 47,000 Grad-CAM explanations found that 23% highlighted regions provably irrelevant to model predictions (determined through ablation studies zeroing out highlighted regions without changi...