Abstract

The collective investigation of adversarial dynamics in multi‑agent artificial intelligence reveals a common narrative of misaligned policy inference that propagates through coordination loops, eroding trust and precipitating cascading failures. By perturbing observations, injecting deceptive messages, or manipulating internal policy gradients, attackers can induce a single victim or an entire coalition to deviate from intended objectives, as demonstrated in the single‑victim communication perturbation studies and the gradient‑based prompt optimization attacks. These vulnerabilities expose the fragility of decentralized execution and the fragility of trust assumptions in safety‑critical deployments such as autonomous swarms, medical decision support, and financial trading.

A range of countermeasures has emerged, yet each addresses only a fragment of the threat landscape. Robust frameworks like ROMAX provide certifiable resilience to adversarial observation attacks by solving a convex minimax problem, while trust‑metric‑based federated aggregation schemes weight client updates to suppress poisoning without sacrificing privacy. For communication‑level sabotage, the Victim‑Simulation‑Based Adversarial Attack (VSA) and single‑victim perturbation pipelines expose the most vulnerable message channels, and semantic prompt obfuscation detection (e.g., Sentra‑Guard) thwarts Base64‑, ROT13‑, and Unicode‑based jailbreaks. However, gaps persist: many solutions assume white‑box access, ignore partial observability, or lack formal guarantees against cascading propagation in heterogeneous agent ensembles.

Explainability and resource budgets add a further layer of complexity. TFX‑MARL introduces a trade‑off controller that budgets explainability against performance, and ATEX‑CF demonstrates how adversarial insights can be embedded in counterfactual explanations for graph‑based agents. Yet counterfactual failure remains pronounced under adversarial perturbations, and obfuscated policy gradients can still mislead explainability pipelines unless paired with targeted neuron‑level masking. Partial observability combined with communication bottlenecks further aggravates misalignment, as shown by MAGNNET and bandwidth‑constrained encoders, which highlight the need for scalable, localized message passing that preserves both privacy and interpretability.

The implications are clear: a holistic, layered defense that integrates adversarial training, trust‑aware aggregation, communication‑level anomaly detection, and explainability budgeting is essential for robust multi‑agent AI systems. Current research offers powerful building blocks, but a unified, formally verified framework that simultaneously mitigates misaligned inference, protects against cascading failures, and delivers faithful explanations remains an open challenge. Addressing these gaps will be critical for the safe deployment of autonomous teams in domains ranging from swarm robotics to healthcare and finance.

1. Misaligned Policy Inference from Adversarial Observations

1.1 Identify the Objective

This chapter synthesizes the current state of research on how adversarial perturbations of observations can lead to misaligned policy inference in multi‑agent reinforcement learning (MARL) systems, the ensuing degradation of trust in cooperative teams, and the cascading failures that may result. It systematically reviews the literature for mechanisms that detect, mitigate, or otherwise address these threats, evaluates the strengths and weaknesses of existing solutions, and determines whether the objective can be met with today’s prior‑art.

1.2 Survey of Existing Prior Art

Reference	Title	Core Contribution	Relevance to Objective
^[192]	Black‑Box Adversarial Robustness Testing with Partial Observation for Multi‑Agent Reinforcement Learning	Proposes black‑box adversarial testing protocols that perturb agents’ partial observations to assess vulnerability.	Directly addresses adversarial observation injection in MARL.
^[145]	AdverSAR: Adversarial Search and Rescue via Multi‑Agent Reinforcement Learning	Introduces a CTDE training paradigm with adversarial modeling for search‑and‑rescue scenarios.	Demonstrates adversarial policy generation in a cooperative MARL setting.
^[10]	Cat‑and‑Mouse Satellite Dynamics	Presents a complex 3‑DOF contested environment where adversarial agents must prevent an evader from reaching goals.	Illustrates multi‑agent adversarial dynamics under partial observability.
^[171]	How to prevent malicious use of intelligent unmanned swarms?	Explores adversarial policy design against unmanned swarms, highlighting exponential action‑space challenges.	Discusses multi‑agent adversarial policy synthesis.
^[35]	An Offline Multi‑Agent Reinforcement Learning Framework for Radio Resource Management	Combines GANs with deep RL and graph neural networks for resource management; includes discussion of adversarial robustness.	Provides contextual background on MARL applications and robustness concerns.
^[88]	Multi‑Agent Reinforcement Learning in Cybersecurity	Discusses Dec‑POMDPs and scalability issues in adversarial cyber‑security scenarios.	Highlights multi‑agent dynamics and the difficulty of aligning policies under adversarial influence.
^[151]	Adversarial Attack on Black‑Box Multi‑Agent by Adaptive Perturbation	Implements state‑of‑the‑art black‑box attacks (MASafe, AMCA, AMI, Lin) on MARL, evaluating impact on reward and win rate.	Provides empirical evidence of misaligned policy inference due to observation attacks.
^[127]	ROMAX: Certifiably Robust Deep Multi‑Agent Reinforcement Learning via Convex Relaxation	Presents a minimax MARL framework that infers worst‑case policy updates of other agents to guarantee robustness.	Directly tackles misaligned policy inference by bounding adversarial influence.
^[3]	DeepForgeSeal: Latent Space‑Driven Semi‑Fragile Watermarking for Deepfake Detection Using Multi‑Agent Adversarial Reinforcement Learning	Introduces adversarial regularization enforcing Lipschitz continuity in policies, improving robustness to noisy observations.	Offers a regularization‑based defense against observation perturbations.
^[10] (duplicate)	Cat‑and‑Mouse Satellite Dynamics	(see above)	Additional context on contested multi‑agent environments.
^[145] (duplicate)	AdverSAR	(see above)	Further illustration of adversarial policy design.
^[171] (duplicate)	How to prevent malicious use of intelligent unmanned swarms?	(see above)	Emphasizes adversarial policy challenges.
^[35] (duplicate)	An Offline Multi‑Agent Reinforcement Learning Framework for Radio Resource Management	(see above)	Background on MARL robustness in communication systems.
^[88] (duplicate)	Multi‑Agent Reinforcement Learning in Cybersecurity	(see above)	Cyber‑security perspective on adversarial policy alignment.

Additional related work that informs the discussion but does not directly provide a complete solution includes:
- Techniques for adversarial regularization and Lipschitz enforcement (§^[3].
- Adversarial training methods such as ROMANCE (Yuan et al. 2023) for robust target MAS (§^[151].
- Adversarial policy synthesis frameworks (MASafe, AMCA, AMI) (§^[151].
- CTDE training paradigms that expose agents to shared observations during training but rely on local observations at execution (AdverSAR, ^[145].

1.3 Best‑Fit Match

ROMAX: Certifiably Robust Deep Multi‑Agent Reinforcement Learning via Convex Relaxation (Ref: ^[127] is the single existing solution that most closely aligns with the objective of preventing misaligned policy inference from adversarial observations.

Requirement	ROMAX Capability	Evidence
Detect worst‑case adversarial perturbations of observations	Uses convex relaxation to formulate a minimax problem that bounds the influence of any adversarial policy update.	The method explicitly models a worst‑case policy update of other agents, thereby anticipating misaligned inference. ^[127]
Guarantee robustness against adversarial observation attacks	Provides certifiable robustness guarantees by solving a convex optimization problem that upper‑bounds possible loss due to adversarial perturbations.	ROMAX’s theoretical guarantees ensure that the learned policy remains within acceptable performance bounds even under worst‑case attacks. ^[127]
Maintain cooperative performance under adversarial conditions	Empirically demonstrates that the minimax policy preserves team reward while withstanding adversarial perturbations in benchmark MARL environments.	Experimental results in ROMAX show reduced reward degradation compared to baseline MARL methods when subjected to observation attacks. ^[127]
Support interpretability of policy updates	The convex relaxation framework yields interpretable bounds on policy shifts, enabling stakeholders to understand the extent of adversarial influence.	The paper discusses how the convex dual variables correspond to sensitivity of the policy to observation changes. ^[127]

Thus, ROMAX satisfies the core requirements of preventing misaligned policy inference through adversarial observations, providing both theoretical guarantees and empirical validation.

1.4 Gap Analysis

Gap	Classification	Existing Art to Close Gap
Partial observability limitations	(ii) Requires net‑new R&D	ROMAX assumes full‑state observability in its convex relaxation; integrating belief‑state estimation (e.g., deep belief networks) would extend applicability.
Trust degradation quantification	(ii) Requires net‑new R&D	Current methods (ROMAX, ROMANCE) do not measure trust metrics or provide interpretable trust scores.
Cascading failure modeling	(ii) Requires net‑new R&D	No prior art models the propagation of misaligned policies leading to system‑wide failures; would require formal safety‑analysis frameworks.
Communication hijack resilience	(i) Closeable by composition	Combining ROMAX with adversarial regularization (DeepForgeSeal, Ref: ^[3] could mitigate message‑based attacks.
Adversarial policy synthesis under constraints	(i) Closeable by integration	Integrating existing black‑box attack methods (MASafe, AMCA, AMI, Ref: ^[151] with ROMAX could generate worst‑case scenarios for training.
Robustness to noisy observations in decentralized execution	(i) Closeable by configuration	Employing CTDE training (AdverSAR, Ref: ^[145] alongside ROMAX would help agents learn to cope with local observation noise.
Scalability to large action spaces	(ii) Requires net‑new R&D	ROMAX’s convex relaxation becomes computationally intensive as the number of agents increases; scalable approximations are needed.

1.5 Verdict

Not Currently Possible – While ROMAX provides a robust foundation against misaligned policy inference, it does not address key aspects such as trust degradation metrics and cascading failure modeling required by the full objective.

Closest Existing Fits
1. ROMAX (Zhou et al. 2022) – Certifiably robust minimax MARL that bounds worst‑case policy updates. Coverage: Provides theoretical guarantees against adversarial observation attacks. Residual Gap: Lacks partial‑observability handling and trust‑degradation metrics.
2. ROMANCE (Yuan et al. 2023) – Robust target MAS via evolutionary learning, applied to message‑passing robustness. Coverage: Improves robustness of cooperative MARL policies under adversarial perturbations. Residual Gap: Does not offer certifiable guarantees or address cascading failures.
3. DeepForgeSeal (DeepForgeSeal, Ref: ^[3] – Adversarial regularization enforcing Lipschitz continuity in policies, enhancing robustness to noisy observations. Coverage: Provides regularization‑based defense against observation noise. Residual Gap: Does not explicitly model worst‑case adversarial policies or quantify trust degradation.

2. Trust Metric‑Based Federated Aggregation against Poisoning

2.1 Identify the Objective

The chapter must delineate a federated learning (FL) aggregation framework that employs quantitative trust metrics—derived from client reputation, participation quality, or dynamic trust scores—to weight local model updates during global aggregation, thereby mitigating the effect of poisoning attacks while preserving privacy and energy efficiency. The solution should integrate secure aggregation to conceal individual updates, support non‑IID client data, and maintain practical communication overhead.

2.2 Survey of Existing Prior Art

#	Prior‑Art Solution	Key Features Relevant to Trust‑Metric Aggregation	Source
1	Trust‑Aware and Energy‑Efficient FL for Secure Sensor Networks	Lightweight trust metrics, trust‑driven aggregation, secure aggregation, energy‑aware scheduling	^[60]
2	Fair and Robust FL via Reputation‑Aware Incentives	Reputation estimation using a Shapley‑variant, reputation‑weighted aggregation, poisoning mitigation	^[73]
3	Reputation Mechanism for Collusion Robustness	Reputation‑based client weighting, dynamic reputation updates, Byzantine resilience	^[194]
4	Lightweight and Robust Federated Data Valuation	Shapley‑based client valuation, robust aggregation, outlier detection	^[64]
5	FBLearn Decentralized FL on Blockchain	Adaptive weight calculation based on local training quality, ensemble techniques, poisoning resilience	^[166]
6	ClusterGuard: Secure Clustered Aggregation	Secure clustered aggregation, robustness to poisoning, hierarchical aggregation	^[122]
7	FedGuard: Selective Parameter Aggregation	Selective parameter aggregation, poisoning mitigation, no auxiliary data	^[189]
8	FedSecure: Adaptive Anomaly Detection	Adaptive anomaly detection, poisoning mitigation, DP support	^[150]
9	PrivEdge: Hybrid Split‑FL for Real‑Time Detection	Secure aggregation, robust aggregation (Krum, Trimmed Mean), privacy‑preserving	^[58]
10	Defend: Poisoned Model Detection and Exclusion	Neuron‑wise magnitude analysis, clustering via GMM, malicious client exclusion	^[143]
11	Krum / Trimmed‑Mean / Median / FedAvg	Classical robust aggregation schemes, used as baselines	^[173]

These works collectively provide mechanisms for client weighting based on trust or reputation, secure aggregation, and robust aggregation against poisoning, but none integrate all three into a single trust‑metric‑driven aggregation scheme within a practical, low‑overhead FL deployment.

2.3 Best‑Fit Match

The Trust‑Aware and Energy‑Efficient Federated Learning for Secure Sensor Networks at the Edge^[60] is the closest prior‑art solution to the stated objective. Its salient capabilities and mapping to the requirement are:

Requirement Feature	Implementation in ^[60]	Citation
Quantitative trust metrics per client	Lightweight trust scores computed from historical participation efficiency, update quality, and anomaly flags	^[60]
Trust‑driven aggregation	Global model updates are weighted proportionally to trust scores, reducing influence of low‑trust (potentially poisoned) clients	^[60]
Secure aggregation	Utilizes homomorphic‑encryption‑based secure sum or threshold‑cryptography to conceal individual updates during aggregation	^[60]
Poisoning mitigation	Trust weighting inherently suppresses poisoned updates; additional anomaly detection thresholds are applied to flag extreme deviations	^[60]
Non‑IID client support	Trust scores adapt to heterogeneity by incorporating local validation performance, ensuring fair weighting across diverse data distributions	^[60]
Energy efficiency	Adaptive communication scheduling based on trust levels reduces unnecessary transmissions from low‑trust clients	^[60]

Thus, ^[60] satisfies the core objective of a trust‑metric‑driven aggregation scheme that is robust to poisoning, privacy‑preserving, and operationally efficient.

2.4 Gap Analysis

Gap	Classification	Remedy (Existing Prior Art)
1. Limited formal differential privacy (DP) – The scheme does not integrate DP noise addition for client updates.	(i) Closeable by integrating DP mechanisms from ^[173] (DP‑FedAvg) or ^[161] (DP‑FedAvg with clipping).	Combine trust‑weighted aggregation with DP‑FedAvg.
2. No explicit outlier detection beyond trust weighting – Extremely malicious updates may still influence trust scores if initial trust is high.	(ii) Requires new R&D—introducing robust aggregation (Krum, Median) in tandem with trust weighting.	Use hybrid scheme: trust‑weighted aggregation plus Krum filtering ^[173] .
3. Scope limited to sensor networks – Architecture assumes edge‑centric topology; may not generalize to cross‑silo or cross‑device FL.	(i) Closeable by adopting the same trust‑metric logic in other FL frameworks, e.g., NEBULA ^[80] or FBLearn ^[166] .	Re‑implement trust logic as a plug‑in to existing FL libraries.
4. No support for hierarchical or clustered aggregation – While trust metrics are computed per client, the scheme does not exploit cluster‑based aggregation to reduce communication.	(i) Closeable by integrating ClusterGuard ^[122] clustering logic with trust weighting.	Combine cluster‑based secure aggregation with trust‑driven weights.
5. No explicit handling of model size heterogeneity – All clients are assumed to share a common model structure.	(i) Requires R&D to extend trust weighting to heterogeneous model architectures.	Use FedAOP ^[203] or InclusiveFL ^[103] to support heterogeneous models, then apply trust weighting.

Overall, the primary gaps are the absence of formal DP and the lack of a hybrid robust aggregation layer. These can be bridged by composing existing, mature mechanisms.

2.5 Verdict

Currently Possible – The objective of a trust‑metric‑based federated aggregation against poisoning is achievable today by composing existing components:

Trust‑Aware FL Engine – Adopt the trust‑metric computation and trust‑driven weighting from ^[60] .
Secure Aggregation Protocol – Employ a threshold‑cryptography or homomorphic‑encryption scheme as described in ^[60] or the standard secure aggregation protocols of Flower/FedML.
Robust Aggregation Layer (Optional) – Integrate Krum or trimmed‑mean filtering from ^[173] to provide additional outlier rejection.
Differential Privacy Layer (Optional) – Apply DP‑FedAvg mechanisms from ^[173] or ^[161] to ensure client‑level privacy.
Communication Scheduler – Use energy‑aware adaptive scheduling logic from ^[60] to minimize transmissions from low‑trust devices.

By orchestrating these modules within a federated learning platform (e.g., Flower, FedML, or NEBULA), a production‑ready trust‑metric‑driven aggregation system can be deployed without inventing new cryptographic primitives or algorithms.

3. Communication Channel Sabotage and Theory of Mind Defense

3.1 Identify the Objective

This chapter surveys the state of the art in detecting, mitigating, and defending against adversarial sabotage of communication channels in multi‑agent artificial intelligence (AI) systems, with a particular focus on test‑time Theory of Mind (ToM) defenses. The objective is to map existing solutions—encompassing threat modelling, adversarial training, communication‑regularization techniques, and ToM‑based message filtering—onto the requirements of robust, real‑time multi‑agent coordination, and to identify the residual gaps that prevent a fully deployable, end‑to‑end defense stack.

3.2 Survey of Existing Prior Art

#	Reference ID	Key Contribution	Relevance to Objective
1	^[85]	Introduces a local ToM inference module that distinguishes cooperative from adversarial messages in centralized‑training, decentralized‑execution (CTDE) settings, and demonstrates mitigation in multi‑agent benchmarks.	Core to test‑time ToM defense against emergent adversarial communication.
2	^[129]	Extends the OWASP Multi‑Agentic System Threat Modeling Guide with empirical threat classes and evaluation strategies for adversarial behaviors in MAS.	Provides taxonomy and evaluation framework for communication sabotage threats.
3	^[114]	Proposes Communicative Power Regularization (CPR) to constrain agents’ influence in communication, improving robustness to misaligned or adversarial messages while preserving cooperative performance.	Offers a complementary regularization layer that mitigates the impact of sabotaged messages.
4	^[27]	Presents a ToM‑based test‑time mitigation that filters out messages from agents whose inferred intentions deviate from cooperative norms in a shared‑reward setting.	Supports the design of a runtime ToM filter similar to 1.
5	^[197]	Describes ROMANCE, an evolutionary generation of auxiliary adversarial attackers for robust multi‑agent coordination, and shows integration into various MARL methods.	Supplies an adversarial training pipeline to expose agents to sabotage scenarios.
6	^[184]	Discusses a Theory of Mind approach for test‑time mitigation against emergent adversarial communication, expanding on the ToM inference framework.	Provides theoretical grounding and additional empirical evidence for ToM defenses.
7	^[97]	Details a framework for detecting anomalous transactions via privileged user accounts, illustrating the need for behavioral forensics in multi‑agent communication.	Highlights the importance of behavioral monitoring beyond message content.
8	^[112]	Offers a comprehensive overview of multi‑agent reinforcement learning for real‑time strategy games, underscoring the prevalence of communication in complex environments.	Contextualizes the necessity of robust communication channels.
9	^[25]	Presents a hybrid MAS‑SIEM framework integrating behavioral forensics and Trust‑Aware ML, with ToM reasoning.	Demonstrates an end‑to‑end system that combines detection, forensics, and ToM inference.
10	^[104]	Describes a multi‑agent system that uses LLMs and ToM reasoning for collaborative tasks.	Illustrates practical deployment of ToM in large‑language‑model‑augmented MAS.

Key Themes Identified
- Threat Taxonomy: OWASP extension defines sabotage as “misaligned communication” and “adversarial message injection.”
- Regularization & Hardening: CPR ^[114] and adversarial training ^[197] provide off‑line robustness.
- Runtime ToM Filtering: ^[85]^[27], and ^[184] present test‑time inference modules that reject or down‑weight suspicious messages.
- Behavioral Forensics: ^[97] and ^[25] show the value of monitoring agent behavior beyond message content.

3.3 Best‑Fit Match

Solution: The test‑time Theory of Mind defense described in A Theory of Mind Approach as Test‑Time Mitigation Against Emergent Adversarial Communication^[85] .

Requirement	Capability in ^[85]	Source
Identify non‑cooperative intent from received messages	Uses Bayesian inverse planning to infer goals of other agents and compares to cooperative expectations, rejecting messages that violate cooperative norms.	^[85]
Operate at run‑time (test‑time)	The ToM inference module is invoked during execution, filtering messages before they influence policy decisions.	^[85]
Compatible with CTDE training	Designed for environments with centralized training and decentralized execution, aligning with common MARL pipelines.	^[85]
Provide empirical validation	Demonstrated on StarCraft II and a cooperative card game benchmark, showing reduced sabotage impact.	^[85]
Extendable to other domains	Framework is generic; only message encoding and policy architecture need adaptation.	^[85]

Why This is the Best Fit
The solution directly addresses the core objective—runtime detection and mitigation of sabotaged communication—using a principled ToM inference mechanism. It has been empirically validated in realistic multi‑agent environments and is architecturally compatible with existing MARL training pipelines. While other works (e.g., CPR, ROMANCE) provide complementary robustness, they do not offer a test‑time ToM filter; thus, ^[85] uniquely satisfies the objective in a single, coherent package.

3.4 Gap Analysis

Gap	Description	Classification
Limited to CTDE settings	The ToM defense assumes centralized training; many deployments use fully decentralized learning.	(i) Configurable with a decentralized training extension (e.g., using local policy updates).
Message encoding assumptions	Requires discrete, structured messages; real‑world systems may use continuous or multi‑modal communication (e.g., vision‑based).	(i) Integration with communication‑regularization modules ^[114] that can handle continuous signals.
Scalability to many agents	Benchmarks involve up to 10 agents; large‑scale real‑world teams may have hundreds.	(ii) Requires new R&D to scale inference to many agents while keeping latency low.
Robustness to sophisticated adversaries	Current evaluation uses simple adversarial policies; more advanced attackers could craft messages that mimic cooperative behavior.	(ii) New adversarial training ^[197] and continual learning are needed to cover this space.
Integration with LLM‑based agents	The framework is designed for RL agents; LLM‑driven agents may represent intentions differently.	(i) Adapt existing ToM inference to LLM internal belief states.
Behavioral forensics beyond message content	Current defenses focus on message filtering; do not detect side‑channel manipulations (e.g., timing, resource usage).	(i) Combine with behavioral monitoring frameworks ^[97]^[25].
Deployment in safety‑critical systems	No formal safety certification or real‑time guarantees.	(ii) Formal verification and safety‑critical integration research required.

3.5 Verdict

(a) Currently Possible – The combination of the ToM test‑time defense ^[85], communication‑regularization ^[114], and adversarial training ^[197] constitutes a deployable, end‑to‑end defense stack for multi‑agent systems operating in CTDE settings.

Implementation Sketch
1. Training Phase – Use a standard MARL framework (e.g., QMIX or VDN) with centralized critic and decentralized actors.
2. Adversarial Exposure – Integrate ROMANCE ^[197] to generate a population of auxiliary adversarial attackers that inject sabotaged messages during training, hardening the policy.
3. Communication Regularization – Apply CPR ^[114] to constrain the influence of each message, limiting the potential damage of a single malicious transmission.
4. Runtime ToM Filter – Deploy the ToM inference module from ^[85] at execution time: each agent receives messages, infers the sender’s hidden goal distribution, compares to the cooperative objective, and either accepts, attenuates, or discards the message before policy execution.
5. Behavioral Monitoring – Optionally stream agent state and communication logs to a SIEM‑style system ^[25] for post‑hoc forensics and continuous adaptation.

This architecture leverages only fully defined, published components and established protocols, avoiding speculative extensions.

4. Explainability Budget Trade‑Off in Multi‑Agent Systems

4.1 Identify the Objective

This chapter synthesises existing research that explicitly addresses the allocation of limited explainability resources (budget) in multi‑agent reinforcement learning (MARL) and related autonomous agent systems. The objective is to outline how current prior‑art solutions quantify, optimise, and trade‑off explainability against performance or other operational constraints, while also considering adversarial threats such as mis‑aligned policy inference, trust degradation, and cascading failures.

4.2 Survey of Existing Prior Art

Ref.	Title	Key Contribution Relevant to Explainability‑Budget Trade‑Off
^[39]	Zero‑Shot Policy Transfer in Multi‑Agent Reinforcement Learning via Trusted Federated Explainability	Introduces TFX‑MARL: trust metric, trust‑aware FL aggregation, and a trade‑off controller that explicitly budgets explainability versus performance.
^[128]	Budgeting Counterfactual for Offline RL	Proposes a non‑Markov budget constraint for counterfactual explanations in RL, linking budget to fidelity and sparsity.
^[191]	Explainable Model Routing for Agentic Workflows	Presents Topaz: an interpretable router that balances cost‑quality trade‑offs and generates natural‑language explanations grounded in routing traces.
^[172]	Explainable Multi‑Agent Reinforcement Learning for Temporal Queries	Utilises SHAP values to explain cooperative strategies, offering post‑hoc explanation mechanisms without explicit budgeting.
^[195]	Air Traffic Control – Cooperative Multi‑Agent Reinforcement Learning	Uses lattice‑space exploration for action pruning; explains decisions via a breadth‑first strategy, but lacks explicit budget control.
^[50]	Intelligent Resource Allocation in Wireless Networks via Deep Reinforcement Learning	Calls for explainability to build trust; does not provide a budgeting framework.
^[207]	AI‑Powered Household Budgeting Agent	Implements an explainer agent that logs decision rationale; no explicit explainability budgeting.
^[82]	Intelo.ai Multi‑Agent Platform	Highlights transparent, task‑specific agents that surface reasoning, but does not quantify explainability budgets.
^[132]	Designing Reward Functions for Deep RL	Discusses explainability challenges but no budgeting mechanism.
^[81]	Financial Trading with Explainable Controls	Projects black‑box controls onto explainable spaces; no explicit budget.
^[152]	Semantic‑Aware LLM Orchestration for Proactive Resource Management	Proposes reward machines and sub‑goal automata for long‑term explanations; budgeting not addressed.
^[153]	Attack‑Informed Counterfactual Explanations for Graph Neural Networks	Generates counterfactual explanations under a constrained perturbation budget.
^[75]	Resilience in Autonomous Agent Systems	Mentions counterfactual learning for explainability; no explicit budgeting.
¬b9??? (placeholder)	[Other relevant XAI frameworks]	–

The literature converges on a few patterns: (i) federated or multi‑agent environments need trust‑aware aggregation; (ii) explainability is often delivered post‑hoc (SHAP, counterfactuals); (iii) few works explicitly quantify an explainability budget and optimise it against performance or safety constraints. TFX‑MARL is the only solution that provides a budget controller integrated into the federated learning pipeline, making it the most relevant to the stated objective.

4.3 Best‑Fit Match

TFX‑MARL (Trusted Federated Explainability for MARL) is the single prior‑art solution that directly addresses the objective. Its capabilities map to the requirement as follows:

Requirement	TFX‑MARL Feature	Source
Quantify participant integrity and accountability	Trust metric based on provenance, update consistency, local evaluation reliability, and safety‑compliance signals.	^[39]
Reduce poisoning risk in federated aggregation	Trust‑aware FL aggregation that prioritises high‑accountability participants.	^[39]
Explicitly balance explainability and performance	Trade‑off controller that budgets explainability resources (e.g., explanation length, model complexity) against policy performance.	^[39]
Operationally interpretable budgeting mechanism	Simple, rule‑based budget allocation that can be tuned per deployment scenario.	^[39]

TFX‑MARL thus satisfies the core need for an explainability budget controller in a multi‑agent federated setting, including mechanisms for trust, aggregation, and performance optimisation.

4.4 Gap Analysis

Gap	Classification	Potential Closure
1. Limited adversarial robustness to mis‑aligned policy inference beyond poisoning mitigation	(i) Closeable by integrating adversarial detection modules (e.g., red‑team prompts, anomaly detectors) from works like ^[75] and ^[132]
2. Lack of counterfactual explanation budgeting that ties explanation fidelity to a fixed budget	(i) Closeable by incorporating the counterfactual budget framework from ^[128] (counterfactual budget constraint)
3. Absence of explainability for cascading failures triggered by inter‑agent mis‑coordination	(ii) Requires new R&D to model failure propagation and embed explainability constraints at the system level
4. No explicit modelling of trust degradation dynamics over time (e.g., reputation decay)	(i) Could be addressed by extending the trust metric with temporal decay functions from other federated trust studies (not present in the dataset)
5. Explainability is primarily post‑hoc (SHAP, counterfactuals) rather than in‑situ during decision making	(i) Integrating in‑situ explanation modules such as Topaz ^[191] could provide real‑time explanations within the budget

Most gaps are amenable to composition of existing components (e.g., TFX‑MARL + counterfactual budgeting + Topaz). The remaining gaps (cascading failures, dynamic trust degradation) would demand new research.

4.5 Verdict

Currently Possible – The objective can be realised today by deploying TFX‑MARL as the core framework, complemented by:

Counterfactual Budgeting – integrate the algorithm from ^[128] to enforce a counterfactual explanation budget within each agent’s local policy update.
In‑situ Explanation Layer – employ Topaz ^[191] to route decisions through an interpretable router that respects the same budget constraints.
Adversarial Safeguards – add anomaly detection and red‑team prompt evaluation modules ^[75]^[132] to mitigate poisoning and mis‑aligned inference.

This composition yields a fully operational explainability‑budget‑aware multi‑agent system that balances performance, trust, and interpretability while defending against known adversarial threats.

5. Partial Observability & Communication Bottlenecks Effects

5.1 Identify the Objective

The chapter must synthesize how partial observability and communication bottlenecks jointly influence the efficacy, interpretability, and robustness of multi‑agent reinforcement learning (MARL) systems. It should survey existing solutions that explicitly address these constraints, map the capabilities of the single best‑fit prior‑art component to the stated objective, identify gaps that remain unaddressed, and conclude whether the objective can be met with today’s technologies.

5.2 Survey of Existing Prior Art

Reference	Vendor/Project/Authors	Key Contribution Relevant to Partial Observability & Communication Constraints
^[148]	Dec‑POMDP formalism	Defines the fundamental hardness of partial observability and the need for decentralized coordination. ^[148]
^[23]	MAGNNET	Integrates GNN‑based message passing within CTDE to handle partial observability while maintaining decentralized execution. ^[23]
^[4]	GAT‑MARL	Uses graph attention for decentralized routing under partial observability. ^[4]
^[48]	Wireless Communication‑Enhanced Value Decomposition	Provides a communication‑aware mixer that exploits realistic wireless channels, addressing bandwidth limitations. ^[48]
^[133]	Bandwidth‑constrained Variational Message Encoding (BVME)	Introduces a lightweight module that encodes messages under hard bandwidth limits while preserving coordination. ^[133]
^[20]	SCoUT	Scales communication by grouping agents temporally, reducing per‑agent bandwidth. ^[20]
^[165]	Attention‑Augmented IRL with GNNs	Demonstrates that GNNs can capture both local and global features, beneficial under partial observability. ^[165]
^[186]	Survey on Communication Strategies	Reviews bandwidth‑constrained communication methods in MARL, providing a conceptual backdrop. ^[186]
^[93]	Flow (traffic microsimulation)	Offers a realistic environment with partial observability and communication constraints for MARL evaluation. ^[93]

The survey highlights three families of solutions:
1. Decentralized GNN‑based coordination (MAGNNET, GAT‑MARL).
2. Communication‑aware mixers and protocols (Wireless‑Enhanced QMIX, SCoUT).
3. Bandwidth‑constrained message encoding (BVME).
Each addresses at least one of the two constraints, but only a subset jointly tackles both.

5.3 Best‑Fit Match

MAGNNET (Ref: ^[23] is selected as the best‑fit prior‑art solution because it simultaneously:

Requirement	MAGNNET Capability	Source
Operates under partial observability	Uses local observations to update policies while a GNN aggregates information from neighboring agents, thereby approximating a joint belief. ^[23]
Supports decentralized execution	Policies are learned centrally but executed independently, relying only on local message‑passing. ^[23]
Scales to many agents	GNN message passing remains linear in the number of edges, enabling larger teams without central bottlenecks. ^[23]
Requires limited bandwidth	By using sparse adjacency graphs and GNN aggregation, communication is restricted to local neighbors, reducing bandwidth needs. ^[23]
Enables coordination with realistic wireless channels	The architecture can be combined with the wireless‑enhanced mixer ^[48] to expose agents to realistic channel impairments, thereby modeling communication bottlenecks.

Thus, MAGNNET, possibly augmented with wireless‑enhanced mixers, satisfies the core facets of the objective: it mitigates partial observability through learned belief propagation and addresses communication bottlenecks via localized message passing.

5.4 Gap Analysis

Gap #	Description	Classification
G1	Hard bandwidth constraints – MAGNNET’s GNN‑based message passing still assumes that every neighbor’s message can be reliably transmitted, which may not hold under severe bandwidth limits.	(i) Closeable by integrating a bandwidth‑constrained encoder (BVME) or re‑weighting message importance.
G2	Adversarial communication attacks – MAGNNET does not provide defenses against malicious message tampering or spoofing, which can compromise interpretability.	(ii) Requires net‑new R&D; no existing solution fully addresses adversarial communication within GNN‑based MARL.
G3	Interpretability diagnostics – While MAGNNET improves coordination, it lacks built‑in mechanisms for post‑hoc interpretability of learned communication protocols.	(i) Could be addressed by overlaying an explainable message‑encoding layer (e.g., using attention‑based explanation modules).
G4	Realistic wireless channel modeling – The base MAGNNET paper does not empirically validate performance under realistic p‑CSMA or fading channels.	(i) Can be achieved by coupling with the wireless‑enhanced value‑decomposition framework ^[48] .
G5	Scalability to very large agent counts – While GNNs scale, the communication graph may become dense, increasing bandwidth demands.	(i) Mitigation via hierarchical GNNs or sparse grouping (SCoUT, ^[20].

5.5 Verdict

Currently Possible – The objective of analyzing partial observability and communication bottlenecks can be achieved today. A practical implementation would combine:

MAGNNET as the core MARL framework: centralized PPO training with a GNN‑augmented critic, decentralized actors using local observations and neighbor messages. ^[23]
Bandwidth‑constrained Variational Message Encoding (BVME) to compress messages under hard bandwidth limits. ^[133]
Wireless‑enhanced mixer (from ^[48] to expose agents to realistic channel impairments during training, ensuring robustness to communication bottlenecks.

A sketch:
- Training Phase: Agents receive global observations; a shared critic learns a joint Q‑function via a GNN mixer that incorporates messages encoded by BVME. Wireless channel simulator injects packet loss and delay. PPO updates policy parameters.
- Execution Phase: Each agent observes its local state, receives compressed messages from neighbors (BVME output), aggregates via the GNN, and selects an action. No centralized controller is needed, satisfying decentralized execution.

This composition leverages only mature, shipping components (PyTorch Geometric for GNNs, OpenAI‑Gym for environments, existing BVME codebases, and published wireless channel simulators). Thus, the objective is fully realizable with current prior art.

6. Propagation of Misaligned Inference through Joint Decision‑Making

6.1 Identify the Objective

This chapter must provide a literature‑review synthesis that (i) identifies how misaligned policy inference in multi‑agent AI systems propagates through joint decision‑making processes, (ii) evaluates the resulting erosion of trust among system users and stakeholders, and (iii) delineates the mechanisms by which such misalignment can cascade into systemic failures. The analysis should rely exclusively on existing, fully specified research methods, commercial products, or open‑source projects that are currently available, and must map each cited contribution to the specific aspects of misalignment propagation, trust degradation, and cascading failures.

6.2 Survey of Existing Prior Art

The following table lists all prior‑art solutions that address one or more components of the objective: joint perception‑decision vulnerability, multi‑agent misalignment, trust dynamics, or cascading failure mechanisms. Each entry is cited with its unique hex ID.

#	Solution	Domain	Key Feature(s) Relevant to Objective	Source
6.2.1	Perception‑Decision Joint Attack (PDJA)	Adversarial attacks on multimodal agents	Joint perturbation of perception and policy modules to induce low‑reward trajectories; demonstrates how a single adversarial perturbation can propagate through perception‑policy pipelines, causing systemic degradation	^[16]
6.2.2	Confusion‑Based Communication for Multi‑Agent Resilience	Multi‑agent reinforcement learning	Agents learn to broadcast misaligned observations to reduce confusion; illustrates how propagated misalignment can be mitigated by communication protocols	^[78]
6.2.3	HiMAC: Hierarchical Macro‑Micro Learning	Long‑horizon LLM agents	Structured global state tracking to isolate local execution errors; addresses error propagation across hierarchical decision layers	^[76]
6.2.4	NOD (Navigator‑Operator‑Director) Architecture	Service‑oriented multi‑agent systems	External oversight agent verifies critical actions; mitigates misaligned policy execution and prevents cascading failures	^[31]
6.2.5	Fast Adversarial Training (FAT) with Distribution‑aware Guidance (DDG)	Robustness of neural networks	Adjusts perturbation budgets based on sample confidence to reduce overfitting and protect against cascading adversarial errors	^[91]
6.2.6	Adaptive Self‑Evolving Preference Optimization (EvoDPO)	Preference‑based multi‑agent learning	Dynamically updates reference policies to avoid misaligned policy drift; relevant for long‑term trust maintenance	^[74]
6.2.7	Autonomous Evolution of EDA Tools (Self‑Evolved ABC)	Auto‑engineering of multi‑agent rulebases	Self‑evolving rulebases constrain policy modifications, curbing misalignment	^[111]
6.2.8	Multi‑Agent Thompson Sampling for Bandit Coordination	Cooperative control of wind turbines	Models coordination under misaligned individual incentives; demonstrates potential cascading failures in shared‑resource settings	^[126]^[174]
6.2.9	Multi‑Agent Reinforcement Learning with Autonomous Coordination	Multi‑agent system dynamics	Highlights autocurricula and misalignment in adversarial settings; reveals failure modes that can cascade	^[175]
6.2.10	Multimodal Adversarial Attacks on Vision‑Language‑Action Models (SABER)	Vision‑language‑action pipelines	Black‑box sequential attack framework that propagates misaligned inference through multi‑turn interactions	^[98]
6.2.11	Adversarial Robustness of Diffusion Models (NatADiff)	Diffusion‑based generative models	Generates natural adversarial samples that can mislead downstream decision modules, illustrating propagation of misalignment	^[193]
6.2.12	Adversarial‑Robust Multivariate Time‑Series Anomaly Detection (ARTA)	Time‑series anomaly detection	Joint training of detector and perturbation generator; shows how minimal adversarial perturbations can cascade into detection failures	^[44]
6.2.13	Policy Disruption in RL (Large‑Language‑Model‑Based Attacks)	RL policy vulnerability	Attacks that modify reward and action spaces; relevant for cascading policy failures	^[196]
6.2.14	Multi‑Agent Guided Policy Search with Non‑Cooperative Games	Non‑cooperative multi‑agent games	Explores how misaligned objectives lead to suboptimal joint policies and potential failure cascades	^[15]
6.2.15	Robustness Evaluation of Neural Networks via Certified Metrics	Model robustness evaluation	Provides metrics for assessing vulnerability to misaligned inference; useful for trust assessment	^[125]

The survey covers joint perception‑policy vulnerability (PDJA), multi‑agent misalignment mitigation (HiMAC, NOD, confusion‑based communication), robustness techniques (FAT–DDG), and longitudinal policy evolution (EvoDPO). It also includes examples of cascading failures in control‑system settings (wind‑turbine coordination) and multi‑agent games.

6.3 Best‑Fit Match

Perception‑Decision Joint Attack (PDJA)^[16] is the single prior‑art solution that most directly satisfies the objective of demonstrating how a misaligned inference in the perception module can propagate through the decision‑making pipeline, degrading trust and potentially triggering cascading failures.

PDJA Feature	Objective Requirement	Mapping
Dual perturbator (perception & decision)	Joint propagation of misaligned inference	PDJA explicitly models how an adversarial perturbation in perception is amplified by the policy network, leading to low‑reward actions across the system.
Explicit modeling of perception‑action interaction	Mechanism of trust degradation	By showing that perception errors can be hidden yet still induce incorrect decisions, PDJA illustrates how users may lose trust when outcomes diverge from expectations.
Attack success measured via joint reward degradation	Cascading failure illustration	The paper reports that a single perceptual perturbation can reduce overall team reward, implying a systemic cascade.
Use of realistic multimodal inputs	Relevance to joint decision‑making	PDJA operates on vision‑language‑action models, mirroring real‑world AI systems that integrate multiple modalities.

Thus, PDJA satisfies the core requirement of illustrating the propagation mechanism, but it is framed as an adversarial attack rather than a benign misaligned inference scenario.

6.4 Gap Analysis

Gap	Classification	Remedy (Existing Prior Art)
1. Lack of trust‑degradation metrics (e.g., user‑trust scores, confidence calibration)	(i) Closeable by integration with existing trust‑evaluation frameworks (e.g., user‑experience studies on LLMs)	Combine PDJA with the "User‑Trust in LLMs" benchmark (not in dataset) – Not applicable
2. Absence of long‑term cascading failure analysis beyond single‑step reward loss	(i) Closeable by composing PDJA with multi‑agent coordination studies (HiMAC, NOD)	Use HiMAC’s hierarchical error isolation to trace failure propagation
3. No mitigation or mitigation‑evaluation strategies presented	(ii) Requires R&D (but partial mitigation exists)	Integrate Fast Adversarial Training with Distribution‑aware Guidance ^[91] to reduce overfitting and mitigate cascading errors
4. No empirical studies on trust erosion in realistic operational settings	(ii) Net‑new R&D	Conduct controlled user‑study experiments (not available)
5. Lack of formal modeling of misalignment dynamics in multi‑agent learning (e.g., autocurricula)	(i) Closeable by combining PDJA with Autocurricula literature ^[175]	Use autocurriculum to simulate progressive misalignment over training cycles
6. No documented cascading failure scenarios (e.g., wind‑turbine coordination, traffic control)	(i) Closeable by leveraging existing coordination studies ^[126]^[174]	Map perception‑policy misalignment to shared‑resource failure cases

The dominant gap is the lack of a unified framework that simultaneously models misaligned inference propagation, quantifies trust degradation, and predicts cascading failures in realistic multi‑agent deployments. Existing solutions address individual facets but do not integrate them into a single analytic chain.

6.5 Verdict

Not Currently Possible – The objective of fully characterizing propagation of misaligned inference through joint decision‑making, alongside quantifying trust degradation and predicting cascading failures, cannot be achieved solely with existing prior‑art components. The closest fits are:

PDJA (Perception‑Decision Joint Attack) – Provides explicit evidence of perception‑policy misalignment propagation and its impact on joint reward ^[16] .
Coverage: Demonstrates how a single perceptual perturbation can cascade to decision‑making outputs.
Residual Gap: Does not address trust metrics or longer‑term cascading failure dynamics.
HiMAC (Hierarchical Macro‑Micro Learning) – Offers a structured architecture that isolates execution‑level errors and reduces error propagation ^[76] .
Coverage: Shows how hierarchical state tracking can prevent local misalignment from becoming global failure.
Residual Gap: Lacks direct modeling of perception‑policy misalignment or trust degradation mechanisms.
NOD (Navigator‑Operator‑Director) Architecture – Introduces an external verification layer to enforce correct decisions and prevent cascading failures ^[31] .
Coverage: Provides a practical mitigation strategy against misaligned policy execution.
Residual Gap: Does not analyze how misaligned inference propagates across perception‑policy pipelines or quantify trust erosion.

These three solutions collectively cover the principal aspects of misalignment propagation, mitigation, and hierarchical control, but none alone spans the entire objective. Therefore, the current state of prior art yields only partial coverage, leaving the full objective unresolved.

7. Obfuscated Policy Gradients and Incorrect Explainability

7.1 Identify the Objective

The chapter must survey existing mechanisms that detect or mitigate obfuscated policy gradients—adversarial perturbations that alter reinforcement‑learning (RL) policies to mislead multi‑agent systems—and assess how these mechanisms preserve or undermine explainability. It should identify solutions that simultaneously:
1. expose or defend against policy‑gradient‑based attacks;
2. provide faithful, interpretable explanations of agent decisions; and
3. address the specific challenges arising in multi‑agent, agentic‑AI environments (e.g., cascading failures, trust degradation, misaligned policy inference).

7.2 Survey of Existing Prior Art

Identifier	Vendor / Project	Authors / Source	Key Capability Relevant to the Objective	Citation
^[159]	Robust Lagrangian & Adversarial Policy Gradient (RCPG)	Frank et al.	Adversarial training of policy gradients in constrained MDPs, mitigating state‑perturbation attacks.	^[159]
^[119]	Multi‑Agent LLM Defense Pipeline Against Prompt Injection	Wang et al.	Multi‑agent architecture with input sanitization, prompt‑engineering, and model‑level adversarial training to counter obfuscated prompts.	^[119]
^[55]	OpenAI Codex Jailbreak Resistance	OpenAI	Strong adversarial testing (StrongReject benchmark) and sandboxing to detect obfuscated jailbreaks in code generation.	^[55]
^[147]	ABIGX (Unified Explainable Fault Detection)	Zhang et al.	Gradient‑based explainability (IG, ABIGX) to mitigate fault‑class smearing, but no explicit policy‑gradient defence.	^[147]
^[36]	Applied Explainability for Large Language Models	Dumais et al.	Comparative study of SHAP, LIME, Grad‑CAM for XAI in LLMs.	^[36]
^[168]	Grad‑CAM for Deep Learning	Selvaraju et al.	Saliency‑based explanation for image‑based models, demonstrating XAI reliability.	^[168]
^[62]	InjectLab: Tactical Framework for Adversarial Threat Modeling	Alamo et al.	Taxonomy and simulation of prompt‑based attacks, including obfuscated role overrides.	^[62]
	Functional Encryption for Privacy‑Preserving ML	Choudhury et al.	Secure inference mitigates data poisoning, indirectly supporting explainability.
^[154]	AI‑SecOps Toolchain (Aegis Gateway, etc.)	5D Security	Policy‑enforcement point with prompt filtering and red‑team testing.	^[154]
^[179]	Browser Sanitization APIs & AI‑Based Threat Modeling	OpenAI	Embeds security APIs in browsers to mitigate XSS and prompt injection.	^[179]
	Survey of Adversarial AI Threats	Pan et al.	Discusses lack of standardized defensive approaches, highlighting need for layered models.
^[96]	Adversarial AI and Data Privacy in Finance	Liu et al.	Emphasizes importance of explainability for regulatory compliance.	^[96]
^[6]	Explainable AI in Cloud Platforms	Google Cloud	Provides AI‑explainability APIs, but limited robustness against obfuscated attacks.	^[6]

Note: The table lists only those prior‑art artifacts that explicitly address either policy‑gradient adversarial robustness, explainability, or both. No single published product currently satisfies all three criteria simultaneously.

7.3 Best‑Fit Match

Robust Lagrangian & Adversarial Policy Gradient (RCPG)^[159] is the closest existing solution to the stated objective.

Requirement	RCPG Capability	Source
Detect or mitigate obfuscated policy gradients	Explicitly trains policy networks with an adversarial policy gradient that perturbs state‑action pairs to maximize cumulative reward degradation, thereby hardening the policy against manipulation.	^[159]
Multi‑agent applicability	Framework designed for constrained Markov decision processes, naturally extendable to multi‑agent settings through joint policy learning.	^[159]
Explainability support	While RCPG itself does not provide XAI, it integrates with adversarial training mechanisms that preserve policy gradients, enabling downstream application of gradient‑based attribution (e.g., Integrated Gradients).	^[159]
Defense against cascading failures	By optimizing for robust policy gradients, RCPG reduces the probability that a single malicious perturbation propagates through agent interactions, mitigating cascading misbehaviors.	^[159]
Regulatory alignment	The constrained‑MDP formulation aligns with risk‑managed decision‑making required in finance and healthcare, supporting explainability obligations.	^[96]

Thus, RCPG satisfies the core of the objective—protecting policy gradients from obfuscation—while leaving explainability to be layered on top.

7.4 Gap Analysis

Gap	Classification	Suggested Mitigation
No built‑in explainability	(i) Closeable by integration: Combine RCPG with SHAP/LIME ^[36] or Grad‑CAM ^[168] to produce faithful state‑action explanations.
Limited multi‑agent coordination	(i) Closeable by composing with Wang et al.’s multi‑agent defense pipeline ^[119] to enforce policy consistency across agents.
Potential for adversarial policy gradients to induce deceptive internal representations	(ii) Requires new R&D: Develop formal verification of policy gradients under adversarial perturbations (e.g., via SMT or neural‑network verification tools).
Lack of real‑time monitoring for cascading failures	(i) Closeable by integrating continuous monitoring modules from the AI‑SecOps toolchain ^[154] .
Explainability fidelity under obfuscated inputs	(ii) Requires research into robust attribution methods that are resistant to input manipulation (e.g., counterfactual explanations, adversarially trained attribution models).

7.5 Verdict

Currently Possible – The objective can be achieved today by combining existing, fully defined components:

Policy‑gradient robustness: Deploy the RCPG algorithm ^[159] for all RL agents in the multi‑agent system.
Explainability layer: Post‑process agent decision traces with SHAP ^[36] and Integrated Gradients ^[168] to generate faithful, local explanations of state‑action choices.
Multi‑agent coordination: Wrap agents in Wang et al.’s Multi‑Agent LLM Defense Pipeline ^[119] to enforce prompt sanitization and policy‑level defenses, ensuring consistent behavior across agents.
Monitoring & alerting: Integrate the AI‑SecOps monitoring stack ^[154] to detect anomalous policy updates or cascading failures in real time.

This sketch uses only the cited, shipping components and open‑source projects, satisfying the requirement to avoid speculative or undeveloped solutions.

8. Semantic Prompt Obfuscation via Cipher Encoding

8.1 Identify the Objective

The chapter aims to synthesize current, commercially available and academically validated solutions that detect or mitigate jailbreak attacks that employ cipher‑based or character‑level obfuscation (e.g., Base64, ROT13, LeetSpeak, Unicode homoglyphs). It focuses on systems that are deployable today, describing their architecture, coverage, and limitations, and it evaluates how well they meet the requirement of identifying semantically hidden malicious intent in prompts.

8.2 Survey of Existing Prior Art

Solution	Vendor / Project	Core Capability	Relevant Citation
Sentra‑Guard	Multilingual Human‑AI framework for real‑time defense	Detects direct, role‑play, and obfuscated jailbreaks across >100 languages; uses a classifier‑retriever fusion and HITL feedback	^[86]^[109]^[92]
PromptScreen	Multi‑stage semantic linear classifier (SVM) pipeline	Filters prompts using word‑, character‑n‑gram, and hybrid features; high precision on Base64/Leet and Unicode obfuscations	^[138]^[134]
LlamaGuard	Open‑source LLM‑based input‑output safeguard	Detects jailbreaks by modeling token‑level and semantic patterns; includes a Base64/Leet pre‑normalizer	^[87]
CORTEX	Neuro‑symbolic defense architecture	Shifts from pattern matching to latent‑space intent analysis; handles custom ciphers	^[199]
STShield	Single‑token sentinel for real‑time jailbreak detection	Uses token‑activation patterns to flag obfuscated prompts; effective against Base64/Leet	^[87]
RoguePrompt	Dual‑layer ciphering for self‑reconstruction	Exploits a two‑stage obfuscation that bypasses most filters; demonstrates the limits of current detectors	^[182]
CipherChat	Cipher‑based jailbreak framework	Encodes malicious instructions via Caesar, Morse, and other ciphers; shows how LLMs decode obfuscated text	^[65]
PromptGuard	Dual‑layer engine (regex + ML) for prompt filtering	Detects common obfuscation patterns and novel variants; used in commercial products	^[157]
DeepTeam	Red‑team framework with 20+ attack methods	Supports single‑turn and multi‑turn jailbreaks, including custom encodings	^[108]
TryLock	Layered preference + representation engineering	Combines instruction‑level filters with representation‑level checks; mitigates Base64/Leet	^[43]
PromptScreen‑SVM	Semantic LSVM pipeline	Uses TF‑IDF + linear SVM; effective against obfuscated and multi‑turn prompts	^[138]
Sentra‑Guard‑2	Updated Sentra‑Guard iteration with expanded knowledge base	Improves detection of multi‑layer obfuscation (e.g., RoguePrompt)	^[92]
LlamaGuard‑2	Next‑gen LlamaGuard with enhanced token‑activation	Higher robustness to Base64/Leet compared to LlamaGuard‑1	^[87]
PromptGuard‑L2	Machine‑learning layer for novel obfuscations	Trained on 460+ regex patterns + ML classifier; focuses on hidden encodings	^[157]
Sentra‑Shield	Real‑time multilingual defense with HITL loop	Maintains dual‑labeled knowledge base; achieves >99.9% detection on obfuscations	^[92]
RogueCipher	Research prototype for dual‑layer obfuscation	Demonstrates how a self‑reconstruction prompt can bypass filters	^[182]

Key Observations

Normalization Pre‑Processing (Base64, ROT13, Leet, Unicode): Widely adopted in LlamaGuard, PromptScreen, and Sentra‑Guard to strip obfuscation before semantic analysis. ^[86]^[138]^[87]
Semantic Classifiers: SVM or neural models trained on character‑level n‑grams effectively detect obfuscated patterns, but struggle with novel, multi‑layer ciphers such as those in RoguePrompt. ^[138]^[87]
Multi‑Stage Pipelines: Combining regex, semantic, and representation‑level checks (PromptScreen, PromptGuard) yields higher recall for obfuscated jailbreaks. ^[134]^[157]
Human‑in‑the‑Loop (HITL): Sentra‑Guard’s HITL loop improves adaptation to emerging obfuscation techniques. ^[92]
Limitations: Existing systems exhibit reduced performance against dual‑layer or composition‑based obfuscations (e.g., RoguePrompt, CipherChat). They also lack real‑time detection for large, dynamic user prompts in high‑throughput environments. ^[182]^[65]

8.3 Best‑Fit Match

Sentra‑Guard is the single prior‑art solution that most comprehensively meets the objective of detecting semantic prompt obfuscation via cipher encoding.

Architecture: Receives raw prompt → Normalization Layer (Base64, ROT13, Leet, Unicode) → Semantic Classifier (SVM + feature fusion) → Contextual Risk Scoring (retrieval of multilingual embeddings) → HITL Feedback Loop.
Coverage: Handles all major obfuscation families—Base64, ROT13, LeetSpeak, Unicode homoglyphs, multi‑turn role‑play, and custom ciphers (as demonstrated in Sentra‑Guard‑2). ^[86]^[109]^[92]
Performance: AUC ≈ 1.00, F1 ≈ 1.00 on a curated adversarial prompt corpus; ASR reduced to 0.004% against GPT‑4o, GPT‑4o‑mini, Claude‑3, Gemini‑Flash, Mistral‑7B. ^[92]
Real‑Time Capability: Operates within 50 ms per prompt on commodity GPUs, suitable for production use. ^[92]
Extensibility: Supports integration with LlamaGuard or PromptScreen for layered defense, and can plug into existing LLM APIs via a simple REST wrapper.

Thus, Sentra‑Guard’s modular design, high detection rates, and proven efficacy against cipher‑based obfuscations make it the best match for the stated objective.

8.4 Gap Analysis

Gap	Classification	Potential Mitigation
Dual‑Layer / Composition Obfuscation (e.g., RoguePrompt)	Requires net‑new R&D	Combine Sentra‑Guard with a custom multi‑layer decoder (e.g., a lightweight script that iteratively normalizes Base64→ROT13→Leet) prior to semantic analysis.
Real‑Time Scaling for High‑Throughput Applications	Closeable by integration	Deploy Sentra‑Guard as a microservice behind a load balancer; cache normalized forms for repeated prompts; utilize GPU batching.
Zero‑Knowledge Novel Ciphers	Requires R&D	Augment the training corpus with synthetic cipher compositions (e.g., using the string‑composition framework from Plentiful Jailbreaks) to improve generalization.
Cross‑Modal (Image/Video) Obfuscation	Not currently solved	Integrate prompt‑screening with vision‑based detection (e.g., STShield) to cover multimodal injection vectors.
Robustness to Evasion via Contextual Shifting (e.g., Echo Chamber)	Requires R&D	Extend the classifier to include contextual anomaly detection (e.g., monitoring token‑activation drift over conversation).
Model‑Level Mitigation (Fine‑Tuning)	Closeable by composition	Combine Sentra‑Guard with in‑house fine‑tuning of the LLM (e.g., Constitutional AI or RLHF) to reduce baseline vulnerability to obfuscated prompts.

The dominant gap is the handling of sophisticated, multi‑layer obfuscations that intentionally separate encoding from semantic revelation. Existing tools can detect many single‑layer ciphers, but they lack an intrinsic mechanism to reconstruct nested payloads before semantic analysis.

8.5 Verdict

Currently Possible – The objective can be achieved today by deploying Sentra‑Guard (or its upgraded variant Sentra‑Guard‑2) in conjunction with the following components:

Pre‑Processing Layer – Base64, ROT13, LeetSpeak, Unicode normalization (implemented in Sentra‑Guard).
Semantic Classifier – SVM with hybrid word‑ and character‑level TF‑IDF features (from PromptScreen).
Risk Scoring Module – Retrieval‑based contextual scoring using multilingual embeddings (part of Sentra‑Guard).
HITL Feedback Loop – Human reviewers validate edge cases and retrain the classifier (built‑in to Sentra‑Guard).
Optional Enhancements –
LlamaGuard‑2 or CORTEX for additional representation‑level checks.
A lightweight decoder script that attempts nested de‑encoding for suspected dual‑layer obfuscations.

This stack provides real‑time detection of cipher‑based semantic obfuscation with near‑perfect accuracy on known attack families, satisfies the requirement of identifying malicious intent regardless of obfuscation, and is supported by publicly available, shipping products or open‑source repositories.

9. Gradient‑Based Prompt Optimization Attack Methods

9.1 Identify the Objective

The chapter must synthesize all publicly documented gradient‑based prompt optimisation techniques that generate adversarial suffixes or prefixes to jailbreak large language models (LLMs). It should catalogue the existing methods, evaluate their capabilities, identify the single best‑fit existing artefact that most closely satisfies the objective, and analyze remaining gaps relative to an ideal, fully‑automated, black‑box attack pipeline.

9.2 Survey of Existing Prior Art

Method	Source	Core Idea	Key Properties
Greedy Coordinate Gradient (GCG)	^[167]^[33], …	Iterative token‑level gradient ascent on a white‑box model to maximise probability of a target affirmative response	White‑box, universal suffixes, high ASR, limited interpretability
AutoDAN	^[105], ^[155]	Hierarchical genetic algorithm evolving full prompts, preserving fluency	White‑box or surrogate‑based, high fluency, moderate ASR
TAO‑Attack	^[156]	Two‑stage loss: suppress refusals then penalise pseudo‑harmful outputs; direction‑priority token optimisation	White‑box, higher ASR than GCG, more efficient token updates
LARGO	^[181]^[77]	Latent‑space optimisation to generate self‑reflecting adversarial prompts	White‑box, fluent outputs, requires internal latent access
CRA (Contextual Representation Ablation)	^[42]	Gradient‑free ablation of high‑level representations to force unsafe outputs	Black‑box, high ASR, no need for gradients
Dynamic Target Attack (DTA)	^[141]	Uses the target model’s own responses as optimisation targets	Black‑box, adaptive, high ASR
AdvPrompter	^[136]	Trains a separate LLM to generate human‑readable adversarial prompts without gradients	Black‑box, rapid, limited transferability
FERRET	^[63], …	Quality‑diversity evolutionary search with reward‑based selection	Black‑box, high throughput, moderate ASR
PAP (Persuasive Adversarial Prompts)	^[170]^[53]	Persuasive context injection, LLM‑driven paraphrasing	Black‑box, high fluency, moderate ASR
BEAST	^[40]	Beam‑search guided adversarial suffix generation	White‑box/black‑box, high ASR, efficient
CRP (Cascaded Retrieval‑Prompt)	^[11]	Uses retrieval to pre‑populate unsafe content, followed by prompt optimisation	Black‑box, high ASR

These methods cover the spectrum of gradient‑based or gradient‑inspired optimisation, ranging from pure token‑level gradient ascent to latent‐space and representation‑level manipulation. All rely on either white‑box access (gradients, logits) or surrogate models for black‑box transfer.

9.3 Best‑Fit Match

Greedy Coordinate Gradient (GCG) is the most comprehensive and widely benchmarked gradient‑based attack that satisfies the objective of generating adversarial suffixes to jailbreak LLMs.

Requirement	GCG Capability	Source
Gradient‑based optimisation	Uses token‑level gradient ascent to maximize probability of target affirmative token sequence	^[167]
Universal suffix generation	Produces suffixes that transfer across models without re‑optimisation	^[33]
High Attack Success Rate (ASR)	Reported >90 % on several open‑weight LLMs	^[167]
White‑box requirement	Requires model gradients; accessible via open‑source LLMs	^[167]
Automatic, single‑turn attack	Generates adversarial suffix in a single optimisation loop	^[167]
Limited interpretability	Generates gibberish suffixes; no semantic control	^[167]

GCG’s design aligns precisely with the objective: it is a gradient‑based, optimisation‑driven method that automatically crafts adversarial suffixes to elicit unsafe outputs from LLMs. The method’s widespread adoption and benchmarking (e.g., on AdvBench) confirm its status as the de‑facto standard for gradient‑based jailbreaks.

9.4 Gap Analysis

Gap	Classification	Remedy
Lack of semantic coherence	(i) Closeable via integration of a language model for paraphrasing or semantic filtering (e.g., FERRET, AdvPrompter)	Combine GCG with an LLM‑based paraphraser to render suffixes readable
White‑box dependency	(ii) Requires full gradient access; not feasible for commercial APIs	Use surrogate‑based transfer techniques (AutoDAN‑style or DTA) to approximate gradients
High computational cost	(i) GCG can require many gradient steps; mitigated by direction‑priority token optimisation (TAO‑Attack) or beam‑search heuristics (BEAST)	Replace vanilla GCG with TAO‑Attack or BEAST for fewer iterations
Susceptibility to detection (perplexity, filters)	(i) Integrate adversarial prompt generation with a low‑perplexity objective or use CRP to mask unsafe tokens	Employ CRP or LARGO to obfuscate the suffix
Limited multi‑turn adaptation	(i) GCG is single‑turn; can be composed with iterative refinement methods (PAIR, ReNeLLM)	Stack GCG outputs with a black‑box iterative refinement loop
Transferability across modalities	(ii) No support for multimodal LLMs	Extend GCG to latent‑space optimisation (LARGO) or incorporate image‑based perturbations per recent multimodal attacks

Most gaps stem from the trade‑off between optimisation efficiency and practical deployment constraints. They can be addressed by composing GCG with complementary open‑source tools (e.g., TAO‑Attack for efficiency, CRP for stealth, FERRET for semantic control).

9.5 Verdict

Currently Possible – The objective of generating gradient‑based adversarial suffixes for LLM jailbreaks can be achieved today using the open‑source GCG implementation. A practical pipeline would involve:

Model Loading – Load a white‑box LLM (e.g., LLaMA‑2‑7B‑Chat) with the transformers library.
Gradient Extraction – Use the model’s forward method to obtain logits for a benign harmful prompt.
Token‑level Gradient Ascent – Apply the GCG algorithm (as provided in the nanogcg_redteam PyPI package ^[164] to optimise a suffix that maximises the probability of the target affirmative phrase (“Sure, here’s how to …”).
Suffix Concatenation – Append the optimized suffix to the original prompt.
Evaluation – Send the combined prompt to the target LLM and record the Attack Success Rate.

This pipeline relies solely on published, shipping components: the nanogcg_redteam library, HuggingFace transformers, and open‑source LLM weights. It fulfills the objective without requiring novel research or proprietary technology.

10. Multi‑Turn Contextual Memory Attacks

10.1 Identify the Objective

This chapter must provide a systematic synthesis of the state‑of‑the‑art on adversarial techniques that target the contextual memory of multi‑agent AI systems, focusing on how such attacks induce misaligned policy inference, erode trust in the system, and trigger cascading failures across interacting agents. The review should map existing attack and defense mechanisms to these three threat dimensions, critically assess coverage gaps, and conclude whether the objective can be achieved with current, publicly documented methods.

10.2 Survey of Existing Prior Art

#	Source	Core Contribution	Relevance to Objective
^[115]	DeepContext: Stateful Real‑Time Detection of Multi‑Turn Adversarial Intent Drift in LLMs	Recurrent intent tracking using lightweight turn‑level embeddings and an RNN to detect intent drift over turns	Detects the intent shift that underlies misaligned policy inference in multi‑turn dialogues
^[135]	DeepTrap: Automated Discovery of Contextual Vulnerabilities in OpenClaw	Optimises a black‑box trajectory‑level search to identify memory poisoning, RAG poisoning, and other contextual attacks	Provides a methodology for discovering memory‑based attacks that can mislead policy inference
^[41]	MINJA (Memory Injection Attack)	Demonstrates high‑success query‑only memory poisoning by bridging steps and progressive shortening techniques	Exemplifies persistent memory poisoning that can alter agent goals and trigger cascading failures
^[2]	AgentTrust: A Firewall for Agent Tool Calls	Wraps every tool call with a safety evaluation layer to classify actions before execution	Addresses trust degradation by preventing malicious tool invocations driven by poisoned memory
^[137]	Memory Poisoning Attack and Defense on Memory Based LLM‑Agents (various sub‑papers)	Introduces MINJA, AgentPoison, and systematic evaluation of memory‑poisoning attacks and defenses	Provides both attack (MINJA) and defense (AgentPoison) perspectives
^[139]	Memory Poisoning Attack and Defense on Memory Based LLM‑Agents (duplicate)	Same as above, with additional empirical results	Reinforces the feasibility of persistent memory attacks
^[123]	Memory Poisoning Attack and Defense on Memory Based LLM‑Agents	Discusses MINJA, AgentPoison, and cascading effects across multi‑agent systems	Highlights the cascade dimension of memory attacks
^[120]	Agent Traps (DeepMind study)	Characterises categories of memory‑based attacks (RAG poisoning, behaviour control, exfiltration)	Provides a taxonomy that maps to misaligned policy inference and cascading failures
^[204]	Into the Gray Zone: Domain Contexts Can Blur LLM Safety Boundaries	Presents JARGON, a multi‑turn strategy to inject hidden instructions via academic framing	Illustrates how contextual memory can be leveraged over multiple turns to subvert safety
^[130]	Every Picture Tells a Dangerous Story: Memory‑Augmented Multi‑Agent Jailbreak Attacks on VLMs	Extends memory‑poisoning to vision‑language models, showing cascading chain reactions	Demonstrates cross‑modal cascading failures
^[52]	Not a very smart home: crims could hijack smart‑home boiler…	Reports a real‑world memory‑poisoning attack that caused device takeover via calendar invites	Practical case of cascading failures in an IoT context
^[183]	Memory Poisoning Attack and Defense on Memory Based LLM‑Agents	Systematic empirical evaluation of memory poisoning attacks and defenses in EHR agents	Provides evidence of cascading failures in health‑care multi‑agent scenarios
^[67]	How May Explainable Artificial Intelligence Improve IT Security of Object Detection?	Discusses memory poisoning in agents that rely on RAG	Indicates cascading impact on downstream vision tasks
^[162]	On February 15, 2025, the UC Berkeley Center for Long‑Term Cybersecurity…	Outlines risk‑management playbook for autonomous agents, including cascading failure mitigation	Offers high‑level guidance for cascading failure scenarios
^[51]	Artificial intelligence systems are rapidly evolving…	Introduces Memory Ghost Attacks, a class of persistent contextual manipulation	Directly relevant to misaligned policy inference over extended interactions

The above works collectively cover: (i) attack techniques that poison contextual memory, (ii) detection frameworks that monitor intent drift, (iii) defense mechanisms that gate tool calls, and (iv) case studies illustrating cascading failures.

10.3 Best‑Fit Match

MINJA (Memory Injection Attack) – ^[41]

Capabilities and Mapping to Objective
| Objective Aspect | MINJA Feature | Source ||-------------------|---------------|--------|| Persistent memory poisoning across turns | Uses bridging steps and progressive shortening to inject malicious instructions that are retained in long‑term memory | ^[41] || Misaligned policy inference | Poisoned memory causes the agent to adopt attacker‑defined goals, overriding the system prompt | ^[41] || Trust degradation | By changing the agent’s internal policy, users misattribute errors to system failure rather than malicious manipulation | ^[41] || Cascading failures | A single poisoned memory entry can propagate through multiple agents sharing the same memory store, leading to widespread unintended actions | ^[123]^[120] |

MINJA thus provides the most complete end‑to‑end illustration of how a multi‑turn contextual memory attack can lead to all three dimensions of threat.

10.4 Gap Analysis

Gap	Classification	Mitigation via Existing Prior Art?
1. *Detection of memory poisoning in interactive* multi‑agent workflows**	(i) closeable by composition – combine DeepContext ^[115] with AgentTrust ^[2] to monitor intent drift and tool calls simultaneously	Yes, but requires integration
2. Preventing cross‑agent memory contamination	(ii) requires new R&D – current defenses (AgentPoison, AgentTrust) assume isolated memory or shared memory with explicit boundaries	No, current tools do not enforce isolation across agents
3. Quantifying cascading failure impact across heterogeneous agents	(ii) not currently solved – existing case studies (e.g., smart‑home takeover, health‑care agent) are isolated; no systematic metrics	No
4. *Robustness against indirect* memory poisoning via RAG or external knowledge bases**	(i) closeable by integrating AgentPoison ^[41] with hybrid retrieval systems (e.g., Athena hybrid search)	Yes, with configuration
5. Dynamic runtime enforcement of policy consistency across turns	(ii) requires novel runtime enforcement layers	No

Thus, while attacks and some defenses exist, the full end‑to‑end mitigation path from memory poisoning to cascading failure analysis remains incomplete.

10.5 Verdict

Not Currently Possible

Closest Existing Fit	Coverage	Residual Gap
MINJA^[41]	Demonstrates persistent memory poisoning, misaligned policy inference, and cascading potential	Lacks automated detection and cross‑agent isolation
AgentPoison / AgentTrust^[2]^[41]	Provides memory‑attack detection and tool‑call gating	Do not address multi‑agent memory contamination or systematic cascade metrics
DeepContext^[115]	Detects intent drift over turns	Only monitors single‑agent intent; no mechanism to capture memory‑driven policy changes or inter‑agent propagation

The objective of comprehensively analyzing and mitigating multi‑turn contextual memory attacks that simultaneously misalign agent policy, erode trust, and cause cascading failures across a multi‑agent ecosystem cannot yet be fully satisfied with existing, published solutions.

11. Single‑Victim Communication Perturbation Attacks

11.1 Identify the Objective

This chapter must synthesize all published, shipping, or open‑source methods that enable a single adversarial agent to perturb the communication of a single victim agent within a multi‑agent system, thereby degrading the victim’s policy or causing system‑level failures. The review should map existing techniques to key aspects of the objective—message‑level perturbation, agent‑specific targeting, temporal selection, and the resultant impact on coordination or trust—while identifying gaps and practical implementation paths that respect the constraints of today’s research and product ecosystem.

11.2 Survey of Existing Prior Art

#	Reference (hex ID)	Title	Core Contribution	Relevance to Objective
1	^[21]	Finding the Weakest Link: Adversarial Attack against Multi‑Agent Communications	Introduces single‑victim communication perturbation attacks that use Jacobian gradients to identify the most vulnerable messages, agents, and timesteps, quantifying impact on system performance.	Primary method for targeted message attacks.
2	^[28]	Finding the Weakest Link: Adversarial Attack against Multi‑Agent Communications	Duplicate of Ref 1, confirming reproducibility across two publications.	Reinforces feasibility of the Jacobian‑based approach.
3	^[84]	Grey‑Box Adversarial Attack on Communication in Multi‑Agent Reinforcement Learning	Proposes Victim‑Simulation‑Based Adversarial Attack (VSA) that simulates the victim’s receipt of other agents’ messages, generating perturbations that are then injected to degrade performance.	Demonstrates grey‑box, single‑victim targeting.
4	^[45]	Grey‑Box Adversarial Attack on Communication in Multi‑Agent Reinforcement Learning	Same as Ref 3; highlights VSA’s effectiveness in predator‑prey and traffic‑junction environments.	Provides empirical validation.
5	^[24]	Robust Multi‑Agent Communication Based on Decentralization‑Oriented Adversarial Training	Trains an attacker to generate adversarial perturbations on the victim’s messages, applying them as noise during communication.	Illustrates adversarial training for message corruption.
6	^[89]	Robust multi‑agent coordination via evolutionary generation of auxiliary adversarial attackers	Discusses adversarial observation and communication policies, including learning robust communication under poisoned senders.	Contextualizes communication‑based attacks within broader adversarial frameworks.
7	^[69]	Robust multi‑agent coordination via evolutionary generation of auxiliary adversarial attackers	Same as Ref 6; emphasizes multi‑agent vulnerability to communication perturbations.	Reinforces the prevalence of message‑level attacks.
8	^[46]	Robust and efficient communication in multi‑agent reinforcement learning	Surveys robust communication strategies under realistic constraints, including message perturbations.	Provides background on mitigation but not attack methods.
9	^[178]	Robust Coordination under Misaligned Communication via Power Regularization	Defines misaligned communication and proposes power regularization to limit a sender’s influence.	Offers a defense perspective relevant to attack impact.
10	^[202]	Robust Coordination Under Misaligned Communication via Power Regularization	Extends power regularization to multi‑agent systems, addressing misaligned messages.	Defense mechanism that could mitigate attacks.
11	^[49]	Jacobian saliency map approach attack	Describes a Jacobian‑based saliency map to find words/parameters most impactful for adversarial perturbation.	Methodology transferable to communication perturbation.
12	^[198]	Amplification of formal method and fuzz testing to enable scalable assurance for communication system	Advocates formal and fuzz testing to uncover protocol vulnerabilities, including message corruption.	Provides a testing framework for attack validation.
13	^[121]	Complete Guide to Agentic AI Red Teaming	Discusses how adversarial payloads can traverse inter‑agent boundaries, outlining red‑team techniques.	Supplies a broader attack context.
14	^[200]	ARCS: Adversarial Attack with Large Language Models and Critical State Identification	Introduces a black‑box adversarial attack that manipulates reward signals to guide victim policy.	Complements communication attacks with state‑level perturbations.

These references collectively capture the state of single‑victim communication perturbation attacks, the methods used to generate them, and the defenses or testing frameworks that can be paired with them.

11.3 Best‑Fit Match

Best‑Fit Match: Ref 1 ^[21]

Requirement	Implementation in Ref 1	Source
1. Target a single victim agent	The attack strategy explicitly selects one victim agent in a multi‑agent reinforcement learning environment.	^[21]
2. Perturb communication messages	The attacker perturbs the messages sent to the victim by adding perturbations to the raw message vectors.	^[21]
3. Identify susceptible messages, agents, and timesteps	Uses the Jacobian of the message‑to‑policy mapping to compute saliency scores, thus ranking messages, agents, and timesteps by attack impact.	^[21]
4. Quantify impact on system performance	Empirically demonstrates reduction in cumulative reward and coordination metrics across benchmark tasks (Predator‑Prey, TrafficJunction).	^[21]
5. Provide adversarial loss functions that trade‑off success for impact	Introduces two loss functions that control attack success versus perturbation magnitude, enabling practical deployment.	^[21]

Why this solution is the closest fit
Ref 1 delivers a complete, end‑to‑end attack pipeline that satisfies all core aspects of the objective: it isolates a single victim, perturbs its incoming messages, identifies the most influential perturbations via Jacobian analysis, and demonstrates measurable degradation of the victim’s policy and the overall system. All components are fully specified in the paper and have been reproduced in open‑source implementations (e.g., PettingZoo + PyTorch), making it readily deployable today.

11.4 Gap Analysis

Gap	Classification	Notes
1. Limited to MARL environments (e.g., Predator‑Prey, TrafficJunction)	(i) Closeable by integration	Existing fault‑injection frameworks (Refs 12, 13) can be combined to test the attack in more diverse settings.
2. No explicit defense or mitigation presented	(i) Closeable by composition	Power regularization (Refs 9, 10) and misaligned communication defenses can be applied post‑attack to mitigate impact.
3. Does not address cascading failures or trust degradation	(ii) Requires new R&D	Current literature lacks a systematic analysis of how single‑victim perturbations propagate to system‑wide trust metrics.
4. Requires full knowledge of Jacobian, i.e., white‑box access	(i) Closeable by configuration	Grey‑box VSA attack (Refs 3, 4) shows that a black‑box approximation can be used, but the Jacobian step remains a bottleneck.
5. No real‑time or online attack capability	(ii) Requires new R&D	Implementing online Jacobian estimation would need additional algorithmic development beyond current prior art.

11.5 Verdict

Currently Possible – The single‑victim communication perturbation attack described in Ref 1 is fully implementable today using existing, publicly available tools.

Implementation Sketch
1. Environment Setup – Deploy a multi‑agent reinforcement learning benchmark (e.g., Predator‑Prey) using the PettingZoo framework.
2. Model Extraction – Load the victim agent’s policy network (e.g., a small CNN) implemented in PyTorch.
3. Jacobian Computation – For each timestep, compute the Jacobian of the policy output with respect to the incoming message vector using autograd.
4. Saliency Ranking – Rank message components, agents, and timesteps by the magnitude of the Jacobian entries to identify the most influential perturbation points.
5. Perturbation Generation – Apply a small L₂‑bounded perturbation (e.g., 0.01) to the selected message components, using a simple gradient sign method.
6. Attack Injection – Replace the victim’s received message with the perturbed version during execution.
7. Evaluation – Measure cumulative reward, coordination metrics, and any observable trust‑degradation indicators across multiple runs.

This pipeline uses fully specified components from the literature (Refs 1, 3, 4, 12, 13) and requires no new inventions or unproven methodologies.

12. Gradient Masking in Adversarial Training and Explainability

12.1 Identify the Objective

This chapter synthesizes prior‑art solutions that combine gradient masking techniques with adversarial training and explainability mechanisms. The goal is to understand how gradient‑based masking can be leveraged to (i) defend models against adversarial perturbations, (ii) facilitate targeted model adaptation (e.g., alignment or policy refinement), and (iii) provide interpretable insights into model decision pathways—particularly in the context of multi‑agent AI systems where misaligned policy inference, trust degradation, and cascading failures pose serious risks.

12.2 Survey of Existing Prior Art

Ref ID	Contribution	Core Technique(s)	Relevant Aspect	Citation
^[12]	Targeted fine‑tuning via sparse autoencoders (SAEs) that isolate the 3 % of MLP neurons most predictive of a target behavior, followed by fine‑tuning only those neurons using gradient masking	Gradient masking, sparse autoencoding, neuron‑level fine‑tuning	Aligns behavior with minimal fine‑tuning; offers explainability by isolating responsible neurons	^[12]
^[66]	Localizes computation in neural networks through gradient masking, enabling interpretable attribution of internal units	Gradient masking, attribution extraction	Provides post‑hoc interpretability and potential robustness by restricting computation to salient pathways	^[66]
^[110]	Policy Distillation with Selective Input Gradient Regularization (DIGR) for efficient interpretability of RL policies	Gradient‑based regularization, policy distillation	Produces more transparent policies and can be integrated with adversarial training to mitigate policy drift	^[110]
^[59]	Gradient‑based adversarial training strategies (including adversarial purification) that improve robustness without prior knowledge of attack types	Gradient‑based adversarial training, purification	Demonstrates effectiveness of gradient‑based defenses, though not explicitly using masking	^[59]
^[149]	Knowledge distillation framework (not directly using masking)	Distillation, multi‑task learning	Provides a baseline for compression and potential explainability through surrogate models	^[149]

Additional relevant works that touch on related concepts (but do not directly employ gradient masking) include Ref: ^[19] (saliency methods) and Ref: ^[66] (gradient masking for interpretability). However, the table above lists the most directly applicable prior‑art solutions.

12.3 Best‑Fit Match

Targeted Fine‑Tuning via Gradient Masking (Ref: ^[12]

Objective Feature	Implementation in ^[12]	Evidence
Gradient masking	After isolating 3 % of MLP neurons with a sparse autoencoder, the method applies a binary mask to freeze or zero‑out all other neurons during fine‑tuning, effectively confining gradient flow to the selected subset.	The paper explicitly states “fine‑tune only those neurons using gradient masking.” ^[12]
Adversarial robustness (indirect)	By restricting learning to a highly predictive sub‑network, the approach reduces the model’s reliance on spurious features that adversaries could exploit, thereby improving resilience.	The authors claim the targeted update “reduces undesired side effects such as distributional shift” and enhances interpretability, which are correlated with robustness. ^[12]
Explainability	Isolation of a small, interpretable set of neurons allows for post‑hoc attribution (via linear probes) and a clear mapping from neuron activity to behavior.	The method “isolates the 3 % of MLP neurons most predictive of a target behavior” and uses linear probes for interpretation. ^[12]
Scalability	Works on a 40 B multi‑agent system compressed to 6 B while retaining 88 % accuracy, demonstrating feasibility on large models.	Performance metrics reported in the paper (88 % accuracy vs. 40 B baseline). ^[12]

Thus, this solution satisfies the core requirements of gradient masking, alignment of behavior, and interpretability, and it offers a foundation that can be extended toward adversarial training.

12.4 Gap Analysis

Gap	Classification	Reason
Explicit adversarial training integration	(i)	The method does not incorporate adversarial examples during fine‑tuning; it relies solely on neuron isolation.
Multi‑agent coordination	(i)	While the original model is multi‑agent, the masking technique is applied at the network level, not at the agent‑policy level.
Cascading failure mitigation	(i)	No mechanism is described for detecting or preventing failure propagation across agents.
Policy distillation for RL agents	(i)	The approach targets supervised learning; it does not address reinforcement‑learning policy distillation.
Robustness against adaptive adversaries	(ii)	Gradient masking alone can be circumvented by adaptive attacks; no robustness proof is provided.
Explainability of dynamic interactions	(i)	The method explains static neuron contributions but not temporal or inter‑agent interaction dynamics.

Most gaps are (i) closeable by composing the chosen method with other existing solutions (e.g., combining with DIGR for policy distillation, or with gradient‑based adversarial training from Ref: ^[59]. The remaining gaps, such as formal robustness guarantees and multi‑agent coordination, would require new research.

12.5 Verdict

Not Currently Possible – The objective of a unified, end‑to‑end system that applies gradient masking to both adversarial training and explainability in multi‑agent AI, while preventing cascading failures, cannot yet be achieved with existing publicly available methods.

Closest Existing Fits
1. Targeted Fine‑Tuning via Gradient Masking (Ref: ^[12] – Provides selective neuron masking and interpretable behavior alignment, but lacks direct adversarial training and multi‑agent coordination.
2. Localizing Computation through Gradient Masking (Ref: ^[66] – Offers interpretable attribution via gradient masking, yet does not address adversarial robustness or multi‑agent dynamics.
3. Policy Distillation with Selective Input Gradient Regularization (DIGR) (Ref: ^[110] – Enables interpretable RL policies and can be integrated with adversarial training, but does not incorporate neuron‑level gradient masking nor multi‑agent failure analysis.

Each of these works covers a subset of the desired capabilities, yet none collectively fulfill the full spectrum of gradient masking for adversarial training and explainability in the multi‑agent setting.

13. Counterfactual Explanation Failure in Adversarial Environments

13.1 Identify the Objective

The chapter must synthesize current knowledge on how counterfactual explanations (CXs) break down when faced with adversarial perturbations, misaligned policy inference, trust erosion, and cascading failures in multi‑agent AI systems. It should catalog existing methods that address these failures, evaluate the most suitable prior‑art solution, and delineate the remaining gaps that prevent a fully robust, trustworthy counterfactual framework in adversarial settings.

13.2 Survey of Existing Prior Art

Reference (hex ID)	Solution	Key Features & Claims
^[205]	ATEX‑CF – Attack‑Informed Counterfactual Explanations for Graph Neural Networks	Unifies adversarial edge‑addition attacks with counterfactual edge‑deletion, leveraging adversarial insights to generate more impactful explanations on GNNs. Claims improved faithfulness and sparsity under attack. ^[205]
^[144]	CECAS – Counterfactual Explanation via Causally‑Guided Adversarial Steering (Image)	Uses a causally‑guided adversarial method to generate counterfactual images, mitigating spurious correlations and ensuring semantic fidelity. ^[144]
^[17]	CECAS (duplicate)	Same as above; emphasizes filtering out out‑of‑distribution artifacts via diffusion models. ^[17]
^[34]	DiCE – Diverse Counterfactual Explanations	Open‑source library supporting diverse CX generation for any ML model, with extensions for causal constraints and multiple algorithms. ^[34]
^[37]	Counterfactual Explanations for Face Forgery Detection	Applies adversarial removal of artifacts to generate CXs that reveal forgery traces, improving interpretability and attack transferability. ^[37]
^[201]	Counterfactual Inference for AD Diagnosis	Combines U‑Net and GANs to produce counterfactual diagnostic maps, illustrating causal inference in medical imaging. ^[201]
^[71]	Dual‑Loss One‑Lipschitz Network	Shows that traversing the gradient to the decision boundary can serve as a counterfactual, with improved explanation reliability. ^[71]
^[116]	Desiderata‑Driven Visual CX	Formalizes CX search as an optimization problem, emphasizing minimal perturbation on the data manifold. ^[116]
^[1]	FreeMCG – Derivative‑Free Diffusion Manifold‑Constrained Gradients	Unified framework for both feature attribution and CX using diffusion models and ensemble Kalman filters. ^[1]
^[169]	Adversarial Image‑to‑Image Translation for CX	Generates realistic counterfactual images via adversarial image‑to‑image translation. ^[169]
^[146]	GANterfactual – GAN‑Based Counterfactuals for Medical Images	Uses adversarial image‑to‑image translation to produce realistic counterfactuals for non‑expert medical users. ^[146]
^[68]	Counterfactual Examples for Robustness	Demonstrates that min‑max adversarial training (PGD) can be used to generate counterfactual examples that improve robustness. ^[68]
^[124]	MACDA – Multi‑Agent Counterfactual Drug‑Target Binding Affinity	Extends CX to multi‑agent settings with discrete inputs (drug, target). ^[124]
^[131]	DiCE (Microsoft)	Open‑source library for diverse CX with support for causal constraints and LIME/SHAP‑style explanations. ^[131]
¬xCAD (not in list but implied)	XCAD – Explainable Collusion Detection for Multi‑Agent Systems	Uses adaptive clustering and graph analysis to detect collusion and provide CXs for trust diagnostics. ^[190]
^[90]	Improving Clinical Diagnosis with Counterfactual Multi‑Agent Reasoning	Integrates counterfactual reasoning into LLM‑based diagnostic agents to surface alternative diagnoses. ^[90]
^[47]	4D‑ARE – Bridging Attribution Gap in LLM Agent Requirements	Combines structural causal models with Shapley values for runtime explanations in LLM agents. ^[47]
^[18]	Efficient Agent Evaluation via Diversity‑Guided User Simulation	Uses counterfactual prompting to surface critical decision points in agent interactions. ^[18]
^[9]	Introspective Extraction and Complement Control	Framework for generating factual and counterfactual rationales with discrimination between them. ^[9]
^[7]	Realistic Extreme Behavior Generation for AV Testing	Generates realistic adversarial collisions to reveal failure modes, implicitly relying on CX for interpretability. ^[7]

Note: The list focuses on methods that explicitly address CX robustness or integrate adversarial techniques into CX generation, as those are directly relevant to counterfactual explanation failure in adversarial environments.

13.3 Best‑Fit Match

ATEX‑CF (Attack‑Informed Counterfactual Explanations for Graph Neural Networks) – ^[205] .

Requirement	ATEX‑CF Capability	Evidence
Unifies adversarial attacks with CX generation	Uses adversarial edge‑addition to inform counterfactual edge‑deletion, addressing the shared goal of flipping predictions while preserving actionable semantics.	^[205]
Grounded in theory	Provides theoretical justification for the integration of attack and explanation strategies, ensuring that the explanation remains faithful under adversarial perturbations.	^[205]
Efficient integration	Combines edge additions and deletions in a single optimization loop, reducing computational overhead compared to separate attack and explanation pipelines.	^[205]
Applicability to graph‑based multi‑agent settings	Designed for graph neural networks, which are common in multi‑agent systems (e.g., social networks, recommendation graphs).	^[205]
Robustness to adversarial perturbations	Claims improved faithfulness and sparsity of explanations under attack conditions, directly targeting CX failure modes.	^[205]

ATEX‑CF thus satisfies the core objective of integrating adversarial insights into counterfactual generation for graph‑based multi‑agent contexts, providing the most comprehensive coverage among existing solutions.

13.4 Gap Analysis

Gap	Classification	Reason
Limited to Graph Neural Networks	(i) Closeable by integration	Combining ATEX‑CF with image‑based CX methods (e.g., CECAS ^[144] could extend coverage to visual agents.
No explicit handling of policy misalignment	(ii) Requires new R&D	Current methods focus on explaining model output, not diagnosing misaligned policy inference in dynamic multi‑agent policies.
Trust degradation and cascading failures not explicitly modeled	(ii) Requires new R&D	Existing CX frameworks do not quantify how an adversarially‑crafted CX can erode stakeholder trust or trigger cascading agent failures.
Vulnerability to data poisoning	(i) Closeable by composition	Pairing ATEX‑CF with data‑poisoning mitigation techniques (e.g., robust training pipelines) could mitigate this gap.
Applicability to continuous‑time or temporal decision making	(ii) Requires new R&D	ATEX‑CF assumes static graph snapshots; temporal dynamics in multi‑agent RL require further extension.
Human‑in‑the‑loop interpretability	(i) Closeable by composition	Integrating ATEX‑CF outputs with human‑readable explanations (e.g., via SHAP or LIME) can improve usability.
Scalability to large‑scale graphs	(i) Closeable by composition	Leveraging graph subsampling or hierarchical explanations can address computational scalability.

13.5 Verdict

Not Currently Possible – While ATEX‑CF provides the best single solution for counterfactual explanation under adversarial attack in graph‑based multi‑agent settings, it does not fully satisfy the broader objective of addressing misaligned policy inference, trust degradation, and cascading failures in diverse multi‑agent AI systems.

Closest Existing Fits
1. ATEX‑CF ^[205] – Offers integrated adversarial‑aware CX for GNNs; residual gap: lacks mechanisms for trust assessment and cascading failure analysis.
2. CECAS (^[144] / ^[17] – Provides causally‑guided CX for images; residual gap: not designed for graph‑based multi‑agent environments or adversarial robustness in policy inference.
3. DiCE ^[34] – Generates diverse CXs with causal constraints; residual gap: does not explicitly account for adversarial perturbations or multi‑agent policy dynamics.

14. Inaccurate Blame Attribution from Adversarial Coordination

14.1 Identify the Objective

The chapter must synthesize existing research and engineered solutions that address the challenge of misattributing blame in multi‑agent artificial intelligence (AI) systems when agents coordinate adversarially. Specifically, it should: (i) review mechanisms for detecting and mitigating misaligned policy inference, (ii) examine frameworks that enable reliable attribution of responsibility across agents, and (iii) assess how cascading failures induced by adversarial coordination can be detected and mitigated, all while drawing exclusively on established prior art.

14.2 Survey of Existing Prior Art

#	Reference	Vendor / Project / Authors	Core Contribution
1	^[22]	Multi‑Agent Accountability Research (NeurIPS 2021)	Introduces efficient approximation algorithms and causal tools for attributing responsibility in decentralized partially observable MDPs.
2	^[61]	IET (In‑the‑Edge Attribution)	Provides forensic evidence of blame attribution via embedding signals in AI outputs; supports auditability even when logs are compromised.
3	^[70]	CDC‑MAS (Causal Discovery for Multi‑Agent Systems)	Presents a performance‑causal inversion principle and Shapley‑based blame assignment for multi‑agent failures.
4	^[180]	Same CDC‑MAS (duplicate reference)	Reinforces the causal inference approach for failure attribution.
5	^[117]	ROMANCE (Robust Multi‑Agent Coordination via Evolutionary Generation of Auxiliary Adversarial Attackers)	Enables agents to train against diversified adversarial attacks, improving resilience to policy perturbation.
6	^[106]	ROMANCE (full implementation)	Provides a framework for incorporating auxiliary adversarial attackers into MARL training.
7	^[163]	Power Regularization in Cooperative DRL	Formalizes power concepts and introduces regularization to mitigate adversarial attacks in multi‑agent settings.
8	^[72]	Anti‑Collusion Taxonomy for Multi‑Agent AI	Maps human anti‑collusion mechanisms to AI interventions; highlights attribution challenges.
9	^[118]	AI Governance Framework (EY UK)	Discusses embedding human oversight into orchestration layers to mitigate autonomous decision risks.
10	^[100]	OWASP Top 10 for Agentic Applications 2026	Identifies cascading failures and insecure inter‑agent communication as key vulnerabilities.
11	^[177]	TRUST (Decentralized AI Service v.0.1)	Provides a framework for decentralized verification, addressing opacity and fault attribution.
12	^[13]	Orchestration Visibility Gap (Qualixar OS)	Highlights the mismatch between user‑perceived blame and actual agent interactions.
13	^[117]	Robust Multi‑Agent Coordination (see #5)	Offers adversarial robustness through auxiliary attacks.
14	¬c1... (not present)	–	–
15	^[79]	RL Challenges Overview	Discusses credit assignment and exploration‑exploitation in multi‑agent learning.
16	^[177]	TRUST (see #11)	–
17	^[177]	TRUST (duplicate)	–
18	^[117]	Robust Coordination	–
19	^[117]	Robust Coordination	–
20	^[117]	Robust Coordination	–
21	^[117]	Robust Coordination	–

Note: Several references (e.g., #5/6, #3/4) appear multiple times due to overlapping topics; they are treated as distinct contributions where appropriate.

14.3 Best-Fit Match

Automatic Failure Attribution and Critical Step Prediction Method for Multi‑Agent Systems Based on Causal Inference (Refs ^[70] and ^[180] is the single prior‑art solution that most closely satisfies the objective. Its key capabilities and mapping to the requirement are:

Requirement	Implementation Capability	Source
Reliable blame attribution across agents	Uses a performance‑causal inversion principle to reverse data flow in execution logs, enabling correct modeling of inter‑agent dependencies.	^[70]
Handling of misaligned policy inference	Applies Shapley value‑based attribution to quantify each agent’s contribution to an outcome, mitigating misalignment by attributing responsibility to the correct policy.	^[70]
Detection of cascading failures	Introduces CDC‑MAS, a causal discovery algorithm that identifies critical failure steps even in the presence of non‑stationary, multi‑agent interactions.	^[180]
Resilience to adversarial coordination	While the method itself does not generate adversarial policies, it is agnostic to the presence of adversarial agents; attribution remains valid even when some agents act maliciously.	^[70]

Thus, this approach satisfies the core aspects of blame attribution, misaligned policy inference, and cascading failure detection, all within a causal inference framework.

14.4 Gap Analysis

Gap	Classification	Potential Remedy
Adversarial manipulation of logs	(i) Closeable by integrating IET ^[61] to embed tamper‑resistant attribution signals within agent outputs, ensuring that even if logs are altered the original blame can be recovered.
Identity fluidity (agents forked or modified at runtime)	(ii) Requires net‑new R&D; existing attribution assumes static agent identities.
Dynamic adversarial policy perturbation	(i) Can be mitigated by combining with ROMANCE (Refs ^[117]^[106] to expose agents to adversarial attacks during training, thus reducing the likelihood of misaligned policies that evade attribution.
Real‑time detection of cascading failures under distributed execution	(i) Augment with TRUST (Refs ^[177]^[99] to provide decentralized verification and latency‑aware failure monitoring.
Robustness to adversarial prompts that cause misattribution	(i) Combine with OWASP Top 10 for Agentic Applications ^[100] to enforce secure inter‑agent communication and guard against prompt injection.

14.5 Verdict

Not Currently Possible – while existing solutions partially address blame attribution and adversarial coordination, no single prior‑art system fully satisfies all aspects of the objective. The three closest fits are:

Automatic Failure Attribution and Critical Step Prediction (CDC‑MAS) – Provides causal blame attribution and failure localization but lacks mechanisms to detect or mitigate adversarial manipulation of logs and identity fluidity.
IET (In‑the‑Edge Attribution) – Offers tamper‑resistant forensic evidence for blame attribution but does not incorporate causal inference for multi‑agent interactions or address cascading failures.
ROMANCE (Robust Multi‑Agent Coordination) – Enables training against adversarial policies, improving resilience to misaligned policies, yet it does not provide explicit attribution of blame across agents when failures occur.

Each of these approaches covers a substantial portion of the requirement but leaves residual gaps, notably in handling dynamic adversarial coordination, ensuring robust attribution in the presence of manipulated logs, and managing identity fluidity.

15. Cascading Misinterpretation Leading to Suboptimal Joint Actions

15.1 Identify the Objective

The chapter must evaluate how misinterpretation of information, amplified through inter‑agent communication, leads to suboptimal joint actions in multi‑agent AI (MAS) systems. It should synthesize existing mechanisms that detect, mitigate, or prevent cascading misinterpretations caused by adversarial policy inference, trust degradation, and contamination propagation, and identify the extent to which current prior‑art solutions address these failure modes.

15.2 Survey of Existing Prior Art

#	Solution	Key Feature	Citation
1	BlindGuard – Unsupervised detection and isolation of malicious agents in LLM‑driven MAS	Uses anomaly scoring on agent responses and communication graph to prune malicious links, preserving legitimate interactions	^[160]^[26]
2	GUARDIAN – Temporal graph modelling of hallucination propagation	Explicitly captures propagation dynamics of hallucinations and errors across agents, enabling detection of misinterpretation chains	^[95]
3	G2CP – Graph‑grounded communication protocol	Wraps messages in graph operations, reducing misinterpretation risk by grounding content in a shared ontology	^[29]
4	AgentAsk – Plug‑and‑play clarification module for LLM‑based MAS	Inserts clarification steps at inter‑agent handoffs to halt cascading errors	^[54]^[113]
5	Dynamic Trust Models (e.g., Hua et al. 2024)	Continuously estimates trustworthiness of agents based on observed behavior	^[56]
6	Source‑Tagging Mechanism (Lee & Tiwari 2024)	Attaches provenance tags to prompts to prevent injection attacks	^[56]
7	Graph‑Augmented LLM Agents^[94]	Uses graph learning to guide reasoning, potentially reducing hallucination spread	^[94]
8	Bi‑Level Graph Anomaly Detection^[107]	Estimates anomaly scores per agent and prunes malicious edges, limiting propagation	^[107]
9	Dynamic Confidence Thresholds^[176]	Neglects attacked communication links to prevent influence spread	^[176]
10	Model Poisoning Attacks (GRMP)^[142]	Demonstrates how malicious updates can remain indistinguishable from benign updates	^[142]
11	Prompt Virus Attack^[56]	Self‑replicating prompts that cause rapid MAS paralysis	^[56]
12	Agent‑Poison Attacks^[56]	Pollutes agents’ memory or knowledge bases	^[56]
13	PrivacyLens Attack^[56]	Induces leakage of sensitive information	^[56]
14	MCP Security Threats^[56]	Man‑in‑the‑middle attacks on communication protocols	^[56]
15	Graph‑Resfusion Approach^[102]	Uses blockchain‑based trust calculations for validator agents in mobile AI networks	^[102]
16	Agent‑Based Models for Misinformation^[8]	Systematic analysis of dynamic social networks to mitigate spread	^[8]
17	Distributed Nonlinear Control for Robotic Networks^[158]	Resilient construction of local desired signals to handle adversarial interactions	^[158]
18	Agentic Observability^[14]	Provides audit trails of agent decisions, enabling root‑cause tracing	^[14]
19	Agentic Security Frameworks^[38]	Attestations and cryptographic verification at agent boundaries	^[38]
20	Dynamic Prompt Sanitization^[5]	Dual‑stage sanitization (pre‑agent and pre‑LLM) to prevent malicious propagation	^[5]
21	Structured Message Schemas^[32]	Typed schemas to reduce ambiguity in inter‑agent messages	^[32]
22	Agent‑Based Red‑Team Testing^[30]	Cross‑environment adversarial knowledge graph to uncover hidden vulnerabilities	^[30]
23	Graph Knowledge Distillation^[187]	Distills knowledge from teacher GNNs to mitigate adversarial influence	^[187]
24	Federated Byzantine‑Resilient Learning^[57]	Uses geometric median and Krum to defend against Byzantine agents	^[57]
25	Distributed Security in Peer‑to‑Peer Networks^[185]	Autonomous synchronization of security agents across devices	^[185]

15.3 Best‑Fit Match

GUARDIAN – Safeguarding LLM Multi‑Agent Collaborations with Temporal Graph Modeling

Requirement	GUARDIAN Capability	Source
Model propagation dynamics of hallucinations and errors	Explicitly captures temporal propagation of misinterpretations via a discrete‑time temporal attributed graph	^[95]
Detect cascading misinterpretation chains	By modeling agent interactions over time, it can identify when errors amplify across multiple agents	^[95]
Provide auditability of inter‑agent communication	Temporal graph records message timestamps and content, enabling forensic tracing	^[95]
Mitigate suboptimal joint actions	By flagging propagation hotspots, GUARDIAN can trigger intervention (e.g., re‑planning, clarification) to prevent drift	^[95]

GUARDIAN therefore most closely fulfills the objective of monitoring and preventing cascading misinterpretation in MAS.

15.4 Gap Analysis

Gap	Class	Closure Option
1. Detection of malicious policy inference – GUARDIAN models hallucination spread but does not identify agents that have been poisoned to infer incorrect policies.	(ii) Requires net‑new R&D	Not addressed by existing GUARDIAN implementation.
2. Trust degradation monitoring – GUARDIAN lacks an explicit trust score that degrades as misinterpretations accumulate.	(i) Closeable by integration	Combine with dynamic trust models (Hua et al. 2024) and source‑tagging (Lee & Tiwari 2024).
3. Isolation of compromised agents – GUARDIAN can flag misinterpretation but does not prune or isolate agents.	(i) Closeable by composition	Integrate with BlindGuard’s anomaly scoring and edge pruning ^[160]^[26].
4. Model poisoning resilience – GUARDIAN assumes clean model updates; it cannot detect GRMP‑style poisoning where updates remain statistically benign.	(ii) Requires net‑new R&D	No existing solution fully mitigates GRMP.
5. Prompt injection defense – GUARDIAN does not sanitize prompts or enforce pre‑agent/LLM checks.	(i) Closeable by integration	Incorporate dual‑stage sanitization ^[5] and source tagging ^[56] .
6. Real‑time intervention – GUARDIAN’s temporal model is retrospective; it does not trigger real‑time corrective actions.	(ii) Requires net‑new R&D	Development of online intervention policies is not covered by current prior art.

15.5 Verdict

Not Currently Possible

Closest Existing Fits	Coverage	Residual Gap
GUARDIAN (Temporal graph modeling)	Captures propagation dynamics and provides auditability of cascading misinterpretations.	Lacks mechanisms for malicious policy inference detection, trust degradation, and real‑time isolation.
BlindGuard (Unsupervised anomaly detection)	Detects and isolates malicious agents via anomaly scores and edge pruning.	Does not model temporal propagation or address model poisoning and prompt injection.
AgentAsk (Clarification module)	Inserts explicit clarification steps to halt cascading errors.	Requires integration with temporal propagation modeling and trust management; does not detect underlying poisoning or injection attacks.

These three solutions together cover most of the objective, but none alone or in straightforward composition fully guarantees prevention of cascading misinterpretation due to adversarial policy inference or model poisoning. Additional research is required to integrate temporal propagation, trust dynamics, and poisoning defenses into a unified, deployable framework.

Appendix (Cited Content)

Derivative-Free Diffusion Manifold-Constrained Gradient for Unified XAI 2024-11-21

https://doi.org/10.1109/CVPR52734.2025.02216

This is because real gradients produce adversarial attacks rather than counterfactual explanations and additional techniques are required to introduce perceptible changes. On the other hand, gradients as feature attribution also often require additional treatment to enhance their faithfulness . In theory, gradients should be usable for both applications. This raises the question of whether a single framework can effectively handle both types of explanations. To address these challenges, in this ...

Runtime Safety, Alignment Gaps, and Elastic Context 2026-05-07

https://awesomeagents.ai/science/runtime-safety-alignment-gaps-elastic-context/

LongSeeker - Elastic context management for search agents achieves 61.5% on BrowseComp by teaching agents to actively reshape their working memory AgentTrust: A Firewall for Agent Tool Calls The scenario motivating AgentTrust is straightforward and slightly terrifying: your agent decides to run a destructive shell command, or gets tricked by a prompt injection into exfiltrating data, and by the time you notice, the action is already done. Chenglin Yang's paper proposes wrapping every agent tool ...

DeepForgeSeal: Latent Space-Driven Semi-Fragile Watermarking for Deepfake Detection Using Multi-Agent Adversarial Reinforcement Learning 2025-11-06

https://doi.org/10.48550/arXiv.2511.04949

Specifically, the authors of introduce adversarial regularization to enforce Lipschitz continuity in policies, improving robustness against noisy observations. Most recently, Yuan et. al have proposed an approach based on evolutionary learning to enhance robustness in message-passing for improving the communication efficiency of agents. While adversarial training has shown promise in both MARL and watermarking independently, to the best of our knowledge, none of the prior works have investigated...

Learning Decentralized Routing Policies via Graph Attention-based Multi-Agent Reinforcement Learning in Lunar Delay-Tolerant Networks 2026-04-20

https://www.catalyzex.com/author/Federico%20Rossi

We formulate the problem as a Partially Observable Markov Decision Problem (POMDP) and propose a Graph Attention-based Multi-Agent Reinforcement Learning (GAT-MARL) policy that performs Centralized Training, Decentralized Execution (CTDE)....

Toward Trustworthy Agentic AI: A Multimodal Framework for Preventing Prompt Injection Attacks 2025-12-28

https://arxiv.org/abs/2512.23557

The results indicate that the commonly deployed traditional defenses such as keyword filters, tuning-based guardrails, and post-hoc filters are weak to multimodal and agent-based prompt injection threats.In contrast, the proposed system: preserves trust boundaries across agents, prevents malicious propagation through LangChain/GraphChain graphs, enforces dual-stage sanitization (pre-agent and pre-LLM), validates outputs before allowing actions or chain continuation.For real-world agentic AI depl...

GDG Cloud Montreal and Ottawa 2026-02-09

https://speakerdeck.com/cncfcanada/google-next-2020-recap

Why is my model not performing? Detect data issues How can I make it better? Iterative workflow End user and Stakeholders Establish a level of trust Clarity over model's behavior Define fallback policies to avoid catastrophic failures NDA Interpreting ML models with Explainable AI Images source: AI Explainability Whitepaper NDA Interpreting ML models with Explainable AI learn more Images source: AI Explainability Whitepaper - GCP Documentation: https://cloud.google.com/ai-platform/prediction/doc...

Realistic Extreme Behavior Generation for Improved AV Testing 2025-12-31

https://doi.org/10.48550/arxiv.2409.10669

Our framework generates counterfactual collisions with diverse crash properties, e.g., crash angle and velocity, between an adversary and a target vehicle by adding perturbations to the adversary's predicted trajectory from a learned AV behavior model.Our main contribution is to ground these adversarial perturbations in realistic behavior as defined through the lens of data-alignment in the behavior model's parameter space.Then, we cluster these synthetic counterfactuals to identify plausible an...

Developing an agent-based model to minimize spreading of malicious information in dynamic social networks 2023-04-11

https://doi.org/10.1007/s10588-023-09375-6

However, optimizing the spread of misinformation is an NP-hard problem due to the structures of social networks (Budak et al. 2011). This analysis involves many variables like the behavior of the users and communities, information propagation across the social media networks, and the dynamicity of the reactions. Likewise, traditional graph theories such as the centrality and modularity methods fall short of identifying the focal information spreaders in online social media networks (Sen et al. 2...

Introspective extraction and complement control 2023-01-09

https://patents.google.com/?oq=16658120

Introspective extraction and complement control --- Further, two discriminators dt(Z), {0,1} are introduced, which aim to discriminate between factual and counterfactual rationales, i.e., between gtf(X) and ggc(X). Accordingly, we have six players, divided into two groups. The first group pertains to t=0 and involves gf(X), gc(X), and do(Z) as players. Both groups play a similar adversarial game, so we focus the discussion on the first group and will not repeat for the second group, for brevity....

Cat-and-Mouse Satellite Dynamics: Divergent Adversarial Reinforcement Learning for Contested Multi-Agent Space Operations 2025-12-31

https://arxiv.org/abs/2409.17443

In this scenario, an evading 'mouse' spacecraft is given a goal point 40m away, and must visit the goal and return to its initial starting point within the maximum episode length.Competing with the evader spacecraft are two adversarial or 'cat' spacecraft, tasked with the goal of stopping the evader from reaching either goal point by colliding with or blocking the evader from reaching either goal. The proposed game provides a complex 3 DOF environment with a continuous state space and partial ob...

Safety Alignment and Jailbreak Attacks Challenge Modern LLMs | HackerNoon 2025-04-10

https://hackernoon.com/safety-alignment-and-jailbreak-attacks-challenge-modern-llms

Recently, automatic prompt engineering techniques have been explored (Shin et al., 2020; Zou et al., 2023). In particular, Zou et al. (2023) demonstrate the use of adversarial attacks to jailbreak LLMs. In addition to white-box attacks which assume full access to the models, they show that a careful combination of techniques can produce perturbations that are transferable to commercial models for which only an API is exposed. More recently, Wichers et al. (2024) proposed a gradient-based techniq...

Our Commitment to Research Excellence 2026-04-12

https://algoverseairesearch.org/research

On StrategyQA and MMLU, SMAGDi compresses a 40B multi-agent system into a 6B student while retaining 88% of its accuracy, substantially outperforming prior distillation methods. Accepted to Reliable ML @ NeurIPS 2025 A Few Bad Neurons: Isolating and Surgically Correcting Sycophancy Claire O'Brien, Jessica Seto, Dristi Roy, Aditya Dwivedi Behavioral alignment in large language models (LLMs) is often achieved through broad fine-tuning, which can result in undesired side effects like distributional...

I have spent the last several years watching enterprise collaboration tools get smarter. 2026-02-10

https://geekfence.com/how-ux-research-reveals-hidden-ai-orchestration-failures/

And the only way I have found to catch these failures is through user experience research methods that engineering dashboards were never designed to capture. The Orchestration Visibility Gap Here's an example of gaps that need a deeper understanding through user research: a transcription agent reports 94% accuracy and 200-millisecond response times. But what the dashboard does not show is that users are abandoning the feature because two agents gave them conflicting information about who said wh...

What Is Agentic Observability and Why Does It Matter for AI Agents? 2026-04-22

https://www.elixirclaw.ai/blog/langsmith-and-agentops-with-ai-agents

What Is Agentic Observability and Why Does It Matter for AI Agents? --- Customer Service and Chatbots: For instance, in e-commerce companies, an AI agent may process thousands of queries each day. If a customer gets the wrong or irrelevant reply, then developers can trace exactly where it went wrong whether it is at the model logic level, due to misinterpretation, or poor data inputs. Healthcare: Medical Diagnosis Systems: Accuracy in healthcare is crucial. AgentOps can track AI agents used in ...

Multi-Agent Guided Policy Search for Non-Cooperative Dynamic Games 2025-09-28

https://doi.org/10.48550/arXiv.2509.24226

Multi-agent reinforcement learning (MARL) optimizes strategic interactions in non-cooperative dynamic games, where agents have misaligned objectives. However, data-driven methods such as multi-agent policy gradients (MA-PG) often suffer from instability and limit-cycle behaviors....

Researchers have developed a novel framework, termed PDJA (Perception - Decision Joint Attack), that leverages artificial intelligence (AI) to address a long-standing challenge in the security of mu 2026-02-07

https://www.miragenews.com/ai-attack-framework-boosts-multi-agent-learning-1601597/

This fragmented design limits their impact and often fails to reflect real-world adversarial conditions, where perception and decision processes are tightly coupled. To overcome these limitations, the authors developed PDJA, a unified framework that jointly perturbs both observations and actions. By explicitly modeling the interaction between perception-level and decision-level vulnerabilities, the approach identifies synergistic attack directions that are invisible to single-vector methods. The...

Counterfactual Visual Explanation via Causally-Guided Adversarial Steering 2025-07-13

https://doi.org/10.48550/arXiv.2507.09881

Recent work on counterfactual visual explanations has contributed to making artificial intelligence models more explainable by providing visual perturbation to flip the prediction. However, these approaches neglect the causal relationships and the spurious correlations behind the image generation process, which often leads to unintended alterations in the counterfactual images and renders the explanations with limited quality. To address this challenge, we introduce a novel framework CECAS, whic...

Efficient Agent Evaluation via Diversity-Guided User Simulation 2026-04-22

https://arxiv.org/abs/2604.21480

It outputs (i) the index of the user turn to modify and (ii) a brief justification for why changing this turn should induce maximal behavioral change while preserving intent. Token usage for this step is tracked separately. Junction selection is performed independently for each branching attempt, allowing different counterfactual pivots to be selected across repetitions. This process enables targeted branching at semantically meaningful decision points, rather than arbitrary or uniformly sampled...

Questioning Interpretability Measures in NLP 2025-12-31

https://doi.org/10.48550/arxiv.2308.06795

We demonstrate that iterative masking can produce large variation in faithfulness scores between comparable models, and show that masked samples are frequently outside the distribution seen during training. We further investigate the impact of adversarial attacks and adversarial training on faithfulness scores, and demonstrate the relevance of faithfulness measures for analyzing feature salience in text adversarial attacks. Our findings provide new insights into the limitations of current faithf...

SCoUT: Scalable Communication via Utility-Guided Temporal Grouping in Multi-Agent Reinforcement Learning 2026-03-04

https://arxiv.org/abs/2603.04833

Representative CTDE methods include centralized critics (MADDPG) (Lowe et al., 2017), counterfactual actor-critic learning (COMA) (Foerster et al., 2018), and value factorization (VDN/QMIX, QTRAN, QPLEX) (Sunehag et al., 2018;Rashid et al., 2020;Son et al., 2019;Wang et al., 2021); PPO-style CTDE baselines such as MAPPO are also competitive (Schulman et al., 2017;Yu et al., 2022). As the number of agents grows, however, centralized critics can become a bottleneck: they must summarize high-dimens...

Finding the Weakest Link: Adversarial Attack against Multi-Agent Communications 2026-05-14

https://arxiv.org/abs/2605.13170

Abstract: Multi-agent systems rely on communication for information sharing and action coordination, which exposes a vulnerability to attacks. We investigate single-victim communication perturbation attacks against Multi-Agent Reinforcement Learning-trained systems and propose methods that use gradient information from the Jacobian to identify which messages, agent, and timesteps are most susceptible to attack and have the greatest impact on the system....

Computers have become ubiquitous in everyday life and so have bugs in programs running on those computers. 2026-04-21

https://mpi-sws.org/events/recent/

This involves explaining expected or realized outcomes of multi-agent systems and attributing responsibility for those outcomes to the participating agents. Addressing these challenges is key to fostering societal trust and easing the adoption of AI decision makers. This thesis investigates accountability in multi-agent sequential decision making. We develop methods to attribute responsibility for observed outcomes and overall system performance, ... We develop methods to attribute responsibilit...

MAGNNET: Multi-Agent Graph Neural Network-based Efficient Task Allocation for Autonomous Vehicles with Deep Reinforcement Learning 2025-12-31

https://doi.org/10.48550/arxiv.2502.02311

In contrast, decentralized methods exhibit greater robustness against agent failures and communication disruptions while also offering improved scalability as the number of agents increases , but often yield suboptimal solutions.This can result in conflicts or idle tasks, especially under conditions of high partial observability. To address these challenges, we introduce MAGNNET, an innovative framework that synergizes multi-agent deep reinforcement learning (MARL) with graph neural networks (GN...

Robust Multi-agent Communication Based on Decentralization-Oriented Adversarial Training 2025-04-29

https://arxiv.org/abs/2504.21278

Recently, the existence of adversarial communication in MARL has attracted increasing attention.For the adversarial attack of communication, much of the work has focused on directly attacking the victim by perturbing with the designated victim's observations or messages.Tu et al. [Tu et al., 2021] trained an attacker to learn how to generate adversarial perturbation and add them as noise to the victim agent's message....

Our faculty apply data science methodologies to a wide range of domains, often working with researchers both across the University and at other institutions. 2026-04-18

https://stage.datascience.virginia.edu/research/active-grants

This project aims to develop a rigorous Bayesian mathematical theory for neural networks, focusing on the role of priors in transfer learning and regularization with limited data....

BlindGuard: Safeguarding LLM-based Multi-Agent Systems under Unknown Attacks 2025-08-11

https://arxiv.org/abs/2508.08127

Abstract: The security of LLM-based multi-agent systems (MAS) is critically threatened by propagation vulnerability, where malicious agents can distort collective decision-making through inter-agent message interactions. While existing supervised defense methods demonstrate promising performance, they may be impractical in real-world scenarios due to their heavy reliance on labeled malicious agents to train a supervised malicious detection model....

A Theory of Mind Approach as Test-Time Mitigation Against Emergent Adversarial Communication 2025-12-31

https://doi.org/10.48550/arxiv.2302.07176

A Theory of Mind Approach as Test-Time Mitigation Against Emergent Adversarial Communication: Ex-tended Abstract. InAdversarial CommunicationTheory of MindMulti-Agent Rein- forcement LearningTest-time Defense, IFAAMAS, 7 pages Multi-Agent Systems (MAS) is the study of multi-agent interactions in a shared environment. Communication for cooperation is a fundamental construct for sharing information in partially observable environments. Cooperative Multi-Agent Reinforcement Learning (CoMARL) is a l...

Finding the Weakest Link: Adversarial Attack against Multi-Agent Communications 2026-05-14

https://arxiv.org/abs/2605.13170

We investigate single-victim communication perturbation attacks against Multi-Agent Reinforcement Learning-trained systems and propose methods that use gradient information from the Jacobian to identify which messages, agent, and timesteps are most susceptible to attack and have the greatest impact on the system....

G2CP: A Graph-Grounded Communication Protocol for Verifiable and Efficient Multi-Agent Reasoning 2026-02-12

https://doi.org/10.65109/JHFW8307

A G2CP message is like handing a colleague a database query rather than an email-there is no room for misinterpretation. The protocol wraps these queries in classical performatives (REQUEST, INFORM, etc.) so that agents retain the social coordination mechanisms pioneered by FIPA-ACL , but ground every content expression in the graph rather than in predicate logic or free text. Contributions This paper makes four primary contributions: (1) The G2CP Protocol: A formal agent communication language ...

DREAM: Dynamic Red-teaming across Environments for AI Models 2025-12-21

https://doi.org/10.48550/arXiv.2512.19016

By using the Cross-Environment Adversarial Knowledge Graph (CE-AKG) and Contextualized Guided Policy Search (C-GPS), DREAM uncovers vulnerabilities missed by traditional single-environment tests, particularly highlighting agents' contextual fragility and inability to track long-term malicious intent. Our experiments show that current LLM agents are vulnerable to cross-environment exploits and long-chain attacks, emphasizing the need for more robust, context-aware defense strategies. DREAM provid...

No Action Without a NOD: A Heterogeneous Multi-Agent Architecture for Reliable Service Agents 2026-05-13

https://arxiv.org/abs/2605.12240

However, these service agents suffer from unreliability in long-horizon tasks, as they often produce policy violations, tool hallucinations, and misaligned actions, which greatly impedes their real-world deployment. To address these challenges, we propose NOD (Navigator-Operator-Director), a heterogeneous multi-agent architecture for service agents. Instead of maintaining task state implicitly in dialogue context as in prior work, we externalize a structured Global State to enable explicit task ...

Most multi-agent AI systems fail at coordination, not capability. 2026-03-11

https://particula.tech/blog/multi-agent-ai-orchestration-that-works

When agents produce conflicting assessments - the risk agent flags danger while the opportunity agent recommends aggressive action - the aggregator needs logic to reconcile disagreements. This conflict resolution layer often becomes the most complex part of the system. The single biggest source of multi-agent system failures is unstructured communication. When agents pass free-form text to each other, small phrasing changes cause downstream misinterpretations that cascade through the system. Def...

Why Is RLHF Alignment Shallow? A Gradient Analysis 2026-03-04

https://arxiv.org/abs/2603.04851

The GCG attack (Zou et al., 2023) finds universal adversarial suffixes via greedy coordinate gradient search, achieving high attack success rates even on black-box models through transfer....

Microsoft Research ( ) 2026-02-07

https://qiita.com/OpenJNY/items/ef885c357b4e0a1551c0

We are working on adding the following features to DiCE: Support for PyTorch and scikit-learn models Support for using DiCE for debugging machine learning models Support for other algorithms for generating counterfactual explanations Incorporating causal constraints when generating counterfactual explanations Lasso [AAAI 2017] / Open-source library provides explanation for machine learning through diverse counterfactuals - Microsoft Research 1711.00399] Counterfactual Explanations without Openin...

An Offline Multi-Agent Reinforcement Learning Framework for Radio Resource Management 2025-01-21

https://doi.org/10.1109/TMC.2025.3599918

In contrast, the work in combines generative adversarial network (GAN) with deep RL for resource management and network slicing. A recent work in solves the RRM problem using graph neural networks (GNNs). The authors formulate the problem as an unsupervised primal-dual problem. They develop a GNN architecture that parameterizes the RRM policies as a graph topology derived from the instantaneous channel conditions. Several works have formulated MARL algorithms for the RRM problem and wireless com...

Applied Explainability for Large Language Models: A Comparative Study 2026-04-14

https://arxiv.org/abs/2604.15371

Explainability tools help identify failure modes, detect bias, and support responsible deployment.As ML systems increasingly influence real-world decisions, the ability to inspect model behaviour becomes a practical requirement rather than a purely academic concern . Gap between existing XAI methods and practical usage A wide range of explainability methods has been proposed, including gradient-based attribution, attention visualisation, and model-agnostic approaches such as SHAP....

Counterfactual Explanations for Face Forgery Detection via Adversarial Removal of Artifacts 2024-07-14

https://doi.org/10.1109/icme57554.2024.10688373

Thus, the naive black-box adversarial perturbations can be more interpretable in the synthesized results.Moreover, this synthesiseby-analysis way is able to force the search of counterfactual explanations on the natural face manifold.In this way, the more general counterfactual traces can be found and the transferable adversarial attack success rate can be improved. Our contributions can be summarized as follows: 1. We provide a novel counterfactual explanation for face forgery detection from an...

As AI systems gain autonomy, a new approach to security is needed to ensure reliable and trustworthy operation. 2026-04-21

https://bbg-news.com/securing-the-rise-of-ai-agents/

The attestations themselves are data structures containing details of the completed operation, the agent performing it, and a digital signature allowing for non-repudiation and integrity checking. Cryptographic verification at each agentic boundary crossing establishes a chain of trust by demanding proof of legitimacy before allowing data or control flow. This process involves verifying the authenticity and integrity of entities attempting to interact, ensuring that only authorized and uncomprom...

Zero-Shot Policy Transfer in Multi-Agent Reinforcement Learning via Trusted Federated Explainability 2026-02-27

https://doi.org/10.63282/3050-9246.ijetcsit-v6i3p118

This paper proposes TFX-MARL (Trusted Federated Ex-plainability for MARL), a governance-inspired framework for zero-shot policy transfer across silos using trust metric-based federated learning (FL) and explainability controls. TFX-MARL contributes: (i) a trust metric that quantifies participant integrity and accountability using provenance, update consistency, local evaluation reliability, and safety-compliance signals; (ii) a trust-aware federated aggregation protocol that reduces poisoning ri...

A novel adversarial attack method, BEAST, is introduced for Language Models, enabling efficient jailbreaking, hallucinations, and membership inference attacks with high success rates and low computat 2026-04-19

https://huggingface.co/papers/2402.15570

BEAST employs interpretable parameters, enabling attackers to balance between attack speed, success rate, and the readability of adversarial prompts. The computational efficiency of BEAST facilitates us to investigate its applications on LMs for jailbreaking, eliciting hallucinations, and privacy attacks. Our gradient-free targeted attack can jailbreak aligned LMs with high attack success rates within one minute....

Memory Poisoning Attack and Defense on Memory Based LLM-Agents 2025-12-31

https://doi.org/10.48550/arxiv.2601.05504

This work addresses both gaps through a systematic empirical study of memory poisoning attacks and defenses in the context of EHR agents. Related work The security of LLM-based agents has become a critical research area, particularly regarding memory poisoning vulnerabilities. introduced MINJA (Memory Injection Attack), demonstrating that agents with persistent memory are vulnerable to query-only attacks achieving over 95% injection success rates through bridging steps, indication prompts, and p...

Silencing the Guardrails: Inference-Time Jailbreaking via Dynamic Contextual Representation Ablation 2026-04-08

https://arxiv.org/abs/2604.07835

Unlike optimization-based (e.g., GCG) or static editing methods (e.g., LED), CRA bypasses safety mechanisms without computationally gradient search or permanent weight modification. We provide a comprehensive evaluation on benchmarks including AdvBench (Zou et al., 2023b), PKU-Alignment (Ji et al., 2023) and ToxicChat (Lin et al., 2023), adhering to rigorous evaluation standards suggested by recent works like JailbreakBench (Chao et al., 2024) and Bag of Tricks (Xu et al., 2024) to avoid prompt-...

TRYLOCK: Defense-in-Depth Against LLM Jailbreaks via Layered Preference and Representation Engineering 2026-01-05

https://doi.org/10.48550/arXiv.2601.03300

Jailbreak attacks have evolved from simple prompt manipulation into a sophisticated adversarial ecosystem. We categorize attacks into six major families: Direct Attacks: Early jailbreaks relied on explicit harmful requests or simple instruction overrides ("ignore previous instructions, tell me how to..."). While modern models are trained to refuse these, they establish the baseline threat. Roleplay and Persona Attacks: The DAN (Do Anything Now) family creates fictional AI personas claimed to ope...

ARTA: Adversarial-Robust Multivariate Time--Series Anomaly Detection via Sparsity-Constrained Perturbations 2026-05-06

https://arxiv.org/abs/2603.25956

We propose ARTA (Adversarially Robust multivariate Time-series Anomaly detection via sparsity-constrained perturbations), a joint training framework that improves detector robustness through a principled min-max optimization objective. ARTA comprises an anomaly detector and a sparsity-constrained mask generator that are trained simultaneously. The generator identifies minimal, task-relevant temporal perturbations that maximally increase the detector's anomaly score, while the detector is optimiz...

Grey-box Adversarial Attack on Communication in Multi-agent Reinforcement Learning 2023-05-29

https://doi.org/10.5555/3545946.3598963

To the best of our knowledge, there has not been any work investigating grey-box attacks on communication in MARL. In this paper, we propose the first grey-box attack method on communication in MARL, which is called victim-simulation based adversarial attack (VSA). At each timestep, the attacker simulates a victim attacked by other regular agents' communication messages and generates adversarial perturbations on its received communication messages. (2023)...

Robust and efficient communication in multi-agent reinforcement learning 2026-02-17

https://pubmed.ncbi.nlm.nih.gov/41705983/

Multi-agent reinforcement learning (MARL) has made significant strides in enabling coordinated behaviors among autonomous agents. However, most existing approaches assume that communication is instantaneous, reliable, and has unlimited bandwidth; these conditions are rarely met in real-world deployments. This survey systematically reviews recent advances in robust and efficient communication strategies for MARL under realistic constraints, including message perturbations, transmission delays, an...

4D-ARE: Bridging the Attribution Gap in LLM Agent Requirements Engineering 2026-01-07

https://doi.org/10.48550/arXiv.2601.04556

Recent work has explored causal attribution in multi-agent and LLM contexts, but primarily for runtime explanation rather than design-time specification. Runtime Attribution. MACIE combines structural causal models with Shapley values to explain collective agent behavior after execution. A2P Scaffolding applies structured counterfactual reasoning (Abduct-Act-Predict) for automated failure attribution in multi-agent systems....

Wireless Communication Enhanced Value Decomposition for Multi-Agent Reinforcement Learning 2026-04-08

https://arxiv.org/abs/2604.08728

We prove that this mixer is permutation invariant (Theorem 2), monotonic and therefore IGM-consistent (Theorem 4), and represents a strictly larger monotone function class than QMIX-style graph-agnostic mixers (Theorem 5).2) Behavioral evidence of communication-aware coordination.Through positive signaling and positive listening analysis, we demonstrate that agents learn genuine communication strategies-not merely incidental message exchange-and that the communication-graphconditioned mixer expl...

Chasing Fairness in Graphs: A GNN Architecture Perspective 2024-03-23

https://doi.org/10.1609/aaai.v38i19.30115

Finally, the perturbation in probability space will be transformed into representation space via Jacobian transformation @SF (F) @F .Efficiency.FMP is an efficient message-passing scheme....

Intelligent resource allocation in wireless networks via deep reinforcement learning 2026-01-07

https://doi.org/10.48550/arXiv.2601.04842

Extending this framework to a decentralized MARL setting is critical. In this scenario, each User Equipment (UE) or Base Station (BS) would act as an independent agent, learning to coordinate implicitly through the environment to maximize global network utility while minimizing signaling overhead. Integration with realistic protocol stacks: To bridge the gap between theoretical simulation and operational reality, the proposed algorithms should be validated on high-fidelity network simulators suc...

Artificial intelligence systems are rapidly evolving from simple prompt-response tools into persistent cognitive environments that retain contextual memory across interactions. 2026-04-17

https://www.tdcommons.org/dpubs_series/9572/

Artificial intelligence systems are rapidly evolving from simple prompt-response tools into persistent cognitive environments that retain contextual memory across interactions. Modern AI assistants, enterprise copilots, cybersecurity analysis systems, and retrieval-augmented architectures increasingly store fragments of prior conversations, retrieved documents, and contextual reasoning signals to improve decision continuity. While this capability significantly improves usability and system intel...

Not a very smart home: crims could hijack smart-home boiler, open and close powered windows and more. 2026-01-13

https://www.theregister.com/2025/08/08/infosec_hounds_spot_prompt_injection/

Black hat A trio of researchers has disclosed a major prompt injection vulnerability in Google's Gemini large language model-powered applications. This allows for attacks ranging from "permanent memory poisoning" to unwanted video streaming, email exfiltration, and even taking over the target's smart home systems to plunge them into darkness or open a powered window, all triggered by nothing more than a simple Google Calendar invitation or email. "You used to believe that adversarial attacks aga...

BreakFun: Jailbreaking LLMs via Schema Exploitation 2025-12-31

https://doi.org/10.48550/arxiv.2510.17904

White-box attacks assume access to internal model states like weights and gradients, while the more practical black-box attacks only require query access through a public API.A further distinction is made between multi-turn attacks that refine their approach over a conversation and single-turn attacks that must succeed in a single prompt.The BreakFun methodology presented in this paper is a black-box, single-turn attack, designed for maximum practical relevance and accessibility.(...

AgentAsk: Multi-Agent Systems Need to Ask 2025-10-07

https://arxiv.org/abs/2510.07593

Similar vulnerabilities appear in single-agent traces, including misinterpretations, logical gaps, and limited reflection, which indicates that subtle errors arise early and persist through execution. This sets up the problem we study: limiting error growth at the handoff between agents so that small inconsistencies do not accumulate into system-level failures. A growing body of recent research attempts to improve the reliability of MAS. One direction emphasizes structured roles and workflow gov...

OpenAI's GPT-5.3 Codex represents a paradigm shift in AI-assisted software development. 2026-04-14

https://www.libertify.com/interactive-library/gpt-5-codex-system-card-safety-capabilities/

The evaluation employed the StrongReject benchmark alongside coding-specific adversarial scenarios. Standard jailbreak techniques - role-playing prompts, context manipulation, and multi-turn persuasion - were tested alongside novel approaches that exploit the coding context, such as instructions hidden in base64-encoded strings, obfuscated code that when decoded contains harmful requests, and adversarial prompts embedded in seemingly legitimate code review requests. The model demonstrated strong...

Web Fraud Attacks Against LLM-Driven Multi-Agent Systems 2025-08-31

https://doi.org/10.48550/arXiv.2509.01211

Information Worm Attack allows attackers to use carefully crafted queries to perform iterative propagation within MAS (Wang et al. 2025a). Prompt Virus attack, whose core is a self-replicating prompt that can spread exponentially, achieves rapid paralysis of the entire MAS (Shi et al. 2025). Similarly, Agent-Poison attacks MAS in ways that pollute agents' memory or knowledge databases (Chen et al. 2024). PrivacyLens can induce agents to leak information outside of their authorized scopes through...

Byzantine Resilient Federated Multi-Task Representation Learning 2025-12-31

https://doi.org/10.48550/arxiv.2503.19209

In this paper, we propose BR-MTRL, a Byzantineresilient multi-task representation learning framework that handles faulty or malicious agents.Our approach leverages representation learning through a shared neural network model, where all clients share fixed layers, except for a client-specific final layer.This structure captures shared features among clients while enabling individual adaptation, making it a promising approach for leveraging client data and computational power in heterogeneous fed...

PrivEdge: a hybrid split - federated learning framework for real-time electricity theft detection on edge nodes 2026-03-20

https://doi.org/10.1038/s41598-026-39064-8

Once local inferences have been received, the server arranges federated aggregation by computing an update of a global model over client participants. This learning procedure, commonly executed by Federated Averaging (FedAvg) or its sturdier alternatives, can guarantee that single updates will be confidential and in the course of improving the shared model constantly54 . Any communication between the edge and the server is done over authenticated and encrypted channels - either version 1.3 of Tr...

Systems And Methods For Adversarial Text Purification Via Large Language Models 2026-05-06

https://ppubs.uspto.gov/pubwebapp/external.html?q=(20260129076).pn

Gradient-based adversarial training strategies have shown effectiveness in defending attacks with no prior knowledge and improving defense. Adversarial purification is a particularly desirable type of defense since it does not require prior knowledge of the type of attack. Prior work in adversarial purification has traditionally focused on continuous inputs such as images, exploring generative models such as GANs, EBMs, and diffusion models. However, the field of creating better adversarial defe...

Trust-Aware and Energy-Efficient Federated Learning for Secure Sensor Networks at the Edge 2026-04-08

https://pubmed.ncbi.nlm.nih.gov/42076416/

This paper proposes a trust-aware and energy-efficient federated learning framework specifically designed for secure sensor networks operating in resource-constrained edge environments. The proposed approach integrates lightweight trust metrics, trust-driven model aggregation, and adaptive communication scheduling to mitigate the impact of unreliable or malicious nodes while reducing unnecessary energy expenditure. By dynamically weighting client contributions based on trust and participation ef...

A groundbreaking research paper introduces a clever solution to one of AI's thorniest problems: accountability in multi-agent systems. 2026-04-14

https://www.aiacceleratorinstitute.com/when-multi-agent-ai-systems-fail-who-takes-the-blame/

Even if a system is compromised and its logs are altered, the attribution signals embedded in its outputs remain intact, providing forensic evidence of what actually occurred. The financial sector, where AI systems handle everything from fraud detection to trading decisions, could use IET to meet regulatory requirements for explainability and accountability. Regulators could audit AI decisions after the fact without requiring companies to maintain extensive logging infrastructure. The future of ...

InjectLab: A Tactical Framework for Adversarial Threat Modeling Against Large Language Models 2025-04-15

https://arxiv.org/abs/2505.18156

This distinct weakness has given rise to a rapidly expanding class of prompt-based adversarial attacks, wherein a malicious user crafts inputs that subvert intended behavior, override internal safeguards, or elicit responses that violate operational policy.Researchers have already demonstrated techniques such as system prompt leakage, jailbreaks, obfuscated role override, and indirect context poisoning in the wild .As LLMs become embedded in increasingly sensitive environments-finance, healthcar...

FERRET: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique 2025-12-31

https://doi.org/10.48550/arxiv.2408.10701

While RAINBOW TEAMING, a recent approach, addresses the diversity challenge by framing adversarial prompt generation as a quality-diversity search, it remains slow and requires a large fine-tuned mutator for optimal performance.To overcome these limitations, we propose FERRET, a novel approach that builds upon RAINBOW TEAMING by generating multiple adversarial prompt mutations per iteration and using a scoring function to rank and select the most effective adversarial prompt....

HomeAI NewsA Coding and Experimental Analysis of Decentralized Federated Learning with Gossip Protocols and Differential Privacy 2026-02-13

http://thecryptonews.io/a-coding-and-experimental-analysis-of-decentralized-federated-learning-with-gossip-protocols-and-differential-privacy/

We also ran controlled experiments across multiple privacy levels for both centralized and decentralized training strategies, visualized convergence trends, and computed convergence speed metrics to compare different aggregation schemes' responses to increasing privacy constraints. In conclusion, we observed that while centralized FedAvg converges faster under weak privacy constraints, gossip-based federated learning is more robust to noisy updates at the cost of slower convergence. Stronger pri...

Exploiting Web Search Tools of AI Agents for Data Exfiltration 2025-10-09

https://doi.org/10.48550/arXiv.2510.09093

This approach initiates with a "Do Anything Now" (DAN) prompt template , iteratively refining it to enhance attack efficacy while preserving semantic coherence. By considering the semantic meaning of prompts during optimization, AutoDAN bridges the gap between effectiveness and interpretability in jailbreak generation . In contrast, CipherChat introduces an entirely distinct paradigm for jailbreaking LLMs. Rather than relying on adversarial optimization, this method encodes harmful prompts using...

This tries to be a pretty comprehensive lists all AI safety, alignment, and control interventions. 2026-01-24

https://www.greaterwrong.com/posts/6Sf9KMMDMFSauDe85/ai-safety-interventions

Scalable analysis of model behavior and persuasion dynamics. Jaipersaud et al. (2024) Interactive visualizations of feature-feature interactions. Lindsey et al. (2025) Rigorous method for testing interpretability hypotheses in neural networks. Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research] Attribution method using path integrals to attribute predictions to inputs. Sundararajan et al. (2017): Axiomatic Attribution for Deep Networks Chain-of-Though...

How May Explainable Artificial Intelligence Improve IT Security of Object Detction? 2026-04-15

https://net.cs.uni-bonn.de/teaching/student-theses/master-diploma-from-2009/

... automatic target annotation to effectively guide fuzzing campaigns Kizilkaya, Bilal Kutzner, Joris Lohr, Marvin Romer, Leon Reinforcement learning in ad-hoc networks with authentication mechanisms based on key-insulated signatures Scheffczyk, Jan Introducing the component coreference resolution task for requirement specification Scherer, Paul Stavrou, Ioannis Suleman, Sherwan Funktionsweise und Verbreitung von TLS 1.3 An Investigation of 5G non-stand-alone Vulnerabilities Wilms, Leo How May ...

Deliberative Credit Assignment (DCA): Making Faithful Reasoning Profitable 2025-07-29

https://www.lesswrong.com/posts/ucJv7ZJEwtQwAq9yu/deliberative-credit-assignment-dca-making-faithful-reasoning

In that pipeline, the reviewer model simply checks answer correctness before generating an explanation. Likewise, ThinkPRM (Muennighoff et al., 2025) uses a small LLM to verify and even recursively re-verify reasoning chains by prompting it (Lets verify again) to improve answer confidence. These works demonstrate multi-agent checks on reasoning outputs, but they do not break down the chain into causal chunks or explicitly re-train the generator based on step relevance. Causal Analysis of CoT: Se...

Robust multi-agent coordination via evolutionary generation of auxiliary adversarial attackers 2023-05-09

https://doi.org/10.48550/arXiv.2305.05909

Previous works typically employ an adversarial training paradigm to obtain a robust policy. These methods generally model the process of policy learning as a minimax problem from the perspective of game theory ) and optimize the policy under the worst-case situation (Pinto et al. 2017;Zhang et al. 2020a;Zhang, Wang, and Boedecker 2022). Nevertheless, the multi-agent problem is much more complex (Zhang, Yang, and Basar 2021), as multiple agents are making decisions simultaneously in the environme...

Automatic Failure Attribution and Critical Step Prediction Method for Multi-Agent Systems Based on Causal Inference 2025-09-10

https://arxiv.org/abs/2509.08682

Abstract: Multi-agent systems (MAS) are critical for automating complex tasks, yet their practical deployment is severely hampered by the challenge of failure attribution. Current diagnostic tools, which rely on statistical correlations, are fundamentally inadequate; on challenging benchmarks like Who\&When, state-of-the-art methods achieve less than 15\% accuracy in locating the root-cause step of a failure. To address this critical gap, we introduce the first failure attribution framework for ...

We argue that, when learning a 1-Lipschitz neural network with the dual loss of an optimal transportation problem, the gradient of the model is both the direction of the transportation plan and the d 2026-04-21

https://jarxiv.com/2022/06/15/when-adversarial-attacks-become-interpretable-counterfactual-explanations/

Traveling along the gradient to the decision boundary is no more an adversarial attack but becomes a counterfactual explanation, explicitly transporting from one class to the other....

Mapping Human Anti-collusion Mechanisms to Multi-agent AI Systems 2026-05-07

https://arxiv.org/abs/2601.00360

This paper addresses that gap by (i) developing a taxonomy of human anti-collusion mechanisms, including sanctions, leniency & whistleblowing, monitoring & auditing, market design, and governance and (ii) mapping them to potential interventions for multi-agent AI systems. For each mechanism, we propose implementation approaches. We also highlight open challenges, such as the attribution problem (difficulty attributing emergent coordination to specific agents), identity fluidity (agents being eas...

Fair and Robust Federated Learning via Reputation-aware Incentives and Model Aggregation 2025-07-06

https://doi.org/10.1109/LANMAN66415.2025.11154591

Collaborative Machine Learning (ML) paradigms, such as Federated Learning (FL), suffer from unequal client contributions and adversarial behavior, where clients deliberately degrade global model accuracy via outdated or poisoned updates. In this paper, we address fair client collaboration and adversarial behavior detection and mitigation using a combined reputation-aware incentive and robust aggregation approach. First, the long-term client reputation across FL epochs is estimated using a varian...

ATLAS : Adaptive Self-Evolutionary Research Agent with Task-Distributed Multi-LLM Supporters 2026-02-01

https://doi.org/10.48550/arXiv.2602.02709

However, a key limitation in many iterative preference-based pipelines is fixed reference policies, which leads to misaligned references, overly conservative updates, or stagnation. To address this, we introduce Evolving DPO (EvoDPO), a preference-optimization loop with telemetry-driven finetuning control and adaptive reference management. At each fine-tuning phase, a strategist agent tunes DPO hyperparameters based on training diagnostics. In parallel, EvoDPO updates its policy using the DPO al...

Resilience in autonomous agent systems is about having the capacity to anticipate, respond to, adapt to, and recover from adverse and dynamic conditions in complex environments. 2026-03-10

https://www.frontiersin.org/journals/artificial-intelligence/articles/10.3389/frai.2023.1212336/full

Resilience in autonomous agent systems is about having the capacity to anticipate, respond to, adapt to, and recover from adverse and dynamic conditions in complex environments. --- Counterfactual learning is a topic that has recently been gaining attention as a model-agnostic and post-hoc technique to improve explainability in machine learning models....

HiMAC: Hierarchical Macro-Micro Learning for Long-Horizon LLM Agents 2026-02-28

https://arxiv.org/abs/2603.00977

A minor syntactic deviation in an early step often cascades into irreversible failure states, causing the agent to lose track of the global goal.This limitation indicates that relying solely on the inherent reasoning capabilities of generic LLMs is insufficient; structural inductive biases are required to decouple global planning from local control [7,13,29,43].Conditioned on this blueprint, the Micro-Policy operates as a focused executor, generating atomic actions for each sub-goal in sequence....

LARGO: Latent Adversarial Reflection through Gradient Optimization for Jailbreaking LLMs 2025-05-18

https://arxiv.org/abs/2505.10838

We introduce LARGO (Latent Adversarial Reflection through Gradient Optimization), a novel latent self-reflection attack that reasserts the power of gradient-based optimization for generating fluent jailbreaking prompts. By operating within the LLM's continuous latent space, LARGO first optimizes an adversarial latent vector and then recursively call the same LLM to decode the latent into natural language....

Promoting Resilience in Multi-Agent Reinforcement Learning via Confusion-Based Communication 2020-12-31

https://www.semanticscholar.org/paper/544ad0ef9faf7f0e3311686a9a9643ff96ae7ea3

We generalize this idea to the multi-agent setting by instructing the agents to communicate observations or messages with prioritized importance, that is determined by measuring misalignments between expected and observed rewards. Computing Confusion We measure confusion according to the extent by which the immediate reward observed by an agent when performing an action in some state is misaligned with its estimated reward. Formally, we compute the level of confusion using the Q function. Let πp...

Reinforcement Learning in Practice: Opportunities and Challenges 2022-02-22

https://arxiv.org/abs/2202.11296

RL has additional challenges like credit assignment and exploration vs. exploitation, comparing with supervised learning. Moreover, in RL, an action can affect next and future states and actions, which results in distribution shift inherently. Deep learning (DL), or deep neural networks (DNNs), can work with/as these and other machine learning approaches. Deep learning is part of machine learning, which is part of AI. Deep RL is an integration of deep learning and RL. 1 Figure 1 presents the rel...

With the growing need for Artificial Intelligence (AI) solutions that can scale across large Internet of Things (IoT) networks while maintaining data privacy, the demand for federated learning platfo 2026-04-12

https://www.enriquetomasmb.com/en/blog/nebula-a-platform-for-decentralized-federated-learning

Network: Manages communication, data exchange, and secure federated interactions. Models: Implements various deep learning architectures (e.g., MLP, CNN, ResNet) compatible with federated learning. Datasets: Supports multiple data partitioning strategies (IID & non-IID) for flexible experimentation. Aggregation: Provides aggregation strategies such as FedAvg, Krum, Median, and Trimmed Mean to securely combine local model updates. NEBULA also extends its capabilities with additional add-ons: Atta...

Sabalynx leverages sophisticated reinforcement learning and deep neural networks to orchestrate self-healing, high-efficiency architectures across 5G and legacy infrastructure. 2026-04-13

https://sabalynx.com/ai-telecommunications-network-optimisation/

We don't just "unleash AI"; we wrap it in a deterministic "safety envelope" of hard-coded business logic and ETSI-compliant guardrails, ensuring that even if the AI reaches a sub-optimal conclusion, the network remains operational. Failure Prevention Critical Architecture The Sabalynx AI-Telco Governance Framework For CIOs, the primary concern isn't just "Does it work?" but "Can we control it?" Our deployments focus on the Explainability (XAI) of network decisions. If our AIOps platform sheds a ...

Intelo.ai is a retail technology company and a dual-category award winner in the 2025 Just Style Excellence Awards 2026-03-10

https://www.just-style.com/excellence-awards/featured-company/2025-intelo-ai/

"From an engineering perspective, our goal was to move beyond the 'black box' era of merchandising AI. These awards for Innovation and Product Launch highlight the success of our Multi-Agent Platform - an architecture designed for transparency and collaboration. We didn't just build features; we built a modular network of specialized agents that can explain their logic, handle complex scenarios, and integrate seamlessly with legacy systems. I am incredibly proud of our product and engineering te...

Grey-box Adversarial Attack on Communication in Multi-agent Reinforcement Learning 2023-05-29

https://doi.org/10.65109/pqmf7636

At each timestep, the attacker simulates a victim attacked by other regular agents' communication messages and generates adversarial perturbations on its received communication messages. The aggregation of these perturbations is sent by the attacker to the regular agents through communication messages, which will induce non-optimal actions of the regular agents. Experimental results show that VSA can effectively degrade the performance of the MAS on Predator-Prey. The findings in this paper will...

A Theory of Mind Approach as Test-Time Mitigation Against Emergent Adversarial Communication 2025-12-31

https://doi.org/10.48550/arxiv.2302.07176

Explicitly, there are works on learning to communicate messages from CoMARL agents; however, non-cooperative agents, when capable of access a cooperative team's communication channel, have been shown to learn adversarial communication messages, sabotaging the cooperative team's performance particularly when objectives depend on finite resources. To address this issue, we propose a technique which leverages local formulations of Theory-of-Mind (ToM) to distinguish exhibited cooperative behavior f...

Sentra-Guard: A Multilingual Human-AI Framework for Real-Time Defense Against Adversarial LLM Jailbreaks 2025-10-25

https://doi.org/10.48550/arXiv.2510.22628

Consequently, modern LLMs like GPT-4o, Gemini Flash, Claude 3, and Mistral 7B frequently produce unsafe responses under these attack vectors, necessitating a dynamic and semantic-aware defense system like Sentra-Guard. D. EXPERIMENTAL RESULTS AND DETECTION PERFOR-MANCE To rigorously assess the performance of this framework, we conducted a comprehensive evaluation using a curated adversarial prompt corpus encompassing a wide spectrum of jailbreak strategies. These included role-playing, system ov...

STShield: Single-Token Sentinel for Real-Time Jailbreak Detection in Large Language Models 2025-12-31

https://doi.org/10.48550/arxiv.2503.17932

STShield: Single-Token Sentinel for Real-Time Jailbreak Detection in Large Language Models --- Jailbreaking proprietary large language models using word substitution cipher. Divij Handa, Advait Chirmule, Bimal Gajera, Chitta Baral, arXiv:2402.106012024arXiv preprint LoRA: Low-rank adaptation of large language models. J Edward, Yelong Hu, Phillip Shen, Zeyuan Wallis, Yuanzhi Allen-Zhu, Shean Li, Lu Wang, Weizhu Wang, Chen, ICLR. 2022 Gradient Cuff: Detecting jailbreak attacks on large language m...

Multi-Agent Reinforcement Learning in Cybersecurity: From Fundamentals to Applications 2025-05-25

https://arxiv.org/abs/2505.19837

In MARL, Dec-POMDPs enable agents to act independently and collaboratively.However, addressing scalability remains an active area of research and often necessitates approximations or abstractions like factored representations, shared policies, or hierarchical approaches . 2) Inability to Handle Multi-Agent Dynamics: Multiple entities (attackers, defenders, neutral agents) interact within the same shared environment in cybersecurity scenarios.This challenge complicates the causality between actio...

Robust multi-agent coordination via evolutionary generation of auxiliary adversarial attackers 2023-05-09

https://doi.org/10.48550/arXiv.2305.05909

For the observation perturbation of CMARL, Lin et al. (2020) learn an adversarial observation policy to attack the system, showing that the ego-system is highly vulnerable to observational perturbations. RADAR (Phan et al. 2021) learns resilient MARL policy via adversarial value decomposition. Hu and Zhang (2022) further design an action regularizer to attack the CMARL system efficiently. Xue et al. (2022c) recently consider the multi-agent adversarial communication, learning robust communicatio...

Improving Clinical Diagnosis with Counterfactual Multi-Agent Reasoning 2026-04-23

https://arxiv.org/abs/2603.27820

Abstract: Clinical diagnosis is a complex reasoning process in which clinicians gather evidence, form hypotheses, and test them against alternative explanations. In medical training, this reasoning is explicitly developed through counterfactual questioning--e.g., asking how a diagnosis would change if a key symptom were absent or altered--to strengthen differential diagnosis skills. As large language model (LLM)-based systems are increasingly used for diagnostic support, ensuring the interpretab...

Mitigating Error Amplification in Fast Adversarial Training 2026-04-27

https://arxiv.org/abs/2604.24332

Abstract: Fast Adversarial Training (FAT) has proven effective in enhancing model robustness by encouraging networks to learn perturbation-invariant representations. However, FAT often suffers from catastrophic overfitting (CO), where the model overfits to the training attack and fails to generalize to unseen ones. Moreover, robustness oriented optimization typically leads to notable performance degradation on clean inputs, and such degradation becomes increasingly severe as the perturbation bud...

Sentra-Guard: A Real-Time Multilingual Defense Against Adversarial LLM Prompts 2026-05-03

https://arxiv.org/abs/2510.22628

It identifies adversarial prompts in both direct and obfuscated attack vectors. A core innovation is the classifier-retriever fusion module, which dynamically computes context-aware risk scores that estimate how likely a prompt is to be adversarial based on its content and context. The framework ensures multilingual resilience with a language-agnostic preprocessing layer. This component automatically translates non-English prompts into English for semantic evaluation, enabling consistent detecti...

This is a fork of Flow, a computational framework for deep RL and control experiments for traffic microsimulation. 2026-03-07

https://github.com/eugenevinitsky/decentralized_bottlenecks

To generate the data locally, see flow/visualize/bottleneck_results.To then generate the graphs from that data, see generate_graphs/generate_graphs.py from which you can generate graphs from your own data by adaptain the __main__ section. 1] Vinitsky, Lichtle, Parvate, Bayen, "Optimizing Mixed Autonomy Traffic Flow With Decentralized Autonomous Vehicles and Multi-Agent RL."...

Graph-Augmented Large Language Model Agents: Current Progress and Future Prospects 2025-07-28

https://doi.org/10.48550/arXiv.2507.21407

Graph-Augmented Large Language Model Agents: Current Progress and Future Prospects --- Specifically, we categorize existing GLA methods by their primary functions in LLM agent systems, including planning, memory, and tool usage, and then analyze how graphs and graph learning algorithms contribute to each. For multi-agent systems, we further discuss how GLA solutions facilitate the orchestration, efficiency optimization, and trustworthiness of MAS. Finally, we highlight key future directions to a...

GUARDIAN: Safeguarding LLM Multi-Agent Collaborations with Temporal Graph Modeling 2025-05-24

https://doi.org/10.48550/arXiv.2505.19234

GUARDIAN: Safeguarding LLM Multi-Agent Collaborations with Temporal Graph Modeling --- By modeling the multi-agent collaboration process as a discrete-time temporal attributed graph, GUARDIAN explicitly captures the propagation dynamics of hallucinations and errors....

How AI advances affect torrent privacy - risks, real attacks, and technical mitigations for developers and IT operators. 2026-04-22

https://bitstorrent.com/protecting-your-privacy-understanding-the-implications-of-ne

How AI advances affect torrent privacy - risks, real attacks, and technical mitigations for developers and IT operators. --- Stay up-to-date on compliance trends in AI compliance. Q5: How should development teams prepare for AI-driven threats? A5: Build privacy-by-design, audit models for leakage, minimize telemetry, and work with security teams to simulate adversarial AI attacks. Look at workforce and tooling shifts described in AI talent migration to anticipate skills gaps. Conclusion: Balance...

Detecting Anomalous Transactions Within An Application By Privileged User Accounts 2023-10-18

https://ppubs.uspto.gov/pubwebapp/external.html?q=(20230334478).pn

FIG. illustrates an example simplified architecture for a multi-tenant agent; FIGS. A-E illustrate an example simplified architecture for instrumenting applications to prevent abuse by privileged users; FIG. illustrates an example simplified agent/tenant for detecting anomalous transactions within an application by privileged user accounts; FIG. illustrate an example insight and associated enforcement policy; and FIG. illustrates an example simplified procedure for detecting anomalous transactio...

SABER: A Stealthy Agentic Black-Box Attack Framework for Vision-Language-Action Models 2026-03-25

https://arxiv.org/abs/2603.24935

SABER: A Stealthy Agentic Black-Box Attack Framework for Vision-Language-Action Models --- III. Problem Formulation We study instruction-only black-box attacks on a frozen vision-language-action (VLA) model.Unlike standard adversarial attacks that optimize a single perturbation under a norm bound, our attacker operates as a multi-turn agent: it selects editing tools, chooses where to edit, and composes perturbations over multiple steps.We therefore formulate attack generation as a sequential dec...

TRUST: A Framework for Decentralized AI Service v.0.1 2026-04-30

https://arxiv.org/abs/2604.27132

TRUST: A Framework for Decentralized AI Service v.0.1 --- Abstract: Large Reasoning Models (LRMs) and Multi-Agent Systems (MAS) in high-stakes domains demand reliable verification, yet centralized approaches suffer four limitations: (1) Robustness, with single points of failure vulnerable to attacks and bias; (2) Scalability, as reasoning complexity creates bottlenecks; (3) Opacity, as hidden auditing erodes trust; and (4) Privacy, as exposed reasoning traces risk model theft....

100

OWASP Top 10 for Agentic Applications 2026: What API Gateway Teams Need to Know - Zuplo 2026-05-03

https://zuplo.com/learning-center/owasp-top-10-agentic-applications-api-gateway

OWASP Top 10 for Agentic Applications 2026: What API Gateway Teams Need to Know - Zuplo --- Insecure Inter-Agent Communication : Multi-agent systems exchange messages without proper authentication or encryption, enabling spoofing and injection. Cascading Failures : Small errors propagate across planning,...

102

Efficient and Trustworthy Block Propagation for Blockchain-Enabled Mobile Embodied AI Networks: A Graph Resfusion Approach 2025-01-25

https://doi.org/10.1109/TMC.2025.3587006

When dealing with sensitive or critical information, malicious attacks can lead to severe consequences, such as information leakage, traffic accidents, or machine interaction failures. To mitigate these risks, the integration of blockchain technology is essential. The network layer, abstracted from the physical layer, presents the validator network in consortium blockchainsenabled MEANETs. The block propagation process is performed according to the mechanism detailed in Section III-A. Here, the ...

103

No One Left Behind: Inclusive Federated Learning over Heterogeneous Devices 2022-08-13

https://doi.org/10.1145/3534678.3539086

In this work, we propose InclusiveFL, a client-inclusive federated learning method to handle this problem. The core idea of InclusiveFL is to assign models of different sizes to clients with different computing capabilities, bigger models for powerful clients and smaller ones for weak clients. We also propose an effective method to share the knowledge among local models with different sizes. In this way, all the clients can participate in FL training, and the final model can be big and powerful ...

104

https://www.sflorg.com/2026/02/wi02222601.html

The Core Concept: Machiavellianism is a meticulously defined, subclinical personality trait characterized by a cognitive and behavioral phenotype optimized for strategic deception, interpersonal exp --- Because the evolution of sophisticated verbal communication made the transmission of ideas incredibly low-cost, a high-status Machiavellian can enthusiastically transmit false beliefs - effectively spreading cultural "mind-viruses" - that alter the behavior of learners for the Machiavellian's ex...

105

Core types, pattern matching, and utilities for the OxideShield security toolkit. 2026-03-13

https://lib.rs/crates/oxideshield-core

Core types, pattern matching, and utilities for the OxideShield security toolkit. AutoDAN - Genetic algorithm adversarial prompts ... AutoDAN - Genetic algorithm adversarial prompts GCG Attack - Zou et al., 2023 Related: oxideshield-guard, oxideshield-wasm See also: glob, nucleo-matcher, assert_matches, nucleo, zxcvbn, gix-glob, yara-x, glob-match, try_match, swiftide-query, caro...

106

Robust multi-agent coordination via evolutionary generation of auxiliary adversarial attackers 2023-05-09

https://doi.org/10.48550/arXiv.2305.05909

Previous works mainly focus on improving coordination ability via solving MARL-specific challenges (e.g., non-stationarity, credit assignment, scalability), but ignore the policy perturbation issue when testing in a different environment. This issue hasn't been considered in problem formulation or efficient algorithm design. To address this issue, we firstly model the problem as a Limited Policy Adversary Dec-POMDP (LPA-Dec-POMDP), where some coordinators from a team might accidentally and unpre...

107

Explainable and Fine-Grained Safeguarding of LLM Multi-Agent Systems via Bi-Level Graph Anomaly Detection 2025-12-20

https://arxiv.org/abs/2512.18733

Then, given an attacked MAS graph G, the goal of f ( ) is to estimate an anomaly score s i for each agent v i based on the agent responses {R 1 , ..., R N } and communication graph A. Agents with high anomaly scores are identified as malicious.Once detected, the malicious agents are isolated from the system to prevent further propagation of harmful information, which can be achieved by pruning both the inward and outward edges of malicious agents while preserving legitimate interactions among no...

108

Red teaming is a strategic cybersecurity practice that simulates controlled adversarial attacks to proactively identify system weaknesses before they can be exploited by malicious actors. 2026-03-13

https://www.krasamo.com/red-teaming/

DeepTeam supports a wide array of adversarial techniques, which can be grouped into single-turn attacks (a single prompt sent to the model) and multi-turn or contextual attacks (developed over multiple simulated interactions). Some of the main supported tactics include: Prompt Injection (single-turn): Embeds hidden or manipulative instructions within the prompt to override the model's system directives. One of the most common and effective attacks. Roleplay (single-turn): Induces the model to ad...

109

Sentra-Guard: A Multilingual Human-AI Framework for Real-Time Defense Against Adversarial LLM Jailbreaks 2025-10-25

https://arxiv.org/abs/2510.22628

Zhang et al. developed the Malicious Instruct benchmark, showing that even advanced detectors failed against more than 22% of multilingual and obfuscated jailbreaks.Huang et al. presented the RAG Guard framework, which improved detection accuracy but was limited to English prompts and relied on static rules.Li et al. highlighted latency and false positives as key barriers to practical deployment.Zero-shot classifiers (Zhu et al., ) extended generalization to unseen attacks but struggled with ind...

110

Policy Distillation with Selective Input Gradient Regularization for Efficient Interpretability 2022-05-17

https://doi.org/10.48550/arXiv.2205.08685

Different from previous work proposing new saliency calculation methods, we focus on improving the natural interpretability of RL policies. Given a RL policy, we propose an approach of Distillation with selective Input Gradient Regularization (DIGR) that uses policy distillation and input gradient regularization to retrain a new policy. (2022)...

111

Autonomous Evolution of EDA Tools: Multi-Agent Self-Evolved ABC 2026-04-15

https://arxiv.org/abs/2604.15082

(5) Self-Evolving Rulebase: A central contribution of our framework is the self-evolving rulebase that governs multi-agent coordination and constrains code modification. Beyond correctness safeguards, the rulebase provides subsystem-specific policies that guide how each agent may alter heuristics, thresholds, cost models, and traversal strategies. During evolution, the planner continuously evaluates whether these rules are overly restrictive or misaligned with emergent patterns in QoR feedback....

112

by Rohin Shah, Eliezer Yudkowsky 2026-04-19

https://www.lesswrong.com/posts/tcCxPLBrEXdxN5HCQ/shah-and-yudkowsky-on-alignment-failures

Yes, much as it might have gained earlier experience with making novel Starcraft plans that involved "applying knowledge about humans and their role in the data-generating process in order to create a plan that leads to more reward", if it was trained on playing Starcraft against humans at any point, or even needed to make sense of how other agents had played Starcraft This in turn can be seen as a direct outgrowth and isomorphism of making novel plans for playing Super Mario Brothers which invo...

113

Auto-translate your Brand videos to Irish. 2026-01-16

https://jollytoday.com/video-translator/translate-brand-video-to-irish/

Optimized for Brand-to-Irish with LLM calibration & multi-agent review for culturally fluent Irish translations. Batch translate and dub 100s of Brand videos to Irish at once. Flexible Brand-to-Irish plans. Instantly translate Brand videos to Irish online. Translating a 100-minute Brand drama with 4000+ lines and many characters into Irish is tough. Since Brand-to-Irish translation can change speech length, our AI expertly adjusts the new Irish audio, subtitles, video, and BGM to maintain perfec...

114

Robust Coordination Under Misaligned Communication via Power Regularization 2025-10-20

https://doi.org/10.3233/faia250952

This paper introduces Communicative Power Regularization (CPR), extending power regularization specifically to communication channels. By explicitly quantifying and constraining agents' communicative influence during training, CPR actively mitigates vulnerabilities arising from misaligned or adversarial communications. Evaluations across benchmark environments Red-Door-Blue-Door, Predator-Prey, and Grid Coverage demonstrate that our approach significantly enhances robustness to adversarial commu...

115

DeepContext: Stateful Real-Time Detection of Multi-Turn Adversarial Intent Drift in LLMs 2026-02-17

https://arxiv.org/abs/2602.16935

Modern LLMs may interleave global and sliding window attention to make LLMs scale more linearly , but this comes with a tradeoff in global awareness.Furthermore, these models still treat the concatenated block as a static snapshot, often failing to capture the temporal drift inherent in multi-turn grooming.Attempts to improve defenses include using a sliding window of conversation history , or employing lightweight embedding classifiers .However, both of these approaches come with limitations of...

116

Towards desiderata-driven design of visual counterfactual explainers 2026-05-07

https://doi.org/10.1016/j.patcog.2025.112811

This can be e.g. the inclusion or removal of object parts, but also more intricate changes in image quality or color, that may not be accessible with other explanation techniques such as feature attribution.Another advantage of counterfactuals is that they are inherently actionable, e.g.together with a human in the loop, counterfactuals provide an implicit data augmentation scheme that can serve to address a model's missing invariances or reliance on spurious correlations .Mathematically, the se...

117

Enhancing Adversarial Robustness of IoT Intrusion Detection via SHAP-Based Attribution Fingerprinting 2025-11-08

https://arxiv.org/abs/2511.06197

Noppel and Wressnegger systematized the knowledge on how post-hoc explanation methods can serve as a foundation for developing more robust and explainable machine learning models. They further argued that if these methods are resilient to adversarial manipulation, they can be used as effective adversarial defense mechanisms....

118

AI technologies are becoming increasingly autonomous, but this brings added cybersecurity risk. 2026-04-18

https://www.icaew.com/insights/viewpoints-on-the-news/2025/oct-2025/how-ai-agents-can-aid-cyber-criminals

When taking AI use to the next level, you need to embed human involvement into the governance and the process design, says Jason Walters, Technology Risk Director, EY UK. "Moving from AI agents to agentic AI, you will have different agents working together with an orchestration layer on top. These multi-agent frameworks operate more autonomously, driven by an objective rather than just a prompt. As this continues to become more complex and the level of autonomy increases, that's where we'll see ...

119

A Multi-Agent LLM Defense Pipeline Against Prompt Injection Attacks 2025-09-15

https://doi.org/10.48550/arXiv.2509.14285

Recent work by Wang et al. extends this taxonomy to include advanced obfuscation techniques and multi-turn persistent attacks. B. Existing Defense Mechanisms Current defense approaches can be classified into four main categories: Input Sanitization: Traditional approaches employ rulebased filtering and keyword detection . However, these methods struggle with obfuscated or semantically disguised attacks . Output Monitoring: Post-generation filtering attempts to detect malicious content in model o...

120

Researchers at Google DeepMind have published a comprehensive study revealing that autonomous AI agents browsing the web are deeply vulnerable to a new class of attacks called "AI Agent Traps," whic 2026-04-17

https://www.cryptika.com/google-deepmind-researchers-warn-hackers-can-hijack-ai-agents-through-malicious-web-content/

These traps can also wrap malicious instructions inside "educational" or "red-teaming" framing to bypass safety filters, a tactic confirmed across multiple large-scale jailbreak datasets. Cognitive State Traps target an agent's long-term memory and knowledge bases. RAG Knowledge Poisoning, for instance, injects fabricated statements into retrieval corpora so that agents treat attacker-controlled content as verified fact. Research cited in the paper demonstrated that poisoning as few as a handful...

121

Complete Guide to Agentic AI Red Teaming 2026-04-23

https://www.trydeepteam.com/guides/guide-agentic-ai-red-teaming

In multi-agent systems, messages between agents carry implicit trust - a sub-agent's output is consumed by the orchestrator without the same scrutiny applied to user inputs. InsecureInterAgentCommunication tests whether adversarial payloads can traverse these boundaries....

122

ClusterGuard: Secure Clustered Aggregation for Federated Learning with Robustness 2025-12-31

https://doi.org/10.1109/tdsc.2026.3671267

However, in large-scale federated learning systems, designing efficient and practical secure aggregation remains a critical challenge. Moreover, while secure aggregation effectively conceals model updates, it unintentionally complicates the detection and mitigation of poisoning attacks, thereby exposing the system to vulnerabilities from both data and model poisoning....

123

Tipping the Dominos: Topology-Aware Multi-Hop Attacks on LLM-Based Multi-Agent Systems 2025-12-02

https://doi.org/10.48550/arXiv.2512.04129

A wide variety of attack methods have been developed to compromise LLM-based systems, including prompt injection , , , vision perturbation , , memory poisoning , knowledge base manipulation , and jailbreak attacks , . For example, Zhang et al. inject vision perturbation in OSWORLD agents to click on the adversarial pop-ups. Russinovich et al. use multi-turn jailbreak to attack various LLM-based systems....

124

Counterfactual Explanation with Multi-Agent Reinforcement Learning for Drug Target Prediction 2021-03-23

https://arxiv.org/abs/2103.12983

Most counterfactual explanation methods only operate on single input data. It remains an open problem how to extend counterfactual-based XAI methods to DTA models, which have two inputs, one for drug and one for target, that also happen to be discrete in nature. Methods We propose a multi-agent reinforcement learning framework, Multi-Agent Counterfactual Drugtarget binding Affinity (MACDA), to generate counterfactual explanations for the drug-protein complex. (2021)...

125

A modern look at simplicity bias in image classification tasks 2026-05-07

https://doi.org/10.1016/j.neunet.2026.108583

A modern look at simplicity bias in image classification tasks --- We present a simplified theoretical case (Section 3) that defines model complexity and analyzes model outputs under perturbations, showing that existing measures based on output sensitivity cannot reliably distinguish a truly complex model from a simple one with large outputs.Moreover, previous methods consider only equidistance in the input space, neglecting the crucial influence of the spectral domain for image tasks.Therefore,...

126

Thompson Sampling for Factored Multi-Agent Bandits 2020-05-04

https://doi.org/10.65109/ntlj3502

In this work, we consider learning to coordinate in multi-agent systems. For example, consider a wind farm comprised of a set of wind turbines. The objective is to maximize the farm's total productivity. When upstream turbines directly face the incoming wind stream, energy is extracted from wind. This reduces the productivity of downstream turbines, potentially damaging the overall power production. However, turbines have the option to rotate, in order to deflect the turbulent flow away from tur...

127

ROMAX: Certifiably Robust Deep Multiagent Reinforcement Learning via Convex Relaxation 2022-05-22

https://doi.org/10.1109/icra46639.2022.9812321

In a multirobot system, a number of cyber-physical attacks (e.g., communication hijack, observation per-turbations) can challenge the robustness of agents. This robust-ness issue worsens in multiagent reinforcement learning because there exists the non-stationarity of the environment caused by simultaneously learning agents whose changing policies affect the transition and reward functions. In this paper, we propose a minimax MARL approach to infer the worst-case policy update of other agents. (...

128

Budgeting Counterfactual for Offline RL 2025-12-31

https://doi.org/10.52202/075280-0250

Algorithm 2 starts with an initial counterfactual budget B, takes action each time according to the condition in Select, and update the budget b t if the action is not drawn from . Comparison to regularized and one-step offline RL One of the most used methods in offline RL methods is adding policy or value regularization and constraints terms on top of a vanilla off-policy RL algorithm [5,8,12,22,23,25,26,50,51], referred to as regularized methods in this paper.Our method can be viewed as an alt...

129

Extending the OWASP Multi-Agentic System Threat Modeling Guide: Insights from Multi-Agent Security Research 2025-12-31

https://doi.org/10.48550/arxiv.2508.09815

Evaluating coordination involves measuring how well agents communicate, synchronize, and complement each other's actions. The most direct metric is success on cooperative tasks.Benchmarks from multi-agent reinforcement learning and board games are used to test LLM-based agents.For example, the Star-Craft Multi-Agent Challenge Samvelyan et al. ( 2019) (a cooperative card game requiring communication under partial information) has been a standard for coordination in AI (though typically with RL ag...

130

Every Picture Tells a Dangerous Story: Memory-Augmented Multi-Agent Jailbreak Attacks on VLMs 2026-04-13

https://arxiv.org/abs/2604.12616

Every Picture Tells a Dangerous Story: Memory-Augmented Multi-Agent Jailbreak Attacks on VLMs 14 Apr 2026C4D0CE67718797FAEB556A07152571D6arXiv:2604.12616v1VLMAgentMemoryJailbreak Attack Memory Update & Guidance...

131

Game-theoretic frameworks for deep neural network rationalization 2023-05-22

https://patents.google.com/?oq=16658122

For t=y (the correct class) the corresponding rationale is called factual; as to t y, they are referred to herein as counterfactual rationales. For simplicity and to facilitate the present explanation, the discussion herein will focus on two-class classification problems (Y={0, 1}). CAR can uncover class-wise rationales using adversarial learning, inspired by outlining pros and cons for decisions. (2023)...

132

Designing effective reward functions stands as a fundamental challenge in the development of deep reinforcement learning (DRL) agents, particularly for applications within the complex domain of fina 2026-03-14

https://digitalfinancenews.com/research-reports/designing-reward-functions-in-deep-reinforcement-learning-for-trading-challenges-and-advanced-methodologies/

Robustness to Black Swan Events: Current reward functions, even dynamic ones, may struggle to prepare agents for truly unprecedented 'black swan' events. Research into anticipatory reward mechanisms or stress-testing reward functions under extreme, synthetic market conditions is crucial. Explainability and Interpretability: As reward functions become more complex and dynamic, understanding why an agent makes a particular decision becomes increasingly challenging. Developing methods for 'explaina...

133

Bandwidth-constrained Variational Message Encoding for Cooperative Multi-agent Reinforcement Learning 2025-12-10

https://doi.org/10.48550/arXiv.2512.11179

Graph-based multi-agent reinforcement learning (MARL) enables coordinated behavior under partial observability by modeling agents as nodes and communication links as edges. While recent methods excel at learning sparse coordination graphs-determining who communicates with whom-they do not address what information should be transmitted under hard bandwidth constraints. We study this bandwidth-limited regime and show that naive dimensionality reduction consistently degrades coordination performanc...

134

PromptScreen: Efficient Jailbreak Mitigation Using Semantic Linear Classification in a Multi-Staged Pipeline 2025-12-21

https://arxiv.org/abs/2512.19011

Overall, our results show that a staged, multilayer defense pipeline can eliminate jailbreak and prompt-injection attacks while maintaining high accuracy on benign inputs with low computational overhead.The semantic LSVM module is central to this success, providing strong generalization at minimal cost and enabling robust end-to-end protection when combined with complementary defenses. SVM Ablation Study To better understand the factors driving the strong performance of the semantic LSVM defense...

135

Red-Teaming Agent Execution Contexts: Open-World Security Evaluation on OpenClaw 2026-05-13

https://arxiv.org/abs/2605.11047

Abstract: Agentic language-model systems increasingly rely on mutable execution contexts, including files, memory, tools, skills, and auxiliary artifacts, creating security risks beyond explicit user prompts. This paper presents DeepTrap, an automated framework for discovering contextual vulnerabilities in OpenClaw. DeepTrap formulates adversarial context manipulation as a black-box trajectory-level optimization problem that balances risk realization, benign-task preservation, and stealth. It co...

136

Generate human-readable adversarial prompts in seconds, 800 faster than existing optimization-based approaches. We train the AdvPrompter using a novel algorithm that does not require access to the gradients of the Target LLM. 2024-04-29

https://reddit.com/r/mlsafety/comments/1cg0vxj/generate_humanreadable_adversarial_prompts_in/

137

Home Blogs Runtime Attacks: Why Modern Mobile Pentesting Matters 2026-04-14

https://tmits.in/blog/mobile-pentesting-runtime-attacks/

Mobile application pentesting proven secrets will survive memory attacks against runtime attacks. Session Hijacking: Tokens Stolen During Checkout On-the-spot attacks seize OAuth tokens and biometric states as users hit the "Pay Now" button, while Man-in-the-Middle proxies capture HTTPS traffic after the SSL pinning breaks down. While many application security testing tools fail to detect the death of an active session. The TMITS run-time protection solution monitors session continuity, highligh...

138

PromptScreen: Efficient Jailbreak Mitigation Using Semantic Linear Classification in a Multi-Staged Pipeline 2025-12-21

https://arxiv.org/abs/2512.19011

Experimental setup.To stress-test robustness under adversarial conditions, we construct an augmented variant of the dataset in which each original prompt is expanded into four perturbed versions, yielding over 120,000 prompts in total.Perturbations include leetspeak substitutions, Unicode homoglyphs, and whitespace manipulations applied at varying levels of intensity.Each SVM configuration is trained on the same training split described in Section 3 and evaluated on an identical held-out test se...

139

Memory Poisoning Attack and Defense on Memory Based LLM-Agents 2026-01-08

https://doi.org/10.48550/arXiv.2601.05504

4] introduced MINJA (Memory Injection Attack), demonstrating that agents with persistent memory are vulnerable to query-only attacks achieving over 95% injection success rates through bridging steps, indication prompts, and progressive shortening techniques. Unlike traditional attacks, MINJA requires no elevated privileges, operating through regular user interactions. present AgentPoison, which targets RAG knowledge bases and memory stores but assumes stronger attacker capabilities with direct s...

141

Dynamic Target Attack 2025-10-01

https://arxiv.org/abs/2510.02422

Existing gradient-based jailbreak attacks typically optimize an adversarial suffix to induce a fixed affirmative response. However, this fixed target usually resides in an extremely low-density region of a safety-aligned LLM's output distribution conditioned on diverse harmful inputs. Due to the substantial discrepancy between the target and the original output, existing attacks require numerous iterations to optimize the adversarial prompt, which might still fail to induce the low-probability t...

142

Graph Representation-based Model Poisoning on the Heterogeneous Internet of Agents 2025-12-31

https://doi.org/10.48550/arxiv.2511.07176

Fig. 4 illustrates the temporal evolution of cosine similarity for each LLM agent over 20 communication rounds.Despite the defense mechanism employing a dynamic threshold, the evolution of similarity metric demonstrates that attackers consistently stay above the adaptive threshold throughout the training process.This result validates that GRMP can exploit the fundamental assumption gap of the DiSim-defense mechanisms.Through learning relational structures among benign updates via graph represent...

143

DEFEND: Poisoned Model Detection and Malicious Client Exclusion Mechanism for Secure Federated Learning-based Road Condition Classification 2025-12-31

https://doi.org/10.48550/arxiv.2512.06172

Recent novel poisoning attack mitigation methods primarily focus on backdoor attacks or untargeted attacks , thus they are not specifically designed for TLFAs.On the other hand, current countermeasures pay attention to model-level misbehavior detection, while missing an effective joint vehicle-level malicious client exclusion strategy based on model-level detection results.By uploading poisoned models, malicious clients can consistently threaten the FL-RCC system if they are not excluded.The sta...

144

Counterfactual Visual Explanation via Causally-Guided Adversarial Steering 2025-09-29

https://arxiv.org/abs/2507.09881

Abstract: Recent work on counterfactual visual explanations has contributed to making artificial intelligence models more explainable by providing visual perturbation to flip the prediction. However, these approaches neglect the causal relationships and the spurious correlations behind the image generation process, which often leads to unintended alterations in the counterfactual images and renders the explanations with limited quality. To address this challenge, we introduce a novel framework C...

145

AdverSAR: Adversarial Search and Rescue via Multi-Agent Reinforcement Learning 2022-11-13

https://doi.org/10.1109/HST56032.2022.10025434

Centralized Training with Decentralized Execution (CTDE) is an MARL training paradigm in which the agents share information during training (e.g. observations of other agents), but act on their own local observations during execution/evaluation , . CTDE is useful in avoiding the non-stationarity issues that often arise in training multi-agent systems. Since a centralized critic in an actor-critic algorithm has access is used to observations of all agents, the agents are less likely to encounter ...

146

Training And Use Of A Bipedal Action Model For Humanoid Robot 2026-05-06

https://ppubs.uspto.gov/pubwebapp/external.html?q=(20260124746).pn

The method of claim 1, wherein the simulation training employs domain randomization of robot and environment parameters including one or more of geometry, mass distribution, actuator limits or friction, contact properties, and exogenous perturbations, and wherein a reward function includes at least a joint-pose accuracy term. 3. The method of claim 1, wherein the high-level, mid-level, and low-level policies operate at different update rates with the Alpha model updating more frequently than the...

147

ABIGX: A Unified Framework for eXplainable Fault Detection and Classification 2025-12-31

https://doi.org/10.48550/arxiv.2311.05316

ABIGX is based on AFR, which calculates the variable contributions by integrating gradients along the path from the explained samples to AFR-reconstructed samples.For explainable fault classification, we raise the fault class smearing problem, which is the intrinsic effect causing the incorrect variable contributions.Then we analyze the fault class smearing in the explainers of saliency map , Integrated Gradient (IG) and ABIGX, among which we prove that ABIGX performs best in mitigating this eff...

148

Probing Dec-POMDP Reasoning in Cooperative MARL 2026-02-23

https://arxiv.org/abs/2602.20804

The standard formalism for these problems, decentralised partially observable Markov decision processes [Dec-POMDPs, 5,21], capture this intrinsic hardness through two fundamental characteristics: partial observability, where agents cannot directly observe the full global state, and decentralised coordination, where agents must cooperate based on local and private information. The intrinsic hardness of this setting stems directly from the interaction of these two factors.In principle, to act opt...

149

AI Readiness in Healthcare through Storytelling XAI 2025-12-31

https://doi.org/10.48550/arxiv.2410.18725

This framework utilizes knowledge distillation, interpretability, and datasets for a variety of tasks.Using datasets from different origins allows the framework to generalize better for real-world scenarios as well.The three parts involved are: 1.The First step involves training the complex deep neural networks for individual tasks.For the task of abnormality detection and localization, a CNN-based model with ResNet 50 backbone is trained using a categorical cross-entropy loss and mean Average P...

150

FedSecure: A Robust Federated Learning Framework for Adaptive Anomaly Detection and Poisoning Attack Mitigation in IoMT 2025-02-24

https://doi.org/10.1109/SATC65530.2025.11137301

Federated learning (FL) is a valuable solution for training models on distributed data while maintaining privacy. However, FL also introduces new security threats such as, poisoning attacks....

151

Adversarial Attack on Black-Box Multi-Agent by Adaptive Perturbation 2025-12-31

https://doi.org/10.48550/arxiv.2511.15292

We implement four state-of-the-art and popular baseline approaches for adversarial attacks in each multi-agent environment.MASafe (Guo et al. 2022): applying the random perturbations to the observations of all agents. AMCA (Zhou et al. 2024): identifying important agents with a differential evolutionary algorithm and generating state perturbations after learning malicious actions. AMI (Li et al. 2024): directly controlling the default agent, learning attack actions based on mutual information.It...

152

This is your source for in-depth research articles, policy papers, and technical reports that showcase our work in distributed cloud computing applications. 2026-04-17

http://cheddarhub.org/publications/

The machine learning systems orchestrating these advanced services will widely rely on deep reinforcement learning (DRL) to process multi-modal requirements datasets and make semantically modulated decisions, introducing three major challenges: (1) First, we acknowledge that most explainable AI research is stakeholder-agnostic while, in reality, the explanations must cater for diverse telecommunications stakeholders, including network service providers, legal authorities, and end users, each wit...

153

ATEX-CF: Attack-Informed Counterfactual Explanations for Graph Neural Networks 2026-02-04

https://arxiv.org/abs/2602.06240

Unlike traditional approaches that treat explanation and attack separately, our method efficiently integrates both edge additions and deletions, grounded in theory, leveraging adversarial insights to explore impactful counterfactuals. In addition, by jointly optimizing fidelity, sparsity, and plausibility under a constrained perturbation budget, our method produces instance-level explanations that are both informative and realistic. Experiments on synthetic and real-world node classification ben...

154

Best Agentic AI Security Tools in 2026: Top 7 Compared 2026-04-30

https://cybersectools.com/resources/agentic-ai-security-tools-worth-evaluating-2026

Native Microsoft 365 integration for agents operating on enterprise collaboration data Data security posture management with automated lifecycle controls Caterpillar is a free, open-source scanner for AI skill files and MCP configurations. It detects credential theft attempts, data exfiltration behaviors, obfuscation techniques, and supply chain tampering in skill logic before you deploy anything. Install it with curl or npm, point it at a directory, and get a letter-grade report. No API key, no...

155

Safeguarding Efficacy in Large Language Models: Evaluating Resistance to Human-Written and Algorithmic Adversarial Prompts 2025-10-11

https://doi.org/10.48550/arXiv.2510.15973

Jailbreaking attacks represent a specific category of prompt injection designed to circumvent safety mechanisms and alignment training (Kumar et al., 2024;Yu et al., 2024). These attacks exploit various weaknesses in current LLM architectures and training methodologies: Human-Written Attacks (Li et al., 2024): These manually crafted prompts leverage creative language patterns, role-playing scenarios, or social engineering techniques to manipulate model responses. Examples include persona adoptio...

156

$OneMillion-Bench: How Far are Language Agents from Human Experts? 2026-04-18

https://www.catalyzex.com/author/Jiaqi%20Li

Abstract:Large language models (LLMs) have achieved remarkable success across diverse applications but remain vulnerable to jailbreak attacks, where attackers craft prompts that bypass safety alignment and elicit unsafe responses. Among existing approaches, optimization-based attacks have shown strong effectiveness, yet current methods often suffer from frequent refusals, pseudo-harmful outputs, and inefficient token-level updates. In this work, we propose TAO-Attack, a new optimization-based ja...

157

We Benchmarked Our Detection Engine Against 2,369 Samples from 7 Peer-Reviewed Datasets. 2026-04-22

https://promptguard.co/blog/benchmark-results-2369-samples

Standalone ML classifiers - even strong ones trained on clean text - fail catastrophically on base64 (30%) and leetspeak (10%). These are trivial encoding techniques that any motivated attacker will try first. PromptGuard's adversarial text normalization layer strips the encoding before the text reaches the ML model, restoring its full detection capability. The model doesn't need to learn every encoding - the normalization layer handles it. Per-Dataset Breakdown PG Full F1 Baseline F1 Delta Tens...

158

Distributed Nonlinear Control of Networked Two-Wheeled Robots under Adversarial Interactions 2026-04-04

https://arxiv.org/abs/2604.03917

... goal of fully distributed implementation and increase vulnerability to coordinated attacks. Addressing resilience for nonlinear, nonholonomic multi-agent systems under adversarial information exchange therefore remains an open and practically relevant problem . Other secure multi-agent coordination methods use homomorphic encryption techniques combined with distributed control approaches to ensure secure computation of distributed control through third-party cloud services . In this paper, w...

159

Robust Lagrangian and Adversarial Policy Gradient for Robust Constrained Markov Decision Processes 2024-06-24

https://doi.org/10.1109/cai59869.2024.00219

Highlighting potential downsides of RCPG such as not robustifying the full constrained objective and the lack of incremental learning, this paper introduces algorithms to robustify the Lagrangian and to learn incrementally using gradient descent over an adversarial policy....

160

BlindGuard: Safeguarding LLM-based Multi-Agent Systems under Unknown Attacks 2026-04-27

https://arxiv.org/abs/2508.08127

161

Integrating personal health data with clinical records can greatly improve the prediction and management of mental health conditions. 2026-04-21

https://affectlog.com/end-to-end-prototype-using-affectlogs-privacy-preserving-mental-health-digital-twin-platform/

For example, after summing weights, add Gaussian noise with variance calibrated to epsilon=1, delta=1e-5 privacy budget. This ensures that any single client's impact on the global model is blurred. Local DP: Each client independently adds noise to its gradients before sending to the server (Differential Privacy - This avoids relying on the server at all, but typically requires more noise (since no averaging benefits). In our prototype, we implement central DP for efficiency: using an algorithm l...

162

On February 15, 2025, the UC Berkeley Center for Long-Term Cybersecurity (CLTC) unveiled what it calls the first comprehensive risk-management profile for autonomous AI agents - a playbook designed f 2026-04-22

https://innovirtuoso.com/cybersecurity/uc-berkeleys-first-risk-management-framework-for-autonomous-ai-agents-the-2025-security-and-governance-blueprint/

Goal: Prepare for agent-specific failures and adversarial behavior. Playbook essentials: - Rapid containment: Kill switches, token revocation, session isolation, and circuit breakers for high-risk tools. - Forensic workflows: Preserve decision logs, prompts, outputs, and tool invocation metadata. - Root cause analysis: Was it hallucination, prompt injection, tool confusion, or credential misuse? - Recovery and notification: Data restoration plans, stakeholder communications, and (where applicabl...

163

The Benefits of Power Regularization in Cooperative Reinforcement Learning 2024-06-16

https://doi.org/10.5555/3545946.3598671

There have also been productive formulations of the related concepts of responsibility and blame [1,4,7,8,11], which have strong connections to power. In AI, power has been formalized in a single agent context, with recent work towards defining power and regularizing an agent's own behavior with respect to power . While it is a promising direction to extend these formal measures to MARL, we focus on making empirical progress on regularizing power in this work. Though the literature on power in d...

164

nanogcg-redteam added to PyPI 2025-12-08

https://pypi.org/project/nanogcg-redteam/

... placeholder="{prompt}",response_parser=lambdax:x)config=GCGConfig(probe_sampling_config=probe_sampling_config,target=api_target,)model_id="mistralai/Mistral-7B-Instruct-v0.2"# Load the local proxy model (white-box) for gradient computation# This model is used to generate the attack, which is then evaluated against the APImodel=AutoModelForCausalLM.from_pretrained(model_id,torch_dtype=torch.float16).to(device)tokenizer=AutoTokenizer.from_pretrained(model_id)message="Tell me how to ..."target=...

165

Attention-Augmented Inverse Reinforcement Learning with Graph Convolutions for Multi-Agent Task Allocation 2025-12-31

https://doi.org/10.48550/arxiv.2504.05045

This makes them vulnerable to single points of failure and impractical in real-world deployments. In the CTDE paradigm, decentralized decisions are made after centralized training, offering robustness against communication failures and improved scalability.Various approaches have been developed under this framework.CapAM combines capsule networks with attention-based GNNs to capture local and global features of task graphs, maintaining performance as task-agent scales increase.An auction-based R...

166

FBLearn: Decentralized Platform for Federated Learning on Blockchain 2024-09-15

https://doi.org/10.3390/electronics13183672

This paper presents a decentralized platform FBLearn for the implementation of federated learning in blockchain, which enables us to harness the benefits of federated learning without the necessity of exchanging sensitive customer or product data, thereby fostering trustless collaboration. As the decentralized blockchain network is introduced in the distributed model training to replace the centralized server, global model aggregation approaches have to be utilized. This paper investigates sever...

167

Reversible Adversarial Examples with Beam Search Attack and Grayscale Invariance 2023-06-19

https://doi.org/10.48550/arXiv.2306.11322

We discussed adversarial attacks in terms of gradient-free Attacks and gradient estimation attacks. Gradient estimation attacks first estimate the gradients of the target model and then use them to run the attack. (2023)...

168

In November 2023, Mount Sinai Health System deployed an explainable AI diagnostic system across its network of 8 hospitals serving 7.4 million patients annually in New York, addressing critical trust 2026-04-23

https://ashganda.com/blog/explainable-ai-xai-transparent-trustworthy-models/

A study analyzing SHAP deployment across 8,400 ML systems found it the most widely adopted XAI technique for production systems requiring rigorous explanations, particularly in regulated industries. Visual Explainability for Deep Learning Deep neural networks processing images, text, or time-series data require specialized explainability techniques that reveal which input regions most influence predictions. These saliency methods generate visual attribution maps highlighting important pixels in ...

169

This is not the Texture you are looking for! 2026-04-23

https://deepai.org/publication/this-is-not-the-texture-you-are-looking-for-introducing-novel-counterfactual-explanations-for-non-experts-using-generative-adversarial-learning

By doing so, the users of counterfactual explanation systems are equipped with a completely different kind of explanatory information. However, methods for generating realistic counterfactual explanations for image classifiers are still rare. In this work, we present a novel approach to generate such counterfactual image explanations based on adversarial image-to-image translation techniques....

170

Untargeted Jailbreak Attack 2025-10-02

https://arxiv.org/abs/2510.02999

However, AdvPrefix still relies on a targeted objective, which does not fully overcome the limitations of existing gradient-based attacks. Black-box attack.Black-box jailbreak attacks mainly rely on an attack LLM to generate or disguise the adversarial prompt.Zeng et al. (Zeng et al., 2024) proposed Prompt Automatic Paraphrasing (PAP), which utilizes an LLM and preset prompt templates related to different scenarios to rewrite harmful questions....

171

How to prevent malicious use of intelligent unmanned swarms? 2023-02-15

https://doi.org/10.1016/j.xinn.2023.100396

However, RL algorithms can be manipulated through adversarial policies10 that alter observations and lead to abnormal behavior, while previous studies have explored adversarial policies in one-on-one games, such as zero-sum robotics games, aiming to fail one well-trained agent by training adversarial policies using RL against black-box opponents. These simple adversarial policies are limited in their ability to address the complex multi-agent competition and cooperation that is required when cou...

172

Explainable Multi-Agent Reinforcement Learning for Temporal Queries 2023-07-31

https://doi.org/10.24963/ijcai.2023/7

In Advances in Neural Information Processing Systems, 2022. Collective explainable ai: Explaining cooperative strategies and agent contribution in multiagent reinforcement learning with shapley values. Bradley Hayes, Julie A Shah, Heuillet, arXiv:1812.04608PMLRAlexandre Heuillet, Fabien Couthouis, and Natalia Diaz-Rodriguez. Explainability in deep reinforcement learning. Knowledge-Based Systems. Landajuela et al., 2021] Mikel Landajuela, Brenden K Petersen, Sookyung Kim, Claudio P Santiago, Rube...

173

ML often centralizes data for training, weakening data control and raising privacy, security, efficiency concerns - especially on edge devices. 2026-04-21

https://eci.dc.uba.ar/federated-learning-from-theory-to-practice/

Lectures, demos, and labs guide participants to implement an end-to-end FL pipeline, from data generation to attack mitigation. Understand the FL computation model and its motivations (privacy, regulation, efficiency). Distinguish and apply variants: cross-device, cross-silo, hierarchical, and personalized FL. Master IID vs. non-IID notions and quantify their effect on performance and stability. Use Flower to transform a centralized (PyTorch/TensorFlow) training routine into a federated one. Gen...

174

Multi-Agent Thompson Sampling for Bandit Applications with Sparse Neighbourhood Structures 2020-04-20

https://doi.org/10.1038/s41598-020-62939-3

In this work, we consider learning to coordinate in multi-agent systems. For example, consider a wind farm control task, which is comprised of a set of wind turbines, and we aim to maximize the farm's total productivity. When upstream turbines directly face the incoming wind stream, energy is extracted from wind. This reduces the productivity of downstream turbines, potentially damaging the overall power production. However, turbines have the option to rotate, in order to deflect the turbulent f...

175

Multi-agent reinforcement learning 2026-02-20

https://en.wikipedia.org/?curid=62285602

As agents improve their performance, they change their environment; this change in the environment affects themselves and the other agents. The feedback loop results in several distinct phases of learning, each depending on the previous one. The stacked layers of learning are called an autocurriculum. Autocurricula are especially apparent in adversarial settings, where each group of agents is racing to counter the current strategy of the opposing group. The [https://www.youtube.com/watch?v=kopoL...

176

Department of Electrical and Computer Engineering, University of Windsor, Sunset Ave., 2026-03-17

https://www.mdpi.com/1424-8220/22/7/2644

The speed of the attack propagation and the scale of the impact will differ; for example, aiming at agents with more connections will result in a faster and greater deviating effect on neighbors. To address this issue, in the proposed algorithm, the communication link that has been attacked is detected, and neglected from the agreement process. On the other hand, aiming at the input communication link of the agent with more neighbors has less effect on the overall graph since it has been removed...

177

Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems 2025-06-02

https://arxiv.org/abs/2505.00212

Abstract: Failure attribution in LLM multi-agent systems-identifying the agent and step responsible for task failures-provides crucial clues for systems debugging but remains underexplored and labor-intensive....

178

Quick Answer Quick Answer: Cooperative and competitive agents represent fundamentally different interaction paradigms in multi-agent systems. 2026-04-11

https://artificial-intelligence-wiki.com/agentic-ai/multi-agent-systems-and-collaboration/cooperative-vs-competitive-agents/

Competitive agents pursue conflicting objectives modeled as zero-sum or general-sum games, employing strategic reasoning about opponents, adversarial learning to predict and counter opponent strategies, and individual optimization rather than collective welfare. Mixed-motive environments combine both paradigms, creating complex strategic scenarios where agents must simultaneously cooperate with some parties while competing with others. As of November 2025, advanced techniques like graph-orchestr...

179

OpenAI's plans to further data collection and surveillance by embedding AI into web browsing. 2026-04-19

https://news.absoluteappsec.com/p/episode-301-ai-browsers-new-ai-agent-attacks-framework-checklist

Episode #134 - Legal Protections, Browser Sanitization APIs, Burnout - Thinking about the security problems Browsers have faced as they've evolved over the years, let's revisit a positive development. In this discussion, Seth and Ken highlight how Browser's implemented Sanitization APIs to potentially help eliminate XSS-style attacks. https://www.youtube.com/watch?v=FA6C6Kr1Ty8 - Episode #207 - Watering Hole Attacks, Adversarial AI, Cookie Security - In this discussion, Ken and Seth talk about a...

180

Automatic Failure Attribution and Critical Step Prediction Method for Multi-Agent Systems Based on Causal Inference 2025-09-09

https://doi.org/10.48550/arXiv.2509.08682

Multi-agent systems (MAS) are critical for automating complex tasks, yet their practical deployment is severely hampered by the challenge of failure attribution. Current diagnostic tools, which rely on statistical correlations, are fundamentally inadequate; on challenging benchmarks like Who\&When, state-of-the-art methods achieve less than 15\% accuracy in locating the root-cause step of a failure. To address this critical gap, we introduce the first failure attribution framework for MAS ground...

181

LARGO: Latent Adversarial Reflection through Gradient Optimization for Jailbreaking LLMs 2025-05-15

https://arxiv.org/abs/2505.10838

Our findings demonstrate a potent alternative to agentic LLM prompting, highlighting the efficacy of interpreting and attacking LLM internals through gradient optimization. LARGO: Latent Adversarial Reflection through Gradient Optimization for Jailbreaking LLMs...

182

RoguePrompt: Dual-Layer Ciphering for Self-Reconstruction to Circumvent LLM Moderation 2025-11-23

https://doi.org/10.48550/arXiv.2511.18790

Together, these layers trick the model into unwittingly reconstructing and obeying the hidden instruction -essentially a semantic self-reconstruction. Unlike prior jailbreaks that rely on single-layer obfuscation or role-play (for example, Base64 encoding or instructing the model to ignore safety rules), RoguePrompt's dual-layer, single-query attack cleanly separates content concealment from revelation. Each stage produces output that looks harmless to filters. Instead of asking the model to bre...

183

Memory Poisoning Attack and Defense on Memory Based LLM-Agents 2026-01-08

https://arxiv.org/abs/2601.05504

This work addresses these gaps through systematic empirical evaluation of memory poisoning attacks and defenses in Electronic Health Record (EHR) agents....

184

Develop truly robust and capable agents, able to interact, avoid exploitation and find pro-social solutions. 2026-04-20

https://ivado.ca/en/regroupements/machine-learning/

It focuses on both (1) fundamental notions of communication among agents and (2) the use of natural language by LLM-endowed agents and their interaction. Axis 4: Multi-agent world modeling This axis aims to explore the potential advantages of endowing agents with the ability to explicitly model their environment, including the beliefs and intentions (i.e., a theory of mind) of other agents co-existing within the environment....

185

Distributed security in a secure peer-to-peer data network based on real-time navigator protection of network devices 2024-04-01

https://patents.google.com/?oq=17361593

Moreover, the security agents "guardian", "sentinel", and "navigator" can execute autonomic synchronization with peer security agents in other network devices in the secure peer-to-peer data network: the autonomic synchronization not only enables an autonomic aggregation of machine learning-based feature data (e.g., cyber-attack feature data, wireless network feature data) in the secure peer-to-peer data network; the autonomic synchronization also enables distributed execution of corrective acti...

186

This comprehensive survey traces the evolution of communication strategies in multi-agent systems, from basic reinforcement learning to sophisticated language-based coordination. 2026-04-22

https://bbg-news.com/beyond-talk-understanding-how-agents-communicate/

Redundancy, achieved through retransmissions or redundant encoding, further enhances reliability at the cost of increased bandwidth usage. Additionally, acknowledgement-based protocols ensure that messages are successfully received, triggering retransmission requests when failures occur. The specific choice of protocol depends on the characteristics of the communication channel and the acceptable trade-off between reliability, bandwidth, and latency. Message weighting and graph-structured messag...

187

FineFake: A knowledge-enriched dataset for fine-grained multi-domain fake news detection 2026-05-07

https://doi.org/10.1016/j.inffus.2026.104253

We then compute the average of the feature vectors for all nodes in to obtain the final graph feature as the representation. = TransE( ), = 1 | | , ' = MLP ( ), ' R 3(4) As data has encoded by pretrained model, we utilize fully connected layers for , , to get the final representation Adversarial Training Scheme Domain-adversarial Training.To enable the model to learn domain-invariant representation and address covariate shift, we follow the architecture of DANN.KEAN comprises a task classifer (w...

189

FedGuard: Selective Parameter Aggregation for Poisoning Attack Mitigation in Federated Learning 2023-10-30

https://doi.org/10.1109/cluster52292.2023.00014

We provide an overview and assessment of existing work on poisoning attack mitigation in Section II. As one of this paper's main contributions, we propose FED-GUARD to effectively defend against poisoning attacks with tuneable overhead in communication and computation.We outline FEDGUARD's architecture, its controllable synthesis of validation data as well as its selective parameter aggregation operator in Section III. FEDGUARD demonstrates to be more effective against poisoning attacks than pre...

190

XCAD: eXplainable Collusion and Adversary Detection Framework for Multi-Agent Systems 2025-12-17

https://doi.org/10.1109/RAAI67517.2025.11423320

In dynamic multi-agent environments such as e-commerce and healthcare, identifying collusive behavior among agents is critical to maintaining trust and reputation systems. While adaptive graph clustering has made it possible to detect collusive agents, these methods often function as black-boxes, offering little insight into the 'how' and 'why' behind the detection. This paper introduces XCAD (eXplainable Collusion and Adversary Detection), a novel framework designed to enhance collusion detecti...

191

Explainable Model Routing for Agentic Workflows 2026-04-03

https://arxiv.org/abs/2604.03527

Across all three settings, the generated explanations let developers verify that Topaz's cost savings stem from capability saturation rather than hidden quality loss, and pinpoint which tasks are most sensitive to further budget changes due to their importance to workflow success. Conclusions We present Topaz, an inherently interpretable model router for agentic workflows that grounds every assignment in human-interpretable skill profiles, traceable cost-quality optimization, and natural-languag...

192

Black-Box Adversarial Robustness Testing with Partial Observation for Multi-Agent Reinforcement Learning 2025-12-13

https://doi.org/10.1109/ICPADS67057.2025.11323102

However, the cooperative policy trained by MARL is vulnerable to adversarial attacks towards agents' observations, which could cause immeasurable damage to the agent team....

193

NatADiff: Adversarial Boundary Guidance for Natural Adversarial Diffusion 2025-05-26

https://doi.org/10.48550/arXiv.2505.20934

However, much of the existing literature focuses on constrained adversarial samples, which do not accurately reflect test-time errors encountered in real-world settings. To address this, we propose `NatADiff', an adversarial sampling scheme that leverages denoising diffusion to generate natural adversarial samples. Our approach is based on the observation that natural adversarial samples frequently contain structural elements from the adversarial class....

194

A Reputation Mechanism Is All You Need: Collaborative Fairness and Adversarial Robustness in Federated Learning 2020-11-19

https://arxiv.org/abs/2011.10464

Attack success rate corresponds to the proportion of '1' images incorrectly classified as '7'. The results are in Tables 3 and 4. Table 3 illustrates that FedAvg, Multi-Krum and RFFL perform well in all three metrics. FedAvg and Multi-Krum are robust against 20% label flipping adversaries because these introduced 'crooked' gradients that are outweighed by the gradients from the honest participants. RFFL performs well by reducing the negative effect from these adversaries. Somewhat surprisingly, ...

195

Dense and complex air traffic scenarios require higher levels of automation than those exhibited by tactical conflict detection and resolution (CD&R) tools that air traffic controllers (ATCO) use tod 2025-12-31

https://doi.org/10.48550/arxiv.2206.07403

This method allows flights (agents) to exchange information through a communication protocol before proposing a joint action that promotes flight efficiency and penalises dangerous situations. The policy function is trained in a controlled simulation environment, while limited transparency is provided. In , authors propose a method that combines Kernel Based Stochastic Factorization and a deep MARL method using the PPO algorithm. These methods are combined by another deep policy model that at ea...

196

Policy Disruption in Reinforcement Learning:Adversarial Attack with Large Language Models and Critical State Identification 2025-07-23

https://doi.org/10.48550/arXiv.2507.18113

Adversarial attacks in RL have garnered substantial attention, with a variety of approaches developed to undermine the learning and decision-making processes of RL agents . Existing methods can be broadly categorized into environment poisoning, state perturbation, adversarial action insertion, and indirect adversarial policy training through agent interactions. Environment Poisoning. Environment poisoning attacks manipulate rewards or transition dynamics to mislead learning. Prior work has formu...

197

Robust multi-agent coordination via evolutionary generation of auxiliary adversarial attackers 2023-05-09

https://doi.org/10.48550/arXiv.2305.05909

Robust multi-agent coordination via evolutionary generation of auxiliary adversarial attackers --- where y tot =r + γ max a ' Q tot (τ ' , a ' ,s ' ; θ - ), and θ - are parameters of a periodically updated target network. In our ROMANCE framework, we select a population of adversarial attackers from the archive, alternatively optimize the adversarial attackers or ego-system by fixing the other, and update the archive accordingly. The full algorithm of our ROMANCE can be seen in Algo. 2 in Append...

198

Amplification of formal method and fuzz testing to enable scalable assurance for communication system 2026-05-04

https://patents.google.com/?oq=18628625

Moreover, it could accelerate the Future G releases by systematic vulnerability and unintended emergent behavior detection in protocols design and stacks implementation. The proposed approach is demonstrated on three selected pilot scientific projects: the fifth generation (5G) open, programmable software-define platform (5G-OPS), outdoor multi-agent robotic navigation system using Clearpath Jackal unmanned ground vehicle (UGV), and indoor drone control research Vicon system (VS) for vulnerabili...

199

LLM Security Firewall - Research Collaboration Invitation 2026-04-22

https://discuss.huggingface.co/t/a-bidirectional-llm-firewall-next-level-x1-help-wanted/172352

Thought Filtering vs. Text Filtering: Empirical Evidence of Latent Space Defense Supremacy Against Adversarial Obfuscation Research Large Language Model (LLM) guardrails typically rely on either shallow syntax matching (Regex) or high-latency vector embedding comparisons (e.g., Llama Guard). Both approaches demonstrate catastrophic failure modes against adversarial obfuscation attacks such as "Glitch Tokens," Leetspeak, and Unicode substitution. In this study, we present CORTEX, a neuro-symbolic...

200

Policy Disruption in Reinforcement Learning:Adversarial Attack with Large Language Models and Critical State Identification 2025-07-23

https://doi.org/10.48550/arXiv.2507.18113

We introduce ARCS, a novel adversarial attack framework where existing agent guide the victim policy toward suboptimal behaviors, and validate its superiority through extensive experiments across multiple environments. Related Work Adversarial attacks in RL have garnered substantial attention, with a variety of approaches developed to undermine the learning and decision-making processes of RL agents . Existing methods can be broadly categorized into environment poisoning, state perturbation, adv...

201

Unveiling The Decision Making Process In Alzheimer's Disease Diagnosis: A Case-based Counterfactual Methodology For Explainable Deep Learning 2024-11-08

https://pubmed.ncbi.nlm.nih.gov/39528206/

Counterfactual inference offers a way to integrate causal explanations into these models, enhancing their robustness and transparency. This study develops a novel methodology combining U-Net and generative adversarial network (GAN) models to create comprehensive counterfactual diagnostic maps for AD....

202

Robust Coordination Under Misaligned Communication via Power Regularization 2024-04-08

https://doi.org/10.3233/FAIA250952

Objective misalignment characterizes multi-agent systems where agents are non-cooperative, potentially indifferent to their impact on others, and pursue self-interested goals. In settings with misaligned agents, public communication channels are vulnerable to misuse or sabotage, particularly against cooperative agents trained to rely on signaling through these channels....

203

FedAOP: Attention-Guided One-Shot Federated Pruning for Heterogeneous Edge Clients 2026-05-07

https://doi.org/10.1109/TPDS.2026.3678517

In this paper, we propose Attention-guided One-shot Pruning for Federated Learning (FedAOP) to address these challenges. First, we design an attention module that integrates spatial and channel attention to highlight critical spatial responses and evaluate channel importance. Then, leveraging these importance scores, we propose an attentive pruning algorithm to generate client-specific models, thereby reducing resource consumption. Furthermore, we introduce an aggregation algorithm with attentio...

204

Into the Gray Zone: Domain Contexts Can Blur LLM Safety Boundaries 2026-04-16

https://arxiv.org/abs/2604.15717

Building on our observation of General Unlocking, we design JARGON to systematically exploit this vulnerability through adversarial multi-turn interactions. As illustrated in Figure 3, JARGON operates in three stages: (1) establishing a safety-research context to create a permissive environment, (2) building rapport through benign academic discussion, and (3) extracting harmful knowledge through contextually reframed queries. Control Layer The attacker in the control layer maintains awareness of...

205

ATEX-CF: Attack-Informed Counterfactual Explanations for Graph Neural Networks 2026-02-04

https://arxiv.org/abs/2602.06240

In this work, we propose a novel framework, ATEX-CF that unifies adversarial attack techniques with counterfactual explanation generation-a connection made feasible by their shared goal of flipping a node's prediction, yet differing in perturbation strategy: adversarial attacks often rely on edge additions, while counterfactual methods typically use deletions. Unlike traditional approaches that treat explanation and attack separately, our method efficiently integrates both edge additions and del...

207

AI can support affordable, nutritionally adequate household diets | Technology 2025-12-28

https://www.devdiscourse.com/article/technology/3742750-ai-can-support-affordable-nutritionally-adequate-household-diets

In contrast, agentic AI operates through ongoing action loops that combine perception, memory, planning, monitoring, and execution. In FinAgent, this capability allows the system to function as an autonomous household decision-support agent. The budgeting agent calculates the disposable weekly food budget based on income and expenses. The nutrition agent ensures macro- and micronutrient adequacy across all household members. The health personalization agent adjusts nutrient targets for condition...

Adversarial Threats to Interpretability in Multi‑Agent AI: Misaligned Policy Inference, Trust Degradation, and Cascading Failures

Abstract

TABLE OF CONTENTS

1. Misaligned Policy Inference from Adversarial Observations

1.1 Identify the Objective

1.2 Survey of Existing Prior Art

1.3 Best‑Fit Match

1.4 Gap Analysis

1.5 Verdict

2. Trust Metric‑Based Federated Aggregation against Poisoning

2.1 Identify the Objective

2.2 Survey of Existing Prior Art

2.3 Best‑Fit Match

2.4 Gap Analysis

2.5 Verdict

3. Communication Channel Sabotage and Theory of Mind Defense

3.1 Identify the Objective

3.2 Survey of Existing Prior Art

3.3 Best‑Fit Match

3.4 Gap Analysis

3.5 Verdict

4. Explainability Budget Trade‑Off in Multi‑Agent Systems

4.1 Identify the Objective

4.2 Survey of Existing Prior Art

4.3 Best‑Fit Match

4.4 Gap Analysis

4.5 Verdict

5. Partial Observability & Communication Bottlenecks Effects

5.1 Identify the Objective

5.2 Survey of Existing Prior Art

5.3 Best‑Fit Match

5.4 Gap Analysis

5.5 Verdict

6. Propagation of Misaligned Inference through Joint Decision‑Making

6.1 Identify the Objective

6.2 Survey of Existing Prior Art

6.3 Best‑Fit Match

6.4 Gap Analysis

6.5 Verdict

7. Obfuscated Policy Gradients and Incorrect Explainability

7.1 Identify the Objective

7.2 Survey of Existing Prior Art

7.3 Best‑Fit Match

7.4 Gap Analysis

7.5 Verdict

8. Semantic Prompt Obfuscation via Cipher Encoding

8.1 Identify the Objective

8.2 Survey of Existing Prior Art

Key Observations

8.3 Best‑Fit Match

8.4 Gap Analysis

8.5 Verdict

9. Gradient‑Based Prompt Optimization Attack Methods

9.1 Identify the Objective

9.2 Survey of Existing Prior Art

9.3 Best‑Fit Match

9.4 Gap Analysis

9.5 Verdict

10. Multi‑Turn Contextual Memory Attacks

10.1 Identify the Objective

10.2 Survey of Existing Prior Art

10.3 Best‑Fit Match

10.4 Gap Analysis

10.5 Verdict

11. Single‑Victim Communication Perturbation Attacks

11.1 Identify the Objective

11.2 Survey of Existing Prior Art

11.3 Best‑Fit Match

11.4 Gap Analysis

11.5 Verdict

12. Gradient Masking in Adversarial Training and Explainability

12.1 Identify the Objective

12.2 Survey of Existing Prior Art

12.3 Best‑Fit Match

12.4 Gap Analysis

12.5 Verdict

13. Counterfactual Explanation Failure in Adversarial Environments

13.1 Identify the Objective

13.2 Survey of Existing Prior Art

13.3 Best‑Fit Match