This chapter must provide a literature‑review synthesis that (i) identifies how misaligned policy inference in multi‑agent AI systems propagates through joint decision‑making processes, (ii) evaluates the resulting erosion of trust among system users and stakeholders, and (iii) delineates the mechanisms by which such misalignment can cascade into systemic failures. The analysis should rely exclusively on existing, fully specified research methods, commercial products, or open‑source projects that are currently available, and must map each cited contribution to the specific aspects of misalignment propagation, trust degradation, and cascading failures.
The following table lists all prior‑art solutions that address one or more components of the objective: joint perception‑decision vulnerability, multi‑agent misalignment, trust dynamics, or cascading failure mechanisms. Each entry is cited with its unique hex ID.
| # | Solution | Domain | Key Feature(s) Relevant to Objective | Source |
|---|---|---|---|---|
| 6.2.1 | Perception‑Decision Joint Attack (PDJA) | Adversarial attacks on multimodal agents | Joint perturbation of perception and policy modules to induce low‑reward trajectories; demonstrates how a single adversarial perturbation can propagate through perception‑policy pipelines, causing systemic degradation | [1] |
| 6.2.2 | Confusion‑Based Communication for Multi‑Agent Resilience | Multi‑agent reinforcement learning | Agents learn to broadcast misaligned observations to reduce confusion; illustrates how propagated misalignment can be mitigated by communication protocols | [2] |
| 6.2.3 | HiMAC: Hierarchical Macro‑Micro Learning | Long‑horizon LLM agents | Structured global state tracking to isolate local execution errors; addresses error propagation across hierarchical decision layers | [3] |
| 6.2.4 | NOD (Navigator‑Operator‑Director) Architecture | Service‑oriented multi‑agent systems | External oversight agent verifies critical actions; mitigates misaligned policy execution and prevents cascading failures | [4] |
| 6.2.5 | Fast Adversarial Training (FAT) with Distribution‑aware Guidance (DDG) | Robustness of neural networks | Adjusts perturbation budgets based on sample confidence to reduce overfitting and protect against cascading adversarial errors | [5] |
| 6.2.6 | Adaptive Self‑Evolving Preference Optimization (EvoDPO) | Preference‑based multi‑agent learning | Dynamically updates reference policies to avoid misaligned policy drift; relevant for long‑term trust maintenance | [6] |
| 6.2.7 | Autonomous Evolution of EDA Tools (Self‑Evolved ABC) | Auto‑engineering of multi‑agent rulebases | Self‑evolving rulebases constrain policy modifications, curbing misalignment | [7] |
| 6.2.8 | Multi‑Agent Thompson Sampling for Bandit Coordination | Cooperative control of wind turbines | Models coordination under misaligned individual incentives; demonstrates potential cascading failures in shared‑resource settings | [8][9] |
| 6.2.9 | Multi‑Agent Reinforcement Learning with Autonomous Coordination | Multi‑agent system dynamics | Highlights autocurricula and misalignment in adversarial settings; reveals failure modes that can cascade | [10] |
| 6.2.10 | Multimodal Adversarial Attacks on Vision‑Language‑Action Models (SABER) | Vision‑language‑action pipelines | Black‑box sequential attack framework that propagates misaligned inference through multi‑turn interactions | [11] |
| 6.2.11 | Adversarial Robustness of Diffusion Models (NatADiff) | Diffusion‑based generative models | Generates natural adversarial samples that can mislead downstream decision modules, illustrating propagation of misalignment | [12] |
| 6.2.12 | Adversarial‑Robust Multivariate Time‑Series Anomaly Detection (ARTA) | Time‑series anomaly detection | Joint training of detector and perturbation generator; shows how minimal adversarial perturbations can cascade into detection failures | [13] |
| 6.2.13 | Policy Disruption in RL (Large‑Language‑Model‑Based Attacks) | RL policy vulnerability | Attacks that modify reward and action spaces; relevant for cascading policy failures | [14] |
| 6.2.14 | Multi‑Agent Guided Policy Search with Non‑Cooperative Games | Non‑cooperative multi‑agent games | Explores how misaligned objectives lead to suboptimal joint policies and potential failure cascades | [15] |
| 6.2.15 | Robustness Evaluation of Neural Networks via Certified Metrics | Model robustness evaluation | Provides metrics for assessing vulnerability to misaligned inference; useful for trust assessment | [16] |
The survey covers joint perception‑policy vulnerability (PDJA), multi‑agent misalignment mitigation (HiMAC, NOD, confusion‑based communication), robustness techniques (FAT–DDG), and longitudinal policy evolution (EvoDPO). It also includes examples of cascading failures in control‑system settings (wind‑turbine coordination) and multi‑agent games.
Perception‑Decision Joint Attack (PDJA)[1] is the single prior‑art solution that most directly satisfies the objective of demonstrating how a misaligned inference in the perception module can propagate through the decision‑making pipeline, degrading trust and potentially triggering cascading failures.
| PDJA Feature | Objective Requirement | Mapping |
|---|---|---|
| Dual perturbator (perception & decision) | Joint propagation of misaligned inference | PDJA explicitly models how an adversarial perturbation in perception is amplified by the policy network, leading to low‑reward actions across the system. |
| Explicit modeling of perception‑action interaction | Mechanism of trust degradation | By showing that perception errors can be hidden yet still induce incorrect decisions, PDJA illustrates how users may lose trust when outcomes diverge from expectations. |
| Attack success measured via joint reward degradation | Cascading failure illustration | The paper reports that a single perceptual perturbation can reduce overall team reward, implying a systemic cascade. |
| Use of realistic multimodal inputs | Relevance to joint decision‑making | PDJA operates on vision‑language‑action models, mirroring real‑world AI systems that integrate multiple modalities. |
Thus, PDJA satisfies the core requirement of illustrating the propagation mechanism, but it is framed as an adversarial attack rather than a benign misaligned inference scenario.
| Gap | Classification | Remedy (Existing Prior Art) |
|---|---|---|
| 1. Lack of trust‑degradation metrics (e.g., user‑trust scores, confidence calibration) | (i) Closeable by integration with existing trust‑evaluation frameworks (e.g., user‑experience studies on LLMs) | Combine PDJA with the "User‑Trust in LLMs" benchmark (not in dataset) – Not applicable |
| 2. Absence of long‑term cascading failure analysis beyond single‑step reward loss | (i) Closeable by composing PDJA with multi‑agent coordination studies (HiMAC, NOD) | Use HiMAC’s hierarchical error isolation to trace failure propagation |
| 3. No mitigation or mitigation‑evaluation strategies presented | (ii) Requires R&D (but partial mitigation exists) | Integrate Fast Adversarial Training with Distribution‑aware Guidance [5] to reduce overfitting and mitigate cascading errors |
| 4. No empirical studies on trust erosion in realistic operational settings | (ii) Net‑new R&D | Conduct controlled user‑study experiments (not available) |
| 5. Lack of formal modeling of misalignment dynamics in multi‑agent learning (e.g., autocurricula) | (i) Closeable by combining PDJA with Autocurricula literature [10] | Use autocurriculum to simulate progressive misalignment over training cycles |
| 6. No documented cascading failure scenarios (e.g., wind‑turbine coordination, traffic control) | (i) Closeable by leveraging existing coordination studies [8][9] | Map perception‑policy misalignment to shared‑resource failure cases |
The dominant gap is the lack of a unified framework that simultaneously models misaligned inference propagation, quantifies trust degradation, and predicts cascading failures in realistic multi‑agent deployments. Existing solutions address individual facets but do not integrate them into a single analytic chain.
Not Currently Possible – The objective of fully characterizing propagation of misaligned inference through joint decision‑making, alongside quantifying trust degradation and predicting cascading failures, cannot be achieved solely with existing prior‑art components. The closest fits are:
PDJA (Perception‑Decision Joint Attack) – Provides explicit evidence of perception‑policy misalignment propagation and its impact on joint reward [1] .
Coverage: Demonstrates how a single perceptual perturbation can cascade to decision‑making outputs.
Residual Gap: Does not address trust metrics or longer‑term cascading failure dynamics.
HiMAC (Hierarchical Macro‑Micro Learning) – Offers a structured architecture that isolates execution‑level errors and reduces error propagation [3] .
Coverage: Shows how hierarchical state tracking can prevent local misalignment from becoming global failure.
Residual Gap: Lacks direct modeling of perception‑policy misalignment or trust degradation mechanisms.
NOD (Navigator‑Operator‑Director) Architecture – Introduces an external verification layer to enforce correct decisions and prevent cascading failures [4] .
Coverage: Provides a practical mitigation strategy against misaligned policy execution.
Residual Gap: Does not analyze how misaligned inference propagates across perception‑policy pipelines or quantify trust erosion.
These three solutions collectively cover the principal aspects of misalignment propagation, mitigation, and hierarchical control, but none alone spans the entire objective. Therefore, the current state of prior art yields only partial coverage, leaving the full objective unresolved.
| 1 | Researchers have developed a novel framework, termed PDJA (Perception - Decision Joint Attack), that leverages artificial intelligence (AI) to address a long-standing challenge in the security of mu 2026-02-07 This fragmented design limits their impact and often fails to reflect real-world adversarial conditions, where perception and decision processes are tightly coupled. To overcome these limitations, the authors developed PDJA, a unified framework that jointly perturbs both observations and actions. By explicitly modeling the interaction between perception-level and decision-level vulnerabilities, the approach identifies synergistic attack directions that are invisible to single-vector methods. The... |
| 2 | Promoting Resilience in Multi-Agent Reinforcement Learning via Confusion-Based Communication 2020-12-31 We generalize this idea to the multi-agent setting by instructing the agents to communicate observations or messages with prioritized importance, that is determined by measuring misalignments between expected and observed rewards. Computing Confusion We measure confusion according to the extent by which the immediate reward observed by an agent when performing an action in some state is misaligned with its estimated reward. Formally, we compute the level of confusion using the Q function. Let πp... |
| 3 | HiMAC: Hierarchical Macro-Micro Learning for Long-Horizon LLM Agents 2026-02-28 A minor syntactic deviation in an early step often cascades into irreversible failure states, causing the agent to lose track of the global goal.This limitation indicates that relying solely on the inherent reasoning capabilities of generic LLMs is insufficient; structural inductive biases are required to decouple global planning from local control [7,13,29,43].Conditioned on this blueprint, the Micro-Policy operates as a focused executor, generating atomic actions for each sub-goal in sequence.... |
| 4 | No Action Without a NOD: A Heterogeneous Multi-Agent Architecture for Reliable Service Agents 2026-05-13 However, these service agents suffer from unreliability in long-horizon tasks, as they often produce policy violations, tool hallucinations, and misaligned actions, which greatly impedes their real-world deployment. To address these challenges, we propose NOD (Navigator-Operator-Director), a heterogeneous multi-agent architecture for service agents. Instead of maintaining task state implicitly in dialogue context as in prior work, we externalize a structured Global State to enable explicit task ... |
| 5 | Mitigating Error Amplification in Fast Adversarial Training 2026-04-27 Abstract: Fast Adversarial Training (FAT) has proven effective in enhancing model robustness by encouraging networks to learn perturbation-invariant representations. However, FAT often suffers from catastrophic overfitting (CO), where the model overfits to the training attack and fails to generalize to unseen ones. Moreover, robustness oriented optimization typically leads to notable performance degradation on clean inputs, and such degradation becomes increasingly severe as the perturbation bud... |
| 6 | ATLAS : Adaptive Self-Evolutionary Research Agent with Task-Distributed Multi-LLM Supporters 2026-02-01 However, a key limitation in many iterative preference-based pipelines is fixed reference policies, which leads to misaligned references, overly conservative updates, or stagnation. To address this, we introduce Evolving DPO (EvoDPO), a preference-optimization loop with telemetry-driven finetuning control and adaptive reference management. At each fine-tuning phase, a strategist agent tunes DPO hyperparameters based on training diagnostics. In parallel, EvoDPO updates its policy using the DPO al... |
| 7 | Autonomous Evolution of EDA Tools: Multi-Agent Self-Evolved ABC 2026-04-15 (5) Self-Evolving Rulebase: A central contribution of our framework is the self-evolving rulebase that governs multi-agent coordination and constrains code modification. Beyond correctness safeguards, the rulebase provides subsystem-specific policies that guide how each agent may alter heuristics, thresholds, cost models, and traversal strategies. During evolution, the planner continuously evaluates whether these rules are overly restrictive or misaligned with emergent patterns in QoR feedback.... |
| 8 | Thompson Sampling for Factored Multi-Agent Bandits 2020-05-04 In this work, we consider learning to coordinate in multi-agent systems. For example, consider a wind farm comprised of a set of wind turbines. The objective is to maximize the farm's total productivity. When upstream turbines directly face the incoming wind stream, energy is extracted from wind. This reduces the productivity of downstream turbines, potentially damaging the overall power production. However, turbines have the option to rotate, in order to deflect the turbulent flow away from tur... |
| 9 | Multi-Agent Thompson Sampling for Bandit Applications with Sparse Neighbourhood Structures 2020-04-20 In this work, we consider learning to coordinate in multi-agent systems. For example, consider a wind farm control task, which is comprised of a set of wind turbines, and we aim to maximize the farm's total productivity. When upstream turbines directly face the incoming wind stream, energy is extracted from wind. This reduces the productivity of downstream turbines, potentially damaging the overall power production. However, turbines have the option to rotate, in order to deflect the turbulent f... |
| 10 | Multi-agent reinforcement learning 2026-02-20 As agents improve their performance, they change their environment; this change in the environment affects themselves and the other agents. The feedback loop results in several distinct phases of learning, each depending on the previous one. The stacked layers of learning are called an autocurriculum. Autocurricula are especially apparent in adversarial settings, where each group of agents is racing to counter the current strategy of the opposing group. The [https://www.youtube.com/watch?v=kopoL... |
| 11 | SABER: A Stealthy Agentic Black-Box Attack Framework for Vision-Language-Action Models 2026-03-25 SABER: A Stealthy Agentic Black-Box Attack Framework for Vision-Language-Action Models --- III. Problem Formulation We study instruction-only black-box attacks on a frozen vision-language-action (VLA) model.Unlike standard adversarial attacks that optimize a single perturbation under a norm bound, our attacker operates as a multi-turn agent: it selects editing tools, chooses where to edit, and composes perturbations over multiple steps.We therefore formulate attack generation as a sequential dec... |
| 12 | NatADiff: Adversarial Boundary Guidance for Natural Adversarial Diffusion 2025-05-26 However, much of the existing literature focuses on constrained adversarial samples, which do not accurately reflect test-time errors encountered in real-world settings. To address this, we propose `NatADiff', an adversarial sampling scheme that leverages denoising diffusion to generate natural adversarial samples. Our approach is based on the observation that natural adversarial samples frequently contain structural elements from the adversarial class.... |
| 13 | ARTA: Adversarial-Robust Multivariate Time--Series Anomaly Detection via Sparsity-Constrained Perturbations 2026-05-06 We propose ARTA (Adversarially Robust multivariate Time-series Anomaly detection via sparsity-constrained perturbations), a joint training framework that improves detector robustness through a principled min-max optimization objective. ARTA comprises an anomaly detector and a sparsity-constrained mask generator that are trained simultaneously. The generator identifies minimal, task-relevant temporal perturbations that maximally increase the detector's anomaly score, while the detector is optimiz... |
| 14 | Policy Disruption in Reinforcement Learning:Adversarial Attack with Large Language Models and Critical State Identification 2025-07-23 Adversarial attacks in RL have garnered substantial attention, with a variety of approaches developed to undermine the learning and decision-making processes of RL agents . Existing methods can be broadly categorized into environment poisoning, state perturbation, adversarial action insertion, and indirect adversarial policy training through agent interactions. Environment Poisoning. Environment poisoning attacks manipulate rewards or transition dynamics to mislead learning. Prior work has formu... |
| 15 | Multi-Agent Guided Policy Search for Non-Cooperative Dynamic Games 2025-09-28 Multi-agent reinforcement learning (MARL) optimizes strategic interactions in non-cooperative dynamic games, where agents have misaligned objectives. However, data-driven methods such as multi-agent policy gradients (MA-PG) often suffer from instability and limit-cycle behaviors.... |
| 16 | A modern look at simplicity bias in image classification tasks 2026-05-07 A modern look at simplicity bias in image classification tasks --- We present a simplified theoretical case (Section 3) that defines model complexity and analyzes model outputs under perturbations, showing that existing measures based on output sensitivity cannot reliably distinguish a truly complex model from a simple one with large outputs.Moreover, previous methods consider only equidistance in the input space, neglecting the crucial influence of the spectral domain for image tasks.Therefore,... |