You will create the first adversarial system that watches agents’ belief evolution in real time, detecting subtle misalignments before they cascade into catastrophic failures—an essential safety layer for any large‑scale, partially observable MARL deployment.
By treating belief trajectories as a sequence and training a discriminator to distinguish expert from agent trajectories, you will bridge adversarial learning, imitation learning, and multi‑agent RL in a way that has never been done at this scale.
Adversarial Alignment Detection
From: Partial Observability Amplification of Misalignment
To design and train a discriminator that monitors joint belief trajectories, flags abnormal divergences, and provides an adversarial signal that protects against reward hacking and deceptive policies.
A temporal belief‑trajectory discriminator, training framework with expert trajectories, integration hooks for the JBWM and reward‑decomposition modules, and evaluation pipelines.
PhD in Computer Science, Electrical Engineering, or a related field with a focus on machine learning or AI safety.
Within 12 months, achieve ≥90 % detection of misalignment events on benchmark tasks, reduce reward‑hacking incidents by 80 %, and embed the discriminator into the BAAC production pipeline.
Lead a research group focused on AI safety and alignment, mentor junior scientists, and shape the company’s long‑term strategy for trustworthy multi‑agent systems.
If this sounds like the challenge you have been looking for, we want to hear from you. We value what you can build over where you have been.