Lead the design of a self‑reinforcing safety loop that turns every detected deception into a learning signal for the model. You’ll build the RL controller that continuously adapts the safety reward, ensuring the LLM remains trustworthy even as attackers evolve.
This role pioneers safety‑RL at the scale of billions of parameters, integrating internal confidence signals (e.g., low‑entropy refusals) into a dynamic reward model—an approach that has only recently been demonstrated in research but never deployed in production.
Continuous Adversarial Feedback Loop (CAFL)
From: Adversarial Prompt Injection and Misleading Explanations
CAFL requires a reinforcement‑learning controller that dynamically tunes the model’s safety reward function based on real‑time fidelity scores. The role must blend safety‑RL, adversarial training, and large‑scale policy optimization.
An end‑to‑end RL pipeline that ingests fidelity scores, generates safety rewards, and fine‑tunes the LLM policy to penalize deceptive strategies while preserving utility.
PhD or Master’s in Machine Learning, Computer Science, or a related field with a focus on reinforcement learning or AI safety.
Within 12 months, deploy a CAFL pipeline that raises the model’s defense success rate to >80% against a suite of jailbreaks while maintaining >90% helpfulness, and establish a continuous learning loop that reduces new attack success by 50% each month.
Scale the feedback loop to multi‑agent coordination scenarios, extend the reward model to multimodal tasks, and lead a cross‑disciplinary team that blends RL, interpretability, and safety engineering.
If this sounds like the challenge you have been looking for, we want to hear from you. We value what you can build over where you have been.