Continuous Adversarial Feedback Loop RL Engineer

corpora-jobs-1778796293285-db9d41c6 - Frontier Development

Applied ScientistSenior1 position

⚡

Why This Role is Different

Frontier Development Role

Lead the design of a self‑reinforcing safety loop that turns every detected deception into a learning signal for the model. You’ll build the RL controller that continuously adapts the safety reward, ensuring the LLM remains trustworthy even as attackers evolve.

The Frontier Element

This role pioneers safety‑RL at the scale of billions of parameters, integrating internal confidence signals (e.g., low‑entropy refusals) into a dynamic reward model—an approach that has only recently been demonstrated in research but never deployed in production.

🔬

Project Context

Research Area

Continuous Adversarial Feedback Loop (CAFL)

From: Adversarial Prompt Injection and Misleading Explanations

Why This Role is Critical

CAFL requires a reinforcement‑learning controller that dynamically tunes the model’s safety reward function based on real‑time fidelity scores. The role must blend safety‑RL, adversarial training, and large‑scale policy optimization.

What You Will Build

An end‑to‑end RL pipeline that ingests fidelity scores, generates safety rewards, and fine‑tunes the LLM policy to penalize deceptive strategies while preserving utility.

🛠

Key Responsibilities

Design the RL architecture that maps fidelity scores to a safety reward signal and integrates it with the model’s policy loss.
Implement a data pipeline that continuously harvests adversarial prompts, fidelity metrics, and policy outcomes for training.
Deploy large‑scale distributed RL training (e.g., PPO or SAC) on LLMs, ensuring convergence and stability.
Integrate the RL loop with the GLO and MCDE outputs, creating a closed‑loop safety system that reacts in real time.
Evaluate the system against benchmark datasets (XSTest, D‑REX) and iterate reward design to balance safety and helpfulness.

🎯

Required Skills & Experience

Technical Must-Haves

Safety‑reinforcement learning

Expert

Designing reward models that penalize deceptive behavior while preserving utility.

Large‑scale distributed training

Advanced

Training PPO/SAC on multi‑GPU clusters for billions‑parameter models.

Adversarial training pipelines

Advanced

Generating and curating adversarial prompts for continuous learning.

PyTorch/TensorFlow and RL libraries (Stable Baselines, RLlib)

Expert

Implementing custom RL algorithms at scale.

Experience Requirements

4+ years of experience in reinforcement learning, preferably with safety‑RL or reward modeling.
Demonstrated success in fine‑tuning LLMs or large‑scale policy models.

Education

PhD or Master’s in Machine Learning, Computer Science, or a related field with a focus on reinforcement learning or AI safety.

⭐

Preferred Skills

Experience with SIRL or similar intrinsic reward frameworks.
Knowledge of ARES or other continuous red‑teaming pipelines.
Familiarity with large‑scale data engineering for RL.

🤝

You Will Thrive Here If...

Obsessed with shipping working systems that learn from real‑world failures.
Comfortable making rapid, data‑driven decisions in a high‑stakes environment.

📈

Impact & Growth

12-Month Impact

Within 12 months, deploy a CAFL pipeline that raises the model’s defense success rate to >80% against a suite of jailbreaks while maintaining >90% helpfulness, and establish a continuous learning loop that reduces new attack success by 50% each month.

Growth Opportunity

Scale the feedback loop to multi‑agent coordination scenarios, extend the reward model to multimodal tasks, and lead a cross‑disciplinary team that blends RL, interpretability, and safety engineering.

Ready to Push the Boundaries?

If this sounds like the challenge you have been looking for, we want to hear from you. We value what you can build over where you have been.