Mechanistic CoT Decomposition & Fidelity Scoring Lead

corpora-jobs-1778796293285-db9d41c6 - Frontier Development

Algorithm DeveloperPrincipal1 position

⚡

Why This Role is Different

Frontier Development Role

Drive the frontier of mechanistic interpretability by turning opaque transformer activations into a faithful, step‑by‑step reasoning graph. You’ll build the engine that not only decomposes CoT but also quantifies how well an explanation reflects the model’s true internal logic.

The Frontier Element

This role bridges the gap between mechanistic probing and actionable safety metrics—an area that has only recently emerged in academia. By quantifying explanation fidelity at scale, you’ll create the first production‑ready tool that can detect deceptive reasoning even when the final answer appears benign.

🔬

Project Context

Research Area

Mechanistic CoT Decomposition Engine (MCDE) and Adaptive Explanation Fidelity Scoring (AEFS)

From: Adversarial Prompt Injection and Misleading Explanations

Why This Role is Critical

The MCDE must parse a model’s chain‑of‑thought into atomic reasoning steps and map them to a reliability graph, while AEFS requires a dynamic fidelity metric that compares internal reasoning to external explanations. Both demand cutting‑edge interpretability research and scalable graph‑based algorithms.

What You Will Build

A scalable decomposition engine that extracts atomic reasoning steps from transformer activations, a reliability graph that scores each step, and a fidelity scoring module that computes a dynamic deception risk score for every explanation.

🛠

Key Responsibilities

Design and implement a graph‑based decomposition algorithm that maps transformer activations to atomic reasoning steps.
Construct a reliability graph that assigns trust scores to each step based on known inference patterns.
Develop the Adaptive Explanation Fidelity Scoring module that compares the internal reasoning graph to the generated explanation and outputs a deception risk score.
Integrate the MCDE and AEFS with the GLO sensor outputs and validate against the D‑REX benchmark and other adversarial datasets.
Iterate on the fidelity metric to minimize false positives while maximizing detection of malicious CoT.

🎯

Required Skills & Experience

Technical Must-Haves

Mechanistic interpretability research

Expert

Designing probes that map transformer internals to logical steps.

Graph neural networks and probabilistic modeling

Advanced

Building the reliability graph and scoring system.

Large‑scale ML framework (PyTorch/TensorFlow)

Expert

Implementing efficient, distributed decomposition pipelines.

Statistical analysis and metric design

Advanced

Defining and validating fidelity scores.

Experience Requirements

5+ years publishing in top ML or AI safety venues on interpretability or mechanistic probing.
Hands‑on experience building production‑grade interpretability tools for large language models.

Education

PhD in Computer Science, Machine Learning, or a related field with a focus on interpretability or cognitive modeling.

⭐

Preferred Skills

Experience with linear‑separability analysis of LLM embeddings.
Familiarity with LIME, SHAP, or graph‑based explainability frameworks.
Knowledge of multimodal reasoning pipelines.

🤝

You Will Thrive Here If...

Passionate about turning theoretical insights into deployable systems.
Comfortable iterating rapidly on complex algorithms while maintaining rigorous evaluation.

📈

Impact & Growth

12-Month Impact

Within 12 months, deliver a fidelity scoring system that detects >90% of deceptive CoT on the D‑REX benchmark, reduces false‑positive jailbreaks by 70%, and provides actionable alerts to downstream safety modules.

Growth Opportunity

Scale the decomposition engine to multimodal LLMs, open‑source the fidelity metric, and lead a cross‑functional team that integrates interpretability insights into the company’s safety platform.

Ready to Push the Boundaries?

If this sounds like the challenge you have been looking for, we want to hear from you. We value what you can build over where you have been.