Drive the frontier of mechanistic interpretability by turning opaque transformer activations into a faithful, step‑by‑step reasoning graph. You’ll build the engine that not only decomposes CoT but also quantifies how well an explanation reflects the model’s true internal logic.
This role bridges the gap between mechanistic probing and actionable safety metrics—an area that has only recently emerged in academia. By quantifying explanation fidelity at scale, you’ll create the first production‑ready tool that can detect deceptive reasoning even when the final answer appears benign.
Mechanistic CoT Decomposition Engine (MCDE) and Adaptive Explanation Fidelity Scoring (AEFS)
From: Adversarial Prompt Injection and Misleading Explanations
The MCDE must parse a model’s chain‑of‑thought into atomic reasoning steps and map them to a reliability graph, while AEFS requires a dynamic fidelity metric that compares internal reasoning to external explanations. Both demand cutting‑edge interpretability research and scalable graph‑based algorithms.
A scalable decomposition engine that extracts atomic reasoning steps from transformer activations, a reliability graph that scores each step, and a fidelity scoring module that computes a dynamic deception risk score for every explanation.
PhD in Computer Science, Machine Learning, or a related field with a focus on interpretability or cognitive modeling.
Within 12 months, deliver a fidelity scoring system that detects >90% of deceptive CoT on the D‑REX benchmark, reduces false‑positive jailbreaks by 70%, and provides actionable alerts to downstream safety modules.
Scale the decomposition engine to multimodal LLMs, open‑source the fidelity metric, and lead a cross‑functional team that integrates interpretability insights into the company’s safety platform.
If this sounds like the challenge you have been looking for, we want to hear from you. We value what you can build over where you have been.