Research Overview
My research spans four areas, connected by a common thread: understanding what’s happening inside models, fixing what’s wrong, and building the tools and standards that make AI systems trustworthy and empowering. My research is published at top venues including NeurIPS, ICML, ICLR, ACL, EMNLP, and FAccT.
Interpretability—We build systems that make consequential decisions, but we often can’t explain how or why they made them. I work on changing that—figuring out which parts of a model drive a decision, what it’s actually doing when it fails, and what to do when something is wrong.
Safety & Alignment—Seeing inside a model could help catch what testing alone can miss. A model can pass existing evaluations and still behave differently in practice—retaining behaviours we thought we’d removed, or producing confident answers that aren’t grounded in what the model actually processed. I study how that gap emerges and how to close it.
Technical Governance—Understanding models matters more when it’s actionable. I work on translating what we find inside models into structured methods that regulators and auditors can use to evaluate safety claims, verify that interventions worked, and hold those responsible accountable.
Societal Impact—As AI systems become more capable, there’s a risk that humans gradually lose agency—over decisions, institutions, and the systems that shape society. I study how these dynamics emerge and what technical and institutional interventions can keep humans in charge.
For more details, see my Research Agenda and recent papers.