Selected publications by research direction. Full list on Google Scholar.
Interpretability
2025
- Preprint
- ICLRIn International Conference on Learning Representations (ICLR), 2025
2024
- NeurIPSInterpreting Learned Feedback Patterns in Large Language ModelsIn Advances in Neural Information Processing Systems (NeurIPS), 2024
Safety & Alignment
2025
- Preprint
- ICMLPoisonBench: Assessing Large Language Model Vulnerability to Data PoisoningIn International Conference on Machine Learning (ICML), 2025
2024
- ACLIn Annual Meeting of the Association for Computational Linguistics (ACL), 2024
Technical Governance
Societal Impact
2026
- Preprint
2025
- Preprint
- Preprint