Selected publications by research direction. Full list on Google Scholar.

Interpretability

2025

  1. Preprint
    Fazl Barez, Tung-Yu Wu, Iván Arcuschin, Michael Lan, Vincent Wang, Noah Siegel, Nicolas Collignon, Clement Neo, Isabelle Lee, Alasdair Paren, Adel Bibi, Robert Trager, Damiano Fornasiere, John Yan, Yanai Elazar, and Yoshua Bengio
    Under review, 2025
  2. ICLR
    Clement Neo, Luke Ong, Philip Torr, Mor Geva, David Krueger, and Fazl Barez
    In International Conference on Learning Representations (ICLR), 2025

2024

  1. NeurIPS
    Interpreting Learned Feedback Patterns in Large Language Models
    Luke Marks, Amir Abdullah, Clement Neo, Rauno Arike, David Krueger, Philip Torr, and Fazl Barez
    In Advances in Neural Information Processing Systems (NeurIPS), 2024
Safety & Alignment

2025

  1. Preprint
    Fazl Barez, Tingchen Fu, Ameya Prabhu, Stephen Casper, Amartya Sanyal, Adel Bibi, Aidan O’Gara, Robert Kirk, Ben Bucknall, Tim Fist, Luke Ong, Philip Torr, Kwok-Yan Lam, Robert Trager, David Krueger, Sören Mindermann, José Hernandez-Orallo, Mor Geva, and Yarin Gal
    Under review, 2025
  2. ICML
    PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning
    Tingchen Fu, Mrinank Sharma, Philip Torr, Shay B. Cohen, David Krueger, and Fazl Barez
    In International Conference on Machine Learning (ICML), 2025

2024

  1. ACL
    Michelle Lo, Shay B. Cohen, and Fazl Barez
    In Annual Meeting of the Association for Computational Linguistics (ACL), 2024
Technical Governance

2026

  1. Report
    Automated Interpretability-Driven Model Auditing and Control: A Research Agenda
    Fazl Barez
    AI Governance Initiative, Oxford Martin School, University of Oxford, 2026

2024

  1. SSRN
    Safeguarding AI in Finance: Lessons for Regulated Industries
    Fazl Barez, and Luke Marks
    SSRN Working Paper 4937924, 2024
Societal Impact

2026

  1. Preprint
    Jakaria Sania, Marta Ziosi, and Fazl Barez
    Under review, 2026

2025

  1. Preprint
    Toward Resisting AI-Enabled Authoritarianism
    Fazl Barez, Isaac Friend, Keir Reid, Igor Krawczuk, Vincent Wang, Jakob Mökander, Philip Torr, Julia Morse, and Robert Trager
    Under review, 2025
  2. Preprint
    Aman Gupta, Daniel O’Shea, and Fazl Barez
    Under review, 2025