Publications | Fazl Barez

Selected publications by research direction. Full list on Google Scholar.

Interpretability

2025

Preprint

Fazl Barez, Tung-Yu Wu, Iván Arcuschin, Michael Lan, Vincent Wang, Noah Siegel, Nicolas Collignon, Clement Neo, Isabelle Lee, Alasdair Paren, Adel Bibi, Robert Trager, Damiano Fornasiere, John Yan, Yanai Elazar, and Yoshua Bengio

Under review, 2025

BibTeX

@article{barez2025cotnotexplainability,
  category = {Interpretability},
  title = {Chain-of-Thought Is Not Explainability},
  author = {Barez, Fazl and Wu, Tung-Yu and Arcuschin, Iv{\'a}n and Lan, Michael and Wang, Vincent and Siegel, Noah and Collignon, Nicolas and Neo, Clement and Lee, Isabelle and Paren, Alasdair and Bibi, Adel and Trager, Robert and Fornasiere, Damiano and Yan, John and Elazar, Yanai and Bengio, Yoshua},
  journal = {Under review},
  year = {2025},
}

ICLR

Towards Interpreting Visual Information Processing in Vision-Language Models

Clement Neo, Luke Ong, Philip Torr, Mor Geva, David Krueger, and Fazl Barez

In International Conference on Learning Representations (ICLR), 2025

arXiv BibTeX

@inproceedings{neo2025vlm,
  category = {Interpretability},
  title = {Towards Interpreting Visual Information Processing in Vision-Language Models},
  author = {Neo, Clement and Ong, Luke and Torr, Philip and Geva, Mor and Krueger, David and Barez, Fazl},
  booktitle = {International Conference on Learning Representations (ICLR)},
  year = {2025},
}

2024

NeurIPS

Interpreting Learned Feedback Patterns in Large Language Models

Luke Marks, Amir Abdullah, Clement Neo, Rauno Arike, David Krueger, Philip Torr, and Fazl Barez

In Advances in Neural Information Processing Systems (NeurIPS), 2024

BibTeX

@inproceedings{marks2024ilfp,
  category = {Interpretability},
  title = {Interpreting Learned Feedback Patterns in Large Language Models},
  author = {Marks, Luke and Abdullah, Amir and Neo, Clement and Arike, Rauno and Krueger, David and Torr, Philip and Barez, Fazl},
  booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
  year = {2024},
}

Safety & Alignment

2025

Preprint

Open Problems in Machine Unlearning for AI Safety

Fazl Barez, Tingchen Fu, Ameya Prabhu, Stephen Casper, Amartya Sanyal, Adel Bibi, Aidan O’Gara, Robert Kirk, Ben Bucknall, Tim Fist, Luke Ong, Philip Torr, Kwok-Yan Lam, Robert Trager, David Krueger, Sören Mindermann, José Hernandez-Orallo, Mor Geva, and Yarin Gal

Under review, 2025

arXiv BibTeX

@article{barez2025unlearning,
  category = {Safety},
  title = {Open Problems in Machine Unlearning for AI Safety},
  author = {Barez, Fazl and Fu, Tingchen and Prabhu, Ameya and Casper, Stephen and Sanyal, Amartya and Bibi, Adel and O'Gara, Aidan and Kirk, Robert and Bucknall, Ben and Fist, Tim and Ong, Luke and Torr, Philip and Lam, Kwok-Yan and Trager, Robert and Krueger, David and Mindermann, S{\"o}ren and Hernandez-Orallo, Jos{\'e} and Geva, Mor and Gal, Yarin},
  journal = {Under review},
  year = {2025},
}

ICML

PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning

Tingchen Fu, Mrinank Sharma, Philip Torr, Shay B. Cohen, David Krueger, and Fazl Barez

In International Conference on Machine Learning (ICML), 2025

BibTeX

@inproceedings{fu2025poisonbench,
  category = {Safety},
  title = {PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning},
  author = {Fu, Tingchen and Sharma, Mrinank and Torr, Philip and Cohen, Shay B. and Krueger, David and Barez, Fazl},
  booktitle = {International Conference on Machine Learning (ICML)},
  year = {2025},
}

2024

ACL

Large Language Models Relearn Removed Concepts

Michelle Lo, Shay B. Cohen, and Fazl Barez

In Annual Meeting of the Association for Computational Linguistics (ACL), 2024

arXiv BibTeX

@inproceedings{lo2024relearn,
  category = {Safety},
  title = {Large Language Models Relearn Removed Concepts},
  author = {Lo, Michelle and Cohen, Shay B. and Barez, Fazl},
  booktitle = {Annual Meeting of the Association for Computational Linguistics (ACL)},
  year = {2024},
}

Technical Governance

2026

Report

Automated Interpretability-Driven Model Auditing and Control: A Research Agenda

Fazl Barez

AI Governance Initiative, Oxford Martin School, University of Oxford, 2026

BibTeX

@article{barez2026auditingagenda,
  category = {Governance},
  title = {Automated Interpretability-Driven Model Auditing and Control: A Research Agenda},
  author = {Barez, Fazl},
  journal = {AI Governance Initiative, Oxford Martin School, University of Oxford},
  year = {2026},
}

2024

SSRN

Safeguarding AI in Finance: Lessons for Regulated Industries

Fazl Barez, and Luke Marks

SSRN Working Paper 4937924, 2024

BibTeX

@article{barez2024safeguardingfinance,
  category = {Governance},
  title = {Safeguarding AI in Finance: Lessons for Regulated Industries},
  author = {Barez, Fazl and Marks, Luke},
  journal = {SSRN Working Paper 4937924},
  year = {2024},
}

Societal Impact

2026

Preprint

From Democracies to Autocracies: How AI Systems Enable Authoritarianism by Design

Jakaria Sania, Marta Ziosi, and Fazl Barez

Under review, 2026

arXiv BibTeX

@article{sania2026autocracies,
  category = {Society},
  title = {From Democracies to Autocracies: How AI Systems Enable Authoritarianism by Design},
  author = {Sania, Jakaria and Ziosi, Marta and Barez, Fazl},
  journal = {Under review},
  year = {2026},
}

2025

Preprint

Toward Resisting AI-Enabled Authoritarianism

Fazl Barez, Isaac Friend, Keir Reid, Igor Krawczuk, Vincent Wang, Jakob Mökander, Philip Torr, Julia Morse, and Robert Trager

Under review, 2025

BibTeX

@article{barez2025authoritarianism,
  category = {Society},
  title = {Toward Resisting AI-Enabled Authoritarianism},
  author = {Barez, Fazl and Friend, Isaac and Reid, Keir and Krawczuk, Igor and Wang, Vincent and M{\"o}kander, Jakob and Torr, Philip and Morse, Julia and Trager, Robert},
  journal = {Under review},
  year = {2025},
}

Preprint

VAL-Bench: Measuring Value Alignment in Language Models

Aman Gupta, Daniel O’Shea, and Fazl Barez

Under review, 2025

arXiv BibTeX

@article{gupta2025valbench,
  category = {Society},
  title = {VAL-Bench: Measuring Value Alignment in Language Models},
  author = {Gupta, Aman and O'Shea, Daniel and Barez, Fazl},
  journal = {Under review},
  year = {2025},
}