I’m interested in building trustworthy, interpretable, and steerable machine learning systems.
and run ML Safety Daily.
We intoduce ProxyBench, a benchmark for reward model robustness in language models. We find limited robustness gains from common techniques like adversarial training, motivating the need for further improvements in robustness techniques.