SAFETYANALYST: Interpretable, Transparent, and Steerable Safety Moderation for AI Behavior

  • Jing Jing Li
  • , Valentina Pyatkin
  • , Max Kleiman-Weiner
  • , Liwei Jiang
  • , Nouha Dziri
  • , Anne G.E. Collins
  • , Jana Schaich Borg
  • , Maarten Sap
  • , Yejin Choi
  • , Sydney Levine

Research output: Contribution to journalConference articlepeer-review

Abstract

The ideal AI safety moderation system would be both structurally interpretable (so its decisions can be reliably explained) and steerable (to align to safety standards and reflect a commu-nity’s values), which current systems fall short on. To address this gap, we present SAFETY-ANALYST, a novel AI safety moderation framework. Given an AI behavior, SAFETYANALYST uses chain-of-thought reasoning to analyze its potential consequences by creating a structured “harm-benefit tree,” which enumerates harmful and beneficial actions and effects the AI behavior may lead to, along with likelihood, severity, and immediacy labels that describe potential impacts on stakeholders. SAFETYANALYST then aggregates all effects into a harmfulness score using 28 fully interpretable weight parameters, which can be aligned to particular safety preferences. We applied this framework to develop an open-source LLM prompt safety classification system, distilled from 18.5 million harm-benefit features generated by frontier LLMs on 19k prompts. On comprehensive benchmarks, we show that SAFETYANALYST (average F1=0.81) outperforms existing moderation systems (aver-age F1<0.72) on prompt safety classification, while offering the additional advantages of interpretability, transparency, and steerability.1

Original languageEnglish
Pages (from-to)35731-35752
Number of pages22
JournalProceedings of Machine Learning Research
Volume267
StatePublished - 2025
Externally publishedYes
Event42nd International Conference on Machine Learning, ICML 2025 - Vancouver, Canada
Duration: 13 Jul 202519 Jul 2025

Bibliographical note

Publisher Copyright:
© 2025, by the authors.

Fingerprint

Dive into the research topics of 'SAFETYANALYST: Interpretable, Transparent, and Steerable Safety Moderation for AI Behavior'. Together they form a unique fingerprint.

Cite this