Diverging Preferences: When do Annotators Disagree and do Models Know?

  • Michael J.Q. Zhang
  • , Zhilin Wang
  • , Jena D. Hwang
  • , Yi Dong
  • , Olivier Delalleau
  • , Yejin Choi
  • , Eunsol Choi
  • , Xiang Ren
  • , Valentina Pyatkin

Research output: Contribution to journalConference articlepeer-review

Abstract

We examine diverging preferences in humanlabeled preference datasets. We develop a taxonomy of disagreement sources spanning ten categories across four high-level classes and find that the majority of disagreements are due to factors such as task underspecification or response style. Our findings challenge a standard assumption in reward modeling methods that annotator disagreements can be attributed to simple noise. We then explore how these findings impact two areas of LLM development: reward modeling training and evaluation. In our experiments, we demonstrate how standard reward modeling (e.g., Bradley-Terry) and LLM-as-Judge evaluation methods fail to account for divergence between annotators. These findings highlight challenges in LLM evaluations, which are greatly influenced by divisive features like response style, and in developing pluralistically aligned LLMs. To address these issues, we develop methods for identifying diverging preferences to mitigate their influence in evaluations and during LLM training.

Original languageEnglish
Pages (from-to)76193-76212
Number of pages20
JournalProceedings of Machine Learning Research
Volume267
StatePublished - 2025
Externally publishedYes
Event42nd International Conference on Machine Learning, ICML 2025 - Vancouver, Canada
Duration: 13 Jul 202519 Jul 2025

Bibliographical note

Publisher Copyright:
© 2025 by the author(s).

Fingerprint

Dive into the research topics of 'Diverging Preferences: When do Annotators Disagree and do Models Know?'. Together they form a unique fingerprint.

Cite this