TY - JOUR
T1 - The Art of Saying No
T2 - 38th Conference on Neural Information Processing Systems, NeurIPS 2024
AU - Brahman, Faeze
AU - Kumar, Sachin
AU - Balachandran, Vidhisha
AU - Dasigi, Pradeep
AU - Pyatkin, Valentina
AU - Ravichander, Abhilasha
AU - Wiegreffe, Sarah
AU - Dziri, Nouha
AU - Chandu, Khyathi
AU - Hessel, Jack
AU - Tsvetkov, Yulia
AU - Smith, Noah A.
AU - Choi, Yejin
AU - Hajishirzi, Hannaneh
N1 - Publisher Copyright:
© 2024 Neural information processing systems foundation. All rights reserved.
PY - 2024
Y1 - 2024
N2 - Chat-based language models are designed to be helpful, yet they should not comply with every user request. While most existing work primarily focuses on refusal of “unsafe” queries, we posit that the scope of noncompliance should be broadened. We introduce a comprehensive taxonomy of contextual noncompliance describing when and how models should not comply with user requests. Our taxonomy spans a wide range of categories including incomplete, unsupported, indeterminate, and humanizing requests (in addition to unsafe requests). To test noncompliance capabilities of language models, we use this taxonomy to develop a new evaluation suite of 1000 noncompliance prompts. We find that most existing models show significantly high compliance rates in certain previously understudied categories with models like GPT-4 incorrectly complying with as many as 30% of requests. To address these gaps, we explore different training strategies using a synthetically-generated training set of requests and expected noncompliant responses. Our experiments demonstrate that while direct finetuning of instruction-tuned models can lead to both over-refusal and a decline in general capabilities, using parameter efficient methods like low rank adapters helps to strike a good balance between appropriate noncompliance and other capabilities.
AB - Chat-based language models are designed to be helpful, yet they should not comply with every user request. While most existing work primarily focuses on refusal of “unsafe” queries, we posit that the scope of noncompliance should be broadened. We introduce a comprehensive taxonomy of contextual noncompliance describing when and how models should not comply with user requests. Our taxonomy spans a wide range of categories including incomplete, unsupported, indeterminate, and humanizing requests (in addition to unsafe requests). To test noncompliance capabilities of language models, we use this taxonomy to develop a new evaluation suite of 1000 noncompliance prompts. We find that most existing models show significantly high compliance rates in certain previously understudied categories with models like GPT-4 incorrectly complying with as many as 30% of requests. To address these gaps, we explore different training strategies using a synthetically-generated training set of requests and expected noncompliant responses. Our experiments demonstrate that while direct finetuning of instruction-tuned models can lead to both over-refusal and a decline in general capabilities, using parameter efficient methods like low rank adapters helps to strike a good balance between appropriate noncompliance and other capabilities.
UR - https://www.scopus.com/pages/publications/105000548783
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.conferencearticle???
AN - SCOPUS:105000548783
SN - 1049-5258
VL - 37
JO - Advances in Neural Information Processing Systems
JF - Advances in Neural Information Processing Systems
Y2 - 9 December 2024 through 15 December 2024
ER -