TY - JOUR
T1 - A controlled trial examining large Language model conformity in psychiatric assessment using the Asch paradigm
AU - Shoval, Dorit Hadar
AU - Gigi, Karny
AU - Haber, Yuval
AU - Itzhaki, Amir
AU - Asraf, Kfir
AU - Piterman, David
AU - Elyoseph, Zohar
N1 - Publisher Copyright:
© The Author(s) 2025.
PY - 2025/5/12
Y1 - 2025/5/12
N2 - Background: Despite significant advances in AI-driven medical diagnostics, the integration of large language models (LLMs) into psychiatric practice presents unique challenges. While LLMs demonstrate high accuracy in controlled settings, their performance in collaborative clinical environments remains unclear. This study examined whether LLMs exhibit conformity behavior under social pressure across different diagnostic certainty levels, with a particular focus on psychiatric assessment. Methods: Using an adapted Asch paradigm, we conducted a controlled trial examining GPT-4o’s performance across three domains representing increasing levels of diagnostic uncertainty: circle similarity judgments (high certainty), brain tumor identification (intermediate certainty), and psychiatric assessment using children’s drawings (high uncertainty). The study employed a 3 × 3 factorial design with three pressure conditions: no pressure, full pressure (five consecutive incorrect peer responses), and partial pressure (mixed correct and incorrect peer responses). We conducted 10 trials per condition combination (90 total observations), using standardized prompts and multiple-choice responses. The binomial test and chi-square analyses assessed performance differences across conditions. Results: Under no pressure, GPT-4o achieved 100% accuracy across all domains. Under full pressure, accuracy declined systematically with increasing diagnostic uncertainty: 50% in circle recognition, 40% in tumor identification, and 0% in psychiatric assessment. Partial pressure showed a similar pattern, with maintained accuracy in basic tasks (80% in circle recognition, 100% in tumor identification) but complete failure in psychiatric assessment (0%). All differences between no pressure and pressure conditions were statistically significant (P <.05), with the most severe effects observed in psychiatric assessment (χ²₁=16.20, P <.001). Conclusions: This study reveals that LLMs exhibit conformity patterns that intensify with diagnostic uncertainty, culminating in complete performance failure in psychiatric assessment under social pressure. These findings suggest that successful implementation of AI in psychiatry requires careful consideration of social dynamics and the inherent uncertainty in psychiatric diagnosis. Future research should validate these findings across different AI systems and diagnostic tools while developing strategies to maintain AI independence in clinical settings. Trial registration: Not applicable.
AB - Background: Despite significant advances in AI-driven medical diagnostics, the integration of large language models (LLMs) into psychiatric practice presents unique challenges. While LLMs demonstrate high accuracy in controlled settings, their performance in collaborative clinical environments remains unclear. This study examined whether LLMs exhibit conformity behavior under social pressure across different diagnostic certainty levels, with a particular focus on psychiatric assessment. Methods: Using an adapted Asch paradigm, we conducted a controlled trial examining GPT-4o’s performance across three domains representing increasing levels of diagnostic uncertainty: circle similarity judgments (high certainty), brain tumor identification (intermediate certainty), and psychiatric assessment using children’s drawings (high uncertainty). The study employed a 3 × 3 factorial design with three pressure conditions: no pressure, full pressure (five consecutive incorrect peer responses), and partial pressure (mixed correct and incorrect peer responses). We conducted 10 trials per condition combination (90 total observations), using standardized prompts and multiple-choice responses. The binomial test and chi-square analyses assessed performance differences across conditions. Results: Under no pressure, GPT-4o achieved 100% accuracy across all domains. Under full pressure, accuracy declined systematically with increasing diagnostic uncertainty: 50% in circle recognition, 40% in tumor identification, and 0% in psychiatric assessment. Partial pressure showed a similar pattern, with maintained accuracy in basic tasks (80% in circle recognition, 100% in tumor identification) but complete failure in psychiatric assessment (0%). All differences between no pressure and pressure conditions were statistically significant (P <.05), with the most severe effects observed in psychiatric assessment (χ²₁=16.20, P <.001). Conclusions: This study reveals that LLMs exhibit conformity patterns that intensify with diagnostic uncertainty, culminating in complete performance failure in psychiatric assessment under social pressure. These findings suggest that successful implementation of AI in psychiatry requires careful consideration of social dynamics and the inherent uncertainty in psychiatric diagnosis. Future research should validate these findings across different AI systems and diagnostic tools while developing strategies to maintain AI independence in clinical settings. Trial registration: Not applicable.
KW - Clinical conformity
KW - Diagnostic uncertainty
KW - Large language models
KW - Psychiatric assessment
KW - Social influence
UR - https://www.scopus.com/pages/publications/105004746451
U2 - 10.1186/s12888-025-06912-2
DO - 10.1186/s12888-025-06912-2
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
C2 - 40355854
AN - SCOPUS:105004746451
SN - 1471-244X
VL - 25
JO - BMC Psychiatry
JF - BMC Psychiatry
IS - 1
M1 - 478
ER -