Dr. ChatGPT: Correlation Between a Novel AI Chatbot and AUA Male Sexual Dysfunction Guidelines
Leelakrishna Channa, BS1, Ilene Staff, PhD2, Tara McLaughlin, PhD2, Jared Bieniek, MD3.
1University of Connecticut School of Medicine, Farmington, CT, USA, 2Hartford Hospital, Hartford, CT, USA, 3Hartford HealthCare, Hartford, CT, USA.
BACKGROUND: The advent of artificial intelligence (AI) chatbots like ChatGPT will fundamentally change clinical practice due to the convenience and accessibility of medical answers for patients and providers. Despite training on a vast amount of data, however, the model can still generate incorrect responses due to limitations of large-scale language models. It is important for medical providers to be aware of these tools and the potential for misinformation. There is an urgent need to validate ChatGPT's responses to questions about common conditions, especially sensitive ones where patients frequently turn online for answers. METHODS: Focusing on male sexual dysfunction as a test group, we performed a qualitative study comparing ChatGPT generated answers to current American Urological Association (AUA) practice guidelines for erectile dysfunction, Peyronie's disease, and disorders of ejaculation. A single query was designed for each guideline statement to elicit details of the statement. ChatGPT responses were compared to existing guidelines for accuracy and completeness by a board-certified urologist specializing in andrology. Differences in accuracy and completeness among the conditions and strength of recommendation groups were calculated using chi-square and Fisher Freeman Halton tests. SPSS v26 was used for comparative statistics and Statology online for one sample proportion test. RESULTS: When queried regarding 73 guideline statements from three AUA male sexual dysfunction guidelines, ChatGPT answers included inaccurate and incomplete information in 30% (CI 20-41%) and 36% (CI 25-47%) of responses, respectively (Table 1). Each of these frequencies is significantly different from 0 (p <0.0001). There were no significant differences in accuracy (p=0.16) or completeness (p=0.99) between each condition guideline. Additionally, the variability in accuracy (p=0.52) and completeness (p=0.39) between strength of recommendation groups was not statistically significant. CONCLUSIONS: Approximately one-third of ChatGPT responses to male sexual dysfunction guideline-based questions contained inaccurate or incomplete information. AI-powered chatbots will be the future for immediate-access medical answers but caution must be exercised as these language models are improved.
Back to 2023 Abstracts