(VP047) EVALUATION OF CHAT-BASED ARTIFICIAL INTELLIGENCE ALGORITHMS FOR PROVIDING ATRIAL FIBRILLATION RECOMMENDATIONS TO PATIENTS AND CLINICIANS
Friday, October 27, 2023
18:00 – 18:10 EST
Location: ePoster Screen 4
Background: Atrial fibrillation (AF) is a prevalent arrhythmia worldwide. As artificial intelligence (AI) and machine learning technologies become increasingly accessible to consumers, patients may turn to chat-based AI models for medical advice. However, the effectiveness and accuracy of these AI models in providing appropriate guidance on AF have yet to be thoroughly examined. Thus, we conducted a study to compare the performance of two popular language models (ChatGPT and Bing AI) in answering patient and clinician queries related to AF. Our aim was to assess the differences in performance between these models, given the rapidly evolving nature of chat-based AI technology.
METHODS AND RESULTS: We designed a total of 36 prompts to cover common questions about AF, with 18 prompts tailored to patient queries (using ChatGPT) and another 18 focused on clinician perspectives (ChatGPT and Bing AI). Each prompt was entered into each platform three times, as shown in Figure 1. Clinician prompts were prefaced with the statement "I am a physician" and included phrases such as "based on the most recent guidelines" and "with reference" to encourage higher-level scientific or medical responses from the AI models.
After generating responses for each prompt, we enlisted three expert clinicians to review them and categorize them as either "appropriate" or "inappropriate". An "appropriate" response was one that provided accurate information and used clear and comprehensible language, suitable for users with varying levels of medical literacy. Responses to clinician prompts were deemed "inappropriate" if they contained incorrect information or inaccurate references, or used unclear language according to expert judgement.
Figure 1 presents an overview of the appropriateness of the generated responses. Overall, ChatGPT generated appropriate responses to 83.3% of the patient-initiated prompts. However, the model produced some inaccuracies in responses to questions about AF triggers, such as those related to alcohol and coffee. When it comes to clinician-related questions, the text accuracy of responses from ChatGPT and Bing AI were 33.3% and 66.6%, respectively, while reference accuracy was 55.5% and 50%. Most of the cited references in appropriate responses referred to current US and European guidelines, but the algorithms also cited primary literature as appropriate. Nonetheless, in two cases related to AF management, ChatGPT mistakenly referenced trials and literature that do not exist.
Conclusion: While current AI chatbots can provide a basic understanding of diseases for patients and the general public, this study found that the appropriateness of responses to clinician-facing questions was limited.