Objectives: Patients with rheumatic diseases frequently turn to online sources for medical information. Large language models, such as ChatGPT, may offer an accessible alternative to conventional patient‑education resources; however, their reliability remains poorly explored. We conducted an exploratory, descriptive comparison to examine whether ChatGPT-4 might provide responses comparable to those of experts. Methods: Seventy-six psoriatic arthritis (PsA) patients generated 32 questions (296 selections) grouped into 6 themes. Each question was answered by ChatGPT-4 and by 12 Italian PsA specialists (each drafted 2-3 answers). Fourteen clinicians, The 14 clinicians scored the accuracy and completeness of AI and human-generated answers, rated accuracy (1-5 Likert scale) and completeness (1-3). Interrater reliability was calculated, and mixed-effects ordinal logistic models were used to compare sources. In a separate arm, 67 PsA patients reviewed 16 randomly selected answer pairs and indicated their preference. Readability was assessed. No formal sample size calculation was performed; P values were descriptive and interpreted alongside effect sizes and 95% CIs. Results: Patients most frequently sought information on prognosis/comorbidities (54/76, 71.1%), therapy strategy (48/76, 63.2%), and treatment risks (38/76, 50.0%). Accuracy appeared comparable between ChatGPT and experts, but ChatGPT scored lower in completeness. Accuracy was lower for pregnancy/fertility, with no clear relevant differences in other domains. ChatGPT answers were chosen 491/998 times (49.2%), clinician answers 343/998 times (34.4%), and no preference 164/998 times (16.4%, P < .001), with a relative preference for ChatGPT responses in prognosis and therapy. ChatGPT responses were, on average, more readable across indices. Conclusions: In this exploratory study, ChatGPT-4 appeared able to generate accurate and readable responses to PsA-related questions and was often preferred by patients.
Forte, G., Mauro, D., Raimondi, M., Pantano, I., Gandolfo, S., Cauli, A., et al. (2025). ChatGPT vs rheumatologists: cross-sectional study on accuracy and patient perception of AI-generated information for psoriatic arthritis. ANNALS OF THE RHEUMATIC DISEASES [10.1016/j.ard.2025.11.012].
ChatGPT vs rheumatologists: cross-sectional study on accuracy and patient perception of AI-generated information for psoriatic arthritis
Chimenti, Maria Sole;
2025-12-12
Abstract
Objectives: Patients with rheumatic diseases frequently turn to online sources for medical information. Large language models, such as ChatGPT, may offer an accessible alternative to conventional patient‑education resources; however, their reliability remains poorly explored. We conducted an exploratory, descriptive comparison to examine whether ChatGPT-4 might provide responses comparable to those of experts. Methods: Seventy-six psoriatic arthritis (PsA) patients generated 32 questions (296 selections) grouped into 6 themes. Each question was answered by ChatGPT-4 and by 12 Italian PsA specialists (each drafted 2-3 answers). Fourteen clinicians, The 14 clinicians scored the accuracy and completeness of AI and human-generated answers, rated accuracy (1-5 Likert scale) and completeness (1-3). Interrater reliability was calculated, and mixed-effects ordinal logistic models were used to compare sources. In a separate arm, 67 PsA patients reviewed 16 randomly selected answer pairs and indicated their preference. Readability was assessed. No formal sample size calculation was performed; P values were descriptive and interpreted alongside effect sizes and 95% CIs. Results: Patients most frequently sought information on prognosis/comorbidities (54/76, 71.1%), therapy strategy (48/76, 63.2%), and treatment risks (38/76, 50.0%). Accuracy appeared comparable between ChatGPT and experts, but ChatGPT scored lower in completeness. Accuracy was lower for pregnancy/fertility, with no clear relevant differences in other domains. ChatGPT answers were chosen 491/998 times (49.2%), clinician answers 343/998 times (34.4%), and no preference 164/998 times (16.4%, P < .001), with a relative preference for ChatGPT responses in prognosis and therapy. ChatGPT responses were, on average, more readable across indices. Conclusions: In this exploratory study, ChatGPT-4 appeared able to generate accurate and readable responses to PsA-related questions and was often preferred by patients.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


