to compare the performance of humans, GPT-4.0 and GPT-3.5 in answering multiple-choice questions from the american academy of ophthalmology (AAO) basic and clinical Science Course (BCSC) self-assessment program, available at https://www.aao.org/education/self-assessments. In June 2023, text-based multiple-choice questions were submitted to GPT-4.0 and GPT-3.5. the AAO provides the percentage of humans who selected the correct answer, which was analyzed for comparison. all questions were classified by 10 subspecialties and 3 practice areas (diagnostics/clinics, medical treatment, surgery). Out of 1023 questions, GPT-4.0 achieved the best score (82.4%), followed by humans (75.7%) and GPT-3.5 (65.9%), with significant difference in accuracy rates (always P < 0.0001). Both GPT-4.0 and GPT-3.5 showed the worst results in surgery-related questions (74.6% and 57.0% respectively). for difficult questions (answered incorrectly by > 50% of humans), both GPT models favorably compared to humans, without reaching significancy. The word count for answers provided by GPT-4.0 was significantly lower than those produced by GPT-3.5 (160 +/- 56 and 206 +/- 77 respectively, P < 0.0001); however, incorrect responses were longer (P < 0.02). GPT-4.0 represented a substantial improvement over GPT-3.5, achieving better performance than humans in an AAO BCSC self-assessment test. however, ChatGPT is still limited by inconsistency across different practice areas, especially when it comes to surgery.

Taloni, A., Borselli, M., Scarsi, V., Rossi, C., Coco, G., Scorcia, V., et al. (2023). Comparative performance of humans versus GPT-4.0 and GPT-3.5 in the self-assessment program of American Academy of Ophthalmology. SCIENTIFIC REPORTS, 13(1), 18562 [10.1038/s41598-023-45837-2].

Comparative performance of humans versus GPT-4.0 and GPT-3.5 in the self-assessment program of American Academy of Ophthalmology

Coco, Giulia;
2023-10-29

Abstract

to compare the performance of humans, GPT-4.0 and GPT-3.5 in answering multiple-choice questions from the american academy of ophthalmology (AAO) basic and clinical Science Course (BCSC) self-assessment program, available at https://www.aao.org/education/self-assessments. In June 2023, text-based multiple-choice questions were submitted to GPT-4.0 and GPT-3.5. the AAO provides the percentage of humans who selected the correct answer, which was analyzed for comparison. all questions were classified by 10 subspecialties and 3 practice areas (diagnostics/clinics, medical treatment, surgery). Out of 1023 questions, GPT-4.0 achieved the best score (82.4%), followed by humans (75.7%) and GPT-3.5 (65.9%), with significant difference in accuracy rates (always P < 0.0001). Both GPT-4.0 and GPT-3.5 showed the worst results in surgery-related questions (74.6% and 57.0% respectively). for difficult questions (answered incorrectly by > 50% of humans), both GPT models favorably compared to humans, without reaching significancy. The word count for answers provided by GPT-4.0 was significantly lower than those produced by GPT-3.5 (160 +/- 56 and 206 +/- 77 respectively, P < 0.0001); however, incorrect responses were longer (P < 0.02). GPT-4.0 represented a substantial improvement over GPT-3.5, achieving better performance than humans in an AAO BCSC self-assessment test. however, ChatGPT is still limited by inconsistency across different practice areas, especially when it comes to surgery.
29-ott-2023
Pubblicato
Rilevanza internazionale
Articolo
Esperti anonimi
Settore MED/30
English
Taloni, A., Borselli, M., Scarsi, V., Rossi, C., Coco, G., Scorcia, V., et al. (2023). Comparative performance of humans versus GPT-4.0 and GPT-3.5 in the self-assessment program of American Academy of Ophthalmology. SCIENTIFIC REPORTS, 13(1), 18562 [10.1038/s41598-023-45837-2].
Taloni, A; Borselli, M; Scarsi, V; Rossi, C; Coco, G; Scorcia, V; Giannaccare, G
Articolo su rivista
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2108/347332
Citazioni
  • ???jsp.display-item.citation.pmc??? 0
  • Scopus 10
  • ???jsp.display-item.citation.isi??? 10
social impact