Comparative performance of humans versus GPT-4.0 and GPT-3.5 in the self-assessment program of American Academy of Ophthalmology

IRIS

to compare the performance of humans, GPT-4.0 and GPT-3.5 in answering multiple-choice questions from the american academy of ophthalmology (AAO) basic and clinical science course (BCSC) self-assessment program, available at https://www.aao.org/education/self-assessments. In june 2023, text-based multiple-choice questions were submitted to GPT-4.0 and GPT-3.5. the AAO provides the percentage of humans who selected the correct answer, which was analyzed for comparison. all questions were classified by 10 subspecialties and 3 practice areas (diagnostics/clinics, medical treatment, surgery). out of 1023 questions, GPT-4.0 achieved the best score (82.4%), followed by humans (75.7%) and GPT-3.5 (65.9%), with significant difference in accuracy rates (always P < 0.0001). both GPT-4.0 and GPT-3.5 showed the worst results in surgery-related questions (74.6% and 57.0% respectively). for difficult questions (answered incorrectly by > 50% of humans), both GPT models favorably compared to humans, without reaching significancy. the word count for answers provided by GPT-4.0 was significantly lower than those produced by GPT-3.5 (160 +/- 56 and 206 +/- 77 respectively, P < 0.0001); however, incorrect responses were longer (P < 0.02). GPT-4.0 represented a substantial improvement over GPT-3.5, achieving better performance than humans in an AAO BCSC self-assessment test. however, ChatGPT is still limited by inconsistency across different practice areas, especially when it comes to surgery.

Taloni, A., Borselli, M., Scarsi, V., Rossi, C., Coco, G., Scorcia, V., et al. (2023). Comparative performance of humans versus GPT-4.0 and GPT-3.5 in the self-assessment program of American Academy of Ophthalmology. SCIENTIFIC REPORTS, 13(1), 18562 [10.1038/s41598-023-45837-2].

Comparative performance of humans versus GPT-4.0 and GPT-3.5 in the self-assessment program of American Academy of Ophthalmology

Taloni, Andrea;Borselli, Massimiliano;Scarsi, Valentina;Rossi, Costanza;Coco, Giulia;Scorcia, Vincenzo;Giannaccare, Giuseppe

2023-10-29

Abstract

to compare the performance of humans, GPT-4.0 and GPT-3.5 in answering multiple-choice questions from the american academy of ophthalmology (AAO) basic and clinical science course (BCSC) self-assessment program, available at https://www.aao.org/education/self-assessments. In june 2023, text-based multiple-choice questions were submitted to GPT-4.0 and GPT-3.5. the AAO provides the percentage of humans who selected the correct answer, which was analyzed for comparison. all questions were classified by 10 subspecialties and 3 practice areas (diagnostics/clinics, medical treatment, surgery). out of 1023 questions, GPT-4.0 achieved the best score (82.4%), followed by humans (75.7%) and GPT-3.5 (65.9%), with significant difference in accuracy rates (always P < 0.0001). both GPT-4.0 and GPT-3.5 showed the worst results in surgery-related questions (74.6% and 57.0% respectively). for difficult questions (answered incorrectly by > 50% of humans), both GPT models favorably compared to humans, without reaching significancy. the word count for answers provided by GPT-4.0 was significantly lower than those produced by GPT-3.5 (160 +/- 56 and 206 +/- 77 respectively, P < 0.0001); however, incorrect responses were longer (P < 0.02). GPT-4.0 represented a substantial improvement over GPT-3.5, achieving better performance than humans in an AAO BCSC self-assessment test. however, ChatGPT is still limited by inconsistency across different practice areas, especially when it comes to surgery.

Scheda breve

Scheda completa

Scheda completa (DC)

	Data di pubblicazione
	
				29-ott-2023
			
	Status di pubblicazione
	
				Pubblicato
			
	DOI dell'articolo
	
				https://dx.doi.org/10.1038/s41598-023-45837-2
			
	Rilevanza
	
				Rilevanza internazionale
			
	Tipo
	
				Articolo
			
	Referee
	
				Esperti anonimi
			
	Settore disciplinare dell'articolo (valido fino a 24/06/2024)
	
				Settore MED/30
			
	Settore disciplinare dell'articolo (valido dal 09/05/2024)
	
				Settore MEDS-17/A - Malattie dell'apparato visivo
			
	Lingua del contenuto
	
				English
			
	Citazione
	
				Taloni, A., Borselli, M., Scarsi, V., Rossi, C., Coco, G., Scorcia, V., et al. (2023). Comparative performance of humans versus GPT-4.0 and GPT-3.5 in the self-assessment program of American Academy of Ophthalmology. SCIENTIFIC REPORTS, 13(1), 18562 [10.1038/s41598-023-45837-2].
			
	Tutti gli autori
	
						Taloni, A; Borselli, M; Scarsi, V; Rossi, C; Coco, G; Scorcia, V; Giannaccare, G
					
	Tipologia
	
				Articolo su rivista
			
	Appare nelle tipologie:
	
				01 - Articolo su rivista

File in questo prodotto:

File	Dimensione	Formato
2023. Comparative performance of humans versus GPT-4.0 and GPT-3.5 in the self-assessment program of American Academy of Ophthalmology.pdf accesso aperto Tipologia: Versione Editoriale (PDF) Licenza: Creative commons Dimensione 1.57 MB Formato Adobe PDF Visualizza/Apri	1.57 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2108/347332

Citazioni

17

45

39

social impact