Investigating the Impact of Data Contamination of Large Language Models in Text-to-SQL Translation

IRIS

Understanding textual description to generate code seems to be an achieved capability of instruction-following Large Language Models (LLMs) in zero-shot scenario. However, there is a severe possibility that this translation ability may be influenced by having seen target textual descriptions and the related code. This effect is known as Data Contamination. In this study, we investigate the impact of Data Contamination on the performance of GPT-3.5 in the Text-to-SQL code-generating tasks. Hence, we introduce a novel method to detect Data Contamination in GPTs and examine GPT-3.5's Text-to-SQL performances using the known Spider Dataset and our new unfamiliar dataset Termite. Furthermore, we analyze GPT-3.5's efficacy on databases with modified information via an adversarial table disconnection (ATD) approach, complicating Text-to-SQL tasks by removing structural pieces of information from the database. Our results indicate a significant performance drop in GPT-3.5 on the unfamiliar Termite dataset, even with ATD modifications, highlighting the effect of Data Contamination on LLMs in Text-to-SQL translation tasks.

Ranaldi, F., Ruzzetti, E.s., Onorati, D., Ranaldi, L., Giannone, C., Favalli, A., et al. (2024). Investigating the Impact of Data Contamination of Large Language Models in Text-to-SQL Translation. In Findings of the Association for Computational Linguistics: ACL 2024 (pp.13909-13920). Association for Computational Linguistics [10.18653/v1/2024.findings-acl.827].

Investigating the Impact of Data Contamination of Large Language Models in Text-to-SQL Translation

Ranaldi F.;Ruzzetti E. S.;Onorati D.;Ranaldi L.;Giannone C.;Favalli A.;Romagnoli R.;Zanzotto F. M.

2024-01-01

Abstract

Understanding textual description to generate code seems to be an achieved capability of instruction-following Large Language Models (LLMs) in zero-shot scenario. However, there is a severe possibility that this translation ability may be influenced by having seen target textual descriptions and the related code. This effect is known as Data Contamination. In this study, we investigate the impact of Data Contamination on the performance of GPT-3.5 in the Text-to-SQL code-generating tasks. Hence, we introduce a novel method to detect Data Contamination in GPTs and examine GPT-3.5's Text-to-SQL performances using the known Spider Dataset and our new unfamiliar dataset Termite. Furthermore, we analyze GPT-3.5's efficacy on databases with modified information via an adversarial table disconnection (ATD) approach, complicating Text-to-SQL tasks by removing structural pieces of information from the database. Our results indicate a significant performance drop in GPT-3.5 on the unfamiliar Termite dataset, even with ATD modifications, highlighting the effect of Data Contamination on LLMs in Text-to-SQL translation tasks.

Scheda breve

Scheda completa

Scheda completa (DC)

	Nome del convegno
	
				62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024)
			
	Luogo del convegno
	
				Bangkok, Thailand
			
	Anno del convegno
	
				2024
			
	Numero del convegno
	
				62
			
	Rilevanza del convegno
	
				Rilevanza internazionale
			
	Data di pubblicazione
	
				2024
			
	DOI dell'intervento
	
				https://dx.doi.org/10.18653/v1/2024.findings-acl.827
			
	Settore disciplinare dell'intervento (valido dal 09/05/2024)
	
				Settore IINF-05/A - Sistemi di elaborazione delle informazioni
			
	Lingua del contenuto
	
				English
			
	Tipologia
	
				Intervento a convegno
			
	Citazione
	
				Ranaldi, F., Ruzzetti, E.s., Onorati, D., Ranaldi, L., Giannone, C., Favalli, A., et al. (2024). Investigating the Impact of Data Contamination of Large Language Models in Text-to-SQL Translation. In Findings of the Association for Computational Linguistics: ACL 2024 (pp.13909-13920). Association for Computational Linguistics [10.18653/v1/2024.findings-acl.827].
			
	Tutti gli autori
	
						Ranaldi, F; Ruzzetti, Es; Onorati, D; Ranaldi, L; Giannone, C; Favalli, A; Romagnoli, R; Zanzotto, Fm
					
	Appare nelle tipologie:
	
				02 - Intervento a convegno

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2108/389003

Citazioni

ND

5

ND

social impact