VALIDATE: a deep dive into vulnerability prediction datasets

IRIS

Context: Vulnerabilities are an essential issue today, as they cause economic damage to the industry and endanger our daily life by threatening critical national security infrastructures. Vulnerability prediction supports software engineers in preventing the use of vulnerabilities by malicious attackers, thus improving the security and reliability of software. Datasets are vital to vulnerability prediction studies, as machine learning models require a dataset. Dataset creation is time-consuming, error-prone, and difficult to validate. Objectives: This study aims to characterise the datasets of prediction studies in terms of availability and features. Moreover, to support researchers in finding and sharing datasets, we provide the first VulnerAbiLty predIction DatAseT rEpository (VALIDATE). Methods: We perform a systematic literature review of the datasets of vulnerability prediction studies. Results: Our results show that out of 50 primary studies, only 22 studies (i.e., 38%) provide a reachable dataset. Of these 22 studies, only one study provides a dataset in a stable repository. Conclusions: Our repository of 31 datasets, 22 reachable plus nine datasets provided by authors via email, supports researchers in finding datasets of interest, hence avoiding reinventing the wheel; this translates into less effort, more reliability, and more reproducibility in dataset creation and use.

Esposito, M., Falessi, D. (2024). VALIDATE: a deep dive into vulnerability prediction datasets. INFORMATION AND SOFTWARE TECHNOLOGY, 170 [10.1016/j.infsof.2024.107448].

VALIDATE: a deep dive into vulnerability prediction datasets

Matteo Esposito;Davide Falessi

2024-01-01

Abstract

Context: Vulnerabilities are an essential issue today, as they cause economic damage to the industry and endanger our daily life by threatening critical national security infrastructures. Vulnerability prediction supports software engineers in preventing the use of vulnerabilities by malicious attackers, thus improving the security and reliability of software. Datasets are vital to vulnerability prediction studies, as machine learning models require a dataset. Dataset creation is time-consuming, error-prone, and difficult to validate. Objectives: This study aims to characterise the datasets of prediction studies in terms of availability and features. Moreover, to support researchers in finding and sharing datasets, we provide the first VulnerAbiLty predIction DatAseT rEpository (VALIDATE). Methods: We perform a systematic literature review of the datasets of vulnerability prediction studies. Results: Our results show that out of 50 primary studies, only 22 studies (i.e., 38%) provide a reachable dataset. Of these 22 studies, only one study provides a dataset in a stable repository. Conclusions: Our repository of 31 datasets, 22 reachable plus nine datasets provided by authors via email, supports researchers in finding datasets of interest, hence avoiding reinventing the wheel; this translates into less effort, more reliability, and more reproducibility in dataset creation and use.

Scheda breve

Scheda completa

Scheda completa (DC)

	Data di pubblicazione
	
				2024
			
	Status di pubblicazione
	
				Pubblicato
			
	DOI dell'articolo
	
				https://dx.doi.org/10.1016/j.infsof.2024.107448
			
	Rilevanza
	
				Rilevanza internazionale
			
	Tipo
	
				Recensione
			
	Referee
	
				Comitato scientifico
			
	Settore disciplinare dell'articolo (valido dal 09/05/2024)
	
				Settore IINF-05/A - Sistemi di elaborazione delle informazioni
			
	Lingua del contenuto
	
				English
			
	Parole chiave
	
				Dataset; Machine learning; Replicability; Repository; Security; Vulnerability
			
	Citazione
	
				Esposito, M., Falessi, D. (2024). VALIDATE: a deep dive into vulnerability prediction datasets. INFORMATION AND SOFTWARE TECHNOLOGY, 170 [10.1016/j.infsof.2024.107448].
			
	Tutti gli autori
	
						Esposito, M; Falessi, D
					
	Tipologia
	
				Articolo su rivista
			
	Appare nelle tipologie:
	
				01 - Articolo su rivista

File in questo prodotto:

File	Dimensione	Formato
1-s2.0-S0950584924000533-main (1).pdf accesso aperto Tipologia: Versione Editoriale (PDF) Licenza: Creative commons Dimensione 2.76 MB Formato Adobe PDF Visualizza/Apri	2.76 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2108/394012

Citazioni

ND

2

1

social impact