The Impact of Dormant Defects on Defect Prediction: A Study of 19 Apache Projects

IRIS

Defect prediction models can be beneficial to prioritize testing, analysis, or code review activities, and has been the subject of a substantial effort in academia, and some applications in industrial contexts. A necessary precondition when creating a defect prediction model is the availability of defect data from the history of projects. If this data is noisy, the resulting defect prediction model could result to be unreliable. One of the causes of noise for defect datasets is the presence of "dormant defects," i.e., of defects discovered several releases after their introduction. This can cause a class to be labeled as defect-free while it is not, and is, therefore "snoring." In this article, we investigate the impact of snoring on classifiers' accuracy and the effectiveness of a possible countermeasure, i.e., dropping too recent data from a training set. We analyze the accuracy of 15 machine learning defect prediction classifiers, on data from more than 4,000 defects and 600 releases of 19 open source projects from the Apache ecosystem. Our results show that on average across projects (i) the presence of dormant defects decreases the recall of defect prediction classifiers, and (ii) removing from the training set the classes that in the last release are labeled as not defective significantly improves the accuracy of the classifiers. In summary, this article provides insights on how to create defects datasets by mitigating the negative effect of dormant defects on defect prediction.

Falessi, D., Ahluwalia, A., Di Penta, M. (2022). The Impact of Dormant Defects on Defect Prediction: A Study of 19 Apache Projects. ACM TRANSACTIONS ON SOFTWARE ENGINEERING AND METHODOLOGY, 31(1), 1-26 [10.1145/3467895].

The Impact of Dormant Defects on Defect Prediction: A Study of 19 Apache Projects

Falessi, D;Ahluwalia, A;Di Penta, M

2022-01-01

Abstract

Defect prediction models can be beneficial to prioritize testing, analysis, or code review activities, and has been the subject of a substantial effort in academia, and some applications in industrial contexts. A necessary precondition when creating a defect prediction model is the availability of defect data from the history of projects. If this data is noisy, the resulting defect prediction model could result to be unreliable. One of the causes of noise for defect datasets is the presence of "dormant defects," i.e., of defects discovered several releases after their introduction. This can cause a class to be labeled as defect-free while it is not, and is, therefore "snoring." In this article, we investigate the impact of snoring on classifiers' accuracy and the effectiveness of a possible countermeasure, i.e., dropping too recent data from a training set. We analyze the accuracy of 15 machine learning defect prediction classifiers, on data from more than 4,000 defects and 600 releases of 19 open source projects from the Apache ecosystem. Our results show that on average across projects (i) the presence of dormant defects decreases the recall of defect prediction classifiers, and (ii) removing from the training set the classes that in the last release are labeled as not defective significantly improves the accuracy of the classifiers. In summary, this article provides insights on how to create defects datasets by mitigating the negative effect of dormant defects on defect prediction.

Scheda breve

Scheda completa

Scheda completa (DC)

	Data di pubblicazione
	
				2022
			
	Status di pubblicazione
	
				Pubblicato
			
	DOI dell'articolo
	
				https://dx.doi.org/10.1145/3467895
			
	Rilevanza
	
				Rilevanza internazionale
			
	Tipo
	
				Articolo
			
	Referee
	
				Comitato scientifico
			
	Settore disciplinare dell'articolo (valido fino a 24/06/2024)
	
				Settore ING-INF/05 - SISTEMI DI ELABORAZIONE DELLE INFORMAZIONI
			
	Settore disciplinare dell'articolo (valido dal 09/05/2024)
	
				Settore IINF-05/A - Sistemi di elaborazione delle informazioni
			
	Lingua del contenuto
	
				English
			
	Parole chiave
	
				Defect prediction
fix-inducing changes
dataset bias
			
	Citazione
	
				Falessi, D., Ahluwalia, A., Di Penta, M. (2022). The Impact of Dormant Defects on Defect Prediction: A Study of 19 Apache Projects. ACM TRANSACTIONS ON SOFTWARE ENGINEERING AND METHODOLOGY, 31(1), 1-26 [10.1145/3467895].
			
	Tutti gli autori
	
						Falessi, D; Ahluwalia, A; Di Penta, M
					
	Tipologia
	
				Articolo su rivista
			
	Appare nelle tipologie:
	
				01 - Articolo su rivista

File in questo prodotto:

File	Dimensione	Formato
2105.12372.pdf accesso aperto Tipologia: Documento in Pre-print Licenza: Copyright degli autori Dimensione 1.6 MB Formato Adobe PDF Visualizza/Apri	1.6 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2108/329067

Citazioni

ND

ND

13

social impact