In order to develop and train defect prediction models, researchers rely on datasets in which a defect is often attributed to a release where the defect itself is discovered. However, in many circumstances, it can happen that a defect is only discovered several releases after its introduction. This might introduce a bias in the dataset, i.e., treating the intermediate releases as defect-free and the latter as defect-prone. We call this phenomenon as 'sleeping defects'. We call 'snoring' the phenomenon where classes are affected by sleeping defects only, that would be treated as defect-free until the defect is discovered. In this paper we analyze, on data from 282 releases of six open source projects from the Apache ecosystem, the magnitude of the sleeping defects and of the snoring classes. Our results indicate that 1) on all projects, most of the defects in a project slept for more than 20% of the existing releases, and 2) in the majority of the projects the missing rate is more than 25% even if we remove 50% of releases.

Ahluwalia, A., Falessi, D., Di Penta, M. (2019). Snoring: A noise in defect prediction datasets. In IEEE International Working Conference on Mining Software Repositories (pp.63-67). 1515 BROADWAY, NEW YORK, NY 10036-9998 USA : IEEE Computer Society [10.1109/MSR.2019.00019].

Snoring: A noise in defect prediction datasets

Falessi D.;
2019-01-01

Abstract

In order to develop and train defect prediction models, researchers rely on datasets in which a defect is often attributed to a release where the defect itself is discovered. However, in many circumstances, it can happen that a defect is only discovered several releases after its introduction. This might introduce a bias in the dataset, i.e., treating the intermediate releases as defect-free and the latter as defect-prone. We call this phenomenon as 'sleeping defects'. We call 'snoring' the phenomenon where classes are affected by sleeping defects only, that would be treated as defect-free until the defect is discovered. In this paper we analyze, on data from 282 releases of six open source projects from the Apache ecosystem, the magnitude of the sleeping defects and of the snoring classes. Our results indicate that 1) on all projects, most of the defects in a project slept for more than 20% of the existing releases, and 2) in the majority of the projects the missing rate is more than 25% even if we remove 50% of releases.
16th IEEE/ACM International Conference on Mining Software Repositories, MSR 2019
2019
Association for Computing Machinery (ACM)
Rilevanza internazionale
2019
Settore ING-INF/05 - SISTEMI DI ELABORAZIONE DELLE INFORMAZIONI
English
Dataset bias
Defect prediction
Fix-inducing changes
Intervento a convegno
Ahluwalia, A., Falessi, D., Di Penta, M. (2019). Snoring: A noise in defect prediction datasets. In IEEE International Working Conference on Mining Software Repositories (pp.63-67). 1515 BROADWAY, NEW YORK, NY 10036-9998 USA : IEEE Computer Society [10.1109/MSR.2019.00019].
Ahluwalia, A; Falessi, D; Di Penta, M
File in questo prodotto:
File Dimensione Formato  
08816788.pdf

solo utenti autorizzati

Tipologia: Versione Editoriale (PDF)
Licenza: Copyright dell'editore
Dimensione 118.8 kB
Formato Adobe PDF
118.8 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2108/273900
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 17
  • ???jsp.display-item.citation.isi??? 13
social impact