In order to develop and train defect prediction models, researchers rely on datasets in which a defect is often attributed to a release where the defect itself is discovered. However, in many circumstances, it can happen that a defect is only discovered several releases after its introduction. This might introduce a bias in the dataset, i.e., treating the intermediate releases as defect-free and the latter as defect-prone. We call this phenomenon as 'sleeping defects'. We call 'snoring' the phenomenon where classes are affected by sleeping defects only, that would be treated as defect-free until the defect is discovered. In this paper we analyze, on data from 282 releases of six open source projects from the Apache ecosystem, the magnitude of the sleeping defects and of the snoring classes. Our results indicate that 1) on all projects, most of the defects in a project slept for more than 20% of the existing releases, and 2) in the majority of the projects the missing rate is more than 25% even if we remove 50% of releases.
Ahluwalia, A., Falessi, D., Di Penta, M. (2019). Snoring: A noise in defect prediction datasets. In IEEE International Working Conference on Mining Software Repositories (pp.63-67). 1515 BROADWAY, NEW YORK, NY 10036-9998 USA : IEEE Computer Society [10.1109/MSR.2019.00019].
Snoring: A noise in defect prediction datasets
Falessi D.;
2019-01-01
Abstract
In order to develop and train defect prediction models, researchers rely on datasets in which a defect is often attributed to a release where the defect itself is discovered. However, in many circumstances, it can happen that a defect is only discovered several releases after its introduction. This might introduce a bias in the dataset, i.e., treating the intermediate releases as defect-free and the latter as defect-prone. We call this phenomenon as 'sleeping defects'. We call 'snoring' the phenomenon where classes are affected by sleeping defects only, that would be treated as defect-free until the defect is discovered. In this paper we analyze, on data from 282 releases of six open source projects from the Apache ecosystem, the magnitude of the sleeping defects and of the snoring classes. Our results indicate that 1) on all projects, most of the defects in a project slept for more than 20% of the existing releases, and 2) in the majority of the projects the missing rate is more than 25% even if we remove 50% of releases.File | Dimensione | Formato | |
---|---|---|---|
08816788.pdf
solo utenti autorizzati
Tipologia:
Versione Editoriale (PDF)
Licenza:
Copyright dell'editore
Dimensione
118.8 kB
Formato
Adobe PDF
|
118.8 kB | Adobe PDF | Visualizza/Apri Richiedi una copia |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.