REPRESENTATION-BASED DATA QUALITY AUDITS FOR AUDIO
Loading...
Author
Gonzalez-Jimenez, Alvaro
Gröger, Fabian
Wermelinger, Linda
Bürli, Andrin
Kastanis, Iason
Lionetti, Simone
Pouly, Marc
DOI
Abstract
Data quality issues such as off-topic samples, near duplicates, and label errors often limit the performance of audiobased systems. This paper addresses these issues by adapting
SelfClean, a representation-to-rank data auditing framework,
from the image to the audio domain. This approach leverages
self-supervised audio representations to identify common
data quality issues, creating ranked review lists that surface
distinct issues within a single, unified process. The method
is benchmarked on the ESC-50, GTZAN, and a proprietary
industrial dataset, using both synthetic and naturally occurring corruptions. The results demonstrate that this framework
achieves state-of-the-art ranking performance, often outperforming issue-specific baselines and enabling significant annotation savings by efficiently guiding human review
Publication Reference
Gonzalez-Jimenez, Alvaro; Gröger, Fabian; Wermelinger, Linda; Bürli, Andrin; Kastanis, Iason; Lionetti, Simone & Pouly, Marc (2026). Representation-Based Data Quality Audits for Audio. ICASSP’26: Proceedings of the International Conference on Acoustics, Speech, and Signal Processing,
Year
2026-02-13