EXPLORING LLM AGENTS FOR CLEANING TABULAR MACHINE LEARNING DATASETS
Loading...
Author
Bendinelli, Tommaso
Dox, Artur
Holz, Christian
DOI
Abstract
High-quality, error-free datasets are a key ingredient in building reliable, accurate, and unbiased machine learning (ML) models. However, real world datasets
often suffer from errors due to sensor malfunctions, data entry mistakes, or improper data integration across multiple sources that can severely degrade model
performance. Detecting and correcting these issues typically require tailor-made
solutions and demand extensive domain expertise. Consequently, automation is
challenging, rendering the process labor-intensive and tedious. In this study, we
investigate whether Large Language Models (LLMs) can help alleviate the burden
of manual data cleaning. We set up an experiment in which an LLM, paired with
Python, is tasked with cleaning the training dataset to improve the performance
of a learning algorithm without having the ability to modify the training pipeline
or perform any feature engineering. We run this experiment on multiple Kaggle
datasets that have been intentionally corrupted with errors. Our results show that
LLMs can identify and correct erroneous entries—such as illogical values or outliers—by leveraging contextual information from other features within the same
row, as well as feedback from previous iterations. However, they struggle to detect
more complex errors that require understanding data distribution across multiple
rows, such as trends and biases
Publication Reference
ICML Workshop
Year
2025-03-01