Optimizing Data Acquisition to Enhance Machine Learning Performance

Publication Name

Proceedings of the VLDB Endowment

Abstract

In this paper, we study how to acquire labeled data points from a large data pool to enrich a training set for enhancing supervised machine learning (ML) performance. The state-of-the-art solution is the clustering-based training set selection (CTS) algorithm, which initially clusters the data points in a data pool and subsequently selects new data points from clusters. The efficiency of CTS is constrained by its frequent retraining of the target ML model, and the effectiveness is limited by the selection criteria, which represent the state of data points within each cluster and impose a restriction of selecting only one cluster in each iteration. To overcome these limitations, we propose a new algorithm, called CTS with incremental estimation of adaptive score (IAS). IAS employs online learning, enabling incremental model updates by using new data, and eliminating the need to fully retrain the target model, and hence improves the efficiency. To enhance the effectiveness of IAS, we introduce adaptive score estimation, which serves as novel selection criteria to identify clusters and select new data points by balancing trade-offs between exploitation and exploration during data acquisition. To further enhance the effectiveness of IAS, we introduce a new adaptive mini-batch selection method that, in each iteration, selects data points from multiple clusters rather than a single cluster, hence eliminating the potential bias due to using only one cluster. By integrating this method into the IAS algorithm, we propose a novel algorithm termed IAS with adaptive mini-batch selection (IAS-AMS). Experimental results highlight the superior effectiveness of IAS-AMS, with IAS also outperforming other competing algorithms. In terms of efficiency, IAS takes the lead, while the efficiency of IAS-AMS is on par with that of the existing CTS algorithm.

Open Access Status

This publication is not available as open access

Volume

17

Issue

6

First Page

1310

Last Page

1323

Share

COinS
 

Link to publisher version (DOI)

http://dx.doi.org/10.14778/3648160.3648172