Cody Coleman - Data selection for Data-Centric AI: Data Quality Over Quantity
Data selection methods, such as active learning and core-set selection, improve the data efficiency of machine learning by identifying the most informative data points to label or train on. Across the data selection literature, there are many ways to identify these training examples. However, classical data selection methods are prohibitively expensive to apply in deep learning because of the larger datasets and models. This talk will describe two techniques to make data selection methods more tractable. First, "selection via proxy" (SVP) avoids expensive training and reduces the computation per example by using smaller proxy models to quantify the informativeness of each example. Second, "similarity search for efficient active learning and search" (SEALS) reduces the number of examples processed by restricting the candidate pool for labeling to the nearest neighbors of the currently labeled set instead of scanning over all of the unlabeled data. Both methods lead to order of magnitude performance improvements, making active learning applications on billions of unlabeled images practical for the first time.