The ALCHEmist: Automated Labeling 500x CHEaper Than LLM Data Annotators

University of Wisconsin-Madison
NeurIPS Spotlight 2024

Abstract

Large pretrained models can be used as annotators, helping replace or augment crowdworkers and enabling distilling generalist models into smaller specialist models. Unfortunately, this comes at a cost: employing top-of-the-line models often requires paying thousands of dollars for API calls, while the resulting datasets are static and challenging to audit. To address these challenges, we propose a simple alternative: rather than directly querying labels from pretrained models, we task models to generate programs that can produce labels. These programs can be stored and applied locally, re-used and extended, and cost orders of magnitude less. Our system, Alchemist, obtains comparable to or better performance than large language model-based annotation in a range of tasks for a fraction of the cost: on average, improvements amount to a 12.9% enhancement while the total labeling costs across all datasets are reduced by a factor of approximately 500x.

Challenges

LLM-based annotators offer efficiency over costly human labelers but come with several challenges:

  1. High Cost: Labeling a dataset can be expensive, particularly when each data point consists of many tokens. For example, labeling a moderately-sized dataset (8k data points) using GPT-4 costs over $1,200.
  2. Lack of Extensibility: Making even small changes to specifications necessitates re-running the entire pipeline to obtain new labels. This inflexibility means the resulting labels are static.
  3. Inability to Audit: API access to pretrained models does not permit inspecting most aspects of the model. Users must simply accept the provided labels with minimal additional information. Techniques that ask the model for explanations for its decisions may not be reliable.

Our Solution

We address these with a new approach! Instead of direct prompts for labels, we ask LLMs to generate only a few programs to act as annotators. These programs either label directly or label a training dataset to train a distilled specialist model.

Why is this good?

Generated programs allow unlimited predictions locally at nearly zero cost—the number of API calls no longer scales with dataset size. Plus, code can be inspected and corrected by SMEs, allowing easy adaptation to changes in your labeling rules.

Main Result

With Alchemist, using just 10 API calls to generate programs improves accuracy on five out of eight datasets compared to zero-shot prompting, while reducing the expense of API calls by 500 times.

Still, prompting ChatGPT for your labels repeatedly? Try to generate your program code to save the project expenses!

Poster

BibTeX

@inproceedings{
huang2024the,
title={The {ALCHE}mist: Automated Labeling 500x {CHE}aper than {LLM} Data Annotators},
author={Tzu-Heng Huang and Catherine Cao and Vaishnavi Bhargava and Frederic Sala},
booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
year={2024},
url={https://openreview.net/forum?id=T0glCBw28a}
}