Large pretrained models can be used as annotators, helping replace or augment crowdworkers and enabling distilling generalist models into smaller specialist models. Unfortunately, this comes at a cost: employing top-of-the-line models often requires paying thousands of dollars for API calls, while the resulting datasets are static and challenging to audit. To address these challenges, we propose a simple alternative: rather than directly querying labels from pretrained models, we task models to generate programs that can produce labels. These programs can be stored and applied locally, re-used and extended, and cost orders of magnitude less. Our system, Alchemist, obtains comparable to or better performance than large language model-based annotation in a range of tasks for a fraction of the cost: on average, improvements amount to a 12.9% enhancement while the total labeling costs across all datasets are reduced by a factor of approximately 500x.
LLM-based annotators offer efficiency over costly human labelers but come with several challenges:
We address these with a new approach! Instead of direct prompts for labels, we ask LLMs to generate only a few programs to act as annotators. These programs either label directly or label a training dataset to train a distilled specialist model.
Generated programs allow unlimited predictions locally at nearly zero cost—the number of API calls no longer scales with dataset size. Plus, code can be inspected and corrected by SMEs, allowing easy adaptation to changes in your labeling rules.
With Alchemist, using just 10 API calls to generate programs improves accuracy on five out of eight datasets compared to zero-shot prompting, while reducing the expense of API calls by 500 times.
Still, prompting ChatGPT for your labels repeatedly? Try to generate your program code to save the project expenses!
@inproceedings{
huang2024the,
title={The {ALCHE}mist: Automated Labeling 500x {CHE}aper than {LLM} Data Annotators},
author={Tzu-Heng Huang and Catherine Cao and Vaishnavi Bhargava and Frederic Sala},
booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
year={2024},
url={https://openreview.net/forum?id=T0glCBw28a}
}