Welcome on Board

Email
thuang273@wisc.edu
Location

Madison, Wisconsin

About me

Greetings! I'm your captain for this conversation, Tzu-Heng (Brian), a final-year CS Ph.D. student at the University of Wisconsin-Madison (UW-Madison), working with Frederic Sala in the Sprocket Lab. This summer, I interned in Meta GenAI (now MSL), working on synthetic data generation. In the summer of 2024, I interned in Apple, working on data curation, data mixing for large-scale multimodal models, advising by Javier Movellan and Manjot Bilkhu. Before joining UW-Madison, I earned my B.S. degree in CS from National Chengchi University (NCCU), where I was fortunately advised by Man-Kwan Shan (NCCU) and Ling-Jyh Chen (Academia Sinica) in the realm of spatio-temporal machine learning and large-scale sensor networks. In 2019, I interned in Argonne National Laboratory in the team, Array of Things, working with Charlie Catlett and Rajesh Sankaran.

I am passionate about advancing machine learning to empower models to learn more with less supervision. I am focusing on Data-centric AI, particularly in designing methods for multimodal data selection, universal data mixing, zero-cost labeling systems, and LLM verification through synthesized programs. These strategies are rooted in weak supervision frameworks to build foundation models with fewer human annotations. Additionally, I am developing a new notion, parameter marketplace, to accelerate training while monetizing parameters as a second profit center.

Recent News

Paper Accepcted by NeurIPS'25

Sep. 2025

"Shrinking the Generation-Verification Gap with Weak Verifiers"

Paper Abstract: Verifiers can improve language model capabilities by scoring and ranking responses from generated candidates. Currently, high-quality verifiers are either unscalable (e.g., humans) or limited in utility (e.g., tools like Lean). While LM judges and reward models have become broadly useful as general-purpose verifiers, a significant performance gap remains between them and oracle verifiers (verifiers with perfect accuracy). To help close this gap, we introduce Weaver, a framework for designing a strong verifier by combining multiple weak, imperfect verifiers. We find weighted ensembles of verifiers, which typically require learning from labeled data, significantly outperform unweighted combinations due to differences in verifier accuracies. To reduce dependency on labeled data, Weaver leverages weak supervision to estimate each verifier's accuracy and combines outputs into a unified score that better reflects true response quality. However, directly applying weak supervision algorithms poses challenges, including inconsistent verifier output formats and handling low-quality verifiers. Weaver addresses these using dataset statistics to normalize outputs and filter specific verifiers. We study Weaver's effectiveness in test-time repeated sampling, where a model generates multiple candidate responses and selects one. Our evaluations show Weaver significantly improves over Pass@1-performance when selecting the first candidate-across reasoning and math tasks, achieving o3-mini-level accuracy with Llama 3.3 70B Instruct as generator, and an ensemble of 70B or smaller judge and reward models as verifiers (87.7% average). This gain mirrors the jump between GPT-4o and o3-mini (69.0% vs. 86.7%), which required extensive finetuning and post-training. To reduce computational costs of verifier ensembles, we train a 400M cross-encoder using Weaver's combined output scores.
New Preprint for LLM evaluation

Jun. 2025

"Time To Impeach LLM-as-a-Judge: Programs are the Future of Evaluation"

Paper Abstract: Large language models (LLMs) are widely used to evaluate the quality of LLM generations and responses, but this leads to significant challenges: high API costs, uncertain reliability, inflexible pipelines, and inherent biases. To address these, we introduce PAJAMA (Program-As-a-Judge for Automated Model Assessment), a new alternative that uses LLMs to synthesize executable judging programs instead of directly scoring responses. These synthesized programs can be stored and run locally, costing orders of magnitude less while providing interpretable, and auditable judging logic that can be easily adapted. Program-based judges mitigate biases, improving judgment consistency by 15.83% and reducing biased responses by 23.7% on average compared to a Qwen2.5-14B-based LLM-as-a-judge. When program judgments are distilled into a model, PAJAMA outperforms LLM-as-a-judge on the challenging CHAT-HARD subset of RewardBench, outperforming metrics by 2.19% on Prometheus and 8.67% on the JudgeLM dataset, all at three orders of magnitude lower cost.
ICML'25 Workshop Oral Paper

Jun. 2025

Our work, "Evaluating Sample Utility For Efficient Data Selection by Mimicking Model Weights", has been selected as an oral paper in ICML'25 DataWorld workshop. We introduce a new data utility metric and data selection framework for multimodal models. See you in Vancouver!
Interning in Meta AI

May. 2025

This summer, I join Meta as a research intern, working with David Kant, Yiting Lu, Sang Michael Xie and Ernie Chang on synthetic data generation for multimodal models.
Passed Ph.D. preliminary exam

Apr. 2025

Passed my preliminary exam! The talk is titled “Data Recipes: Automated Labeling and Efficient Selection.” Many thanks to professors in my committee---Frederic Sala, Ramya Vinayak, and Kirthi Kandasamy.
New Preprint for Online Data Mixing

Mar. 2025

"R&B: Domain Regrouping and Data Mixture Balancing for Efficient Foundation Model Training"

Paper Abstract: Data mixing strategies have successfully reduced the costs involved in training language models. While promising, such methods suffer from two flaws. First, they rely on predetermined data domains (e.g., data sources, task types), which may fail to capture critical semantic nuances, leaving performance on the table. Second, these methods scale with the number of domains in a computationally prohibitive way. We address these challenges via R&B, a framework that re-partitions training data based on semantic similarity to create finer-grained domains. and efficiently optimizes the data composition by leveraging a Gram matrix induced by domain gradients obtained throughout training. Unlike prior works, it removes the need for additional compute to obtain evaluation information such as losses or gradients. We analyze this technique under standard regularity conditions and provide theoretical insights that justify R&B's effectiveness compared to non-adaptive mixing approaches. Empirically, we demonstrate the effectiveness of R&B on five diverse datasets ranging from natural language to reasoning and multimodal tasks. With as little as 0.01% additional compute overhead, R&B matches or exceeds the performance of state-of-the-art data mixing strategies.
New Preprint for Data Selection

Nov. 2024

"Evaluating Sample Utility for Data Selection by Mimicking Model Weights"

Paper Abstract: Foundation models rely on large-scale web-crawled datasets, which frequently contain noisy data, biases, and irrelevant content. Existing data selection techniques typically use human heuristics, downstream evaluation datasets, or specialized scoring models, and can overlook samples' utility in the training process. Instead, we propose a new approach, Mimic Score, a data quality metric that uses a pretrained reference model as a guide to assess the usefulness of data samples for training a new model. It relies on the alignment between the gradient of the new model parameters and the vector pointing toward the reference model in weight space. Samples that misalign with this direction are considered low-value and can be filtered out. Motivated by the Mimic score, we develop Grad-Mimic, a data selection framework that identifies and prioritizes useful samples, automating the selection process to create effective filters. Empirically, using Mimic scores to guide model training results in consistent performance gains across six image datasets and enhances the performance of CLIP models. Moreover, Mimic scores and their associated filters improve upon existing filtering methods and offer accurate estimation of dataset quality.
Attending NeurIPS'24 in Vancouver!

Dec. 2024

Heading to Vancouver to present Alchemist! Welcome to chat more me about automated data labeling, data selection, and more!"
NeurIPS'24 Spotlight Paper

Sep. 2024

Alchemist has been accepcted in NeurIPS as a spotlight paper.
Paper Accepcted by NeurIPS'24

Sep. 2024

"The ALCHEmist: Automated Labeling 500x CHEaper Than LLM Data Annotators"

Paper Abstract: Large pretrained models can be used as annotators, helping replace or augment crowdworkers and enabling distilling generalist models into smaller specialist models. Unfortunately, this comes at a cost: employing top-of-the-line models often requires paying thousands of dollars for API calls, while the resulting datasets are static and challenging to audit. To address these challenges, we propose a simple alternative: rather than directly querying labels from pretrained models, we task models to generate programs that can produce labels. These programs can be stored and applied locally, re-used and extended, and cost orders of magnitude less. Our system, Alchemist, obtains comparable to or better performance than large language model-based annotation in a range of tasks for a fraction of the cost: on average, improvements amount to a 12.9% enhancement while the total labeling costs across all datasets are reduced by a factor of approximately 500×.
Interning in Apple

May. 2024

This summer, I join Apple as a AIML research intern, working with Javier Movellan and Manjot Bilkhu on data-centric AI for multimodal models.
Paper Accepted by NeurIPS'23

Sep. 2023

"Train 'n Trade: Foundations of Parameter Markets"

Paper Abstract: Organizations typically train large models individually. This is costly and time-consuming, particularly for large-scale foundation models. Such vertical production is known to be suboptimal. Inspired by this economic insight, we ask whether it is possible to leverage others' expertise by trading the constituent parts in models, i.e., sets of weights, as if they were market commodities. While recent advances in aligning and interpolating models suggest that doing so may be possible, a number of fundamental questions must be answered to create viable parameter markets. In this work, we address these basic questions, propose a framework containing the infrastructure necessary for market operations to take place, study strategies for exchanging parameters, and offer means for agents to monetize parameters. Excitingly, compared to agents who train siloed models from scratch, we show that it is possible to mutually gain by using the market, even in competitive settings. This suggests that the notion of parameter markets may be a useful paradigm for improving large-scale model training in the future.
Paper Accepted by NeurIPS'23

Sep. 2023

"Geometry-Aware Adaptation for Pretrained Models"

Paper Abstract: Machine learning models---including prominent zero-shot models---are often trained on datasets whose labels are only a small proportion of a larger label space. Such spaces are commonly equipped with a metric that relates the labels via distances between them. We propose a simple approach to exploit this information to adapt the trained model to reliably predict new classes---or, in the case of zero-shot prediction, to improve its performance---without any additional training. Our technique is a drop-in replacement of the standard prediction rule, swapping with the Fréchet mean. We provide a comprehensive theoretical analysis for this approach, studying (i) learning-theoretic results trading off label space diameter, sample complexity, and model dimension, (ii) characterizations of the full range of scenarios in which it is possible to predict any unobserved class, and (iii) an optimal active learning-like next class selection procedure to obtain optimal training classes for when it is not possible to predict the entire range of unobserved classes. Empirically, using easily-available external metrics, our proposed approach, Loki, gains up to 29.7% relative improvement over SimCLR on ImageNet and scales to hundreds of thousands of classes. When no such metric is available, Loki can use self-derived metrics from class embeddings and obtains a 10.5% improvement on pretrained zero-shot models such as CLIP.
Paper Accepted by ICCV'23 Datacomp Workshop

Sep. 2023

"Multimodal Data Curation via Object Detection and Filter Ensembles"

Paper Abstract: We propose an approach for curating multimodal data that we used for our entry in the 2023 DataComp competition filtering track. Our technique combines object detection and weak supervision-based ensembling. In the first of two steps in our approach, we employ an out-of-the-box zero-shot object detection model to extract granular information and produce a variety of filter designs. In the second step, we employ weak supervision to ensemble filtering rules. This approach results in a 4% performance improvement when compared to the best-performing baseline, producing the top-ranking position in the small scale track at the time of writing. Furthermore, in the medium scale track, we achieve a noteworthy 4.2% improvement over the baseline by simply ensembling existing baselines with weak supervision.
Rank #1 in the Datacomp'23 competition

Aug. 2023

Top-ranking position in the ICCV Datacomp'23 competition (small-scale filtering track).
Paper Accepted by ICLR'23 DL4C Workshop

Mar. 2023

"ScriptoriumWS: A Code Generation Assistant for Weak Supervision"

Paper Abstract: Weak supervision is a popular framework for overcoming the labeled data bottleneck: the need to obtain labels for training data. In weak supervision, multiple noisy-but-cheap sources are used to provide guesses of the label and are aggregated to produce high-quality pseudolabels. These sources are often expressed as small programs written by domain experts—and so are expensive to obtain. Instead, we argue for using code-generation models to act as coding assistants for crafting weak supervision sources. We study prompting strategies to maximize the quality of the generated sources, settling on a multi-tier strategy that incorporates multiple types of information. We explore how to best combine hand-written and generated sources. Using these insights, we introduce ScriptoriumWS, a weak supervision system that, when compared to hand-crafted sources, maintains accuracy and greatly improves coverage.
Paper Accepted by NeurIPS'22

Sep. 2022

"AutoWS-Bench-101: Benchmarking Automated Weak Supervision with 100 Labels".

Paper Abstract: Weak supervision (WS) is a powerful method to build labeled datasets for training supervised models in the face of little-to-no labeled data. It replaces hand-labeling data with aggregating multiple noisy-but-cheap label estimates expressed by labeling functions (LFs). While it has been used successfully in many domains, weak supervision's application scope is limited by the difficulty of constructing labeling functions for domains with complex or high-dimensional features. To address this, a handful of methods have proposed automating the LF design process using a small set of ground truth labels. In this work, we introduce AutoWS-Bench-101: a framework for evaluating automated WS (AutoWS) techniques in challenging WS settings---a set of diverse application domains on which it has been previously difficult or impossible to apply traditional WS techniques.
President of SAT at UW-Madison

May. 2022

Our main mission is to bring taiwanese students together from all fields of study for recreational, academic, and cultural purposes.
Joining CS Dept at UW-Madison

Aug. 2021

Here is the start of my Ph.D. journey. Mamba mentality always.

Experience

Resume [PDF]

Education

University of Wisconsin, Madison (UW-Madison)
Aug. 2021 — Present (final-year)
Ph.D. in Computer Science.
National Chengchi University (NCCU)
Sep. 2016 — Jul. 2020
B.S. in Computer Science.

Publications and Preprints

Time to Impeach LLM-as-a-Judge: Programs are the Future of Evaluation
ICML'25 Programmatic Representations for Agent Learning (PRAL) Workshop
Tzu-Heng Huang, Harit Vishwakarma, Frederic Sala
[PDF]
Shrinking the Generation-Verification Gap by Scaling Compute for Verification
NeurIPS'25 & ICML'25 Efficient Systems for Foundation Models (ES-FoMo III) Workshop & ICML'25 Multi-Agent Systems in the Era of Foundation Models: Opportunities, Challenges and Futures (MAS) Workshop
Jon Saad-Falcon, E. Kelly Buchanan, Mayee F Chen, Tzu-Heng Huang, Brendan McLaughlin, Tanvir Bhathal, Shang Zhu, Ben Athiwaratkun, Frederic Sala, Scott Linderman, Azalia Mirhoseini, Christopher Re
[PDF]
From Many Voices to One: A Statistically Principled Aggregation of LLM Judges
NeurIPS'25 Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling Workshop & NeurIPS'25 Reliable ML from Unreliable Data Workshop
Jitian Zhao, Changho Shin, Tzu-Heng Huang, Satya Sai Srinath Namburi GNVV, Frederic Sala
R&B: Domain Regrouping and Data Mixture Balancing for Efficient Foundation Model Training
ICML'25 Unifying Data Curation Frameworks Across Domains (DataWorld) Workshop & ICML'25 Data in Generative Models (The Bad, the Ugly, and the Greats) (DIG-BUGS) Workshop
Albert Ge, Tzu-Heng Huang, John Cooper, Avi Trost, Ziyi Chu, Satya Sai Srinath Namburi GNVV, Ziyang Cai, Kendall Park, Nicholas Roberts, Frederic Sala
[PDF]
Evaluating Sample Utility For Efficient Data Selection by Mimicking Model Weights
ICML'25 Unifying Data Curation Frameworks Across Domains (DataWorld) Workshop (Oral)
Tzu-Heng Huang, Manjot Bilkhu, John Cooper, Frederic Sala, Javier Movellan
[PDF]
The ALCHEmist: Automated Labeling 500x CHEaper Than LLM Data Annotators
NeurIPS'24 (Spotlight)
Tzu-Heng Huang, Catherine Cao, Vaishnavi Bhargava, Frederic Sala
[PDF]
MoRe Fine-Tuning with 10x Fewer Parameters
ICML'24 Efficient Systems for Foundation Models (ES-FoMo) Workshop & ICML'24 Foundation Models in the Wild Workshop
Wenxuan Tan, Nicholas Roberts, Tzu-Heng Huang, Jitian Zhao, John Cooper, Samuel Guo, Chengyu Duan, Frederic Sala
[PDF]
Train 'n Trade: Foundations of Parameter Markets
NeurIPS'23
Tzu-Heng Huang, Harit Vishwakarma, Frederic Sala
[PDF]
Geometry-Aware Adaptation for Pretrained Models
NeurIPS'23
Nicholas Roberts, Xintong Li, Dyah Adila, Sonia Cromp, Tzu-Heng Huang, Jitian Zhao, Frederic Sala
[PDF]
Multimodal Data Curation via Object Detection and Filter Ensembles
ICCV'23 Towards the Next Generation of Computer Vision Datasets (TNGCV) Workshop 1st place on the Datacomp leaderboard (small-scale filtering track)
Tzu-Heng Huang*, Changho Shin*, Sui Jiet Tay, Dyah Adila, Frederic Sala
[PDF]
ScriptoriumWS: A Code Generation Assistant for Weak Supervision
ICLR'23 Deep Learning for Code (DL4C) Workshop & 2023 Midwest Machine Learning Symposium
Tzu-Heng Huang, Catherine Cao, Spencer Schoenberg, Harit Vishwakarma, Nicholas Roberts, Frederic Sala
[PDF]
AutoWS-Bench-101: Benchmarking Automated Weak Supervision with 100 Labels
NeurIPS'22
Nicholas Roberts, Xintong Li, Tzu-Heng Huang, Dyah Adila, Spencer Schoenberg, Cheng-Yu Liu, Lauren Pick, Haotian Ma, Aws Albarghouthi, Frederic Sala
[PDF]
Key Sensor Discovery for Quality Audit of Air Sensor Networks
MobiSys'20
Tzu-Heng Huang, Cheng-Hsien Tsai, Man-Kwan Shan
[PDF]

Experience

Research Intern, Meta GenAI (now MSL)
May. 2025 — Sep. 2025, advised by Ernie Chang, Sang Michael Xie, Yiting Lu, and David Kant
AIML Research Intern, Apple
May. 2024 — Dec. 2024, advised by Javier Movellan and Manjot Bilkhu
CEO & Co-founder, Awan.AI LLC
May. 2023 — Apr. 2024, established with Eric Lin and Jet Lin
Graduate Research Student, University of Wisconsin-Madison
Feb. 2022 — Present, advised by Frederic Sala
Research Intern, Argonne National Laboratory
Jun. 2019 — Sep. 2019, advised by Charlie Catlett and Rajesh Sankaran
Research Assistant, National Chengchi University
Sep. 2018 — Aug. 2021, advised by Man-Kwan Shan
Research Intern, Academia Sinica
Feb. 2018 — Jul. 2020, advised by Ling-Jyh Chen
Research Assistant, National Chengchi University
Jul. 2017 — Jul. 2020, advised by Changya Hu