About me
Greetings! I'm Tzu-Heng (Brian), a final-year CS Ph.D. student at the University of Wisconsin-Madison (UW-Madison), working with Frederic Sala in the Sprocket Lab.
I have been fortunate to intern across both industry and national labs. Most recently, I interned at Apple (second time), where I worked on automated verifiers for open-ended tasks in visual reinforcement learning. Previously, I interned at Meta GenAI (now MSL), focusing on synthetic data generation, and earlier at Apple on data curation and data mixing strategies for multimodal pretraining, under the guidance of Javier Movellan and Manjot Bilkhu. In 2019, I was a research intern at Argonne National Laboratory with the Array of Things team, working with Charlie Catlett and Rajesh Sankaran on large-scale urban sensing systems.
Before joining UW–Madison, I earned my B.S. degree in CS from National Chengchi University (NCCU), where I was advised by Man-Kwan Shan and Ling-Jyh Chen. My early research focused on spatio-temporal machine learning and large-scale sensor networks.
My research focuses on data-centric AI for multimodal models, with the goal of enabling systems to learn more from less but higher-quality supervision. Several data lifecycle projects I have worked on, including multimodal data selection, universal data mixing, zero-cost labeling systems, and LLM verification through synthetic programs and rubrics. These efforts are grounded in weak supervision frameworks to build foundation models with fewer human annotations. Additionally, I am exploring a new notion, parameter marketplace, to accelerate training while monetizing parameters as a second profit center.
Recent News
-
Interning in Apple
Oct. 2025
This winter, I join Apple again as a research intern, working with Javier Movellan and Manjot Bilkhu on automated verifiers for open-ended tasks in visual reinforcement learning.
-
Paper Accepcted by NeurIPS'25
Sep. 2025
"Shrinking the Generation-Verification Gap with Weak Verifiers"
Paper Abstract: Verifiers can improve language model capabilities by scoring and ranking responses from generated candidates. Currently, high-quality verifiers are either unscalable (e.g., humans) or limited in utility (e.g., tools like Lean). While LM judges and reward models have become broadly useful as general-purpose verifiers, a significant performance gap remains between them and oracle verifiers (verifiers with perfect accuracy). To help close this gap, we introduce Weaver, a framework for designing a strong verifier by combining multiple weak, imperfect verifiers. We find weighted ensembles of verifiers, which typically require learning from labeled data, significantly outperform unweighted combinations due to differences in verifier accuracies. To reduce dependency on labeled data, Weaver leverages weak supervision to estimate each verifier's accuracy and combines outputs into a unified score that better reflects true response quality. However, directly applying weak supervision algorithms poses challenges, including inconsistent verifier output formats and handling low-quality verifiers. Weaver addresses these using dataset statistics to normalize outputs and filter specific verifiers. We study Weaver's effectiveness in test-time repeated sampling, where a model generates multiple candidate responses and selects one. Our evaluations show Weaver significantly improves over Pass@1-performance when selecting the first candidate-across reasoning and math tasks, achieving o3-mini-level accuracy with Llama 3.3 70B Instruct as generator, and an ensemble of 70B or smaller judge and reward models as verifiers (87.7% average). This gain mirrors the jump between GPT-4o and o3-mini (69.0% vs. 86.7%), which required extensive finetuning and post-training. To reduce computational costs of verifier ensembles, we train a 400M cross-encoder using Weaver's combined output scores.
-
New Preprint for LLM evaluation
Jun. 2025
"Time To Impeach LLM-as-a-Judge: Programs are the Future of Evaluation"
Paper Abstract: Large language models (LLMs) are widely used to evaluate the quality of LLM generations and responses, but this leads to significant challenges: high API costs, uncertain reliability, inflexible pipelines, and inherent biases. To address these, we introduce PAJAMA (Program-As-a-Judge for Automated Model Assessment), a new alternative that uses LLMs to synthesize executable judging programs instead of directly scoring responses. These synthesized programs can be stored and run locally, costing orders of magnitude less while providing interpretable, and auditable judging logic that can be easily adapted. Program-based judges mitigate biases, improving judgment consistency by 15.83% and reducing biased responses by 23.7% on average compared to a Qwen2.5-14B-based LLM-as-a-judge. When program judgments are distilled into a model, PAJAMA outperforms LLM-as-a-judge on the challenging CHAT-HARD subset of RewardBench, outperforming metrics by 2.19% on Prometheus and 8.67% on the JudgeLM dataset, all at three orders of magnitude lower cost.
-
ICML'25 Workshop Oral Paper
Jun. 2025
Our work, "Evaluating Sample Utility For Efficient Data Selection by Mimicking Model Weights", has been selected as an oral paper in ICML'25 DataWorld workshop. We introduce a new data utility metric and data selection framework for multimodal models. See you in Vancouver!
-
Interning in Meta AI
May. 2025
This summer, I join Meta as a research intern, working with David Kant, Yiting Lu, Sang Michael Xie and Ernie Chang on synthetic data generation for multimodal models.
-
Passed Ph.D. preliminary exam
Apr. 2025
Passed my preliminary exam! The talk is titled “Data Recipes: Automated Labeling and Efficient Selection.” Many thanks to professors in my committee---Frederic Sala, Ramya Vinayak, and Kirthi Kandasamy.
-
New Preprint for Online Data Mixing
Mar. 2025
"R&B: Domain Regrouping and Data Mixture Balancing for Efficient Foundation Model Training"
Paper Abstract: Data mixing strategies have successfully reduced the costs involved in training language models. While promising, such methods suffer from two flaws. First, they rely on predetermined data domains (e.g., data sources, task types), which may fail to capture critical semantic nuances, leaving performance on the table. Second, these methods scale with the number of domains in a computationally prohibitive way. We address these challenges via R&B, a framework that re-partitions training data based on semantic similarity to create finer-grained domains. and efficiently optimizes the data composition by leveraging a Gram matrix induced by domain gradients obtained throughout training. Unlike prior works, it removes the need for additional compute to obtain evaluation information such as losses or gradients. We analyze this technique under standard regularity conditions and provide theoretical insights that justify R&B's effectiveness compared to non-adaptive mixing approaches. Empirically, we demonstrate the effectiveness of R&B on five diverse datasets ranging from natural language to reasoning and multimodal tasks. With as little as 0.01% additional compute overhead, R&B matches or exceeds the performance of state-of-the-art data mixing strategies.
-
New Preprint for Data Selection
Nov. 2024
"Evaluating Sample Utility for Data Selection by Mimicking Model Weights"
Paper Abstract: Foundation models rely on large-scale web-crawled datasets, which frequently contain noisy data, biases, and irrelevant content. Existing data selection techniques typically use human heuristics, downstream evaluation datasets, or specialized scoring models, and can overlook samples' utility in the training process. Instead, we propose a new approach, Mimic Score, a data quality metric that uses a pretrained reference model as a guide to assess the usefulness of data samples for training a new model. It relies on the alignment between the gradient of the new model parameters and the vector pointing toward the reference model in weight space. Samples that misalign with this direction are considered low-value and can be filtered out. Motivated by the Mimic score, we develop Grad-Mimic, a data selection framework that identifies and prioritizes useful samples, automating the selection process to create effective filters. Empirically, using Mimic scores to guide model training results in consistent performance gains across six image datasets and enhances the performance of CLIP models. Moreover, Mimic scores and their associated filters improve upon existing filtering methods and offer accurate estimation of dataset quality.
-
Attending NeurIPS'24 in Vancouver!
Dec. 2024
Heading to Vancouver to present Alchemist! Welcome to chat more me about automated data labeling, data selection, and more!"
-
NeurIPS'24 Spotlight Paper
Sep. 2024
Alchemist has been accepted in NeurIPS as a spotlight paper.
-
Paper Accepcted by NeurIPS'24
Sep. 2024
"The ALCHEmist: Automated Labeling 500x CHEaper Than LLM Data Annotators"
Paper Abstract: Large pretrained models can be used as annotators, helping replace or augment crowdworkers and enabling distilling generalist models into smaller specialist models. Unfortunately, this comes at a cost: employing top-of-the-line models often requires paying thousands of dollars for API calls, while the resulting datasets are static and challenging to audit. To address these challenges, we propose a simple alternative: rather than directly querying labels from pretrained models, we task models to generate programs that can produce labels. These programs can be stored and applied locally, re-used and extended, and cost orders of magnitude less. Our system, Alchemist, obtains comparable to or better performance than large language model-based annotation in a range of tasks for a fraction of the cost: on average, improvements amount to a 12.9% enhancement while the total labeling costs across all datasets are reduced by a factor of approximately 500×.
-
Interning in Apple
May. 2024
This summer, I join Apple as a AIML research intern, working with Javier Movellan and Manjot Bilkhu on data-centric AI for multimodal models.
-
Paper Accepted by NeurIPS'23
Sep. 2023
"Train 'n Trade: Foundations of Parameter Markets"
Paper Abstract: Organizations typically train large models individually. This is costly and time-consuming, particularly for large-scale foundation models. Such vertical production is known to be suboptimal. Inspired by this economic insight, we ask whether it is possible to leverage others' expertise by trading the constituent parts in models, i.e., sets of weights, as if they were market commodities. While recent advances in aligning and interpolating models suggest that doing so may be possible, a number of fundamental questions must be answered to create viable parameter markets. In this work, we address these basic questions, propose a framework containing the infrastructure necessary for market operations to take place, study strategies for exchanging parameters, and offer means for agents to monetize parameters. Excitingly, compared to agents who train siloed models from scratch, we show that it is possible to mutually gain by using the market, even in competitive settings. This suggests that the notion of parameter markets may be a useful paradigm for improving large-scale model training in the future.
-
Paper Accepted by NeurIPS'23
Sep. 2023
"Geometry-Aware Adaptation for Pretrained Models"
Paper Abstract: Machine learning models---including prominent zero-shot models---are often trained on datasets whose labels are only a small proportion of a larger label space. Such spaces are commonly equipped with a metric that relates the labels via distances between them. We propose a simple approach to exploit this information to adapt the trained model to reliably predict new classes---or, in the case of zero-shot prediction, to improve its performance---without any additional training. Our technique is a drop-in replacement of the standard prediction rule, swapping with the Fréchet mean. We provide a comprehensive theoretical analysis for this approach, studying (i) learning-theoretic results trading off label space diameter, sample complexity, and model dimension, (ii) characterizations of the full range of scenarios in which it is possible to predict any unobserved class, and (iii) an optimal active learning-like next class selection procedure to obtain optimal training classes for when it is not possible to predict the entire range of unobserved classes. Empirically, using easily-available external metrics, our proposed approach, Loki, gains up to 29.7% relative improvement over SimCLR on ImageNet and scales to hundreds of thousands of classes. When no such metric is available, Loki can use self-derived metrics from class embeddings and obtains a 10.5% improvement on pretrained zero-shot models such as CLIP.
-
Paper Accepted by ICCV'23 Datacomp Workshop
Sep. 2023
"Multimodal Data Curation via Object Detection and Filter Ensembles"
Paper Abstract: We propose an approach for curating multimodal data that we used for our entry in the 2023 DataComp competition filtering track. Our technique combines object detection and weak supervision-based ensembling. In the first of two steps in our approach, we employ an out-of-the-box zero-shot object detection model to extract granular information and produce a variety of filter designs. In the second step, we employ weak supervision to ensemble filtering rules. This approach results in a 4% performance improvement when compared to the best-performing baseline, producing the top-ranking position in the small scale track at the time of writing. Furthermore, in the medium scale track, we achieve a noteworthy 4.2% improvement over the baseline by simply ensembling existing baselines with weak supervision.
-
Rank #1 in the Datacomp'23 competition
Aug. 2023
Top-ranking position in the ICCV Datacomp'23 competition (small-scale filtering track).
-
Paper Accepted by ICLR'23 DL4C Workshop
Mar. 2023
"ScriptoriumWS: A Code Generation Assistant for Weak Supervision"
Paper Abstract: Weak supervision is a popular framework for overcoming the labeled data bottleneck: the need to obtain labels for training data. In weak supervision, multiple noisy-but-cheap sources are used to provide guesses of the label and are aggregated to produce high-quality pseudolabels. These sources are often expressed as small programs written by domain experts—and so are expensive to obtain. Instead, we argue for using code-generation models to act as coding assistants for crafting weak supervision sources. We study prompting strategies to maximize the quality of the generated sources, settling on a multi-tier strategy that incorporates multiple types of information. We explore how to best combine hand-written and generated sources. Using these insights, we introduce ScriptoriumWS, a weak supervision system that, when compared to hand-crafted sources, maintains accuracy and greatly improves coverage.
-
Paper Accepted by NeurIPS'22
Sep. 2022
"AutoWS-Bench-101: Benchmarking Automated Weak Supervision with 100 Labels".
Paper Abstract: Weak supervision (WS) is a powerful method to build labeled datasets for training supervised models in the face of little-to-no labeled data. It replaces hand-labeling data with aggregating multiple noisy-but-cheap label estimates expressed by labeling functions (LFs). While it has been used successfully in many domains, weak supervision's application scope is limited by the difficulty of constructing labeling functions for domains with complex or high-dimensional features. To address this, a handful of methods have proposed automating the LF design process using a small set of ground truth labels. In this work, we introduce AutoWS-Bench-101: a framework for evaluating automated WS (AutoWS) techniques in challenging WS settings---a set of diverse application domains on which it has been previously difficult or impossible to apply traditional WS techniques.
-
President of SAT at UW-Madison
May. 2022
Our main mission is to bring taiwanese students together from all fields of study for recreational, academic, and cultural purposes.
-
Joining CS Dept at UW-Madison
Aug. 2021
Here is the start of my Ph.D. journey. Mamba mentality always.
Richard was hired to create a corporate identity. We were very pleased with the work done. She has a lot of experience and is very concerned about the needs of client. Lorem ipsum dolor sit amet, ullamcous cididt consectetur adipiscing elit, seds do et eiusmod tempor incididunt ut laborels dolore magnarels alia.