Qingqing Cao

I am a research scientist at Apple System Intelligence and Machine Learning (SIML) group. My current focus is to develop high quality ML systems and applications that are optimized for scalability and efficiency. In the past, I built efficient and practical NLP systems for both edge devices and the cloud, such as on-device (visual) question answering and faster Transformer models.

Previously, I was a postdoc in the UW NLP group at the University of Washington where I won the postdoc research award twice. I hold a Ph.D. degree in computer science at Stony Brook University. I was a recipient of the Catacosinos Fellowship at Stony Brook University and a Rising Star in Data Science at the University of Chicago.

News

May 28, 2025	Thanks ICML 2025 for recognizing me as a top reviewer (1.9%)!
May 02, 2025	CtrlSynth got accepted to ICML 2025 as a poster paper!
May 01, 2025	Glad to serve as Senior Area Chair for EMNLP 2025 !
Apr 11, 2025	Glad to serve as Area Chair for NeurIPS 2025 !
Nov 02, 2024	Thanks NeurIPS for recognizing me as a top reviewer (8.6%)!

Recent publications

2024

ICML 2025
# CtrlSynth: Controllable Image Text Synthesis for Data-Efficient Multimodal Learning

Qingqing Cao, Mahyar Najibi, and Sachin Mehta

Oct 2024

Abstract arXiv BibTeX Paper

Pretraining robust vision or multimodal foundation models (e.g., CLIP) relies on large-scale datasets that may be noisy, potentially misaligned, and have long-tail distributions. Previous works have shown promising results in augmenting datasets by generating synthetic samples. However, they only support domain-specific ad hoc use cases (e.g., either image or text only, but not both), and are limited in data diversity due to a lack of fine-grained control over the synthesis process. In this paper, we design a }emph{controllable} image-text synthesis pipeline, CtrlSynth, for data-efficient and robust multimodal learning. The key idea is to decompose the visual semantics of an image into basic elements, apply user-specified control policies (e.g., remove, add, or replace operations), and recompose them to synthesize images or texts. The decompose and recompose feature in CtrlSynth allows users to control data synthesis in a fine-grained manner by defining customized control policies to manipulate the basic elements. CtrlSynth leverages the capabilities of pretrained foundation models such as large language models or diffusion models to reason and recompose basic elements such that synthetic samples are natural and composed in diverse ways. CtrlSynth is a closed-loop, training-free, and modular framework, making it easy to support different pretrained models. With extensive experiments on 31 datasets spanning different vision and vision-language tasks, we show that CtrlSynth substantially improves zero-shot classification, image-text retrieval, and compositional reasoning performance of CLIP models.
@misc{caoCtrlSynthControllableImage2024, title = {{CtrlSynth}: {Controllable} {Image} {Text} {Synthesis} for {Data}-{Efficient} {Multimodal} {Learning}}, shorttitle = {{CtrlSynth}}, doi = {10.48550/arXiv.2410.11963}, urldate = {2024-10-23}, publisher = {arXiv}, author = {Cao, Qingqing and Najibi, Mahyar and Mehta, Sachin}, month = oct, year = {2024}, }
SCOPE 2025
# KV Prediction for Improved Time to First Token

Maxwell Horton, Qingqing Cao, Chenfan Sun, Yanzi Jin, Sachin Mehta, Mohammad Rastegari, and Moin Nabi

Oct 2024

Abstract arXiv BibTeX Paper

Inference with transformer-based language models begins with a prompt processing step. In this step, the model generates the first output token and stores the KV cache needed for future generation steps. This prompt processing step can be computationally expensive, taking 10s of seconds or more for billion-parameter models on edge devices when prompt lengths or batch sizes rise. This degrades user experience by introducing significant latency into the model’s outputs. To reduce the time spent producing the first output (known as the “time to first token”, or TTFT) of a pretrained model, we introduce a novel method called KV Prediction. In our method, a small auxiliary model is used to process the prompt and produce an approximation of the KV cache used by a base model. This approximated KV cache is then used with the base model for autoregressive generation without the need to query the auxiliary model again. We demonstrate that our method produces a pareto-optimal efficiency-accuracy trade-off when compared to baselines. On TriviaQA, we demonstrate relative accuracy improvements in the range of \15}%-50}% across a range of TTFT FLOPs budgets. We also demonstrate accuracy improvements of up to \30}% on HumanEval python code completion at fixed TTFT FLOPs budgets. Additionally, we benchmark models on an Apple M2 Pro CPU and demonstrate that our improvement in FLOPs translates to a TTFT speedup on hardware. We release our code at https://github.com/apple/corenet/tree/main/projects/kv-prediction .
@misc{hortonKVPredictionImproved2024, title = {{KV} {Prediction} for {Improved} {Time} to {First} {Token}}, doi = {10.48550/arXiv.2410.08391}, urldate = {2024-10-24}, publisher = {First Workshop on Scalable Optimization for Efficient and Adaptive Foundation Models @ ICLR 2025}, author = {Horton, Maxwell and Cao, Qingqing and Sun, Chenfan and Jin, Yanzi and Mehta, Sachin and Rastegari, Mohammad and Nabi, Moin}, month = oct, year = {2024}, }
arxiv
# Efficient Vision-Language Models by Summarizing Visual Tokens into Compact Registers

Yuxin Wen, Qingqing Cao, Qichen Fu, Sachin Mehta, and Mahyar Najibi

Oct 2024

Abstract arXiv BibTeX Paper

Recent advancements in vision-language models (VLMs) have expanded their potential for real-world applications, enabling these models to perform complex reasoning on images. In the widely used fully autoregressive transformer-based models like LLaVA, projected visual tokens are prepended to textual tokens. Oftentimes, visual tokens are significantly more than prompt tokens, resulting in increased computational overhead during both training and inference. In this paper, we propose Visual Compact Token Registers (Victor), a method that reduces the number of visual tokens by summarizing them into a smaller set of register tokens. Victor adds a few learnable register tokens after the visual tokens and summarizes the visual information into these registers using the first few layers in the language tower of VLMs. After these few layers, all visual tokens are discarded, significantly improving computational efficiency for both training and inference. Notably, our method is easy to implement and requires a small number of new trainable parameters with minimal impact on model performance. In our experiment, with merely 8 visual registers–about 1% of the original tokens–Victor shows less than a 4% accuracy drop while reducing the total training time by 43% and boosting the inference throughput by 3.3X.
@misc{wenEfficientVisionLanguageModels2024, title = {Efficient {Vision}-{Language} {Models} by {Summarizing} {Visual} {Tokens} into {Compact} {Registers}}, doi = {10.48550/arXiv.2410.14072}, urldate = {2024-10-24}, publisher = {arXiv}, author = {Wen, Yuxin and Cao, Qingqing and Fu, Qichen and Mehta, Sachin and Najibi, Mahyar}, month = oct, year = {2024}, }
ES-FoMo 2024
# OpenELM: An Efficient Language Model Family with Open Training and Inference Framework

Sachin Mehta, Mohammad Hossein Sekhavat, Qingqing Cao, Maxwell Horton, Yanzi Jin, Chenfan Sun, Iman Mirzadeh, Mahyar Najibi, Dmitry Belenko, Peter Zatloukal, and Mohammad Rastegari

Apr 2024

Abstract arXiv BibTeX

The reproducibility and transparency of large language models are crucial for advancing open research, ensuring the trustworthiness of results, and enabling investigations into data and model biases, as well as potential risks. To this end, we release OpenELM, a state-of-the-art open language model. OpenELM uses a layer-wise scaling strategy to efficiently allocate parameters within each layer of the transformer model, leading to enhanced accuracy. For example, with a parameter budget of approximately one billion parameters, OpenELM exhibits a 2.36% improvement in accuracy compared to OLMo while requiring 2x fewer pre-training tokens. Diverging from prior practices that only provide model weights and inference code, and pre-train on private datasets, our release includes the complete framework for training and evaluation of the language model on publicly available datasets, including training logs, multiple checkpoints, and pre-training configurations. We also release code to convert models to MLX library for inference and fine-tuning on Apple devices. This comprehensive release aims to empower and strengthen the open research community, paving the way for future open research endeavors. Our source code along with pre-trained model weights and training recipes is available at https://github.com/apple/corenet. Additionally, OpenELM models can be found on HuggingFace at: https://huggingface.co/apple/OpenELM.
@misc{mehtaOpenELMEfficientLanguage2024, title = {{OpenELM}: {An} {Efficient} {Language} {Model} {Family} with {Open} {Training} and {Inference} {Framework}}, shorttitle = {{OpenELM}}, urldate = {2024-04-29}, publisher = {Workshop on Efficient Systems for Foundation Models II @ ICML2024}, author = {Mehta, Sachin and Sekhavat, Mohammad Hossein and Cao, Qingqing and Horton, Maxwell and Jin, Yanzi and Sun, Chenfan and Mirzadeh, Iman and Najibi, Mahyar and Belenko, Dmitry and Zatloukal, Peter and Rastegari, Mohammad}, month = apr, year = {2024}, }
ICML 2024
# APT: Adaptive Pruning and Tuning Pretrained Language Models for Efficient Training and Inference

Bowen Zhao, Hannaneh Hajishirzi, and Qingqing Cao

Jan 2024

Abstract arXiv BibTeX Paper Oral (1.5%)

Fine-tuning and inference with large Language Models (LM) are generally known to be expensive. Parameter-efficient fine-tuning over pretrained LMs reduces training memory by updating a small number of LM parameters but does not improve inference efficiency. Structured pruning improves LM inference efficiency by removing consistent parameter blocks, yet often increases training memory and time. To improve both training and inference efficiency, we introduce APT that adaptively prunes and tunes parameters for the LMs. At the early stage of fine-tuning, APT dynamically adds salient tuning parameters for fast and accurate convergence while discarding unimportant parameters for efficiency. Compared to baselines, our experiments show that APT maintains up to 98% task performance when pruning RoBERTa and T5 models with 40% parameters left while keeping 86.4% LLaMA models’ performance with 70% parameters remained. Furthermore, APT speeds up LMs fine-tuning by up to 8x and reduces large LMs memory training footprint by up to 70%.
@misc{zhaoAPTAdaptivePruning2024, title = {{{APT}}: {{Adaptive Pruning}} and {{Tuning Pretrained Language Models}} for {{Efficient Training}} and {{Inference}}}, shorttitle = {{{APT}}}, author = {Zhao, Bowen and Hajishirzi, Hannaneh and Cao, Qingqing}, year = {2024}, month = jan, url = {https://openreview.net/forum?id=sb81Xl50JG}, number = {arXiv:2401.12200}, eprint = {2401.12200}, primaryclass = {cs}, publisher = {{arXiv}}, urldate = {2024-01-23}, highlight = {Oral (1.5%)}, highlight_url = {https://icml.cc/virtual/2024/events/oral#event-35453} }

2023

ICLR 2024
# BTR: Binary Token Representations for Efficient Retrieval Augmented Language Models

Qingqing Cao, Sewon Min, Yizhong Wang, and Hannaneh Hajishirzi

In , Oct 2023

Abstract arXiv BibTeX Paper Poster Slides Video Spotlight (5%)

Retrieval augmentation addresses many critical problems in large language models such as hallucination, staleness, and privacy leaks. However, running retrieval-augmented language models (LMs) is slow and difficult to scale due to processing large amounts of retrieved text. We introduce binary token representations (BTR), which use 1-bit vectors to precompute every token in passages, significantly reducing computation during inference. Despite the potential loss of accuracy, our new calibration techniques and training objectives restore performance. Combined with offline and runtime compression, this only requires 127GB of disk space for encoding 3 billion tokens in Wikipedia. Our experiments show that on five knowledge-intensive NLP tasks, BTR accelerates state-of-the-art inference by up to 4x and reduces storage by over 100x while maintaining over 95% task performance. Our code is publicly available at https://github.com/csarron/BTR.
@inproceedings{caoBTRBinaryToken2024, title = {{BTR}: {Binary} {Token} {Representations} for {Efficient} {Retrieval} {Augmented} {Language} {Models}}, shorttitle = {{BTR}}, url = {https://openreview.net/forum?id=3TO3TtnOFl}, language = {en}, urldate = {2024-04-29}, author = {Cao, Qingqing and Min, Sewon and Wang, Yizhong and Hajishirzi, Hannaneh}, month = oct, year = {2023}, highlight = {Spotlight (5%)}, }