Publications | Qingqing Cao

2024

ICML 2025
# CtrlSynth: Controllable Image Text Synthesis for Data-Efficient Multimodal Learning

Qingqing Cao, Mahyar Najibi, and Sachin Mehta

Oct 2024

Abstract arXiv BibTeX Paper

Pretraining robust vision or multimodal foundation models (e.g., CLIP) relies on large-scale datasets that may be noisy, potentially misaligned, and have long-tail distributions. Previous works have shown promising results in augmenting datasets by generating synthetic samples. However, they only support domain-specific ad hoc use cases (e.g., either image or text only, but not both), and are limited in data diversity due to a lack of fine-grained control over the synthesis process. In this paper, we design a }emph{controllable} image-text synthesis pipeline, CtrlSynth, for data-efficient and robust multimodal learning. The key idea is to decompose the visual semantics of an image into basic elements, apply user-specified control policies (e.g., remove, add, or replace operations), and recompose them to synthesize images or texts. The decompose and recompose feature in CtrlSynth allows users to control data synthesis in a fine-grained manner by defining customized control policies to manipulate the basic elements. CtrlSynth leverages the capabilities of pretrained foundation models such as large language models or diffusion models to reason and recompose basic elements such that synthetic samples are natural and composed in diverse ways. CtrlSynth is a closed-loop, training-free, and modular framework, making it easy to support different pretrained models. With extensive experiments on 31 datasets spanning different vision and vision-language tasks, we show that CtrlSynth substantially improves zero-shot classification, image-text retrieval, and compositional reasoning performance of CLIP models.
@misc{caoCtrlSynthControllableImage2024, title = {{CtrlSynth}: {Controllable} {Image} {Text} {Synthesis} for {Data}-{Efficient} {Multimodal} {Learning}}, shorttitle = {{CtrlSynth}}, doi = {10.48550/arXiv.2410.11963}, urldate = {2024-10-23}, publisher = {arXiv}, author = {Cao, Qingqing and Najibi, Mahyar and Mehta, Sachin}, month = oct, year = {2024}, }
SCOPE 2025
# KV Prediction for Improved Time to First Token

Maxwell Horton, Qingqing Cao, Chenfan Sun, Yanzi Jin, Sachin Mehta, Mohammad Rastegari, and Moin Nabi

Oct 2024

Abstract arXiv BibTeX Paper

Inference with transformer-based language models begins with a prompt processing step. In this step, the model generates the first output token and stores the KV cache needed for future generation steps. This prompt processing step can be computationally expensive, taking 10s of seconds or more for billion-parameter models on edge devices when prompt lengths or batch sizes rise. This degrades user experience by introducing significant latency into the model’s outputs. To reduce the time spent producing the first output (known as the “time to first token”, or TTFT) of a pretrained model, we introduce a novel method called KV Prediction. In our method, a small auxiliary model is used to process the prompt and produce an approximation of the KV cache used by a base model. This approximated KV cache is then used with the base model for autoregressive generation without the need to query the auxiliary model again. We demonstrate that our method produces a pareto-optimal efficiency-accuracy trade-off when compared to baselines. On TriviaQA, we demonstrate relative accuracy improvements in the range of \15}%-50}% across a range of TTFT FLOPs budgets. We also demonstrate accuracy improvements of up to \30}% on HumanEval python code completion at fixed TTFT FLOPs budgets. Additionally, we benchmark models on an Apple M2 Pro CPU and demonstrate that our improvement in FLOPs translates to a TTFT speedup on hardware. We release our code at https://github.com/apple/corenet/tree/main/projects/kv-prediction .
@misc{hortonKVPredictionImproved2024, title = {{KV} {Prediction} for {Improved} {Time} to {First} {Token}}, doi = {10.48550/arXiv.2410.08391}, urldate = {2024-10-24}, publisher = {First Workshop on Scalable Optimization for Efficient and Adaptive Foundation Models @ ICLR 2025}, author = {Horton, Maxwell and Cao, Qingqing and Sun, Chenfan and Jin, Yanzi and Mehta, Sachin and Rastegari, Mohammad and Nabi, Moin}, month = oct, year = {2024}, }
arxiv
# Efficient Vision-Language Models by Summarizing Visual Tokens into Compact Registers

Yuxin Wen, Qingqing Cao, Qichen Fu, Sachin Mehta, and Mahyar Najibi

Oct 2024

Abstract arXiv BibTeX Paper

Recent advancements in vision-language models (VLMs) have expanded their potential for real-world applications, enabling these models to perform complex reasoning on images. In the widely used fully autoregressive transformer-based models like LLaVA, projected visual tokens are prepended to textual tokens. Oftentimes, visual tokens are significantly more than prompt tokens, resulting in increased computational overhead during both training and inference. In this paper, we propose Visual Compact Token Registers (Victor), a method that reduces the number of visual tokens by summarizing them into a smaller set of register tokens. Victor adds a few learnable register tokens after the visual tokens and summarizes the visual information into these registers using the first few layers in the language tower of VLMs. After these few layers, all visual tokens are discarded, significantly improving computational efficiency for both training and inference. Notably, our method is easy to implement and requires a small number of new trainable parameters with minimal impact on model performance. In our experiment, with merely 8 visual registers–about 1% of the original tokens–Victor shows less than a 4% accuracy drop while reducing the total training time by 43% and boosting the inference throughput by 3.3X.
@misc{wenEfficientVisionLanguageModels2024, title = {Efficient {Vision}-{Language} {Models} by {Summarizing} {Visual} {Tokens} into {Compact} {Registers}}, doi = {10.48550/arXiv.2410.14072}, urldate = {2024-10-24}, publisher = {arXiv}, author = {Wen, Yuxin and Cao, Qingqing and Fu, Qichen and Mehta, Sachin and Najibi, Mahyar}, month = oct, year = {2024}, }
ES-FoMo 2024
# OpenELM: An Efficient Language Model Family with Open Training and Inference Framework

Sachin Mehta, Mohammad Hossein Sekhavat, Qingqing Cao, Maxwell Horton, Yanzi Jin, Chenfan Sun, Iman Mirzadeh, Mahyar Najibi, Dmitry Belenko, Peter Zatloukal, and Mohammad Rastegari

Apr 2024

Abstract arXiv BibTeX

The reproducibility and transparency of large language models are crucial for advancing open research, ensuring the trustworthiness of results, and enabling investigations into data and model biases, as well as potential risks. To this end, we release OpenELM, a state-of-the-art open language model. OpenELM uses a layer-wise scaling strategy to efficiently allocate parameters within each layer of the transformer model, leading to enhanced accuracy. For example, with a parameter budget of approximately one billion parameters, OpenELM exhibits a 2.36% improvement in accuracy compared to OLMo while requiring 2x fewer pre-training tokens. Diverging from prior practices that only provide model weights and inference code, and pre-train on private datasets, our release includes the complete framework for training and evaluation of the language model on publicly available datasets, including training logs, multiple checkpoints, and pre-training configurations. We also release code to convert models to MLX library for inference and fine-tuning on Apple devices. This comprehensive release aims to empower and strengthen the open research community, paving the way for future open research endeavors. Our source code along with pre-trained model weights and training recipes is available at https://github.com/apple/corenet. Additionally, OpenELM models can be found on HuggingFace at: https://huggingface.co/apple/OpenELM.
@misc{mehtaOpenELMEfficientLanguage2024, title = {{OpenELM}: {An} {Efficient} {Language} {Model} {Family} with {Open} {Training} and {Inference} {Framework}}, shorttitle = {{OpenELM}}, urldate = {2024-04-29}, publisher = {Workshop on Efficient Systems for Foundation Models II @ ICML2024}, author = {Mehta, Sachin and Sekhavat, Mohammad Hossein and Cao, Qingqing and Horton, Maxwell and Jin, Yanzi and Sun, Chenfan and Mirzadeh, Iman and Najibi, Mahyar and Belenko, Dmitry and Zatloukal, Peter and Rastegari, Mohammad}, month = apr, year = {2024}, }
ICML 2024
# APT: Adaptive Pruning and Tuning Pretrained Language Models for Efficient Training and Inference

Bowen Zhao, Hannaneh Hajishirzi, and Qingqing Cao

Jan 2024

Abstract arXiv BibTeX Paper Oral (1.5%)

Fine-tuning and inference with large Language Models (LM) are generally known to be expensive. Parameter-efficient fine-tuning over pretrained LMs reduces training memory by updating a small number of LM parameters but does not improve inference efficiency. Structured pruning improves LM inference efficiency by removing consistent parameter blocks, yet often increases training memory and time. To improve both training and inference efficiency, we introduce APT that adaptively prunes and tunes parameters for the LMs. At the early stage of fine-tuning, APT dynamically adds salient tuning parameters for fast and accurate convergence while discarding unimportant parameters for efficiency. Compared to baselines, our experiments show that APT maintains up to 98% task performance when pruning RoBERTa and T5 models with 40% parameters left while keeping 86.4% LLaMA models’ performance with 70% parameters remained. Furthermore, APT speeds up LMs fine-tuning by up to 8x and reduces large LMs memory training footprint by up to 70%.
@misc{zhaoAPTAdaptivePruning2024, title = {{{APT}}: {{Adaptive Pruning}} and {{Tuning Pretrained Language Models}} for {{Efficient Training}} and {{Inference}}}, shorttitle = {{{APT}}}, author = {Zhao, Bowen and Hajishirzi, Hannaneh and Cao, Qingqing}, year = {2024}, month = jan, url = {https://openreview.net/forum?id=sb81Xl50JG}, number = {arXiv:2401.12200}, eprint = {2401.12200}, primaryclass = {cs}, publisher = {{arXiv}}, urldate = {2024-01-23}, highlight = {Oral (1.5%)}, highlight_url = {https://icml.cc/virtual/2024/events/oral#event-35453} }

2023

ICLR 2024
# BTR: Binary Token Representations for Efficient Retrieval Augmented Language Models

Qingqing Cao, Sewon Min, Yizhong Wang, and Hannaneh Hajishirzi

In , Oct 2023

Abstract arXiv BibTeX Paper Poster Slides Video Spotlight (5%)

Retrieval augmentation addresses many critical problems in large language models such as hallucination, staleness, and privacy leaks. However, running retrieval-augmented language models (LMs) is slow and difficult to scale due to processing large amounts of retrieved text. We introduce binary token representations (BTR), which use 1-bit vectors to precompute every token in passages, significantly reducing computation during inference. Despite the potential loss of accuracy, our new calibration techniques and training objectives restore performance. Combined with offline and runtime compression, this only requires 127GB of disk space for encoding 3 billion tokens in Wikipedia. Our experiments show that on five knowledge-intensive NLP tasks, BTR accelerates state-of-the-art inference by up to 4x and reduces storage by over 100x while maintaining over 95% task performance. Our code is publicly available at https://github.com/csarron/BTR.
@inproceedings{caoBTRBinaryToken2024, title = {{BTR}: {Binary} {Token} {Representations} for {Efficient} {Retrieval} {Augmented} {Language} {Models}}, shorttitle = {{BTR}}, url = {https://openreview.net/forum?id=3TO3TtnOFl}, language = {en}, urldate = {2024-04-29}, author = {Cao, Qingqing and Min, Sewon and Wang, Yizhong and Hajishirzi, Hannaneh}, month = oct, year = {2023}, highlight = {Spotlight (5%)}, }
arXiv
# Efficiency Pentathlon: A Standardized Arena for Efficiency Evaluation

Hao Peng, Qingqing Cao, Jesse Dodge, Matthew E. Peters, Jared Fernandez, Tom Sherborne, Kyle Lo, Sam Skjonsberg, Emma Strubell, Darrell Plessas, Iz Beltagy, Evan Pete Walsh, Noah A. Smith, and Hannaneh Hajishirzi

Jul 2023

Abstract arXiv BibTeX

Rising computational demands of modern natural language processing (NLP) systems have increased the barrier to entry for cutting-edge research while posing serious environmental concerns. Yet, progress on model efficiency has been impeded by practical challenges in model evaluation and comparison. For example, hardware is challenging to control due to disparate levels of accessibility across different institutions. Moreover, improvements in metrics such as FLOPs often fail to translate to progress in real-world applications. In response, we introduce Pentathlon, a benchmark for holistic and realistic evaluation of model efficiency. Pentathlon focuses on inference, which accounts for a majority of the compute in a model’s lifecycle. It offers a strictly-controlled hardware platform, and is designed to mirror real-world applications scenarios. It incorporates a suite of metrics that target different aspects of efficiency, including latency, throughput, memory overhead, and energy consumption. Pentathlon also comes with a software library that can be seamlessly integrated into any codebase and enable evaluation. As a standardized and centralized evaluation platform, Pentathlon can drastically reduce the workload to make fair and reproducible efficiency comparisons. While initially focused on natural language processing (NLP) models, Pentathlon is designed to allow flexible extension to other fields. We envision Pentathlon will stimulate algorithmic innovations in building efficient models, and foster an increased awareness of the social and environmental implications in the development of future-generation NLP models.
@misc{peng2023efficiency, title = {Efficiency Pentathlon: A Standardized Arena for Efficiency Evaluation}, author = {Peng, Hao and Cao, Qingqing and Dodge, Jesse and Peters, Matthew E. and Fernandez, Jared and Sherborne, Tom and Lo, Kyle and Skjonsberg, Sam and Strubell, Emma and Plessas, Darrell and Beltagy, Iz and Walsh, Evan Pete and Smith, Noah A. and Hajishirzi, Hannaneh}, month = jul, year = {2023}, archiveprefix = {arXiv}, primaryclass = {cs.CL}, }
NeurIPS 2023
# AdANNS: A Framework for Adaptive Semantic Search

Aniket Rege, Aditya Kusupati, Sharan Ranjit S, Alan Fan, Qingqing Cao, Sham M. Kakade, Prateek Jain, and Ali Farhadi

In Thirty-Seventh Conference on Neural Information Processing Systems, Nov 2023

Abstract arXiv BibTeX

Web-scale search systems learn an encoder to embed a given query which is then hooked into an approximate nearest neighbor search (ANNS) pipeline to retrieve similar data points. To accurately capture tail queries and data points, learned representations typically are _rigid, high-dimensional_ vectors that are generally used as-is in the entire ANNS pipeline and can lead to computationally expensive retrieval. In this paper, we argue that instead of rigid representations, different stages of ANNS can leverage _adaptive representations_ of varying capacities to achieve significantly better accuracy-compute trade-offs, i.e., stages of ANNS that can get away with more approximate computation should use a lower-capacity representation of the same data point. To this end, we introduce AdANNS, a novel ANNS design framework that explicitly leverages the flexibility of Matryoshka Representations. We demonstrate state-of-the-art accuracy-compute trade-offs using novel AdANNS-based key ANNS building blocks like search data structures (AdANNS-IVF) and quantization (AdANNS-OPQ). For example on ImageNet retrieval, AdANNS-IVF is up to {}mathbf{1.5}\% more accurate than the rigid representations-based IVF at the same compute budget; and matches accuracy while being up to {}mathbf{90}}times faster in _wall-clock time_. For Natural Questions, \32\-byte AdANNS-OPQ matches the accuracy of the \64\-byte OPQ baseline constructed using rigid representations – _same accuracy at half the cost!_ We further show that the gains from AdANNS translate to modern-day composite ANNS indices that combine search structures and quantization. Finally, we demonstrate that AdANNS can enable inference-time adaptivity for compute-aware search on ANNS indices built non-adaptively on matryoshka representations. Code is open-sourced at https://github.com/RAIVNLab/AdANNS.
@inproceedings{regeAdANNSFrameworkAdaptive2023, title = {{{AdANNS}}: {{A Framework}} for {{Adaptive Semantic Search}}}, shorttitle = {{{AdANNS}}}, booktitle = {Thirty-Seventh {{Conference}} on {{Neural Information Processing Systems}}}, author = {Rege, Aniket and Kusupati, Aditya and S, Sharan Ranjit and Fan, Alan and Cao, Qingqing and Kakade, Sham M. and Jain, Prateek and Farhadi, Ali}, year = {2023}, month = nov, urldate = {2024-01-24}, langid = {english}, }
ACL 2023
# PuMer: Pruning and Merging Tokens for Efficient Vision Language Models

Qingqing Cao, Bhargavi Paranjape, and Hannaneh Hajishirzi

In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Jul 2023

Abstract arXiv BibTeX Paper Code Poster Slides

Large-scale vision language (VL) models use Transformers to perform cross-modal interactions between the input text and image. These cross-modal interactions are computationally expensive and memory-intensive due to the quadratic complexity of processing the input image and text. We present PuMer: a token reduction framework that uses text-informed Pruning and modality-aware Merging strategies to progressively reduce the tokens of input image and text, improving model inference speed and reducing memory footprint. PuMer learns to keep salient image tokens related to the input text and merges similar textual and visual tokens by adding lightweight token reducer modules at several cross-modal layers in the VL model. Training PuMer is mostly the same as finetuning the original VL model but faster. Our evaluation for two vision language models on four downstream VL tasks shows PuMer increases inference throughput by up to 2x and reduces memory footprint by over 50% while incurring less than a 1% accuracy drop.
@inproceedings{cao-etal-2023-pumer, title = {{P}u{M}er: Pruning and Merging Tokens for Efficient Vision Language Models}, author = {Cao, Qingqing and Paranjape, Bhargavi and Hajishirzi, Hannaneh}, editor = {Rogers, Anna and Boyd-Graber, Jordan and Okazaki, Naoaki}, booktitle = {Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}, month = jul, year = {2023}, address = {Toronto, Canada}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2023.acl-long.721}, doi = {10.18653/v1/2023.acl-long.721}, pages = {12890--12903}, }
ACL 2023
# A Survey for Efficient Open Domain Question Answering

Qin Zhang, Shangsi Chen, Dongkuan Xu, Qingqing Cao, Xiaojun Chen, Trevor Cohn, and Meng Fang

In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Jul 2023

Abstract arXiv BibTeX Paper

Open domain question answering (ODQA) is a longstanding task aimed at answering factual questions from a large knowledge corpus without any explicit evidence in natural language processing (NLP). Recent works have predominantly focused on improving the answering accuracy and have achieved promising progress. However, higher accuracy often requires more memory consumption and inference latency, which might not necessarily be efficient enough for direct deployment in the real world. Thus, a trade-off between accuracy, memory consumption and processing speed is pursued. In this paper, we will survey recent advancements in the efficiency of ODQA models and conclude core techniques for achieving efficiency. Additionally, we will provide a quantitative analysis of memory cost, query speed, accuracy, and overall performance comparison. Our goal is to keep scholars informed of the latest advancements and open challenges in ODQA efficiency research and contribute to the further development of ODQA efficiency.
@inproceedings{zhang-etal-2023-survey-efficient, title = {A Survey for Efficient Open Domain Question Answering}, author = {Zhang, Qin and Chen, Shangsi and Xu, Dongkuan and Cao, Qingqing and Chen, Xiaojun and Cohn, Trevor and Fang, Meng}, editor = {Rogers, Anna and Boyd-Graber, Jordan and Okazaki, Naoaki}, booktitle = {Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}, month = jul, year = {2023}, address = {Toronto, Canada}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2023.acl-long.808}, doi = {10.18653/v1/2023.acl-long.808}, pages = {14447--14465}, }
TACL 2023
# Efficient Methods for Natural Language Processing: A Survey

Marcos Treviso, Ji-Ung Lee, Tianchu Ji, Betty Aken, Qingqing Cao, Manuel R. Ciosici, Michael Hassid, Kenneth Heafield, Sara Hooker, Colin Raffel, Pedro H. Martins, André F. T. Martins, Jessica Zosa Forde, Peter Milder, Edwin Simpson, Noam Slonim, Jesse Dodge, Emma Strubell, Niranjan Balasubramanian, Leon Derczynski, Iryna Gurevych, and Roy Schwartz

Transactions of the Association for Computational Linguistics, Jul 2023

Abstract arXiv BibTeX

Recent work in natural language processing (NLP) has yielded appealing results from scaling model parameters and training data; however, using only scale to improve performance means that resource consumption also grows. Such resources include data, time, storage, or energy, all of which are naturally limited and unevenly distributed. This motivates research into efficient methods that require fewer resources to achieve similar results. This survey synthesizes and relates current methods and findings in efficient NLP. We aim to provide both guidance for conducting NLP under limited resources, and point towards promising research directions for developing more efficient methods.
@article{trevisoEfficientMethodsNatural2023, title = {Efficient {{Methods}} for {{Natural Language Processing}}: {{A Survey}}}, shorttitle = {Efficient {{Methods}} for {{Natural Language Processing}}}, author = {Treviso, Marcos and Lee, Ji-Ung and Ji, Tianchu and van Aken, Betty and Cao, Qingqing and Ciosici, Manuel R. and Hassid, Michael and Heafield, Kenneth and Hooker, Sara and Raffel, Colin and Martins, Pedro H. and Martins, Andr{\'e} F. T. and Forde, Jessica Zosa and Milder, Peter and Simpson, Edwin and Slonim, Noam and Dodge, Jesse and Strubell, Emma and Balasubramanian, Niranjan and Derczynski, Leon and Gurevych, Iryna and Schwartz, Roy}, year = {2023}, month = jul, journal = {Transactions of the Association for Computational Linguistics}, volume = {11}, pages = {826--860}, issn = {2307-387X}, doi = {10.1162/tacl_a_00577}, urldate = {2024-01-24}, }

2022

IMWUT 2022

# MobiVQA: Efficient On-Device Visual Question Answering

Qingqing Cao, Prerna Khanna, Nicholas D. Lane, and Aruna Balasubramanian

Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., Jul 2022

BibTeX Paper PDF

@article{mobivqa,
  author = {Cao, Qingqing and Khanna, Prerna and Lane, Nicholas D. and Balasubramanian, Aruna},
  title = {MobiVQA: Efficient On-Device Visual Question Answering},
  year = {2022},
  issue_date = {July 2022},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  volume = {6},
  number = {2},
  url = {https://doi.org/10.1145/3534619},
  doi = {10.1145/3534619},
  journal = {Proc. ACM Interact. Mob. Wearable Ubiquitous Technol.},
  month = jul,
  articleno = {44},
  numpages = {23},
  keywords = {visual question answering, on-device applications, mobile computing, edge computing},
}

2021

EMNLP 2021
# IrEne-viz: Visualizing Energy Consumption of Transformer Models

Yash Kumar Lal, Reetu Singh, Harsh Trivedi, Qingqing Cao, Aruna Balasubramanian, and Niranjan Balasubramanian

In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Nov 2021

Abstract BibTeX Paper

IrEne is an energy prediction system that accurately predicts the interpretable inference energy consumption of a wide range of Transformer-based NLP models. We present the IrEne-viz tool, an online platform for visualizing and exploring energy consumption of various Transformer-based models easily. Additionally, we release a public API that can be used to access granular information about energy consumption of transformer models and their components. The live demo is available at \urlhttp://stonybrooknlp.github.io/irene/demo/.
@inproceedings{lal-etal-2021-irene, title = {{I}r{E}ne-viz: Visualizing Energy Consumption of Transformer Models}, author = {Lal, Yash Kumar and Singh, Reetu and Trivedi, Harsh and Cao, Qingqing and Balasubramanian, Aruna and Balasubramanian, Niranjan}, editor = {Adel, Heike and Shi, Shuming}, booktitle = {Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations}, month = nov, year = {2021}, address = {Online and Punta Cana, Dominican Republic}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2021.emnlp-demo.29}, doi = {10.18653/v1/2021.emnlp-demo.29}, pages = {251--258}, }
ACL 2021
# IrEne: Interpretable Energy Prediction for Transformers

Qingqing Cao, Yash Kumar Lal, Harsh Trivedi, Aruna Balasubramanian, and Niranjan Balasubramanian

In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Aug 2021

Abstract BibTeX Paper Code

Existing software-based energy measurements of NLP models are not accurate because they do not consider the complex interactions between energy consumption and model execution. We present IrEne, an interpretable and extensible energy prediction system that accurately predicts the inference energy consumption of a wide range of Transformer-based NLP models. IrEne constructs a model tree graph that breaks down the NLP model into modules that are further broken down into low-level machine learning (ML) primitives. IrEne predicts the inference energy consumption of the ML primitives as a function of generalizable features and fine-grained runtime resource usage. IrEne then aggregates these low-level predictions recursively to predict the energy of each module and finally of the entire model. Experiments across multiple Transformer models show IrEne predicts inference energy consumption of transformer models with an error of under 7% compared to the ground truth. In contrast, existing energy models see an error of over 50%. We also show how IrEne can be used to conduct energy bottleneck analysis and to easily evaluate the energy impact of different architectural choices. We release the code and data at \urlhttps://github.com/StonyBrookNLP/irene.
@inproceedings{cao-etal-2021-irene, title = {{I}r{E}ne: Interpretable Energy Prediction for Transformers}, author = {Cao, Qingqing and Lal, Yash Kumar and Trivedi, Harsh and Balasubramanian, Aruna and Balasubramanian, Niranjan}, editor = {Zong, Chengqing and Xia, Fei and Li, Wenjie and Navigli, Roberto}, booktitle = {Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)}, month = aug, year = {2021}, address = {Online}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2021.acl-long.167}, doi = {10.18653/v1/2021.acl-long.167}, pages = {2145--2157}, }
EMDL 2021
# Are Mobile DNN Accelerators Accelerating DNNs

Qingqing Cao, Alexandru E. Irimiea, Mohamed Abdelfattah, Aruna Balasubramanian, and Nicholas D. Lane

In Proceedings of the 5th International Workshop on Embedded and Mobile Deep Learning, Aug 2021

Abstract BibTeX Paper PDF Code Slides

Deep neural networks (DNNs) are running on many mobile and embedded devices with the goal of energy efficiency and highest possible performance. However, DNN workloads are getting more computationally intensive, and simultaneously their deployment is ever-increasing. This has led to the creation of many purpose-built low-power neural accelerators to replace or augment traditional mobile CPUs and GPUs. In this work, we provide an in-depth study of one set of commercially-available mobile accelerators, the Intel Neural Compute Sticks (NCS). We perform a systematic measurement study of the latency and energy of this accelerator under a variety of DNNs including convolutional neural networks (CNNs) for vision tasks and attention-based Transformer models for NLP tasks. We compare to the mobile processors (CPU, GPU, and DSP) on a smartphone and a mobile board. Our study shows commercial mobile accelerators like NCS are not ready yet to provide the performance as claimed. We also point out directions in optimizing the model architectures to better suit these accelerators.
@inproceedings{10.1145/3469116.3470011, author = {Cao, Qingqing and Irimiea, Alexandru E. and Abdelfattah, Mohamed and Balasubramanian, Aruna and Lane, Nicholas D.}, title = {Are Mobile DNN Accelerators Accelerating DNNs}, year = {2021}, isbn = {9781450385978}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3469116.3470011}, doi = {10.1145/3469116.3470011}, booktitle = {Proceedings of the 5th International Workshop on Embedded and Mobile Deep Learning}, pages = {7-12}, numpages = {6}, location = {Virtual, WI, USA}, series = {EMDL'21}, }

2020

SustaiNLP 2020
# Towards Accurate and Reliable Energy Measurement of NLP Models

Qingqing Cao, Aruna Balasubramanian, and Niranjan Balasubramanian

In Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing, Nov 2020

Abstract BibTeX Paper Code Slides

Accurate and reliable measurement of energy consumption is critical for making well-informed design choices when choosing and training large scale NLP models. In this work, we show that existing software-based energy estimations are not accurate because they do not take into account hardware differences and how resource utilization affects energy consumption. We conduct energy measurement experiments with four different models for a question answering task. We quantify the error of existing software-based energy estimations by using a hardware power meter that provides highly accurate energy measurements. Our key takeaway is the need for a more accurate energy estimation model that takes into account hardware variabilities and the non-linear relationship between resource utilization and energy consumption. We release the code and data at \urlhttps://github.com/csarron/sustainlp2020-energy.
@inproceedings{cao-etal-2020-towards, title = {Towards Accurate and Reliable Energy Measurement of {NLP} Models}, author = {Cao, Qingqing and Balasubramanian, Aruna and Balasubramanian, Niranjan}, editor = {Moosavi, Nafise Sadat and Fan, Angela and Shwartz, Vered and Glava{\v{s}}, Goran and Joty, Shafiq and Wang, Alex and Wolf, Thomas}, booktitle = {Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing}, month = nov, year = {2020}, address = {Online}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2020.sustainlp-1.19}, doi = {10.18653/v1/2020.sustainlp-1.19}, pages = {141--148}, }
ACL 2020
# DeFormer: Decomposing Pre-trained Transformers for Faster Question Answering

Qingqing Cao, Harsh Trivedi, Aruna Balasubramanian, and Niranjan Balasubramanian

In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Jul 2020

Abstract BibTeX Paper Code Slides

Transformer-based QA models use input-wide self-attention – i.e. across both the question and the input passage – at all layers, causing them to be slow and memory-intensive. It turns out that we can get by without input-wide self-attention at all layers, especially in the lower layers. We introduce DeFormer, a decomposed transformer, which substitutes the full self-attention with question-wide and passage-wide self-attentions in the lower layers. This allows for question-independent processing of the input text representations, which in turn enables pre-computing passage representations reducing runtime compute drastically. Furthermore, because DeFormer is largely similar to the original model, we can initialize DeFormer with the pre-training weights of a standard transformer, and directly fine-tune on the target QA dataset. We show DeFormer versions of BERT and XLNet can be used to speed up QA by over 4.3x and with simple distillation-based losses they incur only a 1% drop in accuracy. We open source the code at \urlhttps://github.com/StonyBrookNLP/deformer.
@inproceedings{cao-etal-2020-deformer, title = {{D}e{F}ormer: Decomposing Pre-trained Transformers for Faster Question Answering}, author = {Cao, Qingqing and Trivedi, Harsh and Balasubramanian, Aruna and Balasubramanian, Niranjan}, editor = {Jurafsky, Dan and Chai, Joyce and Schluter, Natalie and Tetreault, Joel}, booktitle = {Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics}, month = jul, year = {2020}, address = {Online}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2020.acl-main.411}, doi = {10.18653/v1/2020.acl-main.411}, pages = {4487--4497}, }

2019

MobiSys 2019
# DeQA: On-Device Question Answering

Qingqing Cao, Noah Weber, Niranjan Balasubramanian, and Aruna Balasubramanian

In Proceedings of the 17th Annual International Conference on Mobile Systems, Applications, and Services, Jul 2019

Abstract BibTeX Paper PDF Slides

Today there is no effective support for device-wide question answering on mobile devices. State-of-the-art QA models are deep learning behemoths designed for the cloud which run extremely slow and require more memory than available on phones. We present DeQA, a suite of latency- and memory- optimizations that adapts existing QA systems to run completely locally on mobile phones. Specifically, we design two latency optimizations that (1) stops processing documents if further processing cannot improve answer quality, and (2) identifies computation that does not depend on the question and moves it offline. These optimizations do not depend on the QA model internals and can be applied to several existing QA models. DeQA also implements a set of memory optimizations by (i) loading partial indexes in memory, (ii) working with smaller units of data, and (iii) replacing in-memory lookups with a key-value database. We use DeQA to port three state-of-the-art QA systems to the mobile device and evaluate over three datasets. The first is a large scale SQuAD dataset defined over Wikipedia collection. We also create two on-device QA datasets, one over a publicly available email data collection and the other using a cross-app data collection we obtain from two users. Our evaluations show that DeQA can run QA models with only a few hundred MBs of memory and provides at least 13x speedup on average on the mobile phone across all three datasets with less than a 1% drop in accuracy.
@inproceedings{deqa, author = {Cao, Qingqing and Weber, Noah and Balasubramanian, Niranjan and Balasubramanian, Aruna}, title = {DeQA: On-Device Question Answering}, year = {2019}, isbn = {9781450366618}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3307334.3326071}, doi = {10.1145/3307334.3326071}, booktitle = {Proceedings of the 17th Annual International Conference on Mobile Systems, Applications, and Services}, pages = {27-40}, numpages = {14}, keywords = {mobile devices, mobile systems, question answering}, location = {Seoul, Republic of Korea}, series = {MobiSys '19}, }

2017

MobiCom 2017
# UIWear: Easily Adapting User Interfaces for Wearable Devices

Jian Xu*, Qingqing Cao*, Aditya Prakash, Aruna Balasubramanian, and Donald E. Porter

In Proceedings of the 23rd Annual International Conference on Mobile Computing and Networking, Jul 2017

Abstract BibTeX Paper PDF Code Slides

Wearable devices such as smartwatches offer exciting new opportunities for users to interact with their applications. However, the current wearable programming model requires the developer to write a custom companion app for each wearable form factor; the companion app extends the smartphone display onto the wearable, relays user interactions from the wearable to the phone, and updates the wearable display as needed. The development effort required to write a companion app is significant and will not scale to an increasing diversity of form factors. This paper argues for a different programming model for wearable devices. The developer writes an application for the smartphone, but only specifies a UI design for the wearable. Our UIWear system abstracts a logical model of the smartphone GUI, re-tailors the GUI for the wearable device based on the specified UI design, and compiles it into a companion app that we call the UICompanion app. We implemented UIWear on Android smartphones, AndroidWear smartwatches, and Sony SmartEyeGlasses. We evaluate 20 developer-written companion apps from the AndroidWear category on Google Play against the UIWear-created UICompanion apps. The lines-of-code required for the developer to specify the UI design in UIWear is an order-of-magnitude smaller compared to the companion app lines-of-code. Further, in most cases, the UICompanion app performed comparably or better than the corresponding companion app both in terms of qualitative metrics, including latency and energy, and quantitative metrics, including look-and-feel.
@inproceedings{uiwear, author = {Xu*, Jian and Cao*, Qingqing and Prakash, Aditya and Balasubramanian, Aruna and Porter, Donald E.}, title = {UIWear: Easily Adapting User Interfaces for Wearable Devices}, year = {2017}, isbn = {9781450349161}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3117811.3117819}, doi = {10.1145/3117811.3117819}, booktitle = {Proceedings of the 23rd Annual International Conference on Mobile Computing and Networking}, pages = {369-382}, numpages = {14}, keywords = {accessibility, smartphone, wearable, android, smartwatch}, location = {Snowbird, Utah, USA}, series = {MobiCom '17}, }
MobiCom 2017
# Demo: UIWear: Easily Adapting User Interfaces for Wearable Devices

Jian Xu*, Qingqing Cao*, Aruna Balasubramanian, and Donald E. Porter

In Proceedings of the 23rd Annual International Conference on Mobile Computing and Networking, Jul 2017

Abstract BibTeX Paper PDF Poster Website

Wearable devices such as smartwatches offer exciting new opportunities for users to interact with their applications. However, the current wearable programming model requires the developer to write a custom companion app for each wearable form factor; the companion app extends the smartphone display onto the wearable, relays user interactions from the wearable to the phone, and updates the wearable display as needed. The development effort required to write a companion app is significant and will not scale to an increasing diversity of form factors. This paper argues for a different programming model for wearable devices. The developer writes an application for the smartphone, but only specifies a UI design for the wearable. Our UIWear system abstracts a logical model of the smartphone GUI, re-tailors the GUI for the wearable device based on the specified UI design, and compiles it into a companion app that we call the UICompanion app. We implemented UIWear on Android smartphones, AndroidWear smartwatches, and Sony SmartEyeGlasses. We evaluate 20 developer-written companion apps from the AndroidWear category on Google Play against the UIWear-created UICompanion apps. The lines-of-code required for the developer to specify the UI design in UIWear is an order-of-magnitude smaller compared to the companion app lines-of-code. Further, in most cases, the UICompanion app performed comparably or better than the corresponding companion app both in terms of qualitative metrics, including latency and energy, and quantitative metrics, including look-and-feel.
@inproceedings{uiwear_demo, author = {Xu*, Jian and Cao*, Qingqing and Balasubramanian, Aruna and Porter, Donald E.}, title = {Demo: UIWear: Easily Adapting User Interfaces for Wearable Devices}, year = {2017}, isbn = {9781450349161}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3117811.3124769}, doi = {10.1145/3117811.3124769}, booktitle = {Proceedings of the 23rd Annual International Conference on Mobile Computing and Networking}, pages = {510-512}, numpages = {3}, keywords = {accessibility, smartphone, android wear, smartwatch, android}, location = {Snowbird, Utah, USA}, series = {MobiCom '17}, }
EMDL 2017
# MobiRNN: Efficient Recurrent Neural Network Execution on Mobile GPU

Qingqing Cao, Niranjan Balasubramanian, and Aruna Balasubramanian

In Proceedings of the 1st International Workshop on Deep Learning for Mobile Systems and Applications, Jul 2017

Abstract BibTeX Paper Code Slides

In this paper, we explore optimizations to run Recurrent Neural Network (RNN) models locally on mobile devices. RNN models are widely used for Natural Language Processing, Machine translation, and other tasks. However, existing mobile applications that use RNN models do so on the cloud. To address privacy and efficiency concerns, we show how RNN models can be run locally on mobile devices. Existing work on porting deep learning models to mobile devices focus on Convolution Neural Networks (CNNs) and cannot be applied directly to RNN models. In response, we present MobiRNN, a mobile-specific optimization framework that implements GPU offloading specifically for mobile GPUs. Evaluations using an RNN model for activity recognition shows that MobiRNN does significantly decrease the latency of running RNN models on phones.
@inproceedings{mobirnn, author = {Cao, Qingqing and Balasubramanian, Niranjan and Balasubramanian, Aruna}, title = {MobiRNN: Efficient Recurrent Neural Network Execution on Mobile GPU}, year = {2017}, isbn = {9781450349628}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3089801.3089804}, doi = {10.1145/3089801.3089804}, booktitle = {Proceedings of the 1st International Workshop on Deep Learning for Mobile Systems and Applications}, pages = {1-6}, numpages = {6}, keywords = {recurrent neural network, mobile GPU, renderscript, performance optimizations}, location = {Niagara Falls, New York, USA}, series = {EMDL '17}, }