Wittawat Jitkrittum

https://wittaw.at/
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Large-scale machine learning models deliver strong performance across a wide range of tasks but come with significant computational and resource constraints. To mitigate these challenges, local smaller models are often deployed alongside larger models, relying on routing and deferral mechanisms to offload complex tasks. However, existing approaches inadequately balance the capabilities of these models, often resulting in unnecessary deferrals or sub-optimal resource usage. In this work we introduce a novel loss function called Gatekeeper for calibrating smaller models in cascade setups. Our approach fine-tunes the smaller model to confidently handle tasks it can perform correctly while deferring complex tasks to the larger model. Moreover, it incorporates a mechanism for managing the trade-off between model performance and deferral accuracy, and is broadly applicable across various tasks and domains without any architectural changes. We evaluated our method on encoder-only, decoder-only, and encoder-decoder architectures. Experiments across image classification, language modeling, and vision-language tasks show that our approach substantially improves deferral performance. View details
    Preview abstract Bipartite ranking is a fundamental supervised learning problem, with the goal of learning a ranking over instances with maximal Area Under the ROC Curve (AUC) against a single binary target label. However, one may often observe multiple binary target labels, e.g., from distinct human annotators. How can one synthesize such labels into a single coherent ranking? In this work, we formally analyze two approaches to this problem -- loss aggregation and label aggregation -- by characterizing their Bayes-optimal solutions. We show that while both approaches can yield Pareto-optimal solutions, loss aggregation can exhibit label dictatorship: one can inadvertently (and undesirably) favor one label over others. This suggests that label aggregation can be preferable to loss aggregation, which we empirically verify. View details
    Preview abstract Bipartite ranking is a fundamental supervised learning problem, with the goal of learning a ranking over instances with maximal area under the ROC curve (AUC) against a single binary target label. However, one may often observe multiple binary target labels, e.g., from distinct human annotators. How can one synthesize such labels into a single coherent ranking? In this work, we formally analyze two approaches to this problem—loss aggregation and label aggregation—by characterizing their Bayes-optimal solutions. We show that while both approaches can yield Pareto-optimal solutions, loss aggregation can exhibit label dictatorship: one can inadvertently (and undesirably) favor one label over others. This suggests that label aggregation can be preferable to loss aggregation, which we empirically verify. View details
    Preview abstract Bipartite ranking is a fundamental supervised learning problem, with the goal of learning a ranking over instances with maximal area under the ROC curve (AUC) against a single binary target label. However, one may often observe multiple binary target labels, e.g., from distinct human annotators. How can one synthesize such labels into a single coherent ranking? In this work, we formally analyze two approaches to this problem—loss aggregation and label aggregation—by characterizing their Bayes-optimal solutions. We show that while both approaches can yield Pareto-optimal solutions, loss aggregation can exhibit label dictatorship: one can inadvertently (and undesirably) favor one label over others. This suggests that label aggregation can be preferable to loss aggregation, which we empirically verify. View details
    Faster Cascades via Speculative Decoding
    Seungyeon Kim
    Neha Gupta
    Aditya Menon
    International Conference on Learning Representations (ICLR) 2025
    Preview abstract Cascades and speculative decoding are two common approaches to improving language models' inference efficiency. Both approaches involve interleaving models of different sizes, but via fundamentally distinct mechanisms: cascades employ a deferral rule that invokes the larger model only for "hard" inputs, while speculative decoding uses speculative execution to primarily invoke the larger model in parallel verification mode. These mechanisms offer different benefits: empirically, cascades are often capable of yielding better quality than even the larger model, while theoretically, speculative decoding offers a guarantee of quality-neutrality. In this paper, we leverage the best of both these approaches by designing new speculative cascading techniques that implement their deferral rule through speculative execution. We characterize the optimal deferral rule for our speculative cascades, and employ a plug-in approximation to the optimal rule. Through experiments with T5 and Gemma models on benchmark language tasks, we show that the proposed cascading approach matches the quality of a regular cascade, but at reduced inference costs. View details
    A Little Help Goes a Long Way: Efficient LLM Training by Leveraging Small LMs
    Vlad Feinberg
    Afshin Rostamizadeh
    Nikunj Saunshi
    Seungyeon Kim
    Veeru Sadhanala
    Rakesh Shivanna
    Rohan Anil
    Aditya Menon
    Hrayr Harutyunyan
    ArXiv (2024)
    Preview abstract A primary challenge in large language model (LLM) development is their onerous pre-training cost. This paper explores a promising paradigm to improve LLM pre-training efficiency and quality by leveraging a small language model (SLM). In particular, this paradigm relies on an SLM to both (1) provide soft labels as additional supervision, and (2) select a small subset of valuable training examples. Put together, this enables an effective transfer of the SLM's predictive distribution to the LLM, while prioritizing specific regions of the training data distribution. Empirically, this leads to reduced LLM training time compared to standard training, while improving the overall quality. Theoretically, we develop a statistical framework to study the utility of SLMs in enabling efficient training of high-quality LLMs. Our framework characterizes how the SLM's seemingly low-quality supervision can enhance the training of a much more capable LLM. Furthermore, it also highlights the need for an adaptive utilization of such supervision, by striking a balance between the bias and variance introduced by the SLM-provided soft labels. We corroborate our theoretical framework by improving the pre-training of LLMs with 2.8B and 8.6B parameters by utilizing smaller LMs on the Pile dataset. View details
    A Little Help Goes a Long Way: Efficient LLM Training by Leveraging Small LMs
    Vlad Feinberg
    Nikunj Saunshi
    Veeru Sadhanala
    Rakesh Shivanna
    Rohan Anil
    Seungyeon Kim
    Aditya Menon
    Hrayr Harutyunyan
    Afshin Rostamizadeh
    arXiv (2024)
    Preview abstract A primary challenge in large language model (LLM) development is their onerous pre-training cost. Typically, such pre-training involves optimizing a self-supervised objective (such as next-token prediction) over a large corpus. This paper explores a promising paradigm to improve LLM pre-training efficiency and quality by suitably leveraging a small language model (LM). In particular, this paradigm relies on a small LM to both (1) provide soft labels as additional training supervision, and (2) select a small subset of valuable (``informative'' and ``hard'') training examples. Put together, this enables an effective transfer of the small LM's predictive distribution to the LLM, while prioritizing specific regions of the training data distribution. Empirically, this leads to reduced LLM training time as compared to standard training, while improving the overall quality. Theoretically, we develop a statistical framework to systematically study the utility of small LMs in enabling efficient training of high-quality LLMs. In particular, our framework characterizes how the small LM's seemingly low-quality supervision can enhance the training of a much more capable LLM. Furthermore, it also highlights the need for an adaptive utilization of such supervision, by striking a balance between the bias and variance introduced by the small LM-provided soft labels. We corroborate our theoretical framework by improving the pre-training of an LLM with 2.8B parameters by utilizing a smaller LM with 1.5B parameters on the Pile dataset. View details
    Preview abstract Recent advances in language model (LM) design has yielded a series of models with remarkably improved quality on complex NLP tasks, but significantly in-creased inference cost. A simple strategy to achieve more favourable cost-quality tradeoffs is cascading: here, a small model is invoked for most “easy” instances, while a large model is invoked for a few “hard” instances. Typically, “easy” in-stances are those where the small model has high confidence in its prediction.While the principles underpinning effective cascading are well-studied for classification problems, a similar understanding is lacking for generative tasks. The ex-tension of simple ”Chow” rule which defers based on the probability of predicting an answer is not straightforward for generative tasks where the number of output tokens is variable. Moreover, LMs are known to suffer from length bias where longer answers are penalized more as compared to shorter answers which complicates things further. In this work, we initiate a systematic study of deferral rules for cascades for language models. For example, how does one best summarise model confidence across a variable number of output tokens? We show experimentally that there is no one straight forward extension of probability based uncertainty for LMs which works well across all tasks. Via experiments on a range of bench-marks with FLAN-T5 models, we find that incorporating token-level uncertainty can significantly improve the cost-quality tradeoff of cascades. We further show that incorporating embeddings from the smaller model and intermediate layer embeddings from the larger model can further boost performance View details
    USTAD: Unified Single-model Training Achieving Diverse Scores for Information Retrieval
    Seungyeon Kim
    Manzil Zaheer
    Veeru Sadhanala
    Sadeep Jayasumana
    Aditya Menon
    Rob Fergus
    International Conference on Machine Learning (ICML) (2024)
    Preview abstract Modern information retrieval (IR) systems consists of multiple stages like retrieval and ranking. Transformers are employed across these different IR stages, achieving state-of-the-art performance, but each model is trained separately leading to complex pipelines and increased cost for maintaining multiple models. The apparent need for separate models is due to different input/output semantics at different stages. In this paper, we challenge this tradition of using separate models as transformers are very expressive models and ask the question would changing just score function suffice? We present a new unified approach - USTAD - to train a single network that can provide powerful ranking scores as cross-encoder (CE) as well as factorized embeddings for large-scale retrieval as a dual-encoder (DE). Empirically, we find a single USTAD model to be competitive to separate ranking CE and retrieval DE models. Furthermore, USTAD enables new distillation techniques, significantly improving CE to DE distillations. Also using USTAD teacher, we can deploy novel asymmetric architectures for student models which realizes better embedding alignment without increasing online inference cost. On standard benchmarks like MSMARCO, we show that our approach successfully distills from both dual-encoder (DE) and cross-encoder (CE) teacher models to 1/10th size asymmetric students that can retain 95-97% of the teacher performance. View details
    A Sketch Is Worth a Thousand Words: Image Retrieval with Text and Sketch
    Patsorn Sangkloy
    Diyi Yang
    James Hays
    The European Conference on Computer Vision (ECCV) (2022) (to appear)
    Preview abstract We address the problem of retrieving in-the-wild images with both a sketch and a text query. We present TASK-former (Text And SKetch transformer), an end-to-end trainable model for image retrieval using a text description and a sketch as input. We argue that both input modalities complement each other in a manner that cannot be achieved easily by either one alone. TASK-former follows the late-fusion dual-encoder approach, similar to CLIP, which allows efficient and scalable retrieval since the retrieval set can be indexed independently of the queries. We empirically demonstrate that using an input sketch (even a poorly drawn one) in addition to text considerably increases retrieval recall compared to traditional text-based image retrieval. To evaluate our approach, we collect 5,000 hand-drawn sketches for images in the test set of the COCO dataset. The collected sketches are available a https://janesjanes.github.io/tsbir/. View details
    ×