Publications

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field[a19ad0].

people standing in front of a screen with images and a chipboard

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field[a19ad0].

Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
1 - 15 of 11067 publications
    FreshBrew: A Benchmark for Evaluating AI Agents on Java Code Migration
    Diganta Misra
    Yanqi Luo
    Anjali Sridhar
    Justine Gehring
    Silvio Soares Ribeiro Junior
    2026
    Preview abstract AI coding assistants are rapidly becoming integral to modern software development. A key challenge in this space is the continual need to migrate and modernize codebases in response to evolving software ecosystems. Traditionally, such migrations have relied on rule-based systems and human intervention. With the advent of powerful large language models (LLMs), AI-driven agentic frameworks offer a promising alternative—but their effectiveness remains underexplored. In this paper, we introduce FreshBrew, a novel benchmark for evaluating AI-based agentic frameworks on project-level Java migrations. We benchmark several such frameworks, powered by state-of-the-art LLMs, and compare their performance against established rule-based tools. Our evaluation of AI agents on this benchmark of 228 repositories shows that the top-performing model, Gemini 2.5 Flash, can successfully migrate 56.5% of projects to JDK 17. Our empirical analysis reveals novel insights into the critical strengths and limitations of current agentic approaches, offering actionable insights into their real-world applicability. By releasing FreshBrew publicly upon acceptance, we aim to facilitate rigorous, reproducible evaluation and catalyze progress in AI-driven codebase modernization. View details
    Preview abstract How many T gates are needed to approximate an arbitrary n-qubit quantum state to within a given precision ϵ? Improving prior work of Low, Kliuchnikov and Schaeffer, we show that the optimal asymptotic scaling is Θ(sqrt{2^n log(1/ε)} + log(1/ε)) if we allow an unlimited number of ancilla qubits. We also show that this is the optimal T-count for implementing an arbitrary diagonal n-qubit unitary to within error ϵ. We describe an application to batched synthesis of single-qubit unitaries: we can approximate a tensor product of m = O(log log(1/ϵ)) arbitrary single-qubit unitaries to within error ϵ with the same asymptotic T-count as is required to approximate just one single-qubit unitary. View details
    CrossCheck: Input Validation for WAN Control Systems
    Rishabh Iyer
    Isaac Keslassy
    Sylvia Ratnasamy
    Networked Systems Design and Implementation (NSDI) (2026) (to appear)
    Preview abstract We present CrossCheck, a system that validates inputs to the Software-Defined Networking (SDN) controller in a Wide Area Network (WAN). By detecting incorrect inputs—often stemming from bugs in the SDN control infrastructure—CrossCheck alerts operators before they trigger network outages. Our analysis at a large-scale WAN operator identifies invalid inputs as a leading cause of major outages, and we show how CrossCheck would have prevented those incidents. We deployed CrossCheck as a shadow validation system for four weeks in a production WAN, during which it accurately detected the single incident of invalid inputs that occurred while sustaining a 0% false positive rate under normal operation, hence imposing little additional burden on operators. In addition, we show through simulation that CrossCheck reliably detects a wide range of invalid inputs (e.g., detecting demand perturbations as small as 5% with 100% accuracy) and maintains a near-zero false positive rate for realistic levels of noisy, missing, or buggy telemetry data (e.g., sustaining zero false positives with up to 30% of corrupted telemetry data). View details
    mmMUSE: An mmWave-based Motion-resilient Universal Speech Enhancement System
    Chenming He
    Yanyong Zhang
    Kai Wang
    Dequan Wang
    Lingyu Wang
    the Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT), ACM (2026) (to appear)
    Preview abstract Voice-based smart systems can greatly enhance user experiences by allowing higher-quality interactions through better voice perception. Speech enhancement can benefit such systems by isolating noise from speech. Recently, integrating millimeter-wave (mmWave) with audio for speech perception has gained increasing attention due to microphones' limitations in noisy environments. However, mmWave-based vocal extraction is severely affected by motion, which disperses vocal signals across ranges and introduces distortions. In this paper, we propose an mmWave-based motion-resilient universal speech enhancement system called mmMUSE, which fuses mmWave and audio signals. To mitigate motion interference, we develop a Doppler-based method for motion-robust vocal signal extraction. Moreover, by introducing the Vocal-Noise-Ratio metric to assess the prominence of vocal signals from mmWave, we achieve real-time voice activity detection that gains 3.81 dB of SISDR in noisy speeches. Additionally, we design a two-stage complex-valued network that includes an attention-based fusion network for cross-modal complementing and a time-frequency masking network for correcting amplitude and phase of speech to isolate noises. Using mmWave and audio datasets from 46 participants, mmMUSE outperforms the state-of-the-art speech enhancement models, achieving an average SISDR improvement of 3.12 dB. Additionally, mmMUSE achieves SISDR improvements of 16.51 dB, 17.93 dB, 14.93 dB, and 18.95 dB in controlled environments involving intense noise, extensive motion, multiple speakers, and various obstructive materials, respectively. Finally, we evaluate mmMUSE in real-world scenarios including running, public spaces, and driving, maintaining a word error rate (WER) below 10%. View details
    Preview abstract Semantic data models express high-level business concepts and metrics, capturing the business logic needed to query a database correctly. Most data modeling solutions are built as layers above SQL query engines, with bespoke query languages or APIs. The layered approach means that semantic models can’t be used directly in SQL queries. This paper focuses on an open problem in this space – can we define semantic models in SQL, and make them naturally queryable in SQL? In parallel, graph query is becoming increasingly popular, including in SQL. SQL/PGQ extends SQL with an embedded subset of the GQL graph query language, adding property graph views and making graph traversal queries easy. We explore a surprising connection: semantic data models are graphs, and defining graphs is a data modeling problem. In both domains, users start by defining a graph model, and need query language support to easily traverse edges in the graph, which means doing joins in the underlying data. We propose some useful SQL extensions that make it easier to use higher-level data model abstractions in queries. Users can define a “semantic data graph” view of their data, encapsulating the complex business logic required to query the underlying tables correctly. Then they can query that semantic graph model easily with SQL. Our SQL extensions are useful independently, simplifying many queries – particularly, queries with joins. We make declared foreign key relationships usable for joins at query time – a feature that seems obvious but is notably missing in standard SQL. In combination, these extensions provide a practical approach to extend SQL incrementally, bringing semantic modeling and graph query together with the relational model and SQL. View details
    Productionizing Quantum Mass Production
    Bill Huggins
    Nathan Wiebe
    arXiv for now (2026) (to appear)
    Preview abstract For many practical applications of quantum computing, the slowest and most costly steps involve coherently accessing classical data. We help address this challenge by applying mass production techniques, which can sometimes allow us to perform operations many times in parallel for a cost that is comparable to a single execution[1-3]. We combine existing mass-production results with modern approaches for loading classical data using ``quantum read-only memory.'' We show that quantum mass production techniques offer no benefit when we consider a cost model that focuses purely on the number of non-Clifford gates. However, analyzing the constant factors in a more nuanced cost model, we find that it may be possible to obtain a reduction in cost of an order or magnitude or more for a variety reasonably-sized fault-tolerant quantum algorithms. We present several applications of quantum mass-production techniques beyond naive parallelization, including a strategy for reducing the cost of serial calls to the same data loading step. View details
    Preview abstract Text-guided image editing, fueled by recent advancements in generative AI, is becoming increasingly widespread. This trend highlights the need for a comprehensive framework to verify text-guided edits and assess their quality. To address this need, we introduce EditInspector, a novel benchmark for evaluation of text-guided image edits, based on human annotations collected using an extensive template for edit verification. We leverage EditInspector to evaluate the performance of state-of-the-art (SoTA) vision and language models in assessing edits across various dimensions, including accuracy, artifact detection, visual quality, seamless integration with the image scene, adherence to common sense, and the ability to describe edit-induced changes. Our findings indicate that current models struggle to evaluate edits comprehensively and frequently hallucinate when describing the changes. To address these challenges, we propose two novel methods that outperform SoTA models in both artifact detection and difference caption generation. View details
    Preview abstract Computer use agents (CUAs) need to plan long-horizon task workflows grounded in diverse, ever-changing applications and environments, but learning is hindered by the scarcity of large-scale, high-quality training data. Existing datasets are small, domain-specific, and costly to annotate, while current synthetic data generation methods often yield brittle, simplistic, or misaligned task demonstrations. We introduce Watch & Learn (W&L), a framework that transforms human demonstration videos available in the Internet into executable UI trajectories at scale. Inspired by robotics, we train an inverse dynamics model that accurately predicts user actions from consecutive screens, bypassing the need for complex heuristics. To scale to the web, we curate a large state-transition corpus and design a retrieval framework that identifies relevant video tutorials, enabling automatic conversion of raw videos into structured UI trajectories without requiring manual annotations. Beyond training data, we show that the generated UI trajectories can also serve as in-context exemplars, providing CUAs with long-horizon priors and domain-specific knowledge at inference time. On the challenging OSWorld and Mind2Web benchmarks, UI trajectories extracted with W&L consistently improve both general-purpose and state-of-the-art frameworks when used in-context, and delivers stronger gains for open-source models when used in training. These results highlight web-scale human demonstration videos as a practical and scalable foundation for advancing CUAs towards real-world deployment. View details
    Improved best-of-both-worlds regret for bandits with delayed feedback
    Ofir Schlisselberg
    tal lancewicki
    Peter Auer
    NeurIPS (2025)
    Preview abstract We study the multi-armed bandit problem with adversarially chosen delays in the Best-of-Both-Worlds (BoBW) framework, which aims to achieve near-optimal performance in both stochastic and adversarial environments. While prior work has made progress toward this goal, existing algorithms suffer from significant gaps to the known lower bounds, especially in the stochastic settings. Our main contribution is a new algorithm that, up to logarithmic factors, matches the known lower bounds in each setting individually. In the adversarial case, our algorithm achieves regret of $\widetilde{O}(\sqrt{KT} + \sqrt{D})$, which is optimal up to logarithmic terms, where $T$ is the number of rounds, $K$ is the number of arms, and $D$ is the cumulative delay. In the stochastic case, we provide a regret bound which scale as $\sum_{i:\Delta_i>0}\roundy{\logp T/\Delta_i} + \frac{1}{K}\sum \Delta_i \sigma_{max}$, where $\Delta_i$ is the sub-optimality gap of arm $i$ and $\sigma_{\max}$ is the maximum number of missing observations. To the best of our knowledge, this is the first \textit{BoBW} algorithm to simultaneously match the lower bounds in both stochastic and adversarial regimes in delayed environment. Moreover, even beyond the BoBW setting, our stochastic regret bound is the first to match the known lower bound under adversarial delays, improving the second term over the best known result by a factor of $K$. View details
    Privacy-preserved LLM Cascade via CoT-enhanced Policy Learning
    Xiaozhong Liu
    Kai Zhang
    Congchao Wang
    Liqian Peng
    2025
    Preview abstract Large Language Models (LLMs) have seen increasing attentions in on-device applications due to their exceptional ability in real-world tasks. However, device-end LLM often performs suboptimal due to the hardware limitation. Cascading local (on-device) weaker and server stronger LLMs presents a promising solution to this challenge. While existing research on LLM cascade primarily focuses on optimizing the performance-cost trade-off, privacy concerns remain largely unaddressed. In this work, we prioritize privacy-preserved LLM cascading while enhancing cascade efficiency. To this end, we propose a novel CoT-enhanced policy learning strategy for deferral decision-making, which accounts for both performance-cost trade-offs and privacy considerations. Extensive experiments on three benchmark datasets validate the effectiveness and superiority of our approach. View details
    Reducing Symbiosis Bias through Better A/B Tests of Recommendation Algorithms
    Yahu Cong
    Yiwei Yu
    Lina Lin
    Yajun Peng
    Changping Meng
    Ningren (Peter) Han
    David Holtz
    Proceedings of WWW'25 (2025)
    Preview abstract It is increasingly common in digital environments to use A/B tests to compare the performance of recommendation algorithms. However, such experiments often violate the stable unit treatment value assumption (SUTVA), particularly SUTVA's ''no hidden treatments'' assumption, due to the shared data between algorithms being compared. This results in a novel form of bias, which we term ''symbiosis bias,'' where the performance of each algorithm is influenced by the training data generated by its competitor. In this paper, we investigate three experimental designs--cluster-randomized, data-diverted, and user-corpus co-diverted experiments--aimed at mitigating symbiosis bias. We present a theoretical model of symbiosis bias and simulate the impact of each design in dynamic recommendation environments. Our results show that while each design reduces symbiosis bias to some extent, they also introduce new challenges, such as reduced training data in data-diverted experiments. We further validate the existence of symbiosis bias using data from a large-scale A/B test conducted on a global recommender system, demonstrating that symbiosis bias affects treatment effect estimates in the field. Our findings provide actionable insights for researchers and practitioners seeking to design experiments that accurately capture algorithmic performance without bias in treatment effect estimates introduced by shared data. View details
    Beyond Digital Literacy: Building Youth Digital Resilience Through Existing “Information Sensibility” Practices
    Mia Hassoun
    Ian Beacock
    Todd Carmody
    Patrick Gage Kelley
    Beth Goldberg
    Devika Kumar
    Laura Murray
    Rebekah Park
    Behzad Sarmadi
    Social Sciences Journal, 14(4) (2025)
    Preview abstract Youth media consumption and disordered eating practices have historically been subjects of moral panics, often resulting in protective, deficit-based interventions like content removal. We argue for interventions which instead equip youth to evaluate and manage risks in their online environments, building upon their existing “information sensibility” practices. Drawing upon ethnographic research and intervention testing with 77 participants in the US and India, we analyze how youth (aged 13–26), including those with diverse political perspectives and those recovering from disordered eating (DE), engage with online news and health information. Participants generally algorithmically encountered (rather than searched for) information online, and their engagement was shaped more by social motivations—like belonging—than truth seeking. Participants interpreted online information collaboratively, relying on social cues and peer validation within their online communities. They demonstrated preference for personal testimonies and relatable sources, particularly those with similar social identities. We propose resilience-building interventions that build upon these youth online information practices by: (1) leveraging peer networks, promoting critical information engagement through collaborative learning and peer-to-peer support within online communities; (2) developing social media sensibility, equipping youth to critically evaluate information sources in situ; (3) providing pathways offline, connecting youth to desired in-person communities; and (4) encouraging probabilistic thinking. View details
    Scaling Embedding Layers in Language Models
    Da Yu
    Yangsibo Huang
    Pritish Kamath
    Daogao Liu
    Chiyuan Zhang
    2025
    Preview
    Preview abstract As large language models (LLMs) improve in their capacity to serve as personal AI assistants, their ability to output uniquely tailored, personalized responses that align with the soft preferences of their users is imperative for maximizing user satisfaction and retention. However, lay users are notoriously bad at prompt specification and often struggle with conveying their latent preferences to AI assistants. To resolve this, we demonstrate that activation steering, an inference-time method, can effectively control the response of the LLMs towards expressing different preferences. In contrast to memory-based personalization methods that require long user history, steering is extremely lightweight and easily-controllable via an interpretable linear strength factor. We further conduct a within-subjects user study (n=14) to investigate how end users personalize their conversations through three different steerable chatbot interfaces. The results demonstrate the effectiveness of preference-based steering for aligning real-world conversations with user preferences, and we discuss qualitative findings on how diverse values around control, transparency, and usability of personalization lead users to prefer different interfaces. View details
    Fast and Effective GNN Training through Sequences of Random Path Graphs
    Fabio Vitale
    Francesco Paolo Nerini
    Andre Panisson
    Francesco Bonchi
    2025
    Preview abstract We present a novel scalable framework for training GNNs in node classification tasks, based on effective resistance, a standard tool in spectral graph theory. Unlike other spectral and graph modification approaches to GNN training, our method progressively refines the GNN weights on a sequence of random spanning trees suitably transformed into path graphs which, despite their simplicity, are shown to retain essential topological and node information of the original input graph. The sparse nature of these path graphs substantially lightens the computational burden of GNN training. This not only enhances scalability but also improves accuracy in subsequent test phases. In particular, we focus on small training set regimes, which are of great practical importance, since in many real-world scenarios labels may be challenging to obtain. We show that our framework yields very good empirical results because it effectively counters the training deterioration caused by overfitting when the training set is small. Moreover, we successfully address common issues like over-squashing and over-smoothing while, at the same time, avoiding under-reaching phenomena. Although our framework is flexible and can be deployed in several types of GNNs, in this paper we focus on graph convolutional networks and carry out an extensive experimental investigation on a number of real-world graph benchmarks, where we achieve simultaneous improvement of training speed and test accuracy over a wide pool of representative baselines. View details

    1. Check out the publication hero

    ×