Publications

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field[a19ad0].

people standing in front of a screen with images and a chipboard

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field[a19ad0].

Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
1 - 15 of 11254 publications
    Preview abstract Communicating spatial tasks via text or speech creates ``a mental mapping gap'' that limits an agent’s expressiveness. Inspired by co-speech gestures in face-to-face conversation, we propose \textsc{AgentHands}, an LLM-powered XR system that equips agents with hands to render responses clearer and more engaging. Guided by a design taxonomy distilled from a formative study (N=10), we implement a novel pipeline to generate and render a hand agent that augments conversational responses with synchronized, space-aware, and interactive hand gestures: using a meta-instruction, \textsc{AgentHands} generates verbal responses embedded with \textit{GestureEvents} aligned to specific words; each event specifies gesture type and parameters. At runtime, a parser converts events into time-stamped poses and motions, driving an animation system that renders expressive hands synchronized with speech. In a within-subjects study (N=12), \textsc{AgentHands} increased engagement and made spatially grounded conversations easier to follow compared to a speech-only baseline. View details
    Preview abstract The emergence of Agentic AI—autonomous systems capable of reasoning, decision-making, and multi-step execution—represents a paradigm shift in enterprise technology. Moving beyond simple generative tasks, these agents offer the potential to solve long-standing industry pain points, with over 90% of enterprises planning integration within the next three years. However, the transition from successful proof-of-concept (PoC) to a resilient, production-grade system presents significant hurdles. This article categorizes these challenges into three primary domains: Technical and Engineering Hurdles: Issues such as "entangled workflows" that complicate debugging, the struggle to maintain output quality and mitigate hallucinations, and the unpredictability caused by shifting underlying models or data sources. People, Process, and Ecosystem Hurdles: The high operational costs and unclear ROI of large models, the necessity of a new "Agent Ops" skillset, the complexity of integrating agents with disparate enterprise systems, and a rapidly evolving regulatory landscape. The Pace of Change and Security risks: The technical debt incurred by shifting software frameworks and the expanded attack surface created by autonomous agents. The article concludes that successful deployment requires a shift from informal "vibe-testing" to rigorous engineering discipline. By adopting code-first frameworks, establishing robust evaluation metrics (KPIs), and prioritizing functional deployment over theoretical optimization, organizations can effectively manage the lifecycle of Agentic AI and realize its transformative business value. View details
    Preview abstract The field of Human-Computer Interaction is approaching a critical inflection point, moving beyond the era of static, deterministic systems into a new age of self-evolving systems. We introduce the concept of Adaptive generative interfaces that move beyond static artifacts to autonomously expand their own feature sets at runtime. Rather than relying on fixed layouts, these systems utilize generative methods to morph and grow in real-time based on a user’s immediate intent. The system operates through three core mechanisms: Directed synthesis (generating new features from direct commands), Inferred synthesis (generating new features for unmet needs via inferred commands), and Real-time adaptation (dynamically restructuring the interface's visual and functional properties at runtime). To empirically validate this paradigm, we executed a within-subject (repeated measures) comparative study (N=72) utilizing 'Penny,' a digital banking prototype. The experimental design employed a counterbalanced Latin Square approach to mitigate order effects, such as learning bias and fatigue, while comparing Deterministic interfaces baseline against an Adaptive generative interfaces. Participant performance was verified through objective screen-capture evidence, with perceived usability quantified using the industry-standard System Usability Scale (SUS). The results demonstrated a profound shift in user experience: the Adaptive generative version achieved a System Usability Scale (SUS) score of 84.38 ('Excellent'), significantly outperforming the Deterministic version’s score of 53.96 ('Poor'). With a statistically significant mean difference of 30.42 points (p < 0.0001) and a large effect size (d=1.04), these findings confirm that reducing 'navigation tax' through adaptive generative interfaces directly correlates with a substantial increase in perceived usability. We conclude that deterministic interfaces are no longer sufficient to manage the complexity of modern workflows. The future of software lies not in a fixed set of pre-shipped features, but in dynamic capability sets that grow, adapt, and restructure themselves in real-time to meet the specific intent of the user. This paradigm shift necessitates a fundamental transformation in product development, requiring designers to transcend traditional, linear workflows and evolve into 'System Builders'—architects of the design principles and rules that facilitate this new age of self-evolving software. View details
    Preview abstract Modern user interfaces are complex composites, with elements originating from various sources, such as the operating system, apps, a web browser, or websites. Many security and privacy models implicitly depend on users correctly identifying an element's source, a concept we term ''surface attribution.'' Through two large-scale vignette-based surveys (N=4,400 and N=3,057), we present the first empirical measurement of this ability. We find that users struggle, correctly attributing UI source only 55% of the time on desktop and 53% on mobile. Familiarity and strong brand cues significantly improve accuracy, whereas UI positioning, a long-held security design concept especially for browsers, has minimal impact. Furthermore, simply adding a ''Security & Privacy'' brand cue to Android permission prompts failed to improve attribution. These findings demonstrate a fundamental gap in users' mental models, indicating that relying on them to distinguish trusted UI is a fragile security paradigm. View details
    Expert evaluation of LLM world models: A high-Tc superconductivity case study
    Haoyu Guo
    Maria Tikhanovskaya
    Paul Raccuglia
    Alexey Vlaskin
    Chris Co
    Scott Ellsworth
    Matthew Abraham
    Lizzie Dorfman
    Peter Armitage
    Chunhan Feng
    Antoine Georges
    Olivier Gingras
    Dominik Kiese
    Steve Kivelson
    Vadim Oganesyan
    Brad Ramshaw
    Subir Sachdev
    Senthil Todadri
    John Tranquada
    Eun-Ah Kim
    Proceedings of the National Academy of Sciences (2026)
    Preview abstract Large Language Models (LLMs) show great promise as a powerful tool for scientific literature exploration. However, their effectiveness in providing scientifically accurate and comprehensive answers to complex questions within specialized domains remains an active area of research. This work evaluates the performance of six different LLM-based systems for answering scientific literature questions, including commercially available closed models and a custom retrieval-augmented generation (RAG) system capable of retrieving images alongside text. We conduct a rigorous expert evaluation of the systems in the domain of high-temperature cuprate superconductors, a research area that involves material science, experimental physics, computation, and theoretical physics. We use an expert-curated database of 1726 scientific papers and a set of 67 expert-formulated questions. The evaluation employs a multi-faceted rubric assessing balanced perspectives, factual comprehensiveness, succinctness, evidentiary support, and image relevance. Our results demonstrate that RAG-based systems, powered by curated data and multimodal retrieval, outperform existing closed models across key metrics, particularly in providing comprehensive and well-supported answers, and in retrieving relevant visual information. This study provides valuable insights into designing and evaluating specialized scientific literature understanding systems, particularly with expert involvement, while also highlighting the importance of rich, domain-specific data in such systems. View details
    Preview abstract A growing body of qualitative research has identified contextual risk factors that elevate people’s chances of experiencing digital-safety attacks. However, the lack of quantitative data on the population level distribution of these risk factors prevents policymakers and tech companies from developing targeted, evidence-based interventions to improve digital safety. To address this gap, we surveyed 5,001 adults in the United States to analyze: (1) the frequency of and relationship between digital-safety attacks (e.g., scams, harassment, account hacking), and (2) how these attacks align with 10 contextual risk factors. Nearly half of our respondents identify as resource constrained, which significantly correlates with higher likelihood of experiencing four common attacks. We also present qualitative insights to expand our understanding of the factors beyond the existing literature (e.g., “prominence” included high-visibility roles in local communities). This study provides the first large-scale quantitative analysis correlating digital-safety attacks with contextual risk factors and demographics. View details
    Preview abstract We introduce AMS (Activation-based Model Scanner), a tool for verifying whether a language model is safe to deploy by analyzing its internal activation patterns. While "uncensored" and maliciously fine-tuned models pose increasing risks, current detection methods rely on behavioral testing that is slow, incomplete, and easily evaded. AMS takes a fundamentally different approach: measuring the geometric structure of safety-relevant concepts in the model's activation space. Safe models exhibit strong class separation (4-8σ) between harmful and benign content; models with removed or degraded safety training show collapsed separation (<2σ). Using contrastive prompt pairs and direction vector analysis, AMS performs model-level verification rather than prompt-level classification. We validate AMS across 14 model configurations spanning 3 architecture families (Llama, Gemma, Qwen), 3 quantization levels (FP16, INT8, INT4), and multiple model categories (instruction-tuned, base, abliterated, uncensored). In our validation set: (1) all four instruction-tuned models pass with 3.8-8.4σ separation; (2) three tested uncensored models (Dolphin, Lexi, LLama-3-8b-Uncensored) flagged as CRITICAL with 1.1-1.3σ on harmful content; (3) an abliterated Llama variant flagged as WARNING (3.33σ); (4) Llama base model shows 0.69σ, confirming absence of safety training; (5) quantization has minimal impact (<5% drift). One model labeled "uncensored" (DarkIdol) unexpectedly passed, suggesting either mislabeling or a technique that preserves activation geometry. AMS also provides identity verification via direction vector comparison. Scanning completes in 10-40 seconds per model on GPU hardware. We discuss threshold calibration, limitations of our validation scope, and directions for broader evaluation. View details
    Preview abstract Multimodal large language models (LLMs) integrate and process information from multiple modalities such as text, images, audio, and video, enabling complex tasks such as audio translation and visual question answering. While powerful, this complexity introduces novel vulnerabilities to sophisticated adversarial attacks. This survey paper provides a comprehensive overview of this rapidly expanding field, systematically categorizing attacks that range from manipulations of single modalities (e.g., perturbed images or audio) to those exploiting cross-modal interactions. We overview how these attacks exploit weaknesses in model fusion, attention mechanisms, and representation learning and provided analyses on their potential for real-world consequences. View details
    Preview abstract The rapid expansion of the Internet of Things (IoT) and smart home ecosystems has led to a fragmented landscape of user data management across consumer electronics (CE) such as Smart TVs, gaming consoles, and set-top boxes. Current onboarding processes on these devices are characterized by high friction due to manual data entry and opaque data-sharing practices. This paper introduces the User Data Sharing System (UDSS), a platform-agnostic framework designed to facilitate secure, privacy-first PII (Personally Identifiable Information) exchange between device platforms and third-party applications. Our system implements a Contextual Scope Enforcement (CSE) mechanism that programmatically restricts data exposure based on user intent—specifically distinguishing between Sign-In and Sign-Up workflows. Unlike cloud-anchored identity standards such as FIDO2/WebAuthn, UDSS is designed for shared, device-centric CE environments where persistent user-to-device bind-ing cannot be assumed. We further propose a tiered access model that balances developer needs with regulatory compliance (GDPR/CCPA). A proof-of-concept implementation on a reference ARMv8 Linux-based middleware demonstrates that UDSS reduces user onboarding latency by 65% and measurably reduces PII over-exposure risk through protocol-enforced data minimization. This framework provides a standardized approach to identity management in the heterogeneous CE market. View details
    Bi-level Hierarchical Neural Contextual Bandits for Online Recommendation
    Yunzhe Qi
    Yikun Ban
    Allan Stewart
    Chuanwei Ruan
    Jiachuan He
    Shishir Kumar Prasad
    Haixun Wang
    Jingrui He
    Transactions on Machine Learning Research (2026)
    Preview abstract Contextual bandit algorithms aim to identify the optimal choice among a set of candidate arms, based on their contextual information. Among others, the neural contextual bandit algorithms have demonstrated generally superior performance compared to traditional linear and kernel-based methods. Nevertheless, neural methods are not inherently suitable to handle a large number of candidate arms due to their high computational cost when performing neural exploration. Motivated by the widespread availability of arm category information (e.g., movie genres, retailer types), we formulate contextual bandits into a bi-level recommendation problem based on the accessible arm category information, and propose a novel neural bandit framework, named H2N-Bandit, which utilizes a bi-level hierarchical neural structure to mitigate the substantial computational cost found in conventional neural bandit methods. To demonstrate its effectiveness, we provide the regret bound for H2N-Bandit under the over-parameterized neural bandit settings. Furthermore, to illustrate its efficiency, we conduct extensive experiments on multiple real-world public data sets with various specifications, showing that H2N-Bandit can significantly reduce the computational cost over existing non-linear methods while achieving better or comparable performances against state-of-the-art baselines. View details
    The Perfection Paradox: From Architect to Curator in AI-Assisted API Design
    JJ Geewax
    David R Karger
    Extended Abstracts of the 2026 CHI Conference on Human Factors in Computing Systems (CHI EA '26), ACM, Barcelona, Spain, TBD
    Preview abstract Enterprise API design is often bottlenecked by the tension between rapid feature delivery and the rigorous maintenance of usability standards. We present an industrial case study evaluating an AI-assisted design workflow trained on API Improvement Proposals(AIPs). Through a controlled study with 16 industry experts, we compared AI-generated API specifications against human-authored ones. While quantitative results indicated AI superiority in 10 of 11 usability dimensions and an 87% reduction in authoring time, qualitative analysis revealed a paradox: experts frequently misidentified AI work as human (19% accuracy) yet described the designs as unsettlingly “perfect.” We characterize this as a “Perfection Paradox”—where hyper-consistency signals a lack of pragmatic human judgment. We discuss the implications of this perfection paradox, proposing a shift in the human designer’s role from the “drafter” of specifications to the “curator” of AI-generated patterns. View details
    Neural general circulation models for modeling precipitation
    Stephan Hoyer
    Dmitrii Kochkov
    Janni Yuval
    Ian Langmore
    Science Advances (2026)
    Preview abstract Climate models struggle to accurately simulate precipitation, particularly extremes and the diurnal cycle. While hybrid models combining machine learning and physics have emerged with the premise of improving precipitation simulations, none have proven sufficiently skillful or stable enough to outperform existing models in simulating precipitation. Here, we present the first hybrid model that is trained directly on precipitation observations. The model runs at 2.8 degrees resolution and is built on the differentiable NeuralGCM framework. This model is stable for decadal simulations and demonstrates significant improvements over existing GCMs, ERA5 reanalysis, and a Global Cloud-Resolving Model in simulating precipitation. Our approach yields reduced biases, a more realistic precipitation distribution, improved representation of extremes, and a more accurate diurnal cycle. Furthermore, it outperforms the ECMWF ensemble for mid-range weather forecasting. This advance paves the way for more reliable simulations of current climate and for the ability to fully utilize the abundance of existing observations to further improve GCMs. View details
    ARM MTE Performance in Practice
    Taehyun Noh
    Yingchen Wang
    Tal Garfinkel
    Mahesh Madhav
    Mattan Erez
    Shravan Narayan
    Usenix Security (2026)
    Preview
    Preview abstract As artificial intelligence (AI) is rapidly integrated into healthcare, ensuring that this innovation helps to combat health inequities requires engaging marginalized communities in health AI futuring. However, little research has examined Black populations’ perspectives on the use of AI in health contexts, despite the widespread health inequities they experience–inequities that are already perpetuated by AI. Addressing this research gap, through qualitative workshops with 18 Black adults, we characterize participants’ cautious optimism for health AI addressing structural well-being barriers (e.g., by providing second opinions that introduce fairness into an unjust healthcare system), and their concerns that AI will worsen health inequities (e.g., through health AI biases they deemed inevitable and the problematic reality of having to trust healthcare providers to use AI equitably). We advance health AI research by articulating previously-unreported health AI perspectives from a population experiencing significant health inequities, and presenting key considerations for future work. View details
    Type-Aware Ranking of Urban Similarity from Aerial Imagery
    Idan Kligvasser
    Yotam Intrator
    Yuval Desheh
    Aviad Barzilai
    Niv Efron
    Ehud Rivlin
    Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops (2026), pp. 821-829
    Preview abstract Estimating and ranking cross-city similarity from aerial imagery is a fundamental challenge in remote sensing and geospatial representation learning. Urban environments differ widely in road layout, marking conventions, and infrastructure design, yet standard visual representations often struggle to disentangle these meaningful structural variations from superficial appearances. In this work, we propose a type-aware contrastive learning framework that measures urban similarity by explicitly modeling distinct infrastructure elements. Leveraging open-vocabulary retrieval, we construct a globally diverse dataset of road-related features, such as intersections, crosswalks, and bus lanes, and train a type-conditioned Vision Transformer that fuses visual features with CLIP-derived semantic embeddings. Crucially, we introduce an adaptive per-type contrastive loss that dynamically emphasizes infrastructure categories with high discriminative power while down-weighting less informative types. To quantify city-level similarity, we aggregate per-type cosine similarities via a lightweight classifier to generate a global city-to-city similarity matrix. Experiments demonstrate that this type-aware approach significantly improves clustering quality and successfully generalizes to unseen cities, establishing a scalable, interpretable foundation for comparative urban analysis. View details

    1. Check out the publication hero

    ×