Oriana Riva

Oriana Riva

Oriana is a researcher at Google DeepMind where she leads a team working on reinforcement learning and computer control agents. Prior to Google, Oriana was at Microsoft Research in Redmond.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    AndroidWorld: An Open World for Autonomous Agents
    Jonathan Waltz
    Marybeth Fair
    Daniel Toyama
    Will Bishop
    Sarah Clinckemaillie
    Timothy Lillicrap
    Chris Rawles
    Robert Berry
    Gabrielle Lau
    Divya Tyam
    Yifan Chang
    Alice Li
    Folawiyo Campbell-Ajala
    Wei Li
    ICLR 2025 (2025)
    Preview abstract Autonomous computer control agents that execute human tasks by controlling user interfaces (UIs) are emerging. Such agents would be valuable for humans, and progress in the field will be driven by realistic and reproducible benchmarks. We present AndroidWorld, a fully-functioning Android environment that pro-vides reward signals across 20 apps on 114 programmatic tasks. Instead of a static test set, the tasks in AndroidWorld parameterized, allowing for unlimited variation in language and task parameters. Reward signals are derived from An-droid system state, making them highly durable and extensible across different applications. To demonstrate AndroidWorld's extensibility, we integrate the popular MiniWoB++ into it.To evaluate AndroidWorld, we introduce a new multimodal autonomous agent for Android, M3A. Our agent achieves a 27% success rate leaving ample room for future work. Furthermore, we adapt a popular desktop web agent for Android, which we find to be less effective on mobile, suggesting future research is needed to build universal, cross-domain agents. Finally, we conduct robustness testing by testing M3A against a suite of real-world variations on a representative subset of tasks. AndroidWorld and the experiments in this paper are available at https://https//github.com/google-research/android-world: View details
    Preview abstract Large-language models and large-vision models are increasingly capable of solving compositional reasoning tasks, as measured by breakthroughs in visual-question answering benchmarks. However, state-of-the-art solutions often involve careful construction of large pre-training and fine-tuning datasets, which can be expensive. The use of external tools, whether other ML models, search engines, or APIs, can significantly improve performance by breaking down high-level reasoning questions into sub-questions that are answerable by individual tools, but this approach has similar dataset construction costs to teach fine-tuned models how to use the available tools. We propose a technique in which existing training sets can be directly used for constructing computational environments with task metrics as rewards. This enables a model to autonomously teach itself to use itself or another model as a tool. By doing so, we augment training sets by integrating external signals. The proposed method starts with zero-shot prompts and iteratively refines them by selecting few-shot examples that maximize the task metric on the training set. Our experiments showcase how Gemini learns how to use itself, or another smaller and specialized model such as ScreenAI, to iteratively improve performance on training sets. Our approach successfully generalizes and improves upon zeroshot performance on charts, infographics, and document visual question-answering datasets. View details
    Preview abstract Computer use agents (CUAs) need to plan long-horizon task workflows grounded in diverse, ever-changing applications and environments, but learning is hindered by the scarcity of large-scale, high-quality training data. Existing datasets are small, domain-specific, and costly to annotate, while current synthetic data generation methods often yield brittle, simplistic, or misaligned task demonstrations. We introduce Watch & Learn (W&L), a framework that transforms human demonstration videos available in the Internet into executable UI trajectories at scale. Inspired by robotics, we train an inverse dynamics model that accurately predicts user actions from consecutive screens, bypassing the need for complex heuristics. To scale to the web, we curate a large state-transition corpus and design a retrieval framework that identifies relevant video tutorials, enabling automatic conversion of raw videos into structured UI trajectories without requiring manual annotations. Beyond training data, we show that the generated UI trajectories can also serve as in-context exemplars, providing CUAs with long-horizon priors and domain-specific knowledge at inference time. On the challenging OSWorld and Mind2Web benchmarks, UI trajectories extracted with W&L consistently improve both general-purpose and state-of-the-art frameworks when used in-context, and delivers stronger gains for open-source models when used in training. These results highlight web-scale human demonstration videos as a practical and scalable foundation for advancing CUAs towards real-world deployment. View details
    Agent-Initated Interaction in Phone UI Automation
    Noam Kahlon
    Guy Rom
    Tal Efros
    Omri Berkovitch
    Sapir Caduri
    Will Bishop
    Ido Dagan
    Association for Computing Machinery, New York, NY, USA, 2391–2400
    Preview abstract Phone automation agents aim to autonomously perform a given natural-language user request, such as scheduling appointments or booking a hotel. While much research effort has been devoted to screen understanding and action planning, complex tasks often necessitate user interaction for successful completion. Aligning the agent with the user's expectations is crucial for building trust and enabling personalized experiences. This requires the agent to proactively engage the user when necessary, avoiding actions that violate their preferences while refraining from unnecessary questions where a default action is expected. We argue that such subtle agent-initiated interaction with the user deserves focused research attention. To promote such research, this paper introduces a task formulation for detecting the need for user interaction and generating appropriate messages. We thoroughly define the task, including aspects like interaction timing and the scope of the agent's autonomy. Using this definition, we derived annotation guidelines and created a diverse dataset for the task, leveraging an existing UI automation dataset. We tested several text-based and multimodal baseline models for the task, finding that it is very challenging for current LLMs. We suggest that our task formulation, dataset, baseline models and analysis will be valuable for future UI automation research, specifically in addressing this crucial yet often overlooked aspect of agent-initiated interaction. This work provides a needed foundation to allow personalized agents to properly engage the user when needed, within the context of phone UI automation. View details
    Visual Grounding for User Interfaces
    Yijun Qian
    Yujie Lu
    Alexander G. Hauptmann
    2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2024) - Industry Track
    Preview abstract Enabling autonomous language agents to drive application user interfaces (UIs) as humans do can significantly expand the capability of today's API-based agents. Essential to this vision is the ability of agents to ground natural language commands to on-screen UI elements. Prior UI grounding approaches work by relaying on developer-provided UI metadata (UI trees, such as web DOM, and accessibility labels) to detect on-screen elements. However, such metadata is often unavailable or incomplete. Object detection techniques applied to UI screens remove this dependency, by inferring location and types of UI elements directly from the UI's visual appearance. The extracted semantics, however, are too limited to directly enable grounding. We overcome the limitations of both approaches by introducing the task of visual UI grounding, which unifies detection and grounding. A model takes as input a UI screenshot and a free-form language expression, and must identify the referenced UI element. We propose a solution to this problem, LVG, which learns UI element detection and grounding using a new technique called layout-guided contrastive learning, where the semantics of individual UI objects are learned also from their visual organization. Due to the scarcity of UI datasets, LVG integrates synthetic data in its training using multi-context learning. LVG outperforms baselines pre-trained on much larger datasets by over 4.9 points in top-1 accuracy, thus demonstrating its effectiveness. View details
    UINav: A Practical Approach to Train On-Device Automation Agents
    Wei Li
    Fu-Lin Hsu
    Will Bishop
    Folawiyo Campbell-Ajala
    Max Lin
    2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2024) - Industry Track
    Preview abstract Automation systems that can autonomously drive application user interfaces to complete user tasks are of great benefit, especially when users are situationally or permanently impaired. Prior automation systems do not produce generalizable models while AI-based automation agents work reliably only in simple, hand-crafted applications or incur high computation costs. We propose UINav, a demonstration-based approach to train automation agents that fit mobile devices, yet achieving high success rates with modest numbers of demonstrations. To reduce the demonstration overhead, UINav, uses a referee model that provides users with immediate feedback on tasks where the agent fails, and automatically augments human demonstrations to increase diversity in training data. Our evaluation shows that with only 10 demonstrations UINav, can achieve 70% accuracy, and that with enough demonstrations it can surpass 90% accuracy. View details
    On the Effects of Data Scale on UI Control Agents
    Wei Li
    Will Bishop
    Alice Li
    Chris Rawles
    Folawiyo Campbell-Ajala
    Divya Tyam
    Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024
    Preview abstract Autonomous agents that control computer interfaces to accomplish human tasks are emerging. Leveraging LLMs to power such agents has been of special interest, but unless fine-tuned on human-collected task demonstrations, performance is still relatively low. In this work we study whether fine-tuning alone is a viable approach for building real-world computer control agents. In particularly, we investigate how performance measured on both high and low-level tasks in domain and out of domain scales as more training data is collected. To this end we collect and release a new dataset, AndroidControl, consisting of 15,283 demonstrations of everyday tasks with Android apps. Compared to existing datasets, each AndroidControl task instance includes both high and low-level human-generated instructions, allowing us to explore the level of task complexity an agent can handle. Moreover, AndroidControl is the most diverse computer control dataset to date, including 14,548 unique tasks over 833 Android apps, thus allowing us to conduct in-depth analysis of the model performance in and out of the domain of the training data. Using the dataset, we find that when tested in domain fine-tuned models outperform zero and few-shot baselines and scale in such a way that robust performance might feasibly be obtained simply by collecting more data. Out of domain, performance scales significantly more slowly and suggests that in particular for high-level tasks, fine-tuning on more data alone may be insufficient for achieving robust out-of-domain performance. View details
    Android in the Wild: A Large-Scale Dataset for Android Device Control
    Chris Rawles
    Alice Li
    Timothy Lillicrap
    NeurIPS 2023 Datasets and Benchmarks Track
    Preview abstract There is a growing interest in device-control systems that can interpret human natural language instructions and execute them on a digital device by directly controlling its user interface. We present a dataset for device-control research, Android in the Wild (AitW), which is orders of magnitude larger than current datasets. The dataset contains human demonstrations of device interactions, including the screens and actions, and corresponding natural language instructions. It consists of 715k episodes spanning 30k unique instructions, four versions of Android (v10-13), and eight device types (Pixel 2 XL to Pixel 6) with varying screen resolutions. It contains multi-step tasks that require semantic understanding of language and visual context. This dataset poses a new challenge: actions available through the user interface must be inferred from their visual appearance, and, instead of simple UI element-based actions, the action space consists of precise gestures (e.g., horizontal scrolls to operate carousel widgets). We organize our dataset to encourage robustness analysis of device-control systems, i.e., how well a system performs in the presence of new task descriptions, new applications, or new platform versions. We develop two agents and report performance across the dataset. The dataset is available at https://github.com/google-research/google-research/tree/master/android_in_the_wild. View details