Yiwen Song

Yiwen Song

Yiwen Song is a research scientist at Google. Her research focuses on Large Language Models (LLMs), particularly at the intersection of multimodality and generative AI.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Recently, decomposing complex problems into simple subtasks--a crucial part of human-like natural planning--to solve the given problem has significantly boosted the performance of large language models (LLMs). However, leveraging such planning structures during post-training to boost the performance of smaller open-source LLMs remains underexplored. Motivated by this, we introduce Plan-Tuning, a unified post-training framework that (i) distills synthetic task decompositions (termed “planning trajectories”) from large-scale LLMs and (ii) fine-tunes smaller models via supervised and reinforcement-learning objectives designed to mimic these planning processes to improve complex reasoning. On GSM8k and the MATH benchmarks, plan-tuned models outperform strong baselines by an average ~7%. Furthermore, plan-tuned models show better generalization capabilities on out-of-domain datasets, with average ~10% and ~12% performance improvements on OlympiadBench and AIME 2024, respectively. Our detailed analysis demonstrates how planning trajectories improves complex reasoning capabilities, showing that Plan-Tuning is an effective strategy for improving task-specific performance of smaller LLMs. View details
    Preview abstract The proliferation of Large Language Models (LLMs) has opened new opportunities in data science, yet their practical deployment is often constrained by the challenge of discovering relevant data within large and heterogeneous data lakes. Existing approaches, including single-agent and master–slave multi-agent systems, struggle with scalability, information heterogeneity, and robustness to irrelevant files. To address these limitations, we propose a novel multi-agent communication paradigm inspired by the blackboard architecture in traditional AI and software design. In this framework, a central agent posts information requests to a shared blackboard, and autonomous subordinate agents---each responsible for a partition of the data lake---volunteer to respond based on their capabilities. This distributed design improves scalability and flexibility by eliminating the need for a central coordinator to have prior knowledge of agent expertise. We evaluate the approach on three benchmarks that require explicit data discovery: KramaBench and modified versions of DS-Bench and DA-Code to incorporate data discovery. Experimental results demonstrate that the blackboard architecture substantially outperforms baselines, including RAG and the master–slave paradigm, achieving 13% to 57% relative improvement in end-to-end task success and up to a 9% relative gain in F1 score for data discovery across both proprietary and open-source LLMs. These findings establish the blackboard paradigm as a scalable and generalizable communication framework for multi-agent data science systems. View details
    Preview abstract Test-time scaling has shown considerable success in improving the performance of language models on complex reasoning tasks without requiring fine-tuning. However, current strategies, such as self-reflection or ensembling, primarily focus on logical or structural refinement. They do not leverage the guiding potential of affective feedback. Inspired by psychological research showing that emotions can modulate cognitive performance, we introduce HEART--a novel framework that uses emotionally-driven prompts for iterative self-correction. HEART provides feedback on a models' incorrect response using a curated set of concise, emotionally charged phrases based on Paul Ekman's six basic emotions. By systematically varying the emotional tone of the feedback across iterations, our method guides the model to escape flawed reasoning paths and explore more promising alternatives. We evaluate our framework on challenging reasoning benchmarks including OlympiadBench, Humanity's Last Exam, and SimpleQA. Across these benchmarks, our approach delivers significantly deeper reasoning which leads to consistent and significant increase in accuracy compared to existing prompting methods. Crucially, these gains are observed across a diverse range of model architectures, demonstrating the broad applicability of our technique. Overall, our findings suggest that the next frontier in machine reasoning may lie not just in refining logic, but also in understanding and leveraging the 'HEART' of the models. View details
    Preview abstract Computer use agents (CUAs) need to plan long-horizon task workflows grounded in diverse, ever-changing applications and environments, but learning is hindered by the scarcity of large-scale, high-quality training data. Existing datasets are small, domain-specific, and costly to annotate, while current synthetic data generation methods often yield brittle, simplistic, or misaligned task demonstrations. We introduce Watch & Learn (W&L), a framework that transforms human demonstration videos available in the Internet into executable UI trajectories at scale. Inspired by robotics, we train an inverse dynamics model that accurately predicts user actions from consecutive screens, bypassing the need for complex heuristics. To scale to the web, we curate a large state-transition corpus and design a retrieval framework that identifies relevant video tutorials, enabling automatic conversion of raw videos into structured UI trajectories without requiring manual annotations. Beyond training data, we show that the generated UI trajectories can also serve as in-context exemplars, providing CUAs with long-horizon priors and domain-specific knowledge at inference time. On the challenging OSWorld and Mind2Web benchmarks, UI trajectories extracted with W&L consistently improve both general-purpose and state-of-the-art frameworks when used in-context, and delivers stronger gains for open-source models when used in training. These results highlight web-scale human demonstration videos as a practical and scalable foundation for advancing CUAs towards real-world deployment. View details
    Preview abstract The alignment of language models (LMs) with human values increasingly relies on using other LMs as automated judges, or ``autoraters''. However, their reliability is limited by a foundational issue: they are trained on deterministic preference labels, forcing a single ground truth onto tasks that are often subjective, ambiguous, or nuanced. We argue that a truly reliable autorater must learn to model the full distribution of preference defined by a target population. In this paper, we propose a general framework for calibrating probabilistic autoraters to any given preference distribution. We formalize the problem and present two learning methods tailored to different data conditions: direct supervised fine-tuning for dense, probabilistic labels, and a reinforcement learning approach for sparse, binary labels. Our empirical results show that finetuning autoraters with a distribution-matching objective leads to verbalized probability predictions that are better aligned with the target preference distribution, with improved calibration and significantly lower positional bias, all while preserving performance on objective tasks. View details