Pengcheng Yin

Pengcheng Yin

Hi! I am Pengcheng, a research scientist at the learning for code team at Google Brain. I work on problems in the intersection of natural language processing and machine learning for software engineering. My long-term research goal is to build models to let developers communicate to computers in their own language. You can find more about my research at my personal website (http://pengcheng.in).
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    UQE: A Query Engine for Unstructured Databases
    Hanjun Dai
    Bethany Wang
    Sherry Yang
    Phitchaya Mangpo Phothilimthana
    Advances in Neural Information Processing Systems (NeurIPS) (2024)
    Preview abstract Analytics on structured data is a mature field with many successful methods. However, most real world data exists in unstructured form, such as images and conversations. We investigate the potential of Large Language Models (LLMs) to enable unstructured data analytics. In particular, we propose a new Universal Query Engine (UQE) that directly interrogates and draws insights from unstructured data collections. This engine accepts queries in a Universal Query Language (UQL), a dialect of SQL that provides full natural language flexibility in specifying conditions and operators. The new engine leverages the ability of LLMs to conduct analysis of unstructured data, while also allowing us to exploit advances in sampling and optimization techniques to achieve efficient and accurate query execution. In addition, we borrow techniques from classical compiler theory to better orchestrate the workflow between sampling methods and foundation model calls. We demonstrate the efficiency of UQE on data analytics across different modalities, including images, dialogs and reviews, across a range of useful query types, including conditional aggregation, semantic retrieval and abstraction aggregation. View details
    Preview abstract Text-to-SQL, the process of translating natural language into Structured Query Language (SQL), represents a transformative application of large language models (LLMs), potentially revolutionizing how humans interact with data. This paper introduces the SQL-PaLM framework, a comprehensive solution for understanding and enhancing Text-to-SQL using LLMs, using in the learning regimes of few-shot prompting and instruction fine-tuning. With few-shot prompting, we explore the effectiveness of consistency decoding with execution-based error filtering. With instruction fine-tuning, we delve deep in understanding the critical paradigms that influence the performance of tuned LLMs. In particular, we investigate how performance can be improved through expanded training data coverage and diversity, synthetic data augmentation, and integrating query-specific database content. We propose a test-time selection method to further refine accuracy by integrating SQL outputs from multiple paradigms with execution feedback as guidance. Additionally, we tackle the practical challenge of navigating intricate databases with a significant number of tables and columns, proposing efficient techniques for accurately selecting relevant database elements to enhance Text-to-SQL performance. Our holistic approach yields substantial advancements in Text-to-SQL, as demonstrated on two key public benchmarks, Spider and BIRD. Through comprehensive ablations and error analyses, we shed light on the strengths and weaknesses of our framework, offering valuable insights into Text-to-SQL’s future work. View details
    Spider2.0-GUI: Can Multimodal Agents Achieve Expert Proficiency in Data Science and Engineering?
    Ruisheng Cao
    Fangyu Lei
    Haoyuan Wu
    Jixuan Chen
    Yeqiao Fu
    Hongcheng Gao
    Xinzhuang Xiong
    Hanchong Zhang
    Yuchen Mao
    Wenjing Hu
    Tianbao Xie
    Hongshen Xu
    Danyang Zhang
    Sida Wang
    Caiming Xiong
    Ansong Ni
    Qian Liu
    Victor Zhong
    Lu Chen
    Kai Yu
    Tao Yu
    2024
    Preview abstract The field of data science and engineering is crucial for harnessing large-scale data to assist both individuals and enterprises in analytical processing and automated orchestration. Despite the significance, large language model~(LLM)-based data agents remain underexplored, particularly concerning professional data engineering tools such as {\tt dbt}, {\tt Airflow}, and {\tt Airbyte}, which are complex to use and include intensive GUI operations. To bridge this gap, we introduce Spider2.0-GUI, the first benchmark focusing on enterprise data engineering softwares across a full data pipeline. It encapsulates $486$ tasks involving $20$ professional softwares, guiding through tasks such as data warehousing, ingestion, transformation, analysis, visualization, and orchestration. Each task is paired with both abstract and verbose instructions, considering different levels of user expertise. We also build a comprehensive document warehouse with $11,231$ documents for Spider2.0-GUI to support retrieval-augmented agent frameworks. The benchmark is further enhanced with a real-time, executable Ubuntu desktop environment that interacts with real-world internet, providing a realistic and dynamic testing ground. Preliminary results with state-of-the-art vision language models~(VLMs) indicate that even the most advanced model only achieves $11\%$ success rate~(SR) with abstract instructions, and $21\%$ SR with verbose instructions~(a.k.a., step-by-step tutorials). This benchmark not only investigates the competencies of data agents, but also paves the way for future advancements in real-world automated data science and engineering tasks. View details
    Preview abstract Text-to-SQL aims to automate the process of generating SQL queries on a database from natural language text. In this work, we propose "SQLPrompt", tailored to improve the few-shot prompting capabilities of Text-to-SQL for Large Language Models (LLMs). Our methods include innovative prompt design, execution based consistency decoding strategy which selects the SQL with the most consistent execution outcome among other SQL proposals, and a method that aims to improve performance by diversifying the SQL proposals during consistency selection with different prompt designs ("MixPrompt") and foundation models ("MixLLMs"). We show that SQLPrompt outperforms previous approaches for in-context learning with few labeled data by a large margin, closing the gap with finetuning state-of the-art with thousands of labeled data. View details
    Preview abstract Identifying invariants in programs is an important program analysis task with applications towards program understanding, vulnerability analysis, and formal verification. Existing tools for identifying invariants rely on dynamic analysis, requiring traces collected from multiple executions in order to produce reliable invariants. We study the application of large language models to invariant prediction, finding that models training on source code and fine-tuned to invariant prediction can perform invariant prediction as static rather than dynamic analysis. Using a scratchpad approach gives the best performance, finding invariants statically of quality comparable to those obtained by a dynamic analysis tool with access to five program traces. View details
    PaLM: Scaling Language Modeling with Pathways
    Aakanksha Chowdhery
    Sharan Narang
    Jacob Devlin
    Maarten Bosma
    Hyung Won Chung
    Sebastian Gehrmann
    Parker Schuh
    Sasha Tsvyashchenko
    Abhishek Rao
    Yi Tay
    Noam Shazeer
    Nan Du
    Reiner Pope
    James Bradbury
    Guy Gur-Ari
    Toju Duke
    Henryk Michalewski
    Xavier Garcia
    Liam Fedus
    David Luan
    Barret Zoph
    Ryan Sepassi
    David Dohan
    Shivani Agrawal
    Mark Omernick
    Marie Pellat
    Aitor Lewkowycz
    Erica Moreira
    Rewon Child
    Oleksandr Polozov
    Zongwei Zhou
    Brennan Saeta
    Michele Catasta
    Jason Wei
    Kathy Meier-Hellstern
    arxiv:2204.02311 (2022)
    Preview abstract Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM. We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough performance, outperforming the finetuned state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark. A significant number of BIG-bench tasks showed discontinuous improvements from model scale, meaning that performance steeply increased as we scaled to our largest model. PaLM also has strong capabilities in multilingual tasks and source code generation, which we demonstrate on a wide array of benchmarks. We additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale. Finally, we discuss the ethical considerations related to large language models and discuss potential mitigation strategies. View details
    Preview abstract When writing programs, people have the ability to tackle a new complex task by decomposing it into smaller and more familiar subtasks. While it is difficult to measure whether neural program synthesis methods have similar capabilities, what we can measure is whether they compositionally generalize, that is, whether a model that has been trained on the simpler subtasks is subsequently able to solve more complex tasks. In this paper, we focus on measuring the ability of learned program synthesizers to compositionally generalize. We first characterize several different axes along which program synthesis methods would be desired to generalize, e.g., length generalization, or the ability to combine known subroutines in new ways that do not occur in the training data. Based on this characterization, we introduce a benchmark suite of tasks to assess these abilities based on two popular existing datasets, SCAN and RobustFill. Finally, we make first attempts to improve the compositional generalization ability of Transformer models along these axes through novel attention mechanisms that draw inspiration from a human-like decomposition strategy. Empirically, we find our modified Transformer models generally perform better than natural baselines, but the tasks remain challenging. View details