Sercan O. Arik
Sercan Arik is a Research Scientist at Google Cloud AI. Motivated by the mission of democratizing AI and bringing it to the most impactful use cases (from Healthcare, Finance, Retail, Media, Education, Communications and many other industries), he works on making AI high-performance for the most-demanded data types, interpretable, fair, data-efficient, robust and reliable.
Before joining Google, he was a Research Scientist at Baidu Silicon Valley AI Lab. At Baidu, he focused on deep learning research, particularly for applications in human-technology interfaces. He co-developed state-of-the-art speech synthesis, keyword spotting, voice cloning, and neural architecture search systems. Prior to Baidu, he completed a PhD degree in Electrical Engineering at Stanford University in 2016. He has co-authored more than 50 journal and conference publications.
Authored Publications
Sort By
Preview abstract
Agents based on large language models (LLMs) for machine learning engineering (MLE) can automatically implement ML models via code generation. However, existing approaches to build such agents often rely heavily on inherent LLM knowledge and employ coarse exploration strategies that modify the entire code structure at once. This limits their ability to select effective task-specific models and perform deep exploration within specific components, such as experimenting extensively with feature engineering options. To overcome these, we propose MLE-STAR, a novel approach to build MLE agents. MLESTAR first leverages external knowledge by using a search engine to retrieve effective models from the web, forming an initial solution, then iteratively refines it by exploring various strategies targeting specific ML components. This exploration is guided by ablation studies analyzing the impact of individual code blocks. Furthermore, we introduce a novel ensembling method using an effective strategy suggested by MLE-STAR. Our experimental results show that MLE-STAR achieves medals in 64% of the Kaggle competitions on the MLE-bench Lite, significantly outperforming the best alternative.
View details
From Few to Many: Self-Improving Many-Shot Reasoners Through Iterative Optimization and Generation
Han Zhou
Hootan Nakhost
Ke Jiang
International Conference on Learning Representations (ICLR) (2025)
Preview abstract
Recent advances in long-context large language models (LLMs) have led to the emerging paradigm of many-shot in-context learning (ICL), where it is observed that scaling many more demonstrating examples beyond the conventional few-shot setup in the context can lead to performance benefits. However, despite its promise, it is unclear what aspects dominate the benefits and whether simply scaling to more examples is the most effective way of improving many-shot ICL. In this work, we first provide an analysis of the factors driving many-shot ICL, and we find that 1) many-shot performance can still be attributed to often a few disproportionately influential examples and 2) identifying such influential examples ("optimize") and using them as demonstrations to regenerate new examples ("generate") can lead to further improvements. Inspired by the findings, we propose BRIDGE, an algorithm that alternates between the optimize step with Bayesian optimization to discover the influential sets of examples and the generate step to reuse this set to expand the reasoning paths of the examples back to the many-shot regime automatically. On Gemini, Claude, and Mistral LLMs of different sizes, we show that BRIDGE to significant improvements across a diverse set of tasks, including symbolic reasoning, numerical reasoning, and code generation.
View details
Reasoning-SQL: Reinforcement Learning with Partial Rewards for Reasoning-Enhanced Text-to-SQL
Mohammadreza Pourreza
Shayan Talaei
Hailong Li
Azalia Mirhoseini
Amin Saberi
Conference on Language Modeling (COLM) (2025) (to appear)
Preview abstract
Text-to-SQL is a challenging task involving multiple reasoning-intensive subtasks, including natural language understanding, database schema comprehension, and precise SQL query formulation. Existing approaches often rely on handcrafted reasoning paths with inductive biases that can limit their overall effectiveness. Motivated by the recent success of reasoning-enhanced models such as DeepSeek R1 and OpenAI o1, which effectively leverage reward-driven self-exploration to enhance reasoning capabilities and generalization, we propose a novel set of partial rewards tailored specifically for the Text-to-SQL task. Our reward set includes schema-linking, AI feedback, n-gram similarity, and syntax check, explicitly designed to address the reward sparsity issue prevalent in reinforcement learning (RL). Leveraging group relative policy optimization (GRPO), our approach explicitly encourages large language models (LLMs) to develop intrinsic reasoning skills necessary for accurate SQL query generation. With models of different sizes, we demonstrate that RL-only training with our proposed rewards consistently achieves higher accuracy and superior generalization compared to supervised fine-tuning (SFT). Remarkably, our RL-trained 14B-parameter model significantly outperforms larger proprietary models, e.g. o3-mini by 4% and Gemini-1.5-Pro-002 by 3% on the BIRD benchmark. These highlight the efficacy of our proposed RL-training framework with partial rewards for enhancing both accuracy and reasoning capabilities in Text-to-SQL tasks.
View details
Preview abstract
Large language models (LLMs), optimized through human feedback, have rapidly emerged as a leading paradigm for developing intelligent conversational assistants. However, despite their strong performance across many benchmarks, LLM-based agents might still lack conversational skills such as disambiguation -- when they are faced with ambiguity, they often overhedge or implicitly guess users' true intents rather than asking clarification questions. Under task-specific settings, high-quality conversation samples are often limited, constituting a bottleneck for LLMs' ability to learn optimal dialogue action policies. We propose Action-Based Contrastive Self-Training (ACT), a quasi-online preference optimization algorithm based on Direct Preference Optimization (DPO), that enables data-efficient dialogue policy learning in multi-turn conversation modeling. We demonstrate ACT's efficacy under data-efficient tuning scenarios, even when there is no action label available, using multiple real-world conversational tasks: tabular-grounded question-answering, machine reading comprehension, and AmbigSQL, a novel task for disambiguating information-seeking requests for complex SQL generation towards data analysis agents. Additionally, we propose evaluating LLMs' ability to function as conversational agents by examining whether they can implicitly recognize and reason about ambiguity in conversation. ACT demonstrates substantial conversation modeling improvements over standard tuning approaches like supervised fine-tuning and DPO.
View details
Astute RAG: Overcoming Imperfect Retrieval Augmentation and Knowledge Conflicts for Large Language Models
Fei Wang
The Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025) (2025) (to appear)
Preview abstract
Retrieval-Augmented Generation (RAG), while effective in integrating external knowledge to address the limitations of large language models (LLMs), can be undermined by imperfect retrieval, which may introduce irrelevant, misleading, or even malicious information. Despite its importance, previous studies have rarely explored the behavior of RAG through joint analysis on how errors from imperfect retrieval attribute and propagate, and how potential conflicts arise between the LLMs' internal knowledge and external sources. We find that imperfect retrieval augmentation might be inevitable and quite harmful, through controlled analysis under realistic conditions. We identify the knowledge conflicts between LLM-internal and external knowledge from retrieval as a bottleneck to overcome in the post-retrieval stage of RAG. To render LLMs resilient to imperfect retrieval, we propose Astute RAG, a novel RAG approach that adaptively elicits essential information from LLMs' internal knowledge, iteratively consolidates internal and external knowledge with source-awareness, and finalizes the answer according to information reliability. Our experiments using Gemini and Claude demonstrate that Astute RAG significantly outperforms previous robustness-enhanced RAG methods. Notably, Astute RAG is the only approach that matches or exceeds the performance of LLMs without RAG under worst-case scenarios. Further analysis reveals that Astute RAG effectively resolves knowledge conflicts, improving the reliability and trustworthiness of RAG systems.
View details
Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies
Ivan Vulić
Anna Korhonen
Han Zhou
Shariq Iqbal
2025
Preview abstract
Large language models (LLMs), employed as multiple agents that interact and collaborate with each other, have excelled at solving complex tasks. The agents are programmed with {prompts} that declare their functionality, along with the {workflows} that orchestrate interactions within a structured flow. Designing prompts and workflows for multi-agent systems is inherently complex, especially when addressing a new task. It often demands expert-level knowledge and involves significant trial and error. Gaining a deep understanding of the factors that contribute to effective multi-agent systems is essential for automating the entire process. Motivated by this, we first conduct an in-depth analysis of the design spaces for multi-agent systems, focusing on the impact of prompts, scaling the number of agents, and common types of agentic modules. Our findings reveal that top-performing systems often emerge from simpler design spaces, where prompts play a critical role in enhancing agent functionality and enabling more effective scaling. Based on the insights, we propose Multi-Agent System Search (MASS), a multi-stage optimization framework that performs the optimization in a pruned design space, with prompts and an influential subset of modules. We show that MASS-optimized multi-agent systems outperform existing alterntives by a substantial margin. Based on the MASS-found systems, we finally propose design principles behind building effective multi-agent systems.
View details
Teach Better or Show Smarter? On Instructions and Exemplars in Automatic Prompt Optimization
Hootan Nakhost
Advances in Neural Information Processing Systems (NeurIPS) (2024)
Preview abstract
Large language models have demonstrated remarkable capabilities, but their performance is heavily reliant on effective prompt engineering. Automatic prompt optimization (APO) methods are designed to automate this and can be broadly categorized into those targeting instructions (instruction optimization, IO) vs. those targeting exemplars (exemplar selection, ES). Despite their shared objective, these have evolved rather independently, with IO recently receiving more research attention. This paper seeks to bridge this gap by comprehensively comparing the performance of representative IO and ES techniques, both isolation and combination, on a diverse set of challenging tasks. Our findings reveal that intelligently reusing model-generated input-output pairs obtained from evaluating prompts on the validation set as exemplars consistently improves performance over IO methods but is currently under-investigated. We also find that despite the recent focus on IO, how we select exemplars can outweigh how we optimize instructions, with ES strategies as simple as random search outperforming state-of-the-art IO methods with seed instructions without any optimization. Moreover, we observe synergy between ES and IO, with optimal combinations surpassing individual contributions. We conclude that studying exemplar selection as a standalone method and its optimal combination with instruction optimization remains a crucial aspect of APO and deserves greater consideration in future research, even in the era of highly capable instruction-following models.
View details
Preview abstract
With development of Large Language Models (LLMs), collaboration between LLMs to solve complex tasks has attracted more and more attention. An important challenging task is reasoning from long text that cannot be input into LLMs. Thus far, limited research has explored how to solve long context tasks via pure multi-agent collaboration.
In this paper, we propose Chain-of-Agents (CoA), a novel framework that leverages the multi-agent collaboration via natural language to solve complex tasks. In CoA, the long text is split into chunks to be processed by agents repeatedly with appending the information from preceding agents. A manager model is finally employed to obtain the final answer utilizing the output of the last agent.
On wide range of datasets for long context question answering, summarization, and code completion and with many LLMs (including PaLM 2, Claude, and Gemini), we show that CoA framework outperforms strong baselines, including the commonly-used retrieval augmented generation (RAG) systems, by a large margin. For instance, text-bison obtains 13.30\% performance gain on NarrativeQA, and 10.22\% on MuSiQue dataset.
View details
Preview abstract
Large language models (LLMs) have achieved remarkable advancements in natural language understanding, generation, and manipulation of text-based data. However, one major issue towards their widespread deployment in the real world is that they can generate "hallucinated" answers that are not factual. Towards this end, this paper focuses on improving grounding from a holistic perspective with a novel framework, AGREE. We start with the design of a test time adaptation capability that takes into account the support information generated in self-grounded responses. To effectively enable this capability, we propose that the model tuning needs to be redesigned with a novel tuning objective mimicking the test time adaptation setup for grounding. This tuning on top of the pre-trained LLMs requires small amount of data that need to be constructed in a particular way to learn the grounding information, for which we introduce a data construction method. Our results show that AGREE pushes the state-of-the-art in grounding, demonstrated across many datasets.
View details
SQL-PaLM: Improved Large Language Model Adaptation for Text-to-SQL
Satya Gundabathula
Hanjun Dai
Hootan Nakhost
TMLR (2024)
Preview abstract
Text-to-SQL, the process of translating natural language into Structured Query Language
(SQL), represents a transformative application of large language models (LLMs), potentially
revolutionizing how humans interact with data. This paper introduces the SQL-PaLM
framework, a comprehensive solution for understanding and enhancing Text-to-SQL using
LLMs, using in the learning regimes of few-shot prompting and instruction fine-tuning. With
few-shot prompting, we explore the effectiveness of consistency decoding with execution-based error filtering. With instruction fine-tuning, we delve deep in understanding the critical
paradigms that influence the performance of tuned LLMs. In particular, we investigate
how performance can be improved through expanded training data coverage and diversity,
synthetic data augmentation, and integrating query-specific database content. We propose
a test-time selection method to further refine accuracy by integrating SQL outputs from
multiple paradigms with execution feedback as guidance. Additionally, we tackle the
practical challenge of navigating intricate databases with a significant number of tables and
columns, proposing efficient techniques for accurately selecting relevant database elements to
enhance Text-to-SQL performance. Our holistic approach yields substantial advancements
in Text-to-SQL, as demonstrated on two key public benchmarks, Spider and BIRD. Through
comprehensive ablations and error analyses, we shed light on the strengths and weaknesses
of our framework, offering valuable insights into Text-to-SQL’s future work.
View details