Avi Caciularu

Avi Caciularu

Avi Caciularu is a Research Scientist at Google Research Tel-Aviv, working on natural language processing and their applications with large language models.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract A critical component in the trustworthiness of LLMs is reliable uncertainty communication, yet LLMs often use assertive language when conveying false claims, leading to over-reliance and eroded trust. We present the first systematic study of faithful confidence calibration of LLMs, benchmarking models' ability to use linguistic expressions of uncertainty that faithfully reflect their intrinsic uncertainty, across a comprehensive array of models, datasets, and prompting strategies. Our results demonstrate that LLMs largely fail at this task, and that existing interventions are insufficient: standard prompt approaches provide only marginal gains, and existing, factuality-based calibration techniques can even harm faithful calibration. To address this critical gap, we introduce MetaFaith, a novel prompt-based calibration approach inspired by human metacognition. We show that MetaFaith robustly improves faithful calibration across diverse models and task domains, enabling up to 61% improvement in faithfulness and achieving an 83% win rate over original generations as judged by humans. View details
    Preview abstract Retrieval Augmented Generation (RAG) is a commonly used approach for enhancing LLMs with relevant and up-to-date information. However, the retrieved sources can often bring conflicting information and it is not clear how models address such discrepancies. In this work, we first point out that knowledge conflicts stem from various reasons and thus require tailored solutions in order to better align model responses to human preferences. To that end, we introduce a novel taxonomy of knowledge conflicts in RAG and define the desired model’s behavior for each category. Additionally, we construct a high-quality benchmark by asking two expert annotators to identify the conflict type within realistic RAG instances, each comprising a query and its associated search results. Finally, we conduct extensive experiments and show that explicitly informing LLMs about the potential conflict category significantly improves the quality and appropriateness of the responses. Yet, there is still a vast room for improvement. Taken together, our work highlights the importance of evaluating RAG systems not only on factual accuracy but also on their ability to manage and resolve knowledge conflicts effectively. View details
    Preview abstract A growing area of research investigates augmenting language models with tools (e.g., search engines, calculators) to overcome their shortcomings (e.g., missing or incorrect knowledge, incorrect logical inferences). Various few-shot tool-usage strategies have been proposed. However, there is no systematic and fair comparison across different strategies, or between these strategies and strong baselines that do not leverage tools. We conduct an extensive empirical analysis, finding that (1) across various datasets, example difficulty levels, and models, strong no-tool baselines are competitive to tool-assisted strategies, implying that effectively using tools with in-context demonstrations is a difficult unsolved problem; (2) for knowledge-retrieval tasks, strategies that *refine* incorrect outputs with tools outperform strategies that retrieve relevant information *ahead of* or *during generation*; (3) tool-assisted strategies are expensive in the number of tokens they require to work -- incurring additional costs by orders of magnitude -- which does not translate into significant improvement in performance. Overall, our findings suggest that few-shot tool integration is still an open challenge, emphasizing the need for comprehensive evaluations of future strategies to accurately assess their *benefits* and *costs*. View details