Blog test prior to deployment

June 3, 2024

test

Update • apr 26, 2024

test tests test test x2 x2 x2

Speculative RAG is a novel Retrieval Augmented Generation framework that uses a smaller specialist LM to generate draft texts that are then fed to a larger generalist LM to verify and select the best draft. Speculative RAG achieves state-of-the-art performance both in accuracy and efficiency. tests test test x2 x2 x2

By using WeightedSums as the inputs to other operators, it is possible to express rich combinatorial structures, while keeping models compact and the number of learnable weights small. As an example, we include the sum_of_products model (illustrated in the figure below) which first creates a pairwise product of two WeightedSums, and then a sum of the two products. By setting some of the weights to zero, we can create many different discrete structures. The total number of possible structures in this model is 216, since there are 216 base kernels that can be turned on or off. All these structures are explored implicitly by training just this one model.

play silent looping video pause silent looping video
play silent looping video pause silent looping video
play silent looping video pause silent looping video

test tests test test x2 x2 x2

Large language models (LLMs) are increasingly being used to power services that can respond to user inquiries. Yet despite their widespread use, LLMs often struggle with factual inaccuracies and may generate hallucinated content (i.e., descriptions that cannot be verified by a given input), particularly when faced with knowledge-intensive questions that demand up-to-date information or obscure facts. For example, if a user asks, What are the new features of the latest Google Pixel phone?, an LLM might generate outdated or inaccurate information.

Retrieval augmented generation (RAG) recently emerged as a promising solution to mitigate these issues. RAG leverages an external knowledge base to retrieve documents with related information, and incorporates this information into its generated content. By retrieving timely and accurate information, RAG effectively reduces factual errors in knowledge-intensive tasks. While RAG improves the accuracy of LLM responses, longer documents require more complex reasoning and can significantly delay response times. Recent studies have explored paths to extend the context length limit of LLMs, yet achieving well-grounded reasoning over such extended contexts remains an open challenge. Consequently, striking a balance between efficiency and effectiveness in RAG has become a central research focus.

In “Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting”, we propose a novel framework that offloads computational burden to a smaller specialist RAG drafter (a specialized LM fine-tuned for RAG), which serves as an efficient and robust RAG module for the existing generalist LM.

Speculative RAG follows the drafting approach described in speculative decoding, a method that accelerates auto-regressive LM inference by using a smaller model to concurrently and rapidly generate multiple subsequent tokens (e.g., words or word segments) verified in parallel with a base model, to improve the effectiveness and efficiency of RAG systems. We demonstrate that Speculative RAG yields significant improvements and achieves state-of-the-art performance in both accuracy and latency on the TriviaQA, MuSiQue, PubHealth, and ARC-Challenge benchmarks.

Speculative RAG

Speculative RAG consists of two components: (1) a specialist RAG drafter, and (2) a generalist RAG verifier. First, the base model’s knowledge retriever retrieves related documents from the knowledge base. Then, Speculative RAG offloads computational burden to the specialist RAG drafter, a small LM specialized in answering questions using retrieved documents and not expected to cope with general problems. This smaller module excels at reasoning over retrieved documents and can rapidly produce responses with their corresponding rationale. It serves as an efficient and robust RAG module for the generalist LM. The specialist drafter enables the generalist verifier to bypass the detailed review of potentially repetitive documents, focusing instead on validating the drafts and selecting the most accurate answer.

For example, when answering, “Which actress or singer starred as Doralee Rhodes in the 1980 film, Nine to Five?”, we retrieve a number of documents from the knowledge base with a retriever. We feed subsets of retrieved documents into the RAG drafter and generate multiple answer drafts with corresponding rationale in parallel. This guarantees a high processing speed of the large number of documents.

We determine that some retrieved documents are not relevant due to the limited capability of the knowledge retriever. In this example, the retrieved documents contain information about both the Nine to Five movie (1980) and the Nine to Five musical (2010). To determine the most accurate draft, the generalist RAG verifier, a general LLM, calculates the conditional generation probability of the answer drafts with rationales and outputs a confidence score. Since answer drafts based on the Nine to Five musical would be inaccurate, the generalist RAG verifier assigns those drafts lower scores and filters them out. Finally, the generalist verifier selects the answer draft with the highest confidence score, which is based on the Nine to Five movie, as the final answer.