Parameter Efficient Reinforcement Learning from Human Feedback

Hakim Sidahmed

Samrat Phatale

Alex Hutcheson

Zhuonan Lin

Zhang Chen

Zac Yu

Jarvis Jin

Simral Chaudhary

Roman Komarytsia

Christiane Ahlheim

Yonghao Zhu

Bowen Li

Saravanan Ganesh

Bill Byrne

Jessica Hoffmann

Hassan Mansoor

Wei Li

Abhinav Rastogi

Lucas Dixon

2024

Download Google Scholar

Abstract

While Reinforcement Learning from Human Feedback (RLHF) effectively aligns pretrained Large Language Models (LLMs) with human preferences, its computational cost and complexity hinder wider adoption.
This work introduces Parameter-Efficient Reinforcement Learning (PERL): by leveraging Low-Rank Adaptation (LoRA) \citep{hu2021lora} for reward model training and reinforcement learning, we are able to perform RL loops while updating only a fraction of the parameters required by traditional RLHF.
We demonstrate that the effectiveness of this method is not confined to a specific task. We compare PERL to conventional fine-tuning (full-tuning) across X highly diverse tasks, spanning from summarization to X and X, for a total of X different benchmarks - including two novel preference datasets released with this paper. Our findings show that PERL achieves comparable performance to RLHF while significantly reducing training time (up to 2x faster for reward models and 15\% faster for RL loops), and memory footprint (up to 50\% reduction for reward models and 25\% for RL loops). Finally, we provide a single set of parameters that achieves results on par with RLHF on every task, which shows the accessibility of the method.
By mitigating the computational cost and the burden of hyperparameter search, PERL facilitates broader adoption of RLHF as an LLM alignment technique.

Defining the technology of today and tomorrow.

Philosophy

People

Foundational ML & Algorithms

Computing Systems & Quantum AI

Science, AI & Society

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

Parameter Efficient Reinforcement Learning from Human Feedback

Abstract

Learn more about how we conduct our research