Parameter Efficient Reinforcement Learning from Human Feedback
Abstract
While Reinforcement Learning from Human Feedback (RLHF) effectively aligns pretrained Large Language Models (LLMs) with human preferences, its computational cost and complexity hinder wider adoption.
This work introduces Parameter-Efficient Reinforcement Learning (PERL): by leveraging Low-Rank Adaptation (LoRA) \citep{hu2021lora} for reward model training and reinforcement learning, we are able to perform RL loops while updating only a fraction of the parameters required by traditional RLHF.
We demonstrate that the effectiveness of this method is not confined to a specific task. We compare PERL to conventional fine-tuning (full-tuning) across X highly diverse tasks, spanning from summarization to X and X, for a total of X different benchmarks - including two novel preference datasets released with this paper. Our findings show that PERL achieves comparable performance to RLHF while significantly reducing training time (up to 2x faster for reward models and 15\% faster for RL loops), and memory footprint (up to 50\% reduction for reward models and 25\% for RL loops). Finally, we provide a single set of parameters that achieves results on par with RLHF on every task, which shows the accessibility of the method.
By mitigating the computational cost and the burden of hyperparameter search, PERL facilitates broader adoption of RLHF as an LLM alignment technique.
This work introduces Parameter-Efficient Reinforcement Learning (PERL): by leveraging Low-Rank Adaptation (LoRA) \citep{hu2021lora} for reward model training and reinforcement learning, we are able to perform RL loops while updating only a fraction of the parameters required by traditional RLHF.
We demonstrate that the effectiveness of this method is not confined to a specific task. We compare PERL to conventional fine-tuning (full-tuning) across X highly diverse tasks, spanning from summarization to X and X, for a total of X different benchmarks - including two novel preference datasets released with this paper. Our findings show that PERL achieves comparable performance to RLHF while significantly reducing training time (up to 2x faster for reward models and 15\% faster for RL loops), and memory footprint (up to 50\% reduction for reward models and 25\% for RL loops). Finally, we provide a single set of parameters that achieves results on par with RLHF on every task, which shows the accessibility of the method.
By mitigating the computational cost and the burden of hyperparameter search, PERL facilitates broader adoption of RLHF as an LLM alignment technique.