A Little Help Goes a Long Way: Efficient LLM Training by Leveraging Small LMs
Abstract
A primary challenge in large language model (LLM) development is their onerous pre-training cost. Typically, such pre-training involves optimizing a self-supervised objective (such as next-token prediction) over a large corpus. This paper explores a promising paradigm
to improve LLM pre-training efficiency and quality by suitably leveraging a small language model (LM). In particular, this paradigm relies on a small LM to both (1) provide soft labels as additional training supervision, and (2) select a small subset of valuable (``informative'' and ``hard'') training examples. Put together, this enables an effective transfer of the small LM's predictive distribution to the LLM, while prioritizing specific regions of the training data distribution. Empirically, this leads to reduced LLM training time as compared to standard training, while improving the overall quality. Theoretically, we develop a statistical framework to systematically study the utility of small LMs in enabling efficient training of high-quality LLMs.
In particular, our framework characterizes how the small LM's seemingly low-quality supervision
can enhance the training of a much more capable LLM. Furthermore, it also highlights the need for an adaptive utilization of such supervision, by striking a balance between the bias and variance introduced by the small LM-provided soft labels. We corroborate our theoretical framework by improving the pre-training of an LLM with 2.8B parameters by utilizing a smaller LM with 1.5B parameters on the Pile dataset.
to improve LLM pre-training efficiency and quality by suitably leveraging a small language model (LM). In particular, this paradigm relies on a small LM to both (1) provide soft labels as additional training supervision, and (2) select a small subset of valuable (``informative'' and ``hard'') training examples. Put together, this enables an effective transfer of the small LM's predictive distribution to the LLM, while prioritizing specific regions of the training data distribution. Empirically, this leads to reduced LLM training time as compared to standard training, while improving the overall quality. Theoretically, we develop a statistical framework to systematically study the utility of small LMs in enabling efficient training of high-quality LLMs.
In particular, our framework characterizes how the small LM's seemingly low-quality supervision
can enhance the training of a much more capable LLM. Furthermore, it also highlights the need for an adaptive utilization of such supervision, by striking a balance between the bias and variance introduced by the small LM-provided soft labels. We corroborate our theoretical framework by improving the pre-training of an LLM with 2.8B parameters by utilizing a smaller LM with 1.5B parameters on the Pile dataset.