Matryoshka Model Learning for Improved Elastic Student Models
Abstract
Production machine learning models in the industry are often devel-oped with a primary focus on maximizing model quality. However,these models must ultimately operate within the resource con-straints of their serving infrastructure, including limitations on com-pute, memory and bandwidth. The rapid evolution of serving hard-ware, particularly with advancements in accelerator technology,necessitates periodic retraining to leverage newer, more efficientinfrastructure. This cyclical retraining process is resource-intensive,demanding significant model development time and incurring sub-stantial training costs. This challenge is further amplified by thetrend towards increasingly complex models, which inherently re-quire greater computational resources for training and deployment.While prior work has explored techniques like supernet sub-modelextraction to address training efficiency, a critical gap remains: theefficient generation of a spectrum of high-quality models froman existing production model, a common requirement in diverseindustrial applications. To bridge this gap, we introduce a novel ap-proach leveraging a "Teaching Assistant" (TA) model, derived froma given production model (referred to as the Student model). Wedemonstrate that through co-training the Student and TA modelswith Matryoshka structure while using online distillation, we notonly enhance the Student model’s performance but also enable theflexible creation of a model family offering a compelling trade-offbetween model quality and model size.