Video-STaR: Bootstrapping Weak Video Supervision for Visual Instruction Tuning

Xiaohan Wang
Serena Yeung
Orr Zohar
2024

Abstract

The performance of instruction-following Large Vision- Language Models heavily relies on the size and quality of the instruction-tuning dataset. Existing video instruction tuning datasets are derived by prompting large language models with video captions to generate question-answer pairs, and suffer from data quality and scaling issues. While many existing video datasets have diverse labels and supervision for various tasks, their integration into Large Vision-Language Models is non-trivial. Herein, we present Video Self-Taught Reasoners (Video-STaR), a novel approach that allows the utilization of any labeled video dataset for video instruction tuning. Video-STaR uses a Large Vision-Language Model to generate video question-answer pairs off of video content and labels and utilizes the video labels to filter these question-answer pairs by only selecting correctly answered instances for instruction tuning. The filtering effectively employs the existing video labels as weak supervision for the quality of the question-answer pairs, iteratively enhancing the model through cycles of self-training until performance plateaus. Our results demonstrate that Large Vision-Language Models tuned with Video-STaR exhibit superior robustness, showing marked improvement in VQA benchmarks and adapted downstream tasks. For instance, on Kinetics700, Video-STaR improved accuracy from 50.0 to 59.9 and on zero-shot MSVD-QA from 69.7 to 71.3.
×