MAGVIT: Masked Generative Video Transformer

Kihyuk Sohn
Han Zhang
Huiwen Chang
Alex Hauptmann
Lu Jiang
CVPR (2023)

Abstract

This paper introduces a Masked Generative Video Transformer, named MAGVIT, for multi-task video generation. We train a single MAGVIT model and apply it to multiple video generation tasks at inference time. To this end, two new designs are proposed: an improved 3D tokenizer model to quantize a video into spatial-temporal visual tokens, and a novel technique to embed conditions inside the mask to facilitate multi-task training.
We conduct extensive experiments to demonstrate the compelling quality, efficiency, and flexibility of the proposed model.
First, MAGVIT radically improves the previous best fidelity on two video generation tasks.
In terms of efficiency, MAGVIT offers leading video generation speed at inference time, which is estimated to be one or two orders-of-magnitudes faster than other models. As for flexibility, we verified that a single trained MAGVIT is able to generically perform 8+ tasks at several video benchmarks from drastically different visual domains. We will open source our framework and models.