A Comprehensive Survey on Text-to-Video Generation
-
Graphical Abstract
-
Abstract
Since the release of Sora, the Text-to-Video (T2V) generation has brought profound changes to AI-generated content. T2V generation aims to generate high-quality videos based on a given text description, which is challenging due to the lack of large-scale, high-quality text-video pairs for training and the complexity of modeling high-dimensional video data. Although there have been some valuable and impressive surveys on T2V generation, these surveys introduce approaches in a relatively isolated way, lack the development of evaluation metrics, and lack the latest advances in T2V generation since 2023. Due to the rapid expansion of the field of T2V generation, a comprehensive review of the relevant studies is both necessary and challenging. This survey attempts to connect and systematize existing research in a comprehensive way. Unlike previous surveys, this survey paper reviews nearly ninety representative T2V generation approaches and includes the latest method published on March 2024 from the perspectives of model, data, evaluation metrics, and available open-source. It may help readers better understand the current research status and ideas and have a quick start with accessible open-source models. Finally, the future challenges and method trends of T2V generation are thoroughly discussed.
-
-