Loong: Generating Minute-level Long Videos with Autoregressive Language Models

Yuqing Wang1, Tianwei Xiong1, Daquan Zhou2, Zhijie Lin2

Yang Zhao2, Bingyi Kang2, Jiashi Feng2, Xihui Liu1

1University of Hong Kong, 2ByteDance

Abstract

It is desirable but challenging to generate content-rich long videos in the scale of minutes. Autoregressive large language models (LLMs) have achieved great success in generating coherent and long sequences of tokens in the domain of natural language processing, while the exploration of autoregressive LLMs for video generation is limited to generating short videos of several seconds. In this work, we conduct a deep analysis of the challenges that prevent autoregressive LLM-based video generators from generating long videos. Based on the observations and analysis, we propose Loong, a new autoregressive LLM-based video generator that can generate minute-long videos. Specifically, we model the text tokens and video tokens as a unified sequence for autoregressive LLMs and train the model from scratch. We propose progressive short-to-long training with a loss re-weighting scheme to mitigate the loss imbalance problem for long video training. We further investigate inference strategies, including video token re-encoding and sampling strategies, to diminish error accumulation during inference. Our proposed Loong can be trained on 10-second videos and be extended to generate minute-level long videos conditioned on text prompts, as demonstrated by the results.

Approach

Framework

Given the input text tokens, the model predict video tokens autoregressively. All the text and video information is formulated into a unidirectional discrete token sequence, where the model predicts the next token based on the previous tokens. Video Tokenizer is utlized to convert video frames into discrete video tokens. We follow a progressive training pipeline to train on long videos.

Generated High-Resolution Videos

Clown fish swimming through the coral reef

Aerial view of Santorini during the blue hour, showcasing the stunning architecture of white Cycladic buildings with blue domes. The caldera views are breathtaking, and the lighting creates a beautiful, serene atmosphere

A panda eating bamboo on a rock

Hulk wearing virtual reality goggles

A bigfoot walking in the snowstorm

Two pandas sitting at a table playing cards

Two raccoons reading books in NYC Times Square

A koala bear playing piano in the forest

A panda standing on a surfboard in the ocean in sunset

Reconstructed Videos using the Discrete Video Tokenizer

Left: original video, Right: reconstructed video

Reconstructed video 1
Reconstructed video 2
Reconstructed video 3
Reconstructed video 4
Reconstructed video 5
Reconstructed video 6

The discrete video tokenizer is utilized to convert the raw video frames into compact discrete tokens, which can then be processed by the autoregressive language model. The tokenizer is trained using a 3D causal CNN architecture, where the encoder maps the video frames into discrete codes, and the decoder reconstructs the original frames from the discrete tokens. The tokenizer can significantly compress the data size. For example, given a 17x128x128 video clip, we compress it by 4 times in the temporal dimension and 8 times in the spatial dimensions (height and width). This allows us to represent the clip using only 256x5 tokens. The GIFs above show the reconstruction quality of videos from the WebVid dataset after compression by our tokenizer.

Generated Low-Resolution Short Videos(128x128)

A happy elephant wearing a birthday hat walking under the sea

A happy elephant wearing a birthday hat walking under the sea

A big futuristic robot walking in post apocalypse world

A white and orange tabby cat is seen happily darting through a dense garden, as if chasing something.

A bear is giving a presentation in classroom.

An astronaut typing on a keyboard, arc shot.

Teddy bear walking down 5th Avenue, front view, beautiful sunset.

Red sports car coming around a bend in a mountain road.

Drone view of waves crashing against the rugged cliffs along Big Sur’s garay point beach.

BibTeX

@article{wang2024loong,
      title={Loong: Generating Minute-level Long Videos with Autoregressive Language Models},
      author={Wang, Yuqing and Xiong, Tianwei and Zhou, Daquan and Lin, Zhijie and Zhao, Yang and Kang, Bingyi and Feng, Jiashi and Liu, Xihui},
      journal={arXiv preprint arXiv:2410.02757},
      year={2024}
}

Acknowledgements

We thank Zhaoliang Xu, Lijun Yu, Dingdong Yang for helpful discussions.