We use cookies and similar technologies to help provide the content on the Emu-Video site and Google Analytics for analytics purposes. You can learn more about cookies and how we use them in our Cookie Policy

Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning

State-of-the-art text-to-video generation

Try it out

An Emu on a ski trip, 4k, high resolution

Try it out

Factorizing Text-to-Video Generation by Explicit Image Conditioning

Emu Video is a simple method for text to video generation based on diffusion models, factorizing the generation into two steps:

First generating an image conditioned on a text prompt
Then generating a video conditioned on the prompt and the generated image

Factorized generation allows us to train high quality video generation models efficiently. Unlike prior work that requires a deep cascade of models, our approach only requires two diffusion models to generate 512px, 4 second long videos at 16fps.

State of the Art results

We compared Emu Video against state of the art text-to-video generation models on a varity of prompts, by asking human raters to select the most convincing videos, based on quality and faithfulness to the prompt.

Our 512 pixels, 16 frames per second, 4 second long videos win on both metrics against prior works: Make-a-Video (MAV), Imagen-Video (Imagen), Align Your Latents (AYL), Reuse & Diffuse (R&D), Cog Video (Cog), Gen2 (Gen2) and Pika Labs (Pika).

Read the paper

Read our blog

Authors

Rohit Girdhar^*

Mannat Singh^*

Andrew Brown*

Quentin Duval*

Samaneh Azadi*

Sai Saketh Rambhatla

Akbar Shah

Xi Yin

Devi Parikh

Ishan Misra*

(^): equal first authors(*): equal technical contribution

Acknowledgments

We are grateful for the support of multiple collaborators who helped us in this work.

Baixue Zheng, Baishan Guo, Jeremy Teboul, Milan Zhou, Shenghao Lin, Kunal Pradhan, Jort Gemmeke, Jacob Xu, Dingkang Wang, Samyak Datta, Guan Pang, Symon Perriman, Vivek Pai, Shubho Sengupta for their help with the data and infra. We would like to thank Uriel Singer, Adam Polyak, Shelly Sheynin, Yaniv Taigman, Licheng Yu, Luxin Zhang, Yinan Zhao, David Yan, Yaqiao Luo, Xiaoliang Dai, Zijian He, Peizhao Zhang, Peter Vajda, Roshan Sumbaly, Armen Aghajanyan, Michael Rabbat, and Michal Drozdzal for helpful discussions. We are also grateful to the help from Lauren Cohen, Mo Metanat, Lydia Baillergeau, Amanda Felix, Ana Paula Kirschner Mofarrej, Kelly Freed, Somya Jain. We thank Ahmad Al-Dahle and Manohar Paluri for their support.