On December 14, the virtual platform of the Russian House in Brussels hosted a master class on using the new Kandinsky Video neural network to generate video from text.
Denis Dimitrov spoke in detail about how the Kandinsky Video neural network was created, its working principle and key features, and what improvements and new features Kandinsky 3.0 has over Kandinsky 2.2. Tatiana Nikulina showed how to use Kandinsky 3.0 and Kandinsky Video neural network to generate animations and videos based on text.
Approaches to multimedia content generation occupy a prominent place in modern artificial intelligence research. Thus, over the past few years, models for synthesizing images from text have shown high-quality results.
Kandinsky Video is the first generative model in Russia for creating full-fledged videos using text descriptions. The model generates video sequences up to eight seconds long at a frequency of 30 frames per second. The Kandinsky Video architecture consists of two key blocks: the first is responsible for creating key frames that form the structure of the video plot, and the second is responsible for generating interpolation frames that allow smooth movements in the final video. The two blocks are based on a new model of image synthesis based on text descriptions, Kandinsky 3.0. The generated video format is a continuous scene with movement of both the subject and the background. The neural network creates videos with a resolution of 512 x 512 pixels and various aspect ratios. The model was trained on a dataset of more than 300 thousand text-video pairs. Video generation takes up to three minutes.
In addition, an option has been implemented to generate animated videos in which dynamics are achieved by simulating the passage of a camera relative to a static scene. One request generates a four-second video with the selected animation effect, at a frequency of 24 frames per second and a resolution of 640 x 640 pixels. Synthesizing one second of video takes about 20 seconds on average. Various types of image animation were implemented, which made it possible to move objects, bring them closer and further away, and animate static images in all possible ways. The animation modes are based on the function of redrawing an image based on a text description (image2image). Generation of composite scenes for creating “mini-movies” is also available (you can enter up to 3 queries at once).