Loading Events

On December 14, a workshop on using the new Kandinsky Video neural network to generate video from text will be held at the virtual site of the Russian House in Brussels.

Speakers: Denis Dimitrov, managing director for data science at Sber AI, scientific consultant at the Institute of Artificial Intelligence AIRI, and Tatyana Nikulina, chief prompt engineer at Sber AI.

Approaches to multimedia content generation occupy a prominent place in modern artificial intelligence research. Thus, over the past few years, models for synthesizing images from text have shown high-quality results.

Speakers: Denis Dimitrov, managing director for data science at Sber AI, scientific consultant at the Institute of Artificial Intelligence AIRI, and Tatyana Nikulina, chief prompt engineer at Sber AI.

Approaches to multimedia content generation occupy a prominent place in modern artificial intelligence research. Thus, over the past few years, models for synthesizing images from text have shown high-quality results.

Kandinsky Video is the first generative model in Russia for creating full-fledged videos using text descriptions. The model generates video sequences up to eight seconds long at a frequency of 30 frames per second. The Kandinsky Video architecture consists of two key blocks: the first is responsible for creating key frames that form the structure of the video plot, and the second is responsible for generating interpolation frames that allow smooth movements in the final video. The two blocks are based on a new model of image synthesis based on text descriptions, Kandinsky 3.0. The generated video format is a continuous scene with movement of both the subject and the background. The neural network creates videos with a resolution of 512 x 512 pixels and various aspect ratios. The model was trained on a dataset of more than 300 thousand text-video pairs. Video generation takes up to three minutes.

In addition, an option has been implemented to generate animated videos in which dynamics are achieved by simulating the passage of a camera relative to a static scene. One request generates a four-second video with the selected animation effect, at a frequency of 24 frames per second and a resolution of 640 x 640 pixels. Synthesizing one second of video takes about 20 seconds on average. Various types of image animation were implemented, which made it possible to move objects, bring them closer and further away, and animate static images in all possible ways. The animation modes are based on the function of redrawing an image based on a text description (image2image). Generation of composite scenes for creating “mini-movies” is also available (you can enter up to 3 queries at once).

Denis Dimitrov will talk in detail about how the Kandinsky Video neural network was created, the principle of its operation and key features, as well as what improvements and new features now exist in Kandinsky 3.0 compared to Kandinsky 2.2. Tatyana Nikulina will show with specific examples how to work correctly with the Kandinsky 3.0 neural network and Kandinsky Video to generate animations and videos from text.

Language: Russian/English

Go to Top