Generative AI has become the buzzword of 2023. Whether text-generating ChatGPT or image-generating Midjourney, generative AI tools have transformed businesses and dominated the content creation industry. With Microsoft’s partnership with OpenAI and Google creating its own AI-powered chatbot called Bard, it is fast growing into one of the hottest areas within the tech sphere.
Generative AI aims to generate new data similar to the training dataset. It utilizes machine learning algorithms called generative models to learn the patterns and distributions underlying the training data. Although different generative models are available that produce text, images, audio, codes and videos, this article will take a deep dive into generative video models.
From generating video using text descriptions to generating new scenes and characters and enhancing the quality of a video, generative video models offer a wealth of opportunities for video content creators. Generative video platforms are often powered by sophisticated models like GANs, VAEs, or CGANs, capable of translating human language to build images and videos. In this article, you will learn about generative video models, their advantages, and how they work, followed by a step-by-step guide on creating your own generative video model
Generative models and their types
Generative models create new data similar to the training data using machine learning algorithms. To create new data, these models undergo a series of training wherein they are exposed to large datasets. They learn the underlying patterns and relationships in the training data to produce similar synthetic data based on their knowledge acquired from the training. Once trained, these models take text prompts (sometimes image prompts) to generate content based on the text.
There are several different types of generative models, including:
These are some of the most typically used generative models, but many others have been developed for specific use cases. The choice of which model to use will depend on the specific requirements of the task at hand.
What is a generative video model?
Generative video models are machine learning algorithms that generate new video data based on patterns and relationships learned from training datasets. In these models, the underlying structure of the video data is learned, allowing it to be used to create synthetic video data similar to the original ones. Different types of generative video models are available, like GANs, VAEs, CGANs and more, each of which takes a different training approach based on its unique infrastructure.
Generative video models mostly utilize text-to-video prompts where users can enter their requirements through text, and the model generates the video using the textual description. Depending on your tools, generative video models also utilize sketch or image prompts to generate videos.
What tasks can a generative video model perform?
A wide range of activities can be carried out by generative video models, including:
Benefits of generative video models
Compared to more conventional techniques, generative video models have a number of benefits:
How do generative video models work?
Like any other AI model, generative video models are trained on large data sets to produce new videos. However, the training process varies from model to model depending on the model’s architecture. Let us understand how this may work by taking the example of two different models: VAE and GAN.
Variational Autoencoders (VAEs)
A Variational Autoencoder (VAE) is a generative model for generating videos and images. In a VAE, two main components are present: an encoder and a decoder. An encoder maps a video to a lower-dimensional representation, called a latent code, while a decoder reverses the process.
A VAE uses encoders and decoders to model the distribution of videos in training data. In the encoder, each video is mapped into a latent code, which becomes a parameter for parametrizing a probability distribution (such as a normal distribution). To calculate a reconstruction loss, the decoder maps the latent code back to a video, then compares it to the original video.
To maximize the diversity of the generated videos, the VAE encourages the latent codes to follow the prior distribution, which minimizes the reconstruction loss. After the VAE has been trained, it can be leveraged to generate new videos by sampling latent codes from a prior distribution and passing them through the decoder.
Generative Adversarial Networks (GANs)
GANs are deep learning model that generates images or videos when given a text prompt. A GAN has two core components: a generator and a discriminator. Both the generator and the discriminator, being neural networks, process the video input to generate different kinds of output. While the generator generates fake videos, the discriminator assesses these videos’ originality to provide feedback to the generator.
Using a random noise vector as input, the generator in the GAN generates a video. Discriminators take in videos as input and produce probability scores indicating the likelihood of the video is real. Here, the generator classifies the videos as real if taken from the training data and the video generated by the generator is stamped as fake.
Generators and discriminators have trained adversarially during training. Generators are trained to create fake videos that discriminators cannot detect, while discriminators are trained to identify fake videos created by generators. The generator continues this process until it produces videos that the discriminator can no longer distinguish from actual videos.
Following the training process, a noise vector can be sampled and passed through the generator to generate a brand-new video. While incorporating some randomness and diversity, the resultant videos should reflect the characteristics of the training data.
How to create a generative video model?
Here, we discuss how to create a generative video model similar to the VToonify framework that combines the advantages of StyleGAN and Toonify frameworks.
Set up the environment
The first step to creating a generative video model is setting up the environment. To set up the environment for creating a generative video model, you must decide on the right programming language to write codes. Here, we are moving forward with Python. Next, you must install several software packages, including a deep learning framework such as TensorFlow or PyTorch, and any additional libraries you will need to preprocess and visualize your data.
Install the following dependencies:
pip install torch torchvision
NumPy:
pip install numpy
OpenCV:
pip install opencv-python
Matplotlib:
pip install matplotlib
Other necessary dependencies can be found here.You may need to modify the file ‘vtoonify_env.yaml‘ to install PyTorch that matches with your own CUDA version.
Model architecture design
You cannot create a generative video model without designing the architecture of the model. It determines the quality and capacity of the generated video sequences. Considering the sequential nature of video data is critical when designing the architecture of the generative model since video sequences consist of multiple frames linked by time. Combining CNNs with RNNs or creating a custom architecture may be an option.
As we are designing a model similar to VToonify, understanding in-depth about the framework is necessary. So, what is VToonify?
VToonify is a framework developed by MMLab@NTU for generating high-quality artistic portrait videos. It combines the advantages of two existing frameworks: the image translation framework and the StyleGAN-based framework. The image translation framework supports variable input size, but achieving high-resolution and controllable style transfer is difficult. On the other hand, the StyleGAN-based framework is good for high-resolution and controllable style transfer but is limited to fixed image size and may lose details.
VToonify uses the StyleGAN model to achieve high-resolution and controllable style transfer and removes its limitations by adapting the StyleGAN architecture into a fully convolutional encoder-generator architecture. It uses an encoder to extract multi-scale content features of the input frame and combines them with the StyleGAN model to preserve the frame details and control the style. The framework has two instantiations, namely, VToonify-T and VToonify-D, wherein the first uses Toonify and the latter follows DualStyleGAN.
The backbone of VToonify-D is DualStyleGAN, developed by MMLab@NTU. DualStyleGAN utilizes the benefits of StyleGAN and can be considered an advanced version of it. In this article, we will be moving forward with VToonify-D.
The following steps need to be considered while designing a model architecture:
Since the model we develop is VToonify-like, human face sequences should be fed as input to the generative model, and anime or cartoon face sequences should be the output. Images, optical flows, or feature maps can be input and output data formats.
Write the following codes for the encoder network:
num_styles = int(np.log2(out_size)) * 2 - 2
encoder_res = [2**i for i in range(int(np.log2(in_size)), 4, -1)]
self.encoder = nn.ModuleList()
self.encoder.append(
nn.Sequential(
nn.Conv2d(img_channels+19, 32, 3, 1, 1, bias=True),
nn.LeakyReLU(negative_slope=0.2, inplace=True),
nn.Conv2d(32, channels[in_size], 3, 1, 1, bias=True),
nn.LeakyReLU(negative_slope=0.2, inplace=True)))
for res in encoder_res:
in_channels = channels[res]
if res > 32:
out_channels = channels[res // 2]
block = nn.Sequential(
nn.Conv2d(in_channels, out_channels, 3, 2, 1, bias=True),
nn.LeakyReLU(negative_slope=0.2, inplace=True),
nn.Conv2d(out_channels, out_channels, 3, 1, 1, bias=True),
nn.LeakyReLU(negative_slope=0.2, inplace=True))
self.encoder.append(block)
else:
layers = []
for _ in range(num_res_layers):
layers.append(VToonifyResBlock(in_channels))
self.encoder.append(nn.Sequential(*layers))
block = nn.Conv2d(in_channels, img_channels, 1, 1, 0, bias=True)
self.encoder.append(block)
You can refer to this GitHub link to add the generator network.
Model training
First, you need to import argparse, math and random to start training the model. Run the following commands to do so:
import argparse
import math
import random
After importing all prerequisites, specify the parameters for training. It includes total training iterations, the batch size for each GPU, the local rank for distributed training, the interval of saving a checkpoint, the learning rate and more. You can refer to the following command lines to understand.
self.parser = argparse.ArgumentParser(description="Train VToonify-D")
self.parser.add_argument("--iter", type=int, default=2500, help="total training iterations")
self.parser.add_argument("--batch", type=int, default=9, help="batch sizes for each gpus")
self.parser.add_argument("--lr", type=float, default=0.0001, help="learning rate")
self.parser.add_argument("--local_rank", type=int, default=0, help="local rank for distributed training")
self.parser.add_argument("--start_iter", type=int, default=0, help="start iteration")
self.parser.add_argument("--save_every", type=int, default=25000, help="interval of saving a checkpoint")
self.parser.add_argument("--save_begin", type=int, default=35000, help="when to start saving a checkpoint")
self.parser.add_argument("--log_every", type=int, default=300, help="interval of saving a checkpoint")
Next, we have to pre-train the encoder network for the model.
def pretrain(args, generator, g_optim, g_ema, parsingpredictor, down, directions, styles, device):
pbar = range(args.iter)
if get_rank() == 0:
pbar = tqdm(pbar, initial=args.start_iter, dynamic_ncols=True, smoothing=0.01)
recon_loss = torch.tensor(0.0, device=device)
loss_dict = {}
if args.distributed:
g_module = generator.module
else:
g_module = generator
accum = 0.5 ** (32 / (10 * 1000))
requires_grad(g_module.encoder, True)
for idx in pbar:
i = idx + args.start_iter
if i > args.iter:
print("Done!")
break
Now train both the generator and the discriminator using paired data.
def train(args, generator, discriminator, g_optim, d_optim, g_ema, percept, parsingpredictor, down, pspencoder, directions, styles, device):
pbar = range(args.iter)
if get_rank() == 0:
pbar = tqdm(pbar, initial=args.start_iter, smoothing=0.01, ncols=130, dynamic_ncols=False)
d_loss = torch.tensor(0.0, device=device)
g_loss = torch.tensor(0.0, device=device)
grec_loss = torch.tensor(0.0, device=device)
gfeat_loss = torch.tensor(0.0, device=device)
temporal_loss = torch.tensor(0.0, device=device)
gmask_loss = torch.tensor(0.0, device=device)
loss_dict = {}
surffix = '_s'
if args.fix_style:
surffix += '%03d'%(args.style_id)
surffix += '_d'
if args.fix_degree:
surffix += '%1.1f'%(args.style_degree)
if not args.fix_color:
surffix += '_c'
if args.distributed:
g_module = generator.module
d_module = discriminator.module
else:
g_module = generator
d_module = discriminator
In the above code snippet, the function ‘train’ establishes various loss tensors for the generator and the discriminator and generates a dictionary of loss values. Using the backpropagation algorithm, the algorithm loops over the specified number of iterations and calculates and minimizes losses.
You can find the whole set of codes to train the model here.
Model evaluation and fine-tuning
Model evaluation involves evaluating the model’s quality, efficiency, and effectiveness. When developers evaluate a model carefully, they can identify areas for improvement and fine-tune its parameters to improve its functionality. This process involves accessing the quality of the generated video sequences using quantitative metrics such as structural similarity index (SSIM), Mean Squared Error (MSE) or peak signal-to-noise ratio (PSNR) and visually inspecting the generated video sequences.
Based on the evaluation results, fine-tune the model by adjusting the architecture, configuration, or training process to improve its performance. It would be best to optimize the hyperparameters, which involves adjusting the loss function, fine-tuning the optimization algorithm and tweaking the model’s parameters to enhance the generative video model’s performance.
Develop web UI
Building a web User Interface (UI) is necessary if your project needs the end-users to interact with the video model. It enables users to feed input parameters like effects, style types, image rescale, style degree or more. For this, you must design the layout, topography, colors and other visual elements based on your set parameters.
Now, develop the front end as per the design. Once the UI is developed, test it thoroughly to make it free of bugs and optimize the functionality. You can also use Gradio UI to build custom UI for the project without coding requirements.
Deployment
Once the model is trained and fine-tuned and the web UI is built, the model needs to be deployed to a production environment for generating new videos. Integration with a mobile or web app, setting up a data processing and streaming pipeline, and configuring the hardware and software infrastructure may be required to deploy the model based on the requirement.
Wrapping up
The steps involved in creating a generative video model are complex and consist of preprocessing the video dataset and designing the model architecture to adding layers to the basic architecture and training and evaluating the model. Generative Adversarial Networks (GANs) or Variational Autoencoders (VAEs) are frequently used as the foundation architecture, and the model’s capacity and complexity can be increased by including Convolutional, Pooling, Recurrent, or Dense layers.
There are several applications for generative video models, such as video synthesis, video toonification, and video style transfer. Existing image-oriented models can be trained to produce high-quality, artistic videos with adaptable style settings. The field of generative video models is rapidly evolving, and new techniques and models are continually being developed to improve the quality and flexibility of the generated videos.
Fascinated by a generative video model’s capabilities and want to leverage its power to level up your business? Contact neuroni.co today to start building your own generative video model and transform your vision into reality!