Adithyan Ilangovan – Bitmovin

Video Tech Deep Dive: Super-Resolution with Machine Learning Part 3

Adithyan Ilangovan — Tue, 24 Nov 2020 09:00:05 +0000

Practical Super-Resolution Deployments and Ensuing Results

Introduction

Welcome to Part 3 of Bitmovin’s Video Tech Deep Dive series: “Super-Resolution with Machine learning”.
Before you continuing with this post, I would highly recommend that you view the first two installments:

However, if you would rather jump right into it, here is a quick summary:

Spatially upsampling videos is a huge business opportunity.
Super-Resolution (SR) is a class of techniques to spatially upsample videos.
Machine Learning (ML) based SR methods are superior to the conventional SR methods.
SR can be incorporated into your video workflow in several ways, and consequently, help you improve the end-user experience.

In this closing post, we explore:

How to do practical super-resolution deployments in your video workflows?
Which tools you should be using?
Some real-life results from applying these practical deployments.

Practical Super-Resolution Deployments

So, we understand how ML-based super-resolution works in theory. But how is it actually deployed in practice?

Classic 3-Step Playbook

It follows the classic 3 step playbook (This is by no means a comprehensive explanation of how ML models work. But rather a simplified representation of it to get a basic understanding), like any other ML-based deployments. You need to :

Choose the right ML model
Train the chosen ML model
Use the trained ML model

The classic 3-steps used in Machine Learning deployment

Choosing a model

The first step is choosing the right model structure to train and deploy. The model you select will determine the level of tradeoff between computation complexity vs performance. For example, you can select a “complex” model that is hard to train in Step 2 but will give you great results in Step 3. Conversely, you could choose a “simpler” model that is easy to train but will give you comparatively worse performance.
This is similar to the tradeoff that you make when choosing different codecs. The tradeoff with codecs is defined by compression efficiency vs encoding complexity. There is no one “correct” model as it’s often determined by the particular use-case requirement.

Training the model

Once you’ve selected the right model for the specific project, you need to feed it with training data. In the case of super-resolution, the training data is a compilation of high-resolution videos and their corresponding low-resolution videos. The model is like an empty brain and the training data is its sensory inputs of the real world. Based on the training data, the model will learn how to upsample videos.
As you might have guessed, the choice of training data highly influences what a model learns. If you only feed it a particular type of content, say cartoon, it will perform exceptionally well in that particular type of content. But not so much for the other types of content. So, the training dataset has to be carefully chosen. Once the model is trained and learned, you can start deploying to actually upsample low-resolution videos.

Using the model

The last step is to use the newly-trained model in Step 2 to upsample videos by feeding in a low-resolution video and it will provide you with an upsampled high-resolution video.

Implementation of the model

Once you’ve selected an ML model, the next critical step is deciding how to actually implement the chosen model. This is similar to choosing a codec and its corresponding implementation in an encoding workflow :

Selecting the codec (ex: HEVC, VP9, AV1) – this is the standard that will specify how the encoding should be applied
The implementation of the codec – There are several implementation options for the same codec. In the case of HEVC (H.265), there is X265, Beamr-HEVC, Nvidia-Hardware-HEVC, among others. The difference between implementations could be either hardware or software-based. Furthermore, within the software, the implementation could be based on an open-source or closed-source build.

Similarly, when it comes to the super-resolution ML models, you could apply the same differentiating implementations for the same model, open-source vs closed-source and/or hardware-accelerated vs software-accelerated. For example, the academic SRCNN is an open-source and popular theoretical Super-Resolution ML model, and Waifu-2x is an open-sourced implementation of that model.

Codec and codec implementation (HEVC and x265) is analogous to the Super-Resolution ML model and its corresponding implementation (SRCNN and waifu-2x).

Codec and codec implementation (HEVC and x265) is analogous to the Super-Resolution ML model and its corresponding implementation (SRCNN and waifu-2x).Some popular open-source implementations of the Super-Resolution ML models are Waifu-2x, Anime-4k-CPP, and ACNet.

Super-Resolution Practical Results

In this section, we look at some of the results obtained from using an implementation of the Super-Resolution model. In the first section, we look at an implementation where we had to manually train the model. And in the next section, we will look at the implementation where we use a pre-trained model.

Manually Trained model

Methodology

In this section, we follow all the three steps laid out earlier in the Section “Classic 3-Step Playbook”. We use FFmpeg to do the testing. The following were the settings we used to perform an evaluation of super-resolution in FFmpeg

Evaluation of inbuilt super-resolution video filter in FFmpeg.

Choose the model: We chose the Efficient Sub-Pixel Convolutional Neural Network model (SRCNN).
Train the model: We used two 1080p sample videos to create the model that was fed into Ffmpeg.
Use the model: Once the model is ready, we can use it to upsample any video.

To test the performance of the super-resolution upsampling method in a real video workflow, we selected a 720p video as input. The input video was upscaled and transcoded using H.264 with high encoding settings and at different bitrates. To determine the effectiveness of the super-resolution method, we ran the test using the traditional bicubic method first, for control.

Once we had the results of both upsamples, I compared the final output quality against the input quality using the VMAF quality metric. The result is shown in the following figure.

VMAF vs Bitrate (in Kbps) for the different upsampling methods.

Results

We observe that on average there is a 6 point difference in VMAF. If you are in the video field, you will realize that is a pretty significant gain in video quality.
Admittedly, this is not a scientific evaluation of the upsampling methods and also not a fair comparison. Because we used only two videos to train the model. And we fed in a similar type of low-resolution video for super-resolution upsampling. So the experiment is highly rigged towards the super-resolution method.
Nevertheless, this goes on to show the superiority of the super-resolution methods compared to the traditional upsampling method. This was more of a test-grade result. In the next section, we will look at the production-grade result.
If you are an engineer/developer and want to try your hand at the aforementioned steps, then you are in luck. The popular multimedia tool FFmpeg has supported super-resolution as an inbuilt video filter, since version 4.2.

Pre-Trained Model

It’s important to note that steps one and two from the “Classic 3-step playbook” are not always mandatory when using machine-learning. In some cases, it’s possible that the best performing and most appropriate models have already been chosen and trained. In the case of super-resolution, there are plenty of models publicly available for upsampling anime content that has been trained using larger data sets and also proven to work well. Therefore, you don’t need to concern yourself with the burden of manually training your own models. In this scenario, you could simply plug the existing model into your production workflow model and “turn on” super-resolution.

For certain use cases, such as anime, you could skip step 1 (Choosing) and step 2 (Training), and directly use a pre-trained model.

For our second super-resolution deployment test, we chose the popular Waifu2x implementation with a pre-trained model to do some production-grade testing for a very popular (but old) anime series. Given that this model is a perfect fit for upsampling anime-style art, we selected production assets that are old, low-resolution, and noisy.

Production assets used for testing the pre-trained super-resolution model.

We feed them through the pre-trained model and the results of our test can be viewed below:
[Rich_Web_Slider id=”1″]
As you can see from the results above, the video upsampled using the super-resolution is significantly better than the conventional upsampling methods. The details in the boundary are crisper, the blurred artifacts are reduced, among other things.
Since we did not have a high-resolution video for reference, we could not compute an objective quality metric (VMAF). Nevertheless, we applied subjective quality testing by playing the two videos, conventionally vs super-resolution upsampled, and asked the viewers to vote on which video looked better. 83% of the viewers voted that the super-resolution upsampled video looks better.

Subjective quality testing. Video-2 is the upsampled video obtained from the Super-Resolution methods. Video-1 is the upsampled video obtained from the conventional bicubic method.

These results are “super” encouraging for the future of upsampling content with machine learning, new models don’t always need to be picked and trained. Even for certain specific production use-cases, one could use pre-trained models, and obtain superior results compared to the traditional methods. One could reasonably believe that super-resolution models can be applied for many new variations of use cases.

Conclusion

As we’ve learned throughout this series, super-resolution with machine learning is a huge business opportunity, especially with the exponential rise of 4K (and higher) resolution consumer devices and the count of streaming services. Given that many content owners have massive backlog libraries of standard quality content, upsampling will be a great method of re-engaging an older audience and introducing new viewers to older content. Super-resolution, one of a few types of upsampling methods can be best applied using machine learning mechanics that will further improve quality over time. However, much like a classic encoding workflow, there are countless ways to implement an ML-based super-resolution upsample – and finding the balance of existing models versus new models will ultimately help improve the end-user experience.
Did you enjoy this post? Check out the following content:
Part one of the Super-Resolution series: What’s the buzz and why does it matter?
Part two: Why is it good and how can you incorporate it?
View my comparison test of Bicubic vs Super-Resolution content here

The post Video Tech Deep Dive: Super-Resolution with Machine Learning Part 3 appeared first on Bitmovin.

Video Tech Deep Dive: Super-Resolution with Machine Learning P2

Adithyan Ilangovan — Mon, 06 Jul 2020 13:22:49 +0000

Super-Resolution: Why is it good and how can you incorporate it?

Introduction

Welcome to Part 2 of Bitmovin’s Video Tech Deep Dive series: Super-Resolution with Machine learning. Before you get started, I highly recommend that you read Part 1. But if you would rather prefer to directly jump into it, here is a quick summary:

Spatially upsampling videos is a huge business opportunity.
Super-Resolution is a class of techniques to spatially upsample videos.
Super-Resolution can be categorized into two categories: machine-learning based and non-machine-learning based.
This blog series will focus on machine learning-based super-resolution.

The focus of this series of blog posts will be on machine learning-based super-resolution.

In this post, we will examine:

What factors lead to the current popularity of machine learning-based super-resolution?
Why is it better than the other conventional methods?
And, finally, how you can incorporate it into your video workflow and what benefits will super-resolution yield?

The Holy Trinity: Super-Resolution, Machine learning, and Video upscaling

Super-resolution, Machine learning (ML), and Video Upscaling are a match made in heaven. The three factors coming together is the reason behind the current popularity in Machine-learning based super-resolution applications. In this section, we will see why.

Super-Resolution and its beginnings

The concept of super-resolution has existed since the 1980s. The basic idea behind super-resolution was (and continues to be) to intelligently combine non-redundant information from multiple related low-resolution images to generate a single high-resolution image.

Super-Resolution uses non-redundant information from several related images to produce a single image.

Some classic early applications were finding license plate information from several low-resolution images.

Several low-resolution snapshots of a moving car provides non-redundant but related information. Super-Resolution uses this related non-redundancy to create higher-resolution images, which can be useful in finding information such as license plate information or driver identification [Source].

When super-resolution started out, the “intelligence” was, roughly speaking, a set of predefined and complex mathematical formulas (Image observation model, Interpolation-restoration, among others). The “intelligence” at the beginning had nothing to do with ML.
But the recent wave of interest in super-resolution has been primarily driven by ML.

Machine learning’s resurgence

So, why ML and what changed now?
ML, in essence, is about learning the “intelligence” for a well-defined problem. With the right architecture and enough data, ML can be significantly more “intelligent” than a human-defined solution (at least in that narrow domain). We saw this demonstrated stunningly in the case of AlphaZero (for chess) and AlphaGo (for the board game Go).
Super-resolution is a well-defined problem, and one could reasonably argue that ML would be a natural fit to solve this problem. With that motivation, early theoretical solutions were already proposed in the literature.
But, the exorbitant computational power and fundamental unresolved complexities kept the practical applications of ML-based super-resolution at bay.
However, in the last few years, there were two major developments:

The enormous increase in the computational power density, especially the purpose-built Graphical Processing Units (GPUs), and also their affordability.
Fundamental advances in ML, especially Convolutional Neural Network (CNNs), and their ease of use.

These developments have led to a resurgence and come back for ML-based super-resolution methods.
It should be mentioned that ML-based super-resolution is a versatile hammer that can be used to drive many nails. It has wide applications, ranging from medical imaging, remote sensing, astronomical observations, among others. But as mentioned in Part 1 of this series, we will focus on how the ML super-resolution hammer can nail the problem of video upscaling.

The convergence of the three factors

The last missing puzzle piece in this arc of the story is Video upscaling.

Video upscaling, Machine learning, and Super-Resolution a match made in heaven.

When you think about it, video upscaling is almost a perfect “nail” for the ML-based super-resolution “hammer”.
Video provides the core features needed for the ML-based Super-Resolution. Namely:

It has related non-redundancy built-in: Every single frame in a video almost always has a set of closely “related” frames. And if there is enough motion of an object in the frame, all those related frames should provide non-redundant information about objects in the frame.
The vast amount of available data: We have no shortage of video. These vast data could be used to train the ML network and let the network learn the best to upscale intelligence.

The convergence of these three factors is why we are witnessing a huge uptick in the research in this area, and also the first practical applications in the field of ML Super-Resolution powered Video upscaling.

Why is it better than traditional methods?

I provided a historical timeline and the factors that lead to ML Super-Resolution powered Video upscaling. But, it might still not be clear on why it is superior to other traditional methods (bilinear, bicubic, Lanczos, among others). In this section, I will provide a simplified explanation to provide an intuitive understanding.
The superior performance simply boils down to the fact that the algorithm understands the nature of the content it is upsampling. And how it tunes itself to upsample that content in the best way possible. This is in contrast to the traditional methods where there is no “tuning”. In traditional methods, the same formula is applied without any consideration of the nature of the content.
One could say that:

ML-based super-resolution is to upsampling, what Per-Title is to encoding.

In Per-Title, we use different encoding recipes for the different pieces of content. In a similar way, ML-based super-resolution uses different upsampling recipes for different pieces of content.
The recipes can adapt itself on both at the:

Macro-level: Use different upsampling recipes for different types of content (anime, movie, sports, among others)
Micro-level: Use different upsampling recipes for different types of frames within the same content (high complexity frame, low complexity frame).

The superior performance of ML-based super-resolution comes from the fact that it understands the content. It understands both at the macro level and micro level.

Why do you need Super-Resolution with Machine-Learning?

Hopefully, by now, you are already excited about the possibilities of this idea. In this section, I would like to provide some suggestions on how you can incorporate this idea into your own video workflows and the potential benefits you might expect from it.

Quality

Broadly speaking, a video processing workflow typically has three steps involved:

Pre-processing (decoding, upsampling, filtering, among others)
Encoding
Post-processing (filtering, muxing, among others)

Typically, there is a heavy emphasis on the encoding block for visual quality optimizations (Per-Title, 3-Pass, Codec-Configuration, among others).

The three major blocks in a video processing workflow. The emphasis is typically on the encoding block, whereas the pre- and post-processing blocks are ignored.

But, the other two (often overlooked) blocks are as important when it comes to visual quality optimization. In this instance, upsampling is a preprocessing step. And by choosing the right upsampling methods, such as super-resolution, one can improve the visual quality of the entire workflow. Sometimes, significantly more than that could be provided from the other blocks.
In the Part-3 of this series, we will delve more deeply into this. We will quantify how much quality improvements one could expect from tuning the pre-processing block with super-resolution. And use some real-life examples.

Synergies with other blocks

(This specific section is primarily meant for advanced readers who understand what Per-Title, VMAF, convex-hull means. Please feel free to skip this section).
Like explained earlier, there are broadly three blocks in a video workflow. Roughly speaking, they work independently. But if we are smart about the design, we can extract synergies and use that to improve the overall video pipelines, that otherwise would not have existed.
One illustrative example is how Per-Title can work in conjunction with the Super-Resolution. This idea is depicted in the following figure.

VMAF vs Bitrate Convex hulls of video content.
Green => 360p, Red => 720p, Blue => 1080p.
BC : Bicubic, SR : SuperResolution.

The solid line represents the convex hull when the traditional bicubic upsampling method is used, whereas the dotted line represents the convex hull with the super-resolution method (as explained in the last section visual quality is improved by using Super-Resolution).
In the above figure, for the illustrated bitrate: When using the traditional method the choice is clear. We will pick the 720p rendition. But, when using Super-Resolution, the choice is not very clear. We could either pick

720p Super-Resolution rendition, or
360p Super-Resolution rendition, or
720p Bicubic rendition.

The choice is determined by the complexity (vs) quality tradeoff that we are willing to make.
The takeaway message is two blocks synergistically working together to give more options and flexibility for the Per-Title algorithm to work with. Overall, a higher number of options translate to better overall results.
This is just one illustrative example, but within your own video workflows, you could identify regions where super-resolution can work synergically and improve the overall performance.

Targeted Upsampling

If your entire video catalog is a specific kind of content (anime for example), and you want to do a targeted upsample of these contents, then without doubt ML Super-Resolution is the way to go!
In fact, that is what many companies already do. This specific trend will only accelerate in the future, especially considering the popularity of consumer 4K TVs.
Visual quality enhancements, Synergies, and Targeted upsampling are some ideas on how you can incorporate Super-Resolution into your video workflows.

Super-Resolution applied for targeted content such as Anime [Source]

Summary

We continued the story from Part 1. We learned that :

The convergence of three factors has led to a resurgence in Machine learning-based Super-Resolution.
The superior performance of Super-Resolution boils down to the fact that it understands the content it is upsampling.
Super-Resolution can be incorporated into your video workflow in several ways, and consequently, help you increase the end-user experience.

Did you enjoy this post?
Check out part one of the super-resolution series: Super-Resolution with Machine Learning P1
Check out part three: Practical Super-Resolution Deployments and Ensuing Results

The post Video Tech Deep Dive: Super-Resolution with Machine Learning P2 appeared first on Bitmovin.