athena – Bitmovin

ATHENA’s first 5 years of research and innovation

Andy Francis — Mon, 19 Aug 2024 02:17:57 +0000

Since forming in October 2019, the Christian Doppler Laboratory ATHENA at Universität Klagenfurt, run by Bitmovin co-founder Dr. Christian Timmerer, has been advancing research and innovation for adaptive bitrate (ABR) streaming technologies. Over the past five years, the lab has addressed critical challenges in video streaming from encoding and delivery to playback and end-to-end quality of experience. They are breaking new ground using edge computing, machine learning, neural networks and generative AI for video applications, contributing significantly to both academic knowledge and industry applications as Bitmovin’s research partner.

In this blog, we’ll take a look at the highlights of the ATHENA lab’s work over the past five years and its impact on the future of the streaming industry.

Publications

ATHENA has made its mark with high-impact publications on the topics of multimedia, signal processing, and computer networks. Their research has been featured in prestigious journals such as IEEE Communications Surveys & Tutorials and IEEE Transactions on Multimedia. With 94 papers published or accepted by the time of the 5-year evaluation, the lab has established itself as a leader in video streaming research.

ATHENA also contributed to reproducibility in research. Their open source tools Video Complexity Analyzer and LLL-CAdViSE have already been used by Bitmovin and others in the industry. Their open, multi-codec UHD dataset enables research and development of multi-codec playback solutions for 8K video.

ATHENA has also looked at applications of AI in video coding and streaming, something that will become more of a focus over the next two years. You can read more about ATHENA’s AI video research in this blog post.

Patents

But it’s not all just theoretical research. The ATHENA lab has successfully translated its findings into practical solutions, filing 16 invention disclosures and 13 patent applications. As of publication, 6 patents have been granted:

Workflow diagram for Fast Multi-Rate Encoding using convolutional neural networks. More detail available here.

PhDs

ATHENA has also made an educational impact, successfully guiding the inaugural cohort of seven PhD students to their successful dissertation defenses, with research topics ranging from edge computing in video streaming to machine learning applications in video coding.

Dr. Alireza Erfanian: “Optimizing QoE and Latency of Video Streaming using Edge Computing and In-Network Intelligence”, May 25, 2023
Dr. Ekrem Çetinkaya: “Video Coding Enhancements for HTTP Adaptive Streaming using Machine Learning”, June 7, 2023
Dr. Minh Nguyen: “Policy-driven Dynamic HTTP Adaptive Streaming Player Environment”, June 30, 2023
Dr. Jesús Aguilar Armijo: “Multi-access Edge Computing for Adaptive Video Streaming”, July 10, 2023
Dr. Reza Farahani: “Network-Assisted Delivery of Adaptive Video Streaming Services through CDN, SDN, and MEC”, August 22, 2023
Dr. Vignesh V Menon: “Content-adaptive Video Coding for HTTP Adaptive Streaming”, January 15, 2024
Dr. Babak Taraghi, “End-to-end Quality of Experience Evaluation for HTTP Adaptive Streaming”, July 10, 2024

There are also two postdoctoral scholars in the lab who have made significant contributions and progress.

Dr. Hadi Amirpour, “Video Coding for Efficient HTTP Adaptive Streaming”, February 8, 2024
Dr. Farzad Tashtarian, “How to Optimize Dynamic Adaptive Video Streaming? Challenges and Solutions”, February 27, 2023 & “End-to-End Adaptive Video Streaming Optimization”, June 26, 2024

Practical applications with Bitmovin

As Bitmovin’s academic partner, ATHENA plays a critical role in developing and enhancing technologies that can differentiate our streaming solutions. As ATHENA’s company partner, Bitmovin helps guide and test practical applications of the research, with regular check-ins for in-depth discussions about new innovations and potential technology transfers. The collaboration has resulted in several advancements over the years, including recent projects like CAdViSE and WISH ABR.

CAdViSE

CAdViSE (Cloud based Adaptive Video Streaming Evaluation) is a framework for automated testing of media players. It allows you to test how different players and ABR configurations perform and react to fluctuations in different network parameters. Bitmovin is using CAdViSE to evaluate the performance of different custom ABR algorithms. The code is available in this github repo.

WISH ABR

WISH stands for Weighted Sum model for HTTP Adaptive Streaming and it allows for customization of ABR logic for different devices and applications. WISH’s logic is based on a model that weighs bandwidth, buffer and quality costs for playing back a segment. By setting weights for the importance of those metrics, you create a custom ABR algorithm, optimized for your content and use case. You can learn more about WISH ABR in this blog post.

Decision process for WISH ABR, weighing data/bandwidth cost, buffer cost, and quality cost of each segment.

Project spinoffs

The success of ATHENA has led to three spinoff projects:.

APOLLO

APOLLO is funded by the Austrian Research Promotion Agency FFG and is a cooperative project between Bitmovin and Alpen-Adria-Universität Klagenfurt. The main objective of APOLLO is to research and develop an intelligent video platform for HTTP adaptive streaming which provides distribution of video transcoding across large and small-scale computing environments, using AI and ML techniques for the actual video distribution.

GAIA

GAIA is also funded by the Austrian Research Promotion Agency FFG and is a cooperative project between Bitmovin and Alpen-Adria-Universität Klagenfurt. The GAIA project researches and develops a climate-friendly adaptive video streaming platform that provides complete energy awareness and accountability along the entire delivery chain. It also aims to reduce energy consumption and GHG emissions through advanced analytics and optimizations on all phases of the video delivery chain.

SPIRIT

SPIRIT (Scalable Platform for Innovations on Real-time Immersive Telepresence) is an EU Horizon Europe-funded innovation action. It brings together cutting-edge companies and universities in the field of telepresence applications with advanced and complementary expertise in extended reality (XR) and multimedia communications. SPIRIT’s mission is to create Europe’s first multisite and interconnected framework capable of supporting a wide range of application features in collaborative telepresence.

What’s next

Over the next two years, the ATHENA project will focus on advancing deep neural network and AI-driven techniques for image and video coding. This work will include making video coding more energy- and cost-efficient, exploring immersive formats like volumetric video and holography, and enhancing QoE while being mindful of energy use. Other focus areas include AI-powered, energy-efficient live video streaming and generative AI applications for adaptive streaming.

Get in touch or let us know in the comments if you’d like to learn more about Bitmovin and ATHENA’s research and innovation, AI or sustainability related projects.

The post ATHENA’s first 5 years of research and innovation appeared first on Bitmovin.

The AI Video Research Powering a Higher Quality Future

Andy Francis — Sun, 05 May 2024 22:06:17 +0000

*This post was originally published in June 2023. It was updated in May 2024 with more recent research publications and updates.*

This post will summarize the current state of Artificial Intelligence (AI) applications for video in 2024, including recent progress and announcements. We’ll also take a closer look at AI video research and collaboration between Bitmovin and the ATHENA laboratory that has the potential to deliver huge leaps in quality improvements and bring an end to playback stalls and buffering. This includes ATHENA’s FaRes-ML, which was recently granted a US Patent. Keep reading to learn more!

AI for video at NAB 2024
FaRes-ML granted US Patent
Recent Bitmovin and ATHENA AI Research
Generative AI for Adaptive Video Streaming
DeepVCA: Deep Video Complexity Analyzer
DIGITWISE: Digital Twin-based Modeling of Adaptive Video Streaming Engagement
Previous Bitmovin and ATHENA AI Research
Better quality with neural network-driven Super Resolution upscaling
Less buffering and higher QoE with applied machine learning
Challenges ahead
Learn more
AI Video Glossary

AI for video at NAB 2024

At NAB 2024, the AI hype train continued gaining momentum and we saw more practical applications of AI for video than ever before. We saw various uses of AI-powered encoding optimization, Super Resolution upscaling, automatic subtitling and translations, and generative AI video descriptions and summarizations. Bitmovin also presented some new AI-powered solutions, including our Analytics Session Interpreter, which won a Best of Show award from TV Technology. It uses machine learning and large language models to generate a summary, analysis and recommendations for every viewer session. The early feedback has been positive and we’ll continue to refine and add more capabilities that will help companies better understand and improve their viewers’ experience.

L to R: Product Manager Jacob Arends, CEO Stefan Lederer and Engineer Peter Eder accepting the award for Bitmovin’s AI-powered Analytics Session Interpreter

Other AI highlights from NAB included Jan Ozer’s “Beyond the Hype: A Critical look at AI in Video Streaming” presentation, NETINT and Ampere’s live subtitling demo using OpenAI Whisper, and Microsoft and Mediakind sharing AI applications for media and entertainment workflows. You can find more detail about these sessions and other notable AI solutions from the exhibition floor in this post.

FaRes-ML granted US Patent

For a few years before this recent wave of interest, Bitmovin and our ATHENA project colleagues have been researching the practical applications of AI for video streaming services. It’s something we’re exploring from several angles, from boosting visual quality and upscaling older content to more intelligent video processing for adaptive bitrate (ABR) switching. One of the projects that was first published in 2021 (and covered below in this post) is Fast Multi-Resolution and Multi-Rate Encoding for HTTP Adaptive Streaming Using Machine Learning (FaRes-ML). We’re happy to share that FaRes-ML was recently granted a US Patent! Congrats to the authors, Christian Timmerer, Hadi Amirpour, Ekrem Çetinkaya and the late Prof. Mohammad Ghanbari, who sadly passed away earlier this year.

Recent Bitmovin and ATHENA AI Research

In this section, I’ll give a short summary of projects that were shared and published since the original publication of this blog, and link to details for anyone interested in learning more.

Generative AI for Adaptive Video Streaming

Presented at the 2024 ACM Multimedia Systems Conference, this research proposal outlines the opportunities at the intersection of advanced AI algorithms and digital entertainment for elevating quality, increasing user interactivity and improving the overall streaming experience. Research topics that will be investigated include AI generated recommendations for user engagement and AI techniques for reducing video data transmission. You can learn more here.

DeepVCA: Deep Video Complexity Analyzer

The ATHENA lab developed and released the open-source Video Complexity Analyzer (VCA) to extract and predict video complexity faster than existing method’s like ITU-T’s Spatial Information (SI) and Temporal Information (TI). DeepVCA extends VCA using deep neural networks to accurately predict video encoding parameters, like bitrate, and the encoding time of video sequences. The spatial complexity of the current frame and previous frame are used to rapidly predict the temporal complexity of a sequence, and the results show significant improvements over unsupervised methods. You can learn more and access the source code and dataset here.

DeepVCA’s spatial and temporal complexity prediction process

DIGITWISE: Digital Twin-based Modeling of Adaptive Video Streaming Engagement

DIGITWISE leverages the concept of a digital twin, a digital replica of an actual viewer, to model user engagement based on past viewing sessions. The digital twin receives input about streaming events and utilizes supervised machine learning to predict user engagement for a given session. The system model consists of a data processing pipeline, machine learning models acting as digital twins, and a unified model to predict engagement (XGBoost). The DIGITWISE system architecture demonstrates the importance of personal user sensitivities, reducing user engagement prediction error by up to 5.8% compared to non-user-aware models. It can also be used to optimize content provisioning and delivery by identifying the features that maximize engagement, providing an average engagement increase of up to 8.6 %.You can learn more here.

System overview of DIGITWISE user engagement prediction

Previous Bitmovin and ATHENA AI Research

Better quality with neural network-driven Super Resolution upscaling

The first group of ATHENA publications we’re looking at all involve the use of neural networks to drive visual quality improvements using Super Resolution upscaling techniques.

DeepStream: Video streaming enhancements using compressed deep neural networks

Deep learning-based approaches keep getting better at enhancing and compressing video, but the quality of experience (QoE) improvements they offer are usually only available to devices with GPUs. This paper introduces DeepStream, a scalable, content-aware per-title encoding approach to support both CPU-only and GPU-available end-users. To support backward compatibility, DeepStream constructs a bitrate ladder based on any existing per-title encoding approach, with an enhancement layer for GPU-available devices. The added layer contains lightweight video super-resolution deep neural networks (DNNs) for each bitrate-resolution pair of the bitrate ladder. For GPU-available end-users, this means ~35% bitrate savings while maintaining equivalent PSNR and VMAF quality scores, while CPU-only users receive the video as usual. You can learn more here.

DeepStream system architecture

LiDeR: Lightweight video Super Resolution for mobile devices

Although DNN-based Super Resolution methods like DeepStream show huge improvements over traditional methods, their computational complexity makes it hard to use them on devices with limited power, like smartphones. Recent improvements in mobile hardware, especially GPUs, made it possible to use DNN-based techniques, but existing DNN-based Super Resolution solutions are still too complex. This paper proposes LiDeR, a lightweight video Super Resolution network specifically tailored toward mobile devices. Experimental results show that LiDeR can achieve competitive Super Resolution performance with state-of-the-art networks while improving the execution speed significantly. You can learn more here or watch the video presentation from an IEEE workshop.

Quantitative results comparing Super Resolution methods. LiDeR achieves near equivalent PSNR and SSIM quality scores while running ~3 times faster than its closest competition.

Super Resolution-based ABR for mobile devices

This paper introduces another new lightweight Super Resolution network, SR-ABR Net, that can be deployed on mobile devices to upgrade low-resolution/low-quality videos while running in real-time. It also introduces a novel ABR algorithm, WISH-SR, that leverages Super Resolution networks at the client to improve the video quality depending on the client’s context. By taking into account device properties, video characteristics, and user preferences, it can significantly boost the visual quality of the delivered content while reducing both bandwidth consumption and the number of stalling events. You can learn more here or watch the video presentation from Mile High Video.

System architecture for proposed Super Resolution based adaptive bitrate algorithm

Less buffering and higher QoE with applied machine learning

The next group of research papers involve applying machine learning at different stages of the video workflow to improve QoE for the end user.

FaRes-ML: Fast multi-resolution, multi-rate encoding

Fast multi-rate encoding approaches aim to address the challenge of encoding multiple representations from a single video by re-using information from already encoded representations. In this paper, a convolutional neural network is used to speed up both multi-rate and multi-resolution encoding for ABR streaming. Experimental results show that the proposed method for multi-rate encoding can reduce the overall encoding time by 15.08% and parallel encoding time by 41.26%. Simultaneously, the proposed method for multi-resolution encoding can reduce the encoding time by 46.27% for the overall encoding and 27.71% for the parallel encoding on average. You can learn more here.

FaRes-ML flowchart

ECAS-ML: Edge assisted adaptive bitrate switching

As video streaming traffic in mobile networks increases, utilizing edge computing support is a key way to improve the content delivery process. At an edge node, we can deploy ABR algorithms with a better understanding of network behavior and access to radio and player metrics. This project introduces ECAS-ML, Edge Assisted Adaptation Scheme for HTTP Adaptive Streaming with Machine Learning. It uses machine learning techniques to analyze radio throughput traces and balance the tradeoffs between bitrate, segment switches and stalls to deliver a higher QoE, outperforming other client-based and edge-based ABR algorithms. You can learn more here.

ECAS-ML system architecture

Challenges ahead

The road from research to practical implementation is not always quick or direct or even possible in some cases, but fortunately that’s an area where Bitmovin and ATHENA have been working together closely for several years now. Going back to our initial implementation of HEVC encoding in the cloud, we’ve had success using small trials and experiments with Bitmovin’s clients and partners to provide real-world feedback for the ATHENA team, informing the next round of research and experimentation toward creating viable, game-changing solutions. This innovation-to-product cycle is already in progress for the research mentioned above, with promising early quality and efficiency improvements.

Many of the advancements we’re seeing in AI are the result of aggregating lots and lots of processing power, which in turn means lots of energy use. Even with processors becoming more energy efficient, the sheer volume involved in large-scale AI applications means energy consumption can be a concern, especially with increasing focus on sustainability and energy efficiency. From that perspective, for some use cases (like Super Resolution) it will be worth considering the tradeoffs between doing server-side upscaling during the encoding process and client-side upscaling, where every viewing device will consume more power.

Learn more

Want to learn more about Bitmovin’s AI video research and development? Check out the links below.

Analytics Session Interpreter webinar

AI-powered video Super Resolution and Remastering

Super Resolution blog series

Super Resolution with Machine Learning webinar

Athena research

MPEG Meeting Updates

GAIA project blogs

AI Video Glossary

Machine Learning – Machine learning is a subfield of artificial intelligence that deals with developing algorithms and models capable of learning and making predictions or decisions based on data. It involves training these algorithms on large datasets to recognize patterns and extract valuable insights. Machine learning has diverse applications, such as image and speech recognition, natural language processing, and predictive analytics.

Neural Networks – Neural networks are sophisticated algorithms designed to replicate the behavior of the human brain. They are composed of layers of artificial neurons that analyze and process data. In the context of video streaming, neural networks can be leveraged to optimize video quality, enhance compression techniques, and improve video annotation and content recommendation systems, resulting in a more immersive and personalized streaming experience for users.

Super Resolution – Super Resolution upscaling is an advanced technique used to enhance the quality and resolution of images or videos. It involves using complex algorithms and computations to analyze the available data and generate additional details. By doing this, the image or video appears sharper, clearer, and more detailed, creating a better viewing experience, especially on 4K and larger displays.

Graphics Processing Unit (GPU) – A GPU is a specialized hardware component that focuses on handling and accelerating graphics-related computations. Unlike the central processing unit (CPU), which handles general-purpose tasks, the GPU is specifically designed for parallel processing and rendering complex graphics, such as images and videos. GPUs are widely used in various industries, including gaming, visual effects, scientific research, and artificial intelligence, due to their immense computational power.

Video Understanding – Video understanding is the ability to analyze and comprehend the information present in a video. It involves breaking down the visual content, movements, and actions within the video to make sense of what is happening.

The post The AI Video Research Powering a Higher Quality Future appeared first on Bitmovin.

Efficiently Predicting Quality with ATHENA’s Video Complexity Analyzer (VCA) project

Christian Timmerer — Fri, 25 Mar 2022 19:30:34 +0000

For online prediction in live streaming applications, selecting low-complexity features is critical to ensure low-latency video streaming without disruptions. For each frame/ video/ video segment, two features, i.e., the average texture energy and the average gradient of the texture energy are determined. A DCT-based energy function is introduced to determine the block-wise texture of each frame. The spatial and temporal features of the video/ video segment are derived from the DCT-based energy function. The Video Complexity Analyzer (VCA) project is launched in 2022, aiming to provide the most efficient, highest performance spatial and temporal complexity prediction of each frame/ video/ video segment which can be used for a variety of applications like shot/scene detection, online per-title encoding.

What is the Video Complexity Analyzer

The primary objective of the Video Complexity Analyzer is to become the best spatial and temporal complexity predictor for every frame/ video segment/ video which aids in predicting encoding parameters for applications like scene-cut detection and online per-title encoding. VCA leverages x86 SIMD and multi-threading optimizations for effective performance. While VCA is primarily designed as a video complexity analyzer library, a command-line executable is provided to facilitate testing and development. We expect VCA to be utilized in many leading video encoding solutions in the coming years.

VCA is available as an open-source library, published under the GPLv3 license. For more details, please visit the software online documentation here. The source code can be found here.

Heatmap of spatial complexity (E)

Heatmap of temporal complexity (h)

A performance comparison (frames analyzed per second) of VCA (with different levels of threading enabled) compared to Spatial Information/Temporal Information (SITI) [Github] is shown below

Visual Complexity Analyzer vs Spatial Information/Temporal Information

How to Build a Video Complexity Analyzer

The software is tested mostly in Linux and Windows OS. It requires some pre-requisite software to be installed before compiling. The steps to build the project in Linux and Windows are explained below.

Prerequisites

CMake version 3.13 or higher.
Git.
C++ compiler with C++11 support
NASM assembly compiler (for x86 SIMD support)

The following C++11 compilers have been known to work:

Visual Studio 2015 or later
GCC 4.8 or later
Clang 3.3 or later

Execute Build

The following commands will check out the project source code and create a directory called ‘build’ where the compiler output will be placed. CMake is then used for generating build files and compiling the VCA binaries.

$ git clone https://github.com/cd-athena/VCA.git
$ cd VCA
$ mkdir build
$ cd build
$ cmake ../
$ cmake --build .

This will create VCA binaries in the VCA/build/source/apps/ folder.

Command-Line Options

General

Displaying Help Text:

--help, -h

Displaying version details:

--version, -v

Logging/Statistic Options

--complexity-csv

Write the spatial (E) and temporal complexity (h), epsilon, brightness (L) statistics to a Comma Separated Values log file. Creates the file if it doesn’t already exist. The following statistics are available:

POC Picture Order Count – The display order of the frames
E Spatial complexity of the frame
h Temporal complexity of the frame
epsilon Gradient of the temporal complexity of the frame
L Brightness of the frame

Unless option:–no-chroma is used, the following chroma statistics are also available:

avgU Average U chroma component of the frame
energyU Average U chroma texture of the frame
avgV Average V chroma component of the frame
energyV Average V chroma texture of the frame

--shot-csv < filename>

Write the shot id, the first POC of every shot to a Comma Separated Values log file. Creates the file if it doesn’t already exist.

--yuvview-stats

Write the per block results (L, E, h) to a stats file that can be visualized using YUView.

Performance Options

--no-chroma

Disable analysis of chroma planes (which is enabled by default).

--no-simd

The Video Complexity Analyzer will use all detected CPU SIMD architectures by default. This will disable that detection.

--threads

Specify the number of threads to use. Default: 0 (autodetect).

Input/Output

--input

Input filename. Raw YUV or Y4M supported. Use stdin for stdin. For example piping input from ffmpeg works like this:

ffmpeg.exe -i Sintel.2010.1080p.mkv -f yuv4mpegpipe - | vca.exe --input stdin

--y4m

Parse input stream as YUV4MPEG2 regardless of file extension. Primarily intended for use with stdin. This option is implied if the input filename has a “.y4m” extension

--input-depth

Bit-depth of input file or stream. Any value between 8 and 16. Default is 8. For Y4M files, this is read from the Y4M header.

--input-res

Source picture size [w x h]. For Y4M files, this is read from the Y4M header.

--input-csp

Chroma Subsampling. 4:0:0(monochrome), 4:2:0, 4:2:2, and 4:4:4 are supported. For Y4M files, this is read from the Y4M header.

--input-fps

The framerate of the input. For Y4M files, this is read from the Y4M header.

--skip

Number of frames to skip at start of input file. Default 0.

--frames, -f

Number of frames of input sequence to be analyzed. Default 0 (all).

Analyzer Configuration

--block-size <8/16/32>

Size of the non-overlapping blocks used to determine the E, h features. Default: 32.

--min-thresh

Minimum threshold of epsilon for shot detection.

--max-thresh

Maximum threshold of epsilon for shot detection.

Using the VCA API

VCA is written primarily in C++ and x86 assembly language. This API is wholly defined within :file: vcaLib.h in the source/lib/ folder of our source tree. All of the functions and variables and enumerations meant to be used by the end-user are present in this header.

vca_analyzer_open(vca_param param)

Create a new analyzer handler, all parameters from vca_param are copied. The returned pointer is then passed to all of the functions pertaining to this analyzer. Since vca_param is copied internally, the user may release their copy after allocating the analyzer. Changes made to their copy of the param structure have no affect on the analyzer after it has been allocated.

vca_result vca_analyzer_push(vca_analyzer *enc, vca_frame *frame)

Push a frame to the analyzer and start the analysis. Note that only the pointers will be copied but no ownership of the memory is transferred to the library. The caller must make sure that the pointers are valid until the frame was analyzed. Once a results for a frame was pulled the library will not use pointers anymore. This may block until there is a slot available to work on. The number of frames that will be processed in parallel can be set using nrFrameThreads.

bool vca_result_available(vca_analyzer *enc)

Check if a result is available to pull.

vca_result vca_analyzer_pull_frame_result(vca_analyzer *enc, vca_frame_results *result)

Pull a result from the analyzer. This may block until a result is available. Use vca_result_available() if you want to only check if a result is ready.

void vca_analyzer_close(vca_analyzer *enc)

Finally, the analyzer must be closed in order to free all of its resources. An analyzer that has been flushed cannot be restarted and reused. Once vca_analyzer_close() has been called, the analyzer handle must be discarded.
Try out the video complexity analyzer for yourself, amongst other exciting innovations both at https://athena.itec.aau.at/ and bitmovin.com

The post Efficiently Predicting Quality with ATHENA’s Video Complexity Analyzer (VCA) project appeared first on Bitmovin.

Efficient Multi-Encoding Algorithms for HTTP Adaptive Bitrate Streaming

Christian Timmerer — Thu, 23 Dec 2021 16:25:36 +0000

The Future of HTTP Adaptive Streaming (HAS)

According to multiple reports, video viewing will account for as much as 82% of all internet traffic by the end of 2022, as such the popularity of HTTP Adaptive Streaming (HAS) is steadily increasing to efficiently support modern demand. Furthermore, improvements in video characteristics such as frame rate, resolution, and bit depth raise the need to develop a large-scale, highly efficient video encoding environment. This is even more crucial for DASH-based content provisioning as it requires encoding multiple representations of the same video content. Each video is encoded at multiple bitrates and spatial resolutions (i.e., representations) to adapt to the heterogeneity of network conditions, device characteristics, and end-user preferences as shown in Figure 1. However, encoding the same content at multiple representations requires substantial resources and costs for content providers.

Figure 1: A systematic representation of encoding scheme in HAS

As seen in Figure 2, as resolution doubles, encoding time complexity also doubles! To address this challenge, we must employ multi-encoding schemes to accelerate the encoding process of multiple representations without impacting quality. This is achieved by exploiting a high correlation of encoder analysis decisions (like block partitioning and prediction mode decisions) across multiple representations.

Figure 2: Relative time complexity of encoding representations in x265 HEVC encoding.

What is Multi-Encoding?

To encode multiple renditions of the same video at multiple bitrates and resolution, we reuse encoder analysis information across various renditions. This is due to the fact that there is a strong correlation of encoder decisions across various bitrate and resolution renditions. The scheme of sharing analysis information across multiple bitrates within a resolution is termed “multi-rate encoding” while sharing the information across multiple resolutions is termed as “multi-resolution encoding”. “Multi-encoding” is a generalized term that combines both multi-rate and multi-resolution encoding schemes.

Proposed Heuristics:

To aid the encoding process of the dependent renditions in HEVC, the ATHENA Labs research team proposes a few new encoder decision heuristics, Prediction Mode and Motion Estimation:
Prediction Mode Heuristics:
Prediction Mode heuristics are those where the selected Coding Unit (CU) size for the dependent renditions is the same as the reference representation – this can be further broken down into the following modes:

If the SKIP mode was chosen in the highest bitrate rendition, rate-distortion optimization is evaluated for only MERGE/SKIP modes.
If the 2Nx2N mode was chosen in the highest bitrate rendition, RDO is skipped for AMP modes.
If the inter-prediction mode was chosen in the highest bitrate rendition, RDO is skipped for intra-prediction modes.
If the intra-prediction mode was chosen for the highest and lowest bitrate rendition, RDO is evaluated for only intra-prediction modes in the intermediate renditions.

Motion Estimation Heuristics:
Motion Estimation heuristics are those where the CU size and PU selected for the dependent representations are the same as the reference representation:

The same reference frame is forced as that of the highest bitrate rendition.
The Motion Vector Predictor (MVP) is set to be the Motion Vector (MV) of the highest bitrate rendition.
The motion search range is decreased to a smaller window if the MVs of the highest and the lowest bitrate renditions are close to each other.

Based on the above-mentioned heuristics, two multi-encoding schemes are proposed.

Proposed Multi-encoding Schemes

In our first proposed multi-encoding approach we perform the following steps:

The first resolution tier (i.e, 540p in our example) is encoded using the combination of double-bound for CU depth estimation (c.f. Previous blog post), Prediction Mode Heuristics, and Motion Estimation Heuristics.
The CU depth from the highest bit rate representation of the first resolution tier (i.e., 540p) is shared with the highest bit rate representation of the next resolution tier (i.e., 1080p in our example). In particular, the information is used as a lower bound, i.e., the CU is forced to split if the current encode depth is lower than the reference encode CU depth. The remaining bitrate representations of this resolution tier are encoded using the multi-rate scheme as used in Step 1.
Repeat Step 2 for the remaining resolution tiers in ascending order with respect to the resolution until no more resolution tiers are left (i.e., only for 2160p in our example).

Figure 3: An example of the first proposed multi-encoding scheme.

The second proposed multi-encoding scheme is a minor variation of the first scheme which aims to extend the double-bound for CU depth estimation scheme across resolution tiers. It is performed in the following steps:

The first resolution tier (i.e., 540p in our example) is encoded using the combination of double-bound for CU depth estimation, Prediction Mode Heuristics, and Motion Estimation Heuristics.
The CU depth from the lowest bit rate representation of the first resolution tier (i.e., 540p) is shared with the highest bit rate representation of the next resolution tier (i.e., 1080p in our example). In particular, the information is used as a lower bound, i.e., the CU is forced to split if the current encode depth is lower than the reference encode CU depth.
The scaled CU depth from the lowest bit rate representation of the previous resolution tier (i.e., 540p) and CU depth information from the highest bit rate representation of the current resolution tier are shared with the lowest bit rate representation of the current resolution tier (i.e., 1080p in our example) and are used as the lower bound and upper bound respectively for CU depth search.
The remaining bit rate representations of this resolution tier (i.e., 1080p) are encoded using the multi-rate scheme as used in Step 1.
Repeat Step 2 for the remaining resolution tiers in ascending order with respect to the resolution until no more resolution tiers are left (i.e., only for 2160p in our example).

Figure 4: An example of the second proposed multi-encoding scheme.

Results

It is observed that the state-of-the-art scheme yields the highest average encoding time-saving, i.e., 80.05%, but it comes with a bitrate increase of 13.53% and 9.59% to maintain the same PSNR and VMAF respectively as compared to the stand-alone encodings. The first proposed multi-encoding scheme has the lowest increase in bitrate to maintain the same PSNR and VMAF (2.32% and 1.55%) respectively as compared to the stand-alone encodings. The second proposed multi-encoding scheme improves the encoding time savings of the first proposed multi-encoding scheme by 11% with a negligible increase in bitrate to maintain the same PSNR and VMAF. This result is shown in Table 1, where Delta T represents the overall encoding time-savings compared to the stand-alone encodings, BDR_P and BDR_V refer to the average difference in bitrate with respect to stand-alone encodings to maintain the same PSNR and VMAF, respectively.

Results of the proposed multi-encoding schemes

View the full multi-encoding research paper from ATHENA here.
If you liked this article, check out some of our other great ATHENA content at the following links:

The post Efficient Multi-Encoding Algorithms for HTTP Adaptive Bitrate Streaming appeared first on Bitmovin.