Hadi Amirpour – Bitmovin https://bitmovin.com Bitmovin provides adaptive streaming infrastructure for video publishers and integrators. Fastest cloud encoding and HTML5 Player. Play Video Anywhere. Mon, 09 Jan 2023 10:49:15 +0000 en-GB hourly 1 https://bitmovin.com/wp-content/uploads/2023/11/bitmovin_favicon.svg Hadi Amirpour – Bitmovin https://bitmovin.com 32 32 Efficiently Predicting Quality with ATHENA’s Video Complexity Analyzer (VCA) project https://bitmovin.com/blog/video-complexity-analyzer-vca/ Fri, 25 Mar 2022 19:30:34 +0000 https://bitmovin.com/?p=224553 For online prediction in live streaming applications, selecting low-complexity features is critical to ensure low-latency video streaming without disruptions. For each frame/ video/ video segment, two features, i.e., the average texture energy and the average gradient of the texture energy are determined. A DCT-based energy function is introduced to determine the block-wise texture of each...

The post Efficiently Predicting Quality with ATHENA’s Video Complexity Analyzer (VCA) project appeared first on Bitmovin.

]]>
For online prediction in live streaming applications, selecting low-complexity features is critical to ensure low-latency video streaming without disruptions. For each frame/ video/ video segment, two features, i.e., the average texture energy and the average gradient of the texture energy are determined. A DCT-based energy function is introduced to determine the block-wise texture of each frame. The spatial and temporal features of the video/ video segment are derived from the DCT-based energy function. The Video Complexity Analyzer (VCA) project is launched in 2022, aiming to provide the most efficient, highest performance spatial and temporal complexity prediction of each frame/ video/ video segment which can be used for a variety of applications like shot/scene detection, online per-title encoding.

What is the Video Complexity Analyzer

The primary objective of the Video Complexity Analyzer is to become the best spatial and temporal complexity predictor for every frame/ video segment/ video which aids in predicting encoding parameters for applications like scene-cut detection and online per-title encoding. VCA leverages x86 SIMD and multi-threading optimizations for effective performance. While VCA is primarily designed as a video complexity analyzer library, a command-line executable is provided to facilitate testing and development. We expect VCA to be utilized in many leading video encoding solutions in the coming years.

VCA is available as an open-source library, published under the GPLv3 license. For more details, please visit the software online documentation here. The source code can be found here.
 

- Bitmovin
Heatmap of spatial complexity (E)

- Bitmovin
Heatmap of temporal complexity (h)

A performance comparison (frames analyzed per second) of VCA (with different levels of threading enabled) compared to Spatial Information/Temporal Information (SITI) [Github] is shown below

Visual Complexity Analyzer vs Spatial Information/Temporal Information_Bar Chart
Visual Complexity Analyzer vs Spatial Information/Temporal Information

How to Build a Video Complexity Analyzer

The software is tested mostly in Linux and Windows OS. It requires some pre-requisite software to be installed before compiling. The steps to build the project in Linux and Windows are explained below.

Prerequisites

  1. CMake version 3.13 or higher.
  2. Git.
  3. C++ compiler with C++11 support
  4. NASM assembly compiler (for x86 SIMD support)

The following C++11 compilers have been known to work:

  • Visual Studio 2015 or later
  • GCC 4.8 or later
  • Clang 3.3 or later

Execute Build

The following commands will check out the project source code and create a directory called ‘build’ where the compiler output will be placed. CMake is then used for generating build files and compiling the VCA binaries.

$ git clone https://github.com/cd-athena/VCA.git
$ cd VCA
$ mkdir build
$ cd build
$ cmake ../
$ cmake --build .

This will create VCA binaries in the VCA/build/source/apps/ folder.

Command-Line Options

General

Displaying Help Text:

--help, -h

Displaying version details:

--version, -v

Logging/Statistic Options

--complexity-csv <filename>

Write the spatial (E) and temporal complexity (h), epsilon, brightness (L) statistics to a Comma Separated Values log file. Creates the file if it doesn’t already exist. The following statistics are available:

  • POC Picture Order Count – The display order of the frames
  • E Spatial complexity of the frame
  • h Temporal complexity of the frame
  • epsilon Gradient of the temporal complexity of the frame
  • L Brightness of the frame

Unless option:–no-chroma is used, the following chroma statistics are also available:

  • avgU Average U chroma component of the frame
  • energyU Average U chroma texture of the frame
  • avgV Average V chroma component of the frame
  • energyV Average V chroma texture of the frame
--shot-csv < filename>

Write the shot id, the first POC of every shot to a Comma Separated Values log file. Creates the file if it doesn’t already exist.

--yuvview-stats <filename>

Write the per block results (L, E, h) to a stats file that can be visualized using YUView.

Performance Options

--no-chroma

Disable analysis of chroma planes (which is enabled by default).

--no-simd

The Video Complexity Analyzer will use all detected CPU SIMD architectures by default. This will disable that detection.

--threads <integer>

Specify the number of threads to use. Default: 0 (autodetect).

Input/Output

--input <filename>

Input filename. Raw YUV or Y4M supported. Use stdin for stdin. For example piping input from ffmpeg works like this:

ffmpeg.exe -i Sintel.2010.1080p.mkv -f yuv4mpegpipe - | vca.exe --input stdin
--y4m

Parse input stream as YUV4MPEG2 regardless of file extension. Primarily intended for use with stdin. This option is implied if the input filename has a “.y4m” extension

--input-depth <integer>

Bit-depth of input file or stream. Any value between 8 and 16. Default is 8. For Y4M files, this is read from the Y4M header.

--input-res <wxh>

Source picture size [w x h]. For Y4M files, this is read from the Y4M header.

--input-csp <integer or string>

Chroma Subsampling. 4:0:0(monochrome), 4:2:0, 4:2:2, and 4:4:4 are supported. For Y4M files, this is read from the Y4M header.

--input-fps <double>

The framerate of the input. For Y4M files, this is read from the Y4M header.

--skip <integer>

Number of frames to skip at start of input file. Default 0.

--frames, -f <integer>

Number of frames of input sequence to be analyzed. Default 0 (all).

Analyzer Configuration

--block-size <8/16/32>

Size of the non-overlapping blocks used to determine the E, h features. Default: 32.

--min-thresh <double>

Minimum threshold of epsilon for shot detection.

--max-thresh <double>

Maximum threshold of epsilon for shot detection.

Using the VCA API

VCA is written primarily in C++ and x86 assembly language. This API is wholly defined within :file: vcaLib.h in the source/lib/ folder of our source tree. All of the functions and variables and enumerations meant to be used by the end-user are present in this header.

vca_analyzer_open(vca_param param)

Create a new analyzer handler, all parameters from vca_param are copied. The returned pointer is then passed to all of the functions pertaining to this analyzer. Since vca_param is copied internally, the user may release their copy after allocating the analyzer. Changes made to their copy of the param structure have no affect on the analyzer after it has been allocated.

vca_result vca_analyzer_push(vca_analyzer *enc, vca_frame *frame)

Push a frame to the analyzer and start the analysis. Note that only the pointers will be copied but no ownership of the memory is transferred to the library. The caller must make sure that the pointers are valid until the frame was analyzed. Once a results for a frame was pulled the library will not use pointers anymore. This may block until there is a slot available to work on. The number of frames that will be processed in parallel can be set using nrFrameThreads.

bool vca_result_available(vca_analyzer *enc)

Check if a result is available to pull.

vca_result vca_analyzer_pull_frame_result(vca_analyzer *enc, vca_frame_results *result)

Pull a result from the analyzer. This may block until a result is available. Use vca_result_available() if you want to only check if a result is ready.

void vca_analyzer_close(vca_analyzer *enc)

Finally, the analyzer must be closed in order to free all of its resources. An analyzer that has been flushed cannot be restarted and reused. Once vca_analyzer_close() has been called, the analyzer handle must be discarded.
Try out the video complexity analyzer for yourself, amongst other exciting innovations both at https://athena.itec.aau.at/ and bitmovin.com

The post Efficiently Predicting Quality with ATHENA’s Video Complexity Analyzer (VCA) project appeared first on Bitmovin.

]]>
Efficient Multi-Encoding Algorithms for HTTP Adaptive Bitrate Streaming https://bitmovin.com/blog/multi-encoding-has/ Thu, 23 Dec 2021 16:25:36 +0000 https://bitmovin.com/?p=209148 The Future of HTTP Adaptive Streaming (HAS) According to multiple reports, video viewing will account for as much as 82% of all internet traffic by the end of 2022, as such the popularity of HTTP Adaptive Streaming (HAS) is steadily increasing to efficiently support modern demand. Furthermore, improvements in video characteristics such as frame rate,...

The post Efficient Multi-Encoding Algorithms for HTTP Adaptive Bitrate Streaming appeared first on Bitmovin.

]]>
The Future of HTTP Adaptive Streaming (HAS)

According to multiple reports, video viewing will account for as much as 82% of all internet traffic by the end of 2022, as such the popularity of HTTP Adaptive Streaming (HAS) is steadily increasing to efficiently support modern demand. Furthermore, improvements in video characteristics such as frame rate, resolution, and bit depth raise the need to develop a large-scale, highly efficient video encoding environment. This is even more crucial for DASH-based content provisioning as it requires encoding multiple representations of the same video content. Each video is encoded at multiple bitrates and spatial resolutions (i.e., representations) to adapt to the heterogeneity of network conditions, device characteristics, and end-user preferences as shown in Figure 1. However, encoding the same content at multiple representations requires substantial resources and costs for content providers.

A systematic representation of encoding scheme in HTTP Adaptive Streaming (HAS)_decision tree
Figure 1: A systematic representation of encoding scheme in HAS

As seen in Figure 2, as resolution doubles, encoding time complexity also doubles! To address this challenge, we must employ multi-encoding schemes to accelerate the encoding process of multiple representations without impacting quality. This is achieved by exploiting a high correlation of encoder analysis decisions (like block partitioning and prediction mode decisions) across multiple representations.

Relative time complexity of encoding representations in x265 HEVC encoding_Bar Chart
Figure 2: Relative time complexity of encoding representations in x265 HEVC encoding.

What is Multi-Encoding?

To encode multiple renditions of the same video at multiple bitrates and resolution, we reuse encoder analysis information across various renditions. This is due to the fact that there is a strong correlation of encoder decisions across various bitrate and resolution renditions. The scheme of sharing analysis information across multiple bitrates within a resolution is termed “multi-rate encoding” while sharing the information across multiple resolutions is termed as “multi-resolution encoding”. “Multi-encoding” is a generalized term that combines both multi-rate and multi-resolution encoding schemes.

Proposed Heuristics:

To aid the encoding process of the dependent renditions in HEVC, the ATHENA Labs research team proposes a few new encoder decision heuristics, Prediction Mode and Motion Estimation:
Prediction Mode Heuristics:
Prediction Mode heuristics are those where the selected Coding Unit (CU) size for the dependent renditions is the same as the reference representation – this can be further broken down into the following modes:

  1. If the SKIP mode was chosen in the highest bitrate rendition, rate-distortion optimization is evaluated for only MERGE/SKIP modes.
  2. If the 2Nx2N mode was chosen in the highest bitrate rendition, RDO is skipped for AMP modes.
  3. If the inter-prediction mode was chosen in the highest bitrate rendition, RDO is skipped for intra-prediction modes.
  4. If the intra-prediction mode was chosen for the highest and lowest bitrate rendition, RDO is evaluated for only intra-prediction modes in the intermediate renditions.

Motion Estimation Heuristics:
Motion Estimation heuristics are those where the CU size and PU selected for the dependent representations are the same as the reference representation:

  1. The same reference frame is forced as that of the highest bitrate rendition.
  2. The Motion Vector Predictor (MVP) is set to be the Motion Vector (MV) of the highest bitrate rendition.
  3. The motion search range is decreased to a smaller window if the MVs of the highest and the lowest bitrate renditions are close to each other.

Based on the above-mentioned heuristics, two multi-encoding schemes are proposed.

Proposed Multi-encoding Schemes

In our first proposed multi-encoding approach we perform the following steps:

  1. The first resolution tier (i.e, 540p in our example) is encoded using the combination of double-bound for CU depth estimation (c.f. Previous blog post), Prediction Mode Heuristics, and Motion Estimation Heuristics.
  2. The CU depth from the highest bit rate representation of the first resolution tier (i.e., 540p) is shared with the highest bit rate representation of the next resolution tier (i.e., 1080p in our example). In particular, the information is used as a lower bound, i.e., the CU is forced to split if the current encode depth is lower than the reference encode CU depth. The remaining bitrate representations of this resolution tier are encoded using the multi-rate scheme as used in Step 1.
  3. Repeat Step 2 for the remaining resolution tiers in ascending order with respect to the resolution until no more resolution tiers are left (i.e., only for 2160p in our example).

Figure 3: An example of the first proposed multi-encoding scheme.

Multi-Encoding Scheme Proposal 1_Decision Tree
Figure 3: An example of the first proposed multi-encoding scheme.

The second proposed multi-encoding scheme is a minor variation of the first scheme which aims to extend the double-bound for CU depth estimation scheme across resolution tiers. It is performed in the following steps:

  1. The first resolution tier (i.e., 540p in our example) is encoded using the combination of double-bound for CU depth estimation, Prediction Mode Heuristics, and Motion Estimation Heuristics.
  2. The CU depth from the lowest bit rate representation of the first resolution tier (i.e., 540p) is shared with the highest bit rate representation of the next resolution tier (i.e., 1080p in our example). In particular, the information is used as a lower bound, i.e., the CU is forced to split if the current encode depth is lower than the reference encode CU depth.
  3. The scaled CU depth from the lowest bit rate representation of the previous resolution tier (i.e., 540p) and CU depth information from the highest bit rate representation of the current resolution tier are shared with the lowest bit rate representation of the current resolution tier (i.e., 1080p in our example) and are used as the lower bound and upper bound respectively for CU depth search.
  4. The remaining bit rate representations of this resolution tier (i.e., 1080p) are encoded using the multi-rate scheme as used in Step 1.
  5. Repeat Step 2 for the remaining resolution tiers in ascending order with respect to the resolution until no more resolution tiers are left (i.e., only for 2160p in our example).
Multi-Encoding Scheme Proposal 2_Decision Tree
Figure 4: An example of the second proposed multi-encoding scheme.

Results

It is observed that the state-of-the-art scheme yields the highest average encoding time-saving, i.e., 80.05%, but it comes with a bitrate increase of 13.53% and 9.59% to maintain the same  PSNR and VMAF respectively as compared to the stand-alone encodings. The first proposed multi-encoding scheme has the lowest increase in bitrate to maintain the same  PSNR and VMAF (2.32% and 1.55%) respectively as compared to the stand-alone encodings.  The second proposed multi-encoding scheme improves the encoding time savings of the first proposed multi-encoding scheme by 11% with a negligible increase in bitrate to maintain the same PSNR and VMAF. This result is shown in Table 1, where Delta T represents the overall encoding time-savings compared to the stand-alone encodings, BDR_P and BDR_V refer to the average difference in bitrate with respect to stand-alone encodings to maintain the same PSNR and VMAF, respectively.

Quality Results of the Proposed multi-encoding schemes_Chart
Results of the proposed multi-encoding schemes

View the full multi-encoding research paper from ATHENA here.
If you liked this article, check out some of our other great ATHENA content at the following links:

The post Efficient Multi-Encoding Algorithms for HTTP Adaptive Bitrate Streaming appeared first on Bitmovin.

]]>
ATHENA Lab: Fast Multi-Resolution and Multi-Rate Encoding for HTTP Adaptive Streaming Using Machine Learning (FaRes-ML) https://bitmovin.com/blog/multi-rate-encoding-fares-ml/ Wed, 14 Jul 2021 11:49:13 +0000 https://bitmovin.com/?p=179400 The heterogeneity of the devices on the Internet and the difference among the network conditions of the users make designing a video delivery tool that can adapt to all these differences while maximizing the quality of experience (QoE) for each user a tricky problem. HTTP Adaptive Streaming (HAS) is the de-facto solution for video delivery...

The post ATHENA Lab: Fast Multi-Resolution and Multi-Rate Encoding for HTTP Adaptive Streaming Using Machine Learning (FaRes-ML) appeared first on Bitmovin.

]]>
- Bitmovin
The heterogeneity of the devices on the Internet and the difference among the network conditions of the users make designing a video delivery tool that can adapt to all these differences while maximizing the quality of experience (QoE) for each user a tricky problem. HTTP Adaptive Streaming (HAS) is the de-facto solution for video delivery over the Internet. In HAS, multiple representations are stored for each video, with each representation having a different quality level and/or resolution. This way, HAS streaming sessions can alternate between different quality options based on the network and viewing conditions while delivering the content. However, the requirement to store multiple representations for a single video in HAS brings additional encoding challenges since the source video needs to be encoded efficiently at multiple bitrates and resolutions. Multi-Rate encoding aims to tackle this problem. 
This blog post introduces our new approach to multi-rate encoding, called FaRes-ML, Fast Multi-Resolution and Multi-Rate Encoding for HTTP Adaptive Streaming Using Machine Learning (FaRes-ML). But first…

What is Multi-Rate Encoding?

In multi-rate encoding, a single source video needs to be encoded at multiple bitrates and resolutions in order to provide a suitable representation for a variety of network and viewing conditions. The quality level of the encoded video is controlled by the quantization parameter (QP) in the encoder. An example multi-rate encoding scheme is given in Fig.1.

Multi-Rate Encoding workflow_animated gif
Multi-Rate Encoding workflow

This is a computationally expensive process due to the high data size of videos and the high complexity of video codecs. However, since all of these representations consist of the same content, there is a nice amount of redundancy. Multi-rate encoding approaches exploit this redundancy to speed up the encoding process.
In multi-rate encoding, a representation is chosen as the reference representation (usually the highest [1] or the lowest quality [2] representation), and its information is used to speed up the remaining dependent representations. Since block partitioning is one of the most time-consuming processes in the encoding pipeline, a majority of the multi-rate encoding approach focuses on speeding up this portion of the process.
In block partitioning, each frame is divided into smaller pieces called blocks to achieve more precise motion compensation. Smaller block sizes are used for motion intense areas while larger block sizes are used for stationary areas. 
High-Efficiency Video Coding (HEVC) standard uses a Coding Tree Unit (CTU) for block partitioning. By default, each CTU covers a 64×64 pixels-sized square region and each CTU can be divided recursively up to three times with the smallest block size being 8×8 pixels. Each split operation increases the depth level by 1 (i.e. depth 0 for 64×64 pixels and depth 3 for 8×8 pixels). An example of block partitioning for a frame is illustrated in Fig.2.

Block partioning in Multi-rate Encoding_animated gif example
Block partitioning using a CTU

Introducing the FaRes-ML

FaRes-ML uses Convolutional Neural Networks (CNNs) to predict the CTU split decision for the dependent representations. The highest quality representation from the lowest resolution is chosen as the reference representation. The reference representation is selected from the lowest resolution to speed up the parallel encoding performance since, in parallel encoding, the highest complexity representation bounds the overall encoding time. Thus choosing the reference from a low resolution can increase the parallel encoding performance. 
The encoding process in FaRes-ML consists of three main steps:

  1. The reference representation is encoded with the HEVC reference encoder. Then, the encoding information obtained is stored to be used while encoding the dependent representations. 
  2. Once the encoding information is obtained, the pixel values from the source video in corresponding resolution and the encoding information from the reference representation are fed into the CNN for the given quality level and resolution. 
  3. The output from the CNN is the split decision for the given depth level. This decision is used to speed up the encoding of the dependent representation.

The overall encoding scheme of FaRes-ML is given in Fig.3.

Fast Multi-rate encoding scheme_animated workflow
FaRes-ML Encoding Scheme Workflow

To measure the encoding performance of the FaRes-ML approach, we compared the results to the HEVC reference software (HM 16.21) and the lower bound approach [3]. FaRes-ML achieves  27.71 % time saving for the parallel encoding and 46.27% for the overall encoding while maintaining a minimal bitrate increase (2.05 %). The resulting normalized encoding time graph is given in Fig.4.

Fast Multi-Rate Encoding efficiency comparison_FaRes-ML vs Lower Bound vs HEVC_Bar Graph
Fast Multi-Rate Encoding efficiency comparison vs Lower Bound vs HEVC

Conclusion

As the quality of content resolution improves to new heights with 4K+ resolutions becoming the norm, organizations and researchers are finding new ways to improve the back-end delivery technologies to match the content to its respective device. One of the latest approaches to improving the speed of encoding is the FaRes-ML method, a machine learning-based approach that handles multiple representations in different qualities and resolutions. By applying CNNs to exploit the redundant information in the multi-rate encoding pipeline, FaRes-ML is capable of speeding up overall encodings by nearly 50% in ATHENA’s early-stage experiments with additional improvement parallel encoding methods, all while maintaining a minimal bitrate increase. 
Although the FaRes-ML method has been proven in lab environments for single and parallel encodes, its potential can be extended to cover even more encoding decisions (e.g., reference frame selection) to further improve the encoding performance in the near future. Furthermore, the extension of the proposed method for recent video codecs such as Versatile Video Coding (VVC) can be interesting due to the increased encoding complexity of recent video encoding standards, which would significantly decrease the amount of time organizations that operate a back-end workflow could implement the brand new codec.
The team at ATHENA will work closely with Bitmovin in the coming months to determine how FaRes-ML works in real-world applications. If you’re interested in learning more about the Fast Multi-Resolution and Multi-Rate Encoding approach, you can find the full study published in the IEEE Open Journal of Signal Processing journal as an open-access article. More information about the full study can be found in the following links:

If you liked this article, check out some of our other great ATHENA content at the following links:

Sources

[1] D. Schroeder, A. Ilangovan, M. Reisslein, and E. Steinbach, “Efficient multi-rate video encoding for HEVC-based adaptive HTTP streaming,” IEEE Trans. Circuits Syst. Video Technol., vol. 28, no. 1, pp. 143–157, Jan. 2018.
[2] K. Goswami et al., “Adaptive multi-resolution encoding for ABR streaming,” in Proc. 25th IEEE Int. Conf. Image Process., 2018, pp. 1008–1012.
[3] H. Amirpour, E. Çetinkaya, C. Timmerer, and M. Ghanbari, “Fast multi-rate encoding for adaptive HTTP streaming,” in Proc. Data Compression Conf., 2020, pp. 358–358..

The post ATHENA Lab: Fast Multi-Resolution and Multi-Rate Encoding for HTTP Adaptive Streaming Using Machine Learning (FaRes-ML) appeared first on Bitmovin.

]]>
ATHENA Lab: Improving Viewer Experiences with Scalable Light Field Coding (SLFC) https://bitmovin.com/blog/scalable-light-field-coding/ Tue, 15 Jun 2021 08:58:33 +0000 https://bitmovin.com/?p=173882 Immersive Viewer Experiences with Light Field Imaging  Light field imaging is a promising technology that will provide a more immersive viewing experience. It enables some post-processing tasks like depth estimation, changing the viewport, refocusing, etc. To this end, a huge amount of data needs to be collected, processed, stored, and transmitted, which leaves the challenging...

The post ATHENA Lab: Improving Viewer Experiences with Scalable Light Field Coding (SLFC) appeared first on Bitmovin.

]]>
- Bitmovin

Immersive Viewer Experiences with Light Field Imaging 

Light field imaging is a promising technology that will provide a more immersive viewing experience. It enables some post-processing tasks like depth estimation, changing the viewport, refocusing, etc. To this end, a huge amount of data needs to be collected, processed, stored, and transmitted, which leaves the challenging task of compression and transmission of light field images [1]. Unlike conventional photography that integrates the rays from all directions into a pixel,  light field imaging collects the rays from all directions resulting in a multiview representation of the scene. An example of a multiview representation of a light field image is shown in Fig 1 below and in an interactive format here:

Scalable Light Field Coding_Multiview Representation of an Image_Multiple Images
Fig 1. Multiview representation of a light field image. (u,v) represents the view number, and (x,y) denotes pixels inside each view. [2]
Light field image coding solutions exploit the high redundancy that exists between multiview of a light field. Pseudo Video Sequence-based (PVS) solutions convert multiview of a light field into a sequence of pictures and encode pseudo videos using an advanced video encoder. This methodology leverages the increasing dependency between views and resulting in decreased redundancy inside multiple views to improve the encoding efficiency of light field compression. In other words, PSV employs a similar method of efficiency optimization as per-title encoding, wherein similar features are identified and carried over from view to view to reduce the reuse of recurring factors. However, as the technology behind PVS solutions develops further, new challenges for other important functionalities of light field coding arise; such as viewport scalability, quality scalability, viewport random access, and uniform quality distribution among viewports.
In this post, we introduce a novel light field coding solution, namely, Scalable Light field Coding (SLFC), which addresses the above-mentioned functionalities in addition to the encoding efficiency. 

Functionalities of Light field Coding

Aside from the baseline function of reducing redundancies by collecting and comparing images from multiple views, the complexity of Light field Coding is affected by four key factors:

  • Viewport scalability: Unlike conventional 2D displayed images (and media in general), light field image coding solutions require that all views are encoded, transmitted, and decoded to alleviate high dependency between views, thereby enabling more arbitrary views (such as the standard 2D central view). Contrarily,  conventional 2D displays only display a central view, which by comparison, is a significantly less immersive experience. The scalability limitation of these multi-viewports is that in order to increase the compatibility of Light field Coding solutions with capturing devices, displays, network condition, processing power, and storage capacity, viewports must be grouped into different layers [3] and so that they can be encoded, transmitted, decoded, and displayed one after another, a significantly more complex task than conventional coding. 
  • Quality scalability: To increase compatibility with the network condition and processing power, light field images can be provided in two (or more) quality levels. With the increasing available bandwidth and/or power, the quality of light field images can be improved by transmitting the remaining layers.
  • Viewport random access: To avoid decoding delay, high bandwidth requirement, and huge processing power while navigating between various viewports, random access (the number of views required to access a specific view) to the image views should be considered in light field image coding. 
  • Uniform quality distribution: To avoid facing quality fluctuation when navigating between viewports, light field image views should have similar qualities at each bitrate.

Introducing SLFC: Scalable Light Field Coding

To address the additional complexities that come with standard Light field Coding, we propose the Scalable Light Field Coding (SLFC) solution. The first function that SLFC addresses are the viewport scalability issue by dividing multiviews into seven layers and encoding them for efficiency. 

Scalable Light Field Coding_Seven Multiview Encoding Layers_Graphs
Fig 2. Seven layers of multiview encoding

In each layer, views represented by red belong to that layer, while gray views belong to the previous layers, and black views belong to the next layers. To provide compatibility with 2D displays, the first layer contains the central view.  The second layer contains the four corner views. For the remaining layers, available horizontal and vertical intermediate views are added.

Encoding the views

The process of encoding each layer is a three-step process that’s defined by the horizontal and vertical differences between each layer/view:

  • Firstly, the central view (the first layer) is independently intra-coded, primarily defined by the red central dot.
  • The second step takes the views from the second layer and is encoded independently of each other while using the central view as their reference image. 
  • The remaining layers are made of horizontal and vertical intermediate views of previously encoded views. For example in layer 3, four possible horizontal and vertical intermediate views are added. In each layer (3 to 7), two views from the previously encoded layers are used to synthesize their intermediate view. Sepconv [4] which has been designed for video interpolation is used for view synthesis. You can see an example of this process in the image below:

 

Scalable Light Field Coding_Synthesizing Encoding Layers using Speconv_Workflow
Fig 3. The most right view in layer 3 is synthesized using the top-right and bottom-right view from layer 2.

In the example above, the layers are synthesized from the top-right and bottom-right views to create the most accurate representation of the multiview approach. As a result, the synthesized view has less residual data compared to the individual top-right and bottom-right views. Therefore,  this synthesized view is added to the reference list in the video encoder as a virtual reference frame. All-in-all, four reference views are used for encoding each view in layers 3 to 7: (i) the most central view, (ii, iii) two views that are used for synthesizing the virtual reference frame, (v) and the synthesized view.

Experimental results of Applied SLFC 

Encoding efficiency: Encoding efficiency of Table light field test image [5], compared to JPEG Pleno anchor [6], WaSP [7], MuLE [8], and PSB [9] is shown in Fig. 4. BD-Rate and BD-PSNR for other test images against the best competitor (PSB) are given in Table. 1.

Scalable Light Field Coding_encoding efficiency results comparison_linear graph
Fig. 4: RD-curves for the Table test image.

 

Scalable Light Field Coding_BD Rate & BD PSNR of SLFC vs PSB_Table
SLFC vs PSB BD-Rate %

Scalability: the number of views inside each layer, and allocated bitrate to each layer at bpp=0.75 for different layers are shown in Fig. 5.

Scalable Light Field Coding_Number of Views with each Encoding layer_Bar Chart
Fig. 5: (left) the number of views inside each layer, (right) allocated bitrate to each layer at bpp=0.75 for different layers.

Random Access: The required bitrate to access each view at bpp = 0.75 for Table test image is shown in Fig. 6.

Scalable Light Field Coding_Required Bitrate for each view_Light Graph
Fig. 6: The required bitrate to access each view at bpp=0.75.

ScalabQuality Scalability: The synthesized view is considered as quality layer 1 and utilizing the synthesized view for inter-coding results in quality layer 2.
Quality Distribution: PSNR heatmap plot for Table light field images bpp = 0.005 is shown in Fig. 7.

Scalable Light Field Coding_PSNR Heatmap by Image/View_Heatmap
Fig. 7: PSNR Heatmap plot for Table light field test image at bpp = 0.005.

Conclusion

The study of Scalable Light Field Coding (SLFC) was enacted in an attempt to optimize the process of “standard” light field coding by improving the applied compression. Our methodology added multiple critical compression features, such as viewport scalability (how many views are delivered), quality scalability, random access, and uniform quality distribution (wherein there are very few differences in quality between different views). The results of our research were that the SFLC method improves the quality of experience (QoE) for multiview content by a significant margin. In the future, applying SLFC to video and image workflows will help create a more immersive and higher-quality VR/AR experience. Conceivably allowing consumers to truly feel like they are within the environment that they are simulating.
Check out our full study and more at the following link here

Sources:

[1] C. Conti, L. D. Soares, and P. Nunes, “Dense Light Field Coding: A Survey,” in IEEE Access, vol. 8, pp. 49244-49284, 2020, DOI: 10.1109/ACCESS.2020.2977767.
[2] G. Wu et al., “Light Field Image Processing: An Overview,” in IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 7, pp. 926-954, Oct. 2017, DOI: 10.1109/JSTSP.2017.2747126.
[3] Ricardo Jorge Santos Monteiro, “Scalable light field representation and coding,” 2020.
[4] S. Niklaus, L. Mai, and F. Liu, “Video Frame Interpolation via Adaptive Separable Convolution,” 2017 IEEE International Conference on Computer Vision (ICCV), Venice, 2017, pp. 261-270, DOI: 10.1109/ICCV.2017.37.
[5] Katrin Honauer, Ole Johannsen, Daniel Kondermann, and Bastian Goldluecke, “A dataset and evaluation methodology for depth estimation on 4D light fields,” in Computer Vision – ACCV 2016, Shang-Hong Lai, Vincent Lepetit, Ko Nishino, and Yoichi Sato, Eds., Cham, 2017, pp. 19–34, Springer International Publishing.
[6] F Pereira, C Pagliari, EAB da Silva, I Tabus, H Amirpour, M Bernardo, and A Pinheiro, “JPEG pleno light field coding common test conditions v3. 2,” Doc. ISO/IEC JTC, vol. 1.
[7] P. Astola and I. Tabus, “Wasp: Hierarchical warping, merging, and sparse prediction for light field image compression,” in The 7th European Workshop on Visual Information Processing (EUVIP), Oct 2018, pp. 435–439.
[8] M. B. de Carvalho, M. P. Pereira, G. Alves, E. A. B. da Silva, C. L. Pagliari, F. Pereira, and V. Testoni, “A 4D DCT-Based lenslet light field codec,” in 2018 25th IEEE International Conference on Image Processing (ICIP), Oct 2018, pp. 435–439.
[9] L. Li, Z. Li, B. Li, D. Liu, and H. Li, “Pseudo-Sequence-Based 2-D Hierarchical Coding Structure for Light-Field Image Compression,” in 2017 Data Compression Conference (DCC), April 2017, pp. 131–140.

The post ATHENA Lab: Improving Viewer Experiences with Scalable Light Field Coding (SLFC) appeared first on Bitmovin.

]]>