Jameson Steiner – Bitmovin

Video Tech Deep-Dive: Live Low Latency Streaming Part 3 – Low-Latency HLS

Jameson Steiner — Mon, 10 Aug 2020 09:50:09 +0000

This blog post is the final piece of our Live Low-Latency Streaming series, where we previously covered the basic principles of low-latency streaming in OTT and LL-DASH. This final post focuses on latency when using Apple’s HTTP Live Streaming (HLS) protocol and how the latency time can be reduced. This article assumes that you are already familiar with the basics of HLS and its manifest/playlist mechanics. You can view the first two posts below:

Why is latency high in HLS?

HLS in its current specifications favors stream reliability over latency. Higher latency is accepted in exchange for stable playback without interruptions. In section 6.3.3. Playing the Media Playlist File the HLS specification states that a playback client

SHOULD NOT choose a segment that starts less than three target durations from the end of the playlist file

Honoring this requirement results in having a latency of at least 3 target durations. Given typical target durations for current HLS deployments of 10 or 6 seconds, we would end up with a latency of at least 30 or 18 seconds, which is far from low. Even if we choose to ignore the above requirement, the fact that segments are typically produced, transferred, and consumed in their entirety poses a high risk of buffer underruns and subsequent playback interruptions, as described in more detail in the first part of this blog series.
The HLS media playlist for the above depicted this live stream would look something like this:
[bg_collapse view=”button-blue” color=”#f7f7f7″ icon=”eye” expand_text=”View HLS media playlist” collapse_text=”Close HLS media playlist” ]

[/bg_collapse]

Road to Low-Latency HLS

2017’s Periscope, the most popular platform for live streaming of user-generated content at the time, investigated streaming solutions to replace their RTMP- and HLS-based hybrid approach with a more scalable one. The requirement was to offer similar end-to-end latency as RTMP but in a more cost-effective way; considering that their use case was streaming to large audiences. Periscope presented their solution to high latency issues: which took Apple’s HLS protocol, made two fundamental changes and called it Low-Latency HLS (LHLS):

Segments are delivered using HTTP/1.1 Chunked Transfer Coding
Segments are advertised in the HLS playlist before the are available

If you read our previous blog posts about Low-Latency streaming, you might recognize these simple concepts as being the key ingredients for today’s OTT-based Low-Latency streaming approaches, like LL-DASH. Periscope’s work likely sparked and influenced the following developments around low-latency streaming such as LL-DASH and a community-driven initiative for defining modifications to HLS aiming to reduce streaming latency that started at the end of 2018.
The core of the community proposal for LHLS was the same as the aforementioned concepts. Segments should be loaded in chunks using HTTP CTE and early availability of incomplete segments should be signaled using a new #EXT-X-PREFETCH tag in the playlist. In the example below, the client can already load and consume the currently available data of 6.ts and continue to do so as the chunks become available over time. Furthermore, the request for the segment 7.ts can be made early on to save network round-trip time, even though production had not started yet. It is also worth mentioning that the LHLS proposal preserves full backward-compatibility allowing standard HLS clients to consume such streams. This was the gist of the proposed implementation; you can find the full proposal in the hlsjs-rfcs GitHub repository.
[bg_collapse view=”button-blue” color=”#f7f7f7″ icon=”eye” expand_text=”View LHLS media playlist proposal” collapse_text=”Close LHLS media playlist proposal” ]

[/bg_collapse]
Individuals across several companies in the media industry came together to work on this proposal with the hope that also Apple, being the driving force behind HLS, would join in and work the proposal into the official HLS specification. However, things came to fruition very differently than expected as Apple presented its own preliminary version, a very different approach during their 2019’s Worldwide Developers Conference.
Despite it being (and staying) a proprietary approach, some companies, like Twitch, are successfully using it in their production systems.

Apple’s Low-Latency HLS

In this section we’ll cover the principles of Apple’s preliminary specification for Low-Latency HLS.

Generation of Partial Media Segments

While HLS content is split into individual segments, in low-latency HLS each segment further consists of parts that are independently addressable by the client. For example, a segment of 6 seconds can consist of 30 parts of 200ms duration each. Depending on the container format, such parts can represent CMAF chunks or a sequence of TS packets. This partitioning of segments decouples the end-to-end latency from the long segment duration and allows the client to load parts of a segment as soon as they become available. Compared to LL-DASH, this is achieved by using HTTP CTE, however, the MPD does not advertise individual parts/chunks of segments.
[bg_collapse view=”button-blue” color=”#f7f7f7″ icon=”eye” expand_text=”View partial media segment generation in low latency HLS” collapse_text=”Close partial media segment generation in low latency HLS” ]

[/bg_collapse]
Partial segments are advertised using a new EXT-X-PART tag. Note that partial segments are only advertised for the most recent segments in the playlist. Furthermore, the partial segments (filePart272.x.mp4) and the respective full segments (fileSequence272.mp4) are offered.
Partial segments can also reference the same file but at different byte ranges. Clients can thereby load multiple partial segments with a single request and save round-trips compared to making separate requests for each part (as seen below).

Preload hints and blocking of Media downloads

Soon to be available partial segments are advertised prior to their actual availability in the playlist by a new EXT-X-PRELOAD-HINT tag. This enables clients to open a request early and the server will respond once the data becomes available. This way the client can “save” the round-trip time for the request.

Playlist Delta Updates

Clients have to refresh HLS playlists more frequently for low-latency HLS. Playlist Delta Updates can be used to reduce the amount of data transferred for each playlist request. A new EXT-X-SKIP tag replaces the content of the playlist that the client already received with a previous request.

Blocking of Playlist reload

The discovery of new segments becoming available for an HLS live stream is usually applied by the client reloading the playlist file in regular intervals and checking for new segments being appended. In the case of low-latency streaming, it is desirable to avoid any delay from a (partial) segment becoming available in the playlist to the client discovering its availability. With the playlist reloading approach, such discovery delay can be as high as the reload time interval in the worst case.
With the new feature of blocking playlist reloads, clients can specify which future segment’s availability they are awaiting and the server will have to hold onto that playlist request until that specific segment becomes available in the playlist. The segment to be awaited for is specified using a query parameter on the playlist request.

Rendition Reports

When playing at low latencies, fast bitrate adaptation is crucial to avoid playback interruptions due to buffer underruns. To save round-trips during playlist switching, playlists must contain rendition reports via a new EXT-X-RENDITION-REPORT tag that informs about the most recent segment and part in the respective rendition.

Conclusion

For more detailed information on Apple’s low-latency HLS take a look at the Preliminary Specification and the latest IEFT draft containing low-latency extensions for HLS.
We can conclusively say that low-latency HLS increases complexity quite significantly compared to standard HLS. The server will have its responsibilities expanded, from simply serving segments to supporting several additional mechanisms that clients use to save network round-trips and speed up segment delivery which ultimately enables lower end-to-end latency. Considering that the specification remains subject to change and is yet to be finalized, it might still take a while until streaming vendors pick it up and we finally see low-latency HLS in the wild. In short, live low latency streaming using HLS is possible, but at a large cost to server complexity, there are measures being developed to reduce complexity and server load, but it’ll take wider spread adoption by major stream providers for this to happen.

The post Video Tech Deep-Dive: Live Low Latency Streaming Part 3 – Low-Latency HLS appeared first on Bitmovin.

Video Tech Deep-Dive: Live Low Latency Streaming Part 2

Jameson Steiner — Thu, 25 Jun 2020 12:42:01 +0000

This blog post is continuation of an ongoing blog and webinar technical deep series. You can find the first blog post here. The first post covered the fundamentals of live low latency and defined chunked delivery methods with CMAF.
This blog post expands on chunked CMAF delivery by explaining it’s application with MPEG-DASH to achieve low latency. We’ll lay some foundations and cover the basic approaches behind low-latency DASH, then look into what future developments are expected as low-latency streaming is a heavily researched subject and is quickly becoming a media industry standard.

Basics of MPEG-DASH Live Streaming

Before diving into how Low Latency Streaming works in MPEG-DASH we first need to understand some basic stream mechanics of DASH live streams, most importantly, the concept of segment availability.
The DASH Media Presentation Description (MPD) is an XML document containing essential metadata of a DASH stream. Among many other things, it describes which segments a stream consists of and how a playback client can obtain them. The main difference between on-demand and live stream segments within DASH is that all segments of the stream are available at all times for on-demand; whereas the segments are produced continuously one after another as time progresses for live-streams. Every time a new segment is produced, its availability is signaled to playback clients through the MPD. It is important to note that a segment is only made available once it is fully encoded and written to the origin.

Fig. 1 Live stream with template-based addressing scheme (simplified)

The MPD would specify the start of the stream availability (i.e. the Availability Start Time) and a constant segment duration, e.g. 2 seconds. Using these values the player can calculate how many segments are currently in the availability window and also their individual availability start time. For example, the segment availability start time for the second segment would be AST + segment_duration * 2.

Low Latency Streaming with MPEG-DASH

In the first part of this blog post series, we described how chunked encoding and transfer enables partial loads and consumption of segments that are still in the process of being encoded. To make a player aware of this action, the segment availability in the MPD is adjusted to signal an earlier availability, i.e. when the first chunk is complete. This is done using the availabilityTimeOffset in the MPD. As a result, the player will not wait for a segment to be fully available and will load and consume it earlier.
Consider the example of Fig.1 with a segment duration of 2 seconds and a chunk duration of 0.033 seconds (i.e. one video frame duration with 29.97 fps). To signal the segment availability once the first chunk is completed we would set the availabilityTimeOffset to 1.967 seconds (segment_duration – chunk_duration). This would signal the greyed-out segment in Fig. 1 to become partially available.
The below MPD represents this example:



  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xmlns="urn:mpeg:dash:schema:mpd:2011"
  xmlns:xlink="http://www.w3.org/1999/xlink"
 xsi:schemaLocation="urn:mpeg:DASH:schema:MPD:2011 http://standards.iso.org/ittf/PubliclyAvailableStandards/MPEG-DASH_schema_files/DASH-MPD.xsd"
  profiles="urn:mpeg:dash:profile:isoff-live:2011"
  type="dynamic"
  minimumUpdatePeriod="PT500S"
  suggestedPresentationDelay="PT2S"
  availabilityStartTime="2019-08-20T05:00:03Z"
  publishTime="2019-08-20T12:42:07Z"
  minBufferTime="PT2.0S">
  
    
      contentType="video"
      segmentAlignment="true"
      bitstreamSwitching="true"
      frameRate="30000/1001">
      
       id="0"
       mimeType="video/mp4"
       codecs="avc1.64001f"
       bandwidth="2000000"
       width="1280"
       height="720"
        
         timescale="1000000"
         duration="2000000"

availabilityTimeOffset=”1.967″

         initialization="1566277203/init-stream$RepresentationID$.m4s"
         media="1566277203/chunk-stream_t_$RepresentationID$-$Number%05d$.m4s"
         startNumber="1">

To recap, for low-latency DASH we are mainly doing two things:

Chunked encoding and transfer (i.e. chunked CMAF)
Signaling early availability of in-progress segments

While the previous approach enables a basic low-latency DASH setup, there are additional considerations to be made to further optimize and stabilize streaming experience. The DASH Industry Forum is working on guidelines for low-latency DASH to be released in the next version of the DASH-IF Interoperability Points (DASH-IF IOP) – expected in early July 2020. The change request for that can be found here. The following will explain key parts of these guidelines. Please note that some features were not officially finalized and standardized at the time of this post’s publication (June 2020).

Wallclock Time Mapping

For the purpose of measuring latency, a mapping between the media’s presentation time and the wall-clock time is needed. This is so that for any given presentation time of the stream the corresponding wall-clock time is known. The latency for a given playback position can then be calculated by determining the corresponding wall-clock time and subtracting it from the current wall-clock time.
This mapping can be achieved by specifying a so-called Producer Reference Time either in the segments (i.e. inband as prft box) or in the MPD. It essentially specifies the wallclock time at which the respective segment/chunk was produced. (as seen below)


  id="0"
  type="encoder"
  presentationTime="538590000000"
  wallclockTime="2020-05-19T14:57:45Z">

The type attribute specifies whether the reference time was set by the capturing device or the encoder. Allowing for calculation of the End-to-End Latency (EEL) or Encoder-Display Latency (EDL), respectively.

Client Time Synchronization

A precise time/clock at the playback client is necessary for calculations that involve the client’s wallclock time such as segment availability calculations and latency calculations. It is recommended for the MPD to include a UTCTiming element which specifies a time source that can be used to adjust for any drift of the client clock. (as seen below)


  schemeIdUri="urn:mpeg:dash:utc:http-iso:2014"

  value="https://time.akamai.com/?iso"

/>

Low Latency Service Description

A ServiceDescription element should be used to specify the service provider’s desired target latency and minimum/maximum latency boundaries in milliseconds. Furthermore, playback rate boundaries may be specified that define the allowed range for playback acceleration/deceleration by the playout client to fulfill the latency requirements.

In most player implementations such parameters are provided externally using configurations and APIs.

Resynchronization Points

The previous post pointed out that chunked delivery decouples the achievable latency from the segment durations and enables us to choose relatively long segment durations to maintain good video encoding efficiency. In turn, this prevents fast quality adaptation of the player as quality switching can only be done on segment boundaries. In a low-latency scenario with low buffer levels, fast adaptation — especially down-switching — would be desirable to avoid buffer underruns and consequently playback interruptions.
To that end, Resync elements may be used that specify segment properties like chunk duration and chunk size. Playback clients can utilize them to locate resync point and

Join streams mid-segment, based on latency requirements
Switch representations mid-segment
Resynchronize at mid-segment position after buffer underruns

The previous was a glimpse of what to expect in the near future and shows the great effort of the media industry put into kick-starting low-latency streaming with MPEG-DASH and getting it ready for production services.
Want to learn more? Check out Part 3: Video Tech Deep-Dive: Live Low Latency Streaming Part 3 – Low-Latency HLS
… or take a look at some of the supporting documentation below:
[Tool] DASH-IF Conformance Tool
[Blog Post] Video Tech Deep-Dive: Live Low Latency Streaming Part 1
[Demo] Low Latency Streaming with Bitmovin’s Player

The post Video Tech Deep-Dive: Live Low Latency Streaming Part 2 appeared first on Bitmovin.

Video Tech Deep Dive: Super-Resolution with Machine Learning P1

Jameson Steiner — Wed, 20 May 2020 07:16:20 +0000

Super-Resolution: What’s the buzz and why does it matter?

Super-resolution has been gaining steam recently. Many companies, secretly and not-so-secretly, have been incorporating this technology into their workflow. Most notably:

Samsung has started advertising this feature in their latest flagship phone camera’s – boasting 64MP cameras that use super-resolution for zooming in during the photo capture process
Other upstart companies are exploiting super-resolution to “upsample” videos and bring back videos to life that were captured centuries ago.
Image editing applications like Pixelmator pro are using this feature to provide an enhanced end-user experience.

Companies, large and small, have been incorporating super-resolution into their products.

Although the idea of super-resolution has been around for quite some time, it’s recent resurgence in media applications has been driven primarily by advances in Machine Learning (ML). In the age of 4K and 8K quality content – super-resolution is an increasingly relevant topic in the field of video and will only continue to grow.

So in this series, I will try to shed some light on :

What is ML-based super-resolution?
Why is super-resolution so enticing for the video companies? And what are its advantages?
Why does super-resolution matter for you, and how can you incorporate it within your own video workflows?

What is Video Sampling

Before jumping directly into the deep end, let’s get some basics in order. Starting from simple digital signals and building all the way up to ML-based super-resolution.

Digital Signals

Videos are sequences of images. And an image essentially is nothing but a two-dimensional digital signal. And processing this digital signal is of utmost importance for most electronic devices. Especially, if you concern yourself with video transmission and picture quality.

Resampling digital signals

Next, within a digital signal, post-processing is almost always required. One of the commonly applied post-processing steps is known as “resampling”. Resampling is changing the sampling rate of a digital signal. Or, in simple words, for a given time duration, we change the number of samples in the signal.
Within resampling, one could do two things. You could either:
Upsample

To predict new information from the pre-existing information. Or in other words, you are increasing the number of samples in a given time. This is also known as interpolation sometimes.

Downsample

To throw away existing information. Or in other words, you are decreasing the number of samples in a given time.

This idea is depicted in the following figure.

A digital source signal can be resampled in two ways: upsampled or downsampled.

Resampling video

As explained earlier, video is nothing but a digital signal. And there is usually a need to sample this digital video signal (we will look at some practical examples below). Since super-resolution concerns itself only with video upsampling, the rest of the series will focus only on the video upsampling.
To reiterate, video upsampling is the process of predicting new video samples from pre-existing video samples.

Why Upsample?

Is there a need to upsample to videos? And more importantly, is there a business opportunity behind it?
Let’s look at some real-world use cases and the types of upsampling to explain its relevance towards modern-day media. There are two primary types of upsampling: Temporal and Spatial.

Temporal vs Spatial Upsampling

Temporal Upsampling

Temporal upsampling is to predict video information across time-dimension using pre-existing information. This is best displayed in the iconic film series, The Matrix; if you are old enough to remember the famous “Neo vs Agent Smith Fight” scene from the movie Matrix Reloaded you’ll know that this movie was shot in the year 2003. One of the fascinating aspects of the scene is that it alternates between 12000 frames per second (fps) (this is super-slow-motion) and 24 fps (normal-speed).

In the year 2003, filmmakers certainly did not have a camera that can shoot at 12000 fps. Cameras were only capable of shooting a maximum of 24 fps. So, the filmmakers had to do sophisticated interpolation to obtain the 12000 frames (per second) from the pre-existing 24 frames (per second). In other words, they predict digital samples across temporal dimensions.

Spatial Upsampling

On the other side, Spatial upsampling is the process of predicting information across the spatial dimension.

Classic movies that in low resolution have to resampled to a higher resolution.

Imagine, now that you have an old catalog of classic movies that you want to enjoy on your new crisp 4K-TV. The classic movies were (expectedly) not shot in 4K resolution. So to convert low-resolution movies, say 360p to a higher 4K resolution would require spatial upsampling. You need spatial upsampling to go from 172800 pixels (360p) to 8294400 (4k) pixels. In other words, you predict digital information across spatial dimensions.
So, to answer the original question at the beginning.

YES! there is a need to upsample videos.
And, an even emphatic, YES! there is a huge business opportunity behind it.

Super-resolution primarily deals with spatial upsampling. Hence, we will restrict our focus for the remainder of the series only on spatial upsampling.

Spatial Upsampling in Video

You might already be familiar with some of the other well-known methods to spatially upsample videos; the most common ones being bilinear, bicubic, or lanczos. Essentially, the idea behind all of these methods is that they use a single predefined mathematical function to predict new digital video samples from the pre-existing ones.

Most commonly used upsampling methods work on the same idea. They use a single predefined mathematical function to interpolate.

The emphasis on “single predefined mathematical function” is important. This is a key point, and that we will revisit that later.
Now, in a similar vein, super-resolution is a class of techniques to perform upsampling of videos. There are several flavors of super-resolution. But they are based on the same core principle: they use information from several images (typically neighboring images in video) to spatially upsample a single low-resolution image to a high-resolution image.

Super-resolution uses information from several low-resolution images to interpolate a single low-resolution image to a high-resolution image.

Note that in contrast to the spatial upsampling methods mentioned before, super-resolution uses information from several neighboring images to interpolate a single low-resolution image. And because it combines multiple information sources, it is able to better interpolate the image than the methods mentioned above. And this is one of the reasons why super-resolution is already popularly applied by companies such as AMD and NVIDIA, to render video games at high resolutions (4k).
Going forward, we will focus on super-resolution applications in a typical video workflow. We will especially focus on ML-based super-resolution. And discover the superior benefits it offers over conventional methods in video workflows.
Why ML-based super-resolution and why is it so much better? We will answer this in the next series of posts. So stay tuned!

Conclusion

The following figure summarizes everything we learned in this blog post.

The focus of this series of blog posts will be on machine learning-based super-resolution.

There is a big business opportunity behind upsampling videos, especially spatial upsampling of videos. Super-resolution is a class of techniques to spatially upsample video. Broadly, super-resolution can be categorized into two categories: machine learning-based and non-machine learning-based. This blog series will focus on machine learning-based super-resolution and the superior benefits it offers in video workflows.
Did you enjoy this post? Want to learn more?
Check out part two of the Super-Resolution series: Super-Resolution with Machine Learning P2
Check out part three: Practical Super-Resolution Deployments and Ensuing Results

The post Video Tech Deep Dive: Super-Resolution with Machine Learning P1 appeared first on Bitmovin.

Video Tech Deep-Dive: Live Low Latency Streaming Part 1

Jameson Steiner — Wed, 22 Apr 2020 13:49:39 +0000

What is Live Low Latency?

Low Latency in live streaming is the time delay between an event’s content being captured at one end of the media delivery chain and played out to a user at the other end. Consider a goal scored at a football game: Live latency is the delay in time between the moment a goal is scored and captured by a camera until the moment that a viewer sees the goal on their own device. There are a few different terms that effectively define the same experience: end-to-end latency, hand-waving latency, or glass-to-glass latency.

End-to-end video encoding workflow (where latency matters)

In our most recent developer report, low latency was identified as one of the biggest challenges for the media industry. This blog series will take an in-depth look into why that’s the case, welcome to our Live Latency Deep Dive series!

Why care about Low Latency?

Most use cases where live latency is crucial can be categorized into the following:

Live content delivered across multiple distribution channels

high live latency in comparison to traditional linear broadcast delivery via satellite, terrestrial or cable services. Over-the-top (OTT) delivery methods like MPEG-DASH and Apple HLS have become the defacto standard for delivering video to audiences using mobile devices such as smartphones, tablets, laptops, and Smart TVs. Live network content, like sports or news, drive the need for low live latency as these networks attempt to deliver content simultaneously over various distribution means (e.g. OTT vs Cable).
Picture a scenario where you are streaming your favorite football team playing in the global final, your neighbor and equal fan (with incredibly thin walls) has traditional linear cable. It’s the final moments of the game, but you hear the neighbor cursing loudly, despite the fact that there is well over 1 minute left in the game. The thrill is spoiled and you know your team certainly lost. The need for faster live latency becomes clear, the difference between broadcast and streaming is unacceptable in today’s digital world. But a lot of factors affect how quickly content will appear on a viewer’s screen. Aside from infrastructural issues (like not being optimized for low latency), modern streaming methods may suffer latency delays from additional factors like social media feeds, push notifications, and second-screen experiences running in parallel to the live event.

Interactive live content

Whenever audience interaction is involved, live latency should be as low as possible to ensure a good quality of experience (QoE). Such use cases include webinars, auctions, user-generated content where the broadcaster interacts with the audience (e.g. Twitch, Periscope, Facebook Live, etc.) and more. Latency is often measured on a spectrum, where high latency is the least sought after delay, and Real-Time is the most sought after. See the Latency Spectrum below (including the latency types, delay time, and streaming formats):

Latency Spectrum in Video Streaming

The latency spectrum shows that unoptimized OTT delivery accounts for around 30+ seconds of delay while cable broadcast TV clocks in at around 5 seconds – give or take. Furthermore, sub-second latencies may not be achievable with OTT methods and require other protocols like WebRTC.

Where does live latency come from?

First, a slightly more technical definition of live latency: It’s the time difference between a video frame being captured and the moment it’s presented to the playback client. In other words, it’s the time that a video frame spends in the media processing and delivery chain. Every component in the chain introduces a certain amount of latency and eventually accumulates to what is considered live latency.
Let’s have a look at the main sources of live latency:

Buffering ahead for playback stability at the player-level

Live stream timeline

A video player will aim to maintain a pre-defined amount of buffered data ahead of its playback position. The standard value is about 30 seconds of buffer loaded ahead at all times during playback. One of the reasons behind this is the cause is that if network bandwidth drops during playback there would still be 30 seconds of data to be played out without interruption. During this time the player can react to new bandwidth conditions appropriately, thereby buying the player some time to adapt. Buffer time also typically influences the bitrate adaptation decisions as low buffer levels may imply more aggressive downwards adaptations.
However, when aiming for 30 seconds of buffer with a live stream, the player must stay at least 30 seconds behind the live edge (the most recent point) of the stream with its playback position; this would result in a live latency of 30 seconds. Conversely, this means that aiming for a low latency would require being even closer to the live edge and implies having a minimum buffer. If we aim for 5 seconds of latency, the player would have 5 seconds of buffer at most. Thus, the difficult decision of trading off between latency and playback stability must be made.

Segments are produced, transferred and consumed in their entirety

Live streams are encoded in real-time. This means that if a segment duration is 6 seconds it will take the encoder 6 seconds to produce one full segment. Additionally, if fragmented MP4 is used as the container format, encoders can only write a segment to the desired storage once it’s encoded completely, i.e. 6 seconds after starting the encode of the segment. So once a segment is transferred to the storage its oldest frame is already 6 seconds old. On the other side of the delivery chain, the player can only decode an fMP4 segment in its entirety and therefore needs to download a segment fully before it can process it. Network transfers: like uploading a video to a CDN origin server, transferring the content within the CDN, and downloading from the CDN edge server to the client can add to the overall latency to a lower degree.
In summary, the fact that segments are only processed and transferred in their entirety results in latency being correlated directly to segment duration.

Data Segments in the Encoding Workflow

What can we do?

Naive approach: Short segments

As latency is correlated to segment duration, a simple way to decrease latency would be to use short segments, e.g. 1-second duration. However, this comes with negative side effects such as:

Video coding efficiency suffers: The requirement of each video segment starting with a key frame implies having small groups of pictures (GOPs). This in turn, causes the efficiency of differential/predictive coding to suffer. With short segments, you’d have to spend more bits if you’re aiming for the same perceptual quality as longer segments with the same content.
More network requests and everything negative associated with them, e.g. time to first byte (TTFB) wasted on every request.
Increased number of segments may decrease CDN caching efficiency.
Buffer at the player grows in a jumpy fashion which increases the risk of playback stalls due to rebuffering.

Chunked encoding and transfer

To solve the problem of segments being produced and consumed only in their entirety, we can make use of the chunked encoding scheme specified in the MPEG-CMAF (Common Media Application Format) standard. CMAF defines a container format based on the ISO Base Media File Format (ISO BMFF), similar to the MP4 container format, which is already widely supported by browsers and end devices. Within its chunked encoding feature, CMAF introduces the notion of CMAF chunks. Compared to an “ordinary” fMP4 segment that has its media payload in a single big mdat box, chunked CMAF allows segments to consist of a sequence of CMAF chunks (moof+mdat tuples). In extreme cases, every frame can be put into its own CMAF chunk. This enables the encoder to produce and the player’s decoder to consume segments in a chunk-by-chunk fashion instead of limiting use to entire segment consumption. Admittedly, the MPEG-TS container format offers similar properties as chunked CMAF, but it’s fading as a format for OTT due to the lack of native device and platform support that fMP4 and CMAF provide.

6s fMP4 segment compared to chunked CMAF

Chunked encoding on its own does not help us decrease the latency but is a key ingredient. To capitalize on chunked encodes, we need to combine the process with HTTP 1.1 chunked transfer encoding (CTE). CTE is a feature of HTTP that allows resource transfers where size is unknown at the time of transfer. It does so by transferring resources chunk-wise and signaling the end of a resource with a chunk of length 0. We can utilize CTE at the encoder to write CMAF chunks to the storage as soon as they are being produced without waiting for the encode of the full segment to finish. This enables the player to request (also using CTE) available CMAF chunks of a segment that is still being encoded and forward them as fast as possible to the decoder for playout. Therefore allowing playback as soon as the first CMAF chunk is received.

Implications of low latency chunked delivery

… besides enabling low latency:

Smoother and less jumpy client buffer levels from the constant flow of CMAF chunks received. Thus lowering the risk of buffer underruns and improves playback stability.
Faster stream startup (time to first frame) and seeking at the client due to being able to decode and playout segments partially during their download.
Higher overhead in segment file size compared to non-chunked segments as a result of the additional metadata (moof boxes, mdat headers) introduced with chunked encodes.
Low buffer levels at the client impact playback stability. A low live latency implies the client is playing close to the live edge and has a low buffer level. Therefore the longest achievable buffer level is limited by the current live latency. It’s a QoE tradeoff: low latency vs. playback stability.
Bandwidth estimation for adaptive streaming at the client is hard. When loading a segment at the bleeding live edge, the download rate will be limited by the source/encoder. As content is produced in real-time it takes, for example, 6 seconds to encode a 6-second long segment. So the download rate/time for segments is no longer limited by networks but by encoders. This causes a problem in bandwidth estimation methods that are currently commonplace in the industry and based on the download duration. The standard formula to calculate bandwidth estimation is:

estimatedBW = segmentSize / downloadDuration
E.g.: estimatedBW = 1MB / 2s = 4mbit

As download duration roughly equals the segment duration when loading at the bleeding live edge using CTE, it can no longer be used to estimate client bandwidth. Bandwidth estimation is a crucial part of any adaptive streaming player and the lack of estimated bandwidth must be addressed. Research for better ways to estimate bandwidth in chunked low-latency delivery scenarios is ongoing in academia and throughout the streaming industry, e.g. ACTE.
Did you enjoy this post? Want to learn more? Check out Part two of the Low Latency series: Video Tech Deep-Dive: Live Low Latency Streaming Part 2
…or if you want to jump ahead, take a look at Part three: Video Tech Deep-Dive: Live Low Latency Streaming Part 3 – Low-Latency HLS

The post Video Tech Deep-Dive: Live Low Latency Streaming Part 1 appeared first on Bitmovin.

Fun with Container Formats – Part 2

Jameson Steiner — Mon, 01 Jul 2019 21:54:35 +0000

In continuation of our Fun with Container Formats series, this week we’ll be diving into the MP4 and CMAF container formats. If you need a refresher on the terminology or the handling of these formats in your player, please check back to Post 1: Fun with Container Formats

MP4

Overview of Standards
MPEG-4 Part 14 (MP4) is one of the most commonly used container formats and often has a .mp4 file ending. It is used for Dynamic Adaptive Streaming over HTTP (DASH) and can also be used for Apple’s HLS streaming.
MP4 is based on the ISO Base Media File Format (MPEG-4 Part 12), which is based on the QuickTime File Format. MPEG stands for Moving Pictures Experts Group and is a cooperation of the International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC). MPEG was formed to set standards for audio and video compression and transmission. MPEG-4 specifies the Coding of audio-visual objects.
MP4 supports a wide range of codecs. The most commonly used video codecs are H.264 and HEVC. AAC is the most commonly used audio codec. AAC is the successor of the famous MP3 audio codec.

ISO Base Media File Format
ISO Base Media File Format (ISOBMFF, MPEG-4 Part 12) is the base of the MP4 container format. ISOBMFF is a standard that defines time-based multimedia files. Time-base multimedia usually refers to audio and video, often delivered as a steady stream. It is designed to be flexible and easy to extend. It enables interchangeability, management, editing and presentability of multimedia data.

The base component of ISOBMFF are boxes, which are also called atoms. The standard defines the boxes, by using classes and an object oriented approach. Using inheritance all boxes extend a base class Box and can be made specific in their purpose by adding new class properties.
The base class:
Example FileTypeBox:

The FileTypeBox is used to identify the purpose and usage of an ISOBMFF file. It is often at the beginning of a file.
A box can also have children and form a tree of boxes. For example the MovieBox (moov) can have multiple TrackBoxes (trak). A track in the context of ISOBMFF is a single media stream. E.g. a MovieBox contains a trak box for video and one track box for audio.
The binary codec data can be stored in a Media Data Box (mdat). A track usually references its binary codec data.

Fragmented MP4 (fMP4)

Using MP4 it is also possible to split a movie into multiple fragments. This has the advantage that for using DASH or HLS a player software only needs to download the fragments the viewer wants to watch. A fragmented MP4 file consists of the usual MovieBox with the TrackBoxes to signal which media streams are available. A Movie Extends Box (mvex) is used to signal that the movie is continued in the fragments. Another advantage is that fragments can be stored in different files. A fragment consists of a Movie Fragment Box (moof), which is very similar to a Movie Box (moov). It contains the information about the media streams contained in one single fragment. E.g. it contains the timestamp information for the 10 seconds of video, which are stored in the fragment. Each fragment has its own Media Data (mdat) box.
Debugging (f)MP4 files
Viewing the boxes (atoms) of an (f)MP4 file is often necessary to discover bugs and other unwanted configurations of specific boxes. To get a summary of what a media file contains the best tools are:

MediaInfo (https://mediaarea.net/en/MediaInfo/Download)
ffprobe, which is part of the ffmpeg binaries (https://ffbinaries.com/downloads)

These tools will however not show you the box structure of an (f)MP4 file. For this you could use the following tools:

Boxdumper (https://github.com/l-smash/l-smash)
IsoViewer (https://github.com/sannies/isoviewer)
MP4Box.js (http://download.tsi.telecom-paristech.fr/gpac/mp4box.js/filereader.html)
Mp4dump (https://www.bento4.com/)

(Screenshot of isoviewer)

CMAF

MPEG-CMAF (Common Media Application Format)
Serving every platform as a content distributor can prove to be challenging as some platforms only support certain container formats. To distribute a certain piece of content it can be necessary to produce and serve copies of the content in different container formats, e.g. MPEG-TS and fMP4. Clearly, this causes additional costs in infrastructure for content creation as well as storage costs for hosting multiple copies of the same content. On top of that, it also makes CDN caching less efficient. MPEG-CMAF aims to solve these problems, not by creating yet another container format, but by converging to a single already existing container format for OTT media delivery. CMAF is closely related to fMP4 which should make the transition from fMP4 to CMAF to be of very low effort. Further, with Apple being involved in CMAF, the necessity of having content muxed in MPEG-TS to serve Apple devices should hopefully be a thing of the past and CMAF can be used everywhere.
With MPEG-CMAF, there are also improvements in the interoperability of DRM (Digital Rights Management) solutions by the use of MPEG-CENC (Common Encryption). It is theoretically possible to encrypt the content once and still use it with all the different state-of-the-art DRM systems. However, there is no encryption scheme standardized and, unfortunately, there are still competing ones, for instance Widevine and PlayReady. Those are not compatible to each other, but the DRM industry is slowly moving to converge to one, the Common Encryption format.
Chunked CMAF
One interesting feature of MPEG-CMAF is the possibility to encode segments in so-called CMAF chunks. Such chunked encoding of the content in combination with delivering the media files using HTTP chunked transfer encoding enables lower latencies in live streaming uses cases than before.

In traditional fMP4 the whole segment had to be fully downloaded until it could be played out. With chunked encoding, any completely loaded chunks of the segment can already be decoded and played out while still loading the rest of the segment. Hereby, the achievable live latency is no longer depending on the segment duration as the muxed chunks of an incomplete segment can already be loaded and played out at the client.
Make sure to check back with us next week as dive into MPEG-TS and Matroska
Jump to Part 3 of Fun with Container Formats

The post Fun with Container Formats – Part 2 appeared first on Bitmovin.