Video encoding is unquestionably one of the most demanding computational tasks: compressing a 1-hour raw video to a standard-dimension video encoded with AVC (the most common video codec) can even take 2 or more hours, depending on the capabilities of your hardware (e.g., processor, GPU, RAM) or your cloud service provider's resources. The encoding difficulty varies with different codecs (like AVC, HEVC, AV1, VP9, etc.), and output configurations (resolution, bitrate, quality, color depth, etc.), with a general trend that "the higher the quality and the more compressed the format, the more complex the encoding task becomes". But what makes encoding such a challenging task?
In essence, the video compression process consists of three main phases: Prediction, Transform, and Encoding. The objective is to generate a compressed version of an input video that can be decoded and played back later. Let’s dive into each phase.
During the Prediction phase, the algorithm seeks out blocks and patterns common between neighboring frames. It links data from original frames (known as "I frames") to subsequent frames ("P frames" and "B frames"). When a match (or "prediction") is found, the redundant information is omitted, thus achieving video compression. This process is repeated until all frames are scanned and compressed as much as possible by also considering references within the same frame and across multiple frames.
Every group of frames that are interlinked by predictions is termed a "GOP" (Group Of Pictures). The size of a GOP can vary based on the scene (a change in scene will trigger a new GOP) and the method of distribution (live streaming favors short GOPs of 0.5 - 2 seconds, while video-on-demand can accommodate longer GOPs). For more insights into GOPs and frame types, you can explore OTTverse’s guide on I, P, and B-frames.
The Transform phase follows, where all residual blocks (i.e., parts of the video not accounted for by prediction) are converted into a non-visual format represented by a pattern and a set of coefficients. These serve as a blueprint for the decoder to reconstruct the frames later, which effectively conserves data.
Imagine a painter working in reverse: instead of applying colors from a palette to a canvas, they are removing the colors from the canvas and arranging them back onto the palette. This analogy helps to conceptualize what happens during the Transform phase.
Finally, in the Encoding phase, all the values and parameters from the earlier steps—predictions, transform coefficients, and metadata—are compiled into a computer-readable format. This compressed bitstream is what can be stored or transmitted and is what we typically understand as "video."
For those who are curious about the technical details, VCodex offers comprehensive overviews on AVC and HEVC encoding.
While the above may seem intricate, it is also exceedingly demanding for computers. Videos are bulky in terms of data, which is a fundamental bottleneck for processing performance. It influences execution times and the time it takes to transfer data, particularly evident in cloud encoding services.
To illustrate the magnitude of data involved in storing a compressed video, consider three different ways of representing a scene from the animated film "Big Buck Bunny": as text, as an image, and as a video. Each format tells the same story but varies dramatically in size—text being the smallest at only 397 bytes, an image larger at 618,149 bytes, and the video the largest at 6,992,559 bytes. The video offers the richest detail and is the most enjoyable to watch, but it's also about 17,000 times larger than the text representation!
An enormous, fluffy, and utterly adorable grey-furred rabbit is heartlessly harassed by the ruthless, loud, bullying gang of a flying squirrel, who is determined to squash his happiness. While he's peacefully strolling through a meadow, the gang throws acorns and nuts at him while he's standing helpless: after a dodged shot, he succumbs to a barrage of pesky bullets that whack him and stun him.
Video encoding is about managing these massive volumes of data and compressing them without sacrificing quality, a task not to be taken lightly.
Traditionally, video encoding is performed on single machines that process the workload sequentially. The video is streamed from its source, with each frame undergoing the aforementioned phases to form GOPs, which are then transformed and encoded.
ByteNite approaches encoding complexity from a different angle. Instead of solely improving processing speed, it breaks down videos into chunks that are encoded in parallel across a network of devices. This grid computing technique enables the handling of longer videos and larger workloads without the need for powerful and costly hardware.
It's akin to diluting a bitter medicine in a glass of water! The video segmentation process is surprisingly efficient: our software automatically identifies potential GOPs, extracts them from the streaming video, and distributes them across multiple devices for encoding. As GOPs are independent, ByteNite’s Assembler module seamlessly merges the processed chunks together within seconds after they're processed.
This method proves to be 10 or more times faster than traditional encoding. Our test results for the cases detailed below speak volumes.
Encoding job #1 (Meridian) on ByteNite
Encoding job #2 (Unbelievable Beauty) on ByteNite
Video encoding is a complex process that demands significant computational power. Companies and associations in the video broadcasting industry strive to optimize encoding software and develop more efficient codecs, yet the fundamental encoding task remains largely unchanged. ByteNite introduces a revolutionary distributed computing software that leverages the intrinsic segmentation of video encoding—GOPs—to reduce complexity. By having each device in a network process a portion of the video concurrently, ByteNite achieves encoding speeds 10+ times faster than conventional methods.