WIP-0007: Whitebox Video Processing Implementation Proposal¶

Introduction¶

This document proposes an implementation of the video processing operations that Whitebox performs in regard to camera-provided video streams and files.

Goals¶

Ingest video streams from various camera models
Prepare videos for streaming and playback
Make videos available in multiple resolutions for smoother playback on different devices and network conditions
Extract video metadata (e.g., resolution, framerate, codec)

Proposed Implementation¶

Video processing mostly revolves around FFmpeg pipelines, which are used to ingest, transcode, and prepare videos for streaming and playback. They can be run in entirety or in parts, depending on the input, desired output, available (or reserved) resources, etc.

We have two kinds of video processing pipelines:

Real-time video processing pipeline: Ingests video streams from cameras, transcodes them into streaming-friendly formats, and makes them available for immediate playback after flight sessions while the high-definition videos are being processed.
Post-flight video processing pipeline: Downloads high-definition video files off the camera devices after flight sessions, transcodes them into multiple resolutions, and makes them available for playback.

For real-time streaming from the camera to the clients (frontend), we use WebRTC protocol for low-latency. For both immediate post-flight playback and high-definition transcoding, we use HLS (HTTP Live Streaming) protocol for stable playback on different devices and network conditions.

H.264 codec is used for compatibility with most devices and browsers in both streaming and playback, with fMP4 used as a container format for HLS segments in playback.

Real-time video processing pipeline¶

When a camera is connected, Whitebox starts ingesting its video stream, transcodes the video to H.264, and pushes it to SRS (Simple Realtime Server).

SRS, without re-transcoding, then makes the stream available for real-time playback over both RTMP and WebRTC protocols for clients to connect to. Clients (frontend) then connect to the WebRTC endpoint to allow real-time low-latency video playback.

During flight sessions, Whitebox also records the stream to HLS files for immediate post-flight playback. When a flight session ends, the recorded video files are available for playback right away, but the stream continues until the camera is disconnected.

For a high-level overview, the real-time video processing pipeline looks like this:

[Video Stream Ingestion]
           |
           ▼
  [Video Preparation]
           |
           ▼          on demand
     [SRS Server] --     e.g.      --> [Record for Immediate Post-Flight Playback]
           ▲         during flight
           |
           ▼
    [WebRTC Clients]

Videos are played using Video.js that provides HLS support across all major browsers.

Post-flight video processing pipeline¶

After a flight session ends, Whitebox downloads the high-definition video files off the camera device, transcodes them into multiple resolutions, packages them into HLS format, and makes them available for playback over HTTP. As this process can be resource-intensive and time-consuming, it runs in batches so that it can be easily tracked, paused, resumed, or retried in case of failures/crashes.

To achieve the best user-experience and performance, videos are first transcoded to the lowest resolution for quicker availability, and then to higher resolutions. After a video is available in lowest resolution, to speed the process up, Whitebox then transcodes remaining resolution targets from highest to lowest, in a way that each resolution is transcoded from the previously transcoded file, providing faster processing times.

As a single video can be quite large, Whitebox leverages the HLS format's segmented nature to process videos in smaller segments. This allows for transcoding to continue where it left off in case of failures/crashes, rather than trying to reprocess the entire video.

Depending on the video format provided by the camera, additional processing may be required to convert the video into a playback-friendly format. For example, Insta360's insv format requires fisheye to equirectangular conversion.

For a high-level overview, the post-flight video processing pipeline looks like this:

        [Flight session ends]
                  |
                  ▼
   [Download High-Definition Videos]
                  |
                  ▼
[Transcode to the next target resolution] <--+-- [Recovering from failures/crashes]
                  |                          |
                  ▼                          |
 [Recover from previous failure, if any]     +
                  |                          |
                  ▼                          |
      [Process Video in Segments] -----------+
                  |
                  ▼
    [Video Available for Playback]

As this operation can be very resource-intensive, Whitebox manages resources used by video processing to ensure that real-time video processing has enough resources to operate smoothly. This is done using process niceness and CPU affinity settings.

Making videos available in multiple resolutions for smoother playback¶

To ensure smooth playback on different devices and network conditions, Whitebox transcodes the high-definition videos into multiple resolutions - the original one, and then to 4K, 1080p, and 720p, if the original resolution is higher than these targets.

During playback, the client players can switch between different resolutions based on the network conditions, ensuring a smooth viewing experience without buffering or interruptions. This is handled automatically by the HLS protocol, but the user can also manually select the desired resolution if needed.

By default, Whitebox uses Video.js for standard video playback, and Pannellum for 360-degree video playback.

Segment processing¶

To handle large video files efficiently, Whitebox leverages the segmented nature of HLS format. Videos are processed in smaller segments, and written to separate files with a fixed duration. In case of failures or crashes during processing, the video needs to be reprocessed only from the last failed segment, rather than the entire video. This approach improves reliability and reduces processing time.

As an example, the following shows how a processed video file is structured:

flight_recordings/1/device_1/input.mp4

flight_recordings/1/device_1/1080p/init.mp4
                                   output_00001.mp4
                                   output_00002.mp4
                                   output_00003.mp4
                                   ...
                                   output.m3u8

flight_recordings/1/device_1/4K/init.mp4
                                output_00001.mp4
                                output_00002.mp4
                                output_00003.mp4
                                ...
                                output.m3u8

# After each resolution is processed, it is added to the master playlist
flight_recordings/1/device_1/playlist.m3u8

With this approach, until a resolution finishes processing, it will simply not appear in the master playlist, and clients will only see the available resolutions.

As an example, you can use the following FFmpeg command to process a video file into segments of 2 seconds each:

ffmpeg \
  -i input.mp4 \
  -c:v libx264 -crf 20 -preset medium \
  -c:a aac -b:a 128k \
  -f hls \
  -hls_time 2 \
  -hls_segment_type fmp4 \
  -hls_segment_filename "output_%05d.mp4" \
  -hls_playlist_type vod \
  -hls_flags independent_segments \
  output.m3u8

In case of a failure or crash during processing, Whitebox can resume processing from the last successfully processed segment, rather than starting from the beginning. This is achieved by checking the existing segments in the output directory and retrying from the last one.

To do this, you would first need to determine the last successfully processed segment. You can do this by listing the existing segment files and finding the highest numbered one. Then, you can use the -ss option in FFmpeg to seek to the appropriate timestamp and continue processing from there. For example, if the last successfully processed segment is output_00005.mp4, you would seek to the timestamp corresponding to the end of that segment (e.g., 10 seconds if each segment is 2 seconds long):

ffmpeg \
  # Seek to 50 seconds to continue from the last processed segment
  -ss 50 \
  -i input.mp4 \
  -c:v libx264 -crf 20 -preset medium \
  -c:a aac -b:a 128k \
  -f hls \
  -hls_time 2 \
  -hls_segment_type fmp4 \
  -hls_segment_filename "output_%05d.mp4" \
  -hls_playlist_type vod \
  # Add +append_list to make FFmpeg append to existing playlist
  -hls_flags independent_segments+append_list
  # Ensure the segment numbering continues from the latest segment
  -start_number 6 \
  output.m3u8

You can verify whether segments are proper by using ffprobe to inspect them.

H.264¶

Whitebox processes all videos into H.264 (AVC) codec, as it has great compatibility with most devices and browsers.

To explain how H.264 compression works in simple terms, video frames are compressed using three types of frames:

I-frames (Intra-coded frames): These are keyframes that contain a complete image. They do not rely on any other frames for decoding. I-frames are typically larger in size compared to other frame types
P-frames (Predicted frames): These frames are encoded based on the data from previous frames (I-frames or other P-frames). They store only the changes from the reference frames, making them smaller in size (think of a diff)
B-frames (Bidirectional predicted frames): These frames use both previous and future frames as references for encoding. B-frames can achieve higher compression ratios than P-frames, but they require more processing power to decode. Additionally, B-frames can introduce latency in streaming scenarios due to their dependency on future frames, so for real-time streaming, we do not use them

While processing segments, when using independent_segments HLS flag, FFmpeg will insert an I-frame at the beginning of each segment, ensuring that each segment can be decoded independently and that the playback is smoother when seeking.

Camera-specific video information and processing¶

Insta360's `insv`¶

Insta360's insv is, with 360 cameras and at the time of writing, an MP4 container with multiple video streams inside, one for each camera, and a single audio stream. Videos can be in various resolutions, and can be in AVC (H.264) or HEVC (H.265) codec.

Alongside the three, there is additional proprietary metadata, such as GPS data, stabilization info, gyroscope data, etc.

The two video streams, along with the audio stream, need to be transcoded into a single video file for playback. We use fisheye to equirectangular conversion to achieve this, using FFmpeg's v360 filter, which plays nice with Pannellum and other 360-degree video players. We use the same framerate as the source video, and H.264 output codec for compatibility.

Example FFmpeg commands¶

Whitebox uses HLS format for video processing and playback. This is configured using the following FFmpeg options:

  # Use HLS format
  -f hls \
  # Set segment duration to 2 seconds
  -hls_time 2 \
  # Use fMP4 format for segments
  -hls_segment_type fmp4 \
  # Set segment filename pattern, %05d for zero-padded 5-digit numbering
  -hls_segment_filename "output_%05d.mp4" \
  # Set playlist type to VOD (video on demand)
  -hls_playlist_type vod \
  # Ensure segments are independent (i.e., start with I-frames)
  -hls_flags independent_segments

For convenience and testing, the following commands will simply output to a single MP4 file (indicated by the output.mp4 part).

Transcoding fisheye to equirectangular¶

As an example, to transcode an insv file into a single equirectangular MP4 file, you can use the following FFmpeg command:

ffmpeg \
  -i input.insv \
  -filter_complex "\
    [0:v:0][0:v:1]hstack=inputs=2[df]; \
    [df]v360=input=dfisheye:output=equirect:ih_fov=195:iv_fov=195:w=3840:h=2160[out]" \
  -map "[out]" -map 0:a:0 \
  -c:v libx264 -crf 20 -preset medium \
  -c:a copy \
  output.mp4

This will:

Take the two video streams from the insv file
Stack them horizontally using hstack filter
Convert the stacked fisheye video into equirectangular format using v360 filter
Adjust the output resolution to 3840x2160 (for 4K video)
Map the audio stream from the insv file
Encode the video using H.264 codec
Copy the audio stream without re-encoding

This assumes that the input insv file has a fisheye format with 195-degree field of view for both horizontal and vertical directions. ih_fov and iv_fov parameters need to be adjusted per camera model, video capture resolution, and settings.

Using hardware acceleration¶

Whitebox runs on Orange Pi 5+, which has hardware H.264 and H.265 hardware acceleration support. To use hardware acceleration and transcode H.265 to H.264, you can use the following FFmpeg command:

ffmpeg \
  -c:v hevc_rkmpp \
  -i input.insv \
  -filter_complex "\
    [0:v:0][0:v:1]hstack=inputs=2[df]; \
    [df]v360=input=dfisheye:output=equirect:ih_fov=195:iv_fov=195:w=3840:h=2160[out]" \
  -map "[out]" -map 0:a:0 \
  -c:v h264_rkmpp \
  -rc_mode CBR \
  -b:v 25M -maxrate 25M -bufsize 50M \
  -level 5.1 \
  -c:a copy \
  -movflags +faststart \
  output.mp4

However, due to the nature of the v360 filter, hardware acceleration cannot be used for the dual-fisheye -> equirectangular conversion step, as the filter requires CPU processing. More details will be found in #448

Metadata extraction¶

More details will be found in #68

Videos may contain metadata, such as GPS data, camera orientation, etc. During transcoding, this metadata should be used or preserved in a way that would allow video players to augment playback with this information.

As video players cannot read the proprietary metadata directly from the video files, Whitebox should extract this data from the video files, and apply their effects during transcoding into playback-ready formats.