This an experimental video codec that scales based on content (i.e. physical objects) and point of gaze (i.e. where the viewer is looking). Scalability, in this context, is the ability to reconstruct meaningful video information from partial decompressed streams, thereby helping video systems meet their client device processing power and network bandwidth requirements. In particular, this video codec provides quality scalability by adaptively varying the color pattern accuracies of regions in each frame. Regions where the viewer is looking or that are classified as foreground objects are reconstructed at higher color pattern accuracies.
- Hiearchical block matching algorithm (HBMA) for motion estimation
- RANSAC (random sample consensus) for global motion estimation
- Hardware acceleration via SSE2 (Streaming SIMD Extensions 2) instructions and instruction-level parallelism (ILP) for mean absolute difference (MAD) calculations
- MAD is used as the error criterion in HBMA
- Multithreaded I/O
- Encoder app has a main thread, a read thread, and a write thread
- Decoder app has a main thread and a read thread
- Thread-safe circular queue is used between a "producing" thread and a "consuming" thread, such as a read thread and a main thread, respectively, or a main thread and a write thread, respectively
The encoder first performs block-based motion estimation and then segments the video into regions, which are sets of spatially-connected blocks with similar motion. Each region is classified either as part of the background or as a distinct moving (i.e. foreground) object. Afterwards, the encoder applies the Discrete Cosine Transform (DCT) and the resulting coefficients and identifier of each region are written to a stream.
In a typical streaming architecture, each region would be converted into quantized DCT coefficients based on bandwidth and gaze requirements, and the compressed data would be buffered for streaming. In such architectures, the continuous bandwidth sensing would control the quantization of the DCT coefficients of each region, with background transform blocks generally being more quantized than foreground transform blocks. The gaze requirements would determine which transform blocks need to be more clear, with transform blocks in the gaze area being less quantized than those outside of it. In these cases, both properties of bandwidth and gaze control would be communicated by the client to the streaming encoder.
For the sake of simplicity, I emulate the streaming feature. The encoder computes and stores all the transform coefficients, but the decoder will decide on the degree of quantization to apply based on where the user is gazing and whether a region corresponds to a foreground object or a part of the background. This approach allows me to focus on block-based motion estimation and segmentation without tackling the the complexities of streaming. I also emulate the gaze point, which is determined by the position of the mouse cursor.
- Illumination model: constant intensity assumption
- Valid for spatially and temporally invariant ambient illumination sources and diffuse reflecting surfaces
- The surface reflectance of an object does not change as the object moves
- Describes neither moving shadows nor reflections due to glossy surfaces
- Scene model: 2D
- All objects are flat and lay on the same image plane
- Objects are limited to moving in a 2D plane
- Motion model: 2D
- Motion is translational
- Applies to camera and objects
- Convert frame from the RGB color space to the YUV color space
- Extract Y (i.e. luminance) channel
- All motion estimation is intensity-based and relies solely on the Y channel
- Estimate block-wise motion
- Achieved by a variation of the hierarchical block matching algorithm (HBMA)
- For more details, open the file
motion.hpp
and read the comment block at the top of the file and the comment block for the functionEstimateMotionHierarchical
- If Streaming SIMD Extensions 2 (SSE2) is supported on your platform, then a SSE2-based HBMA implementation (see function
EstimateMotionHierarchical16x16Sse2
in filemotion.hpp
) is used to improve performance
- For more details, open the file
- Achieved by a variation of the hierarchical block matching algorithm (HBMA)
- Estimate global motion using random sample consensus (RANSAC)
- Global motion is assumed to be the camera motion
- Inlier group is assumed to be the background motion vectors
- Outlier group is assumed to be the foreground motion vectors
- For more details, open the file
motion.hpp
and read the comment block at the top of the file and the comment block for the functionEstimateGlobalMotionRansac
- Create foreground mask from outlier group
- Improve spatial connectivity and remove noise in the foreground mask
- Achieved by applying the closing and opening morphological operators in the stated order
- Structural element is rectangular
- Achieved by applying the closing and opening morphological operators in the stated order
- Segment foreground layer into regions
- Apply k-means on foreground layer
- Each feature vector consists of motion vector position and components
- A cluster is not guaranteed to be spatially connected. Step 7.2 addresses this
- Find the connected components of each cluster
- Each connected component becomes a region
- Apply k-means on foreground layer
- Compute the Discrete Cosine Transform (DCT)
- The frame in the RGB color space is divided into transform blocks, which undergo the DCT
- Each channel is processed independently
- Write encoded frames to a file
- Each transform block is assigned a block type
- The background and each foreground region is mapped to a unique block type
- For every transform block, the block type and the DCT coefficients for each channel are written
- Emulate bandwidth scalability and gaze control
- If a block is in the gaze region, then no quantization occurs and substeps below are skipped
- The DCT coefficients of a block are quantized by dividing by either
foreground-quant-step
orbackground-quant-step
(seedecoder
usage section), depending on whether the block belongs to the foreground or background - The resulting quantized values are rounded off, and the reverse process is applied to obtain the dequantized DCT coefficients
- Compute the inverse DCT of dequantized blocks
The videos I used for testing were from the 2014 IEEE Change Detection Workshop (CDW-2014) dataset, which can be found here: http://changedetection.net/.
The names of the result videos follow the format <executable-name>_<video-category>_<video-name>.mov
.
For example, the name of the videos used in the description section are
encoder-visualizer_dynamic-background_fall.mov
-
decoder_dynamic-background_fall.mov
.
- C++ compiler
- Must support at least C++17
- CMake >= 3.2
- OpenCV == 3.4.*
I have not built the project on other platforms besides the one used for my development environment:
- Hardware: MacBook Pro (Retina, 13-inch, Early 2015)
- Operating System: macOS Mojave (version 10.14.16)
- C++ compiler: Clang == 10.0.01
- CMake == 3.25.1
- OpenCV == 3.4.16
However, there is a good chance that the project can be built on your platform without much hassle because I use CMake to generate the build system. CMake is best known for taking a generic project description (see file CMakeLists.txt
) and generating a platform-specific build system.
To install OpenCV, follow the OpenCV installation instructions for your platform. Once installed, CMake can use the command find_package
to find the OpenCV package and load its package-specific details.
The following build steps assume a POSIX-compliant shell.
The build system must be generated before the project can be built. From the project directory, generate the build system.
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE:STRING=Release ..
The -G
option is omitted in cmake -DCMAKE_BUILD_TYPE:STRING=Release ..
, so CMake will choose a default build system generator type based on your platform. To learn more about generators, see the CMake docs.
If SSE2 is supported on your platform, then a SSE2-based HBMA implementation is used by default to improve performance. To disable the SSE2-based HBMA implementation and use the fallback implementation instead, execute cmake -D SVC_MOTION_SSE2:BOOL=OFF ..
, which additionally sets the cache variable SVC_MOTION_SSE2
to OFF
.
After having generated the build system, go back to the project directory
cd ..
and build all the targets of the project.
cmake --build build --config Release
The built targets are placed in the directory build
.
The executable targets are
- encoder-visualizer
- encoder
- decoder
To build only a specific target, replace mytarget
with the name of the target and execute:
cmake --build build --config Release --target mytarget
Each option provided at the command-line must have its name prefixed with "--" and have an associated argument following its name. Options must also be before positional parameters.
For example, the option kmeans-cluster-count
, its associated argument of 12, and the video file path foreman.mp4
, a positional parameter, would be passed to encoder
like so:
./build/apps/encoder --kmeans-cluster-count 12 foreman.mp4
encoder
writes the encoded video to the standard output stream stdout
.
To run encoder
with the default configuration and write the encoded video to a file, execute the following command, which redirects output from stdout
to encoded_video_file_path
:
./build/apps/encoder video_file_path > encoded_video_file_path
If you do not want to create a encoded video file but you still want to run the decoder
on the encoder
output, run encoder
and decoder
concurrently and connect the stdout
of encoder
to the stdin
of decoder
. Achieve this by executing the following command:
./build/apps/encoder video_file_path | ./build/apps/decoder
To run encoder-visualizer
with the default configuration, execute the following command:
./build/apps/encoder-visualizer video_file_path
encoder
and encoder-visualizer
have the same options.
To see the name and type of each option, search for #options
in apps/encoder.cpp
. You'll see an array called opts
in the function ParseConfig
. Each element of the array corresponds to an option and contains the name and type of the option.
To see the default values of the options, search for #default-cfg
in apps/encoder.cpp
. You'll see some functions containing the default values.
If the SSE2-based HBMA implementation is being used, then the motion vector block dimensions and pyramid level count cannot be set and the default values are used.
The decoder
reads encoded video from the standard input stream stdin
.
To run decoder
with the default configuration and read encoded video from a file, execute the following command, which redirects input from stdin
to encoded_video_file_path
:
./build/apps/decoder < encoded_video_file_path
If you want to run decoder
on the output of encoder
without creating an encoded video file, run encoder
and decoder
concurrently and connect the stdout
of encoder
to the stdin
of decoder
. Achieve this by executing the following command:
./build/apps/encoder video_file_path | ./build/apps/decoder
To see the name and type of each option, search for #options
in apps/decoder.cpp
. You'll see an array called opts
in the function ParseConfig
. Each element of the array corresponds to an option and contains the name and type of the option.
To see the default values of the options, search for #default-cfg
in apps/decoder.cpp
. You'll see some functions containing the default values.
- Address oversegmentation by merging regions
- Implement entropy coding
- Derive prediction error images and compress them using the JPEG pipeline
- Compress motion vectors using entropy coding
- Adaptive tuning of certain parameters
- Implement network streaming
- Eliminate dependence on OpenCV
- Multimedia Systems: Algorithms, Standards, and Industry Practices
- By Parag Havaldar and Gerard Medioni
- Video Processing and Communications
- By Yao Wang, Jôrn Ostermann, and Ya-Qin Zhang