Introduction

For computer vision algorithms that are deployed in the real world such as a robot that is moving through the environment, high processing speeds are essential to provide safe and efficient operation. As sequential processing has reached its limits the world has moved to multi core systems, with GPUs that run 1000s of operations in parallel on the extreme side.

While in computer vision we often have natually parallizable problems, such as the same operation applied to all pixels, implementing the same for different computing architectures poses a challenge. Often this requires the knowledge of specific frameworks and programming languages (e.g. CUDA for NVIDIA GPUs) or hardware specific instructions for other embedded platforms. The code that is implemented for a certain device, can not be run on another device and thus multiple versions for the same algorithm/platform are necessary.

This lack of portability has led to several mechanisms to write device agnostic code that are now getting more and more merged into the C++ standard, allowing us to write standard C++ code that is parallelizable by default. It can be compiled for different accelerators such as CPUs and GPUs without having to adapt a single line of code.

NVIDIA offers its HPC SDK which targets this use case for NVIDIA GPUs in high performance computing, allowing to scale applicatons from single CPUs to multiple GPUs, to multiple nodes with multiple GPUs. While this is is not the primary use case for robotics, we can use the same toolchain to work on much smaller devices such as the NVIDIA Jetson family, which is very suitable for robotics.

In this post we take a look at a typical example of a computer vision algorithm and thow to write it in Standard C++ prepared for execution on CPUs and GPUs. We also compare the execution times on the different accelerators at various workloads. Our target device which we also use for measuring execution times is a NVIDIA Jetson Xavier AGX with 4 CPU cores enabled and 512 GPU cores.

Table of contents

  1. The Algorithm
  2. The “Classic” Loop
  3. STL Algorithms
  4. The Beauty: A Standardized Abstraction to Parallelism
  5. The Catch: Limitations
  6. Conclusion
  7. References
  8. Appendix

All source code can be found at this repo

The Algorithm

We will take a look at a typical operation in computer vision, the computation of the photometric error between two images. We will assume we have two images and depth maps taken at two different timestamps from a calibrated depth camera following the pinhole model. The two images are related by the relative movement of the robot, the pose. This error can serve for many applications such as estimating the movement of the robot itself or tracking other objects in the scene.

Expressed mathematically it looks like this:

$$ err_x(x) = I_0(x) - I_1(W(x)) \tag{1}$$

$$ err = \sum_{x \in X} err_x(x) \tag{2}$$

Where \(X\) is a set of two dimensional vectors containing all valid pixels positions in the image \(I_0\) and \(W(x)\) is the reprojection function that maps \(x\) to its corresponding position in the second image \(I_1\). The detailed formulation of \(W(x)\) is given in the appendix but not really relevant for now.

The “Classic” Loop

In the “old days” we would implement the equations in a nested for loop that iterates over the pixel positions and accumulates the computed pixel error.

template<typename Reproject>
double compute(Reproject reproject, const cv::Mat& I0, const cv::Mat& I1) {
 float r = 0.;
  for (int v0 = 0; v0 < I0.rows; v0++) {
    for (int u0 = 0; u0 < I0.cols; u0++) {
      const Vec2f uv1 = reproject({u0, v0});
      const bool withinImage = 0 < uv1(0) && uv1(0) < I0.cols && 0 < uv1(1) && uv1(1) < I0.rows;

      r += withinImage ? (float)(I1.at<uint8_t>((int)uv1(1), (int)uv1(0))) - (float)(I0.at<uint8_t>(v0, u0)) : 0.;
    }
  }
  return r/255.0f;
}

This works of course, but executes all the cols*rows operations one after another. Let’s execute it on some sample data to measure the runtime.

Running for [10] iterations on scale: [640,480] with method: [classic]
Runtime = 0.00910059 +- 0.000219555

This gives us a baseline runtime for the photometric error computation.

STL Algorithms

Since C++17 we have the header in the standard that provides templates for common data operations inspired by the functional programming languages. The algorithms provide a way of designing our code a long common design patterns making it a more understandable. Once you got a hang of the horrible syntax of course. The understandability however, is not the predominant reason why we should make use of the algorithms. But first let’s see how our loop would look like if we replace it with an STL algorithm.

In our case we can separate the loop in two operations. First, we transform each pixel location to an error using Eq.1. Then we reduce the set errors to a single error by summing over them in Eq.2. This is referred to a transform-reduce or map-reduce operation and C++17 has a corresponding algorithm available in the header.

Implementing our loop with the transform-reduce algorithm results in the following:

#include <algorithm>
#include <numeric>
#include <ranges>
namespace stdv=std::views;
...
template<typename Reproject>
double compute(Reproject reproject, const cv::Mat& I0, const cv::Mat& I1) {
  auto idxx = stdv::iota(0,I0.rows*I0.cols);
  return std::transform_reduce(
            idxx.begin(),
            idxx.end(),
            0.,
            std::plus<float>{},
            [I0 = I0.data, I1 = I1.data, w = I0.cols, h = I0.rows, reproject](int idx) {
              const int u0 = idx % w;
              const int v0 = (idx - u0) / w;
              const Vec2f uv1 = reproject({u0, v0});
              const bool withinImage = 0 < uv1(0) && uv1(0) < w && 0 < uv1(1) && uv1(1) < h;
              return withinImage ? (float)(I1[(int)uv1(1) * w + (int)uv1(0)]) - (float)(I0[v0 * w + u0]) : 0.f;
            }) /
          255.f;
}

We first create the indices we iterate over. For this we use a C++20 feature, the iota-view as it supports “lazy” evaluation, meaning the index is only computed within the functor. Unfortunately, we have to choose 1D indices and compute the respective 2D indices within the functor, which makes to code less readable. However, with the upcoming C++23 and the cartesian product view this is about to change.

Anyway, I think the code is still understandable, we have set of iterators, an initial value of the error (0), the reduce operation (a sum) and the transform operation (reproject and subtract). Let’s see how it performs:

Running for [10] iterations on scale: [640,480] with method: [sequential]
Runtime = 0.00956295 +- 8.52892e-05

As expected, the runtimes are comparable to the simple classic loop.

Execution Policies

The benefit of STL algorithms get’s appearant when we introduce execution policies. These are available for all STL algorithms (where they make sense) and allow the programmer to tell the compiler in which order the code can be executed. In total there are 4 policies:

  • seq : Code has to run sequentually
  • unseq: Code does not have to run sequentually. This enables vectorization, the execution of multiple iterations in parallel on a single processor.
  • par: Code can run in parallel. This enables execution on multiple cores in parallel.
  • par_unseq: Enables parallel execution and vectorization.

We can specify them as argument when we call the STL algorithm:

#include <algorithm>
#include <numeric>
#include <ranges>
namespace stdv=std::views;
#include <execution>
namespace exec = std::execution;
...
template<typename Reproject>
double compute(Reproject reproject, const cv::Mat& I0, const cv::Mat& I1) {
  auto idxx = stdv::iota(0,I0.rows*I0.cols);
  return std::transform_reduce(
            exec::par_unseq, //<---------------------
            idxx.begin(),
            idxx.end(),
            0.,
            std::plus<float>{},
            [I0 = I0.data, I1 = I1.data, w = I0.cols, h = I0.rows, reproject](int idx) {
              const int u0 = idx % w;
              const int v0 = (idx - (idx % w)) / w;
              const Vec2f uv1 = reproject({u0, v0});
              const bool withinImage = 0 < uv1(0) && uv1(0) < w && 0 < uv1(1) && uv1(1) < h;
              return withinImage ? (float)(I1[(int)uv1(1) * w + (int)uv1(0)]) - (float)(I0[v0 * w + u0]) : 0.f;
            }) /
          255.f;
}

We have marginally modified the code but now the compiler can assume that each loop iteration can run in parallel which can be exploited when running the program. Let’s see about the runtime.

Running for [10] iterations on scale: [640,480] with method: [parallel]
Runtime = 0.00513411 +- 0.000899916

Wow, we just gained 5ms / 50% with almost no code changes!

You might wonder now who and were decides how the loop will be executed. This actually depends on the library that is used for parallelization and the system load at runtime. We just “informed” the program that it can execute the loops in parallel the is no guarantee that it will do that. One one hand this does not give us a lot of control about the program execution, on the other hand this abstracts away quite some unecessary complexity of the algorithm.

The Beauty: A Standardized Abstraction to Parallelism

The real beauty of a standardized approach to parallelism is that the STL algorithms and their execution policies can be implemented by different compiler / parallization library providers. With the High-Performance-Computing (HPC) compiler SDK, NVIDIA has released a toolchain that can compile the code for NVIDIA platforms.

You will have already guessed it, this allows us to compile the exact same code from above for a GPU to exploit massive parallelism, without having to change a single line of code! The predominant change we have to compile the programm with their compiler and provide some additional flags:

nvc++ src/photometric_error.cpp -o photometric_error -stdpar=gpu

If you are intersted in how to setup the compilation toolchain within a docker and using cmake checkout the appendix. Let’s see about the runtime of the GPU version:

Running for [10] iterations on scale: [640,480] with method: [parallel]
Runtime = 0.00252343 +- 0.00118023

We have gained another 2.5 ms and reduction of 50% compared to the CPU version without changing any code!

Ok honestly, I was expecting it to give greater benefits compared to the CPU. In the end we have 512 cores compare to only 4 cores on the CPU. However, our example application is still quite small, let’s see what happens when we drastically increase the problem size. For now let’s simulate an image which is 10 times larger:

Running for [10] iterations on scale: [6400,4800] with method: [classic]
Runtime = 1.42811 +- 0.0155183

Running for [10] iterations on scale: [6400,4800] with method: [sequential]
Runtime = 0.934116 +- 0.00315484

Running for [10] iterations on scale: [6400,4800] with method: [CPU parallel]
Runtime = 0.312329 +- 0.00731599

Running for [10] iterations on scale: [6400,4800] with method: [GPU parallel]
Runtime = 0.053725 +- 0.00623579

Now, the benefit of the GPU get’s much more visible, compared to the parallel CPU version it is 6x faster, compared to the classic version it’s almost 30x faster! All that with minor code modifications and without implementing any device specific code!

The Catch

Now let’s come to the catch. Of course there is never a free lunch and even in this small application we can see several limitations of the desribed approach:

Lack of Control

The current interface only allows us to tell how the code can be executed. There is no guarantee that it will be executed in parallel and if so by how many threads etc. We have to rely on the automatic optimization by the compiler / underlying parallization library. This abstracts away some complexity but also prevents more fine grained optimization of our code. E.g. we also can’t specify that some loops should run in parallel on the CPU, while others should run on the GPU.

In future the parallel programming model within the C++ Standard will be further extended. For example you can checkout this talk for the available proposals and proof-of-concept implementations, which will allow much more fined grained definition for task executions. However, it will take at least until 2026 until these mechanisms are standardized and available in the common toolchains.

Memory Management

GPU and CPU memory typically have different address spaces. Even on devices like the NVIDIA Jetson which share a physical memory, there is a logical distinction between the memories. The nvc++ compiler takes care that the memory is allocated and migrated appropriately when compiling our code with the stdpar=gpu flag, using NVIDIAs Unified Memory. However, the memory can only be managed by Unified Memory if the data is allocated on the heap and the allocation takes place in the code that is compiled by the nvc++ compiler.

For example in a normal CPU program we could create lambdas above like the following:

[&I0, &I1, w = I0.cols, h = I0.rows, reproject](int idx) {...}

We would simply pass a reference to the object and we could access its functions like we are used to. However, when we compile the same for the GPU we end up with a compiler error, because the reference points to memory on the CPU stack:

NVC++-S-0000-Internal compiler error. BAD sptr in var_refsym 

Instead, we can pass a pointer to the underlying memory using the .data method, which points to the image memory on the heap:

[I0 = I0.data, I1 = I1.data, w = I0.cols, h = I0.rows, reproject](int idx) { ... }

This will allow the compiler to trigger the automatic management of the accessed memory. The changes are not big but in a way they limit our set of possible instructions when we work on the GPU and force us to work again with raw pointers and the like.

If you now wonder why we can’t handle the memory management ourselves, well this brings us back to the point of Lack of Control. Until now there is no standardized mechanism to do so and we would have to fall back to other more low level frameworks like OpenACC or CUDA to do so.

Third Party Libraries

The memory management problem also leads to issues when using third party libraries like OpenCV. When we compile our program for the GPU, the compiler will make sure all memory in our code is allocated in Unified Memory and can be migrated to the GPU. However, it cannot do so in the binaries that we link from external libraries. Hence, we have to make sure that any data is allocated in unified memory before we are trying to access it. For example when we load the images for our benchmark it looks like the following:

  //By default cv::Mat allocates not in unified memory so we can not access it on the GPU
  cv::Mat I0_ = cv::imread(RESOURCE_DIR "/rgb0.png", cv::IMREAD_GRAYSCALE);
  ...
  // We have to create a vector ourselves, which we know will be allocated in unified memory
  std::vector<uint8_t> i0v(I0_.rows * I0_.cols * args.sx * args.sy);
  ...
  // We create the cv mat type so we have all the opencv functionality available
  cv::Mat I0{I0_.rows * args.sy, I0_.cols * args.sx, CV_8U, i0v.data()};
  ...
  // exploiting the fact the resize also copies and copies to our unified memory
  cv::resize(I0_, I0, cv::Size{0, 0}, args.sx, args.sy);
  ...

This clutters the code and limits the use of some third party libraries. In the case of OpenCV there are other mechanisms to make sure the memory is allocated in unified memory like overwriting the custom allocator for the cv::Mat class, but these things also need to be considered during implementation.

Another issue occured when using Eigen. Eigen is heavily optimized for processing on the CPU using vectorized instructions and the like. The problem: These instructions can not be executed on the GPU, meaning lambdas, that execute a matrix multiplication throw an “unsupported operation” error when compiled for the GPU. Instead, we have to “manually” apply the multiplication. Again, things like this can be considered but they are hard to debug and they also take a way some of the nice Eigen syntax. If you are interested in more details simply check out the appendix.

Luckily, with C++26 it is planned to bring more linear algebra to the C++ standard with for and foremost the mdspan type for handling multidimensional data. This might reduce of the need of some thirdparty libraries greatly.

Conclusion

Let’s recap the core elements of this post:

  • By applying STL algorithms instead of classic loops we can integrate a portable approach to parallelism into our code
  • The approach has some limitations which need to be considered during implementation and require early on device testing
  • The limitations should be adressed in upcoming releases of the C++ Standard

If we consider and handle the respective limitations we can write CPU and GPU optimized algorithms without having to dig into device specific instructions and the like!

References

Appendix

The Reprojection Functor

The reprojection function \(W(x)\) consists of three steps:

  1. Reconstructing the pixel \(x = \begin{bmatrix}u \ v \end{bmatrix}\) at depth \(Z_0(x)\) using the camera intrinsics (inverse camera projection):

$$ X^0 = Z_0(x) \begin{bmatrix} (u - c_x)/f_x & (v- c_y)/f_y & 1 \end{bmatrix}^T $$

  1. Transforming the 3D point by the relative pose, consisting of a rotation matrix \(R\) and a translation vector \(t\):

$$ X^1 = R X^0 + t$$

  1. Projecting the 3D point to \(I_1\) using the camera parameters:

$$ x^1= \begin{bmatrix} f_xX^1_0 + c_x & f_yX^1_1 + c_y & X^1_2 \end{bmatrix}^T $$

$$ \begin{bmatrix}u^1 \ v^1 \end{bmatrix} = \begin{bmatrix}\frac{x^1_1}{x^1_3} \ \frac{x^1_2}{x^1_3} \end{bmatrix} $$

We can implement it in the following functor:

//All parameters are from the TUM-RGBD dataset / camera
auto reproject = [fx = 525.0f * args.sx,
                    fy = 525.0f * args.sy,
                    cx = 319.5f * args.sx,
                    cy = 239.5f * args.sy,
                    w = (int)(640 * args.sx),
                    h = (int)(480 * args.sy),
                    Z0 = (float*)Z0.data,
                    pose = Mat4f::Identity()](const Vec2f& uv0) -> Vec2f {
    const float z = Z0[(int)uv0(1) * w + (int)uv0(0)];
    if (!std::isfinite(z) || z <= 0) {
      return {-1, -1};
    }
    const Vec3f p0{(uv0(0) - cx) / fx * z, (uv0(1) - cy) / fy * z, z};

    /*Need to apply the transformation "manually" as otherwise cuda version does not compile due to "unsupported operation"*/
    const Vec3f p0t = {
      pose(0, 0) * p0(0) + pose(0, 1) * p0(1) + pose(0, 2) * p0(2) + pose(0, 3),
      pose(1, 0) * p0(0) + pose(1, 1) * p0(1) + pose(1, 2) * p0(2) + pose(1, 3),
      pose(2, 0) * p0(0) + pose(2, 1) * p0(1) + pose(2, 2) * p0(2) + pose(2, 3)};

    if (p0t(2) <= 0) {
      return {-1, -1};
    }
    return {(fx * p0t(0) / p0t(2)) + cx, (fy * p0t(1) / p0t(2)) + cy};
  };

As this is only a benchmark application, we just use the identity pose.

Toolchain Setup

We can make use of NVIDIAs docker container containing the compilation toolchain. All source code can be found at this repo

FROM nvcr.io/nvidia/nvhpc:23.11-devel-cuda_multi-ubuntu22.04 as dev

RUN apt update && apt install -y --no-install-recommends libopencv-dev

ADD install_eigen.sh /opt/
RUN cd /opt/ && . /etc/profile.d/lmod.sh && module load nvhpc-hpcx/23.11 && /opt/install_eigen.sh

FROM dev as runtime
ADD src /workspace/src
ADD CMakeLists.txt /workspace/

WORKDIR /workspace
RUN . /etc/profile.d/lmod.sh && module load nvhpc-hpcx/23.11 && mkdir -p /workspace/build && cd /workspace/build && cmake .. && make
ADD resource /workspace/resource

For Eigen a small modification was required that it can be compiled with nvc++. Which is why we have to use a custom install script install_eigen.sh:

#!/bin/bash
git clone --branch 3.4 https://gitlab.com/libeigen/eigen.git
cd eigen
# https://forums.developer.nvidia.com/t/possible-bug-in-nvc-23-05/260553
sed -i 's/#if EIGEN_COMP_CLANG || EIGEN_COMP_CASTXML/#if EIGEN_COMP_CLANG || EIGEN_COMP_CASTXML || __NVCOMPILER_LLVM__/' ./Eigen/src/Core/arch/NEON/Complex.h
mkdir build
cd build
cmake ..
make -j2
make install
cd ..
rm build -r

With the build environment set up we can compile our code with the following CMakeListst.txt:

cmake_minimum_required(VERSION 3.20.0)
project(parallel_cpp_example LANGUAGES CXX)
############################
# Compile for GPU
############################
# make sure to select nvcc as compiler with 
# source /etc/profiles.d/module.sh
# module load ..

find_package(OpenCV REQUIRED)
find_package(Eigen3 3.4 REQUIRED)

add_executable(photometric_error src/photometric_error.cpp)

target_include_directories(photometric_error PUBLIC ${OpenCV_INCLUDE_DIRS})
target_link_libraries(photometric_error ${OpenCV_LIBRARIES})

target_link_libraries(photometric_error Eigen3::Eigen)

target_compile_definitions(photometric_error PRIVATE RESOURCE_DIR="${CMAKE_CURRENT_LIST_DIR}/resource")
target_compile_features(photometric_error PUBLIC cxx_std_20)

#######################################################
#   Link various parallelization backends if available
#######################################################

find_package(NVHPC)
if(NVHPC_FOUND)# If compiled with NVHPC generate an additional executable for GPU

    target_compile_options(photometric_error PRIVATE -stdpar=multicore)
    target_link_options(photometric_error PRIVATE -stdpar=multicore)

    # Additional compile options to get info about the optimizations the compiler does
    target_compile_options(photometric_error PRIVATE -Minfo -fast -march=native -Mllvm-fast)
    target_link_options(photometric_error PRIVATE -Minfo -fast -march=native -Mllvm-fast)

    add_executable(photometric_error_gpu src/photometric_error.cpp)
    target_include_directories(photometric_error_gpu PUBLIC ${OpenCV_INCLUDE_DIRS})
    target_link_libraries(photometric_error_gpu ${OpenCV_LIBRARIES})
    target_link_libraries(photometric_error_gpu Eigen3::Eigen)

    target_compile_definitions(photometric_error_gpu PRIVATE RESOURCE_DIR="${CMAKE_CURRENT_LIST_DIR}/resource")
    target_compile_features(photometric_error_gpu PUBLIC cxx_std_20)

    target_compile_options(photometric_error_gpu PRIVATE -stdpar=gpu)
    target_compile_options(photometric_error_gpu PRIVATE -Minfo -fast -march=native -Mllvm-fast)
    target_link_options(photometric_error_gpu PRIVATE -Minfo -fast -march=native -Mllvm-fast)
    target_link_options(photometric_error_gpu PRIVATE -stdpar=gpu)

else() # If compiled without NVHPC e.g. by GCC use the TBB library for parallization
    message(WARNING "NVHPC SDK not found")
    find_package(TBB)
    if(TBB_FOUND)
        target_link_libraries(photometric_error TBB::tbb)
    endif()
endif()

The Benchmark

The main consists of a simple for loop that runs the function several times to measure average and standard deviation:


int main(int argc, char* argv[]) {

  const Args args = Args::parse(argc, argv);

  cv::Mat I0_ = cv::imread(RESOURCE_DIR "/rgb0.png", cv::IMREAD_GRAYSCALE);
  cv::Mat Z0_ = convertDepthMat(cv::imread(RESOURCE_DIR "/depth0.png", cv::IMREAD_ANYDEPTH), 1.0 / 5000.0);
  cv::Mat I1_ = cv::imread(RESOURCE_DIR "/rgb1.png", cv::IMREAD_GRAYSCALE);

  // We have to allocate memory ourselves as otherwise its not "unified memory"
  std::vector<uint8_t> i0v(I0_.rows * I0_.cols * args.sx * args.sy);
  std::vector<uint8_t> i1v(I0_.rows * I0_.cols * args.sx * args.sy);
  std::vector<float> z0v(I0_.rows * I0_.cols * args.sx * args.sy);
  cv::Mat I0{I0_.rows * args.sy, I0_.cols * args.sx, CV_8U, i0v.data()};
  cv::Mat Z0{I0_.rows * args.sy, I0_.cols * args.sx, CV_32F, z0v.data()};
  cv::Mat I1{I0_.rows * args.sy, I0_.cols * args.sx, CV_8U, i1v.data()};

  cv::resize(I0_, I0, cv::Size{0, 0}, args.sx, args.sy);
  cv::resize(Z0_, Z0, cv::Size{0, 0}, args.sx, args.sy, cv::INTER_NEAREST);  // No bilinear interpolation for depth
  cv::resize(I1_, I1, cv::Size{0, 0}, args.sx, args.sy);

  auto computePhotometricError = photometric_error::construct(args, Z0);

  using timer = std::chrono::high_resolution_clock;

  std::cout << "Running for [" << args.N << "] iterations on scale: [" << I0.cols << "," << I0.rows << "] with method: [" << args.method
            << "]" << std::endl;

  std::vector<double> dt(args.N);
  double err = 0.;
  for (int i = 0; i < args.N; i++) {
    auto t0 = timer::now();

    err = computePhotometricError(I0, I1);

    auto t1 = timer::now();
    dt[i] = (t1 - t0).count() / 1e9;
  }
  std::cout << "Runtime = " << mean(dt) << " +- " << stddev(dt) << " Error: " << err << std::endl;
  return 0;
}

The data are few images from the TUM-RGBD Dataset