Common Vision Blox 14.0
Modular Embedded - Applications (CVB & CUDA)




CUDA Background

Acquiring over CVB into CUDA memory

Use CUDA for CVB Shapefinder

Application Compatibility


The modular embedded technology unifies two powerful tools for combining high efficient image acquisiton over Common Vision Blox (CVB) and paralellized general purpose GPU computation via NVIDIAs CUDA platform. More general documentation on the 3rd generation of the CVB acquisition stack and NVIDIA CUDA can be found in the following links:


Although this is a specific guide for the modular embedded system (aarch64), there are ways to build the sample on other platforms.

  1. Install NVIDIA CUDA toolkit (Either via sdkmanager or this guide). Afterwards, verify nvcc is available via command line. If not, expose it in the path:
  2. Verify CVB version >= 14.00.003 is installed
  3. Get sources of the CppCudaMemory example ($CVB/tutorial/ImageManager/Cvb++/CppCudaMemory)
  4. Adjust the power settings (NVP model) to your needs and also consider jetson_clocks (documentation).

See this page for Windows installation or this documentation for other Linux platforms.

CUDA Background

To develop kernels for Nvidia GPU, the CUDA programming model can be used. Apart from the ability to write C++ kernels (dedicated highly parallel computation functions) that are run on the GPU, it also provides runtime API functions that allow controlling the GPUs' behavior. The heterogeneous programming approach uses both the CPU and GPU. In this model, the CPU and its memory is referred to as the host, while the GPU and its memory will be denoted as the device. The host can manage memory on both the host and device, and on the other hand initiate kernels, which are functions executed on the device by multiple GPU threads simultaneously. For allocating memory on the GPU, cudaMalloc is provided for example. To provide access to relevant data, memory can either be explicitly copied from CPU to GPU memory (cudaMemcpy, cudaMemcpyAsync), or provided in mapped device accessible host memory (cudaHostAlloc with the flag cudaHostAllocMapped). The mapped memory can be used as effectively “zero-copy” memory as the GPU can work directly on that data through the corresponding device pointer (cudaHostGetDevicePointer). Additionally, so called managed memory can be used (cudaMallocManaged). When using the last option, the same pointer is used on CPU and GPU. The CUDA driver ensures that the memory is automatically available on the device where it is used. Note: Windows and embedded devices have restrictions when using managed memory. cudaStreamAttachMemAsync has to be used there. The device visible host memory and managed memory can be used to more efficiently stream acquired data to the GPU. In the tutorial the constant value USE_MANAGED_MEMORY selects whether the mapped or managed memory variant is used.

Acquistion using CVB and CUDA memory

The CVB acquisition stack allows directly streaming acquired data from cameras into GPU accessible memory. This CppCudaMemory tutorial exemplary demonstrates this compatibility with the CUDA programming model. It uses the CVB++ API wrappers and CUDA.

The main customization point in the CVB acquisition stack to enable streaming into GPU accessible memory is the flow set. In the CppUserAllocatedMemory example and corresponding documentation, the usage of this is already written down. Here the process for direct interaction with CUDA is described.


A developer defined UserFlowSetPool derived from Cvb::Driver::FlowSetPool is used to override the memory management behavior. From now on, the user is free to choose between managed or mapped memory allocation. First, the stream requirements are asked from CVB++:

// get the flow set information that is needed for the current stream
auto flowSetInfo = stream -> FlowSetInfo();
// create a subclass of FlowSetPool to store the created buffer
auto flowSetPoolPtr = Tutorial::UserFlowSetPool::Create(flowSetInfo);

Flow Set Pool Creation and Registration

When createing the custom flow set pool, the required memory for the flow sets is allocated. GPU accessible memory shall be used here. The std::generate_n will insert NUM_BUFFERS elements into the user flowSetPoolPtr. In this example, for every flow inside a flow set an independent memory region is allocated.

  • Managed: cudaMallocManaged is used to allocate the buffer with the requested size. cudaMemAttachHost is given as flag to specify that the allocated memory is initially accessible by the host. This is required if cudaDevAttrConcurrentManagedAccess is not available, as the CVB driver will write to the buffer from the CPU.
  • Mapped: cudaHostAlloc with the flag cudaHostAllocMapped allocates memory on the host that is accessible by the GPU. When additionally setting cudaHostAllocWriteCombined, reading that memory from the GPU is more efficient but caching of the data is disabled on the host, which in turn leads to slow reads on CPU. Thus, the acquired buffers should mainly be consumed on the GPU.
std::generate_n(std::back_inserter(*flowSetPoolPtr), NUM_BUFFERS, [&flowSetInfo]() {
auto flows = std::vector<void *>(flowSetInfo.size());
std::transform(flowSetInfo.begin(), flowSetInfo.end(), flows.begin(),
void *ptr = nullptr;
// allocate managed memory, but attach to host
CUDA_CHECK(cudaMallocManaged(&ptr, info.Size, cudaMemAttachHost));
// allocate host memory, which is readable from the GPU
CUDA_CHECK(cudaHostAlloc(&ptr, info.Size,
cudaHostAllocWriteCombined | cudaHostAllocMapped));
return ptr;
return flows;
// finally register the user flow set pool


After registering the flow set pool, it will be used by the acquisition engine internally. Therefore when using stream->WaitFor(TIMEOUT) to acquire a frame, the memory used by the composite is the CUDA allocated memory. In the tutorial, the first element of the composite is assumed to be an image with linear memory layout. The base pointer of the linearAccess corresponds to the originally allocated pointer. The usage of that pointer should be handled again depending on the way the memory was allocated:

  • Managed: After attaching this pointer to a CUDA stream (cudaStreamAttachMemAsync), it can be used on the GPU inside a CUDA kernel. When done using the memory on the GPU, the attachment should be changed to the CPU again.
std::uint8_t *dInput = reinterpret_cast<std::uint8_t *>(linearAccess.BasePtr());
// attach memory to stream -> make writable from GPU
CUDA_CHECK(cudaStreamAttachMemAsync(gpuStream, dTarget));
CUDA_CHECK(cudaStreamAttachMemAsync(gpuStream, dInput));
// enqueue CUDA kernel ...
// detach from stream -> make writable from acquisition engine again
CUDA_CHECK(cudaStreamAttachMemAsync(gpuStream, dInput, 0, cudaMemAttachHost));
  • Mapped: The base pointer of the linear access is a CPU-only pointer. To get the corresponding GPU pointer cudaHostGetDevicePointer is used.
// get GPU pointer for mapped host pointer
std::uint8_t *dImage;
CUDA_CHECK(cudaHostGetDevicePointer(&dImage, reinterpret_cast<void *>(linearAccess.BasePtr()), 0));
// enqueue CUDA kernel ...

Independent of the memory management mode used: before releasing the composite from the WaitFor call, which enables the acquisition engine to reuse that memory, it has to be ensured that all uses of the pointer (including the cudaStreamAttachMemAsync call) have been completed. cudaStreamSynchronize is used in the tutorial to ensure this.


In this example, a sobel kernel is run on the acquired image. The kernel's target buffer is manually managed CUDA memory. To use the target memory with CVB processing functions, it can be wrapped in a Cvb::WrappedImage. Note: The image is not copied but the given buffer is worked on directly.

  • Managed: When using managed memory for the target buffer as well, it shall be taken care that the memory is attached to the host.
// prefetch target buffer on CPU
CUDA_CHECK(cudaStreamAttachMemAsync(gpuStream, dTarget, 0, cudaMemAttachHost));
auto wrapped =
Cvb::WrappedImage::FromGrey8Pixels(target, image->Width() - 2 * SOBEL_RADIUS,
image->Height() - 2 * SOBEL_RADIUS);
static std::unique_ptr< WrappedImage > FromGrey8Pixels(void *buffer, int width, int height)
  • Mapped / fully manually managed: The memory should be copied to host memory manually. It is generally not advised to use CUDA allocated device mapped memory for further processing on the host.
// copy output to host memory
CUDA_CHECK(cudaMemcpyAsync(target, dTarget, (image->Width() - 2 * SOBEL_RADIUS) *
(image->Height() - 2 * SOBEL_RADIUS), cudaMemcpyDeviceToHost, gpuStream));
// map host memory to Cvb image
auto wrapped =
Cvb::WrappedImage::FromGrey8Pixels(target, image->Width() - 2 * SOBEL_RADIUS,
image->Height() - 2 * SOBEL_RADIUS);


For resource cleanup, the destructor of the UserFlowSetPool should use the respective memory freeing function for the corresponding memory regions:

virtual ~UserFlowSetPool()
for (auto &flowSet : *this){
for (auto &flow : flowSet){

Use CUDA for CVB Shapefinder

CVB also offers a CUDA specific implementation for its Shapefinder tool. The deployed example application can be found at $CVB/tutorial/ShapeFinder/Cvb++/QtShapeFinder2Cuda. It shows the performance difference by applying a shapefinder search with the CPU as compared to the CUDA implementation - wrapped by a simple QT app. Be aware that the full performance difference and especially the advantage of the CUDA implementation is only avalable with a valid CVB Shapefinder license.

Application Compatibility

We aim to make embedded boards easy to use. However, the difficulty of finding suitable and compatible versions of 3rd party software can be an obstacle. This section provides verified installation guidelines and details on the compatibility of various applications and software versions with the modular embedded system, specifically for the Jetson Xavier NX platform.


  • #### Supported Version: JetPack Version 4.6.1
  • JetPack is initially installed by the supplier. Do not re-flash it on your own after it has been initially flashed. If a problem occurs, please contact the RMA. In the future the flash utility will be tested and released.
  • Before Installation: CUDA and other application require a lot of storage space, therefore an extension of the parts of the rootfs on and external drive is recommended.
  • More Information can be found here: NVIDIA JetPack SDK 4.6.1

Installation Steps

1. CUDA, cuDNN and TensorRT Installation:

  • #### Version: CUDA 10.2, cuDNN 8.2.1, TensorRT 8.2.1
  • #### Installation: Using the SDK Manager (SDK Manager version: 1.9.3) is recommended for ease of installation.

    1. Older versions of SDK Manager can be downloaded here: NVIDIA SDK Manager Archives and are enabled via the command sdkmanager –archivedversions
    2. When using the SDK Manager, installation can be done simply by selecting CUDA(Yellow box) and CCUDA-X AI for cuDNN and TensorRT(Blue box). Do not select the image file(Red box) as it is already installed with the patch applied.

    SDK Manager CUDA Installation setting

  • #### Verification Executing dpkg -l | grep cuda will display an output similar to the one shown below, which verifies that the installation is complete.

    Input (Shell):

    dpkg -l | grep cuda

    Output (Shell):

    ii cuda-toolkit-10-2 10.2.460-1 arm64 CUDA Toolkit 10.2 meta-package
    ii cuda-tools-10-2 10.2.460-1 arm64 CUDA Tools meta-package
    ii libcudnn8 arm64 cuDNN runtime libraries
    ii libcudnn8-dev arm64 cuDNN development libraries and headers
    ii libnvinfer-bin 8.2.1-1+cuda10.2 arm64 TensorRT binaries
    ii libnvinfer-dev 8.2.1-1+cuda10.2 arm64 TensorRT development libraries and headers

2. TensorFlow

  • #### Version: 1.15.5
  • #### **Installation (Shell):**
    sudo apt-get update
    sudo apt-get install -y python3-pip pkg-config
    sudo apt-get install -y libhdf5-serial-dev hdf5-tools libhdf5-dev zlib1g-dev zip libjpeg8-dev liblapack-dev libblas-dev gfortran
    sudo ln -s /usr/include/locale.h /usr/include/xlocale.h
    sudo pip3 install protobuf==3.19.6
    sudo pip3 install --verbose 'Cython<3'
    sudo wget --no-check-certificate
    sudo pip3 install --verbose tensorflow-1.15.5+nv22.1-cp36-cp36m-linux_aarch64.whl
  • #### Requirements: Python 3.6, Protobuf Python package version 3.19.6
  • #### Verification: The installation of TensorFlow and GPU usage.

    Input (Shell):

    pip show tensorflow

    Output (Shell):

    Name: tensorflow
    Version: 1.15.5+nv22.1
    Summary: TensorFlow is an open source machine learning framework for everyone.

    Input (Python):

    import tensorflow as tf
    print("Tensorflow version", tf.__version__)
    print("Is GPU available?", tf.test.is_gpu_available())

    Output (Shell):

    Tensorflow version 1.15.5
    2023-11-16 09:52:07.921047: I tensorflow/core/common_runtime/gpu/] Found device 0 with properties: name: Xavier major: 7 minor: 2 memoryClockRate(GHz): 1.109
    2023-11-16 09:52:10.328446: I tensorflow/core/common_runtime/gpu/] Created TensorFlow device (/device:GPU:0 with 10969 MB memory) -> physical GPU (device: 0, name: Xavier, pci bus id: 0000:00:00.0, compute capability: 7.2)
    Is GPU available? True
  • More Information can be found here: Official Tensorflow for Jetson, TensorFlow for Jetson Platform Release Notes

3. PyTorch

  • #### Version: 1.11.0a0+17540c5
  • #### **Installation (Shell):**
    # Install system packages required by PyTorch:
    sudo apt-get -y update;
    sudo apt-get -y install autoconf bc build-essential g++-8 gcc-8 clang-8 lld-8 gettext-base gfortran-8 iputils-ping libbz2-dev libc++-dev libcgal-dev libffi-dev libfreetype6-dev libhdf5-dev libjpeg-dev liblzma-dev libncurses5-dev libncursesw5-dev libpng-dev libreadline-dev libssl-dev libsqlite3-dev libxml2-dev libxslt-dev locales moreutils openssl python-openssl rsync scons python3-pip libopenblas-dev;
    # Export with the following command:
    export TORCH_INSTALL=
    # Install PyTorch.
    python3 -m pip install --upgrade pip; python3 -m pip install aiohttp numpy=='1.19.4' scipy=='1.5.3' export "LD_LIBRARY_PATH=/usr/lib/llvm-8/lib:$LD_LIBRARY_PATH"; python3 -m pip install --upgrade protobuf; python3 -m pip install --no-cache $TORCH_INSTALL
  • #### Verification: The installation of PyTorch and GPU usage.

    Input (Shell):

    pip show torch

    Output (Shell):

    Name: torch
    Version: 1.11.0a0+17540c5
    Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration

    Input (Python):

    import torch
    print("PyTorch version:", torch.__version__)
    print("Is CUDA available:", torch.cuda.is_available())
    print("CUDA device count:", torch.cuda.device_count())
    print("CUDA device name:", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "No CUDA device")

    Output (Shell):

    PyTorch version: 1.11.0a0+17540c5
    Is CUDA available: True
    CUDA device count: 1
    CUDA device name: Xavier
  • More Information can be found here: Official PyTorch for Jetson, PyTorch for Jetson Platform Release Notes

4. ONNX Runtime

  • #### Version: 1.4.0
  • #### **Installation (Shell):**
    # Download pip wheel:
    wget -O onnxruntime_gpu-1.4.0-cp36-cp36m-linux_aarch64.whl
    # Install pip wheel
    pip3 install onnxruntime_gpu-1.4.0-cp36-cp36m-linux_aarch64.whl
  • #### Verification: The installation of ONNX Runtime.

    Input (Shell):

    pip show onnxruntime_gpu

    Output (Shell):

    Name: onnxruntime-gpu
    Version: 1.4.0
    Summary: ONNX Runtime Python bindings
  • More Information can be found here: ONNX Runtime