Common Vision Blox 14.1
Modular Embedded - Applications (CVB & CUDA)

Contents:

Introduction

Prerequisites

CUDA Background

Acquiring over CVB into CUDA memory

Use CUDA for CVB Shapefinder


Introduction

The modular embedded technology unifies two powerful tools for combining high efficient image acquisiton over Common Vision Blox (CVB) and paralellized general purpose GPU computation via NVIDIAs CUDA platform. More general documentation on the 3rd generation of the CVB acquisition stack and NVIDIA CUDA can be found in the following links:


Prerequisites

Although this is a specific guide for the modular embedded system (aarch64), there are ways to build the sample on other platforms.

  1. Install NVIDIA CUDA toolkit (Either via sdkmanager or this guide). Afterwards, verify nvcc is available via command line. If not, expose it in the path:
    PATH=$PATH:/usr/local/cuda:/usr/local/cuda/bin
  2. Verify CVB version >= 14.00.003 is installed
  3. Get sources of the CppCudaMemory example ($CVB/tutorial/ImageManager/Cvb++/CppCudaMemory)
  4. Adjust the power settings (NVP model) to your needs and also consider jetson_clocks (documentation).

See this page for Windows installation or this documentation for other Linux platforms.


CUDA Background

To develop kernels for Nvidia GPU, the CUDA programming model can be used. Apart from the ability to write C++ kernels (dedicated highly parallel computation functions) that are run on the GPU, it also provides runtime API functions that allow controlling the GPUs' behavior. The heterogeneous programming approach uses both the CPU and GPU. In this model, the CPU and its memory is referred to as the host, while the GPU and its memory will be denoted as the device. The host can manage memory on both the host and device, and on the other hand initiate kernels, which are functions executed on the device by multiple GPU threads simultaneously. For allocating memory on the GPU, cudaMalloc is provided for example. To provide access to relevant data, memory can either be explicitly copied from CPU to GPU memory (cudaMemcpy, cudaMemcpyAsync), or provided in mapped device accessible host memory (cudaHostAlloc with the flag cudaHostAllocMapped). The mapped memory can be used as effectively “zero-copy” memory as the GPU can work directly on that data through the corresponding device pointer (cudaHostGetDevicePointer). Additionally, so called managed memory can be used (cudaMallocManaged). When using the last option, the same pointer is used on CPU and GPU. The CUDA driver ensures that the memory is automatically available on the device where it is used. Note: Windows and embedded devices have restrictions when using managed memory. cudaStreamAttachMemAsync has to be used there. The device visible host memory and managed memory can be used to more efficiently stream acquired data to the GPU. In the tutorial the constant value USE_MANAGED_MEMORY selects whether the mapped or managed memory variant is used.


Acquistion using CVB and CUDA memory

The CVB acquisition stack allows directly streaming acquired data from cameras into GPU accessible memory. This CppCudaMemory tutorial exemplary demonstrates this compatibility with the CUDA programming model. It uses the CVB++ API wrappers and CUDA.

The main customization point in the CVB acquisition stack to enable streaming into GPU accessible memory is the flow set. In the CppUserAllocatedMemory example and corresponding documentation, the usage of this is already written down. Here the process for direct interaction with CUDA is described.

Preparation

A developer defined UserFlowSetPool derived from Cvb::Driver::FlowSetPool is used to override the memory management behavior. From now on, the user is free to choose between managed or mapped memory allocation. First, the stream requirements are asked from CVB++:

// get the flow set information that is needed for the current stream
auto flowSetInfo = stream -> FlowSetInfo();
// create a subclass of FlowSetPool to store the created buffer
auto flowSetPoolPtr = Tutorial::UserFlowSetPool::Create(flowSetInfo);

Flow Set Pool Creation and Registration

When createing the custom flow set pool, the required memory for the flow sets is allocated. GPU accessible memory shall be used here. The std::generate_n will insert NUM_BUFFERS elements into the user flowSetPoolPtr. In this example, for every flow inside a flow set an independent memory region is allocated.

  • Managed: cudaMallocManaged is used to allocate the buffer with the requested size. cudaMemAttachHost is given as flag to specify that the allocated memory is initially accessible by the host. This is required if cudaDevAttrConcurrentManagedAccess is not available, as the CVB driver will write to the buffer from the CPU.
  • Mapped: cudaHostAlloc with the flag cudaHostAllocMapped allocates memory on the host that is accessible by the GPU. When additionally setting cudaHostAllocWriteCombined, reading that memory from the GPU is more efficient but caching of the data is disabled on the host, which in turn leads to slow reads on CPU. Thus, the acquired buffers should mainly be consumed on the GPU.
std::generate_n(std::back_inserter(*flowSetPoolPtr), NUM_BUFFERS, [&flowSetInfo]() {
auto flows = std::vector<void *>(flowSetInfo.size());
std::transform(flowSetInfo.begin(), flowSetInfo.end(), flows.begin(),
void *ptr = nullptr;
if (USE_MANAGED_MEMORY)
{
// allocate managed memory, but attach to host
CUDA_CHECK(cudaMallocManaged(&ptr, info.Size, cudaMemAttachHost));
}
else
{
// allocate host memory, which is readable from the GPU
CUDA_CHECK(cudaHostAlloc(&ptr, info.Size,
cudaHostAllocWriteCombined | cudaHostAllocMapped));
}
return ptr;
});
return flows;
});
// finally register the user flow set pool
stream->RegisterExternalFlowSetPool(std::move(flowSetPoolPtr));

Acquisition

After registering the flow set pool, it will be used by the acquisition engine internally. Therefore when using stream->WaitFor(TIMEOUT) to acquire a frame, the memory used by the composite is the CUDA allocated memory. In the tutorial, the first element of the composite is assumed to be an image with linear memory layout. The base pointer of the linearAccess corresponds to the originally allocated pointer. The usage of that pointer should be handled again depending on the way the memory was allocated:

  • Managed: After attaching this pointer to a CUDA stream (cudaStreamAttachMemAsync), it can be used on the GPU inside a CUDA kernel. When done using the memory on the GPU, the attachment should be changed to the CPU again.
std::uint8_t *dInput = reinterpret_cast<std::uint8_t *>(linearAccess.BasePtr());
// attach memory to stream -> make writable from GPU
CUDA_CHECK(cudaStreamAttachMemAsync(gpuStream, dTarget));
CUDA_CHECK(cudaStreamAttachMemAsync(gpuStream, dInput));
// enqueue CUDA kernel ...
(...)
// detach from stream -> make writable from acquisition engine again
CUDA_CHECK(cudaStreamAttachMemAsync(gpuStream, dInput, 0, cudaMemAttachHost));
CUDA_CHECK(cudaStreamSynchronize(gpuStream));
  • Mapped: The base pointer of the linear access is a CPU-only pointer. To get the corresponding GPU pointer cudaHostGetDevicePointer is used.
// get GPU pointer for mapped host pointer
std::uint8_t *dImage;
CUDA_CHECK(cudaHostGetDevicePointer(&dImage, reinterpret_cast<void *>(linearAccess.BasePtr()), 0));
// enqueue CUDA kernel ...
(...)
CUDA_CHECK(cudaStreamSynchronize(gpuStream));

Independent of the memory management mode used: before releasing the composite from the WaitFor call, which enables the acquisition engine to reuse that memory, it has to be ensured that all uses of the pointer (including the cudaStreamAttachMemAsync call) have been completed. cudaStreamSynchronize is used in the tutorial to ensure this.

Processing

In this example, a sobel kernel is run on the acquired image. The kernel's target buffer is manually managed CUDA memory. To use the target memory with CVB processing functions, it can be wrapped in a Cvb::WrappedImage. Note: The image is not copied but the given buffer is worked on directly.

  • Managed: When using managed memory for the target buffer as well, it shall be taken care that the memory is attached to the host.
// prefetch target buffer on CPU
CUDA_CHECK(cudaStreamAttachMemAsync(gpuStream, dTarget, 0, cudaMemAttachHost));
CUDA_CHECK(cudaStreamSynchronize(gpuStream));
auto wrapped =
Cvb::WrappedImage::FromGrey8Pixels(target, image->Width() - 2 * SOBEL_RADIUS,
image->Height() - 2 * SOBEL_RADIUS);
static std::unique_ptr< WrappedImage > FromGrey8Pixels(void *buffer, int width, int height)
  • Mapped / fully manually managed: The memory should be copied to host memory manually. It is generally not advised to use CUDA allocated device mapped memory for further processing on the host.
// copy output to host memory
CUDA_CHECK(cudaMemcpyAsync(target, dTarget, (image->Width() - 2 * SOBEL_RADIUS) *
(image->Height() - 2 * SOBEL_RADIUS), cudaMemcpyDeviceToHost, gpuStream));
CUDA_CHECK(cudaStreamSynchronize(gpuStream));
// map host memory to Cvb image
auto wrapped =
Cvb::WrappedImage::FromGrey8Pixels(target, image->Width() - 2 * SOBEL_RADIUS,
image->Height() - 2 * SOBEL_RADIUS);
wrapped->Save(outputName);

Cleanup

For resource cleanup, the destructor of the UserFlowSetPool should use the respective memory freeing function for the corresponding memory regions:

virtual ~UserFlowSetPool()
{
for (auto &flowSet : *this){
for (auto &flow : flowSet){
if (USE_MANAGED_MEMORY)
CUDA_CHECK(cudaFree(flow.Buffer));
else
CUDA_CHECK(cudaFreeHost(flow.Buffer));
}
}
}

Use CUDA for CVB Shapefinder

CVB also offers a CUDA specific implementation for its Shapefinder tool. The deployed example application can be found at $CVB/tutorial/ShapeFinder/Cvb++/QtShapeFinder2Cuda. It shows the performance difference by applying a shapefinder search with the CPU as compared to the CUDA implementation - wrapped by a simple QT app. Be aware that the full performance difference and especially the advantage of the CUDA implementation is only avalable with a valid CVB Shapefinder license.