Unnecessary Picture of Stellar Nebula

I have a limited background in Parallel programming. I took a course during my undergrad as well as the Udacity offering a few years ago. While I’ve been intrigued, for a while the hardware was difficult to get access to or the libraries were difficult to use.

While things aren’t perfect — the driver packages still don’t play very nicely with each other — there has been a marked improvement. The Cuda drivers and libraries can be installed by apt-get, I didn’t have to build a kernel image, and the Thrust library is now included with the Cuda toolkit. As we’ll see this makes it a little bit easier to work with.

In this post I take a look at creating a simple program using Cuda’s core library, Thrust, a serialized version.

The source-code can be found here.

Set Up and Build Config

My computer has a Nvidia GTX 1060 (6GB) graphics card. I’m running Ubuntu 18.04, using the nvidia-driver-390 metapackage.

There were a couple of gotchas I ran into. However, because Thrust is included as part of the Cuda toolkit, I don’t need to link any external libraries. This allowed my final Makefile to be fairly simple.

The compiler, referenced as CC, is packaged with Cuda version 9.1 — the maximum supported by my driver version. You’ll also notice the ccbin compiler option for each of the builds. As of this writing, Ubuntu’s default g++ version is 7.3.0 and Cuda 9.1 only supports up to 6.

Serial Version

The serial version isn’t meant to be well formed C++ code, but rather it mimics the structure of the Cuda program closely. First the memory size of the vectors is calculated. This is 8000x8000 (64M) floating point numbers. Next, the vectors are allocated on the heap using malloc and initialized with random floating point values between 0 and 1. Then the vectors are summed 10 times and the result is are stored in output. Lastly the memory is freed up.

It’s a simple if contrived and inelegant program. On my CPU the run-time is around 4.5 seconds.

Cuda Version

As mentioned above, the cuda version is intentionally similar to the serial version.

The allocation is done with the cudaMallocManaged function instead of malloc, and the free function is replaced by cudaFree. This means that the memory is available on the host (code run on the cpu) and on the device (code run on the gpu).

The way the data is interacted with is quite different.

Notice first that this is a .cu file, so there are some values available that aren’t in .cpp files. The __global__ keyword indicates that the function is available on the host and the device. Both return void, they mutate the shared memory rather than return values.

The first function add, as you might guess, produces the output of the sum operation. This is done element-wise, each kernel is responsible for a subset of the input. This is done by calculating idx which is based on the block location and thread index of this particular kernel. Thus each kernel only performs a single addition operation.

The second function is the random initialization. This function actually uses the thrust API to sample from a normal distribution. There are some other options for doing this on the GPU, but I wasn’t able to find anything that was this easy to use.

The calls to these kernel functions are a little strange looking. On lines 20–22 and 27/28 you’ll see how the functions is called. The function parameters are provided normally, but they are parameterized with <<<numBlocks,blockSize>>> . This expression tells Cuda how many kernels will be used. In our case, we are making blocks of size 256, and then there are 64M / 256 blocks. Cuda then distributes work to all of these kernels. The cudaDeviceSynchronize is kind of like join on threads or await in JavaScript async functions. This causes the program to stop and wait for all of the kernels to finish before the program continues.

This is a bit more work than the serialized version, and way less readable than a well written C++ program. Is the juice worth the squeeze? On my hardware, this is a dramatic yes. The Cuda version finishes in around 0.14s, over 30x faster than the serialized version. If we look at the output of nvprof we can get a bit more detail.

The most expensive operations were memory allocation, and device synchronization. And the most expensive kernel operation was the random initialization despite the summation being run 10x more. This is a good demonstration of an important parallel programming principle: communication is expensive.

Thrust Version

The Thrust version differs quite a bit from the Cuda verison. The patterns are more STL C++ idiomatic, rather than C-like.

The custom kernel operation initRandomPrg is organized as a struct that implements the () operator, meaning it can be executed after initialization. In a JavaScript context I would think of this as a second order function. The first call sets the bounds for the random floats. Executions of the returned function will return a random number sample from a uniform distribution.

The data is stored in device_vectors which are initialized, and can be interacted with much like std::vector objects. Despite their name these vectors can be accessed and mutated from the host as well as the device. It isn’t necessary to calculate the sizeof the vectors, or manually allocate them.

The kernel executions are organized similar to the STL algorithm package functions. There is not declaration of block sizes or counts, and the devices don’t have to be synchronized.

The convenience of not having to manually allocate and de-allocate memory seems like it should come at a cost. However, that appears not to be the case. In multiple trials, the Thrust version appears to run slightly faster than the Cuda version. This could be because of a number of factors — not the least that I’m not familiar extremely with Cuda API performance characteristics.

The output of nvprof is a little more difficult to read.

The function names are generated monstrosities rather than the easy to read ones from before. I’m assuming the most expensive GPU operation is the random initialization, but I’m not sure. And I don’t know what the third GPU operation is. At least the API calls are still broken out nicely.

There are no calls to cudaDeviceSynchroniz. A more intelligent synchronization could explain why the Thrust version was a little faster.

Conclusions

Thrust appears to be supported internally by Nvidia. Despite the appearance on the project’s former homepage, it is going to be under active development going forward.

From my perspective Thrust has a lot of value. With simple allocation, deallocation, block sizing, and synchronization, there are fewer opportunities for me to shoot myself in the foot. Also there are a few nice algorithms available.

Assuming I can come up with a suitable application(some kind of simulation perhaps), I’m anxious to dig more into Cuda, Thrust, and the other libraries Nvidia has provided.