CUDA on the NVIDIA Jetson, Part 3: Threads, Atomics, Fuzzing, and Floating Point Errors

18th April 2022

Introduction

As promised in my last post, I will here present my implementation of a simple dot product routine for the NVIDIA Jetson nano GPU computer in the C programming language. My implementation is based on this excellent tutorial by NVIDIA.

I will also use my implementation to demonstrate the use of threads, thread synchronisation and shared memory. In addition, revisiting my remark in my post about the implementation of my own memory manager I will use my dot product routine to showcase the usefulness of software testing based on "fuzzing". In particular, fuzzing my implementation reveals the danger of significant floating point imprecisions in certain situations -- these will be examined in this post as well.

The Algorithm and its Implementation

To calculate the product of two vectors of length \(n\), we need to carry out \(n\) multiplications. Ideally, we carry out all of these in parallel. For this, we could of course use \(n\) thread blocks with a single thread each. However, we would then need to add up all \(n\) results after each other, which might for sufficiently long vectors introduce sufficient overhead. To make this more efficient, we can use multiple (up to 1024 on my machine) threads per thread block and let each thread block first take care of adding up its 1024 products in parallel.

This is possible because all threads of a thread block can share memory. Thus, we just let one of the 1024 threads take care of this addition after all multiplications in the respective thread block have been carried out. To ensure that the addition only happens after all threads of a thread block have finished, we synchronise the execution of all threads of a threadblock. Only after this first parallel addition step, we will add up the results of the individual thread blocks sequentially.

In order to avoid race conditions between the different thread blocks, this needs to be carried out using an atomic addition command, i.e. a command that "happens in one step", instead of consisting of multiple, disconnected steps (reading from memory, calculating the addition, writing back to memory). If we were to use a non-atomic addition command instead, the multiple steps of the addition command might "overlap" across threads, and thus the sum wouldn't be computed correctly.

The "kernel" routine to run on the GPU thus looks as follows:


__global__ void dot (float *a, float *b, float *c) {
    // This will be shared between all threads of a thread block
    // In this way, a single thread can add up the results of
    // all the thread in its thread block
    __shared__ float tmp[MAX_THREADS_PER_BLOCK];
       
    // Element-wise parallel multiplication across threads
    int index = blockIdx.x * blockDim.x + threadIdx.x;
    if (index < N) {
        tmp[threadIdx.x] = a[index] * b[index];
    } else {
        // In the last thread block, we might have some
        // empty threads at the end.
        // Thus, we fill the corresponding vector elements
        // with zero, so that we may safely add up the
        // complete vector later.
        tmp[threadIdx.x] = 0;
    }
    
    // All thread wait here, until all threads reach this line
    // Only then we carry out the addition of the threads' results.
    __syncthreads();
    
    // Thread 0 ...
    if (threadIdx.x == 0) {
        // ... sums up the results of the block's threads ...
        float sum = 0;
        for (int i = 0; i < blockDim.x; i++) {
            sum += tmp[i];
        }
        
        // ... and then adds the sum to the return variable *c.
        // This needs to be an atomic command to avoid race conditions.
        atomicAdd(c, sum);
    }
}

Extensive Testing via Fuzzing

Now that we have implemented the algorithm, we can (and should!) test it. Of course, we could just build and run a number of "hand-crafted" test cases and check whether the results are correct. In fact, I have done so first and, importantly, did not notice any problem with the implementation at all. However, this approach comes with a problem: If we construct all tests manually, we will end up only testing those things that we explicitly have thought of.

For this reason, I have decided to "fuzz" my implementation instead. "Fuzzing" refers to a testing approach where we instead generate a multitude of test cases randomly and throw them at our implementation. To check the correctness of the calculation of the GPU, we additionally calculate the real result on the CPU and compare results.

My concrete fuzzing strategy was based on 10'000 randomly generated test cases. Every test case consisted of a pair of vectors with 2'000 elements randomly and uniformly chosen from the range of \([-50.0; 50.0]\). When comparing the GPU result with the CPU result, I allowed for a maximum relative error of \(10^{-6}\).

Running the tests, some errors quickly emerged:

I forgot to set the return variable dev_c to zero before executing the kernel. Performing only one calculation still worked because apparently memory allocated on the GPU is set to zero by default. (Or I just got lucky?)
While copying back the result from GPU to CPU, I specified a wrong number of bytes to copy. For some reason, the implementation still worked perfectly fine while I was running my manually constructed test cases. (I think it might have to do something with the fact that my manual test cases were based on very small numbers, so I might have copied over all relevant bits?)
Finally, and importantly, once in a few hundred or so test cases some big floating point popped up. This was the first problem I could not just easily fix, so I decided to examine it in more detail.

Examining the Floating Point Errors in Detail

From my fuzzing, I just knew that cases exist where CPU and GPU results differed significantly; however, I had no way of telling which of the two results is the right one. For this reason, I came up with a (in my opinion pretty cool) strategy:

I changed my fuzzing strategy to randomly sampling integer vector elements, but did not change the implementation of the dot product routine itself, letting it continue to operate with floating-point numbers. With all input values being integers, it was possible to calculate a "ground truth" result on the CPU using long long int variables, thus avoiding all potential CPU floating point imprecisions.

I then examined the floating point error of the GPU result in more detail, observing a correlation of its magnitude with the length of the input vectors (see diagram below). This naturally makes sense since larger input vectors require a larger number of (imprecise) floating point operations.

In addition, larger input vectors could also imply that input values are spread out more widely, which is known to further amplify floating point imprecisions. However, because I sampled the input vector elements uniformly from a rather small range of \([-50.0; 50.0]\), I don't think this should be too much of a problem.

Conclusions

Multiple things have become clear during this tiny little project, most importanly:

Thread synchronisation and shared memory in CUDA is very easy to implement and extremely useful.
The same applies to atomic operations for avoiding race conditions between different thread blocks.
Fuzzing can easily reveal important problems which manually written tests may overlook, for instance floating point imprecisions.
Such floating point imprecisions can, however, be exactly quantified and studied if suitable test cases are constructed.

Personal blog of Matthias M. Fischer

Home

About

Portfolio