Personal blog of Matthias M. Fischer


CUDA on the NVIDIA Jetson, Part 2: Game of Life, Benchmarking, and some surprising Profiling Results

3rd April 2022

Introduction

To start my experiments with the NVIDIA Jetson (which I've set up some days ago, see here for some notes on it), I've decided to implement a simulation of the classic Conway's Game of Life.

My implementation can be found here; an exemplary simulation run (slowed down by waiting 100ms between every step for better visibility) might look as follows:

In this post, I will share some interesting observations I have made while benchmarking its execution time in comparison to a CPU-only execution and while profiling the GPU-implementation in greater detail.

Some Implemtation Details

As alluded to already in my previous post, there are some hardware-imposed limits on the maximum degree of parallelisation. In particular, on the Jetson nano we can run at most 1024 threads per so-called thread block. However, the maximum number of thread blocks is virtually unlimited (\( \sim 2^{63}\)).

For this reason, to simulate a grid of \(i \times j\) cells, I decided to simply use \(i \times j\) thread blocks with one thread each. Since we don't need any of the mechanisms to synchronise or communicate between a thread block's threads, this strategy appears reasonable.

CPU vs GPU Execution Time

First, I was interested in the time required for calculating e.g. 100 steps of the simulation as a function of the input grid size. In particular, I wanted to compare the execution time using either the CPU or the GPU. To measure the execution time, I simply counted the ticks elapsed between the start of the simulation (after randomly initialising the input array) and its end.

For the case of the CPU, I expected to see a linear increase in required computation time since every grid cell is calculated sequentially in the same single thread. Further, I expected GPU execution time being independent of the grid size due to the fact that all grid cells are calculated in individual parallel threads at once.

The first expectation was clearly fulfilled with CPU compuation time correlating perfectly linear with the input grid size (blue line in the plot below, note the slope of 1.0 in the log-log-plot indicating a linear relationship). However, GPU execution time was only constant up to a grid size of approximately \(10^6\) cells, where a linear relationship started to emerge as well (orange line), which confused me.

Profiling GPU Execution using nvprof

To shed light on potential reasons for the increase in GPU performance time for larger grid sizes, I next used the nvprof profiler on one single simulation of a randomly initialised 2000 x 2000 grid:

            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
      API calls:   92.64%  5.20410s         1  5.20410s  5.20410s  5.20410s  cudaFree
                    6.61%  371.11ms         2  185.56ms  9.4499ms  361.66ms  cudaMalloc
                    0.59%  33.076ms         1  33.076ms  33.076ms  33.076ms  cudaMemcpy
                    0.16%  9.2675ms       202  45.878us  28.438us  908.61us  cudaLaunchKernel
                    0.00%  137.92us        97  1.4210us     625ns  28.178us  cuDeviceGetAttribute
                    0.00%  50.626us         3  16.875us  1.4580us  46.043us  cuDeviceGetCount
                    0.00%  9.6360us         1  9.6360us  9.6360us  9.6360us  cuDeviceTotalMem
                    0.00%  3.3330us         2  1.6660us  1.4060us  1.9270us  cuDeviceGet
                    0.00%  1.7710us         1  1.7710us  1.7710us  1.7710us  cuDeviceGetName
                    0.00%  1.0420us         1  1.0420us  1.0420us  1.0420us  cuDeviceGetUuid

Strikingly, 92.6% of the execution time, corresponding to 5.2s of execution time, was spent on the one single cudaFree() which is called directly before the program terminates. Another 6.6% of time is spent on the single cudaMalloc() right at the beninning of the code. Only 0.16%, i.e. 10ms of execution time, is used on running the actual simulation. This suggested that setting up the CUDA infrastructure can induce some significant overhead which might offset computational efficiency gained due to parallelisation.

A more detailled Profiling

To study in more detail how execution time differs between the different CUDA functions for varying input sizes, I next went back to my "tick counting" approach, however I counted elapsed ticks for the four individual parts of the program: cudaMalloc(), cudaMemcpy(), the actuall kernel launch, and cudaFree().

It's reassuring to see that the time spent on the actual simulation (green line) seems to be independent of the input grid size as expected. The only exception is the last data point for an input grid size of \(20'000 \times 20'000 = 400'000'000\) cells, where execution time seems to slightly increase for unknown reasons. I sadly couldn't extend the profiling to even larger input sizes due to a lack of memory.

It also makes sense that the time spent on cudaMemcpy() (orange line) somehow scales with input size; however, I don't really understand the nonlinearity of the relationship. Intuitively, I would have expected some linear relationship over the complete range of input sizes, but maybe some low-level optimisations ameliorate this for smaller input sizes.

Finally, the strongly nonlinear relationship between input size and time spent on cudaMalloc() (blue line) and cudaFree() (red line) doesn't really make sense to me at all. Having written my own memory manager recently, I would have assumed that the time required for allocating and freeing memory would only depend on the length of the number of memory blocks already allocated instead. Maybe looking at the source code of the functions will shed some light on this...

Conclusions and Outlook

This has been a super interesting little project and the first "real" piece of software I've ever written using CUDA. The specific profiling results for cudaMalloc(), cudaMemcpy(), and cudaFree() still don't really make much sense to me, as I don't know much about what exactly is happening under the hood when using then. Additionally, I also don't quite understand why the results of the nvprof profiler and my own "manual" profiling show such a strong discrepancy, although I have some speculations about it. This is something I will want to examine in more detail in the future.

In any way, the results have made apparent that being mindful of such "extra costs" of using CUDA is important, especially when working with larger amounts of data. They have also demonstrated nicely how detailled profiling can bring to light some -- potentially unexpected -- findings which may be worthy of further examination.

Until now, I have not made use of the possibility of running multiple threads per one thread block, sharing memory between them and synchronising their execution. Due to the trivially parallelisable nature of Conway's Game of Life this was simply not required for this little project; however, I am now very eager to give it a try. For this reason, I'll next implement the vector product routine from this excellent tutorial by NVIDIA which makes use of all of these things.

I will also examine in detail all the example projects provided in the CUDA SD card image to get some more inspiration for interesting things to try. Since I'm very much into modelling (biological) processes using systems of differential equations, I might implement some things related to this topic as well.

In any way, you may expect quite some more CUDA content in the very near future -- so stay tuned :).