Personal blog of Matthias M. Fischer


CUDA on the NVIDIA Jetson, Part 1: Setup Notes

29th March 2022

Introduction

Some days ago, a good friend of mine lended me an NVIDIA Jetson nano, a small computer with an NVIDIA Maxwell GPU.

(Picture taken from the NVIDIA page.)

Generally designed with AI-related purposes in mind (such as training neural networks for image recognition), it can nonetheless also be used for just learning how to write massively parallel algorithms for GPUs, e.g. for numerical mathematics. Which is exactly what I have been doing -- some posts about that will follow as soon as I have found some time for a proper writeup.

For now, I just want to mention some (super minor) issues I encountered during the setup process (which in itself is really easy, see for instance here) and how to fix them. I'll also share some small general tips for working with the Jetson which I found useful. All of this is primarily intended as future reference for myself and (maybe) other people as well.

Fixing the path in the .bashrc

After completing the setup, the compiler nvcc was not available in the terminal. Some googling revealed that this is easy to fix by adding the following lines to the ~/.bashrc:

export PATH=/usr/local/cuda/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

Example projects

The Jetson comes with a bunch of example projects which are perfect for getting a first understanding of what can be done and how. The path we just added reveals their location. In my case, the examples can be found at /usr/local/cuda/samples.

All examples come with a Makefile. Building them requires super user privileges, so just run sudo make.

deviceQuery: A particularly useful example project

The deviceQuery project in /.../samples/1_Utilities/deviceQuery is especially useful. It produces a printout of the system's specs, containing some important information for making best use of its resources.

An exemplary printout might look as follows:

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA Tegra X1"
  CUDA Driver Version / Runtime Version          10.2 / 10.2
  CUDA Capability Major/Minor version number:    5.3
  Total amount of global memory:                 3964 MBytes (4156719104 bytes)
  ( 1) Multiprocessors, (128) CUDA Cores/MP:     128 CUDA Cores
  GPU Max Clock rate:                            922 MHz (0.92 GHz)
  Memory Clock rate:                             13 Mhz
  Memory Bus Width:                              64-bit
  L2 Cache Size:                                 262144 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            Yes
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            No
  Supports Cooperative Kernel Launch:            No
  Supports MultiDevice Co-op Kernel Launch:      No
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.2, CUDA Runtime Version = 10.2, NumDevs = 1
Result = PASS

Especially important for me were the lines "Maximum number of threads per block" and "Max dimension size of a grid size". Those provide information on the limit of the GPU's parallel processing abilities. More information about that will follow in a future post.

Profiling software execution

A very helpful tool to profile the execution of a program is nvprof. However, due to some security concerns, NVIDIA has decided to require super user privileges for the execution of the profiler. Because in super user context, the PATH variable might differ, it makes sense to execute the command as sudo $(which nvprof) ./software_to_profile.

When run from within a Makefile, escape the $ with another $, i.e.: sudo $$(which nvprof) ./software_to_profile

Conclusions

I've already starting implementing some things using the Jetson and profiling the performance compared to a calculation using the CPU instead of the GPU. Expect some posts on that in the near future. For now, I hope you have found these little pointers useful. Happy (parallel) programming!