CUDA on the NVIDIA Jetson, Part 1: Setup Notes
29th March 2022Introduction
Some days ago, a good friend of mine lended me an NVIDIA Jetson nano, a small computer with an NVIDIA Maxwell GPU.
(Picture taken from the NVIDIA page.)
Generally designed with AI-related purposes in mind (such as training neural networks for image recognition), it can nonetheless also be used for just learning how to write massively parallel algorithms for GPUs, e.g. for numerical mathematics. Which is exactly what I have been doing -- some posts about that will follow as soon as I have found some time for a proper writeup.
For now, I just want to mention some (super minor) issues I encountered during the setup process (which in itself is really easy, see for instance here) and how to fix them. I'll also share some small general tips for working with the Jetson which I found useful. All of this is primarily intended as future reference for myself and (maybe) other people as well.
Fixing the path in the .bashrc
After completing the setup, the compiler nvcc
was not available in the terminal. Some googling revealed that this is easy to fix by adding the following lines to the ~/.bashrc
:
export PATH=/usr/local/cuda/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
Example projects
The Jetson comes with a bunch of example projects which are perfect for getting a first understanding of what can be done and how. The path we just added reveals their location. In my case, the examples can be found at /usr/local/cuda/samples
.
All examples come with a Makefile
. Building them requires super user privileges, so just run sudo make
.
deviceQuery: A particularly useful example project
The deviceQuery project in /.../samples/1_Utilities/deviceQuery
is especially useful. It produces a printout of the system's specs, containing some important information for making best use of its resources.
An exemplary printout might look as follows:
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "NVIDIA Tegra X1"
CUDA Driver Version / Runtime Version 10.2 / 10.2
CUDA Capability Major/Minor version number: 5.3
Total amount of global memory: 3964 MBytes (4156719104 bytes)
( 1) Multiprocessors, (128) CUDA Cores/MP: 128 CUDA Cores
GPU Max Clock rate: 922 MHz (0.92 GHz)
Memory Clock rate: 13 Mhz
Memory Bus Width: 64-bit
L2 Cache Size: 262144 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: Yes
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: No
Supports Cooperative Kernel Launch: No
Supports MultiDevice Co-op Kernel Launch: No
Device PCI Domain ID / Bus ID / location ID: 0 / 0 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.2, CUDA Runtime Version = 10.2, NumDevs = 1
Result = PASS
Especially important for me were the lines "Maximum number of threads per block" and "Max dimension size of a grid size". Those provide information on the limit of the GPU's parallel processing abilities. More information about that will follow in a future post.
Profiling software execution
A very helpful tool to profile the execution of a program is nvprof
. However, due to some security concerns, NVIDIA has decided to require super user privileges for the execution of the profiler. Because in super user context, the PATH variable might differ, it makes sense to execute the command as sudo $(which nvprof) ./software_to_profile
.
When run from within a Makefile, escape the $ with another $, i.e.: sudo $$(which nvprof) ./software_to_profile
Conclusions
I've already starting implementing some things using the Jetson and profiling the performance compared to a calculation using the CPU instead of the GPU. Expect some posts on that in the near future. For now, I hope you have found these little pointers useful. Happy (parallel) programming!