CUDA is the NVIDIA parallel programming language that executes at a higher perfromance on Graphical Processing Units (GPUs). Currently CUDA can support: C, C++, C#, Fortran, Java, Python.
CUDA Versions 6.0, 6.5 and 7.0 are installed on /opt/cuda.
CUDA compiler: nvcc
CUDA file extension: .cu
CUDA binaries and libraries are installed in /opt/CUDA. To set the environment for using CUDA, use the module command:
module load cuda/6.0.37 (or for version 6.5 - module load cuda/6.5)
Example of compiling CUDA file
Please do your editing and compiling on the Management node & execute program on a GPU node. To simply compile a CUDA file: nvcc -arch=sm_35 filename.cu This will generate a standard "a.out" execution file on the current work directory. -arch=sm_35 is the gpu architecture supported by the compiler OR nvcc -arch=sm_35 -O3 filename.cu -o outCUDA This will optimize at level 3 of the serial part of the code and generate execution file "outCUDA"
GPU/CUDATesla K20 Architecture
Compute Capability 3.5 Max Threads per Thread Block 1024 Max Threads per SM 2048 Max Thread Blocks per SM 16
CUDA C-example program
The simple vector addition vectoradd.cu sample program located in /opt/cuda/cuda-6.0/samples/0_Simple/vectorAdd is one of the official CUDA samples shipped with CUDA Toolkit. It randomly generates two float type vectors, and uses GPU to calculate their additions. In the end, the GPU result is compared with the CPU result to verify if the GPU result is correct or not.
More sample programs can be found at /opt/cuda/cuda-6.0/samples/
Compile the code
- module load cuda
- nvcc -arch=sm_35 vectorAdd.cu (copy the program to your home directory or give the full path)
The above command will create an executable file named ‘a.out’. Alternatively, you may specify your executable filename:
- nvcc -arch=sm_35 vectorAdd.cu -o outCUDA
To run the CUDA program, you need to request a GPU node. A sample batch file for requesting the GPU node and running the above sample program is provided:
- #PBS -l nodes=1:ppn=1
- #PBS -l feature=gpunode
- #PBS -N GPUJob
- #PBS -l walltime=00:05:00
- cd $PBS_O_WORKDIR
- [Vector addition of 50000 elements]
- Copy input data from the host memory to the CUDA device
- CUDA kernel launch with 196 blocks of 256 threads
- Copy output data from the CUDA device to the host memory
- Test PASSED
Running CUDA programs
Once you have compiled your CUDA code, you will need to run on one of the GPU Nodes. Instructions on how to submit a job is detailed on the GPU Nodes page.