CUDA old

From Montana Tech Computer Science Department
Jump to: navigation, search

CUDA is the NVIDIA parallel programming language that executes at a higher perfromance on Graphical Processing Units (GPUs). Currently CUDA can support: C, C++, C#, Fortran, Java, Python.

CUDA Versions 4.2, 5.5, and 6.0 are installed on /opt/CUDA.

CUDA compiler: nvcc

CUDA file extension: .cu

CUDA Environment

CUDA binaries and libraries are installed in /opt/CUDA. To set the environment for using CUDA, use the module command:

module load cuda/6.0.37 (or for version 5.5 - module load cuda/5.5)

Example of compiling CUDA file

Please do your editing and compiling on the Management node & execute program on a GPU node. To simply compile a CUDA file: nvcc -arch=sm_30 This will generate a standard "a.out" execution file on the current work directory. -arch=sm_30 is the gpu architecture supported by the compiler OR nvcc -arch=sm_30 -o outCUDA This will optimize at level 3 of the serial part of the code and generate execution file "outCUDA"

GPU/CUDATesla K20 Architecture

Compute Capability 3.5 Max Threads per Thread Block 1024 Max Threads per SM 2048 Max Thread Blocks per SM 16

CUDA C-example program

//Hello CUDA program - Example of data computation on the Device and return Results to Host

#define COLUMNS 1023
#define ROWS 511

__global__ void add(double *a, double *b, double *c) {

       int x = blockIdx.x;
       int y = blockIdx.y;
       int i = (COLUMNS*y) + x;
       c[i] = a[i] + b[i];


int main() {

       double a[ROWS][COLUMNS], b[ROWS][COLUMNS], c[ROWS][COLUMNS];
       double *dev_a, *dev_b, *dev_c;
       //================== Memory Allocation to Device =========================
       cudaMalloc((void **) &dev_a, ROWS*COLUMNS*sizeof(double));
       cudaMalloc((void **) &dev_b, ROWS*COLUMNS*sizeof(double));
       cudaMalloc((void **) &dev_c, ROWS*COLUMNS*sizeof(double));
       //================= Generate Matrices ====================================
       for (int y = 0; y < ROWS; y++) {
               for (int x = 0; x < COLUMNS; x++)
                       a[y][x] = x;
                       b[y][x] = y;
       //================= Transfer data to Device from Host ====================
       cudaMemcpy(dev_a, a, ROWS*COLUMNS*sizeof(double), cudaMemcpyHostToDevice);
       cudaMemcpy(dev_b, b, ROWS*COLUMNS*sizeof(double), cudaMemcpyHostToDevice);
       //================= Dimension in how threads will be used ================
       dim3 grid(COLUMNS,ROWS);
       //================= Call to Kernel =======================================
       add<<<grid,1>>>(dev_a, dev_b, dev_c);
       cudaMemcpy(c, dev_c, ROWS*COLUMNS*sizeof(double), cudaMemcpyDeviceToHost);
       //================= Print results obtained from Device ===================
       for (int y = 0; y < ROWS; y++)
               for (int x = 0; x < COLUMNS; x++)
                       printf("[%d][%d]=%d ",y,x,c[y][x]);
       //=================== Free up Allocated Memory on Device ===================
       return 0;


Running CUDA programs

Once you have compiled your CUDA code, you will need to run on one of the GPU Nodes. Instructions on how to submit a job is detailed on the GPU Nodes page.