# Compiling a program with CUDA

Kay contains a gpu partition with high-spec NVIDIA Volta V100 gpus. Users can utilise these in various ways. The approach that gives the greatest freedom and control is to execute programs written in CUDA, which is a variant of C/C++, written by NVIDIA specifically for programming its gpus. Here we give an example of writing and compiling a program to multiply two large matrices from first principles using CUDA.

## Matrix Multiplication

The first thing is to load the correct module for CUDA on Kay:

`module load cuda/10.1.243`

It is recommended to use the latest module as it is compatible with both gcc and the intel compilers.

Next, we create a source file, matmul.cu, and include the CUDA header.

`#include <cuda.h>`

We now fill out the code. Note that the usual C/C++ headers should all be available, but we can usually only run functions from standard headers on the cpu side (in this case the `main`

function).

```
#include <cuda.h>
#include <stdio.h>
#include <stdlib.h>
```

int main(int argc, char** argv)
{
if (argc < 2) {
printf("Usage: ./matmul N\n");
printf("where N is a matrix dimension, ");
printf("100 <= N <= 30000.\n");
exit(0);
}
int N = atoi(argv[1]);
if (N < 100 || N > 30000) {
printf("Dimension is out of bounds!\n");
exit(1);
}

For now, let us do all the work on the gpu, and not transfer the result back to the cpu. This is not very realistic, but will at least give us a feeling for how fast the gpu can compute a matrix multiplication. Thus we see the first CUDA-specific code:

matmul<<<numblocks,32>>>(N, M1_d, M2_d, M3_d); cudaDeviceSynchronize();

This means 'run the matmul kernel, which takes the parameters N, M1_d, M2_d, M3_d, with numblocks blocks of 32 threads' and return immediately. To make the cpu wait for the gpu to finish, we have the cudaDeviceSynchronize() function.

Let us see first how we would naively compute an N x N matrix product of M3 = M1 x M2 on a classical cpu:

void matmul(int N, float* M1, float* M2, float* M3) { for (int i = 0; i < N; i++) { for (int j = 0; j < N; j++) { for (int k = 0; k < N; k++) { M3[i*N + j] += M1[i*N + k] * M2[k*N + j]; } } } }

(where we assume M3 has been initialised to zero). We can look at the philosophy of CUDA programming as replacing an 'outer loop' by the 'execution thousands of threads in parallel'.

Here is how we might do the multiplication in CUDA:

__global__ void matmul(int N, float* M1, float* M2, float *M3) { int i = threadIdx.y*32 + threadIdx.x; if (i < N) { for (int j = 0; j < N; j++) { for (int k = 0; k < N; k++) { M3[i*N + j] += M1[i*N + k] * M2[k*N + j]; } } } }

There is the special system variable `threadIdx`

. For now we understand that it is used to compute the current thread number. The keyword `__global__`

indicates that the function is a gpu kernel.

We have not yet shown how to calculate numblocks. This requires a brief explanation:

int numblocks = (int)( 0.5 + (N / 32.0) ); // this is ceil(N/32)

In fact we want to compute the so-called 'ceiling' of (N / 32). This means we will have enough blocks of 32 threads to cover all iterations of the outermost loop. So, it will either be exact, if N happens to be a multiple of 32, or we will have one 'partial' block. We actually check in every thread that we are within bounds (the `if (i < N)`

condition). The formula above precisely calculates the ceiling. This is somewhat technical, but these kind of considerations are very common in CUDA programming/debugging.

We have also not initialised the values in M1, M2 and M3. For this we will create separate functions, one to set the entries in the product matrix M3 to zero, and another to initialise the entries of M1 or M2 to random floats in the range (-100.0,100.0) for simplicity.

The finished matmul.cu looks like this:

```
#include <cuda.h>
#include <stdio.h>
#include <stdlib.h>
```

__global__ void matinit(int N, float* M);
__global__ void matzero(int N, float* M);
__global__ void matmul(int N, float* M1, float* M2, float *M3);
int main(int argc, char** argv)
{
if (argc < 2) {
printf("Usage: ./matmul N\n");
printf("where N is a matrix dimension, ");
printf("100 <= N <= 30000.\n");
exit(0);
}
int N = atoi(argv[1]);
if (N < 100 || N > 30000) {
printf("Dimension is out of bounds!\n");
exit(1);
}
// we represent each matrix as a 1-d array of N*N elements
float* M1_d;
float* M2_d;
float* M3_d;
// allocate memory on the device (i.e. the gpu) for matrices
cudaMalloc( (void**) &M1_d, N*N*sizeof(float) );
cudaMalloc( (void**) &M2_d, N*N*sizeof(float) );
cudaMalloc( (void**) &M3_d, N*N*sizeof(float) );
// compute number of 32-thread blocks required to
// cover N iterations
int numblocks = (int)( 0.5 + (N / 32.0) ); // ceil(N/32)
// initialise matrices. It is paramount to set M3 to zero,
// the algorithm will not work correctly otherwise
matinit<<<numblocks,32>>>(N, M1_d);
matinit<<<numblocks,32>>>(N, M2_d);
matzero<<<numblocks,32>>>(N, M3_d);
cudaDeviceSynchronize();
printf("Starting...");
clock_t start = clock();
matmul<<<numblocks,32>>>(N, M1_d, M2_d, M3_d);
cudaDeviceSynchronize();
clock_t stop = clock();
printf("\nFinished!\n");
double s = (double)(stop - start) / CLOCKS_PER_SEC;
printf("Time taken: %0.3f seconds\n", s);
// free memory
cudaFree(M3_d);
cudaFree(M2_d);
cudaFree(M1_d);
exit(0);
}
__global__ void matmul(int N, float* M1, float* M2, float *M3)
{
int i = threadIdx.y*32 + threadIdx.x;
if (i < N) {
for (int j = 0; j < N; j++) {
for (int k = 0; k < N; k++) {
M3[i*N + j] += M1[i*N + k] * M2[k*N + j];
}
}
}
}
__global__ void matinit(int N, float* M)
{
int j = threadIdx.y*32 + threadIdx.x;
if (j < N) {
for (int i = 0; i < N; i++)
M[i*N + j] = 100.0 - 200.0 * j / N;
}
}
__global__ void matzero(int N, float* M)
{
int j = threadIdx.y*32 + threadIdx.x;
if (j < N) {
for (int i = 0; i < N; i++)
M[i*N + j] = 0.0;
}
}

The CUDA keyword to allocate memory on the device (gpu) is cudaMalloc, which is similar to C malloc but we must use cudaMalloc in the gpu case.

We can compile matmul.cu with the following line:

nvcc matmul.cu -o matmul -arch=sm_70 -O3

where we must specify the gpu architecture (on Kay for the V100s it is sm_70). Run the program with e.g.

./matmul 10000

in an appropriate sbatch script, using the GpuQ. Using the V100s on Kay, the timings for this unoptimised kernel range from about 40 milliseconds for the smallest size, to about 140 milliseconds for the largest size. Try writing a single cpu only version of this routine, and compare to the gpu speeds.

A final remark on the matrix dimension. Each float entry takes 4 bytes. We have 3 matrices and each one has N*N entries. If we allow up to 30000 for the dimension, this already consumes over 10 gigabytes of ram. Remember that each V100 on Kay has up to 16 gigabytes of ram. It is a common bug not to notice this and overallocate memory, only to find the kernel (i.e. gpu program) returns almost instantaneously. In fact in this case it will be failing even at the memory allocation stage and will not be computing what we might think it is. This gives a false sense of performance, not to mention incorrect results. If you start the adventure of CUDA programming, be prepared for such situations, and if you need extra assistance, our support team is there to help!