Logo of Science Foundation Ireland  Logo of the Higher Education Authority, Ireland7 CapacitiesGPGPU Research Projects
Ireland's High-Performance Computing Centre | ICHEC
Home | News | Infrastructure | Outreach | Services | Research | Support | Education & Training | Consultancy | About Us | Login

GPGPU Infrastructure

There is a growing interest in programming graphics processors (GPGPUs) in CPU-based heterogeneous architectures to meet the demands for high-end computing resources and to model more complex problems. ICHEC is a major player within GPGPU computing research at European level. If you want to get access to the GPU facility in ICHEC clusters, please open a helpdesk call.

Technical Specifications

There are currently 32 NVIDIA K20X GPU cards on (2 cards per compute node) Fionn.

Hardware Specifications for K20X

See more details at: Tesla-Kepler-Family-Datasheet

The GPGPU Environment

The CUDA programming model is available to facilitate GPGPU computing on Fionn.

To use a version of CUDA on Fionn load the relevant environment module:

module load dev cuda/5.5

More details can be found on the ICHEC CUDA software webpage: CUDA

Compiling a Simple Vector Addition Program

Using CUDA

//VecAdd.cu
#include <stdio.h>
#define N 1024
#define BLOCK_SIZE 32

__global__ void VecAdd(float *a, float *b, float *c){

int i=threadIdx.x+blockIdx.x*blockDim.x;
if(i<N){
c[i]=a[i]+b[i];
}
}

int main () {

int i;
size_t size = N * sizeof(float);

/*Vector allocation on host*/
float *a, *b, *c;
a=(float *)malloc(size);
b=(float *)malloc(size);
c=(float *)malloc(size);

/*Initialize vectors*/
printf("Initialising vectors\n");
for (i=0; i<N; i++){
a[i]= 1.0f; b[i]=2.0f; c[i]=0.0f;
}

/*Allocate device memory for a, b and c*/
printf("Allocating vectors in device memory\n");
float *dev_a, *dev_b, *dev_c;
cudaMalloc((void **)&dev_a, size);
cudaMalloc((void **)&dev_b, size);
cudaMalloc((void **)&dev_c, size);

/*Set up the grid for the kernel*/
dim3 block(BLOCK_SIZE), grid(N/BLOCK_SIZE+1);

/*Copy a and b to the device*/
printf("Copying vectors to the device\n");
cudaMemcpy(dev_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b, size, cudaMemcpyHostToDevice);

/*Call the vector addition kernel*/
printf("Initialising Kernel\n");
VecAdd<<<grid, block>>>(dev_a, dev_b, dev_c);
printf("Kernel finished\n");

/*Copy dev_c back to the host*/
printf("Copying vector to the host\n");
cudaMemcpy(c, dev_c, size, cudaMemcpyDeviceToHost);

cudaFree(dev_a);
cudaFree(dev_b);
cudaFree(dev_c);
free(a);
free(b);
free(c);

return 0;
}

The nvcc compiler is NVIDIA's CUDA compiler. CUDA codes runs on both the CPU and GPU. NVCC separates these two parts and sends host code (the part of code which will be run on the CPU) to a C compiler like GCC or Intel C/C++ Compiler or and sends the device code (the part which will run on the GPU) to the GPU. The device code is further compiled by NVCC.

To compile a simple Vector Addition program enabled for the GPUs on Fionn, simply load the cuda/5.5 module and compile your CUDA source code (all CUDA source code files end with the .cu suffix) with nvcc as shown below:

mlysaght@fionn3:~> module load dev cuda/5.5

mlysaght@fionn3:~> nvcc VecAdd.cu -o VecAdd

For more information about the development enviroment please see: cuda toolkit

Using OpenACC

//VecAdd.c
#include <stdio.h>
#define N 1024

int main () {

int i;

/*Vector allocation on host*/
float *a, *b, *c;
a=(float *)malloc(N * sizeof(float));
b=(float *)malloc(N * sizeof(float));
c=(float *)malloc(N * sizeof(float));

/*Initialize vectors*/
printf("Initialising vectors\n");
for (i=0; i<N; i++){
a[i]= 1.0f; b[i]=2.0f; c[i]=0.0f;
}

/*Apply the vector addition*/
printf("Vector Addition:\n");
#pragma acc kernels loop independent
{
for(i = 0; i<N; i++){
c[i] = a[i] + b[i];
}
}

free(a);
free(b);
free(c);

return 0;
}

The pgcc compiler is the PGI compiler for C that support OpenACC. For Fortran, you can use pgfortran or pgf90. To compile a simple Vector Addition program enabled for the GPUs on Fionn, simply load the pgi/12.8 module and compile your C source code with pgcc as shown below:

mlysaght@fionn3:~> module load dev pgi/12.8

mlysaght@fionn3:~> pgcc VecAdd.c -o VecAdd -acc

You can use -Minfo=acc compiler option to enable informational messages from the compiler. For more information about the development enviroment please see: OpenACC official web page.

GPGPU-enabled Applicatons on ICHEC systems

GPU-enabled versions of the following software packages are also available at ICHEC. For job submission and benchmark results, please see the software webpages:

Job Submission

Job Submission Example on Fionn

#!/bin/bash
#PBS -N MyJobName
#PBS -j oe
#PBS -r n
#PBS -A MyProjectCode
#PBS -l nodes=1:ppn=20
#PBS -l walltime=00:10:00
#PBS -q GpuQ
module load dev
#For CUDA v5.5
module load cuda/5.5
cd $PBS_O_WORKDIR
#For CUDA version only
./executable
#For 2 MPI processes, 1 node
mpiexec -n 2 –ppn 2 ./executable
-

Job Submission Example on Stoney (retired)

- -
-#!/bin/bash
-#PBS -N MyJobName
-#PBS -j oe
-#PBS -r n
-#PBS -A MyProjectCode
-#PBS -l nodes=1:ppn=8
-#PBS -l walltime=00:10:00
-#PBS -q GpuQ
- #For CUDA v4.1
module load cuda/4.1
#For OpenACC
module load pgi/12.8
cd $PBS_O_WORKDIR
#For CUDA/OpenACC version only
./executable
#For 2 MPI processes, 1 node
mpiexec -n 2 –npernode 2 ./executable

Simple Benchmarks

The figure below shows the performance of DGEMM using CUBLAS and MAGMA versus Intel MKL BLAS. All tests were conducted at the Hybrid partition on Fionn (2x10 core 2.2GHz Intel Ivy Bridge with 64GB of RAM and two NVIDIA K20X per node). For CUBLAS tests, CUDA v5.5 was used. For MAGMA tests, MAGMA v1.4.1 and CUDA v5.0 were used. For Intel MKL tests: Intel v14.0.0 was used and KMP_AFFINITY was set to "nowarnings,granularity=fine,scatter". CUBLAS delivers near-peak performance for matrix sizes larger than 5000. For N=10000, CUBLAS performs 1.87x better than MAGMA and 3.15x better than Intel MKL with 20 threads.