Logo of Science Foundation Ireland  Logo of the Higher Education Authority, Ireland7 CapacitiesGPGPU Research Projects
Ireland's High-Performance Computing Centre | ICHEC
Home | News | Infrastructure | Outreach | Services | Research | Support | Education & Training | Consultancy | About Us | Login
Xeon Phi Picture

Xeon Phi Infrastructure

There are currently sixteen compute nodes each with two Xeon Phi card on fionn within the accelerators partition. This we shall call Xeon Phi partition, XPp. In addition there are two extra Xeon Phi cards on the shared memory partition.

Technical Details

Each compute node on the XPp comprises of a host with 20 cores (a dual socket Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz) , 64GiB of RAM and two Xeon Phi 5110P cards each with 8GiB of RAM. The table below summaries Host and Xeon Phi characteristics at ICHEC.

PropetyCPU (2 per node)Xeon Phi (2 per node)
ModelIntel Xeon E5-2660 v2 (Ivy Bridge)Intel Xeon Phi 5110P
Cores1060
Threads20 (2 HT per core)240 (4 HW Threads per core)
Clock Speed2.20 GHz1.053 GHz
Memory64 GB8 GB
L1 Cache32 KB (per core)32 KB (per core)
L2 Cache256 KB (per core)512 KB (per core)
L3 Cache25 MB (shared)none
Vector Unit256 bit (4 DP)512 bit (8 DP)
Flops/clock816
Theoretical Peak (DP) 352.8 GF/s1010.88 GF/s

Building Applications

Applications can run on a Xeon Phi in two modes: native and offload mode. For native mode one just needs to build the application in a specific way, see below, however while may be relatively simple to build and run a a native application achieving performance can be a challenge. The offload mode had to explicitely coded. There are different ways to achieve this, two of them are OpenMP 4.0 or Intel offload directives. Below you can find Hello World verions for offload with OpenMP 4.0.

Software infrastructure

MPSS 3.1.1
Flash Version : 2.1.02.0390
SMC Firmware Version : 1.16.5078
SMC Boot Loader Version : 1.8.4326
uOS Version : 2.6.38.8+mpss3.1.1

The available software can be accessed via modules by loading category phi

module load phi

At the moment we have Intel compilers, intel math kernel library, intel mpi, intel vtune, intel trace analyzer and collector and gcc 4.9 available for all users (while gcc 4.9 support OpenMP 4.0 does not support offload). Due to novelty of the platform we have installed different versions of all of the above.

------------------------------------------------------- /ichec/modulefiles/phi -----------------------------
gcc/4.9.0 intel/impi/4.1.2.040 intel/itac/9.0.0.007 intel/runtime/3.1.1-14.0.1
intel/comp/14.0.0 intel/impi/4.1.3.048 intel/mkl/11.1.0 intel/runtime/3.1.1-14.0.2
intel/comp/14.0.1 intel/impi/5.0.0.007 intel/mkl/11.1.1 intel/vtune/2013.16
intel/comp/14.0.2 intel/itac/8.1.3.037 intel/mkl/11.1.2 valgrind/gcc/3.9.0
intel/impi/4.1.1.036 intel/itac/8.1.4.045 intel/runtime/3.1.1-14.0.0

Access

On each node cards are configured as separate network nodes with a shared file system with the hosts. So for example if one gets one node, e.g. service61, it will get his two cards, mic0 and mic1 into which can ssh via ssh mic0 or ssh service61-mic0. Cards can be accessed via scif interface too.

Hello World!

Native mode

Let us start with a simple example: a hello world with one openmp region, see hello_v1.F90 (Fortran version) or hello_v1.cpp (C++ version) hello_v1.c (C version)

To build the fortran code for host:

module load phi intel/comp/14.0.2
ifort -o hello_v1.X hello_v1.F90 -fopenmp

for C++:

module load phi intel/comp/14.0.2
icpc -o hello_v1.X hello_v1.cpp -openmp

for C:

module load phi intel/comp/14.0.2
icc -o hello_v1.X hello_v1.c -fopenmp

then we run it as usually

[alin@service61: ...hello]: OMP_NUM_THREADS=10 ./hello_v1.X
Running on host
Hello from thread id 7
Hello from thread id 2
Hello from thread id 0
Hello from thread id 5
Hello from thread id 8
Hello from thread id 4
Hello from thread id 1
Hello from thread id 6
Hello from thread id 3
Hello from thread id 9
The number of threads is 10
Maximum number of threads is 10
=============================

For generating a native binary one add -mmic to the previous build line, in Fortran:

[alin@service61: ...hello]: ifort -o hello_v1.MIC hello_v1.F90 -openmp -mmic

In C++:

[clalanne@service61: ...hello]: icpc -o hello_v1.MIC hello_v1.cpp -openmp -mmic

In C:

[alin@service61: ...hello]: icc -o hello_v1.MIC hello_v1.c -fopenmp -mmic

To run there are two different methods

    via ssh into the card
  • [alin@service61: ...hello]: ssh mic0 LD_LIBRARY_PATH=$MIC_LD_LIBRARY_PATH $(pwd)/hello_v1.MIC
    Running on Xeon Phi
    Hello from thread id 144
    Hello from thread id 24
    Hello from thread id 12
    Hello from thread id 72
    Hello from thread id 130
    ...
    Hello from thread id 7
    Hello from thread id 34
    Hello from thread id 33
    Hello from thread id 6
    Hello from thread id 195
    The number of threads is 240
    Maximum number of threads is 240
    =============================
  • [alin@service61: ...hello]: ssh mic0 LD_LIBRARY_PATH=$MIC_LD_LIBRARY_PATH OMP_NUM_THREADS=10 $(pwd)/hello_v1.MIC
    Running on Xeon Phi
    Hello from thread id 0
    Hello from thread id 5
    Hello from thread id 6
    Hello from thread id 9
    Hello from thread id 3
    Hello from thread id 8
    Hello from thread id 7
    Hello from thread id 4
    Hello from thread id 2
    Hello from thread id 1
    The number of threads is 10
    Maximum number of threads is 10
    =============================
    using micnativeloadex
  • [alin@service61: ...hello]: micnativeloadex ./hello_v1.MIC -d 0
    Running on Xeon Phi
    Hello from thread id 0
    Hello from thread id 68
    Hello from thread id 9
    Hello from thread id 136
    ...
    Hello from thread id 96
    Hello from thread id 31
    Hello from thread id 143
    Hello from thread id 111
    The number of threads is 236
    Maximum number of threads is 236
    =============================
  • [alin@service61: ...hello]: micnativeloadex ./hello_v1.MIC -d 0 -e "OMP_NUM_THREADS=10"
    Running on Xeon Phi
    Hello from thread id 2
    Hello from thread id 1
    Hello from thread id 0
    Hello from thread id 8
    Hello from thread id 9
    Hello from thread id 4
    Hello from thread id 3
    Hello from thread id 6
    Hello from thread id 7
    Hello from thread id 5
    The number of threads is 10
    Maximum number of threads is 10
    =============================

Offload mode with one card

Let us start with a simple example: a hello world with one openmp region which is run on host and offloaded to card too, see hello_v2.F90 (Fortran version) or hello_v2.cpp (C++ version) hello_v2.c (C version)

To build the Fortran code:

[alin@service61: ...hello]: ifort -o hello_v2.MIX hello_v2.F90 -openmp

the C++ code:

[clalanne@service61: ...hello]: icpc -o hello_v2.MIX hello_v2.cpp -openmp

the C code:

[alin@service61: ...hello]: icc -o hello_v2.MIX hello_v2.c -fopenmp
    run
  • [alin@service61: ...hello]: ./hello_v2.MIX
    No of cards: 2
    Running on Xeon Phi
    Hello from thread id 0
    Hello from thread id 64
    Hello from thread id 10
    Hello from thread id 129
    ...
    Hello from thread id 5
    Hello from thread id 132
    Hello from thread id 6
    Hello from thread id 2
    The number of threads is 236
    Maximum number of threads is 236
    =============================
    Running on host
    Hello from thread id 2
    Hello from thread id 7
    Hello from thread id 11
    Hello from thread id 36
    Hello from thread id 8
    ...
    Hello from thread id 33
    Hello from thread id 28
    Hello from thread id 15
    Hello from thread id 18
    Hello from thread id 19
    The number of threads is 40
    Maximum number of threads is 40
    =============================
  • [alin@service61: ...hello]: MIC_ENV_PREFIX=PHI OMP_NUM_THREADS=5 PHI_OMP_NUM_THREADS=6 ./hello_v2.MIX
    No of cards: 2
    Running on Xeon Phi
    Hello from thread id 0
    Hello from thread id 1
    Hello from thread id 2
    Hello from thread id 3
    Running on host
    Hello from thread id 4
    Hello from thread id 5
    The number of threads is 6
    Maximum number of threads is 6
    =============================
    Hello from thread id 0
    Hello from thread id 2
    Hello from thread id 3
    Hello from thread id 4
    Hello from thread id 1
    The number of threads is 5
    Maximum number of threads is 5
    =============================

Offload mode with multiple cards

Let us start with a simple example: a hello world with one openmp region which is run on host and offloaded to all cards available, see hello_v3.F90 (Fortran version) or hello_v3.cpp (C++ version) hello_v3.c (C version)

To build the Fortran code:

[alin@service61: ...hello]: ifort -o hello_v3.MIX hello_v3.F90 -fopenmp

To build the C++ code:

[clalanne@service61: ...hello]: icpc -o hello_v3.MIX hello_v3.cpp -fopenmp

To build the C code:

[alin@service61: ...hello]: icc -o hello_v3.MIX hello_v3.c -fopenmp
    run
  • [alin@service61: ...hello]: ./hello_v3.MIX
    No of cards: 2
    Running on Xeon Phi no: 0
    Hello from thread id 0
    Hello from thread id 40
    ...
    Hello from thread id 24
    Hello from thread id 48
    Hello from thread id 66
    The number of threads is 236
    Maximum number of threads is 236
    =============================
    Running on Xeon Phi no: 1
    Hello from thread id 0
    Hello from thread id 40
    ...
    Hello from thread id 68
    Hello from thread id 18
    Hello from thread id 17
    The number of threads is 236
    Maximum number of threads is 236
    =============================
    Running on host
    Hello from thread id 0
    Hello from thread id 9
    ...
    Hello from thread id 13
    Hello from thread id 33
    Hello from thread id 17
    The number of threads is 40
    Maximum number of threads is 40
    =============================
  • [alin@service61: ...hello]: MIC_ENV_PREFIX=PHI OMP_NUM_THREADS=5 PHI_OMP_NUM_THREADS=6 ./hello_v3.MIX
    No of cards: 2
    Running on Xeon Phi no: 0
    Hello from thread id 0
    Hello from thread id 4
    Hello from thread id 1
    Hello from thread id 2
    Hello from thread id 5
    Hello from thread id 3
    The number of threads is 6
    Maximum number of threads is 6
    =============================
    Running on Xeon Phi no: 1
    Hello from thread id 0
    Hello from thread id 1
    Hello from thread id 2
    Hello from thread id 5
    Hello from thread id 4
    Hello from thread id 3
    Running on host
    The number of threads is 6
    Maximum number of threads is 6
    =============================
    Hello from thread id 0
    Hello from thread id 2
    Hello from thread id 1
    Hello from thread id 3
    Hello from thread id 4
    The number of threads is 5
    Maximum number of threads is 5
    =============================
  • [alin@service61: ...hello]: MIC_ENV_PREFIX=PHI OMP_NUM_THREADS=5 PHI_0_OMP_NUM_THREADS=3 PHI_1_OMP_NUM_THREADS=4 ./hello_v3.MIX
    No of cards: 2
    Running on Xeon Phi no: 0
    Hello from thread id 0
    Hello from thread id 2
    Hello from thread id 1
    The number of threads is 3
    Maximum number of threads is 3
    =============================
    Running on Xeon Phi no: 1
    Hello from thread id 2
    Hello from thread id 1
    Hello from thread id 0
    Hello from thread id 3
    The number of threads is 4
    Maximum number of threads is 4
    =============================
    Hello from thread id 4
    Hello from thread id 0
    Hello from thread id 1
    Hello from thread id 2
    Hello from thread id 3
    The number of threads is 5
    Maximum number of threads is 5
    =============================

Disable offload

There are cases when we want to disable the generation of offload mode. Let us start with a simple example: a hello world with one openmp region which is run on host and offloaded to a card, see hello_v2.F90(Fortran version) or hello_v2.cpp(C++ version) hello_v2.c(C++ version)

To build with the offload region disable we use -no-openmp-offload option

[alin@service61: ...hello]: ifort -o hello_v2.X hello_v2.F90 -fopenmp -no-openmp-offload
hello_v2.F90(11): remark #8711: OpenMP* directive disabled via command line.
!$omp declare target
^
/ichec/work/scratch/tmp/155065.service1.cb3.ichec.ie/ifortavSkLO.i90(14): remark #8711: *MIC* OpenMP* directive disabled via command line.

Similarly for C++ and C

[clalanne@service61: ...hello]: icpc -o hello_v2.X hello_v2.cpp -fopenmp -no-openmp-offload
[alin@service61: ...hello]: icc -o hello_v2.X hello_v2.c -fopenmp -no-openmp-offload

but when we run

[alin@service61: ...hello]: OMP_NUM_THREADS=5 ./hello_v2.X
No of cards: 2
Running on host
Hello from thread id 0
Hello from thread id 3
Hello from thread id 4
Hello from thread id 1
Hello from thread id 2
The number of threads is 5
Maximum number of threads is 5
=============================
Running on host
Hello from thread id 3
Hello from thread id 4
Hello from thread id 0
Hello from thread id 2
Hello from thread id 1
The number of threads is 5
Maximum number of threads is 5
=============================

the code is run twice on host. To properly disable the offload region one needs to protect the region with a preprocessor pragma, eg NOOFFLOAD, see hello_v4.F90(Fortran version) or hello_v4.cpp(C++ version) hello_v4.c(C version)

and then we build activating the NOOFFLOAD option

[alin@service61: ...hello]: ifort -o hello_v4.X hello_v4.F90 -fopenmp -DNOOFFLOAD -no-openmp-offload
hello_v4.F90(11): remark #8711: OpenMP* directive disabled via command line.
!$omp declare target
^
/ichec/work/scratch/tmp/155065.service1.cb3.ichec.ie/ifortkgCKD8.i90(14): remark #8711: *MIC* OpenMP* directive disabled via command line.

and similarly for C++ and C

[clalanne@service61: ...hello]: icpc -o hello_v4.X hello_v4.cpp -fopenmp -DNOOFFLOAD -no-openmp-offload
[alin@service61: ...hello]: icc -o hello_v4.X hello_v4.c -fopenmp -DNOOFFLOAD -no-openmp-offload

now we run and the code executes the code only once as expected

[alin@service61: ...hello]: OMP_NUM_THREADS=5 ./hello_v4.X
No of cards: 2
Running on host
Hello from thread id 0
Hello from thread id 2
Hello from thread id 1
Hello from thread id 3
Hello from thread id 4
The number of threads is 5
Maximum number of threads is 5
=============================

Offload reports

One can obtain some debug information about the offloading by using the OFFLOAD_REPORT environment variable. Assuming we reuse the binary we generated for the simple offload part with one card we can run as

[alin@service61: ...hello]: OFFLOAD_REPORT=3 OMP_NUM_THREADS=3 ./hello_v2.MIX
No of cards: 2
[Offload] [HOST] [State] Initialize logical card 0 = physical card 0
[Offload] [HOST] [State] Initialize logical card 1 = physical card 1
[Offload] [MIC 0] [File] hello_v2.F90
[Offload] [MIC 0] [Line] 72
[Offload] [MIC 0] [Tag] Tag 0
[Offload] [HOST] [Tag 0] [State] Start Offload
[Offload] [HOST] [Tag 0] [State] Initialize function __offload_entry_hello_v2_F90_72MAIN__ifort1188938271dQKtGe
[Offload] [HOST] [Tag 0] [State] Send pointer data
[Offload] [HOST] [Tag 0] [State] CPU->MIC pointer data 0
[Offload] [HOST] [Tag 0] [State] CPU->MIC copyin data 0
[Offload] [HOST] [Tag 0] [State] Compute task on MIC
[Offload] [HOST] [Tag 0] [State] Receive pointer data
[Offload] [HOST] [Tag 0] [State] MIC->CPU pointer data 0
[Offload] [MIC 0] [Tag 0] [State] Start target function __offload_entry_hello_v2_F90_72MAIN__ifort1188938271dQKtGe
[Offload] [MIC 0] [Tag 0] [State] Scatter copyin data
Running on Xeon Phi
Hello from thread id 1
Hello from thread id 0
Hello from thread id 2
The number of threads is 3
Maximum number of threads is 3
=============================
[Offload] [MIC 0] [Tag 0] [State] Gather copyout data
[Offload] [HOST] [Tag 0] [State] Scatter copyout data
[Offload] [HOST] [Tag 0] [CPU Time] 0.962837(seconds)
[Offload] [MIC 0] [Tag 0] [CPU->MIC Data] 0 (bytes)
[Offload] [MIC 0] [Tag 0] [MIC Time] 0.030520(seconds)
[Offload] [MIC 0] [Tag 0] [MIC->CPU Data] 0 (bytes)

Running on host
[Offload] [MIC 0] [Tag 0] [State] MIC->CPU copyout data 0
Hello from thread id 0
Hello from thread id 2
Hello from thread id 1
The number of threads is 3
Maximum number of threads is 3
=============================
[Offload] [MIC 1] [State] Unregister data tables
[Offload] [MIC 0] [State] Unregister data tables
[Offload] [HOST] [State] Unregister data tables

to reduce the verbosity of the report use 1 or 2 as values for OFFLOAD_REPORT