Logo of Science Foundation Ireland  Logo of the Higher Education Authority, Ireland7 CapacitiesGPGPU Research Projects
Ireland's High-Performance Computing Centre | ICHEC
Home | News | Infrastructure | Outreach | Services | Research | Support | Education & Training | Consultancy | About Us | Login

ICHEC Software

Information about software packages installed on the ICHEC systems.

DL_POLY

Versions Installed

Stoney: 4.01.1 / 4.03.1

Fionn: 4.05.1

Description

DL_POLY is a package of subroutines, programs and data files, designed to facilitate molecular dynamics simulations of macromolecules, polymers, ionic systems and solutions on a distributed memory parallel computer. DL_POLY was developed at the STFC Daresbury Laboratory by W. Smith, T.R. Forester and I.T. Todorov. Three versions of DL_POLY are currently available:

DL_POLY_2 is the original version which has been parallelised using the Replicated Data strategy and is useful for simulations of up to 30,000 atoms on 100 processors.

DL_POLY_3 is a version which uses Domain Decomposition to achieve parallelism and is suitable for simulations of order 1 million atoms on 8-1024 processors.

DL_POLY_4's general design provides scalable performance from a single processor workstation to a high performance parallel computer. DL_POLY_4 offers fully parallel I/O as well as a netCDF alternative (HDF5 library dependence) to the default ASCII trajectory file. It is also available as a CUDA+OpenMP port, developed in collaboration with ICHEC, to harness the power offered by NVIDIA® GPUs. A full description of the available DL_POLY_4 functionality may be obtained from the DL_POLY_4 User Manual (PDF).

License

Currently, only one version of the DL_POLY software is available under an STFC license, DL_POLY_4, and with support provisioned to the UK's academia only. The former DL_POLY_2 version (authored by W. Smith, T.R. Forester and I.T. Todorov) is now transformed into DL_POLY_CLASSIC and available as open source under the BSD license.

Users should express their interest to the Helpdesk to gain access to the executables.

Benchmarks

4.01.1 on Stoney:

The figure below shows the strong scaling performance of both the vanilla MPI version of DL_POLY 4 and the GPU-enabled version of DL_POLY 4 run across 16 nodes of the Stoney. The benchmark calculation used to obtain these results was TEST2 (available from the DL_POLY ftp site). The only modification made to TEST2 for this benchmark calculation was to the ‘CONTROL’ input file where the nfold parameter was set to (2,2,2) in order to increase the size of the calculation for the purpose of scaling. The figure shows that, for the configurations investigated, harnessing the additional 2 GPU cards on each node results in a ~30% reduction in wall-clock time for all node counts. (The acceleration gained from GPU-enablement is problem specific and users may find a better or worse speedup gain than is shown here, depending on their problem type.)

4.05.1 on Fionn:

The figure below shows the relative speedup with respect to the number of cores. The benchmark calculation used to obtain these results was TEST40 (available from the DL_POLY ftp site). The figure shows that there is a very good speed up behaviour. Linear speedup is achieved up to 192 cores (8 nodes).

Additional Notes

To use a version of DL_POLY on Stoney load the relevant environment module:

module load dl_poly/4.03.1

To use a version of DL_POLY on Fionn load the relevant environment module:

module load molmodel dl_poly/intel/4.05.1

Job Submission Example on Stoney

#!/bin/bash
#PBS -l nodes=2:ppn=8
#
#PBS -l walltime=30:00:00
#PBS -N MyJobName
#PBS -A MyProjectName

#Load the DL_POLY module
module load dl_poly/4.01.1

cd $PBS_O_WORKDIR

mpiexec -n 16 ./DLPOLY.Z

Job Submission Example on Stoney using GPGPUs

When the DLPOLY.Z.cu executable is invoked, an initialiser in dl_poly_init_cu.cu is invoked form dl_poly.f90. The initialiser binds the MPI processes running on each node to the available CUDA-enabled devices attached to that node in round-robin fashion. Information regarding the MPI process affinity and device binding is printed to the standard output.

Please NOTE that the affinity of the MPI processes is not decided by the initializer - it is the responsibility of the user to specify this when the executable is run. The documentation for the particular MPI implementation in use should be consulted in order to achieve this. Examples of the MPI process-device binding and process affinity output (on a system with 8-core nodes, 2 CUDA devices per node, and using MVAPICH2) are:

Example 1: 2 MPI processes, 1 node:

#!/bin/bash
#PBS -l nodes=1:ppn=8
#PBS -l walltime=30:00:00
#PBS -q GpuQ
#PBS -N MyJobName
#PBS -A MyProjectName

#Load the DL_POLY module
module load dl_poly/4.01.1

cd $PBS_O_WORKDIR

mpiexec -n 2 ./DLPOLY.Z.cu

Output:

Bound MPI process 0 (pid=14299; affined to CPU(s) 0; 1 OpenMP thread(s)) to device 0@stoney52
Bound MPI process 1 (pid=14300; affined to CPU(s) 1; 1 OpenMP thread(s)) to device 1@stoney52

Example 2: 2 MPI process, 1 node, 4 threads each:

#!/bin/bash
#PBS -l nodes=1:ppn=8
#PBS -l walltime=30:00:00
#PBS -q GpuQ
#PBS -N MyJobName
#PBS -A MyProjectName

#Load the DL_POLY module
module load dl_poly/4.01.1

cd $PBS_O_WORKDIR

export OMP_NUM_THREADS=4
mpiexec -n 2 ./DLPOLY.Z.cu

Output:

Bound MPI process 0 (pid=14319; affined to CPU(s) 0; 4 OpenMP thread(s)) to device 0@stoney52
Bound MPI process 1 (pid=14320; affined to CPU(s) 1; 4 OpenMP thread(s)) to device 1@stoney52

Example 3: 8 MPI processes, 1 node (GPU over-subscription):

#!/bin/bash
#PBS -l nodes=1:ppn=8
#PBS -l walltime=30:00:00
#PBS -q GpuQ
#PBS -N MyJobName
#PBS -A MyProjectName

#Load the DL_POLY module
module load dl_poly/4.01.1

cd $PBS_O_WORKDIR

mpiexec -n 8 ./DLPOLY.Z.cu

Output:

Bound MPI process 0 (pid=14345; affined to CPU(s) 0; 4 OpenMP thread(s)) to device 0@stoney52
Bound MPI process 1 (pid=14346; affined to CPU(s) 1; 4 OpenMP thread(s)) to device 1@stoney52
Bound MPI process 2 (pid=14347; affined to CPU(s) 2; 4 OpenMP thread(s)) to device 0@stoney52
Bound MPI process 3 (pid=14348; affined to CPU(s) 3; 4 OpenMP thread(s)) to device 1@stoney52
Bound MPI process 4 (pid=14349; affined to CPU(s) 4; 4 OpenMP thread(s)) to device 0@stoney52
Bound MPI process 5 (pid=14350; affined to CPU(s) 5; 4 OpenMP thread(s)) to device 1@stoney52
Bound MPI process 6 (pid=14351; affined to CPU(s) 6; 4 OpenMP thread(s)) to device 0@stoney52
Bound MPI process 7 (pid=14352; affined to CPU(s) 7; 4 OpenMP thread(s)) to device 1@stoney52

** WARNING: The number of MPI processes (8) on node stoney52 is greater than the number of devices (2) on that node
Ideally, the number of MPI processes on a given node should match the number of available devices on that node.

Example 4: 2 MPI processes, 1 node, 4 threads each, explicit affinity (set using the MVAPICH2 environment variable, MV2_CPU_MAPPING):

#!/bin/bash
#PBS -l nodes=1:ppn=8
#PBS -l walltime=30:00:00
#PBS -q GpuQ
#PBS -N MyJobName
#PBS -A MyProjectName

#Load the DL_POLY module
module load dl_poly/4.01.1

cd $PBS_O_WORKDIR

export OMP_NUM_THREADS=4
export MV2_CPU_MAPPING="0,2,4,6:1,3,5,7"
mpiexec -n 2 ./DLPOLY.Z.cu

Output:

Bound MPI process 0 (pid=14409; affined to CPU(s) 0 2 4 6; 4 OpenMP thread(s)) to device 0@stoney52
Bound MPI process 1 (pid=14410; affined to CPU(s) 1 3 5 7; 4 OpenMP thread(s)) to device 1@stoney52

Example 5: 4 MPI processes, 2 nodes, 4 threads each, explicit affinity (set using the MVAPICH2 environment variable MV2_CPU_MAPPING):

#!/bin/bash
#PBS -l nodes=1:ppn=8
#PBS -l walltime=30:00:00
#PBS -q GpuQ
#PBS -N MyJobName
#PBS -A MyProjectName

#Load the DL_POLY module
module load dl_poly/4.01.1

cd $PBS_O_WORKDIR

export OMP_NUM_THREADS=4
export MV2_CPU_MAPPING="0,2:4,6:1,3:5,7"
mpiexec -n 4 –npernode 2 ./DLPOLY.Z.cu

Output:

Bound MPI process 0 (pid=14627; affined to CPU(s) 0 2 4 6; 4 OpenMP thread(s)) to device 0@stoney52 Bound MPI process 1
(pid=14628; affined to CPU(s) 1 3 5 7; 4 OpenMP thread(s)) to device 1@stoney52
Bound MPI process 2 (pid=8581; affined to CPU(s) 0 2 4 6; 4 OpenMP thread(s)) to device 0@stoney51
Bound MPI process 3 (pid=8582; affined to CPU(s) 1 3 5 7; 4 OpenMP thread(s)) to device 1@stoney5

Assuming that this PBS script is saved as dl_poly.pbs, the job can be submitted to the queue by running the following command in the same directory:

qsub dl_poly.pbs

Job Submission Example on Fionn

#!/bin/bash
#PBS -l nodes=2:ppn=24
#PBS -l walltime=30:00:00
#PBS -N MyJobName
#PBS -A MyProjectName

#Load the DL_POLY module
module load molmodel dl_poly/intel/4.05.1

cd $PBS_O_WORKDIR

mpiexec -n 48 –ppn 24 ./DLPOLY.Z

Job Submission Example on Fionn using GPGPUs

#!/bin/bash
#PBS -l nodes=2:ppn=20
#PBS -l walltime=30:00:00
#PBS -q GpuQ
#PBS -N MyJobName
#PBS -A MyProjectName

#Load the DL_POLY module
module load molmodel dl_poly/intel/4.05.1

cd $PBS_O_WORKDIR

mpiexec -n 4 –ppn 2 ./DLPOLY.Z.cu

Further Information

More information can be obtained at http://www.stfc.ac.uk/CSE/randd/ccg/software/DL_POLY/25526.aspx.

Return to the software index