Skip to main content

Using COSMA8

COSMA8 has 2 login nodes, accessed via login8.cosma.dur.ac.uk

COSMA8 has 360 compute nodes, each of which have 1TB RAM and 128 cores (2x AMD 7H12 processors)

There are 2 high RAM (4TB) fat nodes, which should be accessed via the cosma8-shm queue.

There are a number of GPU-enabled servers (see below), a 1TB AMD Milan test node and a 1TB Milan-X test node.

There are 4 relevant SLURM queues:

cosma8: provides exclusive access to nodes, shared with cosma8-serial

cosma8-serial: provides non-exclusive access to nodes. Use this if you want less than 128 cores (and remember to specify your memory requirement too)

cosma8-shm: access to the mad04 and mad05 servers, with 4TB RAM. These is also non-exclusive, so may be shared with other users if you don't require all 128 cores or all 4TB RAM.

cosma8-shm2: access to the ga004 server with a Milan 7703 processor and MI100 GPU.

Useful information

Numerical libraries

Intel MKL

MKL: The Intel Math Kernel Library is known to be hobbled on AMD systems. There is a fix that must be applied: please click here.

MKL is available via the intel_comp and oneAPI modules.

OpenBLAS

OpenBLAS is available via the openblas modules.

GSL

The Gnu Scientific Library can be accessed via the gsl modules.

Compilers

Recommended compilers will depend on the success seen by your application. See the PDFs below for recommended compiler options on AMD systems. Wisdom about best compilers for particular codes is collected here. Here are the available compilers:

icc

intel_comp/2018 - generally stable

intel_comp/latest - possibly better optimisations

oneAPI - The newest versions of the Intel compiler, aliased to intel_comp

gcc

gnu_comp/ - 10.2 or 11.1 know about the Zen2 architecture, so will be better optimised

aocc (AMD optimised compiler collection)

aocc/ - the AMD Optimised Compiler Collection - based on LLVM, use the latest version

llvm

Available via the llvm modules

pgi

Available via the pgi modules

MPI modules

OpenMPI

Usually best to use the newest openmpi module. A version of this with .no-ucx (e.g. openmpi/4.1.1.no-ucx) may offer more stable performance in some cases.

Large jobs may suffer from performance issues. This can sometimes be resolved by selecting the UD protocol over the newer DC (dynamical connection) protocol by setting:

export UCX_TLS=self,sm,ud

or

export UCX_TLS=self,sm,ud,rc,dc

in the job script. See discussion here.

If openmpi is complaining about running out of resources (memory pools being empty), the following may help:

export UCX_MM_RX_MAX_BUFS=65536
export UCX_IB_RX_MAX_BUFS=65536
export UCX_IB_TX_MAX_BUFS=65536

(or some larger value).

UCX settings can be seen with: /cosma/local/ucx/1.10.1/bin/ucx_info -f

For Gadget-4, setting export UCX_UD_MLX5_RX_QUEUE_LEN=16384 has also been shown to help.

Intel_mpi

2018 module is the fallback option for SWIFT.

Later versions use UCX underneath, and initially suffered from stability issues. However, the newest versions are much improved.

Mvapich

The mvapich module can sometimes offer improved performance. However, in some cases, RAM usage can be increased.

Tuning information

Options for compiling on AMD systems

Tuning for AMD systems

GPUs

A number of GPU servers are accessible - please ask if you are unsure how to use these:

gn001: 10x NVIDIA V100 GPUs

ga003: 6x AMD MI50 GPUs

ga004: 1x AMD MI100 GPU, 2x 64 core AMD Milan processors.

ga005, ga006: 2x AMD MI200 GPUs, 2x 32 core 7513 EPYC processors

login8b, mad04, mad05: Between 0-3 NVIDIA A100 GPUs (reconfigurable/moveable as required, please ask if you have a particular setup you wish for)

SWIFT

The current recommended setup (July 2021) is this:

module load intel_comp/2021.1.0 compiler
module load intel_mpi/2018
module load ucx/1.8.1
module load fftw/3.3.9epyc parallel_hdf5/1.10.6 parmetis/4.0.3-64bit gsl/2.5

You can swap in OpenMPI 4.0.5 instead of intel-mpi and get slightly worse performance.

--bind-to none is required to use all the cores correctly.

The intel_com/2021.1.0 mpi version is also working, if bound correctly to cores.

Arm Forge

Allinea Arm Forge and MAP (used for code profiling) is available using the allinea/ddt/20.2.1 module.

Profiles collected during the commissioning period are available in the commissioning report.

SLURM batch scripts

See examples here.

COSMA8 FAQ

The COSMA8 FAQ details some of the known issues or peculiarities related to COSMA8. Please let us know if there is something you would like added.

Known code issues

There is collective wisdom available when running particular codes on COSMA8.

Resources