Hartree Centre Logo

Back to Contents Page

Running Jobs on Scafell Pike

Last modified: 3/5/2018

Logging On

To use Scafell Pike you will normally log into the Fairthorpe X86 node hcxlogin1: hcxlogin1.hartree.stfc.ac.uk . If you are adventurous, it is also possible to access Scafell Pike from the Power nodes, hcplogin1 and hcplogin2.

What will You See?

File Systems

[myid-mygroup@hcxlogin1 scafellpike?]$ df
Filesystem                             1K-blocks         Used     Available Use% Mounted on
cds131-dm0:/gpfs/cds/fairthorpe    1498317193216 149716768768 1348600424448  10% /gpfs/cds
sqg1cdata8-dm0:/lustre/scafellpike 1443356825600  79160509440 1349644140544   6% /lustre/scafellpike
fairfiler01:/data/home                2121790464   1025147904    1096642560  49% /gpfs/fairthorpe/local

The file systems seen above are used as follows.

environment variablefile systemdescription
$HOME/gpfs/fairthorpe/local/myproject/mygroup/myid-mygroupHome file system on Fairthorpe
$HCBASE/lustre/scafellpike/local/myproject/mygroup/myid-mygroupFile system on the Fairthorpe cluster
$HCCDS/gpfs/cds/local/myproject/mygroup/myid-mygroupPermanent file system on the Common Data Store

$HOME and $HCBASE should be used as scratch space to use for job development and running application on the cluster. $HCCDS should be used to store all large data sets and executables. Jobs can be launched directly from $HCCDS using Data Mover (see below) or data and executables copied to $HCBASE for use during the batch run.

Note: $HCCDS is backed up. The other file systems are not.

Managed Software

The software which we have installed and manage on behalf of our user community can be accessed via environment modules. This is described in a separate chapter Managed Software. Examples of using specific packages are given in the HOWTOs.

Batch Queues

The main batch queues available for Scafell Pike are shown here.

[myid-mygroup@hcxlogin1 scafellpike?]$ bqueues
scafellpikeKNL   40  Open:Active       -    -    -    -     0     0     0     0
scafellpikeSKL   40  Open:Active       -    -    -    -  2748    20  2728     0
scafellpikeI     40  Open:Active       -    -    -    -     0     0     0     0
universeScafell  40  Open:Active       -    -    -    -     0     0     0     0
SKLShort         40  Open:Active       -    -    -    -     0     0     0     0
KNLShort         40  Open:Active       -    -    -    -     0     0     0     0
KNLHimem         40  Open:Active       -    -    -    -     0     0     0     0
XRV              40  Open:Active       -    -    -    -     0     0     0     0

The cluster effectively has two partisions referred to as KNL and SKL.

The most important queues which are further described below are scafellpikeKNL foer using the Xeon Phi Knights Landing nodes and scafellpikeSKL for using the Xeon Gold Skylake nodes.

There are 846 "regular" SKL nodes each with 32x x86_64 Xeon cores running at up to 3.7 GHz with 192GB memory.

There are 846 "accelerator" KNL nodes each with 64 cores running at 1.3 HGz with 384 GB memory.

To see which queues are enabled at any time type "bclusters".

Batch Jobs

This section describes how to submit batch jobs. Specific examples for managed software are given in the HOWTOs.

LSF Job Scheduler

The job scheduler on Scafell Pike is IBM Platform LSF v10.1. Having first compiled your executable, or using one that we have installed, you can submit it to the job queue with a suitable submission script. We provide some examples below.

Here is a link to the on-line LSF documentation: external link: https://www.ibm.com/support/knowledgecenter/en/SSWRJV_10.1.0/lsf_welcome/lsf_welcome.html . There are also man pages for LSF available on Fairthorpe and Scafell Pike.

Using Intel MPI

Intel MPI is installed and managed and is expected to give good performance on Scafell Pike. For this reason it is part of the "default" software stack.

Please use mpiexec.hydra as in this example LSF script which runs a variant of the IMB benchmark. Not the use of ptile=32 to request all 32 cores per node. This assumes that $HCBASE/imb has previously been created.

#BSUB -o stdout.imb.txt
#BSUB -e stderr.imb.txt
#BSUB -R "span[ptile=32]"
#BSUB -q scafellpikeSKL
#BSUB -n 96            
#BSUB -J imb
#BSUB -W 00:19

cd $HCBASE/imb

# setup modules
. /etc/profile.d/modules.sh
module load intel_mpi > /dev/null 2>&1

export MYJOB="$HCBASE/imb/IMB-MPI1-intel sendrecv"

mpiexec.hydra -np 96 ${MYJOB}

An explanation of the parameters used:

#BSUB -o stdout.imb.txt   <-- Specify an output filename
#BSUB -e stderr.imb.txt   <-- Specify an error filename
#BSUB -R "span[ptile=32]"   <-- Request 32 processes per node, which matches the number of cores per node.
  You can set ptile to less than 32 if required (e.g. to get more memory per process).
#BSUB -q scafellpikeSKL <-- use SKL nodes, alternative is scafellpikeKNL
#BSUB -R "rusage[mem=15000]" <-- Request memory, in this case 15GB 
#BSUB -n 96   <-- Request number of mpi tasks.  So in this example, we're asking for 3 nodes (3x32) in total.
#BSUB -J imb   <-- Give the job a name.
#BSUB -W 00:19   <-- Request 19 minutes of wallclock time.

mpiexec.hydra -np 96 ${MYJOB}  <-- Tell mpirun to start 96 processes (should be the same as above).

Using OpenMPI

OpenMPI is also installed and managed as an alternative MPI version.

Specifying appropriate environment variables directly:

#BSUB -o stdout.imb.txt
#BSUB -e stderr.imb.txt
#BSUB -R "span[ptile=32]"
#BSUB -q scafellpikeSKL
#BSUB -n 96            
#BSUB -J imb
#BSUB -W 00:19

cd $HCBASE/imb
export MYHOME=`pwd`
export OPENMPI_ROOT=/lustre/scafellpike/local/apps/gcc/openmpi/2.1.1
export PATH=$MPIROOT/bin:$PATH
export MYJOB="${MYHOME}/IMB-MPI1 sendrecv"

mpirun -np 96 ${MYJOB}

Or using environment modules:

#BSUB -o stdout.imb.txt
#BSUB -e stderr.imb.txt
#BSUB -R "span[ptile=32]"
#BSUB -q scafellpikeSKL
#BSUB -n 96            
#BSUB -J imb
#BSUB -W 00:19

cd $HCBASE/imb
export MYHOME=`pwd`

# setup modules
. /etc/profile.d/modules.sh
module load openmpi-gcc > /dev/null 2>&1

export MYJOB="${MYHOME}/IMB-MPI1 sendrecv"

mpirun -np 96 ${MYJOB}

Some of these scripts can get quite complex, especially if Data Mover is used. Please check the HOWTOS and contact the Help Desk if you have any questions: external link: https://stfc.service-now.com .

Submitting jobs

Submit your job like this:

bsub < myjob.sh

Don't forget the re-direct symbol "<" or nothing will happen.

Requesting large memory nodes

Note: this section to be checked

There are 24x high memory nodes each with 1TB RAM. They have Intel Xeon Broadwell processors, 32 cores per node. You can request them with this syntax in your submission script:

#BSUB -R "rusage[mem=250000]"

This will request 250,000MB, or ~250GB, of memory on each node. Please note that because there are only 4 nodes available, you cannot request more than 96 mpi tasks (3x32).


By this we mean a single job that actuall consists of several independent tasks, which are either MPI applications or possible serial multi-threaded. Running several of these together on the same node can be a more effective use of resources (because node allocation is exclusive).

Here is an example of running four b4nchmark tests on one node. The LSF allocates 32 cores and we will pin 8 processes from each task to selected ones. The syntax is appropriate for Intel MPI.

#BSUB -J IMB-multi
#BSUB -oo stdout.IMB.txt
#BSUB -eo stderr.IMB.txt
#BSUB -R "span[ptile=32] affinity[core(1):cpubind=core]"
#BSUB -n 32
# target KNL
#BSUB -W 0:59

cd $HCBASE/knl-testing/IMB-intelmpi/TESTING

#Load modules
source /etc/profile.d/modules.sh
module load intel_mpi/18.0.128

# try various fabric settings shm, dapl, ofa, tcp
export I_MPI_FABRICS=shm:ofa

export myexe="./IMB-MPI1"
export myargs="Allreduce"

export I_MPI_PIN=yes
export I_MPI_DEBUG=4

# execute IMB 4 times with process mapping
(date > output1.IMB.txt; mpirun -np 8 -genvall -env I_MPI_PIN_PROCESSOR_LIST=0-7 $myexe $myargs >> output1.IMB.txt; sleep 10; date >> output1.IMB.txt) &
(date > output2.IMB.txt; mpirun -np 8 -genvall -env I_MPI_PIN_PROCESSOR_LIST=8-15 $myexe $myargs >> output2.IMB.txt; sleep 11; date >> output2.IMB.txt) &
(date > output3.IMB.txt; mpirun -np 8 -genvall -env I_MPI_PIN_PROCESSOR_LIST=16-23 $myexe $myargs >> output3.IMB.txt; sleep 12; date >> output3.IMB.txt) &
(date > output4.IMB.txt; mpirun -np 8 -genvall -env I_MPI_PIN_PROCESSOR_LIST=24-31 $myexe $myargs >> output4.IMB.txt; sleep 13; date >> output4.IMB.txt) &

# wait for all the above tasks to complete

Job Arrays

A "job array" is a set of jobs submitted using a single script. Typically this is used in cases where the same job has to be run many times with different input data. This is sometimes referred to as task farming, parameter sweep or design of experiments and is typical of an optimisation procedure.

Using job array syntax will permit the LSF scheduler to make fair use of the resources available. The job script has syntax as follows.

# define an array job
#BSUB -J "My_Jobname[1-20]";
#BSUB -o stdout.%J.%I.txt
#BSUB -e stderr.%J.%I.txt
# … other BSUB options

# look up the index for this job instance
if [ -z $LSB_REMOTEJID ] # runs in local queue
    job_id=$LSB_JOBID; array_id=$LSB_JOBINDEX
else # job has ben forwarded to remote queue
    job_id=$LSB_REMOTEJID; array_id=$LSB_REMOTEINDEX

# open a different input file for each job instance
# this also runs 4 instances concurrently from different directories
# there could be many ways to do this…
(cd try1; mpiexec.hydra -n 8 my_executable input.$array_id) &
(cd try2; mpiexec.hydra -n 8 my_executable input.$array_id) &
(cd try3; mpiexec.hydra -n 8 my_executable input.$array_id) &
(cd try4; mpiexec.hydra -n 8 my_executable input.$array_id) &

# wait for all processes to complete in this script shell

When the job runs, you will see output from bjobs as follows.

345998  myid-h RUN   q12h32     login1      ida7c16     My_Jobname[16]   Aug 18 19:18
345998  myid-h RUN   q12h32     login1      ida3c36     My_Jobname[15]   Aug 18 19:18
345998  myid-h RUN   q12h32     login1      ida7a18     My_Jobname[17]   Aug 18 19:18
345998  myid-h RUN   q12h32     login1      ida2a41     My_Jobname[18]   Aug 18 19:18
345998  myid-h RUN   q12h32     login1      ida7a27     My_Jobname[19]   Aug 18 19:18

Monitoring your job

You can use:

bjobs -W

to see your running jobs, or:

bjobs -W -u all

to see all user jobs.

More information is available with:

bjobs -W -l

And you can check scheduling information (perhaps if your job is showing with status "SSUSP") with:

bjobs -W -s <jobid>

"bjobs" has many options - please check the man page for further details.

To see the status of the compute nodes in the system, you can use:


E-Mail Notification

As an alternative method of monitoring, e-Mail notification has been enabled on Scafell Pike. You can use the -B flag to send an email when the job gets dispatched and begins execution; and use the -N flag to send an email report when the job completes; the -u flag sets the email address. Here is an example command:

bsub -u user@example.co.uk -B -N < myjob.sh

The options can also be included in the job script itself. Samples of the messages sent are on a separate page here.

Killing a job


bkill <jobid>

As you might expect, you can only kill your own jobs.

Data Mover

Data Mover is described in chapter Power8Jobs.

Further information to Follow

Interactive Access

Development Nodes

You can run interactive jobs using the "-I" flag, for example:

bsub -q scafellpikeI -W 00:59 -n 96 -R "span[ptile=32]" -Is "<command>"

You can also use this method to compile or edit on a compute node, thus freeing up resources on the login node, e.g.

bsub -q scafellpikeI -W 00:59 -Is emacs <my_file>

Note the use of "-Is" to get a pseudo-terminal with stdin.

Further information is given in the chapter on code development Development2.

Extreme Factory - XCS and XRV

Information to Follow

More Information

Back to Contents Page