Sfp_jobs

Hartree Centre Logo

Back to Contents Page

Running Jobs on Scafell Pike

Last modified: 3/5/2018

Logging On

To use Scafell Pike you will normally log into the Fairthorpe X86 node hcxlogin1: hcxlogin1.hartree.stfc.ac.uk . If you are adventurous, it is also possible to access Scafell Pike from the Power nodes, hcplogin1 and hcplogin2.

What will You See?

File Systems

[myid-mygroup@hcxlogin1 scafellpike?]$ df
Filesystem                             1K-blocks         Used     Available Use% Mounted on
cds131-dm0:/gpfs/cds/fairthorpe    1498317193216 149716768768 1348600424448  10% /gpfs/cds
sqg1cdata8-dm0:/lustre/scafellpike 1443356825600  79160509440 1349644140544   6% /lustre/scafellpike
fairfiler01:/data/home                2121790464   1025147904    1096642560  49% /gpfs/fairthorpe/local

The file systems seen above are used as follows.

environment variablefile systemdescription
$HOME/gpfs/fairthorpe/local/myproject/mygroup/myid-mygroupHome file system on Fairthorpe
$HCBASE/lustre/scafellpike/local/myproject/mygroup/myid-mygroupFile system on the Fairthorpe cluster
$HCCDS/gpfs/cds/local/myproject/mygroup/myid-mygroupPermanent file system on the Common Data Store

$HOME and $HCBASE should be used as scratch space to use for job development and running application on the cluster. $HCCDS should be used to store all large data sets and executables. Jobs can be launched directly from $HCCDS using Data Mover (see below) or data and executables copied to $HCBASE for use during the batch run.

Note: $HCCDS is backed up. The other file systems are not.

Managed Software

The software which we have installed and manage on behalf of our user community can be accessed via environment modules. This is described in a separate chapter Managed Software. Examples of using specific packages are given in the HOWTOs.

Batch Queues

The main batch queues available for Scafell Pike are shown here.

[myid-mygroup@hcxlogin1 scafellpike?]$ bqueues
QUEUE_NAME      PRIO STATUS          MAX JL/U JL/P JL/H NJOBS  PEND   RUN  SUSP 
scafellpikeKNL   40  Open:Active       -    -    -    -     0     0     0     0
scafellpikeSKL   40  Open:Active       -    -    -    -  2748    20  2728     0
scafellpikeI     40  Open:Active       -    -    -    -     0     0     0     0
universeScafell  40  Open:Active       -    -    -    -     0     0     0     0
SKLShort         40  Open:Active       -    -    -    -     0     0     0     0
KNLShort         40  Open:Active       -    -    -    -     0     0     0     0
KNLHimem         40  Open:Active       -    -    -    -     0     0     0     0
XRV              40  Open:Active       -    -    -    -     0     0     0     0

The cluster effectively has two partisions referred to as KNL and SKL.

The most important queues which are further described below are scafellpikeKNL foer using the Xeon Phi Knights Landing nodes and scafellpikeSKL for using the Xeon Gold Skylake nodes.

There are 846 "regular" SKL nodes each with 32x x86_64 Xeon cores running at up to 3.7 GHz with 192GB memory.

There are 846 "accelerator" KNL nodes each with 64 cores running at 1.3 HGz with 384 GB memory.

To see which queues are enabled at any time type "bclusters".

Batch Jobs

This section describes how to submit batch jobs. Specific examples for managed software are given in the HOWTOs.

LSF Job Scheduler

The job scheduler on Scafell Pike is IBM Platform LSF v10.1. Having first compiled your executable, or using one that we have installed, you can submit it to the job queue with a suitable submission script. We provide some examples below.

Here is a link to the on-line LSF documentation: external link: https://www.ibm.com/support/knowledgecenter/en/SSWRJV_10.1.0/lsf_welcome/lsf_welcome.html . There are also man pages for LSF available on Fairthorpe and Scafell Pike.

Job submission filter

This section to be re-written

There is a job submission filter on Scafell Pike which will set the following defaults if you do not override them:

  • Number of processors (slots) requested => 24 (i.e. 1 node)
  • Wallclock time => 1hr
  • No resource requirements

Jobs will by default go into the q1h32 queue. (1hr wall clock 32 node maximum, 512 cores maximum)

You may request a walltime longer than 1 hour, in which case your job will go into q12h32 (12hr wall clock 32 node maximum, 512 cores maximum). Note that jobs with walltimes > 1 hour are currently considered to be "at risk" and may be terminated without warning. See the sample submission scripts for details on how to do this.

Using Intel MPI

Intel MPI is installed and managed and is expected to give good performance on Scafell Pike. For this reason it is part of the "default" software stack.

Please use mpiexec.hydra as in this example LSF script which runs a variant of the IMB benchmark. Not the use of ptile=32 to request all 32 cores per node. This assumes that $HCBASE/imb has previously been created.

#BSUB -o stdout.imb.txt
#BSUB -e stderr.imb.txt
#BSUB -R "span[ptile=32]"
#BSUB -q scafellpikeSKL
#BSUB -n 96            
#BSUB -J imb
#BSUB -W 00:19

cd $HCBASE/imb

# setup modules
. /etc/profile.d/modules.sh
module load intel_mpi > /dev/null 2>&1

export MYJOB="$HCBASE/imb/IMB-MPI1-intel sendrecv"

mpiexec.hydra -np 96 ${MYJOB}

An explanation of the parameters used:

#BSUB -o stdout.imb.txt   <-- Specify an output filename
#BSUB -e stderr.imb.txt   <-- Specify an error filename
#BSUB -R "span[ptile=32]"   <-- Request 32 processes per node, which matches the number of cores per node.
  You can set ptile to less than 32 if required (e.g. to get more memory per process).
#BSUB -q scafellpikeSKL <-- use SKL nodes, alternative is scafellpikeKNL
#BSUB -R "rusage[mem=15000]" <-- Request memory, in this case 15GB 
#BSUB -n 96   <-- Request number of mpi tasks.  So in this example, we're asking for 3 nodes (3x32) in total.
#BSUB -J imb   <-- Give the job a name.
#BSUB -W 00:19   <-- Request 19 minutes of wallclock time.

mpiexec.hydra -np 96 ${MYJOB}  <-- Tell mpirun to start 96 processes (should be the same as above).

Using OpenMPI

OpenMPI is also installed and managed as an alternative MPI version.

Specifying appropriate environment variables directly:

#BSUB -o stdout.imb.txt
#BSUB -e stderr.imb.txt
#BSUB -R "span[ptile=32]"
#BSUB -q scafellpikeSKL
#BSUB -n 96            
#BSUB -J imb
#BSUB -W 00:19

cd $HCBASE/imb
export MYHOME=`pwd`
export OPENMPI_ROOT=/lustre/scafellpike/local/apps/gcc/openmpi/2.1.1
export PATH=$MPIROOT/bin:$PATH
export LD_LIBRARY_PATH=$MPIROOT/lib:$MPIROOT/lib/openmpi:$LD_LIBRARY_PATH
export MYJOB="${MYHOME}/IMB-MPI1 sendrecv"

mpirun -np 96 ${MYJOB}

Or using environment modules:

#BSUB -o stdout.imb.txt
#BSUB -e stderr.imb.txt
#BSUB -R "span[ptile=32]"
#BSUB -q scafellpikeSKL
#BSUB -n 96            
#BSUB -J imb
#BSUB -W 00:19

cd $HCBASE/imb
export MYHOME=`pwd`

# setup modules
. /etc/profile.d/modules.sh
module load openmpi-gcc > /dev/null 2>&1

export MYJOB="${MYHOME}/IMB-MPI1 sendrecv"

mpirun -np 96 ${MYJOB}

Some of these scripts can get quite complex, especially if Data Mover is used. Please check the HOWTOS and contact the Help Desk if you have any questions: external link: https://stfc.service-now.com .

Submitting jobs

Submit your job like this:

bsub < myjob.sh

Don't forget the re-direct symbol "<" or nothing will happen.

Requesting large memory nodes

Note: this section to be checked

There are 24x high memory nodes each with 1TB RAM. They have Intel Xeon Broadwell processors, 32 cores per node. You can request them with this syntax in your submission script:

#BSUB -R "rusage[mem=250000]"

This will request 250,000MB, or ~250GB, of memory on each node. Please note that because there are only 4 nodes available, you cannot request more than 96 mpi tasks (3x32).

Job Arrays

A "job array" is a set of jobs submitted using a single script. Typically this is used in cases where the same job has to be run many times with different input data. This is sometimes referred to as task farming, parameter sweep or design of experiments and is typical of an optimisation procedure.

Using job array syntax will permit the LSF scheduler to make fair use of the resources available. The job script has syntax as follows.

#!/bin/bash;
# define an array job
#BSUB -J "My_Jobname[1-20]";
#BSUB -o stdout.%J.%I.txt
#BSUB -e stderr.%J.%I.txt
# … other BSUB options

# look up the index for this job instance
if [ -z $LSB_REMOTEJID ] # runs in local queue
then
    job_id=$LSB_JOBID; array_id=$LSB_JOBINDEX
else # job has ben forwarded to remote queue
    job_id=$LSB_REMOTEJID; array_id=$LSB_REMOTEINDEX
fi

# open a different input file for each job instance
# this also runs 4 instances concurrently from different directories
# there could be many ways to do this…
(cd try1; mpiexec.hydra -n 8 my_executable input.$array_id) &
(cd try2; mpiexec.hydra -n 8 my_executable input.$array_id) &
(cd try3; mpiexec.hydra -n 8 my_executable input.$array_id) &
(cd try4; mpiexec.hydra -n 8 my_executable input.$array_id) &

# wait for all processes to complete in this script shell
wait

When the job runs, you will see output from bjobs as follows.

345998  myid-h RUN   q12h32     login1      ida7c16     My_Jobname[16]   Aug 18 19:18
345998  myid-h RUN   q12h32     login1      ida3c36     My_Jobname[15]   Aug 18 19:18
345998  myid-h RUN   q12h32     login1      ida7a18     My_Jobname[17]   Aug 18 19:18
345998  myid-h RUN   q12h32     login1      ida2a41     My_Jobname[18]   Aug 18 19:18
345998  myid-h RUN   q12h32     login1      ida7a27     My_Jobname[19]   Aug 18 19:18

Monitoring your job

You can use:

bjobs -W

to see your running jobs, or:

bjobs -W -u all

to see all user jobs.

More information is available with:

bjobs -W -l

And you can check scheduling information (perhaps if your job is showing with status "SSUSP") with:

bjobs -W -s <jobid>

"bjobs" has many options - please check the man page for further details.

To see the status of the compute nodes in the system, you can use:

bhosts

E-Mail Notification

As an alternative method of monitoring, e-Mail notification has been enabled on Scafell Pike. You can use the -B flag to send an email when the job gets dispatched and begins execution; and use the -N flag to send an email report when the job completes; the -u flag sets the email address. Here is an example command:

bsub -u user@example.co.uk -B -N < myjob.sh

The options can also be included in the job script itself. Samples of the messages sent are on a separate page here.

Killing a job

Use:

bkill <jobid>

As you might expect, you can only kill your own jobs.

Data Mover

Data Mover is described in chapter Power8Jobs.

Further information to Follow

Interactive Access

Development Nodes

You can run interactive jobs using the "-I" flag, for example:

bsub -q scafellpikeI -W 00:59 -n 96 -R "span[ptile=32]" -Is "<command>"

You can also use this method to compile or edit on a compute node, thus freeing up resources on the login node, e.g.

bsub -q scafellpikeI -W 00:59 -Is emacs <my_file>

Note the use of "-Is" to get a pseudo-terminal with stdin.

Further information is given in the chapter on code development Development2.

Extreme Factory - XCS and XRV

Information to Follow

More Information

Back to Contents Page