Slurm - Queue Policies & Advice
Our Slurm setup runs with the following goals and constraints in mind:
allow short jobs to run without having to wait more than a few hours
do not permit many long jobs to take over the entire cluster for long periods
try to divide the cluster equally among users
keep all of the cluster’s processors as busy as possible all of the time
To do this, we primarily use three main queues/partitions called short
, medium
and long
(referring to their runtime), with medium
being the default queue that jobs will go to unless you specify otherwise.
Important
You can access a maximum of 256 cores per queue at a time. Any subsequent jobs will queue until your usage allows for more to start running.
Queue |
CPUs |
Max RAM |
Time Limit |
Description |
---|---|---|---|---|
|
640 |
192 - 512 GB |
6 hours |
This is a high priority queue for smaller jobs with thresholds set to allow smaller jobs to squeeze through that might have to wait in the other queues. |
|
1152 |
192 - 512 GB |
24 hours |
This is the default queue that all jobs will submit to unless otherwise requested. |
|
1,888 |
192 - 384 GB |
14 days |
This queue is for long running jobs. |
Note
When running array jobs, each individual job has its own time limit, so even an array job with 1000 parts that each take 3 hours to run could still use the short queue.
There are also two special queues that should only be used for jobs that require large amounts of memory or access to GPU Processing:
Queue |
CPUs |
Max RAM |
Time Limit |
Description |
---|---|---|---|---|
|
440 |
1.5 - 4.0 TB |
14 days |
This queue is for jobs requiring a very large amount of RAM |
|
336 |
192 - 512 GB |
14 days |
This queue is for jobs requiring GPUs |
Note
All queues run with the same priority across all nodes. Only the time limits differ, with the short
and medium
queues automatically killing a job if it exceeds their limits. GPUs can only be accessed from the gpu
queue, and large RAM requests can only run on the himem
queue.
Note
The maximum amount of memory you can request (per node) will be a few GB less than shown above, because Slurm reserves some allocation for the rest of the operating system.
Specifying queues
Note
No special options are required to submit to the medium
queue.
To submit to the short
queue, use:
sbatch --partition=short myscript.sh
Or to submit to the long
queue, use:
sbatch --partition=long myscript.sh
To submit to the high memory (himem
) queue, use:
sbatch --partition=himem myscript.sh
To submit to the gpu
queue, where n
specifies how many GPUs you want to use:
sbatch --partition=gpu --gpus=[n] myscript.sh
For more details on accessing the GPUs, see GPU Processing.
To get a job list for an individual queue rather than all queues, use the -p
or --partition
option for squeue
, for example:
squeue --partition=short
Additional advice and guidance
Below are some additional questions you may have about using the cluster in a sensible - and fair - manner. Don’t hesitate to Contact Us if you’re unsure though.
Can I use the entire cluster at once?
It depends.
While there are currently no limits to prevent you from submitting a job that uses every CPU across one or more queues, you first need to ask yourself how sensible that would be? Consider:
how long the job will last? Short running tasks allow others’ jobs to rise in priority above yours (the fair-share policy), so submitting a 10,000 jobs that only last a few minutes each will ‘hog’ the cluster much less than just a few tens or hundreds of jobs that last for hours and hours.
how busy is the cluster? If it’s 2am and no-one else is using the cluster, then it’s less likely to be detrimental to anyone else.
how much you value your friendship with other cluster users? Seriously. This is a shared resource, and while it’s here to be used, it’s not here to be abused.
Which queue/partition should I use?
It depends.
Based purely on historical observation and anecdotal evidence, a significant number of jobs seem to complete OK within 24 hours (so the default medium queue is probably fine), but obviously the bigger your job or data sets that you want to process, the more likely it is to overrun and therefore be safer on the long queue. However, if the long queue is busy, you may then have to wait longer for your job to start. Note though, that each task of an array job has its own time allocation, so you could still successfully run a week-long job on the medium queue if each of its subtasks completes in less than 24 hours.
If it’s an interactive job, then you’re probably better off running it on the short queue.
Where should I write data to?
It depends.
During a job, you should almost always be writing output data to one of the scratch locations, however there’s a choice of storage locations each with their own pros and cons:
Shared network BeeGFS scratch space ($SCRATCH
or /mnt/shared/scratch/$USER
) is accessible from any node and may be where your data is already residing. It’s a parallel storage array and reasonably fast when dealing with very large sequential reads or writes - so great for stream reading from multiple large .bam files for instance - but not so good if your job has to read or write hundreds or millions of very tiny files. As part of the main storage array it also has plenty of free space.
Node-specific scratch space ($TMPDIR
) is local to each node and uses an array of SSDs for performance so it can be much faster than BeeGFS for certain use cases, but each node’s capacity is limited (see System Overview for details) and you need to copy your data there first.
Note
$TMPDIR
is automatically created - and destroyed! - as part of a job submission, so it’s up to you to copy any input data here as the first step of an sbatch submission, and to copy data out again at the end.
How much CPU/memory should I allocate to a job?
It depends.
Although gruffalo
can automatically manage and prioritise jobs well - most of the time - you still need to ensure sensible job-allocation requests are made.
Try to avoid submitting jobs that lock out too much of the cluster at once, either by using too many CPUs simultaneously for an excessive amount of time, or by requesting resources far beyond those actually used (eg asking for 16 CPUs for a process that only uses one for the majority of its runtime, or 100 GB of memory for a job that only uses a fraction of that). Over-allocation of resources negatively affects both other users and additional jobs of your own.
However, if you under-allocate on memory, the cluster will kill jobs that try to go beyond their requested allocation. It may therefore be tempting to just over-allocate everything for every job, asking for all the CPUs or all the memory, but this is easily spotted and we’ll take action if we notice your jobs continually requesting resources significantly beyond what they’re using. Jobs requesting more resources also tend to take longer to run as they must wait until all those resources become available if the cluster is busy. It may just take a little trial and error until you get confortable with how much to request for a given job or data set.
Finally, you should also take Green Computing into account. A single node running 32 tasks uses far less energy than 32 nodes running 1 task each. If you over-allocate resources, then more nodes need to be online to meet your requirements, which wastes power if they’re not being used effectively.