Using Partitions¶
A partition is a collection of compute nodes, think of it as a sub-cluster or slice of the larger cluster. Each partition has its own rules and configurations.
For example, the quicktest partition has a maximum job run-time of 1 hour, whereas the partition bigmem has a maximum runtime of 10 days. Partitions can also limit who can run a job. Currently any user can use any partition but there may come a time when certain research groups purchase their own nodes and they are given exclusive access.
To view the partitions available to use you can type the vuw-partitions command, eg
harrelwe@raapoi-master:~$ vuw-partitions
VUW CLUSTER PARTITIONS
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
quicktest* up 1:00:00 1 idle c03n01
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
bigmem up 10-00:00:0 2 idle c10n01,c11n01
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
parallel up 10-00:00:00 6 down* c04n01,c05n04,c06n[01-04]
parallel up 10-00:00:00 27 idle
c03n[02-04],c04n[02-04],c05n[01-02],c07n[01,03-04],c08n[01-04],c09n[01-04],c12n[01-04],c13n[01-04]
NOTE: This utility is a wrapper for the Slurm command:
sinfo -p PARTITION
Notice the STATE field, this describes the current condition of nodes within the partition, the most common states are defined as:
- idle - nodes in an idle state have no jobs running, all resources are available for work
- mix - nodes in a mixed state have some jobs running, but still have some resources available for work
- alloc - nodes in an alloc state are completely full, all resources are in use.
- drain - nodes in a drain state have some running jobs, but no new jobs can be run. This is typically done before the node goes into maintenance
- maint - node is in maintenance mode, no jobs can be submitted
- resv - node is in a reservation. A reservation is setup for future maintenance or for special purposes such as temporary dedicated access
- down - node is down, either for maitnenance or due to failure
Also notice the TIMELIMIT field, this describes the maximum runtime of a partition. For example, the quicktest partition has a maximum runtime of 1 hour and the parallel partition has a max runtime of 10 days.
Partition Descriptions¶
Partition: quicktest
This partition is for quick tests of code, environment, software builds or similar short-run jobs. Since the max time limit is 1 hour it should not take long for your job to run. This can also be used for near-on-demand interactive jobs.
- Maximum CPU available per task: 24
- Maximum memory available per task: 62G
- Maximum Runtime: 1 hour
Partition: bigmem
This partition is primarily useful for jobs that require very large shared memory (generally greater than 125 GB). These are known as memory-bound jobs.
- Maximum CPU available per task: 48
- Maximum memory available per task: 1 TB (Note: maximum CPU for 1 TB is 40)
- Maximum Runtime: 10 days
Partition: parallel
This partition is useful for parallel workflows, either loosely coupled or jobs requiring MPI or other message passing protocols for tightly bound jobs.
- Maximum CPU available per task: 64
- Maximum memory available per task: 125G
- Maximum Runtime: 10 days
Cluster Defaults
Please note that if you do not specify CPU, Memory or Time in your job request you will be given the cluster defaults which are:
- Default CPU: 2
- Default Memory: 2 GB
- Default Time: 1 hour
You can change these with the -c, --mem and --time parameters to the srun and sbatch commands. Please see this documentation for more information about srun and sbatch.