Overview
Login Node
When you login to Komondor, you arrive at the login node (v01). Apart from being the gate to the supercomputer system, the login node can be used to manage your projects on the supercomputer. Management tasks include:
uploading programs and data to the storage;
compiling and installing software;
preparing your application to be run on the supercomputer;
submitting computation jobs to the compute nodes using the Slurm scheduler;
monitoring and handling submitted jobs;
checking job efficiency to improve subsequent jobs;
arranging, downloading and backuping computation results.
Important
Though small management tasks that don’t require large resources can be executed on the login node, resource-intensive computation tasks are subject to interruption without prior notice.
Compute Nodes
Actual computation tasks must be run on the compute nodes. These are the nodes that provide the computing power of the supercomputer. According to the amount and type of resources, there are four different types of compute nodes in Komondor: CPU-only, GPU, AI and BigData. Slurm organizes these resources into partitions (queues). Partitions serve different computing requirements.
Job Queues (Partitions)
The main compute partitions of Komondor are:
cpu(default)gpuaibigdata
Partition |
Compute nodes |
CPUs / node |
CPU cores / node |
GPUs / node |
Memory / node |
|---|---|---|---|---|---|
“CPU” ( |
184 x |
2 AMD CPUs |
128 CPU cores |
N/A |
256 GB RAM |
“GPU” ( |
58 x |
1 AMD CPU |
64 CPU cores |
4 A100 GPUs |
256 GB RAM |
“AI” ( |
4 x |
2 AMD CPUs |
128 CPU cores |
8 A100 GPUs |
512 GB RAM |
“BigData” ( |
1 x |
16 Intel CPUs |
288 CPU cores |
N/A |
12 TB RAM |
Special-purpose Slurm partitions
In addition to the main compute partitions above, Komondor provides a few specialized Slurm partitions for interactive work and for improving overall utilization.
Partition |
Intended use |
Default time |
Max time |
Notes |
|---|---|---|---|---|
|
Short interactive tests, debugging, quick experiments (including GPU), JupyterHub notebooks |
00:10:00 |
01:00:00 |
Dedicated GPU node; |
|
Short CPU-only jobs that may run on otherwise idle GPU nodes |
01:00:00 |
02:00:00 |
Hidden partition; |
Short jobs and the cpu-short partition
To reduce waiting time and increase utilization, short CPU jobs submitted to the default cpu partition can be automatically eligible
to run on GPU nodes via the hidden cpu-short partition. This happens when the job:
requests a walltime of 2 hours or less, and
requests 64 CPU cores or less in total.
This is most useful when the CPU partition is saturated while GPU nodes have available CPU capacity. You do not need to submit directly to
cpu-short; the partition list is extended automatically by a Slurm job submission (Lua) policy. Just set an accurate time limit for short jobs.
You can instruct the Slurm scheduler to allocate the necessary resources and launch your tasks on the compute nodes using Slurm commands and special directives in your job script. Slurm queues all submitted jobs (from all users) according to their calculated priority and starts them on a schedule based on available resources.