Overview

Login Node

When you login to Komondor, you arrive at the login node (v01). Apart from being the gate to the supercomputer system, the login node can be used to manage your projects on the supercomputer. Management tasks include:

  • uploading programs and data to the storage;

  • compiling and installing software;

  • preparing your application to be run on the supercomputer;

  • submitting computation jobs to the compute nodes using the Slurm scheduler;

  • monitoring and handling submitted jobs;

  • checking job efficiency to improve subsequent jobs;

  • arranging, downloading and backuping computation results.

Important

Though small management tasks that don’t require large resources can be executed on the login node, resource-intensive computation tasks are subject to interruption without prior notice.

Compute Nodes

Actual computation tasks must be run on the compute nodes. These are the nodes that provide the computing power of the supercomputer. According to the amount and type of resources, there are four different types of compute nodes in Komondor: CPU-only, GPU, AI and BigData. Slurm organizes these resources into partitions (queues). Partitions serve different computing requirements.

Job Queues (Partitions)

The main compute partitions of Komondor are:

  • cpu (default)

  • gpu

  • ai

  • bigdata

Partition

Compute nodes

CPUs / node

CPU cores / node

GPUs / node

Memory / node

“CPU” (cpu, default)

184 x

2 AMD CPUs

128 CPU cores

N/A

256 GB RAM

“GPU” (gpu)

58 x

1 AMD CPU

64 CPU cores

4 A100 GPUs

256 GB RAM

“AI” (ai)

4 x

2 AMD CPUs

128 CPU cores

8 A100 GPUs

512 GB RAM

“BigData” (bigdata)

1 x

16 Intel CPUs

288 CPU cores

N/A

12 TB RAM

Special-purpose Slurm partitions

In addition to the main compute partitions above, Komondor provides a few specialized Slurm partitions for interactive work and for improving overall utilization.

Partition

Intended use

Default time

Max time

Notes

test

Short interactive tests, debugging, quick experiments (including GPU), JupyterHub notebooks

00:10:00

01:00:00

Dedicated GPU node; MaxMemPerCPU=4000 MB; oversubscription is enabled (OverSubscribe=FORCE:2)

cpu-short

Short CPU-only jobs that may run on otherwise idle GPU nodes

01:00:00

02:00:00

Hidden partition; MaxMemPerCPU=4000 MB; uses GPU nodes (GPUs are not allocated unless requested)

Short jobs and the cpu-short partition

To reduce waiting time and increase utilization, short CPU jobs submitted to the default cpu partition can be automatically eligible to run on GPU nodes via the hidden cpu-short partition. This happens when the job:

  • requests a walltime of 2 hours or less, and

  • requests 64 CPU cores or less in total.

This is most useful when the CPU partition is saturated while GPU nodes have available CPU capacity. You do not need to submit directly to cpu-short; the partition list is extended automatically by a Slurm job submission (Lua) policy. Just set an accurate time limit for short jobs.

You can instruct the Slurm scheduler to allocate the necessary resources and launch your tasks on the compute nodes using Slurm commands and special directives in your job script. Slurm queues all submitted jobs (from all users) according to their calculated priority and starts them on a schedule based on available resources.