How to use "suspend" to let emergency job run first without affecting the resumption of preempted jobs

Scenario

In a computing-intensive cluster, users usually apply for an appropriate number of CPUs. However, due to poor control over memory usage, they often apply for excessive memory.

Thus, if there is a job with high priority that need to be executed in advance, even if administrator suspend the running low-priority jobs, scheduler will keep high-priority jobs pending because there is not enough available memory.

At this point, the usual approach is to directly cancel low-priority jobs which occupy the specified resources. However, this approach does not allow low-priority jobs to resume running without additional cost. For low-priority jobs without a checkpoint mechanism, they can only be run from scratch. For jobs that can save checkpoints, the checkpoint needs to be reloaded and subsequent calculation results of the last saved checkpoint will be lost.

Solution

This article introduces a method for administrators to manually handle urgent jobs and make them run immediately without losing the calculation results of preempted jobs. (BTW, make sure the nodes in the cluster are configured with enough memory. )

Note This method is a manual operation by the administrator and is only used for job processing under special circumstances. It is not recommended and cannot replace SLURM's automatic preemption function.

Adjust the Slurm cluster according to the following configuration to avoid the situation mentioned earlier. Changes to the following configuration files take effect upon restart of Slurm daemons, daemon receipt of the SIGHUP signal, or execution of the command scontrol reconfigure.

  1. slurm.conf

    SelectTypeParameters=CR_CPU_Memory
  2. cgroup.conf

    ConstrainRAMSpace=yes
    MaxRAMPercent=50

Note

  1. 50 means that the maximum memory used by the job cannot exceed 50% of the total memory of the node configured in Slurm. This value can be adjusted according to the actual situation. The following content will be based on this value.
  2. After configuring as above, administrator can also configure the default memory allocation rules at the partition level according to the business to facilitate users.

After configuring the cluster in this way, when a user submits a job, if all CPUs are requested and more than 50% of the node's memory is requested, when the actual memory occupied by the job exceeds 50% of the node's memory, the OOM mechanism will be triggered.

Administrators can announce this rule to all users using the cluster, so that when users apply for memory in violation of regulations, the administrator can cancel such jobs.

At this time, when a job that needs to be executed first is submitted, administrators can follow these steps:

  1. Adjusts the priority of the urgent job to the highest.

  2. Locate the resources that need to be preempted and suspend the job currently occupying the resources. Then the urgent job will run.

  3. When the job is running, update the node status to DRAIN .

  4. After the urgent job ends, resume the preempted job.

  5. Restore node status, the preempted job will resume from the state when it was last interrupted and continue to run to complete.