Submit Job Failed on Host Cluster

In most cases, submit job fail is because of incorrect slurm status. User can try the following methods to investigate:

Check job output on LiCO GUI.

SSH to login node of the cluster, and re-submit the job.

Go to user home directory and find job file.
Run the command sbatch jobfile.slurm to re-submit the job.

Note: In many job failed cases, resource requested exceeds cluster limitation is the cause. For example, if cluster has 80 cores, jobs which request more than 80 cores will be failed.

cat a.slurm
#!/bin/bash
#SBATCH --job-name='test'
#SBATCH --workdir=/home/hpcadmin/test
#SBATCH --partition=compute
#SBATCH --nodes=1
#SBATCH --mincpus=200
mpirun /home/hpcadmin/test/icpi

sbatch a.slurm
sbatch: error: CPU count per node can not be satisfied
sbatch: error: Batch job submission failed: Requested node configuration is not available

SSH to login node in the cluster, and run commands to investigate.
Run command sinfo to check queues, run command squeue to check runnig jobs, run command scancel to cancel running jobs.

sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute      up   infinite      1    mix c1
compute      up   infinite      1   idle c2
compute1     up   infinite      1   idle c2
wujq         up   infinite      1    mix c1
wujq         up   infinite      1   idle c2
test         up   infinite      1    mix c1
wujtest      up   infinite      1    mix c1
wujtest      up   infinite      1   idle c2

squeue
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
7429   compute     test hpcadmin  R       0:05      1 c1

scancel 7429

If you are the administrator of the cluster, you can also ssh to slurm management node.

Run command scontrol show nodes to check the status of compute nodes in the cluster.

scontrol show nodes
NodeName=c1 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=0 CPUTot=72 CPULoad=0.01
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=gpu:2
   NodeAddr=c1 NodeHostName=c1 Version=18.08
   OS=Linux 3.10.0-1062.el7.x86_64 #1 SMP Wed Aug 7 18:08:02 UTC 2019
   RealMemory=200000 AllocMem=0 FreeMem=31158 Sockets=72 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=compute,wujq,test,wujtest
   BootTime=2020-01-15T09:40:01 SlurmdStartTime=2020-01-15T09:40:30
   CfgTRES=cpu=72,mem=200000M,billing=72
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s


NodeName=c2 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=0 CPUTot=72 CPULoad=0.01
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=gpu:2
   NodeAddr=c2 NodeHostName=c2 Version=18.08
   OS=Linux 3.10.0-1062.el7.x86_64 #1 SMP Wed Aug 7 18:08:02 UTC 2019
   RealMemory=200000 AllocMem=0 FreeMem=30217 Sockets=72 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=compute,compute1,wujq,wujtest
   BootTime=2020-01-15T09:37:50 SlurmdStartTime=2020-01-15T09:38:31
   CfgTRES=cpu=72,mem=200000M,billing=72
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

Run command sinfo to check the status of queue, and if you find some nodes in drain status, you can use the follow command to resume nodes.

scontrol update nodename=c1 state=idle

Refer to more slurm commands: https://slurm.schedmd.com/man_index.html