How To Solve slurm Common Problem
Using slurm command sinfo to check the node status:
If node status is
drain
:$ sudo scontrol update NodeName=<hostname> State=RESUME
If node status is
down
:Use the following command to see the node detail information, see the reason in the output of this command.
$ sudo scontrol show nodes
Check whether all the nodes have the same
slurm.conf
file under/etc/slurm
.Check whether service of slurmd, munge are active on all the nodes, and whether service of slurmctld is active on the management node.
Check whether all the nodes have the same date and whether ntpd service is active on all the nodes.
If you meet the following warning text when using srun/prun to run mpi program:
Failed to create a completion queue (CQ): ...... Error: Cannot allocate memory
* soft memlock unlimited * hard memlock unlimited