How To Solve slurm Common Problem

  • Using slurm command sinfo to check the node status:

    If node status is drain:

    You can use the following command to change the node status to normal
    $ sudo scontrol update NodeName=<hostname> State=RESUME
    

    If node status is down:

    1. Use the following command to see the node detail information, see the reason in the output of this command.

    $ sudo scontrol show nodes
    
    1. Check whether all the nodes have the same slurm.conf file under /etc/slurm.

    2. Check whether service of slurmd, munge are active on all the nodes, and whether service of slurmctld is active on the management node.

    3. Check whether all the nodes have the same date and whether ntpd service is active on all the nodes.

  • If you meet the following warning text when using srun/prun to run mpi program:

    Failed to create a completion queue (CQ):
    ......
    Error: Cannot allocate memory
    
    Please check whether soft memlock and hard memlock are unlimited in the file /etc/security/limits.conf on management node and compute nodes. If not, you should set them as unlimited and restart the nodes to take effect
    * soft memlock unlimited
    * hard memlock unlimited