The K8S should be configured before LiCO running AI model training on it. The yaml files used in this guide are located under https://hpc.lenovo.com/lico/downloads/6.2/examples/k8s/, you should download these yaml files to local before running the below steps.
xkubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/nginx-0.30.0/deploy/static/mandatory.yamlkubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/nginx-0.30.0/deploy/static/provider/baremetal/service-nodeport.yamlxxxxxxxxxxkubectl apply -f system-anonymous.yamlxxxxxxxxxxkubectl apply -f clusterrole.yamlStep 1. Create monitoring namespace
xxxxxxxxxxkubectl create namespace monitoringStep 2. Label all gpu nodes with hardware-type=NVIDIAGPU
xxxxxxxxxxkubectl label node <gpu node name> hardware-type=NVIDIAGPUFor example:
xxxxxxxxxxkubectl label node gpunode1 hardware-type=NVIDIAGPUkubectl label node gpunode2 hardware-type=NVIDIAGPUStep 3. In the file Prometheus-deployment.yaml, add node address of all gpu nodes and modify the node port
xxxxxxxxxxstatic_configstargets<gpu node address>:9400nodePort<port>For example:
xxxxxxxxxxstatic_configstargets10.240.212.120:940010.240.212.122:9400nodePort31893Step 4. In the file dcgm-exporter.yaml, modify volumes path name of 'dcgm-exporter-counters', the file counters.csv must in this path and every node can access it.
xxxxxxxxxxvolumesname"dcgm-exporter-counters" hostPath path"/etc/dcgm-exporter"Step 5. Deploy prometheus and gpu metrics exporter.
xxxxxxxxxxkubectl apply -f prometheus-deployment.yamlkubectl apply -f dcgm-exporter.yamlAdd --streaming-connection-idle-timeout=0 to the kubelet configure file on all compute nodes, then restart kubelet service on all compute nodes.
xxxxxxxxxxsystemctl restart kubelet Using the below command to check the node port of services. The node port of these services are needed when adding the kuber server into LiCO.
xxxxxxxxxxkubectl get svc ingress-nginx -n ingress-nginxkubectl get svc prometheus -n monitoring