The K8S should be configured before LiCO running AI model training on it. The yaml files used in this guide are located under https://hpc.lenovo.com/lico/downloads/6.2/examples/k8s/, you should download these yaml files to local before running the below steps.
xkubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/nginx-0.30.0/deploy/static/mandatory.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/nginx-0.30.0/deploy/static/provider/baremetal/service-nodeport.yaml
xxxxxxxxxx
kubectl apply -f system-anonymous.yaml
xxxxxxxxxx
kubectl apply -f clusterrole.yaml
Step 1. Create monitoring namespace
xxxxxxxxxx
kubectl create namespace monitoring
Step 2. Label all gpu nodes with hardware-type=NVIDIAGPU
xxxxxxxxxx
kubectl label node <gpu node name> hardware-type=NVIDIAGPU
For example:
xxxxxxxxxx
kubectl label node gpunode1 hardware-type=NVIDIAGPU
kubectl label node gpunode2 hardware-type=NVIDIAGPU
Step 3. In the file Prometheus-deployment.yaml, add node address of all gpu nodes and modify the node port
xxxxxxxxxx
static_configs
targets
<gpu node address>:9400
nodePort <port>
For example:
xxxxxxxxxx
static_configs
targets
10.240.212.120:9400
10.240.212.122:9400
nodePort31893
Step 4. In the file dcgm-exporter.yaml, modify volumes path name of 'dcgm-exporter-counters', the file counters.csv must in this path and every node can access it.
xxxxxxxxxx
volumes
name"dcgm-exporter-counters"
hostPath
path"/etc/dcgm-exporter"
Step 5. Deploy prometheus and gpu metrics exporter.
xxxxxxxxxx
kubectl apply -f prometheus-deployment.yaml
kubectl apply -f dcgm-exporter.yaml
Add --streaming-connection-idle-timeout=0 to the kubelet configure file on all compute nodes, then restart kubelet service on all compute nodes.
xxxxxxxxxx
systemctl restart kubelet
Using the below command to check the node port of services. The node port of these services are needed when adding the kuber server into LiCO.
xxxxxxxxxx
kubectl get svc ingress-nginx -n ingress-nginx
kubectl get svc prometheus -n monitoring