The K8S should be configured before LiCO running AI model training on it. The yaml files used in this guide are located under https://hpc.lenovo.com/lico/downloads/6.0/examples/k8s/, you should download these yaml files to local before running the below steps.
kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/nginx-0.30.0/deploy/static/mandatory.yamlkubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/nginx-0.30.0/deploy/static/provider/baremetal/service-nodeport.yaml
xxxxxxxxxxkubectl apply -f system-anonymous.yaml
xxxxxxxxxxkubectl apply -f clusterrole.yaml
Step 1. Create monitoring namespace
xxxxxxxxxxkubectl create namespace monitoring
Step 2. Label all gpu nodes with hardware-type=NVIDIAGPU
xxxxxxxxxxkubectl label node <gpu node name> hardware-type=NVIDIAGPU
For example:
xxxxxxxxxxkubectl label node gpunode1 hardware-type=NVIDIAGPUkubectl label node gpunode2 hardware-type=NVIDIAGPU
Step 3. In the file Prometheus-deployment.yaml, add node address of all gpu nodes
xxxxxxxxxxstatic_configs:- targets:- <gpu node address>:9400
For example:
xxxxxxxxxxstatic_configs:- targets:- 10.240.212.120:9400- 10.240.212.122:9400
Step 4. Deploy prometheus and gpu metrics exporter.
xxxxxxxxxxkubectl apply -f prometheus-deployment.yamlkubectl apply -f pod-gpu-metrics-exporter-daemonset.yaml
By default, metrics-server service is not exposed, so we need expose it as nodePort
xxxxxxxxxxkubectl edit svc metrics-server -n kube-system
For example:
xxxxxxxxxxspec:clusterIP: 10.254.24.154externalTrafficPolicy: Clusterports:- nodePort: 43731port: 443protocol: TCPtargetPort: main-portselector:k8s-app: metrics-serversessionAffinity: Nonetype: NodePortstatus:loadBalancer: {}
Add --streaming-connection-idle-timeout=0 to the kubelet configure file on all compute nodes, then restart kubelet service on all compute nodes.
xxxxxxxxxxsystemctl restart kubelet
Using the below command to check the node port of services. The node port of these services are needed when adding the kuber server into LiCO.
xxxxxxxxxxkubectl get svc ingress-nginx -n ingress-nginxkubectl get svc prometheus -n monitoringkubectl get svc metrics-server -n kube-system