The K8S should be configured before LiCO running AI model training on it. The yaml files used in this guide are located under https://hpc.lenovo.com/lico/downloads/6.0/examples/k8s/, you should download these yaml files to local before running the below steps.
kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/nginx-0.30.0/deploy/static/mandatory.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/nginx-0.30.0/deploy/static/provider/baremetal/service-nodeport.yaml
xxxxxxxxxx
kubectl apply -f system-anonymous.yaml
xxxxxxxxxx
kubectl apply -f clusterrole.yaml
Step 1. Create monitoring namespace
xxxxxxxxxx
kubectl create namespace monitoring
Step 2. Label all gpu nodes with hardware-type=NVIDIAGPU
xxxxxxxxxx
kubectl label node <gpu node name> hardware-type=NVIDIAGPU
For example:
xxxxxxxxxx
kubectl label node gpunode1 hardware-type=NVIDIAGPU
kubectl label node gpunode2 hardware-type=NVIDIAGPU
Step 3. In the file Prometheus-deployment.yaml, add node address of all gpu nodes
xxxxxxxxxx
static_configs:
- targets:
- <gpu node address>:9400
For example:
xxxxxxxxxx
static_configs:
- targets:
- 10.240.212.120:9400
- 10.240.212.122:9400
Step 4. Deploy prometheus and gpu metrics exporter.
xxxxxxxxxx
kubectl apply -f prometheus-deployment.yaml
kubectl apply -f pod-gpu-metrics-exporter-daemonset.yaml
By default, metrics-server service is not exposed, so we need expose it as nodePort
xxxxxxxxxx
kubectl edit svc metrics-server -n kube-system
For example:
xxxxxxxxxx
spec:
clusterIP: 10.254.24.154
externalTrafficPolicy: Cluster
ports:
- nodePort: 43731
port: 443
protocol: TCP
targetPort: main-port
selector:
k8s-app: metrics-server
sessionAffinity: None
type: NodePort
status:
loadBalancer: {}
Add --streaming-connection-idle-timeout=0 to the kubelet configure file on all compute nodes, then restart kubelet service on all compute nodes.
xxxxxxxxxx
systemctl restart kubelet
Using the below command to check the node port of services. The node port of these services are needed when adding the kuber server into LiCO.
xxxxxxxxxx
kubectl get svc ingress-nginx -n ingress-nginx
kubectl get svc prometheus -n monitoring
kubectl get svc metrics-server -n kube-system