Configure-K8S-for-LiCO

Configure K8S for LiCO

The K8S should be configured before LiCO running AI model training on it. The yaml files used in this guide are located under https://hpc.lenovo.com/lico/downloads/6.2/examples/k8s/, you should download these yaml files to local before running the below steps.

1. Deploy nginx ingress controller


x
kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/nginx-0.30.0/deploy/static/mandatory.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/nginx-0.30.0/deploy/static/provider/baremetal/service-nodeport.yaml

2. Enable clusterrole system:anonymous


xxxxxxxxxx
kubectl apply -f system-anonymous.yaml

3. Create cluster role clusterrole-for-lico


xxxxxxxxxx
kubectl apply -f clusterrole.yaml

4. Deploy prometheus for gpu monitoring

Step 1. Create monitoring namespace


xxxxxxxxxx
kubectl create namespace monitoring

Step 2. Label all gpu nodes with hardware-type=NVIDIAGPU


xxxxxxxxxx
kubectl label node <gpu node name> hardware-type=NVIDIAGPU

For example:


xxxxxxxxxx
kubectl label node gpunode1 hardware-type=NVIDIAGPU
kubectl label node gpunode2 hardware-type=NVIDIAGPU

Step 3. In the file Prometheus-deployment.yaml, add node address of all gpu nodes and modify the node port


xxxxxxxxxx
static_configs:
    - targets:
      - <gpu node address>:9400
nodePort: <port>

For example:


xxxxxxxxxx
static_configs:
    - targets:
      - 10.240.212.120:9400
      - 10.240.212.122:9400
nodePort: 31893

Step 4. In the file dcgm-exporter.yaml, modify volumes path name of 'dcgm-exporter-counters', the file counters.csv must in this path and every node can access it.


xxxxxxxxxx
volumes:
- name: "dcgm-exporter-counters"
  hostPath:
  path: "/etc/dcgm-exporter"

Step 5. Deploy prometheus and gpu metrics exporter.


xxxxxxxxxx
kubectl apply -f prometheus-deployment.yaml
kubectl apply -f dcgm-exporter.yaml

5. Disable streaming connection timeout, by default the value is 4 hours.

Add --streaming-connection-idle-timeout=0 to the kubelet configure file on all compute nodes, then restart kubelet service on all compute nodes.


xxxxxxxxxx
systemctl restart kubelet

6. Check the exported Nodeport of the services

Using the below command to check the node port of services. The node port of these services are needed when adding the kuber server into LiCO.


xxxxxxxxxx
kubectl get svc ingress-nginx -n ingress-nginx
kubectl get svc prometheus -n monitoring