Configure K8S for LiCO

The K8S should be configured before LiCO running AI model training on it. The yaml files used in this guide are located under https://hpc.lenovo.com/lico/downloads/6.2/examples/k8s/, you should download these yaml files to local before running the below steps.

1. Deploy nginx ingress controller

2. Enable clusterrole system:anonymous

3. Create cluster role clusterrole-for-lico

4. Deploy prometheus for gpu monitoring

Step 1. Create monitoring namespace

Step 2. Label all gpu nodes with hardware-type=NVIDIAGPU

For example:

Step 3. In the file Prometheus-deployment.yaml, add node address of all gpu nodes and modify the node port

For example:

Step 4. In the file dcgm-exporter.yaml, modify volumes path name of 'dcgm-exporter-counters', the file counters.csv must in this path and every node can access it.

Step 5. Deploy prometheus and gpu metrics exporter.

5. Disable streaming connection timeout, by default the value is 4 hours.

Add --streaming-connection-idle-timeout=0 to the kubelet configure file on all compute nodes, then restart kubelet service on all compute nodes.

6. Check the exported Nodeport of the services

Using the below command to check the node port of services. The node port of these services are needed when adding the kuber server into LiCO.