Configure K8S for LiCO

The K8S should be configured before LiCO running AI model training on it. The yaml files used in this guide are located under https://hpc.lenovo.com/lico/downloads/6.0/examples/k8s/, you should download these yaml files to local before running the below steps.

1. Deploy nginx ingress controller

2. Enable clusterrole system:anonymous

3. Create cluster role clusterrole-for-lico

4. Deploy prometheus for gpu monitoring

Step 1. Create monitoring namespace

Step 2. Label all gpu nodes with hardware-type=NVIDIAGPU

For example:

Step 3. In the file Prometheus-deployment.yaml, add node address of all gpu nodes

For example:

Step 4. Deploy prometheus and gpu metrics exporter.

5. Configure metric-server for cpu/memory monitoring.

By default, metrics-server service is not exposed, so we need expose it as nodePort

For example:

6. Disable streaming connection timeout, by default the value is 4 hours.

Add --streaming-connection-idle-timeout=0 to the kubelet configure file on all compute nodes, then restart kubelet service on all compute nodes.

7. Check the exported Nodeport of the services

Using the below command to check the node port of services. The node port of these services are needed when adding the kuber server into LiCO.