Configure Accurately Scheduling Inference Services based on the CUDA version

TOC

Introduction

In a Kubernetes cluster, inconsistencies in GPU models and CUDA driver versions across different GPU nodes lead to the following issues:

  1. Version mismatch: The CUDA runtime version used by applications may be incompatible with the CUDA driver version on certain nodes, resulting in failures or performance problems.

  2. Scheduling challenges: The native Kubernetes scheduler is unaware of CUDA version dependencies and cannot guarantee that applications are scheduled onto GPU nodes with compatible versions.

  3. High maintenance overhead: Manually managing the CUDA version dependencies between nodes and applications increases operational complexity.

This document provides a step-by-step guide for configuring accurately scheduling inference services based on the CUDA runtime version and Nvidia Driver version. With these settings, you can resolve the CUDA Runtime and CUDA Driver version mismatch at the Kubernetes scheduling level to ensure that applications are scheduled to compatible GPU nodes.

Steps

Adding CUDA version in node labels

  1. On each GPU node, run the following command to retrieve the supported CUDA runtime version:

    nvidia-smi | sed -n 's/.*CUDA Version: \([0-9.]\+\).*/\1/p'

    For example, the output might be 12.4.

  2. On the control node, label the GPU node with the corresponding major and minor version:

    kubectl label node <node-name> \
    nvidia.com/cuda.runtime.major=12 \
    nvidia.com/cuda.runtime.minor=4
TIP

If your cluster has many GPU nodes, it is difficult to label them manually. You can install the Node Feature Discovery cluster plugin. By deploying the Node Feature Discovery(NFD) cluster plugin and turning on the GFD extension, GPU nodes will automatically be labeled with the CUDA version. Node Feature Discovery cluster plugin can be retrieved from Customer Portal. Please contact Consumer Support for more information.

Configure Accurately Scheduling Inference Services based on the CUDA version

Starting from Alauda AI 1.5, the product will automatically schedule pod of inference services by the CUDA version. For earlier versions, you can follow the following steps:

  1. Determine which ClusterServingRuntime you need to select when creating an inference service.
  2. Parse the ClusterServingRuntime label: If cpaas.io/accelerator-type is nvidia, further parse cpaas.io/cuda-version (example 11.8).
  3. Add nodeAffinity field in the inference service, example:
    apiVersion: serving.kserve.io/v1beta1
    kind: InferenceService
    spec:
      predictor:
        affinity:
          nodeAffinity:
            preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              preference:
                matchExpressions:
                - key: nvidia.com/cuda.runtime.major
                  operator: In
                  values: ["11"]
                - key: nvidia.com/cuda.runtime.minor
                  operator: Gt
                  values: ["7"] # Since the k8s operator only supports Gt, which means greater than but not equal to, we use the rt version minus one to meet the requirements.