Making use of GPU¶

In order to make use of GPU, one must add or already have worker nodes with GPU flavors. Currently this means you can only use On-demand Kubernetes clusters in STO2 that have worker nodes with flavors that are suffixed with gA2.

Given that we use Talos Linux, we cannot make use of NVIDIA GPU Operator and instead we install NVIDIA device plugin for Kubernetes.

Supported Nvidia Drivers

Currently we only support Talos Nvidia OSS drivers.

Validate Nvidia Runtime available¶

To replicate the example make surekubeconf-demo is obtained for that specific cluster and active in current shell via KUBECONFIG environment variable or specified via --kubeconfig flag for helm and kubectl command line tools.

➜ kubectl get runtimeclasses.node.k8s.io -A
NAME     HANDLER   AGE
nvidia   nvidia    5h42m

We can make use of NVIDIA System Management Interface to list current gpu capabilities, where the nvcr.io/nvidia/cuda:12.9.1-base-ubuntu24.04 is available through Nvidia Container registry.

➜ kubectl run -n nvidia nvidia-test --restart=Never -ti --rm  --image nvcr.io/nvidia/cuda:12.9.1-base-ubuntu24.04 --overrides '{"spec": {"runtimeClassName": "nvidia"}}' nvidia-smi
Tue Nov 18 16:12:48 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.172.08             Driver Version: 570.172.08     CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A2                      On  |   00000000:00:05.0 Off |                    0 |
|  0%   35C    P8              5W /   60W |       0MiB /  15356MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
pod "nvidia-test" deleted

Example job¶

➜ cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
  namespace: nvidia
spec:
  restartPolicy: Never
  runtimeClassName: nvidia
  containers:
    - name: cuda-container
      image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0
      resources:
        limits:
          nvidia.com/gpu: 1 # requesting 1 GPU
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
EOF

The result of the above pod would be:

➜ kubectl get pods -n nvidia gpu-pod
NAME      READY   STATUS      RESTARTS   AGE
gpu-pod   0/1     Completed   0          19s

with the output looking like:

➜ kubectl logs -f -n nvidia gpu-pod
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

Example vLLM Deployment with DeepSeek¶

This example demonstrates deploying a vLLM inference server running the DeepSeek-R1-Distill-Qwen-1.5B model.

Prerequisites¶

Before deploying, you'll need a Hugging Face API token to download the model. For this you need a Hugging Face user account. To download the model create a token with at least read permissions. See HF security tokens for more details.

Create a Kubernetes secret with your token:

kubectl create secret generic hf-secret \
  --from-literal=hf_api_token='your_huggingface_token_here'

Deploy vLLM with DeepSeek¶

➜ cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-deepseek-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: deepseek-server
  template:
    metadata:
      labels:
        app: deepseek-server
    spec:
      runtimeClassName: nvidia
      containers:
        - name: inference-server
          image: docker.io/vllm/vllm-openai:v0.10.0
          resources:
            requests:
              cpu: "2"
              memory: "10Gi"
              ephemeral-storage: "10Gi"
              nvidia.com/gpu: "1"
            limits:
              cpu: "2"
              memory: "10Gi"
              ephemeral-storage: "10Gi"
              nvidia.com/gpu: "1"
          command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
          args:
            - --model=$(MODEL_ID)
            - --tensor-parallel-size=1
            - --host=0.0.0.0
            - --port=8000
          env:
            - name: LD_LIBRARY_PATH
              value: /usr/local/nvidia/lib64
            - name: MODEL_ID
              value: deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-secret
                  key: hf_api_token
          volumeMounts:
            - mountPath: /dev/shm
              name: dshm
      volumes:
        - name: dshm
          emptyDir:
            medium: Memory
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
---
apiVersion: v1
kind: Service
metadata:
  name: llm-service
spec:
  selector:
    app: deepseek-server
  type: ClusterIP
  ports:
    - protocol: TCP
      port: 8000
      targetPort: 8000
EOF

Verify the Deployment¶

Check that the deployment is running:

➜ kubectl get pods -l app=deepseek-server
NAME                                        READY   STATUS    RESTARTS   AGE
vllm-deepseek-deployment-xxxxxxxxxx-xxxxx   1/1     Running   0          2m

Check the logs to ensure the model is loaded:

➜ kubectl logs -l app=deepseek-server --tail=50

Test the Inference Server¶

Once the pod is running and the model is loaded, you can test the OpenAI-compatible API:

➜ kubectl run -it --rm curl-test --image=curlimages/curl --restart=Never -- \
  curl -X POST http://llm-service:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
    "prompt": "Explain quantum computing in simple terms:",
    "max_tokens": 100,
    "temperature": 0.7
  }'

Or test with a chat completion:

➜ kubectl run -it --rm curl-test --image=curlimages/curl --restart=Never -- \
  curl -X POST http://llm-service:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
    "messages": [
      {"role": "user", "content": "What is the capital of France?"}
    ],
    "max_tokens": 50
  }'

Access from Outside the Cluster¶

To access the service from outside the cluster, you can use port-forwarding:

➜ kubectl port-forward service/llm-service 8000:8000

Then test from your local machine:

➜ curl -X POST http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
    "prompt": "Hello, how are you?",
    "max_tokens": 50
  }'