单机部署AI推理服务在量小的时候够用,但一旦并发量上来,单点故障、无法水平扩展、发布必须停机等问题就会集中爆发。Kubernetes是解决这些问题的标准方案;

去年我们把推理服务从单机迁到K8s,经历了不少踩坑。这篇文章把完整的部署过程和坑点整理出来;

前置条件

K8s集群中需要有GPU节点,并且安装了NVIDIA设备插件:

kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.16.2/deployments/static/nvidia-device-plugin.yml

验证GPU是否可用:

kubectl get nodes -o json | jq '.items[].status.allocatable["nvidia.com/gpu"]'
# 应该返回类似 "1" 或 "4" 这样的数字

Deployment定义

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-inference
  labels:
    app: vllm-inference
spec:
  replicas: 2
  selector:
    matchLabels:
      app: vllm-inference
  template:
    metadata:
      labels:
        app: vllm-inference
    spec:
      containers:
      - name: vllm
        image: your-registry/vllm-server:latest
        env:
        - name: MODEL_NAME
          value: "Qwen/Qwen2.5-7B-Instruct"
        - name: MAX_MODEL_LEN
          value: "8192"
        ports:
        - containerPort: 8000
          name: http
        resources:
          limits:
            nvidia.com/gpu: 1
            memory: "32Gi"
          requests:
            nvidia.com/gpu: 1
            memory: "16Gi"
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 300
          periodSeconds: 10
        volumeMounts:
        - name: model-cache
          mountPath: /root/.cache/huggingface
      volumes:
      - name: model-cache
        persistentVolumeClaim:
          claimName: model-cache-pvc

几个关键点。resources里声明nvidia.com/gpu: 1,K8s调度器会把这个Pod分配到有GPU的节点上。readinessProbe的initialDelaySeconds设300秒,给模型加载留足够时间——Pod在这段时间内不会被标记为Ready,流量不会打过来。PVC挂载模型缓存目录,Pod重建后不用重新下载模型。

Service和Ingress

apiVersion: v1
kind: Service
metadata:
  name: vllm-service
spec:
  selector:
    app: vllm-inference
  ports:
  - port: 80
    targetPort: 8000
  type: ClusterIP
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: vllm-ingress
  annotations:
    nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
spec:
  rules:
  - host: llm.internal.company.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: vllm-service
            port:
              number: 80

proxy-read-timeout和proxy-send-timeout必须设大,大模型推理的响应时间可能超过60秒,默认的60秒超时会导致请求失败。

自动扩缩容

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-inference
  minReplicas: 1
  maxReplicas: 4
  metrics:
  - type: Resource
    resource:
      name: nvidia.com/gpu
      target:
        type: Utilization
        averageUtilization: 70

注意:GPU的HPA需要自定义metrics,因为K8s原生的Resource metrics不一定能正确反映GPU利用率。建议用Prometheus Adapter把vLLM暴露的gpu_cache_usage_perc指标接入HPA。

模型缓存PVC

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-cache-pvc
spec:
  accessModes:
    - ReadOnlyMany    # 多个Pod共享只读
  storageClassName: nfs
  resources:
    requests:
      storage: 100Gi

ReadOnlyMany是关键——多个推理Pod共享同一个模型缓存,不用每个Pod都下载一份。用NFS或Ceph提供共享存储。

监控

Prometheus采集vLLM内置指标(http://vllm-service:8000/metrics),Grafana做可视化。关键指标:请求延迟P50/P95/P99、QPS、GPU显存利用率、KV Cache使用率、请求队列长度。

踩坑记录

Pod驱逐方面,K8s在节点资源不足时会驱逐Pod。推理Pod被驱逐后重新调度需要重新加载模型,恢复时间很长。用PodDisruptionBudget保证至少一个Pod始终运行。镜像拉取方面,vLLM镜像加上CUDA基础层大约10GB+,首次拉取很慢。建议预拉取到所有GPU节点。

写在最后

K8s部署AI推理服务的复杂度不低,但回报也很大:故障自愈、滚动更新、弹性伸缩。对于需要7x24运行的生产服务来说,这是必经之路。