单机部署AI推理服务在量小的时候够用,但一旦并发量上来,单点故障、无法水平扩展、发布必须停机等问题就会集中爆发。Kubernetes是解决这些问题的标准方案;
去年我们把推理服务从单机迁到K8s,经历了不少踩坑。这篇文章把完整的部署过程和坑点整理出来;
前置条件
K8s集群中需要有GPU节点,并且安装了NVIDIA设备插件:
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.16.2/deployments/static/nvidia-device-plugin.yml
验证GPU是否可用:
kubectl get nodes -o json | jq '.items[].status.allocatable["nvidia.com/gpu"]'
# 应该返回类似 "1" 或 "4" 这样的数字
Deployment定义
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-inference
labels:
app: vllm-inference
spec:
replicas: 2
selector:
matchLabels:
app: vllm-inference
template:
metadata:
labels:
app: vllm-inference
spec:
containers:
- name: vllm
image: your-registry/vllm-server:latest
env:
- name: MODEL_NAME
value: "Qwen/Qwen2.5-7B-Instruct"
- name: MAX_MODEL_LEN
value: "8192"
ports:
- containerPort: 8000
name: http
resources:
limits:
nvidia.com/gpu: 1
memory: "32Gi"
requests:
nvidia.com/gpu: 1
memory: "16Gi"
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 300
periodSeconds: 10
volumeMounts:
- name: model-cache
mountPath: /root/.cache/huggingface
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-cache-pvc
几个关键点。resources里声明nvidia.com/gpu: 1,K8s调度器会把这个Pod分配到有GPU的节点上。readinessProbe的initialDelaySeconds设300秒,给模型加载留足够时间——Pod在这段时间内不会被标记为Ready,流量不会打过来。PVC挂载模型缓存目录,Pod重建后不用重新下载模型。
Service和Ingress
apiVersion: v1
kind: Service
metadata:
name: vllm-service
spec:
selector:
app: vllm-inference
ports:
- port: 80
targetPort: 8000
type: ClusterIP
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: vllm-ingress
annotations:
nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
spec:
rules:
- host: llm.internal.company.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: vllm-service
port:
number: 80
proxy-read-timeout和proxy-send-timeout必须设大,大模型推理的响应时间可能超过60秒,默认的60秒超时会导致请求失败。
自动扩缩容
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: vllm-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: vllm-inference
minReplicas: 1
maxReplicas: 4
metrics:
- type: Resource
resource:
name: nvidia.com/gpu
target:
type: Utilization
averageUtilization: 70
注意:GPU的HPA需要自定义metrics,因为K8s原生的Resource metrics不一定能正确反映GPU利用率。建议用Prometheus Adapter把vLLM暴露的gpu_cache_usage_perc指标接入HPA。
模型缓存PVC
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-cache-pvc
spec:
accessModes:
- ReadOnlyMany # 多个Pod共享只读
storageClassName: nfs
resources:
requests:
storage: 100Gi
ReadOnlyMany是关键——多个推理Pod共享同一个模型缓存,不用每个Pod都下载一份。用NFS或Ceph提供共享存储。
监控
Prometheus采集vLLM内置指标(http://vllm-service:8000/metrics),Grafana做可视化。关键指标:请求延迟P50/P95/P99、QPS、GPU显存利用率、KV Cache使用率、请求队列长度。
踩坑记录
Pod驱逐方面,K8s在节点资源不足时会驱逐Pod。推理Pod被驱逐后重新调度需要重新加载模型,恢复时间很长。用PodDisruptionBudget保证至少一个Pod始终运行。镜像拉取方面,vLLM镜像加上CUDA基础层大约10GB+,首次拉取很慢。建议预拉取到所有GPU节点。
写在最后
K8s部署AI推理服务的复杂度不低,但回报也很大:故障自愈、滚动更新、弹性伸缩。对于需要7x24运行的生产服务来说,这是必经之路。