vLLM
Verified Code examples on this page have been automatically tested and verified.Configure vLLM, a high-performance LLM serving engine, through agentgateway. This guide covers two deployment patterns.
- External vLLM: Connect to a vLLM server running outside your Kubernetes cluster on dedicated GPU hardware.
- In-cluster vLLM: Deploy vLLM as a workload inside your Kubernetes cluster.
Before you begin
Install and set up an agentgateway proxy.Set up vLLM
Choose your deployment option and follow the corresponding steps to set up the vLLM server and create the required Kubernetes resources.
Option 1: External vLLM
Install vLLM on a GPU-enabled machine. See the vLLM installation guide.
Start the vLLM OpenAI-compatible server.
vllm serve meta-llama/Llama-3.1-8B-Instruct \ --host 0.0.0.0 \ --port 8000 \ --dtype autoVerify the server is accessible.
curl http://<VLLM_SERVER_IP>:8000/v1/modelsCreate a headless Service and EndpointSlice that point to the external vLLM server. Replace
<VLLM_SERVER_IP>with the actual IP address.kubectl apply -f- <<EOF apiVersion: v1 kind: Service metadata: name: vllm namespace: agentgateway-system spec: type: ClusterIP clusterIP: None ports: - port: 8000 targetPort: 8000 protocol: TCP --- apiVersion: discovery.k8s.io/v1 kind: EndpointSlice metadata: name: vllm namespace: agentgateway-system labels: kubernetes.io/service-name: vllm addressType: IPv4 endpoints: - addresses: - <VLLM_SERVER_IP> ports: - port: 8000 protocol: TCP EOF
Option 2: Deploy vLLM in a Kubernetes cluster
Use this option to deploy vLLM directly in your Kubernetes cluster alongside agentgateway.
Before you begin, make sure that your cluster has:
- GPU nodes (NVIDIA GPUs with CUDA support).
- NVIDIA GPU Operator or device plugin installed.
- Sufficient GPU memory for your chosen model.
Example steps:
Create a vLLM Deployment with GPU resources.
ℹ️For gated models such as Llama, create a Hugging Face token secret before deploying.
kubectl create secret generic hf-token \ -n agentgateway-system \ --from-literal=token=<your-hf-token>kubectl apply -f- <<EOF apiVersion: apps/v1 kind: Deployment metadata: name: vllm namespace: agentgateway-system spec: replicas: 1 selector: matchLabels: app: vllm template: metadata: labels: app: vllm spec: containers: - name: vllm image: vllm/vllm-openai:latest args: - "--model" - "meta-llama/Llama-3.1-8B-Instruct" - "--host" - "0.0.0.0" - "--port" - "8000" - "--dtype" - "auto" ports: - containerPort: 8000 name: http env: - name: HUGGING_FACE_HUB_TOKEN valueFrom: secretKeyRef: name: hf-token key: token optional: true --- apiVersion: v1 kind: Service metadata: name: vllm namespace: agentgateway-system spec: selector: app: vllm ports: - port: 8000 targetPort: 8000 protocol: TCP EOFWait for the vLLM pod to be ready.
ℹ️vLLM downloads model weights on first startup, which can take several minutes depending on model size and network speed. Monitor progress with the following command.
kubectl logs -f deployment/vllm -n agentgateway-systemkubectl wait --for=condition=ready pod \ -l app=vllm \ -n agentgateway-system \ --timeout=300s
Create the agentgateway backend resources
These steps are the same for both external and in-cluster vLLM.
Create an AgentgatewayBackend resource. The
openaiprovider type is used because vLLM exposes an OpenAI-compatible API.kubectl apply -f- <<EOF apiVersion: agentgateway.dev/v1alpha1 kind: AgentgatewayBackend metadata: name: vllm namespace: agentgateway-system spec: ai: provider: openai: model: meta-llama/Llama-3.1-8B-Instruct host: vllm.agentgateway-system.svc.cluster.local port: 8000 EOFReview the following table to understand this configuration. For more information, see the API reference.
Setting Description ai.provider.openaiThe OpenAI-compatible provider type. vLLM exposes an OpenAI-compatible API, so the openaitype is used here.openai.modelThe model name as served by vLLM. This must match the --modelargument used when starting vLLM.hostThe in-cluster DNS name of the Service pointing to the vLLM instance. portThe port vLLM listens on. The default is 8000.Create an HTTPRoute to expose the vLLM backend through the gateway.
kubectl apply -f- <<EOF apiVersion: gateway.networking.k8s.io/v1 kind: HTTPRoute metadata: name: vllm namespace: agentgateway-system spec: parentRefs: - name: agentgateway-proxy namespace: agentgateway-system rules: - backendRefs: - name: vllm namespace: agentgateway-system group: agentgateway.dev kind: AgentgatewayBackend EOFSend a request to verify agentgateway can route to vLLM.
curl "$INGRESS_GW_ADDRESS" \ -H "content-type: application/json" \ -d '{ "model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [ { "role": "user", "content": "Explain the benefits of vLLM for serving large language models." } ] }' | jqIn one terminal, start a port-forward to the gateway.
kubectl port-forward -n agentgateway-system svc/agentgateway-proxy 8080:80In a second terminal, send a request.
curl "localhost:8080" \ -H "content-type: application/json" \ -d '{ "model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [ { "role": "user", "content": "Explain the benefits of vLLM for serving large language models." } ] }' | jq
Troubleshooting
Connection refused or 503 response
What’s happening:
The gateway returns a 503 response or requests fail with a connection error.
Why it’s happening:
For external vLLM, the cluster cannot reach the server — check the EndpointSlice IP and firewall rules. For in-cluster vLLM, the pod may still be starting or may have failed to schedule.
How to fix it:
For external vLLM, verify the server is reachable and the EndpointSlice is correct:
curl http://<VLLM_SERVER_IP>:8000/v1/models kubectl get endpointslice vllm -n agentgateway-system -o yamlFor in-cluster vLLM, check the pod status and logs:
kubectl get pods -l app=vllm -n agentgateway-system kubectl logs deployment/vllm -n agentgateway-system
Pod stuck in Pending state (in-cluster only)
What’s happening:
The vLLM pod does not start and shows a Pending status.
Why it’s happening:
No GPU nodes are available in the cluster, or the GPU resource requests cannot be satisfied.
How to fix it:
Check GPU node availability:
kubectl describe nodes | grep -A 5 "nvidia.com/gpu"Check the pod events for scheduling errors:
kubectl describe pod -l app=vllm -n agentgateway-system