vLLM

Verified

Configure vLLM, a high-performance LLM serving engine, through agentgateway. This guide covers two deployment patterns.

External vLLM: Connect to a vLLM server running outside your Kubernetes cluster on dedicated GPU hardware.
In-cluster vLLM: Deploy vLLM as a workload inside your Kubernetes cluster.

Before you begin

Install and set up an agentgateway proxy.

Set up vLLM

Choose your deployment option and follow the corresponding steps to set up the vLLM server and create the required Kubernetes resources.

Option 1: External vLLM

Install vLLM on a GPU-enabled machine. See the vLLM installation guide.

Start the vLLM OpenAI-compatible server.

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --dtype auto

Verify the server is accessible.

curl http://<VLLM_SERVER_IP>:8000/v1/models

Create a headless Service and EndpointSlice that point to the external vLLM server. Replace <VLLM_SERVER_IP> with the actual IP address.

kubectl apply -f- <<EOF
apiVersion: v1
kind: Service
metadata:
  name: vllm
  namespace: agentgateway-system
spec:
  type: ClusterIP
  clusterIP: None
  ports:
  - port: 8000
    targetPort: 8000
    protocol: TCP
---
apiVersion: discovery.k8s.io/v1
kind: EndpointSlice
metadata:
  name: vllm
  namespace: agentgateway-system
  labels:
    kubernetes.io/service-name: vllm
addressType: IPv4
endpoints:
- addresses:
  - <VLLM_SERVER_IP>
ports:
- port: 8000
  protocol: TCP
EOF

Option 2: Deploy vLLM in a Kubernetes cluster

Use this option to deploy vLLM directly in your Kubernetes cluster alongside agentgateway.

Before you begin, make sure that your cluster has:

GPU nodes (NVIDIA GPUs with CUDA support).
NVIDIA GPU Operator or device plugin installed.
Sufficient GPU memory for your chosen model.

Example steps:

Create a vLLM Deployment with GPU resources.

ℹ️

For gated models such as Llama, create a Hugging Face token secret before deploying.

kubectl create secret generic hf-token \
  -n agentgateway-system \
  --from-literal=token=<your-hf-token>

kubectl apply -f- <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm
  namespace: agentgateway-system
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm
  template:
    metadata:
      labels:
        app: vllm
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        args:
        - "--model"
        - "meta-llama/Llama-3.1-8B-Instruct"
        - "--host"
        - "0.0.0.0"
        - "--port"
        - "8000"
        - "--dtype"
        - "auto"
        ports:
        - containerPort: 8000
          name: http
        env:
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-token
              key: token
              optional: true
---
apiVersion: v1
kind: Service
metadata:
  name: vllm
  namespace: agentgateway-system
spec:
  selector:
    app: vllm
  ports:
  - port: 8000
    targetPort: 8000
    protocol: TCP
EOF

Wait for the vLLM pod to be ready.
ℹ️
vLLM downloads model weights on first startup, which can take several minutes depending on model size and network speed. Monitor progress with the following command.
kubectl logs -f deployment/vllm -n agentgateway-system
```
kubectl wait --for=condition=ready pod \
  -l app=vllm \
  -n agentgateway-system \
  --timeout=300s
```

Create the agentgateway backend resources

These steps are the same for both external and in-cluster vLLM.

Create an AgentgatewayBackend resource. The openai provider type is used because vLLM exposes an OpenAI-compatible API.

kubectl apply -f- <<EOF
apiVersion: agentgateway.dev/v1alpha1
kind: AgentgatewayBackend
metadata:
  name: vllm
  namespace: agentgateway-system
spec:
  ai:
    provider:
      openai:
        model: meta-llama/Llama-3.1-8B-Instruct
      host: vllm.agentgateway-system.svc.cluster.local
      port: 8000
EOF

Review the following table to understand this configuration. For more information, see the API reference.

Setting	Description
`ai.provider.openai`	The OpenAI-compatible provider type. vLLM exposes an OpenAI-compatible API, so the `openai` type is used here.
`openai.model`	The model name as served by vLLM. This must match the `--model` argument used when starting vLLM.
`host`	The in-cluster DNS name of the Service pointing to the vLLM instance.
`port`	The port vLLM listens on. The default is `8000`.

Create an HTTPRoute to expose the vLLM backend through the gateway.

kubectl apply -f- <<EOF
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: vllm
  namespace: agentgateway-system
spec:
  parentRefs:
  - name: agentgateway-proxy
    namespace: agentgateway-system
  rules:
  - backendRefs:
    - name: vllm
      namespace: agentgateway-system
      group: agentgateway.dev
      kind: AgentgatewayBackend
EOF

Send a request to verify agentgateway can route to vLLM.

curl "$INGRESS_GW_ADDRESS" \
  -H "content-type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {
        "role": "user",
        "content": "Explain the benefits of vLLM for serving large language models."
      }
    ]
  }' | jq

In one terminal, start a port-forward to the gateway.

kubectl port-forward -n agentgateway-system svc/agentgateway-proxy 8080:80

In a second terminal, send a request.

curl "localhost:8080" \
  -H "content-type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {
        "role": "user",
        "content": "Explain the benefits of vLLM for serving large language models."
      }
    ]
  }' | jq

Troubleshooting

Connection refused or 503 response

What’s happening:

The gateway returns a 503 response or requests fail with a connection error.

Why it’s happening:

For external vLLM, the cluster cannot reach the server — check the EndpointSlice IP and firewall rules. For in-cluster vLLM, the pod may still be starting or may have failed to schedule.

How to fix it:

For external vLLM, verify the server is reachable and the EndpointSlice is correct:

curl http://<VLLM_SERVER_IP>:8000/v1/models
kubectl get endpointslice vllm -n agentgateway-system -o yaml

For in-cluster vLLM, check the pod status and logs:

kubectl get pods -l app=vllm -n agentgateway-system
kubectl logs deployment/vllm -n agentgateway-system

Pod stuck in Pending state (in-cluster only)

What’s happening:

The vLLM pod does not start and shows a Pending status.

Why it’s happening:

No GPU nodes are available in the cluster, or the GPU resource requests cannot be satisfied.

How to fix it:

Check GPU node availability:

kubectl describe nodes | grep -A 5 "nvidia.com/gpu"

Check the pod events for scheduling errors:

kubectl describe pod -l app=vllm -n agentgateway-system

Next steps

Multiple endpoints

Set up other API endpoints such as embeddings or models.

Prompt guards

Set up prompt guards for your LLM traffic.

LLM observability

View metrics and logs for LLM traffic.

Ollama Multiple endpoints