How to Deploy Gemma 4 in Production (vLLM + Docker)

Running Gemma 4 on your laptop with Ollama is great for development. But when you need to serve hundreds of concurrent users, handle thousands of requests per minute, and keep latency under a second — you need a production-grade inference engine.

That's where vLLM comes in. It's the gold standard for serving large language models in production, and it works beautifully with Gemma 4.

Why vLLM?

vLLM uses PagedAttention, which manages GPU memory the way an operating system manages RAM — dynamically allocating and freeing memory as requests come and go. The result:

2-4x higher throughput vs. naive inference
OpenAI-compatible API — swap in Gemma 4 without changing your client code
Continuous batching — no wasted GPU cycles between requests
Tensor parallelism — split big models across multiple GPUs

Installing vLLM

# Create a clean environment
python -m venv vllm-env
source vllm-env/bin/activate

# Install vLLM with GPU support
pip install vllm

Make sure you have the right CUDA drivers. vLLM needs CUDA 12.1+ and a GPU with at least 16GB VRAM for the smaller Gemma 4 models.

Serving Gemma 4 with the OpenAI-Compatible API

The simplest way to get started — one command and you have an API endpoint:

vllm serve google/gemma-4-26b \
  --host 0.0.0.0 \
  --port 8000 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90 \
  --dtype bfloat16

Now you can hit it with any OpenAI SDK client:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed",  # vLLM doesn't require a key by default
)

response = client.chat.completions.create(
    model="google/gemma-4-26b",
    messages=[
        {"role": "user", "content": "Explain quantum computing in one paragraph."}
    ],
    temperature=0.7,
    max_tokens=256,
)

print(response.choices[0].message.content)

The beauty of this: if your app already uses the OpenAI API, you literally just change the base_url and model name. Everything else stays the same.

GPU Memory Planning

This is the part most people get wrong. Here's what you actually need:

Model	Precision	Min VRAM	Recommended VRAM	Max Context
Gemma 4 E4B	FP16	10 GB	16 GB	32K
Gemma 4 E4B	INT8	6 GB	10 GB	16K
Gemma 4 26B	BF16	52 GB	80 GB (A100)	32K
Gemma 4 26B	INT8	28 GB	40 GB (A100)	32K
Gemma 4 31B	BF16	62 GB	80 GB (A100)	32K

Pro tip: The --gpu-memory-utilization flag (default 0.9) controls how much VRAM vLLM pre-allocates. Lower it to 0.8 if you're running other processes on the same GPU. Need help figuring out your hardware? Check our hardware guide.

For multi-GPU setups:

# Split Gemma 4 26B across 2 GPUs
vllm serve google/gemma-4-26b \
  --tensor-parallel-size 2 \
  --host 0.0.0.0 \
  --port 8000

Docker Setup

Docker is the right way to deploy in production. Here's a complete docker-compose.yml:

version: "3.8"

services:
  vllm:
    image: vllm/vllm-openai:latest
    ports:
      - "8000:8000"
    volumes:
      - model-cache:/root/.cache/huggingface
    environment:
      - HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
    command: >
      --model google/gemma-4-26b
      --host 0.0.0.0
      --port 8000
      --max-model-len 8192
      --gpu-memory-utilization 0.90
      --dtype bfloat16
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3

volumes:
  model-cache:

Launch it:

# Set your HuggingFace token
export HF_TOKEN=your_token_here

# Start the service
docker compose up -d

# Check logs
docker compose logs -f vllm

# Verify it's running
curl http://localhost:8000/v1/models

For a deeper dive into Docker-specific setup, see our Docker guide.

Batch Inference

When you need to process a large dataset — say, classifying 10,000 documents — batch inference is way more efficient than sending one request at a time:

from openai import OpenAI
import asyncio
import aiohttp

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

async def process_batch(items, max_concurrent=50):
    semaphore = asyncio.Semaphore(max_concurrent)
    
    async def process_one(item):
        async with semaphore:
            response = client.chat.completions.create(
                model="google/gemma-4-26b",
                messages=[{"role": "user", "content": item}],
                max_tokens=128,
            )
            return response.choices[0].message.content
    
    tasks = [process_one(item) for item in items]
    return await asyncio.gather(*tasks)

# Process 1000 items with up to 50 concurrent requests
items = ["Classify this text: " + text for text in your_texts]
results = asyncio.run(process_batch(items))

vLLM handles the batching internally through continuous batching — it groups requests together automatically for maximum GPU utilization.

Load Balancing

For high availability, run multiple vLLM instances behind a load balancer:

# nginx.conf
upstream vllm_backend {
    least_conn;
    server vllm-1:8000;
    server vllm-2:8000;
    server vllm-3:8000;
}

server {
    listen 80;

    location /v1/ {
        proxy_pass http://vllm_backend;
        proxy_set_header Host $host;
        proxy_read_timeout 120s;
    }
}

Vertex AI: The Managed Option

If you don't want to manage infrastructure at all, Google's Vertex AI offers managed Gemma 4 deployment:

# Deploy via gcloud CLI
gcloud ai endpoints create \
  --region=us-central1 \
  --display-name=gemma4-endpoint

gcloud ai models upload \
  --region=us-central1 \
  --display-name=gemma-4-26b \
  --container-image-uri=us-docker.pkg.dev/vertex-ai/prediction/vllm-serve:latest \
  --artifact-uri=gs://your-bucket/gemma-4-26b

Vertex AI handles scaling, monitoring, and GPU allocation. You pay per prediction. It's more expensive per query but way less operational overhead.

For a comparison with Google AI Studio (free tier), check our Google AI Studio guide.

Monitoring

You should be watching these metrics in production:

import requests

# vLLM exposes Prometheus metrics
metrics = requests.get("http://localhost:8000/metrics").text

# Key metrics to track:
# vllm:num_requests_running    — current concurrent requests
# vllm:num_requests_waiting    — queue depth
# vllm:avg_generation_throughput — tokens per second
# vllm:gpu_cache_usage_perc   — KV cache utilization