Running Gemma 4 on your laptop with Ollama is great for development. But when you need to serve hundreds of concurrent users, handle thousands of requests per minute, and keep latency under a second — you need a production-grade inference engine.
That's where vLLM comes in. It's the gold standard for serving large language models in production, and it works beautifully with Gemma 4.
Why vLLM?
vLLM uses PagedAttention, which manages GPU memory the way an operating system manages RAM — dynamically allocating and freeing memory as requests come and go. The result:
- 2-4x higher throughput vs. naive inference
- OpenAI-compatible API — swap in Gemma 4 without changing your client code
- Continuous batching — no wasted GPU cycles between requests
- Tensor parallelism — split big models across multiple GPUs
Installing vLLM
# Create a clean environment
python -m venv vllm-env
source vllm-env/bin/activate
# Install vLLM with GPU support
pip install vllmMake sure you have the right CUDA drivers. vLLM needs CUDA 12.1+ and a GPU with at least 16GB VRAM for the smaller Gemma 4 models.
Serving Gemma 4 with the OpenAI-Compatible API
The simplest way to get started — one command and you have an API endpoint:
vllm serve google/gemma-4-26b \
--host 0.0.0.0 \
--port 8000 \
--max-model-len 8192 \
--gpu-memory-utilization 0.90 \
--dtype bfloat16Now you can hit it with any OpenAI SDK client:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed", # vLLM doesn't require a key by default
)
response = client.chat.completions.create(
model="google/gemma-4-26b",
messages=[
{"role": "user", "content": "Explain quantum computing in one paragraph."}
],
temperature=0.7,
max_tokens=256,
)
print(response.choices[0].message.content)The beauty of this: if your app already uses the OpenAI API, you literally just change the base_url and model name. Everything else stays the same.
GPU Memory Planning
This is the part most people get wrong. Here's what you actually need:
| Model | Precision | Min VRAM | Recommended VRAM | Max Context |
|---|---|---|---|---|
| Gemma 4 E4B | FP16 | 10 GB | 16 GB | 32K |
| Gemma 4 E4B | INT8 | 6 GB | 10 GB | 16K |
| Gemma 4 26B | BF16 | 52 GB | 80 GB (A100) | 32K |
| Gemma 4 26B | INT8 | 28 GB | 40 GB (A100) | 32K |
| Gemma 4 31B | BF16 | 62 GB | 80 GB (A100) | 32K |
Pro tip: The --gpu-memory-utilization flag (default 0.9) controls how much VRAM vLLM pre-allocates. Lower it to 0.8 if you're running other processes on the same GPU. Need help figuring out your hardware? Check our hardware guide.
For multi-GPU setups:
# Split Gemma 4 26B across 2 GPUs
vllm serve google/gemma-4-26b \
--tensor-parallel-size 2 \
--host 0.0.0.0 \
--port 8000Docker Setup
Docker is the right way to deploy in production. Here's a complete docker-compose.yml:
version: "3.8"
services:
vllm:
image: vllm/vllm-openai:latest
ports:
- "8000:8000"
volumes:
- model-cache:/root/.cache/huggingface
environment:
- HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
command: >
--model google/gemma-4-26b
--host 0.0.0.0
--port 8000
--max-model-len 8192
--gpu-memory-utilization 0.90
--dtype bfloat16
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 3
volumes:
model-cache:Launch it:
# Set your HuggingFace token
export HF_TOKEN=your_token_here
# Start the service
docker compose up -d
# Check logs
docker compose logs -f vllm
# Verify it's running
curl http://localhost:8000/v1/modelsFor a deeper dive into Docker-specific setup, see our Docker guide.
Batch Inference
When you need to process a large dataset — say, classifying 10,000 documents — batch inference is way more efficient than sending one request at a time:
from openai import OpenAI
import asyncio
import aiohttp
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
async def process_batch(items, max_concurrent=50):
semaphore = asyncio.Semaphore(max_concurrent)
async def process_one(item):
async with semaphore:
response = client.chat.completions.create(
model="google/gemma-4-26b",
messages=[{"role": "user", "content": item}],
max_tokens=128,
)
return response.choices[0].message.content
tasks = [process_one(item) for item in items]
return await asyncio.gather(*tasks)
# Process 1000 items with up to 50 concurrent requests
items = ["Classify this text: " + text for text in your_texts]
results = asyncio.run(process_batch(items))vLLM handles the batching internally through continuous batching — it groups requests together automatically for maximum GPU utilization.
Load Balancing
For high availability, run multiple vLLM instances behind a load balancer:
# nginx.conf
upstream vllm_backend {
least_conn;
server vllm-1:8000;
server vllm-2:8000;
server vllm-3:8000;
}
server {
listen 80;
location /v1/ {
proxy_pass http://vllm_backend;
proxy_set_header Host $host;
proxy_read_timeout 120s;
}
}Vertex AI: The Managed Option
If you don't want to manage infrastructure at all, Google's Vertex AI offers managed Gemma 4 deployment:
# Deploy via gcloud CLI
gcloud ai endpoints create \
--region=us-central1 \
--display-name=gemma4-endpoint
gcloud ai models upload \
--region=us-central1 \
--display-name=gemma-4-26b \
--container-image-uri=us-docker.pkg.dev/vertex-ai/prediction/vllm-serve:latest \
--artifact-uri=gs://your-bucket/gemma-4-26bVertex AI handles scaling, monitoring, and GPU allocation. You pay per prediction. It's more expensive per query but way less operational overhead.
For a comparison with Google AI Studio (free tier), check our Google AI Studio guide.
Monitoring
You should be watching these metrics in production:
import requests
# vLLM exposes Prometheus metrics
metrics = requests.get("http://localhost:8000/metrics").text
# Key metrics to track:
# vllm:num_requests_running — current concurrent requests
# vllm:num_requests_waiting — queue depth
# vllm:avg_generation_throughput — tokens per second
# vllm:gpu_cache_usage_perc — KV cache utilizationSet up alerts for:
- Queue depth > 100: You need more instances or a bigger GPU
- GPU cache > 95%: Reduce
max-model-lenor add memory - p99 latency > 5s: Time to scale horizontally
- Error rate > 1%: Check OOM errors and model health
A quick Grafana dashboard with these four metrics will catch most production issues before your users notice.
Next Steps
- Set up Docker containers for reproducible deployments
- Enable structured JSON output for API consumers
- Compare Gemma 4 architectures to pick the right model for your workload
- Learn about fine-tuning to customize the model for your use case



