How to Run Gemma 4 in Docker (Complete Container Guide)

Docker gives you reproducible, isolated AI deployments. Same container, same results — whether it's your laptop, a staging server, or production. No more "it works on my machine."

Let's set up Gemma 4 in Docker from scratch.

Why Docker for AI?

Reproducible: Pin your Ollama version, model files, and config
Isolated: Won't mess with your host system's Python, CUDA, or anything else
Portable: Build once, deploy anywhere
Easy cleanup: docker compose down and it's gone

If you're just running Gemma 4 for personal use, Ollama directly is simpler. Docker shines when you need consistent deployments across environments or want to bundle Gemma 4 into a larger application stack.

Quick Start with Docker Run

The fastest way to get Gemma 4 running in Docker:

# Run Ollama in Docker
docker run -d \
  --name gemma4 \
  -p 11434:11434 \
  -v ollama-data:/root/.ollama \
  ollama/ollama

# Pull and run Gemma 4
docker exec gemma4 ollama pull gemma4:26b
docker exec -it gemma4 ollama run gemma4:26b

That's it — three commands. The -v ollama-data:/root/.ollama ensures your model persists when the container restarts.

Dockerfile with Ollama

For more control, build a custom image:

FROM ollama/ollama:latest

# Set environment
ENV OLLAMA_HOST=0.0.0.0
ENV OLLAMA_KEEP_ALIVE=24h

# Create a startup script that pulls the model on first run
COPY <<'EOF' /start.sh
#!/bin/bash
ollama serve &
sleep 5

# Pull model if not already present
if ! ollama list | grep -q "gemma4:26b"; then
    echo "Pulling Gemma 4 26B..."
    ollama pull gemma4:26b
fi

# Keep container running
wait
EOF

RUN chmod +x /start.sh

EXPOSE 11434

CMD ["/start.sh"]

Build and run:

docker build -t gemma4-server .
docker run -d --name gemma4 -p 11434:11434 -v ollama-data:/root/.ollama gemma4-server

Docker Compose (Recommended)

For a proper setup, use docker-compose.yml:

version: "3.8"

services:
  ollama:
    image: ollama/ollama:latest
    container_name: gemma4-ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama-models:/root/.ollama
    environment:
      - OLLAMA_HOST=0.0.0.0
      - OLLAMA_KEEP_ALIVE=24h
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 30s

  webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: gemma4-webui
    ports:
      - "3000:8080"
    volumes:
      - webui-data:/app/backend/data
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    depends_on:
      ollama:
        condition: service_healthy
    restart: unless-stopped

volumes:
  ollama-models:
    driver: local
  webui-data:
    driver: local

This gives you Ollama + Open WebUI — a complete ChatGPT-like interface for Gemma 4:

# Start everything
docker compose up -d

# Pull Gemma 4
docker exec gemma4-ollama ollama pull gemma4:26b

# Open the web UI
open http://localhost:3000

GPU Passthrough (NVIDIA)

To use your GPU inside Docker, you need the NVIDIA Container Toolkit:

# Install NVIDIA Container Toolkit (Ubuntu/Debian)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
  sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit

# Configure Docker to use NVIDIA runtime
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

# Verify
docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi

Update your docker-compose.yml to use the GPU:

services:
  ollama:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"
    volumes:
      - ollama-models:/root/.ollama
    environment:
      - OLLAMA_HOST=0.0.0.0
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    restart: unless-stopped

Note: On Mac with Apple Silicon, Docker runs in a Linux VM and cannot access Metal acceleration. For Mac, run Ollama natively instead — you'll get Metal GPU acceleration automatically. See our Mac performance guide for details.

Persistent Model Storage

Models are large files. You don't want to re-download them every time a container restarts.

Named volume (recommended — Docker manages the storage):

volumes:
  ollama-models:
    driver: local

Bind mount (you choose the path — good for managing disk space):

volumes:
  - /data/ollama-models:/root/.ollama

Check model storage size:

docker exec gemma4-ollama du -sh /root/.ollama/models

Model	Approximate Size (Q4)
Gemma 4 E2B	~1.5 GB
Gemma 4 E4B	~2.5 GB
Gemma 4 26B	~15 GB
Gemma 4 31B	~18 GB

Multi-Model Setup

Want to run multiple Gemma 4 sizes for different use cases? Easy:

# Pull multiple models
docker exec gemma4-ollama ollama pull gemma4:e4b   # Fast, simple tasks
docker exec gemma4-ollama ollama pull gemma4:26b    # Most tasks
docker exec gemma4-ollama ollama pull gemma4:31b    # Maximum quality

# List all models
docker exec gemma4-ollama ollama list

Ollama loads models on demand and unloads idle ones. Only the active model uses VRAM. You can configure how long models stay loaded:

environment:
  - OLLAMA_KEEP_ALIVE=5m     # Unload after 5 minutes of idle
  - OLLAMA_MAX_LOADED_MODELS=2  # Keep up to 2 models loaded

Exposing the API

The Ollama API runs on port 11434 by default. Once your container is running:

# List available models
curl http://localhost:11434/api/tags

# Generate a response
curl http://localhost:11434/api/chat -d '{
  "model": "gemma4:26b",
  "messages": [{"role": "user", "content": "Hello!"}]
}'

# The API is also OpenAI-compatible
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma4:26b",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

For detailed API usage, see our API tutorial. For production-grade serving with higher throughput, consider vLLM in Docker.

Useful Docker Commands

# View logs
docker compose logs -f ollama

# Check resource usage
docker stats gemma4-ollama

# Enter the container
docker exec -it gemma4-ollama bash

# Stop everything
docker compose down

# Stop and remove model data
docker compose down -v

# Update Ollama image
docker compose pull && docker compose up -d