Gemma 4 Slow? 5 Speed Fixes (10x Faster on Mac/Windows/Linux)

You downloaded Gemma 4, ran it, and... it's painfully slow. Maybe 2 tokens per second. Maybe worse. Before you blame the model, let's figure out what's actually going wrong — because in most cases, a few configuration tweaks can 5-10x your speed.

Step 1: Diagnose Why It's Slow

There are five common reasons Gemma 4 runs slower than expected. Let's check each one.

Reason 1: CPU Fallback

This is the number one speed killer. Your model is running on CPU instead of GPU, and you might not even realize it.

How to check:

# Mac: Activity Monitor → GPU History (Window menu)
# Or check if Metal is being used:
sudo powermetrics --samplers gpu_power -n 1

# NVIDIA: GPU utilization should be > 0%
nvidia-smi

# AMD: Same check
rocm-smi

If GPU utilization stays at 0% during inference, you're on CPU. Fix this first — nothing else matters until GPU acceleration is working.

Reason 2: Wrong Quantization

Not all quantizations are created equal for speed:

Quantization	File Size (12B)	Speed	Quality	Best For
Q4_K_M	~7 GB	Fastest	Good	Daily use, most tasks
Q5_K_M	~8.5 GB	Fast	Better	When quality matters
Q6_K	~10 GB	Medium	Very good	Balanced
Q8_0	~13 GB	Slow	Near-original	Quality-critical tasks
FP16	~24 GB	Slowest	Original	Only if you have the VRAM
IQ4_XS	~6 GB	Fastest	Acceptable	Tight VRAM budget

If you're running Q8 or FP16 and wondering why it's slow, switch to Q4_K_M. The quality difference is marginal for most tasks, but the speed difference is dramatic. Our GGUF guide explains the technical formats, while our 4-bit quantization benchmarks show real-world quality vs speed tradeoffs with actual test results on the 31B model.

Reason 3: Context Length Too Long

Gemma 4 supports up to 256K context, but longer context = slower inference. The relationship isn't linear — it gets worse as context grows:

Context Length	Relative Speed	VRAM Usage (12B Q4)
2K	1.0x (baseline)	~7 GB
8K	~0.9x	~8 GB
32K	~0.7x	~12 GB
128K	~0.4x	~20 GB
256K	~0.25x	~30 GB+

Fix: Set a reasonable context length for your task:

# Ollama: limit context
ollama run gemma4:12b --ctx-size 8192

# llama.cpp
./llama-server -m model.gguf -c 8192

# Don't use 256K unless you actually need it

Reason 4: KV Cache Bloat

The KV (key-value) cache stores attention information and grows with conversation length. Long conversations eat VRAM and slow things down.

Fix: Start fresh conversations regularly, or set a cache limit:

# llama.cpp: limit KV cache
./llama-server -m model.gguf -c 8192 --cache-type-k q8_0 --cache-type-v q8_0

# Quantized KV cache uses less VRAM with minimal quality loss

Reason 5: Batch Size Issues

If you're serving multiple requests, wrong batch sizes hurt throughput:

# vLLM: tune batch size
python -m vllm.entrypoints.openai.api_server \
  --model google/gemma-4-12b-it \
  --max-num-batched-tokens 4096 \
  --max-num-seqs 8

Platform-Specific Fixes

Mac (Apple Silicon)

Mac performance depends entirely on Metal GPU acceleration working correctly:

# Check Metal support
system_profiler SPDisplaysDataType | grep Metal

# Ollama automatically uses Metal on Apple Silicon
# If speed is still slow, check unified memory pressure:
memory_pressure

# For llama.cpp, ensure Metal is enabled
cmake -B build -DGGML_METAL=ON
cmake --build build

# M1/M2/M3 recommended settings
./llama-server -m model.gguf -ngl 999 -c 8192

Mac Model	Unified Memory	12B Q4 Speed	Notes
M1 8GB	8GB	~12 tok/s	Usable but tight
M1 Pro 16GB	16GB	~18 tok/s	Comfortable
M2 Pro 16GB	16GB	~22 tok/s	Good daily driver
M3 Pro 18GB	18GB	~25 tok/s	Sweet spot
M3 Max 36GB	36GB	~30 tok/s	Can run 27B Q4
M4 Max 48GB	48GB	~35 tok/s	Runs everything

Mac-specific tip: Close memory-hungry apps (Chrome, Docker) before running large models. Apple Silicon shares memory between CPU and GPU, so there's no separate VRAM pool.

Windows (NVIDIA CUDA)

# Make sure CUDA is actually being used
# In Ollama, check with:
ollama ps

# Common Windows issue: power settings
# Set to "High Performance" in Windows power options
# Laptop GPUs throttle aggressively on "Balanced"

# For llama.cpp on Windows:
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release

Windows-specific tip: Disable Windows Defender real-time scanning for your model directory. It scans every file read and can tank performance:

# PowerShell (admin)
Add-MpPreference -ExclusionPath "C:\Users\you\models"

Linux (NVIDIA or AMD)

# NVIDIA: Make sure persistence mode is on
sudo nvidia-smi -pm 1

# Set GPU to max performance
sudo nvidia-smi -ac 1215,1410  # Values vary by GPU

# AMD: Check ROCm is active
rocm-smi

# For both: ensure you're not running Wayland compositor
# X11 has less GPU overhead for compute tasks

Quick Speed Checklist

Run through this checklist to maximize speed:

1. [ ] GPU acceleration is working (not CPU fallback)
2. [ ] Using Q4_K_M quantization (unless quality is critical)
3. [ ] Context length set to what you actually need (not 256K default)
4. [ ] KV cache quantized (--cache-type-k q8_0)
5. [ ] Flash Attention enabled (if available)
6. [ ] No memory-hungry background apps
7. [ ] Power settings on "High Performance" (laptops)
8. [ ] Latest drivers installed

When Slow Is Expected

Sometimes Gemma 4 is slow and that's just how it is:

First token latency: The first token always takes longer (prompt processing). This is normal.
Very long prompts: Processing a 100K token input takes time no matter what.
27B model on 16GB: It fits, but barely. Consider the 12B instead.
CPU-only inference: Without a GPU, expect 1-5 tok/s. That's the reality of running LLMs on CPU.

If you're experiencing issues beyond just speed, like crashes or errors, check our troubleshooting guide for solutions to OOM errors, GPU detection problems, and more.

Next Steps

Not sure your hardware is enough? Check the Hardware Requirements Guide for minimum and recommended specs
Confused about which model size to pick? Read Which Gemma 4 Model Should You Pick? to match model size to your hardware
Want to understand quantization better? See our GGUF Quantization Guide for detailed comparisons

Speed optimization is mostly about getting the basics right. Fix CPU fallback, pick the right quantization, and set a reasonable context length — those three changes alone will solve 90% of performance complaints.

gemma4 — interact

Stop reading. Start building.

~/gemma4 $ Get hands-on with the models discussed in this guide. No deployment, no friction, 100% free playground.

Launch Playground />