You downloaded Gemma 4, ran it, and... it's painfully slow. Maybe 2 tokens per second. Maybe worse. Before you blame the model, let's figure out what's actually going wrong — because in most cases, a few configuration tweaks can 5-10x your speed.
Step 1: Diagnose Why It's Slow
There are five common reasons Gemma 4 runs slower than expected. Let's check each one.
Reason 1: CPU Fallback
This is the number one speed killer. Your model is running on CPU instead of GPU, and you might not even realize it.
How to check:
# Mac: Activity Monitor → GPU History (Window menu)
# Or check if Metal is being used:
sudo powermetrics --samplers gpu_power -n 1
# NVIDIA: GPU utilization should be > 0%
nvidia-smi
# AMD: Same check
rocm-smiIf GPU utilization stays at 0% during inference, you're on CPU. Fix this first — nothing else matters until GPU acceleration is working.
Reason 2: Wrong Quantization
Not all quantizations are created equal for speed:
| Quantization | File Size (12B) | Speed | Quality | Best For |
|---|---|---|---|---|
| Q4_K_M | ~7 GB | Fastest | Good | Daily use, most tasks |
| Q5_K_M | ~8.5 GB | Fast | Better | When quality matters |
| Q6_K | ~10 GB | Medium | Very good | Balanced |
| Q8_0 | ~13 GB | Slow | Near-original | Quality-critical tasks |
| FP16 | ~24 GB | Slowest | Original | Only if you have the VRAM |
| IQ4_XS | ~6 GB | Fastest | Acceptable | Tight VRAM budget |
If you're running Q8 or FP16 and wondering why it's slow, switch to Q4_K_M. The quality difference is marginal for most tasks, but the speed difference is dramatic. Our GGUF guide has detailed benchmarks for each quantization level.
Reason 3: Context Length Too Long
Gemma 4 supports up to 256K context, but longer context = slower inference. The relationship isn't linear — it gets worse as context grows:
| Context Length | Relative Speed | VRAM Usage (12B Q4) |
|---|---|---|
| 2K | 1.0x (baseline) | ~7 GB |
| 8K | ~0.9x | ~8 GB |
| 32K | ~0.7x | ~12 GB |
| 128K | ~0.4x | ~20 GB |
| 256K | ~0.25x | ~30 GB+ |
Fix: Set a reasonable context length for your task:
# Ollama: limit context
ollama run gemma4:12b --ctx-size 8192
# llama.cpp
./llama-server -m model.gguf -c 8192
# Don't use 256K unless you actually need itReason 4: KV Cache Bloat
The KV (key-value) cache stores attention information and grows with conversation length. Long conversations eat VRAM and slow things down.
Fix: Start fresh conversations regularly, or set a cache limit:
# llama.cpp: limit KV cache
./llama-server -m model.gguf -c 8192 --cache-type-k q8_0 --cache-type-v q8_0
# Quantized KV cache uses less VRAM with minimal quality lossReason 5: Batch Size Issues
If you're serving multiple requests, wrong batch sizes hurt throughput:
# vLLM: tune batch size
python -m vllm.entrypoints.openai.api_server \
--model google/gemma-4-12b-it \
--max-num-batched-tokens 4096 \
--max-num-seqs 8Platform-Specific Fixes
Mac (Apple Silicon)
Mac performance depends entirely on Metal GPU acceleration working correctly:
# Check Metal support
system_profiler SPDisplaysDataType | grep Metal
# Ollama automatically uses Metal on Apple Silicon
# If speed is still slow, check unified memory pressure:
memory_pressure
# For llama.cpp, ensure Metal is enabled
cmake -B build -DGGML_METAL=ON
cmake --build build
# M1/M2/M3 recommended settings
./llama-server -m model.gguf -ngl 999 -c 8192| Mac Model | Unified Memory | 12B Q4 Speed | Notes |
|---|---|---|---|
| M1 8GB | 8GB | ~12 tok/s | Usable but tight |
| M1 Pro 16GB | 16GB | ~18 tok/s | Comfortable |
| M2 Pro 16GB | 16GB | ~22 tok/s | Good daily driver |
| M3 Pro 18GB | 18GB | ~25 tok/s | Sweet spot |
| M3 Max 36GB | 36GB | ~30 tok/s | Can run 27B Q4 |
| M4 Max 48GB | 48GB | ~35 tok/s | Runs everything |
Mac-specific tip: Close memory-hungry apps (Chrome, Docker) before running large models. Apple Silicon shares memory between CPU and GPU, so there's no separate VRAM pool.
Windows (NVIDIA CUDA)
# Make sure CUDA is actually being used
# In Ollama, check with:
ollama ps
# Common Windows issue: power settings
# Set to "High Performance" in Windows power options
# Laptop GPUs throttle aggressively on "Balanced"
# For llama.cpp on Windows:
cmake -B build -DGGML_CUDA=ON
cmake --build build --config ReleaseWindows-specific tip: Disable Windows Defender real-time scanning for your model directory. It scans every file read and can tank performance:
# PowerShell (admin)
Add-MpPreference -ExclusionPath "C:\Users\you\models"Linux (NVIDIA or AMD)
# NVIDIA: Make sure persistence mode is on
sudo nvidia-smi -pm 1
# Set GPU to max performance
sudo nvidia-smi -ac 1215,1410 # Values vary by GPU
# AMD: Check ROCm is active
rocm-smi
# For both: ensure you're not running Wayland compositor
# X11 has less GPU overhead for compute tasksQuick Speed Checklist
Run through this checklist to maximize speed:
1. [ ] GPU acceleration is working (not CPU fallback)
2. [ ] Using Q4_K_M quantization (unless quality is critical)
3. [ ] Context length set to what you actually need (not 256K default)
4. [ ] KV cache quantized (--cache-type-k q8_0)
5. [ ] Flash Attention enabled (if available)
6. [ ] No memory-hungry background apps
7. [ ] Power settings on "High Performance" (laptops)
8. [ ] Latest drivers installedWhen Slow Is Expected
Sometimes Gemma 4 is slow and that's just how it is:
- First token latency: The first token always takes longer (prompt processing). This is normal.
- Very long prompts: Processing a 100K token input takes time no matter what.
- 27B model on 16GB: It fits, but barely. Consider the 12B instead.
- CPU-only inference: Without a GPU, expect 1-5 tok/s. That's the reality of running LLMs on CPU.
If you're experiencing issues beyond just speed, like crashes or errors, check our troubleshooting guide for solutions to OOM errors, GPU detection problems, and more.
Next Steps
- Not sure your hardware is enough? Check the Hardware Requirements Guide for minimum and recommended specs
- Confused about which model size to pick? Read Which Gemma 4 Model Should You Pick? to match model size to your hardware
- Want to understand quantization better? See our GGUF Quantization Guide for detailed comparisons
Speed optimization is mostly about getting the basics right. Fix CPU fallback, pick the right quantization, and set a reasonable context length — those three changes alone will solve 90% of performance complaints.



