Why Is Gemma 4 Slow? Speed Up Guide for Mac, Windows & Linux

Apr 7, 2026

You downloaded Gemma 4, ran it, and... it's painfully slow. Maybe 2 tokens per second. Maybe worse. Before you blame the model, let's figure out what's actually going wrong — because in most cases, a few configuration tweaks can 5-10x your speed.

Step 1: Diagnose Why It's Slow

There are five common reasons Gemma 4 runs slower than expected. Let's check each one.

Reason 1: CPU Fallback

This is the number one speed killer. Your model is running on CPU instead of GPU, and you might not even realize it.

How to check:

# Mac: Activity Monitor → GPU History (Window menu)
# Or check if Metal is being used:
sudo powermetrics --samplers gpu_power -n 1

# NVIDIA: GPU utilization should be > 0%
nvidia-smi

# AMD: Same check
rocm-smi

If GPU utilization stays at 0% during inference, you're on CPU. Fix this first — nothing else matters until GPU acceleration is working.

Reason 2: Wrong Quantization

Not all quantizations are created equal for speed:

QuantizationFile Size (12B)SpeedQualityBest For
Q4_K_M~7 GBFastestGoodDaily use, most tasks
Q5_K_M~8.5 GBFastBetterWhen quality matters
Q6_K~10 GBMediumVery goodBalanced
Q8_0~13 GBSlowNear-originalQuality-critical tasks
FP16~24 GBSlowestOriginalOnly if you have the VRAM
IQ4_XS~6 GBFastestAcceptableTight VRAM budget

If you're running Q8 or FP16 and wondering why it's slow, switch to Q4_K_M. The quality difference is marginal for most tasks, but the speed difference is dramatic. Our GGUF guide has detailed benchmarks for each quantization level.

Reason 3: Context Length Too Long

Gemma 4 supports up to 256K context, but longer context = slower inference. The relationship isn't linear — it gets worse as context grows:

Context LengthRelative SpeedVRAM Usage (12B Q4)
2K1.0x (baseline)~7 GB
8K~0.9x~8 GB
32K~0.7x~12 GB
128K~0.4x~20 GB
256K~0.25x~30 GB+

Fix: Set a reasonable context length for your task:

# Ollama: limit context
ollama run gemma4:12b --ctx-size 8192

# llama.cpp
./llama-server -m model.gguf -c 8192

# Don't use 256K unless you actually need it

Reason 4: KV Cache Bloat

The KV (key-value) cache stores attention information and grows with conversation length. Long conversations eat VRAM and slow things down.

Fix: Start fresh conversations regularly, or set a cache limit:

# llama.cpp: limit KV cache
./llama-server -m model.gguf -c 8192 --cache-type-k q8_0 --cache-type-v q8_0

# Quantized KV cache uses less VRAM with minimal quality loss

Reason 5: Batch Size Issues

If you're serving multiple requests, wrong batch sizes hurt throughput:

# vLLM: tune batch size
python -m vllm.entrypoints.openai.api_server \
  --model google/gemma-4-12b-it \
  --max-num-batched-tokens 4096 \
  --max-num-seqs 8

Platform-Specific Fixes

Mac (Apple Silicon)

Mac performance depends entirely on Metal GPU acceleration working correctly:

# Check Metal support
system_profiler SPDisplaysDataType | grep Metal

# Ollama automatically uses Metal on Apple Silicon
# If speed is still slow, check unified memory pressure:
memory_pressure

# For llama.cpp, ensure Metal is enabled
cmake -B build -DGGML_METAL=ON
cmake --build build

# M1/M2/M3 recommended settings
./llama-server -m model.gguf -ngl 999 -c 8192
Mac ModelUnified Memory12B Q4 SpeedNotes
M1 8GB8GB~12 tok/sUsable but tight
M1 Pro 16GB16GB~18 tok/sComfortable
M2 Pro 16GB16GB~22 tok/sGood daily driver
M3 Pro 18GB18GB~25 tok/sSweet spot
M3 Max 36GB36GB~30 tok/sCan run 27B Q4
M4 Max 48GB48GB~35 tok/sRuns everything

Mac-specific tip: Close memory-hungry apps (Chrome, Docker) before running large models. Apple Silicon shares memory between CPU and GPU, so there's no separate VRAM pool.

Windows (NVIDIA CUDA)

# Make sure CUDA is actually being used
# In Ollama, check with:
ollama ps

# Common Windows issue: power settings
# Set to "High Performance" in Windows power options
# Laptop GPUs throttle aggressively on "Balanced"

# For llama.cpp on Windows:
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release

Windows-specific tip: Disable Windows Defender real-time scanning for your model directory. It scans every file read and can tank performance:

# PowerShell (admin)
Add-MpPreference -ExclusionPath "C:\Users\you\models"

Linux (NVIDIA or AMD)

# NVIDIA: Make sure persistence mode is on
sudo nvidia-smi -pm 1

# Set GPU to max performance
sudo nvidia-smi -ac 1215,1410  # Values vary by GPU

# AMD: Check ROCm is active
rocm-smi

# For both: ensure you're not running Wayland compositor
# X11 has less GPU overhead for compute tasks

Quick Speed Checklist

Run through this checklist to maximize speed:

1. [ ] GPU acceleration is working (not CPU fallback)
2. [ ] Using Q4_K_M quantization (unless quality is critical)
3. [ ] Context length set to what you actually need (not 256K default)
4. [ ] KV cache quantized (--cache-type-k q8_0)
5. [ ] Flash Attention enabled (if available)
6. [ ] No memory-hungry background apps
7. [ ] Power settings on "High Performance" (laptops)
8. [ ] Latest drivers installed

When Slow Is Expected

Sometimes Gemma 4 is slow and that's just how it is:

  • First token latency: The first token always takes longer (prompt processing). This is normal.
  • Very long prompts: Processing a 100K token input takes time no matter what.
  • 27B model on 16GB: It fits, but barely. Consider the 12B instead.
  • CPU-only inference: Without a GPU, expect 1-5 tok/s. That's the reality of running LLMs on CPU.

If you're experiencing issues beyond just speed, like crashes or errors, check our troubleshooting guide for solutions to OOM errors, GPU detection problems, and more.

Next Steps

Speed optimization is mostly about getting the basics right. Fix CPU fallback, pick the right quantization, and set a reasonable context length — those three changes alone will solve 90% of performance complaints.

Gemma 4 AI

Gemma 4 AI

Related Guides