Gemma 4 Not Working? Common Fixes for OOM, Slow Speed & GPU Issues

Apr 7, 2026

So Gemma 4 isn't working the way you expected. Don't worry — most issues have straightforward fixes. This guide covers the real problems people hit, sourced from Reddit threads, GitHub issues, and community forums.

Let's troubleshoot.

Problem 1: Out of Memory (OOM)

Symptoms: Your system freezes, the process gets killed, or you see errors like CUDA out of memory, mmap failed, or the system just starts swapping like crazy.

Why it happens: The model weights + KV cache exceed your available RAM or VRAM.

Fix 1: Use a Smaller Model

The most reliable fix. If you're trying to run 31B on 16 GB RAM, it's just not going to work.

# Instead of this (needs ~20GB)
ollama run gemma4:31b

# Try this (needs ~6GB)
ollama run gemma4:e4b

Check our model comparison guide to find the right size for your hardware.

Fix 2: Use a More Aggressive Quantization

If you're loading GGUF files, grab a smaller quantization. Our GGUF guide explains all the quantization options in detail.

# Q4_K_M is much smaller than Q8 or FP16
huggingface-cli download google/gemma-4-26b-GGUF \
  --include "gemma-4-26b-Q4_K_M.gguf"
QuantizationMemory SavingsQuality Impact
Q4_K_M~75% smallerMinimal
Q5_K_M~65% smallerVery small
Q8_0~50% smallerNegligible

Fix 3: Reduce Context Length

The KV cache grows with context length. Gemma 4 supports up to 262K tokens, but that cache is massive — community reports show the 31B model's KV cache alone can eat ~22 GB at full context.

# Limit context to 4K or 8K
ollama run gemma4:31b --ctx-size 4096

In LM Studio, go to Settings and reduce the "Context Length" slider.

Fix 4: Enable KV Cache Quantization

Some backends support quantizing the KV cache itself, which dramatically reduces memory:

# In llama.cpp
./llama-server -m gemma4-31b-Q4_K_M.gguf \
  --ctx-size 8192 \
  --cache-type-k q8_0 \
  --cache-type-v q8_0

Fix 5: Close Other Apps

Sounds obvious, but Chrome alone can eat 4-8 GB of RAM. Close browsers, IDEs, and other heavy apps before running large models.

Problem 2: Slow Inference

Symptoms: Tokens come out painfully slow — like 1-2 tokens per second when you expected 20+. For a comprehensive walkthrough of every speed optimization available, see our speed optimization guide.

Fix 1: Check If GPU Is Actually Being Used

This is the number one cause of slow inference. The model might be running entirely on CPU.

# Check if Ollama is using GPU
ollama ps

Look at the "PROCESSOR" column. If it says "CPU" instead of showing your GPU, that's your problem.

Fix 2: Make Sure GPU Offloading Is Enabled

For Ollama, GPU offloading should be automatic, but sometimes it doesn't detect your GPU:

# Check available GPUs
ollama show --system

# Force GPU layers (all layers)
OLLAMA_NUM_GPU=999 ollama run gemma4:e4b

For llama.cpp, use the -ngl flag:

# Offload all layers to GPU
./llama-cli -m gemma4-e4b-Q4_K_M.gguf -ngl 999

Fix 3: You Might Be CPU-Bottlenecked

If the model doesn't fully fit in VRAM, some layers run on CPU, creating a bottleneck. Options:

  • Use a smaller model that fits entirely in VRAM
  • Use a smaller quantization (Q4 instead of Q8)
  • Reduce context length to free up VRAM for model layers

Fix 4: Check Your Power Settings

On laptops, power-saving mode throttles both CPU and GPU. Make sure you're on "High Performance" or plugged in.

On Mac:

# Check if low power mode is active
pmset -g | grep lowpowermode

Problem 3: GPU Not Detected

NVIDIA Users

Check CUDA drivers:

# Verify CUDA is installed and working
nvidia-smi

If nvidia-smi doesn't work or shows an error:

  1. Install or update NVIDIA drivers from nvidia.com/drivers
  2. Install CUDA Toolkit from developer.nvidia.com/cuda-downloads
  3. Restart your machine

Check that Ollama sees the GPU:

# Should show your GPU
ollama show --system

AMD Users

AMD GPU support requires ROCm, and it's more finicky:

  1. Install ROCm: follow the ROCm installation guide
  2. Make sure you have a supported GPU (RX 7000 series works best)
  3. Use the ROCm-compatible build of your inference engine
# Check ROCm installation
rocminfo | head -20

Known issue: Some AMD GPUs (especially older ones) aren't supported. Check the ROCm compatibility list.

Mac Users (Apple Silicon)

Good news — Metal acceleration is enabled by default in Ollama and llama.cpp on Apple Silicon. If it's not working:

# Check that Metal is available
system_profiler SPDisplaysDataType | grep Metal

If it shows "Metal: Supported" you're good. Ollama should automatically use Metal acceleration on M1/M2/M3/M4 Macs.

Problem 4: Model Download Gets Stuck

Ollama Download Stuck

# Cancel and retry
# Ctrl+C to stop, then:
ollama pull gemma4:e4b

If it keeps getting stuck:

  • Check your internet connection
  • Try a different network (VPN might help or hurt)
  • Check disk space: df -h

Hugging Face Download Stuck

# Enable faster downloads
pip install hf-transfer
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download google/gemma-4-e4b

If you're in a region with slow access to Hugging Face, try a mirror or download during off-peak hours.

Not Enough Disk Space

# Check available space
df -h

# Clean up old Ollama models
ollama list          # See what's installed
ollama rm modelname  # Remove ones you don't need

For reference, here's how much space you need:

ModelDisk Space (Q4_K_M)
E2B~1.5 GB
E4B~3 GB
26B~8 GB
31B~18 GB

Problem 5: Ollama-Specific Errors

"Error: model not found"

Make sure you're using the correct model name:

# Correct
ollama run gemma4
ollama run gemma4:e4b

# Wrong (common mistakes)
ollama run gemma-4     # Hyphen doesn't work
ollama run google/gemma4  # Don't include org name

Tokenizer Issues

There have been reports of tokenizer-related bugs with Gemma 4 in early versions of llama.cpp. If you're getting garbled output:

# Update Ollama to the latest version
# macOS
brew upgrade ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

The fix was merged into llama.cpp and Ollama picked it up in recent releases. Make sure you're on the latest version.

"Unexpected token" or Parsing Errors

This usually means the GGUF file is corrupted or incompatible:

# Remove and re-download the model
ollama rm gemma4:e4b
ollama pull gemma4:e4b

Problem 6: Running on CPU Instead of GPU

This is a known issue (referenced in GitHub issue #15237 for Ollama). The model loads but runs on CPU even though you have a GPU.

Diagnosis

# Check what Ollama is using
ollama ps
# Look at the PROCESSOR column

Solutions

Step 1: Update Ollama to the latest version (many GPU detection bugs have been fixed):

brew upgrade ollama  # macOS
# Or re-run the install script on Linux

Step 2: Set GPU environment variables explicitly:

# NVIDIA
export CUDA_VISIBLE_DEVICES=0
ollama run gemma4:e4b

# Force GPU usage
OLLAMA_NUM_GPU=999 ollama run gemma4:e4b

Step 3: Check if the model is too large for your GPU:

If the model doesn't fit in VRAM, Ollama might fall back to CPU entirely instead of doing partial offloading. Try a smaller model or quantization.

Step 4: Restart the Ollama service:

# macOS
brew services restart ollama

# Linux (systemd)
sudo systemctl restart ollama

Troubleshooting Decision Tree

Not sure where to start? Follow this:

  1. Is the model downloading?

    • No → Check internet, disk space, model name spelling
    • Yes → Continue
  2. Does it start running?

    • No, OOM error → Use smaller model or quantization, reduce context length
    • No, other error → Update Ollama, check model name, re-download
    • Yes → Continue
  3. Is it using the GPU?

    • No → Check drivers (NVIDIA: nvidia-smi, AMD: rocminfo), update Ollama, set env vars
    • Yes → Continue
  4. Is it fast enough?

    • No → Check power settings, close other apps, try smaller quantization
    • Yes → You're good!
  5. Is the output quality bad?

    • Garbled text → Update Ollama (tokenizer fix), re-download model
    • Low quality → Try a larger model or less aggressive quantization

Still Stuck?

If none of the above fixed your issue:

Next Steps

Gemma 4 AI

Gemma 4 AI

Related Guides