So Gemma 4 isn't working the way you expected. Don't worry — most issues have straightforward fixes. This guide covers the real problems people hit, sourced from Reddit threads, GitHub issues, and community forums.
Let's troubleshoot.
Problem 1: Out of Memory (OOM)
Symptoms: Your system freezes, the process gets killed, or you see errors like CUDA out of memory, mmap failed, or the system just starts swapping like crazy.
Why it happens: The model weights + KV cache exceed your available RAM or VRAM.
Fix 1: Use a Smaller Model
The most reliable fix. If you're trying to run 31B on 16 GB RAM, it's just not going to work.
# Instead of this (needs ~20GB)
ollama run gemma4:31b
# Try this (needs ~6GB)
ollama run gemma4:e4bCheck our model comparison guide to find the right size for your hardware.
Fix 2: Use a More Aggressive Quantization
If you're loading GGUF files, grab a smaller quantization. Our GGUF guide explains all the quantization options in detail.
# Q4_K_M is much smaller than Q8 or FP16
huggingface-cli download google/gemma-4-26b-GGUF \
--include "gemma-4-26b-Q4_K_M.gguf"| Quantization | Memory Savings | Quality Impact |
|---|---|---|
| Q4_K_M | ~75% smaller | Minimal |
| Q5_K_M | ~65% smaller | Very small |
| Q8_0 | ~50% smaller | Negligible |
Fix 3: Reduce Context Length
The KV cache grows with context length. Gemma 4 supports up to 262K tokens, but that cache is massive — community reports show the 31B model's KV cache alone can eat ~22 GB at full context.
# Limit context to 4K or 8K
ollama run gemma4:31b --ctx-size 4096In LM Studio, go to Settings and reduce the "Context Length" slider.
Fix 4: Enable KV Cache Quantization
Some backends support quantizing the KV cache itself, which dramatically reduces memory:
# In llama.cpp
./llama-server -m gemma4-31b-Q4_K_M.gguf \
--ctx-size 8192 \
--cache-type-k q8_0 \
--cache-type-v q8_0Fix 5: Close Other Apps
Sounds obvious, but Chrome alone can eat 4-8 GB of RAM. Close browsers, IDEs, and other heavy apps before running large models.
Problem 2: Slow Inference
Symptoms: Tokens come out painfully slow — like 1-2 tokens per second when you expected 20+. For a comprehensive walkthrough of every speed optimization available, see our speed optimization guide.
Fix 1: Check If GPU Is Actually Being Used
This is the number one cause of slow inference. The model might be running entirely on CPU.
# Check if Ollama is using GPU
ollama psLook at the "PROCESSOR" column. If it says "CPU" instead of showing your GPU, that's your problem.
Fix 2: Make Sure GPU Offloading Is Enabled
For Ollama, GPU offloading should be automatic, but sometimes it doesn't detect your GPU:
# Check available GPUs
ollama show --system
# Force GPU layers (all layers)
OLLAMA_NUM_GPU=999 ollama run gemma4:e4bFor llama.cpp, use the -ngl flag:
# Offload all layers to GPU
./llama-cli -m gemma4-e4b-Q4_K_M.gguf -ngl 999Fix 3: You Might Be CPU-Bottlenecked
If the model doesn't fully fit in VRAM, some layers run on CPU, creating a bottleneck. Options:
- Use a smaller model that fits entirely in VRAM
- Use a smaller quantization (Q4 instead of Q8)
- Reduce context length to free up VRAM for model layers
Fix 4: Check Your Power Settings
On laptops, power-saving mode throttles both CPU and GPU. Make sure you're on "High Performance" or plugged in.
On Mac:
# Check if low power mode is active
pmset -g | grep lowpowermodeProblem 3: GPU Not Detected
NVIDIA Users
Check CUDA drivers:
# Verify CUDA is installed and working
nvidia-smiIf nvidia-smi doesn't work or shows an error:
- Install or update NVIDIA drivers from nvidia.com/drivers
- Install CUDA Toolkit from developer.nvidia.com/cuda-downloads
- Restart your machine
Check that Ollama sees the GPU:
# Should show your GPU
ollama show --systemAMD Users
AMD GPU support requires ROCm, and it's more finicky:
- Install ROCm: follow the ROCm installation guide
- Make sure you have a supported GPU (RX 7000 series works best)
- Use the ROCm-compatible build of your inference engine
# Check ROCm installation
rocminfo | head -20Known issue: Some AMD GPUs (especially older ones) aren't supported. Check the ROCm compatibility list.
Mac Users (Apple Silicon)
Good news — Metal acceleration is enabled by default in Ollama and llama.cpp on Apple Silicon. If it's not working:
# Check that Metal is available
system_profiler SPDisplaysDataType | grep MetalIf it shows "Metal: Supported" you're good. Ollama should automatically use Metal acceleration on M1/M2/M3/M4 Macs.
Problem 4: Model Download Gets Stuck
Ollama Download Stuck
# Cancel and retry
# Ctrl+C to stop, then:
ollama pull gemma4:e4bIf it keeps getting stuck:
- Check your internet connection
- Try a different network (VPN might help or hurt)
- Check disk space:
df -h
Hugging Face Download Stuck
# Enable faster downloads
pip install hf-transfer
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download google/gemma-4-e4bIf you're in a region with slow access to Hugging Face, try a mirror or download during off-peak hours.
Not Enough Disk Space
# Check available space
df -h
# Clean up old Ollama models
ollama list # See what's installed
ollama rm modelname # Remove ones you don't needFor reference, here's how much space you need:
| Model | Disk Space (Q4_K_M) |
|---|---|
| E2B | ~1.5 GB |
| E4B | ~3 GB |
| 26B | ~8 GB |
| 31B | ~18 GB |
Problem 5: Ollama-Specific Errors
"Error: model not found"
Make sure you're using the correct model name:
# Correct
ollama run gemma4
ollama run gemma4:e4b
# Wrong (common mistakes)
ollama run gemma-4 # Hyphen doesn't work
ollama run google/gemma4 # Don't include org nameTokenizer Issues
There have been reports of tokenizer-related bugs with Gemma 4 in early versions of llama.cpp. If you're getting garbled output:
# Update Ollama to the latest version
# macOS
brew upgrade ollama
# Linux
curl -fsSL https://ollama.com/install.sh | shThe fix was merged into llama.cpp and Ollama picked it up in recent releases. Make sure you're on the latest version.
"Unexpected token" or Parsing Errors
This usually means the GGUF file is corrupted or incompatible:
# Remove and re-download the model
ollama rm gemma4:e4b
ollama pull gemma4:e4bProblem 6: Running on CPU Instead of GPU
This is a known issue (referenced in GitHub issue #15237 for Ollama). The model loads but runs on CPU even though you have a GPU.
Diagnosis
# Check what Ollama is using
ollama ps
# Look at the PROCESSOR columnSolutions
Step 1: Update Ollama to the latest version (many GPU detection bugs have been fixed):
brew upgrade ollama # macOS
# Or re-run the install script on LinuxStep 2: Set GPU environment variables explicitly:
# NVIDIA
export CUDA_VISIBLE_DEVICES=0
ollama run gemma4:e4b
# Force GPU usage
OLLAMA_NUM_GPU=999 ollama run gemma4:e4bStep 3: Check if the model is too large for your GPU:
If the model doesn't fit in VRAM, Ollama might fall back to CPU entirely instead of doing partial offloading. Try a smaller model or quantization.
Step 4: Restart the Ollama service:
# macOS
brew services restart ollama
# Linux (systemd)
sudo systemctl restart ollamaTroubleshooting Decision Tree
Not sure where to start? Follow this:
-
Is the model downloading?
- No → Check internet, disk space, model name spelling
- Yes → Continue
-
Does it start running?
- No, OOM error → Use smaller model or quantization, reduce context length
- No, other error → Update Ollama, check model name, re-download
- Yes → Continue
-
Is it using the GPU?
- No → Check drivers (NVIDIA:
nvidia-smi, AMD:rocminfo), update Ollama, set env vars - Yes → Continue
- No → Check drivers (NVIDIA:
-
Is it fast enough?
- No → Check power settings, close other apps, try smaller quantization
- Yes → You're good!
-
Is the output quality bad?
- Garbled text → Update Ollama (tokenizer fix), re-download model
- Low quality → Try a larger model or less aggressive quantization
Still Stuck?
If none of the above fixed your issue:
- Ollama issues: Check github.com/ollama/ollama/issues and search for your specific error
- llama.cpp issues: Check github.com/ggml-org/llama.cpp/issues
- Reddit: Search r/LocalLLaMA — the community is incredibly helpful and someone has probably hit the same issue
Next Steps
- Pick the right model for your hardware → Which Gemma 4 Model Should I Use?
- Check hardware requirements → Gemma 4 Hardware Guide
- Download or re-download → Gemma 4 Download Guide
- Try the browser version (no install needed) → Google AI Studio Guide



