GGUF quantization is how you shrink Gemma 4 from a 24GB behemoth into something that actually fits on your hardware. But with a dozen different quantization levels to choose from, picking the right one is confusing. This guide cuts through the noise and tells you exactly which format to use.
What Is GGUF?
GGUF (GGML Universal Format) is a file format designed specifically for running large language models on consumer hardware. It stores model weights in compressed formats that trade a small amount of quality for dramatically smaller file sizes and faster inference.
The key concept is quantization — reducing the precision of model weights from 16-bit floating point (FP16) down to 8-bit, 4-bit, or even lower. Lower precision = smaller file = faster inference = slightly less accurate.
Quantization Levels Compared
Here's the complete comparison for Gemma 4 12B:
| Quantization | File Size | VRAM Needed | Speed (tok/s)* | Quality Loss | Best For |
|---|---|---|---|---|---|
| FP16 | ~24 GB | ~26 GB | Baseline | None | Research, fine-tuning |
| Q8_0 | ~13 GB | ~15 GB | 1.2x faster | Negligible | Quality-critical tasks |
| Q6_K | ~10 GB | ~12 GB | 1.4x faster | Very small | Balance of quality and size |
| Q5_K_M | ~8.5 GB | ~10 GB | 1.6x faster | Small | Better quality daily driver |
| Q5_K_S | ~8 GB | ~10 GB | 1.6x faster | Small | Slightly smaller Q5 |
| Q4_K_M | ~7 GB | ~9 GB | 1.8x faster | Moderate | Most users' best choice |
| Q4_K_S | ~6.5 GB | ~8.5 GB | 1.8x faster | Moderate | Tight VRAM budget |
| IQ4_XS | ~6 GB | ~8 GB | 1.9x faster | Noticeable | Minimum viable quality |
| Q3_K_M | ~5.5 GB | ~7.5 GB | 2.0x faster | Significant | Not recommended |
| Q2_K | ~4.5 GB | ~6.5 GB | 2.1x faster | Severe | Experimentation only |
Speed relative to FP16 on the same hardware. Actual tok/s varies by GPU.
The Recommendations
- Q4_K_M — Best balance for most people. Quality is surprisingly close to FP16 for everyday tasks like coding, writing, and Q&A. This is the default in most Ollama models.
- Q5_K_M — Pick this if you have the extra VRAM and want noticeably better quality on complex reasoning tasks.
- Q8_0 — Near-original quality. Only use if your hardware can handle it — the quality improvement over Q5 is marginal for most tasks.
- IQ4_XS — The smallest format that's still usable. Great for testing or when you're 1-2 GB short on VRAM.
Avoid Q3 and Q2 — the quality drop is too steep to be useful for anything serious.
Where to Download GGUF Files
Unsloth on Hugging Face (Recommended)
Unsloth provides high-quality GGUF conversions for all Gemma 4 models:
# Browse available files
# https://huggingface.co/unsloth/gemma-4-12b-it-GGUF
# Download with huggingface-cli
pip install huggingface_hub
huggingface-cli download unsloth/gemma-4-12b-it-GGUF \
gemma-4-12b-it-Q4_K_M.gguf \
--local-dir ./models
# Or download with wget
wget https://huggingface.co/unsloth/gemma-4-12b-it-GGUF/resolve/main/gemma-4-12b-it-Q4_K_M.ggufAvailable repos:
| Model | Hugging Face Repo |
|---|---|
| Gemma 4 1B | unsloth/gemma-4-1b-it-GGUF |
| Gemma 4 4B | unsloth/gemma-4-4b-it-GGUF |
| Gemma 4 12B | unsloth/gemma-4-12b-it-GGUF |
| Gemma 4 27B | unsloth/gemma-4-27b-it-GGUF |
Running GGUF Files
With llama.cpp
The most direct way to run GGUF files:
# Clone and build llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON # or DGGML_METAL=ON for Mac
cmake --build build
# Run inference
./build/bin/llama-server \
-m ./models/gemma-4-12b-it-Q4_K_M.gguf \
-ngl 999 \
-c 8192 \
--host 0.0.0.0 \
--port 8080
# Now you have an OpenAI-compatible API at http://localhost:8080With Ollama
Ollama uses GGUF under the hood. You can create custom models from GGUF files:
# Method 1: Use pre-built Ollama models (easiest)
ollama run gemma4:12b
# Method 2: Import your own GGUF file
# Create a Modelfile
cat > Modelfile << 'EOF'
FROM ./gemma-4-12b-it-Q4_K_M.gguf
PARAMETER temperature 0.7
PARAMETER num_ctx 8192
TEMPLATE """{{ .System }}
{{ .Prompt }}"""
EOF
# Create the model
ollama create my-gemma4 -f Modelfile
ollama run my-gemma4With LM Studio
LM Studio provides a GUI for downloading and running GGUF files:
- Open LM Studio
- Search for "gemma 4" in the model browser
- Select the quantization level you want
- Click Download
- Go to the Chat tab and select your model
- Start chatting
LM Studio also exposes a local API compatible with the OpenAI format, so you can use it as a drop-in backend for applications expecting an OpenAI-style endpoint.
Quality vs Speed: Real-World Testing
Here's how different quantizations perform on actual tasks with Gemma 4 12B:
| Task | Q4_K_M | Q5_K_M | Q8_0 | FP16 |
|---|---|---|---|---|
| Code generation | 92% match | 95% match | 98% match | 100% (baseline) |
| Creative writing | Minor diffs | Near identical | Identical | Baseline |
| Math reasoning | ~85% accuracy | ~90% accuracy | ~95% accuracy | ~96% accuracy |
| Summarization | Very close | Very close | Identical | Baseline |
| Translation | Small quality drop | Near identical | Identical | Baseline |
For most users, Q4_K_M is the sweet spot. You lose a few percentage points on hard math and complex reasoning, but for coding, writing, summarization, and general Q&A, the difference is barely noticeable.
Choosing by Hardware
| Your Hardware | Recommended Quant | Model Size |
|---|---|---|
| 8GB VRAM GPU | Q4_K_M or IQ4_XS | 12B |
| 12GB VRAM GPU | Q5_K_M or Q6_K | 12B |
| 16GB VRAM GPU | Q8_0 | 12B |
| 24GB VRAM GPU | Q8_0 (12B) or Q4_K_M (27B) | 12B or 27B |
| 16GB Mac | Q4_K_M | 12B |
| 32GB Mac | Q5_K_M (12B) or Q4_K_M (27B) | 12B or 27B |
| 64GB+ Mac | Q8_0 for any size | 27B |
Next Steps
- Need to download models? Check our Download Guide for all the ways to get Gemma 4
- Want more details on hardware requirements? See the Hardware Guide for VRAM calculations by model and quantization
- Downloading from Hugging Face? Read How to Download from Hugging Face for detailed instructions
The bottom line: start with Q4_K_M. If you notice quality issues on your specific tasks, step up to Q5_K_M. Only go higher if you have the VRAM to spare and genuinely need the extra precision.



