Gemma 4 GGUF: Which Quantization Should I Pick?

Apr 7, 2026

GGUF quantization is how you shrink Gemma 4 from a 24GB behemoth into something that actually fits on your hardware. But with a dozen different quantization levels to choose from, picking the right one is confusing. This guide cuts through the noise and tells you exactly which format to use.

What Is GGUF?

GGUF (GGML Universal Format) is a file format designed specifically for running large language models on consumer hardware. It stores model weights in compressed formats that trade a small amount of quality for dramatically smaller file sizes and faster inference.

The key concept is quantization — reducing the precision of model weights from 16-bit floating point (FP16) down to 8-bit, 4-bit, or even lower. Lower precision = smaller file = faster inference = slightly less accurate.

Quantization Levels Compared

Here's the complete comparison for Gemma 4 12B:

QuantizationFile SizeVRAM NeededSpeed (tok/s)*Quality LossBest For
FP16~24 GB~26 GBBaselineNoneResearch, fine-tuning
Q8_0~13 GB~15 GB1.2x fasterNegligibleQuality-critical tasks
Q6_K~10 GB~12 GB1.4x fasterVery smallBalance of quality and size
Q5_K_M~8.5 GB~10 GB1.6x fasterSmallBetter quality daily driver
Q5_K_S~8 GB~10 GB1.6x fasterSmallSlightly smaller Q5
Q4_K_M~7 GB~9 GB1.8x fasterModerateMost users' best choice
Q4_K_S~6.5 GB~8.5 GB1.8x fasterModerateTight VRAM budget
IQ4_XS~6 GB~8 GB1.9x fasterNoticeableMinimum viable quality
Q3_K_M~5.5 GB~7.5 GB2.0x fasterSignificantNot recommended
Q2_K~4.5 GB~6.5 GB2.1x fasterSevereExperimentation only

Speed relative to FP16 on the same hardware. Actual tok/s varies by GPU.

The Recommendations

  • Q4_K_M — Best balance for most people. Quality is surprisingly close to FP16 for everyday tasks like coding, writing, and Q&A. This is the default in most Ollama models.
  • Q5_K_M — Pick this if you have the extra VRAM and want noticeably better quality on complex reasoning tasks.
  • Q8_0 — Near-original quality. Only use if your hardware can handle it — the quality improvement over Q5 is marginal for most tasks.
  • IQ4_XS — The smallest format that's still usable. Great for testing or when you're 1-2 GB short on VRAM.

Avoid Q3 and Q2 — the quality drop is too steep to be useful for anything serious.

Where to Download GGUF Files

Unsloth provides high-quality GGUF conversions for all Gemma 4 models:

# Browse available files
# https://huggingface.co/unsloth/gemma-4-12b-it-GGUF

# Download with huggingface-cli
pip install huggingface_hub
huggingface-cli download unsloth/gemma-4-12b-it-GGUF \
  gemma-4-12b-it-Q4_K_M.gguf \
  --local-dir ./models

# Or download with wget
wget https://huggingface.co/unsloth/gemma-4-12b-it-GGUF/resolve/main/gemma-4-12b-it-Q4_K_M.gguf

Available repos:

ModelHugging Face Repo
Gemma 4 1Bunsloth/gemma-4-1b-it-GGUF
Gemma 4 4Bunsloth/gemma-4-4b-it-GGUF
Gemma 4 12Bunsloth/gemma-4-12b-it-GGUF
Gemma 4 27Bunsloth/gemma-4-27b-it-GGUF

Running GGUF Files

With llama.cpp

The most direct way to run GGUF files:

# Clone and build llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON  # or DGGML_METAL=ON for Mac
cmake --build build

# Run inference
./build/bin/llama-server \
  -m ./models/gemma-4-12b-it-Q4_K_M.gguf \
  -ngl 999 \
  -c 8192 \
  --host 0.0.0.0 \
  --port 8080

# Now you have an OpenAI-compatible API at http://localhost:8080

With Ollama

Ollama uses GGUF under the hood. You can create custom models from GGUF files:

# Method 1: Use pre-built Ollama models (easiest)
ollama run gemma4:12b

# Method 2: Import your own GGUF file
# Create a Modelfile
cat > Modelfile << 'EOF'
FROM ./gemma-4-12b-it-Q4_K_M.gguf
PARAMETER temperature 0.7
PARAMETER num_ctx 8192
TEMPLATE """{{ .System }}
{{ .Prompt }}"""
EOF

# Create the model
ollama create my-gemma4 -f Modelfile
ollama run my-gemma4

With LM Studio

LM Studio provides a GUI for downloading and running GGUF files:

  1. Open LM Studio
  2. Search for "gemma 4" in the model browser
  3. Select the quantization level you want
  4. Click Download
  5. Go to the Chat tab and select your model
  6. Start chatting

LM Studio also exposes a local API compatible with the OpenAI format, so you can use it as a drop-in backend for applications expecting an OpenAI-style endpoint.

Quality vs Speed: Real-World Testing

Here's how different quantizations perform on actual tasks with Gemma 4 12B:

TaskQ4_K_MQ5_K_MQ8_0FP16
Code generation92% match95% match98% match100% (baseline)
Creative writingMinor diffsNear identicalIdenticalBaseline
Math reasoning~85% accuracy~90% accuracy~95% accuracy~96% accuracy
SummarizationVery closeVery closeIdenticalBaseline
TranslationSmall quality dropNear identicalIdenticalBaseline

For most users, Q4_K_M is the sweet spot. You lose a few percentage points on hard math and complex reasoning, but for coding, writing, summarization, and general Q&A, the difference is barely noticeable.

Choosing by Hardware

Your HardwareRecommended QuantModel Size
8GB VRAM GPUQ4_K_M or IQ4_XS12B
12GB VRAM GPUQ5_K_M or Q6_K12B
16GB VRAM GPUQ8_012B
24GB VRAM GPUQ8_0 (12B) or Q4_K_M (27B)12B or 27B
16GB MacQ4_K_M12B
32GB MacQ5_K_M (12B) or Q4_K_M (27B)12B or 27B
64GB+ MacQ8_0 for any size27B

Next Steps

The bottom line: start with Q4_K_M. If you notice quality issues on your specific tasks, step up to Q5_K_M. Only go higher if you have the VRAM to spare and genuinely need the extra precision.

Gemma 4 AI

Gemma 4 AI

Related Guides