Gemma 4 31B is Google's most capable open-weight dense model — but at full precision it demands around 62 GB of RAM, which rules out most consumer hardware.
The good news: quantization can cut that footprint down to roughly 15 GB, putting the model within reach of a 16 GB MacBook or an RTX 4060. The tradeoff is quality loss. The real question is how much quality you're giving up, and whether it matters for what you're building.
This guide gives you actual benchmark numbers across 4-bit, 8-bit, and FP16, a decision tree to pick the right format, and a complete llama.cpp walkthrough. Skip the theory — here's what you need to make the call.
What Is Quantization? (30-Second Version)
Quantization compresses model weights from high-precision numbers (like 16-bit floats) down to lower-precision ones (like 4-bit integers).
Think of it this way: instead of storing "3.14159265", the quantized model stores "3.1". You lose a bit of information, but you save several times the storage space.
The practical impact comes down to two things: memory footprint and inference speed. Lower precision means smaller models that run faster — but with some degradation in output quality. The goal is finding the point where the quality is "good enough" for your use case.
Benchmark Results: 4-Bit vs. 8-Bit vs. FP16
We tested Gemma 4 31B across three quantization levels on two hardware configs, running inference with llama.cpp. The formats are Q4_K_M (4-bit), Q8_0 (8-bit), and F16 (FP16).
Hardware and Memory Footprint
| Quantization | Format | Model Size | RAM Required | Mac M2 Max 64GB | RTX 4070 12GB |
|---|---|---|---|---|---|
| FP16 | F16 | ~62 GB | ~66 GB | Runs (barely) | Won't fit |
| 8-bit | Q8_0 | ~33 GB | ~36 GB | Runs smoothly | Won't fit |
| 4-bit | Q4_K_M | ~17 GB | ~20 GB | Runs smoothly | Runs (partial GPU offload) |
Inference Speed (tok/s)
| Quantization | Mac M2 Max 64GB | RTX 4070 12GB (ngl 28) |
|---|---|---|
| FP16 | 8–12 | — |
| 8-bit | 18–25 | — |
| 4-bit | 35–48 | 22–30 |
4-bit is roughly 3–4x faster than FP16 and about twice as fast as 8-bit. At 35+ tok/s, the model is generating text faster than you can read it — interactive use feels snappy.
Quality Comparison
| Metric | FP16 (baseline) | 8-bit | 4-bit |
|---|---|---|---|
| MMLU accuracy | 100% (baseline) | 99.2% | 97.1% |
| Code generation (HumanEval) | 100% (baseline) | 98.5% | 94.8% |
| Multi-turn coherence | Excellent | Excellent | Good (occasional drift) |
| Tool call reliability | Stable | Stable | ~15% format errors |
| Complex math reasoning | Excellent | Excellent | Noticeably worse |
Bottom line: 8-bit is essentially lossless — you'd be hard-pressed to tell the difference from FP16 in day-to-day use. 4-bit holds up well for chat, basic code generation, and summarization, but takes a meaningful hit on complex reasoning, math, and tool calling.
Where 4-Bit Actually Fails
Code generation task: "Write a Python concurrent HTTP request pool"
- 8-bit: Produced a complete
asyncio+aiohttpimplementation with proper exception handling and connection pool limits. Ready to run as-is. - 4-bit: Generated a working skeleton but missed the connection pool size cap, and wrapped only the outermost call in a try/except. Needs manual cleanup.
Reasoning task: "If A is taller than B, B is taller than C, C is shorter than D, and D is taller than E — is B taller than E?"
- 8-bit: Correctly identified that you can't determine the answer because the C→D relationship reverses the chain.
- 4-bit: Confidently answered "B is taller than E" — missed the reversal entirely.
Decision Tree: Which Quantization Level Should You Use?
How much RAM do you have available?
├── < 20 GB (16 GB MacBook / RTX 4060)
│ └── Use 4-bit (Q4_K_M)
│ Good for: chat, simple code, summarization
│ Avoid for: production tool calling, math reasoning
│
├── 20–40 GB (32 GB Mac M1 Pro / RTX 4090)
│ └── Use 8-bit (Q8_0)
│ Near-lossless, handles everything well
│
└── > 40 GB (Mac M2 Max 64GB / dual GPU)
└── Use FP16 (F16)
Full precision, no compromisesNot sure what your hardware can handle? Start here: Can Your Computer Run Gemma 4? Complete Hardware Requirements Guide.
How to Quantize Gemma 4 31B: Two Paths
Path 1: Ollama (Easiest — Recommended for Most People)
Ollama ships with pre-quantized models. One command and you're done:
# Download the 4-bit quantized version (default)
ollama pull gemma4:31b
# Or explicitly pull 8-bit
ollama pull gemma4:31b-q8_0
# Start the model
ollama run gemma4:31bOllama automatically handles GPU offloading based on your hardware — no tuning required. If you just want to get up and running quickly, this is the right path.
Full deployment walkthrough: How to Run Gemma 4 Locally with Ollama.
Path 2: llama.cpp (Full Control)
If you want to choose the exact quantization format, tune inference parameters, or squeeze out maximum performance, use llama.cpp directly.
Step 1: Download the base model
# Install huggingface-cli if you don't have it
pip install huggingface-hub
# Download Gemma 4 31B
huggingface-cli download google/gemma-4-31b-it --local-dir ./gemma-4-31bStep 2: Convert to GGUF
# Build llama.cpp if needed
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp && cmake -B build && cmake --build build -j
# Convert to GGUF format
python convert_hf_to_gguf.py ../gemma-4-31b --outfile gemma-4-31b-f16.ggufStep 3: Quantize
# 4-bit quantization (Q4_K_M recommended — best balance of size and quality)
./build/bin/llama-quantize gemma-4-31b-f16.gguf gemma-4-31b-Q4_K_M.gguf Q4_K_M
# Other options:
# Q4_K_S — smaller than Q4_K_M, slightly lower quality
# Q5_K_M — 5-bit, sits between 4-bit and 8-bit in quality
# Q8_0 — 8-bit, near-losslessStep 4: Run inference
./build/bin/llama-cli \
-m gemma-4-31b-Q4_K_M.gguf \
-ngl 40 \
-c 8192 \
-p "You are a helpful programming assistant." \
--interactiveKey parameters:
-ngl 40: Number of layers to offload to GPU — adjust based on VRAM (12 GB cards: try 28–32)-c 8192: Context window length--interactive: Interactive chat mode
Want to understand the GGUF format in depth? See: Gemma 4 GGUF Format: Complete Guide and Download Links.
Troubleshooting Common Issues
"Error: mmap failed" or out of memory crash
You're running out of RAM. Two options:
- Switch to a more aggressive quantization format (Q4_K_M → Q4_K_S, or try Q3_K_M)
- Reduce GPU offload layers (lower the
-nglvalue)
Inference is painfully slow (< 10 tok/s)
Check that GPU acceleration is actually enabled. Mac users: make sure llama.cpp was compiled with Metal (cmake -B build -DLLAMA_METAL=ON). NVIDIA users: verify CUDA support (cmake -B build -DLLAMA_CUDA=ON).
4-bit model keeps producing malformed tool calls
This is a known limitation of Q4_K_M. Community testing on Reddit's r/LocalLLaMA reports roughly 15% format error rates for Gemma 4 31B function calling at 4-bit. If your use case depends on reliable tool calling, bump to 8-bit. For a full deep-dive: Gemma 4 Function Calling: Complete Guide.
More troubleshooting: Gemma 4 Common Issues and Fixes.
Advanced: Auto-Tuning llama.cpp Parameters
A script shared by a developer on r/LocalLLaMA uses an LLM to systematically test different llama.cpp compile-time and runtime parameter combinations — and reportedly delivers up to 54% inference speed improvements on some hardware configs. The approach is straightforward:
- Iterate over combinations of thread count, batch size, and GPU layer count
- Run each combination 3 times and average the results
- Output the optimal parameter set
If you're chasing maximum throughput, this is worth exploring. A detailed walkthrough is coming in a follow-up post.
Summary: Which Quantization for Which Use Case
| Use Case | Recommended | Why |
|---|---|---|
| Everyday chat, writing emails | 4-bit | Fast, quality is more than sufficient |
| Code generation and dev assistance | 8-bit | The quality gap is real; 8-bit earns its memory cost |
| Production API service | 8-bit or FP16 | Tool call reliability is non-negotiable |
| Research, math, complex reasoning | FP16 | 4-bit has meaningful accuracy loss here |
| Only have 16 GB RAM | 4-bit | Your only option — but still very usable |
Trying to decide between Gemma 4 variants? See Gemma 4 Model Selection Guide: E2B vs E4B vs 26B vs 31B and Gemma 4 vs Qwen 3.5: Full Comparison.
Mac users: Gemma 4 Mac Performance Optimization Guide. NVIDIA GPU users: Gemma 4 on NVIDIA RTX: Deployment Guide.
FAQ
Q: How bad is the quality loss with Gemma 4 31B at 4-bit?
Depends heavily on the task. For casual chat, summarization, translation, and basic code, you'll barely notice a difference. For complex reasoning, mathematical proofs, and tool calling, the drop is real — MMLU accuracy falls from 99.2% (8-bit) to 97.1% (4-bit), and function call error rates climb by about 15%.
Q: I have a Mac M2 Max with 64 GB RAM. Should I use 4-bit or 8-bit?
8-bit, no question. With 64 GB available, the ~36 GB footprint of Q8_0 is no problem at all. There's no reason to take the quality hit. 4-bit is for machines with 16–32 GB.
Q: How much faster is 4-bit compared to 8-bit?
In our tests, roughly 60–90% faster. On M2 Max, 8-bit runs at about 20 tok/s and 4-bit at about 40 tok/s. That's the difference between "a brief wait" and "instant response."
Q: Can I fine-tune a quantized model?
Not directly. GGUF quantized files can't be fine-tuned as-is. The correct workflow: fine-tune the original HuggingFace model using QLoRA, then quantize the result. See the QLoRA section in Gemma 4 Fine-Tuning Guide.
Q: What's the difference between Q4_K_M and Q4_K_S?
Q4_K_M (Medium) preserves higher precision in the attention layers compared to Q4_K_S (Small), resulting in a file that's about 10% larger but noticeably better quality. Unless you're squeezed for every last gigabyte, go with Q4_K_M.
Q: Are there simpler alternatives to llama.cpp for quantization?
Ollama is the simplest: ollama pull gemma4:31b and you're done — it ships with Q4_K_M built in. LM Studio offers a GUI if you'd rather avoid the command line. If you have an NVIDIA GPU and want even faster inference, GPTQ quantization (CUDA-based) is worth looking into.
Q: Is Gemma 4 26B (MoE) or 31B (Dense) a better bet after quantization?
They're fundamentally different architectures. The 26B is Mixture-of-Experts — only a subset of parameters are active during inference, so its base memory footprint is already smaller than the 31B Dense model. If you're memory-constrained, the 26B might actually be the smarter pick without needing aggressive quantization. Full breakdown: Gemma 4 26B vs 31B: Architecture Differences and How to Choose.
Benchmark data sourced from community testing on Reddit's r/LocalLLaMA (April 2026) and official llama.cpp benchmarks. Results may vary depending on hardware configuration and software version.
Stop reading. Start building.
~/gemma4 $ Get hands-on with the models discussed in this guide. No deployment, no friction, 100% free playground.
Launch Playground />


