Gemma 4 gives you two options at the top end: a 26B Mixture of Experts (MoE) model and a 31B Dense model. They're surprisingly different in how they work, and the right choice depends on what you're optimizing for. Let's break it down.
MoE Explained Simply
The 26B MoE model has 26 billion total parameters, but here's the twist — it doesn't use all of them at once. Instead, it has multiple "expert" sub-networks, and a routing mechanism picks which experts to activate for each token. Only about 3.8 billion parameters are active during any given forward pass.
Think of it like a hospital with 20 specialists. When a patient comes in, they don't see all 20 doctors — they get routed to the 2-3 specialists relevant to their condition. The hospital has 20 doctors' worth of knowledge, but each visit only uses a fraction of the staff.
MoE 26B Architecture:
┌─────────────────────────────┐
│ Router: "Which experts?" │
├──────┬──────┬──────┬───────┤
│ Exp1 │ Exp2 │ Exp3 │ ... │ ← 26B total parameters
├──────┴──────┴──────┴───────┤
│ Only ~3.8B active/token │ ← Actual compute cost
└─────────────────────────────┘Dense Explained
The 31B Dense model is straightforward — all 31 billion parameters are active for every single token. No routing, no experts, just one big network doing all the work every time.
Dense 31B Architecture:
┌─────────────────────────────┐
│ All 31B parameters active │ ← Every token uses everything
│ for every token │
└─────────────────────────────┘Head-to-Head Comparison
| Metric | 26B MoE | 31B Dense |
|---|---|---|
| Total parameters | 26B | 31B |
| Active parameters | ~3.8B | 31B |
| VRAM (FP16) | ~52 GB | ~62 GB |
| VRAM (Q4_K_M) | ~15 GB | ~18 GB |
| Speed (tok/s, RTX 4090) | ~45 | ~18 |
| Speed (tok/s, M3 Max 36GB) | ~25 | ~10 |
Benchmark Comparison
| Benchmark | 26B MoE | 31B Dense | Winner |
|---|---|---|---|
| MMLU | 79.5 | 81.3 | Dense (+1.8) |
| HumanEval | 75.2 | 77.1 | Dense (+1.9) |
| GSM8K | 87.0 | 88.9 | Dense (+1.9) |
| MATH | 52.1 | 54.8 | Dense (+2.7) |
| ARC-Challenge | 68.3 | 69.1 | Dense (+0.8) |
| Average | 72.4 | 74.2 | Dense (+1.8 avg) |
The Dense model wins on raw quality across the board, but the margins are small — typically 1-3 points. The question is whether that small quality edge justifies the massive speed difference.
Speed Comparison
This is where MoE shines. Because only 3.8B parameters are active per token, inference is dramatically faster:
| Hardware | 26B MoE Q4 (tok/s) | 31B Dense Q4 (tok/s) | MoE Speedup |
|---|---|---|---|
| RTX 4090 24GB | ~45 | ~18 | 2.5x faster |
| RTX 3090 24GB | ~30 | ~12 | 2.5x faster |
| M3 Max 36GB | ~25 | ~10 | 2.5x faster |
| M4 Max 48GB | ~32 | ~14 | 2.3x faster |
The MoE model is consistently 2-2.5x faster. For interactive use cases where you're waiting for responses, this difference is huge.
VRAM Comparison
Here's the catch with MoE — even though only 3.8B parameters are active, all 26B need to be loaded into memory:
| Format | 26B MoE | 31B Dense | Difference |
|---|---|---|---|
| FP16 | ~52 GB | ~62 GB | MoE saves ~10 GB |
| Q8_0 | ~28 GB | ~33 GB | MoE saves ~5 GB |
| Q5_K_M | ~19 GB | ~22 GB | MoE saves ~3 GB |
| Q4_K_M | ~15 GB | ~18 GB | MoE saves ~3 GB |
MoE uses less VRAM than Dense at every quantization level, but the savings aren't as dramatic as the speed difference. Both models need serious hardware at full precision.
Use Case Recommendations
Pick the 26B MoE When:
- Interactive chat and coding assistance — the 2.5x speed advantage makes conversations feel natural
- API serving with multiple users — faster inference means higher throughput and lower cost per query
- Hardware is the bottleneck — fits in slightly less VRAM and runs much faster
- Quality is "good enough" — for most practical tasks, the 1-2 point benchmark difference doesn't matter
- You're running on consumer hardware — Q4 MoE on a 16GB GPU is actually usable
Pick the 31B Dense When:
- Fine-tuning — Dense models are more straightforward to fine-tune than MoE; the expert routing adds complexity
- Maximum quality on hard tasks — when you need every last point on math, reasoning, or code generation
- Batch processing — if you're processing offline and don't care about per-token speed
- Research and evaluation — when you need the absolute best baseline
- Simple deployment — Dense models have broader framework support and fewer edge cases
Quick Decision Table
| Your Priority | Pick |
|---|---|
| Speed | 26B MoE |
| Quality | 31B Dense |
| Cost efficiency | 26B MoE |
| Fine-tuning | 31B Dense |
| Interactive use | 26B MoE |
| Offline batch processing | 31B Dense |
Framework Support
Not all frameworks handle MoE models equally well:
| Framework | MoE Support | Dense Support |
|---|---|---|
| Ollama | Yes | Yes |
| llama.cpp | Yes | Yes |
| vLLM | Yes | Yes |
| SGLang | Yes | Yes |
| LM Studio | Partial | Yes |
| TensorRT-LLM | Yes | Yes |
| transformers | Yes | Yes |
MoE support has matured significantly, but if you encounter issues with a specific framework, Dense is the safer bet.
Next Steps
- Still deciding on model size? Read Which Gemma 4 Model Should You Pick? for the full lineup including smaller models
- Want to understand quantization options? Check the GGUF Guide for Q4/Q5/Q8 comparisons
- Ready to run one of these? Follow our Ollama tutorial to get started in minutes
For most people, the 26B MoE is the better choice. It's 2.5x faster with only a tiny quality trade-off. Save the 31B Dense for fine-tuning or when you genuinely need maximum quality and can afford to wait for responses.



