"Can I run it on my machine?" — that's the first question everyone asks. The answer depends on which Gemma 4 model you're trying to run and what hardware you've got. Let's cut through the confusion and give you actual numbers.
The Complete Hardware Requirements Table
Here's what each model needs at different quantization levels:
| Model | 4-bit (Q4) | 8-bit (Q8) | 16-bit (FP16) | Minimum RAM/VRAM |
|---|---|---|---|---|
| E2B (2B) | ~1.5GB | ~2.5GB | ~4GB | 4GB RAM |
| E4B (4B) | ~3GB | ~5GB | ~8GB | 6GB RAM |
| 26B MoE | ~8GB | ~18GB | ~28GB | 8GB VRAM |
| 31B Dense | ~20GB | ~34GB | ~62GB | 20GB VRAM |
What does "quantization" mean? It's a way to compress the model by using less precision for the numbers. 4-bit is the most compressed (smallest, fastest, slightly less accurate). 16-bit is full precision (largest, most accurate, needs the most memory). For most people, 4-bit is the sweet spot — the quality difference is barely noticeable.
The KV Cache Gotcha
Here's something most guides don't mention. The model weights are only part of the memory story. When Gemma 4 processes long conversations, it builds up a KV cache (key-value cache) that stores attention information from previous tokens.
For the 31B model at its full 262K context length, the KV cache alone can eat ~22GB of memory — on top of the model weights. That means even if you have 24GB of VRAM for the model, you might run out during long conversations.
Practical advice:
- Reduce context length if you're hitting OOM errors. You don't always need 262K tokens.
- With Ollama, use
num_ctxto limit context:ollama run gemma4:31b --num-ctx 4096 - For most tasks, 4K-8K context is plenty.
Will It Run on MY Machine?
Let's go through specific hardware:
MacBook Air M2 (8GB)
| Model | Works? | Notes |
|---|---|---|
| E2B | Yes | Runs great, fast responses |
| E4B | Yes | Good performance, the sweet spot |
| 26B | No | Not enough unified memory |
| 31B | No | Not even close |
Verdict: E4B is your best bet. Surprisingly capable for an 8GB machine.
MacBook Pro M3/M4 (16GB)
| Model | Works? | Notes |
|---|---|---|
| E2B | Yes | Overkill but fast |
| E4B | Yes | Excellent performance |
| 26B | Yes (4-bit) | Works but tight on memory. Close other apps. |
| 31B | No | Needs more memory |
Verdict: You can actually run the 26B MoE model at 4-bit quantization. That's a serious model on a laptop — see our 26B vs 31B comparison to understand the tradeoffs. Just don't expect to have Chrome open with 50 tabs at the same time.
MacBook Pro M3/M4 (36GB/48GB)
| Model | Works? | Notes |
|---|---|---|
| E2B | Yes | Way overkill |
| E4B | Yes | Fast and smooth |
| 26B | Yes | Comfortable at 8-bit |
| 31B | Yes (4-bit, 36GB) | Tight but works |
Verdict: This is the sweet spot for running large models. 36GB handles everything up to 31B at 4-bit. 48GB gives you breathing room.
Mac Studio M2 Ultra (64GB+)
| Model | Works? | Notes |
|---|---|---|
| All models | Yes | No compromises |
Verdict: You can run every Gemma 4 model comfortably, including 31B at 8-bit. The M2 Ultra's unified memory architecture handles these workloads beautifully.
Gaming PC — RTX 3060 (12GB VRAM)
| Model | Works? | Notes |
|---|---|---|
| E2B | Yes | GPU-accelerated, very fast |
| E4B | Yes | Fast inference |
| 26B | Yes (4-bit) | Fits in 12GB VRAM |
| 31B | No | Needs 20GB+ VRAM |
Verdict: The RTX 3060 is actually a solid AI card for its price. 12GB VRAM runs the 26B model nicely at 4-bit.
Gaming PC — RTX 4090 (24GB VRAM)
| Model | Works? | Notes |
|---|---|---|
| E2B | Yes | Lightning fast |
| E4B | Yes | Lightning fast |
| 26B | Yes | Comfortable even at 8-bit |
| 31B | Yes (4-bit) | Fits with room for KV cache |
Verdict: The king of consumer GPUs for AI. Runs everything Gemma 4 offers. The 31B model fits at 4-bit with enough headroom for reasonable context lengths.
Cloud — A100 (80GB VRAM)
| Model | Works? | Notes |
|---|---|---|
| All models | Yes | Full speed, full precision |
Verdict: If you need maximum performance or full-precision models, rent an A100. Available on Google Cloud, AWS, Lambda Labs, and RunPod.
CPU-Only: Possible but Painful
Don't have a GPU? You can still run Gemma 4, just on CPU. Here's what to expect:
- E2B on CPU: ~5-10 tokens/sec. Totally usable.
- E4B on CPU: ~2-5 tokens/sec. Usable but you'll be patient.
- 26B on CPU: ~0.5-2 tokens/sec. Painfully slow but technically works.
- 31B on CPU: Don't bother. Under 1 token/sec on most machines.
CPU inference is roughly 2-10x slower than GPU inference, depending on your CPU and the model size. Apple Silicon handles CPU inference better than Intel/AMD because of the unified memory architecture and Neural Engine.
Quantization: Which Format to Use
If you're using Ollama, it handles quantization automatically. But if you're downloading GGUF files from Hugging Face, here's what to pick:
| Format | Size vs FP16 | Quality | Speed | When to Use |
|---|---|---|---|---|
| Q4_K_M | ~25% | 95-97% | Fastest | Recommended default. Best balance. |
| Q5_K_M | ~35% | 97-98% | Fast | Slight quality bump, still small |
| Q6_K | ~50% | 98-99% | Medium | When quality matters more |
| Q8_0 | ~65% | 99%+ | Slower | Near-lossless, needs more RAM |
| FP16 | 100% | 100% | Slowest | Only if you have tons of VRAM |
My recommendation: Q4_K_M. It's the sweet spot that the community has converged on. The quality loss is minimal and you get the best performance and smallest file size. If you have extra VRAM to spare, Q5_K_M is a small step up.
Tips to Squeeze More Performance
For a comprehensive optimization walkthrough covering all platforms, see our speed optimization guide.
Close other apps. Especially browsers. Chrome alone can eat 2-4GB of RAM. When running 26B+ models, every GB counts.
Reduce context length. If you're getting out-of-memory errors, limit the context window. Most conversations don't need 262K tokens. Set num_ctx to 4096 or 8192.
Use Metal (Mac) or CUDA (NVIDIA). Make sure GPU acceleration is actually enabled. Ollama does this automatically, but if you're using other tools, check your backend settings.
Monitor memory usage. On Mac, use Activity Monitor. On Linux, nvidia-smi for GPU memory. Watch for swap usage — if you're hitting swap, performance tanks.
Consider offloading layers. Some tools like llama.cpp let you put some layers on GPU and the rest on CPU. This lets you run models that are slightly too big for your GPU, though it's slower than full GPU inference.
What Should I Buy?
If you're shopping for AI hardware, here's what I'd recommend at different budgets:
| Budget | Recommendation | Can Run |
|---|---|---|
| $0 | Use your existing laptop + E4B | E2B, E4B |
| $200-400 | Used RTX 3060 12GB | Up to 26B (4-bit) |
| $500-800 | RTX 4060 Ti 16GB | Up to 26B (8-bit) |
| $1,000-1,500 | RTX 4090 24GB | Up to 31B (4-bit) |
| $2,000-4,000 | Mac Studio M2 Pro/Max 32-64GB | All models comfortably |
| $5,000+ | Mac Studio M2 Ultra 64GB+ | Everything, no compromises |
| Pay-per-use | Cloud A100 (~$1-2/hr) | Everything at full speed |
Best value pick: A used RTX 3060 12GB. It's absurdly cheap now and runs the 26B model. For most people, that's enough.
Best Mac pick: MacBook Pro with 36GB unified memory. Runs everything up to 31B (tight at 4-bit) and you get a great laptop for everything else too.
Don't need local? Skip the hardware entirely and use the Gemma 4 API. Google AI Studio gives you free access with no hardware requirements.
Quick Decision Flowchart
- Do you have 4GB RAM? → You can run E2B. That's something.
- Do you have 8GB RAM? → Run E4B. It's genuinely good.
- Do you have a GPU with 8GB+ VRAM? → Run 26B at 4-bit. This is the quality jump.
- Do you have 20GB+ VRAM? → Run 31B. Top-tier local AI.
- None of the above? → Use the cloud API. No shame in that.
Not sure which model size is right for your use case? Check out our model comparison guide.
Next Steps
- Ready to install? Follow our Ollama setup guide
- Picking a model? Read Gemma 4: Which Model Should You Use?
- Running into issues? Check our troubleshooting guide
- Want to skip local setup? Try the API approach



