Yes, you can run Gemma 4 on a Raspberry Pi. No, it won't be fast. But it works, and there are some genuinely good reasons to do it. Let me show you how, and be honest about what to expect.
What's Realistic
Let's set expectations before we start:
| Raspberry Pi 5 (8GB) | MacBook M2 16GB | |
|---|---|---|
| Model | Gemma 4 E2B (Q4) | Gemma 4 26B (Q4) |
| Speed | 2-5 tokens/sec | 14-18 tokens/sec |
| Feel | Slow but functional | Smooth and interactive |
| Cost | ~$80 | ~$1200+ |
| Power | 5-15W | 20-50W |
At 2-5 tokens per second, you're waiting a few seconds for a short answer and maybe 30 seconds for a longer response. It's not interactive chat speed. But for automated tasks, offline assistants, and tinkering? Totally viable.
Requirements
- Raspberry Pi 5 with 8GB RAM (required — 4GB won't cut it)
- microSD card (at least 32GB, ideally 64GB) or USB SSD
- Active cooling (fan or heatsink — the CPU will run hot)
- Raspberry Pi OS 64-bit (Bookworm or later)
The Pi 4 with 8GB can technically run E2B too, but the Pi 5 is significantly faster (~2x) and I'd recommend it if you're buying new hardware.
Installing Ollama on ARM
Ollama supports ARM64 natively, so installation on the Pi is straightforward:
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Verify installation
ollama --version
# Start the service
sudo systemctl enable ollama
sudo systemctl start ollamaNow pull the smallest Gemma 4 model:
# Pull E2B — the only model that fits in 8GB
ollama pull gemma4:e2b
# Run it
ollama run gemma4:e2bThe initial download takes a while on Pi (the model is about 1.5GB). Once loaded, you should see a prompt. Type something and wait — your first response will take a few seconds to start generating.
Performance Reality Check
I ran some benchmarks on a Raspberry Pi 5 8GB with active cooling:
Model: gemma4:e2b (Q4_K_M quantization)
Prompt: "Explain what an API is in 3 sentences."
Prompt eval: ~1.5 seconds
Generation speed: 3.2 tokens/second
Total time for ~50 token response: ~17 secondsModel: gemma4:e2b (Q4_K_M quantization)
Prompt: "Write a Python function to reverse a string."
Prompt eval: ~2 seconds
Generation speed: 2.8 tokens/second
Total time for ~80 token response: ~30 secondsIt's slow. There's no getting around it. The Pi's ARM CPU is doing all the work — there's no GPU acceleration here. But the answers are correct and coherent. The model is the same Gemma 4 running on a $3000 Mac — just slower.
Practical Use Cases
At this speed, interactive chat isn't ideal. But these use cases work great:
Offline Personal Assistant
import requests
def ask_gemma(question):
response = requests.post("http://localhost:11434/api/chat", json={
"model": "gemma4:e2b",
"messages": [{"role": "user", "content": question}],
"stream": False,
})
return response.json()["message"]["content"]
# Process a question overnight, have the answer in the morning
answer = ask_gemma("Summarize the key points of this article: ...")Home Automation Brain
Hook it up to Home Assistant for natural language control:
# Parse voice commands into structured actions
command = "Turn on the living room lights and set them to 50%"
response = ask_gemma(f"""Parse this home command into JSON:
Command: {command}
Format: {{"device": "...", "action": "...", "value": "..."}}""")At 2-5 tok/s, parsing a simple command takes ~5 seconds. That's fine for home automation — you're not in a hurry to turn on a light.
Privacy-First AI
The biggest selling point: your data never leaves your house. No cloud, no API keys, no terms of service. Just a $80 computer running AI on your desk.
For people who want a private AI assistant for journal entries, personal notes, or sensitive questions — a Pi running Gemma 4 is hard to beat on price.
Learning and Education
A Raspberry Pi running Gemma 4 is an amazing teaching tool:
- Students can experiment with AI without needing cloud accounts
- Schools can set up AI workstations for under $100 each
- Learn about LLM inference, tokenization, and quantization hands-on
Optimization Tips
1. Use Q4 quantization (or lower)
Q4_K_M gives the best speed-to-quality ratio on the Pi. Don't try Q8 — it'll be too slow and might not fit in memory.
2. Keep context short
# Reduce context window to save memory and speed up processing
ollama run gemma4:e2b --num-ctx 1024The default context window eats into your limited RAM. For simple Q&A, 1024 tokens is plenty.
3. Use an SSD instead of microSD
A USB 3.0 SSD dramatically speeds up model loading. The microSD card is the bottleneck when the model first loads into memory.
# Check if your model is on slow storage
ls -la ~/.ollama/models/4. Add swap space
If you're running tight on memory:
# Add 4GB swap
sudo fallocate -l 4G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
# Make it permanent
echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstabWarning: swap on microSD will be very slow. Use an SSD if possible.
5. Close everything else
The Pi has only 8GB. Close the desktop environment if you're running headless:
# Switch to CLI only
sudo systemctl set-default multi-user.target
sudo rebootThis frees up ~500MB of RAM — which matters when you're working with tight margins.
6. Lower the temperature
I mean the physical temperature. The Pi 5 throttles when it gets hot. Make sure you have:
- A proper heatsink
- Active cooling (fan)
- Good ventilation
What About the Pi 4?
The Raspberry Pi 4 with 8GB can run Gemma 4 E2B, but:
- ~1.5-3 tok/s (roughly 40% slower than Pi 5)
- No crypto extensions for faster inference
- Still works for the same use cases, just with more patience
If you already have a Pi 4 8GB, try it. If you're buying new, get the Pi 5.
The Fun Factor
Let's be real: running AI on a credit-card-sized computer is just cool. It's a conversation starter, a weekend project, and a genuine learning experience. The fact that it produces coherent, useful text at all is remarkable.
Show up at a meetup with a Raspberry Pi running Gemma 4 and people will want to talk to you.
For a more practical setup, check out running Gemma 4 on a Mac or in Docker. And if you want to understand why the E2B model fits on such tiny hardware, our architecture guide explains the different model sizes.
Next Steps
- Compare with more powerful setups: Mac performance guide
- Learn about model sizes: which Gemma 4 model to pick
- Understand the architecture: Gemma 4 architecture explained
- Set up a proper server: Docker deployment



