How to Run Gemma 4 on Raspberry Pi (Yes, Really)

Yes, you can run Gemma 4 on a Raspberry Pi. No, it won't be fast. But it works, and there are some genuinely good reasons to do it. Let me show you how, and be honest about what to expect.

What's Realistic

Let's set expectations before we start:

	Raspberry Pi 5 (8GB)	MacBook M2 16GB
Model	Gemma 4 E2B (Q4)	Gemma 4 26B (Q4)
Speed	2-5 tokens/sec	14-18 tokens/sec
Feel	Slow but functional	Smooth and interactive
Cost	~$80	~$1200+
Power	5-15W	20-50W

At 2-5 tokens per second, you're waiting a few seconds for a short answer and maybe 30 seconds for a longer response. It's not interactive chat speed. But for automated tasks, offline assistants, and tinkering? Totally viable.

Requirements

Raspberry Pi 5 with 8GB RAM (required — 4GB won't cut it)
microSD card (at least 32GB, ideally 64GB) or USB SSD
Active cooling (fan or heatsink — the CPU will run hot)
Raspberry Pi OS 64-bit (Bookworm or later)

The Pi 4 with 8GB can technically run E2B too, but the Pi 5 is significantly faster (~2x) and I'd recommend it if you're buying new hardware.

Installing Ollama on ARM

Ollama supports ARM64 natively, so installation on the Pi is straightforward:

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Verify installation
ollama --version

# Start the service
sudo systemctl enable ollama
sudo systemctl start ollama

Now pull the smallest Gemma 4 model:

# Pull E2B — the only model that fits in 8GB
ollama pull gemma4:e2b

# Run it
ollama run gemma4:e2b

The initial download takes a while on Pi (the model is about 1.5GB). Once loaded, you should see a prompt. Type something and wait — your first response will take a few seconds to start generating.

Performance Reality Check

I ran some benchmarks on a Raspberry Pi 5 8GB with active cooling:

Model: gemma4:e2b (Q4_K_M quantization)
Prompt: "Explain what an API is in 3 sentences."

Prompt eval: ~1.5 seconds
Generation speed: 3.2 tokens/second
Total time for ~50 token response: ~17 seconds

Model: gemma4:e2b (Q4_K_M quantization)
Prompt: "Write a Python function to reverse a string."

Prompt eval: ~2 seconds
Generation speed: 2.8 tokens/second
Total time for ~80 token response: ~30 seconds

It's slow. There's no getting around it. The Pi's ARM CPU is doing all the work — there's no GPU acceleration here. But the answers are correct and coherent. The model is the same Gemma 4 running on a $3000 Mac — just slower.

Practical Use Cases

At this speed, interactive chat isn't ideal. But these use cases work great:

Offline Personal Assistant

import requests

def ask_gemma(question):
    response = requests.post("http://localhost:11434/api/chat", json={
        "model": "gemma4:e2b",
        "messages": [{"role": "user", "content": question}],
        "stream": False,
    })
    return response.json()["message"]["content"]

# Process a question overnight, have the answer in the morning
answer = ask_gemma("Summarize the key points of this article: ...")

Home Automation Brain

Hook it up to Home Assistant for natural language control:

# Parse voice commands into structured actions
command = "Turn on the living room lights and set them to 50%"

response = ask_gemma(f"""Parse this home command into JSON:
Command: {command}
Format: {{"device": "...", "action": "...", "value": "..."}}""")

At 2-5 tok/s, parsing a simple command takes ~5 seconds. That's fine for home automation — you're not in a hurry to turn on a light.

Privacy-First AI

The biggest selling point: your data never leaves your house. No cloud, no API keys, no terms of service. Just a $80 computer running AI on your desk.

For people who want a private AI assistant for journal entries, personal notes, or sensitive questions — a Pi running Gemma 4 is hard to beat on price.

Learning and Education

A Raspberry Pi running Gemma 4 is an amazing teaching tool:

Students can experiment with AI without needing cloud accounts
Schools can set up AI workstations for under $100 each
Learn about LLM inference, tokenization, and quantization hands-on

Optimization Tips

1. Use Q4 quantization (or lower)

Q4_K_M gives the best speed-to-quality ratio on the Pi. Don't try Q8 — it'll be too slow and might not fit in memory.

2. Keep context short

# Reduce context window to save memory and speed up processing
ollama run gemma4:e2b --num-ctx 1024

The default context window eats into your limited RAM. For simple Q&A, 1024 tokens is plenty.

3. Use an SSD instead of microSD

A USB 3.0 SSD dramatically speeds up model loading. The microSD card is the bottleneck when the model first loads into memory.

# Check if your model is on slow storage
ls -la ~/.ollama/models/

4. Add swap space

If you're running tight on memory:

# Add 4GB swap
sudo fallocate -l 4G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile

# Make it permanent
echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab

Warning: swap on microSD will be very slow. Use an SSD if possible.

5. Close everything else

The Pi has only 8GB. Close the desktop environment if you're running headless:

# Switch to CLI only
sudo systemctl set-default multi-user.target
sudo reboot

This frees up ~500MB of RAM — which matters when you're working with tight margins.

6. Lower the temperature

I mean the physical temperature. The Pi 5 throttles when it gets hot. Make sure you have:

A proper heatsink
Active cooling (fan)
Good ventilation

What About the Pi 4?

The Raspberry Pi 4 with 8GB can run Gemma 4 E2B, but:

~1.5-3 tok/s (roughly 40% slower than Pi 5)
No crypto extensions for faster inference
Still works for the same use cases, just with more patience

If you already have a Pi 4 8GB, try it. If you're buying new, get the Pi 5.

The Fun Factor

Let's be real: running AI on a credit-card-sized computer is just cool. It's a conversation starter, a weekend project, and a genuine learning experience. The fact that it produces coherent, useful text at all is remarkable.

Show up at a meetup with a Raspberry Pi running Gemma 4 and people will want to talk to you.

For a more practical setup, check out running Gemma 4 on a Mac or in Docker. And if you want to understand why the E2B model fits on such tiny hardware, our architecture guide explains the different model sizes.