Running a real AI model on your phone — no internet, no cloud, no API key. Just your iPhone doing the thinking. Sounds wild, but Gemma 4 actually makes this possible.
Before you get too excited though, let's set realistic expectations. It works. It's also not going to feel like ChatGPT. Here's what you need to know.
Which Models Actually Run on iPhone?
Not all Gemma 4 models fit on a phone. Here's the breakdown:
| Model | iPhone Compatibility | Notes |
|---|---|---|
| E2B (2B) | All modern iPhones | Recommended for mobile |
| E4B (4B) | iPhone 15 Pro and newer | Needs 8GB RAM (A17 Pro+) |
| 26B | No | Way too large |
| 31B | No | Not even close |
The sweet spot is E2B. It runs on basically any iPhone from the last few years and gives you surprisingly useful results for a 2-billion parameter model. E4B is better quality but you'll need at least an iPhone 15 Pro — that's the first iPhone with 8GB of RAM.
Not sure which model is right for your use case? Check out our model comparison guide.
How to Set It Up: Google AI Edge Gallery
Google's official path for running Gemma models on mobile is the AI Edge Gallery app. It's the easiest way to get started.
Step-by-Step Setup
-
Download AI Edge Gallery from the App Store. Search "Google AI Edge Gallery" or look for it in Google's developer tools section.
-
Open the app and browse the available models. You'll see Gemma 4 E2B listed (and E4B if your device supports it).
-
Download the model. This is the big step — the E2B model is roughly 1.5-2GB. Make sure you're on Wi-Fi.
-
Wait for the download and conversion. The app needs to optimize the model for your specific chip. This can take a few minutes. Don't close the app.
-
Start chatting. Once it's ready, you can type prompts and get responses. Everything runs locally — try putting your phone in airplane mode to prove it.
First Test
Once it's running, try something simple:
Summarize what a REST API is in 3 sentences.You should see tokens appearing one by one. It's slower than you're used to, but it's doing real inference on your phone's Neural Engine.
Performance: What to Actually Expect
Let's be honest. This isn't going to replace your cloud AI setup. Here's what the experience is actually like:
- Speed: About 5-15 tokens per second on E2B (iPhone 15 Pro). That's readable but not fast. E4B is slower — maybe 3-8 tokens/sec.
- Quality: E2B handles simple tasks well: summaries, translations, quick questions, basic code. Don't expect GPT-4 level reasoning.
- First response: There's a 2-5 second startup delay while the model loads into memory.
- Battery: Running inference is GPU-intensive. Expect noticeable battery drain during active use. Maybe 10-15% per hour of continuous chatting.
- Heat: Your phone will get warm. After 15-20 minutes of heavy use, thermal throttling might kick in and slow things down further.
The Killer Feature: Offline AI
Here's why this actually matters despite the limitations. Your phone works everywhere. On a plane. In a subway tunnel. In a country with restricted internet. In a location with zero cell service.
Offline use cases that actually make sense:
- Travel translator — works without roaming data
- Quick writing help — draft emails, fix grammar on the go
- Code snippets — generate quick utility functions while commuting
- Note summarization — paste long text and get a summary
- Privacy-sensitive queries — nothing leaves your device, ever
Limitations You Should Know
No large models. The 26B and 31B models need 16-20+ GB of RAM. iPhones max out at 8GB. This isn't changing anytime soon.
Battery drain is real. Running neural network inference is power-hungry. Don't expect to use this for hours without a charger nearby.
Thermal throttling. After extended use, your iPhone will heat up and the OS will reduce performance to protect the hardware. Responses get slower.
Context length is limited. On-device models typically use shorter context windows to save memory. Don't expect to paste a 10,000-word document and get a perfect summary.
No multimodal on-device (yet). While Gemma 4 supports image input in its cloud/desktop versions, on-device image analysis may be limited depending on the app implementation.
Android Gets More Options
Fair warning — if you're considering Android, the on-device AI situation is more mature there. Android has:
- Google AICore — system-level AI integration for Pixel devices
- MediaPipe — Google's ML framework with broader model support
- More RAM — some Android flagships have 12-16GB
That said, Apple's Neural Engine is excellent for inference, so iPhone isn't at a huge disadvantage for the models that do fit.
Is It Worth It?
Yes, if you want offline AI, you care about privacy, or you just think it's cool to run a real language model on your phone. The E2B model is genuinely useful for quick tasks.
No, if you need high-quality reasoning, long context, or fast responses. Use the cloud API for that.
The honest answer: It's a glimpse of where things are going. In two years, on-device AI will be dramatically better. Right now, it's useful but limited. Try it, appreciate how far we've come, and use the cloud when you need serious horsepower.
You can also run Gemma 4 in your browser via WebGPU — check our browser guide for another no-install option.
Next Steps
- Want to run Gemma 4 on your laptop instead? See Can My Laptop Run Gemma 4?
- Not sure which model size fits your needs? Read Gemma 4: Which Model?
- Prefer running in a browser? Try the WebGPU browser guide
- Ready for API-level power? Check out our API tutorial



