How to Run Gemma 4 on iPhone (Yes, It Actually Works)

Running a real AI model on your phone — no internet, no cloud, no API key. Just your iPhone doing the thinking. Sounds wild, but Gemma 4 actually makes this possible.

Before you get too excited though, let's set realistic expectations. It works. It's also not going to feel like ChatGPT. Here's what you need to know.

Which Models Actually Run on iPhone?

Not all Gemma 4 models fit on a phone. Here's the breakdown:

Model	iPhone Compatibility	Notes
E2B (2B)	All modern iPhones	Recommended for mobile
E4B (4B)	iPhone 15 Pro and newer	Needs 8GB RAM (A17 Pro+)
26B	No	Way too large
31B	No	Not even close

The sweet spot is E2B. It runs on basically any iPhone from the last few years and gives you surprisingly useful results for a 2-billion parameter model. E4B is better quality but you'll need at least an iPhone 15 Pro — that's the first iPhone with 8GB of RAM.

Not sure which model is right for your use case? Check out our model comparison guide.

How to Set It Up: Google AI Edge Gallery

Google's official path for running Gemma models on mobile is the AI Edge Gallery app. It's the easiest way to get started.

Step-by-Step Setup

Download AI Edge Gallery from the App Store. Search "Google AI Edge Gallery" or look for it in Google's developer tools section.
Open the app and browse the available models. You'll see Gemma 4 E2B listed (and E4B if your device supports it).
Download the model. This is the big step — the E2B model is roughly 1.5-2GB. Make sure you're on Wi-Fi.
Wait for the download and conversion. The app needs to optimize the model for your specific chip. This can take a few minutes. Don't close the app.
Start chatting. Once it's ready, you can type prompts and get responses. Everything runs locally — try putting your phone in airplane mode to prove it.

First Test

Once it's running, try something simple:

Summarize what a REST API is in 3 sentences.

You should see tokens appearing one by one. It's slower than you're used to, but it's doing real inference on your phone's Neural Engine.

Performance: What to Actually Expect

Let's be honest. This isn't going to replace your cloud AI setup. Here's what the experience is actually like:

Speed: About 5-15 tokens per second on E2B (iPhone 15 Pro). That's readable but not fast. E4B is slower — maybe 3-8 tokens/sec.
Quality: E2B handles simple tasks well: summaries, translations, quick questions, basic code. Don't expect GPT-4 level reasoning.
First response: There's a 2-5 second startup delay while the model loads into memory.
Battery: Running inference is GPU-intensive. Expect noticeable battery drain during active use. Maybe 10-15% per hour of continuous chatting.
Heat: Your phone will get warm. After 15-20 minutes of heavy use, thermal throttling might kick in and slow things down further.

The Killer Feature: Offline AI

Here's why this actually matters despite the limitations. Your phone works everywhere. On a plane. In a subway tunnel. In a country with restricted internet. In a location with zero cell service.

Offline use cases that actually make sense:

Travel translator — works without roaming data
Quick writing help — draft emails, fix grammar on the go
Code snippets — generate quick utility functions while commuting
Note summarization — paste long text and get a summary
Privacy-sensitive queries — nothing leaves your device, ever

Limitations You Should Know

No large models. The 26B and 31B models need 16-20+ GB of RAM. iPhones max out at 8GB. This isn't changing anytime soon.

Battery drain is real. Running neural network inference is power-hungry. Don't expect to use this for hours without a charger nearby.

Thermal throttling. After extended use, your iPhone will heat up and the OS will reduce performance to protect the hardware. Responses get slower.

Context length is limited. On-device models typically use shorter context windows to save memory. Don't expect to paste a 10,000-word document and get a perfect summary.

No multimodal on-device (yet). While Gemma 4 supports image input in its cloud/desktop versions, on-device image analysis may be limited depending on the app implementation.

Android Gets More Options

Fair warning — if you're considering Android, the on-device AI situation is more mature there. Android has:

Google AICore — system-level AI integration for Pixel devices
MediaPipe — Google's ML framework with broader model support
More RAM — some Android flagships have 12-16GB

That said, Apple's Neural Engine is excellent for inference, so iPhone isn't at a huge disadvantage for the models that do fit. If you want to take full advantage of Apple's hardware acceleration and squeeze out maximum performance, check out our step-by-step guide on running Gemma 4 on iPhone with CoreML for native optimization.

Is It Worth It?

Yes, if you want offline AI, you care about privacy, or you just think it's cool to run a real language model on your phone. The E2B model is genuinely useful for quick tasks.

No, if you need high-quality reasoning, long context, or fast responses. Use the cloud API for that.

The honest answer: It's a glimpse of where things are going. In two years, on-device AI will be dramatically better. Right now, it's useful but limited. Try it, appreciate how far we've come, and use the cloud when you need serious horsepower.

You can also run Gemma 4 in your browser via WebGPU — check our browser guide for another no-install option.