How to Run Gemma 4 on iPhone with CoreML (2026)

CoreML-LLM just dropped v0.2.0 and the numbers are absurd. Gemma 4 E2B running natively on iPhone with Apple's Neural Engine — 11 tokens per second, 250MB of RAM, 2 watts of power. This is the most efficient way to run Gemma 4 on an iPhone, period.

X/Twitter has been blowing up about this for good reason. Let's walk through what it is, how it works, and how to set it up.

What Is CoreML-LLM?

CoreML-LLM is an open-source framework that converts LLMs to Apple's CoreML format and runs them directly on the Neural Engine — the dedicated AI chip inside every modern iPhone. Unlike GPU-based inference, the Neural Engine is purpose-built for matrix operations, which means dramatically lower power consumption and memory usage.

Version 0.2.0 added full support for Gemma 4 E2B, and the community benchmarks speak for themselves.

The Benchmarks Everyone Is Talking About

Here's what CoreML-LLM v0.2.0 achieves with Gemma 4 E2B on iPhone:

Metric	Result
Prefill (33 tokens)	188ms (15.8x speedup vs CPU)
Decode speed	11 tok/s
Context window	2048 tokens
RAM usage	~250MB
Power draw	~2W

Compare that to the standard approach of running Gemma 4 E2B via AI Edge Gallery or MediaPipe, which typically uses 3GB of RAM and drains your battery noticeably. CoreML-LLM is using roughly 12x less memory and sipping power instead of gulping it.

For a broader view of mobile deployment options, see our mobile deployment guide.

How It Gets These Numbers

CoreML-LLM doesn't just dump the model onto the Neural Engine and hope for the best. There are four key technical optimizations that make this work:

Sliding Window Attention

Instead of full self-attention over the entire context, CoreML-LLM uses a sliding window approach. The model only attends to a fixed window of recent tokens at each layer. This keeps memory constant regardless of sequence length (up to the 2048 context limit) and is a perfect fit for the Neural Engine's fixed-size tensor operations.

INT4 Palettized Weights with mmap

The weights are quantized to INT4 using Apple's palettization technique — a form of lookup-table quantization where each weight is an index into a small codebook. The model file is memory-mapped (mmap) rather than loaded entirely into RAM, which is why you see that 250MB figure instead of the full model size. Only the pages being actively used get loaded into physical memory.

Stateless KV Cache

Traditional KV caches grow with sequence length and eat memory. CoreML-LLM uses a stateless approach where the KV cache is managed as fixed-size CoreML tensors. This avoids dynamic memory allocation and keeps the Neural Engine pipeline clean.

Batched Prefill

The 188ms prefill time (for 33 tokens) comes from processing the entire prompt as a single batched operation rather than token-by-token. This is a 15.8x speedup over sequential processing and makes the initial response feel nearly instant.

Which iPhones Are Compatible?

You need an A16 chip or newer — that means iPhone 14 Pro and above.

Device	Chip	Neural Engine	Compatible
iPhone 14 Pro / Pro Max	A16 Bionic	16-core	Yes
iPhone 15 / 15 Plus	A16 Bionic	16-core	Yes
iPhone 15 Pro / Pro Max	A17 Pro	16-core	Yes
iPhone 16 / 16 Plus	A18	16-core	Yes
iPhone 16 Pro / Pro Max	A18 Pro	16-core	Yes (fastest)
iPhone 14 / 14 Plus	A15 Bionic	16-core	No
iPhone 13 and older	A15 or older	—	No

The A16's Neural Engine has the instruction set needed for the INT4 palettized operations. Older chips technically have a Neural Engine, but they lack support for the specific quantization format CoreML-LLM uses.

Want to know what hardware you need for larger models? Check the hardware requirements guide.

Step-by-Step Setup Guide

Prerequisites

Mac with Xcode 15.4+ installed
iPhone 14 Pro or newer, running iOS 17+
About 2GB free storage on your iPhone
Python 3.10+ on your Mac (for model conversion)

Step 1: Install CoreML-LLM

# Clone the repository
git clone https://github.com/nicklimmm/coreml-llm.git
cd coreml-llm

# Install Python dependencies
pip install -r requirements.txt

Step 2: Download and Convert the Model

# Download Gemma 4 E2B and convert to CoreML format
python convert.py \
  --model google/gemma-4-e2b-it \
  --output gemma4-e2b.mlpackage \
  --quantize int4-palettized \
  --context-length 2048

This step takes 10-20 minutes depending on your Mac. The conversion handles quantization, palettization, and Neural Engine optimization automatically.

Step 3: Build the iOS App

# Open the Xcode project
open CoreMLLLM.xcodeproj

In Xcode, select your iPhone as the target device.
Drag the gemma4-e2b.mlpackage into the project's Resources folder.
Set your development team in Signing & Capabilities.
Hit Build and Run (Cmd+R).

Step 4: First Run

The first launch takes a minute or two while CoreML compiles the model for your specific Neural Engine. After that, it's cached and subsequent launches are fast.

Try a simple prompt:

Explain what a neural engine is in two sentences.

You should see tokens streaming at roughly 11 per second. Put your phone in airplane mode — it still works. That's the whole point.

Step 5: Verify Performance

The app includes a built-in benchmark mode. Tap the settings icon and select "Run Benchmark" to see your device's actual prefill and decode speeds. Compare against the numbers in this article to make sure everything is working optimally.

CoreML-LLM vs AI Edge Gallery

There are two main ways to run Gemma 4 on iPhone right now. Here's how they compare:

Feature	CoreML-LLM v0.2.0	AI Edge Gallery
Setup difficulty	Medium (requires Mac + Xcode)	Easy (App Store download)
Decode speed	11 tok/s	5-15 tok/s
RAM usage	~250MB	~3GB
Power consumption	~2W	~5-8W
Runs on Neural Engine	Yes (dedicated)	Partial (mostly GPU)
Custom app integration	Yes (open-source)	No (standalone app)
Model support	Gemma 4 E2B only	E2B + E4B
Multimodal	Not yet	Limited

Use CoreML-LLM if you want maximum efficiency, you're building your own app, or you want to squeeze the most out of your battery.

Use AI Edge Gallery if you just want to try Gemma 4 quickly without any development setup, or you need E4B. For more on the AI Edge Gallery approach, see our iPhone guide.

Limitations

E2B only. CoreML-LLM v0.2.0 supports Gemma 4 E2B. The larger E4B, 12B, and 26B models are not yet converted. E4B support is planned but no timeline has been announced.

No multimodal yet. Gemma 4 E2B supports text, vision, and audio in its full form, but CoreML-LLM currently only handles text inference. Multimodal support is on the roadmap for v0.3.

2048 context window. The sliding window attention keeps memory low, but limits how much text the model can consider at once. For longer documents, you'll need to chunk your input.

Requires a Mac for setup. The model conversion and app building process requires Xcode on macOS. There's no Windows or Linux path right now.

Quality is E2B-level. This is a 2B parameter model. It's great for quick tasks — summaries, translations, simple Q&A, basic code generation — but don't expect GPT-4 level reasoning. See our E2B vs E4B comparison for a detailed quality breakdown.

Why This Matters

250MB of RAM and 2 watts. That means Gemma 4 can run alongside your other apps without killing your phone. It means you can use it for hours without destroying your battery. It means on-device AI stops being a "demo" and starts being a real feature you'd actually ship in a production app.

The Neural Engine has been sitting in iPhones for years, mostly running camera processing and keyboard predictions. CoreML-LLM is showing what happens when you actually target it properly with a real language model.

Next Steps

New to Gemma 4 on mobile? Start with the Mobile Deployment Guide for the full picture
Want the easier (but less efficient) option? See the iPhone Guide for AI Edge Gallery setup
Choosing between model sizes? Read our E2B vs E4B comparison
Need to check your hardware? See Hardware Requirements for all platforms

On-device AI just got a lot more practical. CoreML-LLM v0.2.0 with Gemma 4 E2B is the most efficient way to run a real language model on an iPhone today — and v0.3 with multimodal support is coming.

gemma4 — interact

Stop reading. Start building.

~/gemma4 $ Get hands-on with the models discussed in this guide. No deployment, no friction, 100% free playground.

Launch Playground />