CoreML-LLM just dropped v0.2.0 and the numbers are absurd. Gemma 4 E2B running natively on iPhone with Apple's Neural Engine — 11 tokens per second, 250MB of RAM, 2 watts of power. This is the most efficient way to run Gemma 4 on an iPhone, period.
X/Twitter has been blowing up about this for good reason. Let's walk through what it is, how it works, and how to set it up.
What Is CoreML-LLM?
CoreML-LLM is an open-source framework that converts LLMs to Apple's CoreML format and runs them directly on the Neural Engine — the dedicated AI chip inside every modern iPhone. Unlike GPU-based inference, the Neural Engine is purpose-built for matrix operations, which means dramatically lower power consumption and memory usage.
Version 0.2.0 added full support for Gemma 4 E2B, and the community benchmarks speak for themselves.
The Benchmarks Everyone Is Talking About
Here's what CoreML-LLM v0.2.0 achieves with Gemma 4 E2B on iPhone:
| Metric | Result |
|---|---|
| Prefill (33 tokens) | 188ms (15.8x speedup vs CPU) |
| Decode speed | 11 tok/s |
| Context window | 2048 tokens |
| RAM usage | ~250MB |
| Power draw | ~2W |
Compare that to the standard approach of running Gemma 4 E2B via AI Edge Gallery or MediaPipe, which typically uses 3GB of RAM and drains your battery noticeably. CoreML-LLM is using roughly 12x less memory and sipping power instead of gulping it.
For a broader view of mobile deployment options, see our mobile deployment guide.
How It Gets These Numbers
CoreML-LLM doesn't just dump the model onto the Neural Engine and hope for the best. There are four key technical optimizations that make this work:
Sliding Window Attention
Instead of full self-attention over the entire context, CoreML-LLM uses a sliding window approach. The model only attends to a fixed window of recent tokens at each layer. This keeps memory constant regardless of sequence length (up to the 2048 context limit) and is a perfect fit for the Neural Engine's fixed-size tensor operations.
INT4 Palettized Weights with mmap
The weights are quantized to INT4 using Apple's palettization technique — a form of lookup-table quantization where each weight is an index into a small codebook. The model file is memory-mapped (mmap) rather than loaded entirely into RAM, which is why you see that 250MB figure instead of the full model size. Only the pages being actively used get loaded into physical memory.
Stateless KV Cache
Traditional KV caches grow with sequence length and eat memory. CoreML-LLM uses a stateless approach where the KV cache is managed as fixed-size CoreML tensors. This avoids dynamic memory allocation and keeps the Neural Engine pipeline clean.
Batched Prefill
The 188ms prefill time (for 33 tokens) comes from processing the entire prompt as a single batched operation rather than token-by-token. This is a 15.8x speedup over sequential processing and makes the initial response feel nearly instant.
Which iPhones Are Compatible?
You need an A16 chip or newer — that means iPhone 14 Pro and above.
| Device | Chip | Neural Engine | Compatible |
|---|---|---|---|
| iPhone 14 Pro / Pro Max | A16 Bionic | 16-core | Yes |
| iPhone 15 / 15 Plus | A16 Bionic | 16-core | Yes |
| iPhone 15 Pro / Pro Max | A17 Pro | 16-core | Yes |
| iPhone 16 / 16 Plus | A18 | 16-core | Yes |
| iPhone 16 Pro / Pro Max | A18 Pro | 16-core | Yes (fastest) |
| iPhone 14 / 14 Plus | A15 Bionic | 16-core | No |
| iPhone 13 and older | A15 or older | — | No |
The A16's Neural Engine has the instruction set needed for the INT4 palettized operations. Older chips technically have a Neural Engine, but they lack support for the specific quantization format CoreML-LLM uses.
Want to know what hardware you need for larger models? Check the hardware requirements guide.
Step-by-Step Setup Guide
Prerequisites
- Mac with Xcode 15.4+ installed
- iPhone 14 Pro or newer, running iOS 17+
- About 2GB free storage on your iPhone
- Python 3.10+ on your Mac (for model conversion)
Step 1: Install CoreML-LLM
# Clone the repository
git clone https://github.com/nicklimmm/coreml-llm.git
cd coreml-llm
# Install Python dependencies
pip install -r requirements.txtStep 2: Download and Convert the Model
# Download Gemma 4 E2B and convert to CoreML format
python convert.py \
--model google/gemma-4-e2b-it \
--output gemma4-e2b.mlpackage \
--quantize int4-palettized \
--context-length 2048This step takes 10-20 minutes depending on your Mac. The conversion handles quantization, palettization, and Neural Engine optimization automatically.
Step 3: Build the iOS App
# Open the Xcode project
open CoreMLLLM.xcodeproj- In Xcode, select your iPhone as the target device.
- Drag the
gemma4-e2b.mlpackageinto the project's Resources folder. - Set your development team in Signing & Capabilities.
- Hit Build and Run (Cmd+R).
Step 4: First Run
The first launch takes a minute or two while CoreML compiles the model for your specific Neural Engine. After that, it's cached and subsequent launches are fast.
Try a simple prompt:
Explain what a neural engine is in two sentences.You should see tokens streaming at roughly 11 per second. Put your phone in airplane mode — it still works. That's the whole point.
Step 5: Verify Performance
The app includes a built-in benchmark mode. Tap the settings icon and select "Run Benchmark" to see your device's actual prefill and decode speeds. Compare against the numbers in this article to make sure everything is working optimally.
CoreML-LLM vs AI Edge Gallery
There are two main ways to run Gemma 4 on iPhone right now. Here's how they compare:
| Feature | CoreML-LLM v0.2.0 | AI Edge Gallery |
|---|---|---|
| Setup difficulty | Medium (requires Mac + Xcode) | Easy (App Store download) |
| Decode speed | 11 tok/s | 5-15 tok/s |
| RAM usage | ~250MB | ~3GB |
| Power consumption | ~2W | ~5-8W |
| Runs on Neural Engine | Yes (dedicated) | Partial (mostly GPU) |
| Custom app integration | Yes (open-source) | No (standalone app) |
| Model support | Gemma 4 E2B only | E2B + E4B |
| Multimodal | Not yet | Limited |
Use CoreML-LLM if you want maximum efficiency, you're building your own app, or you want to squeeze the most out of your battery.
Use AI Edge Gallery if you just want to try Gemma 4 quickly without any development setup, or you need E4B. For more on the AI Edge Gallery approach, see our iPhone guide.
Limitations
E2B only. CoreML-LLM v0.2.0 supports Gemma 4 E2B. The larger E4B, 12B, and 26B models are not yet converted. E4B support is planned but no timeline has been announced.
No multimodal yet. Gemma 4 E2B supports text, vision, and audio in its full form, but CoreML-LLM currently only handles text inference. Multimodal support is on the roadmap for v0.3.
2048 context window. The sliding window attention keeps memory low, but limits how much text the model can consider at once. For longer documents, you'll need to chunk your input.
Requires a Mac for setup. The model conversion and app building process requires Xcode on macOS. There's no Windows or Linux path right now.
Quality is E2B-level. This is a 2B parameter model. It's great for quick tasks — summaries, translations, simple Q&A, basic code generation — but don't expect GPT-4 level reasoning. See our E2B vs E4B comparison for a detailed quality breakdown.
Why This Matters
250MB of RAM and 2 watts. That means Gemma 4 can run alongside your other apps without killing your phone. It means you can use it for hours without destroying your battery. It means on-device AI stops being a "demo" and starts being a real feature you'd actually ship in a production app.
The Neural Engine has been sitting in iPhones for years, mostly running camera processing and keyboard predictions. CoreML-LLM is showing what happens when you actually target it properly with a real language model.
Next Steps
- New to Gemma 4 on mobile? Start with the Mobile Deployment Guide for the full picture
- Want the easier (but less efficient) option? See the iPhone Guide for AI Edge Gallery setup
- Choosing between model sizes? Read our E2B vs E4B comparison
- Need to check your hardware? See Hardware Requirements for all platforms
On-device AI just got a lot more practical. CoreML-LLM v0.2.0 with Gemma 4 E2B is the most efficient way to run a real language model on an iPhone today — and v0.3 with multimodal support is coming.
Stop reading. Start building.
~/gemma4 $ Get hands-on with the models discussed in this guide. No deployment, no friction, 100% free playground.
Launch Playground />


