Gemma 4 Mobile Deployment: Android, iOS, CoreML and AI Edge Guide

Running Gemma 4 directly on Android or iOS means private, offline AI without sending prompts to a cloud API. This guide compares AI Edge SDK, AICore, MediaPipe, CoreML, and LiteRT, then shows which path fits your RAM, battery, and app requirements.

Which Models Work on Mobile?

Not every Gemma 4 model fits on a phone. Here's what's realistic:

Model	Parameters	RAM Needed	Android	iOS	Recommended?
Gemma 4 E2B	2B	~3 GB	Yes	Yes	Best for most phones
Gemma 4 E4B	4B	~5 GB	Yes	Yes	Flagship phones only
Gemma 4 1B	1B	~2 GB	Yes	Yes	Fastest, lower quality
Gemma 4 4B	4B	~5 GB	Possible	Possible	Tight fit
Gemma 4 12B+	12B+	~9 GB+	No	No	Too large for mobile

The E2B and E4B ("Edge") models are optimized for mobile — they include multimodal capabilities (text, vision, and audio) at sizes that actually fit on a phone. You can grab the model files from any source listed in our download guide. For detailed RAM and storage specs, check the hardware requirements.

Android Deployment

Android has the most mature ecosystem for on-device Gemma 4, thanks to Google's tight integration.

Option 1: Google AI Edge SDK

The AI Edge SDK is Google's official solution for running Gemma on Android:

// build.gradle.kts
dependencies {
    implementation("com.google.ai.edge:ai-edge-sdk:0.3.0")
}

// In your Activity or ViewModel
import com.google.ai.edge.InferenceSession
import com.google.ai.edge.ModelConfig

class GemmaViewModel : ViewModel() {
    private var session: InferenceSession? = null

    fun initModel(context: Context) {
        val config = ModelConfig.builder()
            .setModelPath("gemma-4-e2b-it.task")
            .setMaxTokens(1024)
            .setTemperature(0.7f)
            .build()

        session = InferenceSession.create(context, config)
    }

    fun generateResponse(prompt: String): String {
        return session?.generateResponse(prompt) ?: "Model not loaded"
    }
}

Option 2: AICore (Pixel and Samsung)

AICore is built into recent Pixel phones and Samsung Galaxy devices. It provides system-level AI acceleration:

// Check if AICore is available
val aiCoreAvailable = AICore.isAvailable(context)

if (aiCoreAvailable) {
    // AICore handles model management and optimization
    val session = AICore.createSession(
        model = "gemma-4-e2b-it",
        options = AICore.Options.builder()
            .setAccelerator(AICore.Accelerator.GPU)
            .build()
    )

    val response = session.generate("Explain photosynthesis simply")
}

AICore advantage: the model may already be cached on the device, so users don't need to download 2-3GB separately.

Option 3: MediaPipe LLM Inference API

MediaPipe is more flexible and works across a wider range of Android devices:

dependencies {
    implementation("com.google.mediapipe:tasks-genai:0.10.20")
}

// Initialize the LLM
val options = LlmInference.LlmInferenceOptions.builder()
    .setModelPath("/data/local/tmp/gemma-4-e2b-it.bin")
    .setMaxTokens(1024)
    .setTopK(40)
    .setTemperature(0.7f)
    .setRandomSeed(42)
    .build()

val llmInference = LlmInference.createFromOptions(context, options)

// Generate text
val response = llmInference.generateResponse("What is machine learning?")

// Stream responses
llmInference.generateResponseAsync(prompt) { partialResult, done ->
    // Update UI with each token
    textView.append(partialResult)
}

iOS Deployment

Option 1: AI Edge Gallery App

The easiest way to test Gemma 4 on iOS — download the AI Edge Gallery app from the App Store. For Apple-specific optimizations and setup details, see our dedicated iPhone guide.

Install AI Edge Gallery
Browse available models
Download Gemma 4 E2B or E4B
Start chatting — completely offline

This is great for personal use and testing, but not for embedding in your own app.

Option 2: CoreML (Native Apple Performance)

For maximum performance on Apple hardware, you can convert Gemma 4 to CoreML format and leverage the Neural Engine directly. This approach typically delivers 2-3x faster inference compared to standard methods. See our dedicated guide on running Gemma 4 on iPhone with CoreML for the full conversion and integration walkthrough.

Option 3: LiteRT (TensorFlow Lite Runtime)

For integrating Gemma 4 into your own iOS app:

import LiteRT

class GemmaModel {
    private var interpreter: Interpreter?

    func loadModel() throws {
        guard let modelPath = Bundle.main.path(
            forResource: "gemma-4-e2b-it",
            ofType: "tflite"
        ) else {
            throw GemmaError.modelNotFound
        }

        var options = Interpreter.Options()
        options.threadCount = 4

        // Use GPU delegate for acceleration
        let gpuDelegate = MetalDelegate()
        interpreter = try Interpreter(
            modelPath: modelPath,
            options: options,
            delegates: [gpuDelegate]
        )
    }

    func generate(prompt: String) throws -> String {
        // Tokenize input
        let tokens = tokenize(prompt)

        // Run inference
        try interpreter?.allocateTensors()
        try interpreter?.copy(tokens, toInputAt: 0)
        try interpreter?.invoke()

        // Decode output
        let output = try interpreter?.output(at: 0)
        return decode(output)
    }
}

Option 4: MediaPipe for iOS

MediaPipe also works on iOS:

import MediaPipeTasksGenAI

let options = LlmInference.Options()
options.modelPath = Bundle.main.path(
    forResource: "gemma-4-e2b-it",
    ofType: "bin"
)!
options.maxTokens = 1024
options.temperature = 0.7

let llm = try LlmInference(options: options)
let response = try llm.generateResponse(inputText: "Hello!")

Performance Expectations

Be realistic about what mobile AI can do. Here's what to expect:

Device	Model	Speed (tok/s)	First Token (ms)	RAM Usage
Pixel 9 Pro	E2B	~15-20	~800	~3 GB
Pixel 9 Pro	E4B	~8-12	~1500	~5 GB
Samsung S24 Ultra	E2B	~15-18	~900	~3 GB
iPhone 15 Pro	E2B	~12-15	~1000	~3 GB
iPhone 16 Pro	E2B	~15-18	~800	~3 GB
iPhone 16 Pro	E4B	~8-10	~1500	~5 GB

These speeds are slower than desktop, but perfectly usable for interactive chat. The first token takes a bit longer as the model initializes.

Battery and Thermal Considerations

Running AI inference is compute-intensive. Here's what to keep in mind:

Concern	Reality	Mitigation
Battery drain	~5-8% per hour of active use	Limit max generation length
Heat	Phone gets warm during inference	Add cooldown pauses between long generations
Background use	OS may kill the process	Keep model loaded only when needed
Storage	2-5 GB per model	Offer model download as optional

// Good practice: release model when not in use
override fun onPause() {
    super.onPause()
    session?.close()
}

override fun onResume() {
    super.onResume()
    if (session == null) initModel()
}

Offline: The Killer Feature

The biggest advantage of on-device AI is that it works without internet. Think about the use cases:

Travel: AI assistant works on airplane mode
Privacy-sensitive tasks: Medical questions, personal journaling, private coding — nothing leaves your device
Poor connectivity: Rural areas, subway, developing regions
Speed: No network latency — responses start immediately
Cost: No API fees after the initial model download

This is something cloud APIs fundamentally cannot offer. When you run Gemma 4 on your phone, your data stays on your phone. Period.

Mobile vs Cloud API

Factor	On-Device (Gemma 4 E2B)	Cloud API (Gemini)
Speed	~15 tok/s	~50-100 tok/s
Quality	Good (2B model)	Excellent (large model)
Privacy	Complete	Data sent to server
Offline	Yes	No
Cost	Free after download	Per-token pricing
Battery impact	High	Minimal
Setup	Model download required	API key only

The ideal approach is hybrid: use on-device Gemma 4 for privacy-sensitive and offline tasks, and fall back to a cloud API when you need higher quality or when the phone is connected.

Next Steps

Want to run Gemma 4 on iPhone specifically? Check our detailed iPhone Guide for Apple-specific optimizations
Not sure which model to pick? Read Which Gemma 4 Model? to understand the full lineup
Curious about hardware requirements for desktop? See the Hardware Guide for desktop and laptop recommendations
Want to try on a Raspberry Pi? See our Raspberry Pi deployment guide for running Gemma 4 on single-board computers

Mobile AI is still early, but it's real and it works today. Start with the E2B model, test it on your phone, and build from there. The fact that a capable AI runs entirely on a phone you carry in your pocket — with no internet, no API keys, no monthly bills — is kind of amazing.

gemma4 — interact

Stop reading. Start building.

~/gemma4 $ Get hands-on with the models discussed in this guide. No deployment, no friction, 100% free playground.

Launch Playground />