How to Deploy Gemma 4 on Android & iOS (Mobile AI Guide)

Apr 7, 2026

Running an AI model directly on your phone — no internet required, no data leaving your device — sounds futuristic, but Gemma 4 makes it real. The smaller E2B and E4B models are specifically designed for mobile deployment. This guide covers everything you need to get Gemma 4 running on Android and iOS.

Which Models Work on Mobile?

Not every Gemma 4 model fits on a phone. Here's what's realistic:

ModelParametersRAM NeededAndroidiOSRecommended?
Gemma 4 E2B2B~3 GBYesYesBest for most phones
Gemma 4 E4B4B~5 GBYesYesFlagship phones only
Gemma 4 1B1B~2 GBYesYesFastest, lower quality
Gemma 4 4B4B~5 GBPossiblePossibleTight fit
Gemma 4 12B+12B+~9 GB+NoNoToo large for mobile

The E2B and E4B ("Edge") models are optimized for mobile — they include multimodal capabilities (text, vision, and audio) at sizes that actually fit on a phone. You can grab the model files from any source listed in our download guide. For detailed RAM and storage specs, check the hardware requirements.

Android Deployment

Android has the most mature ecosystem for on-device Gemma 4, thanks to Google's tight integration.

Option 1: Google AI Edge SDK

The AI Edge SDK is Google's official solution for running Gemma on Android:

// build.gradle.kts
dependencies {
    implementation("com.google.ai.edge:ai-edge-sdk:0.3.0")
}

// In your Activity or ViewModel
import com.google.ai.edge.InferenceSession
import com.google.ai.edge.ModelConfig

class GemmaViewModel : ViewModel() {
    private var session: InferenceSession? = null

    fun initModel(context: Context) {
        val config = ModelConfig.builder()
            .setModelPath("gemma-4-e2b-it.task")
            .setMaxTokens(1024)
            .setTemperature(0.7f)
            .build()

        session = InferenceSession.create(context, config)
    }

    fun generateResponse(prompt: String): String {
        return session?.generateResponse(prompt) ?: "Model not loaded"
    }
}

Option 2: AICore (Pixel and Samsung)

AICore is built into recent Pixel phones and Samsung Galaxy devices. It provides system-level AI acceleration:

// Check if AICore is available
val aiCoreAvailable = AICore.isAvailable(context)

if (aiCoreAvailable) {
    // AICore handles model management and optimization
    val session = AICore.createSession(
        model = "gemma-4-e2b-it",
        options = AICore.Options.builder()
            .setAccelerator(AICore.Accelerator.GPU)
            .build()
    )

    val response = session.generate("Explain photosynthesis simply")
}

AICore advantage: the model may already be cached on the device, so users don't need to download 2-3GB separately.

Option 3: MediaPipe LLM Inference API

MediaPipe is more flexible and works across a wider range of Android devices:

dependencies {
    implementation("com.google.mediapipe:tasks-genai:0.10.20")
}

// Initialize the LLM
val options = LlmInference.LlmInferenceOptions.builder()
    .setModelPath("/data/local/tmp/gemma-4-e2b-it.bin")
    .setMaxTokens(1024)
    .setTopK(40)
    .setTemperature(0.7f)
    .setRandomSeed(42)
    .build()

val llmInference = LlmInference.createFromOptions(context, options)

// Generate text
val response = llmInference.generateResponse("What is machine learning?")

// Stream responses
llmInference.generateResponseAsync(prompt) { partialResult, done ->
    // Update UI with each token
    textView.append(partialResult)
}

iOS Deployment

The easiest way to test Gemma 4 on iOS — download the AI Edge Gallery app from the App Store. For Apple-specific optimizations and setup details, see our dedicated iPhone guide.

  1. Install AI Edge Gallery
  2. Browse available models
  3. Download Gemma 4 E2B or E4B
  4. Start chatting — completely offline

This is great for personal use and testing, but not for embedding in your own app.

Option 2: LiteRT (TensorFlow Lite Runtime)

For integrating Gemma 4 into your own iOS app:

import LiteRT

class GemmaModel {
    private var interpreter: Interpreter?

    func loadModel() throws {
        guard let modelPath = Bundle.main.path(
            forResource: "gemma-4-e2b-it",
            ofType: "tflite"
        ) else {
            throw GemmaError.modelNotFound
        }

        var options = Interpreter.Options()
        options.threadCount = 4

        // Use GPU delegate for acceleration
        let gpuDelegate = MetalDelegate()
        interpreter = try Interpreter(
            modelPath: modelPath,
            options: options,
            delegates: [gpuDelegate]
        )
    }

    func generate(prompt: String) throws -> String {
        // Tokenize input
        let tokens = tokenize(prompt)

        // Run inference
        try interpreter?.allocateTensors()
        try interpreter?.copy(tokens, toInputAt: 0)
        try interpreter?.invoke()

        // Decode output
        let output = try interpreter?.output(at: 0)
        return decode(output)
    }
}

Option 3: MediaPipe for iOS

MediaPipe also works on iOS:

import MediaPipeTasksGenAI

let options = LlmInference.Options()
options.modelPath = Bundle.main.path(
    forResource: "gemma-4-e2b-it",
    ofType: "bin"
)!
options.maxTokens = 1024
options.temperature = 0.7

let llm = try LlmInference(options: options)
let response = try llm.generateResponse(inputText: "Hello!")

Performance Expectations

Be realistic about what mobile AI can do. Here's what to expect:

DeviceModelSpeed (tok/s)First Token (ms)RAM Usage
Pixel 9 ProE2B~15-20~800~3 GB
Pixel 9 ProE4B~8-12~1500~5 GB
Samsung S24 UltraE2B~15-18~900~3 GB
iPhone 15 ProE2B~12-15~1000~3 GB
iPhone 16 ProE2B~15-18~800~3 GB
iPhone 16 ProE4B~8-10~1500~5 GB

These speeds are slower than desktop, but perfectly usable for interactive chat. The first token takes a bit longer as the model initializes.

Battery and Thermal Considerations

Running AI inference is compute-intensive. Here's what to keep in mind:

ConcernRealityMitigation
Battery drain~5-8% per hour of active useLimit max generation length
HeatPhone gets warm during inferenceAdd cooldown pauses between long generations
Background useOS may kill the processKeep model loaded only when needed
Storage2-5 GB per modelOffer model download as optional
// Good practice: release model when not in use
override fun onPause() {
    super.onPause()
    session?.close()
}

override fun onResume() {
    super.onResume()
    if (session == null) initModel()
}

Offline: The Killer Feature

The biggest advantage of on-device AI is that it works without internet. Think about the use cases:

  • Travel: AI assistant works on airplane mode
  • Privacy-sensitive tasks: Medical questions, personal journaling, private coding — nothing leaves your device
  • Poor connectivity: Rural areas, subway, developing regions
  • Speed: No network latency — responses start immediately
  • Cost: No API fees after the initial model download

This is something cloud APIs fundamentally cannot offer. When you run Gemma 4 on your phone, your data stays on your phone. Period.

Mobile vs Cloud API

FactorOn-Device (Gemma 4 E2B)Cloud API (Gemini)
Speed~15 tok/s~50-100 tok/s
QualityGood (2B model)Excellent (large model)
PrivacyCompleteData sent to server
OfflineYesNo
CostFree after downloadPer-token pricing
Battery impactHighMinimal
SetupModel download requiredAPI key only

The ideal approach is hybrid: use on-device Gemma 4 for privacy-sensitive and offline tasks, and fall back to a cloud API when you need higher quality or when the phone is connected.

Next Steps

  • Want to run Gemma 4 on iPhone specifically? Check our detailed iPhone Guide for Apple-specific optimizations
  • Not sure which model to pick? Read Which Gemma 4 Model? to understand the full lineup
  • Curious about hardware requirements for desktop? See the Hardware Guide for desktop and laptop recommendations

Mobile AI is still early, but it's real and it works today. Start with the E2B model, test it on your phone, and build from there. The fact that a capable AI runs entirely on a phone you carry in your pocket — with no internet, no API keys, no monthly bills — is kind of amazing.

Gemma 4 AI

Gemma 4 AI

Related Guides