Running an AI model directly on your phone — no internet required, no data leaving your device — sounds futuristic, but Gemma 4 makes it real. The smaller E2B and E4B models are specifically designed for mobile deployment. This guide covers everything you need to get Gemma 4 running on Android and iOS.
Which Models Work on Mobile?
Not every Gemma 4 model fits on a phone. Here's what's realistic:
| Model | Parameters | RAM Needed | Android | iOS | Recommended? |
|---|---|---|---|---|---|
| Gemma 4 E2B | 2B | ~3 GB | Yes | Yes | Best for most phones |
| Gemma 4 E4B | 4B | ~5 GB | Yes | Yes | Flagship phones only |
| Gemma 4 1B | 1B | ~2 GB | Yes | Yes | Fastest, lower quality |
| Gemma 4 4B | 4B | ~5 GB | Possible | Possible | Tight fit |
| Gemma 4 12B+ | 12B+ | ~9 GB+ | No | No | Too large for mobile |
The E2B and E4B ("Edge") models are optimized for mobile — they include multimodal capabilities (text, vision, and audio) at sizes that actually fit on a phone. You can grab the model files from any source listed in our download guide. For detailed RAM and storage specs, check the hardware requirements.
Android Deployment
Android has the most mature ecosystem for on-device Gemma 4, thanks to Google's tight integration.
Option 1: Google AI Edge SDK
The AI Edge SDK is Google's official solution for running Gemma on Android:
// build.gradle.kts
dependencies {
implementation("com.google.ai.edge:ai-edge-sdk:0.3.0")
}
// In your Activity or ViewModel
import com.google.ai.edge.InferenceSession
import com.google.ai.edge.ModelConfig
class GemmaViewModel : ViewModel() {
private var session: InferenceSession? = null
fun initModel(context: Context) {
val config = ModelConfig.builder()
.setModelPath("gemma-4-e2b-it.task")
.setMaxTokens(1024)
.setTemperature(0.7f)
.build()
session = InferenceSession.create(context, config)
}
fun generateResponse(prompt: String): String {
return session?.generateResponse(prompt) ?: "Model not loaded"
}
}Option 2: AICore (Pixel and Samsung)
AICore is built into recent Pixel phones and Samsung Galaxy devices. It provides system-level AI acceleration:
// Check if AICore is available
val aiCoreAvailable = AICore.isAvailable(context)
if (aiCoreAvailable) {
// AICore handles model management and optimization
val session = AICore.createSession(
model = "gemma-4-e2b-it",
options = AICore.Options.builder()
.setAccelerator(AICore.Accelerator.GPU)
.build()
)
val response = session.generate("Explain photosynthesis simply")
}AICore advantage: the model may already be cached on the device, so users don't need to download 2-3GB separately.
Option 3: MediaPipe LLM Inference API
MediaPipe is more flexible and works across a wider range of Android devices:
dependencies {
implementation("com.google.mediapipe:tasks-genai:0.10.20")
}
// Initialize the LLM
val options = LlmInference.LlmInferenceOptions.builder()
.setModelPath("/data/local/tmp/gemma-4-e2b-it.bin")
.setMaxTokens(1024)
.setTopK(40)
.setTemperature(0.7f)
.setRandomSeed(42)
.build()
val llmInference = LlmInference.createFromOptions(context, options)
// Generate text
val response = llmInference.generateResponse("What is machine learning?")
// Stream responses
llmInference.generateResponseAsync(prompt) { partialResult, done ->
// Update UI with each token
textView.append(partialResult)
}iOS Deployment
Option 1: AI Edge Gallery App
The easiest way to test Gemma 4 on iOS — download the AI Edge Gallery app from the App Store. For Apple-specific optimizations and setup details, see our dedicated iPhone guide.
- Install AI Edge Gallery
- Browse available models
- Download Gemma 4 E2B or E4B
- Start chatting — completely offline
This is great for personal use and testing, but not for embedding in your own app.
Option 2: LiteRT (TensorFlow Lite Runtime)
For integrating Gemma 4 into your own iOS app:
import LiteRT
class GemmaModel {
private var interpreter: Interpreter?
func loadModel() throws {
guard let modelPath = Bundle.main.path(
forResource: "gemma-4-e2b-it",
ofType: "tflite"
) else {
throw GemmaError.modelNotFound
}
var options = Interpreter.Options()
options.threadCount = 4
// Use GPU delegate for acceleration
let gpuDelegate = MetalDelegate()
interpreter = try Interpreter(
modelPath: modelPath,
options: options,
delegates: [gpuDelegate]
)
}
func generate(prompt: String) throws -> String {
// Tokenize input
let tokens = tokenize(prompt)
// Run inference
try interpreter?.allocateTensors()
try interpreter?.copy(tokens, toInputAt: 0)
try interpreter?.invoke()
// Decode output
let output = try interpreter?.output(at: 0)
return decode(output)
}
}Option 3: MediaPipe for iOS
MediaPipe also works on iOS:
import MediaPipeTasksGenAI
let options = LlmInference.Options()
options.modelPath = Bundle.main.path(
forResource: "gemma-4-e2b-it",
ofType: "bin"
)!
options.maxTokens = 1024
options.temperature = 0.7
let llm = try LlmInference(options: options)
let response = try llm.generateResponse(inputText: "Hello!")Performance Expectations
Be realistic about what mobile AI can do. Here's what to expect:
| Device | Model | Speed (tok/s) | First Token (ms) | RAM Usage |
|---|---|---|---|---|
| Pixel 9 Pro | E2B | ~15-20 | ~800 | ~3 GB |
| Pixel 9 Pro | E4B | ~8-12 | ~1500 | ~5 GB |
| Samsung S24 Ultra | E2B | ~15-18 | ~900 | ~3 GB |
| iPhone 15 Pro | E2B | ~12-15 | ~1000 | ~3 GB |
| iPhone 16 Pro | E2B | ~15-18 | ~800 | ~3 GB |
| iPhone 16 Pro | E4B | ~8-10 | ~1500 | ~5 GB |
These speeds are slower than desktop, but perfectly usable for interactive chat. The first token takes a bit longer as the model initializes.
Battery and Thermal Considerations
Running AI inference is compute-intensive. Here's what to keep in mind:
| Concern | Reality | Mitigation |
|---|---|---|
| Battery drain | ~5-8% per hour of active use | Limit max generation length |
| Heat | Phone gets warm during inference | Add cooldown pauses between long generations |
| Background use | OS may kill the process | Keep model loaded only when needed |
| Storage | 2-5 GB per model | Offer model download as optional |
// Good practice: release model when not in use
override fun onPause() {
super.onPause()
session?.close()
}
override fun onResume() {
super.onResume()
if (session == null) initModel()
}Offline: The Killer Feature
The biggest advantage of on-device AI is that it works without internet. Think about the use cases:
- Travel: AI assistant works on airplane mode
- Privacy-sensitive tasks: Medical questions, personal journaling, private coding — nothing leaves your device
- Poor connectivity: Rural areas, subway, developing regions
- Speed: No network latency — responses start immediately
- Cost: No API fees after the initial model download
This is something cloud APIs fundamentally cannot offer. When you run Gemma 4 on your phone, your data stays on your phone. Period.
Mobile vs Cloud API
| Factor | On-Device (Gemma 4 E2B) | Cloud API (Gemini) |
|---|---|---|
| Speed | ~15 tok/s | ~50-100 tok/s |
| Quality | Good (2B model) | Excellent (large model) |
| Privacy | Complete | Data sent to server |
| Offline | Yes | No |
| Cost | Free after download | Per-token pricing |
| Battery impact | High | Minimal |
| Setup | Model download required | API key only |
The ideal approach is hybrid: use on-device Gemma 4 for privacy-sensitive and offline tasks, and fall back to a cloud API when you need higher quality or when the phone is connected.
Next Steps
- Want to run Gemma 4 on iPhone specifically? Check our detailed iPhone Guide for Apple-specific optimizations
- Not sure which model to pick? Read Which Gemma 4 Model? to understand the full lineup
- Curious about hardware requirements for desktop? See the Hardware Guide for desktop and laptop recommendations
Mobile AI is still early, but it's real and it works today. Start with the E2B model, test it on your phone, and build from there. The fact that a capable AI runs entirely on a phone you carry in your pocket — with no internet, no API keys, no monthly bills — is kind of amazing.



