Most Gemma 4 tutorials only show you how to use it for text. But here's the thing — multimodal is Gemma 4's biggest upgrade over previous versions. Every single Gemma 4 model can understand images, and the smaller E2B/E4B models even handle audio. If you're not sending images to Gemma 4, you're only using half the model.
Let's fix that.
What Can Gemma 4 "See"?
Gemma 4's vision capabilities cover a lot of ground:
- Photos — describe scenes, identify objects, read signs
- Screenshots — extract text, analyze UI layouts
- Charts and graphs — interpret data visualizations
- Documents — read printed or scanned text (OCR)
- Handwriting — read handwritten notes (quality depends on legibility)
- Video frames — analyze individual frames from video
- Diagrams — understand flowcharts, architecture diagrams, wireframes
- Code screenshots — read and explain code from images
One important thing to understand: Gemma 4 does image understanding, not image generation. It can look at a picture and tell you what's in it, but it can't create images. If you want image generation, that's a different model entirely (like Imagen).
Send Images via Ollama CLI
If you've got Ollama running locally, sending images is dead simple:
ollama run gemma4 "Describe this image in detail" --image /path/to/photo.jpgMultiple images work too:
ollama run gemma4 "Compare these two screenshots" --image before.png --image after.pngThat's it. One flag. Ollama handles the encoding and everything else behind the scenes.
Send Images via API (Python)
For programmatic use, you need to base64-encode the image and include it in your API call. Here's how with Ollama's local API:
import requests
import base64
# Read and encode the image
with open("screenshot.png", "rb") as f:
image_data = base64.b64encode(f.read()).decode("utf-8")
response = requests.post("http://localhost:11434/api/generate", json={
"model": "gemma4",
"prompt": "What text is visible in this screenshot?",
"images": [image_data],
"stream": False
})
print(response.json()["response"])Using the Chat API with Images
For multi-turn conversations about images:
import requests
import base64
with open("chart.png", "rb") as f:
image_data = base64.b64encode(f.read()).decode("utf-8")
response = requests.post("http://localhost:11434/api/chat", json={
"model": "gemma4",
"messages": [
{
"role": "user",
"content": "What trends do you see in this chart?",
"images": [image_data]
}
],
"stream": False
})
print(response.json()["message"]["content"])Using Google AI Studio SDK
import google.generativeai as genai
from pathlib import Path
genai.configure(api_key="YOUR_API_KEY")
model = genai.GenerativeModel("gemma-4-27b-it")
image = genai.upload_file(Path("diagram.png"))
response = model.generate_content([
"Explain this architecture diagram. What are the main components and how do they connect?",
image
])
print(response.text)For more API options, check out our complete API tutorial.
5 Practical Use Cases (with Example Prompts)
1. OCR: Extract Text from Screenshots
Extract all visible text from this screenshot. Format it as plain text, preserving the layout as much as possible.This works surprisingly well for app screenshots, web pages, receipts, and business cards. It's not perfect with very small text or unusual fonts, but for most use cases it gets the job done.
2. Chart and Data Analysis
Analyze this chart. What type of chart is it? What are the key data points? What trends or patterns do you notice? Summarize the main takeaway in one sentence.Gemma 4 can read bar charts, line graphs, pie charts, and scatter plots. It'll identify axes, labels, and approximate values. Great for quickly understanding data visualizations without digging into the raw data.
3. UI Design Review
Review this UI screenshot as a UX designer. Identify: 1) Visual hierarchy issues, 2) Accessibility concerns (contrast, text size), 3) Layout inconsistencies, 4) Suggestions for improvement. Be specific and reference exact elements.This is a genuinely useful workflow. Drop in a screenshot of your app and get a quick design critique. It catches things like poor contrast, inconsistent spacing, and unclear CTAs.
4. Photo Description (Accessibility)
Write a detailed alt-text description for this image suitable for screen readers. Include: the main subject, setting, colors, mood, and any text visible. Keep it under 150 words.Perfect for generating alt text for websites. It's not a replacement for hand-written descriptions for critical content, but it's great for bulk processing.
5. Handwritten Notes Transcription
Transcribe the handwritten text in this image. If any words are unclear, indicate them with [unclear]. Preserve the original structure (bullet points, numbered lists, etc.).Quality depends heavily on the handwriting. Clean print-style writing works well. Messy cursive? Hit or miss. But even imperfect transcription is faster than retyping.
Which Models Support What?
| Capability | E2B | E4B | 26B | 31B |
|---|---|---|---|---|
| Text input | Yes | Yes | Yes | Yes |
| Image input | Yes | Yes | Yes | Yes |
| Audio input | Yes | Yes | No | No |
| Video frames | Yes | Yes | Yes | Yes |
| Image generation | No | No | No | No |
Key things to note:
- All models support image input. Even the smallest E2B can analyze images.
- Audio input is E2B/E4B only. The larger models don't support audio natively.
- No image generation. Gemma 4 is an understanding model, not a generative image model.
- Video = frames. You send individual frames, not video files. Extract keyframes first and send them as images.
Tips for Better Image Analysis
Be specific in your prompts. "Describe this image" gives you generic output. "List every product visible on this shelf with approximate prices" gives you useful data.
Image quality matters. A blurry photo gives blurry answers. Crop to the relevant area before sending — less noise means better results.
Use the right model size. For simple OCR, E2B is fine. For complex scene understanding or nuanced analysis, the 26B or 31B models are noticeably better.
Multiple images work. You can send 2-3 images and ask for comparisons, differences, or combined analysis. Don't go crazy though — more images means more processing time and memory usage.
Limitations
- No image generation. Can't draw, create, or edit images.
- Hallucinations happen. Gemma 4 might "read" text that isn't there or misidentify objects. Always verify critical information.
- Small text is hard. If you can barely read it, Gemma 4 probably can't either. Zoom in and crop.
- Complex diagrams. Very dense technical diagrams with lots of overlapping elements can confuse the model. Break them into sections if needed.
Next Steps
- Need the right prompts for image tasks? Check 50 Best Gemma 4 Prompts
- Want to call the image API programmatically? See our API tutorial
- Not sure which model to use for your vision tasks? Read Gemma 4: Which Model?
- Running locally? Start with our Ollama setup guide



