Gemma 4 has a built-in thinking mode that lets the model "reason out loud" before giving you an answer. It's like asking someone to show their work on a math problem — the extra steps often lead to a better answer.
But it's not always worth the tradeoff. Let's break down when to use it and when to skip it.
What Is Thinking Mode?
In thinking mode, Gemma 4 generates a chain of reasoning before producing the final answer. The model essentially has an internal scratchpad where it works through the problem step by step.
Without thinking:
User: What's 17 × 23?
Gemma 4: 391With thinking:
User: What's 17 × 23?
Gemma 4: <think>
17 × 23
= 17 × 20 + 17 × 3
= 340 + 51
= 391
</think>
391The thinking happens inside <think> tags. Your application can either show this reasoning to the user or strip it out and just use the final answer.
How to Enable Thinking Mode
With Ollama
# Use the thinking variant of the model
ollama run gemma4:26b-thinkingOr via the API with a budget token parameter:
import requests
response = requests.post("http://localhost:11434/api/chat", json={
"model": "gemma4:26b",
"messages": [
{"role": "user", "content": "Solve: If 3x + 7 = 22, what is x?"}
],
"options": {
"num_predict": 2048, # Allow enough tokens for thinking
},
"stream": False,
})
print(response.json()["message"]["content"])With the System Prompt
You can also trigger thinking behavior through the system prompt:
messages = [
{
"role": "system",
"content": "Think step by step before answering. Show your reasoning in <think> tags, then provide the final answer."
},
{
"role": "user",
"content": "A train leaves Chicago at 9 AM traveling 60 mph. Another leaves New York at 10 AM traveling 80 mph toward Chicago. The distance is 800 miles. When do they meet?"
}
]Budget Tokens
Some implementations let you control how much thinking the model does with a budget:
# More budget = more thinking = slower but better
response = requests.post("http://localhost:11434/api/chat", json={
"model": "gemma4:26b",
"messages": [{"role": "user", "content": "Complex reasoning task here..."}],
"options": {
"num_predict": 4096, # Higher budget for more thinking room
},
})When Thinking Mode Helps
Thinking mode shines on tasks that require multi-step reasoning:
Math and logic problems:
Without thinking: "The answer is 42" (sometimes wrong)
With thinking: Step-by-step work → correct answerComplex coding:
Without thinking: Generates code that looks right but has subtle bugs
With thinking: Reasons about edge cases, data flow, then generates cleaner codeAnalysis and comparison:
Without thinking: Surface-level answer
With thinking: Considers multiple angles, weighs tradeoffsHere's a practical comparison on the same problems:
| Problem Type | Without Thinking | With Thinking | Improvement |
|---|---|---|---|
| Basic math (2+2) | Correct | Correct | None |
| Multi-step math | ~70% correct | ~90% correct | Significant |
| Logic puzzles | ~50% correct | ~80% correct | Major |
| Code debugging | Finds obvious bugs | Finds subtle bugs | Significant |
| Simple Q&A | Fast, correct | Slower, correct | None (worse: slower) |
| Translation | Good | Same quality | None |
| Creative writing | Natural flow | Can feel overthought | Worse |
When to Skip Thinking Mode
Don't use thinking mode for:
- Simple Q&A: "What's the capital of France?" doesn't need a chain of thought
- Translation: Thinking mode adds latency without improving translation quality
- Creative writing: The extra reasoning can make output feel stilted and over-planned
- Chat conversations: Natural back-and-forth doesn't benefit from formal reasoning
- Data extraction: Pulling names, dates, and numbers from text is pattern matching, not reasoning
- High-throughput applications: If you're processing thousands of requests, the 2-3x slowdown matters. See our batch processing guide.
Performance Impact
Thinking mode has a real cost:
| Metric | Without Thinking | With Thinking |
|---|---|---|
| Tokens generated | 100-500 | 300-2000 |
| Time to answer | 1-5 seconds | 3-15 seconds |
| VRAM usage | Baseline | Same (slightly more KV cache) |
| Answer quality (reasoning) | Good | Better |
| Answer quality (simple) | Good | Same |
The model generates 2-5x more tokens in thinking mode because it's writing out its reasoning. On a Mac with Apple Silicon, this means:
- M2 Pro 16GB: Simple answer in 2s → Thinking answer in 6-8s
- M4 Max 64GB: Simple answer in 1s → Thinking answer in 3-4s
For interactive chat, the delay is noticeable. For batch processing, it can double or triple your processing time.
Practical Example: With vs Without
Let's look at a real problem:
Prompt: "I have 3 red balls, 5 blue balls, and 2 green balls in a bag. If I draw 2 balls without replacement, what's the probability both are blue?"
Without thinking:
The probability is 5/10 × 4/9 = 20/90 = 2/9 ≈ 0.222Correct, but if the problem were trickier, the model might skip a step and get it wrong.
With thinking:
<think>
Total balls: 3 + 5 + 2 = 10
Blue balls: 5
Probability of first ball being blue: 5/10
After drawing one blue ball:
- Remaining balls: 9
- Remaining blue: 4
Probability of second ball being blue: 4/9
Combined probability: 5/10 × 4/9 = 20/90 = 2/9
</think>
The probability of drawing 2 blue balls is 2/9, or approximately 22.2%.Same answer, but the thinking version is more likely to catch errors in harder problems because it explicitly tracks the state at each step.
Combining Thinking with Other Features
Thinking + JSON output: Use thinking mode for the reasoning, then output structured JSON:
messages = [
{
"role": "system",
"content": """Think through the problem in <think> tags.
Then output your final answer as JSON (no other text after the JSON):
{"answer": number, "confidence": number, "reasoning_summary": string}"""
},
{"role": "user", "content": "Complex problem here..."}
]For more on structured output, see our JSON output guide.
Thinking + fine-tuned models: If you've fine-tuned Gemma 4 for a specific domain, thinking mode can still improve reasoning quality on complex domain-specific problems.
Next Steps
- Try thinking mode with different models: model selection guide
- Combine with structured output: JSON output guide
- Run thinking mode locally: Ollama setup guide
- See performance on your hardware: Mac performance guide



