Gemma 4 Thinking Mode: What It Does & When to Use It

Gemma 4 has a built-in thinking mode that lets the model "reason out loud" before giving you an answer. It's like asking someone to show their work on a math problem — the extra steps often lead to a better answer.

But it's not always worth the tradeoff. Let's break down when to use it and when to skip it.

What Is Thinking Mode?

In thinking mode, Gemma 4 generates a chain of reasoning before producing the final answer. The model essentially has an internal scratchpad where it works through the problem step by step.

Without thinking:

User: What's 17 × 23?
Gemma 4: 391

With thinking:

User: What's 17 × 23?
Gemma 4: <think>
17 × 23
= 17 × 20 + 17 × 3
= 340 + 51
= 391
</think>
391

The thinking happens inside <think> tags. Your application can either show this reasoning to the user or strip it out and just use the final answer.

How to Enable Thinking Mode

With Ollama

# Use the thinking variant of the model
ollama run gemma4:26b-thinking

Or via the API with a budget token parameter:

import requests

response = requests.post("http://localhost:11434/api/chat", json={
    "model": "gemma4:26b",
    "messages": [
        {"role": "user", "content": "Solve: If 3x + 7 = 22, what is x?"}
    ],
    "options": {
        "num_predict": 2048,  # Allow enough tokens for thinking
    },
    "stream": False,
})

print(response.json()["message"]["content"])

With the System Prompt

You can also trigger thinking behavior through the system prompt:

messages = [
    {
        "role": "system",
        "content": "Think step by step before answering. Show your reasoning in <think> tags, then provide the final answer."
    },
    {
        "role": "user",
        "content": "A train leaves Chicago at 9 AM traveling 60 mph. Another leaves New York at 10 AM traveling 80 mph toward Chicago. The distance is 800 miles. When do they meet?"
    }
]

Budget Tokens

Some implementations let you control how much thinking the model does with a budget:

# More budget = more thinking = slower but better
response = requests.post("http://localhost:11434/api/chat", json={
    "model": "gemma4:26b",
    "messages": [{"role": "user", "content": "Complex reasoning task here..."}],
    "options": {
        "num_predict": 4096,  # Higher budget for more thinking room
    },
})

When Thinking Mode Helps

Thinking mode shines on tasks that require multi-step reasoning:

Math and logic problems:

Without thinking: "The answer is 42" (sometimes wrong)
With thinking: Step-by-step work → correct answer

Complex coding:

Without thinking: Generates code that looks right but has subtle bugs
With thinking: Reasons about edge cases, data flow, then generates cleaner code

Analysis and comparison:

Without thinking: Surface-level answer
With thinking: Considers multiple angles, weighs tradeoffs

Here's a practical comparison on the same problems:

Problem Type	Without Thinking	With Thinking	Improvement
Basic math (2+2)	Correct	Correct	None
Multi-step math	~70% correct	~90% correct	Significant
Logic puzzles	~50% correct	~80% correct	Major
Code debugging	Finds obvious bugs	Finds subtle bugs	Significant
Simple Q&A	Fast, correct	Slower, correct	None (worse: slower)
Translation	Good	Same quality	None
Creative writing	Natural flow	Can feel overthought	Worse

When to Skip Thinking Mode

Don't use thinking mode for:

Simple Q&A: "What's the capital of France?" doesn't need a chain of thought
Translation: Thinking mode adds latency without improving translation quality
Creative writing: The extra reasoning can make output feel stilted and over-planned
Chat conversations: Natural back-and-forth doesn't benefit from formal reasoning
Data extraction: Pulling names, dates, and numbers from text is pattern matching, not reasoning
High-throughput applications: If you're processing thousands of requests, the 2-3x slowdown matters. See our batch processing guide.

Performance Impact

Thinking mode has a real cost:

Metric	Without Thinking	With Thinking
Tokens generated	100-500	300-2000
Time to answer	1-5 seconds	3-15 seconds
VRAM usage	Baseline	Same (slightly more KV cache)
Answer quality (reasoning)	Good	Better
Answer quality (simple)	Good	Same

The model generates 2-5x more tokens in thinking mode because it's writing out its reasoning. On a Mac with Apple Silicon, this means:

M2 Pro 16GB: Simple answer in 2s → Thinking answer in 6-8s
M4 Max 64GB: Simple answer in 1s → Thinking answer in 3-4s

For interactive chat, the delay is noticeable. For batch processing, it can double or triple your processing time.

Practical Example: With vs Without

Let's look at a real problem:

Prompt: "I have 3 red balls, 5 blue balls, and 2 green balls in a bag. If I draw 2 balls without replacement, what's the probability both are blue?"

Without thinking:

The probability is 5/10 × 4/9 = 20/90 = 2/9 ≈ 0.222

Correct, but if the problem were trickier, the model might skip a step and get it wrong.

With thinking:

<think>
Total balls: 3 + 5 + 2 = 10
Blue balls: 5

Probability of first ball being blue: 5/10
After drawing one blue ball:
- Remaining balls: 9
- Remaining blue: 4

Probability of second ball being blue: 4/9

Combined probability: 5/10 × 4/9 = 20/90 = 2/9
</think>

The probability of drawing 2 blue balls is 2/9, or approximately 22.2%.

Same answer, but the thinking version is more likely to catch errors in harder problems because it explicitly tracks the state at each step.

Combining Thinking with Other Features

Thinking + JSON output: Use thinking mode for the reasoning, then output structured JSON:

messages = [
    {
        "role": "system",
        "content": """Think through the problem in <think> tags.
Then output your final answer as JSON (no other text after the JSON):
{"answer": number, "confidence": number, "reasoning_summary": string}"""
    },
    {"role": "user", "content": "Complex problem here..."}
]

For more on structured output, see our JSON output guide.

Thinking + fine-tuned models: If you've fine-tuned Gemma 4 for a specific domain, thinking mode can still improve reasoning quality on complex domain-specific problems.