Build a Local AI Agent with Gemma 4 + OpenClaw

Running AI agents through cloud APIs costs money, leaks your data, and stops working when you lose internet. With Gemma 4 + Ollama + OpenClaw, you can build a fully local AI agent that calls tools, searches the web privately, and runs a Telegram bot — all on your own hardware, for free.

This is consistently the most requested tutorial topic we see on X. Here's the complete setup.

Why Local Agents Matter

Three reasons people are building local instead of calling GPT-4 or Claude APIs:

Zero cost. No per-token billing. Run as many queries as you want. Leave your agent running 24/7 without watching a meter.
Privacy. Your prompts, documents, and tool results never leave your machine. No terms-of-service surprises.
Offline. Works on a plane, in a cabin, behind a corporate firewall. The model runs locally, and tools like SearXNG give you local search without hitting Google.

The catch has always been quality — local models used to be too dumb for real agent work. Gemma 4 changes that. The 26B model handles 5-step tool calling chains without crashing, which is genuinely impressive for a model that fits on a single GPU.

The 3-Step Setup

Step 1: Pull Gemma 4 with Ollama

If you don't have Ollama installed yet, grab it from ollama.com. Then pull the recommended model:

ollama pull gemma4:26b-a4b

Why 26B-A4B specifically? It's a Mixture-of-Experts model — only 4 billion parameters are active at any time, but it draws from 26 billion total. This gives you the best quality-per-active-parameter of any Gemma 4 variant. On a MacBook M1, it uses about 13GB of RAM and runs at 20-40 tokens per second.

If you want to understand the full model lineup and pick the right size for your hardware, see Which Gemma 4 Model Should You Use?.

For the Ollama setup details (custom parameters, GPU configuration, context window settings), check out How to Run Gemma 4 with Ollama.

Step 2: Install OpenClaw

OpenClaw is an open-source agent framework designed for local LLMs. It handles the hard parts: tool registration, multi-turn conversation management, and integrations with services like Telegram and SearXNG.

git clone https://github.com/AstraBert/OpenClaw.git
cd OpenClaw
pip install -r requirements.txt
cp .env.example .env

Edit the .env file to point at your local Ollama instance:

LLM_BASE_URL=http://localhost:11434/v1
LLM_MODEL=gemma4:26b-a4b
LLM_API_KEY=ollama          # Ollama doesn't need a real key, but the field is required

Step 3: Connect Tools and Run

OpenClaw comes with built-in tools you can enable in the config:

tools:
  - name: searxng
    enabled: true
    base_url: http://localhost:8888  # Local SearXNG instance
  - name: calculator
    enabled: true
  - name: web_scraper
    enabled: true
  - name: code_executor
    enabled: true

Start the agent:

python main.py

That's it. You now have a local AI agent with multi-tool calling, powered by Gemma 4.

What OpenClaw Gives You

OpenClaw isn't just a wrapper around the Ollama API. It handles several things that are painful to build yourself:

Telegram integration. Connect your agent as a Telegram bot. Your friends or team can chat with it from their phones while it runs on your machine.

SearXNG local search. Instead of calling Google's API (which costs money and tracks you), OpenClaw connects to a local SearXNG instance. You get web search without any external API calls.

Multi-tool calling. Gemma 4's native function calling support means the agent can chain multiple tools in a single query. Ask "search for the latest Gemma 4 benchmarks and calculate the average score" and it will call search, then calculator, then give you the answer.

Conversation memory. OpenClaw manages the conversation history and handles the tool-call-response loop automatically. You don't have to manually append messages and re-send them.

Real-World Performance

What people are actually reporting on X and GitHub:

Setup	Performance
MacBook M1 16GB	26B model, 13GB RAM, 20-40 tok/s
RTX 3090 24GB	26B model, full GPU offload, 50+ tok/s
MacBook M2 Pro 32GB	26B model with 128K context window, comfortable headroom
RTX 4060 8GB	12B model recommended instead, 26B won't fit

Users report the 26B model reliably completing 5-step tool calling chains — search, parse, calculate, format, respond — without losing coherence or crashing. This is a significant step up from earlier local models that would hallucinate tool call formats after 2-3 steps.

Known Issue: KV Cache Bug

There's a known bug in some versions of llama.cpp (which Ollama uses under the hood) that causes issues with multi-turn conversations. The KV cache can get corrupted after many tool call rounds, leading to garbled output or crashes.

Workaround:

# Set a lower context window to reduce KV cache pressure
ollama run gemma4:26b-a4b --num-ctx 8192

# Or in your Ollama Modelfile:
PARAMETER num_ctx 8192

If you're hitting this issue, keeping the context window at 8K-16K instead of the full 256K significantly reduces the chance of KV cache corruption. The Ollama team is tracking this and a fix is expected in upcoming releases.

For long conversations, you can also periodically restart the conversation or implement a sliding window in your agent code that only keeps the last N exchanges.

Example Use Cases

Local Telegram Bot

The most popular setup. Run a Telegram bot on your home server that your family or team can message. It searches the web, answers questions, does calculations — all without any API costs or data leaving your network.

TELEGRAM_BOT_TOKEN=your_bot_token_here
TELEGRAM_ALLOWED_USERS=user_id_1,user_id_2

Web Automation with Playwright

Combine OpenClaw with Playwright for browser automation. The agent can navigate websites, fill forms, extract data, and take screenshots — all orchestrated by Gemma 4's tool calling.

tools = [
    {
        "type": "function",
        "function": {
            "name": "browse_url",
            "description": "Open a URL in a headless browser and return the page content",
            "parameters": {
                "type": "object",
                "properties": {
                    "url": {"type": "string", "description": "URL to visit"},
                    "action": {"type": "string", "enum": ["read", "screenshot", "click"], "description": "What to do on the page"}
                },
                "required": ["url"]
            }
        }
    }
]

Local Code Assistant

Point the agent at your codebase and let it answer questions, find bugs, or generate code. With 256K context, Gemma 4 can hold an entire medium-sized project in context.

# Feed your project files as context
find ./src -name "*.py" -exec cat {} \; | python openclaw_cli.py \
  "Review this code for potential bugs and suggest fixes"

Tips for Stable Agent Loops

Tip	Why
Use 26B-A4B, not 12B, for agent work	MoE architecture handles tool calling better
Keep context under 16K for multi-turn	Avoids KV cache issues in current llama.cpp
Set `max_steps` to 10	Prevents infinite tool calling loops
Write detailed tool descriptions	Gemma 4 relies heavily on descriptions to pick the right tool
Test tools individually first	Make sure each tool works before chaining them

Next Steps

New to Ollama? Start with How to Run Gemma 4 with Ollama for the basics
Want to understand tool calling first? Read Gemma 4 Function Calling for the underlying API
Need to pick the right model size? See Which Gemma 4 Model? — the 26B A4B is our recommendation for agents

Local AI agents went from a novelty to genuinely useful in 2026. Gemma 4's function calling reliability, combined with OpenClaw's batteries-included approach, means you can have a production-quality agent running on your own hardware in under 10 minutes. No API keys, no monthly bills, no data leaving your machine.

gemma4 — interact

Stop reading. Start building.

~/gemma4 $ Get hands-on with the models discussed in this guide. No deployment, no friction, 100% free playground.

Launch Playground />