Running AI agents through cloud APIs costs money, leaks your data, and stops working when you lose internet. With Gemma 4 + Ollama + OpenClaw, you can build a fully local AI agent that calls tools, searches the web privately, and runs a Telegram bot — all on your own hardware, for free.
This is consistently the most requested tutorial topic we see on X. Here's the complete setup.
Why Local Agents Matter
Three reasons people are building local instead of calling GPT-4 or Claude APIs:
- Zero cost. No per-token billing. Run as many queries as you want. Leave your agent running 24/7 without watching a meter.
- Privacy. Your prompts, documents, and tool results never leave your machine. No terms-of-service surprises.
- Offline. Works on a plane, in a cabin, behind a corporate firewall. The model runs locally, and tools like SearXNG give you local search without hitting Google.
The catch has always been quality — local models used to be too dumb for real agent work. Gemma 4 changes that. The 26B model handles 5-step tool calling chains without crashing, which is genuinely impressive for a model that fits on a single GPU.
The 3-Step Setup
Step 1: Pull Gemma 4 with Ollama
If you don't have Ollama installed yet, grab it from ollama.com. Then pull the recommended model:
ollama pull gemma4:26b-a4bWhy 26B-A4B specifically? It's a Mixture-of-Experts model — only 4 billion parameters are active at any time, but it draws from 26 billion total. This gives you the best quality-per-active-parameter of any Gemma 4 variant. On a MacBook M1, it uses about 13GB of RAM and runs at 20-40 tokens per second.
If you want to understand the full model lineup and pick the right size for your hardware, see Which Gemma 4 Model Should You Use?.
For the Ollama setup details (custom parameters, GPU configuration, context window settings), check out How to Run Gemma 4 with Ollama.
Step 2: Install OpenClaw
OpenClaw is an open-source agent framework designed for local LLMs. It handles the hard parts: tool registration, multi-turn conversation management, and integrations with services like Telegram and SearXNG.
git clone https://github.com/AstraBert/OpenClaw.git
cd OpenClaw
pip install -r requirements.txt
cp .env.example .envEdit the .env file to point at your local Ollama instance:
LLM_BASE_URL=http://localhost:11434/v1
LLM_MODEL=gemma4:26b-a4b
LLM_API_KEY=ollama # Ollama doesn't need a real key, but the field is requiredStep 3: Connect Tools and Run
OpenClaw comes with built-in tools you can enable in the config:
tools:
- name: searxng
enabled: true
base_url: http://localhost:8888 # Local SearXNG instance
- name: calculator
enabled: true
- name: web_scraper
enabled: true
- name: code_executor
enabled: trueStart the agent:
python main.pyThat's it. You now have a local AI agent with multi-tool calling, powered by Gemma 4.
What OpenClaw Gives You
OpenClaw isn't just a wrapper around the Ollama API. It handles several things that are painful to build yourself:
Telegram integration. Connect your agent as a Telegram bot. Your friends or team can chat with it from their phones while it runs on your machine.
SearXNG local search. Instead of calling Google's API (which costs money and tracks you), OpenClaw connects to a local SearXNG instance. You get web search without any external API calls.
Multi-tool calling. Gemma 4's native function calling support means the agent can chain multiple tools in a single query. Ask "search for the latest Gemma 4 benchmarks and calculate the average score" and it will call search, then calculator, then give you the answer.
Conversation memory. OpenClaw manages the conversation history and handles the tool-call-response loop automatically. You don't have to manually append messages and re-send them.
Real-World Performance
What people are actually reporting on X and GitHub:
| Setup | Performance |
|---|---|
| MacBook M1 16GB | 26B model, 13GB RAM, 20-40 tok/s |
| RTX 3090 24GB | 26B model, full GPU offload, 50+ tok/s |
| MacBook M2 Pro 32GB | 26B model with 128K context window, comfortable headroom |
| RTX 4060 8GB | 12B model recommended instead, 26B won't fit |
Users report the 26B model reliably completing 5-step tool calling chains — search, parse, calculate, format, respond — without losing coherence or crashing. This is a significant step up from earlier local models that would hallucinate tool call formats after 2-3 steps.
Known Issue: KV Cache Bug
There's a known bug in some versions of llama.cpp (which Ollama uses under the hood) that causes issues with multi-turn conversations. The KV cache can get corrupted after many tool call rounds, leading to garbled output or crashes.
Workaround:
# Set a lower context window to reduce KV cache pressure
ollama run gemma4:26b-a4b --num-ctx 8192
# Or in your Ollama Modelfile:
PARAMETER num_ctx 8192If you're hitting this issue, keeping the context window at 8K-16K instead of the full 256K significantly reduces the chance of KV cache corruption. The Ollama team is tracking this and a fix is expected in upcoming releases.
For long conversations, you can also periodically restart the conversation or implement a sliding window in your agent code that only keeps the last N exchanges.
Example Use Cases
Local Telegram Bot
The most popular setup. Run a Telegram bot on your home server that your family or team can message. It searches the web, answers questions, does calculations — all without any API costs or data leaving your network.
TELEGRAM_BOT_TOKEN=your_bot_token_here
TELEGRAM_ALLOWED_USERS=user_id_1,user_id_2Web Automation with Playwright
Combine OpenClaw with Playwright for browser automation. The agent can navigate websites, fill forms, extract data, and take screenshots — all orchestrated by Gemma 4's tool calling.
tools = [
{
"type": "function",
"function": {
"name": "browse_url",
"description": "Open a URL in a headless browser and return the page content",
"parameters": {
"type": "object",
"properties": {
"url": {"type": "string", "description": "URL to visit"},
"action": {"type": "string", "enum": ["read", "screenshot", "click"], "description": "What to do on the page"}
},
"required": ["url"]
}
}
}
]Local Code Assistant
Point the agent at your codebase and let it answer questions, find bugs, or generate code. With 256K context, Gemma 4 can hold an entire medium-sized project in context.
# Feed your project files as context
find ./src -name "*.py" -exec cat {} \; | python openclaw_cli.py \
"Review this code for potential bugs and suggest fixes"Tips for Stable Agent Loops
| Tip | Why |
|---|---|
| Use 26B-A4B, not 12B, for agent work | MoE architecture handles tool calling better |
| Keep context under 16K for multi-turn | Avoids KV cache issues in current llama.cpp |
Set max_steps to 10 | Prevents infinite tool calling loops |
| Write detailed tool descriptions | Gemma 4 relies heavily on descriptions to pick the right tool |
| Test tools individually first | Make sure each tool works before chaining them |
Next Steps
- New to Ollama? Start with How to Run Gemma 4 with Ollama for the basics
- Want to understand tool calling first? Read Gemma 4 Function Calling for the underlying API
- Need to pick the right model size? See Which Gemma 4 Model? — the 26B A4B is our recommendation for agents
Local AI agents went from a novelty to genuinely useful in 2026. Gemma 4's function calling reliability, combined with OpenClaw's batteries-included approach, means you can have a production-quality agent running on your own hardware in under 10 minutes. No API keys, no monthly bills, no data leaving your machine.
Stop reading. Start building.
~/gemma4 $ Get hands-on with the models discussed in this guide. No deployment, no friction, 100% free playground.
Launch Playground />


