Gemma 4 + OpenAI Codex CLI: The Free, Private Local Coding Assistant (2026)
When OpenAI shipped Codex CLI in early April 2026, the first question that lit up Hacker News wasn’t does it work? — it was can I point it at a local model and stop paying per token?
The short answer: yes. Codex CLI will happily talk to any OpenAI-compatible endpoint, and Gemma 4 running under Ollama is one of the cleanest drop-in replacements you can put behind it. You get a terminal coding agent that costs nothing to run, never ships your source code to a third party, and keeps working on a flight or behind a corporate firewall.
This guide walks the whole path end to end — install Ollama, pull the right Gemma 4 variant, wire the environment variables, and run four realistic coding tasks to prove the setup. If you’ve been stalling on Codex because of the API bill or the compliance review, this is the post you were looking for.
What Is OpenAI Codex CLI?
Codex CLI is OpenAI’s terminal-native coding agent — think of it as a command-line cousin to GitHub Copilot, without the editor. You stay in your shell and talk to it in plain English:
- Generate code:
codex "scaffold a REST API with JWT auth" - Explain a file:
codex explain app/main.py - Refactor in place:
codex refactor src/cache.ts --goal "reduce memory" - Fix an error:
codex fix --error "TypeError: undefined is not a function"
Out of the box it talks to api.openai.com. But under the hood it’s a standard OpenAI-format HTTP client, so anything that speaks the chat completions protocol — Ollama, LM Studio, vLLM, Llama.cpp’s server — will work. That’s the whole trick.
Why Replace OpenAI With Gemma 4?
Three reasons that hold up in practice.
Privacy. Your code never leaves your machine. No vendor ever sees your business logic, your secrets, or your database schemas. For teams under GDPR, SOC 2, HIPAA, or an internal IP policy, that’s not a nice-to-have.
Cost. OpenAI bills per token, and Codex likes to pass the whole file in context. A refactor on a medium repo can quietly burn five to ten dollars. Gemma 4 running on your laptop costs exactly nothing — you can leave an agent loop running overnight and it still costs nothing.
Offline. Trains, planes, client sites, air-gapped networks. If you work anywhere with flaky or forbidden internet, a local model isn’t a luxury, it’s the only option.
The reason this wasn’t practical two years ago was quality — local models couldn’t write real code. Gemma 4 closes that gap. The 26B MoE and 31B Dense checkpoints are good enough at instruction following, function calling, and structured output that you stop noticing you’re offline.
Prerequisites
Before you start:
- Ollama installed (ollama.com)
- Gemma 4 pulled locally (26B or 31B — we’ll pick below)
- OpenAI Codex CLI installed
- Hardware: 16 GB RAM minimum for 26B, 24 GB+ for 31B. See our hardware guide for details
- A working Node.js 20+ toolchain
Not sure which Gemma 4 size fits your machine? The short version: 26B MoE is the sweet spot for laptops because only ~4B parameters activate per token. If you’re on a desktop with headroom, 31B Dense is meaningfully smarter on long-horizon tasks.
Step 1: Install Ollama and Pull Gemma 4
Ollama handles model downloads, quantization, and the local HTTP server in one binary. Nothing else to configure.
Install Ollama
# macOS
brew install ollama
# Linux
curl -fsSL https://ollama.com/install.sh | sh
# Windows: download the installer from ollama.com and run itPull the model
For most coding work, Gemma 4 26B (MoE) is the best price-performance pick. Same 16 GB MacBook can run it smoothly because only ~4B parameters activate per token:
ollama pull gemma4:26b-a4bIf you have more headroom (24 GB+ unified memory or a 24 GB GPU) and want maximum quality, pull the 31B Dense:
ollama pull gemma4:31bExpect a 15–20 GB download. Verify it works:
ollama run gemma4:26b-a4b "Write a Python function that reverses a string"If you see a sane answer, you’re done with this step.
Step 2: Start the Ollama API Server
Ollama exposes an OpenAI-compatible API on http://localhost:11434/v1. Start it in a dedicated terminal:
ollama serveLeave that window open. From a second terminal, confirm the API is live:
curl http://localhost:11434/v1/modelsYou should get a JSON list that includes gemma4:26b-a4b. If curl hangs or refuses the connection, the server didn’t bind — check port 11434 isn’t already in use.
Step 3: Install OpenAI Codex CLI
Codex CLI ships via npm:
npm install -g @openai/codex-cli
codex --versionYou should see a 1.x.x version string. If the install fails, make sure your Node.js is 20 or newer — Codex uses newer fetch APIs.
Step 4: Point Codex CLI at Gemma 4
Codex CLI honours the standard OpenAI SDK environment variables. Override three of them to redirect it to Ollama.
macOS / Linux
Add to ~/.zshrc or ~/.bashrc:
export OPENAI_API_BASE="http://localhost:11434/v1"
export OPENAI_API_KEY="ollama" # any non-empty string
export CODEX_MODEL="gemma4:26b-a4b"Reload: source ~/.zshrc.
Windows (PowerShell)
$env:OPENAI_API_BASE = "http://localhost:11434/v1"
$env:OPENAI_API_KEY = "ollama"
$env:CODEX_MODEL = "gemma4:26b-a4b"Smoke test
codex "print hello world in Python"Expected output:
print("Hello, World!")If you see Model not found or Connection refused, jump to Troubleshooting below.
Step 5: Four Realistic Tasks
Hello World is a bad benchmark. Here are four tasks that match how people actually use Codex CLI day to day.
1. Scaffold a REST endpoint
codex "create a FastAPI endpoint that accepts email + password, validates them, and returns a JWT"Gemma 4 26B will produce something close to:
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, EmailStr
from datetime import datetime, timedelta
import jwt
app = FastAPI()
SECRET_KEY = "your-secret-key"
class UserLogin(BaseModel):
email: EmailStr
password: str
@app.post("/login")
def login(user: UserLogin):
if user.email == "user@example.com" and user.password == "password123":
token = jwt.encode(
{"sub": user.email, "exp": datetime.utcnow() + timedelta(hours=1)},
SECRET_KEY, algorithm="HS256",
)
return {"access_token": token, "token_type": "bearer"}
raise HTTPException(status_code=401, detail="Invalid credentials")2. Explain unfamiliar code
Save as complex.py:
def f(n): return n if n < 2 else f(n-1) + f(n-2)Then:
codex explain complex.pyGemma 4 will identify the naive recursive Fibonacci, flag its O(2^n) complexity, and suggest memoisation.
3. Refactor for performance
codex refactor complex.py --goal "optimize for speed"Expected rewrite:
from functools import lru_cache
@lru_cache(maxsize=None)
def f(n):
return n if n < 2 else f(n-1) + f(n-2)O(2^n) down to O(n) with a decorator. Ten tokens of code, the classic textbook fix.
4. Fix a real error
codex fix --error "TypeError: 'NoneType' object is not subscriptable" --file app.pyGemma 4 will scan app.py, find the line indexing into a function return that can be None, and emit a patch that adds the missing if x is None: return guard.
Gemma 4 Local vs. OpenAI Cloud: How Close?
| Dimension | Gemma 4 26B (local) | Gemma 4 31B (local) | GPT-4 class (cloud) |
|---|---|---|---|
| Code-gen quality | ~GPT-3.5 tier | ~GPT-4 tier | Best in class |
| Throughput | 20–40 t/s on M1/M2 | 15–30 t/s | 100+ t/s |
| Network latency | 0 ms | 0 ms | 200–500 ms |
| Per-task cost | $0 | $0 | $0.15–$0.30 |
| Monthly cost (100 tasks/week) | $0 | $0 | $60–$120 |
| Privacy | Local | Local | Prompt uploaded |
| Offline | Yes | Yes | No |
The raw throughput gap looks scary on paper, but with zero network latency the perceived time for short tasks is close to identical. Where cloud still wins is multi-step agent loops over huge repos — that’s where the 100+ t/s advantage compounds.
For detailed hardware numbers, see Gemma 4 Mac performance: M1 to M4 and the NVIDIA RTX guide.
Troubleshooting
"Connection refused"
Ollama isn’t running, or it’s on the wrong port. Run curl http://localhost:11434/v1/models. If that fails, restart ollama serve and make sure nothing else owns port 11434.
"Model not found"
ollama list will show you the exact tag you installed. The CODEX_MODEL env var has to match character-for-character (gemma4:26b-a4b, not gemma4-26b).
Codex hangs or times out
You’re probably paging to disk because the model doesn’t fit in RAM. Swap 31B for 26B, or drop to a Q4 quant. You can also raise Codex’s timeout with --timeout 120.
Garbage output
Vague prompts hurt local models more than they hurt GPT-4. Don’t say "make it better" — say "convert this function to async/await and add structured error handling."
Going Further: Function Calling
Gemma 4 supports OpenAI-style tool calls, which means Codex CLI can drive real external actions: query your GitHub issues, run your test suite, hit your staging database. Define your tools as JSON Schema, pass them with --tools tools.json, and Gemma 4 will pick the right one and loop on the results. Full walk-through in Gemma 4 function calling in practice.
FAQ
Can I use Gemma 4 E2B or E4B instead of 26B? Technically yes, practically no. The small edge variants will generate valid code for trivial snippets but fall apart on anything with real context. Stick to 26B or 31B for Codex work.
Does this work on Windows?
Yes. Ollama runs natively on Windows 11, Codex CLI is just Node.js. The only difference is using PowerShell $env: syntax for the environment variables.
Can I use the same setup with Cursor, Continue, or Aider?
Yes — anything that speaks the OpenAI API format works. Point its base URL at http://localhost:11434/v1, set the model name, done. We have dedicated guides for Aider and Claude Code Router.
What does this actually cost? Zero dollars. Ollama is free, Gemma 4 is open-weight, Codex CLI is free. Your electricity bill goes up by a few cents per day of heavy use.
Gemma 4 vs. Qwen 3 for coding — which is better? Gemma 4 is stronger on instruction following and structured JSON; Qwen 3 is sometimes faster on identical hardware and has a slight edge on Chinese-language tasks. See the full comparison.
Will this run on a Raspberry Pi? Only the E2B edge model, and only for toy prompts. Codex-grade work needs a laptop-class machine with 16 GB+ of RAM.
What if 31B still isn’t good enough?
Unset OPENAI_API_BASE and you’re back on OpenAI’s cloud. A lot of teams run a hybrid: local Gemma 4 for fast/routine edits, cloud GPT-4 for the one or two hard problems per week.
Related Articles
- How to run Gemma 4 with Ollama — the full Ollama deep-dive, including GPU flags and context window tuning
- Best local AI models for coding in 2026 — Gemma 4 in context with Qwen 3, DeepSeek, Llama, and the rest
- Gemma 4 vs Qwen 3 — side-by-side comparison for when you’re choosing your local default
Stop reading. Start building.
~/gemma4 $ Get hands-on with the models discussed in this guide. No deployment, no friction, 100% free playground.
Launch Playground />


