NVIDIA GPUs are the easiest path to running Gemma 4 locally. Whether you've got a budget RTX 3060 or a beefy RTX 4090, the CUDA ecosystem makes setup straightforward. This guide covers everything from driver requirements to advanced TensorRT-LLM optimization.
CUDA Driver Requirements
Before anything else, make sure your NVIDIA driver and CUDA toolkit are up to date:
| Component | Minimum Version | Recommended |
|---|---|---|
| NVIDIA Driver | 535+ | 560+ |
| CUDA Toolkit | 12.1 | 12.4+ |
| cuDNN | 8.9 | 9.0+ |
| Python | 3.10 | 3.11+ |
Check your current setup:
# Check driver version
nvidia-smi
# Check CUDA version
nvcc --version
# If nvcc isn't found, CUDA toolkit may not be in your PATH
export PATH=/usr/local/cuda/bin:$PATHUpdating Drivers
On Linux:
# Ubuntu/Debian
sudo apt update
sudo apt install nvidia-driver-560
sudo rebootOn Windows, download the latest driver from nvidia.com/drivers or use GeForce Experience.
The Easiest Way: Ollama
Ollama auto-detects NVIDIA GPUs and handles everything for you. No CUDA toolkit installation needed — Ollama bundles its own:
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Run Gemma 4
ollama run gemma4:12b
# Verify GPU is being used
ollama ps
# Should show "GPU" in the processor columnThat's it. Ollama detects your NVIDIA GPU, loads the model into VRAM, and starts generating. For most users, this is all you need.
GPU Offloading Settings
When your model doesn't fully fit in VRAM, you can split it between GPU and CPU. This is called partial offloading:
# Ollama: control how many layers go to GPU
OLLAMA_NUM_GPU=35 ollama run gemma4:12b
# llama.cpp: specify GPU layers
./llama-server -m gemma-4-12b-Q4_K_M.gguf -ngl 35
# Set to 0 for CPU-only, or 999 for full GPUThe sweet spot depends on your VRAM. A general rule:
| VRAM | Recommended Layers (12B Q4) | What It Means |
|---|---|---|
| 6GB | 15-20 | ~50% on GPU |
| 8GB | 25-30 | ~75% on GPU |
| 12GB | 35-40 | ~95% on GPU |
| 16GB+ | 999 (all) | Fully GPU-accelerated |
| 24GB+ | 999 (all) | Room for longer context |
RTX Performance Comparison
Here's what to expect for Gemma 4 12B inference speed across different RTX cards:
| GPU | VRAM | Q4_K_M (tok/s) | Q8_0 (tok/s) | FP16 (tok/s) | Notes |
|---|---|---|---|---|---|
| RTX 3060 | 12GB | ~25 | ~15 | OOM | Great budget option |
| RTX 3060 Ti | 8GB | ~20* | OOM | OOM | *Partial offload |
| RTX 3070 | 8GB | ~22* | OOM | OOM | *Partial offload |
| RTX 3090 | 24GB | ~40 | ~25 | ~12 | Still excellent |
| RTX 4060 | 8GB | ~28* | OOM | OOM | *Partial offload |
| RTX 4070 Ti | 12GB | ~38 | ~22 | OOM | Good mid-range |
| RTX 4080 | 16GB | ~50 | ~30 | OOM | Strong performer |
| RTX 4090 | 24GB | ~65 | ~40 | ~20 | Consumer king |
OOM = Out of Memory at that quantization level
The RTX 3060 12GB is honestly the best value pick — 12GB of VRAM at a fraction of the 4090's price, and it runs Q4 models at perfectly usable speeds.
NVIDIA Jetson Orin Support
Gemma 4 runs on NVIDIA's Jetson platform, making it possible to deploy on edge devices:
# On Jetson Orin (JetPack 6.x)
# Install Ollama ARM64 build
curl -fsSL https://ollama.com/install.sh | sh
# Run smaller models
ollama run gemma4:4b
# The 1B model is best for Jetson Orin Nano
ollama run gemma4:1b| Jetson Model | RAM | Best Gemma 4 Model | Use Case |
|---|---|---|---|
| Orin Nano 8GB | 8GB | 1B or 4B Q4 | Embedded AI assistant |
| Orin NX 16GB | 16GB | 4B or 12B Q4 | Edge inference |
| AGX Orin 64GB | 64GB | 12B FP16 or 27B Q4 | Full-featured edge AI |
DGX Spark
NVIDIA's DGX Spark is a desktop AI workstation with 128GB of unified memory — it runs the full Gemma 4 27B in FP16 without breaking a sweat:
# On DGX Spark, run the full 27B model
ollama run gemma4:27b
# Or run at full precision
python -m vllm.entrypoints.openai.api_server \
--model google/gemma-4-27b-it \
--dtype float16 \
--max-model-len 32768TensorRT-LLM Optimization
For maximum throughput on NVIDIA hardware, TensorRT-LLM compiles the model specifically for your GPU:
# Install TensorRT-LLM
pip install tensorrt-llm
# Convert and optimize the model
python convert_checkpoint.py \
--model_dir google/gemma-4-12b-it \
--output_dir ./gemma4-trt \
--dtype float16
# Build the TensorRT engine
trtllm-build \
--checkpoint_dir ./gemma4-trt \
--output_dir ./gemma4-engine \
--max_batch_size 4 \
--max_input_len 4096 \
--max_seq_len 8192
# Run inference
python run.py --engine_dir ./gemma4-engine --max_output_len 512TensorRT-LLM typically gives 2-3x throughput improvement over vanilla PyTorch, but the build process takes 10-30 minutes and the engine is locked to your specific GPU model.
Flash Attention
Make sure Flash Attention is enabled for better memory efficiency and speed:
# Install Flash Attention 2
pip install flash-attn --no-build-isolation
# Verify it's being used (in Python)
python -c "import flash_attn; print(flash_attn.__version__)"Most frameworks (vLLM, SGLang, transformers) automatically use Flash Attention when available. It reduces VRAM usage and increases speed, especially at longer context lengths.
Next Steps
- Need hardware buying advice? Check the Hardware Requirements Guide for detailed recommendations by budget
- Running into errors? The Troubleshooting Guide covers CUDA-specific issues like driver mismatches and OOM errors
- Want to try Ollama first? Follow our Ollama Setup Guide for the simplest path to running Gemma 4
NVIDIA GPUs remain the gold standard for local AI. The combination of mature drivers, broad framework support, and tools like TensorRT-LLM means you'll spend less time debugging and more time actually using Gemma 4.



