How to Run Gemma 4 on NVIDIA RTX (CUDA Setup & Optimization)

NVIDIA GPUs are the easiest path to running Gemma 4 locally. Whether you've got a budget RTX 3060 or a beefy RTX 4090, the CUDA ecosystem makes setup straightforward. This guide covers everything from driver requirements to advanced TensorRT-LLM optimization.

CUDA Driver Requirements

Before anything else, make sure your NVIDIA driver and CUDA toolkit are up to date:

Component	Minimum Version	Recommended
NVIDIA Driver	535+	560+
CUDA Toolkit	12.1	12.4+
cuDNN	8.9	9.0+
Python	3.10	3.11+

Check your current setup:

# Check driver version
nvidia-smi

# Check CUDA version
nvcc --version

# If nvcc isn't found, CUDA toolkit may not be in your PATH
export PATH=/usr/local/cuda/bin:$PATH

Updating Drivers

On Linux:

# Ubuntu/Debian
sudo apt update
sudo apt install nvidia-driver-560
sudo reboot

On Windows, download the latest driver from nvidia.com/drivers or use GeForce Experience.

The Easiest Way: Ollama

Ollama auto-detects NVIDIA GPUs and handles everything for you. No CUDA toolkit installation needed — Ollama bundles its own:

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Run Gemma 4
ollama run gemma4:12b

# Verify GPU is being used
ollama ps
# Should show "GPU" in the processor column

That's it. Ollama detects your NVIDIA GPU, loads the model into VRAM, and starts generating. For most users, this is all you need.

GPU Offloading Settings

When your model doesn't fully fit in VRAM, you can split it between GPU and CPU. This is called partial offloading:

# Ollama: control how many layers go to GPU
OLLAMA_NUM_GPU=35 ollama run gemma4:12b

# llama.cpp: specify GPU layers
./llama-server -m gemma-4-12b-Q4_K_M.gguf -ngl 35

# Set to 0 for CPU-only, or 999 for full GPU

The sweet spot depends on your VRAM. A general rule:

VRAM	Recommended Layers (12B Q4)	What It Means
6GB	15-20	~50% on GPU
8GB	25-30	~75% on GPU
12GB	35-40	~95% on GPU
16GB+	999 (all)	Fully GPU-accelerated
24GB+	999 (all)	Room for longer context

RTX Performance Comparison

Here's what to expect for Gemma 4 12B inference speed across different RTX cards:

GPU	VRAM	Q4_K_M (tok/s)	Q8_0 (tok/s)	FP16 (tok/s)	Notes
RTX 3060	12GB	~25	~15	OOM	Great budget option
RTX 3060 Ti	8GB	~20*	OOM	OOM	*Partial offload
RTX 3070	8GB	~22*	OOM	OOM	*Partial offload
RTX 3090	24GB	~40	~25	~12	Still excellent
RTX 4060	8GB	~28*	OOM	OOM	*Partial offload
RTX 4070 Ti	12GB	~38	~22	OOM	Good mid-range
RTX 4080	16GB	~50	~30	OOM	Strong performer
RTX 4090	24GB	~65	~40	~20	Consumer king

OOM = Out of Memory at that quantization level

The RTX 3060 12GB is honestly the best value pick — 12GB of VRAM at a fraction of the 4090's price, and it runs Q4 models at perfectly usable speeds.

NVIDIA Jetson Orin Support

Gemma 4 runs on NVIDIA's Jetson platform, making it possible to deploy on edge devices:

# On Jetson Orin (JetPack 6.x)
# Install Ollama ARM64 build
curl -fsSL https://ollama.com/install.sh | sh

# Run smaller models
ollama run gemma4:4b

# The 1B model is best for Jetson Orin Nano
ollama run gemma4:1b

Jetson Model	RAM	Best Gemma 4 Model	Use Case
Orin Nano 8GB	8GB	1B or 4B Q4	Embedded AI assistant
Orin NX 16GB	16GB	4B or 12B Q4	Edge inference
AGX Orin 64GB	64GB	12B FP16 or 27B Q4	Full-featured edge AI

DGX Spark

NVIDIA's DGX Spark is a desktop AI workstation with 128GB of unified memory — it runs the full Gemma 4 27B in FP16 without breaking a sweat:

# On DGX Spark, run the full 27B model
ollama run gemma4:27b

# Or run at full precision
python -m vllm.entrypoints.openai.api_server \
  --model google/gemma-4-27b-it \
  --dtype float16 \
  --max-model-len 32768

TensorRT-LLM Optimization

For maximum throughput on NVIDIA hardware, TensorRT-LLM compiles the model specifically for your GPU:

# Install TensorRT-LLM
pip install tensorrt-llm

# Convert and optimize the model
python convert_checkpoint.py \
  --model_dir google/gemma-4-12b-it \
  --output_dir ./gemma4-trt \
  --dtype float16

# Build the TensorRT engine
trtllm-build \
  --checkpoint_dir ./gemma4-trt \
  --output_dir ./gemma4-engine \
  --max_batch_size 4 \
  --max_input_len 4096 \
  --max_seq_len 8192

# Run inference
python run.py --engine_dir ./gemma4-engine --max_output_len 512

TensorRT-LLM typically gives 2-3x throughput improvement over vanilla PyTorch, but the build process takes 10-30 minutes and the engine is locked to your specific GPU model.

Flash Attention

Make sure Flash Attention is enabled for better memory efficiency and speed:

# Install Flash Attention 2
pip install flash-attn --no-build-isolation

# Verify it's being used (in Python)
python -c "import flash_attn; print(flash_attn.__version__)"

Most frameworks (vLLM, SGLang, transformers) automatically use Flash Attention when available. It reduces VRAM usage and increases speed, especially at longer context lengths.

Next Steps

Need hardware buying advice? Check the Hardware Requirements Guide for detailed recommendations by budget
Running into errors? The Troubleshooting Guide covers CUDA-specific issues like driver mismatches and OOM errors
Want to try Ollama first? Follow our Ollama Setup Guide for the simplest path to running Gemma 4

NVIDIA GPUs remain the gold standard for local AI. The combination of mature drivers, broad framework support, and tools like TensorRT-LLM means you'll spend less time debugging and more time actually using Gemma 4. If you're on AMD hardware instead, check out our guide to running Gemma 4 on AMD GPUs with ROCm — setup is different but the results are solid.

gemma4 — interact