How to Run Gemma 4 on NVIDIA RTX (CUDA Setup & Optimization)

Apr 7, 2026

NVIDIA GPUs are the easiest path to running Gemma 4 locally. Whether you've got a budget RTX 3060 or a beefy RTX 4090, the CUDA ecosystem makes setup straightforward. This guide covers everything from driver requirements to advanced TensorRT-LLM optimization.

CUDA Driver Requirements

Before anything else, make sure your NVIDIA driver and CUDA toolkit are up to date:

ComponentMinimum VersionRecommended
NVIDIA Driver535+560+
CUDA Toolkit12.112.4+
cuDNN8.99.0+
Python3.103.11+

Check your current setup:

# Check driver version
nvidia-smi

# Check CUDA version
nvcc --version

# If nvcc isn't found, CUDA toolkit may not be in your PATH
export PATH=/usr/local/cuda/bin:$PATH

Updating Drivers

On Linux:

# Ubuntu/Debian
sudo apt update
sudo apt install nvidia-driver-560
sudo reboot

On Windows, download the latest driver from nvidia.com/drivers or use GeForce Experience.

The Easiest Way: Ollama

Ollama auto-detects NVIDIA GPUs and handles everything for you. No CUDA toolkit installation needed — Ollama bundles its own:

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Run Gemma 4
ollama run gemma4:12b

# Verify GPU is being used
ollama ps
# Should show "GPU" in the processor column

That's it. Ollama detects your NVIDIA GPU, loads the model into VRAM, and starts generating. For most users, this is all you need.

GPU Offloading Settings

When your model doesn't fully fit in VRAM, you can split it between GPU and CPU. This is called partial offloading:

# Ollama: control how many layers go to GPU
OLLAMA_NUM_GPU=35 ollama run gemma4:12b

# llama.cpp: specify GPU layers
./llama-server -m gemma-4-12b-Q4_K_M.gguf -ngl 35

# Set to 0 for CPU-only, or 999 for full GPU

The sweet spot depends on your VRAM. A general rule:

VRAMRecommended Layers (12B Q4)What It Means
6GB15-20~50% on GPU
8GB25-30~75% on GPU
12GB35-40~95% on GPU
16GB+999 (all)Fully GPU-accelerated
24GB+999 (all)Room for longer context

RTX Performance Comparison

Here's what to expect for Gemma 4 12B inference speed across different RTX cards:

GPUVRAMQ4_K_M (tok/s)Q8_0 (tok/s)FP16 (tok/s)Notes
RTX 306012GB~25~15OOMGreat budget option
RTX 3060 Ti8GB~20*OOMOOM*Partial offload
RTX 30708GB~22*OOMOOM*Partial offload
RTX 309024GB~40~25~12Still excellent
RTX 40608GB~28*OOMOOM*Partial offload
RTX 4070 Ti12GB~38~22OOMGood mid-range
RTX 408016GB~50~30OOMStrong performer
RTX 409024GB~65~40~20Consumer king

OOM = Out of Memory at that quantization level

The RTX 3060 12GB is honestly the best value pick — 12GB of VRAM at a fraction of the 4090's price, and it runs Q4 models at perfectly usable speeds.

NVIDIA Jetson Orin Support

Gemma 4 runs on NVIDIA's Jetson platform, making it possible to deploy on edge devices:

# On Jetson Orin (JetPack 6.x)
# Install Ollama ARM64 build
curl -fsSL https://ollama.com/install.sh | sh

# Run smaller models
ollama run gemma4:4b

# The 1B model is best for Jetson Orin Nano
ollama run gemma4:1b
Jetson ModelRAMBest Gemma 4 ModelUse Case
Orin Nano 8GB8GB1B or 4B Q4Embedded AI assistant
Orin NX 16GB16GB4B or 12B Q4Edge inference
AGX Orin 64GB64GB12B FP16 or 27B Q4Full-featured edge AI

DGX Spark

NVIDIA's DGX Spark is a desktop AI workstation with 128GB of unified memory — it runs the full Gemma 4 27B in FP16 without breaking a sweat:

# On DGX Spark, run the full 27B model
ollama run gemma4:27b

# Or run at full precision
python -m vllm.entrypoints.openai.api_server \
  --model google/gemma-4-27b-it \
  --dtype float16 \
  --max-model-len 32768

TensorRT-LLM Optimization

For maximum throughput on NVIDIA hardware, TensorRT-LLM compiles the model specifically for your GPU:

# Install TensorRT-LLM
pip install tensorrt-llm

# Convert and optimize the model
python convert_checkpoint.py \
  --model_dir google/gemma-4-12b-it \
  --output_dir ./gemma4-trt \
  --dtype float16

# Build the TensorRT engine
trtllm-build \
  --checkpoint_dir ./gemma4-trt \
  --output_dir ./gemma4-engine \
  --max_batch_size 4 \
  --max_input_len 4096 \
  --max_seq_len 8192

# Run inference
python run.py --engine_dir ./gemma4-engine --max_output_len 512

TensorRT-LLM typically gives 2-3x throughput improvement over vanilla PyTorch, but the build process takes 10-30 minutes and the engine is locked to your specific GPU model.

Flash Attention

Make sure Flash Attention is enabled for better memory efficiency and speed:

# Install Flash Attention 2
pip install flash-attn --no-build-isolation

# Verify it's being used (in Python)
python -c "import flash_attn; print(flash_attn.__version__)"

Most frameworks (vLLM, SGLang, transformers) automatically use Flash Attention when available. It reduces VRAM usage and increases speed, especially at longer context lengths.

Next Steps

NVIDIA GPUs remain the gold standard for local AI. The combination of mature drivers, broad framework support, and tools like TensorRT-LLM means you'll spend less time debugging and more time actually using Gemma 4.

Gemma 4 AI

Gemma 4 AI

Related Guides