How to Download Gemma 4 from Hugging Face (Weights & GGUF)

Apr 7, 2026

Hugging Face is the primary hub for downloading Gemma 4 model weights. Whether you want the original FP16 weights for fine-tuning or GGUF quantized files for local inference, everything lives on HF. This guide walks through every download method and shows you how to start using the models right away.

Official Repositories

Google publishes the original Gemma 4 weights on Hugging Face:

ModelHugging Face RepoSizeFormat
Gemma 4 1B ITgoogle/gemma-4-1b-it~2 GBSafeTensors
Gemma 4 4B ITgoogle/gemma-4-4b-it~8 GBSafeTensors
Gemma 4 12B ITgoogle/gemma-4-12b-it~24 GBSafeTensors
Gemma 4 27B ITgoogle/gemma-4-27b-it~54 GBSafeTensors
Gemma 4 E2B ITgoogle/gemma-4-e2b-it~4 GBSafeTensors
Gemma 4 E4B ITgoogle/gemma-4-e4b-it~8 GBSafeTensors

Base (pre-trained, non-instruction-tuned) models are also available with the -pt suffix instead of -it.

GGUF Repositories

For running with llama.cpp, Ollama, or LM Studio, grab the GGUF versions from Unsloth:

ModelHugging Face RepoQuantizations Available
Gemma 4 1Bunsloth/gemma-4-1b-it-GGUFQ4_K_M, Q5_K_M, Q8_0, IQ4_XS
Gemma 4 4Bunsloth/gemma-4-4b-it-GGUFQ4_K_M, Q5_K_M, Q8_0, IQ4_XS
Gemma 4 12Bunsloth/gemma-4-12b-it-GGUFQ4_K_M, Q5_K_M, Q6_K, Q8_0, IQ4_XS
Gemma 4 27Bunsloth/gemma-4-27b-it-GGUFQ4_K_M, Q5_K_M, Q6_K, Q8_0, IQ4_XS

Download Methods

The Hugging Face CLI is the most reliable way to download large model files:

# Install the CLI
pip install huggingface_hub

# Login (required for gated models)
huggingface-cli login

# Download a specific GGUF file
huggingface-cli download unsloth/gemma-4-12b-it-GGUF \
  gemma-4-12b-it-Q4_K_M.gguf \
  --local-dir ./models

# Download the full official model
huggingface-cli download google/gemma-4-12b-it \
  --local-dir ./models/gemma-4-12b-it

# Resume interrupted downloads automatically
# Just run the same command again — it picks up where it left off

Method 2: Git LFS

For downloading entire repositories including all files:

# Install git-lfs
# macOS
brew install git-lfs

# Ubuntu
sudo apt install git-lfs

# Initialize git-lfs
git lfs install

# Clone the model repo
git clone https://huggingface.co/google/gemma-4-12b-it

# For GGUF — clone only the file you need
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/unsloth/gemma-4-12b-it-GGUF
cd gemma-4-12b-it-GGUF
git lfs pull --include="gemma-4-12b-it-Q4_K_M.gguf"

The GIT_LFS_SKIP_SMUDGE=1 trick clones the repo metadata without downloading the large files, then you selectively pull only the quantization you want. This saves bandwidth when a repo has multiple large files.

Method 3: Python API

Download programmatically in your scripts:

from huggingface_hub import hf_hub_download, snapshot_download

# Download a single file
path = hf_hub_download(
    repo_id="unsloth/gemma-4-12b-it-GGUF",
    filename="gemma-4-12b-it-Q4_K_M.gguf",
    local_dir="./models"
)
print(f"Downloaded to: {path}")

# Download entire model
snapshot_download(
    repo_id="google/gemma-4-12b-it",
    local_dir="./models/gemma-4-12b-it"
)

Using with Transformers Library

Once you've downloaded the official weights, load them directly with the transformers library:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load model and tokenizer
model_id = "google/gemma-4-12b-it"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"  # Automatically distribute across available GPUs
)

# Generate text
messages = [
    {"role": "user", "content": "Explain quantum computing in simple terms."}
]

input_ids = tokenizer.apply_chat_template(
    messages,
    return_tensors="pt",
    add_generation_prompt=True
).to(model.device)

outputs = model.generate(
    input_ids,
    max_new_tokens=512,
    temperature=0.7,
    do_sample=True
)

response = tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True)
print(response)

With 4-bit Quantization (BitsAndBytes)

Run the full model on less VRAM using on-the-fly quantization:

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4"
)

model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-4-12b-it",
    quantization_config=quantization_config,
    device_map="auto"
)
# Now runs on ~8GB VRAM instead of ~26GB

Using with Text Generation Inference (TGI)

For production serving, Hugging Face's TGI provides optimized inference:

# Run with Docker
docker run --gpus all \
  -p 8080:80 \
  -v ./models:/data \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id google/gemma-4-12b-it \
  --max-input-tokens 4096 \
  --max-total-tokens 8192 \
  --dtype bfloat16

# Query the API
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemma-4-12b-it",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 256
  }'

HF Mirror for Chinese Users

If you're in China and Hugging Face is slow or blocked, use the official mirror:

# Set the mirror endpoint
export HF_ENDPOINT=https://hf-mirror.com

# Now all huggingface-cli commands use the mirror
huggingface-cli download unsloth/gemma-4-12b-it-GGUF \
  gemma-4-12b-it-Q4_K_M.gguf \
  --local-dir ./models

# Or in Python
import os
os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"

from huggingface_hub import hf_hub_download
path = hf_hub_download(
    repo_id="unsloth/gemma-4-12b-it-GGUF",
    filename="gemma-4-12b-it-Q4_K_M.gguf"
)

The mirror syncs with the main HF hub, so all models and files are available.

Download Tips

TipDetails
Use huggingface-cli over git cloneBetter resume support, progress bars, and error handling
Download specific files when possibleDon't clone entire repos with 10+ quantization files
Check disk space firstThe 27B FP16 model needs 54GB+ free space
Use --cache-dir for custom cache locationDefaults to ~/.cache/huggingface/ which may be on a small drive
Verify file integrityhuggingface-cli checks SHA256 automatically

Next Steps

Hugging Face makes model distribution painless. Whether you're grabbing a quick GGUF for Ollama or the full weights for a research project, the download process is straightforward and resumable.

Gemma 4 AI

Gemma 4 AI

Related Guides