Download Gemma 4 from Hugging Face: 5 Methods (2026)

Hugging Face is the primary hub for downloading Gemma 4 model weights. Whether you want the original FP16 weights for fine-tuning or GGUF quantized files for local inference, everything lives on HF. This guide walks through every download method and shows you how to start using the models right away.

Official Repositories

Google publishes the original Gemma 4 weights on Hugging Face:

Model	Hugging Face Repo	Size	Format
Gemma 4 1B IT	google/gemma-4-1b-it	~2 GB	SafeTensors
Gemma 4 4B IT	google/gemma-4-4b-it	~8 GB	SafeTensors
Gemma 4 12B IT	google/gemma-4-12b-it	~24 GB	SafeTensors
Gemma 4 27B IT	google/gemma-4-27b-it	~54 GB	SafeTensors
Gemma 4 E2B IT	google/gemma-4-e2b-it	~4 GB	SafeTensors
Gemma 4 E4B IT	google/gemma-4-e4b-it	~8 GB	SafeTensors

Base (pre-trained, non-instruction-tuned) models are also available with the -pt suffix instead of -it.

GGUF Repositories

For running with llama.cpp, Ollama, or LM Studio, grab the GGUF versions from Unsloth:

Model	Hugging Face Repo	Quantizations Available
Gemma 4 1B	unsloth/gemma-4-1b-it-GGUF	Q4_K_M, Q5_K_M, Q8_0, IQ4_XS
Gemma 4 4B	unsloth/gemma-4-4b-it-GGUF	Q4_K_M, Q5_K_M, Q8_0, IQ4_XS
Gemma 4 12B	unsloth/gemma-4-12b-it-GGUF	Q4_K_M, Q5_K_M, Q6_K, Q8_0, IQ4_XS
Gemma 4 27B	unsloth/gemma-4-27b-it-GGUF	Q4_K_M, Q5_K_M, Q6_K, Q8_0, IQ4_XS

Download Methods

Method 1: huggingface-cli (Recommended)

The Hugging Face CLI is the most reliable way to download large model files:

# Install the CLI
pip install huggingface_hub

# Login (required for gated models)
huggingface-cli login

# Download a specific GGUF file
huggingface-cli download unsloth/gemma-4-12b-it-GGUF \
  gemma-4-12b-it-Q4_K_M.gguf \
  --local-dir ./models

# Download the full official model
huggingface-cli download google/gemma-4-12b-it \
  --local-dir ./models/gemma-4-12b-it

# Resume interrupted downloads automatically
# Just run the same command again — it picks up where it left off

Method 2: Git LFS

For downloading entire repositories including all files:

# Install git-lfs
# macOS
brew install git-lfs

# Ubuntu
sudo apt install git-lfs

# Initialize git-lfs
git lfs install

# Clone the model repo
git clone https://huggingface.co/google/gemma-4-12b-it

# For GGUF — clone only the file you need
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/unsloth/gemma-4-12b-it-GGUF
cd gemma-4-12b-it-GGUF
git lfs pull --include="gemma-4-12b-it-Q4_K_M.gguf"

The GIT_LFS_SKIP_SMUDGE=1 trick clones the repo metadata without downloading the large files, then you selectively pull only the quantization you want. This saves bandwidth when a repo has multiple large files.

Method 3: Python API

Download programmatically in your scripts:

from huggingface_hub import hf_hub_download, snapshot_download

# Download a single file
path = hf_hub_download(
    repo_id="unsloth/gemma-4-12b-it-GGUF",
    filename="gemma-4-12b-it-Q4_K_M.gguf",
    local_dir="./models"
)
print(f"Downloaded to: {path}")

# Download entire model
snapshot_download(
    repo_id="google/gemma-4-12b-it",
    local_dir="./models/gemma-4-12b-it"
)

Using with Transformers Library

Once you've downloaded the official weights, load them directly with the transformers library:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load model and tokenizer
model_id = "google/gemma-4-12b-it"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"  # Automatically distribute across available GPUs
)

# Generate text
messages = [
    {"role": "user", "content": "Explain quantum computing in simple terms."}
]

input_ids = tokenizer.apply_chat_template(
    messages,
    return_tensors="pt",
    add_generation_prompt=True
).to(model.device)

outputs = model.generate(
    input_ids,
    max_new_tokens=512,
    temperature=0.7,
    do_sample=True
)

response = tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True)
print(response)

With 4-bit Quantization (BitsAndBytes)

Run the full model on less VRAM using on-the-fly quantization:

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4"
)

model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-4-12b-it",
    quantization_config=quantization_config,
    device_map="auto"
)
# Now runs on ~8GB VRAM instead of ~26GB

Using with Text Generation Inference (TGI)

For production serving, Hugging Face's TGI provides optimized inference:

# Run with Docker
docker run --gpus all \
  -p 8080:80 \
  -v ./models:/data \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id google/gemma-4-12b-it \
  --max-input-tokens 4096 \
  --max-total-tokens 8192 \
  --dtype bfloat16

# Query the API
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemma-4-12b-it",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 256
  }'

HF Mirror for Chinese Users

If you're in China and Hugging Face is slow or blocked, use the official mirror:

# Set the mirror endpoint
export HF_ENDPOINT=https://hf-mirror.com

# Now all huggingface-cli commands use the mirror
huggingface-cli download unsloth/gemma-4-12b-it-GGUF \
  gemma-4-12b-it-Q4_K_M.gguf \
  --local-dir ./models

# Or in Python
import os
os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"

from huggingface_hub import hf_hub_download
path = hf_hub_download(
    repo_id="unsloth/gemma-4-12b-it-GGUF",
    filename="gemma-4-12b-it-Q4_K_M.gguf"
)

The mirror syncs with the main HF hub, so all models and files are available.

Download Tips

Tip	Details
Use `huggingface-cli` over `git clone`	Better resume support, progress bars, and error handling
Download specific files when possible	Don't clone entire repos with 10+ quantization files
Check disk space first	The 27B FP16 model needs 54GB+ free space
Use `--cache-dir` for custom cache location	Defaults to `~/.cache/huggingface/` which may be on a small drive
Verify file integrity	`huggingface-cli` checks SHA256 automatically

Next Steps

Not sure which GGUF to pick? Read our GGUF Quantization Guide for detailed format comparisons
Want all download options in one place? Check the Complete Download Guide covering Ollama, LM Studio, and direct downloads
Ready to run the model? Follow our Ollama tutorial for the quickest setup

Hugging Face makes model distribution painless. Whether you're grabbing a quick GGUF for Ollama or the full weights for a research project, the download process is straightforward and resumable.

gemma4 — interact

Stop reading. Start building.

~/gemma4 $ Get hands-on with the models discussed in this guide. No deployment, no friction, 100% free playground.

Launch Playground />