Hugging Face is the primary hub for downloading Gemma 4 model weights. Whether you want the original FP16 weights for fine-tuning or GGUF quantized files for local inference, everything lives on HF. This guide walks through every download method and shows you how to start using the models right away.
Official Repositories
Google publishes the original Gemma 4 weights on Hugging Face:
| Model | Hugging Face Repo | Size | Format |
|---|---|---|---|
| Gemma 4 1B IT | google/gemma-4-1b-it | ~2 GB | SafeTensors |
| Gemma 4 4B IT | google/gemma-4-4b-it | ~8 GB | SafeTensors |
| Gemma 4 12B IT | google/gemma-4-12b-it | ~24 GB | SafeTensors |
| Gemma 4 27B IT | google/gemma-4-27b-it | ~54 GB | SafeTensors |
| Gemma 4 E2B IT | google/gemma-4-e2b-it | ~4 GB | SafeTensors |
| Gemma 4 E4B IT | google/gemma-4-e4b-it | ~8 GB | SafeTensors |
Base (pre-trained, non-instruction-tuned) models are also available with the -pt suffix instead of -it.
GGUF Repositories
For running with llama.cpp, Ollama, or LM Studio, grab the GGUF versions from Unsloth:
| Model | Hugging Face Repo | Quantizations Available |
|---|---|---|
| Gemma 4 1B | unsloth/gemma-4-1b-it-GGUF | Q4_K_M, Q5_K_M, Q8_0, IQ4_XS |
| Gemma 4 4B | unsloth/gemma-4-4b-it-GGUF | Q4_K_M, Q5_K_M, Q8_0, IQ4_XS |
| Gemma 4 12B | unsloth/gemma-4-12b-it-GGUF | Q4_K_M, Q5_K_M, Q6_K, Q8_0, IQ4_XS |
| Gemma 4 27B | unsloth/gemma-4-27b-it-GGUF | Q4_K_M, Q5_K_M, Q6_K, Q8_0, IQ4_XS |
Download Methods
Method 1: huggingface-cli (Recommended)
The Hugging Face CLI is the most reliable way to download large model files:
# Install the CLI
pip install huggingface_hub
# Login (required for gated models)
huggingface-cli login
# Download a specific GGUF file
huggingface-cli download unsloth/gemma-4-12b-it-GGUF \
gemma-4-12b-it-Q4_K_M.gguf \
--local-dir ./models
# Download the full official model
huggingface-cli download google/gemma-4-12b-it \
--local-dir ./models/gemma-4-12b-it
# Resume interrupted downloads automatically
# Just run the same command again — it picks up where it left offMethod 2: Git LFS
For downloading entire repositories including all files:
# Install git-lfs
# macOS
brew install git-lfs
# Ubuntu
sudo apt install git-lfs
# Initialize git-lfs
git lfs install
# Clone the model repo
git clone https://huggingface.co/google/gemma-4-12b-it
# For GGUF — clone only the file you need
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/unsloth/gemma-4-12b-it-GGUF
cd gemma-4-12b-it-GGUF
git lfs pull --include="gemma-4-12b-it-Q4_K_M.gguf"The GIT_LFS_SKIP_SMUDGE=1 trick clones the repo metadata without downloading the large files, then you selectively pull only the quantization you want. This saves bandwidth when a repo has multiple large files.
Method 3: Python API
Download programmatically in your scripts:
from huggingface_hub import hf_hub_download, snapshot_download
# Download a single file
path = hf_hub_download(
repo_id="unsloth/gemma-4-12b-it-GGUF",
filename="gemma-4-12b-it-Q4_K_M.gguf",
local_dir="./models"
)
print(f"Downloaded to: {path}")
# Download entire model
snapshot_download(
repo_id="google/gemma-4-12b-it",
local_dir="./models/gemma-4-12b-it"
)Using with Transformers Library
Once you've downloaded the official weights, load them directly with the transformers library:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Load model and tokenizer
model_id = "google/gemma-4-12b-it"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto" # Automatically distribute across available GPUs
)
# Generate text
messages = [
{"role": "user", "content": "Explain quantum computing in simple terms."}
]
input_ids = tokenizer.apply_chat_template(
messages,
return_tensors="pt",
add_generation_prompt=True
).to(model.device)
outputs = model.generate(
input_ids,
max_new_tokens=512,
temperature=0.7,
do_sample=True
)
response = tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True)
print(response)With 4-bit Quantization (BitsAndBytes)
Run the full model on less VRAM using on-the-fly quantization:
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_quant_type="nf4"
)
model = AutoModelForCausalLM.from_pretrained(
"google/gemma-4-12b-it",
quantization_config=quantization_config,
device_map="auto"
)
# Now runs on ~8GB VRAM instead of ~26GBUsing with Text Generation Inference (TGI)
For production serving, Hugging Face's TGI provides optimized inference:
# Run with Docker
docker run --gpus all \
-p 8080:80 \
-v ./models:/data \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id google/gemma-4-12b-it \
--max-input-tokens 4096 \
--max-total-tokens 8192 \
--dtype bfloat16
# Query the API
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "google/gemma-4-12b-it",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 256
}'HF Mirror for Chinese Users
If you're in China and Hugging Face is slow or blocked, use the official mirror:
# Set the mirror endpoint
export HF_ENDPOINT=https://hf-mirror.com
# Now all huggingface-cli commands use the mirror
huggingface-cli download unsloth/gemma-4-12b-it-GGUF \
gemma-4-12b-it-Q4_K_M.gguf \
--local-dir ./models
# Or in Python
import os
os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"
from huggingface_hub import hf_hub_download
path = hf_hub_download(
repo_id="unsloth/gemma-4-12b-it-GGUF",
filename="gemma-4-12b-it-Q4_K_M.gguf"
)The mirror syncs with the main HF hub, so all models and files are available.
Download Tips
| Tip | Details |
|---|---|
Use huggingface-cli over git clone | Better resume support, progress bars, and error handling |
| Download specific files when possible | Don't clone entire repos with 10+ quantization files |
| Check disk space first | The 27B FP16 model needs 54GB+ free space |
Use --cache-dir for custom cache location | Defaults to ~/.cache/huggingface/ which may be on a small drive |
| Verify file integrity | huggingface-cli checks SHA256 automatically |
Next Steps
- Not sure which GGUF to pick? Read our GGUF Quantization Guide for detailed format comparisons
- Want all download options in one place? Check the Complete Download Guide covering Ollama, LM Studio, and direct downloads
- Ready to run the model? Follow our Ollama tutorial for the quickest setup
Hugging Face makes model distribution painless. Whether you're grabbing a quick GGUF for Ollama or the full weights for a research project, the download process is straightforward and resumable.



