如何從 Hugging Face 下載 Gemma 4（權重與 GGUF）

Hugging Face 是下載 Gemma 4 模型權重的主要樞紐。無論你想要原始的 FP16 權重用於微調，還是 GGUF 量化檔案用於本機推論，一切都在 HF 上。本指南介紹每種下載方法，並展示如何立即開始使用模型。

官方倉庫

Google 在 Hugging Face 上發布原始 Gemma 4 權重：

模型	Hugging Face 倉庫	大小	格式
Gemma 4 1B IT	google/gemma-4-1b-it	~2 GB	SafeTensors
Gemma 4 4B IT	google/gemma-4-4b-it	~8 GB	SafeTensors
Gemma 4 12B IT	google/gemma-4-12b-it	~24 GB	SafeTensors
Gemma 4 27B IT	google/gemma-4-27b-it	~54 GB	SafeTensors
Gemma 4 E2B IT	google/gemma-4-e2b-it	~4 GB	SafeTensors
Gemma 4 E4B IT	google/gemma-4-e4b-it	~8 GB	SafeTensors

基礎（預訓練、未指令微調）模型也有提供，用 -pt 後綴代替 -it。

GGUF 倉庫

用於 llama.cpp、Ollama 或 LM Studio，從 Unsloth 取得 GGUF 版本：

模型	Hugging Face 倉庫	可用量化
Gemma 4 1B	unsloth/gemma-4-1b-it-GGUF	Q4_K_M, Q5_K_M, Q8_0, IQ4_XS
Gemma 4 4B	unsloth/gemma-4-4b-it-GGUF	Q4_K_M, Q5_K_M, Q8_0, IQ4_XS
Gemma 4 12B	unsloth/gemma-4-12b-it-GGUF	Q4_K_M, Q5_K_M, Q6_K, Q8_0, IQ4_XS
Gemma 4 27B	unsloth/gemma-4-27b-it-GGUF	Q4_K_M, Q5_K_M, Q6_K, Q8_0, IQ4_XS

下載方式

方式 1：huggingface-cli（推薦）

Hugging Face CLI 是下載大型模型檔案最可靠的方式：

# 安裝 CLI
pip install huggingface_hub

# 登入（需要存取受限模型時）
huggingface-cli login

# 下載特定的 GGUF 檔案
huggingface-cli download unsloth/gemma-4-12b-it-GGUF \
  gemma-4-12b-it-Q4_K_M.gguf \
  --local-dir ./models

# 下載完整的官方模型
huggingface-cli download google/gemma-4-12b-it \
  --local-dir ./models/gemma-4-12b-it

# 自動恢復中斷的下載
# 只需再次執行相同指令——它會從中斷處繼續

方式 2：Git LFS

用於下載整個倉庫包括所有檔案：

# 安裝 git-lfs
# macOS
brew install git-lfs

# Ubuntu
sudo apt install git-lfs

# 初始化 git-lfs
git lfs install

# clone 模型倉庫
git clone https://huggingface.co/google/gemma-4-12b-it

# 對於 GGUF——只 clone 你需要的檔案
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/unsloth/gemma-4-12b-it-GGUF
cd gemma-4-12b-it-GGUF
git lfs pull --include="gemma-4-12b-it-Q4_K_M.gguf"

GIT_LFS_SKIP_SMUDGE=1 技巧會 clone 倉庫的 metadata 而不下載大檔案，然後你有選擇性地只拉取你想要的量化。當倉庫有多個大檔案時這可節省頻寬。

方式 3：Python API

在你的腳本中以程式設計方式下載：

from huggingface_hub import hf_hub_download, snapshot_download

# 下載單一檔案
path = hf_hub_download(
    repo_id="unsloth/gemma-4-12b-it-GGUF",
    filename="gemma-4-12b-it-Q4_K_M.gguf",
    local_dir="./models"
)
print(f"Downloaded to: {path}")

# 下載整個模型
snapshot_download(
    repo_id="google/gemma-4-12b-it",
    local_dir="./models/gemma-4-12b-it"
)

搭配 Transformers 函式庫使用

下載好官方權重後，直接用 transformers 函式庫載入它們：

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# 載入模型和 tokenizer
model_id = "google/gemma-4-12b-it"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"  # 自動在可用的 GPU 之間分配
)

# 生成文字
messages = [
    {"role": "user", "content": "Explain quantum computing in simple terms."}
]

input_ids = tokenizer.apply_chat_template(
    messages,
    return_tensors="pt",
    add_generation_prompt=True
).to(model.device)

outputs = model.generate(
    input_ids,
    max_new_tokens=512,
    temperature=0.7,
    do_sample=True
)

response = tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True)
print(response)

搭配 4-bit 量化（BitsAndBytes）

用即時量化在更少的 VRAM 上執行完整模型：

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4"
)

model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-4-12b-it",
    quantization_config=quantization_config,
    device_map="auto"
)
# 現在在 ~8GB VRAM 上執行而非 ~26GB

搭配 Text Generation Inference (TGI) 使用

用於正式環境服務，Hugging Face 的 TGI 提供最佳化的推論：

# 用 Docker 執行
docker run --gpus all \
  -p 8080:80 \
  -v ./models:/data \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id google/gemma-4-12b-it \
  --max-input-tokens 4096 \
  --max-total-tokens 8192 \
  --dtype bfloat16

# 查詢 API
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemma-4-12b-it",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 256
  }'

給中國使用者的 HF 映射站

如果你在中國且 Hugging Face 很慢或被封鎖，使用官方映射站：

# 設定映射端點
export HF_ENDPOINT=https://hf-mirror.com

# 現在所有 huggingface-cli 指令都使用映射站
huggingface-cli download unsloth/gemma-4-12b-it-GGUF \
  gemma-4-12b-it-Q4_K_M.gguf \
  --local-dir ./models

# 或在 Python 中
import os
os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"

from huggingface_hub import hf_hub_download
path = hf_hub_download(
    repo_id="unsloth/gemma-4-12b-it-GGUF",
    filename="gemma-4-12b-it-Q4_K_M.gguf"
)

映射站與主 HF 樞紐同步，所有模型和檔案都可用。

下載技巧

技巧	細節
使用 `huggingface-cli` 而非 `git clone`	更好的恢復支援、進度條和錯誤處理
盡可能下載特定檔案	不要 clone 有 10+ 量化檔案的整個倉庫
先檢查磁碟空間	27B FP16 模型需要 54GB+ 可用空間
使用 `--cache-dir` 自訂快取位置	預設為 `~/.cache/huggingface/`，可能在小磁碟上
驗證檔案完整性	`huggingface-cli` 自動檢查 SHA256