从 Hugging Face 下载 Gemma 4（原始权重和 GGUF 版本）

Hugging Face 是下载 Gemma 4 模型权重的主要渠道。不管你要原始的 FP16 权重做微调，还是 GGUF 量化版本跑本地推理，都在 HF 上。这篇教程覆盖每种下载方式，并教你怎么马上用起来。

官方仓库

Google 在 Hugging Face 上发布 Gemma 4 的原始权重：

模型	Hugging Face 仓库	大小	格式
Gemma 4 1B IT	google/gemma-4-1b-it	~2 GB	SafeTensors
Gemma 4 4B IT	google/gemma-4-4b-it	~8 GB	SafeTensors
Gemma 4 12B IT	google/gemma-4-12b-it	~24 GB	SafeTensors
Gemma 4 27B IT	google/gemma-4-27b-it	~54 GB	SafeTensors
Gemma 4 E2B IT	google/gemma-4-e2b-it	~4 GB	SafeTensors
Gemma 4 E4B IT	google/gemma-4-e4b-it	~8 GB	SafeTensors

预训练基座模型（非指令微调）用 -pt 后缀。

GGUF 仓库

用 llama.cpp、Ollama 或 LM Studio 跑的话，从 Unsloth 下载 GGUF 版本：

模型	Hugging Face 仓库	可用量化
Gemma 4 1B	unsloth/gemma-4-1b-it-GGUF	Q4_K_M, Q5_K_M, Q8_0, IQ4_XS
Gemma 4 4B	unsloth/gemma-4-4b-it-GGUF	Q4_K_M, Q5_K_M, Q8_0, IQ4_XS
Gemma 4 12B	unsloth/gemma-4-12b-it-GGUF	Q4_K_M, Q5_K_M, Q6_K, Q8_0, IQ4_XS
Gemma 4 27B	unsloth/gemma-4-27b-it-GGUF	Q4_K_M, Q5_K_M, Q6_K, Q8_0, IQ4_XS

下载方法

方法一：huggingface-cli（推荐）

Hugging Face 命令行工具是下载大模型文件最靠谱的方式：

# 安装 CLI
pip install huggingface_hub

# 登录（部分模型需要）
huggingface-cli login

# 下载特定 GGUF 文件
huggingface-cli download unsloth/gemma-4-12b-it-GGUF \
  gemma-4-12b-it-Q4_K_M.gguf \
  --local-dir ./models

# 下载完整官方模型
huggingface-cli download google/gemma-4-12b-it \
  --local-dir ./models/gemma-4-12b-it

# 断点续传：直接重跑同一个命令就行

方法二：Git LFS

适合下载整个仓库：

# 安装 git-lfs
# macOS
brew install git-lfs

# Ubuntu
sudo apt install git-lfs

# 初始化
git lfs install

# 克隆模型仓库
git clone https://huggingface.co/google/gemma-4-12b-it

# GGUF 仓库——只下你需要的文件
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/unsloth/gemma-4-12b-it-GGUF
cd gemma-4-12b-it-GGUF
git lfs pull --include="gemma-4-12b-it-Q4_K_M.gguf"

GIT_LFS_SKIP_SMUDGE=1 这个技巧是先克隆仓库元数据但不下载大文件，然后只拉你需要的那个量化版本。一个仓库里十几个 GGUF 文件时特别省流量。

方法三：Python API

在代码里下载：

from huggingface_hub import hf_hub_download, snapshot_download

# 下载单个文件
path = hf_hub_download(
    repo_id="unsloth/gemma-4-12b-it-GGUF",
    filename="gemma-4-12b-it-Q4_K_M.gguf",
    local_dir="./models"
)
print(f"下载到: {path}")

# 下载完整模型
snapshot_download(
    repo_id="google/gemma-4-12b-it",
    local_dir="./models/gemma-4-12b-it"
)

用 transformers 库加载

下载官方权重后，直接用 transformers 加载：

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "google/gemma-4-12b-it"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"  # 自动分配到可用 GPU
)

# 生成文本
messages = [
    {"role": "user", "content": "用简单的话解释量子计算。"}
]

input_ids = tokenizer.apply_chat_template(
    messages,
    return_tensors="pt",
    add_generation_prompt=True
).to(model.device)

outputs = model.generate(
    input_ids,
    max_new_tokens=512,
    temperature=0.7,
    do_sample=True
)

response = tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True)
print(response)

4-bit 量化加载（BitsAndBytes）

用动态量化在更少显存上跑完整模型：

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4"
)

model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-4-12b-it",
    quantization_config=quantization_config,
    device_map="auto"
)
# 现在只需要 ~8GB 显存，而不是 ~26GB

用 TGI 做生产部署

Hugging Face 的 Text Generation Inference 提供优化的推理服务：

# Docker 运行
docker run --gpus all \
  -p 8080:80 \
  -v ./models:/data \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id google/gemma-4-12b-it \
  --max-input-tokens 4096 \
  --max-total-tokens 8192 \
  --dtype bfloat16

# 调用 API
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemma-4-12b-it",
    "messages": [{"role": "user", "content": "你好！"}],
    "max_tokens": 256
  }'

国内镜像加速

在国内访问 Hugging Face 如果很慢或者连不上，用镜像站：

# 设置镜像地址
export HF_ENDPOINT=https://hf-mirror.com

# 现在所有 huggingface-cli 命令都走镜像
huggingface-cli download unsloth/gemma-4-12b-it-GGUF \
  gemma-4-12b-it-Q4_K_M.gguf \
  --local-dir ./models

# Python 里用
import os
os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"

from huggingface_hub import hf_hub_download
path = hf_hub_download(
    repo_id="unsloth/gemma-4-12b-it-GGUF",
    filename="gemma-4-12b-it-Q4_K_M.gguf"
)

镜像站和 HF 主站同步，所有模型和文件都可以用。亲测速度差距很大，国内用户强烈推荐。

下载小贴士

建议	说明
用 `huggingface-cli` 而不是 `git clone`	断点续传更好，有进度条和错误处理
尽量下载特定文件	别克隆有 10+ 个量化文件的整个仓库
先检查磁盘空间	27B FP16 需要 54GB+ 空间
用 `--cache-dir` 指定缓存位置	默认存在 `~/.cache/huggingface/`，可能在容量小的盘上
验证文件完整性	`huggingface-cli` 会自动校验 SHA256