Hugging FaceからGemma 4をダウンロードする方法（重みとGGUF）

Hugging FaceはGemma 4モデル重みをダウンロードするための主要ハブです。ファインチューニング用のオリジナルFP16重み、ローカル推論用のGGUF量子化ファイル — すべてHFに存在します。このガイドではすべてのダウンロード方法を解説し、すぐにモデルを使い始める方法を紹介します。

公式リポジトリ

GoogleはオリジナルのGemma 4重みをHugging Faceで公開しています：

モデル	Hugging Faceリポジトリ	サイズ	フォーマット
Gemma 4 1B IT	google/gemma-4-1b-it	約2 GB	SafeTensors
Gemma 4 4B IT	google/gemma-4-4b-it	約8 GB	SafeTensors
Gemma 4 12B IT	google/gemma-4-12b-it	約24 GB	SafeTensors
Gemma 4 27B IT	google/gemma-4-27b-it	約54 GB	SafeTensors
Gemma 4 E2B IT	google/gemma-4-e2b-it	約4 GB	SafeTensors
Gemma 4 E4B IT	google/gemma-4-e4b-it	約8 GB	SafeTensors

ベース（プレトレーニング済み、非instruction-tuned）モデルも-itの代わりに-ptサフィックスで利用可能です。

GGUFリポジトリ

llama.cpp、Ollama、LM Studioで実行するには、UnslothのGGUFバージョンを取得：

モデル	Hugging Faceリポジトリ	利用可能な量子化
Gemma 4 1B	unsloth/gemma-4-1b-it-GGUF	Q4_K_M, Q5_K_M, Q8_0, IQ4_XS
Gemma 4 4B	unsloth/gemma-4-4b-it-GGUF	Q4_K_M, Q5_K_M, Q8_0, IQ4_XS
Gemma 4 12B	unsloth/gemma-4-12b-it-GGUF	Q4_K_M, Q5_K_M, Q6_K, Q8_0, IQ4_XS
Gemma 4 27B	unsloth/gemma-4-27b-it-GGUF	Q4_K_M, Q5_K_M, Q6_K, Q8_0, IQ4_XS

ダウンロード方法

方法1：huggingface-cli（推奨）

Hugging Face CLIは大きなモデルファイルをダウンロードする最も信頼性の高い方法：

# CLIをインストール
pip install huggingface_hub

# ログイン（gatedモデルに必要）
huggingface-cli login

# 特定のGGUFファイルをダウンロード
huggingface-cli download unsloth/gemma-4-12b-it-GGUF \
  gemma-4-12b-it-Q4_K_M.gguf \
  --local-dir ./models

# 完全な公式モデルをダウンロード
huggingface-cli download google/gemma-4-12b-it \
  --local-dir ./models/gemma-4-12b-it

# 中断されたダウンロードの自動再開
# 同じコマンドを再実行するだけ — 中断したところから続く

方法2：Git LFS

すべてのファイルを含むリポジトリ全体をダウンロード：

# git-lfsをインストール
# macOS
brew install git-lfs

# Ubuntu
sudo apt install git-lfs

# git-lfsを初期化
git lfs install

# モデルリポジトリをクローン
git clone https://huggingface.co/google/gemma-4-12b-it

# GGUFの場合 — 必要なファイルだけクローン
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/unsloth/gemma-4-12b-it-GGUF
cd gemma-4-12b-it-GGUF
git lfs pull --include="gemma-4-12b-it-Q4_K_M.gguf"

GIT_LFS_SKIP_SMUDGE=1テクニックは、大きなファイルをダウンロードせずにリポジトリメタデータをクローンし、その後必要な量子化だけを選択的にプルします。リポジトリに複数の大きなファイルがある場合に帯域を節約できます。

方法3：Python API

スクリプト内でプログラム的にダウンロード：

from huggingface_hub import hf_hub_download, snapshot_download

# 単一ファイルをダウンロード
path = hf_hub_download(
    repo_id="unsloth/gemma-4-12b-it-GGUF",
    filename="gemma-4-12b-it-Q4_K_M.gguf",
    local_dir="./models"
)
print(f"Downloaded to: {path}")

# モデル全体をダウンロード
snapshot_download(
    repo_id="google/gemma-4-12b-it",
    local_dir="./models/gemma-4-12b-it"
)

Transformersライブラリで使用

公式重みをダウンロードしたら、transformersライブラリで直接ロード：

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# モデルとトークナイザーをロード
model_id = "google/gemma-4-12b-it"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"  # 利用可能なGPUに自動分散
)

# テキストを生成
messages = [
    {"role": "user", "content": "Explain quantum computing in simple terms."}
]

input_ids = tokenizer.apply_chat_template(
    messages,
    return_tensors="pt",
    add_generation_prompt=True
).to(model.device)

outputs = model.generate(
    input_ids,
    max_new_tokens=512,
    temperature=0.7,
    do_sample=True
)

response = tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True)
print(response)

4bit量子化で（BitsAndBytes）

オンザフライ量子化を使ってより少ないVRAMでフルモデルを実行：

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4"
)

model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-4-12b-it",
    quantization_config=quantization_config,
    device_map="auto"
)
# 約26GBではなく約8GB VRAMで動作

Text Generation Inference（TGI）で使用

本番サービングには、Hugging FaceのTGIが最適化された推論を提供：

# Dockerで実行
docker run --gpus all \
  -p 8080:80 \
  -v ./models:/data \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id google/gemma-4-12b-it \
  --max-input-tokens 4096 \
  --max-total-tokens 8192 \
  --dtype bfloat16

# APIにクエリ
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemma-4-12b-it",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 256
  }'

中国ユーザー向けHFミラー

中国にいてHugging Faceが遅い、またはブロックされている場合、公式ミラーを使用：

# ミラーエンドポイントを設定
export HF_ENDPOINT=https://hf-mirror.com

# これですべてのhuggingface-cliコマンドがミラーを使用
huggingface-cli download unsloth/gemma-4-12b-it-GGUF \
  gemma-4-12b-it-Q4_K_M.gguf \
  --local-dir ./models

# Pythonでも
import os
os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"

from huggingface_hub import hf_hub_download
path = hf_hub_download(
    repo_id="unsloth/gemma-4-12b-it-GGUF",
    filename="gemma-4-12b-it-Q4_K_M.gguf"
)

ミラーはメインのHFハブと同期するため、すべてのモデルとファイルが利用可能です。

ダウンロードのヒント

ヒント	詳細
`git clone`より`huggingface-cli`を使う	より良い再開サポート、プログレスバー、エラーハンドリング
可能な場合は特定のファイルをダウンロード	10以上の量子化ファイルがあるリポジトリ全体をクローンしない
先にディスク容量を確認	27B FP16モデルは54GB以上の空き容量が必要
カスタムキャッシュ場所に`--cache-dir`を使う	デフォルトは小さなドライブかもしれない`~/.cache/huggingface/`
ファイルの整合性を検証	`huggingface-cli`は自動的にSHA256をチェック