Gemma 4 is already impressive out of the box, but what if you need it to speak like your brand, follow your specific output format, or crush a niche domain that general models stumble on? That's where fine-tuning comes in.
The good news: you don't need a cluster of A100s anymore. With LoRA and a tool called Unsloth, you can fine-tune Gemma 4 on a single GPU in under an hour. Let's walk through the whole process.
Why Fine-Tune at All?
Before you invest the effort, make sure fine-tuning is actually what you need. Here's a quick decision guide:
| Situation | Solution |
|---|---|
| Model doesn't know your domain | Fine-tune with domain data |
| Model ignores your output format | Fine-tune with format examples |
| Model needs updated info | Use RAG instead |
| Model is too verbose/terse | Try prompt engineering first |
| Model gives wrong answers sometimes | Try few-shot prompting first |
If prompt engineering and RAG don't cut it, fine-tuning is your next move. Before diving in, try our best Gemma 4 prompts — you might solve the problem with better prompting alone. For a broader look at what Gemma 4 can do before you start customizing it, check out our use cases guide.
LoRA and QLoRA Explained (Simply)
LoRA (Low-Rank Adaptation) doesn't modify the original model weights. Instead, it trains a small set of adapter weights that sit on top of the base model. Think of it like putting a custom lens on a camera rather than rebuilding the camera.
- Base model: frozen, untouched
- Adapter: tiny (usually 1-5% of base model size)
- Result: near full fine-tune quality at a fraction of the cost
QLoRA goes one step further — it loads the base model in 4-bit quantized form, cutting memory usage roughly in half. You get 90%+ of LoRA quality while fitting on much smaller GPUs.
| Method | VRAM Needed (Gemma 4 E4B) | VRAM Needed (Gemma 4 26B) | Quality |
|---|---|---|---|
| Full fine-tune | 32GB+ | 100GB+ | Best |
| LoRA | 16GB | 48GB | ~98% of full |
| QLoRA | 8GB | 24GB | ~95% of full |
Setting Up Unsloth
Unsloth is the fastest way to fine-tune Gemma 4 with LoRA. It patches the model for 2x faster training and 60% less memory usage compared to vanilla HuggingFace.
# Create a virtual environment
python -m venv gemma4-finetune
source gemma4-finetune/bin/activate
# Install Unsloth (includes all dependencies)
pip install unsloth
# For QLoRA on older GPUs, also install:
pip install bitsandbytesVerify your setup:
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"VRAM: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB")Preparing Your Training Data
Your data should be in JSONL format, with each line containing a conversation. Here's the structure Gemma 4 expects:
{"messages": [{"role": "user", "content": "What's the return policy?"}, {"role": "assistant", "content": "Our return policy allows returns within 30 days of purchase with original receipt."}]}
{"messages": [{"role": "user", "content": "Do you ship internationally?"}, {"role": "assistant", "content": "Yes, we ship to 45 countries. Shipping takes 7-14 business days."}]}
{"messages": [{"role": "system", "content": "You are a medical coding assistant."}, {"role": "user", "content": "Code: Patient presents with acute bronchitis"}, {"role": "assistant", "content": "ICD-10: J20.9 - Acute bronchitis, unspecified"}]}Data quality tips:
- 500-1000 examples is a sweet spot for most tasks
- More isn't always better — 500 high-quality examples beat 5000 sloppy ones
- Include edge cases and negative examples
- Keep responses consistent in format and tone
- Validate your JSONL before training:
import json
def validate_jsonl(filepath):
valid = 0
errors = 0
with open(filepath, 'r') as f:
for i, line in enumerate(f, 1):
try:
data = json.loads(line)
assert "messages" in data, "Missing 'messages' key"
assert len(data["messages"]) >= 2, "Need at least 2 messages"
valid += 1
except (json.JSONDecodeError, AssertionError) as e:
print(f"Line {i}: {e}")
errors += 1
print(f"\nValid: {valid}, Errors: {errors}")
validate_jsonl("training_data.jsonl")Training Configuration
Here's a complete training script using Unsloth:
from unsloth import FastLanguageModel
import torch
# Load model with QLoRA (4-bit)
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="google/gemma-4-e4b",
max_seq_length=4096,
dtype=None, # Auto-detect
load_in_4bit=True, # QLoRA
)
# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
model,
r=16, # Rank — higher = more capacity, more VRAM
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
lora_alpha=16,
lora_dropout=0, # Unsloth optimized — keep at 0
bias="none",
use_gradient_checkpointing="unsloth",
)
# Load training data
from datasets import load_dataset
dataset = load_dataset("json", data_files="training_data.jsonl", split="train")
# Training arguments
from trl import SFTTrainer
from transformers import TrainingArguments
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
max_seq_length=4096,
args=TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
warmup_steps=10,
num_train_epochs=3,
learning_rate=2e-4,
fp16=not torch.cuda.is_bf16_supported(),
bf16=torch.cuda.is_bf16_supported(),
logging_steps=10,
output_dir="outputs",
optim="adamw_8bit",
seed=42,
),
)
# Train
trainer.train()
# Save LoRA adapter
model.save_pretrained("gemma4-finetuned-lora")
tokenizer.save_pretrained("gemma4-finetuned-lora")Key parameters to tune:
r=16: LoRA rank. Start with 16, go to 32 or 64 if quality isn't therenum_train_epochs=3: Usually 2-5 is enough. Watch for overfittinglearning_rate=2e-4: Default works well. Lower to 1e-4 if training is unstableper_device_train_batch_size=2: Increase if you have VRAM headroom
Exporting to GGUF
Once your adapter is trained, you probably want to run it locally with Ollama. That means converting to GGUF format. For a detailed breakdown of quantization options and their tradeoffs, see our GGUF guide.
# Merge adapter and export to GGUF (Q4_K_M quantization)
model.save_pretrained_gguf(
"gemma4-finetuned-gguf",
tokenizer,
quantization_method="q4_k_m",
)
# Or export at multiple quantization levels
for method in ["q4_k_m", "q5_k_m", "q8_0"]:
model.save_pretrained_gguf(
f"gemma4-finetuned-{method}",
tokenizer,
quantization_method=method,
)| Quantization | File Size (E4B) | Quality Loss | Best For |
|---|---|---|---|
| Q4_K_M | ~2.5 GB | Minimal | Most users |
| Q5_K_M | ~3.2 GB | Very small | Quality-focused |
| Q8_0 | ~4.8 GB | Negligible | When VRAM allows |
Deploying with Ollama
Create a Modelfile to bring your fine-tuned model into Ollama:
# Modelfile
FROM ./gemma4-finetuned-gguf/gemma4-finetuned-q4_k_m.gguf
PARAMETER temperature 0.7
PARAMETER top_p 0.9
SYSTEM "You are a helpful assistant fine-tuned for customer support."Then build and run:
# Create the custom model
ollama create my-gemma4 -f Modelfile
# Test it
ollama run my-gemma4 "What's your return policy?"
# Verify it's listed
ollama listYour fine-tuned model now runs exactly like any other Ollama model. You can use it with the Ollama API, build apps on top of it, or share it with your team.
Common Issues and Fixes
Out of memory during training: Lower per_device_train_batch_size to 1, enable gradient_checkpointing, or switch to a smaller base model.
Model outputs garbage after fine-tuning: Your data probably has formatting issues, or you trained for too many epochs. Reduce epochs and validate your JSONL.
GGUF export fails: Make sure you have enough disk space (3-5x the model size during conversion) and that your Unsloth version is up to date.
Training loss doesn't decrease: Learning rate might be too low. Try 5e-4. Also check that your data actually has the patterns you want the model to learn.
Next Steps
- Explore Gemma 4 model sizes to pick the best base model for fine-tuning
- Set up a proper deployment pipeline for your fine-tuned model
- Learn about structured JSON output to combine fine-tuning with reliable output formats
- Check hardware requirements to plan your training setup



