Gemma 4 Fine-Tuning: LoRA, QLoRA, Unsloth, 1 GPU in 1 Hour

Gemma 4 is already impressive out of the box, but what if you need it to speak like your brand, follow your specific output format, or crush a niche domain that general models stumble on? That's where fine-tuning comes in.

The good news: you don't need a cluster of A100s anymore. With LoRA and a tool called Unsloth, you can fine-tune Gemma 4 on a single GPU in under an hour. Let's walk through the whole process.

Why Fine-Tune at All?

Before you invest the effort, make sure fine-tuning is actually what you need. Here's a quick decision guide:

Situation	Solution
Model doesn't know your domain	Fine-tune with domain data
Model ignores your output format	Fine-tune with format examples
Model needs updated info	Use RAG instead
Model is too verbose/terse	Try prompt engineering first
Model gives wrong answers sometimes	Try few-shot prompting first

If prompt engineering and RAG don't cut it, fine-tuning is your next move. Before diving in, try our best Gemma 4 prompts — you might solve the problem with better prompting alone. For a broader look at what Gemma 4 can do before you start customizing it, check out our use cases guide.

LoRA and QLoRA Explained (Simply)

LoRA (Low-Rank Adaptation) doesn't modify the original model weights. Instead, it trains a small set of adapter weights that sit on top of the base model. Think of it like putting a custom lens on a camera rather than rebuilding the camera.

Base model: frozen, untouched
Adapter: tiny (usually 1-5% of base model size)
Result: near full fine-tune quality at a fraction of the cost

QLoRA goes one step further — it loads the base model in 4-bit quantized form, cutting memory usage roughly in half. You get 90%+ of LoRA quality while fitting on much smaller GPUs.

Method	VRAM Needed (Gemma 4 E4B)	VRAM Needed (Gemma 4 26B)	Quality
Full fine-tune	32GB+	100GB+	Best
LoRA	16GB	48GB	~98% of full
QLoRA	8GB	24GB	~95% of full

Setting Up Unsloth

Unsloth is the fastest way to fine-tune Gemma 4 with LoRA. It patches the model for 2x faster training and 60% less memory usage compared to vanilla HuggingFace.

# Create a virtual environment
python -m venv gemma4-finetune
source gemma4-finetune/bin/activate

# Install Unsloth (includes all dependencies)
pip install unsloth

# For QLoRA on older GPUs, also install:
pip install bitsandbytes

Verify your setup:

import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"VRAM: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB")

Preparing Your Training Data

Your data should be in JSONL format, with each line containing a conversation. Here's the structure Gemma 4 expects:

{"messages": [{"role": "user", "content": "What's the return policy?"}, {"role": "assistant", "content": "Our return policy allows returns within 30 days of purchase with original receipt."}]}
{"messages": [{"role": "user", "content": "Do you ship internationally?"}, {"role": "assistant", "content": "Yes, we ship to 45 countries. Shipping takes 7-14 business days."}]}
{"messages": [{"role": "system", "content": "You are a medical coding assistant."}, {"role": "user", "content": "Code: Patient presents with acute bronchitis"}, {"role": "assistant", "content": "ICD-10: J20.9 - Acute bronchitis, unspecified"}]}

Data quality tips:

500-1000 examples is a sweet spot for most tasks
More isn't always better — 500 high-quality examples beat 5000 sloppy ones
Include edge cases and negative examples
Keep responses consistent in format and tone
Validate your JSONL before training:

import json

def validate_jsonl(filepath):
    valid = 0
    errors = 0
    with open(filepath, 'r') as f:
        for i, line in enumerate(f, 1):
            try:
                data = json.loads(line)
                assert "messages" in data, "Missing 'messages' key"
                assert len(data["messages"]) >= 2, "Need at least 2 messages"
                valid += 1
            except (json.JSONDecodeError, AssertionError) as e:
                print(f"Line {i}: {e}")
                errors += 1
    print(f"\nValid: {valid}, Errors: {errors}")

validate_jsonl("training_data.jsonl")

Training Configuration

Here's a complete training script using Unsloth:

from unsloth import FastLanguageModel
import torch

# Load model with QLoRA (4-bit)
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="google/gemma-4-e4b",
    max_seq_length=4096,
    dtype=None,           # Auto-detect
    load_in_4bit=True,    # QLoRA
)

# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r=16,                  # Rank — higher = more capacity, more VRAM
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_alpha=16,
    lora_dropout=0,        # Unsloth optimized — keep at 0
    bias="none",
    use_gradient_checkpointing="unsloth",
)

# Load training data
from datasets import load_dataset
dataset = load_dataset("json", data_files="training_data.jsonl", split="train")

# Training arguments
from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    max_seq_length=4096,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=10,
        num_train_epochs=3,
        learning_rate=2e-4,
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        logging_steps=10,
        output_dir="outputs",
        optim="adamw_8bit",
        seed=42,
    ),
)

# Train
trainer.train()

# Save LoRA adapter
model.save_pretrained("gemma4-finetuned-lora")
tokenizer.save_pretrained("gemma4-finetuned-lora")

Key parameters to tune:

r=16: LoRA rank. Start with 16, go to 32 or 64 if quality isn't there
num_train_epochs=3: Usually 2-5 is enough. Watch for overfitting
learning_rate=2e-4: Default works well. Lower to 1e-4 if training is unstable
per_device_train_batch_size=2: Increase if you have VRAM headroom

Exporting to GGUF

Once your adapter is trained, you probably want to run it locally with Ollama. That means converting to GGUF format. For a detailed breakdown of quantization options and their tradeoffs, see our GGUF guide.

# Merge adapter and export to GGUF (Q4_K_M quantization)
model.save_pretrained_gguf(
    "gemma4-finetuned-gguf",
    tokenizer,
    quantization_method="q4_k_m",
)

# Or export at multiple quantization levels
for method in ["q4_k_m", "q5_k_m", "q8_0"]:
    model.save_pretrained_gguf(
        f"gemma4-finetuned-{method}",
        tokenizer,
        quantization_method=method,
    )

Quantization	File Size (E4B)	Quality Loss	Best For
Q4_K_M	~2.5 GB	Minimal	Most users
Q5_K_M	~3.2 GB	Very small	Quality-focused
Q8_0	~4.8 GB	Negligible	When VRAM allows

Deploying with Ollama

Create a Modelfile to bring your fine-tuned model into Ollama:

# Modelfile
FROM ./gemma4-finetuned-gguf/gemma4-finetuned-q4_k_m.gguf

PARAMETER temperature 0.7
PARAMETER top_p 0.9

SYSTEM "You are a helpful assistant fine-tuned for customer support."

Then build and run:

# Create the custom model
ollama create my-gemma4 -f Modelfile

# Test it
ollama run my-gemma4 "What's your return policy?"

# Verify it's listed
ollama list

Your fine-tuned model now runs exactly like any other Ollama model. You can use it with the Ollama API, build apps on top of it, or share it with your team.

Common Issues and Fixes

Out of memory during training: Lower per_device_train_batch_size to 1, enable gradient_checkpointing, or switch to a smaller base model.

Model outputs garbage after fine-tuning: Your data probably has formatting issues, or you trained for too many epochs. Reduce epochs and validate your JSONL.

GGUF export fails: Make sure you have enough disk space (3-5x the model size during conversion) and that your Unsloth version is up to date.

Training loss doesn't decrease: Learning rate might be too low. Try 5e-4. Also check that your data actually has the patterns you want the model to learn.

Next Steps

Explore Gemma 4 model sizes to pick the best base model for fine-tuning
Set up a proper deployment pipeline for your fine-tuned model
Learn about structured JSON output to combine fine-tuning with reliable output formats
Check hardware requirements to plan your training setup

gemma4 — interact