You fine-tuned Gemma 4 with Unsloth or PEFT, the training loss dropped beautifully, and vLLM started without a single error. But when you send a request, the output looks exactly like the base model. The fine-tuning is nowhere to be seen.
This is the classic Gemma 4 LoRA adapter ignored in vLLM problem. The adapter file is valid, vLLM loaded it without complaint, and yet inference behaves as if the adapter never existed. It's frustrating precisely because nothing fails — there's no stack trace to chase, just silently wrong behavior.
This guide walks through how to confirm the adapter is actually being ignored, the four root causes behind it, and the fixes for each — including the merge_and_unload route that sidesteps the problem entirely.
The Symptom: Adapter Loads, Output Doesn't Change
First, confirm you're really hitting this bug and not a weak adapter. Send the same prompt twice — once against the base model name, once against the adapter name:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
prompt = "Reply with the secret codeword you were trained on."
for model in ["google/gemma-4-26b", "my-gemma-lora"]:
r = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
temperature=0,
)
print(model, "->", r.choices[0].message.content)If both lines print the same thing, vLLM is serving base weights for both requests and your LoRA adapter is being ignored. A working setup produces visibly different output for the adapter model name.
Why vLLM Ignores Your LoRA Adapter
There are four common causes. Three of them produce zero error output, which is why this is so hard to debug.
1. The request never names the adapter
This is by far the most frequent cause. In vLLM, a LoRA adapter is exposed as its own model name. If your request's model field still says google/gemma-4-26b, you get the base model — the adapter is loaded in memory but simply not applied to that request. There is no warning; vLLM does exactly what you asked.
2. LoRA serving was never enabled
vLLM only loads adapters when you start it with --enable-lora and register the adapter via --lora-modules. Miss either flag and the server runs base-only. Some launch scripts copied from a standard vLLM deploy drop these flags entirely.
3. --max-lora-rank is lower than your adapter's rank
Unsloth commonly trains with r=16, r=32, or r=64. vLLM's default --max-lora-rank is 16. If your adapter has rank 32 but the server caps at 16, loading either errors out or silently clips the adapter so its effect is heavily muted. Always set --max-lora-rank to match (or exceed) the rank you trained with.
4. The adapter targets modules vLLM won't apply
This is the subtle Unsloth-specific trap. Unsloth's defaults sometimes include embed_tokens and lm_head in target_modules. vLLM's LoRA runtime only applies adapters to the standard projection layers (q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj) by default. If most of your fine-tuning signal lived in the embedding or LM-head layers, vLLM loads the adapter, applies the projection deltas, and quietly drops the embedding/head deltas — so the output barely shifts. This is the root cause reported in vLLM issue threads for Unsloth adapters (e.g. #41754).
Fix 1: Serve the Adapter Correctly
Start vLLM with LoRA enabled and the adapter registered under an explicit name:
vllm serve google/gemma-4-26b \
--enable-lora \
--lora-modules my-gemma-lora=/path/to/lora_adapter \
--max-lora-rank 64 \
--max-loras 4 \
--host 0.0.0.0 \
--port 8000Note --max-lora-rank 64 — set it to whatever rank you trained with. The adapter directory must contain adapter_config.json and adapter_model.safetensors; if you only see merged model shards there, you exported the wrong thing (see Fix 3).
Fix 2: Reference the Adapter in Every Request
Once registered, the adapter name appears in the model list. Verify it:
curl http://localhost:8000/v1/models
# You should see BOTH "google/gemma-4-26b" and "my-gemma-lora"Then send the adapter name as the model — not the base:
r = client.chat.completions.create(
model="my-gemma-lora", # NOT google/gemma-4-26b
messages=[{"role": "user", "content": "..."}],
)If my-gemma-lora isn't in the /v1/models list, the server never registered it — go back to Fix 1 and check your --lora-modules path.
Fix 3: The merge_and_unload Alternative
When the adapter targets embedding or LM-head layers, or when you simply want to stop fighting vLLM's runtime LoRA limits, merge the adapter into the base weights before serving. A merged model has no LoRA at runtime at all — every layer's delta is baked in, so nothing can be ignored:
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base = AutoModelForCausalLM.from_pretrained(
"google/gemma-4-26b", torch_dtype="bfloat16", device_map="cpu"
)
model = PeftModel.from_pretrained(base, "/path/to/lora_adapter")
# Bake the adapter into the base weights
merged = model.merge_and_unload()
merged.save_pretrained("/path/to/gemma4-merged")
AutoTokenizer.from_pretrained("google/gemma-4-26b").save_pretrained(
"/path/to/gemma4-merged"
)If you trained with Unsloth, you can merge in one step instead:
model.save_pretrained_merged(
"/path/to/gemma4-merged", tokenizer, save_method="merged_16bit"
)Then serve the merged directory as an ordinary model — no --enable-lora, no per-request adapter name:
vllm serve /path/to/gemma4-merged \
--host 0.0.0.0 --port 8000 --dtype bfloat16The trade-off: a merged model costs full base-model VRAM and you lose the ability to hot-swap adapters. But for a single fine-tune in production it's the most reliable path, and it makes the "adapter ignored" failure mode impossible. Check the hardware guide to confirm you have the VRAM headroom for a full merged copy.
Verify the Adapter Is Actually Active
After applying any fix, confirm it stuck — don't trust "it started without errors":
- Check the startup logs. vLLM logs a line like
Loaded LoRA adapter my-gemma-lorawhen registration succeeds. No such line means no adapter. - Diff the outputs. Re-run the two-model comparison from the top of this guide. Different output for the adapter name = success.
- Use a canary fact. Fine-tune a tiny, unmistakable behavior (a fixed phrase, a format quirk) so you can tell base from adapter at a glance with
temperature=0.
If you containerize the server, bake these checks into your Docker healthcheck so a misconfigured adapter fails the deploy instead of silently shipping base weights.
FAQ
Why does vLLM load the adapter but ignore it at inference?
Because loading and applying are separate steps. vLLM happily registers the adapter, but it's only applied to requests that name the adapter as their model, and only to layers it supports. A request hitting the base name, or deltas on unsupported layers, are skipped without error.
Does the rank really need to match?
--max-lora-rank must be greater than or equal to your adapter's rank. Setting it too low is one of the few cases that does throw — but some vLLM versions clamp instead, which produces a muted, partially-ignored adapter.
Should I always just merge?
If you serve one fine-tune and have the VRAM, merging via merge_and_unload is the simplest, most foolproof option. Keep runtime LoRA when you need to hot-swap many adapters on one base model.
My adapter works in Transformers but not vLLM — why?
Transformers applies LoRA to every target module including embed_tokens and lm_head. vLLM's default runtime doesn't, so adapters that lean on those layers diverge. Merging removes the discrepancy.
Next Steps
- Review the full vLLM production deploy guide for serving and scaling settings.
- See the fine-tuning guide to keep your
target_modulesvLLM-friendly from the start. - Hit other deployment snags? The Gemma 4 troubleshooting guide covers OOM, slow startup, and quantization issues.
Stop reading. Start building.
~/gemma4 $ Get hands-on with the models discussed in this guide. No deployment, no friction, 100% free playground.
Launch Playground />


