Why does vLLM output repetition loops after ~500 tokens when using FP8 KV cache?

Without calibration data, vLLM falls back to q_scale=1.0 — a uniform scale that ignores the actual activation distribution. Quantization error accumulates over the sequence until the model's logits become unreliable around token 500, collapsing into repetition. The startup log warns: 'Using default scale q_scale=1.0'.

How do I fix vLLM repetition loops caused by FP8 KV cache on GB10?

Remove --kv-cache-dtype fp8 and --calculate-kv-scales from your serve command. Without these flags, vLLM uses BF16 for KV cache. On GB10 with 128GB unified memory, BF16 KV cache is almost always the right default — VRAM isn't the limiting constraint.

Should I use FP8 KV cache on a DGX Spark (GB10) with 128GB unified memory?

No, unless you have calibration data. FP8 KV cache requires per-layer scale factors from a calibration run. Without them, q_scale=1.0 causes repetition at long outputs. More importantly, GB10's 128GB unified memory leaves ~63 GiB for BF16 KV cache at 90% utilization — memory isn't the constraint.

How do I diagnose repetition loops in vLLM — is it a sampling parameter or quantization problem?

If the repetition starts at a consistent token count (not immediately) and sampling parameters have no effect, it's precision degradation — not a temperature or top_p issue. Check the startup log for 'q_scale=1.0'. If that line appears, FP8 KV cache is running uncalibrated.

[vLLM] FP8 KV Cache on GB10: Why Outputs Collapse into Repetition Loops

TL;DR

Adding --kv-cache-dtype fp8 to vLLM on GB10 without calibration data causes outputs to collapse into repetition loops after ~500 tokens. The startup log warns q_scale=1.0 — meaning FP8 quantization is running uncalibrated. Fix: remove the flag and use BF16 KV cache. On GB10 with 128GB unified memory, VRAM is not the constraint, so FP8 KV cache is unnecessary.

Plain-Language Version: Why AI Output Breaks After 500 Words

When an AI model generates text, it needs to remember everything it has already written — this memory is called the "KV cache." On machines with limited memory, engineers compress this cache using a format called FP8 (8-bit floating point) to fit more conversation into less space. Think of it like JPEG compression for photos: smaller file, but some detail is lost.

The problem is that this compression needs to be calibrated for each specific model — like tuning a radio to the right frequency. Without calibration, the compression uses a generic setting that works okay for short responses but silently corrupts the memory over longer outputs. Around 500 words in, the AI's memory has drifted so far that it starts repeating itself in a loop.

On the NVIDIA DGX Spark (GB10 chip, 128GB memory), this compression is unnecessary anyway — there is more than enough memory to store the cache uncompressed. The fix is simply to turn off the compression flag. This article explains the root cause, how to spot it in the server logs, and why it does not apply to this hardware.

Preface

Optimization has a prerequisite condition. When the condition isn't met, you don't get a worse version of the optimization — you get something that looks like it worked and then quietly fails at output token 500.

This is about FP8 KV cache on GB10, what the failure looks like, and why the optimization was wrong to apply here in the first place.

What Does the Failure Look Like?

The vLLM serve script had two flags added for memory efficiency:

--kv-cache-dtype fp8 --calculate-kv-scales

Startup looked normal. First responses looked normal. Then, around 500 tokens into a longer output, the model started repeating. Not subtle drift — hard repetition:

The model continues to analyze the situation, analyzing the situation continues to analyze...

Temperature adjustments had no effect. repetition_penalty had no effect. The loop always won once it started.

The vLLM startup log had a warning that was easy to miss:

W  Calculating KV cache scales for FP8 activation, but no calibration
   data found. Using default scale q_scale=1.0

What Is the Root Cause?

FP8 quantization for KV cache requires per-layer scale factors — values that map the float16/bfloat16 distribution of each layer's activations to the FP8 range without losing precision. These scale factors come from a calibration run against a representative dataset.

Without calibration data, vLLM falls back to q_scale=1.0. This is a uniform scale that makes no assumptions about the actual activation distribution. For short outputs, the approximation is acceptable. For long sequences, the accumulated quantization error compounds. Around 500 tokens, precision degrades enough that the model's logits become unreliable and the output collapses into repetition.

--calculate-kv-scales exists for the case where you have calibration data and want vLLM to load and apply it. Without --kv-cache-scales-path pointing to actual scale factors, the flag is a no-op with a warning.

How Do You Fix It?

Remove both flags:

# BEFORE — causes repetition after ~500 tokens
vllm serve /models/qwen35 \
  --kv-cache-dtype fp8 \
  --calculate-kv-scales \
  ...

# AFTER — BF16 KV cache, no calibration needed
vllm serve /models/qwen35 \
  ...

Without these flags, vLLM uses BF16 for KV cache. Outputs are stable at any length.

Why Didn't This Optimization Apply to GB10?

FP8 KV cache trades memory for precision. The tradeoff is worth it when:

VRAM is the limiting constraint
Calibration data is available to set correct scale factors

Neither was true on GB10.

GB10 has 128 GB of unified memory. Running Qwen3.5-35B with BF16 KV cache at 200K context and 90% GPU utilization leaves approximately 63 GiB available for KV cache. At that scale, memory isn't the constraint — there's room for hundreds of thousands of tokens without quantization.

The calibration dataset issue isn't a configuration problem to work around. Generating a calibration dataset requires running the model on representative prompts and recording per-layer activation statistics. It's a separate offline step, not a flag to set at serve time.

The Broader Pattern

This instance — technically-valid optimization that silently degrades when prerequisites aren't met — appears in other places in the vLLM / local LLM stack:

--reasoning-parser qwen3: Routes <think>...</think> output to the reasoning field, keeping content clean. Works correctly when the model reliably exits thinking. When the model produces only thinking tokens (no final answer), content stays null and the client gets nothing.

--enforce-eager: Disables CUDAGraph for debugging. Silently halves decode speed when left in a production serve script. No error, no warning.

The common thread: the flag is accepted, the server starts, early outputs look fine, failure appears later. The startup log is the only signal — and only if you know what to look for.

For FP8 KV cache, the signal is:

Using default scale q_scale=1.0

If that line appears, the optimization is not active in a meaningful sense. Either provide calibration data or remove the flags.

Takeaways

What cost the most time: The failure mode looks identical to other causes of repetition loops — temperature, top_p, model quality, prompt format. The standard debugging path (adjusting sampling parameters, trying different prompts) doesn't help and takes time before the startup log is examined. The log was the answer from the start.

Transferable diagnostics:

Repetition loop that starts at a consistent token count (not immediately) → precision degradation, not a sampling parameter issue. Check KV cache dtype and whether quantization is calibrated.
q_scale=1.0 in startup log → FP8 KV cache is running without calibration. Remove --kv-cache-dtype fp8 or provide --kv-cache-scales-path.
On GB10 with 128GB unified memory: BF16 KV cache is almost always the right default. FP8 KV cache optimization pressure applies to GPUs with limited VRAM, not to systems with 128GB of unified memory.

The pattern that applies everywhere: Read the startup log before debugging model behavior. Repetition, garbage output, and degraded quality are almost always diagnosable from server initialization — not from the outputs themselves.

Checklist

Before enabling FP8 KV cache:

Confirm VRAM is actually the limiting constraint. If not, use BF16.
Check startup log for q_scale=1.0. If present, calibration data is missing.
Generate calibration data with --kv-cache-scales-path before enabling FP8 KV cache in production.
Test at long outputs (1000+ tokens). Repetition from FP8 precision loss doesn't appear immediately.

Also in this series: DGX Spark: Why Your Model Outputs !!!!! · vLLM + Qwen3.5 on DGX Spark