Why does --enable-chunked-prefill cause an 8x throughput regression in vLLM with Qwen3.5-35B?

Qwen3.5-35B-A3B uses SSM (DeltaNet) recurrent layers that require sequential state passing. Chunked prefill splits prompts into segments, but each SSM chunk must receive the hidden state from the previous chunk — this inter-chunk overhead compounds across boundaries. At 5.7 tok/s vs 47 tok/s, the regression is an order of magnitude.

Which LLM architectures are incompatible with vLLM's --enable-chunked-prefill?

Any model with recurrent layers: Mamba, DeltaNet, Mamba2, RWKV, SSM hybrids. This includes Qwen3.5-35B-A3B and qwen3-coder-next. Standard Transformers (Llama, Mistral) and pure MoE models (GLM-4.7-Flash) are safe. Check the model card for 'mamba', 'ssm', 'deltanet', or 'recurrent' in the architecture description.

How do I check if vLLM has chunked prefill enabled without setting the flag myself?

Some vLLM versions enable chunked prefill by default. Run: docker logs qwen35 2>&1 | grep -i 'chunked'. You want to see no output or 'Chunked prefill: disabled'. If you see 'Chunked prefill: enabled', add --disable-chunked-prefill explicitly to your serve command.

How do I diagnose catastrophic vLLM throughput collapse without errors?

When throughput collapses but the server appears healthy (requests succeed, no errors), investigate configuration flags before the model or hardware. For SSM models, --enable-chunked-prefill is the first suspect. Also check for Ollama memory conflicts and whether enable_thinking is set to false per-request.

[vLLM] Don't Add --enable-chunked-prefill to SSM Models

TL;DR

Adding --enable-chunked-prefill to an SSM hybrid model (Qwen3.5-35B) caused an 8x throughput collapse — from 47 tok/s to 5.7 tok/s — because SSM recurrent layers require sequential state passing that chunked processing destroys.

Plain-Language Version: Why a "Speed Boost" Flag Made the AI 8x Slower

Think of an assembly line. If every station works independently, you can split the work into batches and run them in parallel — that is what chunked prefill does for standard AI models. But some newer models (called SSM hybrids) work like a relay race: each station must physically hand a baton to the next one. Split the relay into chunks, and the runners spend more time passing batons than actually running.

That is exactly what happened here. A single configuration flag meant to improve GPU utilization turned a fast model into one slower than a CPU could produce. No error message, no crash — the model just silently served responses at one-eighth speed. If you are running vLLM with a model that has "SSM," "Mamba," or "DeltaNet" in its architecture, do not add this flag.

Preface

A conveyor belt makes assembly faster — unless the thing you're assembling requires each station to hand a physical object to the next one in sequence. Then you've turned a line into a bottleneck.

That's what --enable-chunked-prefill does to an SSM model. The flag is a documented throughput optimization for Transformer models. For hybrid SSM+MoE architectures like Qwen3.5-35B-A3B, it is the opposite of an optimization. This post covers the OpenClaw agent stack running Qwen3.5-35B on vLLM — specifically the one configuration mistake that cut throughput by 8x before I understood what had happened.

What Happened When --enable-chunked-prefill Was Added?

The vLLM migration for Qwen3.5-35B was already done. The model was running at ~47 tok/s. I was trying to squeeze out more throughput for concurrent agent workloads and went looking for flags to tune.

--enable-chunked-prefill is listed in vLLM's documentation as a way to improve GPU utilization during prefill by interleaving prefill and decode operations. It sounds like a safe thing to try. I added it.

Throughput dropped from 47 tok/s to 5.7 tok/s.

That is an 8.2x regression. Not a margin. Not measurement noise. A complete collapse of useful throughput, from comfortably interactive to slower than a CPU could produce.

The flag didn't crash vLLM. It didn't produce an error. The server started, accepted requests, and returned responses — just at 5.7 tok/s. If you're not measuring throughput, you might not notice until you have actual users (or in this case, an agent) waiting on responses.

Why Does Chunked Prefill Destroy SSM Model Throughput?

Qwen3.5-35B-A3B is an SSM+MoE hybrid architecture. The "A3B" in the name refers to its use of DeltaNet — a state space model architecture alongside the mixture-of-experts layers. This is not a standard Transformer.

Here's the difference that matters:

Transformer attention is parallel. Every token in the sequence can be processed in any order — attention is computed over the full sequence as a matrix operation. Chunking this into segments is fine because the attention computation doesn't care about order; it just needs all the tokens to be present.

SSM (State Space Model) layers maintain a recurrent hidden state h_t. Each step's state depends on the previous step's state:

h_t = f(h_{t-1}, x_t)

This is recurrence. It is inherently sequential. You cannot process token 100 without first having processed tokens 1 through 99, because h_100 depends on h_99, which depends on h_98, and so on through the entire sequence.

Chunked prefill splits the prompt into segments and processes them in interleaved batches. For a pure Transformer, this is fine — segments don't depend on each other's intermediate states. For SSM layers, each chunk needs to receive the recurrent hidden state from the end of the previous chunk, then produce a new state to hand to the next chunk. This inter-chunk state passing isn't free — it happens at each segment boundary, and the overhead compounds as sequence length grows.

On a long sequence with many chunk boundaries, the cumulative overhead of passing hidden state across every boundary overwhelms any parallelism gains. The throughput floor is reached quickly. In the measured case: 5.7 tok/s from a machine capable of 47 tok/s.

Which Model Architectures Are Affected?

Any model with recurrent layers (SSM, Mamba, DeltaNet, Mamba2, RWKV, or similar) is affected. The key identifier is the model architecture, not the name:

Qwen3.5-35B-A3B — SSM+MoE hybrid (DeltaNet layers) — affected
qwen3-coder-next 79.7B — SSM+MoE hybrid — affected
GLM-4.7-Flash — pure MoE, standard attention — safe to use chunked prefill
Standard Llama/Qwen/Mistral — pure Transformer — safe to use chunked prefill

If you're unsure, check the model card or config.json. Look for any of: mamba, ssm, deltanet, state_space, or recurrent in the architecture description. If those words appear, do not add --enable-chunked-prefill.

How Do You Fix It?

Remove the flag. That's it. The working serve command for Qwen3.5-35B-A3B-FP8 does not include --enable-chunked-prefill:

docker run -d --name qwen35 --restart unless-stopped \
  --gpus all --ipc host --shm-size 64gb -p 8000:8000 \
  -v /home/coolthor/models/qwen35-35b-hf:/models/qwen35 \
  vllm/vllm-openai:cu130-nightly \
  --model /models/qwen35 \
  --served-model-name qwen3.5-35b \
  --max-model-len 200000 \
  --gpu-memory-utilization 0.90 \
  --kv-cache-dtype fp8 \
  --calculate-kv-scales \
  --max-num-batched-tokens 4096 \
  --enable-prefix-caching \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder
  # NOTE: --enable-chunked-prefill is absent

Also verify that vLLM isn't enabling it automatically. Some vLLM versions enable chunked prefill by default for certain configurations. Check startup logs:

docker logs qwen35 2>&1 | grep -i "chunked"

You want to see either no output, or Chunked prefill: disabled. If you see Chunked prefill: enabled and you didn't put it there, add --disable-chunked-prefill explicitly to force it off.

Takeaways

The diagnostic pattern here is transferable: throughput collapses without errors, investigate flags before the model.

When vLLM's throughput is catastrophically low but the server appears healthy, the first suspects are configuration flags — not the model, not the hardware, not CUDA. The model is doing what you told it to do. The question is what you told it to do.

Checklist for "why is my SSM model slow on vLLM":

Check if --enable-chunked-prefill is in the serve command (or enabled in logs)
Check if Ollama has models loaded in shared GPU memory (see Ollama KEEP_ALIVE conflict)
Check --max-num-batched-tokens against the SSM block size requirement
Check if enable_thinking is false per-request (thinking tokens burn through apparent throughput)

The 8x regression from a single flag is the kind of thing that doesn't make the documentation because it's a flag interaction with architecture type rather than a bug. The flag works exactly as designed — it's just designed for a different class of model.

Conclusion

Before adding any vLLM throughput flags, look up the model's architecture type. If it has any recurrent state (SSM, Mamba, DeltaNet, RWKV), treat --enable-chunked-prefill as off-limits. The flag will not produce an error. It will not warn you. It will simply reduce throughput by one order of magnitude and leave the server running, quietly, at 5.7 tok/s.

The only way to catch it is to measure before and after any flag change.

Also in this series: Migrating Qwen3.5 from Ollama to vLLM on DGX Spark · Ollama KEEP_ALIVE Is Silently Eating Your vLLM Headroom · Pure MoE vs SSM Hybrid: Context Decay and Why It Matters for Agents