Why does vLLM OOM on a 128GB machine when the model only needs 35GB?

Ollama's KEEP_ALIVE=2h keeps the last-used model in memory for two hours after use. On a unified memory machine, Ollama and vLLM share the same pool. If Ollama has 19-51GB parked from a previous session, vLLM's startup allocation fails despite the machine having enough total memory.

How do I check what Ollama has loaded in GPU memory on GB10?

curl -s http://localhost:11434/api/ps — this returns the list of models currently loaded, their size, and VRAM usage. nvidia-smi does not work on GB10 (returns N/A) because unified memory has no separate VRAM allocation to query.

Should I set OLLAMA_KEEP_ALIVE=0 permanently when using vLLM on the same machine?

Yes, if vLLM is your primary serving path. Set OLLAMA_KEEP_ALIVE=0 in the systemd override. The tradeoff: every Ollama request incurs a 15-25 second cold load. Keep KEEP_ALIVE=2h only if Ollama is your only model server.

[vLLM] Ollama's KEEP_ALIVE Is Silently Eating Your vLLM Headroom

Q: How do I immediately unload a model from Ollama memory without restarting?

POST to the generate endpoint with keep_alive: 0 — e.g., curl -X POST http://localhost:11434/api/generate -d '{"model": "glm-4.7-flash:latest", "keep_alive": 0}'. This evicts the model immediately, not 'set TTL to zero going forward.'

TL;DR

vLLM OOMed on a 128GB machine because Ollama's KEEP_ALIVE=2h was silently holding 19-51GB of GPU memory from a previous session. The fix: always run curl localhost:11434/api/ps before vLLM restarts, and set KEEP_ALIVE=0 once vLLM is your primary stack.

Plain-Language Version: When Two AI Programs Fight Over the Same Memory

Picture a parking lot with 128 spaces. You arrive and the sign says "FULL" — but looking around, 40 of those cars finished their errands hours ago. Nobody told them to leave, so they are just sitting there.

That is what happens when Ollama and vLLM share the same GPU memory. Ollama keeps its last-used model loaded for two hours by default (the KEEP_ALIVE setting), even if nothing is using it. When vLLM tries to start, there is not enough room — and the error message just says "out of memory" without mentioning that Ollama is the one hogging it. The fix is a one-line API call to evict the parked model before starting vLLM.

Preface

A parking lot with 128 spaces sounds like it has plenty of room. It doesn't, if 40 of those spaces are occupied by cars that finished their errands two hours ago and nobody told them they could leave.

That's what Ollama's KEEP_ALIVE does to GPU memory on a unified-memory machine. This came up repeatedly when restarting vLLM for the OpenClaw agent stack, and took longer to diagnose than it should have because the failure mode — OOM on a 128GB machine — seems impossible until you understand what's actually in memory. A full benchmark of what was running on this machine before vLLM is at 8 Models on DGX Spark.

Why Did vLLM OOM on a 128GB Machine?

Every vLLM container restart on the GX10 had about a 50% chance of failing with an OOM error during model load. The machine has 128GB of unified memory. Qwen3.5-35B-A3B-FP8 needs roughly 35GB for weights. With --gpu-memory-utilization 0.90 and fp8 KV cache, vLLM wants about 115GB total at startup.

128GB minus 35GB is 93GB. There should be room. But vLLM was OOMing.

The diagnosis command:

curl -s http://localhost:11434/api/ps

This returns what Ollama currently has loaded in memory:

{
  "models": [
    {
      "name": "glm-4.7-flash:latest",
      "size": 19456000000,
      "size_vram": 19456000000,
      ...
    }
  ]
}

GLM-4.7-Flash, 19GB, sitting in memory. Nobody asked it to be there. It was there because KEEP_ALIVE=2h means Ollama keeps any model in memory for two hours after its last use, on the assumption that you'll probably ask it something again soon.

That assumption is wrong when vLLM is your primary serving path. Ollama sits idle, holding ~19-51GB depending on which model was last used, while vLLM tries to start up and runs into the difference between "128GB on paper" and "128GB minus whatever Ollama decided to park."

Why Does KEEP_ALIVE=2h Exist?

This is worth understanding before dismissing the default as stupid.

When Ollama is your primary model server, KEEP_ALIVE=2h makes sense. Model loading on this hardware takes 15-25 seconds depending on model size. If you're switching between models for different tasks — fast MoE model for chat, larger model for complex analysis — keeping the last model warm means your second request comes back instantly instead of waiting for a full load cycle.

The default was designed for interactive use where you make a request, get a response, think for a bit, and make another request. Two hours covers most working sessions.

It becomes a problem once you add vLLM to the same machine. Now you have two model servers competing for the same unified memory pool, and Ollama's "I'll keep this warm in case you need it" behavior starts conflicting with vLLM's "I need to allocate as much of this pool as I can" behavior.

How Do You Manually Clear Ollama's Memory Before a vLLM Restart?

When vLLM needs a restart and Ollama has something loaded, unload it first:

# 1. Check what Ollama has in memory
curl -s http://localhost:11434/api/ps

# 2. Unload specific model (replace with actual model name)
curl -X POST http://localhost:11434/api/generate \
  -d '{"model": "glm-4.7-flash:latest", "keep_alive": 0}'

# 3. Verify it's gone
curl -s http://localhost:11434/api/ps
# Should return: {"models": []}

# 4. Now restart vLLM
docker stop qwen35
docker start qwen35
docker logs -f qwen35

Setting keep_alive: 0 in the API call is an immediate eviction — not "set TTL to zero going forward," but "unload this model right now." This is the correct API for manually clearing Ollama's memory.

For models with larger footprints, the sequence is the same:

# For a 50GB model like qwen3-coder-next
curl -X POST http://localhost:11434/api/generate \
  -d '{"model": "qwen3-coder-next:latest", "keep_alive": 0}'

What Is the Permanent Fix When vLLM Is Your Primary Backend?

If vLLM is your primary serving path and Ollama is a secondary/backup, set KEEP_ALIVE=0 globally:

# /etc/systemd/system/ollama.service.d/override.conf
[Service]
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_KV_CACHE_TYPE=q8_0"
Environment="OLLAMA_KEEP_ALIVE=0"  # ← unload immediately after use

Apply the change:

sudo systemctl daemon-reload
sudo systemctl restart ollama

With KEEP_ALIVE=0, Ollama unloads a model as soon as the request completes. The memory is immediately available for vLLM. The tradeoff is that every Ollama request now incurs the full model load time (~20 seconds cold). For a workflow where Ollama is a backup that runs occasionally, this is acceptable.

When to keep KEEP_ALIVE at 2h:

Ollama is your only model server
You frequently alternate between Ollama requests with short pauses
No vLLM container competing for memory

When to set KEEP_ALIVE=0:

vLLM is your primary serving path
Ollama is used occasionally or as a fallback
You need predictable memory availability for vLLM restarts

Why Doesn't nvidia-smi Show Memory Usage on GB10?

One natural debugging instinct on a GPU machine is to run nvidia-smi and look at memory usage. On GB10, this doesn't work:

nvidia-smi --query-gpu=memory.used --format=csv
# Returns: N/A

GB10 uses unified memory — CPU and GPU share the same physical memory pool. nvidia-smi expects a dedicated VRAM allocation to query. There is no separate VRAM. The query returns N/A and you learn nothing.

The alternatives:

# What Ollama has loaded
curl -s http://localhost:11434/api/ps

# vLLM's KV cache allocation (once running)
curl -s http://localhost:8000/metrics | grep kv_cache

# System memory overview (shared pool)
free -h

free -h shows total system memory including what's been allocated to GPU operations, which on unified memory means everything. It's not as clean as nvidia-smi, but it's what's available.

Takeaways

The diagnostic pattern: when vLLM OOMs on a machine with ostensibly enough memory, check what else is holding memory before assuming the configuration is wrong. The configuration might be fine. Something else might be squatting in the pool.

On unified memory machines, every process that allocates memory — Ollama, Docker containers, system processes — pulls from the same pool. There's no hardware boundary that protects vLLM's allocation from other processes.

The operational consequence: before any vLLM restart, run curl -s http://localhost:11434/api/ps as a habit. It takes one second. A vLLM OOM on startup takes 2-3 minutes to diagnose, because the failure happens during model load rather than at startup, so you watch the logs for two minutes before seeing the error.

Conclusion

If you're running Ollama and vLLM on the same machine, KEEP_ALIVE=2h will eventually bite you. It's not a bug — it's a design choice optimized for single-server interactive use that conflicts with co-hosted multi-server deployment. Check what Ollama has loaded before any vLLM restart. If you're committed to vLLM as your primary backend, set KEEP_ALIVE=0 and accept the cold load latency for Ollama's occasional use.

The symptom is OOM on a machine that should have plenty of memory. The cause is always something else already in that memory. On this hardware, that something is usually Ollama.

Also in this series: 8 Models on DGX Spark: Finding the Best Stack for AI Agents · Don't Add --enable-chunked-prefill to SSM Models · Pure MoE vs SSM Hybrid: Context Decay and Why It Matters for Agents