How do I switch NemoClaw from NVIDIA cloud to a local Ollama model?

Edit ~/.nemoclaw/sandboxes/ /config.yaml and change three fields: set endpoint to 'http://localhost:11434/v1', set model to your Ollama model name (from 'ollama list'), and set api_key to 'ollama'. Then run 'nemoclaw restart' to apply.

Where is NemoClaw's per-sandbox inference configuration file?

At ~/.nemoclaw/sandboxes/ /config.yaml. This path is not documented in NemoClaw v0.1.0 — there is no 'nemoclaw config show' command. You need to list the directory after onboard to find it.

Is it safe to use an uncensored model with NemoClaw?

Yes, because NemoClaw's security operates at two independent layers. OpenShell enforces policy at the tool execution layer — filesystem access, tool allowlists — regardless of which model is running. An uncensored model removes content refusals (what the model says) but OpenShell still restricts agent actions (what the agent does). The two concerns are orthogonal.

What is the difference between using Ollama vs vLLM as NemoClaw's local inference backend?

Ollama is easier to set up, supports multiple models in memory, and swaps models quickly. vLLM gives lower time-to-first-token (~0.12s vs ~0.4s), cleaner handling of structured output and tool calls, and better thinking-mode support via --reasoning-parser. For single-model personal use, either works; for production behavior, vLLM is preferred.

[AI Agent] NemoClaw Without the Cloud: Swapping Nemotron for a Local Ollama Model

TL;DR

NemoClaw's inference backend is OpenAI-compatible and configurable per-sandbox via ~/.nemoclaw/sandboxes/<name>/config.yaml. Pointing it at a local Ollama endpoint (http://localhost:11434/v1) requires three changes: endpoint URL, model name, and API key (set to "ollama"). OpenShell's filesystem and tool policy enforcement is independent of the inference backend — it applies regardless of which model is running.

Plain-Language Version: Why Run AI on Your Own Computer?

Most people use AI by opening a browser and visiting ChatGPT or Claude's website. You type, your words travel to a remote server, the AI processes them, and sends the answer back. This is "cloud inference" — convenient, but every word you type leaves your computer.

There's another way: download the AI model and run it on your own machine. This is "local inference." Ollama is one of the simplest tools for this — one command downloads and runs open-source AI models on your hardware. No internet needed, no monthly fee, and your conversations never leave your hard drive.

This article covers switching NemoClaw (NVIDIA's AI agent framework) from cloud mode to local mode. By default, it uses NVIDIA's cloud API with a 30-day free trial. When that expires, instead of paying, you can point it at the DGX Spark already sitting on your desk with 128 GB of memory.

Three config values changed, and the agent moves from cloud to local. The security sandbox (OpenShell) doesn't care which model you're running — whether it's cloud or local, the fence stays up.

Why Swap the Backend

Three reasons, in order of how often they come up:

Cost. The free NVIDIA API tier is 30 days. After that, running inference through integrate.api.nvidia.com requires a paid account. The GX10 has 128GB unified memory and a 273 GB/s memory bus — using it as a proxy to a cloud API is waste.

Privacy. Every prompt sent to Nemotron cloud leaves the machine. For a personal agent with filesystem access and memory, that's a meaningful surface area.

Model choice. The default nvidia/nemotron-3-super-120b-a12b is capable but opinionated — tuned for helpfulness and content safety in ways that can get in the way of certain use cases. A local model removes that constraint. Which brings up a point worth addressing directly.

The Uncensored Model Question

The model being loaded here is HauhauCS/Qwen3.5-35B-A3B-Uncensored-HauhauCS-Aggressive — a fine-tune of Qwen3.5-35B-A3B with the standard safety filters removed.

The natural question: if the model has no content restrictions, what's stopping the agent from doing whatever it's asked?

OpenShell. And this is the important architectural point that gets lost when people treat NemoClaw as just a chatbot wrapper.

OpenShell enforces policy at the tool execution layer, not the language model layer. When the agent decides to call a tool — read a file, execute a command, make a network request — OpenShell intercepts that call before it executes and checks it against the policy configuration. The model's content preferences are irrelevant at this point. The sandbox allows access to /sandbox and /tmp. It allows tools that appear in the allowlist. Everything else is blocked at the gateway, before the tool runs.

Cloud inference adds a second layer (the model's own content refusals). Local uncensored inference removes that layer. But the OpenShell layer — the one that actually controls what the agent does to the filesystem and network — stays in place either way.

This means: uncensored model + OpenShell sandbox is a coherent configuration. The model can reason and respond without content restrictions; the agent's actions are still bounded by what the sandbox permits. The two concerns are orthogonal.

Finding the Config

After nemoclaw onboard completes, the sandbox configuration lives under:

~/.nemoclaw/sandboxes/<sandbox-name>/config.yaml

For the default sandbox name from non-interactive onboard:

cat ~/.nemoclaw/sandboxes/my-assistant/config.yaml

The relevant section:

inference:
  endpoint: "https://integrate.api.nvidia.com/v1"
  model: "nvidia/nemotron-3-super-120b-a12b"
  api_key: "nvapi-..."

This is the only place that references the NVIDIA cloud. The OpenShell gateway config (~/.openshell/gateways/nemoclaw/) and the sandbox policy files don't touch inference routing.

Switching to Ollama

Ollama exposes an OpenAI-compatible API at http://localhost:11434/v1. The model name to use is whatever Ollama calls it after ollama pull or ollama run completes.

Check what's loaded:

ollama list

Then update the sandbox config:

# Back up first
cp ~/.nemoclaw/sandboxes/my-assistant/config.yaml \
   ~/.nemoclaw/sandboxes/my-assistant/config.yaml.nemotron-backup

Edit ~/.nemoclaw/sandboxes/my-assistant/config.yaml:

inference:
  endpoint: "http://localhost:11434/v1"
  model: "hauhaucs/qwen3.5-35b-a3b-uncensored-hauhaucs-aggressive:latest"
  api_key: "ollama"

The api_key field is required by the config schema but Ollama ignores its value. Setting it to "ollama" satisfies the schema without confusion.

Restart the sandbox to pick up the change:

nemoclaw my-assistant restart

Verify the new backend is active:

nemoclaw my-assistant status
# Phase: Ready
# Model: hauhaucs/qwen3.5-35b-a3b-uncensored-hauhaucs-aggressive:latest

Testing the Swap

Connect to the sandbox and send a test message:

nemoclaw connect my-assistant

The first response will have higher latency than Nemotron cloud — Ollama needs to load the model into memory if it isn't already warm. Subsequent responses come from memory.

One behavioral difference worth noting: Qwen3.5-35B-A3B is a thinking model. By default, Ollama runs it with thinking enabled, which means the agent generates reasoning tokens before responding. This looks like a long pause before output starts. If the latency is unacceptable for interactive use, Ollama supports disabling thinking via the model's system prompt or via a think: false parameter in the request — but NemoClaw doesn't expose that parameter directly. The vLLM path handles this more cleanly.

The vLLM Alternative

If Ollama's latency profile isn't acceptable, the vLLM container already running on port 8000 is a direct drop-in:

inference:
  endpoint: "http://localhost:8000/v1"
  model: "qwen3.5-35b"
  api_key: "none"

The model name qwen3.5-35b is the --served-model-name set when the container was started. vLLM's OpenAI-compatible server handles the rest identically.

vLLM also handles thinking mode more cleanly — the --reasoning-parser qwen3 flag started with the container tells vLLM to strip reasoning tokens from the response before it reaches the client. The agent sees clean output without internal monologue.

The tradeoff: Ollama is faster to set up, easier to swap models, and handles multiple models in memory simultaneously. vLLM gives lower TTFT on warm requests (~0.12s vs Ollama's ~0.4s) and cleaner handling of structured output and tool calls.

For this setup — GX10, personal use, single model running at a time — either works. Ollama wins on convenience; vLLM wins on production behavior.

Takeaways

What cost the most time: Locating the config file. NemoClaw's documentation doesn't mention where per-sandbox inference config lives. The ~/.nemoclaw/sandboxes/ directory structure isn't surfaced by any nemoclaw CLI command — there's no nemoclaw config show or equivalent in v0.1.0. Finding it required listing the directory after onboard.

Transferable diagnostic: When a tool's config path isn't documented, check ~/.toolname/ first. For CLI tools that follow XDG conventions, also check ~/.config/toolname/. ls -la ~/.nemoclaw/ after onboard makes the layout visible immediately.

The pattern that applies everywhere: Content policy (what the model refuses to say) and sandbox policy (what the agent is permitted to do) operate at different layers. They can be configured independently. Treating them as the same concern is a category error that leads to either over-restricting the model or under-restricting the agent's capabilities — neither of which is the right tradeoff.

Swap Checklist

To point NemoClaw at a local inference backend:

cat ~/.nemoclaw/sandboxes/my-assistant/config.yaml → find current inference config
Back up the file
Update endpoint, model, and api_key fields
ollama list or check vLLM /v1/models to confirm model name
nemoclaw my-assistant restart
nemoclaw my-assistant status → verify model name updated
nemoclaw connect my-assistant → send a test message

Also in this series: Part 1 — NemoClaw: What It Is, Why It Exists, and How It Works · Part 2 — Installing NemoClaw on a GX10 from Scratch