← Back to news

Two Qwen3 models on one DGX Spark: the residency math

devashish.me|58 points|28 comments|by devashish86|Jun 18, 2026

Two Qwen3 Models on One DGX Spark: The Residency Math

By Devashish | June 16, 2026

My current agent stack, powered by Hermes, operates on a split architecture. To ensure the workstation remains responsive, I offload the heavy GPU lifting to a DGX Spark, with the two communicating via an HTTP proxy.

As I've scaled my agent fleet using Clawrium, the number of Hermes instances has grown. This has shifted the workload from a simple "one-laptop, one-model" setup to a fleet of agents hammering a single backend—a traffic pattern that a standard single-model server simply cannot handle.

Architecture Overview

The Infrastructure Goal

For months, the Spark served models via ollama. However, ollama lacks granular control: there is no per-process memory budget and no gpu_memory_utilization toggle. This makes it nearly impossible to co-reside a "heavy" reasoning model alongside a "fast" model for quick interactions.

While llama.cpp (the backend) uses PagedAttention to reclaim KV blocks rather than pinning contiguous memory, I needed more control. My target setup:

  • Hardware: One DGX Spark (GB10) with 119.67 GiB119.67\text{ GiB} of unified memory.
  • Software: Multiple vLLM containers orchestrated by a LiteLLM proxy on port :4000.
  • Models:
    1. Qwen3-Next-80B-Instruct-FP8 (The "Heavy Lifter")
    2. Qwen3-4B-Instruct-2507 (The "Fast Responder")

The Trial and Error Process

Attempt 1: The gpu_memory_utilization Trap

I started by trusting the target configurations. For the 80B model, I set:

  • gpu_memory_utilization: 0.75
  • max_model_len: 65536
  • max_num_seqs: 4

The Result: vLLM crashed during KV cache initialization with: "No available memory for the cache blocks."

Because Qwen3-Next is primarily a Mamba-based architecture, the per-block page alignment increases KV pool demands. The remaining 14 GiB\sim 14\text{ GiB} after loading weights wasn't enough. When I tried to increase the target to 0.85, the free-memory check failed because the 4B model was already occupying 16 GiB\sim 16\text{ GiB}.

The Critical Realization: gpu_memory_utilization is a fraction of total GPU memory, not free memory.

To avoid OOMs (Out of Memory) or KV starvation, the sum of all co-resident vLLM utilization fractions must be 0.95\le 0.95 to account for CUDA framework overhead.

Attempt 2: The Tool-Call Failure

Once the models were resident, Hermes encountered a functional bug: tool calls were returning as plain text. Both hermes_tool_parser.py and qwen3xml_tool_parser.py expect a singular <tool_call> tag, but the model was outputting reasoning inside <think> tags and concluding without the actual call.

I discovered that Qwen3-Next-80B-Thinking only supports thinking mode; enable_thinking: false is a structural no-op. This breaks any SDK relying on tool_choice: auto.

The Fix: I launched the model with the following flags: --enable-auto-tool-choice --tool-call-parser hermes (and specifically no --reasoning-parser).

This allowed three LiteLLM aliases (writer, reviewer, and sources) to successfully return finish_reason: tool_calls.

Tooling Config

Attempt 3: The Stability Crash

The Reviewer agent required a 64k64\text{k} context window. This triggered a restart loop for the 4B model (19 times!). The 80B model, at 0.85 utilization, was actually occupying 101.5 GiB101.5\text{ GiB}, leaving the 4B model with insufficient room to meet its own 0.12 target.

The Final Adjustment:

  • 80B Model: Toned down to gpu_memory_utilization: 0.80.
  • 4B Model: Dropped to 0.10 utilization, max_model_len: 16384, and max_num_seqs: 8.

I had to lower the 4B's context length because a 0.10 allocation only leaves 3.5 GiB\sim 3.5\text{ GiB} for the KV pool. A 32k32\text{k} sequence requires 4.8 GiB\sim 4.8\text{ GiB}, but 16k16\text{k} fits at 2.4 GiB\sim 2.4\text{ GiB}.


The Residency Math

Here is the data I should have compiled from the start:

ModelTarget UtilActual ResidencyMax Model LenMax Seqs
Qwen3-80B0.8093.5 GiB\sim 93.5\text{ GiB}65,53665,53644
Qwen3-4B0.1013.8 GiB\sim 13.8\text{ GiB}16,38416,38488

Key Observations:

  1. The Cushion: The 80B model at 0.80 ran 8 GiB\sim 8\text{ GiB} under its allocation. This buffer is the only reason the 4B model doesn't crash during variability.
  2. Overhead: The 4B model's actual residency (13.8 GiB13.8\text{ GiB}) is higher than the target (12 GiB12\text{ GiB}). CUDA framework overhead is a constant, regardless of model size.
  3. Mamba vs. Attention: On Qwen3-Next, the memory demand for max_model_len×max_num_seqs\text{max\_model\_len} \times \text{max\_num\_seqs} is driven by Mamba state alignment, not standard attention KV. Halving the length does not linearly halve the memory demand.

Memory Graph


Final Insights & Playbook

The core takeaway is that gpu_memory_utilization is merely a starting snapshot. Actual residency is the only ground truth.

The Co-residency Playbook:

  1. Load the largest model first.
  2. Allow it to stabilize.
  3. Run nvidia-smi to determine the actual memory used.
  4. Calculate the remaining free pool.
  5. Size the smaller model's gpu_memory_utilization against that free pool, subtracting 5 GiB\sim 5\text{ GiB} for framework overhead.

🚨 24-Hour Action Item

If you are running a vLLM deployment, execute this command immediately:

nvidia-smi --query-gpu=memory.used --format=csv

Compare this actual number to your gpu_memory_utilization target.

  • Check: Does the actual residency diverge from the target by >10%> 10\%?
  • Action: If yes, your sizing model is incorrect. Fix it before deploying agent stacks or fallback chains to avoid silent failures.

Final Setup