Qwen3.5-35B-A3B on RTX 4090: 262K Context, 110 tok/s, Local AI Agent

A Model That Changes the Rules

Model changes like this and their implications feed directly into my AI and automation consulting. Our previous setup was solid: Mistral Small 24B on vLLM, AWQ-quantized, the right flags configured, tool calling working reliably. But two limits remained: 32,000 tokens of context and 14 GB of VRAM consumed by model weights alone — on a GPU with 24 GB total.

Then Qwen3.5-35B-A3B appeared. A model that sounds impossible on paper: 35 billion parameters, but only 3 billion active at any given time. A 262,000-token context window. And faster than our previous, smaller model.

After three days of intensive testing, benchmarking, and configuration, it now runs in production — on the same hardware, without a single euro of additional investment.

The Architecture: Why This Model Is Different

Traditional transformer models store a key-value entry in GPU memory for every token processed — the KV cache. With longer conversations, this grows linearly: more context means more memory. At 32,000 tokens with a 24B model like Mistral, the KV cache alone consumes roughly 4 GB.

Qwen3.5-35B-A3B breaks this pattern on two levels:

Mixture of Experts (MoE): The model has 35 billion parameters but activates only 3 billion per token. The remaining parameters are specialized “experts” that only engage when needed. The result: inference speed comparable to a 3B model with the quality of a much larger one.

Mamba2 State Space Model: 30 of the model’s 40 layers no longer use classical attention. Instead, they operate with a fixed state of just 15 MB — regardless of context length. Only 10 layers retain a traditional KV cache, and even those use just 2 KV heads each instead of the typical 8 or more.

The numbers speak for themselves:

	Mistral Small 24B	Qwen3.5-35B-A3B
Active Parameters	24B (dense)	3B (MoE)
KV Cache per Token	~130 KB	~20 KB
Native Context	32K	262K
Tool Calling (BFCL-V4)	—	67.3
Agent Benchmark (TAU2)	—	81.2
Text Generation	~45 tok/s	~110 tok/s

6.5x less KV cache means the context window scales essentially for free. Speed at 4,000 tokens of context is nearly identical to speed at 262,000 tokens.

Getting There Was Not Trivial

Putting a three-day-old model into production comes with challenges. Three hurdles we had to clear:

vLLM Doesn’t Know the Model Yet

Our existing infrastructure runs on vLLM. But vLLM 0.16.0 simply doesn’t have the Qwen3.5 architecture (Qwen3_5MoeForConditionalGeneration) in its registry. The model is too new. Upgrading to the nightly build would have changed 21 packages and downgraded CUDA — too risky for a production system.

Ollama Works — Until It Hits the Memory Wall

Ollama 0.17 supports the model. The first test with standard quantization (Q4_K_M, 23 GB) looked promising. Until we set context to 262K: 30+ GB total consumption on a 24 GB GPU. The system fell back to unified memory — and every response took over six minutes.

The Solution: Smaller Quantization + llama-server Directly

Instead of Ollama, we now use llama-server (from the llama.cpp project) directly. With a Q3_K_XL quantization from bartowski, model weights fit in just 16 GB instead of 23 GB. The critical lever was the Q8 KV cache: instead of FP16, the KV cache is stored in 8-bit integers — with minimal quality loss but half the memory footprint.

The final configuration:

Q3_K_XL quantization — 15.96 GB weights (vs. 23 GB with Q4)
Q8 KV cache — halves cache memory consumption
Flash Attention — mandatory for the Mamba2 architecture
All 41 layers on GPU — no CPU offloading, no unified memory fallback

Benchmark Results: What the Hardware Actually Delivers

Measured with llama-bench on an RTX 4090, Flash Attention enabled, Q8 KV cache:

Context Depth	Prompt Processing	Text Generation
0 (no context)	4,236 tok/s	110 tok/s
16K tokens	3,854 tok/s	112 tok/s
65K tokens	2,964 tok/s	88 tok/s
131K tokens	2,335 tok/s	77 tok/s
262K tokens	1,353 tok/s	55 tok/s

In typical OpenClaw conversations — a system prompt of ~13,500 tokens plus chat history — we operate in the 16K to 65K token range. That translates to 88 to 112 tokens per second for text generation. Noticeably faster than the ~45 tok/s with Mistral.

Even at full 262K context, the model still responds at 55 tok/s — faster than many cloud APIs.

What This Means in Practice

The model switch has three concrete implications:

No new hardware required. The same RTX 4090 that previously ran Mistral Small 24B with 32K context now runs Qwen3.5-35B-A3B with 262K context. The investment in local GPU infrastructure was already made — the new model makes it exponentially more capable.

262K context changes what an agent can do. With 32K tokens, long documents had to be truncated, conversations compressed, and context discarded. With 262K tokens, the agent can hold entire contract libraries, code repositories, or hours-long chat histories in context — without losing information.

GDPR-compliant, zero ongoing costs. No cloud API, no data leaving the company network. Every request is processed locally. The only running cost is electricity.

And if the new model encounters issues in an edge case: Mistral Small 24B stands ready as a fallback. A swap script switches between both models in under 60 seconds.

Next Step

Want local AI infrastructure that keeps pace with model development — without buying new hardware for every upgrade? I’ll show you how it works in practice.

→ Book a free consultation

→ Or read more first: Local LLM as AI Agent — Without Cloud

Inside OpenClaw #3: Why We Replaced Our AI Model — and Need 6x Less Memory

A Model That Changes the Rules

The Architecture: Why This Model Is Different

Getting There Was Not Trivial

vLLM Doesn’t Know the Model Yet

Ollama Works — Until It Hits the Memory Wall

The Solution: Smaller Quantization + llama-server Directly

Benchmark Results: What the Hardware Actually Delivers

What This Means in Practice

Next Step

Interested?

Inside OpenClaw #3: Why We Replaced Our AI Model — and Need 6x Less Memory

A Model That Changes the Rules

The Architecture: Why This Model Is Different

Getting There Was Not Trivial

vLLM Doesn’t Know the Model Yet

Ollama Works — Until It Hits the Memory Wall

The Solution: Smaller Quantization + llama-server Directly

Benchmark Results: What the Hardware Actually Delivers

What This Means in Practice

Next Step

Share this article

Interested?