Inside OpenClaw

Inside OpenClaw #2: The Hidden vLLM Flags for Mistral

Getting Mistral models to run on vLLM sounds straightforward — until tool calling silently fails. Which flags make the difference, and what we learned along the way.

Running a Mistral model on vLLM should be simple. Load the model, start the server, send requests. And for basic text generation, it is. But the moment you need tool calling — the backbone of any AI agent — things start breaking in ways that are surprisingly hard to diagnose. Technical details like these are part of my AI and automation consulting.

The Silent Failures

When vLLM’s tool-calling configuration is wrong, it doesn’t throw an error. Instead, you get one of these failure modes:

  • The model ignores tools entirely. You define functions, send them in the request, and the model responds as if they don’t exist. It generates a plain text answer instead of a tool call.
  • Broken JSON output. The model attempts a tool call but produces malformed JSON — missing brackets, incorrect types, truncated strings. The Gateway can’t parse it, and the request fails silently.
  • Hallucinated tool names. The model invents tool names that don’t exist in your schema. It confidently calls search_internet when the actual tool is called web_search. The Gateway rejects the call, and the model retries with another invented name.

All of these look like model quality issues on the surface. You might think the model is too small, the quantization is too aggressive, or the prompt needs work. In reality, it’s a configuration problem.

The Three Critical Flags

After extensive debugging, we identified three vLLM flags that are essential for reliable Mistral tool calling:

--tool-call-parser mistral

This tells vLLM to use Mistral’s specific tool-calling format. Without it, vLLM applies a generic parser that doesn’t match Mistral’s expected output structure. The result: the model generates tool calls in the right format, but vLLM misinterprets them.

--chat-template

Mistral models require a specific chat template that includes tool-calling tokens. The default template often strips or mishandles these tokens, causing the model to lose its ability to signal tool invocations. We point this flag to the official Mistral chat template included in the model repository.

--enable-auto-tool-choice

This flag allows the model to decide autonomously whether a response should be a tool call or a text response. Without it, tool calling may require explicit forcing through the API, which breaks the natural agent workflow where the model dynamically decides when to use tools.

The Full Startup Command

Here’s what our production vLLM startup looks like with all critical parameters:

vllm serve model-path \
  --tool-call-parser mistral \
  --chat-template path/to/mistral-chat-template.jinja \
  --enable-auto-tool-choice \
  --quantization awq \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.92 \
  --tensor-parallel-size 1

Each parameter serves a purpose:

  • --quantization awq — Tells vLLM to load the AWQ-quantized model weights, fitting the 24B model into 24 GB VRAM.
  • --max-model-len 32768 — Sets the maximum context length. Going higher is possible but reduces throughput and increases memory pressure.
  • --gpu-memory-utilization 0.92 — Allows vLLM to use 92% of available VRAM. Leaves a small buffer for system stability.
  • --tensor-parallel-size 1 — Single GPU, no parallelism needed.

What We Learned

Tokenizer Compatibility

Make sure the tokenizer matches the model exactly. Mismatched tokenizers can cause subtle issues where the model’s special tokens (including tool-calling tokens) are decoded incorrectly. Always use the tokenizer bundled with the specific model checkpoint you’re running.

Lower Temperature for Tool Calling

We’ve found that temperature 0.1 to 0.3 is the sweet spot for tool-calling interactions. Higher temperatures introduce variability in how the model formats tool calls, leading to more parsing failures. Save higher temperatures for creative text generation — not for structured agent interactions.

Debug Logging Saves Hours

When tool calling fails silently, vLLM’s debug logs are your best friend. Enable verbose logging during development to see exactly how the model’s output is being parsed. Many of the issues we encountered were only visible in the raw token stream, not in the final API response.

The Bigger Picture

Local AI infrastructure is not plug-and-play. The gap between “the model runs” and “the model works reliably as an agent” is filled with configuration details that are poorly documented and often discovered through trial and error.

That’s exactly why we’re writing this series — to save you the debugging time we spent. If you’re building on local models, these details matter.

For more context on OpenClaw’s overall architecture, see our overview of OpenClaw as a personal AI assistant and our deep dive into running the entire stack on a single GPU.


Next Step

Planning to deploy local AI models for your business? I’ve been through the trial-and-error phase so you don’t have to.

Book a free consultation

→ Or read more first: Local LLM as AI Agent — Without Cloud

About the Author René Pfisterer

10+ years in ERP integration, data migration, and process automation for mid-sized companies. Specialized in DATEV, SAP, and AI implementation.

Full profile →
← Previous article Inside OpenClaw #1: Web Search Without Hallucination Next article → Inside OpenClaw #3: Why We Replaced Our AI Model — and Need 6x Less Memory

Interested?

Let's discuss how I can help in a short conversation.