Inside OpenClaw #2: The Hidden vLLM Flags for Mistral
Getting Mistral models to run on vLLM sounds straightforward — until tool calling silently fails. Which flags make the difference, and what we learned along the way.
Running a Mistral model on vLLM should be simple. Load the model, start the server, send requests. And for basic text generation, it is. But the moment you need tool calling — the backbone of any AI agent — things start breaking in ways that are surprisingly hard to diagnose. Technical details like these are part of my AI and automation consulting.
The Silent Failures
When vLLM’s tool-calling configuration is wrong, it doesn’t throw an error. Instead, you get one of these failure modes:
- The model ignores tools entirely. You define functions, send them in the request, and the model responds as if they don’t exist. It generates a plain text answer instead of a tool call.
- Broken JSON output. The model attempts a tool call but produces malformed JSON — missing brackets, incorrect types, truncated strings. The Gateway can’t parse it, and the request fails silently.
- Hallucinated tool names. The model invents tool names that don’t exist in your schema. It confidently calls
search_internetwhen the actual tool is calledweb_search. The Gateway rejects the call, and the model retries with another invented name.
All of these look like model quality issues on the surface. You might think the model is too small, the quantization is too aggressive, or the prompt needs work. In reality, it’s a configuration problem.
The Three Critical Flags
After extensive debugging, we identified three vLLM flags that are essential for reliable Mistral tool calling:
--tool-call-parser mistral
This tells vLLM to use Mistral’s specific tool-calling format. Without it, vLLM applies a generic parser that doesn’t match Mistral’s expected output structure. The result: the model generates tool calls in the right format, but vLLM misinterprets them.
--chat-template
Mistral models require a specific chat template that includes tool-calling tokens. The default template often strips or mishandles these tokens, causing the model to lose its ability to signal tool invocations. We point this flag to the official Mistral chat template included in the model repository.
--enable-auto-tool-choice
This flag allows the model to decide autonomously whether a response should be a tool call or a text response. Without it, tool calling may require explicit forcing through the API, which breaks the natural agent workflow where the model dynamically decides when to use tools.
The Full Startup Command
Here’s what our production vLLM startup looks like with all critical parameters:
vllm serve model-path \
--tool-call-parser mistral \
--chat-template path/to/mistral-chat-template.jinja \
--enable-auto-tool-choice \
--quantization awq \
--max-model-len 32768 \
--gpu-memory-utilization 0.92 \
--tensor-parallel-size 1
Each parameter serves a purpose:
--quantization awq— Tells vLLM to load the AWQ-quantized model weights, fitting the 24B model into 24 GB VRAM.--max-model-len 32768— Sets the maximum context length. Going higher is possible but reduces throughput and increases memory pressure.--gpu-memory-utilization 0.92— Allows vLLM to use 92% of available VRAM. Leaves a small buffer for system stability.--tensor-parallel-size 1— Single GPU, no parallelism needed.
What We Learned
Tokenizer Compatibility
Make sure the tokenizer matches the model exactly. Mismatched tokenizers can cause subtle issues where the model’s special tokens (including tool-calling tokens) are decoded incorrectly. Always use the tokenizer bundled with the specific model checkpoint you’re running.
Lower Temperature for Tool Calling
We’ve found that temperature 0.1 to 0.3 is the sweet spot for tool-calling interactions. Higher temperatures introduce variability in how the model formats tool calls, leading to more parsing failures. Save higher temperatures for creative text generation — not for structured agent interactions.
Debug Logging Saves Hours
When tool calling fails silently, vLLM’s debug logs are your best friend. Enable verbose logging during development to see exactly how the model’s output is being parsed. Many of the issues we encountered were only visible in the raw token stream, not in the final API response.
The Bigger Picture
Local AI infrastructure is not plug-and-play. The gap between “the model runs” and “the model works reliably as an agent” is filled with configuration details that are poorly documented and often discovered through trial and error.
That’s exactly why we’re writing this series — to save you the debugging time we spent. If you’re building on local models, these details matter.
For more context on OpenClaw’s overall architecture, see our overview of OpenClaw as a personal AI assistant and our deep dive into running the entire stack on a single GPU.
Next Step
Planning to deploy local AI models for your business? I’ve been through the trial-and-error phase so you don’t have to.
→ Or read more first: Local LLM as AI Agent — Without Cloud