AI Agent Local: Zero API Costs on a Single GPU

Most AI agents depend on cloud APIs — and every request costs money. With agent workloads, those costs add up fast: a single complex task can consume 50,000 to 100,000 tokens. Run that a few dozen times a day, and you’re looking at a serious monthly bill.

We took a different approach with OpenClaw. Our AI agent runs entirely on local hardware — one GPU, no cloud dependency, no recurring API fees. Local AI solutions are a focus of my AI and automation consulting.

Why Local Matters

Cloud APIs are convenient, but they come with trade-offs that matter for business use:

Cost explosion with agent usage. Agents don’t just answer questions — they reason, plan, and execute multi-step tasks. That means significantly more tokens per interaction than a simple chatbot.
Data stays on your infrastructure. Every API call sends your data to a third-party server. With a local setup, sensitive business data never leaves your network — a clear advantage for GDPR compliance.
No vendor lock-in. When your entire workflow depends on a single API provider, you’re at the mercy of their pricing changes, rate limits, and service availability.

The Hardware

Our setup runs on a single NVIDIA RTX 4090 with 24 GB VRAM. The model is Mistral Small 24B, quantized with AWQ to fit into the available memory while maintaining strong output quality.

That’s it. One consumer-grade GPU. No multi-node cluster, no cloud instances.

Three-Layer Architecture

OpenClaw’s architecture separates concerns into three distinct layers:

Layer 1: vLLM Inference Server

vLLM handles model inference. It serves the quantized Mistral model via an OpenAI-compatible API, managing memory efficiently with PagedAttention and handling concurrent requests.

Layer 2: OpenClaw Gateway (Node.js)

The Gateway is the brain of the operation. It manages conversation context, routes tool calls, enforces safety constraints, and orchestrates the interaction between the model and external tools. Built in Node.js for fast async I/O.

Layer 3: Channels (Discord and Beyond)

Channels are the user-facing interfaces. Currently, OpenClaw runs primarily through Discord, but the channel layer is designed to support additional frontends without changes to the Gateway or inference layer.

Production-Ready with systemd

We run the entire stack as systemd services with health checks and automatic restarts. If the inference server crashes or the Gateway hits an unexpected state, systemd brings it back up — no manual intervention needed.

When Does This Break Even?

Depending on your usage volume, the hardware investment pays for itself in 2 to 4 months compared to equivalent cloud API costs. After that, every interaction is essentially free.

What Works — and What Doesn’t

90% of typical agent tasks work well with the 24B model: research, summarization, drafting, code generation, tool orchestration, and structured data extraction.

Where it hits its limits:

Very complex multi-step reasoning — tasks that require chaining many dependent logical steps
Context windows beyond 32K tokens — long documents or extended conversations can degrade quality
Languages beyond German and English — the model performs best in DE and EN; other languages are less reliable

What’s Next

We’re continuing to refine OpenClaw’s architecture and share what we learn along the way. If you want to understand how OpenClaw works as a personal AI assistant, start with our overview article. For a deeper look at how agents coordinate complex tasks, see our piece on autonomous agent orchestration.

The full source code is available on GitHub.

Next Step

Interested in running AI on your own infrastructure? I help businesses build local AI setups that are cost-effective, privacy-compliant, and tailored to their needs.

→ Book a free consultation

→ Or read more first: AI Workshop: Business Processes

Zero API Costs: How We Run an AI Agent on a Single GPU

Why Local Matters

The Hardware

Three-Layer Architecture

Layer 1: vLLM Inference Server

Layer 2: OpenClaw Gateway (Node.js)

Layer 3: Channels (Discord and Beyond)

Production-Ready with systemd

When Does This Break Even?

What Works — and What Doesn’t

What’s Next

Next Step

Interested?

Zero API Costs: How We Run an AI Agent on a Single GPU

Why Local Matters

The Hardware

Three-Layer Architecture

Layer 1: vLLM Inference Server

Layer 2: OpenClaw Gateway (Node.js)

Layer 3: Channels (Discord and Beyond)

Production-Ready with systemd

When Does This Break Even?

What Works — and What Doesn’t

What’s Next

Next Step

Share this article

Interested?