← all posts
[ guide ]June 2, 20267 min read

Cline on your own hardware: the honest guide

Cline is an autonomous VS Code agent built for frontier models — so most local LLM setups disappoint. Here is what actually works: a 32B coder model, a big context window, and your own GPU behind one endpoint.

Cline (formerly Claude Dev) is an autonomous coding agent that lives in VS Code. Give it a task and it plans the work, reads and edits files across your repo, runs terminal commands, and can even drive a browser to check its own work — pausing for your approval at each step. It speaks the OpenAI API, so in principle you can point it at any endpoint, including a model running on a GPU you own.

In practice, "in principle" hides a lot. Cline is one of the most demanding agents you can run, and most local-LLM tutorials gloss over the part where the small model you downloaded falls apart on the first real task. This guide does not do that.

The hard truth up front: Cline was designed for frontier models, and it shows. A 7B model will disappoint you — it loses the plan, mangles diffs, and fails the tool-call format Cline depends on. The first open model that genuinely handles Cline is Qwen2.5-Coder-32B, and running it well needs roughly 24GB of VRAM. If you do not have that, read the hybrid and cloud sections before you give up — there is a good answer for you too.

Why small models fail at Cline specifically

Cline's system prompt is enormous. Before your task even starts, the model receives detailed instructions for a dozen tools, strict XML formatting rules, and the conventions for proposing diffs. Then it has to hold a multi-step plan in its head while file contents, command output, and your replies stream in. Two things break small models here:

01

Instruction following under load

A 7B model can write a clean function in isolation but drifts when it must obey a long tool protocol and reason about your codebase at the same time. You get malformed tool calls, ignored steps, and edits applied to the wrong file.
02

Long-context coherence

Cline routinely runs 20k–50k tokens deep into a session. Smaller models degrade as context fills — they forget the original task or re-introduce bugs they already fixed. The 32B class holds the thread far better.
03

Diff discipline

Cline edits via precise search-and-replace blocks. If the model gets the surrounding lines slightly wrong, the edit is rejected and it has to retry — burning time and context. Bigger coder models nail the exact-match format much more often.

What each hardware tier actually gets you

Be honest with yourself about which row you are on. The difference between tiers is not subtle with Cline — it is the difference between a usable agent and a frustrating one.

your hardwaremodel that fitscline experience
8GB VRAM / 16GB MacQwen2.5-Coder-7B · Q4_K_MNot recommended. Single-file tweaks only; fails most multi-step tasks.
12–16GB VRAM / 32GB MacQwen2.5-Coder-14B · Q4_K_MWorkable for small, well-scoped edits. Still drops complex plans.
24GB VRAM / 64GB MacQwen2.5-Coder-32B · Q4_K_MThe sweet spot. The first setup that feels like a real Cline agent.
2× 24GB / 48GB / 96GB MacQwen2.5-Coder-32B · Q6 or 70B-classBest local quality. Headroom for big context and higher quant.

Notice there is no 70B row promising miracles. A 70B general model is often worse at Cline than the 32B coder, because Cline rewards coding-specific instruction tuning over raw size. Qwen2.5-Coder-32B at Q4_K_M is the recommendation for almost everyone with the VRAM to run it.

Step 1 — Bring a node online

In the dashboard, go to Nodes → Add a node, name it, and run the one-line installer on your 24GB machine. It supports macOS (Metal), Linux (Docker with NVIDIA passthrough), and Windows (native CUDA, no WSL). The node opens an outbound Cloudflare Tunnel — no port forwarding, no static IP, no firewall changes. It shows CONNECTED within a minute.

Step 2 — Deploy Qwen2.5-Coder-32B

Go to Models, search qwen2.5 coder 32b, and deploy the Q4_K_Mquantization — about 20GB of weights, pulled straight from Hugging Face's GGUF catalog. The dashboard only offers quantizations that fit your node's memory, so you will not accidentally pick one that thrashes. Click [ deploy ], choose your node, and watch the download progress on the node's page. When it flips to READY it is serving the model.

Q4_K_M is the right default at 32B — the quality gap to Q6 is small and the memory savings let you give Cline a much bigger context window, which matters more. Spend your VRAM on context before you spend it on precision.

Step 3 — Raise the context window to 65536

This is the step people skip and then wonder why Cline misbehaves. A node's llama-server defaults to a 4,096-token context window — fine for chat, hopeless for Cline. Cline's system prompt alone can approach that, before a single file is read. You want 65536 if your hardware can hold it.

Open your node's detail page (Nodes → your node), find the context window setting, enter 65536 (or 64k), and save. The node restarts its inference server on the next heartbeat and reports the applied value — under a minute. Managing the node by hand instead? The same change lives in the agent config:

manual alternative
# Managing the node by hand? Set the window in the agent config.
# macOS / Linux:
echo 'WAI_LLAMA_ARGS="-ngl 999 -c 65536"' >> ~/.wai/env && wai restart
# Windows PowerShell:
Add-Content "$env:LOCALAPPDATA\wai\env.ps1" '$WAI_LLAMA_ARGS = "-ngl 999 -c 65536"'; wai restart

The cost is memory. The KV cache grows linearly with context on top of the ~20GB of 32B weights:

context windowkv cache ≈weights + kv ≈fits on
4k (default)~0.4GB~20.5GB24GB — but useless for Cline
32k~3GB~23GB24GB VRAM (tight) / 64GB Mac
65k~6GB~26GB32GB+ VRAM / 64GB Mac — recommended
128k (model max)~12GB~32GB2× 24GB / 48GB / 96GB Mac

On a single 24GB card the 32B model plus a 65k cache is over budget — so either drop to 32k context, or offload a few layers to system RAM (Cline tolerates the slowdown better than you might expect, because it spends most of its time waiting on tool results). A 48GB card or a 64GB Mac runs 65k comfortably.

Step 4 — Create a gateway key

Go to API Keys → Create a key, name it cline, and copy the wai_sk_…value — it is shown once. A dedicated key keeps Cline's request logs separate and lets you revoke it without touching your other tools.

Step 5 — Point Cline at your gateway

In VS Code, open the Cline panel and click the settings gear. At the top, set API Provider to OpenAI Compatible — this reveals three fields. Fill them in exactly:

cline · settings
API Provider:  OpenAI Compatible
Base URL:      https://wideareaai.com/api/v1
API Key:       wai_sk_...
Model ID:      Qwen2.5-Coder-32B-Instruct-Q4_K_M

The Base URL field takes the gateway root with the /api/v1 path — Cline appends /chat/completions itself, so do not add it. The Model ID must match the .gguf name your node serves (shown on the Nodes page, minus the extension). If you are unsure, ask the gateway directly:

list your models
# Confirm the exact model name your node serves right now
curl https://wideareaai.com/api/v1/models \
  -H "Authorization: Bearer wai_sk_..."

Save the settings, then give Cline a real task — not hello world, something like "add a rate limiter to the /login route and a test that proves it returns 429 after five attempts". Watch it plan, propose edits for your approval, and run the test. Every request routes through Wide Area Intelligence to your node; you can see them arrive live on the dashboard's Overview page, with tokens and generation speed broken out on Analytics.

Run Cline in Plan mode first for anything non-trivial. Local 32B models do better when they commit to a plan you have approved before they start editing — it keeps them from wandering halfway through and burning context on dead ends.

The hybrid strategy that actually makes this great

Here is where self-hosting Cline stops being a compromise. You do not have to choose between "always local" and "always cloud." Wide Area Intelligence puts both behind one endpoint, so you can route by difficulty:

01

Local for the routine 90%

Refactors, test writing, boilerplate, fixing the build, renaming across files — Qwen2.5-Coder-32B on your own GPU handles all of it with zero per-token cost and your code never leaving your hardware.
02

Cloud for the hard planning

Architecting a new subsystem, untangling a gnarly bug, or a task that needs the absolute best reasoning? Configure cloud failover with prepaid credits, and those requests can reach Claude or another frontier model — through the same wai_sk_… key and base URL Cline already uses.
03

One endpoint, automatic resilience

Failover also covers the unplanned case: if your node reboots or someone trips over the power cord mid-session, in-flight requests fall over to the cloud instead of erroring out. Cline never sees a dropped connection.

If you keep two nodes (the free tier covers two), you can run different models on each and pin Cline to one with an X-WAI-Node: your-node-nameheader when you want a specific machine with no fallback. The dashboard's Playground → Compare tab is the quickest way to A/B two nodes on the same prompt before you commit.

When the cloud is simply the right call

Trust is built by saying this plainly: if you do not have 24GB of VRAM and you are not buying it, a frontier cloud model through Cline will beat any local setup you can run today, full stop. Self-hosting Cline pays off when you have the hardware and a reason to keep code on your own machines — privacy, compliance, air-gapped work, or just the economics of an agent that re-reads your repo thousands of times a day. If that is you, a 32B coder on your own GPU is genuinely excellent. If it is not, use the cloud and do not feel bad about it.

Either way, Wide Area Intelligence is the layer that lets you change your mind without rewiring anything: deploy a model to a node, create a key, point Cline at the gateway, and flip between local and cloud whenever the task demands it.

Deploy a coder model and wire up Cline →

/// get started

That GPU is already paid for.
Put it on the network.

Create your gateway — free →