← all posts
[ guide ]June 1, 20269 min read

Run Qwen Code on your own GPU with Wide Area Intelligence

Point Alibaba's open-source coding agent at hardware you own: deploy a Qwen Coder model to a node, wire up three environment variables, and code with zero token costs and full privacy.

Qwen Code is Alibaba's open-source coding agent — a terminal CLI in the same family as Claude Code and Gemini CLI. It reads your repository, plans multi-step changes, edits files, and runs commands. And because it speaks the OpenAI API, you can point it at any OpenAI-compatible endpoint.

That last part is what this guide is about. Run a Qwen Coder model on your own GPU through Wide Area Intelligence and you get a coding agent with zero per-token costs, complete privacy (your code never leaves hardware you own), and no rate limits beyond what your GPU can physically generate.

What you need

01

A Wide Area Intelligence account

It's free for up to 2 nodes — sign in with Google.
02

A machine with a GPU

A gaming PC with an RTX 3060 (12GB), any Apple Silicon Mac with 16GB+ unified memory, or anything bigger. This is your "node" — it can be the same machine you code on, or a PC in another room, another building, another city.
03

Node.js 20+ on your dev machine

Only needed for the Qwen Code CLI itself.

Step 1 — Bring a node online

In the dashboard, go to Nodes → Add a node, give it a name, and run the one-line installer on the GPU machine. It works on macOS (native Metal acceleration), Linux (Docker with NVIDIA passthrough), and Windows (native CUDA — no WSL needed). The node opens an outbound Cloudflare Tunnel, so there's no port forwarding, no static IP, and no firewall changes.

Within a minute the node shows CONNECTED in your dashboard, and after its first model download it flips to READY.

Step 2 — Deploy a Qwen Coder model

Go to Models, search qwen coder, and pick a model from the results — they come straight from Hugging Face's GGUF catalog. The dashboard shows which quantizations actually fit your node's memory, so you don't have to guess:

your hardwarerecommended modeldownload
8GB VRAM / 16GB MacQwen2.5-Coder-7B-Instruct · Q4_K_M~4.7GB
12–16GB VRAM / 32GB MacQwen2.5-Coder-14B-Instruct · Q4_K_M~9GB
24GB VRAM / 64GB MacQwen2.5-Coder-32B-Instruct · Q4_K_M~20GB
2× 24GB or 96GB+ MacQwen3-Coder-30B-A3B-Instruct · Q5_K_M~22GB

Click [ deploy ], pick your node, and watch the progress on the node's page. The node keeps serving its current model while the new one downloads, then swaps over and reports the new model name to your dashboard.

Bigger is better for coding agents — if your hardware can hold the 14B or 32B variant, use it. Agentic coding leans hard on instruction following and long-context reasoning, which is where the larger models pull away.

Step 3 — Create a gateway key

Go to API Keys → Create a key and name it something like qwen-code. Copy the wai_sk_…key — it's shown once. One key per tool keeps your request logs tidy and lets you revoke access per-app later.

Step 4 — Install and configure Qwen Code

install
# Qwen Code needs Node.js 20+
npm install -g @qwen-code/qwen-code

Qwen Code reads three environment variables to decide where to send requests. Point them at your gateway:

configure + run
# Point Qwen Code at your Wide Area Intelligence gateway
export OPENAI_BASE_URL="https://wideareaai.com/api/v1"
export OPENAI_API_KEY="wai_sk_..."                        # from API Keys
export OPENAI_MODEL="Qwen2.5-Coder-7B-Instruct-Q4_K_M"    # what your node serves

# Start it inside any project
cd ~/code/my-project
qwen
optional · persist across terminals
# Make it permanent (zsh — use ~/.bashrc for bash)
cat >> ~/.zshrc << 'EOF'
export OPENAI_BASE_URL="https://wideareaai.com/api/v1"
export OPENAI_API_KEY="wai_sk_..."
export OPENAI_MODEL="Qwen2.5-Coder-7B-Instruct-Q4_K_M"
EOF

The OPENAI_MODELvalue must match what your node serves — it's the .gguf filename without the extension, shown on your Nodes page. Or ask the gateway directly:

list your models
# See exactly which model names your nodes serve right now
curl https://wideareaai.com/api/v1/models \
  -H "Authorization: Bearer wai_sk_..."

Step 5 — Raise the context window (don't skip this)

This is the step that makes or breaks coding agents on local hardware. Out of the box, a node's llama-server runs with a 4,096-token context window— fine for chat, but a coding agent stuffs its system prompt, your files, diffs, and the whole conversation into context. Qwen Code will blow past 4k in its first turn or two, and you'll see truncated answers or "context length exceeded" errors.

The fix takes ten seconds: open your node's detail page in the dashboard (Nodes → your node), find the context window setting, type 32768 (or just 32k), and hit save. The node picks the change up on its next heartbeat, restarts its inference server with the new window, and reports back — the whole thing takes under a minute, and the page shows the applied value when it's done.

Managing the node by hand instead? The same thing works through the agent's config file:

manual alternative
# Alternative: set it on the node machine itself (older agents)
# macOS / Linux:
echo 'WAI_LLAMA_ARGS="-ngl 999 -c 32768"' >> ~/.wai/env && wai restart
# Windows PowerShell:
Add-Content "$env:LOCALAPPDATA\wai\env.ps1" '$WAI_LLAMA_ARGS = "-ngl 999 -c 32768"'; wai restart

The trade-off is memory: the KV cache grows linearly with context length, on top of the model weights. For Qwen2.5-Coder-7B (Q4_K_M ≈ 4.7GB of weights):

context windowextra memory (kv cache)total ≈fits on
4k (default)~0.2GB~5.5GB8GB VRAM — but too small for agents
32k~1.8GB~7GB12GB VRAM / 16GB Mac — the sweet spot
64k~3.6GB~9GB16GB VRAM / 32GB Mac
128k (model max)~7.2GB~12.5GB24GB VRAM / 32GB+ Mac

Rule of thumb: start with -c 32768. It's enough for Qwen Code to hold a meaningful slice of your repo plus the conversation, and it fits alongside the 7B model on a 12GB GPU or a 16GB Mac. If you have headroom (check the node's detail page for memory math), go to 65536.

Two gotchas. ① If your node also sets --parallel N, the context is split between slots — each request gets c ÷ N tokens, so raise -c accordingly. ② Qwen Code itself also has a token budget for how much history it keeps; the server-side window is the hard ceiling, so set the server first.

Step 6 — Start coding

That's it. Open a project, run qwen, and ask it to do something real: "add input validation to the signup form and write tests for it". Every request routes through Wide Area AI to your node; you can watch them arrive in real time on the dashboard's Overview page, and the Analytics page shows tokens and generation speed per model and per node.

Requests are load-balanced across all your ready nodes. If the model goes down mid-session (machine reboots, someone trips over the power cord), requests can fail over to a cloud model on prepaid credits.

Tips

01

Pin Qwen Code to one node

Have multiple nodes with different models? Most OpenAI-compatible tools can send extra headers — add X-WAI-Node: your-node-name to route every request to one specific machine, with no fallback.
02

Compare models before committing

The dashboard's Playground → Compare tab sends the same prompt to two nodes side-by-side — an easy way to decide whether the 14B model is worth the extra VRAM over the 7B.
03

Watch your speed

Coding agents are chatty. If generation feels slow, check the node's detail page — it estimates tokens/sec for every model size your hardware can run, so you can pick the best speed/quality trade-off.
04

Same setup, other tools

The exact same three environment variables work for Aider, Continue, Cline, and any other OpenAI-compatible tool. One gateway, one key, every tool pointed at your own hardware.

Why this beats a cloud API for coding agents

Coding agents burn tokens — a single afternoon of agentic work can run through millions of tokens of context re-reads, diffs, and retries. On a metered API that's real money; on your own GPU it's the electricity you were already paying for. Add the privacy angle (your proprietary code never leaves machines you control) and the fact that Qwen2.5-Coder models now punch well above their weight, and self-hosting stops being the compromise option.

Create your gateway and put that GPU to work →

/// get started

That GPU is already paid for.
Put it on the network.

Create your gateway — free →