← all posts
[ guide ]June 2, 20268 min read

Continue.dev in VS Code without the subscription

Continue is the open-source Copilot alternative — chat, autocomplete, and inline edit in VS Code and JetBrains. Wire it to a Qwen Coder model on your own GPU and pay nothing per request, with no code leaving hardware you own.

Continue is the open-source coding assistant for VS Code and JetBrains — the closest thing to GitHub Copilot you can fully own. It does the three things you actually use an AI assistant for: chat in a sidebar that can read your open files, tab-autocomplete as you type, and inline edit (highlight code, press a shortcut, describe the change). The extension is MIT-licensed and free. What it does not ship with is a model — you bring your own endpoint.

That is the whole opportunity. Point Continue at a model running on your own GPU through Wide Area Intelligence and you get a Copilot-class workflow with no subscription, no per-seat fee, and no code leaving hardware you control. This guide walks the real setup, including the part most tutorials skip: chat and autocomplete want completely different models.

The two-model reality

People treat "the model" as one decision. For Continue it is two, and they pull in opposite directions:

01

Chat + edit want a big, smart model

When you ask a question or request a multi-file change, you want reasoning and instruction-following. Latency barely matters — a second or two before the stream starts is fine. This is a job for Qwen2.5-Coder-14B or 32B.
02

Tab-autocomplete wants a tiny, fast model

Autocomplete fires on almost every keystroke pause. If the suggestion does not land in roughly 200ms, you have already typed past it and it is noise. That rules out big models — you want a 1.5B or 3B coder model whose entire job is fast fill-in-the-middle.

Here is why this matters for cost. A focused day of coding can fire thousandsof autocomplete requests. On a metered API that is a real line item; even Copilot's flat fee is a subscription you pay forever. On a GPU you already own, ten thousand completions cost the same as zero: the electricity you were burning anyway. Autocomplete volume is exactly the workload where owning the hardware wins hardest.

What you need

01

A Wide Area Intelligence account

Free for up to 2 nodes — which, conveniently, is exactly the chat node + autocomplete node split this guide recommends.
02

A machine with a GPU

An RTX 3060 (12GB), any Apple Silicon Mac with 16GB+ unified memory, or bigger. This is your node — it can be the laptop you code on, or a PC in another room entirely.
03

VS Code or a JetBrains IDE

Install the Continue extension from the marketplace. The config file is identical across both.

Step 1 — Bring a node online

In the dashboard, go to Nodes → Add a node, name it, and run the one-line installer on the GPU machine. It works on macOS (Metal), Linux (Docker with NVIDIA passthrough), and Windows (native CUDA — no WSL needed). The node opens an outbound Cloudflare Tunnel, so there is no port forwarding, no static IP, and no firewall surgery. Within a minute it shows CONNECTED; after its first model download it flips to READY.

Step 2 — Deploy a chat model

Go to Models, search qwen coder, and deploy the largest variant your node's memory can hold — the dashboard greys out quantizations that will not fit, so you cannot misjudge it. For chat and edit, Qwen2.5-Coder-14B-Instruct · Q4_K_M (~9GB) is the sweet spot on a 12–16GB GPU or a 32GB Mac. Click [ deploy ], pick your node, and watch the progress on the node's page.

Coding assistants lean hard on instruction-following. If your hardware can hold the 14B or 32B variant, use it — the jump from 7B is very noticeable in edit quality and in how well chat respects your constraints.

Step 3 — Create a gateway key

Go to API Keys → Create a key, name it continue, and copy the wai_sk_… value — it is shown once. One key per tool keeps your request logs readable and lets you revoke access for just this app later.

Step 4 — Write the Continue config

Continue reads ~/.continue/config.json (Windows: %USERPROFILE%\.continue\config.json). Open it from the command palette with Continue: Open config.json. The minimal chat-only setup is four lines that matter — provider, model, apiBase, apiKey:

config.json · chat + edit only
// ~/.continue/config.json — chat + edit only (simplest setup)
{
  "models": [
    {
      "title": "Qwen Coder 14B (my GPU)",
      "provider": "openai",
      "model": "Qwen2.5-Coder-14B-Instruct-Q4_K_M",
      "apiBase": "https://wideareaai.com/api/v1",
      "apiKey": "wai_sk_..."
    }
  ]
}

The provider is openai because Wide Area Intelligence speaks the OpenAI API — Continue does not need a Wide Area Intelligence-specific plugin. The model string must match what your node serves exactly: it is the .gguf filename without the extension, shown on your Nodes page. If you are unsure, ask the gateway directly:

list your models
# Confirm the exact model names your nodes serve right now
curl https://wideareaai.com/api/v1/models \
  -H "Authorization: Bearer wai_sk_..."

Save, and the chat sidebar plus inline edit (highlight code, then Cmd/Ctrl+I) are live. Requests route through the gateway to your node, load-balanced across any ready nodes serving that model, with optional cloud failover if the machine drops.

Step 5 — Add fast autocomplete

Autocomplete is a separate tabAutocompleteModel block. The clean answer: run a second node dedicated to a small, fast model. Deploy Qwen2.5-Coder-1.5B-Instruct · Q4_K_M (~1GB, runs comfortably on almost anything, including an old laptop or a mini-PC you leave on), then pin each model to its own node with the X-WAI-Node header so completions never wait behind a heavy chat request:

config.json · chat + autocomplete, two nodes
// ~/.continue/config.json — chat through WAI, autocomplete on a 2nd node
{
  "models": [
    {
      "title": "Qwen Coder 14B (my GPU)",
      "provider": "openai",
      "model": "Qwen2.5-Coder-14B-Instruct-Q4_K_M",
      "apiBase": "https://wideareaai.com/api/v1",
      "apiKey": "wai_sk_...",
      "requestOptions": {
        "headers": { "X-WAI-Node": "workstation" }
      }
    }
  ],
  "tabAutocompleteModel": {
    "title": "Qwen Coder 1.5B (fast)",
    "provider": "openai",
    "model": "Qwen2.5-Coder-1.5B-Instruct-Q4_K_M",
    "apiBase": "https://wideareaai.com/api/v1",
    "apiKey": "wai_sk_...",
    "requestOptions": {
      "headers": { "X-WAI-Node": "always-on-mini" }
    }
  },
  "tabAutocompleteOptions": {
    "debounceDelay": 350,
    "maxPromptTokens": 1024
  }
}

Be honest with yourself about latency, though. A completion request travels from your editor, out to the gateway, down the Cloudflare Tunnel to your node, and back — and that round trip adds roughly 50–100mson top of the model's own generation time, even when the node is in the next room. For chat that is invisible. For autocomplete it eats into your 200ms budget. The debounceDelay of 350ms above helps by not firing on every keystroke, but if your node is geographically far away, tab-autocomplete through any tunnel will feel slightly behind.

So here is the setup a lot of people actually land on, and it is a legitimate one: local Ollama for autocomplete, Wide Area Intelligence for chat. Autocomplete runs on the same machine as your editor (zero network hop, sub-100ms total), while the heavyweight chat model lives on a beefier GPU box you reach through the gateway. You get the best of both:

config.json · hybrid (local tab + remote chat)
// ~/.continue/config.json — local Ollama for tab, WAI for chat (hybrid)
{
  "models": [
    {
      "title": "Qwen Coder 14B (remote GPU)",
      "provider": "openai",
      "model": "Qwen2.5-Coder-14B-Instruct-Q4_K_M",
      "apiBase": "https://wideareaai.com/api/v1",
      "apiKey": "wai_sk_..."
    }
  ],
  "tabAutocompleteModel": {
    "title": "Qwen Coder 1.5B (local Ollama)",
    "provider": "ollama",
    "model": "qwen2.5-coder:1.5b"
  }
}

The hybrid is not a cop-out — it is correct engineering. Put the latency-critical workload (autocomplete) where the network can't hurt it, and the compute-heavy workload (chat) where the big GPU is. Wide Area AI is what lets that big GPU live on another machine without port forwarding.

Step 6 — Raise the context window

Do not skip this on the chat node. Out of the box a node's llama-server runs a 4,096-tokencontext window — fine for a one-liner, but Continue's chat folds in your open files, selected code, and the conversation, and edit mode sends whole functions. You will hit truncated answers fast.

Open the chat node's detail page (Nodes → your node), find the context window setting, enter 32768 (or just 32k), and save. The node restarts its inference server on the next heartbeat and reports the applied value — under a minute. The autocomplete node can stay small; completions use a tight prompt, so the default window is fine there.

How this stacks up

The honest comparison. Continue + Wide Area Intelligence is not strictly better than the commercial tools — it is a different set of trade-offs, and for some teams the paid products are the right call.

Continue + WAIGitHub CopilotCursor
Cost / month$0 (your power bill)$10 individual / $19 business$20 Pro
Code privacyStays on your hardwareSent to GitHub/OpenAISent to Cursor/providers
Chat modelAny GGUF you deployFixed (GPT/Claude tier)Fixed menu (GPT/Claude)
AutocompleteLocal or your GPUProprietary, very fastProprietary, very fast
Works offlineYes (local Ollama tab)NoNo
EditorVS Code + JetBrainsBroadOwn VS Code fork only
Setup effortConfig file + a nodeOne clickOne click

The honest trade-offs

Where the paid tools genuinely win: autocomplete latency and polish. Copilot and Cursor have spent years tuning purpose-built completion models served from low-latency infrastructure, and on raw tab-completion responsiveness they are hard to beat. If you live and die by instant ghost-text and don't care where your code goes, pay them — that is a reasonable choice.

Where Continue + Wide Area Intelligence wins: cost at volume, privacy, and model choice. You are never metered, your proprietary code never leaves machines you own, and you can swap in any GGUF on Hugging Face the day it drops instead of waiting for a vendor to add it. For a team that already has GPUs, or anyone under a contractual obligation to keep source code in-house, that is not a nice-to-have — it is the only option that qualifies.

The middle ground is the hybrid: local autocomplete for snappiness, your own GPU for chat and edit, zero subscription. That is the setup I'd point most developers at.

Tips

01

Use @-context in chat

Continue's @file, @code, and @docs providers feed precise context to your model. Pair them with the 32k window so the node actually has room to hold what you send.
02

Compare models before committing VRAM

The dashboard's Playground → Compare tab runs the same prompt against two nodes side-by-side — a quick way to decide whether 32B chat is worth the extra memory over 14B.
03

One key, every tool

The same gateway and key work for Aider, Cline, Qwen Code, and any OpenAI-compatible client. Continue is just one tool pointed at your hardware.

Put it together

The flow is always the same three moves: deploy a model to a node, create a key, point the tool at the gateway. For Continue that means a 14B coder model on your GPU for chat and edit, an optional small model (on a second node or local Ollama) for autocomplete, and a few lines in config.json. The result is a Copilot-class assistant with no subscription and no code leaving hardware you own.

Create your gateway and drop the subscription →

/// get started

That GPU is already paid for.
Put it on the network.

Create your gateway — free →