field notes

The blog.
Running AI where you can touch it.

[ guide ]June 24, 20269 min read

What Is an LLM Gateway? (and Why Route Local + Cloud)

An LLM gateway is a single OpenAI-compatible endpoint that routes requests across model backends — your own GPUs and the cloud. Here's what a gateway (a.k.a. LLM router) actually does, and why unifying local + cloud behind one URL is the setup serious local-LLM users land on.

read the guide →

[ guide ]June 24, 202611 min read

Best Local LLMs in 2026

An honest, ranked guide to the best open-weight LLMs you can run on your own GPU in 2026 — Llama, Qwen 3, DeepSeek, Mistral, Gemma, Phi — each with size, VRAM, and what it's actually good at. Plus the one tool that tells you whether your card can run it.

read the guide →

[ guide ]June 24, 202611 min read

Best Local LLMs for Coding in 2026

The best open-weight coding models you can run on your own GPU in 2026 — Qwen3 Coder, DeepSeek Coder, Codestral/Devstral, and more — ranked by size and VRAM, plus how an OpenAI-compatible gateway lets Cline, Aider, Continue, and other agents use them.

read the guide →

[ guide ]June 24, 202610 min read

How to Use Claude Code & Coding Agents with a Local LLM

Can you run Claude Code on a local LLM? Technically yes — but there are two real catches: it speaks the Anthropic API, and its large system prompt eats a small model's context window. Here's the honest path, plus the lighter terminal agent that actually works well on your own GPU.

read the guide →

[ comparison ]June 24, 20269 min read

Best Ollama Alternatives in 2026

Ollama is the default on-ramp to local LLMs, but it isn't the only one. An honest rundown of the best alternatives — LM Studio, Jan, llama.cpp, LocalAI, vLLM — and where a gateway like Wide Area Intelligence fits when localhost stops being enough.

read the guide →

[ comparison ]June 24, 20269 min read

Best LM Studio Alternatives in 2026

LM Studio has the best desktop GUI for local LLMs, but it's closed source and single-machine. The best alternatives in 2026 — Jan, Ollama, llama.cpp, GPT4All, vLLM — and where a gateway like Wide Area Intelligence makes any of them reachable from anywhere.

read the guide →

[ comparison ]June 24, 20269 min read

Best OpenRouter Alternatives in 2026

OpenRouter unifies cloud LLM providers behind one API — but every token still bills to someone else's GPU. The best alternatives in 2026, including the option OpenRouter doesn't offer: routing to GPUs you own first, with cloud as failover.

read the guide →

[ guide ]June 24, 202610 min read

How to Run Llama 3 Locally (2026 Guide)

A practical 2026 guide to running Meta's Llama 3 family on your own GPU — picking the right size and quant, the fastest way to get it chatting, and how to make it reachable from your other machines.

read the guide →

[ guide ]June 24, 202610 min read

How to Run Qwen Locally (2026 Guide)

Alibaba's Qwen3 family — including the standout Qwen3-Coder models — is some of the best open-weights you can run at home. A 2026 guide to picking a size, running it on your GPU, and making it reachable from your editor anywhere.

read the guide →

[ guide ]June 24, 202610 min read

How to Run DeepSeek Locally (2026 Guide)

DeepSeek's R1 reasoning models are open-weight — but the full model is enormous. The 2026 guide to running DeepSeek locally: which distill actually fits your GPU, how to run it, and how to reach for the full model only when you need it.

read the guide →

[ deep dive ]June 11, 20267 min read

Anatomy of an inference request: how the gateway decides where it runs

Your code calls one OpenAI-compatible endpoint, but each request quietly walks three stages — edge cache, your own GPU, then capability-aware cloud failover. Here's the whole decision path, and the settings that shape it.

read the guide →

[ deep dive ]June 11, 20268 min read

Failover that doesn't break your app: capability-aware routing

Swapping models on the fly — for cost or for resilience — quietly breaks any request that needs vision or tools, if the substitute can't do them. Here's how the gateway routes by what a request actually needs, and how to set a vision model and backups.

read the guide →

[ guide ]June 10, 20268 min read

Run the oh-my-pi coding agent on your own GPU with Wide Area Intelligence

Point can1357's oh-my-pi (omp) terminal agent at hardware you own: install it, declare a WideAreaAI provider in models.yml, set it as the default model, and code against your own Gemma node with zero token costs.

read the guide →

[ guide ]June 2, 20268 min read

Aider + your own GPU: AI pair programming with zero API bill

Aider is a terminal AI pair programmer that edits your code and commits the diffs. Point it at a coder model running on your own GPU through Wide Area Intelligence and you get git-aware AI pairing with no per-token bill, no rate limits, and full privacy.

read the guide →

[ guide ]June 2, 20268 min read

Continue.dev in VS Code without the subscription

Continue is the open-source Copilot alternative — chat, autocomplete, and inline edit in VS Code and JetBrains. Wire it to a Qwen Coder model on your own GPU and pay nothing per request, with no code leaving hardware you own.

read the guide →

[ guide ]June 2, 20267 min read

Cline on your own hardware: the honest guide

Cline is an autonomous VS Code agent built for frontier models — so most local LLM setups disappoint. Here is what actually works: a 32B coder model, a big context window, and your own GPU behind one endpoint.

read the guide →

[ deep dive ]June 2, 20269 min read

The real cost of the OpenAI API vs your gaming PC (2026 numbers)

A no-fluff cost breakdown: real 2026 API prices, real electricity math for an RTX 4090, honest break-even points, and the hidden factors a calculator won't show you. With a verdict for every usage level.

read the guide →

[ deep dive ]June 2, 20269 min read

How much VRAM do you actually need? The honest LLM sizing guide

Stop guessing whether a 70B model fits your GPU. The real formula is weights + KV cache + overhead — here are the numbers for every popular model, every context length, and every GPU tier, with no marketing rounding.

read the guide →

[ deep dive ]June 2, 20268 min read

Q4_K_M vs Q5_K_M vs Q8_0: what quantization actually costs you

Open a GGUF repo on Hugging Face and you get 20 files. This is how the naming works, how much quality each quant really loses, why smaller is faster, and which one to actually download.

read the guide →

[ comparison ]June 2, 20269 min read

Ollama vs LM Studio vs llama.cpp vs Wide Area Intelligence (2026)

Four ways to run the same open-weights model on your own hardware — and why they're different jobs, not competitors. An honest breakdown of where each one wins, with a full feature comparison.

read the guide →

[ deep dive ]June 2, 20268 min read

Expose your local LLM to the internet safely — no port forwarding

Your model runs great on localhost — until you're at work, on your phone, or shipping an app. A clear-eyed comparison of port forwarding, Tailscale, ngrok, and Cloudflare Tunnel, plus the auth architecture that makes a public endpoint safe.

read the guide →

[ guide ]June 2, 20267 min read

Your GPU should work the night shift: batch inference on idle hardware

A coding agent uses your GPU for two hours a day. The other twenty-two, $1,600 of silicon sits idle. Here's how to queue overnight batch jobs — dataset labeling, embeddings, summarization — and let your nodes chew through them while you sleep.

read the guide →

[ guide ]June 1, 20269 min read

Run Qwen Code on your own GPU with Wide Area Intelligence

Point Alibaba's open-source coding agent at hardware you own: deploy a Qwen Coder model to a node, wire up three environment variables, and code with zero token costs and full privacy.

read the guide →

The blog.Running AI where you can touch it.

What Is an LLM Gateway? (and Why Route Local + Cloud)

Best Local LLMs in 2026

Best Local LLMs for Coding in 2026

How to Use Claude Code & Coding Agents with a Local LLM

Best Ollama Alternatives in 2026

Best LM Studio Alternatives in 2026

Best OpenRouter Alternatives in 2026

How to Run Llama 3 Locally (2026 Guide)

How to Run Qwen Locally (2026 Guide)

How to Run DeepSeek Locally (2026 Guide)

Anatomy of an inference request: how the gateway decides where it runs

Failover that doesn't break your app: capability-aware routing

Run the oh-my-pi coding agent on your own GPU with Wide Area Intelligence

Aider + your own GPU: AI pair programming with zero API bill

Continue.dev in VS Code without the subscription

Cline on your own hardware: the honest guide

The real cost of the OpenAI API vs your gaming PC (2026 numbers)

How much VRAM do you actually need? The honest LLM sizing guide

Q4_K_M vs Q5_K_M vs Q8_0: what quantization actually costs you

Ollama vs LM Studio vs llama.cpp vs Wide Area Intelligence (2026)

Expose your local LLM to the internet safely — no port forwarding

Your GPU should work the night shift: batch inference on idle hardware

Run Qwen Code on your own GPU with Wide Area Intelligence

The blog.
Running AI where you can touch it.