Your model runs beautifully. You typed llama-server, watched the weights load, and got tokens streaming back at a healthy clip. Then you closed the laptop, walked to the office, and reached for that same model from your work machine — and there's nothing there. localhostdoesn't travel.
This is the wall everyone running a local LLM hits eventually. You want to query the model from your phone, from a second computer, from a teammate's laptop, or — the big one — from an app you're building that needs a stable HTTP endpoint. The model is on your GPU at home; the request is coming from somewhere else. Whether you run Ollama, llama.cpp, LM Studio, or vLLM, the problem and the options are identical.
There are four common ways to bridge that gap. They are not equally good, and most of the advice online glosses over the one that matters most: authentication. Exposing an inference server to the internet without it is handing strangers free compute on your hardware. Let's go through all four honestly.
Option 1 — Port forwarding (don't)
The instinct is to open port 8080 on your router and forward it to your machine. It works, technically, and it's the worst option on the list. Here's what you're actually signing up for:
You expose llama-server directly to the entire internet
No auth by default
llama-server and Ollama serve anyone who connects. Unless you explicitly add an API key, your GPU is now a free public endpoint.Your home IP becomes public
Dynamic IPs break it
If you take one thing from this post: never put a raw inference server on an open inbound port. Even with an API key, you're exposing your home IP and trusting a piece of software that was never meant to face the public internet.
Option 2 — Tailscale / WireGuard
A mesh VPN is genuinely excellent — for personal use. Tailscale (a friendly layer over WireGuard) puts all your devices on one private encrypted network. Your phone, your work laptop, and your GPU box get stable internal addresses like 100.x.y.z, and you reach the model as if it were on the same LAN. No ports open to the public, no home IP exposed, traffic encrypted end to end.
The catch is in the word private. Every client that wants to reach the model has to be on the VPN. That's fine for your own devices — install Tailscale once and forget it. It falls apart the moment you need a publicendpoint: a web app you're shipping, a webhook, a serverless function, a teammate who isn't going to install your VPN. There's no URL you can hand to an app that lives outside the mesh.
Use Tailscale when the only consumers are devices you personally own and control. Reach for something else the moment "an application" needs to call your model.
Option 3 — ngrok / localtunnel
These give you a public URL in one command. Great for a demo or a quick share — paste the link, someone hits your model, done.
# Quick public URL for a llama-server on :8080 ngrok http 8080 # Forwarding https://a1b2-203-0-113-7.ngrok-free.app -> http://localhost:8080 # Note: the URL changes every restart on the free tier, # and anyone who finds it can hit your model — no auth.
For anything beyond a demo, the trade-offs add up fast. On the free tier the URL rotates every restart, which is fatal for an app that needs a constant base URL. There are connection and rate limits. A stable custom domain means a paid plan. And like port forwarding, the tunnel itself adds no authentication— if you don't put a key on the inference server, the random URL is the only thing standing between strangers and your GPU, and random URLs leak.
Option 4 — Cloudflare Tunnel
This is the one that actually fits the "public endpoint for apps" case, and it's what Wide Area Intelligence automates under the hood. A small daemon (cloudflared) running next to your model opens an outboundconnection to Cloudflare's edge and holds it open. Requests to your hostname arrive at Cloudflare and ride back down that existing connection to your machine.
Read that again, because it's the whole point: nothing inbound is ever opened. There is no port on your router. There is no public listener on your home IP. The connection is always initiated from inside your network outward — the same direction as any browser tab — so it sails through NAT and firewalls untouched, and it survives a dynamic IP change because the tunnel re-establishes itself. The hostname is stable, and the free tier covers personal use comfortably.
Cloudflare Tunnel solves connectivity, not authorization. The edge still forwards every request to your model. You must still put a key on the inference server — the tunnel being outbound-only protects your network, not your GPU budget.
How they actually compare
Five questions decide which tool you want: how long it takes to set up, whether it protects your network, whether the URL stays put, whether an app (not just you) can call it, and what it costs.
| approach | setup | your home ip | stable url | works for apps | cost |
|---|---|---|---|---|---|
| Port forwarding | 10 min | exposed | no (dynamic) | yes, unsafely | free |
| Tailscale / WireGuard | 5 min | hidden | yes (private) | no — VPN only | free / paid |
| ngrok / localtunnel | 1 min | hidden | paid only | demos only | free / paid |
| Cloudflare Tunnel | 15 min | hidden | yes | yes | free |
| Wide Area Intelligence | 2 min | hidden | yes | yes | free (2 nodes) |
Cloudflare Tunnel wins on the merits for a public, app-ready endpoint — the only cost is a fiddly first-time setup and the fact that you still have to bolt authentication on yourself. That last gap is exactly what Wide Area Intelligence exists to close.
How Wide Area Intelligence does it under the hood
WAI is a managed gateway built on the Cloudflare Tunnel pattern, with the authentication and routing already wired in. When you add a node from the dashboard, the one-line installer drops a small agenton your GPU machine. Here's the request path, end to end:
The agent opens an outbound tunnel
The gateway authenticates apps with hashed API keys
wai_sk_… key in the Authorization header. The gateway looks it up — keys are stored only as SHA-256 hashes, never in plaintext, so a database leak exposes no usable credentials.llama-server itself demands a separate key
Requests route through Cloudflare's edge to your node
X-WAI-Node to pin a request to one specific machine.The two-layer key design is the part most DIY setups skip. A gateway key authorizes your application; a separate node-derived key protects the inference server. Compromising one doesn't hand over the other, and because the node key is generated and held on your own hardware, it's never transmitted to the gateway at all.
The DIY version (honest)
You can absolutely build this yourself — Cloudflare Tunnel is a public product and llama.cpp supports API keys. If you only have one machine and one consumer, rolling your own is a perfectly reasonable weekend project. Here's the shape of it.
First, set up cloudflared and a named tunnel:
# 1. Install cloudflared and log in (opens a browser once) brew install cloudflared # macOS — or download the binary cloudflared tunnel login # 2. Create a named tunnel; this writes a credentials JSON file cloudflared tunnel create my-llm # Created tunnel my-llm with id 6f4e...c91a
Map a hostname you own to the local inference port:
# 3. ~/.cloudflared/config.yml — route a hostname to local llama-server
tunnel: 6f4e...c91a
credentials-file: /Users/you/.cloudflared/6f4e...c91a.json
ingress:
- hostname: llm.example.com
service: http://localhost:8080
- service: http_status:404
# 4. Point DNS at the tunnel, then run it
# cloudflared tunnel route dns my-llm llm.example.com
# cloudflared tunnel run my-llmAnd — the step you must not skip — start the inference server with a key, so the tunnel URL by itself can't be abused:
# 5. NEVER expose llama-server without a key. Start it with one: llama-server -m ./Qwen2.5-Coder-7B-Q4_K_M.gguf \ -ngl 999 -c 32768 \ --host 127.0.0.1 --port 8080 \ --api-key "$(openssl rand -hex 32)" # Now the tunnel URL alone is useless — every request needs: # Authorization: Bearer <that key>
That gets you a stable, authenticated, public endpoint for one model on one machine. What it doesn't get you is everything around it: load-balancing across several nodes, cloud failover when a machine reboots, a dashboard showing live requests and tokens-per-second, deploying a new Hugging Face GGUF with one click, swapping context windows without SSHing in, per-app keys you can revoke individually, and usage analytics. WAI is the same Cloudflare Tunnel architecture you just wired by hand — plus the routing, security, and operations layer that turns one tunnel into a fleet.
Doing it yourself is the right call for a single model you query occasionally. Once you have two machines, an app in production, or anyone besides yourself depending on it, the operations work is where the time goes — and that's the part WAI takes over.
Getting there in two minutes
If you'd rather not maintain tunnels, keys, and DNS by hand, the managed path is short: bring a node online, deploy a model, mint a key, and point your tool at the gateway.
Add a node
Deploy a model
[ deploy ]. The dashboard shows which quantizations fit your hardware.Create a key and point your app at it
https://wideareaai.com/api/v1 with the wai_sk_… key. Any OpenAI-compatible client now reaches your GPU from anywhere.Your model stops being a thing that only works at your desk and becomes a real endpoint — safe to call from your phone, your app, or another continent, with your home network never exposed.