← all posts
[ deep dive ]June 2, 20268 min read

Expose your local LLM to the internet safely — no port forwarding

Your model runs great on localhost — until you're at work, on your phone, or shipping an app. A clear-eyed comparison of port forwarding, Tailscale, ngrok, and Cloudflare Tunnel, plus the auth architecture that makes a public endpoint safe.

Your model runs beautifully. You typed llama-server, watched the weights load, and got tokens streaming back at a healthy clip. Then you closed the laptop, walked to the office, and reached for that same model from your work machine — and there's nothing there. localhostdoesn't travel.

This is the wall everyone running a local LLM hits eventually. You want to query the model from your phone, from a second computer, from a teammate's laptop, or — the big one — from an app you're building that needs a stable HTTP endpoint. The model is on your GPU at home; the request is coming from somewhere else. Whether you run Ollama, llama.cpp, LM Studio, or vLLM, the problem and the options are identical.

There are four common ways to bridge that gap. They are not equally good, and most of the advice online glosses over the one that matters most: authentication. Exposing an inference server to the internet without it is handing strangers free compute on your hardware. Let's go through all four honestly.

Option 1 — Port forwarding (don't)

The instinct is to open port 8080 on your router and forward it to your machine. It works, technically, and it's the worst option on the list. Here's what you're actually signing up for:

01

You expose llama-server directly to the entire internet

Inference servers are not hardened web servers. They're built to sit behind something. Scanners find open ports within hours.
02

No auth by default

Out of the box, llama-server and Ollama serve anyone who connects. Unless you explicitly add an API key, your GPU is now a free public endpoint.
03

Your home IP becomes public

The endpoint is your residential IP address — tied to your name, your location, your other devices on the same network.
04

Dynamic IPs break it

Most home connections rotate IP addresses. Your endpoint silently dies whenever your ISP renews the lease, so you're back to dynamic-DNS hacks.

If you take one thing from this post: never put a raw inference server on an open inbound port. Even with an API key, you're exposing your home IP and trusting a piece of software that was never meant to face the public internet.

Option 2 — Tailscale / WireGuard

A mesh VPN is genuinely excellent — for personal use. Tailscale (a friendly layer over WireGuard) puts all your devices on one private encrypted network. Your phone, your work laptop, and your GPU box get stable internal addresses like 100.x.y.z, and you reach the model as if it were on the same LAN. No ports open to the public, no home IP exposed, traffic encrypted end to end.

The catch is in the word private. Every client that wants to reach the model has to be on the VPN. That's fine for your own devices — install Tailscale once and forget it. It falls apart the moment you need a publicendpoint: a web app you're shipping, a webhook, a serverless function, a teammate who isn't going to install your VPN. There's no URL you can hand to an app that lives outside the mesh.

Use Tailscale when the only consumers are devices you personally own and control. Reach for something else the moment "an application" needs to call your model.

Option 3 — ngrok / localtunnel

These give you a public URL in one command. Great for a demo or a quick share — paste the link, someone hits your model, done.

ngrok
# Quick public URL for a llama-server on :8080
ngrok http 8080
# Forwarding  https://a1b2-203-0-113-7.ngrok-free.app -> http://localhost:8080
# Note: the URL changes every restart on the free tier,
# and anyone who finds it can hit your model — no auth.

For anything beyond a demo, the trade-offs add up fast. On the free tier the URL rotates every restart, which is fatal for an app that needs a constant base URL. There are connection and rate limits. A stable custom domain means a paid plan. And like port forwarding, the tunnel itself adds no authentication— if you don't put a key on the inference server, the random URL is the only thing standing between strangers and your GPU, and random URLs leak.

Option 4 — Cloudflare Tunnel

This is the one that actually fits the "public endpoint for apps" case, and it's what Wide Area Intelligence automates under the hood. A small daemon (cloudflared) running next to your model opens an outboundconnection to Cloudflare's edge and holds it open. Requests to your hostname arrive at Cloudflare and ride back down that existing connection to your machine.

Read that again, because it's the whole point: nothing inbound is ever opened. There is no port on your router. There is no public listener on your home IP. The connection is always initiated from inside your network outward — the same direction as any browser tab — so it sails through NAT and firewalls untouched, and it survives a dynamic IP change because the tunnel re-establishes itself. The hostname is stable, and the free tier covers personal use comfortably.

Cloudflare Tunnel solves connectivity, not authorization. The edge still forwards every request to your model. You must still put a key on the inference server — the tunnel being outbound-only protects your network, not your GPU budget.

How they actually compare

Five questions decide which tool you want: how long it takes to set up, whether it protects your network, whether the URL stays put, whether an app (not just you) can call it, and what it costs.

approachsetupyour home ipstable urlworks for appscost
Port forwarding10 minexposedno (dynamic)yes, unsafelyfree
Tailscale / WireGuard5 minhiddenyes (private)no — VPN onlyfree / paid
ngrok / localtunnel1 minhiddenpaid onlydemos onlyfree / paid
Cloudflare Tunnel15 minhiddenyesyesfree
Wide Area Intelligence2 minhiddenyesyesfree (2 nodes)

Cloudflare Tunnel wins on the merits for a public, app-ready endpoint — the only cost is a fiddly first-time setup and the fact that you still have to bolt authentication on yourself. That last gap is exactly what Wide Area Intelligence exists to close.

How Wide Area Intelligence does it under the hood

WAI is a managed gateway built on the Cloudflare Tunnel pattern, with the authentication and routing already wired in. When you add a node from the dashboard, the one-line installer drops a small agenton your GPU machine. Here's the request path, end to end:

01

The agent opens an outbound tunnel

No inbound ports, no port forwarding, no static IP. The agent dials out to the gateway through Cloudflare's edge and keeps that connection alive. Your router config never changes.
02

The gateway authenticates apps with hashed API keys

Your app sends a wai_sk_… key in the Authorization header. The gateway looks it up — keys are stored only as SHA-256 hashes, never in plaintext, so a database leak exposes no usable credentials.
03

llama-server itself demands a separate key

The inference server runs with its own API key derived from the node key, which never leaves the machine. So even if someone learned your tunnel URL, a raw request bounces — the edge URL alone is useless without going through the authenticated gateway.
04

Requests route through Cloudflare's edge to your node

The gateway picks a ready node, forwards the request down its outbound tunnel, streams tokens back, and logs usage. Add X-WAI-Node to pin a request to one specific machine.

The two-layer key design is the part most DIY setups skip. A gateway key authorizes your application; a separate node-derived key protects the inference server. Compromising one doesn't hand over the other, and because the node key is generated and held on your own hardware, it's never transmitted to the gateway at all.

The DIY version (honest)

You can absolutely build this yourself — Cloudflare Tunnel is a public product and llama.cpp supports API keys. If you only have one machine and one consumer, rolling your own is a perfectly reasonable weekend project. Here's the shape of it.

First, set up cloudflared and a named tunnel:

install + create tunnel
# 1. Install cloudflared and log in (opens a browser once)
brew install cloudflared              # macOS — or download the binary
cloudflared tunnel login

# 2. Create a named tunnel; this writes a credentials JSON file
cloudflared tunnel create my-llm
# Created tunnel my-llm with id 6f4e...­c91a

Map a hostname you own to the local inference port:

~/.cloudflared/config.yml
# 3. ~/.cloudflared/config.yml — route a hostname to local llama-server
tunnel: 6f4e...c91a
credentials-file: /Users/you/.cloudflared/6f4e...c91a.json

ingress:
  - hostname: llm.example.com
    service: http://localhost:8080
  - service: http_status:404

# 4. Point DNS at the tunnel, then run it
#    cloudflared tunnel route dns my-llm llm.example.com
#    cloudflared tunnel run my-llm

And — the step you must not skip — start the inference server with a key, so the tunnel URL by itself can't be abused:

llama-server with auth
# 5. NEVER expose llama-server without a key. Start it with one:
llama-server -m ./Qwen2.5-Coder-7B-Q4_K_M.gguf \
  -ngl 999 -c 32768 \
  --host 127.0.0.1 --port 8080 \
  --api-key "$(openssl rand -hex 32)"

# Now the tunnel URL alone is useless — every request needs:
#   Authorization: Bearer <that key>

That gets you a stable, authenticated, public endpoint for one model on one machine. What it doesn't get you is everything around it: load-balancing across several nodes, cloud failover when a machine reboots, a dashboard showing live requests and tokens-per-second, deploying a new Hugging Face GGUF with one click, swapping context windows without SSHing in, per-app keys you can revoke individually, and usage analytics. WAI is the same Cloudflare Tunnel architecture you just wired by hand — plus the routing, security, and operations layer that turns one tunnel into a fleet.

Doing it yourself is the right call for a single model you query occasionally. Once you have two machines, an app in production, or anyone besides yourself depending on it, the operations work is where the time goes — and that's the part WAI takes over.

Getting there in two minutes

If you'd rather not maintain tunnels, keys, and DNS by hand, the managed path is short: bring a node online, deploy a model, mint a key, and point your tool at the gateway.

01

Add a node

In the dashboard, Nodes → Add a node, then run the one-line installer on your GPU machine. It opens the outbound Cloudflare Tunnel for you — no port forwarding, no firewall edits.
02

Deploy a model

On the Models page, search any Hugging Face GGUF and click [ deploy ]. The dashboard shows which quantizations fit your hardware.
03

Create a key and point your app at it

API Keys → Create a key, then set your base URL to https://wideareaai.com/api/v1 with the wai_sk_… key. Any OpenAI-compatible client now reaches your GPU from anywhere.

Your model stops being a thing that only works at your desk and becomes a real endpoint — safe to call from your phone, your app, or another continent, with your home network never exposed.

Put your local model on a safe public endpoint →

/// get started

That GPU is already paid for.
Put it on the network.

Create your gateway — free →