/// how it works

Routing & failover

The gateway speaks the OpenAI API, but where a request actually runs is its own decision. It tries to serve every request from the cheapest place that can answer — the edge cache, then your own hardware, then the credit-billed cloud — and it never hands a request to a model that can't serve it.

The request pipeline

Every chat completion walks the same three stages. The first one that can answer wins; the rest never run.

request pipeline local-first

First stage that can answer wins. The cache short-circuits repeats; your nodes serve everything they can for free; only what's left reaches the cloud — and the cloud chain only offers models that can actually serve the request.

Edge cache

Identical requests are served from a KV cache at the edge — instantly, for free. Toggle it and set the TTL in Settings.

Your GPU nodes

On a cache miss, the gateway ranks your online nodes — ones already serving the requested model first, then the least loaded — and routes there over a private tunnel. This runs on hardware you own, so there are no token fees.

Cloud failover

Only when no node can serve the request does it reach the credit-billed cloud, through a capability-aware model chain (below). You're billed for the model that actually answered.

When every node is busy

The cloud failover policy decides what happens when all your ready nodes are at capacity:

policy	behavior	optimizes
queue	Always prefer local; the node queues internally. Cloud only after the node timeout.	cost (free-first)
overflow	All nodes busy? Go straight to cloud.	latency
wait	Hold the request up to a set timeout for a node to free up, then cloud.	balance

Substitution mode

With Always use default models on, the gateway ignores the model a request names and serves your account defaults instead — your local default on your nodes, your cloud default on failover. This is how an app hard-coded to a cloud-only model (a coding agent pinned to a specific model, say) can ride your own hardware for free. Responses report the model that actually served.

One exception keeps it safe: a request carrying an image is never substituted onto a text-only default — see capability-aware routing next.

Capability-aware failover

Providers don't all support the same features. If a request uses vision (an image_url part) or tool-calling, the gateway only routes it to models that can actually serve it. The model you explicitly named always heads the chain untouched; every fallback the gateway picks for you is capability-filtered, so a substitution or a provider outage can never silently swap in a model that returns blanks or a hard error.

vision request · resolved chain capability-filtered

image_url ↦request carries an image → needs a model that can see

#	role	model	capabilities	verdict
01	caller	openai/gpt-4o-mini	vision, tools	✓ head — never filtered
02	vision default	google/gemini-2.5-flash	vision, tools	✓ leads image requests
03	vision backup	anthropic/claude-sonnet	vision, tools	✓ if Google is down
04	text default	google/gemma-text-only	tools	✗ pruned — can't see
05	platform	openai/gpt-4o-mini	vision, tools	✓ always backstops

The text-only default is dropped for this request — sending the image there would return blanks. For a plain text request it stays in the chain. The filter is per-request, not per-account.

Image requests also skip local nodes. A node reports the model it has loaded but no capability signal, so the gateway can't tell a vision model from a text-only one — vision goes straight to a vision-capable cloud model.

Vision model & per-chain backups

Three settings shape the cloud chain. Each is optional — leave them blank and the platform default backstops everything.

Default cloud model

The model credit-billed failover reaches for on a plain chat request. Heads the text chain.

Default vision model

Where requests carrying an image go. Set this when your cloud default can't see (a text-only or local model) — images route here instead of coming back blank.

Per-chain backups

One backup model each for the text and vision chains, tried after the primary when its provider is down — only with cross-provider failover on. You're billed for whichever one served.

The resolved order for an image request is: your named model (if any) → vision default → vision backup → text default & backup → platform default — with every text-only entry filtered out. For plain text it's just the text default → backup → platform default.

Cross-provider failover

With cross-provider cloud failover on, an unavailable vendor (5xx, rate-limit, timeout) advances the chain to the next capable model — e.g. Anthropic down, answer with Google or OpenAI. A deterministic client error (a bad request, a content-policy refusal) never switches models: it would fail the same way everywhere, so the gateway surfaces it rather than hiding it behind another vendor. Each model bills at its own rate.

Recipes: multi-model pipelines →wai CLI reference →Blog: anatomy of a request →Configure your gateway →