/// how it works
Routing & failover
The gateway speaks the OpenAI API, but where a request actually runs is its own decision. It tries to serve every request from the cheapest place that can answer — the edge cache, then your own hardware, then the credit-billed cloud — and it never hands a request to a model that can't serve it.
The request pipeline
Every chat completion walks the same three stages. The first one that can answer wins; the rest never run.
First stage that can answer wins. The cache short-circuits repeats; your nodes serve everything they can for free; only what's left reaches the cloud — and the cloud chain only offers models that can actually serve the request.
Edge cache
Your GPU nodes
Cloud failover
When every node is busy
The cloud failover policy decides what happens when all your ready nodes are at capacity:
| policy | behavior | optimizes |
|---|---|---|
| queue | Always prefer local; the node queues internally. Cloud only after the node timeout. | cost (free-first) |
| overflow | All nodes busy? Go straight to cloud. | latency |
| wait | Hold the request up to a set timeout for a node to free up, then cloud. | balance |
Substitution mode
With Always use default models on, the gateway ignores the model a request names and serves your account defaults instead — your local default on your nodes, your cloud default on failover. This is how an app hard-coded to a cloud-only model (a coding agent pinned to a specific model, say) can ride your own hardware for free. Responses report the model that actually served.
One exception keeps it safe: a request carrying an image is never substituted onto a text-only default — see capability-aware routing next.
Capability-aware failover
Providers don't all support the same features. If a request uses vision (an image_url part) or tool-calling, the gateway only routes it to models that can actually serve it. The model you explicitly named always heads the chain untouched; every fallback the gateway picks for you is capability-filtered, so a substitution or a provider outage can never silently swap in a model that returns blanks or a hard error.
| # | role | model | capabilities | verdict |
|---|---|---|---|---|
| 01 | caller | openai/gpt-4o-mini | vision, tools | ✓ head — never filtered |
| 02 | vision default | google/gemini-2.5-flash | vision, tools | ✓ leads image requests |
| 03 | vision backup | anthropic/claude-sonnet | vision, tools | ✓ if Google is down |
| 04 | text default | google/gemma-text-only | tools | ✗ pruned — can't see |
| 05 | platform | openai/gpt-4o-mini | vision, tools | ✓ always backstops |
The text-only default is dropped for this request — sending the image there would return blanks. For a plain text request it stays in the chain. The filter is per-request, not per-account.
Image requests also skip local nodes. A node reports the model it has loaded but no capability signal, so the gateway can't tell a vision model from a text-only one — vision goes straight to a vision-capable cloud model.
Vision model & per-chain backups
Three settings shape the cloud chain. Each is optional — leave them blank and the platform default backstops everything.
Default cloud model
The model credit-billed failover reaches for on a plain chat request. Heads the text chain.
Default vision model
Where requests carrying an image go. Set this when your cloud default can't see (a text-only or local model) — images route here instead of coming back blank.
Per-chain backups
One backup model each for the text and vision chains, tried after the primary when its provider is down — only with cross-provider failover on. You're billed for whichever one served.
The resolved order for an image request is: your named model (if any) → vision default → vision backup → text default & backup → platform default — with every text-only entry filtered out. For plain text it's just the text default → backup → platform default.
Cross-provider failover
With cross-provider cloud failover on, an unavailable vendor (5xx, rate-limit, timeout) advances the chain to the next capable model — e.g. Anthropic down, answer with Google or OpenAI. A deterministic client error (a bad request, a content-policy refusal) never switches models: it would fail the same way everywhere, so the gateway surfaces it rather than hiding it behind another vendor. Each model bills at its own rate.
related