From your application's point of view, Wide Area Intelligence is a single OpenAI-compatible endpoint: you point the SDK at it, send a chat completion, get one back. Underneath, every request makes a short journey — and where it actually runs is decided fresh each time. The goal is simple: serve each request from the cheapest place that can answer it, and never hand it to a model that can't.
First stage that can answer wins. The cache short-circuits repeats; your nodes serve everything they can for free; only what's left reaches the cloud — and the cloud chain only offers models that can actually serve the request.
There are three stages. The first one that can answer wins; the rest never run.
Stage 1 — the edge cache
Before anything else, the gateway checks a key-value cache that lives at the edge. If an identical request has been seen recently, the stored response comes straight back — no model runs, no tokens are spent, and the round-trip is a few milliseconds. You control whether caching is on and how long entries live with the cache TTL setting. For workloads with repeated prompts — classification, templated generation, retries — this quietly removes a large slice of traffic before it ever touches a GPU.
Stage 2 — your own GPUs
On a cache miss, the gateway looks at the nodes you've connected — your workstations, servers, or on-prem boxes running the agent. It ranks them: nodes already serving the requested model come first, then the least loaded. The winner receives the request over a private tunnel and answers directly. Because this is hardware you own, there are no per-token fees — the marginal cost of the request is the electricity it took to run.
If a node is mid-request and you've allowed concurrency, the gateway accounts for in-flight load so one busy node doesn't get piled on while another sits idle. If the chosen node doesn't respond within your node timeout, the next-best node is tried before the request leaves your network.
Stage 3 — cloud failover
Only when no node of yours can serve the request — none online, all at capacity, or none loaded with a capable model — does it reach the credit-billed cloud. This is a fallback, not the default path, and it runs through a model chain that's aware of what the request actually needs (more on that below). You're billed for the model that answered, at its published rate, out of your prepaid balance. No credits, no cloud — your nodes still serve for free.
The knob: what happens when nodes are busy
The one decision you tune is what a request should do when every ready node is at capacity. Three policies, three trade-offs:
| policy | behavior | optimizes for |
|---|---|---|
| queue | Always prefer local; the node queues internally. Cloud only after the node timeout. | cost |
| overflow | All nodes busy → go straight to cloud immediately. | latency |
| wait | Hold for a free node up to a set timeout, then cloud. | balance |
queue keeps the most traffic on hardware you own; overflow trades credits for lower latency under load; waitsplits the difference. They're per-account and take effect immediately.
Reading it back
Every request lands in your dashboard with where it was served (cache, local, or cloud), which node answered, the model, token counts, latency, and — for cache and local hits — an estimate of what the same request would have cost in the cloud. That last number is the running tally of what your own hardware is saving you.
The fast path is the cheap path. A well-cached, well-provisioned setup answers most traffic from the edge or your own GPUs — the cloud bill is only what overflows.
Want the exact rules — substitution, vision models, cross-provider failover? The routing & failover reference lays out every stage, or create a gateway and watch your first request route in the dashboard.