/// recipes
Multi-model pipelines
The cheapest way to run a mixed workload is to use a different model for each step: a fast, cheap one to route or extract, a strong one to generate. Because the gateway speaks the OpenAI API, the only thing that changes between steps is the modelparameter — same endpoint, same key. And because it's your endpoint, the cheap step can run on your own hardware for free.
One endpoint, no lock-in
This pattern is sometimes pitched as a reason to centralize on a single model marketplace. It isn't one. “One endpoint, one key, change only the model parameter” is exactly what https://wideareaai.com/api/v1 already gives you — and a third-party marketplace is just one more upstream behind it (see cross-provider failover). The only thing that ever marries you to a vendor is hard-coding its model ids in your caller. Keep the orchestration on your side and the upstream stays a config detail.
Recipe 1 — classify, then generate
Use a small model to decide what kindof request this is, then hand the real work to a strong model. Routing is a one-token job — it doesn't need a frontier model, and on a small local model it costs nothing.
from openai import OpenAI
client = OpenAI(
base_url="https://wideareaai.com/api/v1",
api_key="wai_sk_...", # from the API Keys page
)
# Step 1 — classify with a fast, cheap model.
# Pin it to your own hardware with X-WAI-Node: this step runs for free.
route = client.chat.completions.create(
model="llama-3.1-8b-instruct",
messages=[{"role": "user", "content": f"Classify this request in one word: {query}"}],
extra_headers={"X-WAI-Node": "workstation-01"},
)
label = route.choices[0].message.content.strip()
# Step 2 — generate with a strong model, only where it earns its rate.
answer = client.chat.completions.create(
model="anthropic/claude-sonnet-4-6",
messages=[{"role": "user", "content": f"[{label}] {query}"}],
)
print(answer.choices[0].message.content)The X-WAI-Node header pins step one to a node you own, so classification runs on your hardware while only the generation call touches the credit-billed cloud. Drop the header and the gateway still prefers a local node that serves the model before failing over — see the request pipeline.
Recipe 2 — extract, then write
Structured extraction is cheap and fast on a small model. Long-form writing is where a frontier model earns its per-token rate. Splitting the two keeps the expensive call short — it only sees clean fields, not the raw document.
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "https://wideareaai.com/api/v1",
apiKey: "wai_sk_...",
});
// Step 1 — structured extraction on a small model (cheap, fast, local).
const extracted = await client.chat.completions.create({
model: "llama-3.1-8b-instruct",
response_format: { type: "json_object" },
messages: [{ role: "user", content: `Extract fields as JSON:\n${doc}` }],
}, { headers: { "X-WAI-Node": "workstation-01" } });
// Step 2 — long-form writing from the structured fields, on a frontier model.
const draft = await client.chat.completions.create({
model: "anthropic/claude-sonnet-4-6",
messages: [{
role: "user",
content: `Write a report from these fields:\n${extracted.choices[0].message.content}`,
}],
});
console.log(draft.choices[0].message.content);Where the money goes
On a mixed workload, splitting the cheap routing/extraction step from the expensive generation step is what makes the bill small: the frontier model only ever sees short, pre-digested prompts, and the high-volume step runs on a model that costs a fraction as much — or nothing at all when it lands on your own node. Every call, local or cloud, shows up in your dashboard with the model that actually served it, so you can see exactly where each step landed.
related