Here's a number that should bother you. A capable inference GPU — an RTX 4070, say — runs about $600 for the card, and once you add the PC around it you're north of $1,600 of silicon. Now count the hours it actually does work. A coding agent fires off a burst of requests, you read the diff, you think, you type. Real GPU utilization for interactive AI work is maybe two hours a day, and that's generous.
The other twenty-two hours, that hardware sits at idle, drawing 15 watts and producing nothing. You bought a machine that can generate tens of millions of tokens a night and you're using a rounding error of its capacity. The fix isn't a faster card. It's giving the card a night shift.
What batch inference is actually good for
Interactive inference — chat, coding agents, anything with a human waiting — needs low latency. Batch inference is the opposite: nobody is waiting, so you trade latency for throughput and let the GPU run flat out. That trade unlocks a whole category of work that's painful or expensive to do request-by-request:
Dataset labeling & classification
Embeddings for search
Summarizing archives
Synthetic training data
Eval suite runs
Notice the common thread: volume, no deadline, repeatable. If a job has those three properties, paying cloud per-token rates for it is lighting money on fire when you own a GPU that's asleep.
How the Wide Area Intelligence night shift works
The gateway already knows when your nodes are busy — it's routing your live traffic. Batch night shift uses that signal. You queue a batch from the dashboard, and your nodes pick up batch work only when they're idle. The instant a real request arrives — a coding agent, a Playground chat — interactive traffic takes priority and the batch yields. Your overnight job never makes your daytime work feel slow.
The input format is the same OpenAI batch JSONL the OpenAI Batch API uses, so if you already have tooling, it carries over. Each line is one independent request keyed by a custom_idyou choose. The gateway streams them to whichever nodes are free, load-balancing across all of them, and collects the results into a single output file you download when it's done.
Jobs survive node restarts. If a Windows machine reboots overnight or someone trips the power, the batch resumes from where it left off — completed lines aren't re-run. A half-finished 50k job picks back up on the next heartbeat, no babysitting.
Walkthrough: classify 50,000 tickets overnight
Prepare a JSONL file
Queue it from the dashboard
.jsonl, choose which nodes may run it (or let any idle node help), and submit. The job shows as QUEUED.Let the nodes chew through it
Download the results
DONE, download results.jsonl and join it back to your source data on custom_id.The JSONL format is one self-contained request per line. The body is exactly what you'd POST to /v1/chat/completions — pin a deterministic classifier with temperature: 0 and a tiny max_tokens so each request finishes fast:
{"custom_id": "ticket-0001", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "Qwen2.5-7B-Instruct-Q4_K_M", "messages": [{"role": "system", "content": "Classify the support ticket as one of: billing, bug, feature_request, account, other. Reply with the label only."}, {"role": "user", "content": "I was charged twice for my May subscription."}], "max_tokens": 4, "temperature": 0}}
{"custom_id": "ticket-0002", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "Qwen2.5-7B-Instruct-Q4_K_M", "messages": [{"role": "system", "content": "Classify the support ticket as one of: billing, bug, feature_request, account, other. Reply with the label only."}, {"role": "user", "content": "The export button does nothing on Firefox."}], "max_tokens": 4, "temperature": 0}}
{"custom_id": "ticket-0003", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "Qwen2.5-7B-Instruct-Q4_K_M", "messages": [{"role": "system", "content": "Classify the support ticket as one of: billing, bug, feature_request, account, other. Reply with the label only."}, {"role": "user", "content": "Can you add dark mode to the dashboard?"}], "max_tokens": 4, "temperature": 0}}Writing 50,000 of those by hand is not the move. Generate them from your data with a few lines of script:
# Turn a CSV of tickets into OpenAI batch JSONL
python3 - <<'PY'
import csv, json
SYSTEM = ("Classify the support ticket as one of: billing, bug, "
"feature_request, account, other. Reply with the label only.")
with open("tickets.csv") as src, open("batch.jsonl", "w") as out:
for row in csv.DictReader(src):
out.write(json.dumps({
"custom_id": f"ticket-{row['id']}",
"method": "POST",
"url": "/v1/chat/completions",
"body": {
"model": "Qwen2.5-7B-Instruct-Q4_K_M",
"messages": [
{"role": "system", "content": SYSTEM},
{"role": "user", "content": row["text"]},
],
"max_tokens": 4,
"temperature": 0,
},
}) + "\n")
PY
wc -l batch.jsonl # 50000 batch.jsonlUpload batch.jsonl, submit, and go to bed. The output file mirrors the input — one line per request, matched by custom_id, with the full response body and token usage:
{"custom_id": "ticket-0001", "response": {"status_code": 200, "body": {"choices": [{"message": {"role": "assistant", "content": "billing"}}], "usage": {"prompt_tokens": 41, "completion_tokens": 1, "total_tokens": 42}}}}
{"custom_id": "ticket-0002", "response": {"status_code": 200, "body": {"choices": [{"message": {"role": "assistant", "content": "bug"}}], "usage": {"prompt_tokens": 39, "completion_tokens": 1, "total_tokens": 40}}}
{"custom_id": "ticket-0003", "response": {"status_code": 200, "body": {"choices": [{"message": {"role": "assistant", "content": "feature_request"}}], "usage": {"prompt_tokens": 38, "completion_tokens": 2, "total_tokens": 40}}}Because every result carries its custom_id, joining back to your source data is trivial — and order-independent, so it doesn't matter which node finished which line:
# Join results back to your data by custom_id
python3 - <<'PY'
import json
labels = {}
with open("results.jsonl") as f:
for line in f:
rec = json.loads(line)
cid = rec["custom_id"]
labels[cid] = rec["response"]["body"]["choices"][0]["message"]["content"].strip()
print(len(labels), "tickets classified")
# write labels.csv, update your DB, build the index — whatever you need
PYThe numbers: one RTX 4070 night vs the cloud
Let's price the ticket job honestly. 50,000 tickets, a 7B model at Q4_K_M, classified to a single-word label. Each request is roughly a 40-token prompt plus a 1–2 token answer — call it ~42 tokens of work per ticket, almost all of it prefill. An RTX 4070 running a 7B at Q4 generates around 40 tokens/sec of decode, but classification is prefill-bound, so throughput is dominated by how fast it can ingest prompts — comfortably several hundred tickets a minute in practice. Even at a conservative 80 tickets/minute it's about 10.5 hours: one overnight run.
Now compare the cost. The OpenAI Batch API is 50% off the synchronous price— that discount is real and worth crediting. Here's the comparison at GPT-4o-mini batch rates (~$0.075/1M input, $0.30/1M output) against running it on hardware you already own:
| approach | model class | wall time | marginal cost |
|---|---|---|---|
| OpenAI sync API | GPT-4o-mini | minutes | ~$0.32 |
| OpenAI Batch API (50% off) | GPT-4o-mini | up to 24h | ~$0.16 |
| WAI night shift | Qwen2.5-7B (yours) | ~one night | ~$0.06 power |
| WAI night shift | Llama-3.1-8B (yours) | ~one night | ~$0.06 power |
The honest read: for this specific job, the dollar amounts are all small. 50k short classifications is a cheap job everywhere— sixteen cents on OpenAI's batch tier is hardly worth optimizing. The ~$0.06 "power" figure is just the marginal electricity for a ~250W card running ten hours at $0.15/kWh; the GPU was already bought and sitting idle.
Where it compounds is scale and repetition. Make the outputs long instead of one word — full ticket summaries, generated training pairs, document rewrites — and the cloud bill grows with output tokens while your power bill barely moves. Run the job every night on fresh data and the cloud line is a recurring subscription; the GPU line is sunk cost you already paid. A job that produces 500 tokens each across 50k rows is closer to $8 on the batch APIper run — $240/month if it's nightly — versus the same six cents of electricity.
When cloud batch is the right call
Self-hosting isn't always the answer, and pretending otherwise would cost you trust. Reach for a cloud batch API when:
The job is genuinely huge and has a deadline
You need frontier-model quality
You don't own a GPU yet
The sweet spot for the night shift is the boring, recurring, high-volume middle: classification, embeddings, summarization, synthetic data, evals — work a 7B–14B model handles fine, that you run again and again, where a morning turnaround is perfectly acceptable. That describes a surprising amount of real production AI.
Put the idle hours to work
The pattern is the same one that runs through everything here: deploy a model to a node, create a gateway key, point your job at the gateway. From the Models page, deploy a 7B or 14B instruct model to a node in a click. Build your JSONL. Queue it under Batch, pick which nodes may help, and submit. Your free tier covers two nodes, which is two GPUs working the night shift instead of zero.
Your hardware is going to sit there overnight regardless. The only question is whether it produces 50,000 labeled rows by morning or just idles at 15 watts.