Your GPU should work the night shift: batch inference on idle hardware

A coding agent uses your GPU for two hours a day. The other twenty-two, $1,600 of silicon sits idle. Here's how to queue overnight batch jobs — dataset labeling, embeddings, summarization — and let your nodes chew through them while you sleep.

Here's a number that should bother you. A capable inference GPU — an RTX 4070, say — runs about $600 for the card, and once you add the PC around it you're north of $1,600 of silicon. Now count the hours it actually does work. A coding agent fires off a burst of requests, you read the diff, you think, you type. Real GPU utilization for interactive AI work is maybe two hours a day, and that's generous.

Put those idle hours to work.

The other twenty-two hours, that hardware sits at idle, drawing 15 watts and producing nothing. You bought a machine that can generate tens of millions of tokens a night and you're using a rounding error of its capacity. The fix isn't a faster card. It's giving the card a night shift.

What batch inference is actually good for

Interactive inference — chat, coding agents, anything with a human waiting — needs low latency. Batch inference is the opposite: nobody is waiting, so you trade latency for throughput and let the GPU run flat out. That trade unlocks a whole category of work that's painful or expensive to do request-by-request:

Dataset labeling & classification

Tag 50,000 support tickets, moderate a comment backlog, score leads, detect language, extract entities. Anything where you have a pile of rows and need a label on each one.

Embeddings for search

Generate vector embeddings for an entire document corpus so you can build a semantic search index or a RAG knowledge base. This is embarrassingly parallel and embarrassingly cheap on idle hardware.

Summarizing archives

Condense a few years of meeting notes, PDFs, or call transcripts into searchable summaries. One pass, overnight, done.

Synthetic training data

Generate instruction/response pairs, paraphrases, or hard negatives to fine-tune a smaller model. Quantity matters more than instant turnaround.

Eval suite runs

Re-run a 2,000-question benchmark every time you swap a model or tweak a prompt. Kick it off at night, read the scorecard with coffee.

Notice the common thread: volume, no deadline, repeatable. If a job has those three properties, paying cloud per-token rates for it is lighting money on fire when you own a GPU that's asleep.

How the Wide Area Intelligence night shift works

The gateway already knows when your nodes are busy — it's routing your live traffic. Batch night shift uses that signal. You queue a batch from the dashboard, and your nodes pick up batch work only when they're idle. The instant a real request arrives — a coding agent, a Playground chat — interactive traffic takes priority and the batch yields. Your overnight job never makes your daytime work feel slow.

The input format is the same OpenAI batch JSONL the OpenAI Batch API uses, so if you already have tooling, it carries over. Each line is one independent request keyed by a custom_idyou choose. The gateway streams them to whichever nodes are free, load-balancing across all of them, and collects the results into a single output file you download when it's done.

Jobs survive node restarts. If a Windows machine reboots overnight or someone trips the power, the batch resumes from where it left off — completed lines aren't re-run. A half-finished 50k job picks back up on the next heartbeat, no babysitting.

Walkthrough: classify 50,000 tickets overnight

Prepare a JSONL file

One request per line, in OpenAI batch format. Build it from whatever your data lives in — a CSV, a database query, a folder of files.

Queue it from the dashboard

Go to Batch → New batch, upload the .jsonl, choose which nodes may run it (or let any idle node help), and submit. The job shows as QUEUED.

Let the nodes chew through it

As nodes go idle they pull work. The dashboard shows a live progress bar — lines completed, tokens generated, current throughput across all participating nodes.

Download the results

When the job flips to DONE, download results.jsonl and join it back to your source data on custom_id.

The JSONL format is one self-contained request per line. The body is exactly what you'd POST to /v1/chat/completions — pin a deterministic classifier with temperature: 0 and a tiny max_tokens so each request finishes fast:

batch.jsonl · one request per line

{"custom_id": "ticket-0001", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "Qwen2.5-7B-Instruct-Q4_K_M", "messages": [{"role": "system", "content": "Classify the support ticket as one of: billing, bug, feature_request, account, other. Reply with the label only."}, {"role": "user", "content": "I was charged twice for my May subscription."}], "max_tokens": 4, "temperature": 0}}
{"custom_id": "ticket-0002", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "Qwen2.5-7B-Instruct-Q4_K_M", "messages": [{"role": "system", "content": "Classify the support ticket as one of: billing, bug, feature_request, account, other. Reply with the label only."}, {"role": "user", "content": "The export button does nothing on Firefox."}], "max_tokens": 4, "temperature": 0}}
{"custom_id": "ticket-0003", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "Qwen2.5-7B-Instruct-Q4_K_M", "messages": [{"role": "system", "content": "Classify the support ticket as one of: billing, bug, feature_request, account, other. Reply with the label only."}, {"role": "user", "content": "Can you add dark mode to the dashboard?"}], "max_tokens": 4, "temperature": 0}}

Writing 50,000 of those by hand is not the move. Generate them from your data with a few lines of script:

build batch.jsonl from a CSV

# Turn a CSV of tickets into OpenAI batch JSONL
python3 - <<'PY'
import csv, json

SYSTEM = ("Classify the support ticket as one of: billing, bug, "
          "feature_request, account, other. Reply with the label only.")

with open("tickets.csv") as src, open("batch.jsonl", "w") as out:
    for row in csv.DictReader(src):
        out.write(json.dumps({
            "custom_id": f"ticket-{row['id']}",
            "method": "POST",
            "url": "/v1/chat/completions",
            "body": {
                "model": "Qwen2.5-7B-Instruct-Q4_K_M",
                "messages": [
                    {"role": "system", "content": SYSTEM},
                    {"role": "user", "content": row["text"]},
                ],
                "max_tokens": 4,
                "temperature": 0,
            },
        }) + "\n")
PY
wc -l batch.jsonl   # 50000 batch.jsonl

Upload batch.jsonl, submit, and go to bed. The output file mirrors the input — one line per request, matched by custom_id, with the full response body and token usage:

results.jsonl · downloaded in the morning

{"custom_id": "ticket-0001", "response": {"status_code": 200, "body": {"choices": [{"message": {"role": "assistant", "content": "billing"}}], "usage": {"prompt_tokens": 41, "completion_tokens": 1, "total_tokens": 42}}}}
{"custom_id": "ticket-0002", "response": {"status_code": 200, "body": {"choices": [{"message": {"role": "assistant", "content": "bug"}}], "usage": {"prompt_tokens": 39, "completion_tokens": 1, "total_tokens": 40}}}
{"custom_id": "ticket-0003", "response": {"status_code": 200, "body": {"choices": [{"message": {"role": "assistant", "content": "feature_request"}}], "usage": {"prompt_tokens": 38, "completion_tokens": 2, "total_tokens": 40}}}

Because every result carries its custom_id, joining back to your source data is trivial — and order-independent, so it doesn't matter which node finished which line:

join results back to your data

# Join results back to your data by custom_id
python3 - <<'PY'
import json

labels = {}
with open("results.jsonl") as f:
    for line in f:
        rec = json.loads(line)
        cid = rec["custom_id"]
        labels[cid] = rec["response"]["body"]["choices"][0]["message"]["content"].strip()

print(len(labels), "tickets classified")
# write labels.csv, update your DB, build the index — whatever you need
PY

The numbers: one RTX 4070 night vs the cloud

Let's price the ticket job honestly. 50,000 tickets, a 7B model at Q4_K_M, classified to a single-word label. Each request is roughly a 40-token prompt plus a 1–2 token answer — call it ~42 tokens of work per ticket, almost all of it prefill. An RTX 4070 running a 7B at Q4 generates around 40 tokens/sec of decode, but classification is prefill-bound, so throughput is dominated by how fast it can ingest prompts — comfortably several hundred tickets a minute in practice. Even at a conservative 80 tickets/minute it's about 10.5 hours: one overnight run.

Now compare the cost. The OpenAI Batch API is 50% off the synchronous price— that discount is real and worth crediting. Here's the comparison at GPT-4o-mini batch rates (~$0.075/1M input, $0.30/1M output) against running it on hardware you already own:

approach	model class	wall time	marginal cost
OpenAI sync API	GPT-4o-mini	minutes	~$0.32
OpenAI Batch API (50% off)	GPT-4o-mini	up to 24h	~$0.16
WAI night shift	Qwen2.5-7B (yours)	~one night	~$0.06 power
WAI night shift	Llama-3.1-8B (yours)	~one night	~$0.06 power

The honest read: for this specific job, the dollar amounts are all small. 50k short classifications is a cheap job everywhere— sixteen cents on OpenAI's batch tier is hardly worth optimizing. The ~$0.06 "power" figure is just the marginal electricity for a ~250W card running ten hours at $0.15/kWh; the GPU was already bought and sitting idle.

Where it compounds is scale and repetition. Make the outputs long instead of one word — full ticket summaries, generated training pairs, document rewrites — and the cloud bill grows with output tokens while your power bill barely moves. Run the job every night on fresh data and the cloud line is a recurring subscription; the GPU line is sunk cost you already paid. A job that produces 500 tokens each across 50k rows is closer to $8 on the batch APIper run — $240/month if it's nightly — versus the same six cents of electricity.

When cloud batch is the right call

Self-hosting isn't always the answer, and pretending otherwise would cost you trust. Reach for a cloud batch API when:

The job is genuinely huge and has a deadline

Ten million rows that must ship by Friday won't fit in your nodes' idle windows. Cloud batch farms out across thousands of GPUs; your two nodes can't. Throughput you don't own beats throughput you do when the clock is real.

You need frontier-model quality

If the task genuinely needs GPT-class or Claude-class reasoning — nuanced extraction, hard judgment calls — a 7B on your desk won't match it. Use cloud batch for the hard subset; even then, consider running a local model first and only escalating the rows it's unsure about.

You don't own a GPU yet

Obviously. But the moment you have idle silicon, the math on recurring volume work tilts hard toward the night shift.

The sweet spot for the night shift is the boring, recurring, high-volume middle: classification, embeddings, summarization, synthetic data, evals — work a 7B–14B model handles fine, that you run again and again, where a morning turnaround is perfectly acceptable. That describes a surprising amount of real production AI.

Put the idle hours to work

The pattern is the same one that runs through everything here: deploy a model to a node, create a gateway key, point your job at the gateway. From the Models page, deploy a 7B or 14B instruct model to a node in a click. Build your JSONL. Queue it under Batch, pick which nodes may help, and submit. Your free tier covers two nodes, which is two GPUs working the night shift instead of zero.

Your hardware is going to sit there overnight regardless. The only question is whether it produces 50,000 labeled rows by morning or just idles at 15 watts.

Queue your first overnight batch →