free tool · no signup · always-on reference
OpenAI API Compatibility Matrix
Almost every local LLM server claims to be 'OpenAI compatible' — but compatibility is a spectrum, not a checkbox. This matrix shows which OpenAI endpoints and features llama.cpp, Ollama, vLLM, and the Wide Area Intelligence gateway actually implement, with footnotes for the partial cases. Toggle columns to compare just the stacks you're choosing between.
show / hide columns
| endpoint / feature | OpenAI API | llama.cpp server | Ollama | vLLM | Wide Area Intelligence |
|---|---|---|---|---|---|
| POST /v1/chat/completions | yes | yes | yes | yes | yes1 |
| POST /v1/completions (legacy) | yes | yes | partial2 | yes | partial1 |
| POST /v1/embeddings | yes | yes3 | yes3 | yes3 | yes4 |
| POST /v1/images/generations | yes | no | no | no | yes24 |
| POST /v1/audio/transcriptions | yes | partial25 | no | no | yes26 |
| GET /v1/models | yes | yes | yes | yes | yes |
| Streaming (SSE) | yes | yes | yes | yes | yes |
| Tool / function calling | yes | partial5 | partial6 | partial7 | partial1 |
| JSON mode / structured output | yes | yes8 | yes8 | yes8 | yes1 |
| Vision (image input) | yes | partial9 | partial10 | partial11 | partial1 |
| logprobs | yes | yes | partial12 | yes | partial1 |
| n > 1 choices | yes | partial13 | no | yes | partial1 |
| temperature / top_p | yes | yes | yes | yes | yes |
| max_tokens / max_completion_tokens | yes | yes | yes14 | yes | yes |
| stop sequences | yes | yes | yes | yes | yes |
| seed param (reproducible) | yes | yes | yes | yes | yes |
| API key auth | yes | partial15 | no16 | partial15 | yes17 |
| Multi-model serving | yes | no18 | yes19 | partial20 | yes21 |
| Remote access built in | yes | no22 | no22 | no22 | yes23 |
notes
- 1.WAI proxies to your node's llama.cpp server, so it inherits whatever that build + model supports — and adds auth, routing, caching, and cloud failover on top.
- 2.Ollama exposes its own /api/generate; its OpenAI-compatible /v1/completions shim exists but is thin and less exercised than chat.
- 3.Embeddings work only when you load an embedding model (e.g. nomic-embed, bge, e5). A chat model on the same endpoint will not return useful vectors.
- 4.WAI serves /v1/embeddings on your own nodes first, then fails over to credit-billed cloud embeddings (OpenAI text-embedding-3, Cloudflare bge) — so it works even with no embedding node loaded. Pass a cloud model id or set a default in Settings; disable failover to stay strictly local.
- 5.llama.cpp tool calling depends on the model's chat template and usually needs the --jinja flag; reliability varies a lot by model and grammar.
- 6.Ollama supports tools for models whose templates declare them (Llama 3.1+, Qwen, Mistral, etc.); streaming + tools and parallel calls are uneven across versions.
- 7.vLLM tool calling requires a per-model tool-call parser (--enable-auto-tool-choice with --tool-call-parser); coverage depends on the model family.
- 8.Constrained / structured output uses GBNF grammars or JSON-schema guidance (llama.cpp), format=json (Ollama), or guided decoding (vLLM) — strictness and schema fidelity differ from OpenAI's.
- 9.llama.cpp handles vision via multimodal projector (mmproj) files and only for vision-capable models (LLaVA, Qwen-VL, etc.).
- 10.Ollama accepts images only for vision models (llava, llama3.2-vision, etc.); text-only models reject image content.
- 11.vLLM vision input depends on the model architecture and the right multimodal config; not every served model accepts images.
- 12.Ollama's OpenAI shim returns limited or no logprobs depending on version and model; treat as best-effort.
- 13.llama.cpp accepts n>1 in some builds but generates sequentially; it is not a true parallel sampling guarantee like OpenAI/vLLM.
- 14.Ollama maps max_tokens to num_predict internally; the OpenAI field is accepted on the /v1 path.
- 15.llama.cpp (--api-key) and vLLM (--api-key) support a single static API key — no per-user keys, scopes, or rotation.
- 16.Ollama ships with no built-in auth; anyone who can reach the port can use it. Lock it down at the network layer.
- 17.WAI issues per-user keys (wai_sk_...) with scopes and rotation, independent of what the underlying node supports.
- 18.A single llama.cpp server process serves one model. Multiple models means multiple processes/ports (or the newer router build).
- 19.Ollama loads and swaps between multiple pulled models on demand within one daemon.
- 20.vLLM serves one model per process by default; multi-model needs multiple instances or a router in front.
- 21.WAI routes across all of your nodes and the models deployed on them from one base URL; pin a node with the X-WAI-Node header.
- 22.These servers bind to localhost/LAN by default. Remote access means you run a tunnel, reverse proxy, or open a port yourself.
- 23.WAI nodes connect out over a Cloudflare Tunnel — no port forwarding, no inbound firewall holes. The gateway is reachable at https://wideareaai.com/api/v1.
- 24.WAI exposes an OpenAI-compatible /v1/images/generations — your own sd-server image nodes first, then credit-billed Google (Gemini 2.5 Flash Image / Imagen 4) cloud failover. The other servers don't ship an OpenAI images endpoint.
- 25.whisper.cpp's whisper-server transcribes via POST /inference (multipart), but it's not the OpenAI /v1/audio/transcriptions path or response shape — you adapt it yourself.
- 26.WAI exposes an OpenAI-compatible /v1/audio/transcriptions backed by your own nodes running whisper.cpp (deploy a Whisper model). Transcription runs on your hardware (free); cloud failover for audio isn't available yet.
Capabilities reflect the state of each project in early 2026 and move fast — always confirm against the version you deploy. Where a feature is uneven across versions or models, we mark it partial rather than overstate.
What "OpenAI compatible" actually means
When a server says it is OpenAI compatible, it means it accepts HTTP requests in the same shape as OpenAI's API — the same JSON body, the same /v1/chat/completions path, the same streaming format — so that an existing OpenAI SDK client works after you only change the base URL and the API key. That is the whole appeal: you keep your code, your tooling, and your prompts, and just point them at hardware you control.
The catch is that "compatible" rarely means "identical." A server can implement the chat endpoint perfectly but ignore logprobs, treat n>1 as n=1, or accept a tools array and then never actually emit a tool call. The request succeeds, the response validates, and your feature silently does nothing. The matrix above exists to surface exactly those gaps before they reach production.
Why the matrix matters when choosing a stack
Each server optimizes for a different job. llama.cpp's server is the low-level reference: broad model and quant support, GBNF grammars, and a single static --api-key, but one model per process and no remote access out of the box. Ollama wraps llama.cpp with a friendly model manager and hot model swapping, trading some OpenAI-surface fidelity for ease of use. vLLM is the throughput-first inference engine — excellent batching, real parallel sampling for n>1, and per-model tool-call parsers — but it is one model per process and heavier to run.
If you pick on the marketing line alone, you can be three weeks into a build before you discover the function calling you depended on needs a specific chat template, a --jinja flag, or a model that was actually trained for tools. Reading the matrix first turns those surprises into a checklist.
Where does Wide Area Intelligence sit?
The Wide Area Intelligence gateway is not a fourth inference engine — it is a routing layer in front ofthe llama.cpp servers running on your own nodes. So for the raw model behavior (tool calling, JSON mode, vision, logprobs) it inherits whatever your node's llama.cpp build and the deployed model support. What it adds on top is the production plumbing those servers leave to you: per-user API keys (wai_sk_...), load-balanced routing across all your nodes, request caching, and automatic cloud failover on prepaid credits when your GPUs are busy or offline. You pin a request to a specific node with the X-WAI-Node header.
FAQ
Does llama.cpp support embeddings? Yes, on /v1/embeddings — but only when the loaded model is an embedding model (nomic-embed, bge, e5). A chat model will not return useful vectors. The same caveat applies to Ollama and vLLM.
Does Ollama do function calling? Partially. Tools work for models whose templates declare them (Llama 3.1+, Qwen, Mistral, and similar), but combining streaming with tools and doing parallel tool calls is uneven across versions, so test your exact model.
Can I get reproducible output? All four honor a seed parameter, but determinism still depends on identical model weights, quantization, and backend — a seed is necessary, not sufficient.
Ready to put your own GPU behind an OpenAI-compatible endpoint with auth and failover already handled? Add your first node free →
/// wide area ai
These numbers are theory. Your GPU is real — put it on the network.
Wide Area Intelligence turns any machine with a GPU into an OpenAI-compatible endpoint — routed, cached, and failed over automatically. Free for 2 nodes.
Start routing — free →