free tool · no signup · runs in your browser
Can I Run It? — Local AI edition
Wondering if your PC can run Llama, Qwen, or DeepSeek locally? Pick your GPU and RAM and get an instant verdict: which models run great, which run with trade-offs, and which won't fit — plus estimated tokens per second and the best quantization for each.
Your hardware
12 GB VRAM @ 504 GB/s · 32 GB system RAM
35 of 55 models run well on your RTX 4070 (12GB).
25 run great · 10 run with trade-offs · 8 technically run · 12won't fit.
Best for chat
Qwen3 14B
14.8B · Q5_K_M
est. generation speed
Best for coding
GLM-4.7 Flash (31B-A3B)
31.2B · IQ2_M
est. generation speed
Best for reasoning
DeepSeek R1 Distill Qwen 14B
14.8B · Q4_K_M
est. generation speed
Runs great
(25)Fits fully in your GPU memory at good quality — fast, everyday usable
| Model | Params | Best quant | Est. speed | Memory needed | Max context |
|---|---|---|---|---|---|
| DeepSeek R1 Distill Qwen 14B | 14.8B | Q4_K_M | 31 tok/s | 10.6 GB | 8K |
| Qwen3 14B | 14.8B | Q5_K_M | 28 tok/s | 11.8 GB | 8K |
| Phi-4 14B | 14.7B | Q4_K_M | 31 tok/s | 10.6 GB | 8K |
| Mistral Nemo 12B | 12.2B | Q6_K | 29 tok/s | 11.4 GB | 8K |
| Gemma 3 12B | 12.2B | Q6_K | 30 tok/s | 10.9 GB | 16K |
| Qwen3 8B | 8.2B | Q8_0 | 33 tok/s | 10.0 GB | 16K |
| Llama 3.1 8B | 8.0B | Q8_0 | 34 tok/s | 9.7 GB | 16K |
| DeepSeek R1 Distill Llama 8B | 8.0B | Q8_0 | 34 tok/s | 9.7 GB | 16K |
| Gemma 4 E4B | 8.0B | Q8_0 | 75 tok/s | 8.8 GB | 128K |
| Gemma 3n E4B | 7.8B | Q8_0 | 75 tok/s | 8.6 GB | 32K |
| Qwen2.5 7B | 7.6B | Q8_0 | 38 tok/s | 8.7 GB | 32K |
| Qwen2.5 Coder 7B | 7.6B | Q8_0 | 38 tok/s | 8.7 GB | 32K |
| DeepSeek R1 Distill Qwen 7B | 7.6B | Q8_0 | 38 tok/s | 8.7 GB | 64K |
| Mistral 7B v0.3 | 7.2B | Q8_0 | 37 tok/s | 8.9 GB | 32K |
| Gemma 3n E2B | 5.4B | Q8_0 | 147 tok/s | 6.2 GB | 32K |
| Gemma 4 E2B | 5.1B | Q8_0 | 150 tok/s | 5.9 GB | 128K |
| Gemma 3 4B | 4.3B | Q8_0 | 67 tok/s | 5.3 GB | 128K |
| Qwen3 4B | 4.0B | Q8_0 | 60 tok/s | 5.8 GB | 32K |
| Phi-4 Mini 3.8B | 3.8B | Q8_0 | 64 tok/s | 5.5 GB | 32K |
| Llama 3.2 3B | 3.2B | Q8_0 | 75 tok/s | 4.8 GB | 64K |
| Qwen3 1.7B | 2.0B | Q8_0 | 106 tok/s | 3.6 GB | 32K |
| DeepSeek R1 Distill Qwen 1.5B | 1.8B | Q8_0 | 155 tok/s | 2.7 GB | 128K |
| Llama 3.2 1B | 1.2B | Q8_0 | 208 tok/s | 2.2 GB | 128K |
| Gemma 3 1B | 1000M | Q8_0 | 296 tok/s | 1.8 GB | 32K |
| Qwen3 0.6B | 752M | Q8_0 | 189 tok/s | 2.4 GB | 32K |
Runs with trade-offs
(10)Works, but needs reduced quantization or feels slow
| Model | Params | Best quant | Est. speed | Memory needed | Max context |
|---|---|---|---|---|---|
| Mixtral 8x7B (47B, 13B active) | 46.7B | Q4_K_M | 5.1 tok/s | 28.3 GB | 8K |
| Nemotron 3 Nano 30B-A3B | 31.6B | IQ2_M | 193 tok/s | 11.7 GB | 8K |
| GLM-4.7 Flash (31B-A3B) | 31.2B | IQ2_M | 210 tok/s | 11.6 GB | 8K |
| Qwen3 30B-A3B (MoE) | 30.5B | IQ2_M | 164 tok/s | 11.7 GB | 8K |
| Qwen3 Coder 30B-A3B | 30.5B | IQ2_M | 164 tok/s | 11.7 GB | 8K |
| Gemma 3 27B | 27.4B | IQ2_M | 30 tok/s | 11.0 GB | 16K |
| Gemma 4 26B (A4B MoE) | 26.5B | IQ2_M | 165 tok/s | 10.2 GB | 32K |
| Devstral Small 2 24B | 24.0B | IQ2_M | 33 tok/s | 10.1 GB | 16K |
| Mistral Small 3.2 24B | 24.0B | IQ2_M | 33 tok/s | 10.1 GB | 16K |
| GPT-OSS 20B (3.6B active) | 21.5B | Q3_K_M | 163 tok/s | 11.0 GB | 32K |
Technically runs
(8)Heavy CPU offloading or extreme quantization — expect single-digit tokens/sec
| Model | Params | Best quant | Est. speed | Memory needed | Max context |
|---|---|---|---|---|---|
| GLM-4.5 Air (106B, 12B active) | 110.5B | IQ2_M | 6.6 tok/s | 39.2 GB | 8K |
| Llama 4 Scout (109B, 17B active) | 108.6B | IQ2_M | 5.1 tok/s | 38.7 GB | 8K |
| Qwen3 Coder Next (80B-A3B) | 79.7B | IQ2_M | 24 tok/s | 28.2 GB | 8K |
| Llama 3.3 70B | 70.6B | IQ2_M | 1.7 tok/s | 26.9 GB | 8K |
| DeepSeek R1 Distill Llama 70B | 70.6B | IQ2_M | 1.7 tok/s | 26.9 GB | 8K |
| DeepSeek R1 Distill Qwen 32B | 32.8B | Q4_K_M | 2.6 tok/s | 21.4 GB | 8K |
| Qwen3 32B | 32.8B | Q4_K_M | 2.6 tok/s | 21.4 GB | 8K |
| Gemma 4 31B | 32.7B | Q4_K_M | 2.6 tok/s | 21.3 GB | 8K |
Won't run
(12)Doesn't fit in your GPU memory + system RAM at any quantization
| Model | Params | Best quant | Est. speed | Memory needed | Max context |
|---|---|---|---|---|---|
| Kimi K2.6 (1T, 32B active) | 1.1T | Q4_K_M | — | 602.7 GB | — |
| Kimi K2 (1T, 32B active) | 1.0T | Q4_K_M | — | 584.4 GB | — |
| DeepSeek V4 Pro (862B MoE) | 861.6B | Q4_K_M | — | 490.8 GB | — |
| GLM-5 (754B MoE) | 753.9B | Q4_K_M | — | 429.7 GB | — |
| DeepSeek V3.2 (671B, 37B active) | 685.4B | Q4_K_M | — | 390.7 GB | — |
| DeepSeek R1 (671B, 37B active) | 684.5B | Q4_K_M | — | 390.2 GB | — |
| Llama 3.1 405B | 405.9B | Q4_K_M | — | 235.3 GB | — |
| Llama 4 Maverick (400B, 17B active) | 401.6B | Q4_K_M | — | 230.4 GB | — |
| Qwen3 235B-A22B (MoE) | 235.1B | Q4_K_M | — | 135.8 GB | — |
| MiniMax M2.7 (229B, 10B active) | 228.7B | Q4_K_M | — | 132.6 GB | — |
| DeepSeek V4 Flash (158B MoE) | 158.1B | Q4_K_M | — | 90.9 GB | — |
| GPT-OSS 120B (5.1B active) | 120.4B | Q4_K_M | — | 69.4 GB | — |
How "Can I run it?" works for local AI
Running a large language model on your own GPU comes down to two questions: does it fit, and how fast does it generate. This tool answers both from the specs of the hardware you select — no download, no install, no signup.
First, fit. A model needs room for its weights (parameters × bytes per parameter, set by the quantization) plus a KV cachethat grows with your context window, plus a little runtime overhead. If that total is larger than your GPU memory, layers spill into system RAM ("CPU offload"), which is far slower. If it doesn't fit even in VRAM plus RAM, the model won't run at all.
Why memory bandwidth decides your tokens/sec
For a single user generating one token at a time, local LLM inference is almost entirely memory-bandwidth bound, not compute bound. Every generated token requires reading the active model weights from memory once. So a rough speed estimate is:
tokens/sec ≈ memory bandwidth ÷ bytes read per token
A 7B model at Q4 reads roughly 4 GB per token-step. An RTX 4070 moves about 504 GB/s, and real-world efficiency on GPUs is around 65% of theoretical — so you land in the tens of tokens per second. Mixture-of-experts models read only their active params per token, so they punch above their total size. CPU and unified-memory systems run at lower efficiency, which the estimates account for.
What the tiers mean
| Tier | What it means |
|---|---|
| Runs great | Fits fully in GPU memory at good quality — fast, everyday usable. |
| Runs with trade-offs | Works, but needs a reduced quant or feels slow. |
| Technically runs | Heavy CPU offload or extreme quant — expect single-digit tokens/sec. |
| Won't run | Doesn't fit in GPU memory + system RAM at any quantization. |
Does the RTX 4070 / 4090 / Apple Silicon matter most?
For fit, it's the memory sizethat gates you: 12 GB cards comfortably run 7B–13B models, 24 GB cards reach 30B-class, and Apple Silicon with 64–128 GB of unified memory can hold genuinely large models (slowly). For speed, it's the bandwidth: a 4090 at ~1 TB/s generates roughly twice as fast as a 4070 on the same model.
Honest caveats
These are estimates, not benchmarks. Real speeds vary with your llama.cpp build, the exact GGUF quant, your context length, batching, flash attention, thermals, and background load. Treat the tokens/sec figures as a ballpark — useful for deciding what to try, not a guarantee. The honest way to know your number is to run the model.
When you do, Wide Area Intelligence turns that GPU into an OpenAI-compatible endpoint: deploy any Hugging Face GGUF to a node in one click, set your context window, and route requests with automatic cloud failover. Free for 2 nodes.
/// wide area ai
These numbers are theory. Your GPU is real — put it on the network.
Wide Area Intelligence turns any machine with a GPU into an OpenAI-compatible endpoint — routed, cached, and failed over automatically. Free for 2 nodes.
Start routing — free →