← all tools

free tool · no signup · runs in your browser

Can I Run It? — Local AI edition

Wondering if your PC can run Llama, Qwen, or DeepSeek locally? Pick your GPU and RAM and get an instant verdict: which models run great, which run with trade-offs, and which won't fit — plus estimated tokens per second and the best quantization for each.

Your hardware

12 GB VRAM @ 504 GB/s · 32 GB system RAM

35 of 55 models run well on your RTX 4070 (12GB).

25 run great · 10 run with trade-offs · 8 technically run · 12won't fit.

Best for chat

Qwen3 14B

14.8B · Q5_K_M

28 tok/s

est. generation speed

Best for coding

GLM-4.7 Flash (31B-A3B)

31.2B · IQ2_M

210 tok/s

est. generation speed

Best for reasoning

DeepSeek R1 Distill Qwen 14B

14.8B · Q4_K_M

31 tok/s

est. generation speed

Runs great

(25)

Fits fully in your GPU memory at good quality — fast, everyday usable

ModelParamsBest quantEst. speedMemory neededMax context
DeepSeek R1 Distill Qwen 14B14.8BQ4_K_M31 tok/s10.6 GB8K
Qwen3 14B14.8BQ5_K_M28 tok/s11.8 GB8K
Phi-4 14B14.7BQ4_K_M31 tok/s10.6 GB8K
Mistral Nemo 12B12.2BQ6_K29 tok/s11.4 GB8K
Gemma 3 12B12.2BQ6_K30 tok/s10.9 GB16K
Qwen3 8B8.2BQ8_033 tok/s10.0 GB16K
Llama 3.1 8B8.0BQ8_034 tok/s9.7 GB16K
DeepSeek R1 Distill Llama 8B8.0BQ8_034 tok/s9.7 GB16K
Gemma 4 E4B8.0BQ8_075 tok/s8.8 GB128K
Gemma 3n E4B7.8BQ8_075 tok/s8.6 GB32K
Qwen2.5 7B7.6BQ8_038 tok/s8.7 GB32K
Qwen2.5 Coder 7B7.6BQ8_038 tok/s8.7 GB32K
DeepSeek R1 Distill Qwen 7B7.6BQ8_038 tok/s8.7 GB64K
Mistral 7B v0.37.2BQ8_037 tok/s8.9 GB32K
Gemma 3n E2B5.4BQ8_0147 tok/s6.2 GB32K
Gemma 4 E2B5.1BQ8_0150 tok/s5.9 GB128K
Gemma 3 4B4.3BQ8_067 tok/s5.3 GB128K
Qwen3 4B4.0BQ8_060 tok/s5.8 GB32K
Phi-4 Mini 3.8B3.8BQ8_064 tok/s5.5 GB32K
Llama 3.2 3B3.2BQ8_075 tok/s4.8 GB64K
Qwen3 1.7B2.0BQ8_0106 tok/s3.6 GB32K
DeepSeek R1 Distill Qwen 1.5B1.8BQ8_0155 tok/s2.7 GB128K
Llama 3.2 1B1.2BQ8_0208 tok/s2.2 GB128K
Gemma 3 1B1000MQ8_0296 tok/s1.8 GB32K
Qwen3 0.6B752MQ8_0189 tok/s2.4 GB32K

Runs with trade-offs

(10)

Works, but needs reduced quantization or feels slow

ModelParamsBest quantEst. speedMemory neededMax context
Mixtral 8x7B (47B, 13B active)46.7BQ4_K_M5.1 tok/s28.3 GB8K
Nemotron 3 Nano 30B-A3B31.6BIQ2_M193 tok/s11.7 GB8K
GLM-4.7 Flash (31B-A3B)31.2BIQ2_M210 tok/s11.6 GB8K
Qwen3 30B-A3B (MoE)30.5BIQ2_M164 tok/s11.7 GB8K
Qwen3 Coder 30B-A3B30.5BIQ2_M164 tok/s11.7 GB8K
Gemma 3 27B27.4BIQ2_M30 tok/s11.0 GB16K
Gemma 4 26B (A4B MoE)26.5BIQ2_M165 tok/s10.2 GB32K
Devstral Small 2 24B24.0BIQ2_M33 tok/s10.1 GB16K
Mistral Small 3.2 24B24.0BIQ2_M33 tok/s10.1 GB16K
GPT-OSS 20B (3.6B active)21.5BQ3_K_M163 tok/s11.0 GB32K

Technically runs

(8)

Heavy CPU offloading or extreme quantization — expect single-digit tokens/sec

ModelParamsBest quantEst. speedMemory neededMax context
GLM-4.5 Air (106B, 12B active)110.5BIQ2_M6.6 tok/s39.2 GB8K
Llama 4 Scout (109B, 17B active)108.6BIQ2_M5.1 tok/s38.7 GB8K
Qwen3 Coder Next (80B-A3B)79.7BIQ2_M24 tok/s28.2 GB8K
Llama 3.3 70B70.6BIQ2_M1.7 tok/s26.9 GB8K
DeepSeek R1 Distill Llama 70B70.6BIQ2_M1.7 tok/s26.9 GB8K
DeepSeek R1 Distill Qwen 32B32.8BQ4_K_M2.6 tok/s21.4 GB8K
Qwen3 32B32.8BQ4_K_M2.6 tok/s21.4 GB8K
Gemma 4 31B32.7BQ4_K_M2.6 tok/s21.3 GB8K

Won't run

(12)

Doesn't fit in your GPU memory + system RAM at any quantization

ModelParamsBest quantEst. speedMemory neededMax context
Kimi K2.6 (1T, 32B active)1.1TQ4_K_M602.7 GB
Kimi K2 (1T, 32B active)1.0TQ4_K_M584.4 GB
DeepSeek V4 Pro (862B MoE)861.6BQ4_K_M490.8 GB
GLM-5 (754B MoE)753.9BQ4_K_M429.7 GB
DeepSeek V3.2 (671B, 37B active)685.4BQ4_K_M390.7 GB
DeepSeek R1 (671B, 37B active)684.5BQ4_K_M390.2 GB
Llama 3.1 405B405.9BQ4_K_M235.3 GB
Llama 4 Maverick (400B, 17B active)401.6BQ4_K_M230.4 GB
Qwen3 235B-A22B (MoE)235.1BQ4_K_M135.8 GB
MiniMax M2.7 (229B, 10B active)228.7BQ4_K_M132.6 GB
DeepSeek V4 Flash (158B MoE)158.1BQ4_K_M90.9 GB
GPT-OSS 120B (5.1B active)120.4BQ4_K_M69.4 GB

How "Can I run it?" works for local AI

Running a large language model on your own GPU comes down to two questions: does it fit, and how fast does it generate. This tool answers both from the specs of the hardware you select — no download, no install, no signup.

First, fit. A model needs room for its weights (parameters × bytes per parameter, set by the quantization) plus a KV cachethat grows with your context window, plus a little runtime overhead. If that total is larger than your GPU memory, layers spill into system RAM ("CPU offload"), which is far slower. If it doesn't fit even in VRAM plus RAM, the model won't run at all.

Why memory bandwidth decides your tokens/sec

For a single user generating one token at a time, local LLM inference is almost entirely memory-bandwidth bound, not compute bound. Every generated token requires reading the active model weights from memory once. So a rough speed estimate is:

tokens/sec ≈ memory bandwidth ÷ bytes read per token

A 7B model at Q4 reads roughly 4 GB per token-step. An RTX 4070 moves about 504 GB/s, and real-world efficiency on GPUs is around 65% of theoretical — so you land in the tens of tokens per second. Mixture-of-experts models read only their active params per token, so they punch above their total size. CPU and unified-memory systems run at lower efficiency, which the estimates account for.

What the tiers mean

TierWhat it means
Runs greatFits fully in GPU memory at good quality — fast, everyday usable.
Runs with trade-offsWorks, but needs a reduced quant or feels slow.
Technically runsHeavy CPU offload or extreme quant — expect single-digit tokens/sec.
Won't runDoesn't fit in GPU memory + system RAM at any quantization.

Does the RTX 4070 / 4090 / Apple Silicon matter most?

For fit, it's the memory sizethat gates you: 12 GB cards comfortably run 7B–13B models, 24 GB cards reach 30B-class, and Apple Silicon with 64–128 GB of unified memory can hold genuinely large models (slowly). For speed, it's the bandwidth: a 4090 at ~1 TB/s generates roughly twice as fast as a 4070 on the same model.

Honest caveats

These are estimates, not benchmarks. Real speeds vary with your llama.cpp build, the exact GGUF quant, your context length, batching, flash attention, thermals, and background load. Treat the tokens/sec figures as a ballpark — useful for deciding what to try, not a guarantee. The honest way to know your number is to run the model.

When you do, Wide Area Intelligence turns that GPU into an OpenAI-compatible endpoint: deploy any Hugging Face GGUF to a node in one click, set your context window, and route requests with automatic cloud failover. Free for 2 nodes.

/// wide area ai

These numbers are theory. Your GPU is real — put it on the network.

Wide Area Intelligence turns any machine with a GPU into an OpenAI-compatible endpoint — routed, cached, and failed over automatically. Free for 2 nodes.

Start routing — free →