Can my GPU run local AI models?

Most modern GPUs can run small-to-mid LLMs locally. As a rule of thumb: 8GB VRAM handles 7B–8B models at Q4, 12GB comfortably runs 7B–13B, 16–24GB reaches 30B-class, and Apple Silicon with 64–128GB of unified memory can hold genuinely large models (slowly). Pick your exact card above for a per-model verdict.

How many tokens per second will I get?

For single-user generation, local LLM speed is memory-bandwidth bound: tokens/sec ≈ memory bandwidth ÷ bytes read per token. A 7B Q4 model reads ~4GB per token-step, so an RTX 4070 (~504 GB/s) lands in the tens of tokens per second and a 4090 (~1 TB/s) is roughly twice as fast. Mixture-of-experts models read only their active parameters, so they punch above their size.

What if a model is too big for my hardware?

When a model won't fit in VRAM plus system RAM, you have two options: run a smaller model or a lower quantization, or route those requests to the cloud. Wide Area Intelligence does this automatically — it serves what your GPU can handle locally for free and fails over to cloud models only when needed, behind one OpenAI-compatible endpoint.

Is this benchmark exact?

No — these are estimates from hardware specs, not measured benchmarks. Real speeds vary with your llama.cpp build, the exact GGUF quant, context length, batching, flash attention, and thermals. Treat the tokens/sec figures as a ballpark for deciding what to try, not a guarantee.

← all tools

free tool · no signup · runs in your browser

Can I Run This LLM? — Local AI edition

Can I run this LLM? Pick your GPU and RAM and get an instant verdict across text, image, audio and video — what runs great locally vs. what to send to the cloud — plus, for LLMs, which models run well, estimated tokens per second, and the best quantization for each.

Your hardware

GPU / acceleratorSystem RAM

12 GB VRAM @ 504 GB/s · 32 GB system RAM

What can this hardware do?

Across text, image, audio and video — and what's better run in the cloud.

Text / chat

Runs great · ~28 tok/s · Best on RTX 4070

Qwen3 14B · Mixtral 8x7B (47B, 13B active)

Coding

Runs · ~210 tok/s · Best on RTX 4070

GLM-4.7 Flash (31B-A3B)

Image

Runs great · Fast on RTX 4070

SDXL Turbo · RealVisXL V5.0 — realistic people & products · DreamShaper XL Turbo — illustration & concept art · CyberRealistic XL — uncensored realistic

Audio & speech

Runs great · Fast on RTX 4070

Whisper Base (English) · Whisper Large v3 Turbo

Video

Use the cloud · Best done in the cloud

Wan 2.2 TI2V 5B · LTX-2.3 (22B)

Run locally

Text / chat (LLMs) — Best on RTX 4070
Image generation — Fast on RTX 4070
Speech & transcription — Fast on RTX 4070

Send to the cloud

Video generation — Best done in the cloud

36 of 56 models run well on your RTX 4070 (12GB).

26 run great · 10 run with trade-offs · 8 technically run · 12won't fit.

Best for chat

Qwen3 14B

14.8B · Q5_K_M

28 tok/s

est. generation speed

Best for coding

GLM-4.7 Flash (31B-A3B)

31.2B · IQ2_M

210 tok/s

est. generation speed

Best for reasoning

DeepSeek R1 Distill Qwen 14B

14.8B · Q4_K_M

31 tok/s

est. generation speed

Runs great

(26)

Fits fully in your GPU memory at good quality — fast, everyday usable

Model	Params	Best quant	Est. speed	Memory needed	Max context
DeepSeek R1 Distill Qwen 14B	14.8B	Q4_K_M	31 tok/s	10.6 GB	8K
Qwen3 14B	14.8B	Q5_K_M	28 tok/s	11.8 GB	8K
Phi-4 14B	14.7B	Q4_K_M	31 tok/s	10.6 GB	8K
Mistral Nemo 12B	12.2B	Q6_K	29 tok/s	11.4 GB	8K
Gemma 4 12B	12.2B	Q6_K	30 tok/s	10.9 GB	16K
Gemma 3 12B	12.2B	Q6_K	30 tok/s	10.9 GB	16K
Qwen3 8B	8.2B	Q8_0	33 tok/s	10.0 GB	16K
Llama 3.1 8B	8.0B	Q8_0	34 tok/s	9.7 GB	16K
DeepSeek R1 Distill Llama 8B	8.0B	Q8_0	34 tok/s	9.7 GB	16K
Gemma 4 E4B	8.0B	Q8_0	75 tok/s	8.8 GB	128K
Gemma 3n E4B	7.8B	Q8_0	75 tok/s	8.6 GB	32K
Qwen2.5 7B	7.6B	Q8_0	38 tok/s	8.7 GB	32K
Qwen2.5 Coder 7B	7.6B	Q8_0	38 tok/s	8.7 GB	32K
DeepSeek R1 Distill Qwen 7B	7.6B	Q8_0	38 tok/s	8.7 GB	64K
Mistral 7B v0.3	7.2B	Q8_0	37 tok/s	8.9 GB	32K
Gemma 3n E2B	5.4B	Q8_0	147 tok/s	6.2 GB	32K
Gemma 4 E2B	5.1B	Q8_0	150 tok/s	5.9 GB	128K
Gemma 3 4B	4.3B	Q8_0	67 tok/s	5.3 GB	128K
Qwen3 4B	4.0B	Q8_0	60 tok/s	5.8 GB	32K
Phi-4 Mini 3.8B	3.8B	Q8_0	64 tok/s	5.5 GB	32K
Llama 3.2 3B	3.2B	Q8_0	75 tok/s	4.8 GB	64K
Qwen3 1.7B	2.0B	Q8_0	106 tok/s	3.6 GB	32K
DeepSeek R1 Distill Qwen 1.5B	1.8B	Q8_0	155 tok/s	2.7 GB	128K
Llama 3.2 1B	1.2B	Q8_0	208 tok/s	2.2 GB	128K
Gemma 3 1B	1000M	Q8_0	296 tok/s	1.8 GB	32K
Qwen3 0.6B	752M	Q8_0	189 tok/s	2.4 GB	32K

Runs with trade-offs

(10)

Works, but needs reduced quantization or feels slow

Model	Params	Best quant	Est. speed	Memory needed	Max context
Mixtral 8x7B (47B, 13B active)	46.7B	Q4_K_M	5.1 tok/s	28.3 GB	8K
Nemotron 3 Nano 30B-A3B	31.6B	IQ2_M	193 tok/s	11.7 GB	8K
GLM-4.7 Flash (31B-A3B)	31.2B	IQ2_M	210 tok/s	11.6 GB	8K
Qwen3 30B-A3B (MoE)	30.5B	IQ2_M	164 tok/s	11.7 GB	8K
Qwen3 Coder 30B-A3B	30.5B	IQ2_M	164 tok/s	11.7 GB	8K
Gemma 3 27B	27.4B	IQ2_M	30 tok/s	11.0 GB	16K
Gemma 4 26B (A4B MoE)	26.5B	IQ2_M	165 tok/s	10.2 GB	32K
Devstral Small 2 24B	24.0B	IQ2_M	33 tok/s	10.1 GB	16K
Mistral Small 3.2 24B	24.0B	IQ2_M	33 tok/s	10.1 GB	16K
GPT-OSS 20B (3.6B active)	21.5B	Q3_K_M	163 tok/s	11.0 GB	32K

Technically runs

(8)

Heavy CPU offloading or extreme quantization — expect single-digit tokens/sec

Model	Params	Best quant	Est. speed	Memory needed	Max context
GLM-4.5 Air (106B, 12B active)	110.5B	IQ2_M	6.6 tok/s	39.2 GB	8K
Llama 4 Scout (109B, 17B active)	108.6B	IQ2_M	5.1 tok/s	38.7 GB	8K
Qwen3 Coder Next (80B-A3B)	79.7B	IQ2_M	24 tok/s	28.2 GB	8K
Llama 3.3 70B	70.6B	IQ2_M	1.7 tok/s	26.9 GB	8K
DeepSeek R1 Distill Llama 70B	70.6B	IQ2_M	1.7 tok/s	26.9 GB	8K
DeepSeek R1 Distill Qwen 32B	32.8B	Q4_K_M	2.6 tok/s	21.4 GB	8K
Qwen3 32B	32.8B	Q4_K_M	2.6 tok/s	21.4 GB	8K
Gemma 4 31B	32.7B	Q4_K_M	2.6 tok/s	21.3 GB	8K

Won't run

(12)

Doesn't fit in your GPU memory + system RAM at any quantization

Model	Params	Best quant	Est. speed	Memory needed	Max context
Kimi K2.6 (1T, 32B active)	1.1T	Q4_K_M	—	602.7 GB	—
Kimi K2 (1T, 32B active)	1.0T	Q4_K_M	—	584.4 GB	—
DeepSeek V4 Pro (862B MoE)	861.6B	Q4_K_M	—	490.8 GB	—
GLM-5 (754B MoE)	753.9B	Q4_K_M	—	429.7 GB	—
DeepSeek V3.2 (671B, 37B active)	685.4B	Q4_K_M	—	390.7 GB	—
DeepSeek R1 (671B, 37B active)	684.5B	Q4_K_M	—	390.2 GB	—
Llama 3.1 405B	405.9B	Q4_K_M	—	235.3 GB	—
Llama 4 Maverick (400B, 17B active)	401.6B	Q4_K_M	—	230.4 GB	—
Qwen3 235B-A22B (MoE)	235.1B	Q4_K_M	—	135.8 GB	—
MiniMax M2.7 (229B, 10B active)	228.7B	Q4_K_M	—	132.6 GB	—
DeepSeek V4 Flash (158B MoE)	158.1B	Q4_K_M	—	90.9 GB	—
GPT-OSS 120B (5.1B active)	120.4B	Q4_K_M	—	69.4 GB	—

How "Can I run it?" works for local AI

Running a large language model on your own GPU comes down to two questions: does it fit, and how fast does it generate. This tool answers both from the specs of the hardware you select — no download, no install, no signup.

First, fit. A model needs room for its weights (parameters × bytes per parameter, set by the quantization) plus a KV cachethat grows with your context window, plus a little runtime overhead. If that total is larger than your GPU memory, layers spill into system RAM ("CPU offload"), which is far slower. If it doesn't fit even in VRAM plus RAM, the model won't run at all.

Why memory bandwidth decides your tokens/sec

For a single user generating one token at a time, local LLM inference is almost entirely memory-bandwidth bound, not compute bound. Every generated token requires reading the active model weights from memory once. So a rough speed estimate is:

tokens/sec ≈ memory bandwidth ÷ bytes read per token

A 7B model at Q4 reads roughly 4 GB per token-step. An RTX 4070 moves about 504 GB/s, and real-world efficiency on GPUs is around 65% of theoretical — so you land in the tens of tokens per second. Mixture-of-experts models read only their active params per token, so they punch above their total size. CPU and unified-memory systems run at lower efficiency, which the estimates account for.

What the tiers mean

Tier	What it means
Runs great	Fits fully in GPU memory at good quality — fast, everyday usable.
Runs with trade-offs	Works, but needs a reduced quant or feels slow.
Technically runs	Heavy CPU offload or extreme quant — expect single-digit tokens/sec.
Won't run	Doesn't fit in GPU memory + system RAM at any quantization.

Does the RTX 4070 / 4090 / Apple Silicon matter most?

For fit, it's the memory sizethat gates you: 12 GB cards comfortably run 7B–13B models, 24 GB cards reach 30B-class, and Apple Silicon with 64–128 GB of unified memory can hold genuinely large models (slowly). For speed, it's the bandwidth: a 4090 at ~1 TB/s generates roughly twice as fast as a 4070 on the same model.

Honest caveats

These are estimates, not benchmarks. Real speeds vary with your llama.cpp build, the exact GGUF quant, your context length, batching, flash attention, thermals, and background load. Treat the tokens/sec figures as a ballpark — useful for deciding what to try, not a guarantee. The honest way to know your number is to run the model.

When you do, Wide Area Intelligence turns that GPU into an OpenAI-compatible endpoint: deploy any Hugging Face GGUF to a node in one click, set your context window, and route requests with automatic cloud failover. Free for 2 nodes.

Frequently asked questions

Can my GPU run local AI models?: Most modern GPUs can run small-to-mid LLMs locally. As a rule of thumb: 8GB VRAM handles 7B–8B models at Q4, 12GB comfortably runs 7B–13B, 16–24GB reaches 30B-class, and Apple Silicon with 64–128GB of unified memory can hold genuinely large models (slowly). Pick your exact card above for a per-model verdict.
How many tokens per second will I get?: For single-user generation, local LLM speed is memory-bandwidth bound: tokens/sec ≈ memory bandwidth ÷ bytes read per token. A 7B Q4 model reads ~4GB per token-step, so an RTX 4070 (~504 GB/s) lands in the tens of tokens per second and a 4090 (~1 TB/s) is roughly twice as fast. Mixture-of-experts models read only their active parameters, so they punch above their size.
What if a model is too big for my hardware?: When a model won't fit in VRAM plus system RAM, you have two options: run a smaller model or a lower quantization, or route those requests to the cloud. Wide Area Intelligence does this automatically — it serves what your GPU can handle locally for free and fails over to cloud models only when needed, behind one OpenAI-compatible endpoint.
Is this benchmark exact?: No — these are estimates from hardware specs, not measured benchmarks. Real speeds vary with your llama.cpp build, the exact GGUF quant, context length, batching, flash attention, and thermals. Treat the tokens/sec figures as a ballpark for deciding what to try, not a guarantee.

Related reading: The best local LLMs in 2026. Ready to use that hardware? Turn your GPU into an OpenAI-compatible endpoint — free for 2 nodes.

/// wide area ai

These numbers are theory. Your GPU is real — put it on the network.

Wide Area Intelligence turns any machine with a GPU into an OpenAI-compatible endpoint — routed, cached, and failed over automatically. Free for 2 nodes.

Start routing — free →