RTX 3090 vs RTX PRO 4000 (Blackwell): Inference, Training, and Power Cost

January 15, 2026 (Last Modified: January 15, 2026)

Translations: DE

I wanted to know how my local RTX 3090 (used, ~€700) compares to one or more RTX PRO 4000 Blackwell cards (145W, new ~€1400) for both LLM inference (single user / scripts / my own agents) and fine-tuning image models (AI-toolkit).

My takeaway up front: with llama.cpp the 3090 is often faster at decoding; for training the RTX PRO 4000 is slightly ahead for me. In terms of power cost per token/epoch the RTX PRO 4000 is much nicer.

Important: the numbers below are measurements from my own tests. Absolute values depend heavily on model, quantization, context length, drivers/build and settings. For the power cost calculations I intentionally use max power as a rough upper bound (3090: 370W, RTX PRO 4000: 140W) and an electricity price of €0.30/kWh.

For the RTX 5090 and the RTX PRO 6000 (workstation) I also have measurements with avg power (measured average draw). There I calculate with avg power because it is closer to reality.

LLM inference (llama.cpp)

With llama.cpp it’s worth separating prompt processing (pp) and token generation (tg):

pp (Prompt Processing): how fast the prompt is processed.
tg (Token Generation / Decoding): how fast new output tokens are generated (what you usually “feel” as speed).

I also used llama-bench, because it’s more reproducible than a chat session with a growing context.

`llama-bench` (tg128 / pp512)

Qwen3-VL-32B (Q4), roughly:

GPU	pp512 (t/s)	tg128 (t/s)
RTX 3090	~1138	~36.24
RTX PRO 4000	~1018	~27.11
4× RTX PRO 4000	~1136	~28.26

gpt-oss-20b (Q4), roughly:

GPU	pp512 (t/s)	tg128 (t/s)
RTX 3090	~4217	~216.12
RTX PRO 4000	~5540	~184.66

What I take from this:

The RTX 3090 is noticeably faster at decoding (tg) in my tests, especially on “larger” models like Qwen3-VL-32B.
The RTX PRO 4000 can be ahead in pp (e.g. gpt-oss-20b), but that doesn’t automatically translate to higher tg.
4× RTX PRO 4000 barely scales for single-stream tg when the model easily fits on a single GPU. Multi-GPU is more useful to fit larger models / longer contexts or to run parallel workloads (multiple concurrent runs).

120B MoE: why CPU MoE can be useful (and when it is not)

One reason I’m looking at MoE (Mixture of Experts) models: they can make very large models usable even when a GPU has too little VRAM for a full GPU run. In practice, some of the experts are executed on the CPU (e.g. via --n-cpu-moe), while the rest stays on the GPU.

This can be better for performance than simply leaving some layers on the CPU in a dense model via --n-gpu-layers (ngl). CPU layers must be executed for every token end-to-end. With MoE, typically only a small subset of experts is active per token, so CPU offload can be more “targeted”.

The downside: if too much ends up on the CPU, or the CPU (and memory bandwidth) can’t keep up, --n-cpu-moe quickly becomes the bottleneck and tg drops hard. As a rough reference from llama-bench on 4× RTX PRO 4000:

Setting	pp512 (t/s)	tg128 (t/s)
`n_cpu_moe=0`	~2586	~134
`n_cpu_moe=24`	~542	~22

I also ran gpt-oss-120b as a “single-GPU real test” with llama-cli (Q4_K_XL, --n-cpu-moe 24). Prompt t/s varies with prompt length/session; generation is more stable:

GPU	Prompt (t/s)	Generation (t/s)
RTX 3090	~65–301	~30.0–31.3
RTX PRO 4000	~50.7–65.3	~11.9–12.9

Note: the gpt-oss-120b chat session on 1× RTX PRO 4000 ran on a different VastAI instance than the llama-bench results above (which I ran on 4× RTX PRO 4000).

Fine-tuning (image models)

My expectation here: without exotic offload setups, “GPU vs GPU” should be a fairer comparison than layer offloading (where host/PCIe/timing quickly gets involved).

“Qwen 2512” (layer offloading)

For both fine-tuning tests: 1 epoch = 250 steps.

Times are in mm:ss format.

GPU	Sampling (per image)	Epoch
RTX PRO 4000	01:21	31:35
RTX 3090	01:27	32:43

Z-image (batch size 1, no layer offloading)

This test is more meaningful for me.

Times are in mm:ss format.

GPU	VRAM	Price (rough)	Sampling	Epoch	Power (basis)
RTX 3090	24 GB	~€700	00:11.45	13:21	Max 370W
RTX PRO 4000	24 GB	~€1400	00:10.80	12:03	Max 140W
RTX 5090	32 GB	~€3000	00:04.30	04:36	Avg ~540W
RTX PRO 6000 Blackwell (Workstation)	96 GB	~€8000	00:03.00	03:24	Avg ~590W

Between the RTX 3090 and RTX PRO 4000 this is roughly ~5–10% difference for me (depending on the metric).

Cost comparison for one fine-tuning run over 10 epochs (Z-image, batch 1):

GPU	Time (10 epochs, hh:mm:ss)	Power basis	Power cost (10 epochs)	VastAI price	VastAI cost (10 epochs)
RTX 3090	2:13:30	Max power (370W)	~€0.25	€0.19/h	~€0.42
RTX PRO 4000	2:00:30	Max power (140W)	~€0.08	€0.26/h	~€0.52
RTX 5090	0:46:00	Avg power (~540W)	~€0.12	€0.39/h	~€0.30
RTX PRO 6000 Blackwell (Workstation)	0:34:00	Avg power (~590W)	~€0.10	€0.85/h	~€0.48

Power cost (local, GPU only)

I intentionally only count GPU power here and ignore the rest of the system. Formula:

Cost = (W / 1000) * (time_in_hours) * €0.30/kWh
time_in_hours = tokens / (t/s) / 3600

1,000,000 output tokens

Based on my llama-bench tg128 values (using max power as an upper bound):

Model	GPU	tg128 (t/s)	Power cost per 1M output tokens
Qwen3-VL-32B (Q4)	RTX 3090	~36.24	~€0.85
Qwen3-VL-32B (Q4)	RTX PRO 4000	~27.11	~€0.43
gpt-oss-20b (Q4)	RTX 3090	~216.12	~€0.14
gpt-oss-20b (Q4)	RTX PRO 4000	~184.66	~€0.06

So: in my setup the RTX PRO 4000 is clearly more efficient in terms of “power per token”, even though it is often slower at tg decoding.

Conclusion (for my use case)

For single user / agents / scripts the RTX 3090 is usually faster in my llama.cpp tests, but the RTX PRO 4000 looks much more attractive over time (power, heat, potentially noise) and in my fine-tuning tests it’s at least on par, often slightly ahead.