NPU LLM Inference Benchmark Standard · v1.0

NPU Benchmark

LLM Inference Performance Testing for PC NPUs

A benchmark standard for measuring Large Language Model inference performance on PC NPUs — covering Qualcomm Snapdragon X, AMD Ryzen AI 300, and Intel Core Ultra Series 2. Each platform runs independently with its native execution path.

🚧 Currently in development · Release date TBD

Methodology references MLPerf Client v1.5 (MLCommons). This tool is not affiliated with or endorsed by MLCommons. MLPerf® is a registered trademark of MLCommons Association.

Supported Platforms

Three independent benchmark applications, one comparable standard

Qualcomm
Snapdragon X Series
Snapdragon X Elite · Snapdragon X Plus
With Hexagon NPU
Execution Provider: NativeQNN
OS: Windows ARM64 · Min Memory: 32 GB
Min Driver: ≥ 30.0.140.1000
In Development
AMD
Ryzen AI 300 Series
Ryzen AI Max+ 395 · Ryzen AI 9 HX 370
XDNA2 architecture and later
Execution Provider: OrtGenAI-RyzenAI
OS: Windows x64 · Min Memory: 32 GB
Min Driver: ≥ 32.0.203.280
In Development
Intel
Core Ultra Series 2
Core Ultra 9 288V · Core Ultra 7 268V
With integrated NPU
Execution Provider: NativeOpenVINO
OS: Windows x64 · Min Memory: 16 GB
Min Driver: ≥ 32.0.100.4297
In Development

Each platform is a separate application using its vendor-native execution path. Results are independently measured and cross-comparable by standard methodology.

Benchmark Metrics

Six core metrics defined in NPUBench Standard v1.0

TTFT

Time To First Token

Measures Prefill phase latency — time from complete input submission to first output token. Score is the average of 4 warm runs (runs 2–5). Cold start is recorded separately as reference only. Lower is better.

🚀
TPS

Tokens Per Second

Average token generation rate in the Decode phase after the first token. Human reading speed reference: ~4–5 tokens/sec. Score is the average of 4 warm runs (runs 2–5). Higher is better.

NPUBench Exclusive

NPU Acceleration Ratio

Ratio of CPU baseline performance to NPU primary performance. Quantifies the actual speedup delivered by NPU for both TTFT and TPS. Every test session requires a CPU baseline run to compute this metric. Higher is better.

🔋
NPUBench Exclusive

Power Efficiency (TPS/W)

Tokens per second per watt of power consumption. A core AI PC evaluation dimension for understanding the energy cost of on-device inference. NPU mode typically uses significantly less power than CPU mode. Higher is better.

🌐
NPUBench Exclusive

Multilingual Translation Testing

Translation quality evaluation across Japanese ↔ Chinese ↔ English. Results include BLEU score validation. Quality thresholds: JA↔ZH ≥ 12, EN↔JA ≥ 15. Tests real multilingual inference capability, not just speed.

🧪

3 Test Modes

Mode A — NPU Primary (NPU handles ≥80% of Prefill compute). Mode B — Hybrid (NPU + CPU/GPU cooperation). Mode C — CPU Baseline (required in every session to compute acceleration ratio). Aligned with MLPerf Client v1.5 execution paths.

Performance Rating System

NPUBench Standard v1.0 · Section 7 · All thresholds derived from real hardware measurements

TTFT Lower is Better
★★★ < 0.1 s Qualcomm Snapdragon X NPU: 0.029 s
★★ 0.1 ~ 2.0 s AMD Ryzen AI 300 NPU: 1.29 s
2.0 ~ 5.0 s Intel Core Ultra S2 NPU: 2.91 s
> 5.0 s CPU baseline range
Score = average of warm runs 2–5. Cold start excluded.
TPS Higher is Better
★★★ > 40 t/s Far exceeds reading speed
★★ 20 ~ 40 t/s Exceeds reading speed
10 ~ 20 t/s Comparable to reading speed
< 10 t/s Below reading speed (~4–5 t/s)
Human reading speed reference: ~4–5 tokens/sec.
NPU Accel Ratio (TTFT) Higher is Better
★★★ > 50 × Qualcomm Snapdragon X: 314 ×
★★ 10 ~ 50 × Intel Core Ultra S2: 26 ×
5 ~ 10 × AMD Ryzen AI 300: 7 ×
< 5 × NPU acceleration ineffective
NPU_Accel_TTFT = TTFT_CPU / TTFT_NPU. Avg of 5 runs.
TPS/W Higher is Better
★★★ < 6 W NPU Qualcomm / Intel: 3–5 W
★★ 6 ~ 15 W AMD Ryzen AI 300: 5–10 W
15 ~ 30 W CPU lower bound
> 30 W CPU upper bound
TPS_per_Watt = TPS / avg_power_W. Avg of 5 runs.

Source: NPUBench Standard v1.0, Section 7. Reference data measured on Llama 3.2 3B Instruct (int4), translation task, average of 5 runs. Not affiliated with MLCommons.

Results & Leaderboard

Will be published after official release

📊
Coming Soon

The benchmark tool is currently under development. Official test results and community leaderboard will be published following the public release. Stay tuned.

TTFT
Lower is Better
TPS
Higher is Better
NPU Accel
Higher is Better
TPS/W
Higher is Better

Score = average of warm runs 2–5 per NPUBench Standard v1.0. Cold start recorded separately as reference.

Get NPU Benchmark

The tool is currently in development. We are building and validating the benchmark methodology across all three supported platforms before public release.

Open methodology. Source code licensed under Apache 2.0. Built for the NPU developer and research community.

Status In Development
Release Date TBD
Platform Windows 11
License Apache 2.0
Platforms Qualcomm · AMD · Intel
Standard NPUBench v1.0