NPU LLM Inference Benchmark Standard · v1.0

NPU Benchmark

LLM Inference Performance Testing for PC NPUs

A benchmark standard for measuring Large Language Model inference performance on PC NPUs — covering Qualcomm Snapdragon X, AMD Ryzen AI 300, and Intel Core Ultra Series 2. Each platform runs independently with its native execution path.

🚧 Currently in development · Release date TBD

Methodology references MLPerf Client v1.5 (MLCommons). This tool is not affiliated with or endorsed by MLCommons. MLPerf® is a registered trademark of MLCommons Association.

Download Details View Features

Supported Platforms

Three independent benchmark applications, one comparable standard

Qualcomm

Snapdragon X Series

Snapdragon X Elite · Snapdragon X Plus
With Hexagon NPU

Execution Provider: NativeQNN

OS: Windows ARM64 · Min Memory: 32 GB
Min Driver: ≥ 30.0.140.1000

○ In Development

AMD

Ryzen AI 300 Series

Ryzen AI Max+ 395 · Ryzen AI 9 HX 370
XDNA2 architecture and later

Execution Provider: OrtGenAI-RyzenAI

OS: Windows x64 · Min Memory: 32 GB
Min Driver: ≥ 32.0.203.280

○ In Development

Intel

Core Ultra Series 2

Core Ultra 9 288V · Core Ultra 7 268V
With integrated NPU

Execution Provider: NativeOpenVINO

OS: Windows x64 · Min Memory: 16 GB
Min Driver: ≥ 32.0.100.4297

○ In Development

Each platform is a separate application using its vendor-native execution path. Results are independently measured and cross-comparable by standard methodology.

Benchmark Metrics

Six core metrics defined in NPUBench Standard v1.0

⏱

TTFT

Time To First Token

Measures Prefill phase latency — time from complete input submission to first output token. Score is the average of 4 warm runs (runs 2–5). Cold start is recorded separately as reference only. Lower is better.

🚀

TPS

Tokens Per Second

Average token generation rate in the Decode phase after the first token. Human reading speed reference: ~4–5 tokens/sec. Score is the average of 4 warm runs (runs 2–5). Higher is better.

⚡

NPUBench Exclusive

NPU Acceleration Ratio

Ratio of CPU baseline performance to NPU primary performance. Quantifies the actual speedup delivered by NPU for both TTFT and TPS. Every test session requires a CPU baseline run to compute this metric. Higher is better.

🔋

NPUBench Exclusive

Power Efficiency (TPS/W)

Tokens per second per watt of power consumption. A core AI PC evaluation dimension for understanding the energy cost of on-device inference. NPU mode typically uses significantly less power than CPU mode. Higher is better.

🌐

NPUBench Exclusive

Multilingual Translation Testing

Translation quality evaluation across Japanese ↔ Chinese ↔ English. Results include BLEU score validation. Quality thresholds: JA↔ZH ≥ 12, EN↔JA ≥ 15. Tests real multilingual inference capability, not just speed.

🧪

3 Test Modes

Mode A — NPU Primary (NPU handles ≥80% of Prefill compute). Mode B — Hybrid (NPU + CPU/GPU cooperation). Mode C — CPU Baseline (required in every session to compute acceleration ratio). Aligned with MLPerf Client v1.5 execution paths.

Performance Rating System

NPUBench Standard v1.0 · Section 7 · All thresholds derived from real hardware measurements

TTFT Lower is Better

★★★ < 0.1 s Qualcomm Snapdragon X NPU: 0.029 s

★★ 0.1 ~ 2.0 s AMD Ryzen AI 300 NPU: 1.29 s

★ 2.0 ~ 5.0 s Intel Core Ultra S2 NPU: 2.91 s

✗ > 5.0 s CPU baseline range

Score = average of warm runs 2–5. Cold start excluded.

TPS Higher is Better

★★★ > 40 t/s Far exceeds reading speed

★★ 20 ~ 40 t/s Exceeds reading speed

★ 10 ~ 20 t/s Comparable to reading speed

✗ < 10 t/s Below reading speed (~4–5 t/s)

Human reading speed reference: ~4–5 tokens/sec.

NPU Accel Ratio (TTFT) Higher is Better

★★★ > 50 × Qualcomm Snapdragon X: 314 ×

★★ 10 ~ 50 × Intel Core Ultra S2: 26 ×

★ 5 ~ 10 × AMD Ryzen AI 300: 7 ×

✗ < 5 × NPU acceleration ineffective

NPU_Accel_TTFT = TTFT_CPU / TTFT_NPU. Avg of 5 runs.

TPS/W Higher is Better

★★★ < 6 W NPU Qualcomm / Intel: 3–5 W

★★ 6 ~ 15 W AMD Ryzen AI 300: 5–10 W

★ 15 ~ 30 W CPU lower bound

✗ > 30 W CPU upper bound

TPS_per_Watt = TPS / avg_power_W. Avg of 5 runs.

Source: NPUBench Standard v1.0, Section 7. Reference data measured on Llama 3.2 3B Instruct (int4), translation task, average of 5 runs. Not affiliated with MLCommons.

Results & Leaderboard

Will be published after official release

📊

Coming Soon

The benchmark tool is currently under development. Official test results and community leaderboard will be published following the public release. Stay tuned.

TTFT

Lower is Better

TPS

Higher is Better

NPU Accel

Higher is Better

TPS/W

Higher is Better

Score = average of warm runs 2–5 per NPUBench Standard v1.0. Cold start recorded separately as reference.

Get NPU Benchmark

The tool is currently in development. We are building and validating the benchmark methodology across all three supported platforms before public release.

Open methodology. Source code licensed under Apache 2.0. Built for the NPU developer and research community.

Release Info

Status In Development

Release Date TBD

Platform Windows 11

License Apache 2.0

Platforms Qualcomm · AMD · Intel

Standard NPUBench v1.0

⚖ Legal Notice & Disclaimer

· NPU Benchmark (NPUBench) is an independently developed benchmark standard. It has no affiliation with or endorsement by MLCommons Association.

· MLPerf® is a registered trademark of MLCommons Association. This tool's methodology references MLPerf Client v1.5 for metric definitions and quality thresholds, but does not constitute or claim official MLCommons certification.

· Use of Llama 3.x models (Meta Platforms) requires compliance with the Meta Llama 3 Community License Agreement. Test results generated using Llama models must not be used to improve non-Llama AI models. Commercial use with >700M monthly active users requires a separate license from Meta.

· Phi-3.5 Mini Instruct (Microsoft) is used under the MIT License.

· NPUBench source code is released under the Apache 2.0 License.

· Test results generated by this tool do not represent the official position or endorsement of any hardware vendor (Qualcomm, AMD, Intel).

· All benchmark scores are measured under specific hardware and software conditions. Results may vary depending on system configuration, driver version, and operating environment.