NPU Benchmark
LLM Inference Performance Testing for PC NPUs
A benchmark standard for measuring Large Language Model inference performance on PC NPUs — covering Qualcomm Snapdragon X, AMD Ryzen AI 300, and Intel Core Ultra Series 2. Each platform runs independently with its native execution path.
Methodology references MLPerf Client v1.5 (MLCommons). This tool is not affiliated with or endorsed by MLCommons. MLPerf® is a registered trademark of MLCommons Association.
Supported Platforms
Three independent benchmark applications, one comparable standard
With Hexagon NPU
Min Driver: ≥ 30.0.140.1000
XDNA2 architecture and later
Min Driver: ≥ 32.0.203.280
With integrated NPU
Min Driver: ≥ 32.0.100.4297
Each platform is a separate application using its vendor-native execution path. Results are independently measured and cross-comparable by standard methodology.
Benchmark Metrics
Six core metrics defined in NPUBench Standard v1.0
Time To First Token
Measures Prefill phase latency — time from complete input submission to first output token. Score is the average of 4 warm runs (runs 2–5). Cold start is recorded separately as reference only. Lower is better.
Tokens Per Second
Average token generation rate in the Decode phase after the first token. Human reading speed reference: ~4–5 tokens/sec. Score is the average of 4 warm runs (runs 2–5). Higher is better.
NPU Acceleration Ratio
Ratio of CPU baseline performance to NPU primary performance. Quantifies the actual speedup delivered by NPU for both TTFT and TPS. Every test session requires a CPU baseline run to compute this metric. Higher is better.
Power Efficiency (TPS/W)
Tokens per second per watt of power consumption. A core AI PC evaluation dimension for understanding the energy cost of on-device inference. NPU mode typically uses significantly less power than CPU mode. Higher is better.
Multilingual Translation Testing
Translation quality evaluation across Japanese ↔ Chinese ↔ English. Results include BLEU score validation. Quality thresholds: JA↔ZH ≥ 12, EN↔JA ≥ 15. Tests real multilingual inference capability, not just speed.
3 Test Modes
Mode A — NPU Primary (NPU handles ≥80% of Prefill compute). Mode B — Hybrid (NPU + CPU/GPU cooperation). Mode C — CPU Baseline (required in every session to compute acceleration ratio). Aligned with MLPerf Client v1.5 execution paths.
Performance Rating System
NPUBench Standard v1.0 · Section 7 · All thresholds derived from real hardware measurements
Source: NPUBench Standard v1.0, Section 7. Reference data measured on Llama 3.2 3B Instruct (int4), translation task, average of 5 runs. Not affiliated with MLCommons.
Results & Leaderboard
Will be published after official release
The benchmark tool is currently under development. Official test results and community leaderboard will be published following the public release. Stay tuned.
Score = average of warm runs 2–5 per NPUBench Standard v1.0. Cold start recorded separately as reference.
Get NPU Benchmark
The tool is currently in development. We are building and validating the benchmark methodology across all three supported platforms before public release.
Open methodology. Source code licensed under Apache 2.0. Built for the NPU developer and research community.
· NPU Benchmark (NPUBench) is an independently developed benchmark standard. It has no affiliation with or endorsement by MLCommons Association.
· MLPerf® is a registered trademark of MLCommons Association. This tool's methodology references MLPerf Client v1.5 for metric definitions and quality thresholds, but does not constitute or claim official MLCommons certification.
· Use of Llama 3.x models (Meta Platforms) requires compliance with the Meta Llama 3 Community License Agreement. Test results generated using Llama models must not be used to improve non-Llama AI models. Commercial use with >700M monthly active users requires a separate license from Meta.
· Phi-3.5 Mini Instruct (Microsoft) is used under the MIT License.
· NPUBench source code is released under the Apache 2.0 License.
· Test results generated by this tool do not represent the official position or endorsement of any hardware vendor (Qualcomm, AMD, Intel).
· All benchmark scores are measured under specific hardware and software conditions. Results may vary depending on system configuration, driver version, and operating environment.