https://github.com/usefulsensors/qc_npu_benchmark Skip to content Navigation Menu Toggle navigation Sign in * Product + GitHub Copilot Write better code with AI + Security Find and fix vulnerabilities + Actions Automate any workflow + Codespaces Instant dev environments + Issues Plan and track work + Code Review Manage code changes + Discussions Collaborate outside of code + Code Search Find more, search less Explore + All features + Documentation + GitHub Skills + Blog * Solutions By size + Enterprise + Teams + Startups By industry + Healthcare + Financial services + Manufacturing By use case + CI/CD & Automation + DevOps + DevSecOps * Resources Topics + AI + DevOps + Security + Software Development + View all Explore + Learning Pathways + White papers, Ebooks, Webinars + Customer Stories + Partners * Open Source + GitHub Sponsors Fund open source developers + The ReadME Project GitHub community articles Repositories + Topics + Trending + Collections * Enterprise + Enterprise platform AI-powered developer platform Available add-ons + Advanced Security Enterprise-grade security features + GitHub Copilot Enterprise-grade AI features + Premium Support Enterprise-grade 24/7 support * Pricing Search or jump to... Search code, repositories, users, issues, pull requests... Search [ ] Clear Search syntax tips Provide feedback We read every piece of feedback, and take your input very seriously. [ ] [ ] Include my email address so I can be contacted Cancel Submit feedback Saved searches Use saved searches to filter your results more quickly Name [ ] Query [ ] To see all available qualifiers, see our documentation. Cancel Create saved search Sign in Sign up Reseting focus You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert {{ message }} usefulsensors / qc_npu_benchmark Public * Notifications You must be signed in to change notification settings * Fork 1 * Star 44 Code sample showing how to run and benchmark models on Qualcomm's Window PCs License Apache-2.0 license 44 stars 1 fork Branches Tags Activity Star Notifications You must be signed in to change notification settings * Code * Issues 2 * Pull requests 0 * Actions * Projects 0 * Security * Insights Additional navigation options * Code * Issues * Pull requests * Actions * Projects * Security * Insights usefulsensors/qc_npu_benchmark main BranchesTags Go to file Code Folders and files Last Last Name Name commit commit message date Latest commit History 8 Commits LICENSE LICENSE README.md README.md benchmark_matmul.py benchmark_matmul.py benchmark_matmul_cudnn.py benchmark_matmul_cudnn.py matmul_model_float.onnx matmul_model_float.onnx matmul_model_quant.onnx matmul_model_quant.onnx matmul_model_quant_io.onnx matmul_model_quant_io.onnx npu_quant_io_profile.csv npu_quant_io_profile.csv npu_quant_profile.csv npu_quant_profile.csv requirements.txt requirements.txt View all files Repository files navigation * README * Apache-2.0 license Benchmarking Qualcomm's NPU on the Microsoft Surface Tablet TL;DR - We see 1.3% of Qualcomm's NPU 45 Teraops/s claim when benchmarking Windows AI PCs * Introduction * Installation + Python + Cmake + Visual Studio + Pip Packages * Benchmark + Running + Understanding the Output + What the Benchmark Measures + Possible Confounding Factors o Compute Bound o Power Settings o Model Topology o Configuration Errors o Onnx Framework * Interpreting the Results Introduction Microsoft now offers Surface tablets that run Windows on a Qualcomm Arm-based SoC. These are marketed as AI PCs, due to their ability to run machine learning models faster and more efficiently than other systems. We are fans of Qualcomm's hardware, and its NPU in particular, so we've invested a lot of time and resources into porting our third-party app to this plaform. Unfortunately there aren't many code examples or benchmarks available to demonstrate how to achieve fast results as an external developer, so we've put together a small standalone project to show the performance we're seeing. It's significantly below what we'd hoped for, so we're publishing this benchmark to see if we can get ideas on how to achieve lower latency. I'm hopeful there will be software changes, either at the application, framework, or driver level, that will improve these results in the future, since I've seen the underlying hardware perform very effectively on other platforms like Android. Installation Python We're using Python to run our test scripts, and on Windows there are several ways to install the language. As of October 2nd, 2024, the Python available on the Microsoft Store doesn't support the Arm architecture, and so it's not suitable for running the packages we need to access Qualcomm's NPU. Instead, you should use the official Python dot org installer. For the results reported here I used version 3.11.9. Cmake We'll also need the cmake build tool to compile Onnx (since prebuilt packages aren't yet available for Windows on Arm). To do this I ran the following command from a Powershell: winget install cmake Visual Studio The build process also requires Visual Studio for the compiler. Download Visual Studio Community Edition (not Code!) from visualstudio.microsoft.com/downloads/. During the installation you will be prompted to select Workload from several options: select Desktop C++ Development checkbox then press install. Pip Packages You can install all the required Python packages by running the following from within this folder: py -m pip install -r requirements.txt This includes a couple of custom packages. The first is my branch of Onnx, which has a fix for compiling using the official py launcher backported to Onnx version 1.16, since the Qualcomm Onnx Runtime doesn't work with newer Onnx versions (giving an Unsupported model IR version error). I also grab a nightly build of Qualcomm's Onnx Runtime package. If you want to install a more recent version, there's a list here. Benchmark Running To execute the benchmark, run: py benchmark_matmul.py Understanding the Output The Onnx runtime initially generates a lot of log spam, including: Error in cpuinfo: Unknown chip model name 'Snapdragon(R) X 12-core X1E80100 @ 3.40 GHz'. Please add new Windows on Arm SoC/chip support to arm/windows/init.c! unknown Qualcomm CPU part 0x1 ignored and Starting stage: Finalizing Graph Sequence Completed stage: Finalizing Graph Sequence (115919 us) Starting stage: Completion Completed stage: Completion (1025 us) After all those messages, you should see the actual benchmark results at the end, something like this: ************ Benchmark Results ************ NPU quantized compute, float I/O accuracy difference is 0.0100 NPU quantized compute and I/O accuracy difference is 0.0060 CPU took 8.42ms, 821,141,860,688 ops per second NPU (quantized compute, float I/O) took 30.63ms, 225,667,671,183 ops per second NPU (quantized compute and I/O) took 12.05ms, 573,475,650,364 ops per second The first two lines confirm that the numerical results of the operations match between the CPU and the NPU. The final three show the latency of the three approaches to running a simple model. The latency is the wall time it took to execute the model from start to finish, and the ops per second is calculated from that latency to indicate the equivalent computational throughput. In this example, we see the CPU is capable of running 821 billion ops /second (821 Gigaops), the first NPU approach gives us 225 Gigaops, and the second 573 Gigaops. What the Benchmark Measures This benchmark is designed to resemble some real world models we depend on, running 6 large matrix multiplications that are similar to the most time-consuming layers in transformer models like OpenAI's Whisper. The shapes are (6, 1500, 256) X (6, 256, 1500), producing a (6, 1500, 1500) result. The model we running consists of a single MatMul node with two inputs and one output. The models are created on the fly using the Onnx model framework, and then fed into the Onnx runtime. The control model is a pure float version that runs entirely on the CPU. The NPU mostly requires quantized models to run effectively (though it has limited support for float16). The first approach we took to quantization used the official ORT quantize_static() method. For convenience this leaves the input and output tensors in 32-bit float and performs runtime conversions at the start and end of the graph so that the rest of the computation happens in eight-bit. Unfortunately we discovered that the conversion operations as implemented on the NPU were extremely slow, much slower than the main matrix multiplication in fact. You can see the results in the npu_quant_profile.csv file in this repository, with conversions taking over 75% of the time. To work around this, we constructed an equivalent model graph programmatically with eight-bit inputs and outputs This is the second "quantized compute and I/O" approach mentioned in the results. This is usually around three times faster than the float I/O version, and profiling shows most of the time is going on the matrix multiplication, as we'd hope. Possible Confounding Factors There are a lot of variables involved in measuring performance. Here are some of the assumptions we've made: Compute Bound Modern transformer models are based around large matrix multiplications, unlike older convolutional models. One potential issue is that accelerators could become memory bound if the layers start to resemble matrix times vectors, since that doesn't allow reuse of many of the weights, and performance becomes bottle necked on fetching values from DRAM. We've tried to avoid that by making both the input matrices more square, so that tiling and reuse should be possible. The original matrices from the tiny Whisper model had a k dimension of only 64, so in case that was too small we bumped it up to 256 in this benchmark to give as much room for SIMD optimizations as possible. Power Settings Windows has a lot of different configuration options around energy usage, so we tried to ensure that all of the settings were on "Best Performance" and that we ran the benchmark with the tablet connected to mains power. There's also a session option on the Qualcomm Onnx Runtime, htp_performance_mode, that we set to sustained_high_performance, since that seemed to give the lowest overall latency in our experiments. Model Topology We wanted to create a graph of operations that reflected modern AI models, but was simple enough to easily interpret. We could have added multiple layers, or used convolutions, or static weights, but settled for a single matrix multiplication operation with dynamic inputs, since that reflected the transformer architectures that are widely used for LLMs and other modern models. Configuration Errors It's possible that the way we build and run our models causes them to fall off the fast path of the drivers or accelerator implementation. For example, we're using unsigned eight-bit quantization, with qdq elements in the graph. We've attempted to follow best practice from the documentation, but we'd welcome ways to improve performance, especially since these would improve the performance of our actual applications. Onnx Framework There are multiple different ways to access AI acceleration on Windows. We looked at DirectML, but it only seems to support GPU access. OpenVino doesn't run on our Arm hardware, as far as we can tell. We've seen similar performance results to those shown here using the Qualcomm QNN SDK directly. TensorFlow Lite isn't supported on Windows for Arm. From this research and our experiments, Onnx is supported by both Microsoft and Qualcomm, and seems to be the best framework to use to get accelerated performance from the NPU, but we're interested in learning if other APIs would be more appropriate. Interpreting the Results The results shown here are current as of October 2nd, 2024, when running on a Microsoft Surface Pro 11th Edition, with a Snapdragon(R) X 12-core X1E80100 clocked at 3.40 GHz. The first obvious thing is that the NPU results, even without float conversion, are slower than the CPU. This is not ideal for an accelerator, even though it could still potentially offer energy or sustained performance advantages that make it worth using. The second conclusion is that the measured performance of 573 billion operations per second is only 1.3% of the 45 trillion ops/s that the marketing material promises. By contrast, running the same model on an Nvidia Geforce RTX 4080 Laptop GPU runs in 3.2ms, an equivalent of 2,160 billion operations per second, almost four times the throughput. About Code sample showing how to run and benchmark models on Qualcomm's Window PCs Resources Readme License Apache-2.0 license Activity Custom properties Stars 44 stars Watchers 5 watching Forks 1 fork Report repository Releases No releases published Packages 0 No packages published Languages * Python 100.0% Footer (c) 2024 GitHub, Inc. Footer navigation * Terms * Privacy * Security * Status * Docs * Contact * Manage cookies * Do not share my personal information You can't perform that action at this time.