https://github.com/usefulsensors/qc_npu_benchmark

Skip to content

Navigation Menu

Toggle navigation
 
Sign in

  * Product
      +  
        GitHub Copilot
        Write better code with AI
      +  
        Security
        Find and fix vulnerabilities
      +  
        Actions
        Automate any workflow
      +  
        Codespaces
        Instant dev environments
      +  
        Issues
        Plan and track work
      +  
        Code Review
        Manage code changes
      +  
        Discussions
        Collaborate outside of code
      +  
        Code Search
        Find more, search less
    Explore
      + All features
      + Documentation
      + GitHub Skills
      + Blog
  * Solutions
    By size
      + Enterprise
      + Teams
      + Startups
    By industry
      + Healthcare
      + Financial services
      + Manufacturing
    By use case
      + CI/CD & Automation
      + DevOps
      + DevSecOps
  * Resources
    Topics
      + AI
      + DevOps
      + Security
      + Software Development
      + View all
    Explore
      + Learning Pathways
      + White papers, Ebooks, Webinars
      + Customer Stories
      + Partners
  * Open Source
      +  
        GitHub Sponsors
        Fund open source developers
      +  
        The ReadME Project
        GitHub community articles
    Repositories
      + Topics
      + Trending
      + Collections
  * Enterprise
      +  
        Enterprise platform
        AI-powered developer platform
    Available add-ons
      +  
        Advanced Security
        Enterprise-grade security features
      +  
        GitHub Copilot
        Enterprise-grade AI features
      +  
        Premium Support
        Enterprise-grade 24/7 support
  * Pricing

Search or jump to...

Search code, repositories, users, issues, pull requests...

Search
[                    ]
Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

[                    ] [ ] Include my email address so I can be
contacted
Cancel Submit feedback

Saved searches

Use saved searches to filter your results more quickly

Name [                    ] 
Query [                    ]

To see all available qualifiers, see our documentation.

Cancel Create saved search
Sign in
Sign up Reseting focus
You signed in with another tab or window. Reload to refresh your
session. You signed out in another tab or window. Reload to refresh
your session. You switched accounts on another tab or window. Reload
to refresh your session. Dismiss alert
{{ message }}
usefulsensors / qc_npu_benchmark Public

  * Notifications You must be signed in to change notification
    settings
  * Fork 1
  * Star 44

Code sample showing how to run and benchmark models on Qualcomm's
Window PCs

License

Apache-2.0 license
44 stars 1 fork Branches Tags Activity
Star
Notifications You must be signed in to change notification settings

  * Code
  * Issues 2
  * Pull requests 0
  * Actions
  * Projects 0
  * Security
  * Insights

Additional navigation options

  * Code
  * Issues
  * Pull requests
  * Actions
  * Projects
  * Security
  * Insights

usefulsensors/qc_npu_benchmark

 main
BranchesTags
  
Go to file
Code

Folders and files

                                                        Last    Last
           Name                        Name            commit  commit
                                                       message  date
Latest commit

 

History

8 Commits
 
LICENSE                     LICENSE                             
README.md                   README.md                           
benchmark_matmul.py         benchmark_matmul.py                 
benchmark_matmul_cudnn.py   benchmark_matmul_cudnn.py           
matmul_model_float.onnx     matmul_model_float.onnx             
matmul_model_quant.onnx     matmul_model_quant.onnx             
matmul_model_quant_io.onnx  matmul_model_quant_io.onnx          
npu_quant_io_profile.csv    npu_quant_io_profile.csv            
npu_quant_profile.csv       npu_quant_profile.csv               
requirements.txt            requirements.txt                    
View all files

Repository files navigation

  * README
  * Apache-2.0 license

Benchmarking Qualcomm's NPU on the Microsoft Surface Tablet

 

TL;DR - We see 1.3% of Qualcomm's NPU 45 Teraops/s claim when
benchmarking Windows AI PCs

  * Introduction
  * Installation
      + Python
      + Cmake
      + Visual Studio
      + Pip Packages
  * Benchmark
      + Running
      + Understanding the Output
      + What the Benchmark Measures
      + Possible Confounding Factors
          o Compute Bound
          o Power Settings
          o Model Topology
          o Configuration Errors
          o Onnx Framework
  * Interpreting the Results

Introduction

 

Microsoft now offers Surface tablets that run Windows on a Qualcomm
Arm-based SoC. These are marketed as AI PCs, due to their ability to
run machine learning models faster and more efficiently than other
systems. We are fans of Qualcomm's hardware, and its NPU in
particular, so we've invested a lot of time and resources into
porting our third-party app to this plaform.

Unfortunately there aren't many code examples or benchmarks available
to demonstrate how to achieve fast results as an external developer,
so we've put together a small standalone project to show the
performance we're seeing. It's significantly below what we'd hoped
for, so we're publishing this benchmark to see if we can get ideas on
how to achieve lower latency. I'm hopeful there will be software
changes, either at the application, framework, or driver level, that
will improve these results in the future, since I've seen the
underlying hardware perform very effectively on other platforms like
Android.

Installation

 

Python

 

We're using Python to run our test scripts, and on Windows there are
several ways to install the language. As of October 2nd, 2024, the
Python available on the Microsoft Store doesn't support the Arm
architecture, and so it's not suitable for running the packages we
need to access Qualcomm's NPU. Instead, you should use the official
Python dot org installer. For the results reported here I used
version 3.11.9.

Cmake

 

We'll also need the cmake build tool to compile Onnx (since prebuilt
packages aren't yet available for Windows on Arm). To do this I ran
the following command from a Powershell:

winget install cmake

Visual Studio

 

The build process also requires Visual Studio for the compiler.
Download Visual Studio Community Edition (not Code!) from
visualstudio.microsoft.com/downloads/.

During the installation you will be prompted to select Workload from
several options: select Desktop C++ Development checkbox then press
install.

Pip Packages

 

You can install all the required Python packages by running the
following from within this folder:

py -m pip install -r requirements.txt

This includes a couple of custom packages. The first is my branch of
Onnx, which has a fix for compiling using the official py launcher
backported to Onnx version 1.16, since the Qualcomm Onnx Runtime
doesn't work with newer Onnx versions (giving an Unsupported model IR
version error).

I also grab a nightly build of Qualcomm's Onnx Runtime package. If
you want to install a more recent version, there's a list here.

Benchmark

 

Running

 

To execute the benchmark, run:

py benchmark_matmul.py

Understanding the Output

 

The Onnx runtime initially generates a lot of log spam, including:

Error in cpuinfo: Unknown chip model name 'Snapdragon(R) X 12-core X1E80100 @ 3.40 GHz'.
Please add new Windows on Arm SoC/chip support to arm/windows/init.c!
unknown Qualcomm CPU part 0x1 ignored

and

Starting stage: Finalizing Graph Sequence
Completed stage: Finalizing Graph Sequence (115919 us)
Starting stage: Completion
Completed stage: Completion (1025 us)

After all those messages, you should see the actual benchmark results
at the end, something like this:

************ Benchmark Results ************
NPU quantized compute, float I/O accuracy difference is 0.0100
NPU quantized compute and I/O accuracy difference is 0.0060
CPU took 8.42ms, 821,141,860,688 ops per second
NPU (quantized compute, float I/O) took 30.63ms, 225,667,671,183 ops per second
NPU (quantized compute and I/O) took 12.05ms, 573,475,650,364 ops per second

The first two lines confirm that the numerical results of the
operations match between the CPU and the NPU. The final three show
the latency of the three approaches to running a simple model. The
latency is the wall time it took to execute the model from start to
finish, and the ops per second is calculated from that latency to
indicate the equivalent computational throughput.

In this example, we see the CPU is capable of running 821 billion ops
/second (821 Gigaops), the first NPU approach gives us 225 Gigaops,
and the second 573 Gigaops.

What the Benchmark Measures

 

This benchmark is designed to resemble some real world models we
depend on, running 6 large matrix multiplications that are similar to
the most time-consuming layers in transformer models like OpenAI's
Whisper. The shapes are (6, 1500, 256) X (6, 256, 1500), producing a
(6, 1500, 1500) result. The model we running consists of a single
MatMul node with two inputs and one output.

The models are created on the fly using the Onnx model framework, and
then fed into the Onnx runtime. The control model is a pure float
version that runs entirely on the CPU.

The NPU mostly requires quantized models to run effectively (though
it has limited support for float16). The first approach we took to
quantization used the official ORT quantize_static() method. For
convenience this leaves the input and output tensors in 32-bit float
and performs runtime conversions at the start and end of the graph so
that the rest of the computation happens in eight-bit.

Unfortunately we discovered that the conversion operations as
implemented on the NPU were extremely slow, much slower than the main
matrix multiplication in fact. You can see the results in the
npu_quant_profile.csv file in this repository, with conversions
taking over 75% of the time.

To work around this, we constructed an equivalent model graph
programmatically with eight-bit inputs and outputs This is the second
"quantized compute and I/O" approach mentioned in the results. This
is usually around three times faster than the float I/O version, and
profiling shows most of the time is going on the matrix
multiplication, as we'd hope.

Possible Confounding Factors

 

There are a lot of variables involved in measuring performance. Here
are some of the assumptions we've made:

Compute Bound

 

Modern transformer models are based around large matrix
multiplications, unlike older convolutional models. One potential
issue is that accelerators could become memory bound if the layers
start to resemble matrix times vectors, since that doesn't allow
reuse of many of the weights, and performance becomes bottle necked
on fetching values from DRAM. We've tried to avoid that by making
both the input matrices more square, so that tiling and reuse should
be possible.

The original matrices from the tiny Whisper model had a k dimension
of only 64, so in case that was too small we bumped it up to 256 in
this benchmark to give as much room for SIMD optimizations as
possible.

Power Settings

 

Windows has a lot of different configuration options around energy
usage, so we tried to ensure that all of the settings were on "Best
Performance" and that we ran the benchmark with the tablet connected
to mains power. There's also a session option on the Qualcomm Onnx
Runtime, htp_performance_mode, that we set to
sustained_high_performance, since that seemed to give the lowest
overall latency in our experiments.

Model Topology

 

We wanted to create a graph of operations that reflected modern AI
models, but was simple enough to easily interpret. We could have
added multiple layers, or used convolutions, or static weights, but
settled for a single matrix multiplication operation with dynamic
inputs, since that reflected the transformer architectures that are
widely used for LLMs and other modern models.

Configuration Errors

 

It's possible that the way we build and run our models causes them to
fall off the fast path of the drivers or accelerator implementation.
For example, we're using unsigned eight-bit quantization, with qdq
elements in the graph. We've attempted to follow best practice from
the documentation, but we'd welcome ways to improve performance,
especially since these would improve the performance of our actual
applications.

Onnx Framework

 

There are multiple different ways to access AI acceleration on
Windows. We looked at DirectML, but it only seems to support GPU
access. OpenVino doesn't run on our Arm hardware, as far as we can
tell. We've seen similar performance results to those shown here
using the Qualcomm QNN SDK directly. TensorFlow Lite isn't supported
on Windows for Arm. From this research and our experiments, Onnx is
supported by both Microsoft and Qualcomm, and seems to be the best
framework to use to get accelerated performance from the NPU, but
we're interested in learning if other APIs would be more appropriate.

Interpreting the Results

 

The results shown here are current as of October 2nd, 2024, when
running on a Microsoft Surface Pro 11th Edition, with a Snapdragon(R)
X 12-core X1E80100 clocked at 3.40 GHz. The first obvious thing is
that the NPU results, even without float conversion, are slower than
the CPU. This is not ideal for an accelerator, even though it could
still potentially offer energy or sustained performance advantages
that make it worth using.

The second conclusion is that the measured performance of 573 billion
operations per second is only 1.3% of the 45 trillion ops/s that the
marketing material promises.

By contrast, running the same model on an Nvidia Geforce RTX 4080
Laptop GPU runs in 3.2ms, an equivalent of 2,160 billion operations
per second, almost four times the throughput.

About

Code sample showing how to run and benchmark models on Qualcomm's
Window PCs

Resources

Readme

License

Apache-2.0 license
Activity
Custom properties

Stars

44 stars

Watchers

5 watching

Forks

1 fork
Report repository

Releases

No releases published

Packages 0

No packages published

Languages

  * Python 100.0%

Footer

 (c) 2024 GitHub, Inc.

Footer navigation

  * Terms
  * Privacy
  * Security
  * Status
  * Docs
  * Contact
  * Manage cookies
  * Do not share my personal information

You can't perform that action at this time.