Home
        _______               __                   _______
       |   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
       |       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
       |___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                             on Gopher (inofficial)
  HTML Visit Hacker News on the Web
       
       
       COMMENT PAGE FOR:
  HTML   AI PCs Aren't Good at AI: The CPU Beats the NPU
       
       
        ein0p wrote 16 min ago:
        Memory bound workload is memory bound. Doesn’t matter how many TOPS
        you have if you’re sitting idle waiting on DRAM during generation.
        You will, however notice a difference in prefill for long prompts.
       
        _davide_ wrote 33 min ago:
        The RTX 4080 should be capable of ~40 TFLOPS, yet they only report
        2,160 billion operations per second. Shouldn't this be enough to
        reconsider the benchmark?
        They probably made some serious error in measuring FLOPS.
        Regarding the fact that CPU beats NPU is possible but they should
        benchmark many matrix multiplications without any application
        synchronization in order to have a decent comparison.
       
        guelermus wrote 1 hour 33 min ago:
        One should pay attention also to power efficiency, a direct comparison
        could be misleading here.
       
        freehorse wrote 2 hours 17 min ago:
        I always thought that the main point of NPUs is energy efficiency (and
        being able to run ML models without taking over all computer resources,
        making it practical to integrate ML applications in the OS itself in
        ways that it does not disturb the user or the workflow) rather than
        being exceptionally faster. At least this has been my experience with
        running stable diffusion on macs. Similar to using other specialised
        hardware like media encoders; they are not necessarily faster than a
        CPU if you throw a dozen+ cpu cores on the task, but it will draw a
        minuscule part of the power.
       
        piskov wrote 2 hours 18 min ago:
        Snapdragon touts 45 TOPS but it’s only int8.
        
        For example Apple's m3 neural engine is mere 18 TOPS but it’s FP16.
        
        So windows has bigger number but it’s not apple to apple comparison.
        
        Did author test int8 performance?
       
        stanleykm wrote 2 hours 22 min ago:
        the ARM SME could be an interesting alternative to NPUs in the future.
        Unlike the NPUs which have at best some fixed function API it will be
        possible to program the SMEs more directly
       
        cjbgkagh wrote 2 hours 28 min ago:
        > We've tried to avoid that by making both the input matrices more
        square, so that tiling and reuse should be possible.
        
        While it might be possible it would not surprise me if a number of
        possible optimizations had not made it into Onnx. It appears that
        Qualcomm does not give direct access to the NPU and users are expected
        to use frameworks to convert models over to it, and in my experience
        conversion tools generally suck and leave a lot of optimizations on the
        table. It could be less of NPUs suck and more of the conversions tools
        suck. I'll wait until I get direct access - I don't trust conversion
        tools.
        
        My view of NPUs is that they're great for tiny ML models and very fast
        function approximations which is my intended use case. While LLMs are
        the new hotness there are huge number of specialized tasks that small
        models are really useful for.
       
          jaygreco wrote 1 hour 41 min ago:
          I came here to say this. I haven’t worked with the Elite X but the
          past gen stuff I’ve used (865 mostly) the accelerators - compute
          DSP and much smaller NPU - required _very_ specific setup,
          compilation with a bespoke toolchain, and communication via RPC to
          name a few.
          
          I would hope the NPU on Elite X is easier to get to considering the
          whole copilot+ thing, but I bring this up mainly to make the point
          that I doubt it’s just as easy as “run general purpose model,
          expect it to magically teleport onto the NPU”.
       
        NoPicklez wrote 3 hours 48 min ago:
        Fairly misleading title, boiling down AI PCs to just the Microsoft
        Surface running Qualcomm
       
        teilo wrote 4 hours 12 min ago:
        Actual article title: Benchmarking Qualcomm's NPU on the Microsoft
        Surface Tablet
        
        Because this isn't about NPUs. It's about a specific NPU, on a specific
        benchmark, with a specific set of libraries and frameworks. So
        basically, this proves nothing.
       
          gnabgib wrote 2 hours 23 min ago:
          The title is from the original article ( [1] ), the URL was changed
          by dang:
          
  HTML    [1]: https://petewarden.com/2024/10/16/ai-pcs-arent-very-good-at-...
  HTML    [2]: https://news.ycombinator.com/item?id=41863591
       
          iml7 wrote 3 hours 50 min ago:
          But you can’t get more clicks. You have to attack enough people to
          get clicks.I feel like this place is becoming more and more filled
          with posts and titles like this.
       
            gerdesj wrote 2 hours 29 min ago:
            Internet points are a bit crap but HN generally discusses things
            properly and off topic and downright weird stuff generally gets
            downvoted to doom.
       
        p1necone wrote 4 hours 23 min ago:
        I might be overly cynical but I just assumed that the entire purpose of
        "AI PCs" was marketing - of course they don't actually achieve much.
        Any real hardware that's supposedly for the "AI" features will actually
        be just special purpose hardware for literally anything the sales
        department can lump under that category.
       
        Havoc wrote 4 hours 50 min ago:
        >We see 1.3% of Qualcomm's NPU 45 Teraops/s claim
        
        To me that suggests that the test is wrong.
        
        I could see intel massaging results, but that far off seems incredibly
        improbable
       
        m00x wrote 4 hours 59 min ago:
        NPUs are efficient, not especially fast. The CPU is much bigger than
        the NPU and has better cache access. Of course it'll perform better.
       
          acdha wrote 3 hours 52 min ago:
          It’s more complicated than that (you’re assuming that the bigger
          CPU is optimized for the same workload) but it’s also irrelevant to
          the topic at hand: they’re seeing this NPU within a factor of 2-4
          of the CPU, but if it performed half as well as Qualcomm claims it
          would be an order of magnitude faster. The story here isn’t another
          round of the specialized versus general debate but that they fell so
          far short of their marketing claims.
       
        Mistletoe wrote 5 hours 14 min ago:
        >The second conclusion is that the measured performance of 573 billion
        operations per second is only 1.3% of the 45 trillion ops/s that the
        marketing material promises.
        
        It just gets so hard to take this industry seriously.
       
        downrightmike wrote 5 hours 18 min ago:
        They should have just made a pci card and not tried to push whole new
        machines on us. We are all good with the machines we already have. If
        you want to sell a new feature, then it needs to be an add-on
       
        hkgjjgjfjfjfjf wrote 5 hours 49 min ago:
        Sutherland's wheel of reincarnation turns.
       
        lostmsu wrote 6 hours 10 min ago:
        The author's benchmark sucks if he could only get 2 tops from a laptop
        4080. The thing should be doing somewhere around 80 tops.
        
        Given that you should take his NPU results with a truckload of salt.
       
        protastus wrote 6 hours 55 min ago:
        Deploying a model on an NPU requires significant profile based
        optimization. Picking up a model that works fine on the CPU but hasn't
        been optimized for an NPU usually leads to disappointing results.
       
          CAP_NET_ADMIN wrote 5 hours 18 min ago:
          Beauty of CPUs - they'll chew through whatever bs code you throw at
          them at a reasonable speed.
       
          catgary wrote 5 hours 22 min ago:
          Yeah whenever I’ve spoken to people who work on stuff like IREE or
          OpenXLA they gave me the impression that understanding how to use
          those compilers/runtimes is an entire job.
       
        wmf wrote 7 hours 3 min ago:
        This headline is seriously misleading because the author did not test
        AMD or Intel NPUs. If Qualcomm is slow don't say all AI PCs are not
        good.
       
        eightysixfour wrote 7 hours 33 min ago:
        I thought the purpose of these things was not to be fast, but to be
        able to run small models with very little power usage? I have a newer
        AMD laptop with an NPU, and my power usage doesn't change using the
        video effects that supposedly run on it, but goes up when using the
        nvidia studio effects.
        
        It seems like the NPUs are for very optimized models that do small
        tasks, like eye contact, background blur, autocorrect models,
        transcription, and OCR. In particular, on Windows, I assumed they were
        running the full screen OCR (and maybe embeddings for search) for the
        rewind feature.
       
          godelski wrote 4 hours 55 min ago:
          > but to be able to run small models with very little power usage
          
          yes
          
          But first, I should also say you probably don't want to be
          programming these things with python. I doubt you'll get good
          performance there, especially as the newness means optimizations
          haven't been ported well (even using things like TensorRT is not
          going to be as fast as writing it from scratch, and Nvidia is
          throwing a lot of man power at that -- for good reason! But it sure
          as hell will get close and save you a lot of time writing).
          
          They are, like you say, generally optimized for doing repeated
          similar tasks. That's also where I suspect some of the info gathered
          here is inaccurate.
          
            (I have not used these NPU chips so what follows is more educated
          guesses, but I'll explain. Please correct me if I've made an error)
          
          Second, I don't trust the timing here. I'm certain the CUDA timing
          (at the end) is incorrect, as the code written wouldn't properly
          time. Timing is surprisingly not easy. I suspect the advertised
          operations are only counting operations directly on the NPU while OP
          would have included CPU operations in their NPU and GPU timings[0].
          But the docs have benchmarking tools, so I suspect they're doing
          something similar. I'd be interested to know the variance and how
          this holds after doing warmups.  They do identify the IO as an issue,
          and so I think this is evidence of this being an issue.
          
          Third, their data is improperly formatted.
          
            MATRIX_COUNT, MATRIX_A, MATRIX_B, MATRIX_K = (6, 1500, 1500, 256)
            INPUT0_SHAPE = [1, MATRIX_COUNT, MATRIX_A, MATRIX_K]
            INPUT1_SHAPE = [1, MATRIX_COUNT, MATRIX_K, MATRIX_B]
            OUTPUT_SHAPE = [1, MATRIX_COUNT, MATRIX_A, MATRIX_B]
          
          You want "channels last" here. I suspected this (do this in pytorch
          too!) and the docs they link confirm.
          
          1500 is also an odd choice and this could be cause for extra misses.
          I wonder how things would change with 1536, 2048, or even 256. Might
          (probably) even want to look smaller, since this might be a common
          preprocessing step. Your models are not processing full res images
          and if you're going to optimize architecture for models, you're going
          to use that shape information. Shape optimization is actually pretty
          important in ML[1]. I suspect this will be quite a large miss.
          
          Fourth, a quick look at the docs and I think the setup is improper.
          Under "Model Workflow" they mention that they want data in 8 or 16
          bit *float*. I'm not going to look too deep, but note that there are
          different types of floats (e.g. pytorch's bfloat is not the same as
          torch.half or torch.float16). Mixed precision is still a confusing
          subject and if you're hitting issues like these it is worth looking
          at. I very much suggest not just running a standard quantization
          procedure and calling it a day (start there! But don't end there
          unless it's "good enough", which doesn't seem too meaningful here.)
          
          FWIW, I still do think these results are useful, but I think they
          need to be improved upon. This type of stuff is surprisingly complex,
          but a large amount of that is due to things being new and much of the
          details still being worked out. Remember that when you're comparing
          to things like CPU or GPU (especially CUDA) that these have had
          hundreds of thousands of man hours put into then and at least tens of
          thousands into high level language libraries (i.e. python) to handle
          these. I don't think these devices are ready for the average user
          where you can just work with them from your favorite language's
          abstraction level, but they're pretty useful if you're willing to
          work close to the metal.
          
          [0] I don't know what the timing is for this, but I do this in
          pytorch a lot so here's the boilerplate
          
              times = torch.empty(rounds)
              # Don't need use dummy data, but here
              input_data = torch.randn((batch_size, *data_shape),
          device="cuda")
              # Do some warmups first. There's background actions dealing with
          IO we don't want to measure
              #     You can remove that line and do a dist of times if you want
          to see this
              # Make sure you generate data and save to a variable (write) or
          else this won't do anything
              for _ in range(warmup):
              data = model(input_data)
              for i in range(rounds):
              starter = torch.cuda.Event(enable_timing=True)
              ender = torch.cuda.Event(enable_timing=True)
              starter.record()
              data = model(input_data)
              ender.record()
              torch.cuda.synchronize()
              times[i] = starter.elapsed_time(ender)/1000
              total_time = times.sum()
          
          The reason we do it this way is if we just wrap the model output with
          a timer then we're looking at CPU time but the GPU operations are
          asynchronous so you could get deceptively fast (or slow) times
          
  HTML    [1]: https://www.thonking.ai/p/what-shapes-do-matrix-multiplicati...
       
          refulgentis wrote 7 hours 4 min ago:
          You're absolutely right IMO, given what I heard when launching
          on-device speech recognition on Pixel, and after leaving Google, what
          I see from ex. Apple Neural Engine vs. CPU when running ONNX stuff.
          
          I'm a bit suspicious of the article's specific conclusion, because it
          is Qualcomm's ONNX, and it be out of date. Also, Android loved
          talking shit about Qualcomm software engineering.
          
          That being said, its directionally correct, insomuch as consumer
          hardware AI acceleration claims are near-universally BS unless you're
          A) writing 1P software B) someone in the 1P really wants you to take
          advantage.
       
            kristianp wrote 5 hours 28 min ago:
            1P?
       
              refulgentis wrote 5 hours 27 min ago:
              First party, i.e. Google/Apple/Microsoft
       
          boomskats wrote 7 hours 9 min ago:
          That's especially true because yours is a Xilinx FPGA. The one that
          they just attached to the latest gen mobile ryzens is 5x more capable
          too.
          
          AMD are doing some fantastic work at the moment, they just don't seem
          to be shouting about it. This one is particularly interesting [1]
          edit: not an FPGA. TIL. :'(
          
  HTML    [1]: https://lore.kernel.org/lkml/DM6PR12MB3993D5ECA50B27682AEBE1...
       
            davemp wrote 2 hours 7 min ago:
            Unfortunately FPGA fabric is ~2x less power efficient than
            equivalent ASIC logic at the same clock speeds last time I checked.
            So implementing general purpose logic on an FPGA is not usually the
            right option even if you don’t care about FMAX or transistor
            counts.
       
            numpad0 wrote 5 hours 42 min ago:
            Sorry for an OT comment but what is going on with that ascii art!?
            The content fits within 80 columns just fine[1], is it GPT
            generated?
            
            1:
            
  HTML      [1]: https://pastebin.com/raw/R9BrqETR
       
            pclmulqdq wrote 6 hours 33 min ago:
            It's not an FPGA. It's a VLIW DSP that Xilinx built to go into an
            FPGA-SoC to help run ML models.
       
              almostgotcaught wrote 6 hours 9 min ago:
              this is the correct answer. one of the compilers for this DSP is
              [1] .
              
  HTML        [1]: https://github.com/Xilinx/llvm-aie
       
            beeflet wrote 6 hours 57 min ago:
            It would be cool if most PCs had a general purpose FPGA that could
            be repurposed by the operating system. For example you could use it
            as a security processor like a TPM or as a bootrom, or you could
            repurpose it for DSP or something.
            
            It just seems like this would be better in terms of
            firmware/security/bootloading because you would be more able to fix
            it if an exploit gets discovered, and it would be leaner because
            different operating systems can implement their own stuff (for
            example linux might not want pluton in-chip security, windows might
            not want coreboot or linux-based boot, bare metal applications can
            have much simpler boot).
       
              walterbell wrote 5 hours 21 min ago:
              Xilinx Artix 7-series PicoEVB fits in M.2 wifi slot and has an
              OSS toolchain,
              
  HTML        [1]: http://www.enjoy-digital.fr/
       
            errantspark wrote 7 hours 1 min ago:
            Wait sorry back up a bit here. I can buy a laptop that has a
            daughter FPGA in it? Does it have GPIO??? Are we seriously building
            hardware worth buying again in 2024? Do you have a link?
       
              dekhn wrote 6 hours 3 min ago:
              If you want GPIOs, you don't need (or want) an FPGA.
              
              I don't know the details of your use case, but I work with low
              level hardware driven by GPIOs and after a bit of investigation,
              concluded that having direect GPIO access in a modern PC was not
              necessary or desirable compared to the alternatives.
       
              eightysixfour wrote 6 hours 44 min ago:
              It isn't as fun as you think - they are setup for specific use
              cases and quite small. Here's a link to the software page: [1]
              The teeny-tiny "NPU," which is actually an FPGA, is 10 TOPS.
              
              Edit: I've been corrected, not an FPGA, just an IP block from
              Xilinx.
              
  HTML        [1]: https://ryzenai.docs.amd.com/en/latest/index.html
       
                boomskats wrote 6 hours 32 min ago:
                Yes, the one on the ryzen 7000 chips like the 7840u isn't
                massive, but that's the last gen model. The one they've just
                released with the HX370 chip is estimated at 50 TOPS, which is
                better than Qualcomm's ARM flagship that this post is about.
                It's a fivefold improvement in a single generation, it's pretty
                exciting.
                
                A̵n̵d̵ ̵i̵t̵'̵s̵ ̵a̵n̵ ̵F̵P̵G̵A̵ It's not an
                FPGA
       
                  almostgotcaught wrote 6 hours 9 min ago:
                  > And it's an FPGA.
                  
                  nope it's not.
       
                    boomskats wrote 4 hours 39 min ago:
                    I've just ordered myself a jump to conclusions mat.
       
                      almostgotcaught wrote 4 hours 14 min ago:
                      Lol during grad school my advisor would frequently cut me
                      off and try to jump to a conclusion, while I was
                      explaining something technical often enough he was wrong.
                      So I did really buy him one (off eBay or something). He
                      wasn't pleased.
       
                wtallis wrote 6 hours 35 min ago:
                It's not a FPGA. It's an NPU IP block from the Xilinx side of
                the company. It was presumably originally developed to be run
                on a Xilinx FPGA, but that doesn't mean AMD did the stupid
                thing and actually fabbed a FPGA fabric instead of properly
                synthesizing the design for their laptop ASIC. Xilinx
                involvement does not automatically mean it's an FPGA.
       
                  boomskats wrote 6 hours 25 min ago:
                  Do you have any more reading on this? How come the XDNA
                  drivers depend on Xilinx' XRT runtime?
       
                    wtallis wrote 6 hours 2 min ago:
                    It would be surprising and strange if AMD didn't reuse the
                    software framework they've already built for doing AI when
                    that IP block is instantiated on an FPGA fabric rather than
                    hardened in an ASIC.
       
                      boomskats wrote 5 hours 19 min ago:
                      Well, I'm irrationally disappointed, but thanks.
                      Appreciate the correction.
       
                    almostgotcaught wrote 6 hours 10 min ago:
                    because XRT has a plugin architecture: XRT<-shim
                    plugin<-kernel driver. The shims register themselves with
                    XRT. The XDNA driver repo houses both the shim and the
                    kernel driver.
       
                      boomskats wrote 5 hours 22 min ago:
                      Thanks, that makes sense.
       
                  eightysixfour wrote 6 hours 32 min ago:
                  Thanks for the correction, edited.
       
          conradev wrote 7 hours 25 min ago:
          That is my understanding as well: low power and low latency.
          
          You can see this in action when evaluating a CoreML model on a macOS
          machine. The ANE takes half as long as the GPU which takes half as
          long as the CPU (actual factors being model dependent)
       
            nickpsecurity wrote 7 hours 21 min ago:
            To take half as long, doesn’t it have to perform twice as fast?
            Or am I misreading your comment?
       
              conradev wrote 4 hours 6 min ago:
              The GPU is stateful and requires loading shaders and initializing
              pipelines before doing any work. That is where its latency comes
              from. It is also extremely power hungry.
              
              The CPU is zero latency to get started, but takes longer because
              it isn't specialized at any one task and isn't massively
              parallel, so that is why the CPU takes even longer.
              
              The NPU often has a simpler bytecode to do more complex things
              like matrix multiplication implemented in hardware, rather than
              having to instantiate a generic compute kernel on the GPU.
       
              eightysixfour wrote 7 hours 16 min ago:
              No, you can have latency that is independent of compute
              performance. The CPU/GPU may have other tasks and the work has to
              wait for the existing threads to finish, or for them to clock up,
              or have slower memory paths, etc.
              
              If you and I have the same calculator but I'm working on a set of
              problems and you're not, and we're both asked to do some math, it
              may take me longer to return it, even though the instantaneous
              performance of the math is the same.
       
                refulgentis wrote 7 hours 8 min ago:
                In isolation, makes sense.
                
                Wouldn't it be odd for OP to present examples that are the
                opposite of their claim, just to get us thinking about "well
                the CPU is busy?"
                
                Curious for their input.
       
        jamesy0ung wrote 7 hours 35 min ago:
        What exactly does Windows do with a NPU? I don't own an 'AI PC' but it
        seems like the NPUs are slow and can't run much.
        
        I know Apple's Neural Engine is used to power Face ID and the facial
        recognition stuff in Photos, among other things.
       
          downrightmike wrote 5 hours 7 min ago:
          AI PC is just a marketing term, doesn't have any real substance
       
            acdha wrote 3 hours 47 min ago:
            Yea, we know that. I believe that’s why the person you’re
            replying too was asking for examples of real usage.
       
          dagaci wrote 5 hours 47 min ago:
          Its used for improving video calls, special effects, image editing/
          effects and noise cancelling, teams stuff
       
          DrillShopper wrote 6 hours 43 min ago:
          It supports Microsoft's Recall (now required) spyware
       
            Janicc wrote 6 hours 6 min ago:
            Please remind me again how Recall sends data to Microsoft. I
            must've missed that part. Or are you against the print screen
            button too? I heard that takes images too. Very scary.
       
              bloated5048 wrote 5 hours 28 min ago:
              It's always safe to assume it does if it's closed source. I
              rather be suspicious of big corporations seeking to profit at
              every step than naive.
              
              Also, it's security risk which already been exploited. Sure, MS
              fixed it, but can you be certain it won't be exploited some time
              in the future again?
       
              Terr_ wrote 5 hours 35 min ago:
              > Please remind me again how Recall sends data to Microsoft. I
              must've missed that part.
              
              Sure, just post the source code and I'll point out where it does
              so, I somehow misplaced my copy. /s
              
              The core problem here is trust, and over the last several years
              Microsoft has burned a hell of a lot of theirs with power-users
              of Windows. Even their most strident public promises of Recall
              being "opt-in" and "on-device only" will--paradoxically--only be
              kept as long as enough people remain suspicious.
              
              Glance away and MS go back to their old games, pushing a
              mandatory "security update" which reset or entirely-removes your
              privacy settings and adding new "telemetry" streams which you
              cannot inspect.
       
              cmeacham98 wrote 5 hours 46 min ago:
              While calling it spyware like GP is over-exaggeration to a
              ridiculous level, comparing Recall to Print Screen is also
              inaccurate:
              
              Print Screen takes images on demand, Recall does so effectively
              at random. This means Recall could inadvertently screenshot and
              store information you didn't intend to keep a record of (To give
              an extreme example: Imagine an abuser uses Recall to discover
              their spouse browsing online domestic violence resources).
       
        isusmelj wrote 7 hours 42 min ago:
        I think the results show that just in general the compute is not used
        well. That the CPU took 8.4ms and GPU took 3.2ms shows a very small
        gap. I'd expect more like 10x - 20x difference here.
        I'd assume that the onnxruntime might be the issue. I think some
        hardware vendors just release the compute units without shipping proper
        support yet. Let's see how fast that will change.
        
        Also, people often mistake the reason for an NPU is "speed". That's not
        correct. The whole point of the NPU is rather to focus on low power
        consumption. To focus on speed you'd need to get rid of the memory
        bottleneck. Then you end up designing your own ASIC with it's own
        memory. The NPUs we see in most devices are part of the SoC around the
        CPU to offload AI computations.
        It would be interesting to run this benchmark in a infinite loop for
        the three devices (CPU, NPU, GPU) and measure power consumption.
        I'd expect the NPU to be lowest and also best in terms of "ops/watt"
       
          spookie wrote 4 hours 37 min ago:
          I've been building an app in pure C using onnxruntime, and it
          outperforms a comparable one done with python by a substancial
          amount. There are many other gains to be made.
          
          (In the end python just calls C, but it's pretty interesting how much
          performance is lost)
       
          theresistor wrote 4 hours 39 min ago:
          > Also, people often mistake the reason for an NPU is "speed". That's
          not correct. The whole point of the NPU is rather to focus on low
          power consumption.
          
          It's also often about offload. Depending on the use case, the CPU and
          GPU may be busy with other tasks, so the NPU is free bandwidth that
          can be used without stealing from the others. Consider AI-powered
          photo filters: the GPU is probably busy rendering the preview, and
          the CPU is busy drawing UI and handling user inputs.
       
            cakoose wrote 4 hours 4 min ago:
            Offload only makes sense if there are other advantages, e.g. speed,
            power.
            
            Without those, wouldn't it be better to use the NPUs silicon budget
            on more CPU?
       
              avianlyric wrote 2 hours 7 min ago:
              Not really. To get extra CPU performance that likely means more
              cores, or some other general compute silicon. That stuff tends to
              be quite big, simply because it’s so flexible.
              
              NPUs focus on one specific type of computation, matrix
              multiplication, and usually with low precision integers, because
              that’s all a neural net needs. That vast reduction in
              flexibility means you can take lots of shortcuts in your design,
              allowing you cram more compute into a smaller footprint.
              
              If you look at the M1 chip[1], you can see the entire 16-Neural
              engine has a foot print about the size of 4 performance cores
              (excluding their caches). It’s not perfect comparison, without
              numbers on what the performance core can achieve in terms of
              ops/second vs the Neural Engine. But it seems reasonable to be
              that the Neural Engine and handily outperform the performance
              core complex when doing matmul operations.
              
  HTML        [1]: https://www.anandtech.com/show/16226/apple-silicon-m1-a1...
       
              theresistor wrote 2 hours 13 min ago:
              If you know that you need to offload matmuls, then building
              matmul hardware is more area efficient than adding an entire
              extra CPU. Various intermediate points exist along that spectrum,
              e.g. Cell's SPUs.
       
              heavyset_go wrote 3 hours 56 min ago:
              More CPU means siphoning off more of the power budget on mobile
              devices. The theoretical value of NPUs is power efficiency on a
              limited budget.
       
          godelski wrote 4 hours 44 min ago:
          They definitely aren't doing the timing properly, but also what you
          might think is timing is not what is generally marketed. But I will
          say, those marketed versions are often easier to compare. One such
          example is that if you're using GPU then have you actually considered
          that there's an asynchronous operation as part of your timing?
          
          If you're naively doing `time.time()` then what happens is this
          
            start = time.time() # cpu records time
            pred = model(input.cuda()).cuda() # push data and model (if not
          already there) to GPU memory and start computation. This is
          asynchronous
            end = time.time() # cpu records time, regardless of if pred stores
          data
          
          You probably aren't expecting that if you don't know systems and
          hardware. But python (and really any language) is designed to be
          smart and compile into more optimized things than what you actually
          wrote. There's no lock, and so we're not going to block operations
          for cpu tasks. You might ask why do this? Well no one knows what you
          actually want to do. And do you want the timer library now checking
          for accelerators (i.e. GPU) every time it records a time? That's
          going to mess up your timer! (at best you'd have to do a constructor
          to say "enable locking for this accelerator") So you gotta do
          something a bit more nuanced.
          
          If you want to actually time GPU tasks, you should look at cuda event
          timers (in pytorch this is `torch.cuda.Event(enable_timing=True)`. I
          have another comment with boilerplate)
          
          Edit:
          
          There's also complicated issues like memory size and shape. They
          definitely are not being nice to the NPU here on either of those.
          They (and GPUs!!!) want channels last. They did [1,6,1500,1500] but
          you'd want [1,1500,1500,6]. There's also the issue of how memory is
          allocated (and they noted IO being an issue). 1500 is a weird number
          (as is 6) so they aren't doing any favors to the NPU, and I wouldn't
          be surprised that this is a surprisingly big hit considering how new
          these things are
          
          And here's my longer comment with more details:
          
  HTML    [1]: https://news.ycombinator.com/item?id=41864828
       
            artemisart wrote 3 hours 16 min ago:
            Important precision: the async part is absolutely not python
            specific, but comes from CUDA, indeed for performance, and you will
            have to use cuda events too in C++ to properly time it.
            
            For ONNX the runtimes I know of are synchronous as we don't do each
            operation individually but whole models at once, there is no need
            for async, the timings should be correct.
       
              godelski wrote 2 hours 52 min ago:
              Yes, it isn't python, it is... hardware. Not even CUDA specific.
              It is about memory moving around and optimization (remember, even
              the CPUs do speculative execution). I say a little more in the
              larger comment.
              
              I'm less concerned about the CPU baseline and more concerned
              about the NPU timing. Especially given the other issues
       
          kmeisthax wrote 7 hours 24 min ago:
          > I think some hardware vendors just release the compute units
          without shipping proper support yet
          
          This is Nvidia's moat. Everything has optimized kernels for CUDA, and
          maybe Apple Accelerate (which is the only way to touch the CPU matrix
          unit before M4, and the NPU at all). If you want to use anything
          else, either prepare to upstream patches in your ML framework of
          choice or prepare to write your own training and inference code.
       
          AlexandrB wrote 7 hours 32 min ago:
          > Also, people often mistake the reason for an NPU is "speed". That's
          not correct. The whole point of the NPU is rather to focus on low
          power consumption.
          
          I have a sneaking suspicion that the real real reason for an NPU is
          marketing. "Oh look, NVDA is worth $3.3T - let's make sure we stick
          some AI stuff in our products too."
       
            conradev wrote 39 min ago:
            The real consumers of the NPUs are the operating systems
            themselves. Google’s TPU and Apple’s ANE are used to power OS
            features like Apple’s Face ID and Google’s image enhancements.
            
            We’re seeing these things in traditional PCs now because
            Microsoft has demanded it so that Microsoft can use it in Windows
            11.
            
            Any use by third party software is a lower priority
       
            Spooky23 wrote 1 hour 19 min ago:
            Microsoft needs to throw something in the gap to slow down MacBook
            attrition.
            
            The M processors changed the game. My teams support 250k users. I
            went from 50 MacBooks in 2020 to over 10,000 today. I added zero
            staff - we manage them like iPhones.
       
              cj wrote 45 min ago:
              Rightly so.
              
              The M processor really did completely eliminate all sense of
              “lag” for basic computing (web browsing, restarting your
              computer, etc). Everything happens nearly instantly, even on the
              first generation M1 processor. The experience of “waiting for
              something to load” went away.
              
              Not to mention these machines easily last 5-10 years.
       
                nxobject wrote 36 min ago:
                As a very happy M1 Max user (should've shelled out for 64GB of
                RAM, though, for local LLMs!), I don't look forward to seeing
                how the Google Workspace/Notions/etc. of the world somehow
                reintroduce lag back in.
       
                  bugbuddy wrote 3 min ago:
                  The problem for Intel and AMD is they are stuck with an OS
                  that ships with a lag-inducing Anti-malware suite. I just did
                  a simple git log and it took 2000% longer than usual because
                  the Antivirus was triggered to scan and run a simulation on
                  each machine instruction and byte of data accessed. The
                  commit log window stayed blank waiting to load long enough
                  for me to complete another tiny project. It always ruin my
                  day.
       
            Dalewyn wrote 2 hours 47 min ago:
            There are no nerves in a neural processing unit, so yes: It's 300%
            bullshit marketing.
       
              brookst wrote 1 hour 54 min ago:
              Neural is an adjective. Adjectives do not require their
              associated nouns to be present. See also: digital computers have
              mo fingers at all.
       
                -mlv wrote 1 hour 27 min ago:
                I always thought 'digital' referred to numbers, not fingers.
       
                  bdd8f1df777b wrote 1 hour 19 min ago:
                  The derivative meaning has been use so widely that it has
                  surpassed its original one in usage. But it doesn’t change
                  the fact that it originally refers to the fingers.
       
              jcgrillo wrote 2 hours 39 min ago:
              Maybe the N secretly stands for NFT.. Like the tesla self driving
              hardware only smaller and made of silicon.
       
            itishappy wrote 7 hours 23 min ago:
            I assume you're both right. I'm sure NPUs exist to fill a very real
            niche, but I'm also sure they're being shoehorned in everywhere
            regardless of product fit because "AI big right now."
       
              brookst wrote 1 hour 59 min ago:
              The shoehorning only works if there is buyer demand.
              
              As a company, if customers are willing to pay a premium for a
              NPU, or if they are unwilling to buy a product without one, it is
              not your place to say “hey we don’t really believe in the AI
              hype so we’re going to sell products people don’t want to
              prove a point”
       
                Spooky23 wrote 1 hour 9 min ago:
                Apple will have a completely AI capable product line in 18
                months, with the major platforms basically done.
                
                Microsoft is built around the broken Intel tick/tick model of
                incremental improvement — they are stuck with OEM shitware
                that will take years to flush out of the channel. That means
                for AI, they are stuck with cloud based OpenAI, where NVIDIA
                has them by the balls and the hyperscalers are all fighting for
                GPU.
                
                Apple will deliver local AI features as software (the hardware
                is “free”) at a much higher margin - while Office 365 AI is
                like $400+ a year per user.
                
                You’ll have people getting iPhones to get AI assisted emails
                or whatever Apple does that is useful.
       
                bdd8f1df777b wrote 1 hour 22 min ago:
                There are two kinds of buyer demands: product, buyers, and the
                stock buyers. The AI hype can certainly convince some of the
                stock buyers.
       
                MBCook wrote 1 hour 29 min ago:
                Is there demand? Or do they just assume there is?
                
                If they shove it in every single product and that’s all
                anyone advertises, whether consumers know it will help them or
                not, you don’t get a lot of choice.
                
                If you want the latest chip, you’re getting AI stuff.
                That’s all there is to it.
       
                  Terr_ wrote 34 min ago:
                  "The math is clear: 100% of our our car sales come from
                  models with our company logo somewhere on the front, which
                  shows incredible customer desire for logos. We should
                  consider offering a new luxury trim level with more of them."
                  
                  "How many models to we have without logos?"
                  
                  "Huh? Why would we do that?"
       
                    MBCook wrote 9 min ago:
                    Heh. Yeah more or less.
                    
                    To some degree I understand it, because as we’ve all
                    noticed computers have pretty much plateaued for the
                    average person. They last much longer. You don’t need to
                    replace them every two years anymore because the software
                    isn’t out stripping them so fast.
                    
                    AI is the first thing to come along in quite a while that
                    not only needs significant power but it’s just something
                    different. It’s something they can say your old computer
                    doesn’t have that the new one does. Other than being 5%
                    faster or whatever.
                    
                    So even if people don’t need it, and even if they notice
                    they don’t need it, it’s something to market on.
                    
                    The stuff up thread about it being the hotness that Wall
                    Street loves is absolutely a thing too.
       
              wtallis wrote 5 hours 40 min ago:
              Looking at it slightly differently: putting low-power NPUs into
              laptop and phone SoCs is how to get on the AI bandwagon in a way
              that NVIDIA cannot easily disrupt. There are plenty of systems
              where a NVIDIA discrete GPU cannot fit into the budget (of $ or
              Watts). So even if NPUs are still somewhat of a solution in
              search of a problem (aka a killer app or two), they're not
              necessarily a sign that these manufacturers are acting entirely
              without strategy.
       
            kmeisthax wrote 7 hours 24 min ago:
            You forget "Because Apple is doing it", too.
       
              rjsw wrote 5 hours 16 min ago:
              I think other ARM SoC vendors like Rockchip added NPUs before
              Apple, or at least around the same time.
       
                bdd8f1df777b wrote 1 hour 22 min ago:
                Even if it were true, they wouldn’t have the same influence
                as Apple has.
       
                acchow wrote 4 hours 48 min ago:
                I was curious so looked it up. Apple's first chip with an NPU
                was the A11 bionic in Sept 2017. Rockchip's was the RK1808 in
                Sept 2019.
       
                  j16sdiz wrote 2 hours 54 min ago:
                  Google TPU was introduced around same time as apple.
                  Basically everybody knew it can be something around that
                  time, just don't know exactly how
       
                  GeekyBear wrote 3 hours 17 min ago:
                  Face ID was the first tent pole feature that ran on the NPU.
       
        tromp wrote 7 hours 44 min ago:
        >  the 45 trillion operations per second that’s listed in the specs
        
        Such a spec should be ideally be accompanied by code demonstrating or
        approximating the claimed performance. I can't imagine a sports car
        advertising a 0-100km/h spec of 2.0 seconds where a user is unable to
        get below 5 seconds.
       
          tedunangst wrote 5 hours 33 min ago:
          I have some bad news for you regarding how car acceleration is
          measured.
       
            otterley wrote 3 hours 20 min ago:
            Well, what is it?
       
              ukuina wrote 1 hour 17 min ago:
              Everything from rolling starts to perfect road conditions and
              specific tires, I suppose.
       
          dmitrygr wrote 7 hours 42 min ago:
          most likely multiplying the same 128x128 matrix from cache to cache.
          That gets you perfect MAC utilization with no need to hit memory.
          Gets you a big number that is not directly a lie - that perf IS
          attainable, on a useless synthetic benchmark
       
            kmeisthax wrote 7 hours 23 min ago:
            Sounds great for RNNs! /s
       
        jsheard wrote 7 hours 49 min ago:
        These NPUs are tying up a substantial amount of silicon area so it
        would be a real shame if they end up not being used for much. I can't
        find a die analysis of the Snapdragon X which isolates the NPU
        specifically but AMDs equivalent with the same ~50 TOPS performance
        target can be seen here, and takes up about as much area as three high
        performance CPU cores:
        
  HTML  [1]: https://www.techpowerup.com/325035/amd-strix-point-silicon-pic...
       
          kllrnohj wrote 2 hours 58 min ago:
          Snapdragon X still has a full 12 cores (all same cores, it's
          homogeneous) and the Strix Point is also 12 cores but in a 4+8
          configuration but with the "little" cores not sacrificing that much
          (nothing like the little cores in ARM's designs which might as well
          not even exist, they are a complete waste of silicon). Consumer
          software doesn't scale to that, so what are you going to do with more
          transistors allocated to the CPU?
          
          It's not unlike why Apple puts so many video engines in their SoCs -
          they don't actually have much else to do with the transistor budget
          they can afford. Making single thread performance better isn't
          limited by transistor count anymore and software is bad at
          multithreading.
       
            wmf wrote 1 hour 30 min ago:
            GPU "infinity" cache would increase 3D performance and there's a
            rumor that AMD removed it to make room for the NPU. They're not out
            of ideas for features to put on the chip.
       
          JohnFen wrote 5 hours 46 min ago:
          > These NPUs are tying up a substantial amount of silicon area so it
          would be a real shame if they end up not being used for much.
          
          This has been my thinking. Today you have to go out of your way to
          buy a system with an NPU, so I don't have any. But tomorrow, will
          they just be included by default? That seems like a waste for those
          of us who aren't going to be running models. I wonder what other uses
          they could be put to?
       
            idunnoman1222 wrote 3 hours 14 min ago:
            Voice to text
       
            heavyset_go wrote 3 hours 49 min ago:
            The idea is that your OS and apps will integrate ML models, so you
            will be running models whether you know it or not.
       
            crazygringo wrote 4 hours 48 min ago:
            Aren't they used for speech recognition -- for dictation? Also for
            FaceID.
            
            They're useful for more things than just LLM's.
       
            jonas21 wrote 5 hours 37 min ago:
            NPUs are already included by default in the Apple ecosystem. Nobody
            seems to mind.
       
              shepherdjerred wrote 3 hours 52 min ago:
              I actually love that Apple includes this — especially now that
              they’re actually doing something with it via Apple Intelligence
       
              acchow wrote 4 hours 44 min ago:
              It enables many features on the phone that people like, all
              without sending your personal data to the cloud. Like searching
              your photos for "dog" or "receipt".
       
              JohnFen wrote 5 hours 29 min ago:
              It's not really a question of minding if it's there, unless its
              presence increases cost, anyway. It just seems a waste to let it
              go idle, so my mind wanders to what other use I could put that
              circuitry to.
       
            jsheard wrote 5 hours 44 min ago:
            > But tomorrow, will they just be included by default?
            
            That's already the way things are going due to Microsoft decreeing
            that Copilot+ is the future of Windows, so AMD and Intel are both
            putting NPUs which meet the Copilot+ performance standard into
            every consumer part they make going forwards to secure OEM sales.
       
              AlexAndScripts wrote 5 hours 18 min ago:
              It almost makes me want to find some use for them on my Linux box
              (not that is has an NPU), but I truly can't think of anything.
              Too small to run a meaningful LLM, and I'd want that in bursts
              anyway, I hate voice controls (at least with the current tech),
              and Recall sounds thoroughly useless. Could you do mediocre
              machine translation on it, perhaps? Local github copilot? An LLM
              that is purely used to build an abstract index of my notes in the
              background?
              
              Actually, could they be used to make better AI in games? That'd
              be neat. A shooter character with some kind of organic tactics,
              or a Civilisation/Stellaris AI that doesn't suck.
       
          ezst wrote 6 hours 52 min ago:
          I can't wait for the LLM fad to be over so we get some sanity (and
          efficiency) back. I personally have no use for this extra hardware
          ("GenAI" doesn't help me in any way nor supports any work-related
          tasks). Worse, most people have no use for that (and recent surveys
          even show predominant hostility towards AI creep). We shouldn't be
          paying extra for that, it should be opt-in, and then it would become
          clear (by looking at the sales and how few are willing to pay a
          premium for "AI") how overblown and unnecessary this is.
       
            jcgrillo wrote 2 hours 22 min ago:
            I just got an iphone and the whole photos thing is absolutely
            garbage. All I wanted to do was look through my damn photos and
            find one I took recently but it started playing some random music
            and organized them in no discernible order.. like it wasn't the
            reverse time sorted.. Idk what kind of fucked up "creative process"
            came up with that bullshit but I sure wish they'd unfuck it stat.
            
            The camera is real good though.
       
              james_marks wrote 22 min ago:
              There’s an album called “Recents” that’s chronological
              and scrolled to the end.
              
              “Recent” seems to mean everything; I’ve got 6k+ photos, I
              think since the last fresh install, which is many devices ago.
              
              Sounds like the view you’re looking for and will stick as the
              default once you find it, but you do have to bat away some BS at
              first.
       
            kalleboo wrote 2 hours 35 min ago:
            > most people have no use for that
            
            Apple originally added their NPUs before the current LLM wave to
            support things like indexing your photo library so that objects and
            people are searchable. These features are still very popular. I
            don't think these NPUs are fast enough for GenAI anyway.
       
              grugagag wrote 1 hour 31 min ago:
              I wish I could turn that off on my phone.
       
              wmf wrote 1 hour 34 min ago:
              MS Copilot and "Apple Intelligence" are running a small language
              model and image generation on the NPU so that should count as
              "GenAI".
       
            mardifoufs wrote 3 hours 57 min ago:
            NPUs were a thing (and a very common one in mobile CPUs too) way
            before the LLM craze.
       
            renewiltord wrote 6 hours 23 min ago:
            I was telling someone this and they gave me link to a laptop with
            higher battery life and better performance than my own, but I kept
            explaining to them that the feature I cared most about was die
            size. They couldn't understand it so I just had to leave them
            alone. Non-technical people don't get it. Die size is what I care
            about. It's a critical feature and so many mainstream companies are
            missing out on my money because they won't optimize die size.
            Disgusting.
       
              fijiaarone wrote 15 min ago:
              Yeah, I know what you mean.  I hate lugging around a big CPU
              core.
       
              waveBidder wrote 4 hours 40 min ago:
              your satire is off base enough that people don't understand it's
              satire.
       
                heavyset_go wrote 3 hours 53 min ago:
                The Poe's Law means it's working.
       
              _zoltan_ wrote 5 hours 13 min ago:
              News flash: you're in the niche of the niche. People don't care
              about die size.
              
              I'd be willing to bet that the amount of money they are missing
              out on is miniscule and is by far offset by people's money who
              care about other stuff. Like you know, performance and battery
              life, just to stick to your examples.
       
                mattnewton wrote 2 hours 32 min ago:
                That’s exactly what the poster is arguing- they are being
                sarcastic.
       
              nl wrote 6 hours 1 min ago:
              Is this a parody?
              
              Why would anyone care about die size? And if you do why not get
              one of the many low power laptops with Atoms etc that do have
              small die size?
       
                tedunangst wrote 5 hours 36 min ago:
                No, no, no, you just don't get it. The only thing Dell will
                sell me is a laptop 324mm wide, which is totally appalling, but
                if they offered me a laptop that's 320mm wide, I'd immediately
                buy it. In my line of work, which is totally serious business,
                every millimeter counts.
       
                thfuran wrote 5 hours 40 min ago:
                Yes, they're making fun of the comment they replied to.
       
                  singlepaynews wrote 18 min ago:
                  Would you do me the favor of explaining the joke?  I get the
                  premise—nobody cares about die size, but the comment being
                  mocked seems perfectly innocuous to me?  They want a laptop
                  without an NPU b/c according to link we get more out of CPU
                  anyways?  What am I missing here?
       
                throwaway48476 wrote 5 hours 46 min ago:
                Maybe through a game of telephone they confused die size and
                node size?
       
            DrillShopper wrote 6 hours 44 min ago:
            Corporatized gains in the market from hype
            Socialized losses in increased carbon emissions, upheaval from job
            loss, and higher prices on hardware.
            
            The more they say the future will be better the more that it looks
            like the status quo.
       
          Kon-Peki wrote 6 hours 55 min ago:
          Modern chips have to dedicate a certain percentage of the die to dark
          silicon [1] (or else they melt/throttle to uselessness), and these
          kinds of components count towards that amount.    So the point of these
          components is to be used, but not to be used too much.
          
          Instead of an NPU, they could have used those transistors and die
          space for any number of things.  But they wouldn't have put
          additional high performance CPU cores there - that would increase the
          power density too much and cause thermal issues that can only be
          solved with permanent throttling.
          
  HTML    [1]: https://en.wikipedia.org/wiki/Dark_silicon
       
            jcgrillo wrote 1 hour 49 min ago:
            Question--what's to be lost by making your features sufficiently
            not dense to allow them to cool at full tilt?
       
              AlotOfReading wrote 1 hour 29 min ago:
              Messes with timing, among other things. A lot of those structures
              are relatively fixed blocks that are designed for specific sizes.
              Signals take more time to propagate longer distances, and longer
              conductors have worse properties. Dense and hot is faster and
              more broadly useful.
       
                jcgrillo wrote 1 hour 13 min ago:
                Interesting, so does that mean we're basically out of runway
                without aggressive cooling?
       
            IshKebab wrote 6 hours 18 min ago:
            If they aren't being used it would be better to dedicate the space
            to more SRAM.
       
              a2l3aQ wrote 6 hours 0 min ago:
              The point is parts of the CPU have to be off or throttled down
              when other components are under load to maintain TDP, adding
              cache that would almost certainly be being used defeats the point
              of that.
       
                jsheard wrote 5 hours 46 min ago:
                Doesn't SRAM have much lower power density than logic with the
                same area though? Hence why AMD can get away with physically
                stacking cache on top of more cache in their X3D parts, without
                the bottom layer melting.
       
                  wtallis wrote 4 hours 37 min ago:
                  The SRAM that AMD is stacking also has the benefit of being
                  last-level cache, so it doesn't need to run at anywhere near
                  the frequency and voltage that eg. L1 cache operates at.
       
                  Kon-Peki wrote 5 hours 2 min ago:
                  Yes, cache has a much lower power density and could have been
                  a candidate for that space.
                  
                  But I wasn’t on the design team and have no basis for
                  second-guessing them.  I’m just saying that cramming more
                  performance CPU cores onto this die isn’t a realistic
                  option.
       
        pram wrote 7 hours 51 min ago:
        I laughed when I saw that the Qualcomm “AI PC” is described as this
        in the ComfyUI docs:
        
        "Avoid", "Nothing works", "Worthless for any AI use"
       
        dmitrygr wrote 7 hours 54 min ago:
        In general MAC unit utilization tends to be low for transformers, but
        1.3% seems pretty bad. I wonder if they fucked up the memory interface
        for the NPU. All the MACs in the world are useless if you cannot feed
        them.
       
          Hizonner wrote 7 hours 28 min ago:
          It's a tablet. It probably has like one DDR channel. It's not so much
          that they "fucked it up" as that they knowingly built a grossly
          unbalanced system so they could report a pointless number.
       
            dmitrygr wrote 7 hours 25 min ago:
            Well, no. If the CPU can hit better numbers on the same model then
            the bandwidth from the DDR IS there. Probably the NPU does not
            attach to the proper cache level, or just has a very thin pipe to
            it
       
              Hizonner wrote 7 hours 18 min ago:
              The CPU is only about twice as good as the NPU, though (four
              times as good on one test). The NPU is being advertised as
              capable of 45 trillion operations per second, and he's getting
              1.3 percent of that.
              
              So, OK, yeah, I concede that the NPU may have even worse access
              to memory than the CPU, but the bottom line is that neither one
              of them has anything close to what it needs to to actually
              delivering anything like the marketing headline performance
              number on any realistic workload.
              
              I bet a lot of people have bought those things after seeing "45
              TOPS", thinking that they'd be able to usefully run transformers
              the size of main memory, and that's not happening on CPU or NPU.
       
                dmitrygr wrote 7 hours 15 min ago:
                Yup, sad all round. We are in agreement.
       
          moffkalast wrote 7 hours 44 min ago:
          I recall looking over the Ryzen AI architecture and the NPU is just
          plugged into PCIe and thus gets completely crap memory bandwidth. I
          would expect it might be similar here.
       
            wtallis wrote 6 hours 19 min ago:
            It's unlikely to be literally connected over PCIe when it's on the
            same chip. It just looks like it's connected over PCIe because
            that's how you make peripherals discoverable to the OS. The
            integrated GPU also appears to be connected over PCIe, but
            obviously has access to far more memory bandwidth.
       
            PaulHoule wrote 7 hours 10 min ago:
            I spent a lot of time with a business partner and an expert looking
            at the design space for accelerators and it was made very clear to
            me that the memory interface puts a hard limit on what you can do
            and that it is difficult to make the most of.  Particularly if a
            half-baked product is being rushed out because of FOMO you’d
            practically expect them to ship something that gives a few percent
            of the performance because the memory interface doesn’t really
            work, it happens to the best of them:
            
  HTML      [1]: https://en.wikipedia.org/wiki/Cell_(processor)
       
        fancyfredbot wrote 7 hours 55 min ago:
        The write up on the GitHub repo is much more informative than the blog.
        
        When running int8 matmul using onnx performance is ~0.6TF.
        
  HTML  [1]: https://github.com/usefulsensors/qc_npu_benchmark
       
          dang wrote 7 hours 29 min ago:
          Thanks—we changed the URL to that from [1] . Readers may way want
          to look at both, of course!
          
  HTML    [1]: https://petewarden.com/2024/10/16/ai-pcs-arent-very-good-at-...
       
            dhruvdh wrote 3 hours 37 min ago:
            Oh, maybe also change the title? I flagged it because of the
            title/url not matching.
       
       
   DIR <- back to front page