hngopher.com/1/live/items/41863061

       Home

       [HN Gopher] AI PCs Aren't Good at AI: The CPU Beats the NPU
       ___________________________________________________________________
        
       AI PCs Aren't Good at AI: The CPU Beats the NPU
        
       Author : dbreunig
       Score  : 316 points
       Date   : 2024-10-16 19:44 UTC (10 hours ago)
        
  HTML web link (github.com)
  TEXT w3m dump (github.com)
        
       | fancyfredbot wrote:
       | The write up on the GitHub repo is much more informative than the
       | blog.
       | 
       | When running int8 matmul using onnx performance is ~0.6TF.
       | 
       | https://github.com/usefulsensors/qc_npu_benchmark
        
         | dang wrote:
         | Thanks--we changed the URL to that from
         | https://petewarden.com/2024/10/16/ai-pcs-arent-very-good-
         | at-.... Readers may way want to look at both, of course!
        
           | dhruvdh wrote:
           | Oh, maybe also change the title? I flagged it because of the
           | title/url not matching.
        
       | dmitrygr wrote:
       | In general MAC unit utilization tends to be low for transformers,
       | but 1.3% seems pretty bad. I wonder if they fucked up the memory
       | interface for the NPU. All the MACs in the world are useless if
       | you cannot feed them.
        
         | moffkalast wrote:
         | I recall looking over the Ryzen AI architecture and the NPU is
         | just plugged into PCIe and thus gets completely crap memory
         | bandwidth. I would expect it might be similar here.
        
           | PaulHoule wrote:
           | I spent a lot of time with a business partner and an expert
           | looking at the design space for accelerators and it was made
           | very clear to me that the memory interface puts a hard limit
           | on what you can do and that it is difficult to make the most
           | of. Particularly if a half-baked product is being rushed out
           | because of FOMO you'd practically expect them to ship
           | something that gives a few percent of the performance because
           | the memory interface doesn't really work, it happens to the
           | best of them:
           | 
           | https://en.wikipedia.org/wiki/Cell_(processor)
        
           | wtallis wrote:
           | It's unlikely to be literally connected over PCIe when it's
           | on the same chip. It just _looks_ like it 's connected over
           | PCIe because that's how you make peripherals discoverable to
           | the OS. The integrated GPU also appears to be connected over
           | PCIe, but obviously has access to far more memory bandwidth.
        
         | Hizonner wrote:
         | It's a tablet. It probably has like one DDR channel. It's not
         | so much that they "fucked it up" as that they knowingly built a
         | grossly unbalanced system so they could report a pointless
         | number.
        
           | dmitrygr wrote:
           | Well, no. If the CPU can hit better numbers on the same model
           | then the bandwidth from the DDR _IS_ there. Probably the NPU
           | does not attach to the proper cache level, or just has a very
           | thin pipe to it
        
             | Hizonner wrote:
             | The CPU is only about twice as good as the NPU, though
             | (four times as good on one test). The NPU is being
             | advertised as capable of 45 trillion operations per second,
             | and he's getting 1.3 percent of that.
             | 
             | So, OK, yeah, I concede that the NPU may have even worse
             | access to memory than the CPU, but the bottom line is that
             | neither one of them has anything close to what it needs to
             | to actually delivering anything like the marketing headline
             | performance number on any realistic workload.
             | 
             | I bet a lot of people have bought those things after seeing
             | "45 TOPS", thinking that they'd be able to usefully run
             | transformers the size of main memory, and that's not
             | happening on CPU _or_ NPU.
        
               | dmitrygr wrote:
               | Yup, sad all round. We are in agreement.
        
       | pram wrote:
       | I laughed when I saw that the Qualcomm "AI PC" is described as
       | this in the ComfyUI docs:
       | 
       | "Avoid", "Nothing works", "Worthless for any AI use"
        
       | jsheard wrote:
       | These NPUs are tying up a substantial amount of silicon area so
       | it would be a real shame if they end up not being used for much.
       | I can't find a die analysis of the Snapdragon X which isolates
       | the NPU specifically but AMDs equivalent with the same ~50 TOPS
       | performance target can be seen here, and takes up about as much
       | area as three high performance CPU cores:
       | 
       | https://www.techpowerup.com/325035/amd-strix-point-silicon-p...
        
         | Kon-Peki wrote:
         | Modern chips have to dedicate a certain percentage of the die
         | to dark silicon [1] (or else they melt/throttle to
         | uselessness), and these kinds of components count towards that
         | amount. So the point of these components is to be used, but not
         | to be used too much.
         | 
         | Instead of an NPU, they could have used those transistors and
         | die space for any number of things. But they wouldn't have put
         | additional high performance CPU cores there - that would
         | increase the power density too much and cause thermal issues
         | that can only be solved with permanent throttling.
         | 
         | [1] https://en.wikipedia.org/wiki/Dark_silicon
        
           | IshKebab wrote:
           | If they aren't being used it would be better to dedicate the
           | space to more SRAM.
        
             | a2l3aQ wrote:
             | The point is parts of the CPU have to be off or throttled
             | down when other components are under load to maintain TDP,
             | adding cache that would almost certainly be being used
             | defeats the point of that.
        
               | jsheard wrote:
               | Doesn't SRAM have much lower power density than logic
               | with the same area though? Hence why AMD can get away
               | with physically stacking cache on top of more cache in
               | their X3D parts, without the bottom layer melting.
        
               | Kon-Peki wrote:
               | Yes, cache has a much lower power density and could have
               | been a candidate for that space.
               | 
               | But I wasn't on the design team and have no basis for
               | second-guessing them. I'm just saying that cramming more
               | performance CPU cores onto this die isn't a realistic
               | option.
        
               | wtallis wrote:
               | The SRAM that AMD is stacking also has the benefit of
               | being last-level cache, so it doesn't need to run at
               | anywhere near the frequency and voltage that eg. L1 cache
               | operates at.
        
           | jcgrillo wrote:
           | Question--what's to be lost by making your features
           | sufficiently not dense to allow them to cool at full tilt?
        
             | AlotOfReading wrote:
             | Messes with timing, among other things. A lot of those
             | structures are relatively fixed blocks that are designed
             | for specific sizes. Signals take more time to propagate
             | longer distances, and longer conductors have worse
             | properties. Dense and hot is faster and more broadly
             | useful.
        
               | jcgrillo wrote:
               | Interesting, so does that mean we're basically out of
               | runway without aggressive cooling?
        
             | positr0n wrote:
             | Good discussion on how at multi GHz clock speeds, the speed
             | of light is actually limiting on some circuit design
             | choices: https://news.ycombinator.com/item?id=12384596
        
         | ezst wrote:
         | I can't wait for the LLM fad to be over so we get some sanity
         | (and efficiency) back. I personally have no use for this extra
         | hardware ("GenAI" doesn't help me in any way nor supports any
         | work-related tasks). Worse, most people have no use for that
         | (and recent surveys even show predominant hostility towards AI
         | creep). We shouldn't be paying extra for that, it should be
         | opt-in, and then it would become clear (by looking at the sales
         | and how few are willing to pay a premium for "AI") how
         | overblown and unnecessary this is.
        
           | DrillShopper wrote:
           | Corporatized gains in the market from hype Socialized losses
           | in increased carbon emissions, upheaval from job loss, and
           | higher prices on hardware.
           | 
           | The more they say the future will be better the more that it
           | looks like the status quo.
        
           | renewiltord wrote:
           | I was telling someone this and they gave me link to a laptop
           | with higher battery life and better performance than my own,
           | but I kept explaining to them that the feature I cared most
           | about was die size. They couldn't understand it so I just had
           | to leave them alone. Non-technical people don't get it. Die
           | size is what I care about. It's a critical feature and so
           | many mainstream companies are missing out on _my money_
           | because they won 't optimize die size. Disgusting.
        
             | nl wrote:
             | Is this a parody?
             | 
             | Why would anyone care about die size? And if you do why not
             | get one of the many low power laptops with Atoms etc that
             | do have small die size?
        
               | throwaway48476 wrote:
               | Maybe through a game of telephone they confused die size
               | and node size?
        
               | thfuran wrote:
               | Yes, they're making fun of the comment they replied to.
        
               | singlepaynews wrote:
               | Would you do me the favor of explaining the joke? I get
               | the premise--nobody cares about die size, but the comment
               | being mocked seems perfectly innocuous to me? They want a
               | laptop without an NPU b/c according to link we get more
               | out of CPU anyways? What am I missing here?
        
               | tedunangst wrote:
               | No, no, no, you just don't get it. The only thing Dell
               | will sell me is a laptop 324mm wide, which is totally
               | appalling, but if they offered me a laptop that's 320mm
               | wide, I'd immediately buy it. In my line of work, which
               | is totally serious business, every millimeter counts.
        
             | _zoltan_ wrote:
             | News flash: you're in the niche of the niche. People don't
             | care about die size.
             | 
             | I'd be willing to bet that the amount of money they are
             | missing out on is miniscule and is by far offset by
             | people's money who care about other stuff. Like you know,
             | performance and battery life, just to stick to your
             | examples.
        
               | mattnewton wrote:
               | That's exactly what the poster is arguing- they are being
               | sarcastic.
        
             | waveBidder wrote:
             | your satire is off base enough that people don't understand
             | it's satire.
        
               | heavyset_go wrote:
               | The Poe's Law means it's working.
        
               | 0xDEAFBEAD wrote:
               | Says a lot about HN that so many believed he was genuine.
        
             | fijiaarone wrote:
             | Yeah, I know what you mean. I hate lugging around a big CPU
             | core.
        
           | mardifoufs wrote:
           | NPUs were a thing (and a very common one in mobile CPUs too)
           | way before the LLM craze.
        
           | kalleboo wrote:
           | > _most people have no use for that_
           | 
           | Apple originally added their NPUs before the current LLM wave
           | to support things like indexing your photo library so that
           | objects and people are searchable. These features are still
           | very popular. I don't think these NPUs are fast enough for
           | GenAI anyway.
        
             | wmf wrote:
             | MS Copilot and "Apple Intelligence" are running a small
             | language model and image generation on the NPU so that
             | should count as "GenAI".
        
               | kalleboo wrote:
               | It's still in beta so we'll see how things go but I saw
               | someone testing what Apple Intelligence ran on-device vs
               | sent off to the "private secure cloud" and even stuff
               | like text summaries were being sent to the cloud.
        
             | grugagag wrote:
             | I wish I could turn that off on my phone.
        
           | jcgrillo wrote:
           | I just got an iphone and the whole photos thing is absolutely
           | garbage. All I wanted to do was look through my damn photos
           | and find one I took recently but it started playing some
           | random music and organized them in no discernible order..
           | like it wasn't the reverse time sorted.. Idk what kind of
           | fucked up "creative process" came up with that bullshit but I
           | sure wish they'd unfuck it stat.
           | 
           | The camera is real good though.
        
             | james_marks wrote:
             | There's an album called "Recents" that's chronological and
             | scrolled to the end.
             | 
             | "Recent" seems to mean everything; I've got 6k+ photos, I
             | think since the last fresh install, which is many devices
             | ago.
             | 
             | Sounds like the view you're looking for and will stick as
             | the default once you find it, but you do have to bat away
             | some BS at first.
        
         | JohnFen wrote:
         | > These NPUs are tying up a substantial amount of silicon area
         | so it would be a real shame if they end up not being used for
         | much.
         | 
         | This has been my thinking. Today you have to go out of your way
         | to buy a system with an NPU, so I don't have any. But tomorrow,
         | will they just be included by default? That seems like a waste
         | for those of us who aren't going to be running models. I wonder
         | what other uses they could be put to?
        
           | jsheard wrote:
           | > But tomorrow, will they just be included by default?
           | 
           | That's already the way things are going due to Microsoft
           | decreeing that Copilot+ is the future of Windows, so AMD and
           | Intel are both putting NPUs which meet the Copilot+
           | performance standard into every consumer part they make going
           | forwards to secure OEM sales.
        
             | AlexAndScripts wrote:
             | It almost makes me want to find some use for them on my
             | Linux box (not that is has an NPU), but I truly can't think
             | of anything. Too small to run a meaningful LLM, and I'd
             | want that in bursts anyway, I hate voice controls (at least
             | with the current tech), and Recall sounds thoroughly
             | useless. Could you do mediocre machine translation on it,
             | perhaps? Local github copilot? An LLM that is purely used
             | to build an abstract index of my notes in the background?
             | 
             | Actually, could they be used to make better AI in games?
             | That'd be neat. A shooter character with some kind of
             | organic tactics, or a Civilisation/Stellaris AI that
             | doesn't suck.
        
           | jonas21 wrote:
           | NPUs are already included by default in the Apple ecosystem.
           | Nobody seems to mind.
        
             | JohnFen wrote:
             | It's not really a question of minding if it's there, unless
             | its presence increases cost, anyway. It just seems a waste
             | to let it go idle, so my mind wanders to what other use I
             | could put that circuitry to.
        
             | acchow wrote:
             | It enables many features on the phone that people like, all
             | without sending your personal data to the cloud. Like
             | searching your photos for "dog" or "receipt".
        
             | shepherdjerred wrote:
             | I actually love that Apple includes this -- especially now
             | that they're actually doing something with it via Apple
             | Intelligence
        
           | crazygringo wrote:
           | Aren't they used for speech recognition -- for dictation?
           | Also for FaceID.
           | 
           | They're useful for more things than just LLM's.
        
             | JohnFen wrote:
             | Yes, but I'm not interested in those sorts of uses. I'm
             | wondering what else an NPU could be used for. I don't know
             | what an NPU actually is at a technical level, so I'm
             | ignorant of the possibilities.
        
           | heavyset_go wrote:
           | The idea is that your OS and apps will integrate ML models,
           | so you will be running models whether you know it or not.
        
             | JohnFen wrote:
             | I'm confident that I'll be able to know and control whether
             | or not my Linux and BSD machines will be using ML models.
        
               | hollerith wrote:
               | --and whether anyone is using your interactions with your
               | computer to train a model.
        
           | idunnoman1222 wrote:
           | Voice to text
        
         | kllrnohj wrote:
         | Snapdragon X still has a full 12 cores (all same cores, it's
         | homogeneous) and the Strix Point is also 12 cores but in a 4+8
         | configuration but with the "little" cores not sacrificing that
         | much (nothing like the little cores in ARM's designs which
         | might as well not even exist, they are a complete waste of
         | silicon). Consumer software doesn't scale to that, so what are
         | you going to do with more transistors allocated to the CPU?
         | 
         | It's not unlike why Apple puts so many video engines in their
         | SoCs - they don't actually have much else to do with the
         | transistor budget they can afford. Making single thread
         | performance better isn't limited by transistor count anymore
         | and software is bad at multithreading.
        
           | wmf wrote:
           | GPU "infinity" cache would increase 3D performance and
           | there's a rumor that AMD removed it to make room for the NPU.
           | They're not out of ideas for features to put on the chip.
        
       | tromp wrote:
       | > the 45 trillion operations per second that's listed in the
       | specs
       | 
       | Such a spec should be ideally be accompanied by code
       | demonstrating or approximating the claimed performance. I can't
       | imagine a sports car advertising a 0-100km/h spec of 2.0 seconds
       | where a user is unable to get below 5 seconds.
        
         | dmitrygr wrote:
         | most likely multiplying the same 128x128 matrix from cache to
         | cache. That gets you perfect MAC utilization with no need to
         | hit memory. Gets you a big number that is not directly a lie -
         | that perf _IS_ attainable, on a useless synthetic benchmark
        
           | kmeisthax wrote:
           | Sounds great for RNNs! /s
        
         | tedunangst wrote:
         | I have some bad news for you regarding how car acceleration is
         | measured.
        
           | otterley wrote:
           | Well, what is it?
        
             | ukuina wrote:
             | Everything from rolling starts to perfect road conditions
             | and specific tires, I suppose.
        
       | isusmelj wrote:
       | I think the results show that just in general the compute is not
       | used well. That the CPU took 8.4ms and GPU took 3.2ms shows a
       | very small gap. I'd expect more like 10x - 20x difference here.
       | I'd assume that the onnxruntime might be the issue. I think some
       | hardware vendors just release the compute units without shipping
       | proper support yet. Let's see how fast that will change.
       | 
       | Also, people often mistake the reason for an NPU is "speed".
       | That's not correct. The whole point of the NPU is rather to focus
       | on low power consumption. To focus on speed you'd need to get rid
       | of the memory bottleneck. Then you end up designing your own ASIC
       | with it's own memory. The NPUs we see in most devices are part of
       | the SoC around the CPU to offload AI computations. It would be
       | interesting to run this benchmark in a infinite loop for the
       | three devices (CPU, NPU, GPU) and measure power consumption. I'd
       | expect the NPU to be lowest and also best in terms of "ops/watt"
        
         | AlexandrB wrote:
         | > Also, people often mistake the reason for an NPU is "speed".
         | That's not correct. The whole point of the NPU is rather to
         | focus on low power consumption.
         | 
         | I have a sneaking suspicion that the real real reason for an
         | NPU is marketing. "Oh look, NVDA is worth $3.3T - let's make
         | sure we stick some AI stuff in our products too."
        
           | kmeisthax wrote:
           | You forget "Because Apple is doing it", too.
        
             | rjsw wrote:
             | I think other ARM SoC vendors like Rockchip added NPUs
             | before Apple, or at least around the same time.
        
               | acchow wrote:
               | I was curious so looked it up. Apple's first chip with an
               | NPU was the A11 bionic in Sept 2017. Rockchip's was the
               | RK1808 in Sept 2019.
        
               | GeekyBear wrote:
               | Face ID was the first tent pole feature that ran on the
               | NPU.
        
               | j16sdiz wrote:
               | Google TPU was introduced around same time as apple.
               | Basically everybody knew it can be something around that
               | time, just don't know exactly how
        
               | Someone wrote:
               | https://en.wikipedia.org/wiki/Tensor_Processing_Unit#Prod
               | uct... shows the first one is from 2015 (publicly
               | announced in 2016). It also shows they have a TDP of
               | 75+W.
               | 
               | I can't find TDP for Apple's Neural Engine
               | (https://en.wikipedia.org/wiki/Neural_Engine), but the
               | first version shipped in the iPhone 8, which has a 7 Wh
               | battery, so these are targeting different markets.
        
               | bdd8f1df777b wrote:
               | Even if it were true, they wouldn't have the same
               | influence as Apple has.
        
           | itishappy wrote:
           | I assume you're both right. I'm sure NPUs exist to fill a
           | very real niche, but I'm also sure they're being shoehorned
           | in everywhere regardless of product fit because "AI big right
           | now."
        
             | wtallis wrote:
             | Looking at it slightly differently: putting low-power NPUs
             | into laptop and phone SoCs is how to get on the AI
             | bandwagon in a way that NVIDIA cannot easily disrupt. There
             | are plenty of systems where a NVIDIA discrete GPU cannot
             | fit into the budget (of $ or Watts). So even if NPUs are
             | still somewhat of a solution in search of a problem (aka a
             | killer app or two), they're not necessarily a sign that
             | these manufacturers are acting entirely without strategy.
        
             | brookst wrote:
             | The shoehorning only works if there is buyer demand.
             | 
             | As a company, if customers are willing to pay a premium for
             | a NPU, or if they are unwilling to buy a product without
             | one, it is not your place to say "hey we don't really
             | believe in the AI hype so we're going to sell products
             | people don't want to prove a point"
        
               | MBCook wrote:
               | Is there demand? Or do they just assume there is?
               | 
               | If they shove it in every single product and that's all
               | anyone advertises, whether consumers know it will help
               | them or not, you don't get a lot of choice.
               | 
               | If you want the latest chip, you're getting AI stuff.
               | That's all there is to it.
        
               | Terr_ wrote:
               | "The math is clear: 100% of our our car sales come from
               | models with our company logo somewhere on the front,
               | which shows incredible customer desire for logos. We
               | should consider offering a new luxury trim level with
               | more of them."
               | 
               | "How many models to we have without logos?"
               | 
               | "Huh? Why would we do that?"
        
               | MBCook wrote:
               | Heh. Yeah more or less.
               | 
               | To some degree I understand it, because as we've all
               | noticed computers have pretty much plateaued for the
               | average person. They last much longer. You don't need to
               | replace them every two years anymore because the software
               | isn't out stripping them so fast.
               | 
               | AI is the first thing to come along in quite a while that
               | not only needs significant power but it's just something
               | different. It's something they can say your old computer
               | doesn't have that the new one does. Other than being 5%
               | faster or whatever.
               | 
               | So even if people don't need it, and even if they notice
               | they don't need it, it's something to market on.
               | 
               | The stuff up thread about it being the hotness that Wall
               | Street loves is absolutely a thing too.
        
               | ddingus wrote:
               | That was all true nearly 10 years ago. And it has only
               | improved. Almost any computer one finds these days is
               | capable of the basics.
        
               | bdd8f1df777b wrote:
               | There are two kinds of buyer demands: product, buyers,
               | and the stock buyers. The AI hype can certainly convince
               | some of the stock buyers.
        
               | Spooky23 wrote:
               | Apple will have a completely AI capable product line in
               | 18 months, with the major platforms basically done.
               | 
               | Microsoft is built around the broken Intel tick/tick
               | model of incremental improvement -- they are stuck with
               | OEM shitware that will take years to flush out of the
               | channel. That means for AI, they are stuck with cloud
               | based OpenAI, where NVIDIA has them by the balls and the
               | hyperscalers are all fighting for GPU.
               | 
               | Apple will deliver local AI features as software (the
               | hardware is "free") at a much higher margin - while
               | Office 365 AI is like $400+ a year per user.
               | 
               | You'll have people getting iPhones to get AI assisted
               | emails or whatever Apple does that is useful.
        
               | justahuman74 wrote:
               | Who is getting $400/y of value from that?
        
               | nxobject wrote:
               | I hope that once they get a baseline level of AI
               | functionality in, they start working with larger LLMs to
               | enable some form of RAG... that might be their next
               | generational shift.
        
               | hakfoo wrote:
               | We're still looking for "that is useful".
               | 
               | The stuff they've been trying to sell AI to the public
               | with is increasingly looking as absurd as every 1978
               | "you'll store your recipes on the home computer"
               | argument.
               | 
               | AI text became a Human Centipede story: Start with a
               | coherent 10-word sentence, let AI balloon it into five
               | pages of flowery nonsense, send it to someone else, who
               | has their AI smash it back down to 10 meaningful words.
               | 
               | Coding assistance, even as spicy autocorrect, is often a
               | net negative as you have to plow through hallucinations
               | and weird guesses as to what you want but lack the tools
               | to explain to it.
               | 
               | Image generation is already heading rapidly into cringe
               | territory, in part due to some very public social media
               | operations. I can imagine your kids' kids in 2040 finding
               | out they generated AI images in the 2020s and looking at
               | them with the same embarrassment you'd see if they dug
               | out your high-school emo fursona.
               | 
               | There might well be some more "closed-loop" AI
               | applications that make sense. But are they going to be
               | running on every desktop in the world? Or are they going
               | to be mostly used in datacentres and purpose-built
               | embedded devices?
               | 
               | I also wonder how well some of the models and techniques
               | scale down. I know Microsoft pushed a minimum spec to
               | promote a machine as Copilot-ready, but that seems like
               | it's going to be "Vista Basic Ready" redux as people try
               | to run tools designed for datacentres full of Quadro
               | cards, or at least high-end GPUs, on their $299 HP
               | laptop.
        
               | jjmarr wrote:
               | Cringe emo girls are trendy now because the nostalgia
               | cycle is hitting the early 2000s. Your kid would be
               | impressed if you told them you were a goth gf. It's not
               | hard to imagine the same will happen with primitive AIs
               | in the 40s.
        
               | defrost wrote:
               | Early 2000's ??
               | 
               | Bela Lugosi Died in 1979, and Peter Murphy was onto his
               | next band by 1984.
               | 
               | By 2000 Goth was fully a distant dot in the rear view
               | mirror for the OG's                   In 2002, Murphy
               | released *Dust* with Turkish-Canadian composer and
               | producer Mercan Dede, which utilizes traditional Turkish
               | instrumentation and songwriting, abandoning Murphy's
               | previous pop and rock incarnations, and juxtaposing
               | elements from progressive rock, trance, classical music,
               | and Middle Eastern music, coupled with Dede's trademark
               | atmospheric electronics.
               | 
               | https://www.youtube.com/watch?v=Yy9h2q_dr9k
               | 
               | https://en.wikipedia.org/wiki/Bauhaus_(band)
        
               | djur wrote:
               | I'm not sure what "gothic music existed in the 1980s" is
               | meant to indicate as a response to "goths existed in the
               | early 2000s as a cultural archetype".
        
               | defrost wrote:
               | That Goths in 2000's were at best second wave nostalgia
               | cycle of Goths from the 1980s.
               | 
               | That people recalling Goths in that period should beware
               | of thinking that was a source and not an echo.
               | 
               | In 2006 Noel Fielding's Richmond Felicity Avenal was a
               | basement dwelling leftover from many years past.
        
               | im3w1l wrote:
               | Until AI chips become abundant, and we are not there yet,
               | cloud AI just makes too much sense. Using a chip
               | constantly vs using it 0.1% of the time is just so many
               | orders of magnitude better.
               | 
               | Local inference does have privacy benefits. I think at
               | the moment it might make sense to send most of queries to
               | a beefy cloud model, and send sensitive queries to a
               | smaller local one.
        
           | Dalewyn wrote:
           | There are no _nerves_ in a _neural_ processing unit, so yes:
           | It 's 300% bullshit marketing.
        
             | jcgrillo wrote:
             | Maybe the N secretly stands for NFT.. Like the tesla self
             | driving hardware only smaller and made of silicon.
        
             | brookst wrote:
             | _Neural_ is an adjective. Adjectives do not require their
             | associated nouns to be present. See also: digital computers
             | have mo fingers at all.
        
               | -mlv wrote:
               | I always thought 'digital' referred to numbers, not
               | fingers.
        
               | bdd8f1df777b wrote:
               | The derivative meaning has been use so widely that it has
               | surpassed its original one in usage. But it doesn't
               | change the fact that it originally refers to the fingers.
        
           | Spooky23 wrote:
           | Microsoft needs to throw something in the gap to slow down
           | MacBook attrition.
           | 
           | The M processors changed the game. My teams support 250k
           | users. I went from 50 MacBooks in 2020 to over 10,000 today.
           | I added zero staff - we manage them like iPhones.
        
             | cj wrote:
             | Rightly so.
             | 
             | The M processor really did completely eliminate all sense
             | of "lag" for basic computing (web browsing, restarting your
             | computer, etc). Everything happens nearly instantly, even
             | on the first generation M1 processor. The experience of
             | "waiting for something to load" went away.
             | 
             | Not to mention these machines easily last 5-10 years.
        
               | nxobject wrote:
               | As a very happy M1 Max user (should've shelled out for
               | 64GB of RAM, though, for local LLMs!), I don't look
               | forward to seeing how the Google Workspace/Notions/etc.
               | of the world somehow reintroduce lag back in.
        
               | bugbuddy wrote:
               | The problem for Intel and AMD is they are stuck with an
               | OS that ships with a lag-inducing Anti-malware suite. I
               | just did a simple git log and it took 2000% longer than
               | usual because the Antivirus was triggered to scan and run
               | a simulation on each machine instruction and byte of data
               | accessed. The commit log window stayed blank waiting to
               | load long enough for me to complete another tiny project.
               | It always ruin my day.
        
               | zdw wrote:
               | This is most likely due to corporate malware.
               | 
               | Even modern macs can be brought to their knees by
               | something that rhymes with FrowdStrike Calcon and
               | interrupts all IO.
        
               | alisonatwork wrote:
               | Pro tip: turn off malware scanning in your git repos[0].
               | There is also the new Dev Drive feature in Windows 11
               | that makes it even easier for developers (and IT admins)
               | to set this kind of thing up via policies[1].
               | 
               | In companies where I worked where the IT team rolled out
               | "security" software to the Mac-based developers, their
               | computers were not noticeably faster than Windows PCs at
               | all, especially given the majority of containers are
               | still linux/amd64, reflecting the actual deployment
               | environment. Meanwhile Windows also runs on ARM anyway,
               | so it's not really something useful to generalize about.
               | 
               | [0] https://support.microsoft.com/en-us/topic/how-to-add-
               | a-file-...
               | 
               | [1] https://learn.microsoft.com/en-us/windows/dev-drive/
        
               | bugbuddy wrote:
               | Unfortunately, the IT department people think they are
               | literal GODs for knowing how to configure Domain Policies
               | and lock down everything. They even refuse to help or
               | even answer requests for help when there are false
               | positives on our own software builds that we cannot
               | unmark as false positives. These people are proactively
               | antagonistic to productivity. Management could not
               | careless...
        
               | djur wrote:
               | Oh, just work for a company that uses Crowdstrike or
               | similar. You'll get back all the lag you want.
        
               | n8cpdx wrote:
               | Chrome managed it. Not sure how since Edge still works
               | reasonably well and Safari is instant to start (even
               | faster than system settings, which is really an
               | indictment of SwiftUI).
        
               | ddingus wrote:
               | I have a first gen M1 and it holds up very nicely even
               | today. I/O is crazy fast and high compute loads get done
               | efficiently.
               | 
               | One can bury the machine and lose very little basic
               | interactivity. That part users really like.
               | 
               | Frankly the only downside of the MacBook Air is the tiny
               | storage. The 8GB RAM is actually enough most of the time.
               | But general system storage with only 1/4 TB is cramped
               | consistently.
               | 
               | Been thinking about sending the machine out to one of
               | those upgrade shops...
        
               | lynguist wrote:
               | Why did you buy a 256GB device for personal use in the
               | first place? Too good of a deal? Or saving these $400 for
               | upgrades for something else?
        
               | bzzzt wrote:
               | Depends on the application as well. Just try to start up
               | Microsoft Teams.
        
             | pjmlp wrote:
             | Microsoft has indeed a problem, however only in countries
             | whose people can afford Apple level prices, and not
             | everyone is a G7 citizen.
        
           | conradev wrote:
           | The real consumers of the NPUs are the operating systems
           | themselves. Google's TPU and Apple's ANE are used to power OS
           | features like Apple's Face ID and Google's image
           | enhancements.
           | 
           | We're seeing these things in traditional PCs now because
           | Microsoft has demanded it so that Microsoft can use it in
           | Windows 11.
           | 
           | Any use by third party software is a lower priority
        
           | pclmulqdq wrote:
           | The correct way to make a true "NPU" is to 10x your memory
           | bandwidth and feed a regular old multicore CPU with
           | SIMD/vector instructions (and maybe a matrix multiply unit).
           | 
           | Most of these small NPUs are actually made for CNNs and other
           | models where "stream data through weights" applies. They have
           | a huge speedup there. When you stream weights across data
           | (any LLM or other large model), you are almost certain to be
           | bound by memory bandwidth.
        
         | kmeisthax wrote:
         | > I think some hardware vendors just release the compute units
         | without shipping proper support yet
         | 
         | This is Nvidia's moat. Everything has optimized kernels for
         | CUDA, and _maybe_ Apple Accelerate (which is the only way to
         | touch the CPU matrix unit before M4, and the NPU at all). If
         | you want to use anything else, either prepare to upstream
         | patches in your ML framework of choice or prepare to write your
         | own training and inference code.
        
         | godelski wrote:
         | They definitely aren't doing the timing properly, but also what
         | you might think is timing is not what is generally marketed.
         | But I will say, those marketed versions are often easier to
         | compare. One such example is that if you're using GPU then have
         | you actually considered that there's an asynchronous operation
         | as part of your timing?
         | 
         | If you're naively doing `time.time()` then what happens is this
         | start = time.time() # cpu records time       pred =
         | model(input.cuda()).cuda() # push data and model (if not
         | already there) to GPU memory and start computation. This is
         | asynchronous       end = time.time() # cpu records time,
         | regardless of if pred stores data
         | 
         | You probably aren't expecting that if you don't know systems
         | and hardware. But python (and really any language) is designed
         | to be smart and compile into more optimized things than what
         | you actually wrote. There's no lock, and so we're not going to
         | block operations for cpu tasks. You might ask why do this? Well
         | no one knows what you actually want to do. And do you want the
         | timer library now checking for accelerators (i.e. GPU) every
         | time it records a time? That's going to mess up your timer! (at
         | best you'd have to do a constructor to say "enable locking for
         | this accelerator") So you gotta do something a bit more
         | nuanced.
         | 
         | If you want to actually time GPU tasks, you should look at cuda
         | event timers (in pytorch this is
         | `torch.cuda.Event(enable_timing=True)`. I have another comment
         | with boilerplate)
         | 
         | Edit:
         | 
         | There's also complicated issues like memory size and shape.
         | They definitely are not being nice to the NPU here on either of
         | those. They (and GPUs!!!) want channels last. They did
         | [1,6,1500,1500] but you'd want [1,1500,1500,6]. There's also
         | the issue of how memory is allocated (and they noted IO being
         | an issue). 1500 is a weird number (as is 6) so they aren't
         | doing any favors to the NPU, and I wouldn't be surprised that
         | this is a surprisingly big hit considering how new these things
         | are
         | 
         | And here's my longer comment with more details:
         | https://news.ycombinator.com/item?id=41864828
        
           | artemisart wrote:
           | Important precision: the async part is absolutely not python
           | specific, but comes from CUDA, indeed for performance, and
           | you will have to use cuda events too in C++ to properly time
           | it.
           | 
           | For ONNX the runtimes I know of are synchronous as we don't
           | do each operation individually but whole models at once,
           | there is no need for async, the timings should be correct.
        
             | godelski wrote:
             | Yes, it isn't python, it is... hardware. Not even CUDA
             | specific. It is about memory moving around and optimization
             | (remember, even the CPUs do speculative execution). I say a
             | little more in the larger comment.
             | 
             | I'm less concerned about the CPU baseline and more
             | concerned about the NPU timing. Especially given the other
             | issues
        
         | theresistor wrote:
         | > Also, people often mistake the reason for an NPU is "speed".
         | That's not correct. The whole point of the NPU is rather to
         | focus on low power consumption.
         | 
         | It's also often about offload. Depending on the use case, the
         | CPU and GPU may be busy with other tasks, so the NPU is free
         | bandwidth that can be used without stealing from the others.
         | Consider AI-powered photo filters: the GPU is probably busy
         | rendering the preview, and the CPU is busy drawing UI and
         | handling user inputs.
        
           | cakoose wrote:
           | Offload only makes sense if there are other advantages, e.g.
           | speed, power.
           | 
           | Without those, wouldn't it be better to use the NPUs silicon
           | budget on more CPU?
        
             | heavyset_go wrote:
             | More CPU means siphoning off more of the power budget on
             | mobile devices. The theoretical value of NPUs is power
             | efficiency on a limited budget.
        
             | theresistor wrote:
             | If you know that you need to offload matmuls, then building
             | matmul hardware is more area efficient than adding an
             | entire extra CPU. Various intermediate points exist along
             | that spectrum, e.g. Cell's SPUs.
        
             | avianlyric wrote:
             | Not really. To get extra CPU performance that likely means
             | more cores, or some other general compute silicon. That
             | stuff tends to be quite big, simply because it's so
             | flexible.
             | 
             | NPUs focus on one specific type of computation, matrix
             | multiplication, and usually with low precision integers,
             | because that's all a neural net needs. That vast reduction
             | in flexibility means you can take lots of shortcuts in your
             | design, allowing you cram more compute into a smaller
             | footprint.
             | 
             | If you look at the M1 chip[1], you can see the entire
             | 16-Neural engine has a foot print about the size of 4
             | performance cores (excluding their caches). It's not
             | perfect comparison, without numbers on what the performance
             | core can achieve in terms of ops/second vs the Neural
             | Engine. But it seems reasonable to be that the Neural
             | Engine and handily outperform the performance core complex
             | when doing matmul operations.
             | 
             | [1] https://www.anandtech.com/show/16226/apple-
             | silicon-m1-a14-de...
        
         | spookie wrote:
         | I've been building an app in pure C using onnxruntime, and it
         | outperforms a comparable one done with python by a substancial
         | amount. There are many other gains to be made.
         | 
         | (In the end python just calls C, but it's pretty interesting
         | how much performance is lost)
        
       | jamesy0ung wrote:
       | What exactly does Windows do with a NPU? I don't own an 'AI PC'
       | but it seems like the NPUs are slow and can't run much.
       | 
       | I know Apple's Neural Engine is used to power Face ID and the
       | facial recognition stuff in Photos, among other things.
        
         | DrillShopper wrote:
         | It supports Microsoft's Recall (now required) spyware
        
           | Janicc wrote:
           | Please remind me again how Recall sends data to Microsoft. I
           | must've missed that part. Or are you against the print screen
           | button too? I heard that takes images too. Very scary.
        
             | cmeacham98 wrote:
             | While calling it spyware like GP is over-exaggeration to a
             | ridiculous level, comparing Recall to Print Screen is also
             | inaccurate:
             | 
             | Print Screen takes images on demand, Recall does so
             | effectively at random. This means Recall could
             | inadvertently screenshot and store information you didn't
             | intend to keep a record of (To give an extreme example:
             | Imagine an abuser uses Recall to discover their spouse
             | browsing online domestic violence resources).
        
             | Terr_ wrote:
             | > Please remind me again how Recall sends data to
             | Microsoft. I must've missed that part.
             | 
             | Sure, just post the source code and I'll point out where it
             | does so, I somehow misplaced my copy. /s
             | 
             | The core problem here is trust, and over the last several
             | years Microsoft has burned a hell of a lot of theirs with
             | power-users of Windows. Even their most strident public
             | promises of Recall being "opt-in" and "on-device only" will
             | --paradoxically--only be _kept_ as long as enough people
             | remain suspicious.
             | 
             | Glance away and MS go back to their old games, pushing a
             | mandatory "security update" which reset or entirely-removes
             | your privacy settings and adding new "telemetry" streams
             | which you cannot inspect.
        
             | bloated5048 wrote:
             | It's always safe to assume it does if it's closed source. I
             | rather be suspicious of big corporations seeking to profit
             | at every step than naive.
             | 
             | Also, it's security risk which already been exploited.
             | Sure, MS fixed it, but can you be certain it won't be
             | exploited some time in the future again?
        
         | dagaci wrote:
         | Its used for improving video calls, special effects, image
         | editing/ effects and noise cancelling, teams stuff
        
         | downrightmike wrote:
         | AI PC is just a marketing term, doesn't have any real substance
        
           | acdha wrote:
           | Yea, we know that. I believe that's why the person you're
           | replying too was asking for examples of real usage.
        
       | eightysixfour wrote:
       | I thought the purpose of these things was not to be fast, but to
       | be able to run small models with very little power usage? I have
       | a newer AMD laptop with an NPU, and my power usage doesn't change
       | using the video effects that supposedly run on it, but goes up
       | when using the nvidia studio effects.
       | 
       | It seems like the NPUs are for very optimized models that do
       | small tasks, like eye contact, background blur, autocorrect
       | models, transcription, and OCR. In particular, on Windows, I
       | assumed they were running the full screen OCR (and maybe
       | embeddings for search) for the rewind feature.
        
         | conradev wrote:
         | That is my understanding as well: low power and low latency.
         | 
         | You can see this in action when evaluating a CoreML model on a
         | macOS machine. The ANE takes half as long as the GPU which
         | takes half as long as the CPU (actual factors being model
         | dependent)
        
           | nickpsecurity wrote:
           | To take half as long, doesn't it have to perform twice as
           | fast? Or am I misreading your comment?
        
             | eightysixfour wrote:
             | No, you can have latency that is independent of compute
             | performance. The CPU/GPU may have other tasks and the work
             | has to wait for the existing threads to finish, or for them
             | to clock up, or have slower memory paths, etc.
             | 
             | If you and I have the same calculator but I'm working on a
             | set of problems and you're not, and we're both asked to do
             | some math, it may take me longer to return it, even though
             | the instantaneous performance of the math is the same.
        
               | refulgentis wrote:
               | In isolation, makes sense.
               | 
               | Wouldn't it be odd for OP to present examples that are
               | the _opposite_ of their claim, just to get us thinking
               | about  "well the CPU is busy?"
               | 
               | Curious for their input.
        
             | conradev wrote:
             | The GPU is stateful and requires loading shaders and
             | initializing pipelines before doing any work. That is where
             | its latency comes from. It is also extremely power hungry.
             | 
             | The CPU is zero latency to get started, but takes longer
             | because it isn't specialized at any one task and isn't
             | massively parallel, so that is why the CPU takes even
             | longer.
             | 
             | The NPU often has a simpler bytecode to do more complex
             | things like matrix multiplication implemented in hardware,
             | rather than having to instantiate a generic compute kernel
             | on the GPU.
        
         | boomskats wrote:
         | That's especially true because yours is a Xilinx FPGA. The one
         | that they just attached to the latest gen mobile ryzens is 5x
         | more capable too.
         | 
         | AMD are doing some fantastic work at the moment, they just
         | don't seem to be shouting about it. This one is particularly
         | interesting
         | https://lore.kernel.org/lkml/DM6PR12MB3993D5ECA50B27682AEBE1...
         | 
         | edit: not an FPGA. TIL. :'(
        
           | errantspark wrote:
           | Wait sorry back up a bit here. I can buy a laptop that has a
           | daughter FPGA in it? Does it have GPIO??? Are we seriously
           | building hardware worth buying again in 2024? Do you have a
           | link?
        
             | eightysixfour wrote:
             | It isn't as fun as you think - they are setup for specific
             | use cases and quite small. Here's a link to the software
             | page: https://ryzenai.docs.amd.com/en/latest/index.html
             | 
             | The teeny-tiny "NPU," which is actually an FPGA, is 10
             | TOPS.
             | 
             | Edit: I've been corrected, not an FPGA, just an IP block
             | from Xilinx.
        
               | wtallis wrote:
               | It's not a FPGA. It's an NPU IP block from the Xilinx
               | side of the company. It was presumably originally
               | developed to be run on a Xilinx FPGA, but that doesn't
               | mean AMD did the stupid thing and actually fabbed a FPGA
               | fabric instead of properly synthesizing the design for
               | their laptop ASIC. Xilinx involvement does not
               | automatically mean it's an FPGA.
        
               | eightysixfour wrote:
               | Thanks for the correction, edited.
        
               | boomskats wrote:
               | Do you have any more reading on this? How come the XDNA
               | drivers depend on Xilinx' XRT runtime?
        
               | almostgotcaught wrote:
               | because XRT has a plugin architecture: XRT<-shim
               | plugin<-kernel driver. The shims register themselves with
               | XRT. The XDNA driver repo houses both the shim and the
               | kernel driver.
        
               | boomskats wrote:
               | Thanks, that makes sense.
        
               | wtallis wrote:
               | It would be surprising and strange if AMD _didn 't_ reuse
               | the software framework they've already built for doing AI
               | when that IP block is instantiated on an FPGA fabric
               | rather than hardened in an ASIC.
        
               | boomskats wrote:
               | Well, I'm irrationally disappointed, but thanks.
               | Appreciate the correction.
        
               | boomskats wrote:
               | Yes, the one on the ryzen 7000 chips like the 7840u isn't
               | massive, but that's the last gen model. The one they've
               | just released with the HX370 chip is estimated at 50
               | TOPS, which is better than Qualcomm's ARM flagship that
               | this post is about. It's a fivefold improvement in a
               | single generation, it's pretty exciting.
               | 
               | And it's an FPGA It's not an FPGA
        
               | almostgotcaught wrote:
               | > And it's an FPGA.
               | 
               | nope it's not.
        
               | boomskats wrote:
               | I've just ordered myself a jump to conclusions mat.
        
               | almostgotcaught wrote:
               | Lol during grad school my advisor would frequently cut me
               | off and try to jump to a conclusion, while I was
               | explaining something technical often enough he was wrong.
               | So I did really buy him one (off eBay or something). He
               | wasn't pleased.
        
             | dekhn wrote:
             | If you want GPIOs, you don't need (or want) an FPGA.
             | 
             | I don't know the details of your use case, but I work with
             | low level hardware driven by GPIOs and after a bit of
             | investigation, concluded that having direect GPIO access in
             | a modern PC was not necessary or desirable compared to the
             | alternatives.
        
               | errantspark wrote:
               | I get a lot of use out of the PRUs on the
               | BeagleboneBlack, I would absolutely get use out of an
               | FPGA in a laptop.
        
               | dekhn wrote:
               | It makes more sense to me to just use the BeagleboneBlack
               | in concert with the FPGA. Unless you have highly specific
               | compute or data movement needs that can't be satisfied
               | over a USB serial link. If you have those needs, and you
               | need a laptop, I guess an FPGA makes sense but that's a
               | teeny market.
        
           | beeflet wrote:
           | It would be cool if most PCs had a general purpose FPGA that
           | could be repurposed by the operating system. For example you
           | could use it as a security processor like a TPM or as a
           | bootrom, or you could repurpose it for DSP or something.
           | 
           | It just seems like this would be better in terms of
           | firmware/security/bootloading because you would be more able
           | to fix it if an exploit gets discovered, and it would be
           | leaner because different operating systems can implement
           | their own stuff (for example linux might not want pluton in-
           | chip security, windows might not want coreboot or linux-based
           | boot, bare metal applications can have much simpler boot).
        
             | walterbell wrote:
             | Xilinx Artix 7-series PicoEVB fits in M.2 wifi slot and has
             | an OSS toolchain, http://www.enjoy-digital.fr/
        
           | pclmulqdq wrote:
           | It's not an FPGA. It's a VLIW DSP that Xilinx built to go
           | into an FPGA-SoC to help run ML models.
        
             | almostgotcaught wrote:
             | this is the correct answer. one of the compilers for this
             | DSP is https://github.com/Xilinx/llvm-aie.
        
           | numpad0 wrote:
           | Sorry for an OT comment but what is going on with that ascii
           | art!? The content fits within 80 columns just fine[1], is it
           | GPT generated?
           | 
           | 1: https://pastebin.com/raw/R9BrqETR
        
           | davemp wrote:
           | Unfortunately FPGA fabric is ~2x less power efficient than
           | equivalent ASIC logic at the same clock speeds last time I
           | checked. So implementing general purpose logic on an FPGA is
           | not usually the right option even if you don't care about
           | FMAX or transistor counts.
        
         | refulgentis wrote:
         | You're absolutely right IMO, given what I heard when launching
         | on-device speech recognition on Pixel, and after leaving
         | Google, what I see from ex. Apple Neural Engine vs. CPU when
         | running ONNX stuff.
         | 
         | I'm a bit suspicious of the article's specific conclusion,
         | because it is Qualcomm's ONNX, and it be out of date. Also,
         | Android loved talking shit about Qualcomm software engineering.
         | 
         | That being said, its directionally correct, insomuch as
         | consumer hardware AI acceleration claims are near-universally
         | BS unless you're A) writing 1P software B) someone in the 1P
         | really wants you to take advantage.
        
           | kristianp wrote:
           | 1P?
        
             | refulgentis wrote:
             | First party, i.e. Google/Apple/Microsoft
        
         | godelski wrote:
         | > but to be able to run small models with very little power
         | usage
         | 
         | yes
         | 
         | But first, I should also say you probably don't want to be
         | programming these things with python. I doubt you'll get good
         | performance there, especially as the newness means
         | optimizations haven't been ported well (even using things like
         | TensorRT is not going to be as fast as writing it from scratch,
         | and Nvidia is throwing a lot of man power at that -- for good
         | reason! But it sure as hell will get close and save you a lot
         | of time writing).
         | 
         | They are, like you say, generally optimized for doing repeated
         | similar tasks. That's also where I suspect some of the info
         | gathered here is inaccurate.                 (I have not used
         | these NPU chips so what follows is more educated guesses, but
         | I'll explain. Please correct me if I've made an error)
         | 
         | Second, I don't trust the timing here. I'm certain the CUDA
         | timing (at the end) is incorrect, as the code written wouldn't
         | properly time. Timing is surprisingly not easy. I suspect the
         | advertised operations are only counting operations directly on
         | the NPU while OP would have included CPU operations in their
         | NPU and GPU timings[0]. But the docs have benchmarking tools,
         | so I suspect they're doing something similar. I'd be interested
         | to know the variance and how this holds after doing warmups.
         | _They do identify the IO as an issue, and so I think this is
         | evidence of this being an issue._
         | 
         | Third, their data is improperly formatted.
         | MATRIX_COUNT, MATRIX_A, MATRIX_B, MATRIX_K = (6, 1500, 1500,
         | 256)       INPUT0_SHAPE = [1, MATRIX_COUNT, MATRIX_A, MATRIX_K]
         | INPUT1_SHAPE = [1, MATRIX_COUNT, MATRIX_K, MATRIX_B]
         | OUTPUT_SHAPE = [1, MATRIX_COUNT, MATRIX_A, MATRIX_B]
         | 
         | You want "channels last" here. I suspected this (do this in
         | pytorch too!) and the docs they link confirm.
         | 
         | 1500 is also an odd choice and this could be cause for extra
         | misses. I wonder how things would change with 1536, 2048, or
         | even 256. Might (probably) even want to look smaller, since
         | this might be a common preprocessing step. Your models are not
         | processing full res images and if you're going to optimize
         | architecture for models, you're going to use that shape
         | information. Shape optimization is actually pretty important in
         | ML[1]. I suspect this will be quite a large miss.
         | 
         | Fourth, a quick look at the docs and I think the setup is
         | improper. Under "Model Workflow" they mention that they want
         | data in 8 or 16 bit * _float*_. I 'm not going to look too
         | deep, but note that there are different types of floats (e.g.
         | pytorch's bfloat is not the same as torch.half or
         | torch.float16). Mixed precision is still a confusing subject
         | and if you're hitting issues like these it is worth looking at.
         | I very much suggest not just running a standard quantization
         | procedure and calling it a day (start there! But don't end
         | there unless it's "good enough", which doesn't seem too
         | meaningful here.)
         | 
         | FWIW, I still do think these results are useful, but I think
         | they need to be improved upon. This type of stuff is
         | surprisingly complex, but a large amount of that is due to
         | things being new and much of the details still being worked
         | out. Remember that when you're comparing to things like CPU or
         | GPU (especially CUDA) that these have had hundreds of thousands
         | of man hours put into then and at least tens of thousands into
         | high level language libraries (i.e. python) to handle these. I
         | don't think these devices are ready for the average user where
         | you can just work with them from your favorite language's
         | abstraction level, but they're pretty useful if you're willing
         | to work close to the metal.
         | 
         | [0] I don't know what the timing is for this, but I do this in
         | pytorch a lot so here's the boilerplate                   times
         | = torch.empty(rounds)         # Don't need use dummy data, but
         | here         input_data = torch.randn((batch_size,
         | *data_shape), device="cuda")         # Do some warmups first.
         | There's background actions dealing with IO we don't want to
         | measure         #    You can remove that line and do a dist of
         | times if you want to see this         # Make sure you generate
         | data and save to a variable (write) or else this won't do
         | anything         for _ in range(warmup):             data =
         | model(input_data)         for i in range(rounds):
         | starter = torch.cuda.Event(enable_timing=True)
         | ender = torch.cuda.Event(enable_timing=True)
         | starter.record()             data = model(input_data)
         | ender.record()             torch.cuda.synchronize()
         | times[i] = starter.elapsed_time(ender)/1000         total_time
         | = times.sum()
         | 
         | The reason we do it this way is if we just wrap the model
         | output with a timer then we're looking at CPU time but the GPU
         | operations are asynchronous so you could get deceptively fast
         | (or slow) times
         | 
         | [1] https://www.thonking.ai/p/what-shapes-do-matrix-
         | multiplicati...
        
       | wmf wrote:
       | This headline is seriously misleading because the author did not
       | test AMD or Intel NPUs. If Qualcomm is slow don't say all AI PCs
       | are not good.
        
       | protastus wrote:
       | Deploying a model on an NPU requires significant profile based
       | optimization. Picking up a model that works fine on the CPU but
       | hasn't been optimized for an NPU usually leads to disappointing
       | results.
        
         | catgary wrote:
         | Yeah whenever I've spoken to people who work on stuff like IREE
         | or OpenXLA they gave me the impression that understanding how
         | to use those compilers/runtimes is an entire job.
        
         | CAP_NET_ADMIN wrote:
         | Beauty of CPUs - they'll chew through whatever bs code you
         | throw at them at a reasonable speed.
        
       | lostmsu wrote:
       | The author's benchmark sucks if he could only get 2 tops from a
       | laptop 4080. The thing should be doing somewhere around 80 tops.
       | 
       | Given that you should take his NPU results with a truckload of
       | salt.
        
       | hkgjjgjfjfjfjf wrote:
       | Sutherland's wheel of reincarnation turns.
        
       | downrightmike wrote:
       | They should have just made a pci card and not tried to push whole
       | new machines on us. We are all good with the machines we already
       | have. If you want to sell a new feature, then it needs to be an
       | add-on
        
       | Mistletoe wrote:
       | >The second conclusion is that the measured performance of 573
       | billion operations per second is only 1.3% of the 45 trillion
       | ops/s that the marketing material promises.
       | 
       | It just gets so hard to take this industry seriously.
        
       | m00x wrote:
       | NPUs are efficient, not especially fast. The CPU is much bigger
       | than the NPU and has better cache access. Of course it'll perform
       | better.
        
         | acdha wrote:
         | It's more complicated than that (you're assuming that the
         | bigger CPU is optimized for the same workload) but it's also
         | irrelevant to the topic at hand: they're seeing this NPU within
         | a factor of 2-4 of the CPU, but if it performed half as well as
         | Qualcomm claims it would be an order of magnitude faster. The
         | story here isn't another round of the specialized versus
         | general debate but that they fell so far short of their
         | marketing claims.
        
       | Havoc wrote:
       | >We see 1.3% of Qualcomm's NPU 45 Teraops/s claim
       | 
       | To me that suggests that the test is wrong.
       | 
       | I could see intel massaging results, but that far off seems
       | incredibly improbable
        
       | p1necone wrote:
       | I might be overly cynical but I just assumed that the entire
       | purpose of "AI PCs" was marketing - of course they don't actually
       | achieve much. Any real hardware that's supposedly for the "AI"
       | features will actually be just special purpose hardware for
       | literally anything the sales department can lump under that
       | category.
        
       | teilo wrote:
       | Actual article title: Benchmarking Qualcomm's NPU on the
       | Microsoft Surface Tablet
       | 
       | Because this isn't about NPUs. It's about a specific NPU, on a
       | specific benchmark, with a specific set of libraries and
       | frameworks. So basically, this proves nothing.
        
         | iml7 wrote:
         | But you can't get more clicks. You have to attack enough people
         | to get clicks.I feel like this place is becoming more and more
         | filled with posts and titles like this.
        
           | gerdesj wrote:
           | Internet points are a bit crap but HN generally discusses
           | things properly and off topic and downright weird stuff
           | generally gets downvoted to doom.
        
         | gnabgib wrote:
         | The title is from the original article
         | (https://petewarden.com/2024/10/16/ai-pcs-arent-very-good-
         | at-...), the URL was changed by dang:
         | https://news.ycombinator.com/item?id=41863591
        
       | NoPicklez wrote:
       | Fairly misleading title, boiling down AI PCs to just the
       | Microsoft Surface running Qualcomm
        
       | cjbgkagh wrote:
       | > We've tried to avoid that by making both the input matrices
       | more square, so that tiling and reuse should be possible.
       | 
       | While it might be possible it would not surprise me if a number
       | of possible optimizations had not made it into Onnx. It appears
       | that Qualcomm does not give direct access to the NPU and users
       | are expected to use frameworks to convert models over to it, and
       | in my experience conversion tools generally suck and leave a lot
       | of optimizations on the table. It could be less of NPUs suck and
       | more of the conversions tools suck. I'll wait until I get direct
       | access - I don't trust conversion tools.
       | 
       | My view of NPUs is that they're great for tiny ML models and very
       | fast function approximations which is my intended use case. While
       | LLMs are the new hotness there are huge number of specialized
       | tasks that small models are really useful for.
        
         | jaygreco wrote:
         | I came here to say this. I haven't worked with the Elite X but
         | the past gen stuff I've used (865 mostly) the accelerators -
         | compute DSP and much smaller NPU - required _very_ specific
         | setup, compilation with a bespoke toolchain, and communication
         | via RPC to name a few.
         | 
         | I would hope the NPU on Elite X is easier to get to considering
         | the whole copilot+ thing, but I bring this up mainly to make
         | the point that I doubt it's just as easy as "run general
         | purpose model, expect it to magically teleport onto the NPU".
        
       | stanleykm wrote:
       | the ARM SME could be an interesting alternative to NPUs in the
       | future. Unlike the NPUs which have at best some fixed function
       | API it will be possible to program the SMEs more directly
        
       | piskov wrote:
       | Snapdragon touts 45 TOPS but it's only int8.
       | 
       | For example Apple's m3 neural engine is mere 18 TOPS but it's
       | FP16.
       | 
       | So windows has bigger number but it's not apple to apple
       | comparison.
       | 
       | Did author test int8 performance?
        
       | freehorse wrote:
       | I always thought that the main point of NPUs is energy efficiency
       | (and being able to run ML models without taking over all computer
       | resources, making it practical to integrate ML applications in
       | the OS itself in ways that it does not disturb the user or the
       | workflow) rather than being exceptionally faster. At least this
       | has been my experience with running stable diffusion on macs.
       | Similar to using other specialised hardware like media encoders;
       | they are not necessarily faster than a CPU if you throw a dozen+
       | cpu cores on the task, but it will draw a minuscule part of the
       | power.
        
       | guelermus wrote:
       | One should pay attention also to power efficiency, a direct
       | comparison could be misleading here.
        
       | _davide_ wrote:
       | The RTX 4080 should be capable of ~40 TFLOPS, yet they only
       | report 2,160 billion operations per second. Shouldn't this be
       | enough to reconsider the benchmark? They probably made some
       | serious error in measuring FLOPS. Regarding the fact that CPU
       | beats NPU is possible but they should benchmark many matrix
       | multiplications without any application synchronization in order
       | to have a decent comparison.
        
         | Grimblewald wrote:
         | That isnt the half of it. A quick skim of the documentation
         | shows that the cpu inference wasnt done in a comparable way
         | either.
        
       | ein0p wrote:
       | Memory bound workload is memory bound. Doesn't matter how many
       | TOPS you have if you're sitting idle waiting on DRAM during
       | generation. You will, however notice a difference in prefill for
       | long prompts.
        
       ___________________________________________________________________
       (page generated 2024-10-17 06:00 UTC)