Home
[HN Gopher] AI PCs Aren't Good at AI: The CPU Beats the NPU ___________________________________________________________________ AI PCs Aren't Good at AI: The CPU Beats the NPU Author : dbreunig Score : 316 points Date : 2024-10-16 19:44 UTC (10 hours ago) HTML web link (github.com) TEXT w3m dump (github.com) | fancyfredbot wrote: | The write up on the GitHub repo is much more informative than the | blog. | | When running int8 matmul using onnx performance is ~0.6TF. | | https://github.com/usefulsensors/qc_npu_benchmark | dang wrote: | Thanks--we changed the URL to that from | https://petewarden.com/2024/10/16/ai-pcs-arent-very-good- | at-.... Readers may way want to look at both, of course! | dhruvdh wrote: | Oh, maybe also change the title? I flagged it because of the | title/url not matching. | dmitrygr wrote: | In general MAC unit utilization tends to be low for transformers, | but 1.3% seems pretty bad. I wonder if they fucked up the memory | interface for the NPU. All the MACs in the world are useless if | you cannot feed them. | moffkalast wrote: | I recall looking over the Ryzen AI architecture and the NPU is | just plugged into PCIe and thus gets completely crap memory | bandwidth. I would expect it might be similar here. | PaulHoule wrote: | I spent a lot of time with a business partner and an expert | looking at the design space for accelerators and it was made | very clear to me that the memory interface puts a hard limit | on what you can do and that it is difficult to make the most | of. Particularly if a half-baked product is being rushed out | because of FOMO you'd practically expect them to ship | something that gives a few percent of the performance because | the memory interface doesn't really work, it happens to the | best of them: | | https://en.wikipedia.org/wiki/Cell_(processor) | wtallis wrote: | It's unlikely to be literally connected over PCIe when it's | on the same chip. It just _looks_ like it 's connected over | PCIe because that's how you make peripherals discoverable to | the OS. The integrated GPU also appears to be connected over | PCIe, but obviously has access to far more memory bandwidth. | Hizonner wrote: | It's a tablet. It probably has like one DDR channel. It's not | so much that they "fucked it up" as that they knowingly built a | grossly unbalanced system so they could report a pointless | number. | dmitrygr wrote: | Well, no. If the CPU can hit better numbers on the same model | then the bandwidth from the DDR _IS_ there. Probably the NPU | does not attach to the proper cache level, or just has a very | thin pipe to it | Hizonner wrote: | The CPU is only about twice as good as the NPU, though | (four times as good on one test). The NPU is being | advertised as capable of 45 trillion operations per second, | and he's getting 1.3 percent of that. | | So, OK, yeah, I concede that the NPU may have even worse | access to memory than the CPU, but the bottom line is that | neither one of them has anything close to what it needs to | to actually delivering anything like the marketing headline | performance number on any realistic workload. | | I bet a lot of people have bought those things after seeing | "45 TOPS", thinking that they'd be able to usefully run | transformers the size of main memory, and that's not | happening on CPU _or_ NPU. | dmitrygr wrote: | Yup, sad all round. We are in agreement. | pram wrote: | I laughed when I saw that the Qualcomm "AI PC" is described as | this in the ComfyUI docs: | | "Avoid", "Nothing works", "Worthless for any AI use" | jsheard wrote: | These NPUs are tying up a substantial amount of silicon area so | it would be a real shame if they end up not being used for much. | I can't find a die analysis of the Snapdragon X which isolates | the NPU specifically but AMDs equivalent with the same ~50 TOPS | performance target can be seen here, and takes up about as much | area as three high performance CPU cores: | | https://www.techpowerup.com/325035/amd-strix-point-silicon-p... | Kon-Peki wrote: | Modern chips have to dedicate a certain percentage of the die | to dark silicon [1] (or else they melt/throttle to | uselessness), and these kinds of components count towards that | amount. So the point of these components is to be used, but not | to be used too much. | | Instead of an NPU, they could have used those transistors and | die space for any number of things. But they wouldn't have put | additional high performance CPU cores there - that would | increase the power density too much and cause thermal issues | that can only be solved with permanent throttling. | | [1] https://en.wikipedia.org/wiki/Dark_silicon | IshKebab wrote: | If they aren't being used it would be better to dedicate the | space to more SRAM. | a2l3aQ wrote: | The point is parts of the CPU have to be off or throttled | down when other components are under load to maintain TDP, | adding cache that would almost certainly be being used | defeats the point of that. | jsheard wrote: | Doesn't SRAM have much lower power density than logic | with the same area though? Hence why AMD can get away | with physically stacking cache on top of more cache in | their X3D parts, without the bottom layer melting. | Kon-Peki wrote: | Yes, cache has a much lower power density and could have | been a candidate for that space. | | But I wasn't on the design team and have no basis for | second-guessing them. I'm just saying that cramming more | performance CPU cores onto this die isn't a realistic | option. | wtallis wrote: | The SRAM that AMD is stacking also has the benefit of | being last-level cache, so it doesn't need to run at | anywhere near the frequency and voltage that eg. L1 cache | operates at. | jcgrillo wrote: | Question--what's to be lost by making your features | sufficiently not dense to allow them to cool at full tilt? | AlotOfReading wrote: | Messes with timing, among other things. A lot of those | structures are relatively fixed blocks that are designed | for specific sizes. Signals take more time to propagate | longer distances, and longer conductors have worse | properties. Dense and hot is faster and more broadly | useful. | jcgrillo wrote: | Interesting, so does that mean we're basically out of | runway without aggressive cooling? | positr0n wrote: | Good discussion on how at multi GHz clock speeds, the speed | of light is actually limiting on some circuit design | choices: https://news.ycombinator.com/item?id=12384596 | ezst wrote: | I can't wait for the LLM fad to be over so we get some sanity | (and efficiency) back. I personally have no use for this extra | hardware ("GenAI" doesn't help me in any way nor supports any | work-related tasks). Worse, most people have no use for that | (and recent surveys even show predominant hostility towards AI | creep). We shouldn't be paying extra for that, it should be | opt-in, and then it would become clear (by looking at the sales | and how few are willing to pay a premium for "AI") how | overblown and unnecessary this is. | DrillShopper wrote: | Corporatized gains in the market from hype Socialized losses | in increased carbon emissions, upheaval from job loss, and | higher prices on hardware. | | The more they say the future will be better the more that it | looks like the status quo. | renewiltord wrote: | I was telling someone this and they gave me link to a laptop | with higher battery life and better performance than my own, | but I kept explaining to them that the feature I cared most | about was die size. They couldn't understand it so I just had | to leave them alone. Non-technical people don't get it. Die | size is what I care about. It's a critical feature and so | many mainstream companies are missing out on _my money_ | because they won 't optimize die size. Disgusting. | nl wrote: | Is this a parody? | | Why would anyone care about die size? And if you do why not | get one of the many low power laptops with Atoms etc that | do have small die size? | throwaway48476 wrote: | Maybe through a game of telephone they confused die size | and node size? | thfuran wrote: | Yes, they're making fun of the comment they replied to. | singlepaynews wrote: | Would you do me the favor of explaining the joke? I get | the premise--nobody cares about die size, but the comment | being mocked seems perfectly innocuous to me? They want a | laptop without an NPU b/c according to link we get more | out of CPU anyways? What am I missing here? | tedunangst wrote: | No, no, no, you just don't get it. The only thing Dell | will sell me is a laptop 324mm wide, which is totally | appalling, but if they offered me a laptop that's 320mm | wide, I'd immediately buy it. In my line of work, which | is totally serious business, every millimeter counts. | _zoltan_ wrote: | News flash: you're in the niche of the niche. People don't | care about die size. | | I'd be willing to bet that the amount of money they are | missing out on is miniscule and is by far offset by | people's money who care about other stuff. Like you know, | performance and battery life, just to stick to your | examples. | mattnewton wrote: | That's exactly what the poster is arguing- they are being | sarcastic. | waveBidder wrote: | your satire is off base enough that people don't understand | it's satire. | heavyset_go wrote: | The Poe's Law means it's working. | 0xDEAFBEAD wrote: | Says a lot about HN that so many believed he was genuine. | fijiaarone wrote: | Yeah, I know what you mean. I hate lugging around a big CPU | core. | mardifoufs wrote: | NPUs were a thing (and a very common one in mobile CPUs too) | way before the LLM craze. | kalleboo wrote: | > _most people have no use for that_ | | Apple originally added their NPUs before the current LLM wave | to support things like indexing your photo library so that | objects and people are searchable. These features are still | very popular. I don't think these NPUs are fast enough for | GenAI anyway. | wmf wrote: | MS Copilot and "Apple Intelligence" are running a small | language model and image generation on the NPU so that | should count as "GenAI". | kalleboo wrote: | It's still in beta so we'll see how things go but I saw | someone testing what Apple Intelligence ran on-device vs | sent off to the "private secure cloud" and even stuff | like text summaries were being sent to the cloud. | grugagag wrote: | I wish I could turn that off on my phone. | jcgrillo wrote: | I just got an iphone and the whole photos thing is absolutely | garbage. All I wanted to do was look through my damn photos | and find one I took recently but it started playing some | random music and organized them in no discernible order.. | like it wasn't the reverse time sorted.. Idk what kind of | fucked up "creative process" came up with that bullshit but I | sure wish they'd unfuck it stat. | | The camera is real good though. | james_marks wrote: | There's an album called "Recents" that's chronological and | scrolled to the end. | | "Recent" seems to mean everything; I've got 6k+ photos, I | think since the last fresh install, which is many devices | ago. | | Sounds like the view you're looking for and will stick as | the default once you find it, but you do have to bat away | some BS at first. | JohnFen wrote: | > These NPUs are tying up a substantial amount of silicon area | so it would be a real shame if they end up not being used for | much. | | This has been my thinking. Today you have to go out of your way | to buy a system with an NPU, so I don't have any. But tomorrow, | will they just be included by default? That seems like a waste | for those of us who aren't going to be running models. I wonder | what other uses they could be put to? | jsheard wrote: | > But tomorrow, will they just be included by default? | | That's already the way things are going due to Microsoft | decreeing that Copilot+ is the future of Windows, so AMD and | Intel are both putting NPUs which meet the Copilot+ | performance standard into every consumer part they make going | forwards to secure OEM sales. | AlexAndScripts wrote: | It almost makes me want to find some use for them on my | Linux box (not that is has an NPU), but I truly can't think | of anything. Too small to run a meaningful LLM, and I'd | want that in bursts anyway, I hate voice controls (at least | with the current tech), and Recall sounds thoroughly | useless. Could you do mediocre machine translation on it, | perhaps? Local github copilot? An LLM that is purely used | to build an abstract index of my notes in the background? | | Actually, could they be used to make better AI in games? | That'd be neat. A shooter character with some kind of | organic tactics, or a Civilisation/Stellaris AI that | doesn't suck. | jonas21 wrote: | NPUs are already included by default in the Apple ecosystem. | Nobody seems to mind. | JohnFen wrote: | It's not really a question of minding if it's there, unless | its presence increases cost, anyway. It just seems a waste | to let it go idle, so my mind wanders to what other use I | could put that circuitry to. | acchow wrote: | It enables many features on the phone that people like, all | without sending your personal data to the cloud. Like | searching your photos for "dog" or "receipt". | shepherdjerred wrote: | I actually love that Apple includes this -- especially now | that they're actually doing something with it via Apple | Intelligence | crazygringo wrote: | Aren't they used for speech recognition -- for dictation? | Also for FaceID. | | They're useful for more things than just LLM's. | JohnFen wrote: | Yes, but I'm not interested in those sorts of uses. I'm | wondering what else an NPU could be used for. I don't know | what an NPU actually is at a technical level, so I'm | ignorant of the possibilities. | heavyset_go wrote: | The idea is that your OS and apps will integrate ML models, | so you will be running models whether you know it or not. | JohnFen wrote: | I'm confident that I'll be able to know and control whether | or not my Linux and BSD machines will be using ML models. | hollerith wrote: | --and whether anyone is using your interactions with your | computer to train a model. | idunnoman1222 wrote: | Voice to text | kllrnohj wrote: | Snapdragon X still has a full 12 cores (all same cores, it's | homogeneous) and the Strix Point is also 12 cores but in a 4+8 | configuration but with the "little" cores not sacrificing that | much (nothing like the little cores in ARM's designs which | might as well not even exist, they are a complete waste of | silicon). Consumer software doesn't scale to that, so what are | you going to do with more transistors allocated to the CPU? | | It's not unlike why Apple puts so many video engines in their | SoCs - they don't actually have much else to do with the | transistor budget they can afford. Making single thread | performance better isn't limited by transistor count anymore | and software is bad at multithreading. | wmf wrote: | GPU "infinity" cache would increase 3D performance and | there's a rumor that AMD removed it to make room for the NPU. | They're not out of ideas for features to put on the chip. | tromp wrote: | > the 45 trillion operations per second that's listed in the | specs | | Such a spec should be ideally be accompanied by code | demonstrating or approximating the claimed performance. I can't | imagine a sports car advertising a 0-100km/h spec of 2.0 seconds | where a user is unable to get below 5 seconds. | dmitrygr wrote: | most likely multiplying the same 128x128 matrix from cache to | cache. That gets you perfect MAC utilization with no need to | hit memory. Gets you a big number that is not directly a lie - | that perf _IS_ attainable, on a useless synthetic benchmark | kmeisthax wrote: | Sounds great for RNNs! /s | tedunangst wrote: | I have some bad news for you regarding how car acceleration is | measured. | otterley wrote: | Well, what is it? | ukuina wrote: | Everything from rolling starts to perfect road conditions | and specific tires, I suppose. | isusmelj wrote: | I think the results show that just in general the compute is not | used well. That the CPU took 8.4ms and GPU took 3.2ms shows a | very small gap. I'd expect more like 10x - 20x difference here. | I'd assume that the onnxruntime might be the issue. I think some | hardware vendors just release the compute units without shipping | proper support yet. Let's see how fast that will change. | | Also, people often mistake the reason for an NPU is "speed". | That's not correct. The whole point of the NPU is rather to focus | on low power consumption. To focus on speed you'd need to get rid | of the memory bottleneck. Then you end up designing your own ASIC | with it's own memory. The NPUs we see in most devices are part of | the SoC around the CPU to offload AI computations. It would be | interesting to run this benchmark in a infinite loop for the | three devices (CPU, NPU, GPU) and measure power consumption. I'd | expect the NPU to be lowest and also best in terms of "ops/watt" | AlexandrB wrote: | > Also, people often mistake the reason for an NPU is "speed". | That's not correct. The whole point of the NPU is rather to | focus on low power consumption. | | I have a sneaking suspicion that the real real reason for an | NPU is marketing. "Oh look, NVDA is worth $3.3T - let's make | sure we stick some AI stuff in our products too." | kmeisthax wrote: | You forget "Because Apple is doing it", too. | rjsw wrote: | I think other ARM SoC vendors like Rockchip added NPUs | before Apple, or at least around the same time. | acchow wrote: | I was curious so looked it up. Apple's first chip with an | NPU was the A11 bionic in Sept 2017. Rockchip's was the | RK1808 in Sept 2019. | GeekyBear wrote: | Face ID was the first tent pole feature that ran on the | NPU. | j16sdiz wrote: | Google TPU was introduced around same time as apple. | Basically everybody knew it can be something around that | time, just don't know exactly how | Someone wrote: | https://en.wikipedia.org/wiki/Tensor_Processing_Unit#Prod | uct... shows the first one is from 2015 (publicly | announced in 2016). It also shows they have a TDP of | 75+W. | | I can't find TDP for Apple's Neural Engine | (https://en.wikipedia.org/wiki/Neural_Engine), but the | first version shipped in the iPhone 8, which has a 7 Wh | battery, so these are targeting different markets. | bdd8f1df777b wrote: | Even if it were true, they wouldn't have the same | influence as Apple has. | itishappy wrote: | I assume you're both right. I'm sure NPUs exist to fill a | very real niche, but I'm also sure they're being shoehorned | in everywhere regardless of product fit because "AI big right | now." | wtallis wrote: | Looking at it slightly differently: putting low-power NPUs | into laptop and phone SoCs is how to get on the AI | bandwagon in a way that NVIDIA cannot easily disrupt. There | are plenty of systems where a NVIDIA discrete GPU cannot | fit into the budget (of $ or Watts). So even if NPUs are | still somewhat of a solution in search of a problem (aka a | killer app or two), they're not necessarily a sign that | these manufacturers are acting entirely without strategy. | brookst wrote: | The shoehorning only works if there is buyer demand. | | As a company, if customers are willing to pay a premium for | a NPU, or if they are unwilling to buy a product without | one, it is not your place to say "hey we don't really | believe in the AI hype so we're going to sell products | people don't want to prove a point" | MBCook wrote: | Is there demand? Or do they just assume there is? | | If they shove it in every single product and that's all | anyone advertises, whether consumers know it will help | them or not, you don't get a lot of choice. | | If you want the latest chip, you're getting AI stuff. | That's all there is to it. | Terr_ wrote: | "The math is clear: 100% of our our car sales come from | models with our company logo somewhere on the front, | which shows incredible customer desire for logos. We | should consider offering a new luxury trim level with | more of them." | | "How many models to we have without logos?" | | "Huh? Why would we do that?" | MBCook wrote: | Heh. Yeah more or less. | | To some degree I understand it, because as we've all | noticed computers have pretty much plateaued for the | average person. They last much longer. You don't need to | replace them every two years anymore because the software | isn't out stripping them so fast. | | AI is the first thing to come along in quite a while that | not only needs significant power but it's just something | different. It's something they can say your old computer | doesn't have that the new one does. Other than being 5% | faster or whatever. | | So even if people don't need it, and even if they notice | they don't need it, it's something to market on. | | The stuff up thread about it being the hotness that Wall | Street loves is absolutely a thing too. | ddingus wrote: | That was all true nearly 10 years ago. And it has only | improved. Almost any computer one finds these days is | capable of the basics. | bdd8f1df777b wrote: | There are two kinds of buyer demands: product, buyers, | and the stock buyers. The AI hype can certainly convince | some of the stock buyers. | Spooky23 wrote: | Apple will have a completely AI capable product line in | 18 months, with the major platforms basically done. | | Microsoft is built around the broken Intel tick/tick | model of incremental improvement -- they are stuck with | OEM shitware that will take years to flush out of the | channel. That means for AI, they are stuck with cloud | based OpenAI, where NVIDIA has them by the balls and the | hyperscalers are all fighting for GPU. | | Apple will deliver local AI features as software (the | hardware is "free") at a much higher margin - while | Office 365 AI is like $400+ a year per user. | | You'll have people getting iPhones to get AI assisted | emails or whatever Apple does that is useful. | justahuman74 wrote: | Who is getting $400/y of value from that? | nxobject wrote: | I hope that once they get a baseline level of AI | functionality in, they start working with larger LLMs to | enable some form of RAG... that might be their next | generational shift. | hakfoo wrote: | We're still looking for "that is useful". | | The stuff they've been trying to sell AI to the public | with is increasingly looking as absurd as every 1978 | "you'll store your recipes on the home computer" | argument. | | AI text became a Human Centipede story: Start with a | coherent 10-word sentence, let AI balloon it into five | pages of flowery nonsense, send it to someone else, who | has their AI smash it back down to 10 meaningful words. | | Coding assistance, even as spicy autocorrect, is often a | net negative as you have to plow through hallucinations | and weird guesses as to what you want but lack the tools | to explain to it. | | Image generation is already heading rapidly into cringe | territory, in part due to some very public social media | operations. I can imagine your kids' kids in 2040 finding | out they generated AI images in the 2020s and looking at | them with the same embarrassment you'd see if they dug | out your high-school emo fursona. | | There might well be some more "closed-loop" AI | applications that make sense. But are they going to be | running on every desktop in the world? Or are they going | to be mostly used in datacentres and purpose-built | embedded devices? | | I also wonder how well some of the models and techniques | scale down. I know Microsoft pushed a minimum spec to | promote a machine as Copilot-ready, but that seems like | it's going to be "Vista Basic Ready" redux as people try | to run tools designed for datacentres full of Quadro | cards, or at least high-end GPUs, on their $299 HP | laptop. | jjmarr wrote: | Cringe emo girls are trendy now because the nostalgia | cycle is hitting the early 2000s. Your kid would be | impressed if you told them you were a goth gf. It's not | hard to imagine the same will happen with primitive AIs | in the 40s. | defrost wrote: | Early 2000's ?? | | Bela Lugosi Died in 1979, and Peter Murphy was onto his | next band by 1984. | | By 2000 Goth was fully a distant dot in the rear view | mirror for the OG's In 2002, Murphy | released *Dust* with Turkish-Canadian composer and | producer Mercan Dede, which utilizes traditional Turkish | instrumentation and songwriting, abandoning Murphy's | previous pop and rock incarnations, and juxtaposing | elements from progressive rock, trance, classical music, | and Middle Eastern music, coupled with Dede's trademark | atmospheric electronics. | | https://www.youtube.com/watch?v=Yy9h2q_dr9k | | https://en.wikipedia.org/wiki/Bauhaus_(band) | djur wrote: | I'm not sure what "gothic music existed in the 1980s" is | meant to indicate as a response to "goths existed in the | early 2000s as a cultural archetype". | defrost wrote: | That Goths in 2000's were at best second wave nostalgia | cycle of Goths from the 1980s. | | That people recalling Goths in that period should beware | of thinking that was a source and not an echo. | | In 2006 Noel Fielding's Richmond Felicity Avenal was a | basement dwelling leftover from many years past. | im3w1l wrote: | Until AI chips become abundant, and we are not there yet, | cloud AI just makes too much sense. Using a chip | constantly vs using it 0.1% of the time is just so many | orders of magnitude better. | | Local inference does have privacy benefits. I think at | the moment it might make sense to send most of queries to | a beefy cloud model, and send sensitive queries to a | smaller local one. | Dalewyn wrote: | There are no _nerves_ in a _neural_ processing unit, so yes: | It 's 300% bullshit marketing. | jcgrillo wrote: | Maybe the N secretly stands for NFT.. Like the tesla self | driving hardware only smaller and made of silicon. | brookst wrote: | _Neural_ is an adjective. Adjectives do not require their | associated nouns to be present. See also: digital computers | have mo fingers at all. | -mlv wrote: | I always thought 'digital' referred to numbers, not | fingers. | bdd8f1df777b wrote: | The derivative meaning has been use so widely that it has | surpassed its original one in usage. But it doesn't | change the fact that it originally refers to the fingers. | Spooky23 wrote: | Microsoft needs to throw something in the gap to slow down | MacBook attrition. | | The M processors changed the game. My teams support 250k | users. I went from 50 MacBooks in 2020 to over 10,000 today. | I added zero staff - we manage them like iPhones. | cj wrote: | Rightly so. | | The M processor really did completely eliminate all sense | of "lag" for basic computing (web browsing, restarting your | computer, etc). Everything happens nearly instantly, even | on the first generation M1 processor. The experience of | "waiting for something to load" went away. | | Not to mention these machines easily last 5-10 years. | nxobject wrote: | As a very happy M1 Max user (should've shelled out for | 64GB of RAM, though, for local LLMs!), I don't look | forward to seeing how the Google Workspace/Notions/etc. | of the world somehow reintroduce lag back in. | bugbuddy wrote: | The problem for Intel and AMD is they are stuck with an | OS that ships with a lag-inducing Anti-malware suite. I | just did a simple git log and it took 2000% longer than | usual because the Antivirus was triggered to scan and run | a simulation on each machine instruction and byte of data | accessed. The commit log window stayed blank waiting to | load long enough for me to complete another tiny project. | It always ruin my day. | zdw wrote: | This is most likely due to corporate malware. | | Even modern macs can be brought to their knees by | something that rhymes with FrowdStrike Calcon and | interrupts all IO. | alisonatwork wrote: | Pro tip: turn off malware scanning in your git repos[0]. | There is also the new Dev Drive feature in Windows 11 | that makes it even easier for developers (and IT admins) | to set this kind of thing up via policies[1]. | | In companies where I worked where the IT team rolled out | "security" software to the Mac-based developers, their | computers were not noticeably faster than Windows PCs at | all, especially given the majority of containers are | still linux/amd64, reflecting the actual deployment | environment. Meanwhile Windows also runs on ARM anyway, | so it's not really something useful to generalize about. | | [0] https://support.microsoft.com/en-us/topic/how-to-add- | a-file-... | | [1] https://learn.microsoft.com/en-us/windows/dev-drive/ | bugbuddy wrote: | Unfortunately, the IT department people think they are | literal GODs for knowing how to configure Domain Policies | and lock down everything. They even refuse to help or | even answer requests for help when there are false | positives on our own software builds that we cannot | unmark as false positives. These people are proactively | antagonistic to productivity. Management could not | careless... | djur wrote: | Oh, just work for a company that uses Crowdstrike or | similar. You'll get back all the lag you want. | n8cpdx wrote: | Chrome managed it. Not sure how since Edge still works | reasonably well and Safari is instant to start (even | faster than system settings, which is really an | indictment of SwiftUI). | ddingus wrote: | I have a first gen M1 and it holds up very nicely even | today. I/O is crazy fast and high compute loads get done | efficiently. | | One can bury the machine and lose very little basic | interactivity. That part users really like. | | Frankly the only downside of the MacBook Air is the tiny | storage. The 8GB RAM is actually enough most of the time. | But general system storage with only 1/4 TB is cramped | consistently. | | Been thinking about sending the machine out to one of | those upgrade shops... | lynguist wrote: | Why did you buy a 256GB device for personal use in the | first place? Too good of a deal? Or saving these $400 for | upgrades for something else? | bzzzt wrote: | Depends on the application as well. Just try to start up | Microsoft Teams. | pjmlp wrote: | Microsoft has indeed a problem, however only in countries | whose people can afford Apple level prices, and not | everyone is a G7 citizen. | conradev wrote: | The real consumers of the NPUs are the operating systems | themselves. Google's TPU and Apple's ANE are used to power OS | features like Apple's Face ID and Google's image | enhancements. | | We're seeing these things in traditional PCs now because | Microsoft has demanded it so that Microsoft can use it in | Windows 11. | | Any use by third party software is a lower priority | pclmulqdq wrote: | The correct way to make a true "NPU" is to 10x your memory | bandwidth and feed a regular old multicore CPU with | SIMD/vector instructions (and maybe a matrix multiply unit). | | Most of these small NPUs are actually made for CNNs and other | models where "stream data through weights" applies. They have | a huge speedup there. When you stream weights across data | (any LLM or other large model), you are almost certain to be | bound by memory bandwidth. | kmeisthax wrote: | > I think some hardware vendors just release the compute units | without shipping proper support yet | | This is Nvidia's moat. Everything has optimized kernels for | CUDA, and _maybe_ Apple Accelerate (which is the only way to | touch the CPU matrix unit before M4, and the NPU at all). If | you want to use anything else, either prepare to upstream | patches in your ML framework of choice or prepare to write your | own training and inference code. | godelski wrote: | They definitely aren't doing the timing properly, but also what | you might think is timing is not what is generally marketed. | But I will say, those marketed versions are often easier to | compare. One such example is that if you're using GPU then have | you actually considered that there's an asynchronous operation | as part of your timing? | | If you're naively doing `time.time()` then what happens is this | start = time.time() # cpu records time pred = | model(input.cuda()).cuda() # push data and model (if not | already there) to GPU memory and start computation. This is | asynchronous end = time.time() # cpu records time, | regardless of if pred stores data | | You probably aren't expecting that if you don't know systems | and hardware. But python (and really any language) is designed | to be smart and compile into more optimized things than what | you actually wrote. There's no lock, and so we're not going to | block operations for cpu tasks. You might ask why do this? Well | no one knows what you actually want to do. And do you want the | timer library now checking for accelerators (i.e. GPU) every | time it records a time? That's going to mess up your timer! (at | best you'd have to do a constructor to say "enable locking for | this accelerator") So you gotta do something a bit more | nuanced. | | If you want to actually time GPU tasks, you should look at cuda | event timers (in pytorch this is | `torch.cuda.Event(enable_timing=True)`. I have another comment | with boilerplate) | | Edit: | | There's also complicated issues like memory size and shape. | They definitely are not being nice to the NPU here on either of | those. They (and GPUs!!!) want channels last. They did | [1,6,1500,1500] but you'd want [1,1500,1500,6]. There's also | the issue of how memory is allocated (and they noted IO being | an issue). 1500 is a weird number (as is 6) so they aren't | doing any favors to the NPU, and I wouldn't be surprised that | this is a surprisingly big hit considering how new these things | are | | And here's my longer comment with more details: | https://news.ycombinator.com/item?id=41864828 | artemisart wrote: | Important precision: the async part is absolutely not python | specific, but comes from CUDA, indeed for performance, and | you will have to use cuda events too in C++ to properly time | it. | | For ONNX the runtimes I know of are synchronous as we don't | do each operation individually but whole models at once, | there is no need for async, the timings should be correct. | godelski wrote: | Yes, it isn't python, it is... hardware. Not even CUDA | specific. It is about memory moving around and optimization | (remember, even the CPUs do speculative execution). I say a | little more in the larger comment. | | I'm less concerned about the CPU baseline and more | concerned about the NPU timing. Especially given the other | issues | theresistor wrote: | > Also, people often mistake the reason for an NPU is "speed". | That's not correct. The whole point of the NPU is rather to | focus on low power consumption. | | It's also often about offload. Depending on the use case, the | CPU and GPU may be busy with other tasks, so the NPU is free | bandwidth that can be used without stealing from the others. | Consider AI-powered photo filters: the GPU is probably busy | rendering the preview, and the CPU is busy drawing UI and | handling user inputs. | cakoose wrote: | Offload only makes sense if there are other advantages, e.g. | speed, power. | | Without those, wouldn't it be better to use the NPUs silicon | budget on more CPU? | heavyset_go wrote: | More CPU means siphoning off more of the power budget on | mobile devices. The theoretical value of NPUs is power | efficiency on a limited budget. | theresistor wrote: | If you know that you need to offload matmuls, then building | matmul hardware is more area efficient than adding an | entire extra CPU. Various intermediate points exist along | that spectrum, e.g. Cell's SPUs. | avianlyric wrote: | Not really. To get extra CPU performance that likely means | more cores, or some other general compute silicon. That | stuff tends to be quite big, simply because it's so | flexible. | | NPUs focus on one specific type of computation, matrix | multiplication, and usually with low precision integers, | because that's all a neural net needs. That vast reduction | in flexibility means you can take lots of shortcuts in your | design, allowing you cram more compute into a smaller | footprint. | | If you look at the M1 chip[1], you can see the entire | 16-Neural engine has a foot print about the size of 4 | performance cores (excluding their caches). It's not | perfect comparison, without numbers on what the performance | core can achieve in terms of ops/second vs the Neural | Engine. But it seems reasonable to be that the Neural | Engine and handily outperform the performance core complex | when doing matmul operations. | | [1] https://www.anandtech.com/show/16226/apple- | silicon-m1-a14-de... | spookie wrote: | I've been building an app in pure C using onnxruntime, and it | outperforms a comparable one done with python by a substancial | amount. There are many other gains to be made. | | (In the end python just calls C, but it's pretty interesting | how much performance is lost) | jamesy0ung wrote: | What exactly does Windows do with a NPU? I don't own an 'AI PC' | but it seems like the NPUs are slow and can't run much. | | I know Apple's Neural Engine is used to power Face ID and the | facial recognition stuff in Photos, among other things. | DrillShopper wrote: | It supports Microsoft's Recall (now required) spyware | Janicc wrote: | Please remind me again how Recall sends data to Microsoft. I | must've missed that part. Or are you against the print screen | button too? I heard that takes images too. Very scary. | cmeacham98 wrote: | While calling it spyware like GP is over-exaggeration to a | ridiculous level, comparing Recall to Print Screen is also | inaccurate: | | Print Screen takes images on demand, Recall does so | effectively at random. This means Recall could | inadvertently screenshot and store information you didn't | intend to keep a record of (To give an extreme example: | Imagine an abuser uses Recall to discover their spouse | browsing online domestic violence resources). | Terr_ wrote: | > Please remind me again how Recall sends data to | Microsoft. I must've missed that part. | | Sure, just post the source code and I'll point out where it | does so, I somehow misplaced my copy. /s | | The core problem here is trust, and over the last several | years Microsoft has burned a hell of a lot of theirs with | power-users of Windows. Even their most strident public | promises of Recall being "opt-in" and "on-device only" will | --paradoxically--only be _kept_ as long as enough people | remain suspicious. | | Glance away and MS go back to their old games, pushing a | mandatory "security update" which reset or entirely-removes | your privacy settings and adding new "telemetry" streams | which you cannot inspect. | bloated5048 wrote: | It's always safe to assume it does if it's closed source. I | rather be suspicious of big corporations seeking to profit | at every step than naive. | | Also, it's security risk which already been exploited. | Sure, MS fixed it, but can you be certain it won't be | exploited some time in the future again? | dagaci wrote: | Its used for improving video calls, special effects, image | editing/ effects and noise cancelling, teams stuff | downrightmike wrote: | AI PC is just a marketing term, doesn't have any real substance | acdha wrote: | Yea, we know that. I believe that's why the person you're | replying too was asking for examples of real usage. | eightysixfour wrote: | I thought the purpose of these things was not to be fast, but to | be able to run small models with very little power usage? I have | a newer AMD laptop with an NPU, and my power usage doesn't change | using the video effects that supposedly run on it, but goes up | when using the nvidia studio effects. | | It seems like the NPUs are for very optimized models that do | small tasks, like eye contact, background blur, autocorrect | models, transcription, and OCR. In particular, on Windows, I | assumed they were running the full screen OCR (and maybe | embeddings for search) for the rewind feature. | conradev wrote: | That is my understanding as well: low power and low latency. | | You can see this in action when evaluating a CoreML model on a | macOS machine. The ANE takes half as long as the GPU which | takes half as long as the CPU (actual factors being model | dependent) | nickpsecurity wrote: | To take half as long, doesn't it have to perform twice as | fast? Or am I misreading your comment? | eightysixfour wrote: | No, you can have latency that is independent of compute | performance. The CPU/GPU may have other tasks and the work | has to wait for the existing threads to finish, or for them | to clock up, or have slower memory paths, etc. | | If you and I have the same calculator but I'm working on a | set of problems and you're not, and we're both asked to do | some math, it may take me longer to return it, even though | the instantaneous performance of the math is the same. | refulgentis wrote: | In isolation, makes sense. | | Wouldn't it be odd for OP to present examples that are | the _opposite_ of their claim, just to get us thinking | about "well the CPU is busy?" | | Curious for their input. | conradev wrote: | The GPU is stateful and requires loading shaders and | initializing pipelines before doing any work. That is where | its latency comes from. It is also extremely power hungry. | | The CPU is zero latency to get started, but takes longer | because it isn't specialized at any one task and isn't | massively parallel, so that is why the CPU takes even | longer. | | The NPU often has a simpler bytecode to do more complex | things like matrix multiplication implemented in hardware, | rather than having to instantiate a generic compute kernel | on the GPU. | boomskats wrote: | That's especially true because yours is a Xilinx FPGA. The one | that they just attached to the latest gen mobile ryzens is 5x | more capable too. | | AMD are doing some fantastic work at the moment, they just | don't seem to be shouting about it. This one is particularly | interesting | https://lore.kernel.org/lkml/DM6PR12MB3993D5ECA50B27682AEBE1... | | edit: not an FPGA. TIL. :'( | errantspark wrote: | Wait sorry back up a bit here. I can buy a laptop that has a | daughter FPGA in it? Does it have GPIO??? Are we seriously | building hardware worth buying again in 2024? Do you have a | link? | eightysixfour wrote: | It isn't as fun as you think - they are setup for specific | use cases and quite small. Here's a link to the software | page: https://ryzenai.docs.amd.com/en/latest/index.html | | The teeny-tiny "NPU," which is actually an FPGA, is 10 | TOPS. | | Edit: I've been corrected, not an FPGA, just an IP block | from Xilinx. | wtallis wrote: | It's not a FPGA. It's an NPU IP block from the Xilinx | side of the company. It was presumably originally | developed to be run on a Xilinx FPGA, but that doesn't | mean AMD did the stupid thing and actually fabbed a FPGA | fabric instead of properly synthesizing the design for | their laptop ASIC. Xilinx involvement does not | automatically mean it's an FPGA. | eightysixfour wrote: | Thanks for the correction, edited. | boomskats wrote: | Do you have any more reading on this? How come the XDNA | drivers depend on Xilinx' XRT runtime? | almostgotcaught wrote: | because XRT has a plugin architecture: XRT<-shim | plugin<-kernel driver. The shims register themselves with | XRT. The XDNA driver repo houses both the shim and the | kernel driver. | boomskats wrote: | Thanks, that makes sense. | wtallis wrote: | It would be surprising and strange if AMD _didn 't_ reuse | the software framework they've already built for doing AI | when that IP block is instantiated on an FPGA fabric | rather than hardened in an ASIC. | boomskats wrote: | Well, I'm irrationally disappointed, but thanks. | Appreciate the correction. | boomskats wrote: | Yes, the one on the ryzen 7000 chips like the 7840u isn't | massive, but that's the last gen model. The one they've | just released with the HX370 chip is estimated at 50 | TOPS, which is better than Qualcomm's ARM flagship that | this post is about. It's a fivefold improvement in a | single generation, it's pretty exciting. | | And it's an FPGA It's not an FPGA | almostgotcaught wrote: | > And it's an FPGA. | | nope it's not. | boomskats wrote: | I've just ordered myself a jump to conclusions mat. | almostgotcaught wrote: | Lol during grad school my advisor would frequently cut me | off and try to jump to a conclusion, while I was | explaining something technical often enough he was wrong. | So I did really buy him one (off eBay or something). He | wasn't pleased. | dekhn wrote: | If you want GPIOs, you don't need (or want) an FPGA. | | I don't know the details of your use case, but I work with | low level hardware driven by GPIOs and after a bit of | investigation, concluded that having direect GPIO access in | a modern PC was not necessary or desirable compared to the | alternatives. | errantspark wrote: | I get a lot of use out of the PRUs on the | BeagleboneBlack, I would absolutely get use out of an | FPGA in a laptop. | dekhn wrote: | It makes more sense to me to just use the BeagleboneBlack | in concert with the FPGA. Unless you have highly specific | compute or data movement needs that can't be satisfied | over a USB serial link. If you have those needs, and you | need a laptop, I guess an FPGA makes sense but that's a | teeny market. | beeflet wrote: | It would be cool if most PCs had a general purpose FPGA that | could be repurposed by the operating system. For example you | could use it as a security processor like a TPM or as a | bootrom, or you could repurpose it for DSP or something. | | It just seems like this would be better in terms of | firmware/security/bootloading because you would be more able | to fix it if an exploit gets discovered, and it would be | leaner because different operating systems can implement | their own stuff (for example linux might not want pluton in- | chip security, windows might not want coreboot or linux-based | boot, bare metal applications can have much simpler boot). | walterbell wrote: | Xilinx Artix 7-series PicoEVB fits in M.2 wifi slot and has | an OSS toolchain, http://www.enjoy-digital.fr/ | pclmulqdq wrote: | It's not an FPGA. It's a VLIW DSP that Xilinx built to go | into an FPGA-SoC to help run ML models. | almostgotcaught wrote: | this is the correct answer. one of the compilers for this | DSP is https://github.com/Xilinx/llvm-aie. | numpad0 wrote: | Sorry for an OT comment but what is going on with that ascii | art!? The content fits within 80 columns just fine[1], is it | GPT generated? | | 1: https://pastebin.com/raw/R9BrqETR | davemp wrote: | Unfortunately FPGA fabric is ~2x less power efficient than | equivalent ASIC logic at the same clock speeds last time I | checked. So implementing general purpose logic on an FPGA is | not usually the right option even if you don't care about | FMAX or transistor counts. | refulgentis wrote: | You're absolutely right IMO, given what I heard when launching | on-device speech recognition on Pixel, and after leaving | Google, what I see from ex. Apple Neural Engine vs. CPU when | running ONNX stuff. | | I'm a bit suspicious of the article's specific conclusion, | because it is Qualcomm's ONNX, and it be out of date. Also, | Android loved talking shit about Qualcomm software engineering. | | That being said, its directionally correct, insomuch as | consumer hardware AI acceleration claims are near-universally | BS unless you're A) writing 1P software B) someone in the 1P | really wants you to take advantage. | kristianp wrote: | 1P? | refulgentis wrote: | First party, i.e. Google/Apple/Microsoft | godelski wrote: | > but to be able to run small models with very little power | usage | | yes | | But first, I should also say you probably don't want to be | programming these things with python. I doubt you'll get good | performance there, especially as the newness means | optimizations haven't been ported well (even using things like | TensorRT is not going to be as fast as writing it from scratch, | and Nvidia is throwing a lot of man power at that -- for good | reason! But it sure as hell will get close and save you a lot | of time writing). | | They are, like you say, generally optimized for doing repeated | similar tasks. That's also where I suspect some of the info | gathered here is inaccurate. (I have not used | these NPU chips so what follows is more educated guesses, but | I'll explain. Please correct me if I've made an error) | | Second, I don't trust the timing here. I'm certain the CUDA | timing (at the end) is incorrect, as the code written wouldn't | properly time. Timing is surprisingly not easy. I suspect the | advertised operations are only counting operations directly on | the NPU while OP would have included CPU operations in their | NPU and GPU timings[0]. But the docs have benchmarking tools, | so I suspect they're doing something similar. I'd be interested | to know the variance and how this holds after doing warmups. | _They do identify the IO as an issue, and so I think this is | evidence of this being an issue._ | | Third, their data is improperly formatted. | MATRIX_COUNT, MATRIX_A, MATRIX_B, MATRIX_K = (6, 1500, 1500, | 256) INPUT0_SHAPE = [1, MATRIX_COUNT, MATRIX_A, MATRIX_K] | INPUT1_SHAPE = [1, MATRIX_COUNT, MATRIX_K, MATRIX_B] | OUTPUT_SHAPE = [1, MATRIX_COUNT, MATRIX_A, MATRIX_B] | | You want "channels last" here. I suspected this (do this in | pytorch too!) and the docs they link confirm. | | 1500 is also an odd choice and this could be cause for extra | misses. I wonder how things would change with 1536, 2048, or | even 256. Might (probably) even want to look smaller, since | this might be a common preprocessing step. Your models are not | processing full res images and if you're going to optimize | architecture for models, you're going to use that shape | information. Shape optimization is actually pretty important in | ML[1]. I suspect this will be quite a large miss. | | Fourth, a quick look at the docs and I think the setup is | improper. Under "Model Workflow" they mention that they want | data in 8 or 16 bit * _float*_. I 'm not going to look too | deep, but note that there are different types of floats (e.g. | pytorch's bfloat is not the same as torch.half or | torch.float16). Mixed precision is still a confusing subject | and if you're hitting issues like these it is worth looking at. | I very much suggest not just running a standard quantization | procedure and calling it a day (start there! But don't end | there unless it's "good enough", which doesn't seem too | meaningful here.) | | FWIW, I still do think these results are useful, but I think | they need to be improved upon. This type of stuff is | surprisingly complex, but a large amount of that is due to | things being new and much of the details still being worked | out. Remember that when you're comparing to things like CPU or | GPU (especially CUDA) that these have had hundreds of thousands | of man hours put into then and at least tens of thousands into | high level language libraries (i.e. python) to handle these. I | don't think these devices are ready for the average user where | you can just work with them from your favorite language's | abstraction level, but they're pretty useful if you're willing | to work close to the metal. | | [0] I don't know what the timing is for this, but I do this in | pytorch a lot so here's the boilerplate times | = torch.empty(rounds) # Don't need use dummy data, but | here input_data = torch.randn((batch_size, | *data_shape), device="cuda") # Do some warmups first. | There's background actions dealing with IO we don't want to | measure # You can remove that line and do a dist of | times if you want to see this # Make sure you generate | data and save to a variable (write) or else this won't do | anything for _ in range(warmup): data = | model(input_data) for i in range(rounds): | starter = torch.cuda.Event(enable_timing=True) | ender = torch.cuda.Event(enable_timing=True) | starter.record() data = model(input_data) | ender.record() torch.cuda.synchronize() | times[i] = starter.elapsed_time(ender)/1000 total_time | = times.sum() | | The reason we do it this way is if we just wrap the model | output with a timer then we're looking at CPU time but the GPU | operations are asynchronous so you could get deceptively fast | (or slow) times | | [1] https://www.thonking.ai/p/what-shapes-do-matrix- | multiplicati... | wmf wrote: | This headline is seriously misleading because the author did not | test AMD or Intel NPUs. If Qualcomm is slow don't say all AI PCs | are not good. | protastus wrote: | Deploying a model on an NPU requires significant profile based | optimization. Picking up a model that works fine on the CPU but | hasn't been optimized for an NPU usually leads to disappointing | results. | catgary wrote: | Yeah whenever I've spoken to people who work on stuff like IREE | or OpenXLA they gave me the impression that understanding how | to use those compilers/runtimes is an entire job. | CAP_NET_ADMIN wrote: | Beauty of CPUs - they'll chew through whatever bs code you | throw at them at a reasonable speed. | lostmsu wrote: | The author's benchmark sucks if he could only get 2 tops from a | laptop 4080. The thing should be doing somewhere around 80 tops. | | Given that you should take his NPU results with a truckload of | salt. | hkgjjgjfjfjfjf wrote: | Sutherland's wheel of reincarnation turns. | downrightmike wrote: | They should have just made a pci card and not tried to push whole | new machines on us. We are all good with the machines we already | have. If you want to sell a new feature, then it needs to be an | add-on | Mistletoe wrote: | >The second conclusion is that the measured performance of 573 | billion operations per second is only 1.3% of the 45 trillion | ops/s that the marketing material promises. | | It just gets so hard to take this industry seriously. | m00x wrote: | NPUs are efficient, not especially fast. The CPU is much bigger | than the NPU and has better cache access. Of course it'll perform | better. | acdha wrote: | It's more complicated than that (you're assuming that the | bigger CPU is optimized for the same workload) but it's also | irrelevant to the topic at hand: they're seeing this NPU within | a factor of 2-4 of the CPU, but if it performed half as well as | Qualcomm claims it would be an order of magnitude faster. The | story here isn't another round of the specialized versus | general debate but that they fell so far short of their | marketing claims. | Havoc wrote: | >We see 1.3% of Qualcomm's NPU 45 Teraops/s claim | | To me that suggests that the test is wrong. | | I could see intel massaging results, but that far off seems | incredibly improbable | p1necone wrote: | I might be overly cynical but I just assumed that the entire | purpose of "AI PCs" was marketing - of course they don't actually | achieve much. Any real hardware that's supposedly for the "AI" | features will actually be just special purpose hardware for | literally anything the sales department can lump under that | category. | teilo wrote: | Actual article title: Benchmarking Qualcomm's NPU on the | Microsoft Surface Tablet | | Because this isn't about NPUs. It's about a specific NPU, on a | specific benchmark, with a specific set of libraries and | frameworks. So basically, this proves nothing. | iml7 wrote: | But you can't get more clicks. You have to attack enough people | to get clicks.I feel like this place is becoming more and more | filled with posts and titles like this. | gerdesj wrote: | Internet points are a bit crap but HN generally discusses | things properly and off topic and downright weird stuff | generally gets downvoted to doom. | gnabgib wrote: | The title is from the original article | (https://petewarden.com/2024/10/16/ai-pcs-arent-very-good- | at-...), the URL was changed by dang: | https://news.ycombinator.com/item?id=41863591 | NoPicklez wrote: | Fairly misleading title, boiling down AI PCs to just the | Microsoft Surface running Qualcomm | cjbgkagh wrote: | > We've tried to avoid that by making both the input matrices | more square, so that tiling and reuse should be possible. | | While it might be possible it would not surprise me if a number | of possible optimizations had not made it into Onnx. It appears | that Qualcomm does not give direct access to the NPU and users | are expected to use frameworks to convert models over to it, and | in my experience conversion tools generally suck and leave a lot | of optimizations on the table. It could be less of NPUs suck and | more of the conversions tools suck. I'll wait until I get direct | access - I don't trust conversion tools. | | My view of NPUs is that they're great for tiny ML models and very | fast function approximations which is my intended use case. While | LLMs are the new hotness there are huge number of specialized | tasks that small models are really useful for. | jaygreco wrote: | I came here to say this. I haven't worked with the Elite X but | the past gen stuff I've used (865 mostly) the accelerators - | compute DSP and much smaller NPU - required _very_ specific | setup, compilation with a bespoke toolchain, and communication | via RPC to name a few. | | I would hope the NPU on Elite X is easier to get to considering | the whole copilot+ thing, but I bring this up mainly to make | the point that I doubt it's just as easy as "run general | purpose model, expect it to magically teleport onto the NPU". | stanleykm wrote: | the ARM SME could be an interesting alternative to NPUs in the | future. Unlike the NPUs which have at best some fixed function | API it will be possible to program the SMEs more directly | piskov wrote: | Snapdragon touts 45 TOPS but it's only int8. | | For example Apple's m3 neural engine is mere 18 TOPS but it's | FP16. | | So windows has bigger number but it's not apple to apple | comparison. | | Did author test int8 performance? | freehorse wrote: | I always thought that the main point of NPUs is energy efficiency | (and being able to run ML models without taking over all computer | resources, making it practical to integrate ML applications in | the OS itself in ways that it does not disturb the user or the | workflow) rather than being exceptionally faster. At least this | has been my experience with running stable diffusion on macs. | Similar to using other specialised hardware like media encoders; | they are not necessarily faster than a CPU if you throw a dozen+ | cpu cores on the task, but it will draw a minuscule part of the | power. | guelermus wrote: | One should pay attention also to power efficiency, a direct | comparison could be misleading here. | _davide_ wrote: | The RTX 4080 should be capable of ~40 TFLOPS, yet they only | report 2,160 billion operations per second. Shouldn't this be | enough to reconsider the benchmark? They probably made some | serious error in measuring FLOPS. Regarding the fact that CPU | beats NPU is possible but they should benchmark many matrix | multiplications without any application synchronization in order | to have a decent comparison. | Grimblewald wrote: | That isnt the half of it. A quick skim of the documentation | shows that the cpu inference wasnt done in a comparable way | either. | ein0p wrote: | Memory bound workload is memory bound. Doesn't matter how many | TOPS you have if you're sitting idle waiting on DRAM during | generation. You will, however notice a difference in prefill for | long prompts. ___________________________________________________________________ (page generated 2024-10-17 06:00 UTC)