It's not the compute, it's the memory bandwidth

If you're familiar with computations in a latency sensitive or high performance environment, you'd know about CPU stalls, CPU cache misses, memory locality and JIT compilation, then it should come to no surprise that GPUs face the similar constraints. Whether rendering intense games like Cyberpunk 2077 or running local inference with Ollama ollama, the literal compute power of the underlying device is underutilized. More specifically, the processing units idle while memory is fetched from nearby DRAM into SRAM in or physically near the processing unit. The bottleneck is how much data can be loaded, sent, and received over parallel lanes consistently without data corruption and retransmission. While the hardware is actively bandwidth constrained, it is technically fully utilized — only by virtue of it being unable to commit to more work due to being I/O bound instead of compute bound.

The reason why Apple can brag about its AI performance in their new M5 lineup is because they focused on the issue holding back local inference: memory bandwidth.

The increase in unified memory bandwidth enables complex workflows like intensive AI model training and massive video projects. M5 Pro supports up to 64GB of unified memory with up to 307GB/s of memory bandwidth, while M5 Max supports up to 128GB of unified memory with up to 614GB/s of memory bandwidth.

Apple

Their ability to pull of 614GB of memory bandwidth on a mobile laptop device is incredible. Yet, it still comes short of the 1,792 GB/s bandwidth that a consumer desktop NVIDIA RTX 5090 GPU advertises and the data center specialized H100 cards have 3000GB/s bandwidth. Unlike NVIDIA, consumers have a chance to attempt local inference of large models like GLM 4.7 — which needs 400GB of memory — with Apple's hardware using RDMA as demonstrated by Jeff Gerling.

A stack of mac studios in a 10 inch rack. They appear networked from the back side.

Image shared with permission from Jeff Geerling's 1.5 TB of VRAM on Mac Studio - RDMA over Thunderbolt 5

Even with 512 GB of unified memory, running with distributed inference won't solve the memory memory wall problem. More energy and time is spent on fetching and buffering weights into compute-accessible memory than actually doing the math to get the next token. Subsequently, dedicated silicon like Neural Processing Units don't make the memory problem go away, instead they align efficient compute with the memory bandwidth available.

If you were hoping to get your hands on Apple's 512GB units, they're no longer available!

Apple has also raised the price for the 256GB RAM upgrade option. It used to cost $1,600 to go from 96GB to 256GB on the high-end M3 Ultra machine, but now it costs $2,000. 512GB was $4,000 when it was available.

Apple has likely removed the option to purchase 512GB of memory because of global DRAM shortages that have dried up supply and caused prices to soar, and it's also probably why shipping times for a configuration with 256GB RAM range into May.

MacRumors

There are different approaches, like Tensor Processing Units by Google, where memory flows towards compute differently and the memory bandwidth increases to 7,380GB/s in their TPU7x (Ironwood) preview — twelve times faster than Apple and twice that of NVIDIA's nvidia H100's. And finally Groq groq (recently acquire-hired by nvidia) boasts 80,000GB/s of memory bandwidth with sheer parallelization in their LPU™ AI Inference Technology.

High Bandwidth Memory is at the center of the current memory crisis and should be the focus of chip and memory innovation for years to come. Until then, labs must figure out how to condense the capabilities offered by frontier models on fractions of the memory requirements we have today. We were seeing this exact focus in Qwen until their team quit.

The alternative to this is to hard-wire the weights in the chip during fabrication, which Taalas taalas achieved with the Llama3.1-8B quantized model. Their reported rates (17,000+ tokens per second) are twenty eight times faster than what groq could achieve. This appears to be the ceiling of what can be computed on TSMC's 2019-generation process according to the EE Times.

If taalas achieved this with memory, it would be effectively above 2 petabytes per second memory bandwidth. That's not something humanity has achieved yet on a single die. Their reported numbers are eight hundred and fifty times faster than I can get on my MacBook. My ability to do local inference isn't limited by the amount of cores I have. It is how fast the weights can be loaded into the Neural Processing Unit (NPU).

We've had comparable logical compute for the last five years. Memory bandwidth has not caught up to the speed our processors can compute.

The memory OpenAI openai and others are clamoring for will still be held back by memory bandwidth for years to come. Every new generation of high bandwidth memory will continually make the infrastructure (and its oceans of GPUs) obsolete every year of the next decade as energy cost per weight loaded trends downward. Their inference data centers won't replace GPUs because the newer ones calculate faster. They'll be replaced because the new generation will use less energy to send bits faster to the transistors that do math.

Footnotes

  1. Thunderbolt and PCI Express use differential signaling to improve signal integrity at high speeds. Graphics memory is moving towards PAMx signaling (or Pulse-Amplitude Modulation signaling) to break above 100Gbps.

Why write about all this stuff that I can't actually get my hands on?

I ran into it too, on a much much smaller scale. It is one thing to be aware of a glass wall. Another to slam face first into it in confusion.

Early into the Fortress of Darkness, the Time Bandits run into a glass wall.

I'm developing a Rust application with Claude on literal e-waste. These Lenovo M700 units are cheaper than Raspberry Pis and actually come with a case.

Ebay listing for a Lenovo M700 with i3-6100T with 8GB of ram no hard drive for $48

I found out that rendering an 8K texture at 90 degrees would max out my memory bandwidth and leave the rendering engine under utilized. I could not get the 3D capabilities of this weak tiny pitiful thin client to saturate because I was memory bound. I was leaving literal logical compute on the table for something that gets tossed into the dumpster after 10 years behind a dentist's receptionist desk.

intel gpu top showing IMC reads, writes, render 3D busy 60%, video 8%

And it got me thinking, what about memory-heavy loads, like LLMs? And to my surprise, I found out that my hobby project for a smart home screen had more in common with the current memory crisis than I could have guessed.

I've come to conclude that even well-resourced consumers will be unable to match the effective speed and quality of inference that frontier labs like OpenAI and Anthropic deliver. The chips that do offer memory bandwidth for inference will be bought out by Google, OpenAI, Anthropic, AWS, Microsoft Azure, and NVIDIA. Manufacturers like Apple will get the scraps (see "silicon binning") for their consumer grade customers. We're not as profitable as data centers backed by insatiable investment.

Profile image
@_Freakyclown_@twitter.com
TIL: that intel i3,i5, i7 and i9 processors are all the same, all made the same, on the same wafer. But depending on how many defects (therefore how many sections actually work) they just rename them. So your i3 processor is an i9 with many defects less for an i5 and less for i7

Profile image
@Sohcahtoa82@twitter.com
@_Freakyclown_ This is called "binning" and it's been done for literally *decades*. It's not sinister, it's just a way to get the most out of manufacturing when it's nearly impossible to create a CPU 100% perfectly every time.

Until I can get a hardware refresh, I guess I'll use Claude to make neat tools that work with the hardware I have access to, even if it is 10 years old or with a tiny amount of memory. While I can't get Raspberry Pis at a fair price for my projects, thankfully, Luckfox luckfox and Expressif are accessible as long as I wait for a few weeks from AliExpress.

I'm rather excited at how luckfox boards have built-in NPUs by Radxa. These use the ONNX model exchange format which Intel, IBM, Groq, Nvidia, and many others are centralizing around. Maybe I'll write about my adventures in using a Linux single board computer with only 64MB of RAM.