Running AI models is turning into a memory game
What Happened
When we talk about the cost of AI infrastructure, the focus is usually on Nvidia and GPUs -- but memory is an increasingly important part of the picture.
Our Take
Here's what nobody talks about: HBM (high-bandwidth memory) is the actual bottleneck now, not NVIDIA's H100 compute. Running Llama 3.1 at scale means you're buying more memory than compute, and NVIDIA owns that supply chain. NVIDIA's laughing all the way to earnings.
This is infrastructure getting weirdly expensive in weird places. Cloud providers can't optimize it away — memory's physical, not algorithmic. So anyone running production LLMs is about to get a rude awakening on TCO. Open weights helps (you own the hardware), closed APIs don't (cloud markups on HBM will kill you).
What To Do
Ask your cloud vendor exactly how much they're charging per GB of model memory — that number will shock you.
Builder's Brief
What Skeptics Say
Memory bandwidth as a bottleneck is not a new insight — HBM constraints have been publicly discussed since GPT-3 scaling papers; framing this as an emerging realization undersells how much capital is already flowing into SK Hynix, Micron, and CXL interconnects to address it. The editorial may be late to a trend already priced in.
Cited By
React
Get the weekly AI digest
The stories that matter, with a builder's perspective. Every Thursday.