Running AI models is turning into a memory game

Read the full articleRunning AI models is turning into a memory game on TechCrunch

What Happened

When we talk about the cost of AI infrastructure, the focus is usually on Nvidia and GPUs -- but memory is an increasingly important part of the picture.

Our Take

Here's what nobody talks about: HBM (high-bandwidth memory) is the actual bottleneck now, not NVIDIA's H100 compute. Running Llama 3.1 at scale means you're buying more memory than compute, and NVIDIA owns that supply chain. NVIDIA's laughing all the way to earnings.

This is infrastructure getting weirdly expensive in weird places. Cloud providers can't optimize it away — memory's physical, not algorithmic. So anyone running production LLMs is about to get a rude awakening on TCO. Open weights helps (you own the hardware), closed APIs don't (cloud markups on HBM will kill you).

What To Do

Ask your cloud vendor exactly how much they're charging per GB of model memory — that number will shock you.

Builder's Brief

Who

ML engineers and infrastructure teams sizing hardware for training runs or high-throughput inference

What changes

memory capacity and bandwidth should be explicitly modeled in cost estimates, not treated as a GPU-derivative metric

When

weeks

Watch for

HBM spot pricing and CXL-enabled memory pool announcements from cloud providers as leading indicators of constraint relief

What Skeptics Say

Memory bandwidth as a bottleneck is not a new insight — HBM constraints have been publicly discussed since GPT-3 scaling papers; framing this as an emerging realization undersells how much capital is already flowing into SK Hynix, Micron, and CXL interconnects to address it. The editorial may be late to a trend already priced in.

Cited By

TechCrunch Running AI models is turning into a memory game

React

Newsletter

Get the weekly AI digest

The stories that matter, with a builder's perspective. Every Thursday.

Loading comments...