AI Is Insatiable
What Happened
While browsing our website a few weeks ago, I stumbled upon “How and When the Memory Chip Shortage Will End” by Senior Editor Samuel K. Moore. His analysis focuses on the current DRAM shortage caused by AI hyperscalers’ ravenous appetite for memory, a major constraint on the speed at which large lan
Our Take
DRAM prices have climbed ~40% since late 2024. Hyperscalers bulk-buying HBM3 for LLM inference is draining standard server memory supply — the same memory your self-hosted inference stack competes for.
On vLLM or TGI, memory is your binding constraint — not compute. Most teams still size GPU fleets by FLOP capacity. That's the wrong unit. KV-cache overflow kills throughput at scale long before CUDA cores become the bottleneck.
Self-hosted teams running Llama 70B+ should reprice hardware budgets now. Teams on managed APIs — OpenAI, Anthropic — can ignore this for another two quarters.
What To Do
Size your vLLM deployment by KV-cache memory budget first, not GPU count, because DRAM scarcity is now the real constraint on self-hosted inference throughput.
Builder's Brief
What Skeptics Say
Chip shortage narratives historically precede overbuilding and gluts — current AI-driven DRAM demand may reflect a cyclical surge rather than a permanent new floor, and markets are pricing in the optimistic scenario.
Cited By
React
Get the weekly AI digest
The stories that matter, with a builder's perspective. Every Thursday.