Why Are Large Language Models so Terrible at Video Games?
What Happened
Large language models (LLMs) have improved so quickly that the benchmarks themselves have evolved, adding more complex problems in an effort to challenge the latest models. Yet LLMs haven’t improved across all domains, and one task remains far outside their grasp: They have no idea how to play video
Our Take
LLMs now score near-human on most reasoning benchmarks, but consistent failure at video games — including simple Atari titles — has persisted across model generations. The bottleneck is perception-action loops, not reasoning depth.
Computer-use agents — Anthropic's, OpenAI's Operator — hit the same wall on dynamic UIs: they stall on real-time state changes and multi-step sequences with sparse feedback. Assuming GPT-4o or Claude 3.5 can handle any sequential task because they handle chain-of-thought is the wrong mental model.
Teams building GUI automation or game-playing agents should stop routing real-time perception tasks to frontier LLMs. RAG pipelines and static-document agents are unaffected.
What To Do
Route real-time GUI interaction to RL-trained systems or specialized vision-action models instead of frontier LLMs because token prediction is architecturally incompatible with sub-100ms perception-action loops.
Builder's Brief
What Skeptics Say
Video game performance is a narrow proxy for reasoning; poor scores on reflex-and-state tasks don't generalize to the planning and language tasks where LLMs are actually deployed, making this a compelling headline with limited engineering implications.
Cited By
React
Get the weekly AI digest
The stories that matter, with a builder's perspective. Every Thursday.