Skip to main content
Back to Pulse
IEEE Spectrum+1 source

Why Are Large Language Models so Terrible at Video Games?

Read the full articleWhy Are Large Language Models so Terrible at Video Games? on IEEE Spectrum

What Happened

Large language models (LLMs) have improved so quickly that the benchmarks themselves have evolved, adding more complex problems in an effort to challenge the latest models. Yet LLMs haven’t improved across all domains, and one task remains far outside their grasp: They have no idea how to play video

Our Take

LLMs now score near-human on most reasoning benchmarks, but consistent failure at video games — including simple Atari titles — has persisted across model generations. The bottleneck is perception-action loops, not reasoning depth.

Computer-use agents — Anthropic's, OpenAI's Operator — hit the same wall on dynamic UIs: they stall on real-time state changes and multi-step sequences with sparse feedback. Assuming GPT-4o or Claude 3.5 can handle any sequential task because they handle chain-of-thought is the wrong mental model.

Teams building GUI automation or game-playing agents should stop routing real-time perception tasks to frontier LLMs. RAG pipelines and static-document agents are unaffected.

What To Do

Route real-time GUI interaction to RL-trained systems or specialized vision-action models instead of frontier LLMs because token prediction is architecturally incompatible with sub-100ms perception-action loops.

Builder's Brief

Who

teams building LLM-based autonomous agents or decision-making loops

What changes

spatial reasoning and real-time state-tracking remain hard limits; agent architectures relying on these will need explicit scaffolding

When

months

Watch for

game-based benchmarks appearing in frontier model eval release notes as a standardized capability signal

What Skeptics Say

Video game performance is a narrow proxy for reasoning; poor scores on reflex-and-state tasks don't generalize to the planning and language tasks where LLMs are actually deployed, making this a compelling headline with limited engineering implications.

Cited By

React

Newsletter

Get the weekly AI digest

The stories that matter, with a builder's perspective. Every Thursday.

Loading comments...