Why Are Large Language Models so Terrible at Video Games?

Read the full articleWhy Are Large Language Models so Terrible at Video Games? on IEEE Spectrum

What Happened

Large language models (LLMs) have improved so quickly that the benchmarks themselves have evolved, adding more complex problems in an effort to challenge the latest models. Yet LLMs haven’t improved across all domains, and one task remains far outside their grasp: They have no idea how to play video

Our Take

LLMs now score near-human on most reasoning benchmarks, but consistent failure at video games — including simple Atari titles — has persisted across model generations. The bottleneck is perception-action loops, not reasoning depth.

Computer-use agents — Anthropic's, OpenAI's Operator — hit the same wall on dynamic UIs: they stall on real-time state changes and multi-step sequences with sparse feedback. Assuming GPT-4o or Claude 3.5 can handle any sequential task because they handle chain-of-thought is the wrong mental model.

Teams building GUI automation or game-playing agents should stop routing real-time perception tasks to frontier LLMs. RAG pipelines and static-document agents are unaffected.

What To Do

Route real-time GUI interaction to RL-trained systems or specialized vision-action models instead of frontier LLMs because token prediction is architecturally incompatible with sub-100ms perception-action loops.

Builder's Brief

Who

teams building LLM-based autonomous agents or decision-making loops

What changes

spatial reasoning and real-time state-tracking remain hard limits; agent architectures relying on these will need explicit scaffolding

When

months

Watch for

game-based benchmarks appearing in frontier model eval release notes as a standardized capability signal

What Skeptics Say

Video game performance is a narrow proxy for reasoning; poor scores on reflex-and-state tasks don't generalize to the planning and language tasks where LLMs are actually deployed, making this a compelling headline with limited engineering implications.

Cited By

IEEE Spectrum Why Are Large Language Models so Terrible at Video Games?

Hugging Face CyberSecEval 2 - A Comprehensive Evaluation Framework for Cybersecurity Risks and Capabilities of Large Language Models

Hugging Face The Open Medical-LLM Leaderboard: Benchmarking Large Language Models in Healthcare

Hugging Face Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models

Hugging Face NPHardEval Leaderboard: Unveiling the Reasoning Abilities of Large Language Models through Complexity Classes and Dynamic Updates

Hugging Face The Hallucinations Leaderboard, an Open Effort to Measure Hallucinations in Large Language Models

Hugging Face AMD + 🤗: Large Language Models Out-of-the-Box Acceleration with AMD GPU

Hugging Face Towards Encrypted Large Language Models with FHE

Hugging Face Fast Inference on Large Language Models: BLOOMZ on Habana Gaudi2 Accelerator

Hugging Face Very Large Language Models and How to Evaluate Them

React