Meltdowns

🤖 When LLMs Go Off the Rails: Meltdowns from the Vending-Bench Benchmark

May 26, 2025

Recently I came across a benchmark paper called Vending-Bench, which simulates a vending machine business to test how well LLM-based agents can handle long-term tasks. The setup is simple on paper: order stock, manage inventory, set prices, collect money. But runs can span 20 million tokens, making it a great stress test for long-term coherence.

The researchers ran a bunch of popular models (Claude, GPT-4o, Gemini, etc) through the simulation. The results? Some impressive profits, yes — but also some glorious, completely unhinged meltdowns. Here’s a look at what happens when your helpful AI agent loses the plot entirely.

Lees meer →