🤖 When LLMs Go Off the Rails: Meltdowns from the Vending-Bench Benchmark
May 26, 2025
Recently I came across a benchmark paper called Vending-Bench, which simulates a vending machine business to test how well LLM-based agents can handle long-term tasks. The setup is simple on paper: order stock, manage inventory, set prices, collect money. But runs can span 20 million tokens, making it a great stress test for long-term coherence.
The researchers ran a bunch of popular models (Claude, GPT-4o, Gemini, etc) through the simulation. The results? Some impressive profits, yes — but also some glorious, completely unhinged meltdowns. Here’s a look at what happens when your helpful AI agent loses the plot entirely.