This is a heavily interactive web application, and JavaScript is required. Simple HTML interfaces are possible, but that is not what this is.
Post
Paul Gavrikov
paulgavrikov.bsky.social
did:plc:ybkjfg3d2znvv2qebelo2pqx
🤖 We tested 37 models. Results?
Even top VLMs break down on “easy” tasks in overloaded scenes.
Best model (o3):
• 19.8% accuracy (hardest split)
• 69.5% overall
2025-09-08T15:28:00.337Z