This is a heavily interactive web application, and JavaScript is required. Simple HTML interfaces are possible, but that is not what this is.
Post
Nick Tomlin
nickatomlin.bsky.social
did:plc:z7arswcewujauqliowdnckf5
I'm particularly fond of this new benchmark paper we wrote, which aims to scalably evaluate whether language models can generalize to arbitrary new tasks. The core idea is to use LLMs to generate new games, and then evaluate whether LLMs can play those games
📄: arxiv.org/abs/2505.07215
2025-05-13T21:30:20.348Z