This is a heavily interactive web application, and JavaScript is required. Simple HTML interfaces are possible, but that is not what this is.
Post
Kyle Lo @ COLM 2025 🍁
kylelo.bsky.social
did:plc:xl4nejvjc52uhp25mceh3f7q
issues w preference LM benchmarks:
🐡data contains cases where the "bad" response is just as good as chosen one
🐟model rankings can feel off (claude ranks lower than expected)
led by @cmalaviya.bsky.social, we study underspecified queries & detrimental effect on model evals; accepted to TACL 2025
[contains quote post or other embedded content]
2025-07-22T17:02:49.345Z