This is a heavily interactive web application, and JavaScript is required. Simple HTML interfaces are possible, but that is not what this is.
Post
Ryan Moulton
moultano.bsky.social
did:plc:h25avmes6g7fgcddc3xj7qmg
I wonder if the alignment faking behavior in claude (https://www.anthropic.com/research/alignment-faking) can be attributed via influence functions (https://www.anthropic.com/research/influence-functions) to LessWrong posts about deceptive alignment.
We've given it the script for what we don't want it to do.
2024-12-18T18:26:11.156Z