@kyletokens.bsky.social - Kyle O’Brien

AGI Alignment & Control Pretraining @ Geodesic Research | https://kyobrien.iohttps://bsky.app/profile/kyletokens.bsky.social@kyletokens.bsky.social - Kyle O’Brienhttps://bsky.app/profile/kyletokens.bsky.social/post/3mduatzdkxs2pVery excited to see pretraining safety efforts! We’re only now beginning to understand how promising pretraining safety and alignment interventions are. Much in the way that curating the base model is important for capabilities like reasoning, so too might it be important for safety. [contains quote post or other embedded content]02 Feb 2026 06:47 +0000at://did:plc:2l5f6yopoueecajei62paniv/app.bsky.feed.post/3mduatzdkxs2phttps://bsky.app/profile/kyletokens.bsky.social/post/3mcj7hnhpjs2gI've joined Geodesic Research to build the open-science field of AI safety pretraining research. Our first paper is wild. TL;DR — LLMs pretrained on data about misaligned AIs themselves become less aligned. Luckily, pretraining LLMs with data about good AIs helps them become more aligned. https://alignmentpretraining.ai/16 Jan 2026 03:58 +0000at://did:plc:2l5f6yopoueecajei62paniv/app.bsky.feed.post/3mcj7hnhpjs2ghttps://bsky.app/profile/kyletokens.bsky.social/post/3m7xplmh5sc26You know you're AGI-pilled when your Spotify Wrapped looks like this.14 Dec 2025 18:08 +0000at://did:plc:2l5f6yopoueecajei62paniv/app.bsky.feed.post/3m7xplmh5sc26https://bsky.app/profile/kyletokens.bsky.social/post/3m4it2ff3d22tApplications to apply for the ERA:AI Fellowship close November 3rd! Participating in this Summer's fellowship was my gateway into pursuing AGI safety research full-time. I will be a research manager for the upcoming Winter fellowships. Feel free to DM me with questions. :) erafellowship.org https://erafellowship.org/31 Oct 2025 15:45 +0000at://did:plc:2l5f6yopoueecajei62paniv/app.bsky.feed.post/3m4it2ff3d22thttps://bsky.app/profile/kyletokens.bsky.social/post/3lwad3w4lfs27This articles covers our work for a general audience. :) [contains quote post or other embedded content]12 Aug 2025 22:07 +0000at://did:plc:2l5f6yopoueecajei62paniv/app.bsky.feed.post/3lwad3w4lfs27https://bsky.app/profile/kyletokens.bsky.social/post/3lw7ebgvrac2zBig and True :) [contains quote post or other embedded content]12 Aug 2025 12:55 +0000at://did:plc:2l5f6yopoueecajei62paniv/app.bsky.feed.post/3lw7ebgvrac2zhttps://bsky.app/profile/kyletokens.bsky.social/post/3lw2g5fk3p223I like that OpenAI published this. They were able to fine-tune away GPT-oss's refusal, decreasing refusal rates to ~0%. These results aren't surprising. Acknowledging that existing safeguards don't generalize to open models is the first step in developing solutions. https://arxiv.org/abs/2508.03153v110 Aug 2025 13:45 +0000at://did:plc:2l5f6yopoueecajei62paniv/app.bsky.feed.post/3lw2g5fk3p223https://bsky.app/profile/kyletokens.bsky.social/post/3lvgpyyboik2cI've learned a lot over the past two years of getting into research, mostly from mistakes. I’ve made many mistakes. Such is science. Good research is often at the adjacent possible. I've written up much of what I've learned now that I'm beginning to mentor others. https://open.substack.com/pub/kyletokens/p/dont-think-just-think?r=3gtmk8&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false02 Aug 2025 17:49 +0000at://did:plc:2l5f6yopoueecajei62paniv/app.bsky.feed.post/3lvgpyyboik2chttps://bsky.app/profile/kyletokens.bsky.social/post/3ls2n5pzv5c2iI led an effort at Microsoft last Fall that studied whether SAE steering was an effective way to improve jailbreak robustness. Our paper on SAE steering has been accepted to the ICML Actionable Interpretability Workshop! Venue: actionable-interpretability.github.io Paper: arxiv.org/abs/2411.11296 https://arxiv.org/abs/2411.1129620 Jun 2025 18:10 +0000at://did:plc:2l5f6yopoueecajei62paniv/app.bsky.feed.post/3ls2n5pzv5c2ihttps://bsky.app/profile/kyletokens.bsky.social/post/3lqy6i4sius2eI'll be in England this summer as an AI Safety Research Fellow with ERA! erafellowship.org/fellowship I will be studying data filtering and tamper-resistant unlearning for open-weight AI safety so that the community can continue to benefit from open models as capabilities improve. https://erafellowship.org/fellowship07 Jun 2025 01:17 +0000at://did:plc:2l5f6yopoueecajei62paniv/app.bsky.feed.post/3lqy6i4sius2e