This is a heavily interactive web application, and JavaScript is required. Simple HTML interfaces are possible, but that is not what this is.
Post
Tim Kellogg
timkellogg.me
did:plc:ckaz32jwl6t2cno6fmuw2nhn
apparently RLCoT (chain of thought learned via RL) is in itself an emergent behavior that doesn’t happen until about 1.5B sized models
PPO, GPRO, PRIME — doesn’t matter what RL you use, the key is that it’s RL
experiment logs: https://wandb.ai/jiayipan/TinyZero?nw=nwuserjiayipan
x: https://x.com/jiayi_pirate/status/1882839504899420517?s=46&t=ftkDjGBpGPr2-yTN2CCUYg
2025-01-25T18:46:06.047Z