This is a heavily interactive web application, and JavaScript is required. Simple HTML interfaces are possible, but that is not what this is.
Post
Simon Willison
simonwillison.net
did:plc:kft6lu4trxowqmter2b6vg6z
Microsoft's MIT licensed VibeVoice speech-to-text model (think Whisper with speaker diarization) is really good - my notes on running the 5.71GB 4bit MLX conversion on an M5 MacBook, using about 60GB of RAM at peak and transcribing 1hr of audio in ~9 mins https://simonwillison.net/2026/Apr/27/vibevoice/
2026-04-27T23:49:26.956Z