So I’m in my comfy chair with a new MacBook Air, doing what every normal person does on a Friday night: browsing Hugging Face. I got sucked into the audio processing rabbit hole and realized that, despite years tinkering with RNNs, regression, and the usual computer-vision suspects, I had never tried to make a model understand sound. That felt like a gap, and I hate gaps.
Whenever I pick up a new domain, I give myself a project that seems “reasonable”—something I could finish in a week if life doesn’t intervene. Fresh off editing a YouTube video, I fixated on how much time I spend deleting filler speech. If you’ve never lived inside a video timeline, disfluencies are the ums, uhs, restarts, and other verbal trash we shovel out so the audio flows. For me, it’s mostly stripping the words “umm” and “so” until my waveform looks like a comb. Naturally, I thought, “Hey, maybe I can teach a neural net to spot the ums for me.” That sounded perfectly doable.
...