Turns Out Audio Processing Is Harder Than I Expected

So I’m in my comfy chair with a new MacBook Air, doing what every normal person does on a Friday night: browsing Hugging Face. I got sucked into the audio processing rabbit hole and realized that, despite years tinkering with RNNs, regression, and the usual computer-vision suspects, I had never tried to make a model understand sound. That felt like a gap, and I hate gaps.

Whenever I pick up a new domain, I give myself a project that seems “reasonable”—something I could finish in a week if life doesn’t intervene. Fresh off editing a YouTube video, I fixated on how much time I spend deleting filler speech. If you’ve never lived inside a video timeline, disfluencies are the ums, uhs, restarts, and other verbal trash we shovel out so the audio flows. For me, it’s mostly stripping the words “umm” and “so” until my waveform looks like a comb. Naturally, I thought, “Hey, maybe I can teach a neural net to spot the ums for me.” That sounded perfectly doable.

Kinda.

The deeper I read, the more it became clear that “just find the ums” is a bigger ask than I imagined. Eventually I hit a paper titled Increase Apparent Public Speaking Fluency By Speech Augmentation that describes exactly what I want to build, and of course it’s full of extra steps I had conveniently glossed over.

They start where every machine-learning project starts: data. Their corpus comes from telephone conversations—over 240 hours of spontaneous speech across 500 speakers—thanks to the Switchboard corpus. It’s exactly what you need because curated audio, like audiobooks, is useless here; nobody says “uhhh” when they’re reading a script they practiced. Hunting down spontaneous speech datasets is surprisingly tricky, so when you find one, you hang onto it.

With data in hand, they break the audio into features using Mel-Frequency Cepstral Coefficients (MFCCs). If you remember your signal-processing classes, cepstrums capture the spectral envelope. MFCCs reshape that into the mel scale so it lines up with how human ears perceive pitch. It’s a clever hack that maps nicely onto voice characteristics and, conveniently, there are mature libraries and walkthroughs that calculate them for you.

Next up: the model. The authors feed those MFCC features into what they call a CRNN—basically a convolutional front-end followed by a recurrent layer. That architecture made sense in 2019 when the paper landed. Today my first instinct is always “Could a transformer do this better?” The answer is almost always yes, but I get why they stuck with something more approachable. Transformers are powerful, but they drag in a truckload of complexity and compute. A newer take even pivots to BERT, as outlined in this paper, though I still miss the relative elegance of an LSTM when I’m skimming PyTorch code.

Once the CRNN flags potential filler segments, they punt to an SVM to classify each pause as disfluent or intentional. A long swig of coffee between sentences shouldn’t get chopped. The trick is giving the SVM context: the silence itself plus a configurable number of frames before and after. Over time it learns what a “bad pause” sounds like versus a natural breath.

Did it work? Apparently yes. They reported a 0.9482 accuracy on detecting filler words, which is solid enough that I’d trust it to clean up my audio without turning me into a robot.

So where does that leave me? Mostly itching to rebuild their pipeline with transformers just to see how far the tooling has come. It’ll take more than a weekend, but it would make a fun YouTube series: walking through the data wrangling, fiddling with the feature extraction, and showing how to swap in a more modern architecture. Transformers first showed up in Attention Is All You Need, and the examples I’ve found since—like this GitHub project—are still pretty gnarly: hyperparameter soup and multi-GPU expectations. Someone has to drag this stuff into approachable territory.

If you want to run alongside me, there are a few great jumping-off points. Check out this MFCC walkthrough on Kaggle for a practical implementation, grab more spontaneous speech from Mozilla Common Voice, and read up on audio-to-audio tasks on Hugging Face. For a quick primer on the broader field, I still like this YouTube breakdown.

For edutainment, keep an eye on Frank Krueger’s Sunday Twitch stream. He’s one of those developers who makes .NET, iOS, and machine learning look effortless, and you’ll pick up a dozen little tricks just watching him debug.

That’s the plan for now. If you have war stories about trimming disfluencies or you’ve already tried transformer-based filler detectors, I want to hear them. The more feedback I get, the faster I can turn this into something useful. Thanks for sticking around.

Paper referenced: Increase Apparent Public Speaking Fluency By Speech Augmentation by Sagnik Das, Nisha Gandhi, Tejas Naik, and Roy Shilkrot.