Categories
code machine learning programming Stream video

CleanStream OBS Plugin: Remove Filler Words with Whisper CPP

CleanStream OBS Plugin is a powerful tool that helps clean live audio streams from unwanted words, filler words, and profanities. Created in C++, this plugin can improve the quality of live streams while saving time and effort in post-processing. In this blog post, we will take a detailed walk-through of the code for my CleanStream OBS plugin, explaining how it is built and its core functionalities.

To begin with, this plugin is an audio filter, which means that it works by filtering the audio entered through one function in the main plugin file.

struct obs_source_info my_audio_filter_info = {
  .id = "cleanstream_audio_filter",
  .type = OBS_SOURCE_TYPE_FILTER,
  .output_flags = OBS_SOURCE_AUDIO,
  .get_name = cleanstream_name,
  .create = cleanstream_create,
  .destroy = cleanstream_destroy,
  .get_defaults = cleanstream_defaults,
  .get_properties = cleanstream_properties,
  .update = cleanstream_update,
  .filter_audio = cleanstream_filter_audio,
}

The cleanstream_filter_audio function is responsible for getting audio frames, processing them, and returning the resultant audio data to OBS. The entire magic of this plugin happens within this one function.

The CleanStream OBS plugin is built on top of Whisper C++, a project by ggerganov that allows building the OpenAI Whisper speech recognition model from the ground up in C++, without any dependencies. To use Whisper.cpp in CleanStream, we just need two source files, two header files. Whisper C++ runs the neural network for Whisper, which is quite slow, in a different thread to continuously run in the background. This threading requires some buffering, so we use a circular buffer (utility provided by OBS) to store the incoming audio data, and a separate thread for continuous background processing. Mind I am using two buffers – one for the raw audio data and one for “info” which denotes how many audio frames are in the data buffer and what’s the timestamp.

// push back current audio data to input circlebuf
for (size_t c = 0; c < gf->channels; c++) {
  circlebuf_push_back(&gf->input_buffers[c], audio->data[c], audio->frames * sizeof(float));
}
// push audio packet info (timestamp/frame count) to info circlebuf
struct cleanstream_audio_info info = {0};
info.frames = audio->frames;       // number of frames in this packet
info.timestamp = audio->timestamp; // timestamp of this packet
circlebuf_push_back(&gf->info_buffer, &info, sizeof(info));

The processing thread uses a loop called ‘Whisper Loop,’ which continuously runs to clean out the audio from any unwanted words or profanities. It gets the shared data for the plugin and checks if the Whisper context is initialized and if there is any data to process or not. If there is data to process, the thread performs some processing, such as resampling and voice activation using a VAD (Voice Activation Detection) algorithm that looks at the energy in all the windows.

float energy_all = 0.0f;

for (uint64_t i = 0; i < n_samples; i++) {
  energy_all += fabsf(pcmf32[i]);
}

energy_all /= n_samples;

if (energy_all < vad_thold) {
  return false;
}

This creates an overlap between different samples coming in, and to set the overlap region dynamically, we check the timing of the Whisper inference function. If Whisper was fast enough we increase the overlap, otherwise we decrease it. It eventually settles on the right value for the overlap.

do_log(gf->log_level, "audio processing of %u ms new data took %d ms", new_frames_from_infos_ms,
        (int)duration);

if (duration > new_frames_from_infos_ms) {
  // try to decrease overlap down to minimum of 100 ms
  gf->overlap_ms = std::max((uint64_t)gf->overlap_ms - 10, (uint64_t)100);
  gf->overlap_frames = gf->overlap_ms * gf->sample_rate / 1000;
} else if (!skipped_inference) {
  // try to increase overlap up to 75% of the segment
  gf->overlap_ms =
    std::min((uint64_t)gf->overlap_ms + 10, (uint64_t)(new_frames_from_infos_ms * 0.75f));
  gf->overlap_frames = gf->overlap_ms * gf->sample_rate / 1000;
}

Speaking of the Whisper inference function, it is a fundamental part of this plugin, which transcribes the audio and removes any unwanted sounds. We use a few interesting things from Whisper CPP to decode the transcription into text, such as get segment text, which is limited to just one segment, and t0 t1, which gives the timings. We also sum up the probability for all the tokens that came up from the Whisper inference function to give us the general sentence probability.

const int n_segment = 0;
const char *text = whisper_full_get_segment_text(gf->whisper_context, n_segment);
const int64_t t0 = whisper_full_get_segment_t0(gf->whisper_context, n_segment);
const int64_t t1 = whisper_full_get_segment_t1(gf->whisper_context, n_segment);

float sentence_p = 0.0f;
const int n_tokens = whisper_full_n_tokens(gf->whisper_context, n_segment);
for (int j = 0; j < n_tokens; ++j) {
  sentence_p += whisper_full_get_token_p(gf->whisper_context, n_segment, j);
}
sentence_p /= (float)n_tokens;

Finally, we detect the fillers and profanities through user-defined detect and reject regular expressions. This detection allows us to process the audio and return it using circular buffers and multi-threading to ensure that everything runs smoothly.

std::regex filler_regex(gf->detect_regex);
if (std::regex_search(text_lower, filler_regex, std::regex_constants::match_any)) {
  return DETECTION_RESULT_FILLER;
}
std::regex beep_regex(gf->beep_regex);
if (std::regex_search(text_lower, beep_regex, std::regex_constants::match_any)) {
  return DETECTION_RESULT_BEEP;
}

Modifying the audio to include beep or silence:

info("beep segment, adding a beep %lu -> %u", first_boundary, num_new_frames_from_infos);
if (gf->do_silence) { // User can enable/disable modification
  for (size_t c = 0; c < gf->channels; c++) {
    for (size_t i = first_boundary; i < num_new_frames_from_infos; i++) {
      // add a beep at A4 (440Hz)
      gf->copy_buffers[c][i] = 0.5f * sinf(2.0f * M_PI * 440.0f * (float)i / gf->sample_rate);
    }
  }
}

In conclusion, the CleanStream OBS plugin offers a powerful tool that filters live audio streams from any unwanted sounds. By providing a detailed code walkthrough of the CleanStream OBS plugin, I hope to have given you a better understanding of how it works and what makes it so powerful. This is a big plugin, so covering everything in this post is impossible. So I highly recommend that you check out the CleanStream OBS plugin on GitHub and try it out yourself to experience its powerful capabilities. It’s an excellent tool that can save time and provide high-quality audio for live streams.