Transcribing Videos with Google Cloud Speech-to-Text

Got an hour-long video and not really into manually creating subtitles? not plans to put it on YouTube for their automated transcription services? then – try Google Cloud Speech-to-Text! In this post I’ll share some scripts for automating the process and creating an .str file to go along your video for displaying the subtitles.

Google’s Speech-to-Text (https://cloud.google.com/speech-to-text) is available through their gcloud command line tool, which makes it very easy to script and automate. There are just a few steps and considerations for using it, for example extracting FLAC from your video and splitting it up to small chunks.

Prerequisites

Google Cloud enabled for your Google account (you need a credit card, I believe, but you get $300 in credits for using it the first time)
A live Google Cloud project
gcloud installed (https://cloud.google.com/sdk/docs/install) and configured for your account
A Google Cloud Storage bucket (https://console.cloud.google.com/storage)
FFMPEG installed (https://ffmpeg.org/download.html)
A long video to transcribe (in whatever format, but FFMPEG should be able to read it)

Extract 5-minute FLAC Audio Chunks

Google Speech worked best for me when limiting the length of transcription to 5 minutes. More than that and I got timeouts and errors. So I recommend splitting the audio to 5 minute chunks like so

$ ffmpeg -i video.mp4 -vn -ar 44100 -c:a flac -sample_fmt s16 -ac 1 -y -map 0 -segment_time 00:05:00 -f segment audio%02d.flac

This creates (depending on the length of your video) one or more audioNN.flac files. The encoding is 16-bit, 44.1KHz and mono to make the files as small as possible (~11Mb). I had a pretty slow internet connection while doing this, so uploading a 24-bit 48KHz stereo files was a pain. Google anyway recommend anything above 16KHz for optimal results. You can tweak the parameters, but I doubt it will provide for a better transcription. Upload all the files to your Google Cloud Storage bucket.

Run the Google Speech Command

Cue up all the files for transcription:

$ for i in {0..[[NN]]}; do gcloud ml speech recognize-long-running "gs://[[mybucket]]/audio$(printf '%02d' $i).flac" --encoding='flac' --sample-rate=44100 --language-code='en-US' --include-word-time-offsets --async; done

Change [[NN]] and [[mybucket]] to match your setup (e.g. number of audioNN.flac files you have).

While doing this I also experimented with providing a list of “boost words”: https://cloud.google.com/speech-to-text/docs/speech-adaptation . My “special” words were in a text file words.txt, one word per line, so I could run:

$ for i in {0..18}; do gcloud ml speech recognize-long-running "gs://speechrec/audio$(printf '%02d' $i).flac" --encoding='flac' --sample-rate=44100 --language-code='en-US' --include-word-time-offsets --async --hints "$(awk '{ printf "%s%s", (NR==1?"[":", "), $0 } END{ print "]" }' words.txt)"; done

This attached the words on the command line in the --hints arguments.

Note this is running the recognize-long-running option, which returns immediately and allows to later query the service for the results.

This command will return a bunch of operations/NNNNNNNNNNNNNNNNNN keys that are used for retrieving the results, e.g.

gcloud ml speech operations describe operations/NNNNNNNNNNNNNNNNNN

Which returns a JSON with the results. I therefore pipe the outputs from describe into .json files:

gcloud ml speech operations describe operations/NNNNNNNNNNNNNNNNNN > audio00.json

Keeping the audio .flac files aligned with the .json files.

Create a `.srt` File from the JSON Output

This involves a script that I wrote in Python to take all the JSON files and combine them, transcoding the .json format into .srt format while keeping the timestamps valid and in-sync with the video.

Using the script:

$ google_cloud_speech_json_to_srt.py --concat --fix-timestamps audio*.json

Note the script will look for the aligned .flac files to figure out the starting timestamp of that audio chunk, so it could keep the subtitle timestamps correctly synced with the video.

The script make some heuristics about the number of word to put in a single subtitle line, and makes sure the duration they show up makes sense.

Bask in Your Glory

The .srt file should be ready now. Rename it to match the e.g. video.mp4 file (video.srt), and play the video with VLC and the subtitles should automatically come up, in sync with the video.

Have fun subtitling!

Roy.

Prerequisites

Extract 5-minute FLAC Audio Chunks

Run the Google Speech Command

Create a .srt File from the JSON Output

Bask in Your Glory

Create a `.srt` File from the JSON Output