Auto-Generating Subtitles for Video at Home with Whisper.cpp

More > JRiver Media Center 32 for Linux

(1/4) > >>

mwillems:
So I recently discovered that you can now autogenerate pretty good subtitles for films offline at home using open source software. This was really exciting for me as I watch a lot of old or obscure films for which subtitles aren't available either on the DVD or online, and I often watch while on a noisy exercise machine so I really need subtitles for intelligibility. OpenAI (of ChatGPT fame) makes their Whisper speech to text engine available open source, and there's now open source tooling to let you use it at home offline. I'm using whisper.cpp: https://github.com/ggerganov/whisper.cpp. It's much easier to install on Linux than windows, but there are installation instructions for windows too. Once you've got it setup you just feed it appropriately formatted audio and it spits out subtitles. You can even tell it to output in .srt format!

I've watched about six or seven films with these auto subtitles at this point. The quality is surprisingly good, certainly not perfect, but I'd say better than 95% correct. When it has errors it's mostly of the "misheard words" variety, although it will occasionally get "stuck" on a line of dialog or musical cue and keep repeating for a few seconds, but you can mitigate that with settings. The only real problem I haven't solved yet is that sometimes subtitles precede their dialog by a few seconds, but they stay on the screen until the dialog actually happens so there's no confusion.

Once I got everything setup, it's basically just a little three-line script to generate subtitles for a film and drop them in the film directory where JRiver can pick them up. It's like magic!

Here's my (quite crude) script, I pass it the path to the film as a parameter. The ffmpeg line is because whisper.cpp needs a specifically formatted wav file as an input, and the parameters passed to whisper.cpp tell it to both output in srt format and also significantly reduce the issues with it repeating (by reducing the context window, which slightly hurts accuracy, but dramatically improves the sometime repetition issue):

--- Code: ---#!/bin/bash

ffmpeg -i "$1" -ar 16000 -ac 2 -c:a pcm_s16le /tmp/audio.wav
/path/to/whisper.cpp/main -m /path/to/downloaded/whisper/model.bin -t 1 --max-context 8 -et 2.8 -osrt -f /tmp/audio.wav
cp /tmp/audio.wav.srt "$1".srt

--- End code ---

I hope someone else might find this as exciting as I did. Note that it will go much faster if you have a beefy video card and build whisper.cpp with appropriate acceleration options.

Also, I'm not sure I posted this in the right forum, but I figured the script was Linux-centric so I dropped it here. Feel free to move the post somewhere else if it's in the wrong spot!

zybex:
Thanks, this is interesting!

I guess this is coming fast:
https://github.com/mldljyh/whisper_real_time_translation

Any plans to eventually integrate this into a DirectShow/VobSub filter @hendrik ? It's a really cool usage for AI, perhaps a good entry point for MC in that space.

mwillems:

--- Quote from: zybex on January 25, 2024, 09:37:26 am ---Thanks, this is interesting!

I guess this is coming fast:
https://github.com/mldljyh/whisper_real_time_translation

Any plans to eventually integrate this into a DirectShow/VobSub filter @hendrik ? It's a really cool usage for AI, perhaps a good entry point for MC in that space.

--- End quote ---

Whisper.cpp also allows for streaming in audio for real-time transcription, but I haven't tried to use that feature yet so I don't know how well it works. My current video card does transcription much faster than real-time with the largest Whisper models so it should be doable. I can imagine that integrated graphics or CPU based transcription might need to use a smaller model (the core whisper model comes in various sizes with different memory reqs), which seems to be what the project you linked it is doing (picking one of the smaller Whisper models and bundling it).

It would be super cool to have automagic real-time subtitles in JRiver!

zybex:
Faster-Whisper has some bold claims:
This implementation is up to 4 times faster than openai/whisper for the same accuracy while using less memory. The efficiency can be further improved with 8-bit quantization on both CPU and GPU.

mwillems:

--- Quote from: zybex on January 25, 2024, 09:54:23 am ---Faster-Whisper has some bold claims:
This implementation is up to 4 times faster than openai/whisper for the same accuracy while using less memory. The efficiency can be further improved with 8-bit quantization on both CPU and GPU.

--- End quote ---

Those are some impressive claims, but I notice no actual benchmarks for accuracy in their benchmark tables, they just say it's the "same"? It would be more reassuring to see actual benches for that as I'm much more concerned about accuracy than speed for my use case. In other AI contexts different quantization or implementation options sometimes make a big difference even with the same base models. It looks like there's no AMD/ROCM support either, which is a bummer.

Pretty cool though, regardless!

Navigation

[0] Message Index

[#] Next page

Go to full version