Topic: Auto-Generating Subtitles for Video at Home with Whisper.cpp (Read 896 times)

mwillems · « **on:** January 25, 2024, 08:32:07 am »

So I recently discovered that you can now autogenerate pretty good subtitles for films offline at home using open source software. This was really exciting for me as I watch a lot of old or obscure films for which subtitles aren't available either on the DVD or online, and I often watch while on a noisy exercise machine so I really need subtitles for intelligibility. OpenAI (of ChatGPT fame) makes their Whisper speech to text engine available open source, and there's now open source tooling to let you use it at home offline. I'm using whisper.cpp: https://github.com/ggerganov/whisper.cpp. It's much easier to install on Linux than windows, but there are installation instructions for windows too. Once you've got it setup you just feed it appropriately formatted audio and it spits out subtitles. You can even tell it to output in .srt format!

I've watched about six or seven films with these auto subtitles at this point. The quality is surprisingly good, certainly not perfect, but I'd say better than 95% correct. When it has errors it's mostly of the "misheard words" variety, although it will occasionally get "stuck" on a line of dialog or musical cue and keep repeating for a few seconds, but you can mitigate that with settings. The only real problem I haven't solved yet is that sometimes subtitles precede their dialog by a few seconds, but they stay on the screen until the dialog actually happens so there's no confusion.

Once I got everything setup, it's basically just a little three-line script to generate subtitles for a film and drop them in the film directory where JRiver can pick them up. It's like magic!

Here's my (quite crude) script, I pass it the path to the film as a parameter. The ffmpeg line is because whisper.cpp needs a specifically formatted wav file as an input, and the parameters passed to whisper.cpp tell it to both output in srt format and also significantly reduce the issues with it repeating (by reducing the context window, which slightly hurts accuracy, but dramatically improves the sometime repetition issue):

Code: [Select]

#!/bin/bash

ffmpeg -i "$1" -ar 16000 -ac 2 -c:a pcm_s16le /tmp/audio.wav
/path/to/whisper.cpp/main -m /path/to/downloaded/whisper/model.bin -t 1 --max-context 8 -et 2.8 -osrt -f /tmp/audio.wav
cp /tmp/audio.wav.srt "$1".srt

I hope someone else might find this as exciting as I did. Note that it will go much faster if you have a beefy video card and build whisper.cpp with appropriate acceleration options.

Also, I'm not sure I posted this in the right forum, but I figured the script was Linux-centric so I dropped it here. Feel free to move the post somewhere else if it's in the wrong spot!

zybex · « **Reply #1 on:** January 25, 2024, 09:37:26 am »

Thanks, this is interesting!

I guess this is coming fast:
https://github.com/mldljyh/whisper_real_time_translation

Any plans to eventually integrate this into a DirectShow/VobSub filter @hendrik ? It's a really cool usage for AI, perhaps a good entry point for MC in that space.

mwillems · « **Reply #2 on:** January 25, 2024, 09:44:04 am »

Quote from: zybex on January 25, 2024, 09:37:26 am

Thanks, this is interesting!

I guess this is coming fast:
https://github.com/mldljyh/whisper_real_time_translation

Any plans to eventually integrate this into a DirectShow/VobSub filter @hendrik ? It's a really cool usage for AI, perhaps a good entry point for MC in that space.

Whisper.cpp also allows for streaming in audio for real-time transcription, but I haven't tried to use that feature yet so I don't know how well it works. My current video card does transcription much faster than real-time with the largest Whisper models so it should be doable. I can imagine that integrated graphics or CPU based transcription might need to use a smaller model (the core whisper model comes in various sizes with different memory reqs), which seems to be what the project you linked it is doing (picking one of the smaller Whisper models and bundling it).

It would be super cool to have automagic real-time subtitles in JRiver!

zybex · « **Reply #3 on:** January 25, 2024, 09:54:23 am »

Faster-Whisper has some bold claims:
This implementation is up to 4 times faster than openai/whisper for the same accuracy while using less memory. The efficiency can be further improved with 8-bit quantization on both CPU and GPU.

mwillems · « **Reply #4 on:** January 25, 2024, 10:06:47 am »

Quote from: zybex on January 25, 2024, 09:54:23 am

Faster-Whisper has some bold claims:
This implementation is up to 4 times faster than openai/whisper for the same accuracy while using less memory. The efficiency can be further improved with 8-bit quantization on both CPU and GPU.

Those are some impressive claims, but I notice no actual benchmarks for accuracy in their benchmark tables, they just say it's the "same"? It would be more reassuring to see actual benches for that as I'm much more concerned about accuracy than speed for my use case. In other AI contexts different quantization or implementation options sometimes make a big difference even with the same base models. It looks like there's no AMD/ROCM support either, which is a bummer.

Pretty cool though, regardless!

zybex · « **Reply #5 on:** January 25, 2024, 10:17:35 am »

It may be exactly "same", as apparently it's just using optimized code and accelerated libs, but the underlying algorithm and models are the same:
https://github.com/ggerganov/whisper.cpp/issues/1127#issuecomment-1653107972

mwillems · « **Reply #6 on:** January 25, 2024, 12:47:39 pm »

Quote from: zybex on January 25, 2024, 10:17:35 am

It may be exactly "same", as apparently it's just using optimized code and accelerated libs, but the underlying algorithm and models are the same:
https://github.com/ggerganov/whisper.cpp/issues/1127#issuecomment-1653107972

That's some interesting discussion and reading through the links led me to some emacs integration for whisper, which I didn't know existed yet. I'll be trying that out tonight.

I feel like everything is moving pretty fast in computing the past few years, and lots of very interesting stuff is happening at an incredible pace. Computing is starting to feel really exciting and fun to me again, like in the early days of computing

Hendrik · « **Reply #7 on:** January 30, 2024, 04:02:33 am »

I looked at that real-time version, and lets just say I have questions. Its probably fine for live recordings without huge SFX or music and immediate transcription and translation, however if you already have a pre-recorded file .. the quality might be so much better if you can process entire sentences, which with real-time is tricky.

Offline processing into subtitle files might be a better first goal, as the tech still evolves.

zybex · « **Reply #8 on:** January 30, 2024, 04:15:49 am »

I think the audio stream would need to be processed a few seconds ahead of video. Perhaps even fire up a separate thread to preprocess the stream and use the output as it's needed for display - the thread would ideally stay at least 30s ahead or so. The first-play result could then be kept as an SRT, no need to keep doing it.

This may be tricky, but perhaps doable with two splitter chains, as if playing the file twice simultaneously?

Quote

Offline processing into subtitle files might be a better first goal, as the tech still evolves.

Right.

mwillems · « **Reply #9 on:** January 30, 2024, 08:50:29 am »

Quote from: Hendrik on January 30, 2024, 04:02:33 am

I looked at that real-time version, and lets just say I have questions. Its probably fine for live recordings without huge SFX or music and immediate transcription and translation, however if you already have a pre-recorded file .. the quality might be so much better if you can process entire sentences, which with real-time is tricky.

Offline processing into subtitle files might be a better first goal, as the tech still evolves.

You can simulate this with pre-recorded files by changing the context window (i.e. the --max-context flag). It's definitely true that reducing the context window affects how well it transcribes, although it doesn't seem to need a ton of context. I get pretty good results with just 8 tokens of context which is 8 words (give or take), and going up to 16 doesn't improve things very much in my anecdotal testing. But going down to 0 or 1 tokens really, really hurts accuracy though. You can try it out to tune and see how much context you need for acceptable accuracy, although I will note that some of the whisper models can get caught in a loop with higher contexts for some reason (the v3 are particularly susceptible to this), so you probably want the lowest context that delivers acceptable accuracy.

Quote from: zybex on January 30, 2024, 04:15:49 am

I think the audio stream would need to be processed a few seconds ahead of video. Perhaps even fire up a separate thread to preprocess the stream and use the output as it's needed for display - the thread would ideally stay at least 30s ahead or so. The first-play result could then be kept as an SRT, no need to keep doing it.

This may be tricky, but perhaps doable with two splitter chains, as if playing the file twice simultaneously?
Right.

Just stripping out the audio for separate processing (not doing anything to it) runs at about 25x realtime for me with nothing else going on on fairly beefy hardware. Add in the processing and it's still much faster than realtime on my setup. So doing separate threads live should be doable, at least for good hardware (or fast disks/network). I suspect that lower end hardware (or slower disks) might struggle.

Probably the easiest near term way to integrate would be to add it to audio analysis and do it on import if no subtitle track is detected (or add it as a checkbox, etc.).

bob · « **Reply #10 on:** February 01, 2024, 08:22:37 pm »

Quote from: mwillems on January 30, 2024, 08:50:29 am

...
Probably the easiest near term way to integrate would be to add it to audio analysis and do it on import if no subtitle track is detected (or add it as a checkbox, etc.).

This sounds like a cool idea.

INTERACT FORUM

Author Topic: Auto-Generating Subtitles for Video at Home with Whisper.cpp (Read 896 times)

mwillems

Auto-Generating Subtitles for Video at Home with Whisper.cpp

zybex

Re: Auto-Generating Subtitles for Video at Home with Whisper.cpp

mwillems

Re: Auto-Generating Subtitles for Video at Home with Whisper.cpp

zybex

Re: Auto-Generating Subtitles for Video at Home with Whisper.cpp

mwillems

Re: Auto-Generating Subtitles for Video at Home with Whisper.cpp

zybex

Re: Auto-Generating Subtitles for Video at Home with Whisper.cpp

mwillems

Re: Auto-Generating Subtitles for Video at Home with Whisper.cpp

Hendrik

Re: Auto-Generating Subtitles for Video at Home with Whisper.cpp

zybex

Re: Auto-Generating Subtitles for Video at Home with Whisper.cpp

mwillems

Re: Auto-Generating Subtitles for Video at Home with Whisper.cpp

bob

Re: Auto-Generating Subtitles for Video at Home with Whisper.cpp