Topic: Auto-Generating Subtitles for Video at Home with Whisper.cpp (Read 8741 times)

mwillems · « **on:** January 25, 2024, 08:32:07 am »

So I recently discovered that you can now autogenerate pretty good subtitles for films offline at home using open source software. This was really exciting for me as I watch a lot of old or obscure films for which subtitles aren't available either on the DVD or online, and I often watch while on a noisy exercise machine so I really need subtitles for intelligibility. OpenAI (of ChatGPT fame) makes their Whisper speech to text engine available open source, and there's now open source tooling to let you use it at home offline. I'm using whisper.cpp: https://github.com/ggerganov/whisper.cpp. It's much easier to install on Linux than windows, but there are installation instructions for windows too. Once you've got it setup you just feed it appropriately formatted audio and it spits out subtitles. You can even tell it to output in .srt format!

I've watched about six or seven films with these auto subtitles at this point. The quality is surprisingly good, certainly not perfect, but I'd say better than 95% correct. When it has errors it's mostly of the "misheard words" variety, although it will occasionally get "stuck" on a line of dialog or musical cue and keep repeating for a few seconds, but you can mitigate that with settings. The only real problem I haven't solved yet is that sometimes subtitles precede their dialog by a few seconds, but they stay on the screen until the dialog actually happens so there's no confusion.

Once I got everything setup, it's basically just a little three-line script to generate subtitles for a film and drop them in the film directory where JRiver can pick them up. It's like magic!

Here's my (quite crude) script, I pass it the path to the film as a parameter. The ffmpeg line is because whisper.cpp needs a specifically formatted wav file as an input, and the parameters passed to whisper.cpp tell it to both output in srt format and also significantly reduce the issues with it repeating (by reducing the context window, which slightly hurts accuracy, but dramatically improves the sometime repetition issue):

Code: [Select]

#!/bin/bash

ffmpeg -i "$1" -ar 16000 -ac 2 -c:a pcm_s16le /tmp/audio.wav
/path/to/whisper.cpp/main -m /path/to/downloaded/whisper/model.bin -t 1 --max-context 8 -et 2.8 -osrt -f /tmp/audio.wav
cp /tmp/audio.wav.srt "$1".srt

I hope someone else might find this as exciting as I did. Note that it will go much faster if you have a beefy video card and build whisper.cpp with appropriate acceleration options.

Also, I'm not sure I posted this in the right forum, but I figured the script was Linux-centric so I dropped it here. Feel free to move the post somewhere else if it's in the wrong spot!

zybex · « **Reply #1 on:** January 25, 2024, 09:37:26 am »

Thanks, this is interesting!

I guess this is coming fast:
https://github.com/mldljyh/whisper_real_time_translation

Any plans to eventually integrate this into a DirectShow/VobSub filter @hendrik ? It's a really cool usage for AI, perhaps a good entry point for MC in that space.

mwillems · « **Reply #2 on:** January 25, 2024, 09:44:04 am »

Quote from: zybex on January 25, 2024, 09:37:26 am

Thanks, this is interesting!

I guess this is coming fast:
https://github.com/mldljyh/whisper_real_time_translation

Any plans to eventually integrate this into a DirectShow/VobSub filter @hendrik ? It's a really cool usage for AI, perhaps a good entry point for MC in that space.

Whisper.cpp also allows for streaming in audio for real-time transcription, but I haven't tried to use that feature yet so I don't know how well it works. My current video card does transcription much faster than real-time with the largest Whisper models so it should be doable. I can imagine that integrated graphics or CPU based transcription might need to use a smaller model (the core whisper model comes in various sizes with different memory reqs), which seems to be what the project you linked it is doing (picking one of the smaller Whisper models and bundling it).

It would be super cool to have automagic real-time subtitles in JRiver!

zybex · « **Reply #3 on:** January 25, 2024, 09:54:23 am »

Faster-Whisper has some bold claims:
This implementation is up to 4 times faster than openai/whisper for the same accuracy while using less memory. The efficiency can be further improved with 8-bit quantization on both CPU and GPU.

mwillems · « **Reply #4 on:** January 25, 2024, 10:06:47 am »

Quote from: zybex on January 25, 2024, 09:54:23 am

Faster-Whisper has some bold claims:
This implementation is up to 4 times faster than openai/whisper for the same accuracy while using less memory. The efficiency can be further improved with 8-bit quantization on both CPU and GPU.

Those are some impressive claims, but I notice no actual benchmarks for accuracy in their benchmark tables, they just say it's the "same"? It would be more reassuring to see actual benches for that as I'm much more concerned about accuracy than speed for my use case. In other AI contexts different quantization or implementation options sometimes make a big difference even with the same base models. It looks like there's no AMD/ROCM support either, which is a bummer.

Pretty cool though, regardless!

zybex · « **Reply #5 on:** January 25, 2024, 10:17:35 am »

It may be exactly "same", as apparently it's just using optimized code and accelerated libs, but the underlying algorithm and models are the same:
https://github.com/ggerganov/whisper.cpp/issues/1127#issuecomment-1653107972

mwillems · « **Reply #6 on:** January 25, 2024, 12:47:39 pm »

Quote from: zybex on January 25, 2024, 10:17:35 am

It may be exactly "same", as apparently it's just using optimized code and accelerated libs, but the underlying algorithm and models are the same:
https://github.com/ggerganov/whisper.cpp/issues/1127#issuecomment-1653107972

That's some interesting discussion and reading through the links led me to some emacs integration for whisper, which I didn't know existed yet. I'll be trying that out tonight.

I feel like everything is moving pretty fast in computing the past few years, and lots of very interesting stuff is happening at an incredible pace. Computing is starting to feel really exciting and fun to me again, like in the early days of computing

Hendrik · « **Reply #7 on:** January 30, 2024, 04:02:33 am »

I looked at that real-time version, and lets just say I have questions. Its probably fine for live recordings without huge SFX or music and immediate transcription and translation, however if you already have a pre-recorded file .. the quality might be so much better if you can process entire sentences, which with real-time is tricky.

Offline processing into subtitle files might be a better first goal, as the tech still evolves.

zybex · « **Reply #8 on:** January 30, 2024, 04:15:49 am »

I think the audio stream would need to be processed a few seconds ahead of video. Perhaps even fire up a separate thread to preprocess the stream and use the output as it's needed for display - the thread would ideally stay at least 30s ahead or so. The first-play result could then be kept as an SRT, no need to keep doing it.

This may be tricky, but perhaps doable with two splitter chains, as if playing the file twice simultaneously?

Quote

Offline processing into subtitle files might be a better first goal, as the tech still evolves.

Right.

mwillems · « **Reply #9 on:** January 30, 2024, 08:50:29 am »

Quote from: Hendrik on January 30, 2024, 04:02:33 am

I looked at that real-time version, and lets just say I have questions. Its probably fine for live recordings without huge SFX or music and immediate transcription and translation, however if you already have a pre-recorded file .. the quality might be so much better if you can process entire sentences, which with real-time is tricky.

Offline processing into subtitle files might be a better first goal, as the tech still evolves.

You can simulate this with pre-recorded files by changing the context window (i.e. the --max-context flag). It's definitely true that reducing the context window affects how well it transcribes, although it doesn't seem to need a ton of context. I get pretty good results with just 8 tokens of context which is 8 words (give or take), and going up to 16 doesn't improve things very much in my anecdotal testing. But going down to 0 or 1 tokens really, really hurts accuracy though. You can try it out to tune and see how much context you need for acceptable accuracy, although I will note that some of the whisper models can get caught in a loop with higher contexts for some reason (the v3 are particularly susceptible to this), so you probably want the lowest context that delivers acceptable accuracy.

Quote from: zybex on January 30, 2024, 04:15:49 am

I think the audio stream would need to be processed a few seconds ahead of video. Perhaps even fire up a separate thread to preprocess the stream and use the output as it's needed for display - the thread would ideally stay at least 30s ahead or so. The first-play result could then be kept as an SRT, no need to keep doing it.

This may be tricky, but perhaps doable with two splitter chains, as if playing the file twice simultaneously?
Right.

Just stripping out the audio for separate processing (not doing anything to it) runs at about 25x realtime for me with nothing else going on on fairly beefy hardware. Add in the processing and it's still much faster than realtime on my setup. So doing separate threads live should be doable, at least for good hardware (or fast disks/network). I suspect that lower end hardware (or slower disks) might struggle.

Probably the easiest near term way to integrate would be to add it to audio analysis and do it on import if no subtitle track is detected (or add it as a checkbox, etc.).

bob · « **Reply #10 on:** February 01, 2024, 08:22:37 pm »

Quote from: mwillems on January 30, 2024, 08:50:29 am

...
Probably the easiest near term way to integrate would be to add it to audio analysis and do it on import if no subtitle track is detected (or add it as a checkbox, etc.).

This sounds like a cool idea.

mwillems · « **Reply #11 on:** January 12, 2025, 07:20:31 pm »

It looks like someone else had the same idea as me: VLC media player is going to be adding offline real-time generated subtitles, most likely using some kind of a local whisper implementation.

https://www.pcmag.com/news/vlc-media-player-to-use-ai-to-generate-subtitles-for-videos

Doc4 · « **Reply #12 on:** January 13, 2025, 01:20:00 pm »

Yeah coming off that news, it'd be really cool to have a feature like this in MC. Going off what Hendrik said I can affirm, especially in certain languages, whisper is much better with full sentences and often fails on single words in non-Latin languages. JP is especially bad since many different words sound the same and how they are written is derived from sentence context.

All the machines I happen to run MC on should in theory be performant enough to pre-process the audio track for subtitle generation, but I think having this feature in any form would be great.

lepa · « **Reply #13 on:** January 13, 2025, 01:33:29 pm »

How about real time or timed lyrics?

zybex · « **Reply #14 on:** January 13, 2025, 03:18:59 pm »

Real time lyrics, good luck with that!
https://www.youtube.com/watch?v=iYtBMgLfqKQ

maybe AI can tell what he's saying there

lepa · « **Reply #15 on:** January 13, 2025, 03:41:20 pm »

Looks like what i heard also

mwillems · « **Reply #16 on:** January 13, 2025, 06:07:18 pm »

So on a lark I ran Whisper over Yellow Ledbetter, and.... well.... Whisper may be pretty good at spoken words, but it's definitely not ready for Eddie Vedder:

Quote

[00:00:29.000 --> 00:00:34.000] ♪ I wanna see that, I wanna push that letter ♪
[00:00:34.000 --> 00:00:42.000] ♪ I said, then you said, I wanna leave it again ♪
[00:00:42.000 --> 00:00:48.000] ♪ Once I saw him, on a piece of weather ♪
[00:00:48.000 --> 00:00:56.000] ♪ I said, I wanna say, I wanna leave it again ♪
[00:00:56.000 --> 00:01:02.000] ♪ I wanna be there, I wanna wish it, I wanna wait ♪
[00:01:02.000 --> 00:01:06.000] ♪ In the cold, I wanna sit, I wanna go, I wanna sit ♪
[00:01:06.000 --> 00:01:10.000] ♪ In the cold, I wanna be there ♪
[00:01:10.000 --> 00:01:17.000] ♪ And I wish that, I wanna leave her 'cause I know ♪
[00:01:17.000 --> 00:01:26.000] ♪ I said I know what I wear, I got a box on my back, oh yeah ♪
[00:01:26.000 --> 00:01:31.000] ♪ Can you see them out on the porch? ♪
[00:01:31.000 --> 00:01:36.000] ♪ Yeah, but they don't wave ♪
[00:01:36.000 --> 00:01:43.000] ♪ I see them round the front way, yeah ♪
[00:01:43.000 --> 00:01:48.000] ♪ And I know, and I know, I don't want to stay ♪
[00:01:48.000 --> 00:01:50.240] Make me cry.

Some of those words are correct, especially towards the end

The real lyrics for reference: https://www.azlyrics.com/lyrics/pearljam/yellowledbetter.html

INTERACT FORUM

Author Topic: Auto-Generating Subtitles for Video at Home with Whisper.cpp (Read 8741 times)

mwillems

Auto-Generating Subtitles for Video at Home with Whisper.cpp

zybex

Re: Auto-Generating Subtitles for Video at Home with Whisper.cpp

mwillems

Re: Auto-Generating Subtitles for Video at Home with Whisper.cpp

zybex

Re: Auto-Generating Subtitles for Video at Home with Whisper.cpp

mwillems

Re: Auto-Generating Subtitles for Video at Home with Whisper.cpp

zybex

Re: Auto-Generating Subtitles for Video at Home with Whisper.cpp

mwillems

Re: Auto-Generating Subtitles for Video at Home with Whisper.cpp

Hendrik

Re: Auto-Generating Subtitles for Video at Home with Whisper.cpp

zybex

Re: Auto-Generating Subtitles for Video at Home with Whisper.cpp

mwillems

Re: Auto-Generating Subtitles for Video at Home with Whisper.cpp

bob

Re: Auto-Generating Subtitles for Video at Home with Whisper.cpp

mwillems

Re: Auto-Generating Subtitles for Video at Home with Whisper.cpp

Doc4

Re: Auto-Generating Subtitles for Video at Home with Whisper.cpp

lepa

Re: Auto-Generating Subtitles for Video at Home with Whisper.cpp

zybex

Re: Auto-Generating Subtitles for Video at Home with Whisper.cpp

lepa

Re: Auto-Generating Subtitles for Video at Home with Whisper.cpp

mwillems

Re: Auto-Generating Subtitles for Video at Home with Whisper.cpp