More > JRiver Media Center 32 for Linux
Auto-Generating Subtitles for Video at Home with Whisper.cpp
zybex:
It may be exactly "same", as apparently it's just using optimized code and accelerated libs, but the underlying algorithm and models are the same:
https://github.com/ggerganov/whisper.cpp/issues/1127#issuecomment-1653107972
mwillems:
--- Quote from: zybex on January 25, 2024, 10:17:35 am ---It may be exactly "same", as apparently it's just using optimized code and accelerated libs, but the underlying algorithm and models are the same:
https://github.com/ggerganov/whisper.cpp/issues/1127#issuecomment-1653107972
--- End quote ---
That's some interesting discussion and reading through the links led me to some emacs integration for whisper, which I didn't know existed yet. I'll be trying that out tonight.
I feel like everything is moving pretty fast in computing the past few years, and lots of very interesting stuff is happening at an incredible pace. Computing is starting to feel really exciting and fun to me again, like in the early days of computing ;D
Hendrik:
I looked at that real-time version, and lets just say I have questions. Its probably fine for live recordings without huge SFX or music and immediate transcription and translation, however if you already have a pre-recorded file .. the quality might be so much better if you can process entire sentences, which with real-time is tricky.
Offline processing into subtitle files might be a better first goal, as the tech still evolves.
zybex:
I think the audio stream would need to be processed a few seconds ahead of video. Perhaps even fire up a separate thread to preprocess the stream and use the output as it's needed for display - the thread would ideally stay at least 30s ahead or so. The first-play result could then be kept as an SRT, no need to keep doing it.
This may be tricky, but perhaps doable with two splitter chains, as if playing the file twice simultaneously?
--- Quote ---Offline processing into subtitle files might be a better first goal, as the tech still evolves.
--- End quote ---
Right.
mwillems:
--- Quote from: Hendrik on January 30, 2024, 04:02:33 am ---I looked at that real-time version, and lets just say I have questions. Its probably fine for live recordings without huge SFX or music and immediate transcription and translation, however if you already have a pre-recorded file .. the quality might be so much better if you can process entire sentences, which with real-time is tricky.
Offline processing into subtitle files might be a better first goal, as the tech still evolves.
--- End quote ---
You can simulate this with pre-recorded files by changing the context window (i.e. the --max-context flag). It's definitely true that reducing the context window affects how well it transcribes, although it doesn't seem to need a ton of context. I get pretty good results with just 8 tokens of context which is 8 words (give or take), and going up to 16 doesn't improve things very much in my anecdotal testing. But going down to 0 or 1 tokens really, really hurts accuracy though. You can try it out to tune and see how much context you need for acceptable accuracy, although I will note that some of the whisper models can get caught in a loop with higher contexts for some reason (the v3 are particularly susceptible to this), so you probably want the lowest context that delivers acceptable accuracy.
--- Quote from: zybex on January 30, 2024, 04:15:49 am ---I think the audio stream would need to be processed a few seconds ahead of video. Perhaps even fire up a separate thread to preprocess the stream and use the output as it's needed for display - the thread would ideally stay at least 30s ahead or so. The first-play result could then be kept as an SRT, no need to keep doing it.
This may be tricky, but perhaps doable with two splitter chains, as if playing the file twice simultaneously?
Right.
--- End quote ---
Just stripping out the audio for separate processing (not doing anything to it) runs at about 25x realtime for me with nothing else going on on fairly beefy hardware. Add in the processing and it's still much faster than realtime on my setup. So doing separate threads live should be doable, at least for good hardware (or fast disks/network). I suspect that lower end hardware (or slower disks) might struggle.
Probably the easiest near term way to integrate would be to add it to audio analysis and do it on import if no subtitle track is detected (or add it as a checkbox, etc.).
Navigation
[0] Message Index
[#] Next page
[*] Previous page
Go to full version