Looking for a good whisper AI model for JP-JP transcription

Hope this is the right place to ask. I want to watch/listen to some YouTube videos for listening practice and some sentence mining but all the videos I find only have auto-generated subtitles or on-screen subtitles but pretty much just for VOICEROID or ゆっくり TTS videos. And those auto-generated ones can be iffy.

I found out about whisper AI but can't decide which implantation to use and I can only use those directly on Google Colab or HuggingFace as I have no chance of it running it locally, maybe the smallest models on short audio files but that's about it.

So far I found:

Fast subtitle maker on HF, it allows for large V3 but no VAD filter
Faster whisper on GC only goes up to large V2 but allows for YouTube URLs, VAD filter, etc
WhisperX on GC is one implementation of whisperX which is apparently meant to be one of the best ones but it also does word-level timestamps which is probably too specific of a timestamp
WhisperX with diarization on GC is another implement of whisperX but uses Nvidia NeMo MSDD for speaker diarization which I don't know if I'll need as all subtitles I've seen don't seem to do that.

Not sure if there's any other implementations worth using or which of these is the best, if anyone uses whisper in this way. Planning on using it for videos ranging from a few minutes to 1-2 hours or more. Mostly for gaming videos but also for some podcast stuff as well, and an anime film I cannot find subs for anywhere.

by ajbjc

Looking for a good whisper AI model for JP-JP transcription

Tags: