Whether, why, and how to use Whisper to transcribe speech
Fri, Dec. 23rd, 2022 04:32 pm![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
![[community profile]](https://www.dreamwidth.org/img/silk/identity/community.png)
Whisper, from OpenAI, is an open source speech recognition tool that also does translation. You can try it right now at https://replicate.com/openai/whisper or install it on your own computer to run privately. You provide an audio file, and it emits a text transcript as well as .srt and .vtt subtitle files.
This is a really useful (and free!) tool. I have started using it regularly to make transcripts and captions/subtitles, and I just wrote a blog post to share how, and why -- plus my reflections on the ethics of using it and similar tools trained using machine learning.
Note that it works on existing files, but does not work for live-transcribing an event as it's happening.
This is a really useful (and free!) tool. I have started using it regularly to make transcripts and captions/subtitles, and I just wrote a blog post to share how, and why -- plus my reflections on the ethics of using it and similar tools trained using machine learning.
Note that it works on existing files, but does not work for live-transcribing an event as it's happening.
(no subject)
Date: 2022-12-23 09:46 pm (UTC)I was seeing you post about this on mastodon and I'm really of two minds about it. I wonder if any measures have been made of how many people post craptions who'd have actually bothered to create real captions or transcripts without an automated tool -- and craptions as they exist today are generally considered worse than useless. You're a conscientious person and mention in your post you clean them up, but so many people are willing to post the completely unedited auto generated captions and call it a day.
On the other hand, if the people who are using auto detected captions wouldn't be creating captions or transcripts any other way, at least it's not a net negative. If they are bad it shouldn't count toward any kind of required accessibility, but it doesn't hurt if it's not replacing something better.
(Are they really that much better than the current state of speech recognition, though? Even the built-in speech recognition currently packaged for free in most operating systems does a really good job with random voices. Not good enough for true transcripts, but good.
I mean I get the point of what you are doing in your blog post if you are also putting on timestamps that are suitable for SRT files, but you could do that with built-in speech recognition and not worry about the ethics of machine learning and large language models at all.)
(no subject)
Date: 2022-12-23 10:15 pm (UTC)I'm on Debian Linux and as far as I'm aware there was no built-in speech recognition software shipped with my operating system. So that's not available for me for comparison. I can say that the quality of the speech recognition with Whisper is substantially superior to every other automated caption or transcript product I've ever witnessed. I speak with a conventional US accent, but commenters on MetaFilter found (as I recall) that Whisper was far better than a lot of other services at handling varied accents. I welcome quality comparisons from folks with access to the built-in speech recognition tools in their OSes!
I'm not sure I understand you. Maybe when you said "if you are also putting on" you mean "if Whisper is also putting on"?
By default, when I run Whisper, it outputs three files. One is a plain text file that doesn't have timestamps. The other two are the .srt and the .vtt subtitle files. Here's a chunk of one of the .srts:
If there are built-in tools on Windows and MacOS that create pretty good subtitle files when fed an audio file, that's awesome and I want to know about it!
Displacing good captions/transcripts with bad? Yeah I don't know either. The individuals I'm seeing talking about Whisper are delightedly saying that, for the first time, they can have searchable transcripts for big masses of audio relevant to them (podcasts they listen to, audio notes-to-self, etc.) but of course I am not privy to institutional conversations where people would be deciding to reduce product quality by switching from a human captioner/transcriber to Whisper.
I have very little knowledge of the built-in speech recognition tools on commercial OSes; are you saying they are not also trained via machine learning and large language models?