SoFunction
Updated on 2024-11-21

Speech recognition based on Python

1. Introduction

/openai/whisper

在这里插入图片描述

1.1 Introduction to whisper

Whisper is a general purpose speech recognition model. It is trained on large datasets containing a variety of audio and is also a multi-tasking model that performs multilingual speech recognition, speech translation and language recognition.

在这里插入图片描述

Open AI open sourced the Whisper neural network on September 21, 2022, which claims to have human-level speech recognition in English, and it also supports automatic speech recognition in 98 other languages. The Automatic Speech Recognition (ASR) models provided by the Whisper system are trained to perform speech recognition and translation tasks, turning speech into text in a variety of languages and translating that text into English.

1.2 whisper model

Below are the names of available models and their approximate memory requirements and inference speeds relative to larger models; actual speeds may vary depending on many factors, including available hardware.

Size Parameters English-only model Multilingual model Required VRAM Relative speed
tiny 39 M tiny ~1 GB ~32x
base 74 M base ~1 GB ~16x
small 244 M smal l ~2 GB ~6x
medium 769 M medium ~5 GB ~2x
large 1550 M N/A large ~10 GB 1x

It automatically downloads the model cache as follows:

在这里插入图片描述

2. Installation

2.1 whisper

pip install -U openai-whisper
# pip install git+/openai/ 
pip install --upgrade --no-deps --force-reinstall git+/openai/
pip install zhconv
pip3 install wheel

pip3 install torch torchvision torchaudio
# Note: No scientific Internet will download may be very slow, you can replace the domestic mirror to accelerate the download speed
pip3 install torch torchvision torchaudio -i /simple

在这里插入图片描述

2.2 pytorch

/

The choice is a stable version, windows system, pip installation method, python language, cpu version of the software.

在这里插入图片描述

pip3 install torch torchvision torchaudio

2.3 ffmpeg

/BtbN/FFmpeg-Builds/releases

在这里插入图片描述

After unzipping, find the " " under the bin folder, copy it to a folder, suppose the path of this folder is "D:\software\ffmpeg", then add "D:/software/ffmpeg" to the system environment variable PATH.

3. Testing

3.1 Command testing

whisper audio.mp3

在这里插入图片描述

The above command form of whisper audio.mp3 is the simplest one, it uses small mode model transcription by default, we can also use higher level model to improve the correctness. For example:

whisper audio.mp3 --model medium
whisper  --language Japanese
whisper chinese.mp4 --language Chinese --task translate
whisper  audio.mp3  --model medium
whisper  --model medium  --language Chinese

At the same time the default will generate 5 files, the file name and your source file is the same, but the extension are: .json, .srt, .tsv, .txt, .vtt. In addition to the ordinary text, you can also generate the movie subtitle directly, you can also adjust the json format to do the development process.

在这里插入图片描述

Commonly used parameters are listed below:

--task: Specify the transcription mode, default use --task transcribe to transcribe mode, --task translate to translate mode, currently only support translate to English.
--model: Specify the model to use, default is --model small, Whisper also has English-specific models, that is, add .en to the name, so that it is faster.
--language: specify the language of transcription, by default it will intercept 30 seconds to determine the language, but it is better to specify a certain language, for example, Chinese is --language Chinese.
--device: specify hardware acceleration, auto is used by default, --device cuda is the graphics card, cpu is the CPU, mps is the Apple M1 chip.
--output_format: specify the format of the generated subtitle file, txt,vtt,srt,tsv,json,all, specify more than one can be wrapped with curly brackets {}, do not set default all.
-- output_dir: Specify the output directory of the subtitle file, without setting the default output to the current directory.
--fp16: default True, use 16-bit floating point number for calculation, can reduce the calculation and storage overhead to a certain extent, there may be precision loss, the author's CPU does not support it, the following warning will appear, specify it as False and it will not appear, i.e., use 32-bit floating point number for calculation.

在这里插入图片描述

3.2 Code Test: Recognizing Sound Files

import whisper

if __name__ == '__main__':
    model = whisper.load_model("tiny")
    result = ("audio.mp3", fp16=False, language="Chinese")
    print(result["text"])

在这里插入图片描述

3.3 Code Test: Real-time Recording Recognition

import whisper
import zhconv
import wave  # Use the wave library to read and write audio files of type wav
import pyaudio  # Use the pyaudio library to record, play, and generate wav files.


def record(time):  # Recording program
    # Define data flow blocks
    CHUNK = 1024  # audio frame rate (i.e. how much data is being read at a time, default 1024)
    FORMAT = pyaudio.paInt16  # Generate wav files in normal format when sampling
    CHANNELS = 1  # Number of tracks (each track defines the attributes of that track, such as the track's timbre, library, number of channels, input/output ports, volume, etc.). (can be multiple tracks, not unique)
    RATE = 16000  # Sampling rate (i.e. how much data is sampled per second)
    RECORD_SECONDS = time  # Recording time
    WAVE_OUTPUT_FILENAME = "./"  # Save the audio path
    p = ()  # Create a PyAudio object
    stream = (format=FORMAT,  # Sampling generates wav files in normal format
                    channels=CHANNELS,  # of tracks
                    rate=RATE,  # Sampling rate
                    input=True,  # Ture means this is an input stream, False means this is not an input stream
                    frames_per_buffer=CHUNK)  # How many frames per buffer
    print("* recording")  # Start recording flag
    frames = []  # Define frames as an empty list
    for i in range(0, int(RATE / CHUNK * RECORD_SECONDS)):  # Calculate how many reads to take, sample rate per second / how much data to read each time * recording time = how many reads needed
        data = (CHUNK)  # Read chunks of data at a time
        (data)  # Save the readout to a list
    print("* done recording")  # End recording flag

    stream.stop_stream()  # Stop the input stream
    ()  # Close the input stream
    ()  # Terminate pyaudio

    wf = (WAVE_OUTPUT_FILENAME, 'wb')  # Open a file as 'wb' binary stream write
    (CHANNELS)  # Set the number of tracks
    (p.get_sample_size(FORMAT))  # Set the format of the sample point data to be consistent with FOMART
    (RATE)  # Set the sample rate to match the RATE.
    (b''.join(frames))  # Write sound data to file
    ()  # Data stream saved, file closed


if __name__ == '__main__':
    model = whisper.load_model("tiny")
    record(3)  # Define recording time in /s
    result = ("",language='Chinese',fp16 = True)
    s = result["text"]
    s1 = (s, 'zh-cn')
    print(s1)

4. Tools

4.1 WhisperDesktop

/Const-me/Whisper

High-Performance GPGPU Inference for OpenAI's Whisper Automatic Speech Recognition (ASR) Model
This project is a Windows port of the implementation.
Which in turn is a C++ port of OpenAI’s Whisper automatic speech recognition (ASR) model.

After downloading WhisperDesktop, click Run, then load the model file, and finally select the file to transcribe. The transcription is very fast due to the support of GPU hard disassembly.

在这里插入图片描述

4.2 Buzz

/chidiwilliams/buzz

Buzz transcribes and translates audio offline on your PC. Powered by OpenAI's Whisper.

Another Whisper-based graphical software is Buzz, which supports Windows, macOS, and Linux compared to WhipserDesktop.

在这里插入图片描述

Installation is as follows:

(1)PyPI:

pip install buzz-captions
python -m buzz

(2)Windows:

Download and run the file in the releases page…exe‘

在这里插入图片描述

The installation package for Buzz is a bit large, and Buzz uses model files with a .pt extension, which are automatically downloaded by the software after running.

However, it is best to download the model file in advance and place it in the designated location.

Mac:~/.cache/whisper
Windows:C:\Users\<Your username>\.cache\whisper

However, Buzz uses CPU soft solving and does not support GPU hard solving yet.

4.3 Whisper-WebUI

/jhj0517/Whisper-WebUI

Gradio-based Whisper browser interface. You can use it as a simple subtitle generator!

在这里插入图片描述

The above is based on Python to achieve the details of the speech recognition function, more information about Python speech recognition, please pay attention to my other related articles!