SoFunction
Updated on 2024-11-14

The ultimate python speech recognition guide (having this one is enough)

[Introduction] The huge success of Amazon's Alexa has proven that implementing some level of voice support will become a basic requirement for everyday technology in the near future. Python programs that integrate speech recognition offer a level of interactivity and accessibility unmatched by other technologies. Best of all, implementing speech recognition in Python programs is simple. Read this guide and you'll learn. You will learn:

-How speech recognition works;

-What packages PyPI supports.

-How to install and use the SpeechRecognition package, a full-featured and easy-to-use speech recognition library for Python.

Overview of how language recognition works

Speech recognition originated from research done at Bell Labs in the early 1950s. Early speech recognition systems were only capable of recognizing a single speaker and had a vocabulary of about a dozen words. Modern speech recognition systems have come a long way, recognizing multiple speakers and having large vocabularies to recognize multiple languages.

The first part of speech recognition is, of course, the voice. Through the microphone, speech is converted from a physical sound to an electrical signal, which is then converted to data by an analog-to-digital converter. Once digitized, several models can be applied to transcribe the audio into text.

Most modern speech recognition systems rely on Hidden Markov Models (HMM). It works on the principle that a speech signal can be approximated as a stationary process on very short time scales (e.g., 10 milliseconds), i.e., a process whose statistical properties do not vary with time.

Many modern speech recognition systems will use neural networks prior to HMM recognition to simplify the speech signal through techniques of feature transformation and dimensionality reduction. Speech Activity Detectors (VADs) can also be used to reduce the audio signal to parts that may contain only speech.

Fortunately for Python users, a number of speech-recognition services are available online via APIs, and most of them offer Python SDKs as well.

Select Python Speech Recognition Package

There are a number of speech recognition packages readily available in PyPI. These include:

•apiai

•google-cloud-speech

•pocketsphinx

•SpeechRcognition

•watson-developer-cloud

•wit

Some packages (such as wit and apiai) provide built-in features that go beyond basic speech recognition, such as natural language processing to recognize speaker intent. Other packages, such as Google Cloud Speech, focus on speech-to-text conversion.

SpeechRecognition stands out for its ease of use.

Recognizing speech requires audio input, and retrieving audio input is simple in SpeechRecognition - instead of having to build scripts that access the microphone and process the audio file from scratch, retrieval is automated and running in just a few minutes.

The SpeechRecognition library caters to several major speech APIs, making it extremely flexible. The Google Web Speech API supports hard-coding to the default API key in the SpeechRecognition library, which can be used without registration.SpeechRecognition is the best choice for writing Python programs because of its flexibility and ease of use.

Installing SpeechRecognition

SpeechRecognition is compatible with Python 2.6 , 2.7 and 3.3+, but requires some additional installation steps for use in Python 2. All development versions in this tutorial default to Python 3.3+.

The reader can install SpeechRecognition from a terminal using the pip command:

$ pip install SpeechRecognition

Once the installation is complete please verify the installation by opening the interpreter window and entering the following:

>>> import speech_recognition as sr
>>> sr.__version__
'3.8.1'

Note: Do not close this session, you will be using it in later steps.

To process an existing audio file, simply call SpeechRecognition directly, noting some dependencies for specific use cases. Also note that the PyAudio package is installed to get the microphone input.

identifier class

The core of SpeechRecognition is the recognizer class.

The main purpose of the Recognizer APIs is to recognize speech, and each API has a variety of settings and functions to recognize speech from audio sources, namely:

  • recognize_bing(): Microsoft Bing Speech
  • recognize_google(): Google Web Speech API
  • recognize_google_cloud(): Google Cloud Speech - requires installation of the google-cloud-speech package
  • recognize_houndify(): Houndify by SoundHound
  • recognize_ibm(): IBM Speech to Text
  • recognize_sphinx(): CMU Sphinx - requires installing PocketSphinx
  • recognize_wit():

Among the above seven, only recognition_sphinx() can work offline with CMU Sphinx engine, the other six need to connect to the Internet.

SpeechRecognition comes with a default API key for the Google Web Speech API, which can be used directly. The other six APIs require an API key or username/password combination for authentication, so the Web Speech API is used in this article.

Now get down to business and call the recognize_google() function in your interpreter session.

>>> r.recognize_google()

The screen will appear:

Traceback (most recent call last):
File "", line 1, in <module>
TypeError: recognize_google() missing 1 required positional argument: 'audio_data'

I'm sure you've guessed the result, how is it possible to recognize data from an empty file?

All seven recognize_*() recognizer classes require an audio_data parameter, and the audio_data for each recognizer must be an instance of SpeechRecognition's AudioData class.

AudioData instances are created in two paths: audio files or audio recorded by a microphone, starting with the more accessible audio files.

Use of audio files

First you need to download the audio files (/realpython/python-speech-recognition/tree/master/audio_files) and save them to the directory where the Python interpreter session is located.

The AudioFile class can be initialized with the path of an audio file and provides a context manager interface for reading and manipulating the contents of the file.

Supported File Types

SpeechRecognition currently supports the following file types:

  • WAV: must be in PCM/LPCM format
  • AIFF
  • AIFF-C
  • FLAC: Must be the initial FLAC format; OGG-FLAC format is not available.

If you are using Linux x-86, macOS or Windows, you need to support FLAC files. To run on other systems, you need to install the FLAC encoder and make sure you have access to the flac command.

utilizationrecord() Getting data from a file

Type the following command in the Interpreter Session box to process the contents of the " " file:

>>> harvard = ('')
>>> with harvard as source:
... audio = (source)
...

Open the file via the context manager and read the contents of the file and store the data in an AudioFile instance, then record the data from the entire file into an AudioData instance via record(), which can be confirmed by checking the audio type:

>>> type(audio)
<class 'speech_recognition.AudioData'>

You can now call recognition_google() to try to recognize speech in the audio.

>>> r.recognize_google(audio)
'the stale smell of old beer lingers it takes heat
to bring out the odor a cold dip restores health and
zest a salt pickle taste fine with ham tacos al
Pastore are my favorite a zestful food is the hot
cross bun'

The above completes the recording of the first audio file.

Getting audio clips using offsets and durations

What if you only want to capture part of the speech in a file? record() has a duration keyword parameter that causes the command to stop recording after a specified number of seconds.

For example, the following captures only the speech within the first four seconds of the file:

>>> with harvard as source:
... audio = (source, duration=4)
...
>>> r.recognize_google(audio)
'the stale smell of old beer lingers'

When the record() command is called in a with block, the file stream moves forward. This means that if you record four seconds first, and then four seconds later, the second four seconds of audio will be returned after the first four seconds.

>>> with harvard as source:
... audio1 = (source, duration=4)
... audio2 = (source, duration=4)
...
>>> r.recognize_google(audio1)
'the stale smell of old beer lingers'
>>> r.recognize_google(audio2)
'it takes heat to bring out the odor a cold dip'

In addition to specifying the record duration, you can use the offset parameter to specify a starting point for the record() command, the value of which indicates the time at which recording will begin. For example, to get only the second phrase in a file, set the offset to 4 seconds and record for 3 seconds.

>>> with harvard as source:
... audio = (source, offset=4, duration=3)
...
>>> recognizer.recognize_google(audio)
'it takes heat to bring out the odor'

The offset and duration keyword parameters are useful for splitting audio files when the structure of the speech in the file is known in advance. However, inaccurate use can lead to poor transcription.

>>> with harvard as source:
... audio = (source, offset=4.7, duration=2.8)
...
>>> recognizer.recognize_google(audio)
'Mesquite to bring out the odor Aiko'

The program starts recording from second 4.7 so that the "it t" in the phrase "it takes heat to bring out the odor" is not recorded, and the API only gets the input "akes heat" and the result "Mesquite". At this point, the API only gets the input "akes heat", which is matched by the result "Mesquite".

Similarly, when fetching the ending phrase of the recording, "a cold dip restores health and zest", the API only captures "a co", which is incorrectly matched as " Aiko".

Noise is also a major culprit in translation accuracy. The above example works well because the audio file is clean, but in reality it is impossible to get noise-free audio unless the audio file is processed beforehand.

The effect of noise on speech recognition

Noise does exist in the real world, all recordings have some degree of noise, and unprocessed noise can ruin the accuracy of speech recognition applications.

To see how noise affects speech recognition, download the file "" (/realpython/python-speech-recognition/tree/master/audio_files) and be sure to save it to the working directory of your interpreter session. The phrase "the stale smell of old beer lingers" in the file is pronounced over a background sound that is a loud drone.

What happens when I try to transcribe this file?

>>> jackhammer = ('')
>>> with jackhammer as source:
... audio = (source)
...
>>> r.recognize_google(audio)
'the snail smell of old gear vendors'

So how to deal with this problem? Try calling the adjust_for_ambient_noise() command of the Recognizer class.

>>> with jackhammer as source:
... r.adjust_for_ambient_noise(source)
... audio = (source)
...
>>> r.recognize_google(audio)
'still smell of old beer vendors'

This is much closer to the exact result, but the accuracy is still problematic and the "the" at the beginning of the phrase is lost - what is the reason for this?

Because the first second of the file stream is recognized as the noise level of the audio by default when using the adjust_for_ambient_noise() command, the first second of the file is already consumed before the data is fetched using record().

The time analysis range of the adjust_for_ambient_noise() command can be adjusted using the duration keyword parameter, which is in seconds and defaults to 1. This value is now reduced to 0.5.

>>> with jackhammer as source:
... r.adjust_for_ambient_noise(source, duration=0.5)
... audio = (source)
...
>>> r.recognize_google(audio)
'the snail smell like old Beer Mongers'

Now we get the "the" of the phrase, but now there are some new problems - sometimes it's not possible to eliminate the effects of noise because the signal is too loud.

If these problems are frequently encountered, some pre-processing of the audio is required. This preprocessing can be done through audio editing software, or in Python packages that apply filters to files (e.g. SciPy). When working with noisy files, you can improve accuracy by looking at the actual API response. Most APIs return a JSON string with multiple possible transcriptions, but the recognition_google() method will always return only the most probable transcription when a full response is not mandatory.

The full response is given by changing the True parameter in recognition_google() to show_all.

>>> r.recognize_google(audio, show_all=True)
{'alternative': [
{'transcript': 'the snail smell like old Beer Mongers'},
{'transcript': 'the still smell of old beer vendors'},
{'transcript': 'the snail smell like old beer vendors'},
{'transcript': 'the stale smell of old beer vendors'},
{'transcript': 'the snail smell like old beermongers'},
{'transcript': 'destihl smell of old beer vendors'},
{'transcript': 'the still smell like old beer vendors'},
{'transcript': 'bastille smell of old beer vendors'},
{'transcript': 'the still smell like old beermongers'},
{'transcript': 'the still smell of old beer venders'},
{'transcript': 'the still smelling old beer vendors'},
{'transcript': 'musty smell of old beer vendors'},
{'transcript': 'the still smell of old beer vendor'}
], 'final': True}

As you can see, recognition_google() returns a list with the keyword 'alternative', referring to the list of all possible responses. The structure of this response list will vary from API to API and is mainly used for debugging the results.

Use of microphones

To use SpeechRecognizer to access the microphone the PyAudio package must be installed, close the current interpreter window and do the following:

Installing PyAudio

The process of installing PyAudio varies by operating system.

Debian Linux

If you are using Debian-based Linux (e.g. Ubuntu), you can install PyAudio using apt:

$ sudo apt-get install python-pyaudio python3-pyaudio

You may still need to enable pip install pyaudio after the installation is complete, especially if you are running it virtually.

macOS

For macOS users, you first need to install PortAudio using Homebrew, and then call the pip command to install PyAudio.

$ brew install portaudio
$ pip install pyaudio

Windows

Windows users can install PyAudio directly by calling pip.

$ pip install pyaudio

installation test

After installing PyAudio, you can test the installation from the console.

$ python -m speech_recognition

Make sure the default microphone is turned on and unmuted, if installed properly you should see the following:

A moment of silence, please...
Set minimum energy threshold to 600.4452854381937
Say something!

Speak into the microphone and watch how SpeechRecognition transcribes your speech.

Microphone class

Please open another interpreter session and create an example of an aliasing class.

>>> import speech_recognition as sr
>>> r = ()

The default system microphone will be used at this point, instead of using an audio file as a signal source. The reader can access it by creating an instance of the Microphone class.

>>> mic = ()

If the system does not have a default microphone (as on a RaspberryPi) or if you want to use a non-default microphone, you need to specify the microphone to use by providing the device index. The reader can obtain a list of microphone names by calling the list_microphone_names() function of the Microphone class.

>>> .list_microphone_names()
['HDA Intel PCH: ALC272 Analog (hw:0,0)',
'HDA Intel PCH: HDMI 0 (hw:0,3)',
'sysdefault',
'front',
'surround40',
'surround51',
'surround71',
'hdmi',
'pulse',
'dmix',
'default']

Note: Your output may differ from the above example.

list_microphone_names() returns the index of the microphone device name in the list. In the output above, if you want to use a microphone named "front", which is indexed 3 in the list, you can create a microphone instance as shown below:

>>> # This is just an example; do not run
>>> mic = (device_index=3)

However, most cases require the use of the system's default microphone.

utilizationlisten()Getting microphone input data

With the microphone example ready, the reader can capture some input.

Like the AudioFile class, Microphone is a context manager. Input to the microphone can be captured using the listen() method of the Recognizer class in the with block. This method takes the audio source as its first argument and automatically records the input from the source until it stops when silence is detected.

>>> with mic as source:
... audio = (source)
...

After executing the with block try to say "hello" in the microphone. Please wait for the interpreter to display the prompt again, once the ">>>" prompt is returned the speech will be recognized.

>>> r.recognize_google(audio)
'hello'

If no prompt returns again, probably because the microphone is receiving too much ambient noise, use Ctrl + C to interrupt the process so that the interpreter can display the prompt again.

To process ambient noise, call the adjust_for_ambient_noise() function of the Recognizer class, which operates in the same way as when processing noisy audio files. Since microphone input sounds less predictable than audio files, this procedure can be used for processing any time you listen to microphone input.

>>> with mic as source:
... r.adjust_for_ambient_noise(source)
... audio = (source)
...

After running the above code wait a moment and try to say "hello" in the microphone . Again, you must wait for the interpreter prompt to return before trying to recognize speech.

Keep in mind that adjust_for_ambient_noise() defaults to analyzing 1 second long audio from the audio source. If the reader thinks this is too long, use the duration parameter to adjust it.

The SpeechRecognition profile recommends that the duration parameter be no less than 0.5 seconds. In some cases, you may find that durations longer than the default one second produce better results. The minimum value you need depends on the microphone's surroundings, but this information is often unknown during development. In my experience, a default duration of one second is sufficient for most applications.

Handling difficult-to-recognize speech

Try typing the previous code example into the interpreter with some unintelligible noise in the microphone. You should get results like this:

Traceback (most recent call last):
File "", line 1, in <module>
File "/home/david/real_python/speech_recognition_primer/venv/lib/python3.5/site-packages/speech_recognition/__init__.py", line 858, in recognize_google
if not isinstance(actual_result, dict) or len(actual_result.get("alternative", [])) == 0: raise UnknownValueError()
speech_recognition.UnknownValueError

Audio that can't be matched to text by the API will raise an UnknownValueError exception, so use try and except blocks frequently to resolve such issues. the API will do its best to convert any sound to text, e.g., a short grunt may be recognized as a "how", a cough, a clap, and a tongue click may all be converted to text and raise an exception. short grunts may be recognized as "How", coughs, clapping, and tongue clicks may be converted to text and cause exceptions.

Conclusion:

In this tutorial, we have been recognizing English speech, which is the default language for every recognition _ *() method in the SpeechRecognition package. However, it is absolutely possible and easy to recognize other languages as well. To recognize speech in a different language, set the language keyword parameter of the recognition _ *() method to a string corresponding to the desired language.

to this article on the ultimate version of the python speech recognition guide (this one is enough) of the article is introduced to this, more related python speech recognition content, please search for my previous articles or continue to browse the following related articles I hope you will support me more in the future!