Raspberry: eSpeak



Understanding and Building an Application with STT(speech-to-text) and TTS(text-to-speech)


References:

Most modern speech recognition systems rely on what is known as a Hidden Markov Model (HMM).
In a typical HMM, the speech signal is divided into 10-millisecond fragments.
The final output of the HMM is a sequence of vectors mapped to a vector of real numbers known as cepstral coefficients.
To decode the speech into text, groups of vectors are matched to one or more phonemes—a fundamental unit of speech. This calculation requires training, since the sound of a phoneme varies from speaker to speaker, and even varies from one utterance to another by the same speaker. A special algorithm is then applied to determine the most likely word (or words) that produce the given sequence of phonemes.

A number of speech recognition services are available for use online through an API, and many of these services offer Python SDKs:
  • apiai
  • assemblyai
  • google-cloud-speech
  • pocketsphinx
  • SpeechRecognition
  • watson-developer-cloud
  • wit

The SpeechRecognition library acts as a wrapper for several popular speech APIs and is thus extremely flexible. One of these—the Google Web Speech API—supports a default API key that is hard-coded into the SpeechRecognition library, the other 6 APIs all require authentication with either an API key or a username/password combination.

SpeechRecognition is compatible with Python 2.6, 2.7 and 3.3+, but requires some additional installation steps for Python 2.

Install SpeechRecognition:

sudo apt-get install flac
sudo pip install SpeechRecognition

Check the installation:

import speech_recognition as sr

print(sr.__version__)
All of the magic in SpeechRecognition happens with the Recognizer class.
Creating a Recognizer instance:

r = sr.Recognizer()

Each Recognizer instance has seven methods for recognizing speech from an audio source using various APIs:
  • recognize_bing()
  • Microsoft Bing Speech
  • recognize_google()
  • Google Web Speech API
  • recognize_google_cloud()
  • Google Cloud Speech - requires installation of the google-cloud-speech package
  • recognize_houndify()
  • Houndify by SoundHound
  • recognize_ibm()
  • IBM Speech to Text
  • recognize_sphinx()
  • CMU Sphinx - requires installing PocketSphinx
  • recognize_wit()
  • Wit.ai
Of the seven, only recognize_sphinx() works offline with the CMU Sphinx engine. The other six all require an internet connection.
Since SpeechRecognition ships with a default API key for the Google Web Speech API, you can get started with it right away.
The other six APIs all require authentication with either an API key or a username/password combination.

All seven recognize_*() methods of the Recognizer class require an audio_data argument. In each case, audio_data must be an instance of SpeechRecognition’s AudioData class.
There are two ways to create an AudioData instance: from an audio file or audio recorded by a microphone.

Working With Audio Files


Currently, SpeechRecognition supports the following file formats:
  • WAV
  • must be in PCM/LPCM format
  • AIFF
  • AIFF-C
  • FLAC
  • must be native FLAC format; OGG-FLAC is not supported


Working With Microphones


To access your microphone with SpeechRecognizer, you’ll have to install the PyAudio package.
Once you’ve got PyAudio installed, you can test the installation from the console.

If your system has no default microphone (such as on a RaspberryPi), or you want to use a microphone other than the default, you will need to specify which one to use by supplying a device index. You can get a list of microphone names by calling the list_microphone_names() static method of the Microphone class.

import speech_recognition as sr

sr.Microphone.list_microphone_names()

The device index of the microphone is the index of its name in the list returned by list_microphone_names().

For example, if you want to use a device name whose index is 3:

mic = sr.Microphone(device_index=3)
Microphone is a context manager, you can capture input from the microphone using the listen() method of the Recognizer class inside of the with block:

with mic as source:
   audio = r.listen(source)

If the listen() never ends, this means many noise in the environment.

Then, to recognize the speech:

r.recognize_google(audio)


A simple “Guess the Word” game "guessing_game.py":


import random
import time

import speech_recognition as sr


def recognize_speech_from_mic(recognizer, microphone):
    """Transcribe speech from recorded from `microphone`.
    Returns a dictionary with three keys:
    "success": a boolean indicating whether or not the API request was
               successful
    "error":   `None` if no error occured, otherwise a string containing
               an error message if the API could not be reached or
               speech was unrecognizable
    "transcription": `None` if speech could not be transcribed,
               otherwise a string containing the transcribed text
    """
    # check that recognizer and microphone arguments are appropriate type
    if not isinstance(recognizer, sr.Recognizer):
        raise TypeError("`recognizer` must be `Recognizer` instance")

    if not isinstance(microphone, sr.Microphone):
        raise TypeError("`microphone` must be `Microphone` instance")

    # adjust the recognizer sensitivity to ambient noise and record audio
    # from the microphone
    with microphone as source:
        recognizer.adjust_for_ambient_noise(source)
        audio = recognizer.listen(source)

    # set up the response object
    response = {
        "success": True,
        "error": None,
        "transcription": None
    }

    # try recognizing the speech in the recording
    # if a RequestError or UnknownValueError exception is caught,
    #     update the response object accordingly
    try:
        response["transcription"] = recognizer.recognize_google(audio)
    except sr.RequestError:
        # API was unreachable or unresponsive
        response["success"] = False
        response["error"] = "API unavailable"
    except sr.UnknownValueError:
        # speech was unintelligible
        response["error"] = "Unable to recognize speech"

    return response


if __name__ == "__main__":
    # set the list of words, maxnumber of guesses, and prompt limit
    WORDS = ["apple", "banana", "grape", "orange", "mango", "lemon"]
    NUM_GUESSES = 3
    PROMPT_LIMIT = 5

    # create recognizer and mic instances
    recognizer = sr.Recognizer()
    microphone = sr.Microphone()

    # get a random word from the list
    word = random.choice(WORDS)

    # format the instructions string
    instructions = (
        "I'm thinking of one of these words:\n"
        "{words}\n"
        "You have {n} tries to guess which one.\n"
    ).format(words=', '.join(WORDS), n=NUM_GUESSES)

    # show instructions and wait 3 seconds before starting the game
    print(instructions)
    time.sleep(3)

    for i in range(NUM_GUESSES):
        # get the guess from the user
        # if a transcription is returned, break out of the loop and
        #     continue
        # if no transcription returned and API request failed, break
        #     loop and continue
        # if API request succeeded but no transcription was returned,
        #     re-prompt the user to say their guess again. Do this up
        #     to PROMPT_LIMIT times
        for j in range(PROMPT_LIMIT):
            print('Guess {}. Speak!'.format(i+1))
            guess = recognize_speech_from_mic(recognizer, microphone)
            if guess["transcription"]:
                break
            if not guess["success"]:
                break
            print("I didn't catch that. What did you say?\n")

        # if there was an error, stop the game
        if guess["error"]:
            print("ERROR: {}".format(guess["error"]))
            break

        # show the user the transcription
        print("You said: {}".format(guess["transcription"]))

        # determine if guess is correct and if any attempts remain
        guess_is_correct = guess["transcription"].lower() == word.lower()
        user_has_more_attempts = i < NUM_GUESSES - 1

        # determine if the user has won the game
        # if not, repeat the loop if user has more attempts
        # if no attempts left, the user loses the game
        if guess_is_correct:
            print("Correct! You win!".format(word))
            break
        elif user_has_more_attempts:
            print("Incorrect. Try again.\n")
        else:
            print("Sorry, you lose!\nI was thinking of '{}'.".format(word))
            break







用合成語音讓 Raspberry Pi 說話


只要安裝語音合成軟體並將 Raspberry Pi 接上喇叭,並能簡單地做出朗讀文章或是自動報時的系統。
許多文字轉語音的套件像是 eSpeak 或是 Festival 都支援多國語言,雖然中文支援不如英文來得好,但不妨玩看看。
朗讀日文則可使用 Open JTalk,即使包含漢字的文章也不成問題。

用 eSpeak 或 Festival 進行語音合成


eSpeak 是一個支援多國語言的文字轉語音程式,可使用底下的指令進行安裝。

 $ sudo apt-get install espeak
eSpeak 的使用非常簡單,在指令之後指定詞句即可。

 $ espeak "good morning"
您也可以使用參數 -v 來切換不同的聲調,像是使用 f3 的女聲或是 m7 的男聲。

 $ espeak -v f3 "good morning"
 $ espeak -v m7 "good morning"
關於這些聲調您可在 espeak-data 的安裝目錄裡找到,其中 f1 ~ f5 為女聲,m1 ~ m7 為男聲,另外還有 croak、klatt、whisper 等等。
可用下方的指令來查看有哪些選擇。

 $ ls '/usr/lib/arm-linux-gnueabihf/espeak-data/voices/! v/'
要切換使用的語言也是使用 -v 來切換,比如說要使用中文請用參數-v zh。
如果要配合切換聲調則使用 -v zh+f3 的形式。

 $ espeak -v zh+f3 "早安"

增加中文字詞


雖然中文的支援不是很完美,但聽起來還不算太糟。
建議您額外安裝增加中文字詞的檔案。
細心的讀者可能會發現,指定為中文時常出現底下的訊息:

 Full dictionary is not installed for 'zh'
這是因為 eSpeak 為了讓套件的檔案容量不至於太大,僅收錄一小部分的中
文字詞。比如說,您可以試試底下這個例子,應該無法正確發音。

 $ espeak –v zh "魑魅魍魎"
如果想要讓它正確發音,請先到下列網址下載完整的中文字詞檔。
URL http://espeak.sourceforge.net/data/zh_listx.zip


 $ unzip espeak-1.48.04-source.zip
 $ cd espeak-1.48.04-source/dictsource
 $ unzip ../../zh_listx.zip
 $ sudo espeak --compile=zh
畫面上會出現以下訊息即表示安裝完成,如此 Raspberry Pi 便能正確地說
出「魑魅魍魎」了。

 Full dictionary is not installed for 'zh'
 Using phonemetable: 'zh'
 Compiling: 'zh_list'
 3873 entries
 Compiling: 'zh_listx'
 57665 entries
 Compiling: 'zh_rules'
 181 rules, 28 groups (0)
eSpeak 也可以用參數 -f 來指定以檔案輸入。其他可以用來調整的參數請
參考 man espeak 顯示的手冊說明。

留言

熱門文章