I have a table which contains HTML content to show it on Web Page.
I am trying to convert that content into text to speech after extracting plain text from it. But When I hear that converted content, there is no pause in it and voice is very continuous.
How can i maintain pauses and other things while doing Text to speech.
Example
Python Code for TTS
#app.route('/convert', methods=['POST'])
def convert():
if request.method=='POST' or request.method=='GET':
area = request.form['text']
parsed =BeautifulSoup(area).get_text()
print("******************",area)
from gtts import gTTS
# Language in which you want to convert
language = 'en'
myobj = gTTS(text=parsed, lang=language, slow=False)
# Saving the converted audio in a mp3 file named
# welcome
import uuid
file= str(uuid.uuid4())
myobj.save(file+'.mp3')
return parsed
Converting that to plain text removes the semantic info in the HTML - you'll get "1. Sustained Attention 2. Selective Attention...". You may try doing addition parsing/formatting using gtts functionality, or consider converting to SSML and using a different API (like Amazon Polly or espeak) which supports SSML.
I am looking for a small pause, wait, break or anything that will allow for a short break (looking for about 2 seconds +-, configurable would be ideal) when speaking out the desired text.
People online have said that adding three full stops followed by a space creates a break but I don't seem to be getting that. Code below is my test that has no pauses, sadly.. Any ideas or suggestions?
Edit: It would be ideal if there is some command from gTTS that would allow me to do this, or maybe some trick like using the three full stops if that actually worked.
from gtts import gTTS
import os
tts = gTTS(text=" Testing ... if there is a pause ... ... ... ... ... longer pause? ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... insane pause " , lang='en', slow=False)
tts.save("temp.mp3")
os.system("temp.mp3")
Ok, you need Speech Synthesis Markup Language (SSML) to achieve this.
Be aware you need to setting up Google Cloud Platform credentials
first in the bash:
pip install --upgrade google-cloud-texttospeech
Then here is the code:
import html
from google.cloud import texttospeech
def ssml_to_audio(ssml_text, outfile):
# Instantiates a client
client = texttospeech.TextToSpeechClient()
# Sets the text input to be synthesized
synthesis_input = texttospeech.SynthesisInput(ssml=ssml_text)
# Builds the voice request, selects the language code ("en-US") and
# the SSML voice gender ("MALE")
voice = texttospeech.VoiceSelectionParams(
language_code="en-US", ssml_gender=texttospeech.SsmlVoiceGender.MALE
)
# Selects the type of audio file to return
audio_config = texttospeech.AudioConfig(
audio_encoding=texttospeech.AudioEncoding.MP3
)
# Performs the text-to-speech request on the text input with the selected
# voice parameters and audio file type
response = client.synthesize_speech(
input=synthesis_input, voice=voice, audio_config=audio_config
)
# Writes the synthetic audio to the output file.
with open(outfile, "wb") as out:
out.write(response.audio_content)
print("Audio content written to file " + outfile)
def text_to_ssml(inputfile):
raw_lines = inputfile
# Replace special characters with HTML Ampersand Character Codes
# These Codes prevent the API from confusing text with
# SSML commands
# For example, '<' --> '<' and '&' --> '&'
escaped_lines = html.escape(raw_lines)
# Convert plaintext to SSML
# Wait two seconds between each address
ssml = "<speak>{}</speak>".format(
escaped_lines.replace("\n", '\n<break time="2s"/>')
)
# Return the concatenated string of ssml script
return ssml
text = """Here are <say-as interpret-as="characters">SSML</say-as> samples.
I can pause <break time="3s"/>.
I can play a sound"""
ssml = text_to_ssml(text)
ssml_to_audio(ssml, "test.mp3")
More documentation:
Speaking addresses with SSML
But if you don't have Google Cloud Platform credentials, the cheaper and easier way is to use time.sleep(1) method
If there is any background waits required, you can use the time module to wait as below.
import time
# SLEEP FOR 5 SECONDS AND START THE PROCESS
time.sleep(5)
Or you can do a 3 time check with wait etc..
import time
for tries in range(3):
if someprocess() is False:
time.sleep(3)
You can save multiple mp3 files, then use time.sleep() to call each with your desired amount of pause:
from gtts import gTTS
import os
from time import sleep
tts1 = gTTS(text="Testingn" , lang='en', slow=False)
tts2 = gTTS(text="if there is a pause" , lang='en', slow=False)
tts3 = gTTS(text="insane pause " , lang='en', slow=False)
tts1.save("temp1.mp3")
tts2.save("temp2.mp3")
tts3.save("temp3.mp3")
os.system("temp1.mp3")
sleep(2)
os.system("temp2.mp3")
sleep(3)
os.system("temp3.mp3")
Sadly the answer is no, gTTS package has no additional function for pause,an issue already been created in 2018 for adding a pause function ,but it is smart enough to add natural pauses by tokenizer.
What is tokenizer?
Function that takes text and returns it split into a list of tokens (strings). In the gTTS context, its goal is
to cut the text into smaller segments that do not exceed the maximum character size allowed(100) for each TTS API
request, while making the speech sound natural and continuous. It does so by splitting text where speech would
naturaly pause (for example on ".") while handling where it should not (for example on “10.5” or “U.S.A.”).
Such rules are called tokenizer cases, which it takes a list of.
Here is an example:
text = "regular text speed no pause regular text speed comma pause, regular text speed period pause. regular text speed exclamation pause! regular text speed ellipses pause... regular text speed new line pause \n regular text speed "
So in this case, adding a sleep() seems like the only answer. But tricking the tokenizer is worth mentioning.
You can add arbitrary pause with Pydub by saving and concatenating temporary mp3. Then you can use a silent audio for your pause.
You can use any break point symbols of your choice where you want to add pause (here $):
from pydub import AudioSegment
from gtts import gTTS
contents = "Hello with $$ 2 seconds pause"
contents.split("$") # I have chosen this symbol for the pause.
pause2s = AudioSegment.from_mp3("silent.mp3")
# silent.mp3 contain 2s blank mp3
cnt = 0
for p in parts:
# The pause will happen for the empty element of the list
if not p:
combined += pause2s
else:
tts = gTTS(text=p , lang=langue, slow=False)
tmpFileName="tmp"+str(cnt)+".mp3"
tts.save(tmpFileName)
combined+=AudioSegment.from_mp3(tmpFileName)
cnt+=1
combined.export("out.mp3", format="mp3")
Late to the party here, but you might consider trying out the audio_program_generator package. You provide a text file comprised of individual phrases, each of which has a configurable pause at the end. In return, it gives you an mp3 file that 'stitches together' all the phrases and their pauses into one continuous audio file. You can optionally mix in a background sound-file, as well. And it implements several of the other bells and whistles that Google TTS provides, like accents, slow-play-speech, etc.
Disclaimer: I am the author of the package.
I had the same problem, and didn't want to use lots of temporary files on disk. This code parses an SSML file, and creates silence whenever a <break> tag is found:
import io
from gtts import gTTS
import lxml.etree as etree
import pydub
ssml_filename = 'Section12.35-edited.ssml'
wav_filename = 'Section12.35-edited.mp3'
events = ('end',)
DEFAULT_BREAK_TIME = 250
all_audio = pydub.AudioSegment.silent(100)
for event, element in etree.iterparse(
ssml_filename,
events=events,
remove_comments=True,
remove_pis=True,
attribute_defaults=True,
):
tag = etree.QName(element).localname
if tag in ['p', 's'] and element.text:
tts = gTTS(element.text, lang='en', tld='com.au')
with io.BytesIO() as temp_bytes:
tts.write_to_fp(temp_bytes)
temp_bytes.seek(0)
audio = pydub.AudioSegment.from_mp3(temp_bytes)
all_audio = all_audio.append(audio)
elif tag == 'break':
# write silence to the file.
time = element.attrib.get('time', None) # Shouldn't be possible to have no time value.
if time:
if time.endswith('ms'):
time_value = int(time.removesuffix('ms'))
elif time.endswith('s'):
time_value = int(time.removesuffix('s')) * 1000
else:
time_value = DEFAULT_BREAK_TIME
else:
time_value = DEFAULT_BREAK_TIME
silence = pydub.AudioSegment.silent(time_value)
all_audio = all_audio.append(silence)
with open(wav_filename, 'wb') as output_file:
all_audio.export(output_file, format='mp3')
I know 4Rom1 used this method above, but to put it more simply, I found this worked really well for me. Get a 1 sec silent mp3, I found one by googling 1 sec silent mp3. Then use pydub to add together audio segments however many times you need. For example to add 3 seconds of silence
from pydub import AudioSegment
seconds = 3
output = AudioSegment.from_file("yourfile.mp3")
output += AudioSegment.from_file("1sec_silence.mp3") * seconds
output.export("newaudio.mp3", format="mp3")
I used Azure speech to text in python
import azure.cognitiveservices.speech as speechsdk
var = lambda evt: print('ss: {}'.format(evt))
speech_recognizer.recognizing.connect(var)
then after trying to get result actual recognizer text it end with this:
ss: SpeechRecognitionEventArgs(session_id=0aea5e8b80e544b48414f2d27585b6c4, result=SpeechRecognitionResult(result_id=86c7de30436f4db1b064121bd617f24b, text="Hello.", reason=ResultReason.RecognizedSpeech))
I want to just print Hello ?
To get the text from the event:
import azure.cognitiveservices.speech as speechsdk
var = lambda evt: print('ss: {}'.format(evt.result.text))
speech_recognizer.recognizing.connect(var)
If you are using simple mic to recognize the text, here is something which you can use to get the text:
def speech_recognize_once_from_mic():
"""performs one-shot speech recognition from the default microphone"""
# <SpeechRecognitionWithMicrophone>
speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=service_region)
# Creates a speech recognizer using microphone as audio input.
# The default language is "en-us".
speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config)
# Starts speech recognition, and returns after a single utterance is recognized. The end of a
# single utterance is determined by listening for silence at the end or until a maximum of 15
# seconds of audio is processed. It returns the recognition text as result.
# Note: Since recognize_once() returns only a single utterance, it is suitable only for single
# shot recognition like command or query.
# For long-running multi-utterance recognition, use start_continuous_recognition() instead.
result = speech_recognizer.recognize_once()
# Check the result
if result.reason == speechsdk.ResultReason.RecognizedSpeech:
print("Recognized: {}".format(result.text))
elif result.reason == speechsdk.ResultReason.NoMatch:
print("No speech could be recognized")
elif result.reason == speechsdk.ResultReason.Canceled:
cancellation_details = result.cancellation_details
print("Speech Recognition canceled: {}".format(cancellation_details.reason))
if cancellation_details.reason == speechsdk.CancellationReason.Error:
print("Error details: {}".format(cancellation_details.error_details))
# </SpeechRecognitionWithMicrophone>
Check this repo for further reference. Hope it helps.
I have searched and tried to implement solutions suggested here:
Errno 13 Permission denied: 'file.mp3' Python
Error while re-opening sound file in python
But there doesn't seem to be any good solutions to this. Here is my code, can anyone tell me what I am doing wrong here:
#!/usr/bin/env python3
# Requires PyAudio and PySpeech.
import time, os
import speech_recognition as sr
from gtts import gTTS
import pygame as pg
import mutagen.mp3
#Find out what input sound device is default (use if you have issues with microphone)
#import pyaudio
#sdev= pyaudio.pa.get_default_input_device()
def play_music(sound_file, volume=0.8):
'''
stream music with mixer.music module in a blocking manner
this will stream the sound from disk while playing
'''
# set up the mixer, this will set it up according to your sound file
mp3 = mutagen.mp3.MP3(sound_file)
pg.mixer.init(frequency=mp3.info.sample_rate)
pg.mixer.music.set_volume(volume)
try:
pg.mixer.music.load(sound_file)
print("HoBo Sound file {} loaded!".format(sound_file))
except pg.error:
print("HoBo Sound file {} not found! ({})".format(sound_file, pg.get_error()))
return
pg.mixer.music.play()
while pg.mixer.music.get_busy() == True:
continue
pg.mixer.quit()
sound_file.close()
def speak(audioString):
print(audioString)
tts = gTTS(text=audioString, lang='en')
tts.save("audio.mp3")
# pick a mp3 file in folder or give full path
sound_file = "audio.mp3"
# optional volume 0 to 1.0
volume = 0.6
play_music(sound_file, volume)
def audioIn():
# Record Audio from Microphone
r = sr.Recognizer()
with sr.Microphone() as source:
print("Say something!")
audio = r.listen(source)
# Google Speech Recognition
try:
# for testing purposes, we're just using the default API key
# to use another API key, use `r.recognize_google(audio, key="GOOGLE_SPEECH_RECOGNITION_API_KEY")`
# instead of `r.recognize_google(audio)`
data = r.recognize_google(audio)
print("You said: ", data)
except sr.UnknownValueError:
print("Google Speech Recognition could not understand audio")
except sr.RequestError as e:
print("Could not request results from Google Speech Recognition service; {0}".format(e))
return data
def hobo(data):
if "how are you" in data:
speak("I am fine")
if "what time is it" in data:
speak(time.ctime())
if "where is" in data:
data = data.split(" ")
location = data[2]
speak("Hold on Sir, I will show you where " + location + " is.")
os.system("chromium-browser https://www.google.nl/maps/place/" + location + "/&")
# Starts the program
#time.sleep(2)
speak("Testing")
while(data != "stop"):
data = audioIn()
hobo(data)
else:
quit
So I found the fix in one of the original threads I already went over. The fix was to implement a delete() function like so:
def delete():
time.sleep(2)
pg.mixer.init()
pg.mixer.music.load("somefilehere.mp3")
os.remove("audio.mp3")
and changing the play_music() function so it includes the delete() function in the end (and I removed the sound_file.close() statement of course).
Follow below method
import time
from gtts import gTTS
import pygame
def Text_to_speech():
Message = "hey there"
speech = gTTS(text=Message)
speech.save('textToSpeech.mp3')
pygame.mixer.init()
pygame.mixer.music.load("textToSpeech.mp3")
pygame.mixer.music.play()
time.sleep(3)
pygame.mixer.music.unload()
I am just creating a chatbot in Python. It's working well but I want to add pyttsx to this chatbot so that it could speak its output.
My code is
import aiml
import sys
import pyttsx
engine = pyttsx.init()
# Create a Kernel object.
kern = aiml.Kernel()
brainLoaded = False
forceReload = False
while not brainLoaded:
if forceReload or (len(sys.argv) >= 2 and sys.argv[1] == "reload"):
kern.bootstrap(learnFiles="std-startup.xml", commands="load aiml b")
brainLoaded = True
kern.saveBrain("standard.brn")
else:
try:
kern.bootstrap(brainFile = "standard.brn")
brainLoaded = True
except:
forceReload = True
print "\nINTERACTIVE MODE (ctrl-c to exit)"
while(True):
hea = kern.respond(raw_input("> "))
print hea
engine.say (hea)
engine.runAndWait()
When I am running this code I am not hearing any voice but I can see chat on terminal. I want it to speak the response, too. What am I doing wrong?
engine.runAndWait is outside the while(True): loop, so it's unlikely to be played until the loop is interrupted.
If you move it into the loop, and and the sound is choppy, test the code below:
import pyttsx
engine = pyttsx.init()
engine.say("Oh, hello!")
My experience with pyttsx is that it needs to be fed short amounts of text, otherwise the text is interrupted. I'm not sure exactly why that is, but truncating the sentences yourself and saying several phrases should suit your purpose:
engine.say("It's nice to meet you.")
engine.say("I hope you are doing well.")
engine.say("Would you like to join us ")
engine.say ("tomorrow at eight for dinner?")
But you'd need to parse the text and truncate it in a way that would keep the message intact.