Why google speech_v1p1beta1 output only shows the last word? - python

I am using the code below to transcribe an audio file. When the process is completed, I only get the last word.
I have tried both flac and wav files and made sure the files are in my bucket.
Also verified service account is google is working fine. But can't figure out why I am only getting the last word.
#!/usr/bin/env python
"""Google Cloud Speech API sample that demonstrates enhanced models
and recognition metadata.
Example usage:
python diarization.py
"""
import argparse
import io
def transcribe_file_with_diarization():
"""Transcribe the given audio file synchronously with diarization."""
# [START speech_transcribe_diarization_beta]
from google.cloud import speech_v1p1beta1 as speech
client = speech.SpeechClient()
audio = speech.types.RecognitionAudio(uri="gs://MYBUCKET/MYAudiofile")
config = speech.types.RecognitionConfig(
encoding=speech.enums.RecognitionConfig.AudioEncoding.LINEAR16,
sample_rate_hertz=8000,
language_code='en-US',
enable_speaker_diarization=True,
diarization_speaker_count=2)
print('Waiting for operation to complete...')
response = client.recognize(config, audio)
# The transcript within each result is separate and sequential per result.
# However, the words list within an alternative includes all the words
# from all the results thus far. Thus, to get all the words with speaker
# tags, you only have to take the words list from the last result:
result = response.results[-1]
words_info = result.alternatives[0].words
# Printing out the output:
for word_info in words_info:
print("word: '{}', speaker_tag: {}".format(word_info.word,
word_info.speaker_tag))
# [END speech_transcribe_diarization_beta]
if __name__ == '__main__':
transcribe_file_with_diarization()
RESULTS is shown here after running the code:
python diarazation.py
Waiting for operation to complete...
word: 'bye', speaker_tag: 0

Related

RecognitionConfig error with .flac files in google transcribe

I am trying to transcribe an audio file with google cloud. Here is my code:
from google.cloud.speech_v1 import enums
from google.cloud import speech_v1p1beta1
import os
import io
def sample_long_running_recognize(local_file_path):
client = speech_v1p1beta1.SpeechClient()
# local_file_path = 'resources/commercial_mono.wav'
# If enabled, each word in the first alternative of each result will be
# tagged with a speaker tag to identify the speaker.
enable_speaker_diarization = True
# Optional. Specifies the estimated number of speakers in the conversation.
diarization_speaker_count = 2
# The language of the supplied audio
language_code = "en-US"
config = {
"enable_speaker_diarization": enable_speaker_diarization,
"diarization_speaker_count": diarization_speaker_count,
"language_code": language_code,
"encoding": enums.RecognitionConfig.AudioEncoding.FLAC
}
with io.open(local_file_path, "rb") as f:
content = f.read()
audio = {"content": content}
# audio = {"uri": storage_uri}
operation = client.long_running_recognize(config, audio)
print(u"Waiting for operation to complete...")
response = operation.result()
for result in response.results:
# First alternative has words tagged with speakers
alternative = result.alternatives[0]
print(u"Transcript: {}".format(alternative.transcript))
# Print the speaker_tag of each word
for word in alternative.words:
print(u"Word: {}".format(word.word))
print(u"Speaker tag: {}".format(word.speaker_tag))
sample_long_running_recognize('/Users/asi/Downloads/trimmed_3.flac')
I keep getting this error:
google.api_core.exceptions.InvalidArgument: 400 audio_channel_count `1` in RecognitionConfig must either be unspecified or match the value in the FLAC header `2`.
I cannot figure out what I am doing wrong. I have pretty much copy and pasted this from the google cloud speech API docs. Any advice?
This attribute (audio_channel_count) is the number of channels in the input audio data, and you only need to set this for MULTI-CHANNEL recognition. I would assume that this is your case, so as the message suggests, you need to set 'audio_channel_count' : 2 in your config to exactly match your audio file.
Please take a look on the source code for more information about the attributes for RecognitionConfig object.

Adding a pause in Google-text-to-speech

I am looking for a small pause, wait, break or anything that will allow for a short break (looking for about 2 seconds +-, configurable would be ideal) when speaking out the desired text.
People online have said that adding three full stops followed by a space creates a break but I don't seem to be getting that. Code below is my test that has no pauses, sadly.. Any ideas or suggestions?
Edit: It would be ideal if there is some command from gTTS that would allow me to do this, or maybe some trick like using the three full stops if that actually worked.
from gtts import gTTS
import os
tts = gTTS(text=" Testing ... if there is a pause ... ... ... ... ... longer pause? ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... insane pause " , lang='en', slow=False)
tts.save("temp.mp3")
os.system("temp.mp3")
Ok, you need Speech Synthesis Markup Language (SSML) to achieve this.
Be aware you need to setting up Google Cloud Platform credentials
first in the bash:
pip install --upgrade google-cloud-texttospeech
Then here is the code:
import html
from google.cloud import texttospeech
def ssml_to_audio(ssml_text, outfile):
# Instantiates a client
client = texttospeech.TextToSpeechClient()
# Sets the text input to be synthesized
synthesis_input = texttospeech.SynthesisInput(ssml=ssml_text)
# Builds the voice request, selects the language code ("en-US") and
# the SSML voice gender ("MALE")
voice = texttospeech.VoiceSelectionParams(
language_code="en-US", ssml_gender=texttospeech.SsmlVoiceGender.MALE
)
# Selects the type of audio file to return
audio_config = texttospeech.AudioConfig(
audio_encoding=texttospeech.AudioEncoding.MP3
)
# Performs the text-to-speech request on the text input with the selected
# voice parameters and audio file type
response = client.synthesize_speech(
input=synthesis_input, voice=voice, audio_config=audio_config
)
# Writes the synthetic audio to the output file.
with open(outfile, "wb") as out:
out.write(response.audio_content)
print("Audio content written to file " + outfile)
def text_to_ssml(inputfile):
raw_lines = inputfile
# Replace special characters with HTML Ampersand Character Codes
# These Codes prevent the API from confusing text with
# SSML commands
# For example, '<' --> '<' and '&' --> '&'
escaped_lines = html.escape(raw_lines)
# Convert plaintext to SSML
# Wait two seconds between each address
ssml = "<speak>{}</speak>".format(
escaped_lines.replace("\n", '\n<break time="2s"/>')
)
# Return the concatenated string of ssml script
return ssml
text = """Here are <say-as interpret-as="characters">SSML</say-as> samples.
I can pause <break time="3s"/>.
I can play a sound"""
ssml = text_to_ssml(text)
ssml_to_audio(ssml, "test.mp3")
More documentation:
Speaking addresses with SSML
But if you don't have Google Cloud Platform credentials, the cheaper and easier way is to use time.sleep(1) method
If there is any background waits required, you can use the time module to wait as below.
import time
# SLEEP FOR 5 SECONDS AND START THE PROCESS
time.sleep(5)
Or you can do a 3 time check with wait etc..
import time
for tries in range(3):
if someprocess() is False:
time.sleep(3)
You can save multiple mp3 files, then use time.sleep() to call each with your desired amount of pause:
from gtts import gTTS
import os
from time import sleep
tts1 = gTTS(text="Testingn" , lang='en', slow=False)
tts2 = gTTS(text="if there is a pause" , lang='en', slow=False)
tts3 = gTTS(text="insane pause " , lang='en', slow=False)
tts1.save("temp1.mp3")
tts2.save("temp2.mp3")
tts3.save("temp3.mp3")
os.system("temp1.mp3")
sleep(2)
os.system("temp2.mp3")
sleep(3)
os.system("temp3.mp3")
Sadly the answer is no, gTTS package has no additional function for pause,an issue already been created in 2018 for adding a pause function ,but it is smart enough to add natural pauses by tokenizer.
What is tokenizer?
Function that takes text and returns it split into a list of tokens (strings). In the gTTS context, its goal is
to cut the text into smaller segments that do not exceed the maximum character size allowed(100) for each TTS API
request, while making the speech sound natural and continuous. It does so by splitting text where speech would
naturaly pause (for example on ".") while handling where it should not (for example on “10.5” or “U.S.A.”).
Such rules are called tokenizer cases, which it takes a list of.
Here is an example:
text = "regular text speed no pause regular text speed comma pause, regular text speed period pause. regular text speed exclamation pause! regular text speed ellipses pause... regular text speed new line pause \n regular text speed "
So in this case, adding a sleep() seems like the only answer. But tricking the tokenizer is worth mentioning.
You can add arbitrary pause with Pydub by saving and concatenating temporary mp3. Then you can use a silent audio for your pause.
You can use any break point symbols of your choice where you want to add pause (here $):
from pydub import AudioSegment
from gtts import gTTS
contents = "Hello with $$ 2 seconds pause"
contents.split("$") # I have chosen this symbol for the pause.
pause2s = AudioSegment.from_mp3("silent.mp3")
# silent.mp3 contain 2s blank mp3
cnt = 0
for p in parts:
# The pause will happen for the empty element of the list
if not p:
combined += pause2s
else:
tts = gTTS(text=p , lang=langue, slow=False)
tmpFileName="tmp"+str(cnt)+".mp3"
tts.save(tmpFileName)
combined+=AudioSegment.from_mp3(tmpFileName)
cnt+=1
combined.export("out.mp3", format="mp3")
Late to the party here, but you might consider trying out the audio_program_generator package. You provide a text file comprised of individual phrases, each of which has a configurable pause at the end. In return, it gives you an mp3 file that 'stitches together' all the phrases and their pauses into one continuous audio file. You can optionally mix in a background sound-file, as well. And it implements several of the other bells and whistles that Google TTS provides, like accents, slow-play-speech, etc.
Disclaimer: I am the author of the package.
I had the same problem, and didn't want to use lots of temporary files on disk. This code parses an SSML file, and creates silence whenever a <break> tag is found:
import io
from gtts import gTTS
import lxml.etree as etree
import pydub
ssml_filename = 'Section12.35-edited.ssml'
wav_filename = 'Section12.35-edited.mp3'
events = ('end',)
DEFAULT_BREAK_TIME = 250
all_audio = pydub.AudioSegment.silent(100)
for event, element in etree.iterparse(
ssml_filename,
events=events,
remove_comments=True,
remove_pis=True,
attribute_defaults=True,
):
tag = etree.QName(element).localname
if tag in ['p', 's'] and element.text:
tts = gTTS(element.text, lang='en', tld='com.au')
with io.BytesIO() as temp_bytes:
tts.write_to_fp(temp_bytes)
temp_bytes.seek(0)
audio = pydub.AudioSegment.from_mp3(temp_bytes)
all_audio = all_audio.append(audio)
elif tag == 'break':
# write silence to the file.
time = element.attrib.get('time', None) # Shouldn't be possible to have no time value.
if time:
if time.endswith('ms'):
time_value = int(time.removesuffix('ms'))
elif time.endswith('s'):
time_value = int(time.removesuffix('s')) * 1000
else:
time_value = DEFAULT_BREAK_TIME
else:
time_value = DEFAULT_BREAK_TIME
silence = pydub.AudioSegment.silent(time_value)
all_audio = all_audio.append(silence)
with open(wav_filename, 'wb') as output_file:
all_audio.export(output_file, format='mp3')
I know 4Rom1 used this method above, but to put it more simply, I found this worked really well for me. Get a 1 sec silent mp3, I found one by googling 1 sec silent mp3. Then use pydub to add together audio segments however many times you need. For example to add 3 seconds of silence
from pydub import AudioSegment
seconds = 3
output = AudioSegment.from_file("yourfile.mp3")
output += AudioSegment.from_file("1sec_silence.mp3") * seconds
output.export("newaudio.mp3", format="mp3")

Post beanstalk queue data to Mssql using python script

Guys I'm currently developing an offline ALPR solution.
So far I've used OpenAlpr software running on Ubuntu. By using a python script I found on StackOverlFlow I'm able to read the beanstalk queue data (plate number & meta data) of the ALPR but I need to send this data from the beanstalk queue to a mssql database. Does anyone know how to export beanstalk queue data or JSON data to the database? The code below is for local-host, how do i modify it to automatically post data to the mssql database? The data in the beanstalk queue is in JSON format [key=value].
The read & write csv was my addition to see if it can save the json data as csv on localdisk
import beanstalkc
import json
from pprint import pprint
beanstalk = beanstalkc.Connection(host='localhost', port=11300)
TUBE_NAME='alprd'
text_file = open('output.csv', 'w')
# For diagnostics, print out a list of all the tubes available in Beanstalk.
print beanstalk.tubes()
# For diagnostics, print the number of items on the current alprd queue.
try:
pprint(beanstalk.stats_tube(TUBE_NAME))
except beanstalkc.CommandFailed:
print "Tube doesn't exist"
# Watch the "alprd" tube; this is where the plate data is.
beanstalk.watch(TUBE_NAME)
while True:
# Wait for a second to get a job. If there is a job, process it and delete it from the queue.
# If not, return to sleep.
job = beanstalk.reserve(timeout=5000)
if job is None:
print "No plates yet"
else:
plates_info = json.loads(job.body)
# Do something with this data (e.g., match a list, open a gate, etc.).
# if 'data_type' not in plates_info:
# print "This shouldn't be here... all OpenALPR data should have a data_type"
# if plates_info['data_type'] == 'alpr_results':
# print "Found an individual plate result!"
if plates_info['data_type'] == 'alpr_group':
print "Found a group result!"
print '\tBest plate: {} ({:.2f}% confidence)'.format(
plates_info['candidates'][0]['plate'],
plates_info['candidates'][0]['confidence'])
make_model = plates_info['vehicle']['make_model'][0]['name']
print '\tVehicle information: {} {} {}'.format(
plates_info['vehicle']['year'][0]['name'],
plates_info['vehicle']['color'][0]['name'],
' '.join([word.capitalize() for word in make_model.split('_')]))
elif plates_info['data_type'] == 'heartbeat':
print "Received a heartbeat"
text_file.write('Best plate')
# Delete the job from the queue when it is processed.
job.delete()
text_file.close()
AFAIK there is no way to directly export data from beanstalkd.
What you have makes sense, that is streaming data out of a tube into a flat file or performing a insert into the DB directly
Given the IOPS beanstalkd can be produced, it might still be a reasonable solution (depends on what performance you are expecting)
Try asking https://groups.google.com/forum/#!forum/beanstalk-talk as well

how to read mp3 data from google cloud using python

I am trying to read mp3/wav data from google cloud and trying to implement audio diarization technique. Issue is that I am not able to read the result which has passed by google api in variable response.
below is my python code
speech_file = r'gs://pp003231/a4a.wav'
config = speech.types.RecognitionConfig(
encoding=speech.enums.RecognitionConfig.AudioEncoding.LINEAR16,
language_code='en-US',
enable_speaker_diarization=True,
diarization_speaker_count=2)
audio = speech.types.RecognitionAudio(uri=speech_file)
response = client.long_running_recognize(config, audio)
print response
result = response.results[-1]
print result
Output displayed on console is
Traceback (most recent call last):
File "a1.py", line 131, in
print response.results
AttributeError: 'Operation' object has no attribute 'results'
Can you please share your expert advice about what I am doing wrong?
Thanks for your help.
Its too late for the author of this thread. However, posting the solution for someone in future as I too had similar issue.
Change
result = response.results[-1]
to
result = response.result().results[-1]
and it will work fine
Do you have access to the wav file in your bucket? also, this is the entire code? It seems that the sample_rate_hertz and the imports are missing. Here you have the code copy/pasted from the google docs samples, but I edited it to have just the diarization function.
#!/usr/bin/env python
"""Google Cloud Speech API sample that demonstrates enhanced models
and recognition metadata.
Example usage:
python diarization.py
"""
import argparse
import io
def transcribe_file_with_diarization():
"""Transcribe the given audio file synchronously with diarization."""
# [START speech_transcribe_diarization_beta]
from google.cloud import speech_v1p1beta1 as speech
client = speech.SpeechClient()
audio = speech.types.RecognitionAudio(uri="gs://<YOUR_BUCKET/<YOUR_WAV_FILE>")
config = speech.types.RecognitionConfig(
encoding=speech.enums.RecognitionConfig.AudioEncoding.LINEAR16,
sample_rate_hertz=8000,
language_code='en-US',
enable_speaker_diarization=True,
diarization_speaker_count=2)
print('Waiting for operation to complete...')
response = client.recognize(config, audio)
# The transcript within each result is separate and sequential per result.
# However, the words list within an alternative includes all the words
# from all the results thus far. Thus, to get all the words with speaker
# tags, you only have to take the words list from the last result:
result = response.results[-1]
words_info = result.alternatives[0].words
# Printing out the output:
for word_info in words_info:
print("word: '{}', speaker_tag: {}".format(word_info.word,
word_info.speaker_tag))
# [END speech_transcribe_diarization_beta]
if __name__ == '__main__':
transcribe_file_with_diarization()
To run the code just name it diarization.py and use the command:
python diarization.py
Also, you have to install the latest google-cloud-speech library:
pip install --upgrade google-cloud-speech
And you need to have the credentials of your service account in a json file, you can check more info here

Spark streaming not reading from local directory

I am trying to write a spark streaming application using Spark Python API.
The application should read text files from local directory and send it to Kafka cluster.
When submitting the python script to spark engine, nothing sent to kafka at all.
I tried to print the events instead of send it to Kafka and found that there is nothing read.
Here is the code of the script.
#!/usr/lib/python
# -*- coding: utf-8 -*-
from pyspark import SparkContext, SparkConf
from pyspark.streaming import StreamingContext
from kafka import KafkaProducer
import sys
import time
reload(sys)
sys.setdefaultencoding('utf8')
producer = KafkaProducer(bootstrap_servers="kafka-b01.css.org:9092,kafka-b02.css.org:9092,kafka-b03.css.org:9092,kafka-b04.css.org:9092,kafka-b05.css.org:9092")
def send_to_kafka(rdd):
tweets = rdd.collect()
print ("--------------------------")
print (tweets)
print "--------------------------"
#for tweet in tweets:
# producer.send('test_historical_job', value=bytes(tweet))
if __name__ == "__main__":
conf = SparkConf().setAppName("TestSparkFromPython")
sc = SparkContext(conf=conf)
ssc = StreamingContext(sc, 1)
tweetsDstream = ssc.textFileStream("/tmp/historical/")
tweetsDstream.foreachRDD(lambda rdd: send_to_kafka(rdd))
ssc.start()
ssc.awaitTermination()
I am submitting the script using this command
./spark-submit --master spark://spark-master:7077 /apps/historical_streamer.py
The output of the print statement is an empty list.
--------------------------
[]
--------------------------
EDIT
based on this question I changed the path of the data directory from "/tmp/historical/" to "file:///tmp/historical/".
I tried to run the job first and then move files to the directory but unfortunately it did not work also.
File stream based sources like fileStream or textFileStream expect data files to be:
be created in the dataDirectory by atomically moving or renaming them into the data directory.
If there are no new files in a given window there is nothing to proces so pre-existing files (it seems to be the case here) won't be read on won't show on the output.
Your function:
def send_to_kafka(rdd):
tweets = rdd.collect()
print ("--------------------------")
print (tweets)
print "--------------------------"
#for tweet in tweets:
# producer.send('test_historical_job', value=bytes(tweet))
will collect all the rdd, but it won't print the content of the rdd. To do so, you should use the routine:
tweets.foreach(println)
that will, for every element in the RDD, give as output the elements. As explained in the Spark Documentation
Hope this will help

Categories