Custom phrases/words are ignored by Google Speech-To-Text - python

I am using python3 to transcribe an audio file with Google speech-to-text via the provided python packages (google-speech).
There is an option to define custom phrases which should be used for transcription as stated in the docs: https://cloud.google.com/speech-to-text/docs/speech-adaptation
For testing purposes I am using a small audio file with the contained text:
[..] in this lecture we'll talk about the Burrows wheeler transform and the FM index [..]
And I am giving the following phrases to see the effects if for example I want a specific name to be recognized with the correct notation. In this example I want to change burrows to barrows:
config = speech.RecognitionConfig(dict(
encoding=speech.RecognitionConfig.AudioEncoding.ENCODING_UNSPECIFIED,
sample_rate_hertz=24000,
language_code="en-US",
enable_word_time_offsets=True,
speech_contexts=[
speech.SpeechContext(dict(
phrases=["barrows", "barrows wheeler", "barrows wheeler transform"]
))
]
))
Unfortunately this does not seem to have any effect as the output is still the same as without the context phrases.
Am I using the phrases wrong or has it such a high confidence that the word it hears is indeed burrows so that it will ignore my phrases?
PS: I also tried using the speech_v1p1beta1.AdaptationClient and speech_v1p1beta1.SpeechAdaptation instead of putting the phrases into the config but this only gives me an internal server error with no additional information on what is going wrong. https://cloud.google.com/speech-to-text/docs/adaptation

I have created an audio file to recreate your scenario and I was able to improve the recognition using the model adaptation. To achieve this with this feature, I would suggest taking a look at this example and this post to better understand the adaptation model.
Now, to improve the recognition of your phrase, I performed the following:
I created a new audio file using the following page with the mentioned phrase.
in this lecture we'll talk about the Burrows wheeler transform and the FM index
My tests were based on this code sample. This code creates a PhraseSet and CustomClass that includes the word you would like to improve, in this case the word "barrows". You can also create/update/delete the phrase set and custom class using the Speech-To-Text GUI. Below is the code I used for the improvement.
from os import pathconf_names
from google.cloud import speech_v1p1beta1 as speech
import argparse
def transcribe_with_model_adaptation(
project_id="[PROJECT-ID]", location="global", speech_file=None, custom_class_id="[CUSTOM-CLASS-ID]", phrase_set_id="[PHRASE-SET-ID]"
):
"""
Create`PhraseSet` and `CustomClasses` to create custom lists of similar
items that are likely to occur in your input data.
"""
import io
# Create the adaptation client
adaptation_client = speech.AdaptationClient()
# The parent resource where the custom class and phrase set will be created.
parent = f"projects/{project_id}/locations/{location}"
# Create the custom class resource
adaptation_client.create_custom_class(
{
"parent": parent,
"custom_class_id": custom_class_id,
"custom_class": {
"items": [
{"value": "barrows"}
]
},
}
)
custom_class_name = (
f"projects/{project_id}/locations/{location}/customClasses/{custom_class_id}"
)
# Create the phrase set resource
phrase_set_response = adaptation_client.create_phrase_set(
{
"parent": parent,
"phrase_set_id": phrase_set_id,
"phrase_set": {
"boost": 0,
"phrases": [
{"value": f"${{{custom_class_name}}}", "boost": 10},
{"value": f"talk about the ${{{custom_class_name}}} wheeler transform", "boost": 15}
],
},
}
)
phrase_set_name = phrase_set_response.name
# print(u"Phrase set name: {}".format(phrase_set_name))
# The next section shows how to use the newly created custom
# class and phrase set to send a transcription request with speech adaptation
# Speech adaptation configuration
speech_adaptation = speech.SpeechAdaptation(
phrase_set_references=[phrase_set_name])
# speech configuration object
config = speech.RecognitionConfig(
encoding=speech.RecognitionConfig.AudioEncoding.FLAC,
sample_rate_hertz=24000,
language_code="en-US",
adaptation=speech_adaptation,
enable_word_time_offsets=True,
model="phone_call",
use_enhanced=True
)
# The name of the audio file to transcribe
# storage_uri URI for audio file in Cloud Storage, e.g. gs://[BUCKET]/[FILE]
with io.open(speech_file, "rb") as audio_file:
content = audio_file.read()
audio = speech.RecognitionAudio(content=content)
# audio = speech.RecognitionAudio(uri="gs://biasing-resources-test-audio/call_me_fionity_and_ionity.wav")
# Create the speech client
speech_client = speech.SpeechClient()
response = speech_client.recognize(config=config, audio=audio)
for result in response.results:
# The first alternative is the most likely one for this portion.
print(u"Transcript: {}".format(result.alternatives[0].transcript))
# [END speech_transcribe_with_model_adaptation]
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter
)
parser.add_argument("path", help="Path for audio file to be recognized")
args = parser.parse_args()
transcribe_with_model_adaptation(speech_file=args.path)
Once it runs, you will receive an improved recognition as the below; however, consider that the code tries to create a new custom class and a new phrase set when it runs, and it might throw an error with a element already exists message if try to re-create the custom class and the phrase set.
Using the recognition without the adaptation
(python_speech2text) user#penguin:~/replication/python_speech2text$ python speech_model_adaptation_beta.py audio.flac
Transcript: in this lecture will talk about the Burrows wheeler transform and the FM index
Using the recognition with the adaptation
(python_speech2text) user#penguin:~/replication/python_speech2text$ python speech_model_adaptation_beta.py audio.flac
Transcript: in this lecture will talk about the barrows wheeler transform and the FM index
Finally, I would like to add some notes about the improvement and the code I performed:
I have used a flac audio file as it is recommended for optimal results.
I have used the model="phone_call" and use_enhanced=True as this was the model recognized by Cloud Speech-To-Text using my own audio file. Also the enhanced model can provide better results, you can see the documentation for more details. Note that this configuration might vary from your audio file.
Consider enable data logging to Google to collect data from your audio transcription requests. Google then uses this data to improve its machine learning models used for recognizing speech audio.
Once I have create the custom class and the phrase set, you can use the Speech-to-Text UI to updae and perform your tests quickly. only contains the
I have used in the phrase set the parameter boost, when you use boost, you assign a weighted value to phrase items in a PhraseSet resource. Speech-to-Text refers to this weighted value when selecting a possible transcription for words in your audio data. The higher the value, the higher the likelihood that Speech-to-Text chooses that word or phrase from the possible alternatives.
I hope this information helps you to improve your recognitions.

Related

Is there a way to translate documents in a batch without passing the source the source language code in Google Cloud

I want to build a solution that should process batch documents which are in multiple languages. As per the Google documentation, it accepts list of documents but the source language code is mandatory and it accepts only one language code.
I want the application to be global and translate the batch documents auto detecting the language. Please suggest if there is any alternative to perform this action using cloud Translation API.
Below is the code from Google Documentation
from google.cloud import translate_v3beta1 as translate
def batch_translate_document(
input_uri: str,
output_uri: str,
project_id: str,
timeout=180,
):
client = translate.TranslationServiceClient()
# The ``global`` location is not supported for batch translation
location = "us-central1"
# Google Cloud Storage location for the source input. This can be a single file
# (for example, ``gs://translation-test/input.docx``) or a wildcard
# (for example, ``gs://translation-test/*``).
# Supported file types: https://cloud.google.com/translate/docs/supported-formats
gcs_source = {"input_uri": input_uri}
batch_document_input_configs = {
"gcs_source": gcs_source,
}
gcs_destination = {"output_uri_prefix": output_uri}
batch_document_output_config = {"gcs_destination": gcs_destination}
parent = f"projects/{project_id}/locations/{location}"
# Supported language codes: https://cloud.google.com/translate/docs/language
operation = client.batch_translate_document(
request={
"parent": parent,
"source_language_code": "en-US",
"target_language_codes": ["fr-FR"],
"input_configs": [batch_document_input_configs],
"output_config": batch_document_output_config,
}
)
print("Waiting for operation to complete...")
response = operation.result(timeout)
print("Total Pages: {}".format(response.total_pages))
First, you need to detect the language, and then to use this detected language for your translation. You can use the detectLanguage API operation
However, you can't perform batch translation with different source language. Only one is supported, and you can batch several contents.

Audio signal split at word level boundary

I am working with audio file using webrtcvad and pydub. The split of any fragment is by silence of the sentence.
Is there any way by which the split can be done at word level boundry condition? (after each spoken word)?
If librosa/ffmpeg/pydub has any feature like this, can split is possible at each vocal? but after split, I need start and end time of the vocal exactly what that vocal part has positioned in the original file.
One simple solution or way to split by ffmpeg is also defined by :
https://gist.github.com/vadimkantorov/00bf4fbe4323360722e3d2220cc2915e
but this is also splitting by silence, and with each padding number or the frame size, the split is different. I am trying split by vocal.
As example, I have done this manually the original file, split words and its time position in json is in a folder provided here under the link:
www.mediafire.com/file/u4ojdjezmw4vocb/attached_problem.tar.gz
Simple audio segmentation problems can be handled by using a Hidden Markov Model, after preprocessing the audio into suitable features. Typical features for speech would be soundlevel, vocal activity / voicedness. To get word-level segmentation (as opposed to sentence), this needs to have rather high time resolution. Unfortunately the pyWebRTCVAD does not have adjustable time smoothening so it might not be suited for the task.
In your audio sample there is a radio host speaking rather quickly in German.
Looking at the soundlevels wrt to the word boundaries you have marked it is clear that between some words the soundlevel doesnt really drop. That rules out a simple soundlevel segmentation model.
All in all, getting good results for general speech signals can be quite hard. But fortunately this is very well researched, and with off-the-shelf solutions being available.
These use typically an acoustic model (how words and phonemes sound), as well as a language model (likely orders of words), learned over many hours of audio.
Word segmentation using Speech Recognition library
All these features are included in a Speech Recognition framework, and many allow to get word-level outputs with timing. Below is some working code for this using Vosk.
Alternatives to Vosk would be PocketSphinx. Or using an online speech recognition service from Google Cloud, Amazon Web Services, Azure Cloud etc.
import sys
import os
import subprocess
import json
import math
# tested with VOSK 0.3.15
import vosk
import librosa
import numpy
import pandas
def extract_words(res):
jres = json.loads(res)
if not 'result' in jres:
return []
words = jres['result']
return words
def transcribe_words(recognizer, bytes):
results = []
chunk_size = 4000
for chunk_no in range(math.ceil(len(bytes)/chunk_size)):
start = chunk_no*chunk_size
end = min(len(bytes), (chunk_no+1)*chunk_size)
data = bytes[start:end]
if recognizer.AcceptWaveform(data):
words = extract_words(recognizer.Result())
results += words
results += extract_words(recognizer.FinalResult())
return results
def main():
vosk.SetLogLevel(-1)
audio_path = sys.argv[1]
out_path = sys.argv[2]
model_path = 'vosk-model-small-de-0.15'
sample_rate = 16000
audio, sr = librosa.load(audio_path, sr=16000)
# convert to 16bit signed PCM, as expected by VOSK
int16 = numpy.int16(audio * 32768).tobytes()
# XXX: Model must be downloaded from https://alphacephei.com/vosk/models
# https://alphacephei.com/vosk/models/vosk-model-small-de-0.15.zip
if not os.path.exists(model_path):
raise ValueError(f"Could not find VOSK model at {model_path}")
model = vosk.Model(model_path)
recognizer = vosk.KaldiRecognizer(model, sample_rate)
res = transcribe_words(recognizer, int16)
df = pandas.DataFrame.from_records(res)
df = df.sort_values('start')
df.to_csv(out_path, index=False)
print('Word segments saved to', out_path)
if __name__ == '__main__':
main()
Run the program with the .WAV file and the path to an output file.
python vosk_words.py attached_problem/main.wav out.csv
The script outputs words and their times in the CSV. These timings can then be used to split the audio file. Here is example output:
conf,end,start,word
0.618949,1.11,0.84,also
1.0,1.32,1.116314,eine
1.0,1.59,1.32,woche
0.411941,1.77,1.59,des
Comparing the output (bottom) with the example file you provided (top), it looks pretty good.
It actually picked up a word that your annotations did not include, "und" at 42.25 seconds.
Delimiting words is out of the audio domain and requires a kind of intelligence. Doing it manually is easy because we are intelligent and know exactly what we are looking for, but automatizing the process is hard because, as you already noticed, a silence is not (not only, not always) a word delimiter.
At audio level, we can only approach a solution and this require both analyzing the amplitude of the signal and adding some time mechanisms. As an example, Protools provides a nice tool named Strip Silence that cuts audio regions automatically based on the amplitude of the signal. It always keeps the material at its original position in the timeline and naturally each region knows its own duration. In addition to the threshold in dB, and to prevent creating too much regions, it provides several useful parameters in the time domain : a minimum length for the created regions, a delay before the cut (the delay is computed from the point the amplitude passes below the threshold), an inverted delay before reopening the gate (the delay is computed backward from the point the amplitude passes above the threshold).
This could be a good starting point for you. Implementing such a system probably won't be 100 % successful, but you could obtain a quite good ratio if the settings are well adjusted to the speaker. Even if it's not perfect, it will significantly reduce the need for manual work.

access my own image files or load them in phoneoxpth from S3 to sage maker

I'm fairly new to working with AWS, and I want to use SageMaker to train a certain image data set using fast.ai. But I have no clue how to link all the image data from S3 to SageMaker.
I tried almost everything I could think of, used s3fs and I can read the images separately and the list of the images, but how do I feed that info to my databunch or learning algorithm?
My code:
import boto3
import pandas as pd
from sagemaker import get_execution_role
role = get_execution_role()
bucket='sagemaker-sst-images'
data_key = 'SST_Data/sst-images'
data_location = 's3://{}/{}'.format(bucket, data_key)
This code, I think, gives a URL to the data.
But what comes next? Either get it into a path, or load the data properly?
Since you have boto3 imported, you can use that to start a training job in Sagemaker. It can read your training data straight from S3 to train your model on.
If you are using a custom model, this would require having your inference code in EMR
The way you could do it is like this (I use some of the vars you put in your question):
# Get a sagemaker client object for S3
sagemaker_client = boto3.client('sagemaker')
# Create an input channel definition (for the training job call)
input_data = {
'ChannelName': 'train',
'DataSource': {
'S3DataSource': {
'S3DataType': 'S3Prefix',
'S3Uri': data_location
}
},
'ContentType': 'application/png' # This will depend on the type of your images
}
# Start the training job on Sagemaker
sagemaker_client.create_training_job(
TrainingJobName="My training Job",
HyperParameters={ # These will be any hyperparameters you need },
AlgorithmSpecification={
'TrainingImage': # The ECR url to your algorithm goes here,
'TrainingInputMode': 'File'
},
RoleArn=role,
InputDataConfig=input_data,
OutputDataConfig={'S3OutputPath': # Your S3 output path goes here for your trained model artifacts},
ResourceConfig={
'InstanceType': # whichever type/size of instance you want here
'InstanceCount': # however many instances you want to train with,
'VolumeSizeInGB': # however much storage space you need for your training data on the instance
}
)
Here is some information regarding the instances sizes and prices:
https://aws.amazon.com/sagemaker/pricing/instance-types/
https://aws.amazon.com/sagemaker/pricing/
for ContentType in the input config, here is a resource for more info on the MIME type you need: https://www.sitepoint.com/mime-types-complete-list/
After the training completes, you can use the model artifacts it creates to make a SageMaker model, and use it to perform inference.

Why GCP Vision API returns worse results in python than at its online demo

I wrote a basic python script to call and use the GCP Vision API. My aim is to send an image of a product to it and to retrieve (with OCR) the words written on this box. I have a predefined list of brands so I can search within the returned text from the API the brand and detect what it is.
My python script is the following:
import io
from google.cloud import vision
from google.cloud.vision import types
import os
import cv2
import numpy as np
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "**************************"
def detect_text(file):
"""Detects text in the file."""
client = vision.ImageAnnotatorClient()
with io.open(file, 'rb') as image_file:
content = image_file.read()
image = types.Image(content=content)
response = client.text_detection(image=image)
texts = response.text_annotations
print('Texts:')
for text in texts:
print('\n"{}"'.format(text.description))
vertices = (['({},{})'.format(vertex.x, vertex.y)
for vertex in text.bounding_poly.vertices])
print('bounds: {}'.format(','.join(vertices)))
file_name = "Image.jpg"
img = cv2.imread(file_name)
detect_text(file_name)
For now, I am experimenting with the following product image: (951 × 335 resolution)
Its brand is Acuvue.
The problem is the following. When I am testing the online demo of GCP Cloud Vision API then I am getting the following text result for this image:
FOR ASTIGMATISM 1-DAY ACUVUE MOIST WITH LACREON™ 30 Lenses BRAND CONTACT LENSES UV BLOCKING
(The json result for this returns all the above words including the word Acuvue which matters for me but the json is too long to post it here)
Therefore, the online demo detects pretty well the text on the product and at least it detects accurately the word Acuvue (which is the brand). However, when I am calling the same API in my python script with the same image I am getting the following result:
Texts:
"1.DAY
FOR ASTIGMATISM
WITH
LACREONTM
MOIS
30 Lenses
BRAND CONTACT LENSES
UV BLOCKING
"
bounds: (221,101),(887,101),(887,284),(221,284)
"1.DAY"
bounds: (221,101),(312,101),(312,125),(221,125)
"FOR"
bounds: (622,107),(657,107),(657,119),(622,119)
"ASTIGMATISM"
bounds: (664,107),(788,107),(788,119),(664,119)
"WITH"
bounds: (614,136),(647,136),(647,145),(614,145)
"LACREONTM"
bounds: (600,151),(711,146),(712,161),(601,166)
"MOIS"
bounds: (378,162),(525,153),(528,200),(381,209)
"30"
bounds: (614,177),(629,178),(629,188),(614,187)
"Lenses"
bounds: (634,178),(677,180),(677,189),(634,187)
"BRAND"
bounds: (361,210),(418,210),(418,218),(361,218)
"CONTACT"
bounds: (427,209),(505,209),(505,218),(427,218)
"LENSES"
bounds: (514,209),(576,209),(576,218),(514,218)
"UV"
bounds: (805,274),(823,274),(823,284),(805,284)
"BLOCKING"
bounds: (827,276),(887,276),(887,284),(827,284)
But this does not detect at all the word "Acuvue" as the demo does!!
Why is this happening?
Can I fix something in my python script to make it work properly?
From the docs:
The Vision API can detect and extract text from images. There are two annotation features that support OCR:
TEXT_DETECTION detects and extracts text from any image. For example, a photograph might contain a street sign or traffic sign. The JSON includes the entire extracted string, as well as individual words, and their bounding boxes.
DOCUMENT_TEXT_DETECTION also extracts text from an image, but the response is optimized for dense text and documents. The JSON includes page, block, paragraph, word, and break information.)
My hope was that the web API was actually using the latter, and then filtering the results based on the confidence.
A DOCUMENT_TEXT_DETECTION response includes additional layout information, such as page, block, paragraph, word, and break information, along with confidence scores for each.
At any rate, I was hoping (and my experience has been) that the latter method would "try harder" to find all the strings.
I don't think you were doing anything "wrong". There are just two parallel detection methods. One (DOCUMENT_TEXT_DETECTION) is more intense, optimized for documents (likely for straightened, aligned and evenly spaced lines), and gives more information that might be unnecessary for some applications.
So I suggested you modify your code following the Python example here.
Lastly, my guess is that the \342\204\242 you ask about are escaped octal values corresponding to utf-8 characters it thinks it found when trying to identify the ™ symbol.
If you use the following snippet:
b = b"\342\204\242"
s = b.decode('utf8')
print(s)
You'll be happy to see that it prints ™.

GATT Characteristics with read-property not found by application

I am trying to develop an application that is communicating with an external device using BLE. I have decided to use pygatt (Python) with BGAPI (using a BlueGiga dongle).
The device I am communicating with has a custom primary service with a set of characteristics. According to their specs they have 2 READ characteristics, 8 NOTIFY chars and 1 WRITE char. Initially, I want to read one of the two READ chars, but I am unable to do so. Their UUIDs are not recognized as characteristics. How can this be? I am 100% certain that they are entered correctly.
import pygatt
import bleconnect
import blelib
import logging
logging.basicConfig()
logging.getLogger('pygatt').setLevel(logging.DEBUG)
adapter = pygatt.BGAPIBackend(serial_port='/dev/tty.usbmodem1')
adapter.start()
# Find the device
result = adapter.scan(timeout=5)
for item in result:
scan_name = item['name']
scan_rssi = item['rssi']
scan_address = item['address']
if scan_name == bleconnect.TARGET_NAME:
break
# Connect
device = adapter.connect(address=scan_address)
device.char_read(blelib.CHARACTERISTIC_DEVICE_FEATURES)
I can see in the debug messages that all the NOTIFY and WRITE characteristics are found, but not the two READ characteristics.
What am I missing?
This appears to be some kind of shortcoming in the pygatt API. I managed to find the actual value using bgapi only.

Categories