I'm using voxlingua107 to detect language from audio
first it works fine but after a while I tried downloading an audio file with Spanish but it detects English !
class AudioMixin:
def detect_language(self, path) -> str:
language_id = EncoderClassifier.from_hparams(source="TalTechNLP/voxlingua107-epaca-ecapa", savedir='media/tmp')
# Download Thai language sample from Omniglot and cvert to suitable form
signal = language_id.load_audio(path)
prediction = language_id.classify_batch(signal)
return prediction[3]
that's the code and I've no idea what I'm doing wrong
Related
When I use the open AI whisper model on Hindi audio, it returns the transcription in English instead of Hindi.
How do I get the output in Hindi itself? Is there a setting that can be changed?
mel = whisper.log_mel_spectrogram(audio).to(model.device)
options = whisper.DecodingOptions(language = 'hi')
result = whisper.decode(model, mel, options)
print(result.text)
Result:
Header is probably not clear enough to specify my issue, sorry for that.
Im working on playing chess with speech recognition.
The issue:
Example: User is going to say "Rook A1 to A4" but, the speech recog thinks rook is "rogue" or "Brooklyn A1" etc.
How do i choose the spesific words like rook,pawn,queen etc. and make speech recog ai only understand those words ?
Current code i started with:
import pyttsx3
import speech_recognition as sr
recognizer = sr.Recognizer()
while True:
try:
with sr.Microphone() as mic:
recognizer.adjust_for_ambient_noise(mic, duration=0.2)
audio = recognizer.listen(mic)
text = recognizer.recognize_google(audio)
text = text.lower()
print(f"{text}")
except sr.UnknownValueError():
recognizer = sr.Recognizer()
continue
You should use keyword_entries. Provide them as dictionary like so keyword_only_text = r.recognize_sphinx(audio, keyword_entries=[("rook",0.1),("knight",0.1),...])
This feature unfortunately is not available for recognize_google
The scalar values decide the sensitivity to that word. If you set them to one it will map any words recorded only to your words. If set to 0.001 you will only get a slight bias towards them.
See the workings of the function at speech_recognition/recognize_sphinx.
I am using python3 to transcribe an audio file with Google speech-to-text via the provided python packages (google-speech).
There is an option to define custom phrases which should be used for transcription as stated in the docs: https://cloud.google.com/speech-to-text/docs/speech-adaptation
For testing purposes I am using a small audio file with the contained text:
[..] in this lecture we'll talk about the Burrows wheeler transform and the FM index [..]
And I am giving the following phrases to see the effects if for example I want a specific name to be recognized with the correct notation. In this example I want to change burrows to barrows:
config = speech.RecognitionConfig(dict(
encoding=speech.RecognitionConfig.AudioEncoding.ENCODING_UNSPECIFIED,
sample_rate_hertz=24000,
language_code="en-US",
enable_word_time_offsets=True,
speech_contexts=[
speech.SpeechContext(dict(
phrases=["barrows", "barrows wheeler", "barrows wheeler transform"]
))
]
))
Unfortunately this does not seem to have any effect as the output is still the same as without the context phrases.
Am I using the phrases wrong or has it such a high confidence that the word it hears is indeed burrows so that it will ignore my phrases?
PS: I also tried using the speech_v1p1beta1.AdaptationClient and speech_v1p1beta1.SpeechAdaptation instead of putting the phrases into the config but this only gives me an internal server error with no additional information on what is going wrong. https://cloud.google.com/speech-to-text/docs/adaptation
I have created an audio file to recreate your scenario and I was able to improve the recognition using the model adaptation. To achieve this with this feature, I would suggest taking a look at this example and this post to better understand the adaptation model.
Now, to improve the recognition of your phrase, I performed the following:
I created a new audio file using the following page with the mentioned phrase.
in this lecture we'll talk about the Burrows wheeler transform and the FM index
My tests were based on this code sample. This code creates a PhraseSet and CustomClass that includes the word you would like to improve, in this case the word "barrows". You can also create/update/delete the phrase set and custom class using the Speech-To-Text GUI. Below is the code I used for the improvement.
from os import pathconf_names
from google.cloud import speech_v1p1beta1 as speech
import argparse
def transcribe_with_model_adaptation(
project_id="[PROJECT-ID]", location="global", speech_file=None, custom_class_id="[CUSTOM-CLASS-ID]", phrase_set_id="[PHRASE-SET-ID]"
):
"""
Create`PhraseSet` and `CustomClasses` to create custom lists of similar
items that are likely to occur in your input data.
"""
import io
# Create the adaptation client
adaptation_client = speech.AdaptationClient()
# The parent resource where the custom class and phrase set will be created.
parent = f"projects/{project_id}/locations/{location}"
# Create the custom class resource
adaptation_client.create_custom_class(
{
"parent": parent,
"custom_class_id": custom_class_id,
"custom_class": {
"items": [
{"value": "barrows"}
]
},
}
)
custom_class_name = (
f"projects/{project_id}/locations/{location}/customClasses/{custom_class_id}"
)
# Create the phrase set resource
phrase_set_response = adaptation_client.create_phrase_set(
{
"parent": parent,
"phrase_set_id": phrase_set_id,
"phrase_set": {
"boost": 0,
"phrases": [
{"value": f"${{{custom_class_name}}}", "boost": 10},
{"value": f"talk about the ${{{custom_class_name}}} wheeler transform", "boost": 15}
],
},
}
)
phrase_set_name = phrase_set_response.name
# print(u"Phrase set name: {}".format(phrase_set_name))
# The next section shows how to use the newly created custom
# class and phrase set to send a transcription request with speech adaptation
# Speech adaptation configuration
speech_adaptation = speech.SpeechAdaptation(
phrase_set_references=[phrase_set_name])
# speech configuration object
config = speech.RecognitionConfig(
encoding=speech.RecognitionConfig.AudioEncoding.FLAC,
sample_rate_hertz=24000,
language_code="en-US",
adaptation=speech_adaptation,
enable_word_time_offsets=True,
model="phone_call",
use_enhanced=True
)
# The name of the audio file to transcribe
# storage_uri URI for audio file in Cloud Storage, e.g. gs://[BUCKET]/[FILE]
with io.open(speech_file, "rb") as audio_file:
content = audio_file.read()
audio = speech.RecognitionAudio(content=content)
# audio = speech.RecognitionAudio(uri="gs://biasing-resources-test-audio/call_me_fionity_and_ionity.wav")
# Create the speech client
speech_client = speech.SpeechClient()
response = speech_client.recognize(config=config, audio=audio)
for result in response.results:
# The first alternative is the most likely one for this portion.
print(u"Transcript: {}".format(result.alternatives[0].transcript))
# [END speech_transcribe_with_model_adaptation]
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter
)
parser.add_argument("path", help="Path for audio file to be recognized")
args = parser.parse_args()
transcribe_with_model_adaptation(speech_file=args.path)
Once it runs, you will receive an improved recognition as the below; however, consider that the code tries to create a new custom class and a new phrase set when it runs, and it might throw an error with a element already exists message if try to re-create the custom class and the phrase set.
Using the recognition without the adaptation
(python_speech2text) user#penguin:~/replication/python_speech2text$ python speech_model_adaptation_beta.py audio.flac
Transcript: in this lecture will talk about the Burrows wheeler transform and the FM index
Using the recognition with the adaptation
(python_speech2text) user#penguin:~/replication/python_speech2text$ python speech_model_adaptation_beta.py audio.flac
Transcript: in this lecture will talk about the barrows wheeler transform and the FM index
Finally, I would like to add some notes about the improvement and the code I performed:
I have used a flac audio file as it is recommended for optimal results.
I have used the model="phone_call" and use_enhanced=True as this was the model recognized by Cloud Speech-To-Text using my own audio file. Also the enhanced model can provide better results, you can see the documentation for more details. Note that this configuration might vary from your audio file.
Consider enable data logging to Google to collect data from your audio transcription requests. Google then uses this data to improve its machine learning models used for recognizing speech audio.
Once I have create the custom class and the phrase set, you can use the Speech-to-Text UI to updae and perform your tests quickly. only contains the
I have used in the phrase set the parameter boost, when you use boost, you assign a weighted value to phrase items in a PhraseSet resource. Speech-to-Text refers to this weighted value when selecting a possible transcription for words in your audio data. The higher the value, the higher the likelihood that Speech-to-Text chooses that word or phrase from the possible alternatives.
I hope this information helps you to improve your recognitions.
I was working with python speech recognition and it works well, but can't get my head around this.
Issue:
when working with an audio file
when using recorder.listen(audio)
for some reason, it only listens for a couple of seconds and then stops
so it only listens to a part of the audio file and then ignores the rest
This is the simplified code I'm running with the exact same issue:
import os
import speech_recognition as sr
recog = sr.Recognizer()
audioFile = sr.AudioFile('C:\\Users\ilieg\OneDrive\Documents\Sound recordings\male.wav')
transcript = ""
with audioFile as source:
audio = recog.listen(source)
transcript = transcript + " " + recog.recognize_google(audio)
print(transcript)
If you need a sample of the audio file...I got it from here, just for testing purposes: (I used the first audio file)
http://www.signalogic.com/index.pl?page=codec_samples
Example:
The output of the following (click for audio file) audio file is:
what if somebody decides to break it be careful that you keep adequate coverage but look for places to save money baby it's taking longer to get things squared away than the bankers expected during the wife or ones company may win her tax hated retirement income to boost is helpful to saving Rags are hers Lee tossed on the two naked bones
I am writing a program that recognizes speech. What it does is it records audio from the microphone and converts it to text using Sphinx. My problem is I want to start recording audio only when something is spoken by the user.
I experimented by reading the audio levels from the microphone and recording only when the level is above a particular value. But it ain't that effective. The program starts recording whenever it detects anything loud. This is the code I used
import audioop
import pyaudio as pa
import wav
class speech():
def __init__(self):
# soundtrack properties
self.format = pa.paInt16
self.rate = 16000
self.channel = 1
self.chunk = 1024
self.threshold = 150
self.file = 'audio.wav'
# intialise microphone stream
self.audio = pa.PyAudio()
self.stream = self.audio.open(format=self.format,
channels=self.channel,
rate=self.rate,
input=True,
frames_per_buffer=self.chunk)
def record(self)
while True:
data = self.stream.read(self.chunk)
rms = audioop.rms(data,2) #get input volume
if rms>self.threshold: #if input volume greater than threshold
break
# array to store frames
frames = []
# record upto silence only
while rms>threshold:
data = self.stream.read(self.chunk)
rms = audioop.rms(data,2)
frames.append(data)
print 'finished recording.... writing file....'
write_frames = wav.open(self.file, 'wb')
write_frames.setnchannels(self.channel)
write_frames.setsampwidth(self.audio.get_sample_size(self.format))
write_frames.setframerate(self.rate)
write_frames.writeframes(''.join(frames))
write_frames.close()
Is there a way I can differentiate between human voice and other noise in Python ? Hope somebody can find me a solution.
I think that your issue is that at the moment you are trying to record without recognition of the speech so it is not discriminating - recognisable speech is anything that gives meaningful results after recognition - so catch 22. You could simplify matters by looking for an opening keyword. You can also filter on voice frequency range as the human ear and the telephone companies both do and you can look at the mark space ratio - I believe that there were some publications a while back on that but look out - it varies from language to language. A quick Google can be very informative. You may also find this article interesting.
I think waht you are looking for is VAD (voice activity detection). VAD can be used for preprocessing speech for ASR. Here is some open-source project for implements of VAD link. May it help you.
This is an example script using a VAD library.
https://github.com/wiseman/py-webrtcvad/blob/master/example.py