I'm trying to execute the following code using tensorflow, Hugginface's transformer and openai/whisper-base model
import tensorflow as tf
import transformers
# Load the model and tokenizer
model = transformers.TFWhisperModel.from_pretrained('openai/whisper-base')
tokenizer = transformers.WhisperTokenizer.from_pretrained('openai/whisper-base')
# Read the audio file and convert it to a tensor
audio_file = "data/preamble.wav"
with open(audio_file, 'rb') as f:
audio = f.read()
input_ids = tf.constant(tokenizer.encode(audio, return_tensors='tf'))
# Transcribe the audio
output = model(input_ids)[0]
transcription = tokenizer.decode(output, skip_special_tokens=True)
with open("something.txt", "w") as f:
f.write(transcription)
I'm getting this huge output error, too big to copy and paste here, below is an error snippet. The entire message consists of the same syntax except for the last line, which I've pasted below. The add picture is the top of the error message that I had to screenshot before it disappears.
Top of Error message picture
The 1st output to terminal after running script
Bottom of Error Snippet
c\xff\x0c\x00\xeb\xff\xb3\xff\xc5\xff\x0f\x00\xde\xff\x16\x00B\x00\x0e\x00\xfd\xff$\x000\x00\xff\x
ff\xe7\xff<\x00\xfb\xff\n\x00/\x008\x00\x06\x00\x17\x00\x1d\x00\xde\xff\xf2\xff\xec\xff\xff\xff\x0
f\x00\x1b\x008\x00\x1d\x003\x00%\x00#\x00\r\x00\x16\x00\x1d\x00\x19\x00\xf7\xff\x14\x00\xff\xff\xc
c\xff\x06\x00\xf1\xff\x11\x00\xf0\xff*\x00P\x00\xe7\xffH\x00\t\x00\xd0\xff\xd0\xff\xee\xff\xf6\xff
\xc6\xff\xe4\xff\xce\xff' is not valid. Should be a string, a list/tuple of strings or a list/tuple
of integers.
The last line is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers. is my only clue as to my next step.
I cannot scroll to the top to find where in my code is throwing the error. I'm new to machine learning and I don't know what I'm seeing. Any help is appreciated.
Thank you in advance!!!
I tried a try execpt block around output and transcription with no change, same output message
I've tried:
input_ids = str(tf.constant(tokenizer.encode(audio, return_tensors='tf')))
input_ids = []
input_ids = input_ids.append(int(tf.constant(tokenizer.encode(audio, return_tensors='tf'))))
output = model(str(input_ids))[0]
No change to the output
Related
I am currently working on a project using audio data. The first step of the project is to use another model to produce features for the audio example that are about [400 x 10_000] for each wav file and each wav file will have a label that I'm trying to predict. I will then build another model on top of this to produce my final result.
I don't want to run preprocessing every time I run the model, so my plan was to have a preprocessing pipeline that runs the feature extraction model and saves it into a new folder and then I can just have the second model use the saved features directly. I was looking at using TFRecords, but the documentation is quite unhelpful.
tf.io.serialize_tensor
tfrecord
This is what I've come up with to test it so far:
serialized_features = tf.io.serialize_tensor(features)
feature_of_bytes = tf.train.Feature(
bytes_list=tf.train.BytesList(value=[serialized_features.numpy()]))
features_for_example = {
'feature0': feature_of_bytes
}
example_proto = tf.train.Example(
features=tf.train.Features(feature=features_for_example))
filename = 'test.tfrecord'
writer = tf.io.TFRecordWriter(filename)
writer.write(example_proto.SerializeToString())
filenames = [filename]
raw_dataset = tf.data.TFRecordDataset(filenames)
for raw_record in raw_dataset.take(1):
example = tf.train.Example()
example.ParseFromString(raw_record.numpy())
print(example)
But I'm getting this error:
tensorflow.python.framework.errors_impl.DataLossError: truncated record at 0' failed with Read less bytes than requested
tl;dr:
Getting the above error with TFRecords. Any recommendations to get this example working or another solution not using TFRecords?
As a programming noob, I am trying to find similar sentences in several hundreds of newspaper articles. I have tried my code with a smaller text sample which has worked brilliantly. Now, with a larger text file (using the same code), I get the error code "[E1002] Span index out of range.".
This is my code so far:
!pip install spacy
import spacy
nlp = spacy.load('en_core_web_sm')
nlp.max_length = 2000000
with open('/content/BSE.txt', 'r', encoding="utf-8", errors="ignore") as f:
sentences_articles = f.read()
about_doc = nlp(sentences_articles)
sentences = list(about_doc.sents)
len(sentences)
sentences[:10]
!pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer, util
import torch
embedder = SentenceTransformer('all-mpnet-base-v2')
corpus = sentences
corpus_embeddings = embedder.encode(corpus, show_progress_bar=True, batch_size = 128)
The progress bar stops at 94%, with error "[E1002] Span index out of range". I have used the .readlines() function, which worked, yet because of my text data's nature has produced unusable results (but no error!). I limited the number of words in each sentence, but that didn't help either. I tried several text data (different length, different content), but without success.
Any suggestions on how to fix this?
I had a similar problem with the same mistake, and for me it was solved after changing sentences from a list[Span] to list[str] as this is what .encode() requires. Instead of sentences = list(about_doc.sents), write sentences = list(sent.text for sent in about_doc.sents)
I am following this tutorial to build a custom-made object detection model on Detect.
https://www.analyticsvidhya.com/blog/2021/06/simplest-way-to-do-object-detection-on-custom-datasets/
I have collected and labelled my images, put them on my Drive and I am running the following code snippet to train the model which is part of a Python Notebook on Google Colab:
Train_dataset = core.Dataset('/content/drive/My Drive/training model/Training',transform=custom_transforms)#L1
Test_dataset = core.Dataset('/content/drive/My Drive/training model/Test')#L2
loader=core.DataLoader(Train_dataset, batch_size=2, shuffle=True)#L3
model = core.Model(['black car', 'grey car','white truck'])#L4
losses = model.fit(loader, Test_dataset, epochs=25, lr_step_size=5, learning_rate=0.001, verbose=True)#L5
plt.plot(losses)
plt.show()
However, I keep getting the following error shortly after the first model epoch starts :
ValueError: Could not read image /content/drive/My Drive/training model/Training/frame22.jpg
It gives this error randomly, not only with frame22 but with other frames also that are not present in this directory. I tried to remount my Drive with enabling force_remount at the beginning of the script, but the error persists.
I checked the code of the core.Dataset implementation from Detecto and I confirm what I said in my comments.
The index is created by getting all the .xml annotation files and creating an index that maps them to their image. It does not check that the image is actually there.
For the image filename, it uses the one that is inside the xml file, not the name of the xml file. See below a view of an annotation XML file, where you see the filename attribute. If you change the name of your image, you need to change it inside the xml file.
My code began giving this error after I opened and saved NEweights.npy:
OSError: Failed to interpret file 'D:\\NeuralNetwork\\NEweights.npy' as a pickle
It was working initially before I saved it. Why am I receiving this error only now, and is there any way I can still access the data in NEweights.npy? (Just for context, NEweights.npy is an array of neural network weights trained via Nesterov Accelerated Gradient. I was testing different NN optimizers.)
I have this code to save the numpy arrays in a npy file:
np.save(f'{path}GDweights.npy', np.array(weights, dtype=object))
I have this to access the numpy arrays:
def getWeights(path):
return np.load(path, allow_pickle=True)
path = 'D:\\NeuralNetwork\\'
inputs, outputs = grab(f'{path}test.csv')
weightsGD = getWeights(f'{path}GDweights.npy')
weightsM = getWeights(f'{path}Mweights.npy')
weightsNE = getWeights(f'{path}NEweights.npy')
weightsNA = getWeights(f'{path}NAweights.npy')
weightsD = getWeights(f'{path}Dweights.npy')
This error is raised as an IOError and according to this If the input file does not exist or cannot be read, this error is raised.
I'm trying to build in some debugging code into by tensorflow dataset pipeline. Basically if tfrecord parsing fails on a certain file, I'd like to be able figure out which file that is. My dream would be to run a number of asserts in my parsing_function that provide the filename if they fail.
My pipeline looks something like this:
tf.data.Dataset.from_tensor_slices(file_list)
.apply(tf.contrib.data.parallel_interleave(lambda f: tf.data.TFRecordDataset(f), cycle_length=4))
.map(parse_func, num_parallel_calls=params.num_cores)
.map(_func_for_other_stuff)
Ideally I'd pass the filename through in the parallel_interleave step, but if I have the anonymous function return a filename, tfrecordataset tuple, I get:
TypeError: `map_func` must return a `Dataset` object.
I've also tried to include the filename in the file itself like this question, but am having issues here because filenames are of variable length.
The return value of the function passed to tf.contrib.data.parallel_interleave() must be a tf.data.Dataset. Therefore you can solve this by attaching the filename tensor to each element of the TFRecordDataset, using tf.data.Dataset.zip() as follows:
def read_records_func(filename):
records = tf.data.TFRecordDataset(filename)
# Create a dataset from the filename tensor and repeat it indefinitely.
filename_as_dataset = tf.data.Dataset.from_tensors(filename).repeat(None)
return tf.data.Dataset.zip((filename_as_dataset, records))
dataset = (tf.data.Dataset.from_tensor_slices(file_list)
.apply(tf.contrib.data.parallel_interleave(read_records_func, cycle_length=4))
.map(parse_func, num_parallel_calls=params.num_cores)
.map(_func_for_other_stuff))