Tensorflow dataset from lots of .npy files

Tensorflow dataset from lots of .npy files - python

I'm trying to create a tensorflow dataset from 6500 .npy files of shape [256,256].
My previous method (for less files) is to load them and stack them into an np.array, and the use tf.data.Dataset.from_tensor_slices((stacked_data)).
With the current number of files I get ValueError: Cannot create a tensor proto whose content is larger than 2GB.
I'm now trying the following:
def data_generator():
processed = []
for i in range(len(onlyfiles)):
processed.append(tf.convert_to_tensor(np.load(onlyfiles[i], mmap_mode='r')))
yield iter(tf.concat(processed, 0))
_dataset = tf.data.Dataset.from_generator(generator=data_generator,output_types=tf.float32)
onlyfiles is the list of the filenames
I get multiple errors, one of which is the following:
2022-10-01 11:25:44.602505: W tensorflow/core/framework/op_kernel.cc:1639] Invalid argument: TypeError: `generator` yielded an element that could not be converted to the expected type. The expected type was float32, but the yielded element was <generator object Tensor.__iter__ at 0x7fe6d7d506d0>.
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/data/ops/dataset_ops.py", line 653, in generator_py_func
ret_arrays.append(script_ops.FuncRegistry._convert( # pylint: disable=protected-access
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/ops/script_ops.py", line 195, in _convert
result = np.asarray(value, dtype=dtype, order="C")
TypeError: float() argument must be a string or a number, not 'generator'
What should I change? Is there another method to do it?
Because I created the dataset, is there a better way to prepare it for the Tensorflow implementation?
After a few days, I found this solution. I don't know how good it it, but I'll post it just in case someone finds it useful:
#tf.function
def input_fn():
tf.compat.v1.enable_eager_execution()
mypath = 'tensorflow_datasets/Dataset_1/'
list_of_file_names = [join(mypath, f) for f in listdir(mypath) if isfile(join(mypath, f))]
def gen():
for i in itertools.count(1):
data1 = np.load(list_of_file_names[i%len(list_of_file_names)])
data2 = np.where(data1 > 1, data1, 1)
yield tf.convert_to_tensor(np.where(data2>0, 20*np.log10(data2), 0))
dataset = tf.data.Dataset.from_generator(gen, (tf.float32))
return dataset.make_one_shot_iterator().get_next()

I usually do such things as follows
dataset = tf.data.Dataset.from_tensor_slices(list_of_file_names)
# Optional
dataset = dataset.repeat().shuffle(...)
def read_file(file_name):
full_path_to_image_file = ... # build full path
buffer = tf.io.read_file(full_path_to_image_file)
tensor = ... # converte from buffer to tensor
return tensor
dataset = dataset.map(read_file, num_parallel_calls=...)
As an option you can read file with np.load inside py_function (use decode ("utf-8") to convert byte string to ordinary python string) like
def read_file(file_path):
tensor = tf.py_function(
func=lambda path: np.load(path.numpy().decode("utf-8")),
inp=[file_path],
Tout=tf.float32
)
tensor.set_shape(img_shape)
return tensor

Related

TorchServe: How to convert bytes output to tensors

I have a model that is served using TorchServe. I'm communicating with the TorchServe server using gRPC. The final postprocess method of the custom handler defined returns a list which is converted into bytes for transfer over the network.
The post process method
def postprocess(self, data):
# data type - torch.Tensor
# data shape - [1, 17, 80, 64] and data dtype - torch.float32
return data.tolist()
The main issue is at the client where converting the received bytes from TorchServe to a torch Tensor is inefficiently done via ast.literal_eval
# This takes 0.3 seconds
response = self.inference_stub.Predictions(
inference_pb2.PredictionsRequest(model_name=model_name, input=input_data))
# This takes 0.84 seconds
predictions = torch.as_tensor(literal_eval(
response.prediction.decode('utf-8')))
Using numpy.frombuffer or torch.frombuffer return the following error.
import numpy as np
np.frombuffer(response.prediction)
Traceback (most recent call last):
File "<string>", line 1, in <module>
ValueError: buffer size must be a multiple of element size
np.frombuffer(response.prediction, dtype=np.float32)
Traceback (most recent call last):
File "<string>", line 1, in <module>
ValueError: buffer size must be a multiple of element size
Using torch
import torch
torch.frombuffer(response.prediction, dtype = torch.float32)
Traceback (most recent call last):
File "<string>", line 1, in <module>
ValueError: buffer length (2601542 bytes) after offset (0 bytes) must be a multiple of element size (4)
Is there an alternative, more efficient solution of converting the received bytes into torch.Tensor?

One hack I've found that has significantly increased the performance while sending large tensors is to return a list of json.
In your handler's postprocess function:
def postprocess(self, data):
output_data = {}
output_data['data'] = data.tolist()
return [output_data]
At the clients side when you receive the grpc response, decode it using json.loads
response = self.inference_stub.Predictions(
inference_pb2.PredictionsRequest(model_name=model_name, input=input_data))
decoded_output = response.prediction.decode('utf-8')
preds = torch.as_tensor(json.loads(decoded_output))
preds should have the output tensor
Update:
There's an even faster method and should completely solve the bottleneck. Use tf.io.serialize_tensor from tensorflow to serialize your tensor inside postprocess
def postprocess(self, data):
return [tf.io.serialize_tensor(data.cpu()).numpy()]
Decode it using tf.io.parse_tensor
response = self.inference_stub.Predictions(
inference_pb2.PredictionsRequest(model_name=model_name, input=input_data))
prediction = response.prediction
torch.as_tensor(tf.io.parse_tensor(prediction, out_type=tf.float32).numpy())

Why is Pytorch giving me a datatype error: Float vs Double?

I am working on migrating some working Pytorch code I found online (which is a 2D image classification example using the MNIST data; apologies that I lost track of the original source and am unable to find it) to what I need, which is converting a 1D collection of values into a numerical score. I created my own Dataset class. When I call model(), I get an error: RuntimeError: Expected object of scalar type Float but got scalar type Double for argument #2 'mat1' in call to _th_addmm. My first level of confusion is that I can't find any reference to Python even having a Double datatype. And my second is why I get the error--when I put in debug code to show the datatype of mat1 and its elements, I am told that it is a Tensor which claims to be float64. I also wonder why it is expecting a scalar for mat1, which the documentation describes as a matrix/tensor.
The full error dump is
Traceback (most recent call last):
File "mlalan.py", line 174, in <module>
outputs = model(images)
File "/usr/home/adf/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "mlalan.py", line 80, in forward
x = activate(self.fc1(x))
File "/usr/home/adf/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/usr/home/adf/.local/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 87, in forward
return F.linear(input, self.weight, self.bias)
File "/usr/home/adf/.local/lib/python3.7/site-packages/torch/nn/functional.py", line 1610, in linear
ret = torch.addmm(bias, input, weight.t())
RuntimeError: Expected object of scalar type Float but got scalar type Double for argument #2 'mat1' in call to _th_addmm
Some of the key code from my Dataset class is
class RandomDataset(Dataset):
def __init__(self, csv_file, transform=None):
self.data_frame = pd.read_csv(csv_file, dtype=float)
def __getitem__(self, idx):
raw = self.data_frame.values[idx]
sample = raw[0:6], raw[6:8]
return sample
The full source code is at http://8wheels.org/mlalan.py.

By default, in Python, float means float32. However, in Pandas and Numpy, float means float64. I was able to resolve the problem by adding a call to astype as below. The "32" is required for it to work.
raw = self.data_frame.values[idx].astype(np.float32)
Thanks!
Now I can move on to the next crash :-)

how to convert mnist images to variables images and labels

I have a code as below:
dataset = MNIST(path=data_path, download=True, shuffle=True)
if train:
images, labels = dataset.get_train()
else:
images, labels = dataset.get_test()
images, labels = images[:n_examples], labels[:n_examples]
images, labels = iter(images.view(-1, 784) / 255), iter(labels)
but when i run it, it gives me this error:
Traceback (most recent call last):
File "C:\Users\Ati\Downloads\Compressed\bindsnet_experiments-
master\experiments\mnist\two_layer_backprop.py", line 135, in <module>
images, labels = dataset.get_train()
AttributeError: 'TorchvisionDatasetWrapper' object has no attribute 'get_train'
I think because get_train() is out of date , it doesn’t support by torchvision
But i tested different ways for converting mnist data to images and labels variables
Who know how could i change it when get_train() doesn’t work
I will appreciate your help if someone helps me on this

Yes, it looks like class does not exist in the package anymore.
I was able to find the source code for the package you are looking for:
import os
import functools
import operator
import gzip
import struct
import array
import tempfile
try:
from urllib.request import urlretrieve
except ImportError:
from urllib import urlretrieve # py2
try:
from urllib.parse import urljoin
except ImportError:
from urlparse import urljoin
import numpy
__version__ = '0.2.2'
# `datasets_url` and `temporary_dir` can be set by the user using:
# >>> mnist.datasets_url = 'http://my.mnist.url'
# >>> mnist.temporary_dir = lambda: '/tmp/mnist'
datasets_url = 'http://yann.lecun.com/exdb/mnist/'
temporary_dir = tempfile.gettempdir
class IdxDecodeError(ValueError):
"""Raised when an invalid idx file is parsed."""
pass
def download_file(fname, target_dir=None, force=False):
"""Download fname from the datasets_url, and save it to target_dir,
unless the file already exists, and force is False.
Parameters
----------
fname : str
Name of the file to download
target_dir : str
Directory where to store the file
force : bool
Force downloading the file, if it already exists
Returns
-------
fname : str
Full path of the downloaded file
"""
target_dir = target_dir or temporary_dir()
target_fname = os.path.join(target_dir, fname)
if force or not os.path.isfile(target_fname):
url = urljoin(datasets_url, fname)
urlretrieve(url, target_fname)
return target_fname
def parse_idx(fd):
"""Parse an IDX file, and return it as a numpy array.
Parameters
----------
fd : file
File descriptor of the IDX file to parse
endian : str
Byte order of the IDX file. See [1] for available options
Returns
-------
data : numpy.ndarray
Numpy array with the dimensions and the data in the IDX file
1. https://docs.python.org/3/library/struct.html
#byte-order-size-and-alignment
"""
DATA_TYPES = {0x08: 'B', # unsigned byte
0x09: 'b', # signed byte
0x0b: 'h', # short (2 bytes)
0x0c: 'i', # int (4 bytes)
0x0d: 'f', # float (4 bytes)
0x0e: 'd'} # double (8 bytes)
header = fd.read(4)
if len(header) != 4:
raise IdxDecodeError('Invalid IDX file, '
'file empty or does not contain a full header.')
zeros, data_type, num_dimensions = struct.unpack('>HBB', header)
if zeros != 0:
raise IdxDecodeError('Invalid IDX file, '
'file must start with two zero bytes. '
'Found 0x%02x' % zeros)
try:
data_type = DATA_TYPES[data_type]
except KeyError:
raise IdxDecodeError('Unknown data type '
'0x%02x in IDX file' % data_type)
dimension_sizes = struct.unpack('>' + 'I' * num_dimensions,
fd.read(4 * num_dimensions))
data = array.array(data_type, fd.read())
data.byteswap() # looks like array.array reads data as little endian
expected_items = functools.reduce(operator.mul, dimension_sizes)
if len(data) != expected_items:
raise IdxDecodeError('IDX file has wrong number of items. '
'Expected: %d. Found: %d' % (expected_items,
len(data)))
return numpy.array(data).reshape(dimension_sizes)
def download_and_parse_mnist_file(fname, target_dir=None, force=False):
"""Download the IDX file named fname from the URL specified in dataset_url
and return it as a numpy array.
Parameters
----------
fname : str
File name to download and parse
target_dir : str
Directory where to store the file
force : bool
Force downloading the file, if it already exists
Returns
-------
data : numpy.ndarray
Numpy array with the dimensions and the data in the IDX file
"""
fname = download_file(fname, target_dir=target_dir, force=force)
fopen = gzip.open if os.path.splitext(fname)[1] == '.gz' else open
with fopen(fname, 'rb') as fd:
return parse_idx(fd)
def train_images():
"""Return train images from Yann LeCun MNIST database as a numpy array.
Download the file, if not already found in the temporary directory of
the system.
Returns
-------
train_images : numpy.ndarray
Numpy array with the images in the train MNIST database. The first
dimension indexes each sample, while the other two index rows and
columns of the image
"""
return download_and_parse_mnist_file('train-images-idx3-ubyte.gz')
def test_images():
"""Return test images from Yann LeCun MNIST database as a numpy array.
Download the file, if not already found in the temporary directory of
the system.
Returns
-------
test_images : numpy.ndarray
Numpy array with the images in the train MNIST database. The first
dimension indexes each sample, while the other two index rows and
columns of the image
"""
return download_and_parse_mnist_file('t10k-images-idx3-ubyte.gz')
def train_labels():
"""Return train labels from Yann LeCun MNIST database as a numpy array.
Download the file, if not already found in the temporary directory of
the system.
Returns
-------
train_labels : numpy.ndarray
Numpy array with the labels 0 to 9 in the train MNIST database.
"""
return download_and_parse_mnist_file('train-labels-idx1-ubyte.gz')
def test_labels():
"""Return test labels from Yann LeCun MNIST database as a numpy array.
Download the file, if not already found in the temporary directory of
the system.
Returns
-------
test_labels : numpy.ndarray
Numpy array with the labels 0 to 9 in the train MNIST database.
"""
return download_and_parse_mnist_file('t10k-labels-idx1-ubyte.gz')
You can store this in any file, import that file and use the functions as you need (without creating the MNIST object).
Hope this helps. Good luck.

Better way to store large amount of time-series data with tags using python

I am trying to run a bunch of (power system) simulations and save all the results into dictionaries. Here is the data organization:
Since i have not-so complicated object structure, i decided to use dill to store the dictionary which contains a bunch of dictionaries (each of whose keys contain a class)
import dill as pickle
class Results():
def __init__(self):
self.volt = []
self.angle = []
self.freq = []
def save_obj(obj, name ):
# save as pickle object
currentdir = os.getcwd()
objDir = currentdir + '/obj'
if not os.path.isdir(objDir):
os.mkdir(objDir)
with open(objDir+ '/' + name + '.pkl', 'wb') as f:
pickle.dump(obj, f, pickle.HIGHEST_PROTOCOL,recurse = 'True')
EventDict = {}
########### conceptual code to get all the data
# simList is a list of approximately 7200 events
for event in simList:
ResultsDict = {}
for element in network: # 24 elements in network (23 buses,or nodes, and time)
# code to get voltage, angle and frequency (each of which is a list of 1200 elements)
if element == 'time':
ResultsDict['time'] = element
else:
ResultsDict[element] = Results()
ResultsDict[element].volt = element.volt
ResultsDict[element].angle = element.angle
ResultsDict[element].freq = element.freq
EventDict[event] = ResultsDict
save_obj(EventDict,'EventData')
The resultant pickle object is like 5 gigs and when i try to load, i get the following error saying it ran out of memory:
Traceback (most recent call last):
File "combineEventPkl.py", line 39, in <module>
EventDict = load_obj(objStr)
File "combineEventPkl.py", line 8, in load_obj
return pickle.load(f)
File "C:\Python27\lib\site-packages\dill\_dill.py", line 304, in load
obj = pik.load()
File "C:\Python27\lib\pickle.py", line 864, in load
dispatch[key](self)
File "C:\Python27\lib\pickle.py", line 964, in load_binfloat
self.append(unpack('>d', self.read(8))[0])
MemoryError
no mem for new parser
MemoryError
Also, unpickling takes a long time before i get this traceback.
I realize this problem is because the EventDict is huge.
So, i guess i am asking whether there is a better way to store such time series data, with some functionality of labelling each data with a key, so that i know what it represents? I am open to suggestions other than pickle as long as it is fast in loading and does not involve too much effort in loading into python.

Check out "Fast Data Store for Pandas Time-Series Data using PyStore" https://medium.com/#aroussi/fast-data-store-for-pandas-time-series-data-using-pystore-89d9caeef4e2
May have to chunk the data while reading it in. https://cmdlinetips.com/2018/01/how-to-load-a-massive-file-as-small-chunks-in-pandas/

Scipy: Trying to write wav file, AttributeError: 'list' object has no attribute 'dtype'

I am using Anaconda3 and SciPy to try to write a wav file using an array:
wavfile.write("/Users/Me/Desktop/C.wav", 1000, array)
(I don't know how many samples per second, I'm planning on playing around with that, I'm betting on 1000 however)
array returns an array of 3000 integers, so the file would last 3 seconds.
However it gives me this error when trying to run:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-21-ce3a8d3e4b4b> in <module>()
----> 1 wavfile.write("/Users/Me/Desktop/C.wav", 1000, fin)
/Users/Me/anaconda/lib/python3.4/site-packages/scipy/io/wavfile.py in write(filename, rate, data)
213
214 try:
--> 215 dkind = data.dtype.kind
216 if not (dkind == 'i' or dkind == 'f' or (dkind == 'u' and data.dtype.itemsize == 1)):
217 raise ValueError("Unsupported data type '%s'" % data.dtype)
AttributeError: 'list' object has no attribute 'dtype'

You are passing write an ordinary python list, which does not have an attribute called dtype (you can get that info by studying the error message). The documentation of scipy.io.wavfile clearly states you should pass it a numpy array:
Definition: wavfile.write(filename, rate, data)
Docstring:
Write a numpy array as a WAV file
You can convert your ordinary python list to a numpy array like so:
import numpy as np
arr = np.array(array)

I would like to add a bit of information in reply to user3151828's comment. I opened a file comprised of 32 bit signed float values, audio data not formatted as a proper wave file, and created a normal Python list and then converted it to a numpy array as Oliver W. states to do and printed the results.
import numpy as np
import os
import struct
file = open('audio.npa', 'rb')
i = 0
datalist = []
for i in range(4):
data = file.read(4)
s = struct.unpack('f', data)
datalist.append(s)
numpyarray = np.array(datalist)
print('datalist, normal python array is: ', datalist, '/n')
print('numpyarray is: ', numpyarray)
The output is:
datalist, normal python list is: [(-0.000152587890625,), (-0.005126953125,), (-0.010284423828125,), (-0.009796142578125,)]
numpyarray is:
[[-0.00015259]
[-0.00512695]
[-0.01028442]
[-0.00979614]]
So, there is the difference between the two.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Tensorflow dataset from lots of .npy files - python

Related

TorchServe: How to convert bytes output to tensors

Why is Pytorch giving me a datatype error: Float vs Double?

how to convert mnist images to variables images and labels

Better way to store large amount of time-series data with tags using python

Scipy: Trying to write wav file, AttributeError: 'list' object has no attribute 'dtype'

Categories

Resources