Problem with Loading and Preprocessing Data Using tf.Dataset

Problem with Loading and Preprocessing Data Using tf.Dataset - python

I'm trying to load .npy files using map method of tf.Databases. The reasoning behind this is, in the future, I will be loading data not only from .npy but also for .dat encoded in Medical Imaging format. I can do this with Keras Sequence class but I want to be able to prefetch my data since it will take a lot of time to load huge amounts of files from remote hard drives. Even if I'm loading them from the same hard drive it will take ages (seconds) to generate masks, load data, then feed it to network. Right now, I'm working with the most simple version of the problem that I can handle. I have been working on this for a couple of weeks (on and off for a year) to no avail. The code is given below and the error is.
TypeError: expected str, bytes or os.PathLike object, not Tensor
I do know the reason behind the error but I could not convert it to something numpy can read. I can do this if it weren't an online operation, or output from dataset with list(file) then go for a for loop. I have researched it and every single book goes with the same irrelevant CSV example. I just need to load and preprocess data using prefetch method. So that when my network is doing reconstruction, I can fetch the next batch.
The file argument that is passed to the np_read function is a tensor. However, I expect it to be a string, this tensor has the string in it but I can not extract it.
Tensor("file:0", shape=(None,), dtype=string)
When I convert file to list I get this error.
OperatorNotAllowedInGraphError: iterating over tf.Tensor is not allowed: AutoGraph did not convert this function. Try decorating it directly with #tf.function.
I expect to have a string that points to the location of the files, I do get these locations if I fetch the batch without mapping it. Then print using the list method.
array([b'C:\\Datasets\\MRI_Data\\Recon_v4\\Train\\e14080s3_P18944.7_144.npy', b'C:\\Datasets\\MRI_Data\\Recon_v4\\Train\\e13993s4_P16896.7_77.npy', b'C:\\Datasets\\MRI_Data\\Recon_v4\\Train\\e13992s4_P08704.7_65.npy'], dtype=object)>]
Thank you kindly.
pattern = "C:/Datasets/MRI_Data/Recon_v4/Train/*.npy"
filepath_dataset = tf.data.Dataset.list_files(file_pattern = pattern)
#tf.function
def np_read(file):
loadedFile = np.load(file)
return(loadedFile)
dataset = filepath_dataset.repeat(2).batch(3)
dataset = dataset.map(np_read)

Related

Is converting numpy array to .npy format an example of serialization of data?

I understand that serialization of data means converting a data structure or object state to a form which can be stored in a file or buffer, can be transmitted, and can be reconstructed later (https://www.tutorialspoint.com/object_oriented_python/object_oriented_python_serialization.htm). Based on this definition, converting a numpy array to .npy format should be serialization of the numpy array data object. However, I could not find this assertion anywhere, when I looked up on the internet. Most of the related links were mentioning about how pickle format does serialization of data in python. My questions is - is converting numpy array to .npz format an example of serialization of a python data object. If not, what are the reasons?

Well, according to Wikipedia:
In computing, serialization (or serialisation) is the process of
translating data structures or object state into a format that can be
stored (for example, in a file or memory buffer) or transmitted (for
example, across a network connection link) and reconstructed later
(possibly in a different computer environment).
And according to Numpy Doc:
Binary serialization
NPY format
A simple format for saving numpy arrays to disk with the full
information about them.
The .npy format is the standard binary file format in NumPy for
persisting a single arbitrary NumPy array on disk. The format stores
all of the shape and dtype information necessary to reconstruct the
array correctly even on another machine with a different architecture.
The .npz format is the standard format for persisting multiple NumPy
arrays on disk. A .npz file is a zip file containing multiple .npy
files, one for each array.
So, putting this definitions together you can come up with an answer to your question. Yes, is a way of serialization. Also the process of storing and reading is fast

np.save(filename, arr) writes the array to a file. Since a file a linear structure it is a form of serialization. But often 'serialization' refers to creating a string that can be sent to a database or over some 'pipeline'. I think you can save to a string buffer, but it takes a bit of trickery.
But in Python most objects have a pickle method, which creates a string which can be written to a file. In that sense pickle is a 2 step process - serialize and then write to file. The pickle for a numpy array is actually a save compatible form. (conversely, np.save of a non-array object uses that object's pickle).
savez writes a zip archive, containing one npy file for each array. It may in addition be compressed. There are OS tools for transferring zip archives to other computers.

Keras custom fit_generator for numeric dataframe

I have several CSV files which are placed in a directory. What I want to do is to create a flow from this directory where each file is taken, prepossessed(such as null value fill, outlier treatment etc) and then each data point is passed to keras model and this process should repeat itself for every file placed in the directory. Any suggestions on this to create data flow same as for Image data available in keras. Also this should happen in python :)
Thanks in advance!

I don't think that Keras natively supplies such functionality.
You should make your own converter, using something like glob to go over each file, send it to preprocessing functions, and finally save it as a format readily usable by Keras, such as a numpy array.
You might want to have a look here for an example of inputting multiple files (although in this case they are already numpy arrays, not csv files) to use in the training of a model.

Numpy memory mapping issue

I've been recently working with large matrices. My inputs are stored in form of 15GB .npz files, which I am trying to read incrementally, in small batches.
I am familiar with memory mapping, and having seen numpy also supports these kinds of operations seemed like a perfect solution. However, the problem I am facing is as follows:
I first load the matrix with:
foo = np.load('matrix.npz',mmap_mode="r+")
foo has a single key: data.
When I try to, for example do:
foo['data'][1][1]
numpy seems to endlessly spend the available RAM, almost as if there were no memory mapping. Am I doing anything wrong?
My goal would be, for example, to read 30 lines at a time:
for x in np.arange(0,matrix.shape[1],30):
batch = matrix[x:(x+30),:]
do_something_with(batch)
Thank you!

My guess would be that mmap_mode="r+" is ignored when the file in question is a zipped numpy file. I haven't used numpy in this way, so some of what follows is my best guess. The documentation for load states
If the file is a .npz file, then a dictionary-like object is returned, containing {filename: array} key-value pairs, one for each file in the archive.
No mention of what it does with mmap_mode. However in the code for loading .npz files no usage is made of the mmap_mode keyword:
if magic.startswith(_ZIP_PREFIX):
# zip-file (assume .npz)
# Transfer file ownership to NpzFile
tmp = own_fid
own_fid = False
return NpzFile(fid, own_fid=tmp, allow_pickle=allow_pickle, pickle_kwargs=pickle_kwargs)
So, your initial guess is indeed correct. Numpy uses all of the ram because there is no memmapping ocurring. This is a limitation of the implementation of load; since the npz format is an uncompressed zip archive it should be possible to memmap the variables (unless of course your files were created with savez_compressed).
Implementing a load function that memmaps npz would be quite a bit of work though, so you may want to take a look at structured arrays. They provide similar usage (access of fields by keys) and are already compatible with memmapping.

Processing TensorFlow Records that are XML (text)

I would like to use TensorFlow to process XML strings that are proper TFRecords. I'm curious to understand how to structure code that parses each TFRecord. There is a set of input rules and data type mappings that are applied to each TFRecord record to produce an output TFRecord.
Example input TFRecord:
<PLANT><COMMON>Shooting Star</COMMON><BOTANICAL>Dodecatheon</BOTANICAL><ZONE>Annual</ZONE><LIGHT>Mostly Shady</LIGHT><PRICE>$8.60</PRICE><EXTREF><REF1><ID>608</ID><TYPE>LOOKUP</TYPE><REF2><ID>703</ID><TYPE>STD</TYPE></EXTREF><AVAILABILITY>051399</AVAILABILITY></PLANT>
The rules show what needs to be parsed and how it needs to be formatted. E.g. find the COMMON, PRICE, EXTREF>REF2>ID and AVAILABILITY elements and export their values as a TFRecord.
Example output TFRecord:
Shooting Star,8.60,703,51399
How do I add this logic to a graph so when it executes it produces the output TFRecord? My initial thoughts are that I need to translate the mapping logic into a series of tf.ops...

I believe this link will be very helpful to you. It specifies the exact format that the TFRecord needs, and it provides the code to turn your own dataset into a TFRecord file.
However, that link did not mention XML files. It only talked about how to create a tf_example and turn it into a TFRecord. This link will actually go a step back and show you how to turn a XML file into a tf_example. Note that it will need some modification to fit your needs because it is using the Oxford Pet Dataset.

Representing time sequence input/output in tensorflow

I've been working through the TensorFlow documentation (still learning), and I can't figure out how to represent input/output sequence data. My inputs are a sequences of 20 8-entry vectors, making a 8x20xN matrix, where N is the number of instances. I'd like to eventually pass these through an LSTM for sequence to sequence learning. I know I need a 3D vector, but I'm unsure which dimensions are which.
RTFMs with pointers to the correct documentation greatly appreciated. I feel like this is obvious and I'm just missing it.

As described in the excellent blog post by WildML, the proper way is to save your example in a TFRecord using the formate tf.SequenceExample(). Using TFRecords for this provides the following advantages:
You can split your data in to many files and load them each on a different GPU.
You can use Tensorflow utilities for loading the data (for example using Queues to load you data on demand.
Your model code will be separate from your dataset processing (this is a good habit to have).
You can bring new data to your model just by putting it into this format.
TFRecords uses protobuf or protocol buffers as a way to format your data. The documentation of which can be found here. The basic idea is you have a format for your data (in this case in the format of tf.SequenceExample) save it to a TFRecord and load it using the same data definition. Code for this pattern can be found at this ipython notebook.
As my answer is mostly summarizing the WildML blog post on this topic, I suggest you check that out, again found here.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Problem with Loading and Preprocessing Data Using tf.Dataset - python

Related

Is converting numpy array to .npy format an example of serialization of data?

Keras custom fit_generator for numeric dataframe

Numpy memory mapping issue

Processing TensorFlow Records that are XML (text)

Representing time sequence input/output in tensorflow

Categories

Resources