Include tfrecord name in exception when using tensorflow

Include tfrecord name in exception when using tensorflow - python

I'm trying to build in some debugging code into by tensorflow dataset pipeline. Basically if tfrecord parsing fails on a certain file, I'd like to be able figure out which file that is. My dream would be to run a number of asserts in my parsing_function that provide the filename if they fail.
My pipeline looks something like this:
tf.data.Dataset.from_tensor_slices(file_list)
.apply(tf.contrib.data.parallel_interleave(lambda f: tf.data.TFRecordDataset(f), cycle_length=4))
.map(parse_func, num_parallel_calls=params.num_cores)
.map(_func_for_other_stuff)
Ideally I'd pass the filename through in the parallel_interleave step, but if I have the anonymous function return a filename, tfrecordataset tuple, I get:
TypeError: `map_func` must return a `Dataset` object.
I've also tried to include the filename in the file itself like this question, but am having issues here because filenames are of variable length.

The return value of the function passed to tf.contrib.data.parallel_interleave() must be a tf.data.Dataset. Therefore you can solve this by attaching the filename tensor to each element of the TFRecordDataset, using tf.data.Dataset.zip() as follows:
def read_records_func(filename):
records = tf.data.TFRecordDataset(filename)
# Create a dataset from the filename tensor and repeat it indefinitely.
filename_as_dataset = tf.data.Dataset.from_tensors(filename).repeat(None)
return tf.data.Dataset.zip((filename_as_dataset, records))
dataset = (tf.data.Dataset.from_tensor_slices(file_list)
.apply(tf.contrib.data.parallel_interleave(read_records_func, cycle_length=4))
.map(parse_func, num_parallel_calls=params.num_cores)
.map(_func_for_other_stuff))

Related

Splitting up `h5` file and combining the pieces back

I have an h5 file, which is basically model weights output by keras. For some storage requirements, I'd like to split up the large h5 file into smaller pieces, and combine them back into a single file when needed. However, the way I do it seems to miss some "metadata" (not sure, maybe it's missing a lot more, but judging by the size of the combined file and the original file, it seems that I'm not missing much).
Here's my splitting script:
prefix = "model_weights"
fname_src = "DiffusiveSizeFactorAI/model_weights.h5"
size_max = 90 * 1024**2 # maximum size allowed in bytes
is_file_open = False
dest_fnames = []
idx = 0
with h5py.File(fname_src, "r") as src:
for group in src:
fname = f"{prefix}_{idx}.h5"
if not is_file_open:
dest = h5py.File(fname, "w")
dest_fnames.append(fname)
is_file_open = True
group_id = dest.require_group(group)
src.copy(f"/{group}", group_id)
size = os.path.getsize(fname)
if size > size_max:
dest.close()
idx += 1
is_file_open = False
dest.close()
and here's the script that I use for combining back the pieces:
fname_combined = f"{prefix}_combined.h5"
with h5py.File(fname_combined, "w") as combined:
for fname in dest_fnames:
with h5py.File(fname, "r") as src:
for group in src:
group_id = combined.require_group(group)
src.copy(f"/{group}", group_id)
Just to add a little bit of context if it helps debugging my case, when I load the "combined" model weights, here's the error I'm getting:
ValueError: Layer count mismatch when loading weights from file. Model expected 108 layers, found 0 saved layers.
Note: the size of the original file and the combined one are about the same (they differ by less than 0.5%), which is why I think that I might be missing some metadata.

I am wondering if there is an alternative solution to your problem. I am assuming you want to deploy the model on an embedded system, which leads to memory restrictions. If that is the case, here are some alternatives:
Use TensorFlow Lite: claims that it significantly reduces the size of the model (haven't really tested this). It also improves other important aspects of ML deployment on the edge. In summary, you can make the size up to x5 times smaller.
Apply Pruning: pruning gradually zeroes out model weights during the training process to achieve model sparsity. Sparse models are easier to compress, and thus the zeroes during inference can be skipped for latency improvements.

Based on an answer from h5py developers, there are two issues:
Every time an h5 file is copied this way, a duplicate extra folder level will be added to the destination file. Let me explain:
Suppose in src.h5, I have the following structure: /A/B/C. In these two lines:
group_id = dest.require_group(group)
src.copy(f"/{group}", group_id)
group is /A, and so, after copying, an extra /A will be added to dest.h5, which results in the following erroneous struction: /A/A/B/C. To fix that, one needs to explicitly pass name="A" as an argument to copy.
Metadata of the root level "/" is not being copied neither in the splitting nor in the combining script. To fix that, given that h5 data structure is very similar to Python's dict, you just need to add:
dest.attrs.update(src.attrs)
For personal use, I've written two helper functions, one that splits up a large h5 file into smaller parts, each not exceeding a specified size (passed as argument by user), and another one that combines them back into a single h5 file. In case you find it useful, it can be found on Github here.

Memory issue with cv.imread

I trying to read a large number (54K) of 512x512x3 .png images into an array to create a dataset afterwards and feed to a Keras model. I am using the code below, however I am getting the cv2.OutofMemory error (at around image 50K...) pointing to the fourth line of my code. I have been reading a bit about it, and: I am using the 64bit version, and the images can not be resized as it is a fixed input representation. Is there anything that can be done from a memory management side of things to make it work?
'''
#Images (512x512x3)
X_data = []
files = glob.glob ('C:\Users\77901677\Projects\images1\*.png')
for myFile in files:
image = cv2.imread (myFile)
X_data.append (image)
dataset_image = np.array(X_data)
# Annontations (multilabel) 512x512x2
Y_data = []
files = glob.glob ('C:\\Users\\77901677\\Projects\\annotations1\\*.png')
for myFile in files:
mask = cv2.imread (myFile)
# Gets rid of first channel which is empty
mask = mask[:,:,1:]
Y_data.append (mask)
dataset_mask = np.array(Y_data)
'''
Any ideas or advices are welcome

You can reduce the memory by cutting one of your variables, because you have 2x the array at the moment.
You could use yield for this, thus creating a generator, which will only load your file one at a time, instead of storing it all in an auxiliary variable.
def myGenerator():
files = glob.glob ('C:\\Users\\77901677\\Projects\\annotations1\\*.png')
for myFile in files:
mask = cv2.imread (myFile)
# Gets rid of first channel which is empty
yield mask[:,:,1:]
# initialise your numpy array here
yData = np.zeros(NxHxWxC)
# initialise the generator
mygenerator = myGenerator() # create a generator
for I, data in enumerate(myGenerator):
yData[I,::] = data # load the data
But, this is not optimal for you. If you plan to train a model in the next step, you will have memory issues for sure. In keras, you can additionally implement a Keras Sequence Generator, which will load your files in batches (similarly to this yield generator) to your model in the training stage. I recommend this article here, which demonstrates an easy implementation of it, it's what I use for my keras/tf model pipelines.
It's good practice to use generators when feeding our models large amounts of data.

Python: using astropy.io.fits.open in combination with Tensorflow tf.data.Dataset

I am trying to write a custom input pipeline in Tensorflow for my dataset containing .fits files. I have a list of strings to the locations of the files, like so
pathlist = ['/path/to/file1', 'path/to/file2', ...]
Although the path naming convention has very specific subdirectories, this is a general example. I have written a short function, which when applied to each path element of this list, will spit out a numpy.ndarray with the appropriate data
import numpy as np
from astropy.io import fits
import tensorflow as tf
def path2im(path):
print(path)
hdulist = fits.open(path)
data = hdulist[1].data
data[np.isnan(data)] = 0
return tf.convert_to_tensor(data.astype(np.float32))
It basically opens the fits file from the path, and extracts the data along with dropping the NaNs and converting the array to a tensor. I am following the guidelines set down here (Loading Images in a Directory As Tensorflow Data set) for generating a tensorflow input pipeline. I start by defining a filename dataset from the list of paths, and then mapping the function over it.
filenames = tf.data.Dataset.list_files(pathlist)
ims = filenames.map(path2im)
When this is run, it prints the path not as a string, but as
Tensor("arg0:0", shape=(), dtype=string)
Which makes sense considering the filenames dataset contains tensors, as well as a huge error block in the map function which fails at this line
->hdulist = fits.open(path)
because fits.open(path) accepts a string as an argument for the path. Is there any way to rectify this issue? I cannot find a way to convert the string tensor into a string without starting a session and using .eval(), which I do not want to do in this initialization phase.

The main idea of the Dataset API is to have your data preprocessing part of the TensorFlow graph, so for example you can just specify a filename as placeholder when you'll run the TensorFlow graph.
This is then completely expected that type of the object filenames is a Tensor, and if you want to convert it as a string, you'll have to evaluate it using a Session.
You might want to have a look at this introductory guide to datasets.

PyTorch's dataloader "too many open files" error when no files should be open

So this is a minimal code which illustrates the issue:
This is the Dataset:
class IceShipDataset(Dataset):
BAND1='band_1'
BAND2='band_2'
IMAGE='image'
#staticmethod
def get_band_img(sample,band):
pic_size=75
img=np.array(sample[band])
img.resize(pic_size,pic_size)
return img
def __init__(self,data,transform=None):
self.data=data
self.transform=transform
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
sample=self.data[idx]
band1_img=IceShipDataset.get_band_img(sample,self.BAND1)
band2_img=IceShipDataset.get_band_img(sample,self.BAND2)
img=np.stack([band1_img,band2_img],2)
sample[self.IMAGE]=img
if self.transform is not None:
sample=self.transform(sample)
return sample
And this is the code which fails:
PLAY_BATCH_SIZE=4
#load data. There are 1604 examples.
with open('train.json','r') as f:
data=f.read()
data=json.loads(data)
ds=IceShipDataset(data)
playloader = torch.utils.data.DataLoader(ds, batch_size=PLAY_BATCH_SIZE,
shuffle=False, num_workers=4)
for i,data in enumerate(playloader):
print(i)
It gives that weird open files error in the for loop…
My torch version is 0.3.0.post4
If you want the json file, it is available at Kaggle (https://www.kaggle.com/c/statoil-iceberg-classifier-challenge)
I should mention that the error has nothing to do with the state of my laptop:
yoni#yoni-Lenovo-Z710:~$ lsof | wc -l
89114
yoni#yoni-Lenovo-Z710:~$ cat /proc/sys/fs/file-max
791958
What am I doing wrong here?

I know how to fix the error, but I don't have a complete explanation for why it happens.
First, the solution: you need to make sure that the image data is stored as numpy.arrays, when you call json.loads it loads them as python lists of floats. This causes the torch.utils.data.DataLoader to individually transform each float in the list into a torch.DoubleTensor.
Have a look at default_collate in torch.utils.data.DataLoader - your __getitem__ returns a dict which is a mapping, so default_collate gets called again on each element of the dict. The first couple are ints, but then you get to the image data which is a list, i.e. a collections.Sequence - this is where things get funky as default_collate is called on each element of the list. This is clearly not what you intended. I don't know what the assumption in torch is about the contents of a list versus a numpy.array, but given the error it would appear that that assumption is being violated.
The fix is pretty trivial, just make sure the two image bands are numpy.arrays, for instance in __init__
def __init__(self,data,transform=None):
self.data=[]
for d in data:
d[self.BAND1] = np.asarray(d[self.BAND1])
d[self.BAND2] = np.asarray(d[self.BAND2])
self.data.append(d)
self.transform=transform
or after you load the json, what ever - doesn't really matter where you do it, as long as you do it.
Why does the above results in too many open files?
I don't know, but as the comments pointed out, it is likely to do with interprocess communication and lock files on the two queues data is taken from and added to.
Footnote: the train.json was not available for download from Kaggle due to the competition still being open (??). I made a dummy json file that should have the same structure and tested the fix on that dummy file.

overcome Graphdef cannot be larger than 2GB in tensorflow

I am using tensorflow's imageNet trained model to extract the last pooling layer's features as representation vectors for a new dataset of images.
The model as is predicts on a new image as follows:
python classify_image.py --image_file new_image.jpeg
I edited the main function so that I can take a folder of images and return the prediction on all images at once and write the feature vectors in a csv file. Here is how I did that:
def main(_):
maybe_download_and_extract()
#image = (FLAGS.image_file if FLAGS.image_file else
# os.path.join(FLAGS.model_dir, 'cropped_panda.jpg'))
#edit to take a directory of image files instead of a one file
if FLAGS.data_folder:
images_folder=FLAGS.data_folder
list_of_images = os.listdir(images_folder)
else:
raise ValueError("Please specify image folder")
with open("feature_data.csv", "wb") as f:
feature_writer = csv.writer(f, delimiter='|')
for image in list_of_images:
print(image)
current_features = run_inference_on_image(images_folder+"/"+image)
feature_writer.writerow([image]+current_features)
It worked just fine for around 21 images but then crashed with the following error:
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1912, in as_graph_def
raise ValueError("GraphDef cannot be larger than 2GB.")
ValueError: GraphDef cannot be larger than 2GB.
I thought by calling the method run_inference_on_image(images_folder+"/"+image) the previous image data would be overwritten to only consider the new image data, which doesn't seem to be the case. How to resolve this issue?

The problem here is that each call to run_inference_on_image() adds nodes to the same graph, which eventually exceeds the maximum size. There are at least two ways to fix this:
The easy but slow way is to use a different default graph for each call to run_inference_on_image():
for image in list_of_images:
# ...
with tf.Graph().as_default():
current_features = run_inference_on_image(images_folder+"/"+image)
# ...
The more involved but more efficient way is to modify run_inference_on_image() to run on multiple images. Relocate your for loop to surround this sess.run() call, and you will no longer have to reconstruct the entire model on each call, which should make processing each image much faster.

You can move the create_graph() to somewhere before this loop for image in list_of_images: (which loops over files).
What it does is performing inference multiple times on the same graph.

The simplest way is put create_graph() at the first of main function.
Then, it just create graph only

A good explanation of why such errors is mentioned here, I encountered the same error while using tf dataset api and came to the understanding that data when iterated over in the session, its getting appended on the existing graph. so what I did is used tf.reset_default_graph() before the dataset iterator to make sure that previous graph is cleared away.
Hope this helps for such a scenario.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.