Memory issue with cv.imread - python

I trying to read a large number (54K) of 512x512x3 .png images into an array to create a dataset afterwards and feed to a Keras model. I am using the code below, however I am getting the cv2.OutofMemory error (at around image 50K...) pointing to the fourth line of my code. I have been reading a bit about it, and: I am using the 64bit version, and the images can not be resized as it is a fixed input representation. Is there anything that can be done from a memory management side of things to make it work?
'''
#Images (512x512x3)
X_data = []
files = glob.glob ('C:\Users\77901677\Projects\images1\*.png')
for myFile in files:
image = cv2.imread (myFile)
X_data.append (image)
dataset_image = np.array(X_data)
# Annontations (multilabel) 512x512x2
Y_data = []
files = glob.glob ('C:\\Users\\77901677\\Projects\\annotations1\\*.png')
for myFile in files:
mask = cv2.imread (myFile)
# Gets rid of first channel which is empty
mask = mask[:,:,1:]
Y_data.append (mask)
dataset_mask = np.array(Y_data)
'''
Any ideas or advices are welcome

You can reduce the memory by cutting one of your variables, because you have 2x the array at the moment.
You could use yield for this, thus creating a generator, which will only load your file one at a time, instead of storing it all in an auxiliary variable.
def myGenerator():
files = glob.glob ('C:\\Users\\77901677\\Projects\\annotations1\\*.png')
for myFile in files:
mask = cv2.imread (myFile)
# Gets rid of first channel which is empty
yield mask[:,:,1:]
# initialise your numpy array here
yData = np.zeros(NxHxWxC)
# initialise the generator
mygenerator = myGenerator() # create a generator
for I, data in enumerate(myGenerator):
yData[I,::] = data # load the data
But, this is not optimal for you. If you plan to train a model in the next step, you will have memory issues for sure. In keras, you can additionally implement a Keras Sequence Generator, which will load your files in batches (similarly to this yield generator) to your model in the training stage. I recommend this article here, which demonstrates an easy implementation of it, it's what I use for my keras/tf model pipelines.
It's good practice to use generators when feeding our models large amounts of data.

Related

Using ImageDataGenerator with images in .npy format

I am pretty new to Keras. I am trying to train a model using ImageDataGenerator. I have a very large amount of images for training saved in .npy format. I wanted to use flow_from_directory() so I stored the images as recommended in the documentation (one folder per class). The problem is this only works for png, jpeg, tiff, etc. but won't work with my .npy files.
Is there any way I could use this function or something similar that gives me all the augmentation possibilities that ImageDataGenerator gives?
Thank you very much, any help is appreciated
Yes, it's possible if you are willing to adapt the source code of the ImageDataGenerator (which is actually quite straightforward to read and understand). Looking at the keras-preprocessing github, I think it would suffice to replace the load_img method in the DirectoryIterator class with your own load_array method that reads .npy files from disk instead of images:
...
# build batch of image data
for i, j in enumerate(index_array):
fname = self.filenames[j]
## Replace the code below with your own function
img = load_img(os.path.join(self.directory, fname),
color_mode=self.color_mode,
target_size=self.target_size,
interpolation=self.interpolation)
x = img_to_array(img, data_format=self.data_format)
...
So minimally, you would make the following change to that line:
...
# build batch of image data
for i, j in enumerate(index_array):
fname = self.filenames[j]
img = np.load(os.path.join(self.directory, fname))
...
But probably you will want to implement some of the additional logic that Keras' load_img utility function also has like color mode, target size etc. and wrap everything in your own load_array function.

Loading a custom dataset from json annotations files for Keras classification task

I am new to deep learning and would like to implement a simple classification task using Keras. My dataset contains over 2000 images & for each image I have a respective json file which contains the label for that image. Following is the code to load the json files & create the X (image) & Y (labels) arrays:
X = []
Y = []
with concurrent.futures.ProcessPoolExecutor() as executor:
# Get a list of files to process
str = jsonpath + '/*.json'
#print(str)
json_files = glob.glob(str)
for jsonfile,y in zip(json_files, executor.map(create_array, json_files)):
X.append(y[0])
Y.append(y[1])
where the function create_array is defined as follows:
def create_array(jsonfile):
array_list = []
y_list = []
with open(jsonfile) as f:
data = json.load(f)
name = data['annotation']['data_filename']
img = cv2.imread(imgDIR + '/' + name)
array_list.append(img)
l = data['annotation']['data_annotation']['classification'][0]['classification_label']
y_list.append(l)
return array_list, y_list
It works for small no of images say 15, but for the entire set of 2000 images, the program gets automatically killed or sometimes it gives the error "MemoryError: out of memory".
Is there an efficient way to do this? How can I speed up this data pre-processing part to give it as an input to the keras classification model?
It seems like your images are pretty much ready for training and your preprocessing is simply about loading the files. json format might not be the fastest approach when it comes to loading data. If you're using somthing like pickle to save and load your images, you might experience a speed boost.
The other question is how to efficiently passing the data to keras. Normally you would use model.fit but since not all your data can fit into your memory you can use model.fit_generator
Ther keras doc gives us the folowing hint:
The generator is run in parallel to the model, for efficiency. For
instance, this allows you to do real-time data augmentation on images
on CPU in parallel to training your model on GPU.
The use of keras.utils.Sequence guarantees the ordering and guarantees
the single use of every input per epoch when using
use_multiprocessing=True.
Here is an example how to implement such a generator.

TensorFlow: How to perform image categorisation on multiple images

Hoping that somebody can help out with a TensorFlow query. It's not a difficult one, I'm sure. I am just somewhat lacking in knowledge relating to TensorFlow and NumPy.
Without any prior experience of TensorFlow I have implemented Python code from a tutorial for doing image classification. This works. Once trained, it can tell the difference between a cat and a dog.
This is currently hard-wired for a single image. I would like to be able to classify multiple images (i.e. the contents of a folder), and do this efficiently. What I have done so far in an effort to achieve this is to simply add a loop around everything, so it runs all the code for each image. However, timing the operation shows that classification of each successive image takes longer than the previous one. Therefore there is some kind of incremental overhead. Some operation is taking more time with every loop. I cannot immediately see what this is.
There are two options to improve this. Either:
(1) Leave the loop largely as it is and prevent this slowdown, or
(2) (Preferable IMHO, if it is possible) Pass a list of images to TensorFlow for classification, and get back a list of results. This seems more efficient.
This is the code:
import tensorflow as tf
import numpy as np
import os,glob,cv2
import sys,argparse
import time
try:
inputdir = [redacted - insert input dir here]
for f in os.listdir(inputdir):
start_time = time.time()
filename = os.path.join(inputdir,f)
image_size=128
num_channels=3
images = []
image = cv2.imread(filename) # read image using OpenCV
# Resize image to desired size and preprocess exactly as done during training...
image = cv2.resize(image, (image_size, image_size),0,0, cv2.INTER_LINEAR)
images.append(image)
images = np.array(images, dtype=np.uint8)
images = images.astype('float32')
images = np.multiply(images, 1.0/255.0)
# The input to the network is of shape [None image_size image_size num_channels]. Hence we reshape.
x_batch = images.reshape(1, image_size,image_size,num_channels)
sess = tf.Session() # restore the saved model
saver = tf.train.import_meta_graph('dogs-cats-model.meta') # Step 1: Recreate the network graph. At this step only graph is created
saver.restore(sess, tf.train.latest_checkpoint('./')) # Step 2: Load the weights saved using the restore method
graph = tf.get_default_graph() # access the default graph which we have restored
# Now get hold of the op that we can be processed to get the output.
# In the original network y_pred is the tensor that is the prediction of the network
y_pred = graph.get_tensor_by_name("y_pred:0")
## Feed the images to the input placeholders...
x= graph.get_tensor_by_name("x:0")
y_true = graph.get_tensor_by_name("y_true:0")
y_test_images = np.zeros((1, 2))
# Create the feed_dict that is required to be fed to calculate y_pred...
feed_dict_testing = {x: x_batch, y_true: y_test_images}
result=sess.run(y_pred, feed_dict=feed_dict_testing)
# Note: result is a numpy.ndarray
print(f + '\t' + str(result) + ' ' + '%.2f' % (time.time()-start_time) + ' seconds')
# next image
except:
import traceback
tb = traceback.format_exc()
print(tb)
finally:
input() # keep window open until key is pressed
What I tried to do to modify the above was to create a list of filenames using...
images.append(image)
...and then taking the rest of the code out of the loop. However, this didn't work. It resulted in the following error:
ValueError: cannot reshape array of size 294912 into shape
(1,128,128,3)
At this line:
x_batch = images.reshape(1, image_size,image_size,num_channels)
Apparently this Reshape method doesn't work (as implemented, at least) on a list of images.
So my questions are:
What would causing the steady increase in image classification time that I have seen as images are iterated?
Can I perform classification on multiple images in one go, rather than one-by-one in a loop?
Thanks in advance.
Your issues:
1 a) The main reason why it is so slow is: You are re-creating the graph for every image.
1 b) The incremental overhead is coming from creating a new session every time without destroying the old session. The with syntax helps with that. e.g.:
with tf.Session(graph=tf.Graph()) as session:
# do something with the session
But that won't be a noticable issue after addressing a).
When thinking about the problem, one might realise which parts of your code depend on the image and which don't. The TensorFlow related part that is different per image is the call to session.run, feeding in the image. Everything else can be moved out of the loop.
2) You can also classify multiple images in one go. The first dimension of x_batch is the batch size. You are specifying one. But you may exhaust your memory resources trying to do that for a very large number of images.

How can I pick specific records in TensorFlow from a .tfrecords file?

My goal is to train a neural net for a fixed number of epochs or steps, I would like each step to use a batch of data of a specific size from a .tfrecords file.
Currently I am reading from the file using this loop:
i = 0
data = np.empty(shape=[x,y])
for serialized_example in tf.python_io.tf_record_iterator(filename):
example = tf.train.Example()
example.ParseFromString(serialized_example)
Labels = example.features.feature['Labels'].byte_list.value
# Some more features here
data[i-1] = [Labels[0], # more features here]
if i == 3:
break
i = i + 1
print data # do some stuff etc.
I am a bit of a Python noob, and I suspect that creating "i" outside the loop and breaking out when it reaches a certain value is just a hacky word-around.
Is there a way that I can read data from the file but specify "I would like the first 100 values in the byte_list that is contained within the Labels feature" and then subsequently "I would like the next 100 values".
To clarify, the thing that I am unfamiliar with is looping over a file in this manner, I am not really certain how to manipulate the loop.
Thanks.
Impossible. TFRecords is a streaming reader and has no random access.
A TFRecords file represents a sequence of (binary) strings. The format is not random access, so it is suitable for streaming large amounts of data but not suitable if fast sharding or other non-sequential access is desired.
Expanding on the comment by Shan Carter (although it's not an ideal solution for your question) for archival purposes.
If you'd like to use enumerate() to break out from a loop at a certain iteration, you could do the following:
n = 5 # Iteration you would like to stop at
data = np.empty(shape=[x,y])
for i, serialized_example in enumerate(tf.python_io.tf_record_iterator(filename)):
example = tf.train.Example()
example.ParseFromString(serialized_example)
Labels = example.features.feature['Labels'].byte_list.value
# Some more features here
data[i-1] = [Labels[0], Labels[1]]# more features here
if i == n:
break
print(data)
Addressing your use case for .tfrecords
I would like each step to use a batch of data of a specific size from a .tfrecords file.
As mentioned by TimZaman, .tfrecords are not meant for arbitrary access of data. But seeing as you just need to continously pull batches from the .tfrecords file, you might be better off using the tf.data API to feed your model.
Adapted from the the tf.data guide:
Constructing a Dataset from .tfrecord files
filepath1 = '/path/to/file.tfrecord'
filepath2 = '/path/to/another_file.tfrecord
dataset = tf.data.TFRecordDataset(filenames = [filepath1, filepath2])
From here, if you're using the tf.keras API, you could pass dataset as an argument into model.fit like so:
model.fit(x = dataset,
batch_size = None,
validation_data = some_other_dataset)
Extra Stuff
Here's a blog which helps to explain .tfrecord files a little better than the tensorflow documentation.

overcome Graphdef cannot be larger than 2GB in tensorflow

I am using tensorflow's imageNet trained model to extract the last pooling layer's features as representation vectors for a new dataset of images.
The model as is predicts on a new image as follows:
python classify_image.py --image_file new_image.jpeg
I edited the main function so that I can take a folder of images and return the prediction on all images at once and write the feature vectors in a csv file. Here is how I did that:
def main(_):
maybe_download_and_extract()
#image = (FLAGS.image_file if FLAGS.image_file else
# os.path.join(FLAGS.model_dir, 'cropped_panda.jpg'))
#edit to take a directory of image files instead of a one file
if FLAGS.data_folder:
images_folder=FLAGS.data_folder
list_of_images = os.listdir(images_folder)
else:
raise ValueError("Please specify image folder")
with open("feature_data.csv", "wb") as f:
feature_writer = csv.writer(f, delimiter='|')
for image in list_of_images:
print(image)
current_features = run_inference_on_image(images_folder+"/"+image)
feature_writer.writerow([image]+current_features)
It worked just fine for around 21 images but then crashed with the following error:
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1912, in as_graph_def
raise ValueError("GraphDef cannot be larger than 2GB.")
ValueError: GraphDef cannot be larger than 2GB.
I thought by calling the method run_inference_on_image(images_folder+"/"+image) the previous image data would be overwritten to only consider the new image data, which doesn't seem to be the case. How to resolve this issue?
The problem here is that each call to run_inference_on_image() adds nodes to the same graph, which eventually exceeds the maximum size. There are at least two ways to fix this:
The easy but slow way is to use a different default graph for each call to run_inference_on_image():
for image in list_of_images:
# ...
with tf.Graph().as_default():
current_features = run_inference_on_image(images_folder+"/"+image)
# ...
The more involved but more efficient way is to modify run_inference_on_image() to run on multiple images. Relocate your for loop to surround this sess.run() call, and you will no longer have to reconstruct the entire model on each call, which should make processing each image much faster.
You can move the create_graph() to somewhere before this loop for image in list_of_images: (which loops over files).
What it does is performing inference multiple times on the same graph.
The simplest way is put create_graph() at the first of main function.
Then, it just create graph only
A good explanation of why such errors is mentioned here, I encountered the same error while using tf dataset api and came to the understanding that data when iterated over in the session, its getting appended on the existing graph. so what I did is used tf.reset_default_graph() before the dataset iterator to make sure that previous graph is cleared away.
Hope this helps for such a scenario.

Categories