CRITICAL: tensorflow:Category has no images - validation - python

I'm trying to retraining the Inception v3 model in tensorflow for my own custom categories. I have downloaded some data and formatted it into directories. When I run, the python script creates bottlenecks for the images, and then when it runs, on the first training step( step 0) it has a critical error, where it tries to modulo by 0. It appears in the get_image_path function when computing the mod_index, which is index % len(category_list) so the category_list must be 0 right?
Why is this happening and how can I prevent it?
EDIT: Here's the exact code I'm seeing inside docker
2016-07-04 01:27:52.005912: Step 0: Train accuracy = 40.0%
2016-07-04 01:27:52.006025: Step 0: Cross entropy = 1.109777
CRITICAL:tensorflow:Category has no images - validation.
Traceback (most recent call last):
File "tensorflow/examples/image_retraining/retrain.py", line 824, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run
sys.exit(main(sys.argv))
File "tensorflow/examples/image_retraining/retrain.py", line 794, in main
bottleneck_tensor))
File "tensorflow/examples/image_retraining/retrain.py", line 484, in get_random_cached_bottlenecks
bottleneck_tensor)
File "tensorflow/examples/image_retraining/retrain.py", line 392, in get_or_create_bottleneck
bottleneck_dir, category)
File "tensorflow/examples/image_retraining/retrain.py", line 281, in get_bottleneck_path
category) + '.txt'
File "tensorflow/examples/image_retraining/retrain.py", line 257, in get_image_path
mod_index = index % len(category_list)
ZeroDivisionError: integer division or modulo by zero

Fix:
The issue happens when you have less number of images in any of your sub folders.
I have faced same issue when total number of images under a particular category was less than 30, please try to increase the image count to resolve the issue.
Reason:
For each label (sub folder), tensorflow tries to create 3 categories of images (Train, Test and Validation) and places the images under it based on a probability value (calculated using hash of label name).
An image is placed in the category folder only if the probability value is less than the category (Train, Test or validation) size.
Now if number of images inside a label are less ( say 25) then validation size is calculated as 10 (default) and the probability value is usually greater than 10 and hence no image is placed in the validation set.
Later when all bottlenecks are created and tf is trying to calculate validation accuracy, it first throws an fatal log message:
CRITICAL:tensorflow:Category has no images - validation.
and then continues to execute the code and crashes as it tries to divide by validation list size (which is 0).

I've modified retrain.py to ensure that at least there is an image in validation (line 201*)
if len(validation_images) == 0:
validation_images.append(base_name)
elif percentage_hash < validation_percentage:
(*) Line number may change in future releases. Look at the comments.

I had the same problem when running the retrain.py and when i set the --model_dir argument incorrectly and the inception directory got created in the flower_photos directory.
Please check if there are any directories in the flower_photos directory without any images.

This happens if you have too less images. Like Ashwin suggested, have at least 30 images.
Also the names of your folder is also important. Somehow your folder name can't have an underscore(_)
eg. These names didn't work : dettol_bottle, dettol_soap, dove_soap, lifebuoy_bottle
These names worked : dettolbottle, dettolsoap, dovesoap, lifebuoybottle

For me, this error was caused by having folders in the training directory that did not have images in them. I was following the same "Poets" tutorial and ended up putting directories with subdirectories in the image dir. Once I removed those and placed only directories with images directly in them (no sub dirs) the error no longer occurred and I was able to successfully train my model.

I was trying to train using my own set of images (pictures of dogs instead of flowers), and ran into this same problem.
I identified that the problem for me ended up being that my folder names (category names) weren't present in the imagenet_synset_to_human_label_map.txt file that gets loaded in the inception data that we are modifying.
By changing the name of my image folder from bichon to poodle, this started working, since poodle is in the inception map and bichon is not.

For me it was a "-" in my folder names. The moment I corrected it, the error vanished.

As Ashwin Patti has answered, there is a possibility that the split directory for validation has no images due to a lack of images in the original label directory.
This explanation is supported by the warning when you try to retrain with labels that have less than 20 images:
WARNING: Folder has less than 20 images, which may cause issues.

This error went away for me after adding >50 images to each category

I would also like to add my own experience:
Don't have spaces
For me, it worked when all a folder name contained was a to z characters, no spaces, no symbols, no nothin'.
E.g `I'm a folder' is wrong. However, 'imAFolder' would work.

As Matthieu said in comments, the solution proposed:
# make sure none of the list is empty, otherwise it will raise an error
# when validating / testing
if validation_percentage > 0 and not validation_images:
validation_images.append(training_images.pop())
if testing_percentage > 0 and not testing_images:
testing_images.append(training_images.pop())
wotks for me.
I'm wondering what the message "CRITICAL:tensorflow:Category has no images - validation" really means. Is it related to the error that was fixed or It could mean loss of accuracy? I mean, if was used few images, the results would not be as expected?

I had this exact same problem. My folders were named correctly however my files were named name_1.jpg, name_2.jpg. Removing the underscore fixed the issue.

Related

Splitting up `h5` file and combining the pieces back

I have an h5 file, which is basically model weights output by keras. For some storage requirements, I'd like to split up the large h5 file into smaller pieces, and combine them back into a single file when needed. However, the way I do it seems to miss some "metadata" (not sure, maybe it's missing a lot more, but judging by the size of the combined file and the original file, it seems that I'm not missing much).
Here's my splitting script:
prefix = "model_weights"
fname_src = "DiffusiveSizeFactorAI/model_weights.h5"
size_max = 90 * 1024**2 # maximum size allowed in bytes
is_file_open = False
dest_fnames = []
idx = 0
with h5py.File(fname_src, "r") as src:
for group in src:
fname = f"{prefix}_{idx}.h5"
if not is_file_open:
dest = h5py.File(fname, "w")
dest_fnames.append(fname)
is_file_open = True
group_id = dest.require_group(group)
src.copy(f"/{group}", group_id)
size = os.path.getsize(fname)
if size > size_max:
dest.close()
idx += 1
is_file_open = False
dest.close()
and here's the script that I use for combining back the pieces:
fname_combined = f"{prefix}_combined.h5"
with h5py.File(fname_combined, "w") as combined:
for fname in dest_fnames:
with h5py.File(fname, "r") as src:
for group in src:
group_id = combined.require_group(group)
src.copy(f"/{group}", group_id)
Just to add a little bit of context if it helps debugging my case, when I load the "combined" model weights, here's the error I'm getting:
ValueError: Layer count mismatch when loading weights from file. Model expected 108 layers, found 0 saved layers.
Note: the size of the original file and the combined one are about the same (they differ by less than 0.5%), which is why I think that I might be missing some metadata.
I am wondering if there is an alternative solution to your problem. I am assuming you want to deploy the model on an embedded system, which leads to memory restrictions. If that is the case, here are some alternatives:
Use TensorFlow Lite: claims that it significantly reduces the size of the model (haven't really tested this). It also improves other important aspects of ML deployment on the edge. In summary, you can make the size up to x5 times smaller.
Apply Pruning: pruning gradually zeroes out model weights during the training process to achieve model sparsity. Sparse models are easier to compress, and thus the zeroes during inference can be skipped for latency improvements.
Based on an answer from h5py developers, there are two issues:
Every time an h5 file is copied this way, a duplicate extra folder level will be added to the destination file. Let me explain:
Suppose in src.h5, I have the following structure: /A/B/C. In these two lines:
group_id = dest.require_group(group)
src.copy(f"/{group}", group_id)
group is /A, and so, after copying, an extra /A will be added to dest.h5, which results in the following erroneous struction: /A/A/B/C. To fix that, one needs to explicitly pass name="A" as an argument to copy.
Metadata of the root level "/" is not being copied neither in the splitting nor in the combining script. To fix that, given that h5 data structure is very similar to Python's dict, you just need to add:
dest.attrs.update(src.attrs)
For personal use, I've written two helper functions, one that splits up a large h5 file into smaller parts, each not exceeding a specified size (passed as argument by user), and another one that combines them back into a single h5 file. In case you find it useful, it can be found on Github here.

Python: Keras - executing 'next(trainBatch)', divide by zero error appear while i've 36 pics in train folder

UPDATE 2
Finally problem solved
ImageDataGenerstion().flow_from_directory() expect your data in a subdirectory. For example in my case: my data should be in "data/train/cotton/" cotton directory and my path providing variable should look like trainPath = "data/train". It expect you to put each class data in different respective directory.
For more detail, visit first answer for this Question
UPDATE
Didn't got solution yet. Previously I was providing data path as data/train, but actually it should be as data/train/, so I'm changing it as mentioned here in question below.
Question
I'm training a keras image-processing model on custom data. I'm getting help from youtube , here in first tutorial it is just loading images, making batches using ImageDataGenerator().flow_from_directory() and plotting images with label specified.
my code is
trainPath = "data/train/"
trainBatch = ImageDataGenerator().flow_from_director(directory=trainPath, target_size=(224,224), classes=["cotton"], batch_size=10)
imgs, lables = next(trainBatch)
When I execute last line, it gives error
imgs, lables = next(trainBatch)
Traceback (most recent call last):
File "C:\Users\Haier\AppData\Local\Programs\Python\Python36\lib\site-packages\IPython\core\interactiveshell.py", line 3296, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-20-4b8b727474cb>", line 1, in <module>
imgs, lables = next(trainBatch)
File "C:\Users\Haier\AppData\Local\Programs\Python\Python36\lib\site-packages\keras_preprocessing\image\iterator.py", line 100, in __next__
return self.next(*args, **kwargs)
File "C:\Users\Haier\AppData\Local\Programs\Python\Python36\lib\site-packages\keras_preprocessing\image\iterator.py", line 109, in next
index_array = next(self.index_generator)
File "C:\Users\Haier\AppData\Local\Programs\Python\Python36\lib\site-packages\keras_preprocessing\image\iterator.py", line 85, in _flow_index
current_index = (self.batch_index * self.batch_size) % self.n
ZeroDivisionError: integer division or modulo by zero
I got some understanding of error form stack overflow Question, here person asked question stated this error arise when you've empty data folder. But I've 36 images in my data folder.
What I thought is, it is not reaching or getting to my data/train folder. What else should have done?
Your assistance will be great help for me.
Try using absolute path, try printing content of folder using os module, and try printing the shape of trainBatch.
Following above steps you can conclude if code is able to read all files from folder.

Who is Jenkins? Error /Users/jenkins/miniconda...

This might be a noob question...
I'm following this tutorial on Emotion Recognition With Python, OpenCV and a Face Dataset
When I run the training code get the following error:
OpenCV Error: Bad argument (Wrong input image size. Reason: Training and Test images must be of equal size! Expected an image with 122500 elements, but got 4.) in predict, file /Users/jenkins/miniconda/1/x64/conda-bld/conda_1486587097465/work/opencv-3.1.0/build/opencv_contrib/modules/face/src/fisher_faces.cpp, line 132
Traceback (most recent call last):
File "trainModel.py", line 64, in <module>
correct = run_recognizer()
File "trainModel.py", line 52, in run_recognizer
pred, conf = fishface.predict(image)
cv2.error: /Users/jenkins/miniconda/1/x64/conda-bld/conda_1486587097465/work/opencv-3.1.0/build/opencv_contrib/modules/face/src/fisher_faces.cpp:132: error: (-5) Wrong input image size. Reason: Training and Test images must be of equal size! Expected an image with 122500 elements, but got 4. in function predict
It is complaining about the image size not being 350×350=122500 although all the images in my dataset folder are the correct size 350x350px.
And my user name is not ‘jenkins’ as it says in /Users/jenkins/miniconda… not sure where it comes from or how to replace it with my correct path to fisher_faces.cpp
Thanks for your help!
Don't worry about that path. The OpenCV library you are using was built on someone else's machine, and the error messages got paths from their machine baked in. It's just trying to tell you in which OpenCV source file the error is occurring in, namely this one.
(In this case, Jenkins is a popular build bot.)

Load / restore models into tensorflow at specific iteration or checkpoint

I have a model , which I am saving at every 10 iterations . So , i am having following files in my saved directory .
checkpoint model-50.data-00000-of-00001 model-50.index model-50.meta
model-60.data-00000-of-00001 model-60.index model-60.meta
and so on up to 100 . I have to load only the model-50. Because I have got
NaN values after 70 iterations. By deafault, when i am restoring the saver will look for the final checkpoint. So, how could I specifically load the model-50. please help, otherwise, i have to run the model gain from scratch, which is time consuming.
Since you are using tf.train.Saver's function restore(), you can make use of the last_checkpoints functions to get a list of all available checkpoints. You will see both model-50 and model-60 in this list.
Pick the correct model, and pass it directly to restore() like this,
saver.restore(sess, ckpt_path)
I'm not sure if things were different in the past, but at least as of now, you can use tf.train.get_checkpoint_state() to get CheckpointState proto which contains all_model_checkpoint_paths.
When you execute the command shown in most of the tutorials about saving/restoring a model saver.restore(sess, tf.train.latest_checkpoint(_dir_models)) the second parameter which you are passing is just a string to model path. This is defined in a saver.restore documentation.
save_path: Path where parameters were previously saved.
So you can path any string there and latest_checkpoint is just a convenient function to extract this path from a checkpoint file. Open this file in a notebook and you will see all the model paths available and what is the latest.
You can substitute that path with any path you want. You can get it from that file (either opening it manually or using get_checkpoin_state which will programmatically do it for you.

overcome Graphdef cannot be larger than 2GB in tensorflow

I am using tensorflow's imageNet trained model to extract the last pooling layer's features as representation vectors for a new dataset of images.
The model as is predicts on a new image as follows:
python classify_image.py --image_file new_image.jpeg
I edited the main function so that I can take a folder of images and return the prediction on all images at once and write the feature vectors in a csv file. Here is how I did that:
def main(_):
maybe_download_and_extract()
#image = (FLAGS.image_file if FLAGS.image_file else
# os.path.join(FLAGS.model_dir, 'cropped_panda.jpg'))
#edit to take a directory of image files instead of a one file
if FLAGS.data_folder:
images_folder=FLAGS.data_folder
list_of_images = os.listdir(images_folder)
else:
raise ValueError("Please specify image folder")
with open("feature_data.csv", "wb") as f:
feature_writer = csv.writer(f, delimiter='|')
for image in list_of_images:
print(image)
current_features = run_inference_on_image(images_folder+"/"+image)
feature_writer.writerow([image]+current_features)
It worked just fine for around 21 images but then crashed with the following error:
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1912, in as_graph_def
raise ValueError("GraphDef cannot be larger than 2GB.")
ValueError: GraphDef cannot be larger than 2GB.
I thought by calling the method run_inference_on_image(images_folder+"/"+image) the previous image data would be overwritten to only consider the new image data, which doesn't seem to be the case. How to resolve this issue?
The problem here is that each call to run_inference_on_image() adds nodes to the same graph, which eventually exceeds the maximum size. There are at least two ways to fix this:
The easy but slow way is to use a different default graph for each call to run_inference_on_image():
for image in list_of_images:
# ...
with tf.Graph().as_default():
current_features = run_inference_on_image(images_folder+"/"+image)
# ...
The more involved but more efficient way is to modify run_inference_on_image() to run on multiple images. Relocate your for loop to surround this sess.run() call, and you will no longer have to reconstruct the entire model on each call, which should make processing each image much faster.
You can move the create_graph() to somewhere before this loop for image in list_of_images: (which loops over files).
What it does is performing inference multiple times on the same graph.
The simplest way is put create_graph() at the first of main function.
Then, it just create graph only
A good explanation of why such errors is mentioned here, I encountered the same error while using tf dataset api and came to the understanding that data when iterated over in the session, its getting appended on the existing graph. so what I did is used tf.reset_default_graph() before the dataset iterator to make sure that previous graph is cleared away.
Hope this helps for such a scenario.

Categories