I am trying to use the flow_from_dataframe method of Keras to read training and testing images.
Both my training and testing images are in same directory, and I read the paths from two different csv files.
My code for reading test images looks like,
# Read test file
testdf = pd.read_csv("test.csv")
# load images
test_datagen = ImageDataGenerator(rescale=1./255)
test_generator = test_datagen.flow_from_dataframe(
dataframe=testdf, directory=IMAGE_PATH,
x_col='image_name', y_col=None,
has_ext=True, target_size=(10,10)
,batch_size=32,color_mode='rgb',shuffle=False, class_mode=None)
I get output like this
Found 0 images.
While the similar code for reading training data works properly. I checked if the images exist at the given path, which they do. What are some possible reasons for this error? How can I try to debug the issue?
EDIT: This is a regression task, so all images are in a single directory, and not in subdirectories, as would be expected for a classification task.
EDIT 2: I added usecols=[0] to read_csv, and now test_datagen finds all the images in the directory, and not just the one's that are mentioned in the test.csv file
The issue happens due to NaN's in the dataframe. Ignoring those columns doesn't work. The solution is to replace the NaN's with something else. For example,
testdf = pd.read_csv("test.csv")
testdf.fillna(0, inplace=True)
This replaces the NaN's with 0. Then using ImageDataGenerator as usual works.
I was also facing the same error and found a solution for this.
I was using the absolute path, was using correct DataFrame and everything was fine still the code was throwing an error - "image not found".
I inspected and found that my dataframe was containing image names without extension and the images in the folder was having extension also.
E.g. The image name in DataFrame was 'abc' but the image in the folder was having a name 'abc.png'.
Just add .png in the image names in DataFrame and it will solve your problem.
I just tried below code and it worked out..!!!!
def append_ext(fn):
return fn+".png"
train_valid_data["id_code"]=train_valid_data["id_code"].apply(append_ext)
test_data["id_code"]=test_data["id_code"].apply(append_ext)
Let me know if it solves your problem or if you need any further explanation.
I have the same problem. First, make sure you got the absolute path correctly for the parameter directory.
The filename in my df has value image.pgm.png and the actual image file in the folder has the format image.pgm.
I tried to change the filename in df to image.pgm => Still not working
I renamed the image file from image.pgm to image.pgm.png which matches exactly the format in the df => Worked!
I had the same error,
What I found is that I missed the directory path, and the image extension that was not in the data frame,
So make sure that your directory path is correct and an extension to your image, as you can do the following:
def extention_train_data(x):
return x+".jpg"
change the jpg extension if you have an other one.
then you apply this to you data frame:
train_data['image'] = train_data['image_id'].apply(extention_train_data)
once you have the image column containing your image with its extension then
train_generator = datagen.flow_from_dataframe(
train_data,
directory="/kaggle/input/plant-pathology-2020-fgvc7/images/",
x_col = "image",
y_col = "label",
target_size = size,
class_mode = "binary",
batch_size = batch_size,
subset="training",
shuffle = True,
seed = 42,
)
Okay, so I have been having the same issues. Where my data labels were in a csv file , and the image data in a separate folder.I thought, the issue was being caused by the labels and the images in the folder not aligning properly.Did a whole bunch of stuff to rectify and process the data. It was not the problem.
So, anyone who's having issues.
I tried #Oussama Ouardini's answer and it worked. Thank you!
I am also going to add - that if you are doing a train and validation split to make sure the initial ImageDataGenerator object you create has the validation split specified.
def extension_train_data(x):
return "xc"+str(x)+".png"
train_df['file_id'] = train_df['file_id'].apply(extension_train_data)
Here is my code -
datagen=ImageDataGenerator(rescale=1./255,validation_split=0.2)
#rescale all pixel values from 0-255, so after this step all our
#pixel values are in range (0,1)
train_generator=datagen.flow_from_dataframe(dataframe=train_df,directory='./img_data/', x_col="file_id", y_col="english_cname",
class_mode="categorical",save_to_dir='./new folder/',
target_size=(64,64),subset="training",
seed=42,batch_size=32,shuffle=False)
val_generator=datagen.flow_from_dataframe(dataframe=train_df,directory='./img_d
ata/', x_col="file_id", y_col="english_cname",
class_mode="categorical",
target_size=(64,64),subset="validation",
seed=42,batch_size=32,shuffle=False)
print("\n Sanity check Line.--------")
My output was a succesfully validated image files. :)
Found 212 validated image filenames belonging to 88 classes.
Found 52 validated image filenames belonging to 88 classes.
Sanity check Line.----------
I hope someone will find this useful. Cheers!
Related
The following code is part of my code for a tf graph to read images. When I use this code to iterate through the data, the program gets stuck in tf.io.read_file(path) after a few hundred images forever and doesn't do anything. More specifically, the code even can't be paused and I had to restart the session every time.
#tf.function()
def read_image(path):
image = tf.io.read_file(path)
image = tf.image.decode_jpeg(image)
return image
...
div8k_list=[os.path.join(div8k_save_path, x) for x in os.listdir(div8k_save_path)]
train_path = tf.data.Dataset.from_tensor_slices(div8k_list)
train_images = train_path.map(read_image, num_parallel_calls=tf.data.AUTOTUNE)
I first suspected that there were a few corrupted images or wrong paths in the data that were causing this problem and tested the following code.
for path in train_path:
print(path)
image = tf.io.read_file(path)
image = tf.image.decode_jpeg(image)
Surprisingly, there was no common characteristic of the image path the loop was stuck. And it was not a problem of the image because the loop was once stuck at 1056.png but when I explicitly loaded 1056.png, there was no problem.
What could be the cause of this problem?
edit: to summarize, the program is stuck at read_image forever, while I couldn't find a problem in the dataset.
My dataset is the DIV8K dataset and I am running in COLAB.
EDIT The function that is slowing my code is decode_jpeg, because the following definition of read_image worked multiple times.
#tf.function()
def read_image(path):
image = tf.io.read_file(path)
image = tf.image.decode_jpeg(image)
return image
As mentioned in the comment, try the following function to decode the image file as it can handle mixed extension file format (jpg, png etc), ref.
tf.io.decode_image(image, expand_animations = False)
However, decode_jpeg should also able to handle the .png file format now. Without the image file, it's hard to break down what's causing to prevent this. Most probably the file is somehow corrupted or not the valid extension for the decode_jpeg, though it's named that way, check this solution.
I want to train a GAN and generate images of pokemon. I scraped around 10000 images from the internet which are locally saved. My folder is structured like so:
all_data:
- train:
-bulbasaur.png
-45.png
-....png
- test:
-bulbasaur.png
-45.png
-....png
- validation:
-bulbasaur.png
-45.png
-....png
I tried to load it via:
builder = tfds.ImageFolder(os.path.join(os.getcwd(), "all_data"))
print(builder.info) # num examples, labels... are automatically calculated
ds = builder.as_dataset(split='train', shuffle_files=True)
tfds.show_examples(ds, builder.info)
but I get the error of:
ValueError: Unrecognized split test. Subsplit API not yet supported for ImageFolder. Split name should be one of []. Is there a Problem with how I structured the dataset? As you can tell from the code snippet the different files all have completely varying names (either their English name or their Pokedex number) is that a problem? Since I do not want to classify anything I thought the labeling is not really important.
Also if it helps the splits from the output I get for the builder Info is empty.
tfds.core.DatasetInfo(
....
supervised_keys=('image', 'label'),
splits={
},...
)
Thanks a lot in advance!
Your folder structure should be like;
/content/image_dir/
train/
cat/
cat_1.png
cat_2.png
cat_3.png
dog/
dog_1.png
dog_2.png
dog_3.png
test/
cat.png
dog.png
Below code works with this structured directory
import tensorflow as tf
import tensorflow_datasets as tfds
builder = tfds.ImageFolder('/content/image_dir/')
print(builder.info) # num examples, labels... are automatically calculated
ds = builder.as_dataset(split='train', shuffle_files=True)
tfds.show_examples(ds, builder.info)
my studies project is to develop a neural network to recognize text on license plates. Therefore, I found the ReId-dataset at https://medusa.fit.vutbr.cz/traffic/research-topics/general-traffic-analysis/holistic-recognition-of-low-quality-license-plates-by-cnn-using-track-annotated-data-iwt4s-avss-2017/. This dataset contains a bunch of images of number plates as well as the text of the license plates and was used by Spanhel et al. for a similar approach as the one I have in mind.
Example of a license plate there:
In the project I want to recognize only the license plate text, i.e. only "9B5 2145" and not the country acronym "CZ" and no advertisement text.
I downloaded the dataset and the labels csv-file to my local memory. So, I have the following folder structure: One mother directory for my whole project. This mother directory includes my data directory, where I stored the ReId dataset. This dataset includes several subdirectories, 4 directories with training data and 4 with test data, all of this subdirectories contain a number of images of license plates. The ReId dataset also contains the trainVal csv-file which is structured as follows (snippet of the actual sheet):
track_id is equal to the subdirectory of the ReID dataset.
image_path is equal to the path to the image, in this case the image's name is 1_1.
lp is the label of the license plate, so the actual license plate.
train is a dummy variable, equal to one, if the image is used for training purposes and 0 for validation purposes.
Regarding this dataset, I got three main questions:
How do I read in this images properly? I tried to use something like this
from keras.preprocessing.image import ImageDataGenerator
# create generator
datagen = ImageDataGenerator()
# prepare an iterators for each dataset
train_it = datagen.flow_from_directory('data/train/', class_mode='binary')
val_it = datagen.flow_from_directory('data/validation/', class_mode='binary')
test_it = datagen.flow_from_directory('data/test/', class_mode='binary')
# confirm the iterator works
batchX, batchy = train_it.next()
print('Batch shape=%s, min=%.3f, max=%.3f' % (batchX.shape, batchX.min(), batchX.max()))
But obviously Python did not find images belonging to any classes (side note: I used the correct paths). That is clear to me, because I did not assign any class to my data yet. So, my first question is: Do I have to do that? I don't think so.
How do I then read this images properly? I think, I have to get numpy arrays to work properly with this data.
How do I bring my images and the labels together? In my opinion, I think I have to merge the two datasets, don't I?
Thank you very much!
Question 1 and 2:
For reading the images, imread from matplotlib.pyplot can be used as
shown in the example, this does not require any classes to be set.
Question 3:
The labels and images can be brought together by storing the corresponding license plate number in an output array (y in the example) for each image (stored in the xs array in the example) in the data array. You don't necessarily need to merge them.
Hope I helped!
import os
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
xs, y = [], []
main_dir = './sample/dataset' # the main directory
label_data = pd.read_csv('labels.csv')
for folder in os.listdir(main_dir):
for img in os.listdir(os.path.join(main, folder)):
arr = plt.imread(os.path.join(main, folder) + img)
xs.append(arr)
y.append(label_data[label_data['image_path'] == os.path.join(folder, img)]['lp'])
#^ this part can be changed depending on the exact format of your label data file.
# then you can convert them into numpy arrays and reshape them as you need.
xs = np.array(xs)
y = np.array(y)
I'm based on Window 10, Jupyter Notebook, Pytorch 1.0, Python 3.6.x currently.
At first I confirm to the correct path of files using this code : print(os.listdir('./Dataset/images/')).
and I could check that this path is correct.
but I met Error :
RuntimeError: Found 0 files in subfolders of: ./Dataset/images/
Supported extensions are: .jpg,.jpeg,.png,.ppm,.bmp,.pgm,.tif"
What is the matter?
Could you suggest a solution?
I tried to ./dataset/1/images like this method. but the result was same....
img_dir = './Dataset/images/'
img_data = torchvision.datasets.ImageFolder(os.path.join(img_dir), transforms.Compose([
transforms.Scale(256),
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
]))
img_batch = data.DataLoader(img_data, batch_size=batch_size,
shuffle = True, drop_last=True)
I met the same problem when using celebA, including 200,000 images. As we can see there are many images. But in a small sample situation (I tried 20 images), I checked, the error will not be raised, which means we can read images successfully.
But when the number grows, we should use other methods.
I solved the problem according to this website. Thanks to QimingChen
Github solution
Simply, adding another folder named 1 (/train/--->train/1/) in the original folder will enable our program to work, without changing the path. That's because when facing large datasets, images should be sorted in subfolders of different classes.
The Original answer on Github:
Let's say I am going to use ImageFolder("/train/") to read jpg files in folder train.
The file structure is
/train/
-- 1.jpg
-- 2.jpg
-- 3.jpg
I failed to load them, leading to errors:
RuntimeError: Found 0 images in subfolders of: ./data
Supported image extensions are: .jpg,.JPG,.jpeg,.JPEG,.png,.PNG,.ppm,.PPM,.bmp,.BMP
I read the solution above and tried tens of times. When I changed the structure to
/train/1/
-- 1.jpg
-- 2.jpg
-- 3.jpg
But the read in code is still -- ImageFolder("/train/"), IT WORKS.
It seems like the program tends to recursively read in files, that is convenient in some cases.
Hope this would help!!
Can you post the structure of your files? In your case, it is supposed to be:
img_dir
|_class1
|_a.jpg
|_b.jpg
|_class2
|_a.jpg
|_b.jpg
...
According to the rules of the DataLoader in pytorch you should choose the the superior path of the image path. That means if your images locate in './Dataset/images/', the path of the data loader should be './Dataset' instead. I hope it can fix your bug.:)
You can modify the ImageFolder class to get to the root folder directly (without subfolders):
class ImageFolder(Dataset):
def __init__(self, root, transform=None):
#Call make_dataset to collect files.
self.samples = make_dataset(opt.dataroot)
self.imgs = self.samples
self.transformA = transformA
...
We call the make_dataset method to collect our files:
def make_dataset(dir):
import os
images = []
d = os.path.expanduser(dir)
if not os.path.exists(dir):
print('path does not exist')
for root, _, fnames in sorted(os.walk(d)):
for fname in sorted(fnames):
path = os.path.join(root, fname)
images.append(path)
return images
All the action takes place in the loop containing os.walk. Here, the files are collected from the 'root' directory, which we specify as the directory containing our files.
See the documentation of ImageFolder dataset to see how this dataset class expects the images to be organized into subfolders under `./Dataset/images' according to image classes. Make sure your images adhere to this order.
Apparently, the solution is just making the picture name alpha-numeric. They may be another solution but this work.
I am using image generator for keras like this:
val_generator = datagen.flow_from_directory(
path+'/valid',
target_size=(224, 224),
batch_size=batch_size,)
x,y = val_generator.next()
for i in range(0,1):
image = x[i]
plt.imshow(image.transpose(2,1,0))
plt.show()
This shows wrong colors:
I have two questions.
How to fix the problem
How to get file names of the files (so that I can read it myself from something like matplotlib)
Edit : this is what my datagen looks like
datagen = ImageDataGenerator(
rotation_range=3,
# featurewise_std_normalization=True,
fill_mode='nearest',
width_shift_range=0.2,
height_shift_range=0.2,
horizontal_flip=True
)
Edit 2 :
After following Marcin's answer :
image = 255 - image
I get normal colors , but there are still some weird colors:
The dtype of your image array is 'float32', just convert it into 'uint8':
plt.imshow(image.astype('uint8'))
I had the same problem as OP and solved it by rescaling the pixels from 0-255 to 0-1.
Keras' ImageDataGenerator takes a 'rescale' parameter, which I set to (1/255). This produced images with expected colors
image_gen = ImageDataGenerator(rescale=(1/255))
There are at least three ways to have this twisted colors. So:
one option is that you need to switch a color ordering like in this question.
second is that you might have your pictures made to be a negative (every channels gets transformed by 255 - x transformation) this sometimes happens when it comes to using some GIS libraries.
you could also use a score/255 transformation.
You need to check which options happens in your case.
In order to get the images on your own I usually use (when your folder has a format suitable for a Keras flow_from_directory) I usually use the mix of os.listdir and os.path.join by :
list_of_labels = os.listdir(path_to_dir_with_label_dirs)
for label in list_of_labels:
current_label_dir_path = os.path.join(path_to_dir_with_label_dirs, label
list_of_images = os.listdir(current_label_dir_path)
for image in list_of_images:
current_image_path = os.path.join(current_label_dir_path, image)
image = open(current_image_path) # use the function which you want.
The color problem is rather strange.
I'll try to reproduce it once I have access to my linux machine.
For the filename part of the question, I would like to propose a small change to the Keras sourcecode:
You might want to take a look at this file:
https://github.com/fchollet/keras/blob/master/keras/preprocessing/image.py
It contains the image preprocessing routines.
Look at line 820, the next() function of the DirectoryIterator: this is called to get new images from the directory.
Inside of that function, look at line 838, if save_to_dir has been set to a path, the generator will output the augmented images to this path, for debugging purposes.
The name of the augmented image is a mixture of an index and a hash. Not useful for you.
But you can change the code quite easily:
filenames=[] #<-------------------------------------------- new code
for i, j in enumerate(index_array):
fname = self.filenames[j]
img = load_img(os.path.join(self.directory, fname),
grayscale=grayscale,
target_size=self.target_size)
x = img_to_array(img, dim_ordering=self.dim_ordering)
x = self.image_data_generator.random_transform(x)
x = self.image_data_generator.standardize(x)
filenames.append(fname) # <-----------------------------store the used image's name
batch_x[i] = x
# optionally save augmented images to disk for debugging purposes
if self.save_to_dir:
for i in range(current_batch_size):
img = array_to_img(batch_x[i], self.dim_ordering, scale=True)
#fname = '{prefix}_{index}_{hash}.{format}'.format(prefix=self.save_prefix,
# index=current_index + i,
# hash=np.random.randint(1e4),
# format=self.save_format)
fname=filenames[i] # <------------------------------ use the stored code instead
img.save(os.path.join(self.save_to_dir, fname))
Now the augmented image is saved with the original filename.
This should allow you to save the images under their original filenames.
Ok, how do you actually inject this into the Keras souce ?
Do it like this:
clone Keras: git clone https://github.com/fchollet/keras
go to the sourcefile I linked above. Make the change.
Trick your python code to import the changed code instead of the version installed by pip.
.
# this is the path to the cloned repository
# if you cloned it next to your script
# then just use keras/
# if it's one folder above
# then use ../keras/
sys.path.insert(0, os.getcwd() + "/path/to/keras/")
import keras
Now the DirectoryIterator is your patched version.
I hope that this works, I'm currently on windows. My python stack is only on the linux machine. There might be a small syntax error.
from skimage import io
def imshow(image_RGB):
io.imshow(image_RGB)
io.show()
x,y = train_generator.next()
for i in range(0,11):
image = x[i]
imshow(image)
It works for me.
Just a bit of advice if you are using test_batches=Imagedatagenerator().flow from directory. If you use this to feed a predict generator make sure you set shuffle=false to maintain a correlation between the file and the associated prediction. If you have files numerically labelled in the directory for example as 1.jpg, 2.jpg etc. The images are not fetched as you might think. They are fetched in the order:
1.jpg, 10.jpg, 2.jpg, 20.jpg etc. This makes it hard to match a prediction to a specific file. You can get around this by using 0's padding for example 01.jpg, 02.jpg etc. On the second part of the question "how can I get the files the generator produces you can get these files as follows:
for file in datagen.filenames:
file_names.append(file)