Train Validation data split - labels available but no classes

Train Validation data split - labels available but no classes - python

my studies project is to develop a neural network to recognize text on license plates. Therefore, I found the ReId-dataset at https://medusa.fit.vutbr.cz/traffic/research-topics/general-traffic-analysis/holistic-recognition-of-low-quality-license-plates-by-cnn-using-track-annotated-data-iwt4s-avss-2017/. This dataset contains a bunch of images of number plates as well as the text of the license plates and was used by Spanhel et al. for a similar approach as the one I have in mind.
Example of a license plate there:
In the project I want to recognize only the license plate text, i.e. only "9B5 2145" and not the country acronym "CZ" and no advertisement text.
I downloaded the dataset and the labels csv-file to my local memory. So, I have the following folder structure: One mother directory for my whole project. This mother directory includes my data directory, where I stored the ReId dataset. This dataset includes several subdirectories, 4 directories with training data and 4 with test data, all of this subdirectories contain a number of images of license plates. The ReId dataset also contains the trainVal csv-file which is structured as follows (snippet of the actual sheet):
track_id is equal to the subdirectory of the ReID dataset.
image_path is equal to the path to the image, in this case the image's name is 1_1.
lp is the label of the license plate, so the actual license plate.
train is a dummy variable, equal to one, if the image is used for training purposes and 0 for validation purposes.
Regarding this dataset, I got three main questions:
How do I read in this images properly? I tried to use something like this
from keras.preprocessing.image import ImageDataGenerator
# create generator
datagen = ImageDataGenerator()
# prepare an iterators for each dataset
train_it = datagen.flow_from_directory('data/train/', class_mode='binary')
val_it = datagen.flow_from_directory('data/validation/', class_mode='binary')
test_it = datagen.flow_from_directory('data/test/', class_mode='binary')
# confirm the iterator works
batchX, batchy = train_it.next()
print('Batch shape=%s, min=%.3f, max=%.3f' % (batchX.shape, batchX.min(), batchX.max()))
But obviously Python did not find images belonging to any classes (side note: I used the correct paths). That is clear to me, because I did not assign any class to my data yet. So, my first question is: Do I have to do that? I don't think so.
How do I then read this images properly? I think, I have to get numpy arrays to work properly with this data.
How do I bring my images and the labels together? In my opinion, I think I have to merge the two datasets, don't I?
Thank you very much!

Question 1 and 2:
For reading the images, imread from matplotlib.pyplot can be used as
shown in the example, this does not require any classes to be set.
Question 3:
The labels and images can be brought together by storing the corresponding license plate number in an output array (y in the example) for each image (stored in the xs array in the example) in the data array. You don't necessarily need to merge them.
Hope I helped!
import os
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
xs, y = [], []
main_dir = './sample/dataset' # the main directory
label_data = pd.read_csv('labels.csv')
for folder in os.listdir(main_dir):
for img in os.listdir(os.path.join(main, folder)):
arr = plt.imread(os.path.join(main, folder) + img)
xs.append(arr)
y.append(label_data[label_data['image_path'] == os.path.join(folder, img)]['lp'])
#^ this part can be changed depending on the exact format of your label data file.
# then you can convert them into numpy arrays and reshape them as you need.
xs = np.array(xs)
y = np.array(y)

Related

How to save/extract dataset from hdf5 and convert into TiFF?

I am trying to import CT scan data into ImageJ/FIJI (There is HDF5 plugin in ImageJ/Fiji, however the synchrotron CT data has so large datasets.. so it was failed to open). The scan data (Image dataset) is saved as dataset into the hdf5 file. So I have to extract image dataset from the hdf5 file, then converted it into the Tiff file.
HdF5 File path is "F:/New_ESRF/SNT_BTO4/SNT_BTO4_S1/SNT_BTO4_S1_1_1pag_db0005_vol.hdf5"
Herein, 'SNT_BTO4_S1_1_1pag_db0005_vol.hdf5' is divided into several datasets, and the image dataset is in here:/entry0000/reconstruction/results/data
At the moment, I accessed to the image dataset using h5py. However, after that, I am in stuck to extract/save the dataset separately from the hdf5 file.
Which code is required to extract the image dataset from the hdf5 file?
After that, I am thinking of using from PIL to Image then convert the image into Tiff file. Can I get any advice on the code for this?
import numpy as np
import h5py
filename = "F:/New_ESRF/SNT_BTO4/SNT_BTO4_S1/SNT_BTO4_S1_1_1pag_db0005_vol.hdf5"
with h5py.File(filename,'r') as hdf:
base_items = list (hdf.items())
print('#Items in the base directory:', base_items)
#entry0000
G1 = hdf.get ('entry0000')
G1_items = list (G1.items())
print('#Items in entry0000', G1_items)
#reconstruction
G11 = G1.get ('/entry0000/reconstruction')
G11_items = list (G11.items())
print('#Items in reconstruction', G11_items)
#results_data
G12 = G11.get ('/entry0000/reconstruction/results')
G12_items = list (G12.items())
print('#Items in results', G12_items)

Extracting image data from an HDF5 file and converting to an image is a "relatively straight forward" 2 step process:
Access the data in the HDF5 file
Convert to an image with cv2 (or PIL)
A simple example is available here: How to extract individual JPEG images from a HDF5 file.
You can apply the same process to your file. Here is some pseudo-code. It's not complete because you don't show the shape of the image dataset (and the shape affects how to read the data). Also, you didn't say how many images are in dataset /entry0000/reconstruction/results/data --- does it have a single image or multiple images. If multiple images, which axis is the image counter?
import h5py
import cv2 ## for image conversion
filename = "F:/New_ESRF/SNT_BTO4/SNT_BTO4_S1/SNT_BTO4_S1_1_1pag_db0005_vol.hdf5"
with h5py.File(filename,'r') as hdf:
# get image dataset
img_ds = hdf['/entry0000/reconstruction/results/data']
print(f'Image Dataset info: Shape={img_ds.shape},Dtype={img_ds.dtype}')
## following depends on dataset shape/schema
## code below assumes images are along axis=0
for i in range(img_ds.shape[0]):
cv2.imwrite(f'test_img_{i:03}.tiff',img_ds[i,:]) # uses slice notation
# alternately load to a numpy array first
img_arr = img_ds[i,:] # slice notation gets [i,:,:,:]
cv2.imwrite(f'test_img_{i:03}.tiff',img_arr)
Note: you don't need to use .get() to get a dataset. You can simply reference the dataset path. Also, when you use a group object, use the relative path from the dataset to the group, not the absolute path. (You should modify your code to reflect these changes.) For example, the following are equivalent
G1 = hdf['entry0000']
## is the same as G1 = hdf.get('entry0000')
G11 = hdf['entry0000/reconstruction']
## is the same as G11 = hdf.get('entry0000/reconstruction')
## OR referencing G1 group object:
G11 = G1['reconstruction']
## is the same as G11 = G1.get('reconstruction')

Loading in your own Image data with tensorflow and tdfs.ImageFolder

I want to train a GAN and generate images of pokemon. I scraped around 10000 images from the internet which are locally saved. My folder is structured like so:
all_data:
- train:
-bulbasaur.png
-45.png
-....png
- test:
-bulbasaur.png
-45.png
-....png
- validation:
-bulbasaur.png
-45.png
-....png
I tried to load it via:
builder = tfds.ImageFolder(os.path.join(os.getcwd(), "all_data"))
print(builder.info) # num examples, labels... are automatically calculated
ds = builder.as_dataset(split='train', shuffle_files=True)
tfds.show_examples(ds, builder.info)
but I get the error of:
ValueError: Unrecognized split test. Subsplit API not yet supported for ImageFolder. Split name should be one of []. Is there a Problem with how I structured the dataset? As you can tell from the code snippet the different files all have completely varying names (either their English name or their Pokedex number) is that a problem? Since I do not want to classify anything I thought the labeling is not really important.
Also if it helps the splits from the output I get for the builder Info is empty.
tfds.core.DatasetInfo(
....
supervised_keys=('image', 'label'),
splits={
},...
)
Thanks a lot in advance!

Your folder structure should be like;
/content/image_dir/
train/
cat/
cat_1.png
cat_2.png
cat_3.png
dog/
dog_1.png
dog_2.png
dog_3.png
test/
cat.png
dog.png
Below code works with this structured directory
import tensorflow as tf
import tensorflow_datasets as tfds
builder = tfds.ImageFolder('/content/image_dir/')
print(builder.info) # num examples, labels... are automatically calculated
ds = builder.as_dataset(split='train', shuffle_files=True)
tfds.show_examples(ds, builder.info)

Splitting image based dataset for YOLOv3

I have a question about splitting a dataset of 20k images along with their labels, the dataset is in the format of YOLOv3 which has an image file and a .txt file with the same name as the image, the text file has the labels inside it.
I want to split the dataset into train/test splits, is there a way to randomly select the image and its labels .txt file with it and store it in a separate folder using Python?
I want to be able to split the dataset randomly. For instance, select 16k files along with label file too and store them separately in a train folder and the remaining 4k should be stored in a test folder.
This could manually be done in the file explorer by selecting the first 16k files and move them to a different folder but the split won't be random as I plan to do this over and over again for the same dataset.
Here is what the data looks like
images and labels screenshot

I suggest you to take a look at following Python built-in modules
glob
random
os
shutill
for manipulating files and paths in Python. Here is my code with comments that might solve your problem. It's very simple
import glob
import random
import os
import shutil
# Get all paths to your images files and text files
PATH = 'path/to/dataset/'
img_paths = glob.glob(PATH+'*.jpg')
txt_paths = glob.glob(PATH+'*.txt')
# Calculate number of files for training, validation
data_size = len(img_paths)
r = 0.8
train_size = int(data_size * 0.8)
# Shuffle two list
img_txt = list(zip(img_paths, txt_paths))
random.seed(43)
random.shuffle(img_txt)
img_paths, txt_paths = zip(*img_txt)
# Now split them
train_img_paths = img_paths[:train_size]
train_txt_paths = txt_paths[:train_size]
valid_img_paths = img_paths[train_size:]
valid_txt_paths = txt_paths[train_size:]
# Move them to train, valid folders
train_folder = PATH+'train/'
valid_folder = PATH+'valid/'
os.mkdir(train_folder)
os.mkdir(valid_folder)
def move(paths, folder):
for p in paths:
shutil.move(p, folder)
move(train_img_paths, train_folder)
move(train_txt_paths, train_folder)
move(valid_img_paths, valid_folder)
move(valid_txt_paths, valid_folder)

Keras flow_from_dataframe gives 0 images

I am trying to use the flow_from_dataframe method of Keras to read training and testing images.
Both my training and testing images are in same directory, and I read the paths from two different csv files.
My code for reading test images looks like,
# Read test file
testdf = pd.read_csv("test.csv")
# load images
test_datagen = ImageDataGenerator(rescale=1./255)
test_generator = test_datagen.flow_from_dataframe(
dataframe=testdf, directory=IMAGE_PATH,
x_col='image_name', y_col=None,
has_ext=True, target_size=(10,10)
,batch_size=32,color_mode='rgb',shuffle=False, class_mode=None)
I get output like this
Found 0 images.
While the similar code for reading training data works properly. I checked if the images exist at the given path, which they do. What are some possible reasons for this error? How can I try to debug the issue?
EDIT: This is a regression task, so all images are in a single directory, and not in subdirectories, as would be expected for a classification task.
EDIT 2: I added usecols=[0] to read_csv, and now test_datagen finds all the images in the directory, and not just the one's that are mentioned in the test.csv file

The issue happens due to NaN's in the dataframe. Ignoring those columns doesn't work. The solution is to replace the NaN's with something else. For example,
testdf = pd.read_csv("test.csv")
testdf.fillna(0, inplace=True)
This replaces the NaN's with 0. Then using ImageDataGenerator as usual works.

I was also facing the same error and found a solution for this.
I was using the absolute path, was using correct DataFrame and everything was fine still the code was throwing an error - "image not found".
I inspected and found that my dataframe was containing image names without extension and the images in the folder was having extension also.
E.g. The image name in DataFrame was 'abc' but the image in the folder was having a name 'abc.png'.
Just add .png in the image names in DataFrame and it will solve your problem.
I just tried below code and it worked out..!!!!
def append_ext(fn):
return fn+".png"
train_valid_data["id_code"]=train_valid_data["id_code"].apply(append_ext)
test_data["id_code"]=test_data["id_code"].apply(append_ext)
Let me know if it solves your problem or if you need any further explanation.

I have the same problem. First, make sure you got the absolute path correctly for the parameter directory.
The filename in my df has value image.pgm.png and the actual image file in the folder has the format image.pgm.
I tried to change the filename in df to image.pgm => Still not working
I renamed the image file from image.pgm to image.pgm.png which matches exactly the format in the df => Worked!

I had the same error,
What I found is that I missed the directory path, and the image extension that was not in the data frame,
So make sure that your directory path is correct and an extension to your image, as you can do the following:
def extention_train_data(x):
return x+".jpg"
change the jpg extension if you have an other one.
then you apply this to you data frame:
train_data['image'] = train_data['image_id'].apply(extention_train_data)
once you have the image column containing your image with its extension then
train_generator = datagen.flow_from_dataframe(
train_data,
directory="/kaggle/input/plant-pathology-2020-fgvc7/images/",
x_col = "image",
y_col = "label",
target_size = size,
class_mode = "binary",
batch_size = batch_size,
subset="training",
shuffle = True,
seed = 42,
)

Okay, so I have been having the same issues. Where my data labels were in a csv file , and the image data in a separate folder.I thought, the issue was being caused by the labels and the images in the folder not aligning properly.Did a whole bunch of stuff to rectify and process the data. It was not the problem.
So, anyone who's having issues.
I tried #Oussama Ouardini's answer and it worked. Thank you!
I am also going to add - that if you are doing a train and validation split to make sure the initial ImageDataGenerator object you create has the validation split specified.
def extension_train_data(x):
return "xc"+str(x)+".png"
train_df['file_id'] = train_df['file_id'].apply(extension_train_data)
Here is my code -
datagen=ImageDataGenerator(rescale=1./255,validation_split=0.2)
#rescale all pixel values from 0-255, so after this step all our
#pixel values are in range (0,1)
train_generator=datagen.flow_from_dataframe(dataframe=train_df,directory='./img_data/', x_col="file_id", y_col="english_cname",
class_mode="categorical",save_to_dir='./new folder/',
target_size=(64,64),subset="training",
seed=42,batch_size=32,shuffle=False)
val_generator=datagen.flow_from_dataframe(dataframe=train_df,directory='./img_d
ata/', x_col="file_id", y_col="english_cname",
class_mode="categorical",
target_size=(64,64),subset="validation",
seed=42,batch_size=32,shuffle=False)
print("\n Sanity check Line.--------")
My output was a succesfully validated image files. :)
Found 212 validated image filenames belonging to 88 classes.
Found 52 validated image filenames belonging to 88 classes.
Sanity check Line.----------
I hope someone will find this useful. Cheers!

Keras showing images from data generator

I am using image generator for keras like this:
val_generator = datagen.flow_from_directory(
path+'/valid',
target_size=(224, 224),
batch_size=batch_size,)
x,y = val_generator.next()
for i in range(0,1):
image = x[i]
plt.imshow(image.transpose(2,1,0))
plt.show()
This shows wrong colors:
I have two questions.
How to fix the problem
How to get file names of the files (so that I can read it myself from something like matplotlib)
Edit : this is what my datagen looks like
datagen = ImageDataGenerator(
rotation_range=3,
# featurewise_std_normalization=True,
fill_mode='nearest',
width_shift_range=0.2,
height_shift_range=0.2,
horizontal_flip=True
)
Edit 2 :
After following Marcin's answer :
image = 255 - image
I get normal colors , but there are still some weird colors:

The dtype of your image array is 'float32', just convert it into 'uint8':
plt.imshow(image.astype('uint8'))

I had the same problem as OP and solved it by rescaling the pixels from 0-255 to 0-1.
Keras' ImageDataGenerator takes a 'rescale' parameter, which I set to (1/255). This produced images with expected colors
image_gen = ImageDataGenerator(rescale=(1/255))

There are at least three ways to have this twisted colors. So:
one option is that you need to switch a color ordering like in this question.
second is that you might have your pictures made to be a negative (every channels gets transformed by 255 - x transformation) this sometimes happens when it comes to using some GIS libraries.
you could also use a score/255 transformation.
You need to check which options happens in your case.
In order to get the images on your own I usually use (when your folder has a format suitable for a Keras flow_from_directory) I usually use the mix of os.listdir and os.path.join by :
list_of_labels = os.listdir(path_to_dir_with_label_dirs)
for label in list_of_labels:
current_label_dir_path = os.path.join(path_to_dir_with_label_dirs, label
list_of_images = os.listdir(current_label_dir_path)
for image in list_of_images:
current_image_path = os.path.join(current_label_dir_path, image)
image = open(current_image_path) # use the function which you want.

The color problem is rather strange.
I'll try to reproduce it once I have access to my linux machine.
For the filename part of the question, I would like to propose a small change to the Keras sourcecode:
You might want to take a look at this file:
https://github.com/fchollet/keras/blob/master/keras/preprocessing/image.py
It contains the image preprocessing routines.
Look at line 820, the next() function of the DirectoryIterator: this is called to get new images from the directory.
Inside of that function, look at line 838, if save_to_dir has been set to a path, the generator will output the augmented images to this path, for debugging purposes.
The name of the augmented image is a mixture of an index and a hash. Not useful for you.
But you can change the code quite easily:
filenames=[] #<-------------------------------------------- new code
for i, j in enumerate(index_array):
fname = self.filenames[j]
img = load_img(os.path.join(self.directory, fname),
grayscale=grayscale,
target_size=self.target_size)
x = img_to_array(img, dim_ordering=self.dim_ordering)
x = self.image_data_generator.random_transform(x)
x = self.image_data_generator.standardize(x)
filenames.append(fname) # <-----------------------------store the used image's name
batch_x[i] = x
# optionally save augmented images to disk for debugging purposes
if self.save_to_dir:
for i in range(current_batch_size):
img = array_to_img(batch_x[i], self.dim_ordering, scale=True)
#fname = '{prefix}_{index}_{hash}.{format}'.format(prefix=self.save_prefix,
# index=current_index + i,
# hash=np.random.randint(1e4),
# format=self.save_format)
fname=filenames[i] # <------------------------------ use the stored code instead
img.save(os.path.join(self.save_to_dir, fname))
Now the augmented image is saved with the original filename.
This should allow you to save the images under their original filenames.
Ok, how do you actually inject this into the Keras souce ?
Do it like this:
clone Keras: git clone https://github.com/fchollet/keras
go to the sourcefile I linked above. Make the change.
Trick your python code to import the changed code instead of the version installed by pip.
.
# this is the path to the cloned repository
# if you cloned it next to your script
# then just use keras/
# if it's one folder above
# then use ../keras/
sys.path.insert(0, os.getcwd() + "/path/to/keras/")
import keras
Now the DirectoryIterator is your patched version.
I hope that this works, I'm currently on windows. My python stack is only on the linux machine. There might be a small syntax error.

from skimage import io
def imshow(image_RGB):
io.imshow(image_RGB)
io.show()
x,y = train_generator.next()
for i in range(0,11):
image = x[i]
imshow(image)
It works for me.

Just a bit of advice if you are using test_batches=Imagedatagenerator().flow from directory. If you use this to feed a predict generator make sure you set shuffle=false to maintain a correlation between the file and the associated prediction. If you have files numerically labelled in the directory for example as 1.jpg, 2.jpg etc. The images are not fetched as you might think. They are fetched in the order:
1.jpg, 10.jpg, 2.jpg, 20.jpg etc. This makes it hard to match a prediction to a specific file. You can get around this by using 0's padding for example 01.jpg, 02.jpg etc. On the second part of the question "how can I get the files the generator produces you can get these files as follows:
for file in datagen.filenames:
file_names.append(file)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.