Splitting image based dataset for YOLOv3

Splitting image based dataset for YOLOv3 - python

I have a question about splitting a dataset of 20k images along with their labels, the dataset is in the format of YOLOv3 which has an image file and a .txt file with the same name as the image, the text file has the labels inside it.
I want to split the dataset into train/test splits, is there a way to randomly select the image and its labels .txt file with it and store it in a separate folder using Python?
I want to be able to split the dataset randomly. For instance, select 16k files along with label file too and store them separately in a train folder and the remaining 4k should be stored in a test folder.
This could manually be done in the file explorer by selecting the first 16k files and move them to a different folder but the split won't be random as I plan to do this over and over again for the same dataset.
Here is what the data looks like
images and labels screenshot

I suggest you to take a look at following Python built-in modules
glob
random
os
shutill
for manipulating files and paths in Python. Here is my code with comments that might solve your problem. It's very simple
import glob
import random
import os
import shutil
# Get all paths to your images files and text files
PATH = 'path/to/dataset/'
img_paths = glob.glob(PATH+'*.jpg')
txt_paths = glob.glob(PATH+'*.txt')
# Calculate number of files for training, validation
data_size = len(img_paths)
r = 0.8
train_size = int(data_size * 0.8)
# Shuffle two list
img_txt = list(zip(img_paths, txt_paths))
random.seed(43)
random.shuffle(img_txt)
img_paths, txt_paths = zip(*img_txt)
# Now split them
train_img_paths = img_paths[:train_size]
train_txt_paths = txt_paths[:train_size]
valid_img_paths = img_paths[train_size:]
valid_txt_paths = txt_paths[train_size:]
# Move them to train, valid folders
train_folder = PATH+'train/'
valid_folder = PATH+'valid/'
os.mkdir(train_folder)
os.mkdir(valid_folder)
def move(paths, folder):
for p in paths:
shutil.move(p, folder)
move(train_img_paths, train_folder)
move(train_txt_paths, train_folder)
move(valid_img_paths, valid_folder)
move(valid_txt_paths, valid_folder)

Related

How to load an image dataset in scikit-learn?

I have collected a group of images that I want to train a model on.
How do I load the image dataset? I have a folder of training data with two folders in it denoting the two different kinds of objects. How would I go about loading this data set and then training a model?

this might help you to load your dataset into data variable from a single folder of images
import cv2
import os
import numpy as np
path = 'path to your dataset'
list_of_files = os.listdir(path)
data = np.empty(0)
for i in list_of_files:
x = cv2.imread(os.path.join(path+i))
data.append(x)

Train Validation data split - labels available but no classes

my studies project is to develop a neural network to recognize text on license plates. Therefore, I found the ReId-dataset at https://medusa.fit.vutbr.cz/traffic/research-topics/general-traffic-analysis/holistic-recognition-of-low-quality-license-plates-by-cnn-using-track-annotated-data-iwt4s-avss-2017/. This dataset contains a bunch of images of number plates as well as the text of the license plates and was used by Spanhel et al. for a similar approach as the one I have in mind.
Example of a license plate there:
In the project I want to recognize only the license plate text, i.e. only "9B5 2145" and not the country acronym "CZ" and no advertisement text.
I downloaded the dataset and the labels csv-file to my local memory. So, I have the following folder structure: One mother directory for my whole project. This mother directory includes my data directory, where I stored the ReId dataset. This dataset includes several subdirectories, 4 directories with training data and 4 with test data, all of this subdirectories contain a number of images of license plates. The ReId dataset also contains the trainVal csv-file which is structured as follows (snippet of the actual sheet):
track_id is equal to the subdirectory of the ReID dataset.
image_path is equal to the path to the image, in this case the image's name is 1_1.
lp is the label of the license plate, so the actual license plate.
train is a dummy variable, equal to one, if the image is used for training purposes and 0 for validation purposes.
Regarding this dataset, I got three main questions:
How do I read in this images properly? I tried to use something like this
from keras.preprocessing.image import ImageDataGenerator
# create generator
datagen = ImageDataGenerator()
# prepare an iterators for each dataset
train_it = datagen.flow_from_directory('data/train/', class_mode='binary')
val_it = datagen.flow_from_directory('data/validation/', class_mode='binary')
test_it = datagen.flow_from_directory('data/test/', class_mode='binary')
# confirm the iterator works
batchX, batchy = train_it.next()
print('Batch shape=%s, min=%.3f, max=%.3f' % (batchX.shape, batchX.min(), batchX.max()))
But obviously Python did not find images belonging to any classes (side note: I used the correct paths). That is clear to me, because I did not assign any class to my data yet. So, my first question is: Do I have to do that? I don't think so.
How do I then read this images properly? I think, I have to get numpy arrays to work properly with this data.
How do I bring my images and the labels together? In my opinion, I think I have to merge the two datasets, don't I?
Thank you very much!

Question 1 and 2:
For reading the images, imread from matplotlib.pyplot can be used as
shown in the example, this does not require any classes to be set.
Question 3:
The labels and images can be brought together by storing the corresponding license plate number in an output array (y in the example) for each image (stored in the xs array in the example) in the data array. You don't necessarily need to merge them.
Hope I helped!
import os
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
xs, y = [], []
main_dir = './sample/dataset' # the main directory
label_data = pd.read_csv('labels.csv')
for folder in os.listdir(main_dir):
for img in os.listdir(os.path.join(main, folder)):
arr = plt.imread(os.path.join(main, folder) + img)
xs.append(arr)
y.append(label_data[label_data['image_path'] == os.path.join(folder, img)]['lp'])
#^ this part can be changed depending on the exact format of your label data file.
# then you can convert them into numpy arrays and reshape them as you need.
xs = np.array(xs)
y = np.array(y)

How to load many images eficiently from folder using openCV

I try to create my own image datasets for machine learning.
The workflow I thought is the following :
①Load all image files as an array in the folder.
②Label the loaded images
③Split loaded image files to image_data and label_data.
④Finally, split image_data to image_train_data and image_test_data and split label_data to label_train_data and label_test_data.
However, it doesn't go well in the first step(①).
How can I load all image data efficiently?
And if you implement an image data set for machine learning according to this workflow, how you handle it?
I wrote following code.
cat_im = cv2.imread("C:\\Users\\path\\cat1.jpg")
But, Am I forced writing \cat1.jpg , \cat2.jpg ,\cat3.jpg.....?

## you can find all images like extenstion
import os,cv2
import glob
all_images_path= glob.glob('some_folder\images\*png') ## it gives path of images as list
## then you can loop over all files
loaded_images = []
for image_path in all_images_path:
image = cv2.imread(image_path)
loaded_images.append(image)
## lets assume your labels are just name of files and its like cat1.png,cat2.png etc
labels = []
for image_path in all_images_path:
labels.append(os.basename(image_path))

save multi directory images in a single file after preprocessing

I am working on DICOM images, I have 5 scans(folders) each scan contain multiple images, after working some preprocessing on the images, I want to save the processed images in a single file using "np.save", I have the code below that save each folder in a separate file:
data_path = 'E:/jupyter/test/LIDC-IDRI/'
patients_data = os.listdir(data_path)
for pd in range(len(patients_data)):
full_path = load_scan(data_path + patients_data[pd])
after_pixel_hu = get_pixels_hu(full_path)
after_resample, spacing = resample(after_pixel_hu, full_path, [1,1,1])
np.save(output_path + "images_of_%s_patient.npy" % (patients_data[pd]), after_resample)
load_scan is a function for loading(reading) DICOM files, what I want to do with this code is to save all processed images in a single file, not in five files, can anyone tell me how to do that, please?

The first thing to notice is that you are using %s with patients_data[pd]. I assume patients_data is a list of the names of the patients, which means you are constructing a different output path for each patient - you are asking numpy to save each of your processed images to a new location.
Secondly, .npy is probably not the file type you want to use for your purposes, as it does not handle appending data. You probably want to pick a different file type, and then np.save() to the same file path each time.
Edit: Regarding file type, a pdf may be your best option, where you can make each of your images a separate page.

h5py: How to index over multiple large HDF5 files without loading all their content into memory

This is a question about working with multiple HDF5 datasets simultaneously while treating them as one dataset as far as possible.
I have multiple .h5 files, each of which contains tens of thousands of images. Let's call the files
file01.h5
file02.h5
file03.h5
I now want to create a list or array that contains "pointers" to all images of all three files, without actually loading the images.
Here is what I have so far:
I first open all files:
file01 = h5py.File('file01.h5', 'r')
file02 = h5py.File('file02.h5', 'r')
file03 = h5py.File('file03.h5', 'r')
and add their image datasets to a list:
images = []
images.append(file01['images'])
images.append(file02['images'])
images.append(file03['images'])
where file01['images'] is an HDF5 dataset of shape e.g. (52722, 3, 160, 320), i.e. 52722 images. All good so far, none of the content has been loaded into memory yet. Now I want to make these three separate image lists into one so that I can work with it as if it were one large dataset. I tried to do this:
images = np.concatenate(images)
This is where it breaks. As soon as I concatenate the three HDF5 datasets, it actually loads the images as Numpy arrays and I run out of memory.
What would be the best way to solve this?
I need a solution that allows me to Numpy-slice and index into the three datasets as if it were one.
For example, assume each dataset contained 50,000 images and I wanted to load the third image of each dataset, I need a list images that allows me to index those images as
batch = images[[2, 50002, 100002]]

HDF5 have introduced the concept of a "Virtual Dataset (VDS)".
However, this does not work for versions before 1.10.
I have no experience with the VDS feature but the h5py docs go into more detail and the h5py git repository has an example file here :
'''A simple example of building a virtual dataset.
This makes four 'source' HDF5 files, each with a 1D dataset of 100 numbers.
Then it makes a single 4x100 virtual dataset in a separate file, exposing
the four sources as one dataset.
'''
import h5py
import numpy as np
# Create source files (1.h5 to 4.h5)
for n in range(1, 5):
with h5py.File('{}.h5'.format(n), 'w') as f:
d = f.create_dataset('data', (100,), 'i4')
d[:] = np.arange(100) + n
# Assemble virtual dataset
layout = h5py.VirtualLayout(shape=(4, 100), dtype='i4')
for n in range(1, 5):
filename = "{}.h5".format(n)
vsource = h5py.VirtualSource(filename, 'data', shape=(100,))
layout[n - 1] = vsource
# Add virtual dataset to output file
with h5py.File("VDS.h5", 'w', libver='latest') as f:
f.create_virtual_dataset('data', layout, fillvalue=-5)
print("Virtual dataset:")
print(f['data'][:, :10])
More details can be found on the HDF group which links to a pdf. Figure 1 illustrates the idea nicely.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.