Loading in your own Image data with tensorflow and tdfs.ImageFolder - python

I want to train a GAN and generate images of pokemon. I scraped around 10000 images from the internet which are locally saved. My folder is structured like so:
all_data:
- train:
-bulbasaur.png
-45.png
-....png
- test:
-bulbasaur.png
-45.png
-....png
- validation:
-bulbasaur.png
-45.png
-....png
I tried to load it via:
builder = tfds.ImageFolder(os.path.join(os.getcwd(), "all_data"))
print(builder.info) # num examples, labels... are automatically calculated
ds = builder.as_dataset(split='train', shuffle_files=True)
tfds.show_examples(ds, builder.info)
but I get the error of:
ValueError: Unrecognized split test. Subsplit API not yet supported for ImageFolder. Split name should be one of []. Is there a Problem with how I structured the dataset? As you can tell from the code snippet the different files all have completely varying names (either their English name or their Pokedex number) is that a problem? Since I do not want to classify anything I thought the labeling is not really important.
Also if it helps the splits from the output I get for the builder Info is empty.
tfds.core.DatasetInfo(
....
supervised_keys=('image', 'label'),
splits={
},...
)
Thanks a lot in advance!

Your folder structure should be like;
/content/image_dir/
train/
cat/
cat_1.png
cat_2.png
cat_3.png
dog/
dog_1.png
dog_2.png
dog_3.png
test/
cat.png
dog.png
Below code works with this structured directory
import tensorflow as tf
import tensorflow_datasets as tfds
builder = tfds.ImageFolder('/content/image_dir/')
print(builder.info) # num examples, labels... are automatically calculated
ds = builder.as_dataset(split='train', shuffle_files=True)
tfds.show_examples(ds, builder.info)

Related

Is there a good way to write 2d arrays or tensors to TFRecords in Tensorflow?

I am currently working on a project using audio data. The first step of the project is to use another model to produce features for the audio example that are about [400 x 10_000] for each wav file and each wav file will have a label that I'm trying to predict. I will then build another model on top of this to produce my final result.
I don't want to run preprocessing every time I run the model, so my plan was to have a preprocessing pipeline that runs the feature extraction model and saves it into a new folder and then I can just have the second model use the saved features directly. I was looking at using TFRecords, but the documentation is quite unhelpful.
tf.io.serialize_tensor
tfrecord
This is what I've come up with to test it so far:
serialized_features = tf.io.serialize_tensor(features)
feature_of_bytes = tf.train.Feature(
bytes_list=tf.train.BytesList(value=[serialized_features.numpy()]))
features_for_example = {
'feature0': feature_of_bytes
}
example_proto = tf.train.Example(
features=tf.train.Features(feature=features_for_example))
filename = 'test.tfrecord'
writer = tf.io.TFRecordWriter(filename)
writer.write(example_proto.SerializeToString())
filenames = [filename]
raw_dataset = tf.data.TFRecordDataset(filenames)
for raw_record in raw_dataset.take(1):
example = tf.train.Example()
example.ParseFromString(raw_record.numpy())
print(example)
But I'm getting this error:
tensorflow.python.framework.errors_impl.DataLossError: truncated record at 0' failed with Read less bytes than requested
tl;dr:
Getting the above error with TFRecords. Any recommendations to get this example working or another solution not using TFRecords?

Sentence Transformers can not get a lot of images' embeddings

When I try to get embeddings from images I get error like this 'too many open files'. I have 50000 images, I do not want to split images into different folders and then concatenate embeddings (It is possible decision), I want to get embdedding using one folder where I have 50000 images. How to solve this problem?
from sentence_transformers import SentenceTransformer, util
from PIL import Image
#Load CLIP model
model = SentenceTransformer('clip-ViT-B-32')
#Encode an image:
img_emb = model.encode([Image.open(filepath) for filepath in image_names], batch_size=128, convert_to_tensor=True, show_progress_bar=True)

Train Validation data split - labels available but no classes

my studies project is to develop a neural network to recognize text on license plates. Therefore, I found the ReId-dataset at https://medusa.fit.vutbr.cz/traffic/research-topics/general-traffic-analysis/holistic-recognition-of-low-quality-license-plates-by-cnn-using-track-annotated-data-iwt4s-avss-2017/. This dataset contains a bunch of images of number plates as well as the text of the license plates and was used by Spanhel et al. for a similar approach as the one I have in mind.
Example of a license plate there:
In the project I want to recognize only the license plate text, i.e. only "9B5 2145" and not the country acronym "CZ" and no advertisement text.
I downloaded the dataset and the labels csv-file to my local memory. So, I have the following folder structure: One mother directory for my whole project. This mother directory includes my data directory, where I stored the ReId dataset. This dataset includes several subdirectories, 4 directories with training data and 4 with test data, all of this subdirectories contain a number of images of license plates. The ReId dataset also contains the trainVal csv-file which is structured as follows (snippet of the actual sheet):
track_id is equal to the subdirectory of the ReID dataset.
image_path is equal to the path to the image, in this case the image's name is 1_1.
lp is the label of the license plate, so the actual license plate.
train is a dummy variable, equal to one, if the image is used for training purposes and 0 for validation purposes.
Regarding this dataset, I got three main questions:
How do I read in this images properly? I tried to use something like this
from keras.preprocessing.image import ImageDataGenerator
# create generator
datagen = ImageDataGenerator()
# prepare an iterators for each dataset
train_it = datagen.flow_from_directory('data/train/', class_mode='binary')
val_it = datagen.flow_from_directory('data/validation/', class_mode='binary')
test_it = datagen.flow_from_directory('data/test/', class_mode='binary')
# confirm the iterator works
batchX, batchy = train_it.next()
print('Batch shape=%s, min=%.3f, max=%.3f' % (batchX.shape, batchX.min(), batchX.max()))
But obviously Python did not find images belonging to any classes (side note: I used the correct paths). That is clear to me, because I did not assign any class to my data yet. So, my first question is: Do I have to do that? I don't think so.
How do I then read this images properly? I think, I have to get numpy arrays to work properly with this data.
How do I bring my images and the labels together? In my opinion, I think I have to merge the two datasets, don't I?
Thank you very much!
Question 1 and 2:
For reading the images, imread from matplotlib.pyplot can be used as
shown in the example, this does not require any classes to be set.
Question 3:
The labels and images can be brought together by storing the corresponding license plate number in an output array (y in the example) for each image (stored in the xs array in the example) in the data array. You don't necessarily need to merge them.
Hope I helped!
import os
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
xs, y = [], []
main_dir = './sample/dataset' # the main directory
label_data = pd.read_csv('labels.csv')
for folder in os.listdir(main_dir):
for img in os.listdir(os.path.join(main, folder)):
arr = plt.imread(os.path.join(main, folder) + img)
xs.append(arr)
y.append(label_data[label_data['image_path'] == os.path.join(folder, img)]['lp'])
#^ this part can be changed depending on the exact format of your label data file.
# then you can convert them into numpy arrays and reshape them as you need.
xs = np.array(xs)
y = np.array(y)

Keras flow_from_dataframe gives 0 images

I am trying to use the flow_from_dataframe method of Keras to read training and testing images.
Both my training and testing images are in same directory, and I read the paths from two different csv files.
My code for reading test images looks like,
# Read test file
testdf = pd.read_csv("test.csv")
# load images
test_datagen = ImageDataGenerator(rescale=1./255)
test_generator = test_datagen.flow_from_dataframe(
dataframe=testdf, directory=IMAGE_PATH,
x_col='image_name', y_col=None,
has_ext=True, target_size=(10,10)
,batch_size=32,color_mode='rgb',shuffle=False, class_mode=None)
I get output like this
Found 0 images.
While the similar code for reading training data works properly. I checked if the images exist at the given path, which they do. What are some possible reasons for this error? How can I try to debug the issue?
EDIT: This is a regression task, so all images are in a single directory, and not in subdirectories, as would be expected for a classification task.
EDIT 2: I added usecols=[0] to read_csv, and now test_datagen finds all the images in the directory, and not just the one's that are mentioned in the test.csv file
The issue happens due to NaN's in the dataframe. Ignoring those columns doesn't work. The solution is to replace the NaN's with something else. For example,
testdf = pd.read_csv("test.csv")
testdf.fillna(0, inplace=True)
This replaces the NaN's with 0. Then using ImageDataGenerator as usual works.
I was also facing the same error and found a solution for this.
I was using the absolute path, was using correct DataFrame and everything was fine still the code was throwing an error - "image not found".
I inspected and found that my dataframe was containing image names without extension and the images in the folder was having extension also.
E.g. The image name in DataFrame was 'abc' but the image in the folder was having a name 'abc.png'.
Just add .png in the image names in DataFrame and it will solve your problem.
I just tried below code and it worked out..!!!!
def append_ext(fn):
return fn+".png"
train_valid_data["id_code"]=train_valid_data["id_code"].apply(append_ext)
test_data["id_code"]=test_data["id_code"].apply(append_ext)
Let me know if it solves your problem or if you need any further explanation.
I have the same problem. First, make sure you got the absolute path correctly for the parameter directory.
The filename in my df has value image.pgm.png and the actual image file in the folder has the format image.pgm.
I tried to change the filename in df to image.pgm => Still not working
I renamed the image file from image.pgm to image.pgm.png which matches exactly the format in the df => Worked!
I had the same error,
What I found is that I missed the directory path, and the image extension that was not in the data frame,
So make sure that your directory path is correct and an extension to your image, as you can do the following:
def extention_train_data(x):
return x+".jpg"
change the jpg extension if you have an other one.
then you apply this to you data frame:
train_data['image'] = train_data['image_id'].apply(extention_train_data)
once you have the image column containing your image with its extension then
train_generator = datagen.flow_from_dataframe(
train_data,
directory="/kaggle/input/plant-pathology-2020-fgvc7/images/",
x_col = "image",
y_col = "label",
target_size = size,
class_mode = "binary",
batch_size = batch_size,
subset="training",
shuffle = True,
seed = 42,
)
Okay, so I have been having the same issues. Where my data labels were in a csv file , and the image data in a separate folder.I thought, the issue was being caused by the labels and the images in the folder not aligning properly.Did a whole bunch of stuff to rectify and process the data. It was not the problem.
So, anyone who's having issues.
I tried #Oussama Ouardini's answer and it worked. Thank you!
I am also going to add - that if you are doing a train and validation split to make sure the initial ImageDataGenerator object you create has the validation split specified.
def extension_train_data(x):
return "xc"+str(x)+".png"
train_df['file_id'] = train_df['file_id'].apply(extension_train_data)
Here is my code -
datagen=ImageDataGenerator(rescale=1./255,validation_split=0.2)
#rescale all pixel values from 0-255, so after this step all our
#pixel values are in range (0,1)
train_generator=datagen.flow_from_dataframe(dataframe=train_df,directory='./img_data/', x_col="file_id", y_col="english_cname",
class_mode="categorical",save_to_dir='./new folder/',
target_size=(64,64),subset="training",
seed=42,batch_size=32,shuffle=False)
val_generator=datagen.flow_from_dataframe(dataframe=train_df,directory='./img_d
ata/', x_col="file_id", y_col="english_cname",
class_mode="categorical",
target_size=(64,64),subset="validation",
seed=42,batch_size=32,shuffle=False)
print("\n Sanity check Line.--------")
My output was a succesfully validated image files. :)
Found 212 validated image filenames belonging to 88 classes.
Found 52 validated image filenames belonging to 88 classes.
Sanity check Line.----------
I hope someone will find this useful. Cheers!

Python: using astropy.io.fits.open in combination with Tensorflow tf.data.Dataset

I am trying to write a custom input pipeline in Tensorflow for my dataset containing .fits files. I have a list of strings to the locations of the files, like so
pathlist = ['/path/to/file1', 'path/to/file2', ...]
Although the path naming convention has very specific subdirectories, this is a general example. I have written a short function, which when applied to each path element of this list, will spit out a numpy.ndarray with the appropriate data
import numpy as np
from astropy.io import fits
import tensorflow as tf
def path2im(path):
print(path)
hdulist = fits.open(path)
data = hdulist[1].data
data[np.isnan(data)] = 0
return tf.convert_to_tensor(data.astype(np.float32))
It basically opens the fits file from the path, and extracts the data along with dropping the NaNs and converting the array to a tensor. I am following the guidelines set down here (Loading Images in a Directory As Tensorflow Data set) for generating a tensorflow input pipeline. I start by defining a filename dataset from the list of paths, and then mapping the function over it.
filenames = tf.data.Dataset.list_files(pathlist)
ims = filenames.map(path2im)
When this is run, it prints the path not as a string, but as
Tensor("arg0:0", shape=(), dtype=string)
Which makes sense considering the filenames dataset contains tensors, as well as a huge error block in the map function which fails at this line
->hdulist = fits.open(path)
because fits.open(path) accepts a string as an argument for the path. Is there any way to rectify this issue? I cannot find a way to convert the string tensor into a string without starting a session and using .eval(), which I do not want to do in this initialization phase.
The main idea of the Dataset API is to have your data preprocessing part of the TensorFlow graph, so for example you can just specify a filename as placeholder when you'll run the TensorFlow graph.
This is then completely expected that type of the object filenames is a Tensor, and if you want to convert it as a string, you'll have to evaluate it using a Session.
You might want to have a look at this introductory guide to datasets.

Categories