pytorch deep learning loading data sequentially and efficiently - python

I have been doing neural network analysis on 20 thousand "images", each image represented in the form of the intensity of 100 * 100 * 100 neurons.
x = np.loadtxt('imgfile')
x = x.reshape(-1, img_channels, 100, 100, 100)
//similarly for target variable 'y'
Above, the first dimension of x will be the number of images. Am using DataLoader to get appropriate number of images for training during each iteration as shown below.
batch_size = 16
traindataset = TensorDataset(Tensor(x[:-testdatasize]), Tensor(y[:-testdatasize]) )
train_loader = DataLoader(dataset=traindataset, batch_size=batch_size, shuffle=True)
for epoch in range(num_epochs):
for i, (data,targets) in enumerate(train_loader):
...
I hope to increase the number of images to 50k but am restricted by the computer memory (imgfile is ~50 GB).
I was wondering if there is an efficient way to handle all the data? Like, rather than loading the whole imgfile, can we first divide them into sets, each with batch_size number of images, and load the sets periodically during training. I am not completely sure how to implement this.
I found some similar ideas using Keras here: https://machinelearningmastery.com/how-to-load-large-datasets-from-directories-for-deep-learning-with-keras/
Please point me towards any similar ideas implemented with pytorch or you have any ideas.

Digging a while after posting the question, found out there is, of course, a way using torch.utils.data.Dataset. Each image-data can be saved in a separate file and all the filenames are listed in 'filelistdata'. Only the batch_size number of images will be loaded into memory when called using DataLoader (in the background, getitem method will fetch the images). The following worked for me:
traindataset = CustDataset(filename='filelistdata', root_dir=root_dir)
train_loader = DataLoader(dataset=traindataset, batch_size=batch_size, num_workers = 16)
num_workers is really important for performance and should be higher than the number of Cpus you are using (I am using 4 cpus above). Found the following resources useful for answering this question.
How to split and load huge dataset that doesn't fit into memory into pytorch Dataloader?
https://stanford.edu/~shervine/blog/pytorch-how-to-generate-data-parallel
https://www.youtube.com/watch?v=ZoZHd0Zm3RY

Related

Having problems while doing multiclass classification with tensorflow

https://colab.research.google.com/drive/1EdCL6YXCAvKqpEzgX8zCqWv51Yum2PLO?usp=sharing
Hello,
Above, I'm trying to identify 5 different type of restorations on dental x-rays with tensorflow. i'm using the official documentation to follow the steps but now i'm kind of stucked and i need help. here are my questions:
1-i have my data on my local disk. TF example on the link above downloads the data from a different repository. when i want to test my images, do i have any other way than to use the code below ?:
import numpy as np
from keras.preprocessing import image
from google.colab import files
uploaded = files.upload()
# predicting images
for fn in uploaded.keys():
path = fn
img = image.load_img(path, target_size=(180, 180))
x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
images = np.vstack([x])
classes = model.predict(images)
print(fn)
print(classes)
i'm asking this because the official documentation just shows the way to test images one-by-one, like this:
img = keras.preprocessing.image.load_img(
sunflower_path, target_size=(img_height, img_width)
)
img_array = keras.preprocessing.image.img_to_array(img)
img_array = tf.expand_dims(img_array, 0) # Create a batch
predictions = model.predict(img_array)
score = tf.nn.softmax(predictions[0])
print(
"This image most likely belongs to {} with a {:.2f} percent confidence."
.format(class_names[np.argmax(score)], 100 * np.max(score))
)
2- i'm using "image_dataset_from_directory" method, so i don't have a separate validation directory. is that ok ? or should i use ImageDataGenerator ? For testing my data, i picked some data randomly from all 5 categories by hand and put them in my test folder which has 5 subfolders as i have that number of categories. is this what i am supposed to do for prediction, also separating the test data into different folders ? if yes, how can i load all these 5 folders simultaneously at test time ?
3- i'm also supposed to create the confusion matrix. but i couldn't understand how i can apply this to my code ? some others say, use scikit-learn's confusion matrix, but this time i have to define y-true, y_pred values, which i cannot fit into this code. am i supposed to evaluate 5 different confusion matrices for 5 different predictions and how ?
4-sometimes, i observe that the validation accuracy starts much higher than the training accuracy. is this unusual ? after 3-4 epochs, train accuracy cathces the validation accuracy and continues in a more balanced way. i thought this should not be happening. is everything alright ?
5- final question, why the first epoch takes much much longer time than other epochs? in my setup, it's about 30-40 minutes to complete the first epoch, and then only about a minute or so to complete every other epoch. is there a way to fix it or does it always happen the same way ?
thanks.
I am no expert in image processing with tf, but let me try to answer as much as possible:
1
I dont really understand this question, because you are using image_dataset_from_directory which should handle the file loading process for you. So far to me, it looks good what you are doing there.
2
Let me cite tf.keras.preprocessing.image_dataset_from_directory:
Then calling image_dataset_from_directory(main_directory,
labels='inferred') will return a tf.data.Dataset that yields batches
of images from the subdirectories class_a and class_b, together with
labels 0 and 1 (0 corresponding to class_a and 1 corresponding to
class_b).
And ImageDataGenerator:
Generate batches of tensor image data with real-time data augmentation.
The data will be looped over (in batches).
As your data is handpicked, there is no need for ImageDataGenerator, as image_dataset_from_directory returns what you want. If you test and validation data (which you should have), you can use the tf.data.Dataset functions for splitting data in test, train and valid. This can be a bit clunky, but the time learning tf.data.dataset is well spent.
3
The confusion matrix give the the F1-Score, Precision and Recall values. But as the confusion matrix is normally for binary classification (which is not your case), it only returns those values for one class (and for not this class). Better use the metrics Tensorflow relies on. Tensorflow can calculate the recall and precision and F1 score for you as metric, so if you ask me, use them.
4
Depending on how the data is shuffled and structured this can be normal. When there are more special cases in the training data, the model will have more difficulties to predict them correct. When there are more simple predictions in the test labels, the model will be better there, which gives you a higher accuracy at that point. It is indeed an indicator, that the classes in your train and test data might not be equally distributed.
5
tf.data.Dataset loads the data when needed. This means, the files are not loaded into memory until the training process has started which results in a very long first epoch (loading all images first) and the second very short epoch (oh cool, all images are already there). You can approve this by checking the gpu usage of your machine, it should often be doing nothing or be very low.
To fix this, you can use .prefetch(z) on your dataset variable. ´prefetch() ´makes the dataset prefetch the next ´z´ values, while the gpu is already doing some calculations. This might speed up the first epoch.

TensorFlow keras model fit() parameters steps_per_epoch and epochs behavior on train set

I'm using a tf.data dataset containing my training data consisting of (lets say) 100k images.
I'm also using a tf.data dataset containing my validation set.
Since an epoch of all 100k images takes quite long (in my case approximately one hour) before I get any feedback on performance on the validation set, I set the steps_per_epoch parameter in tf.keras.Model fit() to 10000.
Using a batch size of 1 this results into having 10 validation scores when reaching 100k of images.
In order to complete one epoch of 100k images of my entire training dataset, I set the epochs parameter to 10
However, I'm not sure if using steps_per_epoch and epochs this way has any other consequences. Is it correct to use these parameters in order to get more frequent feedback on performance?
And also a more specific question, does it use all 100k images or does it use the same first 10k images of my training set at every 'epoch'?
I already dug into the TensorFlow docs and read several different stack overflow questions, but I couldn't find anything conclusive to answer my own question. Hope you can help!
Tensorflow version I'm using is 2.2.0.
Is it correct to use these parameters in order to get more frequent
feedback on performance?
Yes, it is correct to use these parameters. Here is the code that i used to fit the model.
model.fit(
train_data,
steps_per_epoch = train_samples//batch_size,
epochs = epochs,
validation_data = test_data,
verbose = 1,
validation_steps = test_samples//batch_size)
does it use all 100k images or does it use the same first 10k images of my
training set at every 'epoch'?
It use all images in your training data.
For better understanding Epoch is the number times the learning algorithm will work through the entire training data set.
Where as steps_per_epoch is the total number of samples in your training data set divided by the batch size.
For example, if you have 100000 training samples and use a batch size of 100, one epoch will be equivalent to 1000 steps_per_epoch.
Note: We generally observe batch size to be the power of 2, this is because of the effective work of optimized matrix operation libraries.

What does batch_size mean in Keras' TimeSeriesGenerator?

I'm working with a video classification of 5 classes and using TimeDistributed CNN model. The train dataset contains 8 videos containing 75 frames each.I have used TimeSeriesGenerator of Keras where length equals 75 as each video contains 75 sequences.But, it seems unclear to me what batch_size should be in this case.
from keras.preprocessing.sequence import TimeseriesGenerator
train_sequences = TimeseriesGenerator(train_data, train_labels, length=75, batch_size=1)
Can anyone tell me what batch size should be considered for this task?
The batch size defines the number of video samples that will be introduce in each iteration of your model. The difference between the different values of batch size are the model weight's optimization. If batch size is equal to 3, the model will input the 3 sample videos and only after that 3 inputs, it will update the weights.
There isn't a optimal value for batch size. It's like No Free Lunch Theorem. I suggest you to try different values and look for the best results.
There are constraint in defining the batch size:
If the value is small, it will require less memory and it could be faster, since your model is requesting fewer samples. Otherwise, the gradient estimation will be less accurate.
If the value is big, the gradient's estimation will be more accurate, but it will need more memory and could be slower.
So you have to find a optimal value between grandient's estimation accuracy and computational resources usage

How to test neural network on entire testset using mxnet's dataset object?

I have trained a 3D convnet using mxnet. I saved the network architecture and parameters with an intention of testing more data with it to check its performance. Since I am not training, I do not want to obtain batches of the dataset. How do I get the network to read in the entire dataset as input? Just passing the network the dataset object directly is only a 4D tensor whereas the network wants 5D. Right now I am using the dataloader but setting batch size as the entire dataset, and I feel like there is a more efficient way to do this.
DataLoader requires either a batch_size or a BatchSampler. In theory, you could write a BatchSampler that fetches the entire dataset as one batch, though I don't think you'll see a significant performance gain if your batch size is significantly large. Additionally, using batches is beneficial if you have more than one worker - have you considered using num_workers > 0 to take advantage of parallel processing?

Tensorflow: very large estimator's logs with Dataset.from_tensor_slices()

I've been studying mnist estimator code (https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/tutorials/layers/cnn_mnist.py)
and after training or 150 000 steps with this code, logs produced by an estimator have 31M in size. (13M for each weight checkpoint and 5M for graph definition).
While tinkering with code I wrote my own train_input_fn using tf.data.Dataset.from_tensor_slices().
My code here:
def my_train_input_fn():
mnist = tf.contrib.learn.datasets.load_dataset("mnist")
images = mnist.train.images # Returns np.array
labels = np.asarray(mnist.train.labels, dtype=np.int32)
dataset = tf.data.Dataset.from_tensor_slices(
({"x": images}, labels))
dataset = dataset.shuffle(50000).repeat().batch(100)
return dataset
And, my logs, even before one step of the training, only after graph initalization, had size over 1,5G! (165M for ckpt-meta, around 600M for each events.out.tfevents and for graph.pbtxt files).
After a little research I found out that the function from_tensor_slices() is not appropriate for larger datasets, because it creates constants in the execution graph.
Note that the above code snippet will embed the features and labels
arrays in your TensorFlow graph as tf.constant() operations. This
works well for a small dataset, but wastes memory---because the
contents of the array will be copied multiple times---and can run into
the 2GB limit for the tf.GraphDef protocol buffer.
source:
https://www.tensorflow.org/programmers_guide/datasets
But mnist dataset has only around 13M in size. So why my graph definition has 600M, not only those additional 13M embedded as constants? And why events file is so big?
The original dataset producing code (https://github.com/tensorflow/tensorflow/blob/r1.8/tensorflow/python/estimator/inputs/numpy_io.py) doesn't produce such large logs files. I guess it is because of queues usage. But now queues are deprecated and we should use tf.Dataset instead of queues, right? What is the correct method of creating such dataset from file containing images (not from TFRecord)? Should I use tf.data.FixedLengthRecordDataset?
I had a similar issue, I solved using tf.data.Dataset.from_generator
or tf.data.Dataset.range and then dataset.map to get the particular value.
E.g. with generator
def generator():
for sample in zip(*datasets_tuple):
yield sample
dataset = tf.data.Dataset.from_generator(generator,
output_types=output_types, output_shapes=output_shapes)

Categories