I load my data with open_memmap function and it takes 5GB RAM memory. Then I compile the model which has params: 89,268,608 and it does not take any RAM memory. My batch size is 200 at the moment and the input image has shape (300,54,3).
My problem is when I call the model.fit function in keras my RAM memory increase from 5 GB to 24GB. My question is why?
When I try with different batch sizes nothing is changing and still 23 GB of RAM are occupied?
If somebody can explain me what is happening I would highly appreciate it,
Thanks!
Keras' fit method loads all the data into memory at once meaning changing your batch size will have no effect on the RAM it takes up. Have a look at using fit_generator which is designed for use with a large dataset.
Related
I am switching to training on GPUs and found that with arbitrary, and not very big, batch size of training will crash. With 256x256 RGB images in a UNET, a batch of 32 causes an out of memory crash, while 16 works successfully. The amount of memory consumed was surprising as I never ran into an out-of-memory on a 16 GB RAM system. Is tensorflow free to use SWAP?
How can I check the amount of total memory available on a GPU? Many guides online only look at memory used.
How does one estimate the memory needs? Image size (pixelschannelsdtype)* batch + parameter size * float?
Many thanks,
Bogdan
Suppose I have a machine with very little RAM but it has a beefy GPU, is it possible to bypass RAM and load a .npy file directly into GPU?
For example,
model.fit(x=None, y=None)
What if my X is too big to fit in memory? Is there a mechanism where I can either load segment by segment from file system as it iterates through batches or simply pass the whole tensor to GPU? I think my first option can be answered by Training a Keras model from batches of .npy files using generator?
Technically speaking, it could be done. CreateFileMapping would allow to transfer data directly from disk to gpu. However, I don't believe tensorflow would support anything like that.
For a project I'm working on, I am using an altered version of Mask RCNN to train a model that will find objects in an image. These images are relatively small, about 300 x 200 pixels, and I train them for a relatively long time, around 100 epochs.
However, my main question relates to the batch size and how Tensorflow allocates memory on the gpu for the validation stage per epoch. I want to increase my batch size to help better smooth out the validation curve, as well as increase the accuracy of the overall model. However, if I increase my batch size to drastically, I get a OOM: GPU out-of-momory and keras_scratch_graph error. I'm currently working with two NVIDIA Quadro P5000s that have 16GB of vram each. having about 3 images per GPU, I can have a max batch size of 6 before it errors out. I've looked around and most people either say to just decrease the batch size, which I would prefer not to do, or enable GPU growth, which I couldn't get to work either. I could decrease the complexity of my model to decrease the size of tensors that are being evaluated, but I don't want to risk it as it could cause my accuracy to decrease, or loss to increase.
Is there a way that I can offset some images onto my physical systems memory, or am I purely limited to the amount of ram I have available on my GPU? Are their any more compact or robust methods out there that could solve this issue?
I have data in large files with a custom format. At the moment I am creating a tf.data.Dataset with the name of those files, and then calling a tf.py_function to access them as needed for my training. The py_function has to load the complete file (nearly 8GB) into memory in order to build an array of just a few MB (1024x1024x4). The py_function only returns that array and the corresponding label. The problem is that each sample that I load increases my CPU RAM usage by nearly 8GB. Very quickly, my computer runs out of RAM and the program crashes. When I run my program outside VS Code, I get through twice as many batches as when I use the debugger, but it's still 13 batches max. (I have 32GB of CPU RAM, and 13*8 > 32, so it looks like the memory gets freed sometimes, but maybe not fast enough?)
I keep the batch_size and prefetch both small so that only a few of these large arrays need to be in memory at the same time. I expected that tensorflow would free up that memory once the py_function exits and it is out of scope. I tried to encourage memory to be freed earlier by explicitly deleting the variable and calling the garbage collector, but that didn't help.
I don't think I can create a minimum working example because the data format and the methods to load the data are custom, but here are the relevant parts of my code:
import pickle
import gc
import tensorflow as tf
def tf_load_raw_data_and_labels(raw_data_files, label_files):
[raw_data, labels] = tf.py_function(load_raw_data_and_labels, [raw_data_files, label_files], [tf.float32, tf.float32])
raw_data.set_shape((1024, 1024, 4))
labels.set_shape((1024))
return raw_data, labels
def load_raw_data_and_labels(raw_data_file, label_file):
#load 8GB big_datacube, extract what I need into raw_data
del big_datacube
gc.collect() #no noticeable difference
return raw_data, labels
with open("train_tiles_list.pickle", "rb") as fid:
raw_data_files, label_files = pickle.load(fid)
train_dataset = tf.data.Dataset.from_tensor_slices((raw_data_files, label_files))
train_dataset = train_dataset.shuffle(n_train)#.repeat()
train_dataset = train_dataset.map(tf_load_raw_data_and_labels)
train_dataset = train_dataset.batch(batch_size)
train_dataset = train_dataset.prefetch(1)
I usually train a ResNet50 model using the model.fit() function from a tf.keras.Model, but I also tried the setup from the tf2 quickstart tutorial, which lets me set a debug point after each batch has trained. At the debug point I checked the memory usage of active variables. The list is very short, and there are no variables over 600KiB. At this point gc.collect() returns a number between 10 and 20, even after running it a few times, but I'm not too sure what that means.
It might end up being easiest to crunch through all the large files and save the smaller arrays to their own files, before I start any training. But for now, I'd like to understand if there is something fundamental causing the memory to not be freed. Is it a memory leak? Perhaps related to tf.data.Datasets, py_functions, or something else specific to my setup?
Edit: I have read that python's garbage collection was updated with python3.4. Because of a dependency related to my custom data, I am using python2.7. Could that be part of the problem?
Edit 2: I found some github issues about memory leaks when using tensorflow. The proposed workaround (tf.random.set_seed(1)) doesn't work for me:
https://github.com/tensorflow/tensorflow/issues/31253
https://github.com/tensorflow/tensorflow/issues/19671
I have installed Tensorflow cpu version.I have only few images as dataset and I am training on a machine with 4GB ram and Core i5 3340m 2.70GHZ with batch size 1 and it is still extremely slow.the size of all images is same (200X185 i think).Will it train like this ? kindly tell me how can I speed up this process?
Training porcess
If your network is deep, it could take a long time to train your network using CPU as it is not optimized like GPU for calculations.
I would suggest you to get a graphic card, even a old version of graphic card can significantly improve the performance (it could be like 100x faster).
Let's put some numbers here. You are dealing with images with a size of 200x185. Do you realize we are talking about 37000 features right? If we deal with gray levels. If we deal with RGB multiply that by 3. How many images are you using for training? Keep also in mind that SGD (Stochastic Gradient Descent, mini-batch size = 1) tend to be very slow for big datasets... Give us some numbers. How many training images and what is "slow". How much time for one epoch. Something else: programming languages, library (tensorflow, etc.), optimizer, etc. would help us in judging if your code is "slow" and can it be made faster.
batch size is another param affect training time: higher size will help reduce time each epoch, but will require more epoch to have the same effiency like size=1
And if your network is deep (using CNN, etc), you should run on GPU