I have 4 numpy arrays x_train, x_test, y_train, y_test which consume about 5GB of memory. I have loaded these into tensorflow with the following code.
train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))
test_dataset = tf.data.Dataset.from_tensor_slices((x_test, y_test))
train_dataset and test_dataset together use about 8GB of memory. The problem is that I am running out of memory and I no longer have any use of the numpy arrays. How can I free those variables from memory?
I tried del <variable_name> in python, but it seems it deletes only the pointer and does not free the memory.
Setting the variables to 0 also doesn't work.
Here is the code if that could help.
https://colab.research.google.com/drive/1-nv_JRQnC3YBfyoacdufCnB6LRJacPCt?usp=sharing
The dataset is
https://www.kaggle.com/datasets/theoviel/rsna-breast-cancer-256-pngs
and, here is the train.csv
https://www.kaggle.com/competitions/rsna-breast-cancer-detection/data?select=train.csv
I suggest you:
1-> Maybe, it is possible that tf.data.Dataset.from_tensor_slices creates a view over the original array, so the memory cannot be deleted. In any, case try to put the this part inside a function like this:
def load_data():
# load your numpy arrays
# x_train, x_test, y_train, y_test
return tf.data.Dataset.from_tensor_slices((x_train, y_train)), tf.data.Dataset.from_tensor_slices((x_test, y_test))
I expect that when function return any temporal variable inside the function scope will be release (including your numpy arrays), but since you mentioned that del didn't work, maybe this didn't work either. But hey, Python sometimes acts in mysterious ways.
2-> If the option 1 don't work, try to use memory mapping (https://numpy.org/doc/stable/reference/generated/numpy.memmap.html)
according to official python documents, you can call garbage collector after delete variable. This action clears the memory of unreferenced objects
import gc
gc.collect()
By my reading of your notebook,
train_dataset, test_dataset = load_data()
gets 2 tensors, with tensorflow reporting on memory use.
With in load_data, all the png are loaded into a list, dataset (as arrays). That is then split into train/test with
x_train, x_test, y_train, y_test = train_test_split(dataset, labels, test_size=0.2, random_state=69)
I'm not familiar with train_test_split and don't know whether it is returning lists of arrays, or arrays. That is, I don't know if the following np.asarray line is needed or not. Either way you end up with a copies for all the png arrays in dataset (selection is random, not by slice).
Then you make tensors from those arrays, and return those.
Since dataset and train/test arrays are local to load_data, they will be deleted, and their space freed up. For smallish arrays, numpy can put the freed up memory in a 'freememory' for later reuse, but these are probably large enough that the space is returned to the OS. Before exiting the function, memory use could be quite high, roughly 3x the size of dataset (dataset plus the split copies, plus the tensors).
While I see a Allocation of 2868117504 exceeds 10% of free system memory. warning, I don't see a memory error. The error in the notebook display is not a memory one.
Related
I have data in large files with a custom format. At the moment I am creating a tf.data.Dataset with the name of those files, and then calling a tf.py_function to access them as needed for my training. The py_function has to load the complete file (nearly 8GB) into memory in order to build an array of just a few MB (1024x1024x4). The py_function only returns that array and the corresponding label. The problem is that each sample that I load increases my CPU RAM usage by nearly 8GB. Very quickly, my computer runs out of RAM and the program crashes. When I run my program outside VS Code, I get through twice as many batches as when I use the debugger, but it's still 13 batches max. (I have 32GB of CPU RAM, and 13*8 > 32, so it looks like the memory gets freed sometimes, but maybe not fast enough?)
I keep the batch_size and prefetch both small so that only a few of these large arrays need to be in memory at the same time. I expected that tensorflow would free up that memory once the py_function exits and it is out of scope. I tried to encourage memory to be freed earlier by explicitly deleting the variable and calling the garbage collector, but that didn't help.
I don't think I can create a minimum working example because the data format and the methods to load the data are custom, but here are the relevant parts of my code:
import pickle
import gc
import tensorflow as tf
def tf_load_raw_data_and_labels(raw_data_files, label_files):
[raw_data, labels] = tf.py_function(load_raw_data_and_labels, [raw_data_files, label_files], [tf.float32, tf.float32])
raw_data.set_shape((1024, 1024, 4))
labels.set_shape((1024))
return raw_data, labels
def load_raw_data_and_labels(raw_data_file, label_file):
#load 8GB big_datacube, extract what I need into raw_data
del big_datacube
gc.collect() #no noticeable difference
return raw_data, labels
with open("train_tiles_list.pickle", "rb") as fid:
raw_data_files, label_files = pickle.load(fid)
train_dataset = tf.data.Dataset.from_tensor_slices((raw_data_files, label_files))
train_dataset = train_dataset.shuffle(n_train)#.repeat()
train_dataset = train_dataset.map(tf_load_raw_data_and_labels)
train_dataset = train_dataset.batch(batch_size)
train_dataset = train_dataset.prefetch(1)
I usually train a ResNet50 model using the model.fit() function from a tf.keras.Model, but I also tried the setup from the tf2 quickstart tutorial, which lets me set a debug point after each batch has trained. At the debug point I checked the memory usage of active variables. The list is very short, and there are no variables over 600KiB. At this point gc.collect() returns a number between 10 and 20, even after running it a few times, but I'm not too sure what that means.
It might end up being easiest to crunch through all the large files and save the smaller arrays to their own files, before I start any training. But for now, I'd like to understand if there is something fundamental causing the memory to not be freed. Is it a memory leak? Perhaps related to tf.data.Datasets, py_functions, or something else specific to my setup?
Edit: I have read that python's garbage collection was updated with python3.4. Because of a dependency related to my custom data, I am using python2.7. Could that be part of the problem?
Edit 2: I found some github issues about memory leaks when using tensorflow. The proposed workaround (tf.random.set_seed(1)) doesn't work for me:
https://github.com/tensorflow/tensorflow/issues/31253
https://github.com/tensorflow/tensorflow/issues/19671
So I'm trying to tain my CNN with mutilple datasets and it seams that when I add enough data (such as when I add multiple sets as one or when I try to add the one that has over a million samples) it throws a ResourceExhaustedError.
As for the instructions here, I tried adding
from keras.backend.tensorflow_backend import set_session
import tensorflow as tf
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.3
set_session(tf.Session(config=config))
to my code but this doesn't seam to make a difference.
I see 0.3 after printing out config.gpu_options.per_process_gpu_memory_fraction so that part seams to be ok.
I even threw in a config.gpu_options.allow_growth = True for good mesure but it doesn't seam to want to do anything but attempt to use all the memory at once only to find that it isn't enough.
The computer I'm trying to use to train this CNN has 4 GTX1080 Ti's with 12gb of dedicated memory each.
EDIT: I'm sorry for not specifying how I was loading the data, I honestly didn't realise there was more than one way. When I was learning, they always had examples where the loaded the datasets that were already built in and it took me a while to realise how to load a self-supplied dataset.
The way I'm doing it is that I'm creating two numpy arrays . One has the path or each image and the other has the corresponding label. Here's the most basic example of this:
data_dir = "folder_name"
# There is a folder for every form and in that folder is every line of that form
for filename in glob.glob(os.path.join(data_dir, '*', '*')):
# the format for file names are: "{author id}-{form id}-{line number}.png"
# filename is a path to the file so .split('\\')[-1] get's the raw file name without the path and .split('-')[0] get's the author id
authors.append(filename.split('\\')[-1].split('-')[0])
files.append(filename)
#keras requires numpy arrays
img_files = np.asarray(files)
img_targets = np.asarray(authors)
Are you sure you're not using a giant batch_size?
"Adding data": honestly I don't know what that means and if you could please describe exactly, with code, what you're doing here, it would be of help.
The number of samples should not cause any problems with GPU memory at all. What does cause a problem is a big batch_size.
Loading a huge dataset could cause a CPU RAM problem, not related with keras/tensorflow. A problem with a numpy array that is too big. (You can test this by simply loading your data "without creating any models")
If that is your problem, you should use a generator to load batches gradually. Again, since there is absolutely no code in your question, we can't help much.
But these are two forms of simply creating a generator for images:
Use the existing ImageDataGenerator and it's flow_from_directory() methods, explained here
Create your own coded generator, which can be:
A loop with yield
A class derived from keras.utils.Sequence
A quick example of a loop generator:
def imageBatchGenerator(imageFiles, imageLabels, batch_size):
while True:
batches = len(imageFiles) // batch_size
if len(imageFiles) % batch_size > 0:
batches += 1
for b in range(batches):
start = b * batch_size
end = (b+1) * batch_size
images = loadTheseImagesIntoNumpy(imageFiles[start:end])
labels = imageLabels[start:end]
yield images,labels
Warning: even with generators, you must make sure your batch size is not too big!
Using it:
model.fit_generator(imageBatchGenerator(files,labels,batchSize), steps_per_epoch = theNumberOfBatches, epochs= ....)
Dividing your model among GPUs
You should be able to decide which layers are processed by which GPU, this "could" probably optimize your RAM usage.
Example, when creating the model:
with tf.device('/gpu:0'):
createLayersThatGoIntoGPU0
with tf.device('/gpu:1'):
createLayersThatGoIntoGPU1
#you will probably need to go back to a previous GPU, as you must define your layers in a proper sequence
with tf.device('/cpu:0'):
createMoreLayersForGPU0
#and so on
I'm not sure this would be better or not, but it's worth trying too.
See more details here: https://keras.io/getting-started/faq/#how-can-i-run-a-keras-model-on-multiple-gpus
The ResourceExhaustedError is raised because you're trying to allocate more memory than is available in your GPUs or main memory. The memory allocation is approximately equal to your network footprint (to estimate this, save a checkpoint and look at the file size) plus your batch size multiplied by the size of a single element of your dataset.
It's difficult to answer your question directly without some more information about your setup, but there is one element of this question that caught my attention: you said that you get the error when you "add enough data" or "use a big enough dataset." That's odd. Notice that the size of your dataset is not included in the calculation for memory allocation footprint. Thus, the size of the dataset shouldn't matter. Since it does, that seems to imply that you are somehow attempting to load your entire dataset into GPU memory or main memory. If you're doing this, that's the origin of your problem. To fix it, use the TensorFlow Dataset API. Using a Dataset sidesteps this limited memory resources by hiding the data behind an Iterator that only yields batches when called. Alternatively, you could use the older feed_dict and QueueRunner data feeding structure, but I don't recommend it. You can find some examples of this here.
If you are already using the Dataset API, you'll need to post more of your code as an edit to your question for us to help you.
There is no setting that magically allows you more memory than your GPU has. It looks to me that your inputs are just to big to fit in the GPU RAM (along with all the required state and gradients).
you should use the config.gpu_options.allow_growth = True but not in order to get more memory, just to get an idea of how much memory you need per input length. start with a small length, see with nvidia-smi how much RAM does your GPU take and then increase the length. do that again and again until you understand what is the maximal length of inputs (batch size) that your GPU can hold.
I've been studying mnist estimator code (https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/tutorials/layers/cnn_mnist.py)
and after training or 150 000 steps with this code, logs produced by an estimator have 31M in size. (13M for each weight checkpoint and 5M for graph definition).
While tinkering with code I wrote my own train_input_fn using tf.data.Dataset.from_tensor_slices().
My code here:
def my_train_input_fn():
mnist = tf.contrib.learn.datasets.load_dataset("mnist")
images = mnist.train.images # Returns np.array
labels = np.asarray(mnist.train.labels, dtype=np.int32)
dataset = tf.data.Dataset.from_tensor_slices(
({"x": images}, labels))
dataset = dataset.shuffle(50000).repeat().batch(100)
return dataset
And, my logs, even before one step of the training, only after graph initalization, had size over 1,5G! (165M for ckpt-meta, around 600M for each events.out.tfevents and for graph.pbtxt files).
After a little research I found out that the function from_tensor_slices() is not appropriate for larger datasets, because it creates constants in the execution graph.
Note that the above code snippet will embed the features and labels
arrays in your TensorFlow graph as tf.constant() operations. This
works well for a small dataset, but wastes memory---because the
contents of the array will be copied multiple times---and can run into
the 2GB limit for the tf.GraphDef protocol buffer.
source:
https://www.tensorflow.org/programmers_guide/datasets
But mnist dataset has only around 13M in size. So why my graph definition has 600M, not only those additional 13M embedded as constants? And why events file is so big?
The original dataset producing code (https://github.com/tensorflow/tensorflow/blob/r1.8/tensorflow/python/estimator/inputs/numpy_io.py) doesn't produce such large logs files. I guess it is because of queues usage. But now queues are deprecated and we should use tf.Dataset instead of queues, right? What is the correct method of creating such dataset from file containing images (not from TFRecord)? Should I use tf.data.FixedLengthRecordDataset?
I had a similar issue, I solved using tf.data.Dataset.from_generator
or tf.data.Dataset.range and then dataset.map to get the particular value.
E.g. with generator
def generator():
for sample in zip(*datasets_tuple):
yield sample
dataset = tf.data.Dataset.from_generator(generator,
output_types=output_types, output_shapes=output_shapes)
I think the title is self explanatory but to ask it in details, there's sklearn's method train_test_split() which works like: X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3, stratify = Y) It means: the method will split data with 0.3 : 0.7 ratio and will try to make percentage of labels in both data equal. Is there a keras equivalent of this?
Now there is using the keras Dataset class. I'm running keras-2.2.4-tf along with the new tensorflow release.
Basically, load all the data into a Dataset using something like tf.data.Dataset.from_tensor_slices. Then split the data into new datasets for training and validation. For example, shuffle all the records in the dataset. Then use all but the first 400 as training and the first 400 as validation.
ds = ds_in.shuffle(buffer_size=rec_count)
ds_train = ds.skip(400)
ds_validate = ds.take(400)
An instance of the Dataset class is a natural container to pass around for the Keras models. I copied the concept from a tensorflow or keras training example but can't seem to find it again.
The canned datasets using the load_data method create numpy.ndarray classes so they are a little different but can be easily converted to a keras Dataset. I suspect this hasn't been done because so much existing code would break.
Unfortunately, the answer (despite our wish) is No! There are some existing datasets like MNIST etc. which can be directly loaded:
(X_train, y_train), (X_test, y_test) = mnist.load_data()
This direct loading in a splitted way makes one have a false hope to have a general method, but unfortunately that isn't present here, though you may would be interested in using the wrappers for SciKit-Learn on Keras.
There is almost similar question on DataScience SE
So I'm reading a bit more about moving data from the CPU -> GPU in Tensorflow, and I see that feed_dict is still slow:
https://github.com/tensorflow/tensorflow/issues/2919
The immediate options I see for "moving" Python variables over to the GPU are:
#1. Tensorflow constant
a = tf.constant(data, name='a')
#2. Tensorflow Variable
b = tf.Variable(data, name='b')
#3. Tensorflow placeholder
c = tf.placeholder(dtype=dtype, shape=[x,y,z ...], name='c')
Options #1 and #2 aren't practical for very large dataset variables (since you're actually preloading the data into memory), as we'll quickly exceed the 2GB graph limit. That currently makes #3 the better choice for getting large Python vars into Tensorflow, but then you're forced into using feed_dict.
Are there other options for moving Python variables to the GPU besides #1, #2, and #3? I'm referring to using...
with tf.device('/gpu:0'):
# create tensorflow object(s), whether it's tf.Variable, tf.constant, etc
If I'm understanding correctly, we can use the input pipeline features to work around this issue? I'm referring to these two here:
https://datascience.stackexchange.com/questions/17559/input-pipeline-for-tensorflow-on-gpu
https://stackoverflow.com/a/38956678/7093236
Is there anything I can do to further enhance the speed of putting everything on the Tensorflow side?
Best way is to use tensorflow Queue to speed up data transfer.
You can do the following step even if you don't have label files
# data_files and labels_files are list, this may be some data files path, and labels values.
filename_queue = tf.train.slice_input_producer([data_files, label_files], shuffle=True)
# filename_queue = tf.train.slice_input_producer(data_files, shuffle=True)
# Some steps to decode the files and process
......
data, label = some_function(filename_queue)
# Define batch size and get batch for processing
image_batch, label_batch = tf.train.shuffle_batch([data, label], batch_size=batch_size, num_threads=num_threads)
The Dataset API is the future-proof way of moving data to the GPU. All reasonable optimizations, like those explained in the tensorflow performance guide, will eventually be available there.