I am trying to run some regression models on GPU. While I get a very low GPU utilization upto 20%. After going through the code,
for i in range(epochs):
rand_index = np.random.choice(args.train_pr,
size=args.batch_size)
rand_x = X_train[rand_index]
rand_y = Y_train[rand_index]
I use these three lines for selecting a random batch for each iteration. So, I wanted to ask when the training is going on, can I ready up one more batch for the next iteration?
I am working on a regression problem and not a classification problem. I have already seen threading in Tensorflow but found the examples only for images and there's no example for a big matrix of size 100000X1000 which is used for training.
You have a large numpy array that lies on the host memory. You want to be able to process it in parallel on the CPU and send batches to the device.
Since TF 1.4, the best way to do it is to use tf.data.Dataset, and particularly tf.data.Dataset.from_tensor_slices. However, as the documentation points out, you should probably not provide your numpy arrays as arguments to this function, because it will end up being copied to device memory. What you should do instead is to use placeholders. The example given in the doc is pretty self-explanatory:
features_placeholder = tf.placeholder(features.dtype, features.shape)
labels_placeholder = tf.placeholder(labels.dtype, labels.shape)
dataset = tf.data.Dataset.from_tensor_slices((features_placeholder, labels_placeholder))
# [Other transformations on `dataset`...]
iterator = dataset.make_initializable_iterator()
sess.run(iterator.initializer, feed_dict={features_placeholder: features,
labels_placeholder: labels})
Further preprocessing or data augmentation can be applied to the slices using the .map method. To make sure that those operations happen concurrently, make sure to use tensorflow operations only and avoid wrapping python operations with tf.py_func.
This is a good use case for generators. You can set up a generator function to yield slices of your numpy matrices one chunk at a time. If you use a package like Keras, you can supply the generator directly to the train_on_batch function. If you prefer to use Tensorflow directly, you can use:
sess = tf.Session()
sess.run(init)
batch_gen = generator(data)
batch = batch_gen.next()
sess.run([optimizer, loss, ...], feed_dict = {X: batch[0], y: batch[1]})
Note: I am using placeholders for the optimizer and loss, you have to replace with your definitions. Note that your generator should yield a (x, y) tuple. If you are unfamiliar with generator expressions, there are many examples online, but here is a simple example from the Keras documentation that shows how you can read in numpy matrices from a file in batches:
def generate_arrays_from_file(path):
while 1:
f = open(path)
for line in f:
x, y = process_line(line)
yield (x, y)
f.close()
But also more fundamentally, a low GPU-usage is not really indicative of any problem loading batches, but rather that your batch size might be too small.
Related
I use this code to test CatBoostClassifier.
import numpy as np
from catboost import CatBoostClassifier, Pool
# initialize data
train_data = np.random.randint(0, 100, size=(100, 10))
train_labels = np.random.randint(0, 2, size=(100))
test_data = Pool(train_data, train_labels) #What is Pool?When to use Pool?
# test_data = np.random.randint(0,100, size=(20, 10)) #Usually we will use numpy array,will not use Pool
model = CatBoostClassifier(iterations=2,
depth=2,
learning_rate=1,
loss_function='Logloss',
verbose=True)
# train the model
model.fit(train_data, train_labels)
# make the prediction using the resulting model
preds_class = model.predict(test_data)
preds_proba = model.predict_proba(test_data)
print("class = ", preds_class)
print("proba = ", preds_proba)
The description about Pool is like this:
Pool used in CatBoost as a data structure to train model from.
I think usually we will use numpy array,will not use Pool.
For example we use:
test_data = np.random.randint(0,100, size=(20, 10))
I did not find any more usage of Pool, so I want to know when we will use Pool instead of numpy array?
Catboost only works with Pools, which is internal data format. If you pass numpy array to it, it will implicitly convert it to Pool first, without telling you.
If you need to apply many formulas to one dataset, using Pool drastically increases performance (like 10x), because you'll omit converting step each time.
My understanding of a Pool is that it is just a convenience wrapper combining features, labels and further metadata like categorical features or a baseline.
While it does not make a big difference if you first construct your pool and then fit your model using the pool, it makes a difference when it comes to saving your training data. If you save all the information separately it might get out of sync or you might forget something and when loading you need couple of lines to load everything. The pool comes in very handy here.
Note that when fitting you can also specify an evaluation dataset as a pool. If you want to try multiple evalutation datasets, it is quite handy to have them wrapped up in a single object - that's what the pools are for.
The most important thing about catboost is that we need not to encode the categorical features in our dataset. catBoost has in built one hot encoder hyperparameter, which can be used only when cat_features hyperparameter is specified. Now the cat_features hyperparameter is hard to define as error pops out as soon as we specify an array. The definition is made simpler using Pool.
What is the new approach (under eager execution) to feeding data through a dataset pipeline in a dynamic fashion, when we need to feed it sample by sample?
I have a tf.data.Dataset which performs some preprocessing steps and reads data from a generator, drawing from a large dataset during training.
Let's say that dataset is represented as:
ds = tf.data.Dataset.from_tensor_slices([1, 2, 3, 4, 5, 6])
ds = ds.map(tf.square).shuffle(2).batch(2)
iterator = tf.data.make_one_shot_iterator(ds)
After training I want to produce various visualizations which require that I feed one sample at a time through the network for inference. I've now got this dataset preprocessing pipeline that I need to feed my raw sample through to be sized and shaped appropriately for the network input.
This seems like a use case for the initializable iterator:
placeholder = tf.placeholder(tf.float32, shape=None)
ds = tf.data.Dataset.from_tensor_slices(placeholder)
ds = ds.map(tf.square).shuffle(2).batch(2)
iterator = tf.data.make_initializable_iterator(ds)
# now re-initialize for each sample
Keep in mind that the map operation in this example represents a long sequence of preprocessing operations that can't be duplicated for each new data sample being feed in.
This doesn't work with eager execution, you can't use the placeholder. The documentation examples all seem to assume a static input such as in the first example here.
The only way I can think of doing this is with a queue and tf.data.Dataset.from_generator(...) which reads from the queue that I push to before predicting on the data. But this feels both hacky, and appears prone to deadlocks that I've yet to solve.
TF 1.14.0
I just realized that the answer to this question is trivial:
Just create a new dataset!
In non-eager mode the code below would have degraded in performance because each dataset operation would have been added to the graph and never released, and in non-eager mode we have the initializable iterator to resolve that issue.
However, in eager execution mode tensorflow operations like this are ephemeral, added iterators aren't being added to a global graph, they just get created and die when no longer referenced. Win one for TF2.0!
The code below (copy/paste runnable) demonstrates:
import tensorflow as tf
import numpy as np
import time
tf.enable_eager_execution()
inp = np.ones(shape=5000, dtype=np.float32)
t = time.time()
while True:
ds = tf.data.Dataset.from_tensors(inp).batch(1)
val = next(iter(ds))
assert np.all(np.squeeze(val, axis=0) == inp)
print('Processing time {:.2f}'.format(time.time() - t))
t = time.time()
The motivation for the question came on the heels of this issue in 1.14 where creating multiple dataset operations in graph mode under Keras constitutes a memory leak.
https://github.com/tensorflow/tensorflow/issues/30448
I have two tf.datasets, one for validation and one for training.
Every now and again I want to switch the data source so that I can run the validation and check some accuracy measure on it.
This blog suggests to use placeholders and feed normal numpy arrays to it. But this would defeat the entire efficiency purpose;
As the tf.data API api guide says:
Warning: "Feeding" is the least efficient way to feed data into a TensorFlow program and should only be used for small experiments and debugging.
So, here is a conceptual example of what I want to achieve:
# Load the datasets from tfrecord files:
val_dataset = tf.data.TFRecordDataset([val_recordfile])
train_dataset = tf.data.TFRecordDataset([train_recordfile])
## Batch size end shuffeling etc. here ##
iterator_tr = train_dataset.make_initializable_iterator()
iterator_val = val_dataset.make_initializable_iterator()
###############################################
## This is the magic: ##
it_op=tf.iterator_placeholder()
## tf.iterator_placeholder does not exist! ##
## and demonstrates my needs ##
###############################################
X, Y = it_op.get_next()
predictions=model(X)
train_op=train_and_minimize(X,Y)
acc_op=get_accuracy(Y,predictions)
with tf.Session() as sess:
# Initialize iterator here
accuracy_tr,_=sess.run([acc_op,train_op], feed_dict={it_op: iterator_tr})
accuracy_val=sess.run(acc_op, feed_dict={it_op: iterator_val})
It does not of course has to be done in this exact way!
I'd prefer a pytonic/ideomatic tensorflow way, but any way that does not require feeding raw data is great for me!
As it turns out, my suggestion wasn't that far off of what worked. It was actually presented in the Tensorflow's guide on datasets:
# Load the datasets in some form of tf.Dataset
tr_dataset=get_dataset(TRAINING)
val_dataset=get_dataset(VALIDATION)
# batching etc..
train_iterator = tr_dataset.make_initializable_iterator()
val_iterator = val_dataset.make_initializable_iterator()
# Make iterator handle that takes a string identifier
iterator_handle = tf.placeholder(tf.string, shape=[])
iterator=tf.data.Iterator.from_string_handle(iterator_handle, train_iterator.output_types,output_shapes=train_iterator.output_shapes)
with tf.Session() as sess:
# Create string handlers for the iterators
train_iterator_handle = sess.run(train_iterator.string_handle())
val_iterator_handle = sess.run(val_iterator.string_handle())
# Now initialize iterators
sess.run(train_iterator.initializer, feed_dict={iterator_handle: train_iterator_handle})
sess.run(val_iterator.initializer, feed_dict={iterator_handle: val_iterator_handle})
I have the following lines as part of a program:
tensor_gradients = optimizer.compute_gradients(cross_entropy)
with tf.Session() as session:
for step in range(20000):
batch = mnist.train.next_batch(train_batch_size)
feed = {input_x: batch[0], input_y: batch[1]}
gradients = session.run([tensor_gradients], feed)[0]
for i in range(len(gradients)):
gradients[i] = (gradients[i][0], tensor_gradients[i][1])
... computation on gradients ...
training_step = optimizer.apply_gradients(gradients)
training = session.run([training_step], feed)
The reason I'm doing this is because I want to modify the gradients using numpy. The above code runs out of memory around step 800. However, if you replace the optimizer.apply_gradients step by tensor_gradients, then the code does not run out of memory.
training_step = optimizer.apply_gradients(tensor_gradients)
Any ideas at what might be happening? The rest of the code remains the same except for the line above. Is it possible that the numpy arrays in gradients is not being garbage collected because they are being passed into the apply_gradients step? I have no idea where the memory leak could be or if I'm inadvertently adding to the tensorflow graph by passing modified gradients (in numpy array form) back into apply_gradients.
Any ideas at what might be happening?
OOM happens because you're constructing the graph inside the loop: This builds a graph with 20,000x nodes, and running it may need more memory than you have.
Move all TF operations that build the graph outside the loop, i.e. everything except feed_dict construction and sess.run calls.
Reply to comments
Apply gradients builds the graph?
Yes, if you look in the docs:
Returns:
An `Operation` that applies the specified gradients. If `global_step`
was not None, that operation also increments `global_step`.
I am using the high-level Estimator on TF:
estim = tf.contrib.learn.Estimator(...)
estim.fit ( some_input )
If some_input has x, y, and batch_size, the codes run but with a warning; so I tried to use input_fn, and managed to send x, y through this input_fn, but not to send the batch_size. Didn't find any example for it.
Could anyone share a simple example that uses input_fn as input to the estim.fit / estim.evaluate, and uses batch_size as well?
Do I have to use tf.train.batch? If so, how does it merge into the higher-level implementation (tf.layers) - I don't know the graph's tf.Graph() or session?
Below is the warning I got:
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/monitors.py:657: calling evaluate
(from tensorflow.contrib.learn.python.learn.estimators.estimator) with y is deprecated and will be removed after 2016-12-01.
Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will only accept input_fn.
Example conversion:
est = Estimator(...) -> est = SKCompat(Estimator(...))
The link provided in Roi's own comment was indeed really helpful. Since I was struggling with the same question as well for a while, I would like to summarize the answer provided by the link above as a reference:
def batched_input_fn(dataset_x, dataset_y, batch_size):
def _input_fn():
all_x = tf.constant(dataset_x, shape=dataset_x.shape, dtype=tf.float32)
all_y = tf.constant(dataset_y, shape=dataset_y.shape, dtype=tf.float32)
sliced_input = tf.train.slice_input_producer([all_x, all_y])
return tf.train.batch(sliced_input, batch_size=batch_size)
return _input_fn
This can then be used like this example (using TensorFlow v1.1):
model = CustomModel(FLAGS.learning_rate)
estimator= tf.estimator.Estimator(model_fn=model.build(), params=model.params())
estimator.train(input_fn=batched_input_fn(
train.features,
train.labels,
FLAGS.batch_size),
steps=FLAGS.train_steps)
Unfortunately, this approach is about 10x slower compared to manual feeding (using TensorFlows low-level API) or compared to using the whole dataset with train.shape[0] == batch_size and not using train.sliced_input_producer() and train.batch() at all. At least on my machine (CPU only). I'm really wondering why this approach is so slow. Any ideas?
Edited:
I could speed it up a bit by using num_threads > 1 as a parameter for train.batch(). On a VM with 2 CPUs, I'm able to double the performance using this batching mechanism compared to the default num_threads=1. But still, it is 5x slower than manual feeding.
But results might be different on a native system or a system that uses all CPU cores for the input-pipeline and the GPU for the model computation. Would be great if somebody could post his results in the comments.