Change tf.dataset source at runtime in Tensorflow - python

I have two tf.datasets, one for validation and one for training.
Every now and again I want to switch the data source so that I can run the validation and check some accuracy measure on it.
This blog suggests to use placeholders and feed normal numpy arrays to it. But this would defeat the entire efficiency purpose;
As the tf.data API api guide says:
Warning: "Feeding" is the least efficient way to feed data into a TensorFlow program and should only be used for small experiments and debugging.
So, here is a conceptual example of what I want to achieve:
# Load the datasets from tfrecord files:
val_dataset = tf.data.TFRecordDataset([val_recordfile])
train_dataset = tf.data.TFRecordDataset([train_recordfile])
## Batch size end shuffeling etc. here ##
iterator_tr = train_dataset.make_initializable_iterator()
iterator_val = val_dataset.make_initializable_iterator()
###############################################
## This is the magic: ##
it_op=tf.iterator_placeholder()
## tf.iterator_placeholder does not exist! ##
## and demonstrates my needs ##
###############################################
X, Y = it_op.get_next()
predictions=model(X)
train_op=train_and_minimize(X,Y)
acc_op=get_accuracy(Y,predictions)
with tf.Session() as sess:
# Initialize iterator here
accuracy_tr,_=sess.run([acc_op,train_op], feed_dict={it_op: iterator_tr})
accuracy_val=sess.run(acc_op, feed_dict={it_op: iterator_val})
It does not of course has to be done in this exact way!
I'd prefer a pytonic/ideomatic tensorflow way, but any way that does not require feeding raw data is great for me!

As it turns out, my suggestion wasn't that far off of what worked. It was actually presented in the Tensorflow's guide on datasets:
# Load the datasets in some form of tf.Dataset
tr_dataset=get_dataset(TRAINING)
val_dataset=get_dataset(VALIDATION)
# batching etc..
train_iterator = tr_dataset.make_initializable_iterator()
val_iterator = val_dataset.make_initializable_iterator()
# Make iterator handle that takes a string identifier
iterator_handle = tf.placeholder(tf.string, shape=[])
iterator=tf.data.Iterator.from_string_handle(iterator_handle, train_iterator.output_types,output_shapes=train_iterator.output_shapes)
with tf.Session() as sess:
# Create string handlers for the iterators
train_iterator_handle = sess.run(train_iterator.string_handle())
val_iterator_handle = sess.run(val_iterator.string_handle())
# Now initialize iterators
sess.run(train_iterator.initializer, feed_dict={iterator_handle: train_iterator_handle})
sess.run(val_iterator.initializer, feed_dict={iterator_handle: val_iterator_handle})

Related

What is Pool in Catboost?When to use Pool instead of numpy array?

I use this code to test CatBoostClassifier.
import numpy as np
from catboost import CatBoostClassifier, Pool
# initialize data
train_data = np.random.randint(0, 100, size=(100, 10))
train_labels = np.random.randint(0, 2, size=(100))
test_data = Pool(train_data, train_labels) #What is Pool?When to use Pool?
# test_data = np.random.randint(0,100, size=(20, 10)) #Usually we will use numpy array,will not use Pool
model = CatBoostClassifier(iterations=2,
depth=2,
learning_rate=1,
loss_function='Logloss',
verbose=True)
# train the model
model.fit(train_data, train_labels)
# make the prediction using the resulting model
preds_class = model.predict(test_data)
preds_proba = model.predict_proba(test_data)
print("class = ", preds_class)
print("proba = ", preds_proba)
The description about Pool is like this:
Pool used in CatBoost as a data structure to train model from.
I think usually we will use numpy array,will not use Pool.
For example we use:
test_data = np.random.randint(0,100, size=(20, 10))
I did not find any more usage of Pool, so I want to know when we will use Pool instead of numpy array?
Catboost only works with Pools, which is internal data format. If you pass numpy array to it, it will implicitly convert it to Pool first, without telling you.
If you need to apply many formulas to one dataset, using Pool drastically increases performance (like 10x), because you'll omit converting step each time.
My understanding of a Pool is that it is just a convenience wrapper combining features, labels and further metadata like categorical features or a baseline.
While it does not make a big difference if you first construct your pool and then fit your model using the pool, it makes a difference when it comes to saving your training data. If you save all the information separately it might get out of sync or you might forget something and when loading you need couple of lines to load everything. The pool comes in very handy here.
Note that when fitting you can also specify an evaluation dataset as a pool. If you want to try multiple evalutation datasets, it is quite handy to have them wrapped up in a single object - that's what the pools are for.
The most important thing about catboost is that we need not to encode the categorical features in our dataset. catBoost has in built one hot encoder hyperparameter, which can be used only when cat_features hyperparameter is specified. Now the cat_features hyperparameter is hard to define as error pops out as soon as we specify an array. The definition is made simpler using Pool.

How do you feed a tf.data.Dataset dynamically in eager execution mode where initializable_iterator isn't available?

What is the new approach (under eager execution) to feeding data through a dataset pipeline in a dynamic fashion, when we need to feed it sample by sample?
I have a tf.data.Dataset which performs some preprocessing steps and reads data from a generator, drawing from a large dataset during training.
Let's say that dataset is represented as:
ds = tf.data.Dataset.from_tensor_slices([1, 2, 3, 4, 5, 6])
ds = ds.map(tf.square).shuffle(2).batch(2)
iterator = tf.data.make_one_shot_iterator(ds)
After training I want to produce various visualizations which require that I feed one sample at a time through the network for inference. I've now got this dataset preprocessing pipeline that I need to feed my raw sample through to be sized and shaped appropriately for the network input.
This seems like a use case for the initializable iterator:
placeholder = tf.placeholder(tf.float32, shape=None)
ds = tf.data.Dataset.from_tensor_slices(placeholder)
ds = ds.map(tf.square).shuffle(2).batch(2)
iterator = tf.data.make_initializable_iterator(ds)
# now re-initialize for each sample
Keep in mind that the map operation in this example represents a long sequence of preprocessing operations that can't be duplicated for each new data sample being feed in.
This doesn't work with eager execution, you can't use the placeholder. The documentation examples all seem to assume a static input such as in the first example here.
The only way I can think of doing this is with a queue and tf.data.Dataset.from_generator(...) which reads from the queue that I push to before predicting on the data. But this feels both hacky, and appears prone to deadlocks that I've yet to solve.
TF 1.14.0
I just realized that the answer to this question is trivial:
Just create a new dataset!
In non-eager mode the code below would have degraded in performance because each dataset operation would have been added to the graph and never released, and in non-eager mode we have the initializable iterator to resolve that issue.
However, in eager execution mode tensorflow operations like this are ephemeral, added iterators aren't being added to a global graph, they just get created and die when no longer referenced. Win one for TF2.0!
The code below (copy/paste runnable) demonstrates:
import tensorflow as tf
import numpy as np
import time
tf.enable_eager_execution()
inp = np.ones(shape=5000, dtype=np.float32)
t = time.time()
while True:
ds = tf.data.Dataset.from_tensors(inp).batch(1)
val = next(iter(ds))
assert np.all(np.squeeze(val, axis=0) == inp)
print('Processing time {:.2f}'.format(time.time() - t))
t = time.time()
The motivation for the question came on the heels of this issue in 1.14 where creating multiple dataset operations in graph mode under Keras constitutes a memory leak.
https://github.com/tensorflow/tensorflow/issues/30448

How can I make inferences using the Tensorflow Cifar10 tutorial code?

I am an absolute beginner to TensorFlow.
If I have a picture (or set of pictures) that I would like to attempt to classify using the code from the Cifar10 TensorFlow tutorial, how would I do so?
I have absolutely no idea where to start.
Train the model using base CIFAR10 dataset exactly as per the tutorial.
Create a new graph with your own inputs - probably easiest to just use a tf.placeholder and feed the data as per below, but there's lots of other ways.
Start a session, load previously saved weights.
Run the session (with a feed_dict if you're using a placeholder as above).
.
import tensorflow as tf
train_dir = '/tmp/cifar10_train' # or use FLAGS as in the train example
batch_size = 8
height = 32
width = 32
image = tf.placeholder(shape=(batch_size, height, width, 3), dtype=tf.uint8)
std_img = tf.image.per_image_standardization(image)
logits = cifar10.inference(std_img)
predictions = tf.argmax(logits, axis=-1)
def get_image_data_batches():
n_batchs = 100
for i in range(n_batchs):
yield (np.random.uniform(size=(batch_size, height, width, 3)*255).astype(np.uint8)
def do_stuff_with(logit_vals, prediction_vals):
pass
with tf.Session() as sess:
# restore variables
saver = tf.train.Saver()
saver.restore(sess, tf.train.latest_checkpoint(train_dir))
# run inference
for batch_data in get_image_data_batches():
logit_vals, prediction_vals = sess.run([logits, predictions], feed_dict={image: image_data})
do_stuff_with(logit_vals, prediction_vals)
There are better ways of getting data into the graph (see tf.data.Dataset), but I believe tf.placeholders are the easiest way for learning and getting something up and running initially.
Also check out tf.estimator.Estimators for a cleaner way of managing sessions. It's very different to the way it's done in this tutorial and -slightly- less flexible, but for standard networks they save you writing a lot of boilerplate code.

Reading numpy matrix in batches in Tensorflow

I am trying to run some regression models on GPU. While I get a very low GPU utilization upto 20%. After going through the code,
for i in range(epochs):
rand_index = np.random.choice(args.train_pr,
size=args.batch_size)
rand_x = X_train[rand_index]
rand_y = Y_train[rand_index]
I use these three lines for selecting a random batch for each iteration. So, I wanted to ask when the training is going on, can I ready up one more batch for the next iteration?
I am working on a regression problem and not a classification problem. I have already seen threading in Tensorflow but found the examples only for images and there's no example for a big matrix of size 100000X1000 which is used for training.
You have a large numpy array that lies on the host memory. You want to be able to process it in parallel on the CPU and send batches to the device.
Since TF 1.4, the best way to do it is to use tf.data.Dataset, and particularly tf.data.Dataset.from_tensor_slices. However, as the documentation points out, you should probably not provide your numpy arrays as arguments to this function, because it will end up being copied to device memory. What you should do instead is to use placeholders. The example given in the doc is pretty self-explanatory:
features_placeholder = tf.placeholder(features.dtype, features.shape)
labels_placeholder = tf.placeholder(labels.dtype, labels.shape)
dataset = tf.data.Dataset.from_tensor_slices((features_placeholder, labels_placeholder))
# [Other transformations on `dataset`...]
iterator = dataset.make_initializable_iterator()
sess.run(iterator.initializer, feed_dict={features_placeholder: features,
labels_placeholder: labels})
Further preprocessing or data augmentation can be applied to the slices using the .map method. To make sure that those operations happen concurrently, make sure to use tensorflow operations only and avoid wrapping python operations with tf.py_func.
This is a good use case for generators. You can set up a generator function to yield slices of your numpy matrices one chunk at a time. If you use a package like Keras, you can supply the generator directly to the train_on_batch function. If you prefer to use Tensorflow directly, you can use:
sess = tf.Session()
sess.run(init)
batch_gen = generator(data)
batch = batch_gen.next()
sess.run([optimizer, loss, ...], feed_dict = {X: batch[0], y: batch[1]})
Note: I am using placeholders for the optimizer and loss, you have to replace with your definitions. Note that your generator should yield a (x, y) tuple. If you are unfamiliar with generator expressions, there are many examples online, but here is a simple example from the Keras documentation that shows how you can read in numpy matrices from a file in batches:
def generate_arrays_from_file(path):
while 1:
f = open(path)
for line in f:
x, y = process_line(line)
yield (x, y)
f.close()
But also more fundamentally, a low GPU-usage is not really indicative of any problem loading batches, but rather that your batch size might be too small.

How to run 2 sessions in a nested way in tensorflow?

I am using tensorflow to train a network, which is the main session A. To preprocess the input data, I used spatial transformer network (tensorflow version). That basically means during training, the Session A starts first, and after each epoch, session B for preprocessing part of spatial transformer network will be used. In session B I have one line code like this
para = tf.Variable(initial_value=initial, name='para', trainable = False)
when I start to run, error occurs and indicate a RuntimeError
RuntimeError: Graph is finalized and cannot be modified.
I am wondering what is the right way to solve this? To the best of my knowledge, there are two possible ways:
1) Merge the preprocessing part in the main Session A and use feed_dict to pass the parameters of preprocessing to the session.
2) find the way to handle the another session running for preprocessing during training of the main session A.
Do anyone has experience about this issue? Please help me.
If you are moving a large amount of data and it is different for each induction then you'll want to save the data once then continue with induction
# in the setup of your tensorflow graph
para_feed = tf.placeholder( dtype=tf.float32 , name='para', shape=[None,2] )
para = tf.Variable( dtype=tf.float32 , name='para', shape=[None,2] )
update_para = tf.assign(para,para_feed)
# to assign
sess.run( update_para , feed_dict={para_feed: initial})
# run your induction multiple times
# ...
If you do not have a large amount of data then simplify it like so
# in the setup of your tensorflow graph
para = tf.placeholder( dtype=tf.float32 , name='para', shape=[None,2] )
# run your induction multiple times
sess.run( your_induction_target , feed_dict={para: initial})

Categories