I have a problem with the very slow batch loading in tensorflow.
Each step in the training is reasonably fast but my function to load data is extremely slow.
I was wondering if there were any ways to make this faster or run it in the background when train operation is running so that the batch can be ready by the time one step is done.
My features are stored in numpy arrays.
Any Ideas?
This is my code.
def test_loadbatch(no_timesteps,list_of_file_paths,batch_size):
nof=no_timesteps# No of combined
files=list_of_file_paths
files=shuffle_list(files)
classes=get_class_number(files)
temp_batch=np.zeros(shape=(batch_size,no_timesteps,4096),dtype=np.float32)
temp_classes=np.zeros(shape=(batch_size,101),dtype=np.float32)
bat_num=0
fileno=0
while bat_num != batch_size :
if os.path.isfile(str(files[fileno])):
val=np.load(str(files[fileno]))
try:
if val.shape[0]>no_timesteps+2:
num=random.randint(0,val.shape[0]-(no_timesteps+2))
temp_batch[bat_num,:,:]=val[num:num+nof,:]
temp_classes[bat_num,:]=create_one_hot(classes[fileno])
bat_num=bat_num+1
except Exception as ex:
fileno=fileno+1
fileno=fileno+1
return np.maximum(np.tanh(temp_batch), 0),temp_classes #normalize in range 0->1
Input data preparation and training a model using the prepared data can be decoupled in TensorFlow using queues. You can create a queue with tf.FIFOQueue or tf.RandomShuffleQueue and enqueue your mini-batches into it with tf.enqueue. The training part of the graph would get a mini-batch by running tf.dequeue. Note that you should run the data preparation and training in different Python threads to get concurrency. Please have a look at the how to on threading and queues for more explanation and examples. Also, note that the throughput of the data preparation + training pipeline would be limited by the slowest stage. In your case, if your data preparation per mini-batch is slower than the training step, you may have to run multiple threads to create many mini-batches in parallel to keep up with the training thread.
Related
I'm implementing an RL algorithm and using tf.data.Dataset(with prefetch) to feed data to the neural network. However, in order to interact with the environment, I have to explicitly feed data through feed_dict to take action. I'm wondering if using feed_dict with Dataset would impair the speed.
Here's a simplified version of my code
# code related to Dataset
ds = tf.data.Dataset.from_generator(buffer, sample_types, sample_shapes)
ds = ds.prefetch(5)
iterator = ds.make_one_shot_iterator()
samples = iterator.get_next(name='samples')
# pass samples to network
# network training, no feed_dict is needed because of Dataset
sess.run([self.opt_op])
# run the actor network to choose an action at the current state.
# manually feed the current state to samples
# will this impair the performance?
action = sess.run(self.action, feed_dict={samples['state']: state})
The is nothing wrong with mixing Dataset and feed_dict. If the state that you provide to feed_dict is large, it might lead to underutilized GPU depending on the size of the data. But it would in no way be related to Dataset being used.
One of the reasons Dataset API exists is to avoid model starvation and improve GPU utilization during training. The starvation might happen for reasons of data being copied from one locations to another: disk to memory, memory to GPU memory, you name it. Dataset tries to start executing bulky IO operations early enough to avoid starving the model when the time comes to process next batch. So, basically, Datasets tries to reduce time between batches.
In your case you probably don't loose any performance from using feed_dict. It seems that you break the execution by environment interaction anyhow (therefore, possibly underutilizing the GPU).
If you would like to be sure, time your performance when you are feeding actual state with feed_dict and than replace state usage with a constant tensor and compare the speed.
I have trained a 3D convnet using mxnet. I saved the network architecture and parameters with an intention of testing more data with it to check its performance. Since I am not training, I do not want to obtain batches of the dataset. How do I get the network to read in the entire dataset as input? Just passing the network the dataset object directly is only a 4D tensor whereas the network wants 5D. Right now I am using the dataloader but setting batch size as the entire dataset, and I feel like there is a more efficient way to do this.
DataLoader requires either a batch_size or a BatchSampler. In theory, you could write a BatchSampler that fetches the entire dataset as one batch, though I don't think you'll see a significant performance gain if your batch size is significantly large. Additionally, using batches is beneficial if you have more than one worker - have you considered using num_workers > 0 to take advantage of parallel processing?
I want to train an RNN with different input size of sentence X, without padding. The logic used for this is that I am using Global Variables and for every step, I take an example, write the forward propagation i.e. build the graph, run the optimizer and then repeat the step again with another example. The program is extremely slow as compared to the numpy implementation of the same thing where I have implemented forward and backward propagation and using the same logic as above. The numpy implementation takes a few seconds while Tensorflow is extremely slow. Can running the same thing on GPU will be useful or I am doing some logical mistake ?
As a general guideline, GPU boosts performance only if you have calculation intensive code and little data transfer. In other words, if you train your model one instance at a time (or on small batch sizes) the overhead for data transfer to/from GPU can even make your code run slower! But if you feed in a good chunk of samples, then GPU will definitely boost your code.
I'm trying to train a (pretty big) neural network using a GPU. The network is written in pytorch. I use python 3.6.3 running on ubuntu 16.04. Currently, the code is running, but it's taking about twice as long as it should to run because my data-grabbing process using the CPU is run in series to the training process using the GPU. Essentially, I grab a mini-batch from file using a mini-batch generator, send that mini-batch to the GPU and then train the network on that minibatch. I've timed the two processes (grabbing a mini batch and training on that mini batch), and they are similar in how long they take (both take around 200ms). I'd like to do something similar to keras' fit_generator method which runs the data-grabbing in parallel to the training (it creates a que of minibatches that can be sent to the GPU when the GPU wants to train on that mini batch). What is the best way to do that? For concreteness, my data generator code and training code run something like this (pseudocode):
#This generator opens a file, grabs and yields a mini batch
def data_gen(PATH,batch_size=32):
with h5py.File(PATH,'r') as f:
for mini-batch in mini-batches:
X = f['X'][mini-batch]
Y = f['Y'][mini-batch]
yield (X,Y)
for epoch in range(epochs):
for data in data_gen(PATH):
mini_X,mini_Y = data
mini_X = autograd.Variable(torch.Tensor(mini_X))
mini_Y = autograd.Variable(torch.Tensor(mini_Y))
out = net(mini_X)
loss = F.binary_cross_entropy(out,mini_Y)
loss.backward()
optimizer.step()
Something like that. As you can see, I use the data_gen as an actual generator for the for-loop, so it's being run sequentially with the training. I would like to run it in parallel and have it generate a que of minibatches which I can then feed to my network. Currently, it takes more than 5 hours to run one epoch, I think with a parallelized version of this, I could get that down to 3 hours or less. I looked into multiprocessing on python, but the explanation on the official documentation was a bit dense for me since I have only limited prior experience in parallel computing. If there's some resources I could take a look at, pointing me towards those resources would be very helpful too! Thanks.
You will need to use threads for the data generation. The idea is to let the CPU handle the data generation (usually loading) while your GPU does the training. That been said, it is not the CPU that will slow things down. It is the constant reading and writing of files. If you are using a dataset make sure the files are copied or extracted into contiguous blocks on your file system. If your files are defragmented across your hard drive, loading them will be a bottleneck regardless of the multi-threading mechanism you are using. With SSD hard drives it is not noticeable.
I want to use Keras for a real-time training and prediction setting. In my scenario I get real-time data via MQTT that should be used to train a (LSTM) Neural Network and/or to apply them to the to get a prediction.
I am using the Tensorflow backend with GPU support and fairly potent GPU capacity but in my scenario Keras does not really profit from GPU acceleration. (I did some performance tests by using the examples in the keras repository to make sure that the GPU acceleration works in general). In my first approach, I used the model.train_on_batch(...) method to train the network with each item coming via MQTT:
model = load_model()
def on_message(msg):
"""
Method called by MQTT client each time new data comes in
"""
if msg.topic == 'my/topic':
X, Y = prepare_data(msg.payload)
prediction = model.predict(X)
loss = model.train_on_batch(X, Y)
send_to_visualization_tool(prediction, loss)
One training step in this setting takes about 200ms. However, when I introduce a buffer e.g. buffering 100 data points, the training time for the whole batch only increases slightly. This suggests that the setup time for batch training has a huge overhead. I also noticed that when using size 1 batches, the CPU consumption is quite high, while the GPU is hardly used at all.
As an alternative I now introduced a synchronized Queue, where the MQTT client pushes data, whenever data comes in and the Neural Network then consumes all data as a batch, that came in while processing the previous batch:
train_data_queue = Queue.Queue()
# MQTT client running in separate thread
def on_message(msg):
train_data_queue.put(msg.payload)
model = load_model()
while True:
train_data_batch = dequeue_all(train_data_queue) # dequeue all items from queue
# or block until at least one
# item is present
X, Y = prepare_data(train_data_batch)
predictions = model.predict_on_batch(X)
losses = model.train_on_batch(X, Y)
send_to_visualization_tool(predictions, losses)
This approach works okay but it would be nice if I could get rid of the additional complexity of synchronized Queues and multi threading. I.e. get first approach work.
My question therefore is: Is there a way to reduce the overhead of one batch trainings? E.g. by reimplementing the model in pure tensorflow?
Or can you think of a better way to do real-time training with Keras?
The performance of keras should be broadly similar to the performance of raw tensorflow, so I do not recommend rewriting your model.
Indeed modern hardware usually takes about the same time to train with a single example as it does with a batch of examples, which is why we spend so much effort batching things up. You can get rid of the complexity of synchronized queues if you want to use tf.contrib.batching.batch_function but you'll still need to feed it from many threads if you want to get the extra throughput.