I am trying to train neural networks using TensorFlow 1.12.0 and Keras API.
The setup is as follows. I have a large number of data points: each point consists of a context (call it 24 floats) and a label (1 float). The total amount of data is O(10^7) points. Test networks vary, but a fairly simple one might look like so
v=[keras.layers.Input(shape=input_shape)]
v.append(keras.layers.Dense(4,use_bias=False,activation=tf.nn.tanh)(v[-1]))
v.append(keras.layers.Dense(1,use_bias=False)(v[-1]))
model=keras.models.Model(inputs=v[0],outputs=v[-1])
optimizer = tf.train.RMSPropOptimizer(0.001)
model.compile(loss='mse', optimizer=optimizer, metrics=['mae'])
(...)
history=model.fit(x=train_data,y=train_labels,epochs=EPOCHS,verbose=1,batch_size=10000, shuffle=True)
I've been getting decent results keeping the data in numpy arrays and passing them to model.fit() as-is. However, I am not happy with the performance. It appears to bottleneck in Python code (I even tried to profile it and the Python method slice_arrays in python/keras/utils/generic_utils.py comes up as the primary bottleneck, with up to half of all time spent there.) The GPU (GeForce 1080 Ti) is in use, but its utilization (as reported by nvidia-smi) is rarely above 10-15%.
Looking for a way to speed things up, I tried to convert the data into tensors:
features=tf.convert_to_tensor(train_data)
labels=tf.convert_to_tensor(train_labels)
history=model.fit(x=features,y=labels,epochs=EPOCHS,verbose=1,steps_per_epoch=20, shuffle=True)
model.fit requires an argument batch_size when passed numpy arrays, but steps_per_epoch when passed tensors. Documentation isn't clear, but it seems that I should set steps_per_epoch to the number of data points divided by batch_size. This way I get comparable convergence rates. Whatever the value, GPU utilization is 100%, which is good.
But now there's a problem. With numpy arrays, run time per epoch is relatively independent of the batch size. I see 8 seconds per epoch at batch size 10k, 5 seconds at batch size 100k, and 7 seconds per epoch at batch size 1M. Convergence is generally way better if the batch size is small. So, I'd typically start with batch size 100k or less. On the other hand, with tensor input, run time per epoch goes up exponentially with steps_per_epoch. With batch size 1M (10-20 steps per epoch), I'm at 2 s / epoch, but the convergence rate is abysmal. With batch size 10k, the convergence rate is good, but the time is up to 30 s / epoch (actually slower than with numpy, despite 100% GPU utilization.)
Basically, the only scenario where tensor input actually ends up faster is the one where I don't care to use it in the first place.
What is going on and is there any way to get around this?
Related
I'm working with a video classification of 5 classes and using TimeDistributed CNN model. The train dataset contains 8 videos containing 75 frames each.I have used TimeSeriesGenerator of Keras where length equals 75 as each video contains 75 sequences.But, it seems unclear to me what batch_size should be in this case.
from keras.preprocessing.sequence import TimeseriesGenerator
train_sequences = TimeseriesGenerator(train_data, train_labels, length=75, batch_size=1)
Can anyone tell me what batch size should be considered for this task?
The batch size defines the number of video samples that will be introduce in each iteration of your model. The difference between the different values of batch size are the model weight's optimization. If batch size is equal to 3, the model will input the 3 sample videos and only after that 3 inputs, it will update the weights.
There isn't a optimal value for batch size. It's like No Free Lunch Theorem. I suggest you to try different values and look for the best results.
There are constraint in defining the batch size:
If the value is small, it will require less memory and it could be faster, since your model is requesting fewer samples. Otherwise, the gradient estimation will be less accurate.
If the value is big, the gradient's estimation will be more accurate, but it will need more memory and could be slower.
So you have to find a optimal value between grandient's estimation accuracy and computational resources usage
I have started to dig a little deeper in Tensorflow and Neural Net training with a focus on optimizing run time and keep bumping up against a data ingestion problem.
Lets say I have 30,000 256x256 images that I have created an efficient Tensorflow pipeline for (including pre-fetching and parallel data calls) that randomly chips the image to 64x64 pixels and randomly flips the images in both directions. So the model accepts tensors of shape (batchsize, 64, 64) and via augmentation there are 30000*((256/64)**2)*4 = 1,920,000 minimum examples. Minimum because there at minimum 16 unique chips but many more ways to randomly chip the whole image. The 4 comes from the four possible flip states (Same, Same), (Flipped, Same), (Same, Flipped), (Flipped, Flipped).
I have this model distributed across several GPU's and so batch size isn't limited by available memory, just trade offs with generalization and how accurate the model will be. In this scenario, I was able to get through a single epoch in ~35 seconds with a batch size of 128 (32 examples per GPU).
Now lets imagine another scenario in which I have pre-chipped the data (deterministically, aka every image has 16 non-overlapping chips extracted) and saved chips locally as 64x64 pixel images. In this case, I don't do any random sampling but otherwise the input pipeline is the same. The number of individual files has gone up considerably from 30,000 to 480,000 with a maximum number of unique examples equal to the minimum number of unique examples from the previous setup. Now, because the number of files has gone up, for the same batch size, the training steps has gone up considerably. Even if I double the batch size, training now takes 2-3 minutes per epoch.
I am curious if there is a consensus between these two scenarios. In scenario 1, I would imagine that I would need to train longer than scenario 2, but could also potentially get away with an even bigger batch size since the training data is slightly changing every single epoch (thus less worry about the model not generalizing).
In scenario 2, I imagine that I could get away with training for less epochs at the cost of an individual epoch taking longer. Since the model is seeing every single example each epoch, there is a real limit to how big I can make the batch size without making the model poorer at generalizing.
Is there a consensus on which scenario is best? It seems like Scenario 1 is better in nearly every way, but something keeps nabbing at me that I am missing something.
I am quite new and could not find a direct answer to this question. I am wondering what is the default strategy that tensorflow 2.0 uses to deal with an incomplete last batch in training (eg. the 23 samples in a total training set of 1023 samples with batch size 100).
I am curious because intuitively if the same 23 samples are always being placed in the last batch of each epoch then these 23 samples would have a disproportionate influence (ie.1/23 each) on the gradient descent as compared to the other 1000 samples (ie. 1/100 each). I am wondering if the internal workings of tf automatically shuffles the samples every epoch.
Many thanks for your help!
Two points regarding your question:
tf.keras.model.fit() has a keyword argument (kwarg) shuffle. It defaults to True. You can see the documentation at https://www.tensorflow.org/api_docs/python/tf/keras/Model?version=stable#fit. The shuffling of the examples happens at the beginning of every epoch. Therefore, in each training epoch, which examples end up in the last batch is randomized. In this regard, no example gets special treatment or undue influence.
The loss- and metric-calculation mechanism of the fit() method takes into account the batch sizes internally. The final loss and metric values output by the method are weighted averages across batches, with the batch size being the weight.
I new on tensorflow and I try to understand what size should be batch.
Shape of my data (119396, 12955). How can I choose best batch_size to my data?
And what dependence batch_size from data shape or using algorithm?
The batch size is the number of input data values that you are introducing at once in the model. It is very important while training, and secondary when testing. For a standard Machine Learning/Deep Learning algorithm, choosing a batch size will have an impact on several aspects:
The bigger the batch size, the more data you will feed at once in a model. Thus, RAM memory consumption will be almost linear with batch size, and there will always be a limit based on your system specs and the size of your model above which your model will overflow.
The bigger the batch size, the faster you will loop over your dataset N times to perform training.
A biggerbatch size will slow down your model training speed, meaning that it will take longer for your model to get one single update since that update depends on more data.
A biggerbatch size will have more data to average towards the next update of the model, hence training should be smoother: smoother training/test accuracy curves.
Note that the size of the data is only related to the batch size in the sense that the bigger the data, the smaller the maximum batch size becomes (limit set by RAM). The size of the model also has a similar relation.
In practice, you should follow "in powers of 2 and the larger the better, provided that the batch fits into your (GPU) memory". For more in-depth details, check https://stackoverflow.com/a/46655895/9670056.
I have installed Tensorflow cpu version.I have only few images as dataset and I am training on a machine with 4GB ram and Core i5 3340m 2.70GHZ with batch size 1 and it is still extremely slow.the size of all images is same (200X185 i think).Will it train like this ? kindly tell me how can I speed up this process?
Training porcess
If your network is deep, it could take a long time to train your network using CPU as it is not optimized like GPU for calculations.
I would suggest you to get a graphic card, even a old version of graphic card can significantly improve the performance (it could be like 100x faster).
Let's put some numbers here. You are dealing with images with a size of 200x185. Do you realize we are talking about 37000 features right? If we deal with gray levels. If we deal with RGB multiply that by 3. How many images are you using for training? Keep also in mind that SGD (Stochastic Gradient Descent, mini-batch size = 1) tend to be very slow for big datasets... Give us some numbers. How many training images and what is "slow". How much time for one epoch. Something else: programming languages, library (tensorflow, etc.), optimizer, etc. would help us in judging if your code is "slow" and can it be made faster.
batch size is another param affect training time: higher size will help reduce time each epoch, but will require more epoch to have the same effiency like size=1
And if your network is deep (using CNN, etc), you should run on GPU