Fluctuations in neural netwok inference time in PyTorch

Fluctuations in neural netwok inference time in PyTorch - python

I have a neural network and I would like to test its inference time by running it on a CPU. It's important that the time would be measured as accurately as possible.
I use the following code to get the inference time of a network:
with torch.no_grad():
for image, _ in testloader:
start = time.time()
net(image)
total_time = time.time() - start
The problem is, that even if I pass the same image through the network (although in this case "image" is really a batch of images) and do that many times over, the inference times will fluctuate with a difference between a minimum and maximum times reaching 30ms or even higher.
Also, before I ran the tests, I had made sure that the OS would interfere as less as possible with the process - closed all applications.
I want to understand where this fluctuations come from exactly. It is in my understanding, that if the same image is passed through the network, it should take the same time to classify it no matter how many tests I make (at least theoretically speaking).
So what causes these fluctuations? Is there a way to eliminate them?

If you run standart torch model, torch creates graph of execution and performs some caching; this makes first run longer than others. Also I notice in several cases small difference in gc behaviour during iterations.
You can (partially) avoid these problems by converting model to TorchScript model.
Also you should check your dataloader and fix any randomness.

Related

how to enable keras fit() multiprocessing properly?

when I run fit() with multiprocessing=True i always get a deadlock and the following warning:
WARNING:tensorflow:multiprocessing can interact badly with TensorFlow, causing nondeterministic deadlocks. For high performance data pipelines tf.data is recommended.
how to run it properly?
Since it says "tf.data", i wonder if transforming my data into this format will make multiprocessing work. What specifically is meant/how to convert it?
my dataset: (reproducable)
Input_shape, labels =(20,4), 6
LEN_X.LEN_Y = 20000.3000
train_X,train_Y = np.asarray([np.random.random(Input_shape) for x in range(LEN_X )]), np.random.random((LEN_X ,labels))
validation_X,validation_Y = np.asarray([np.random.random(Input_shape) for x in range(LEN_Y)]), np.random.random((LEN_Y,labels))
sampleW = np.random.random((LEN_X ,1))

The multiprocessing doesn't accelerate the model itself. It only accelerates the data loading. And data loading delay is not a problem when all your data is already in-memory.
You could still use multiprocessing, however, but you must make sure that the underlying dataset is thread-safe and you have to carefully craft the data pipeline. That is quite time consuming. So, instead I suggest you speed up the model itself.
For that, you should look into:
changing all except last layer activations to RELU.
tweaking batch size. (optimal number depends on your hardware, and is almost always less than or equal to 32)
using Batch normalization to speed up convergence.
using higher learning rate (be careful not to overdo this step).
if you need faster convolutions, consider using Kaggle notebooks or vast.ai for GPU-enabled computations.
last but not least, try using a simpler, smaller model.
Comment down here if you have any additional questions.
Cheers.

validation and training don't converge at the same time, but validation still converges

https://github.com/wenxinxu/resnet-in-tensorflow#overall-structure
The link above is the Resnet model for cifar10.
I am modifying above code to do object detection using Resnet and Cifar10 as training/validating dataset. ( I know the dataset is for object classification) I know that it sounds strange, but hear me out. I use Cifar10 for training and validation then during testing I use a sliding window approach, and then I classify each of the windows to one of 10 classes + "background" classes.
for background classes, I used images from ImageNet. I search ImageNet with following keyword: construction, landscape, byway, mountain, sky, ocean, furniture, forest, room, store, carpet, and floor. then I clean bad images out as much as I can including images that contain Cifar10 classes, for example, I delete a few "floor" images that have dogs in it.
I am currently running the result in Floydhub. Total steps that I am running is 60,000 which is where section under "training curve" from the link about suggests that the result starts to consolidate and do not converge further ( I personally run this code myself and I can back up the claim)
My question is:
what is the cause of the sudden step down in training and validation data which occurs at about the same step?
What if(or Is it possible that)training and validation data don't converge in a step-like fashion at about the same step? what I mean is, for example, training steps down at around 40,000 and validation just converge with no step-down? (smoothly converge)

The sudden step down is caused by the learning rate decay happening at 40k steps (you can find this parameter in hyper_parameters.py). The leraning rate suddenly gets divided by 10, which allows you to tune the parameters more precisely, which in this case improves your performance a lot. You still need the first part, with a pretty big learning rate, to get in a "good" area for your parameters, then the part with a 10x smaller learning rate will refine it and find a very good spot in that area for your parameters.
This would be surprising, since there is a clear difference between before and after 40k, that affects training and validation the same way. You could still see different behaviors from that point: for instance you might start overtraining because of a too small LR, and see you train error drop down and validation go up, because the refinements you're doing are too specific to the training data.

Training changing input size RNN on Tensorflow

I want to train an RNN with different input size of sentence X, without padding. The logic used for this is that I am using Global Variables and for every step, I take an example, write the forward propagation i.e. build the graph, run the optimizer and then repeat the step again with another example. The program is extremely slow as compared to the numpy implementation of the same thing where I have implemented forward and backward propagation and using the same logic as above. The numpy implementation takes a few seconds while Tensorflow is extremely slow. Can running the same thing on GPU will be useful or I am doing some logical mistake ?

As a general guideline, GPU boosts performance only if you have calculation intensive code and little data transfer. In other words, if you train your model one instance at a time (or on small batch sizes) the overhead for data transfer to/from GPU can even make your code run slower! But if you feed in a good chunk of samples, then GPU will definitely boost your code.

Can I train a model in steps in Keras?

I've got a model in Keras that I need to train, but this model invariably blows up my little 8GB memory and freezes my computer.
I've come to the limit of training just one single sample (batch size = 1) and still it blows up.
Please assume my model has no mistakes or bugs and this question is not about "what is wrong with my model". (Yes, smaller models work ok with the same data, but aren't good enough for the task).
How can I split my model in two and train each part separately, but propagating the gradients between them?
Is there a possibility? (There is no limitation about using theano or tensorflow)
Using CPU only, no GPU.

You can do this thing, but it will cause your training time to approach sizes that will only make the results useful for future generations.
Let's consider what all we have in our memory when we train with a batch size of 1 (assuming you've only read in that one sample into memory):
1) that sample
2) the weights of your model
3) the activations of each layer #your model stores these for backpropogation
None of this stuff is unnecessary for training. However, you could, theoretically, do a forward pass on the first half of the model, dump the weights and activations to disk, load the second half of the model, do a forward pass on that, then the backward pass on that, dump those weights and activations to disk, load back the weights and activations of the first half, then complete the backward pass on that. This process could be split up even more to the point of doing one layer at a time.
OTOH, this is akin to what swap space does, without you having to think about it. If you want a slightly less optimized version of this (which, optimization is clearly moot at this point), you can just increase your swap space to 500GB and call it a day.

LSTM time sequence generation using PyTorch

For several days now, I am trying to build a simple sine-wave sequence generation using LSTM, without any glimpse of success so far.
I started from the time sequence prediction example
All what I wanted to do differently is:
Use different optimizers (e.g RMSprob) than LBFGS
Try different signals (more sine-wave components)
This is the link to my code. "experiment.py" is the main file
What I do is:
I generate artificial time-series data (sine waves)
I cut those time-series data into small sequences
The input to my model is a sequence of time 0...T, and the output is a sequence of time 1...T+1
What happens is:
The training and the validation losses goes down smoothly
The test loss is very low
However, when I try to generate arbitrary-length sequences, starting from a seed (a random sequence from the test data), everything goes wrong. The output always flats out
I simply don't see what the problem is. I am playing with this for a week now, with no progress in sight.
I would be very grateful for any help.
Thank you

This is normal behaviour and happens because your network is too confident of the quality of the input and doesn't learn to rely on the past (on it's internal state) enough, relying soley on the input. When you apply the network to its own output in the generation setting, the input to the network is not as reliable as it was in the training or validation case where it got the true input.
I have two possible solutions for you:
The first is the simplest but less intuitive one: Add a little bit of Gaussian noise to your input. This will force the network to rely more on its hidden state.
The second, is the most obvious solution: during training, feed it not the true input but its generated output with a certain probability p. Start out training with p=0 and gradually increase it so that it learns to general longer and longer sequences, independently. This is called schedualed sampling, and you can read more about it here: https://arxiv.org/abs/1506.03099 .

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.