I have a custom speech recognition model which, converted to TFLite, performs excellent in python on PC. When running inference with same TFLite model on Android, accuracy drops. All the processing is happening in TFLite model layers (feature extraction, etc.) so there is no code in Android that could make a difference. Input to model is waveform and output is logits, both in python and Android. I have double checked and there is a good quality microphone input on Android, but model just performs significantly worse than on PC.
I'm having batch normalization layers in my model and I'm suspecting they might be the problem. I'm not sure why would there be inconsistency between Android and python. Did anyone else come across this problem?
Things that I ruled out:
Microphone - I'm using same BT headset on Android and PC
Model architecture - I tried two different model architectures (DeepSpeech and Jasper), both resulting in same
accuracy degradation on Android, but they work perfect on PC
Quantization - my model gets quantized but it works good, without accuracy loss on PC
I can't give any hard and fast answers without further investigation, but
a few possible causes that come to mind:
Quantisation:
If you are comparing the network's performance before and after conversion to TFLite, then depending on your quantisation settings there is an expected drop in performance. However, if you are running the TFLite Interpreter from python as well that shouldn't be the cause. (reference for Quantisation)
Domain Shift:
Just because both microphones are good quality doesn't mean that their output is the same. The compression, noise profile etc could all be different. That means your input data may be from a slightly different domain, which would then lower performance.
To test this, try saving a recording (from either device) and feeding that into the input instead. The output should be the same, if not you have some other bug.
To fix this, use recordings from the phone for your validation data when training the network (and train on some recordings from the phone too), and use recordings from the phone as your point of comparison from python.
Performance issues:
If running the network on the device is causing something to slow down, you may change the effective sample rate, even if the mic was identical. If you've fixed the domain shift error, then this is my last guess. Again it effectively causes a domain shift, but it's harder to compensate for. With this it's best to optimise your code so that it isn't causing a slowdown, or if your implementation permits it, record the sample and run the inference "offline" rather than in parallel to data collection.
Data handling bugs:
If your pipeline from the microphone to the network is different (which seems likely given its on android rather than python on the PC) then it's possible that there is a bug in how the data is getting to the network, whether that's lost packets or something that could cause a problem with normalization.
Related
i'm trying to train my neural network. it aims is to predict video quality. i am using VGG16+LSTM networks, but the VGG16 is not trainable. the total of trainable parameters are 700,000.
i have three questions:
is enough 700,000 trainable parameters for training the 68000 input frames?
is it essential to train vgg16?
how many epochs needed for getting the best resaults?
I haven't been into machine learning in a while, but my understanding is that:
depends, but the only way to find out is to train it and look for over/underfitting
depends on the network layout. It might also be useful to bypass some information around the VGG16, in case the VGG16 hides some of the information you actually need about 'video quality'
depends. You wouldh have to split your data into a training and a test set in order to find that out.
As most things in machine learning and especially deep learning the answers aren't obvious and depend heavily on the problem and the exact network layout. There will be much trial and error involved.
The most important takeaway, I think, is to have two (or even three) different datasets for the training/validation/test step, so you can answer those questions yourself.
For more information, read the wikipedia entry about splitting your datasets.
You start with one and see what impact it had.
Even one epoch will take long and getting the error takes also a bit of time.
I am currently using Tensorflow Object Detection API for my human detection app.
I tried filtering in the API itself which worked but I am still not contended by it because it's slow. So I'm wondering if I could remove other categories in the model itself to also make it faster.
If it is not possible, can you please give me other suggestions to make the API faster since I will be using two cameras. Thanks in advance and also pardon my english :)
Your questions addresses several topics for using neural network pretrained models.
Theoretical methods
In general, you can always neutralize categories by removing the corresponding neurons in the softmax layer and compute a new softmax layer only with the relevant rows of the matrix.
This method will surely work (maybe that is what you meant by filtering) but will not accelerate the network computation time by much, since most of the flops (multiplications and additions) will remain.
Similar to decision trees, pruning is possible but may reduce performance. I will explain what pruning means, but note that the accuracy over your categories may remain since you are not just trimming, you are predicting less categories as well.
Transfer the learning to your problem. See stanford's course in computer vision here. Most of the times I've seen that works good is by keeping the convolution layers as-is, and preparing a medium-size dataset of the objects you'd like to detect.
I will add more theoretical methods if you request, but the above are the most common and accurate I know.
Practical methods
Make sure you are serving your tensorflow model, and not just using an inference python code. This could significantly accelerate performance.
You can export the parameters of the network and load them in a faster framework such as CNTK or Caffe. These frameworks work in C++/CSharp and can inference much faster. Make sure you load the weights correctly, some frameworks use different order in tensor dimensions when saving/loading (little/big endian-like issues).
If your application perform inference on several images, you can distribute the computation via several GPUs. **This can also be done in tensorflow, see Using GPUs.
Pruning a neural network
Maybe this is the most interesting method of adapting big networks for simple tasks. You can see a beginner's guide here.
Pruning means that you remove parameters from your network, specifically the whole nodes/neurons in a decision tree/neural network (resp). To do that in object detection, you can do as follows (simplest way):
Randomly prune neurons from the fully connected layers.
Train one more epoch (or more) with low learning rate, only on objects you'd like to detect.
(optional) Perform the above several times for validation and choose best network.
The above procedure is the most basic one, but you can find plenty of papers that suggest algorithms to do so. For example
Automated Pruning for Deep Neural Network Compression and An iterative pruning algorithm for feedforward neural networks.
Processed data is real-time video (a bunch of sequential frames) and it all needs to end up in a DX12 buffer.
I don't care too much if data gets copied to system memory during training, but during evaluation, it must stay on GPU.
I would train the network separately in python with high latency being allowed but then after it is trained, I would use it entirely on the GPU (because my frames are already there). From my standpoint (experienced with GPGPU programming but not so much with Tensorflow) there are two ways of doing this:
Extracting the parameters from the trained model in python (weights and biases) and uploading them to the c++ program that has the same network topology on the GPU and running it there. It should behave like a Tensorflow network it was trained on.
Using Tensorlow in the c++ program as well and just passing the buffer handles for input and output (the way you would do with GPGPU) and then interop-ing with DX12 (because I need the evaluations to end up here).
Would like to know if any of those options are possible and if so, which one is better and why?
If I left anything unclear, let me know in the comments.
For learning purposes, I am trying to implement a CNN from scratch, but the results do not seem to improve from random guessing. I know this is not the best approach on home hardware, and following course.fast.ai I have obtained much better results via transfer learning, but for a deeper understanding I would like to see, at least in theory, how one could do it otherwise.
Testing on CIFAR-10 posed no issues - a small CNN trained from scratch in a matter of minutes with an error of less than 0.5%.
However, when trying to test against the Cats vs. Dogs Kaggle dataset, the results did not bulge from 50% accuracy. The architecture is basically a copy of AlexNet, including the non-state-of-the-art choices (large filters, histogram equalization, Nesterov-SGD optimizer). For more details, I put the code in a notebook on GitHub:
https://github.com/mspinaci/deep-learning-examples/blob/master/dogs_vs_cats_with_AlexNet.ipynb
(I also tried different architectures, more VGG-like and using Adam optimizer, but the result was the same; the reason why I followed the structure above was to match as closely as possible the Caffe procedure described here:
https://github.com/adilmoujahid/deeplearning-cats-dogs-tutorial
and that seems to converge quickly enough, according to the author's description here: http://adilmoujahid.com/posts/2016/06/introduction-deep-learning-python-caffe/).
I was expecting some fitting to happen quickly, possibly flattening out due to the many suboptimal choices made (e.g. small dataset, no data augmentation). Instead, I saw no increment at all, as the notebook shows.
So I thought that maybe I was simply overestimating my GPU and patience, and that the model was too complicated even to overfit my data in a few hours (I ran 70 epochs, each time roughly 360 batches of 64 images). Therefore I tried to overfit as hard as I could, running these other models:
https://github.com/mspinaci/deep-learning-examples/blob/master/Ridiculously%20overfitting%20models...%20or%20maybe%20not.ipynb
The purely linear model started showing some overfit - around 53.5% training accuracy vs 52% validation accuracy (which I guess is thus my best result). That followed my expectations. However, to try and overfit as hard as I could, the second model is a simple 2 layers feedforward neural network, without any regularization, that I trained on just 2000 images with batch size up to 500. I was expecting the NN to overfit wildly, quickly getting to 100% train accuracy (after all it has 77M parameters for 2k pictures!). Instead, nothing happened, and the accuracy flattened to 50% quickly enough.
Any tip about why none of the "multi-layer" models seems able to pick any feature (be it "true" or out of overfitting) would be very much appreciated!
Note on versions etc: the notebooks were run on Python 2.7, Keras 2.0.8, Theano 0.9.0. The OS is Windows 10, and the GPU is a not-so-powerful, but that should be sufficient for basic tasks, GeForce GTX 960M.
I'm trying to make a simple gesture recognition system to use with my Raspberry Pi equipped with a camera. I would like to train a neural network with tensorflow on my more powerful laptop and then transfer it to the RPi for prediction (as part of a Magic Mirror). Is there a way to export the trained network and weights and use a lightweight version of tensorflow for the linear algebra and prediction without the overhead of all the symbolic graph machinery that are necessary for training? I have seen the tutorials on tensorflow server, but I'd rather not set up a server and just have it run the prediction on the RPi.
Yes, possible and available in the source repository. This allows to deploy and run a model trained on your laptop. Note that this is the same model, which can be big.
To deal with size and efficiency, TF is currently moving along a quantization approach. After your model is trained, a few extra steps allow to "translate" it into a lighter model with similar accuracy. Currently, the implementation is quite slow, though. There is a recent post that shows the whole process for iOS---pretty similar to RaspberryPI overall.
The Makefile contribution is also quite relevant for tuning and extra configuration.
Beware that this code moves often and breaks. It is sometimes useful to checkout an old "release" tag to get something that works end to end.