Tensorflow Multiple Graph & Patch Concern - python

I have a situation where I am using two different networks, one network to tell me if there is important information in a given patch, and a second network to tell me where the important information in the patch is using segmentation.
If I operate them in the same TF Graph / Session, I end up having to use tf.where or tf.cond to tell me which patches I actually want to use, but my optimizer is creating gradients for each condition for the whole net, or at least that is my working theory.
This is using segmentation_logit = tf.where(is_useful_patch,coarse_log,negative_log)
Negative log is a tensor of 0's in the same shape as the coarse logit.
If I am using 192 (128x128) patches, the optimizer attempts to create a parameter with over 100 million parameters (ex: [192,222,129,128]), which nukes my GPU ram and causes a crash.
So, short of actually defining two different sessions, graphs, savers, restorers and tensorboard writers, is there a better way to go about this, a better way to calculate gradients, or a way to combine multiple graphs in the same session?
Thanks in advance!

I'm supposing you get a 192 long is_useful_patch vector with values between 0 and 1 (probabilities) as a result of the first network.
First of all, forget tf.cond of tf.where. I suggest to take a smaller number, like 16 or so (or whatever is best based on your experience, how many useful patches there are normally), and take the index of the best 16 patches with tf.nn.top_k like this:
values, idx_best_patches = tf.nn.top_k( is_useful_patch,
k = 16, sorted = False, name = 'idx_best_patches' )
Then use to collect the best patches with tf.gather_nd like this:
best_patches = tf.gather_nd( patches, idx_best_patches, name = 'best_patches' )
This will collect your best 16 patches and then you continue only with those 16 into the segmenter, instead of 192, having just cut the memory requirement for the segmenter to 1/12. This is the core of the idea.
If there are less than 16 patches with useful information, you can mask some of the output. Also I have no idea how your patches are structured, so make sure you review the tf.gather_nd parameters for correctness, it can be tricky.

Related

Should you shuffle the input for a word2vec model before or after assigning negative context pairs for each target word?

I'm working on a word2vec with negative sampling implementation using python and tensorflow+keras. The initial input for the script is a list of positive target-word pairs, which is processed via looping through them and assigning a number of negative examples to each, then the positive + k negative samples are appended in the corresponding order to a new list. That list is later (after a few adjustments) passed to a keras model.fit():
model.fit([data[:, 0], data[:, 1]], data[:, 2],
batch_size = numof_positives * (numof_negatives + 1))
I looked through some examples, and from what I understand, the batches passed to the neural network should contain the negative context words of those positives that are present in the batch, meaning that shuffling of the data should take place before assigning the negatives. On the other hand, I did not realize that keras' model.fit() has its shuffle argument on True by default, so first it was run with the data being shuffled after the assignment as well. Now that I've added shuffle=False, it seems like it affected the quality of the resulting embedding vectors negatively. Can that be the case? Where should the input be shuffled? What are the implications of passing completely randomly ordered data vs ordered batches?
I may have a few trust issues with the shuffle argument of keras' model.fit(), after experiencing this bug regarding shuffle='batch' first hand.
The value of shuffling with respect to word2vec training (that I'm familiar with) is to avoid cases where all examples of a word, or all similar senses, are clumped together in one range of the training data. (You don't want th model to get really good at those examples where a word/sense is overrepresented... only to then have that overly-specialized performance to be lost when a long run of samples where those same words/senses are completely unrepresented.)
That's something you can achieve with a corpus-shuffle before any batch-shuffle – which might separately bring its own benefits, for similar reasons, by achieving better interleaving of contrasting microexamples.
The word2vec implementations I know tend to do the backprop-updates of a positive-example closely-alongside the corresponding N synthetic negative-examples from the same context. That is, no extra shuffling will move them further from each other.
But it's not impossible further shuffling could help! So I'd not express any opinion on what theoretically "should" happen, compared to any empirical observations, of tradeoffs seen in real runs. (Do what works best!)
(Alternatively, if the actual goal is perfectly-reproducing some other implementation's choices, then you'd mainly want to mimic its actual code, and verify by comparing quantitative results on same-data/same-evaluation tasks.)

Pattern prediction in time series

Has anyone tried to predict a specific pattern in time series data?
Example: In a specific time, there is a huge upward spike in certain variables in a time series...
How would I build a model to predict that spike when next time it occurs?
Please do respond if anyone working in this area.
I tried with converting that particular series of data in a NumPy array and trying to feed in the model.But Its not allowing me.
Here is the data looks like
This data is generated in a controlled manner so that we can have these spikes near to near.. In actual case this could b random, and our main objective is to catch this pattern and make a count.
Das, you could try implementing LSTM based Neural Network Models.
See:
https://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/
It is still preferred that the data contains a trend. If the upward spike happens around the same time of the recurring time interval, it is more likely that you get a better prediction result.
In the image you shared, there seems to be trend in the data. Hence LSTM models can pretty efficiently extract the pattern and output a prediction.
Statistical modelling of the data can also provide better results.
See: https://orangematter.solarwinds.com/2019/12/15/holt-winters-forecasting-simplified/
Das, if outputting the total number of peaks is solely the requirement, then I think heavy neural network models are bit of an overkill. However, neural network models also can pretty well do the job, but require lot of data input for training and fine tuning the weights and biases to give a really good result.
How about you try implementing a thresholding based technique, where you increment a counter every time the data value crosses the preset threshold? In such an approach you should ensure to group very nearby peaks together so that the count is just one for that case. Here you could set a threshold on the x axis too.
ie:- For instance with respect to the given plot, let the y-threshold be 4. Then you will get a count 5 if you consider the y axis threshold (y value 4) alone. This is because for x value at 15:48.2, there are two peaks that cross y value 4. So suppose you set a threshold in the x axis too, then these nearby peaks shall be grouped together within the preset limit and the final count will be 4 (which is the requirement).

How to determine most impactful input variables in a dataset?

I have a neural network program that is designed to take in input variables and output variables, and use forecasted data to predict what the output variables should be based on the forecasted data. After running this program, I will have an output of an output vector. Lets say for example, my input matrix is 100 rows and 10 columns and my output matrix is a vector with 100 values. How do I determine which of my 10 variables (columns) had the most impact on my output?
I've done a correlation analysis between each of my variables (columns) and my output and created a list of the highest correlation between each variable and output, but I'm wondering if there is a better way to go about this.
If what you want to know is model selection, and it's not as simple as studiying the correlation of your features to your target. For an in-depth, well explained look at model selection, I'd recommend you read chapter 7 of The Elements Statistical Learning. If what you're looking for is how to explain your network, then you're in for a treat as well and I'd recommend reading this article for starters, though I won't go into the matter myself.
Naive approaches to model selection:
There a number of ways to do this.
The naïve way is to estimate all possible models, so every combination of features. Since you have 10 features, it's computationally unfeasible.
Another way is to take a variable you think is a good predictor and train to model only on that variable. Compute the error on the training data. Take another variable at random, retrain the model and recompute the error on the training data. If it drops the error, keep the variable. Otherwise discard it. Keep going for all features.
A third approach is the opposite. Start with training the model on all features and sequentially drop variables (a less naïve approach would be to drop variables you intuitively think have little explanatory power), compute the error on training data and compare to know if you keep the feature or not.
There are million ways of going about this. I've exposed three of the simplest, but again, you can go really deeply into this subject and find all kinds of different information (which is why I highly recommend you read that chapter :) ).

Multidimensional Prediction

I have a list of temporal series of values measured in different places. These measurements may or may not be correlated, (mostly depending on their relative positions, but it is plausible that some very close detectors would actually measure decorrelated series). I would like to predict the values of the whole set, taking into account the series of all of them and their correlation through time. If it is of any help, the values should also have relative periodicity
EDIT: I have access to the generated power of several solar panels. These solar panels are spread spatially, and I would like to use them as 'irradiance detectors'. Knowing the sun illumination in several places in the past, I wish to identify correlations in between signals, which could then be used to make predictions of illumination.
Regardless of usual patterns of production through a day (as seen on image), what I am interested in is the information I can extract from one pannels' past to predict another ones future.
I think I would need a Neural Network to solve this problem, but I am not sure how to feed it :I thought of using a temporal window and feed my NN with a few past values from A, B and C, but I am afraid it's a little weak.
The image shows an example of what my data I looks like.
How can I predict the next values of curve A knowing past values of A, B and C?
How to handle this prediction?
I think the easiest way is to train 3 models with the same input but each will predict one value (A, B or C).
If you are sure about correlation between input variable and their impact on the predicted output, you may create one neural network with a common branch (probably RNN over the stacked 3 inputs) then 3 different prediction head where each will produce one prediction A or B or C. Fast-rcnn architecture is a great example of this.
The best way to achieve this task is to use a RNN.
A good tutorial for learning how to develop such a neural network is here :
https://www.tensorflow.org/tutorials/recurrent
I also found this link, where they achieved training a RNN for a rather close problem :
http://blog.datatonic.com/2016/11/traffic-in-london-episode-ii-predicting.html
An even better inspiration :
http://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/

Limit neural network output to subset of trained classes

Is it possible to pass a vector to a trained neural network so it only chooses from a subset of the classes it was trained to recognize. For example, I have a network trained to recognize numbers and letters, but I know that the images I'm running it on next will not contain lowercase letters (Such as images of serial numbers). Then I pass it a vector telling it not to guess any lowercase letters. Since the classes are exclusive the network ends in a softmax function. Following are just examples of what I'd thought of trying but none really work.
import numpy as np
def softmax(arr):
return np.exp(arr)/np.exp(arr).sum()
#Stand ins for previous layer/NN output and vector of allowed answers.
output = np.array([ 0.15885351,0.94527385,0.33977026,-0.27237907,0.32012873,
0.44839673,-0.52375875,-0.99423903,-0.06391236,0.82529586])
restrictions = np.array([1,1,0,0,1,1,1,0,1,1])
#Ideas -----
'''First: Multilpy restricted before sending it through softmax.
I stupidly tried this one.'''
results = softmax(output*restrictions)
'''Second: Multiply the results of the softmax by the restrictions.'''
results = softmax(output)
results = results*restrictions
'''Third: Remove invalid entries before calculating the softmax.'''
result = output*restrictions
result[result != 0] = softmax(result[result != 0])
All of these have issues. The first one causes invalid choices to default to:
1/np.exp(arr).sum()
since inputs to softmax can be negative this can raise the probability given to an invalid choice and make the answer worse. (Should've looked into it before I tried it.)
The second and third both have similar issues in that they wait until right before an answer is given to apply the restriction. For example, if the network is looking at the letter l, but it starts to determine that it's the number 1, this won't be corrected until the very end with these methods. So if it was on it's way to giving the output of 1 with .80 probability but then this option removed it seems the remaining options will redistribute and the highest valid answer won't be as confident as 80%. The remaining options end up a lot more homogeneous.
An example of what I'm trying to say:
output
Out[75]: array([ 5.39413513, 3.81445419, 3.75369546, 1.02716988, 0.39189373])
softmax(output)
Out[76]: array([ 0.70454877, 0.14516581, 0.13660832, 0.00894051, 0.00473658])
softmax(output[1:])
Out[77]: array([ 0.49133596, 0.46237183, 0.03026052, 0.01603169])
(Arrays were ordered to make it easier.)
In the original output the softmax gives .70 that the answer is [1,0,0,0,0] but if that's an invalid answer and thus removed the redistribution how assigns the 4 remaining options with under 50% probability which could easily be ignored as too low to use.
I've considered passing a vector into the network earlier as another input but I'm not sure how to do this without requiring it to learn what the vector is telling it to do, which I think would increase time required to train.
EDIT: I was writing way too much in the comments so I'll just post updates here. I did eventually try giving the restrictions as an input to the network. I took the one hot-encoded answer and randomly added extra enabled classes to simulate an answer key and ensure the correct answer was always in the key. When the key had very few enabled categories the network relied heavily on it and it interfered with learning features from the image. When the key had a lot of enabled categories it seemingly ignored the key completely. This could have been a problem that needed optimized, issues with my network architecture, or just needed a tweak to training but I never got around the the solution.
I did find out that removing answers and zeroing were almost the same when I eventually subtracted np.inf instead of multiplying by 0. I was aware of ensembles but as mentioned in a comment to the first response my network was dealing with CJK characters (alphabet was just to make example easier) and had 3000+ classes. The network was already overly bulky which is why I wanted to look into this method. Using binary networks for each individual category was something I hadn't thought of but 3000+ networks seems problematic too (if I understood what you were saying correctly) though I may look into it later.
First of all, I will loosely go through available options you have listed and add some viable alternatives with the pros and cons. It's kinda hard to structure this answer but I hope you'll get what I'm trying to put out:
1. Multiply restricted before sending it through softmax.
Obviously may give higher chance to the zeroed-out entries as you have written, at seems like a false approach at the beginning.
Alternative: replace impossible values with smallest logit value. This one is similar to softmax(output[1:]), though the network will be even more uncertain about the results. Example pytorch implementation:
import torch
logits = torch.Tensor([5.39413513, 3.81445419, 3.75369546, 1.02716988, 0.39189373])
minimum, _ = torch.min(logits, dim=0)
logits[0] = minimum
print(torch.nn.functional.softmax(logits))
which yields:
tensor([0.0158, 0.4836, 0.4551, 0.0298, 0.0158])
Discussion
Citing you: "In the original output the softmax gives .70 that the answer is [1,0,0,0,0] but if that's an invalid answer and thus removed the redistribution how assigns the 4 remaining options with under 50% probability which could easily be ignored as too low to use."
Yes, and you would be in the right when doing that. Even more so, the actual probabilities for this class are actually far lower, around 14% (tensor([0.7045, 0.1452, 0.1366, 0.0089, 0.0047])). By manually changing the output you are essentially destroying the properties this NN has learned (and it's output distribution) rendering some part of your computations pointless. This points to another problem stated in the bounty this time:
2. NN are known to be overconfident for classification problems
I can imagine this being solved in multiple ways:
2.1 Ensemble
Create multiple neural networks and ensemble them by summing logits taking argmax at the end (or softmax and then `argmax). Hypothetical situation with 3 different models with different predictions:
import torch
predicted_logits_1 = torch.Tensor([5.39413513, 3.81419, 3.7546, 1.02716988, 0.39189373])
predicted_logits_2 = torch.Tensor([3.357895, 4.0165, 4.569546, 0.02716988, -0.189373])
predicted_logits_3 = torch.Tensor([2.989513, 5.814459, 3.55369546, 3.06988, -5.89473])
combined_logits = predicted_logits_1 + predicted_logits_2 + predicted_logits_3
print(combined_logits)
print(torch.nn.functional.softmax(combined_logits))
This would gives us the following probabilities after softmax:
[0.11291057 0.7576356 0.1293983 0.00005554 0.]
(notice the first class is now the most probable)
You can use bootstrap aggregating and other ensembling techniques to improve predictions. This approach makes the classifying decision surface smoother and fixes mutual errors between classifiers (given their predictions vary quite a lot). It would take many posts to describe in any greater detail (or separate question with specific problem would be needed), here or here are some which might get you started.
Still I would not mix this approach with manual selection of outputs.
2.2 Transform the problem into binary
This approach might yield better inference time and maybe even better training time if you can distribute it over multiple GPUs.
Basically, each class of yours can either be present (1) or absent (0). In principle you could train N neural networks for N classes, each outputting a single unbounded number (logit). This single number tells whether the network thinks this example should be classified as it's class or not.
If you are sure certain class won't be the outcome for sure you do not run network responsible for this class detection.
After obtaining predictions from all the networks (or subset of networks), you choose the highest value (or highest probability if you use sigmoid activation, though it would be computationally wasteful).
Additional benefit would be simplicity of said networks (easier training and fine-tuning) and easy switch-like behavior if needed.
Conclusions
If I were you I would go with the approach outlined in 2.2 as you could save yourself some inference time easily and would allow you to "choose outputs" in a sensible manner.
If this approach is not enough, you may consider N ensembles of networks, so a mix of 2.2 and 2.1, some bootstrap or other ensembling techniques. This should improve your accuracy as well.
First ask yourself: what is the benefit of excluding certain outputs based on external data. In your post, I don't see why exactly you want to exclude them.
Saving them won't save computation as óne connection (or óne neuron) has effect on multiple outputs: you can't disable connections/neurons.
Is it really necessary to exclude certain classes? If your network is trained well enough, it will know if it's a capital or not.
So my answer: I don't think you should fiddle with any operation before the softmax. This will give you false conclusions. So you have the following options:
Multiply the results of the softmax by the restrictions.
Don't multiply, if the highest class is 'a', convert it to 'A' as output (convert output to lowercase)
Train a network that sees no difference between capital and non-capital letters

Categories