Stable Softmax function returns wrong output

Stable Softmax function returns wrong output - python

I implemented the Softmax function and later discovered that it has to be stabilized in order to be numerically stable (duh). And now, it is again not stable because even after deducting the max(x) from my vector, the given vector values are still too big to be able to be the powers of e. Here is the picture of the code I used to pinpoint the bug, vector here is sample output vector from forward propagating:
We can clearly see that the values are too big, and instead of probability, I get these really small numbers which leads to small error which leads to vanishing gradients and finally making the network unable to learn.

You are completely right, just translating the mathematical definition of softmax might make it unstable, which is why you have to substract the maximum of x before doing any compution.
Your implementation is correct, and vanishing/exploding gradient is an independant problem that you might encounter depending on what kind of neural network you intent to use.

Related

Touble with implement gradient descent methods for optimizing structural geometries

A problem I'm currently working on requires me to optimize some dimension parameters for a structure in order to prevent buckling while still not being over engineered. I've been able to solve it use iterative (semi-brute forced) methods, however, I wondering if there is a way to implement a gradient descent method to optimize the parameters. More background is given below:
Let's say we are trying to optimize three length/thickness parameters, (t1,t2,t3) .
We initialize these parameters with some random guess (t1,t2,t3)g. Through some transformation to each of these parameters (weights and biases), the aim is to obtain (t1,t2,t3)ideal such that three main criteria (R1,R2,R3)ideal are met. The criteria are calculated by using (t1,t2,t3)i as inputs to some structural equations, where i represents the inputs after the first iteration. Following this, some kind of loss function could be implemented to calculate the error, (R1,R2,R3)i - (R1,R2,R3)ideal
My confusion lies in the fact that traditionally, (t1,t2,t3)ideal would be known and the cost would be a function of the error between (t1,t2,t3)ideal and (t1,t2,t3)i, and subsequent iterations would follow. However, in a case where (t1,t2,t3)ideal are unknown and the targets (R1,R2,R3)ideal (known) are an indirect function of the inputs, how would gradient descent be implemented? How would minimizing the cost relate to the step change in (t1,t2,t3)i ?
P.S: Sorry about the formatting, I cannot embed latex images until my reputation is higher.

I'm having some difficulty understanding how the constraints you're describing are calculated. I'd imagine the quantity you're trying to minimize is the total material used or the cost of construction, not the "error" you describe?
I don't know the details of your specific problem, but it's probably a safe bet that the cost function isn't convex. Any gradient-based optimization algorithm carries the risk of getting stuck in a local minimum. If the cost function isn't computationally intensive to evaluate then I'd recommend you use an algorithm like differential evolution that starts with a population of initial guesses scattered throughout the parameter space. SciPy has a nice implementation of it that allows for constraints (and includes a final gradient-based "polishing" step).

NEAT algorithm result precision

I am a PhD student who is trying to use the NEAT algorithm as a controller for a robot and I am having some accuracy issues with it. I am working with Python 2.7 and for it and am using two NEAT python implementations:
The NEAT which is in this GitHub repository: https://github.com/CodeReclaimers/neat-python
Searching in Google, it looks like it has been used in some projects with succed.
The multiNEAT library developed by Peter Chervenski and Shane Ryan: http://www.multineat.com/index.html.
Which appears in the "official" software web page of NEAT software catalog.
While testing the first one, I've found that my program converges quickly to a solution, but this solution is not precise enough. As lack of precision I want to say a deviation of a minimum of 3-5% in the median and average related to the "perfect" solution at the end of the evolution (Depending on the complexity of the problem, an error around 10% is normal for my solutions. Furthermore, I could said that I've "never" seen an error value under the 1% between the solution given by the NEAT and the solution that it is the correct one). I must said that I've tried a lot of different parameter combinations and configurations (this is an old problem for me).
Due to that, I tested the second library. The MultiNEAT library converges quickly and easier that the previous one. (I assume that is due to the C++ implementation instead the pure Python) I get similar results, but I still have the same problem; lack of accuracy. This second library has different configuration parameters too, and I haven't found a proper combination of them to improve the performance of the problem.
My question is:
Is it normal to have this lack of accuracy in the NEAT results? It achieves good solutions, but not good enough for controlling a robot arm, which is what I want to use it for.
I'll write what I am doing in case someone sees some conceptual or technical mistake in the way I set out my problem:
To simplify the problem, I'll show a very simple example: I have a very simple problem to solve, I want a NN that may calculate the following function: y = x^2 (similar results are found with y=x^3 or y = x^2 + x^3 or similar functions)
The steps that I follow to develop the program are:
"Y" are the inputs to the network and "X" the outputs. The
activation functions of the neural net are sigmoid functions.
I create a data set of "n" samples given values to "X" between the
xmin = 0.0 and the xmax = 10.0
As I am using sigmoid functions, I make a normalization of the "Y"
and "X" values:
"Y" is normalized linearly between (Ymin, Ymax) and (-2.0, 2.0) (input range of sigmoid).
"X" is normalized linearly between (Xmin, Xmax) and (0.0, 1.0) (the output range of sigmoid).
After creating the data set, I subdivide in in a train sample (70%
percent of the total amount), a validation sample and a test sample
(15% each one).
At this point, I create a population of individuals for doing
evolution. Each individual of the population is evaluated in all the
train samples. Each position is evaluated as:
eval_pos = xmax - abs(xtarget - xobtained)
And the fitness of the individual is the average value of all the train positions (I've selected the minimum too but it gives me worse performance).
After the whole evaluation, I test the best obtained individual
against the test sample. And here is where I obtained those
"un-precise values". Moreover, during the evaluation process, the
maximum value where "abs(xtarget - xobtained) = 0" is never
obtained.
Furthermore, I assume that how I manipulate the data is right because, I use the same data set for training a neural network in Keras and I get much better results than with NEAT (an error less than a 1% is achievable after 1000 epochs in a layer with 5 neurons).
At this point, I would like to know if what is happened is normal because I shouldn't use a data set of data for developing the controller, it must be learned "online" and NEAT looks like a suitable solution for my problem.
Thanks in advance.
EDITED POST:
Firstly, Thanks for comment nick.
I'll answer your questions below::
I am using the NEAT algorithm.
Yes, I've carried out experiments increasing the number of individuals in the population and the generations number. A typical graph that I get is like this:
Although the population size in this example is not such big, I've obtained similar results in experiments incrementing the number of individuals or the number of generations. Populations of 500 in individuals and 500 generations, for example. In this experiments, the he algorithm converge fast to a solution, but once there, the best solution is stucked and it does not improve any more.
As I mentioned in my previous post, I've tried several experiments with many different parameters configurations... and the graphics are more or less similar to the previous showed.
Furthermore, other two experiments that I've tried were: once the evolution reach the point where the maximum value and the median converge, I generate other population based on that genome with new configuration parameters where:
The mutation parameters change with a high probability of mutation (weight and neuron probability) in order to find new solutions with the aim to "jumping" from the current genome to other better.
The neuron mutation is reduced to 0, while the weight "mutation probability" increase for "mutate weight" in a lower range in order to get slightly modifications with the aim to get a better adjustment of the weights. (trying to get a "similar" functionality as backprop. making slighty changes in the weights)
This two experiments didn't work as I expected and the best genome of the population was also the same of the previous population.
I am sorry, but I do not understand very well what do you want to say with "applying your own weighted penalties and rewards in your fitness function". What do you mean with including weight penalities in the fitness function?
Regards!

Disclaimer: I have contributed to these libraries.
Have you tried increasing the population size to speed up the search and increasing the number of generations? I use it for a trading task, and by increasing the population size my champions were found much sooner.
Another thing to think about is applying your own weighted penalties and rewards in your fitness function, so that anything that doesn't get very close right away is "killed off" sooner and the correct genome is found faster. It should be noted that neat uses a fitness function to learn as a opposed to gradient descent so it wont converge in the same way and its possible you may have to train a bit longer.
Last question, are you using the neat or hyperneat algo from multineat?

Limit neural network output to subset of trained classes

Is it possible to pass a vector to a trained neural network so it only chooses from a subset of the classes it was trained to recognize. For example, I have a network trained to recognize numbers and letters, but I know that the images I'm running it on next will not contain lowercase letters (Such as images of serial numbers). Then I pass it a vector telling it not to guess any lowercase letters. Since the classes are exclusive the network ends in a softmax function. Following are just examples of what I'd thought of trying but none really work.
import numpy as np
def softmax(arr):
return np.exp(arr)/np.exp(arr).sum()
#Stand ins for previous layer/NN output and vector of allowed answers.
output = np.array([ 0.15885351,0.94527385,0.33977026,-0.27237907,0.32012873,
0.44839673,-0.52375875,-0.99423903,-0.06391236,0.82529586])
restrictions = np.array([1,1,0,0,1,1,1,0,1,1])
#Ideas -----
'''First: Multilpy restricted before sending it through softmax.
I stupidly tried this one.'''
results = softmax(output*restrictions)
'''Second: Multiply the results of the softmax by the restrictions.'''
results = softmax(output)
results = results*restrictions
'''Third: Remove invalid entries before calculating the softmax.'''
result = output*restrictions
result[result != 0] = softmax(result[result != 0])
All of these have issues. The first one causes invalid choices to default to:
1/np.exp(arr).sum()
since inputs to softmax can be negative this can raise the probability given to an invalid choice and make the answer worse. (Should've looked into it before I tried it.)
The second and third both have similar issues in that they wait until right before an answer is given to apply the restriction. For example, if the network is looking at the letter l, but it starts to determine that it's the number 1, this won't be corrected until the very end with these methods. So if it was on it's way to giving the output of 1 with .80 probability but then this option removed it seems the remaining options will redistribute and the highest valid answer won't be as confident as 80%. The remaining options end up a lot more homogeneous.
An example of what I'm trying to say:
output
Out[75]: array([ 5.39413513, 3.81445419, 3.75369546, 1.02716988, 0.39189373])
softmax(output)
Out[76]: array([ 0.70454877, 0.14516581, 0.13660832, 0.00894051, 0.00473658])
softmax(output[1:])
Out[77]: array([ 0.49133596, 0.46237183, 0.03026052, 0.01603169])
(Arrays were ordered to make it easier.)
In the original output the softmax gives .70 that the answer is [1,0,0,0,0] but if that's an invalid answer and thus removed the redistribution how assigns the 4 remaining options with under 50% probability which could easily be ignored as too low to use.
I've considered passing a vector into the network earlier as another input but I'm not sure how to do this without requiring it to learn what the vector is telling it to do, which I think would increase time required to train.
EDIT: I was writing way too much in the comments so I'll just post updates here. I did eventually try giving the restrictions as an input to the network. I took the one hot-encoded answer and randomly added extra enabled classes to simulate an answer key and ensure the correct answer was always in the key. When the key had very few enabled categories the network relied heavily on it and it interfered with learning features from the image. When the key had a lot of enabled categories it seemingly ignored the key completely. This could have been a problem that needed optimized, issues with my network architecture, or just needed a tweak to training but I never got around the the solution.
I did find out that removing answers and zeroing were almost the same when I eventually subtracted np.inf instead of multiplying by 0. I was aware of ensembles but as mentioned in a comment to the first response my network was dealing with CJK characters (alphabet was just to make example easier) and had 3000+ classes. The network was already overly bulky which is why I wanted to look into this method. Using binary networks for each individual category was something I hadn't thought of but 3000+ networks seems problematic too (if I understood what you were saying correctly) though I may look into it later.

First of all, I will loosely go through available options you have listed and add some viable alternatives with the pros and cons. It's kinda hard to structure this answer but I hope you'll get what I'm trying to put out:
1. Multiply restricted before sending it through softmax.
Obviously may give higher chance to the zeroed-out entries as you have written, at seems like a false approach at the beginning.
Alternative: replace impossible values with smallest logit value. This one is similar to softmax(output[1:]), though the network will be even more uncertain about the results. Example pytorch implementation:
import torch
logits = torch.Tensor([5.39413513, 3.81445419, 3.75369546, 1.02716988, 0.39189373])
minimum, _ = torch.min(logits, dim=0)
logits[0] = minimum
print(torch.nn.functional.softmax(logits))
which yields:
tensor([0.0158, 0.4836, 0.4551, 0.0298, 0.0158])
Discussion
Citing you: "In the original output the softmax gives .70 that the answer is [1,0,0,0,0] but if that's an invalid answer and thus removed the redistribution how assigns the 4 remaining options with under 50% probability which could easily be ignored as too low to use."
Yes, and you would be in the right when doing that. Even more so, the actual probabilities for this class are actually far lower, around 14% (tensor([0.7045, 0.1452, 0.1366, 0.0089, 0.0047])). By manually changing the output you are essentially destroying the properties this NN has learned (and it's output distribution) rendering some part of your computations pointless. This points to another problem stated in the bounty this time:
2. NN are known to be overconfident for classification problems
I can imagine this being solved in multiple ways:
2.1 Ensemble
Create multiple neural networks and ensemble them by summing logits taking argmax at the end (or softmax and then `argmax). Hypothetical situation with 3 different models with different predictions:
import torch
predicted_logits_1 = torch.Tensor([5.39413513, 3.81419, 3.7546, 1.02716988, 0.39189373])
predicted_logits_2 = torch.Tensor([3.357895, 4.0165, 4.569546, 0.02716988, -0.189373])
predicted_logits_3 = torch.Tensor([2.989513, 5.814459, 3.55369546, 3.06988, -5.89473])
combined_logits = predicted_logits_1 + predicted_logits_2 + predicted_logits_3
print(combined_logits)
print(torch.nn.functional.softmax(combined_logits))
This would gives us the following probabilities after softmax:
[0.11291057 0.7576356 0.1293983 0.00005554 0.]
(notice the first class is now the most probable)
You can use bootstrap aggregating and other ensembling techniques to improve predictions. This approach makes the classifying decision surface smoother and fixes mutual errors between classifiers (given their predictions vary quite a lot). It would take many posts to describe in any greater detail (or separate question with specific problem would be needed), here or here are some which might get you started.
Still I would not mix this approach with manual selection of outputs.
2.2 Transform the problem into binary
This approach might yield better inference time and maybe even better training time if you can distribute it over multiple GPUs.
Basically, each class of yours can either be present (1) or absent (0). In principle you could train N neural networks for N classes, each outputting a single unbounded number (logit). This single number tells whether the network thinks this example should be classified as it's class or not.
If you are sure certain class won't be the outcome for sure you do not run network responsible for this class detection.
After obtaining predictions from all the networks (or subset of networks), you choose the highest value (or highest probability if you use sigmoid activation, though it would be computationally wasteful).
Additional benefit would be simplicity of said networks (easier training and fine-tuning) and easy switch-like behavior if needed.
Conclusions
If I were you I would go with the approach outlined in 2.2 as you could save yourself some inference time easily and would allow you to "choose outputs" in a sensible manner.
If this approach is not enough, you may consider N ensembles of networks, so a mix of 2.2 and 2.1, some bootstrap or other ensembling techniques. This should improve your accuracy as well.

First ask yourself: what is the benefit of excluding certain outputs based on external data. In your post, I don't see why exactly you want to exclude them.
Saving them won't save computation as óne connection (or óne neuron) has effect on multiple outputs: you can't disable connections/neurons.
Is it really necessary to exclude certain classes? If your network is trained well enough, it will know if it's a capital or not.
So my answer: I don't think you should fiddle with any operation before the softmax. This will give you false conclusions. So you have the following options:
Multiply the results of the softmax by the restrictions.
Don't multiply, if the highest class is 'a', convert it to 'A' as output (convert output to lowercase)
Train a network that sees no difference between capital and non-capital letters

Scipy.optimize.minimize only iterates some variables.

I have written python (2.7.3) code wherein I aim to create a weighted sum of 16 data sets, and compare the result to some expected value. My problem is to find the weighting coefficients which will produce the best fit to the model. To do this, I have been experimenting with scipy's optimize.minimize routines, but have had mixed results.
Each of my individual data sets is stored as a 15x15 ndarray, so their weighted sum is also a 15x15 array. I define my own 'model' of what the sum should look like (also a 15x15 array), and quantify the goodness of fit between my result and the model using a basic least squares calculation.
R=np.sum(np.abs(model/np.max(model)-myresult)**2)
'myresult' is produced as a function of some set of parameters 'wts'. I want to find the set of parameters 'wts' which will minimise R.
To do so, I have been trying this:
res = minimize(get_best_weightings,wts,bounds=bnds,method='SLSQP',options={'disp':True,'eps':100})
Where my objective function is:
def get_best_weightings(wts):
wts_tr=wts[0:16]
wts_ti=wts[16:32]
for i,j in enumerate(portlist):
originalwtsr[j]=wts_tr[i]
originalwtsi[j]=wts_ti[i]
realwts=originalwtsr
imagwts=originalwtsi
myresult=make_weighted_beam(realwts,imagwts,1)
R=np.sum((np.abs(modelbeam/np.max(modelbeam)-myresult))**2)
return R
The input (wts) is an ndarray of shape (32,), and the output, R, is just some scalar, which should get smaller as my fit gets better. By my understanding, this is exactly the sort of problem ("Minimization of scalar function of one or more variables.") which scipy.optimize.minimize is designed to optimize (http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.optimize.minimize.html ).
However, when I run the code, although the optimization routine seems to iterate over different values of all the elements of wts, only a few of them seem to 'stick'. Ie, all but four of the values are returned with the same values as my initial guess. To illustrate, I plot the values of my initial guess for wts (in blue), and the optimized values in red. You can see that for most elements, the two lines overlap.
Image:
http://imgur.com/p1hQuz7
Changing just these few parameters is not enough to get a good answer, and I can't understand why the other parameters aren't also being optimised. I suspect that maybe I'm not understanding the nature of my minimization problem, so I'm hoping someone here can point out where I'm going wrong.
I have experimented with a variety of minimize's inbuilt methods (I am by no means committed to SLSQP, or certain that it's the most appropriate choice), and with a variety of 'step sizes' eps. The bounds I am using for my parameters are all (-4000,4000). I only have scipy version .11, so I haven't tested a basinhopping routine to get the global minimum (this needs .12). I have looked at minimize.brute, but haven't tried implementing it yet - thought I'd check if anyone can steer me in a better direction first.
Any advice appreciated! Sorry for the wall of text and the possibly (probably?) idiotic question. I can post more of my code if necessary, but it's pretty long and unpolished.

Modeling a bounded function

I'm trying to model a monotonic function, that is bounded by y_min and y_max values, satisfies two value pairs (x0,y0), (x1,y1) within that range.
Is there some kind of package in python that might help in solving for the parameters for such a function?
Alternatively, if someone knows of a good source or paper on modeling such a function, I'd be much obliged...
(for anyone interested, i'm actually trying to model a market respons curve of a number of buyers vs. the price of a product, this is bounded by zero and the maximal demand)

well it's still not clear to me what you want, but the classic function of that form is the sigmoid (it's used in neural nets, for example). in a more flexible form that becomes the general logistic function which will fit your (x,y) constraints with suitable parameters. there's also the gompertz curve, which is similar.
however, those are all defined over an open domain and i doubt you have negative numbers of buyers. if that's an issue (you may not care as they get very close to zero) you could try transforming the number of buyers (taking the log kind-of works, but only if you can have a fraction of a buyer...).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.