I have started in the noble endeavor of getting the (somewhat) higher ups of the company I work at to at least consider ML. I work in automotive so the bulk of our "problems" have to do with things like finding the perfect distance between the car's center of gravity and its back and front wheels, the ideal weight etc.
So the entire deal can be formulated as a fitness function ripe for optimization.
The actual function works like this: we get a number of test cases, lets say 10-15, test cases representing cornering at 100 kph, lane switching at 150 kph etc. and we have a universal test function that, given a set or parameters(like mentioned above) returns an error percentage. For example:
get_error(test_case = lane_switch_100kph,
weight = 2000, // kg
front_axle_dist = 1.3, // meters
back_axle_dist = 1.4 // meters
)
And this function return a float in the interval [0, 100] that specified the "error", how much the car deviates from the ideal path when these particular parameters are used. This error needs to be brought below 10%.
I threw together a genetic algorithm (if it matters, I did it in python, using the DEAP library) and tested in with the Rastrigin function and all was great. For the fitness function of the actual simulation I used the mean of the squared errors for all test cases.
On paper this looked simple enough. As a start I need to optimize a more general simulation that only has 4 parameters, so a 4 variable function, which should be simple enough for a genetic algorithm to solve, it fared great with Rastrigin in 100 dimensions so a 4 dimensional function should be a piece of cake.
This particular model was already done by hand a few weeks prior and it is possible to get the error below 10% on all test cases.
Problem is the algorithm, while it DOES improve the mean square error, it does so extremely slowly. For 11 test cases, with the initial errors hanging around 10-30% and the initial MSE hanging around 800 the algorithm seems to be in the final phase of converging right away, going from ~800 slowly (1-3 MSE decrease per generation) to about 725-750 and converging there.
I did not do any optimizations to the algorithm as I figured I would do that after I get a proof of concept done.
My current implementation details are:
4 dimensions as I said, each with its own lower and upper bound.
Mutation is selecting one of the 4 dimensions randomly and randomizing its value.
Mutation probability is 0.3.
Crossover is simulated binary bounded.
Crossover probability is 0.6.
I do the crossover and then the mutation on random samples of the population.
Population size is about 50.
I'm using tournament selection with a tournament size of 1/4 of the population.
Every generation 3/4 of that generation get selected, 1/4 are replaced with new individuals.
Optimizations I planned on using but have not implemented yet.
Elitism.
Adaptive mutation parameters using a clustering based method as described here: https://matlab1.com/wp-content/uploads/2015/09/Clustering-Based-Adaptive-Crossover-and-Mutation-Probabilities-for-Genetic-Algorithms-matlab1.com_.pdf
I am kind of stumped on how exactly I should proceed as the algorithm kind of... getting trapped into a particular local optima right from the start on multiple runs is something I did not even think could happen and have no idea how to fix. Any suggestion would be appreciated alot. Thanks in advance!
Related
This Question is more Theoretical, and not specifically trying to problem-solve.
I recently was introduced to the K-Means Clustering algorithm, and unsupervised machine learning algorithm, and I was intrigued by the though that one some sets of data, even if completely random, the average centroids drawn could keep changing through each iteration.
Example:
What I am trying to show here, is, imagine if the program flipped between iteration 6, to iteration 9, and kept doing this forever.
I have had my code randomly hang before using K-Means, so I don't believe this is impossible, but please let me know if this is a known occurrence, or if it is impossible due to the nature of the algorithm.
If you need more information just ask me in a comment. Using Python 3.7
tl;dr No, a K-means algorithm always has an end point if the algorithm is coded correctly.
Explanation:
The ideal way to think about this is not in the sense of what datapoints would cause issues, but rather about how kmeans is working in the broader sense of things. The k-means algorithm is always working in a finite space. For N data points, there are only N ^ k distinct arrangements for the data points. (This number can be pretty large, but is still finite)
Secondly, a k-means algorithm is always optimizing a loss function, based on the sum of squared distances between each data point and it's assigned cluster center. This means two very important things: Each of the N ^ k distinct arrangements can be arranged in an ascending/descending order of minimum loss to maximum loss. Also, the K-means algorithm will never go from a state of lower net loss to a higher net loss.
These two conditions guarantee that the algorithm will always tend towards the minimum loss arrangement in a finite space, thus ensuring that it has an end.
The last edge case: What if more than one minimum state has equal loss? This is a highly unlikely scenario, but can cause issues if and only if the algorithm is coded poorly for tie breakers. Essentially, the only way this can cause a cycle is if a data point has equal distance for two clusters, and is allowed to change clusters away from it's current cluster even on equal distance. Suffice to say, the algorithms are generally coded so that the data points never swap on a tie, or in some other deterministic manner, thus avoiding this scenario entirely.
I am a PhD student who is trying to use the NEAT algorithm as a controller for a robot and I am having some accuracy issues with it. I am working with Python 2.7 and for it and am using two NEAT python implementations:
The NEAT which is in this GitHub repository: https://github.com/CodeReclaimers/neat-python
Searching in Google, it looks like it has been used in some projects with succed.
The multiNEAT library developed by Peter Chervenski and Shane Ryan: http://www.multineat.com/index.html.
Which appears in the "official" software web page of NEAT software catalog.
While testing the first one, I've found that my program converges quickly to a solution, but this solution is not precise enough. As lack of precision I want to say a deviation of a minimum of 3-5% in the median and average related to the "perfect" solution at the end of the evolution (Depending on the complexity of the problem, an error around 10% is normal for my solutions. Furthermore, I could said that I've "never" seen an error value under the 1% between the solution given by the NEAT and the solution that it is the correct one). I must said that I've tried a lot of different parameter combinations and configurations (this is an old problem for me).
Due to that, I tested the second library. The MultiNEAT library converges quickly and easier that the previous one. (I assume that is due to the C++ implementation instead the pure Python) I get similar results, but I still have the same problem; lack of accuracy. This second library has different configuration parameters too, and I haven't found a proper combination of them to improve the performance of the problem.
My question is:
Is it normal to have this lack of accuracy in the NEAT results? It achieves good solutions, but not good enough for controlling a robot arm, which is what I want to use it for.
I'll write what I am doing in case someone sees some conceptual or technical mistake in the way I set out my problem:
To simplify the problem, I'll show a very simple example: I have a very simple problem to solve, I want a NN that may calculate the following function: y = x^2 (similar results are found with y=x^3 or y = x^2 + x^3 or similar functions)
The steps that I follow to develop the program are:
"Y" are the inputs to the network and "X" the outputs. The
activation functions of the neural net are sigmoid functions.
I create a data set of "n" samples given values to "X" between the
xmin = 0.0 and the xmax = 10.0
As I am using sigmoid functions, I make a normalization of the "Y"
and "X" values:
"Y" is normalized linearly between (Ymin, Ymax) and (-2.0, 2.0) (input range of sigmoid).
"X" is normalized linearly between (Xmin, Xmax) and (0.0, 1.0) (the output range of sigmoid).
After creating the data set, I subdivide in in a train sample (70%
percent of the total amount), a validation sample and a test sample
(15% each one).
At this point, I create a population of individuals for doing
evolution. Each individual of the population is evaluated in all the
train samples. Each position is evaluated as:
eval_pos = xmax - abs(xtarget - xobtained)
And the fitness of the individual is the average value of all the train positions (I've selected the minimum too but it gives me worse performance).
After the whole evaluation, I test the best obtained individual
against the test sample. And here is where I obtained those
"un-precise values". Moreover, during the evaluation process, the
maximum value where "abs(xtarget - xobtained) = 0" is never
obtained.
Furthermore, I assume that how I manipulate the data is right because, I use the same data set for training a neural network in Keras and I get much better results than with NEAT (an error less than a 1% is achievable after 1000 epochs in a layer with 5 neurons).
At this point, I would like to know if what is happened is normal because I shouldn't use a data set of data for developing the controller, it must be learned "online" and NEAT looks like a suitable solution for my problem.
Thanks in advance.
EDITED POST:
Firstly, Thanks for comment nick.
I'll answer your questions below::
I am using the NEAT algorithm.
Yes, I've carried out experiments increasing the number of individuals in the population and the generations number. A typical graph that I get is like this:
Although the population size in this example is not such big, I've obtained similar results in experiments incrementing the number of individuals or the number of generations. Populations of 500 in individuals and 500 generations, for example. In this experiments, the he algorithm converge fast to a solution, but once there, the best solution is stucked and it does not improve any more.
As I mentioned in my previous post, I've tried several experiments with many different parameters configurations... and the graphics are more or less similar to the previous showed.
Furthermore, other two experiments that I've tried were: once the evolution reach the point where the maximum value and the median converge, I generate other population based on that genome with new configuration parameters where:
The mutation parameters change with a high probability of mutation (weight and neuron probability) in order to find new solutions with the aim to "jumping" from the current genome to other better.
The neuron mutation is reduced to 0, while the weight "mutation probability" increase for "mutate weight" in a lower range in order to get slightly modifications with the aim to get a better adjustment of the weights. (trying to get a "similar" functionality as backprop. making slighty changes in the weights)
This two experiments didn't work as I expected and the best genome of the population was also the same of the previous population.
I am sorry, but I do not understand very well what do you want to say with "applying your own weighted penalties and rewards in your fitness function". What do you mean with including weight penalities in the fitness function?
Regards!
Disclaimer: I have contributed to these libraries.
Have you tried increasing the population size to speed up the search and increasing the number of generations? I use it for a trading task, and by increasing the population size my champions were found much sooner.
Another thing to think about is applying your own weighted penalties and rewards in your fitness function, so that anything that doesn't get very close right away is "killed off" sooner and the correct genome is found faster. It should be noted that neat uses a fitness function to learn as a opposed to gradient descent so it wont converge in the same way and its possible you may have to train a bit longer.
Last question, are you using the neat or hyperneat algo from multineat?
First of all: apologies for the lack of code and rather vague descriptions; the code I'm using is 1000+ lines long and I'm not sure what parts of it would be helpful to post.
I'm using emcee to do some Bayesian parameter estimation. My code uses 50 walkers each taking 600 iterations (with no thinning), but for whatever reason, the walker chains seem to converge rather quickly. While I initiate the 50 walkers with a random distribution between -1 and 1, they don't explore the entire parameter space, but seem to converge quickly (usually around the true parameter values). Pictures are below:
The real parameter values are .6 and .4
The real parameter values are -1. and 1.
Any suggestions are greatly appreciated!
That's what they are supposed to do - converge quickly to the regions of high posterior density. Another matter is that for bimodal densities emcee would generate suboptimal proposals and that would slow down the convergence. This is probably what happens in you case and is seen in the second graph in both examples.
Authors of emcee suggested (last time I read) working around this with parallel tempering (see the docs) which they have implemented. But their implementation (last time I checked) would not work when densities between modes differ by several orders of magnitude.
Anyway, multimodal posteriors are a bane of all MCMC and there are plenty of attempt to solve this, none being universally accepted (welcome to the cutting edge). You will have to explore several options, maybe beyond emcee, to find what works for you.
The Problem
I've been doing a bit of research on Particle Swarm Optimization, so I said I'd put it to the test.
The problem I'm trying to solve is the Balanced Partition Problem - or reduced simply to the Subset Sum Problem (where the sum is half of all the numbers).
It seems the generic formula for updating velocities for particles is
but I won't go into too much detail for this question.
Since there's no PSO attempt online for the Subset Sum Problem, I looked at the Travelling Salesman Problem instead.
They're approach for updating velocities involved taking sets of visited towns, subtracting one from another and doing some manipulation on that.
I saw no relation between that and the formula above.
My Approach
So I scrapped the formula and tried my own approach to the Subset Sum Problem.
I basically used gbest and pbest to determine the probability of removing or adding a particular element to the subset.
i.e - if my problem space is [1,2,3,4,5] (target is 7 or 8), and my current particle (subset) has [1,None,3,None,None], and the gbest is [None,2,3,None,None] then there is a higher probability of keeping 3, adding 2 and removing 1, based on gbest
I can post code but don't think it's necessary, you get the idea (I'm using python btw - hence None).
So basically, this worked to an extent, I got decent solutions out but it was very slow on larger data sets and values.
My Question
Am I encoding the problem and updating the particle "velocities" in a smart way?
Is there a way to determine if this will converge correctly?
Is there a resource I can use to learn how to create convergent "update" formulas for specific problem spaces?
Thanks a lot in advance!
Encoding
Yes, you're encoding this correctly: each of your bit-maps (that's effectively what your 5-element lists are) is a particle.
Concept
Your conceptual problem with the equation is because your problem space is a discrete lattice graph, which doesn't lend itself immediately to the update step. For instance, if you want to get a finer granularity by adjusting your learning rate, you'd generally reduce it by some small factor (say, 3). In this space, what does it mean to take steps only 1/3 as large? That's why you have problems.
The main possibility I see is to create 3x as many particles, but then have the transition probabilities all divided by 3. This still doesn't satisfy very well, but it does simulate the process somewhat decently.
Discrete Steps
If you have a very large graph, where a high velocity could give you dozens of transitions in one step, you can utilize a smoother distance (loss or error) function to guide your model. With something this small, where you have no more than 5 steps between any two positions, it's hard to work with such a concept.
Instead, you utilize an error function based on the estimated distance to the solution. The easy one is to subtract the particle's total from the nearer of 7 or 8. A harder one is to estimate distance based on that difference and the particle elements "in play".
Proof of Convergence
Yes, there is a way to do it, but it requires some functional analysis. In general, you want to demonstrate that the error function is convex over the particle space. In other words, you'd have to prove that your error function is a reliable distance metric, at least as far as relative placement goes (i.e. prove that a lower error does imply you're closer to a solution).
Creating update formulae
No, this is a heuristic field, based on shape of the problem space as defined by the particle coordinates, the error function, and the movement characteristics.
Extra recommendation
Your current allowable transitions are "add" and "delete" element.
Include "swap elements" to this: trade one present member for an absent one. This will allow the trivial error function to define a convex space for you, and you'll converge in very little time.
I am studying physics and ran into a really interesting problem. I'm not an expert on programming so please take this into account while reading this.
I really hope that someone can help me with this problem because I struggle with this matter for about 2 months now and don't see any success.
So here is my Problem:
I got a bunch of data sets (more than 2 less than 20) from numerical calculations. The set is given by x against measurement values. I have a set of sensors and want to find the best positions x for my sensors such that the integral of the interpolation comes as close as possible to the integral of the numerical data set.
As this sounds like a typical mathematical problem I started to look for some theorems but I did not find anything.
So I started to write a python program based on the SLSQP minimizer. I chose this because it can handle bounds and constraints. (Note there is always a sensor at 0 and one at 1)
Constraints: the sensor array must stay sorted all the time such that x_i smaller than x_i+1 and the interval of x is normalized to [0,1].
Before doing an overall optimization I started to look for good starting points and searched for maximums, minimums and linear areas of my given data sets. But an optimization over 40 values turned out to deliver bad results.
In my second try I started to search for these points and defined certain areas. So I optimized each area with 1 to 40 sensors. Then I compared the results and decided which area is worth putting more sensors in. I the last step I wanted to do an overall optimization again. But these idea didn't seem to be the proper solution, too, because the optimization had convergence problem as well.
The big problem was, that my optimizer broke the boundaries. I covered this by interrupting the optimization, because once this boundaries were broken the result was not correct in the end. If this happens I reset my initial setup and a homogeneous distribution. After this there are normally no violence of boundaries but the results seems to be a homogeneous distribution, too, often this is obviously not the perfect distribution.
As my algorithm works for simple examples and dies for more complex data I think there is a general problem and not just some error in my coding. Does anyone have an idea how to move on or knows some theoretical stuff about this matter?
The attached plot show the areas in different colors. The function is shown at the bottom and the sensor positions are represented as dots. Dots at value y=1 are from the optimization with one sensors 2 represents the results of optimization with 2 variables. So as the program reaches higher sensor numbers the whole thing gets more and more homogeneous.
It is easy to see that if n is the number of sensors and n goes to infinity you have a total homogeneous distribution. But as far as I see this this should not happen for just 10 sensors.