My code itself doesn't have an issue, this is more of a general question about code that already works. I solve a maze using breadth first search and I'm looking to study the node expansion, space complexity, and O(n) - which, for BFS, is O(b^d).
I'm not used to studying the program once finished and I was wondering if there are any specific methods that are best. I know the code itself will give me a time but I was wondering if there are any library functions that could help, or if there is maybe a function I could implement that would better show me quantitative results.
I have the ability to run the test on multiple different mazes (I even have a maze creator) but I'm asking for something (anything) more quantitative than just running this code on three or four different mazes and using the auto-output. I'm also using pycharm, which I'm unfamiliar with - are there ways the IDE can formalize this information?
For maze problem (without any assumption and knowledge of the network learnt prior), A-STAR algorithm is the state of the earth. Its complexities are:
Worst complexity: O(|E|) = O(b^d)
Space complexity: O(|V|) = O(b^d)
For fairly large network (such as road networks), practical algorithms exists because we can pre-compute some paths apriori. They will have less time complexity (trade-off with larger space complexity) after the pre-processing (or learning) has been done.
Related
Cross posted from csexchange:
Most versions of simulated annealing I've seen are implemented similar to what is outlined in the wikipedia pseudocode below:
Let s = s0
For k = 0 through kmax (exclusive):
T ← temperature( 1 - (k+1)/kmax )
Pick a random neighbour, snew ← neighbour(s)
If P(E(s), E(snew), T) ≥ random(0, 1):
s ← snew
Output: the final state s
I am having trouble understanding how this algorithm does not get stuck in a local optima as the temperature cools. If we jump around at the start while the temp is high, and eventually only take uphill moves as the temp cools, then isn't the solution found highly dependent on where we just so happened to end up in the search space as the temperature started to cool? We may have found a better solution early on, jumped off of it while the temp was high, and then be in a worse-off position as the temp cools and we transition to hill climbing.
An often listed modification to this approach is to keep track of the best solution found so far. I see how this change mitigates the risk of "throwing away" a better solution found in the exploratory stage when the temp is high, but I don't see how this is any better than simply performing repeated random hill-climbing to sample the space, without the temperature theatrics.
Another approach that comes to mind is to combine the ideas of keeping track of the "best so far" with repeated hill climbing and beam search. For each temperature, we could perform simulated annealing and track the best 'n' solutions. Then for the next temperature, start from each of those local peaks.
UPDATE:
It was rightly pointed out I didn't really ask a specific question, I've updated in the comments but want to reflect it here:
I don't need a transcription of the pseudo-code into essay form, I understand how it works - how the temperature balances exploration vs exploitation, and how this (theoretically) helps avoid local optima.
My specific questions are:
Without modification to keep track of the global best solution,
wouldn't this algorithm be extremely prone to local optima, despite
the temperature component? Yes, I understand the probability of taking a worse move declines as the temperature cools (e.g. it transitions to pure hill-climbing), but it would be possible that you had found a better solution early on in the exploitation portion (temp hot), jumped off of it, and then as the temperature cools you have no way back because you're on a path that leads to a new local peak.
With the addition of tracking the global optima, I can absolutely
see how this mitigates getting stuck at local peaks avoiding the problem described above. But how does
this improve upon simple random search? One specific example being
repeated random hill climbing. If you're tracking the global optima and you happen to hit it while in the high-temp portion, then that essentially is a random-search.
In what circumstances would this algorithm be preferable to something like repeated random hill-climbing and why? What properties does a problem have that makes it particularly suited to SA vs an alternative.
By my understanding, simulated annealing is not guaranteed to not get stuck in a local maxima (for maximization problems), especially as it "cools" where later in the cycle as k -> kmax.
So as you say, we "jump all over the place" at the start, but we are still selective in whether or not we accept that jump, as the P() function that determines the probability of acceptance is a function of the goal, E(). Later in the same wikipedia article, they describe the P() function a bit, and suggest that if e_new > e, then perhaps P=1, always taking the move if it is an improvement, but perhaps not always if it isn't an improvement. Later in the cycle, we aren't as willing to take random leaps to lesser results, so the algorithm tends to settle into a max (or min), which may or may not be the global.
The challenge in global function evaluation is to obtain good efficiency, i.e. find the optimal solution or one that is close to the optimum with as little function evaluations as possible. So many "we could perform..." reasonings are correct in terms of convergence to a good solution, but may fail to be efficient.
Now the main difference between pure climbing and annealing is that annealing temporarily admits a worsening of the objective to avoid... getting trapped in a local optimum. Hill climbing only searches the subspace where the function takes larger values than the best so far, which may fail to be connected to the global optimum. Annealing adds some "tunneling" effect.
Temperature decrease is not the key ingredient here and by the way, hill climbing methods such as the Simplex or Hooke-Jeeves patterns also work with diminishing perturbations. Temperature schemes are more related to multiscale decomposition of the target function.
The question as asked doesn't seem very coherent, it seems to ask about guarantees offered by the vanilla simulated annealing algorithm, then proceeds to propose multiple modifications with little to no justification.
In any case, I will try to be coherent in this answer.
The only known rigorous proof of simulated annealing that I know of is described as
The concept of using an adaptive step-size has been essential to the development of
better annealing algorithms. The Classical Simulated Annealing (CSA) was the first annealing algorithm with a rigorous mathematical proof for its global convergence (Geman and
Geman, 1984). It was proven to converge if a Gaussian distribution is used for GXY (TK),
coupled with an annealing schedule S(Tk) that decreases no faster than T = To/(log k).
The integer k is a counter for the external annealing loop, as will be detailed in the next
section. However, a logarithmic decreasing schedule is considered to be too slow, and for
many problems the number of iterations required is considered as “overkill” (Ingber, 1993).
Where the proof of this can be found in
Geman, Stuart, and Donald Geman. "Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images." IEEE Transactions on pattern analysis and machine intelligence 6 (1984): 721-741.
In terms of successful modifications of SA, if you peruse literature you will find many empirical claims of superiority, but few proofs. Having studied this problem in the past, I can say from the various modified versions of SA I have tried empirically for the problems I was interested in at the time, the only SA modification that consistently gave better results than classical simulated annealing is known as parallel tempering.
Parallel tempering is the modification of SA whereby instead of mixing one Markov-Chain (i.e. the annealing step) multiple chains are mixed concurrently at very specific temperatures, with exchanges between chains occurring on a dynamic or static schedule.
Original paper on parallel tempering:
Swendsen, Robert H., and Jian-Sheng Wang. "Replica Monte Carlo simulation of spin-glasses." Physical review letters 57.21 (1986): 2607.
There are various free and proprietary implementations of parallel tempering available due to its effectiveness; I would highly recommend starting with a a high quality implementation instead of rolling your own.
Note also, parallel tempering is extensively used throughout various physical sciences for modelling purposes due to its efficacy, but for some reason has a very small representation in other areas of literature (such as computer science,) which actually baffles me somewhat. Perhaps computer scientists general disinterest with SA modifications may stem from their belief that SA is unlikely to be improved upon.
This area is still very new to me, so forgive me if I am asking dumb questions. I'm utilizing MCTS to run a model-based reinforcement learning task. Basically I have an agent foraging in a discrete environment, where the agent can see out some number of spaces around it (I'm assuming perfect knowledge of its observation space for simplicity, so the observation is the same as the state). The agent has an internal transition model of the world represented by an MLP (I'm using tf.keras). Basically, for each step in the tree, I use the model to predict the next state given the action, and I let the agent calculate how much reward it would receive based on the predicted change in state. From there it's the familiar MCTS algorithm, with selection, expansion, rollout, and backprop.
Essentially, the problem is that this all runs prohibitively slowly. From profiling my code, I notice that a lot of time is spent doing the rollout, likely I imagine because the NN needs to be consulted many times and takes some nontrivial amount of time for each prediction. Of course, I can probably stand to clean up my code to make it run faster (e.g. better vectorization), but I was wondering:
Are there ways to speed up/work around the traditional random walk done for rollout in MCTS?
Are there generally other ways to speed up MCTS? Does it just not mix well with using an NN in terms of runtime?
Thanks!
I am working on a similar problem and so far the following have helped me:
Make sure you are running tensorflow on you GPU (You will have to install CUDA)
Estimate how many steps into the future your agent needs to calculate to still get good results
(The one I am currently working on) parallelize
I'm very interested in the field of machine learning and recently I got the idea for a project for the next few weeks.
Basically I want to create an AI that can beat every human at Tic Tac Toe. The algorithm must be scalable for every n*n board size, and maybe even for other dimensions (for a 3D analogue of the game, for example).
Also I don't want the algorithm to know anything of the game in advance: it must learn on its own. So no hardcoded ifs, and no supervisioned learning.
My idea is to use an Artificial Neural Network for the main algorithm itself, and to train it through the use of a genetic algorithm. So I have to code only the rules of the game, and then each population, battling with itself, should learn from scratch.
It's a big project, and I'm not an expert on this field, but I hope, with such an objective, to learn lots of things.
First of all, is that possible? I mean, is it possible to reach a good result within a reasonable amount of time?
Are there good libraries in Python that I can use for this project? And is Python a suitable language for this kind of project?
Yes, this is possible. But you have to tell your AI the rules of the game, beforehand (well, that's debatable, but it's ostensibly better if you do so - it'll define your search space a little better).
Now, the vanilla tic-tac-toe game is far too simple - a minmax search will more than suffice. Scaling up the dimensionality or the size of the board does make the case for more advanced algorithms, but even so, the search space is quite simple (the algebraic nature of the dimensionality increase leads to a slight transformation of the search space, which should still be tractable by simpler methods).
If you really want to throw a heavy machine learning technique at a problem, take a second look at chess (Deep Blue really just brute forced the sucker). Arimaa is interesting for this application as well. You might also consider looking at Go (perhaps start with some of the work done on AlphaGo)
That's my two cents' worth
I would like to compare different methods of finding roots of functions in python (like Newton's methods or other simple calc based methods). I don't think I will have too much trouble writing the algorithms
What would be a good way to make the actual comparison? I read up a little bit about Big-O. Would this be the way to go?
The answer from #sarnold is right -- it doesn't make sense to do a Big-Oh analysis.
The principal differences between root finding algorithms are:
rate of convergence (number of iterations)
computational effort per iteration
what is required as input (i.e. do you need to know the first derivative, do you need to set lo/hi limits for bisection, etc.)
what functions it works well on (i.e. works fine on polynomials but fails on functions with poles)
what assumptions does it make about the function (i.e. a continuous first derivative or being analytic, etc)
how simple the method is to implement
I think you will find that each of the methods has some good qualities, some bad qualities, and a set of situations where it is the most appropriate choice.
Big O notation is ideal for expressing the asymptotic behavior of algorithms as the inputs to the algorithms "increase". This is probably not a great measure for root finding algorithms.
Instead, I would think the number of iterations required to bring the actual error below some epsilon ε would be a better measure. Another measure would be the number of iterations required to bring the difference between successive iterations below some epsilon ε. (The difference between successive iterations is probably a better choice if you don't have exact root values at hand for your inputs. You would use a criteria such as successive differences to know when to terminate your root finders in practice, so you could or should use them here, too.)
While you can characterize the number of iterations required for different algorithms by the ratios between them (one algorithm may take roughly ten times more iterations to reach the same precision as another), there often isn't "growth" in the iterations as inputs change.
Of course, if your algorithms take more iterations with "larger" inputs, then Big O notation makes sense.
Big-O notation is designed to describe how an alogorithm behaves in the limit, as n goes to infinity. This is a much easier thing to work with in a theoretical study than in a practical experiment. I would pick things to study that you can easily measure that and that people care about, such as accuracy and computer resources (time/memory) consumed.
When you write and run a computer program to compare two algorithms, you are performing a scientific experiment, just like somebody who measures the speed of light, or somebody who compares the death rates of smokers and non-smokers, and many of the same factors apply.
Try and choose an example problem or problems to solve that is representative, or at least interesting to you, because your results may not generalise to sitations you have not actually tested. You may be able to increase the range of situations to which your results reply if you sample at random from a large set of possible problems and find that all your random samples behave in much the same way, or at least follow much the same trend. You can have unexpected results even when the theoretical studies show that there should be a nice n log n trend, because theoretical studies rarely account for suddenly running out of cache, or out of memory, or usually even for things like integer overflow.
Be alert for sources of error, and try to minimise them, or have them apply to the same extent to all the things you are comparing. Of course you want to use exactly the same input data for all of the algorithms you are testing. Make multiple runs of each algorithm, and check to see how variable things are - perhaps a few runs are slower because the computer was doing something else at a time. Be aware that caching may make later runs of an algorithm faster, especially if you run them immediately after each other. Which time you want depends on what you decide you are measuring. If you have a lot of I/O to do remember that modern operating systems and computer cache huge amounts of disk I/O in memory. I once ended up powering the computer off and on again after every run, as the only way I could find to be sure that the device I/O cache was flushed.
You can get wildly different answers for the same problem just by changing starting points. Pick an initial guess that's close to the root and Newton's method will give you a result that converges quadratically. Choose another in a different part of the problem space and the root finder will diverge wildly.
What does this say about the algorithm? Good or bad?
I would suggest you to have a look at the following Python root finding demo.
It is a simple code, with some different methods and comparisons between them (in terms of the rate of convergence).
http://www.math-cs.gordon.edu/courses/mat342/python/findroot.py
I just finish a project where comparing bisection, Newton, and secant root finding methods. Since this is a practical case, I don't think you need to use Big-O notation. Big-O notation is more suitable for asymptotic view. What you can do is compare them in term of:
Speed - for example here newton is the fastest if good condition are gathered
Number of iterations - for example here bisection take the most iteration
Accuracy - How often it converge to the right root if there is more than one root, or maybe it doesn't even converge at all.
Input - What information does it need to get started. for example newton need an X0 near the root in order to converge, it also need the first derivative which is not always easy to find.
Other - rounding errors
For the sake of visualization you can store the value of each iteration in arrays and plot them. Use a function you already know the roots.
Although this is a very old post, my 2 cents :)
Once you've decided which algorithmic method to use to compare them (your "evaluation protocol", so to say), then you might be interested in ways to run your challengers on actual datasets.
This tutorial explains how to do it, based on an example (comparing polynomial fitting algorithms on several datasets).
(I'm the author, feel free to provide feedback on the github page!)
for a datastructures & algorithms class in college we have to implement an algorithm presented in a paper. The paper can be found here.
So i fullly implemented the algorithm, with still some errors left (but that's not really why I'm asking this question, if you want to see how I implemented it thus far, you can find it here)
The real reason why I'm asking a question on Stackoverflow is the second part of the assignment: we have to try to make the algorithm better. I had a few ways in mind, but all of them sound good in theory but they won't really do good in practice:
Draw a line between the source and end node, search the node closest to the middle of that line and divide the "path" in 2 recursively. The base case would be a smaller graph were a single Dijkstra would do the computation. This isn't really an adjustment to the current algorithm but with some thinking it is clear this wouldn't give an optimal solution.
Try to give the algorithm some sense of direction by giving a higher priority to edges that point to the end node. This also won't be optimal..
So now I'm all out of ideas and hoping that someone here could give me a little hint for a possible adjustment. It doesn't really have to improve the algorithm, I think the first reason why they asked us to do this is so we don't just implement the algorithm from the paper without knowing what's behind it.
(If Stackoverflow isn't the right place to ask this question, my apologies :) )
A short description of the algorithm:
The algorithm tries to select which nodes look promising. By promising I mean that they have a good chance on lying on a shortest path. How promising a node is is represented by it's 'reach'. The reach of a vertex on a path is the minimum of it's distances to the start and to the end. The reach of a vertex in a graph is the maximum of the reaches of the vertex on all shortest paths.
To eventually determine whether a node is added to the priority queue in Dijkstra's algorithm, a test() function is added. Test returns true (if the reach of a vertex in the graph is larger or equal then the weight of the path from the origin to v at the time v is to be inserted in the priority queue) or (the reach of the vertex in the graph is larger or equal then the euclidean distance from v to the end vertex).
Harm De Weirdt
Your best bet in cases like this is to think like a researcher: Research in general and Computer Science research specifically is about incremental improvement, one person shows that they can compute something faster using Dijkstra's Algorithm and then later they, or someone else, show that they can compute the same thing a little faster using A*. It's a series of small steps.
That said, the best place to look for ways to improve an algorithm presented in a paper is in the future directions section. This paper gives you a little bit to work on in that direction, but your gold mine in this case lies in sections 5 and 6. There are multiple places where the authors admit to different possible approaches. Try researching some of these approaches, this should lead you either to a possible improvement in the algorithm or at least an arguable one.
Best of luck!