Comparing Root-finding (of a function) algorithms in Python

Comparing Root-finding (of a function) algorithms in Python - python

I would like to compare different methods of finding roots of functions in python (like Newton's methods or other simple calc based methods). I don't think I will have too much trouble writing the algorithms
What would be a good way to make the actual comparison? I read up a little bit about Big-O. Would this be the way to go?

The answer from #sarnold is right -- it doesn't make sense to do a Big-Oh analysis.
The principal differences between root finding algorithms are:
rate of convergence (number of iterations)
computational effort per iteration
what is required as input (i.e. do you need to know the first derivative, do you need to set lo/hi limits for bisection, etc.)
what functions it works well on (i.e. works fine on polynomials but fails on functions with poles)
what assumptions does it make about the function (i.e. a continuous first derivative or being analytic, etc)
how simple the method is to implement
I think you will find that each of the methods has some good qualities, some bad qualities, and a set of situations where it is the most appropriate choice.

Big O notation is ideal for expressing the asymptotic behavior of algorithms as the inputs to the algorithms "increase". This is probably not a great measure for root finding algorithms.
Instead, I would think the number of iterations required to bring the actual error below some epsilon ε would be a better measure. Another measure would be the number of iterations required to bring the difference between successive iterations below some epsilon ε. (The difference between successive iterations is probably a better choice if you don't have exact root values at hand for your inputs. You would use a criteria such as successive differences to know when to terminate your root finders in practice, so you could or should use them here, too.)
While you can characterize the number of iterations required for different algorithms by the ratios between them (one algorithm may take roughly ten times more iterations to reach the same precision as another), there often isn't "growth" in the iterations as inputs change.
Of course, if your algorithms take more iterations with "larger" inputs, then Big O notation makes sense.

Big-O notation is designed to describe how an alogorithm behaves in the limit, as n goes to infinity. This is a much easier thing to work with in a theoretical study than in a practical experiment. I would pick things to study that you can easily measure that and that people care about, such as accuracy and computer resources (time/memory) consumed.
When you write and run a computer program to compare two algorithms, you are performing a scientific experiment, just like somebody who measures the speed of light, or somebody who compares the death rates of smokers and non-smokers, and many of the same factors apply.
Try and choose an example problem or problems to solve that is representative, or at least interesting to you, because your results may not generalise to sitations you have not actually tested. You may be able to increase the range of situations to which your results reply if you sample at random from a large set of possible problems and find that all your random samples behave in much the same way, or at least follow much the same trend. You can have unexpected results even when the theoretical studies show that there should be a nice n log n trend, because theoretical studies rarely account for suddenly running out of cache, or out of memory, or usually even for things like integer overflow.
Be alert for sources of error, and try to minimise them, or have them apply to the same extent to all the things you are comparing. Of course you want to use exactly the same input data for all of the algorithms you are testing. Make multiple runs of each algorithm, and check to see how variable things are - perhaps a few runs are slower because the computer was doing something else at a time. Be aware that caching may make later runs of an algorithm faster, especially if you run them immediately after each other. Which time you want depends on what you decide you are measuring. If you have a lot of I/O to do remember that modern operating systems and computer cache huge amounts of disk I/O in memory. I once ended up powering the computer off and on again after every run, as the only way I could find to be sure that the device I/O cache was flushed.

You can get wildly different answers for the same problem just by changing starting points. Pick an initial guess that's close to the root and Newton's method will give you a result that converges quadratically. Choose another in a different part of the problem space and the root finder will diverge wildly.
What does this say about the algorithm? Good or bad?

I would suggest you to have a look at the following Python root finding demo.
It is a simple code, with some different methods and comparisons between them (in terms of the rate of convergence).
http://www.math-cs.gordon.edu/courses/mat342/python/findroot.py

I just finish a project where comparing bisection, Newton, and secant root finding methods. Since this is a practical case, I don't think you need to use Big-O notation. Big-O notation is more suitable for asymptotic view. What you can do is compare them in term of:
Speed - for example here newton is the fastest if good condition are gathered
Number of iterations - for example here bisection take the most iteration
Accuracy - How often it converge to the right root if there is more than one root, or maybe it doesn't even converge at all.
Input - What information does it need to get started. for example newton need an X0 near the root in order to converge, it also need the first derivative which is not always easy to find.
Other - rounding errors
For the sake of visualization you can store the value of each iteration in arrays and plot them. Use a function you already know the roots.

Although this is a very old post, my 2 cents :)
Once you've decided which algorithmic method to use to compare them (your "evaluation protocol", so to say), then you might be interested in ways to run your challengers on actual datasets.
This tutorial explains how to do it, based on an example (comparing polynomial fitting algorithms on several datasets).
(I'm the author, feel free to provide feedback on the github page!)

Related

Solving a large (150 variable) system of linear, ordinary differential equations; running into floating point rounding and/or stiffness problems

EDIT: Original post too vague. I am looking for an algorithm to solve a large-system, solvable, linear IVP that can handle very small floating point values. Solving for the eigenvectors and eigenvalues is impossible with numpy.linalg.eig() as the returned values are complex and should not be, it does not support numpy.float128 either, and the matrix is not symmetric so numpy.linalg.eigh() won't work. Sympy could do it given an infinite amount of time, but after running it for 5 hours I gave up. scipy.integrate.solve_ivp() works with implicit methods (have tried Radau and BDF), but the output is wildly wrong. Are there any libraries, methods, algorithms, or solutions for working with this many, very small numbers?
Feel free to ignore the rest of this.
I have a 150x150 sparse (~500 nonzero entries of 22500) matrix representing a system of first order, linear differential equations. I'm attempting to find the eigenvalues and eigenvectors of this matrix to construct a function that serves as the analytical solution to the system so that I can just give it a time and it will give me values for each variable. I've used this method in the past for similar 40x40 matrices, and it's much (tens, in some cases hundreds of times) faster than scipy.integrate.solve_ivp() and also makes post model analysis much easier as I can find maximum values and maximum rates of change using scipy.optimize.fmin() or evaluate my function at inf to see where things settle if left long enough.
This time around, however, numpy.linalg.eig() doesn't seem to like my matrix and is giving me complex values, which I know are wrong because I'm modeling a physical system that can't have complex rates of growth or decay (or sinusoidal solutions), much less complex values for its variables. I believe this to be a stiffness or floating point rounding problem where the underlying LAPACK algorithm is unable to handle either the very small values (smallest is ~3e-14, and most nonzero values are of similar scale) or disparity between some values (largest is ~4000, but values greater than 1 only show up a handful of times).
I have seen suggestions for similar users' problems to use sympy to solve for the eigenvalues, but when it hadn't solved my matrix after 5 hours I figured it wasn't a viable solution for my large system. I've also seen suggestions to use numpy.real_if_close() to remove the imaginary portions of the complex values, but I'm not sure this is a good solution either; several eigenvalues from numpy.linalg.eig() are 0, which is a sign of error to me, but additionally almost all the real portions are of the same scale as the imaginary portions (exceedingly small), which makes me question their validity as well. My matrix is real, but unfortunately not symmetric, so numpy.linalg.eigh() is not viable either.
I'm at a point where I may just run scipy.integrate.solve_ivp() for an arbitrarily long time (a few thousand hours) which will probably take a long time to compute, and then use scipy.optimize.curve_fit() to approximate the analytical solutions I want, since I have a good idea of their forms. This isn't ideal as it makes my program much slower, and I'm also not even sure it will work with the stiffness and rounding problems I've encountered with numpy.linalg.eig(); I suspect Radau or BDF would be able to navigate the stiffness, but not the rounding.
Anybody have any ideas? Any other algorithms for finding eigenvalues that could handle this? Can numpy.linalg.eig() work with numpy.float128 instead of numpy.float64 or would even that extra precision not help?
I'm happy to provide additional details upon request. I'm open to changing languages if needed.

As mentioned in the comment chain above the best solution for this is to use a Matrix Exponential, which is a lot simpler (and apparently less error prone) than diagonalizing your system with eigenvectors and eigenvalues.
For my case I used scipy.sparse.linalg.expm() since my system is sparse. It's fast, accurate, and simple. My only complaint is the loss of evaluation at infinity, but it's easy enough to work around.

Python GEKKO Unexpected Behavior with Constraints

I've been playing around with GEKKO for solving flow optimizations and I have come across behavior that is confusing me.
Context:
Sources --> [mixing and delivery] --> Sinks
I have multiple sources (where my flow is coming from), and multiple sinks (where my flow goes to). For a given source (e.g., SOURCE_1), the total flow to the resulting sinks must equal to the volume from SOURCE_1. This is my idea of conservation of mass where the 'mixing' plant blends all the source volumes together.
Constraint Example (DOES NOT WORK AS INTENDED):
When I try to create a constraint for the two SINK volumes, and the one SOURCE volume:
m.Equation(volume_sink_1[i] + volume_sink_2[i] == max_volumes_for_source_1)
I end up with weird results. With that, I mean, it's not actually optimal, it ends up assigning values very poorly. I am off from the optimal by at least 10% (I tried with different max volumes).
Constraint Example (WORKS BUT I DON'T GET WHY):
When I try to create a constraint for the two SINK volumes, and the one SOURCE volume like this:
m.Equation(volume_sink_1[i] + volume_sink_2[i] <= max_volumes_for_source_1 * 0.999999)
With this, I get MUCH closer to the actual optimum to the point where I can just treat it as the optimum. Please note that I had to change it to a less than or equal to and also multiply by 0.999999 which was me messing around with it nonstop and eventually leading to that.
Also, please note that this uses practically all of the source (up to 99.9999% of it) as I would expect. So both formulations make sense to me but the first approach doesn't work.
The only thing I can think of for this behavior is that it's stricter to solve for == than <=. That doesn't explain to me why I have to multiply by 0.999999 though.
Why is this the case? Also, is there a way for me to debug occurrences like this easier?

This same improvement occurs with complementary constraints for conditional statements when using s1*s2<=0 (easier to solve) versus s1*s2==0 (harder to solve).
From the research papers I've seen, the justification is that the solver has more room to search to find the optimal solution even if it always ends up at s1*s2==0. It sounds like your problem may have multiple local minima as well if it converges to a solution, but it isn't the global optimum.
If you can post a complete and minimal problem that demonstrates the issue, we can give more specific suggestions.

If I relax some constraints, can I get an algorithmic shortcut on Approximate Nearest Neighbors?

I'm looking for an algorithm with the fastest time per query for a problem similar to nearest-neighbor search, but with two differences:
I need to only approximately confirm (tolerating Type I and Type II error) the existence of a neighbor within some distance k or return the approximate distance of the nearest neighbor.
I can query many at once
I'd like better throughput than the approximate nearest neighbor libraries out there (https://github.com/erikbern/ann-benchmarks) which seem better designed for single queries. In particular, the algorithmic relaxation of the first criteria seems like it should leave room for an algorithmic shortcut, but I can't find any solutions in the literature nor can I figure out how to design one.
Here's my current best solution, which operates at about 10k queries / sec on per CPU. I'm looking for something close to an order-of-magnitude speedup if possible.
sample_vectors = np.random.randint(low=0, high=2, size=(10000, vector_size))
new_vectors = np.random.randint(low=0, high=2, size=(100000, vector_size))
import annoy
ann = annoy.AnnoyIndex(vector_size, metric='hamming')
for i, v in enumerate(sample_vectors):
ann.add_item(i, v)
ann.build(20)
for v in new_vectors:
print(ann.get_nns_by_vector(v, n=1, include_distances=True))

I'm a bit skeptical of benchmarks such as the one you have linked, as in my experience I have found that the definition of the problem at hand far outweighs in importance the merits of any one algorithm across a set of other (possibly similar looking) problems.
More simply put, an algorithm being a high performer on a given benchmark does not imply it will be a higher performer on the problem you care about. Even small or apparently trivial changes to the formulation of your problem can significantly change the performance of any fixed set of algorithms.
That said, given the specifics of the problem you care about I would recommend the following:
use the cascading approach described in the paper [1]
use SIMD operations (either SSE on intel chips or GPUs) to accelerate, the nearest neighbour problem is one where operations closer to the metal and parallelism can really shine
tune the parameters of the algorithm to maximize your objective; in particular, the algorithm of [1] has a few easy to tune parameters which will dramatically trade performance for accuracy, make sure you perform a grid search over these parameters to set them to the sweet spot for your problem
Note: I have recommended the paper [1] because I have tried many of the algorithms listed in the benchmark you linked and found them all inferior (for the task of image reconstruction) to the approach listed in [1] while at the same time being much more complicated than [1], both undesirable properties. YMMV depending on your problem definition.

I appreciate the solutions, they gave me some ideas, but I'll answer my own question as I found a solution that mostly resolved my question, and maybe it will help someone else in the future.
I used one of the libraries linked in the benchmarks, hnswlib as it not only has slightly improved performance over annoy, but it also has a bulk-query option. Hnswlib's algorithm also allows a highly flexible performance/accuracy tradeoff in favor of performance, which is well-suited for the highly-error tolerant approximate checking I want to do. Further, even though the parallelization improvements are far from linear per-core, it's still something. The above factors combined for a ~5x speedup in my particular situation.
As ldog said, your milage may vary depending on your problem statement.

Vectorizing consequential/iterative simulation (in python)

This is a very general question -- is there any way to vectorize consequential simulation (where next step depends on previous), or any such iterative algorithm in general?
Obviously, if one need to run M simulations (each N steps) you can use for i in range(N) and calculate M values on each step to get a significant speed-up. But say you only need one or two simulations with a lot of steps, or your simulations don't have a fixed amount of steps (like radiation detection), or you are solving a differential system (again, for a lot of steps). Is there any way to shove upper for-loop under the numpy hood (with a speed gain, I am not talking passing python function object to numpy.vectorize), or cython-ish approaches are the only option? Or maybe this is possible in R or some similar language, but not (currently?) in Python?

Perhaps Multigrid in time methods can give some improvements.

How do we simulate randomness?

I came across this function randint() in Python that gives you a random integer from a list of integers. The thing that I am not able to digest is that how can we really simulate randomness. How can I really tell that a random function that may be in any programming language doesn't give a biased result? BellCurve?
How can we simulate something that is so natural as randomness? We can just calculate the probability of a result that can appear. But can never tell how this works.
For simulating something we need complete insight of the topic, don't we?

Quick Answer:
Entropy is introduced into computer algorithms to kick off a deterministic sequence of numbers which are generated by clever computer algorithms, then fitted to the distribution curve the user asked for e.g. rand(1,10) could internally produce numbers from 0.0 to 0.9999 but needs to map to 1 through 10.. Because it is difficult to know what that Entropy was, it makes determining the numbers more difficult, thus the pseudo-random generator description. In Statistics we learn that flipping a coin 100 times may not result in 50 Heads and 50 Tails, nor should it as coin flips don't work that way. However, it's a good example. The probability works assuming an infinite number of iterations along an even distribution. It is possible Heads will show 100 times, 1000 times, 10,000 times in a row. It's possible, not likely but possible. An algorithm which simulates randomness is under no obligation to ensure that a 0 is returned IF a 0 was among the list of possible answers. It only needs to ensure it's possible.
General Answer
Most computer generated random numbers are pseudo-random.
As you've eluded to in your question, computers cannot simulate true randomness; all random algorithms generate random numbers deterministically; meaning if one knows the initial seed of the algorithm, the entropy used by the algorithm, and which iteration the algorithm was in, the 'random' number can be determined. True randomness can only be achieved by observing the outcome of a random event, which may be the physical nature of computer components or other phenomena.
One could argue that natural randomness is not in reality random but just an unknown sequence of events. Not unknowable (i.e. Entropy), but just currently unknown. It is only unknown (random) because we are unable to explain or predict it, at this time, due to insufficient advancement in technology or knowledge. There is true Chaotic Entropy, but unless we're talking about Quantum Computers, it doesn't matter. For the purposes of most applicable software applications a very good Pseudo RNG is all that's required.
Given a period of time 1,000 years ago we could say the ocean was prone to random Tsunami's. Now we have more advance technology and understanding that we can create prediction models. These prediction models become more accurate as we enter more information about all the events which would lead to a Tsunami.
The part that is difficult for a computer to simulate is Entropy. Entropy is, at it's simplest, randomness. When generating a tuple of prime numbers often times the algorithm being used to create a series of random numbers will collect Entropy from outside sources; moving your mouse, electrical noise, 'noise' collected from an antenna such as the built in WiFi or Bluetooth. Entropy is the key to creating a good set of simulate random numbers.
Even with all the advancements we have in collecting entropy, a machine can still be induced to generate a specific set of entropy, which then would allow an attacked to accurately predict the numbers being generated. If the algorithm collects noise from a microphone, they can create a loud and predictable noise at the right time in order to influence the sequence of numbers which will be generated later. The same can be said of all the other forms of gathering Entropy.
A simple way to get true randomness is to use Random.org.
The randomness comes from atmospheric noise, which for many purposes
is better than the pseudo-random number algorithms typically used in
computer programs.

Since you're going to simulate randomness, you are going to end up using pseudorandom number generator. The topic is widely covered. PRNG.
Python's random() already uses the Mersenne twister. And my guess is that you do not want anything better than that, unless you're working on some cryptography tool.
Now, if you want to get a truly random signal, it has to have a physical nature (e.g. Geiger's counter). But in most cases you do not need to go this far.
The answer to your question depends hugely on the purpose of the randomness in your application.

How can we simulate something that is so natural as randomness?
TL;DR:
By knowing what makes that something act "randomly",
By being smart and making just the right simplifying assumptions so as to not make the problem too hard,
By having good previously-collected statistics so as to know that a statistic model is correct,
By having a PRNG that is good enough from which one can simulate that random process, and
By having an algorithm that maps outputs from that PRNG to the underlying statistical distribution.
By knowing what makes that something act "randomly"
Radioative decay almost perfectly acts as a Poisson process. Not quite so perfectly, goals scored in a World Cup game can be modeled as a Poisson process. (But it's close enough for Las Vegas to make money.) On the other hand, the outcome of a coin toss is an example of a Bernoulli process. There are lots of different kinds of random processes, and these different random processes lead to different kinds of random distributions. Knowing what is going on underneath the hood is important.
By being smart and making just the right simplifying assumptions
One of the most helpful tools in a modeler's bag of tricks is the central limit theorem. Add lots and lots and lots of random influences together and the end result very often looks gaussian (the "Bell Curve" alluded to in the question). Assuming a gaussian distribution is a nice simplifying assumption, but it can get one in trouble. One has to be smart enough to avoid oversimplifying assumptions.
By having good previously-collected statistics
It took a while before people determined that radioactive decay was indeed a Poisson process. They determined this by having a nice history of previously taken measurements. Without previously collected statistics, all one has is a guess. Guesses are exceptionally good at biting the person who made the guess in the rear.
By having a PRNG that is good enough
There are lots of reasons for using a deterministic pseudo random number generator. That a PRNG isn't quite "random" in the sense that run #12345 of a Monte Carlo simulation can be exactly repeated can be a good thing. If a simulated vehicle blew up or if a simulated patient died in that run of a Monte Carlo simulation, any sane person would want to investigate that case in detail.
Fortunately, there are a number of very good PRNGs out there. Python uses Mersenne twister. While not the best, it is very, very good.
By having an algorithm that maps outputs from the PRNG to the underlying statistical distribution*
You're toast if there's no way to translate the outcome of Mersenne twister (or whatever PRNG you are using) to the distribution at hand. Fortunately, people before us have spent a good deal of time developing algorithms that approximate a wide number of random distributions.
The question was tagged with python, so I'd be remiss to write about python's random package and numpy's random package. The latter is even better than the built-in capabilities one gets for free as a standard python package. It provides a good number of algorithms that convert from the integer output of Mersenne twister (for example) to a wide number of frequently encountered probability distributions. (And in some cases, probability distributions that are only encountered infrequently.)

I'll start by asking what does it mean to be random? The word is a shorthand for outcomes that are a priori unpredictable. You can attempt to quantify the degree of unpredictability using measures such as entropy, but randomness itself is a binary state: Either an event can be predicted with certainty (entropy = 0), or it is random. Different probability distributions such as the bell curve (normal) or uniform have different amounts of entropy, but they all qualify as random because their entropy is non-zero—you can't predict the outcomes with certainty.
Most programming languages implement some type of Pseudo-Random Number Generator (PRNG). These are deterministic algorithms which use chaotic behavior to mimic the unpredictability of randomness. If you know the algorithm being applied and the initial state, you can predict the outcomes of a PRNG with absolute certainty. However, we can take inspiration from Alan Turing's "Imitation Game." Imagine you have two black box sources of numbers, one of which contains a PRNG (but you don't know anything about its initial state) while the other contains a source of "real" randomness (whatever that means). If you're permitted to apply any test you can think of, and you can't tell which is which within the scope of the sample you plan to use in your computer program, does it matter which one you use?
How can you tell that the PRNG is okay to use? Basically it boils down to trusting that the people who designed the algorithm knew what they were doing, and that the implementation holds up well against a battery of tests specifically intended to catch PRNGs in identifiably non-random behavior, such as Marsaglia's Diehard tests or the more recent Dieharder suite, or those available from NIST.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.