I came across this function randint() in Python that gives you a random integer from a list of integers. The thing that I am not able to digest is that how can we really simulate randomness. How can I really tell that a random function that may be in any programming language doesn't give a biased result? BellCurve?
How can we simulate something that is so natural as randomness? We can just calculate the probability of a result that can appear. But can never tell how this works.
For simulating something we need complete insight of the topic, don't we?
Quick Answer:
Entropy is introduced into computer algorithms to kick off a deterministic sequence of numbers which are generated by clever computer algorithms, then fitted to the distribution curve the user asked for e.g. rand(1,10) could internally produce numbers from 0.0 to 0.9999 but needs to map to 1 through 10.. Because it is difficult to know what that Entropy was, it makes determining the numbers more difficult, thus the pseudo-random generator description. In Statistics we learn that flipping a coin 100 times may not result in 50 Heads and 50 Tails, nor should it as coin flips don't work that way. However, it's a good example. The probability works assuming an infinite number of iterations along an even distribution. It is possible Heads will show 100 times, 1000 times, 10,000 times in a row. It's possible, not likely but possible. An algorithm which simulates randomness is under no obligation to ensure that a 0 is returned IF a 0 was among the list of possible answers. It only needs to ensure it's possible.
General Answer
Most computer generated random numbers are pseudo-random.
As you've eluded to in your question, computers cannot simulate true randomness; all random algorithms generate random numbers deterministically; meaning if one knows the initial seed of the algorithm, the entropy used by the algorithm, and which iteration the algorithm was in, the 'random' number can be determined. True randomness can only be achieved by observing the outcome of a random event, which may be the physical nature of computer components or other phenomena.
One could argue that natural randomness is not in reality random but just an unknown sequence of events. Not unknowable (i.e. Entropy), but just currently unknown. It is only unknown (random) because we are unable to explain or predict it, at this time, due to insufficient advancement in technology or knowledge. There is true Chaotic Entropy, but unless we're talking about Quantum Computers, it doesn't matter. For the purposes of most applicable software applications a very good Pseudo RNG is all that's required.
Given a period of time 1,000 years ago we could say the ocean was prone to random Tsunami's. Now we have more advance technology and understanding that we can create prediction models. These prediction models become more accurate as we enter more information about all the events which would lead to a Tsunami.
The part that is difficult for a computer to simulate is Entropy. Entropy is, at it's simplest, randomness. When generating a tuple of prime numbers often times the algorithm being used to create a series of random numbers will collect Entropy from outside sources; moving your mouse, electrical noise, 'noise' collected from an antenna such as the built in WiFi or Bluetooth. Entropy is the key to creating a good set of simulate random numbers.
Even with all the advancements we have in collecting entropy, a machine can still be induced to generate a specific set of entropy, which then would allow an attacked to accurately predict the numbers being generated. If the algorithm collects noise from a microphone, they can create a loud and predictable noise at the right time in order to influence the sequence of numbers which will be generated later. The same can be said of all the other forms of gathering Entropy.
A simple way to get true randomness is to use Random.org.
The randomness comes from atmospheric noise, which for many purposes
is better than the pseudo-random number algorithms typically used in
computer programs.
Since you're going to simulate randomness, you are going to end up using pseudorandom number generator. The topic is widely covered. PRNG.
Python's random() already uses the Mersenne twister. And my guess is that you do not want anything better than that, unless you're working on some cryptography tool.
Now, if you want to get a truly random signal, it has to have a physical nature (e.g. Geiger's counter). But in most cases you do not need to go this far.
The answer to your question depends hugely on the purpose of the randomness in your application.
How can we simulate something that is so natural as randomness?
TL;DR:
By knowing what makes that something act "randomly",
By being smart and making just the right simplifying assumptions so as to not make the problem too hard,
By having good previously-collected statistics so as to know that a statistic model is correct,
By having a PRNG that is good enough from which one can simulate that random process, and
By having an algorithm that maps outputs from that PRNG to the underlying statistical distribution.
By knowing what makes that something act "randomly"
Radioative decay almost perfectly acts as a Poisson process. Not quite so perfectly, goals scored in a World Cup game can be modeled as a Poisson process. (But it's close enough for Las Vegas to make money.) On the other hand, the outcome of a coin toss is an example of a Bernoulli process. There are lots of different kinds of random processes, and these different random processes lead to different kinds of random distributions. Knowing what is going on underneath the hood is important.
By being smart and making just the right simplifying assumptions
One of the most helpful tools in a modeler's bag of tricks is the central limit theorem. Add lots and lots and lots of random influences together and the end result very often looks gaussian (the "Bell Curve" alluded to in the question). Assuming a gaussian distribution is a nice simplifying assumption, but it can get one in trouble. One has to be smart enough to avoid oversimplifying assumptions.
By having good previously-collected statistics
It took a while before people determined that radioactive decay was indeed a Poisson process. They determined this by having a nice history of previously taken measurements. Without previously collected statistics, all one has is a guess. Guesses are exceptionally good at biting the person who made the guess in the rear.
By having a PRNG that is good enough
There are lots of reasons for using a deterministic pseudo random number generator. That a PRNG isn't quite "random" in the sense that run #12345 of a Monte Carlo simulation can be exactly repeated can be a good thing. If a simulated vehicle blew up or if a simulated patient died in that run of a Monte Carlo simulation, any sane person would want to investigate that case in detail.
Fortunately, there are a number of very good PRNGs out there. Python uses Mersenne twister. While not the best, it is very, very good.
By having an algorithm that maps outputs from the PRNG to the underlying statistical distribution*
You're toast if there's no way to translate the outcome of Mersenne twister (or whatever PRNG you are using) to the distribution at hand. Fortunately, people before us have spent a good deal of time developing algorithms that approximate a wide number of random distributions.
The question was tagged with python, so I'd be remiss to write about python's random package and numpy's random package. The latter is even better than the built-in capabilities one gets for free as a standard python package. It provides a good number of algorithms that convert from the integer output of Mersenne twister (for example) to a wide number of frequently encountered probability distributions. (And in some cases, probability distributions that are only encountered infrequently.)
I'll start by asking what does it mean to be random? The word is a shorthand for outcomes that are a priori unpredictable. You can attempt to quantify the degree of unpredictability using measures such as entropy, but randomness itself is a binary state: Either an event can be predicted with certainty (entropy = 0), or it is random. Different probability distributions such as the bell curve (normal) or uniform have different amounts of entropy, but they all qualify as random because their entropy is non-zero—you can't predict the outcomes with certainty.
Most programming languages implement some type of Pseudo-Random Number Generator (PRNG). These are deterministic algorithms which use chaotic behavior to mimic the unpredictability of randomness. If you know the algorithm being applied and the initial state, you can predict the outcomes of a PRNG with absolute certainty. However, we can take inspiration from Alan Turing's "Imitation Game." Imagine you have two black box sources of numbers, one of which contains a PRNG (but you don't know anything about its initial state) while the other contains a source of "real" randomness (whatever that means). If you're permitted to apply any test you can think of, and you can't tell which is which within the scope of the sample you plan to use in your computer program, does it matter which one you use?
How can you tell that the PRNG is okay to use? Basically it boils down to trusting that the people who designed the algorithm knew what they were doing, and that the implementation holds up well against a battery of tests specifically intended to catch PRNGs in identifiably non-random behavior, such as Marsaglia's Diehard tests or the more recent Dieharder suite, or those available from NIST.
Related
Cross posted from csexchange:
Most versions of simulated annealing I've seen are implemented similar to what is outlined in the wikipedia pseudocode below:
Let s = s0
For k = 0 through kmax (exclusive):
T ← temperature( 1 - (k+1)/kmax )
Pick a random neighbour, snew ← neighbour(s)
If P(E(s), E(snew), T) ≥ random(0, 1):
s ← snew
Output: the final state s
I am having trouble understanding how this algorithm does not get stuck in a local optima as the temperature cools. If we jump around at the start while the temp is high, and eventually only take uphill moves as the temp cools, then isn't the solution found highly dependent on where we just so happened to end up in the search space as the temperature started to cool? We may have found a better solution early on, jumped off of it while the temp was high, and then be in a worse-off position as the temp cools and we transition to hill climbing.
An often listed modification to this approach is to keep track of the best solution found so far. I see how this change mitigates the risk of "throwing away" a better solution found in the exploratory stage when the temp is high, but I don't see how this is any better than simply performing repeated random hill-climbing to sample the space, without the temperature theatrics.
Another approach that comes to mind is to combine the ideas of keeping track of the "best so far" with repeated hill climbing and beam search. For each temperature, we could perform simulated annealing and track the best 'n' solutions. Then for the next temperature, start from each of those local peaks.
UPDATE:
It was rightly pointed out I didn't really ask a specific question, I've updated in the comments but want to reflect it here:
I don't need a transcription of the pseudo-code into essay form, I understand how it works - how the temperature balances exploration vs exploitation, and how this (theoretically) helps avoid local optima.
My specific questions are:
Without modification to keep track of the global best solution,
wouldn't this algorithm be extremely prone to local optima, despite
the temperature component? Yes, I understand the probability of taking a worse move declines as the temperature cools (e.g. it transitions to pure hill-climbing), but it would be possible that you had found a better solution early on in the exploitation portion (temp hot), jumped off of it, and then as the temperature cools you have no way back because you're on a path that leads to a new local peak.
With the addition of tracking the global optima, I can absolutely
see how this mitigates getting stuck at local peaks avoiding the problem described above. But how does
this improve upon simple random search? One specific example being
repeated random hill climbing. If you're tracking the global optima and you happen to hit it while in the high-temp portion, then that essentially is a random-search.
In what circumstances would this algorithm be preferable to something like repeated random hill-climbing and why? What properties does a problem have that makes it particularly suited to SA vs an alternative.
By my understanding, simulated annealing is not guaranteed to not get stuck in a local maxima (for maximization problems), especially as it "cools" where later in the cycle as k -> kmax.
So as you say, we "jump all over the place" at the start, but we are still selective in whether or not we accept that jump, as the P() function that determines the probability of acceptance is a function of the goal, E(). Later in the same wikipedia article, they describe the P() function a bit, and suggest that if e_new > e, then perhaps P=1, always taking the move if it is an improvement, but perhaps not always if it isn't an improvement. Later in the cycle, we aren't as willing to take random leaps to lesser results, so the algorithm tends to settle into a max (or min), which may or may not be the global.
The challenge in global function evaluation is to obtain good efficiency, i.e. find the optimal solution or one that is close to the optimum with as little function evaluations as possible. So many "we could perform..." reasonings are correct in terms of convergence to a good solution, but may fail to be efficient.
Now the main difference between pure climbing and annealing is that annealing temporarily admits a worsening of the objective to avoid... getting trapped in a local optimum. Hill climbing only searches the subspace where the function takes larger values than the best so far, which may fail to be connected to the global optimum. Annealing adds some "tunneling" effect.
Temperature decrease is not the key ingredient here and by the way, hill climbing methods such as the Simplex or Hooke-Jeeves patterns also work with diminishing perturbations. Temperature schemes are more related to multiscale decomposition of the target function.
The question as asked doesn't seem very coherent, it seems to ask about guarantees offered by the vanilla simulated annealing algorithm, then proceeds to propose multiple modifications with little to no justification.
In any case, I will try to be coherent in this answer.
The only known rigorous proof of simulated annealing that I know of is described as
The concept of using an adaptive step-size has been essential to the development of
better annealing algorithms. The Classical Simulated Annealing (CSA) was the first annealing algorithm with a rigorous mathematical proof for its global convergence (Geman and
Geman, 1984). It was proven to converge if a Gaussian distribution is used for GXY (TK),
coupled with an annealing schedule S(Tk) that decreases no faster than T = To/(log k).
The integer k is a counter for the external annealing loop, as will be detailed in the next
section. However, a logarithmic decreasing schedule is considered to be too slow, and for
many problems the number of iterations required is considered as “overkill” (Ingber, 1993).
Where the proof of this can be found in
Geman, Stuart, and Donald Geman. "Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images." IEEE Transactions on pattern analysis and machine intelligence 6 (1984): 721-741.
In terms of successful modifications of SA, if you peruse literature you will find many empirical claims of superiority, but few proofs. Having studied this problem in the past, I can say from the various modified versions of SA I have tried empirically for the problems I was interested in at the time, the only SA modification that consistently gave better results than classical simulated annealing is known as parallel tempering.
Parallel tempering is the modification of SA whereby instead of mixing one Markov-Chain (i.e. the annealing step) multiple chains are mixed concurrently at very specific temperatures, with exchanges between chains occurring on a dynamic or static schedule.
Original paper on parallel tempering:
Swendsen, Robert H., and Jian-Sheng Wang. "Replica Monte Carlo simulation of spin-glasses." Physical review letters 57.21 (1986): 2607.
There are various free and proprietary implementations of parallel tempering available due to its effectiveness; I would highly recommend starting with a a high quality implementation instead of rolling your own.
Note also, parallel tempering is extensively used throughout various physical sciences for modelling purposes due to its efficacy, but for some reason has a very small representation in other areas of literature (such as computer science,) which actually baffles me somewhat. Perhaps computer scientists general disinterest with SA modifications may stem from their belief that SA is unlikely to be improved upon.
I know that when you use numpy.random.seed(0) you get the same result on your own computer every time. I am wondering if it is also true for different computers and different installations of numpy.
It all depends upon type of algorithm implemented internally by numpy random function. In case of numpy, which is operated by pseudo-random number generators (PRNGs) algorithm. What this means is that if you provide the same seed( as of starting input ), you will get the same output. And if you change the seed, you will get a different output. So this kind of algorithm is no system dependent.
But for a true random number generator (TRNG) these often rely on some kind of specialized hardware that does some physical measurement of something unpredictable in the environment such as light or temperature electrical noise radioactive material. So if an module implements t
his kind of algorithm then it will be system dependent.
I am doing a scheduling simulation in python which is full determinstic. So, when I have the same input and parameters I always get the same output.
Now I want to randomize the initial starting state of the simulation and compare the output of two (or more) different simulation parameters. To compare the "same randomized initial starting state" I want to set the random.seed() with an initial value, which should stay the same for all comparisions of different schedulers. Furthermore I want to see the behaviour for one scheduler on different initial states so I have to change the random.seed(). This I have to do of course for all schedulers.
Now my question is, what impact has the seed on the "randomness" of the random generator? For example does it matter if I choose as a seed 1 or 100? And because I want to use different seeds for the same scheduler and compare it to the other ones, can I simply use e.g. the seeds 1 to 10 or must my seeds be "more random"?
For clarification, I use the random generator for distributing tasks initial on different cores and compare the output to "my optimal (deterministic) initial distribution". I want to get a wide-spread of different distributions with my choosen seeds.
Although your choice of seed doesn't matter in theory, it may matter in practice.
There are many PRNGs for which a given seeding strategy will produce correlated pseudorandom number sequences. For example, in most versions of PCG, two sequences generated from seeds that differ only in their high bits will be highly correlated ("Subsequences from the same generator"). Another example, this time involving Unity's PRNG, is found in "A Primer on Repeatable Random Numbers".
If you choose sequential seeds, or seeds that differ only very slightly, it may matter how the seed is used to initialize the PRNG's state. I don't know how well base Python's random.seed(integer_seed) avoids this problem, but if two Mersenne Twister states differ in only one bit, the two sequences they produce will be correlated to each other, and it will take millions of numbers to eliminate this correlation. A similar issue occurs with many other PRNGs, especially linear PRNGs.
Mersenne Twister has a huge state (about 20,000 bits) compared to other PRNGs in common use (which are typically 32 to 128 bits). Unless you seed a PRNG with a seed as big as its state, there will be some PRNG states that will never be produced (and therefore some pseudorandom number sequences that will never be generated). See also this question, which discusses seeding Mersenne Twister, which is the algorithm used in base Python.
To reduce the risk of correlated pseudorandom numbers, you can use PRNG algorithms, such as SFC and other so-called "counter-based" PRNGs (Salmon et al., "Parallel Random Numbers: As Easy as 1, 2, 3", 2011), that support independent "streams" of pseudorandom numbers. (Note, however, that PCG has a flawed notion of "streams".) There are other strategies as well, and I explain more about this in "Seeding Multiple Processes". See also this question.
Also, if you're using NumPy (which is a popular Python library for scientific work), note that NumPy 1.17 introduces a new pseudorandom number generation system; it uses so-called bit generators, such as PCG, and random generators, such as the new numpy.random.Generator. It was the result of a proposal to change the PRNG policy. The NumPy documentation has detailed information on seeding PRNGs in parallel.
Your choice of seed shouldn't matter
If the pseduo-random number generator is made correctly then any seed should create a random distribution of numbers.
From Wikipedia:
For a seed to be used in a pseudorandom number generator, it does not need to be random. Because of the nature of number generating algorithms, so long as the original seed is ignored, the rest of the values that the algorithm generates will follow probability distribution in a pseudorandom manner.
For an scientific experiment I need to generate 10 random, fixed-size, subsets of a list. For the experiment to be repeatable I want to initialize 10 different instances of random.Random() with a known seed.
How different do random seeds need to be? seems to suggest that using seeds 1 to 10 might be a bad idea, as the results could be linear dependent.
If it is a bad practice to pick seeds 1 to 10 for this case, what would be a good strategy for selecting seeds in a repeatable manner?
Clarification: it is important that always the same seeds are used when the program is run (with a specific dataset)! In the end my program must be deterministic.
Using random.org, I generated 10 random numbers from 2**0 to 2**28, giving as seeds:
187372311
204110176
129995678
6155814
22612812
61168821
21228945
146764631
94412880
117623077
Using a linear sequence of seeds can be problematic as noted in the comments. The numbers from random.org:
[...] come from atmospheric noise, which for many purposes is better than the pseudo-random number algorithms typically used in computer programs.
In order to compete for reputation, I'll put my comments here: ;)
How about just using some random from an unknown seed to generate your seed values?
Write down your numbers, and keep them for reproducibility.
If you need true random data, try http://www.random.org, they also have sequence and distribution generators.
EDIT
It depends on what kind of random numbers you want.
You want to seed a PRNG, so you probably (pun not intended) want uniformly distributed numbers from the complete range that your PRNG is able to accept as a seed. Then, just generate them, and write them down somewhere. If you want to read something about PRNGs, C++11 has a some good ones in its library: http://en.cppreference.com/w/cpp/header/random.
HTH
I would like to compare different methods of finding roots of functions in python (like Newton's methods or other simple calc based methods). I don't think I will have too much trouble writing the algorithms
What would be a good way to make the actual comparison? I read up a little bit about Big-O. Would this be the way to go?
The answer from #sarnold is right -- it doesn't make sense to do a Big-Oh analysis.
The principal differences between root finding algorithms are:
rate of convergence (number of iterations)
computational effort per iteration
what is required as input (i.e. do you need to know the first derivative, do you need to set lo/hi limits for bisection, etc.)
what functions it works well on (i.e. works fine on polynomials but fails on functions with poles)
what assumptions does it make about the function (i.e. a continuous first derivative or being analytic, etc)
how simple the method is to implement
I think you will find that each of the methods has some good qualities, some bad qualities, and a set of situations where it is the most appropriate choice.
Big O notation is ideal for expressing the asymptotic behavior of algorithms as the inputs to the algorithms "increase". This is probably not a great measure for root finding algorithms.
Instead, I would think the number of iterations required to bring the actual error below some epsilon ε would be a better measure. Another measure would be the number of iterations required to bring the difference between successive iterations below some epsilon ε. (The difference between successive iterations is probably a better choice if you don't have exact root values at hand for your inputs. You would use a criteria such as successive differences to know when to terminate your root finders in practice, so you could or should use them here, too.)
While you can characterize the number of iterations required for different algorithms by the ratios between them (one algorithm may take roughly ten times more iterations to reach the same precision as another), there often isn't "growth" in the iterations as inputs change.
Of course, if your algorithms take more iterations with "larger" inputs, then Big O notation makes sense.
Big-O notation is designed to describe how an alogorithm behaves in the limit, as n goes to infinity. This is a much easier thing to work with in a theoretical study than in a practical experiment. I would pick things to study that you can easily measure that and that people care about, such as accuracy and computer resources (time/memory) consumed.
When you write and run a computer program to compare two algorithms, you are performing a scientific experiment, just like somebody who measures the speed of light, or somebody who compares the death rates of smokers and non-smokers, and many of the same factors apply.
Try and choose an example problem or problems to solve that is representative, or at least interesting to you, because your results may not generalise to sitations you have not actually tested. You may be able to increase the range of situations to which your results reply if you sample at random from a large set of possible problems and find that all your random samples behave in much the same way, or at least follow much the same trend. You can have unexpected results even when the theoretical studies show that there should be a nice n log n trend, because theoretical studies rarely account for suddenly running out of cache, or out of memory, or usually even for things like integer overflow.
Be alert for sources of error, and try to minimise them, or have them apply to the same extent to all the things you are comparing. Of course you want to use exactly the same input data for all of the algorithms you are testing. Make multiple runs of each algorithm, and check to see how variable things are - perhaps a few runs are slower because the computer was doing something else at a time. Be aware that caching may make later runs of an algorithm faster, especially if you run them immediately after each other. Which time you want depends on what you decide you are measuring. If you have a lot of I/O to do remember that modern operating systems and computer cache huge amounts of disk I/O in memory. I once ended up powering the computer off and on again after every run, as the only way I could find to be sure that the device I/O cache was flushed.
You can get wildly different answers for the same problem just by changing starting points. Pick an initial guess that's close to the root and Newton's method will give you a result that converges quadratically. Choose another in a different part of the problem space and the root finder will diverge wildly.
What does this say about the algorithm? Good or bad?
I would suggest you to have a look at the following Python root finding demo.
It is a simple code, with some different methods and comparisons between them (in terms of the rate of convergence).
http://www.math-cs.gordon.edu/courses/mat342/python/findroot.py
I just finish a project where comparing bisection, Newton, and secant root finding methods. Since this is a practical case, I don't think you need to use Big-O notation. Big-O notation is more suitable for asymptotic view. What you can do is compare them in term of:
Speed - for example here newton is the fastest if good condition are gathered
Number of iterations - for example here bisection take the most iteration
Accuracy - How often it converge to the right root if there is more than one root, or maybe it doesn't even converge at all.
Input - What information does it need to get started. for example newton need an X0 near the root in order to converge, it also need the first derivative which is not always easy to find.
Other - rounding errors
For the sake of visualization you can store the value of each iteration in arrays and plot them. Use a function you already know the roots.
Although this is a very old post, my 2 cents :)
Once you've decided which algorithmic method to use to compare them (your "evaluation protocol", so to say), then you might be interested in ways to run your challengers on actual datasets.
This tutorial explains how to do it, based on an example (comparing polynomial fitting algorithms on several datasets).
(I'm the author, feel free to provide feedback on the github page!)