How does the random module works in python (or any related/preferable language) ? Is there any way that we can actually write a random number generator by ourselves/What does the inner working of the random module looks like?
Do you think the time complexities varies for different functions like randint(), randrange(), and others? Or it is just that the time complexities varies between the functions mentioned above and for the methods like random.choice() ?
There are many good articles on random number generators. They're not hard to create, but it's very easy to create one that isn't really random. The previous generation used "linear congruential" random number generators, where you just did a multiply, and addition, and a modulo operation. Python today uses a "Mersenne twister" algorithm, which requires a bit more arithmetic, but has a very, very long sequence before repeating.
Almost all of the modules in random just grab the next 64-bit integer, and then do a simple conversion to produce randint or randfloat. choice, it should be clear, just done one array lookup after picking the number.
https://en.wikipedia.org/wiki/Random_number_generation
https://www.howtogeek.com/183051/htg-explains-how-computers-generate-random-numbers/
https://en.wikipedia.org/wiki/Mersenne_Twister
Related
I got curious about the relative speeds of some random integer generating code. I wrote the following to check it out:
from random import random
from random import choice
from random import randint
from math import floor
import time
def main():
times = 1000000
startTime = time.time()
for i in range(times):
randint(0,9)
print(time.time()-startTime)
startTime = time.time()
for i in range(times):
choice([0,1,2,3,4,5,6,7,8,9])
print(time.time()-startTime)
startTime = time.time()
for i in range(times):
floor(10*random())##generates random integers in the same range as randint(0,9)
print(time.time()-startTime)
main()
The results of one trial of this code were
0.9340872764587402
0.6552846431732178
0.23188304901123047
Even after executing multiplication and math.floor, the final way of generating integers was by far fastest. Messing with the size of the range from which numbers were generated didn't change anything.
So, why is random way faster than randint? and is there any reason why (besides ease of use, readability, and not inviting mistakes) that one would prefer randint to random (e.g., randint produces more random pseudo-random integers)? If floor(x*random()) feels not readable enough but you want faster code, should you go for a specialized routine?
def myrandint(low,high): ###still about 1.6 longer than the above, but almost 2.5 times faster than random.randint
return floor((high-low+1)*random())+low ##returns a random integer between low and high, inclusive. Results may not be what you expect if int(low) != low, etc. But the numpty who writes 'randint(1.9,3.2)' gets what they deserve.
Before I answer your question (and don't worry, I do get there), take note of the common programmer's idiom:
Premature optimization is the root of all evil.
While this isn't always the case, don't worry about micro-optimizations unless you need them.
This goes double for Python: if you're writing something where speed is critical, you'll usually want write it in a language that will run faster, like C. You can then write Python bindings for that C code if you want to use Python for the non-critical parts of your application (as is the case with, for example, NumPy).
Instead of focusing on making individual expressions or functions in your code run as fast as possible, focus on algorithms you use and the the overall structure of your code (and on making it readable, but you are already aware of that). Then, when your application starts running slowly, you can profile it to figure out what parts take the most time, and improve only those parts.
The changes will be easier to make to well-structured, readable code and optimizing the actual bottlenecks will generally give a much better speedup-to-time-coding ratio than most micro-optimizations. The time spent wondering which of two expressions runs faster is time you could have spent getting other things done.
As an exception, I'd say learning why one option is faster than the other is sometimes worth the time, because then you can incorporate that more general knowledge into your future programming, letting you make quicker calls without worrying about the details.
But enough about why we shouldn't waste time worrying about speed, let's talk about speed.
Taking a look at the source of the random module (for CPython 3.7.4), this line from the end of the opening comment provides a short answer:
* The random() method is implemented in C, executes in a single Python step,
and is, therefore, threadsafe.
The first statement is the ones that matters the most to us. random is a python binding for a C function, so the complexity of its operation runs at the blinding speed of machine code rather than the relatively slow speed of Python.
randint, on the other hand, is implemented in Python, and suffers a significant speed penalty for it. randint calls randrange, which ensures that the range's bounds (and step size) are integers, that the range isn't empty, and that the step size isn't zero, before calling getrandbits, which is implemented in C.
This alone produces the majority of randint's slowness. However, there is one more variable in play.
Going a little deeper, into the internal function _randbelow, it turns out that the algorithm for getting a random number between 0 and n is very straightforward: it gets the number of bits in n, then generates that many bits at random repeatedly until the resulting number is no greater than n.
On average (across all possible values of n), this has little effect, but comparing the extremes, it is noticeable.
I wrote a function that tests the impact of that loop. Here are the results:
bits 2 ** (n - 1) (2 ** n) - 1 ratio
64 1.358526759 1.084741422 1.2523968675
128 1.43073282 1.02119227 1.4010415688
256 1.600253063 1.271662798 1.2583941793
512 1.845024581 1.363168823 1.3534820852
1024 2.371779281 1.620392686 1.4637064839
2048 2.98949864 2.01788896 1.48149809
The first column is the number of bits, the second and third are the average time (in microseconds) to find a random integer with that many bits, in microseconds, over 1 000 000 runs. The last column is the ratio of the second and third columns.
You'll notice that average runtimes for the largest number with a given bit length are larger than for the smallest number with that bit length. This is because of that loop:
When looking for a n-bit number less than the largest n-bit number, a second attempt is needed only if that largest number is generated, which is unlikely except for very small n. But to find a number smaller than the smallest (2n−1 is a single 1-bit followed by n−1 0-bits), half of the attempts fail.
Addendum: I removed the tests for bit lengths from 1 to 32 because, upon inspection of the C source for getrandbits, I discovered that it uses a separate, faster, function for those numbers.
We have a very simple program (single-threaded) where we we do a bunch of random sample generation. For this we are using several calls of the numpy random functions (like normal or random_sample). Sometimes the result of one random call determines the number of times another random function is called.
Now I want to set a seed in the beginning s.th. multiple runs of my program should yield the same result. For this I'm using an instance of the numpy class RandomState. While this is the case in the beginning, at some time the results become different and this is why I'm wondering.
When I am doing everything correctly, having no concurrency and thereby a linear call of the functions AND no other random number generator involded, why does it not work?
Okay, David was right. The PRNGs in numpy work correctly. Throughout every minimal example I created, they worked as they are supposed to.
My problem was a different one, but finally I solved it. Do never loop over a dictionary within a deterministic algorithm. It seems that Python orders the items arbitrarily when calling the .item() function for getting in iterator.
So I am not that disappointed that this was this kind of error, because it is a useful reminder of what to think about when trying to do reproducible simulations.
If reproducibility is very important to you, I'm not sure I'd fully trust any PRNG to always produce the same output given the same seed. You might consider capturing the random numbers in one phase, saving them for reuse; then in a second phase, replay the random numbers you've captured. That's the only way to eliminate the possibility of non-reproducibility -- and it solves your current problem too.
I have a rather big program, where I use functions from the random module in different files. I would like to be able to set the random seed once, at one place, to make the program always return the same results. Can that even be achieved in python?
The main python module that is run should import random and call random.seed(n) - this is shared between all other imports of random as long as somewhere else doesn't reset the seed.
zss's comment should be highlighted as an actual answer:
Another thing for people to be careful of: if you're using
numpy.random, then you need to use numpy.random.seed() to set the
seed. Using random.seed() will not set the seed for random numbers
generated from numpy.random. This confused me for a while. -zss
In the beginning of your application call random.seed(x) making sure x is always the same. This will ensure the sequence of pseudo random numbers will be the same during each run of the application.
Jon Clements pretty much answers my question. However it wasn't the real problem:
It turns out, that the reason for my code's randomness was the numpy.linalg SVD because it does not always produce the same results for badly conditioned matrices !!
So be sure to check for that in your code, if you have the same problems!
Building on previous answers: be aware that many constructs can diverge execution paths, even when all seeds are controlled.
I was thinking "well I set my seeds so they're always the same, and I have no changing/external dependencies, therefore the execution path of my code should always be the same", but that's wrong.
The example that bit me was list(set(...)), where the resulting order may differ.
One important caveat is that for python versions earlier than 3.7, Dictionary keys are not deterministic. This can lead to randomness in the program or even a different order in which the random numbers are generated and therefore non-deterministic random numbers. Conclusion update python.
I was also puzzled by the question when reproducing a deep learning project.So I do a toy experiment and share the results with you.
I create two files in a project, which are named test1.py and test2.py respectively. In test1, I set random.seed(10) for the random module and print 10 random numbers for several times. As you can verify, the results are always the same.
What about test2? I do the same way except setting the seed for the random module.The results display differently every time. Howerver, as long as I import test1———even without using it, the results appear the same as in test1.
So the experiment comes the conclusion that if you want to set seed for all files in a project, you need to import the file/module that define and set the seed.
According to Jon's answer, setting random.seed(n), at the beginning of the main program will set the seed globally. Afterward to set seeds of the imported libraries, one can use the output from random.random(). For example,
rng = np.random.default_rng(int(abs(math.log(random.random()))))
tf.random.set_seed(int(abs(math.log(random.random()))))
You can guarantee this pretty easily by using your own random number generator.
Just pick three largish primes (assuming this isn't a cryptography application), and plug them into a, b and c:
a = ((a * b) % c)
This gives a feedback system that produces pretty random data. Note that not all primes work equally well, but if you're just doing a simulation, it shouldn't matter - all you really need for most simulations is a jumble of numbers with a pattern (pseudo-random, remember) complex enough that it doesn't match up in some way with your application.
Knuth talks about this.
I have a big script in Python. I inspired myself in other people's code so I ended up using the numpy.random module for some things (for example for creating an array of random numbers taken from a binomial distribution) and in other places I use the module random.random.
Can someone please tell me the major differences between the two?
Looking at the doc webpage for each of the two it seems to me that numpy.random just has more methods, but I am unclear about how the generation of the random numbers is different.
The reason why I am asking is because I need to seed my main program for debugging purposes. But it doesn't work unless I use the same random number generator in all the modules that I am importing, is this correct?
Also, I read here, in another post, a discussion about NOT using numpy.random.seed(), but I didn't really understand why this was such a bad idea. I would really appreciate if someone explain me why this is the case.
You have made many correct observations already!
Unless you'd like to seed both of the random generators, it's probably simpler in the long run to choose one generator or the other. But if you do need to use both, then yes, you'll also need to seed them both, because they generate random numbers independently of each other.
For numpy.random.seed(), the main difficulty is that it is not thread-safe - that is, it's not safe to use if you have many different threads of execution, because it's not guaranteed to work if two different threads are executing the function at the same time. If you're not using threads, and if you can reasonably expect that you won't need to rewrite your program this way in the future, numpy.random.seed() should be fine. If there's any reason to suspect that you may need threads in the future, it's much safer in the long run to do as suggested, and to make a local instance of the numpy.random.Random class. As far as I can tell, random.random.seed() is thread-safe (or at least, I haven't found any evidence to the contrary).
The numpy.random library contains a few extra probability distributions commonly used in scientific research, as well as a couple of convenience functions for generating arrays of random data. The random.random library is a little more lightweight, and should be fine if you're not doing scientific research or other kinds of work in statistics.
Otherwise, they both use the Mersenne twister sequence to generate their random numbers, and they're both completely deterministic - that is, if you know a few key bits of information, it's possible to predict with absolute certainty what number will come next. For this reason, neither numpy.random nor random.random is suitable for any serious cryptographic uses. But because the sequence is so very very long, both are fine for generating random numbers in cases where you aren't worried about people trying to reverse-engineer your data. This is also the reason for the necessity to seed the random value - if you start in the same place each time, you'll always get the same sequence of random numbers!
As a side note, if you do need cryptographic level randomness, you should use the secrets module, or something like Crypto.Random if you're using a Python version earlier than Python 3.6.
From Python for Data Analysis, the module numpy.random supplements the Python random with functions for efficiently generating whole arrays of sample values from many kinds of probability distributions.
By contrast, Python's built-in random module only samples one value at a time, while numpy.random can generate very large sample faster. Using IPython magic function %timeit one can see which module performs faster:
In [1]: from random import normalvariate
In [2]: N = 1000000
In [3]: %timeit samples = [normalvariate(0, 1) for _ in xrange(N)]
1 loop, best of 3: 963 ms per loop
In [4]: %timeit np.random.normal(size=N)
10 loops, best of 3: 38.5 ms per loop
The source of the seed and the distribution profile used are going to affect the outputs - if you are looking for cryptgraphic randomness, seeding from os.urandom() will get nearly real random bytes from device chatter (ie ethernet or disk) (ie /dev/random on BSD)
this will avoid you giving a seed and so generating determinisitic random numbers. However the random calls then allow you to fit the numbers to a distribution (what I call scientific random ness - eventually all you want is a bell curve distribution of random numbers, numpy is best at delviering this.
SO yes, stick with one generator, but decide what random you want - random, but defitniely from a distrubtuion curve, or as random as you can get without a quantum device.
It surprised me the randint(a, b) method exists in both numpy.random and random, but they have different behaviors for the upper bound.
random.randint(a, b) returns a random integer N such that a <= N <= b. Alias for randrange(a, b+1). It has b inclusive. random documentation
However if you call numpy.random.randint(a, b), it will return low(inclusive) to high(exclusive). Numpy documentation
I have a application that does a certain experiment 1000 times (multi-threaded, so that multiple experiments are done at the same time). Every experiment needs appr. 50.000 random.random() calls.
What is the best approach to get this really random. I could copy a random object to every experiment and do than a jumpahead of 50.000 * expid. The documentation suggests that jumpahead(1) already scrambles the state, but is that really true?
Or is there another way to do this in 'the best way'?
(No, the random numbers are not used for security, but for a metropolis hasting algorithm. The only requirement is that the experiments are independent, not whether the random sequence is somehow predictable or so)
I could copy a random object to every experiment and do than a jumpahead of 50.000 * expid.
Approximately correct. Each thread gets their own Random instance.
Seed all of them to the same seed value. Use a constant to test, use /dev/random when you "run for the record".
Edit. Outside Python and in older implementations, use jumpahead( 50000 * expid ) to avoid the situation where two generators wind up with a parallel sequences of values. In any reasonably current (post 2.3) Python, jumpahead is no longer linear and using expid is sufficient to scramble the state.
You can't simply do jumpahead(1) in each thread, since that will assure they are synchronized. Use jumpahead( expid ) to assure each thread is distinctly scrambled.
The documentation suggests that jumpahead(1) already scrambles the state, but is that really true?
Yes, jumpahead does indeed "scramble" the state. Recall that for a given seed you get one -- long -- but fixed sequence of pseudo-random numbers. You're jumping ahead in this sequence. To pass randomness tests, you must get all your values from this one sequence.
Edit. Once upon a time, jumpahead(1) was limited. Now jumpahead(1) really does a larger scrambling. The scrambling, however, is deterministic. You can't simply do jumpahead(1) in each thread.
If you have multiple generators with different seeds, you violate the "one sequence from one seed" assumption and your numbers aren't going to be as random as if you get them from a single sequence.
If you only jumphead 1, you'll may be getting parallel sequences which will might be similar. [This similarity might not be detectable; theoretically, there's a similarity.]
When you jumpahead 50,000, you assure that you follow the 1-sequence-1-seed premise. You also assure that you won't have adjacent sequences of numbers in two experiments.
Finally, you also have repeatability. For a given seed, you get consistent results.
Same jumpahead: Not Good.
>>> y=random.Random( 1 )
>>> z=random.Random( 1 )
>>> y.jumpahead(1)
>>> z.jumpahead(1)
>>> [ y.random() for i in range(5) ]
[0.99510321786951772, 0.92436920169905545, 0.21932404923057958, 0.20867489035315723, 0.91525579001682567]
>>> [ z.random() for i in range(5) ]
[0.99510321786951772, 0.92436920169905545, 0.21932404923057958, 0.20867489035315723, 0.91525579001682567]
You shouldn't use that function. There is no proof it can work on Mersenne Twister generator. Indeed, it was removed from Python 3 for that reason.
For more information about generation of pseudo-random numbers on parallel environments, see this article from David Hill.
jumpahead(1) is indeed sufficient (and identical to jumpahead(50000) or any other such call, in the current implementation of random -- I believe that came in at the same time as the Mersenne Twister based implementation). So use whatever argument fits in well with your programs' logic. (Do use a separate random.Random instance per thread for thread-safety purposes of course, as your question already hints).
(random module generated numbers are not meant to be cryptographically strong, so it's a good thing that you're not using for security purposes;-).
Per the random module docs at python.org:
"You can instantiate your own instances of Random to get generators that don’t share state."
And there's also a relevant-looking note on jumpahead, as you mention. But the guarantees there are kind of vague. If the calls to OS-provided randomness aren't so expensive as to dominate your running time, I'd skip all the subtlety and do something like:
randoms = [random.Random(os.urandom(4)) for _ in range(num_expts)]
If num_expts is ~1000, then you're unlikely to have any collisions in your seed (birthday paradox says you need about 65000 experiments before there's a >50% probability that you have a collision). If this isn't good enough for you or if the number of experiments is more like 100k instead of 1k, then I think it's reasonable to follow this up with
for idx, r in enumerate(randoms):
r.jumpahead(idx)
Note that I don't think it will work to just make your seed longer (os.urandom(8), for example), since the random docs state that the seed must be hashable, and so on a 32-bit platform you're only going to get at most 32 bits (4 bytes) of useful entropy in your seed.
This question piqued my curiosity, so I went and looked at the code implementing the random module. I am definitely not a PRNG expert, but it does seem like slightly differing values of n in jumpahead(n) will lead to markedly different Random instance states. (Always scary to contradict Alex Martelli, but the code does use the value of n when shuffling the random state).