I have a project with different modules. Then I have a file called Main.py which has some code that calls these modules during the run. In the file Main.py I set random seed using:
random.seed(2)
The output that I get from different runs is not identical even if I use the same random seed. Could you tell me why this might be happening? The various modules in my class use random.uniform, random.choice, random.sample functions. In one place, I also define rnduniform = random.uniform and use that.
Any help about how to solve this problem (i.e., be able to replicate the result by setting random seed) and help me understand this would be greatly appreciated.
Thank you.
EDIT: Solved. My error.
Sorry for wasting your time. I looked more carefully at the code and one of the functions that uses random number generation was called in an init method of one of the classes. The init method was accessed before the seed was set. I tried to delete the post but I could not. Hence, this edit.
Thread safety deals with concurrent programming - or in other words, when you two different codepaths executing at the same time through means of threading. As something that might be a single line of code to you as a programmer are usually a multitude of seperate actions, a different thread might interfere with whatever variables you are using, or use intermediate calculations. This will cause very hard to understand bugs because usually your code will seem completely fine.
In this case, he is saying that your code using random() and other code in a thread that is somehow using the random number generator might conflict and not behave as expected. For example, the numbers might not be as mathematically random any longer, or if you initialize with a certain base seed and then expect random() return a number of set values over multiple calls, those numbers may not be the ones you expect to be returned. In the very worst case of using non-thread-safe functions, you might end up with harsh exceptions and/or crashes since the function is not designed to be used in multiple threads at the same time.
Also see the Wikipedia topic on Thread safety.
Related
Can you name situations in which one really needs to use os.fork() instead of multiprocessing?
FULL DISCLOSURE:
When one mentions os.fork(), tipically gets buried in "you don't need to fork, just use multiprocessing" or affine answers or comments. I created this question so it can be linked in response. In fact, I'll share my own answer. But if you know a better way to tackle my example scenario, feel free to drop a line - it'd be very useful!
Let's say that you have a class Simulator. Its purpose is to do a very complicated simulation that you can advance by "chunks" (step by step, or for a certain amount of simulation time) with method Simulator.advance(n). Furthermore, you can change parameters in the simulation between calls to Simulator.advance(n) with method Simulator.change(parameters): they might be, e.g., wind speed, available resources, etc.
The two following statements, as it may very well happen, are also true:
The simulation is computationally costly;
For whatever reason, you have no chance in heaven to serialize an instance of Simulator mid-flight.
Now, say that you need to simulate N different scenarios with identical initial conditions that share a common first call to Simulator.advance(n), before some parameter is changed in a different way in each scenario, and then the simulation is resumed for a brief, differentiated final part.
Since the computation is very costly, you would really like not to run the exact same first part N times. So what you do is run the first part, and then (leveraging the raw copy of the process memory by os.fork() in order not to have to serialize anything) you fork N times, pick a different parameter change in every child process, and then collect the final result from each.
We have a very simple program (single-threaded) where we we do a bunch of random sample generation. For this we are using several calls of the numpy random functions (like normal or random_sample). Sometimes the result of one random call determines the number of times another random function is called.
Now I want to set a seed in the beginning s.th. multiple runs of my program should yield the same result. For this I'm using an instance of the numpy class RandomState. While this is the case in the beginning, at some time the results become different and this is why I'm wondering.
When I am doing everything correctly, having no concurrency and thereby a linear call of the functions AND no other random number generator involded, why does it not work?
Okay, David was right. The PRNGs in numpy work correctly. Throughout every minimal example I created, they worked as they are supposed to.
My problem was a different one, but finally I solved it. Do never loop over a dictionary within a deterministic algorithm. It seems that Python orders the items arbitrarily when calling the .item() function for getting in iterator.
So I am not that disappointed that this was this kind of error, because it is a useful reminder of what to think about when trying to do reproducible simulations.
If reproducibility is very important to you, I'm not sure I'd fully trust any PRNG to always produce the same output given the same seed. You might consider capturing the random numbers in one phase, saving them for reuse; then in a second phase, replay the random numbers you've captured. That's the only way to eliminate the possibility of non-reproducibility -- and it solves your current problem too.
I'm writing a game in Python in which the environment is generated randomly. Currently, the game's "save" function works by writing out all parts of the environment which the player has explored. The result is that save files are larger than they need to be—why write random data to disk when you can just generate it again?
What I could use is a random noise function: a function noise such that noise(x) returns a random number, and always the same number whenever it's called with the same value of x. Now, for each point (x,y) in the game's environment, instead of generating a random number using random() and storing the result in env[(x,y)], I can generate a random number using noise((x,y)), throw it away, and generate the same number later.
Not quite sure if I'm stating the obvious, but using some variation of a Perlin noise generator is a common way to do this. This post is a nice description of doing exactly this (as mentioned in the comments, it's not exactly Perlin noise)
For a given position, the Perlin function will return a random value (the position can be 2D, 3D or any dimensionality).
There is a noise module, and this page has a implementation of it
There's a similar thread on gamedev.SE
First, if you need it to be true that noise(x) would always return the same value for the same x, no matter what, even if it's never been called, then you can't really use randomness at all. A good hash function is the only possibility.
However, if you just need to be able to restore a previous state consisting of the values for all of the previously-explored points (never-explored points may turn out different after save and load than if you hadn't quit… but how can anyone tell without access to multiple universes?), and you don't want to store all of those points, then it might be reasonable to regenerate them.
But let's back up a step. You want something that acts like a hash function. Is there a hash function you can use?
I'd imagine the algorithms in hashlib are too slow (md5 is probably the fastest, but test them all), but I wouldn't reject them without actually testing.
It's possible that the "random period" of zlib.adler32 (or zlib.crc32) is too short, but I wouldn't reject it (except maybe hash) without thinking through whether it's good enough. For that matter, even hash plus a decent fixed-side blender function might be good enough (at least on a 64-bit system).
Python doesn't come with anything "between" md5 and `adler32' out of the box. But you can find PyPI modules or source recipes for hundreds of other hash algorithms. For that matter, if you're familiar with any particular hash algorithm that sounds good, most of them are trivial—you could probably code up, e.g., an FNV hash with xor-folding in less time than it takes you to look through the alternatives.
Also, keep in mind that you can generate a bunch of random bytes at "new game" time, store that in the save file, and use it as salt to your hash function.
If you've exhausted the possibilities are you really do need more randomness than a fast-enough hash function with arbitrary salt can give you alone, then:
It sounds like you'll already need to store a list of the points the user has explored (because how else do you know which points you need to restore?). And the order doesn't really matter. So, you can store them in the order of exploration. That means you can regenerate the values deterministically (just by iterating the list). Which means you can use the suggestion by #delnan on your own answer.
However, seed is not the way to do that. It isn't guaranteed to put the RNG into the same state each time across runs, Python versions, machines, etc. For that, you need setstate:
To save, call random.getstate(), and pickle and stash the result.
To load, read and unpickle the state, and call random.setstate(state).
See the docs for full details.
If you're using a random.Random instance, it's exactly the same, except of course that you have to construct a random.Random before you can call setstate on it.
This is guaranteed to work between runs of your program, across machines, etc. Even with a newer version of Python. However, it's not guaranteed to work with an older version of Python. (That is, if the user saves a game with Python 2.6, then tries to load it with 2.5, the state will not be compatible. I believe the only problems come with 2.6->older and 2.3->older, but of course there's no guarantee there won't be additional ones in the future.) I'd suggest stashing the Python version, and if they've downgraded, show a warning saying "This save file requires Python 2.6 or later. You have Python 2.5. The load may fail. Continue anyway?"
This is only guaranteed for random.Random and for the random module itself (since the top-level module functions just use a hidden random.Random). In particular, random.SystemRandom are explicitly documented not to work.
Practically speaking, you can also just pickle a random.Random directly, because the state gets pickled in. It seems like that ought to work, or what would be the sense of pickling a Random object? And it definitely does work. But it isn't actually documented to work, so I'd stick with pickling the getstate, for safety.
One possible implementation of noise is this:
import random
def noise(point):
gen = random.Random()
gen.seed(point)
return gen.random()
I don't know how fast Random.seed() is, though. In addition, Random may change from one version of Python to the next, causing the players of my game to find that the environment changes when they upgrade.
I have a rather big program, where I use functions from the random module in different files. I would like to be able to set the random seed once, at one place, to make the program always return the same results. Can that even be achieved in python?
The main python module that is run should import random and call random.seed(n) - this is shared between all other imports of random as long as somewhere else doesn't reset the seed.
zss's comment should be highlighted as an actual answer:
Another thing for people to be careful of: if you're using
numpy.random, then you need to use numpy.random.seed() to set the
seed. Using random.seed() will not set the seed for random numbers
generated from numpy.random. This confused me for a while. -zss
In the beginning of your application call random.seed(x) making sure x is always the same. This will ensure the sequence of pseudo random numbers will be the same during each run of the application.
Jon Clements pretty much answers my question. However it wasn't the real problem:
It turns out, that the reason for my code's randomness was the numpy.linalg SVD because it does not always produce the same results for badly conditioned matrices !!
So be sure to check for that in your code, if you have the same problems!
Building on previous answers: be aware that many constructs can diverge execution paths, even when all seeds are controlled.
I was thinking "well I set my seeds so they're always the same, and I have no changing/external dependencies, therefore the execution path of my code should always be the same", but that's wrong.
The example that bit me was list(set(...)), where the resulting order may differ.
One important caveat is that for python versions earlier than 3.7, Dictionary keys are not deterministic. This can lead to randomness in the program or even a different order in which the random numbers are generated and therefore non-deterministic random numbers. Conclusion update python.
I was also puzzled by the question when reproducing a deep learning project.So I do a toy experiment and share the results with you.
I create two files in a project, which are named test1.py and test2.py respectively. In test1, I set random.seed(10) for the random module and print 10 random numbers for several times. As you can verify, the results are always the same.
What about test2? I do the same way except setting the seed for the random module.The results display differently every time. Howerver, as long as I import test1———even without using it, the results appear the same as in test1.
So the experiment comes the conclusion that if you want to set seed for all files in a project, you need to import the file/module that define and set the seed.
According to Jon's answer, setting random.seed(n), at the beginning of the main program will set the seed globally. Afterward to set seeds of the imported libraries, one can use the output from random.random(). For example,
rng = np.random.default_rng(int(abs(math.log(random.random()))))
tf.random.set_seed(int(abs(math.log(random.random()))))
You can guarantee this pretty easily by using your own random number generator.
Just pick three largish primes (assuming this isn't a cryptography application), and plug them into a, b and c:
a = ((a * b) % c)
This gives a feedback system that produces pretty random data. Note that not all primes work equally well, but if you're just doing a simulation, it shouldn't matter - all you really need for most simulations is a jumble of numbers with a pattern (pseudo-random, remember) complex enough that it doesn't match up in some way with your application.
Knuth talks about this.
I am trying to get reproducible results with the genetic programming code in chapter 11 of "Programming Collective Intelligence" by Toby Segaran. However, simply setting seed "random.seed(55)" does not appear to work, changing the original code "from random import ...." to "import random" doesn't help, nor does changing Random(). These all seem to do approximately the same thing, the trees start out building the same, then diverge.
In reading various entries about the behavior of random, I can find no reason, given his GP code, why this divergence should happen. There doesn't appear to be anything in the code except calls to random, that has any variability that would account for this behavior. My understanding is that calling random.seed() should set all the calls correctly and since the code isn't threaded at all, I'm not sure how or why the divergence is happening.
Has anyone modified this code to behave reproducibly? Is there some form of calling random.seed() that may work better?
I apologize for not posting an example, but the code is obviously not mine (I'm adding only the call to seed and changing how random is called in the code) and this doesn't appear to be a simple issue with random (I've read all the entries on Python random here and many on the web in general).
Thanks.
Mark L.
I had the same problem just now with some completely unrelated code. I believe my solution was similar to that in eryksun's answer, though I didn't have any trees. What I did have were some sets, and I was doing random.choice(list(set)) to pick values from them. Sometimes my results (the items picked) were diverging even with the same seed each time and I was close to pulling my hair out. After seeing eryksun's answer here I tried random.choice(sorted(set)) instead, and the problem appears to have disappeared. I don't know enough about the inner workings of Python to explain it.
This may help, to create a random object that won't be interfered with from elsewhere:
from random import Random
random = Random(55)
# random can be used like the plain module
If other libraries are calling random.seed for any reason, they won't affect the random object you've created for your program.
I added the following function to gp.py, changing nothing else:
def set_seed(n):
import random
random.seed(n)
I'm using the module based on the example on page 267 (Google books). I can confirm that I get divergent results for the following trial:
>>> import gp
>>> gp.set_seed(55)
>>> rf = gp.getrankfunction(gp.buildhiddenset())
>>> gp.evolve(2, 500, rf, mutationrate=0.2, breedingrate=0.1, pexp=0.7, pnew=0.1)
It starts to diverge as early as the 4th value printed. I'm restarting the interpreter between trials, so it's not any prior state that's causing the problem.
Edit:
I found the random element. It's the memory address of the trees. The rank function sorts the list of results, for which each item is a tuple of the score and tree. Between runs the addresses change, and so the relative sort order of trees with equal score isn't a constant. Fortunately Python has a stable sort, so the fix is simple enough. Just use a sort key to sort based on only the score:
def getrankfunction(dataset):
def rankfunction(population):
scores=[(scorefunction(t, dataset), t) for t in population]
scores.sort(key=lambda x: x[0])
return scores
return rankfunction