My question is the exact opposite of this one.
This is an excerpt from my test file
f1 = open('seed1234','r')
f2 = open('seed7883','r')
s1 = eval(f1.read())
s2 = eval(f2.read())
f1.close()
f2.close()
####
test_sampler1.random_inst.setstate(s1)
out1 = test_sampler1.run()
self.assertEqual(out1,self.out1_regress) # this is fine and passes
test_sampler2.random_inst.setstate(s2)
out2 = test_sampler2.run()
self.assertEqual(out2,self.out2_regress) # this FAILS
Some info -
test_sampler1 and test_sampler2 are 2 object from a class that performs some stochastic sampling. The class has an attribute random_inst which is an object of type random.Random(). The file seed1234 contains a TestSampler's random_inst's state as returned by random.getstate() when it was given a seed of 1234 and you can guess what seed7883 is. What I did was I created a TestSampler in the terminal, gave it a random seed of 1234, acquired the state with rand_inst.getstate() and save it to a file. I then recreate the regression test and I always get the same output.
HOWEVER
The same procedure as above doesn't work for test_sampler2 - whatever I do not get the same random sequence of numbers. I am using python's random module and I am not importing it anywhere else, but I do use numpy in some places (but not numpy.random).
The only difference between test_sampler1 and test_sampler2 is that they are created from 2 different files. I know this is a big deal and it is totally dependent on the code I wrote but I also can't simply paste ~800 lines of code here, I am merely looking for some general idea of what I might be messing up...
What might be scrambling the state of test_sampler2's random number generator?
Solution
There were 2 separate issues with my code:
1
My script is a command line script and after I refactored it to use python's optparse library I found out that I was setting the seed for my sampler using something like seed = sys.argv[1] which meant that I was setting the seed to be a str, not an int - seed can take any hashable object and I found it the hard way. This explains why I would get 2 different sequences if I used the same seed - one if I run my script from the command line with sth like python sample 1234 #seed is 1234 and from my unit_tests.py file when I would create an object instance like test_sampler1 = TestSampler(seed=1234).
2
I have a function for discrete distribution sampling which I borrowed from here (look at the accepted answer). The code there was missing something fundamental: it was still non-deterministic in the sense that if you give it the same values and probabilities array, but transformed by a permutation (say values ['a','b'] and probs [0.1,0.9] and values ['b','a'] and probabilities [0.9,0.1]) and the seed is set and you will get the same random sample, say 0.3, by the PRNG, but since the intervals for your probabilities are different, in one case you'll get a b and in one an a. To fix it, I just zipped the values and probabilities together, sorted by probability and tadaa - I now always get the same probability intervals.
After fixing both issues the code worked as expected i.e. out2 started behaving deterministically.
The only thing (apart from an internal Python bug) that can change the state of a random.Random instance is calling methods on that instance. So the problem lies in something you haven't shown us. Here's a little test program:
from random import Random
r1 = Random()
r2 = Random()
for _ in range(100):
r1.random()
for _ in range(200):
r2.random()
r1state = r1.getstate()
r2state = r2.getstate()
with open("r1state", "w") as f:
print >> f, r1state
with open("r2state", "w") as f:
print >> f, r2state
for _ in range(100):
with open("r1state") as f:
r1.setstate(eval(f.read()))
with open("r2state") as f:
r2.setstate(eval(f.read()))
assert r1state == r1.getstate()
assert r2state == r2.getstate()
I haven't run that all day, but I bet I could and never see a failing assert ;-)
BTW, it's certainly more common to use pickle for this kind of thing, but it's not going to solve your real problem. The problem is not in getting or setting the state. The problem is that something you haven't yet found is calling methods on your random.Random instance(s).
While it's a major pain in the butt to do so, you could try adding print statements to random.py to find out what's doing it. There are cleverer ways to do that, but better to keep it dirt simple so that you don't end up actually debugging the debugging code.
Related
I'm running python code that's similar to:
import numpy
def get_user_group(user, groups):
if not user.group_id:
user.group_id = assign(groups)
return user.group_id
def assign(groups):
for group in groups:
ids.append(group.id)
percentages.append(group.percentage) # e.g. .33
assignment = numpy.random.choice(ids, p=percentages)
return assignment
We are running this in the wild against tens of thousands of users. I've noticed that the assignments do not respect the actual group percentages. E.G. if our percentages are [.9, .1] we've noticed a consistent hour over hour split of 80% and 20%. We've confirmed the inputs of the choice function are correct and mismatch from actual behavior.
Does anyone have a clue why this could be happening? Is it because we are using the global numpy? Some groups will be split between [.9, .1] while others are [.33,.34,.33] etc. Is it possible that different sets of groups are interfering with each other?
We are running this code in a python flask web application on a number of nodes.
Any recommendations on how to get reliable "random" weighted choice?
This comment exhausted the limitations of a comment, hence I post it here.
The fact that your team was not able to reproduce the problem but got proper results is a sign that most probably NumPy can suit your needs. You can benefit from NumPy later, when you need efficiency, and it can be seen that efficiency is not your concern now.
A more complete code and infrastructure setup on your nodes would be helpful though. How often do you restart your Flask server? Where do you initialize the NumPy random generator? Consider the following code that creates a page /random which can be customized with size, e.g: localhost:5000/random?size=20:
from flask import Flask, request
import numpy
import pandas
... # your webapp
numpy.random.seed(0)
#app.route('/random', methods=['GET'])
def random():
"""Gives the desired number of random numbers
with the state of the random number generator.
"""
# DON'T PUT numpy.random.seed(0) HERE
size = request.args.get('size')
if size is not None:
size = int(size)
else:
size = 1
state = numpy.random.get_state()
data = numpy.random.random(size=size)
table = pandas.DataFrame(data=data)
return table.to_html() + repr(state)
In this example, the state is initialized once after the Flask app is started. Whenever the /random page is requested, good random numbers are generated.
If you put the state initialization inside the function, it would surely cause unexpected distributions, bc you'll get the same random numbers (and same choices).
If you use multiple nodes and initialize with the same seed, your different nodes will produce the same choice again. In this case, use the unique node ids as seed values. If you restart the servers often, concatenate the restart ID or timestamp to the unique node ID. It is also a good idea to ensure that the timestamp is logged.
I have such code and use Jupyter-Notebook
for j in range(timesteps):
a_int = np.random.randint(largest_number/2) # int version
and i get random numbers, but when i try to move part of code to the functions, i start to receive same number in each iteration
def create_train_data():
np.random.seed(seed=int(time.time()))
a_int = np.random.randint(largest_number/2) # int version
return a
for j in range(timesteps):
c = create_train_data()
Why it's happend and how to fix it? i think maybe it because of processes in Jupyter-Notebook
The offending line of code is
np.random.seed(seed=int(time.time()))
Since you're executing in a loop that completes fairly quickly, calling int() on the time reduces your random seed to the same number for the entire loop. If you really want to manually set the seed, the following is a more robust approach.
def create_train_data():
a_int = np.random.randint(largest_number/2) # int version
return a
np.random.seed(seed=int(time.time()))
for j in range(timesteps):
c = create_train_data()
Note how the seed is being created once and then used for the entire loop, so that every time a random integer is called the seed changes without being reset.
Note that numpy already takes care of a pseudo-random seed. You're not gaining more random results by using it. A common reason for manually setting the seed is to ensure reproducibility. You set the seed at the start of your program (top of your notebook) to some fixed integer (I see 42 in a lot of tutorials), and then all the calculations follow from that seed. If somebody wants to verify your results, the stochasticity of the algorithms can't be a confounding factor.
The other answers are correct in saying that it is because of the seed. If you look at the Documentation From SciPy you will see that seeds are used to create a predictable random sequence. However, I think the following answer from another question regarding seeds gives a better overview of what it does and why/where to use it.
What does numpy.random.seed(0) do?
Hans Musgrave's answer is great if you are happy with pseudo-random numbers. Pseudo-random numbers are good for most applications but they are problematic if used for cryptography.
The standard approach for getting one truly random number is seeding the random number generator with the system time before pulling the number, like you tried. However, as Hans Musgrave pointed out, if you cast the time to int, you get the time in seconds which will most likely be the same throughout the loop. The correct solution to seed the RNG with a time is:
def create_train_data():
np.random.seed()
a_int = np.random.randint(largest_number/2) # int version
return a
This works because Numpy already uses the computer clock or another source of randomness for the seed if you pass no arguments (or None) to np.random.seed:
Parameters: seed : {None, int, array_like}, optional Random seed used
to initialize the pseudo-random number generator. Can be any integer
between 0 and 2**32 - 1 inclusive, an array (or other sequence) of
such integers, or None (the default). If seed is None, then
RandomState will try to read data from /dev/urandom (or the Windows
analogue) if available or seed from the clock otherwise.
It all depends on your application though. Do note the warning in the docs:
Warning The pseudo-random generators of this module should not be used
for security purposes. For security or cryptographic uses, see the
secrets module.
The use of Python symbolic computation module "Sympy" in a simulation is very difficult, I need to have reliable fixed inputs, for that I use the seed() in the random module.
However every time I call a simple sympy function, it seems to overwrites the seed with a new value, thus getting new output every time. I have searched a little bit and found this. But neither of them has a solution.
Consider this code:
from sympy import *
import random
random.seed(1)
for _ in range(2):
x = symbols('x')
equ = (x** random.randint(1,5)) ** Rational(random.randint(1,5)/2)
print(equ)
This outputs
(x**2)**(5/2)
x**4
on the first run, and
(x**2)**(5/2)
(x**5)**(3/2)
On the second run, and every-time I run the script it returns new output. I need a way to fix this to enforce the use of seed().
Does this help? From the docs on random:
"You can instantiate your own instances of Random to get generators that don’t share state"
Usage:
import random
# Create a new pseudo random number generator
prng = random.Random()
prng.seed(1)
This number generator will be unaffected by sympy
I ran into this today and can't figure out why. I have several functions chained together that perform some time consuming operations as part of a larger pipeline. I've included these here, pared down to a test example, as best as I could. The issue is that when I call a function directly, I get the expected output (e.g., 5 different trees). However, when I call the same function in a multiprocessing pool with apply_async (or apply, doesn't matter), I get 5 trees, but they are all the same.
I've documented this in an IPython notebook, which can be viewed here: http://nbviewer.ipython.org/gist/cfriedline/0e275d528ff1a8d674c6
In cell 91, I create 5 trees (each with 10 tips), and return two lists. The first containing the non-multiprocessing trees, and the second from apply_async.
In cell 92, you can see the results of creating trees without multiprocessing, and in 93, with multiprocessing.
What I expect is that there would be a total of 10 different trees between the two tests, but instead all of the multiprocessing trees are identical. Makes little sense to me.
Relevant versions of things:
Linux 2.6.18-238.12.1.el5 x86_64 GNU/Linux
Python 2.7.6 :: Anaconda 1.9.2 (64-bit)
IPython 2.0.0
Rpy2 2.3.9
Thanks!
Chris
I solved this one, with a point in the right direction from #mgilson. In fact, it was a random number problem, just not in python - in R (sigh). The state of R is copied when the Pool is created, meaning so is its random seed. To fix, just a little rpy2 as below calling R's set.seed function (with some process specific stuff for good measure):
def create_tree(num_tips, type):
"""
creates the taxa tree in R
#param num_tips: number of taxa to create
#param type: type for naming (e.g., 'taxa')
#return: a dendropy Tree
#rtype: dendropy.Tree
"""
r = rpy2.robjects.r
set_seed = r('set.seed')
set_seed(int((time.time()+os.getpid()*1000)))
rpy2.robjects.globalenv['numtips'] = num_tips
rpy2.robjects.globalenv['treetype'] = type
name = _get_random_string(20)
if type == "T":
r("%s = rtree(numtips, rooted=T, tip.label=paste(treetype, seq(1:(numtips)), sep=''))" % name)
else:
r("%s = rtree(numtips, rooted=F, tip.label=paste(treetype, seq(1:(numtips)), sep=''))" % name)
tree = r[name]
return ape_to_dendropy(tree)
I'm not 100% familiar with these libraries, however, on Linux, (IIRC) multiprocessing uses os.fork. This means that the state of the random module (which you're using) will also be forked and that each of your processes will generate the same sequence of random numbers resulting in a not-so-random _get_random_string function.
If I'm right, and you make the pool smaller than the number of trees that you want, you should see that you get groups of N identical trees (where N is the number of pools).
I think that probably the ideal solution is to re-seed the random number generator inside of each of the processes. It's unlikely that they'll run at exactly the same time, so you should get differing results.
I am developing an application for color blind people to enable them smoothly surf the Internet. I have a set of colors, lets say A , which consists of all the colors seen by a color blind person. Set A is calculated using a big calculation involving millions of colors. Set A is independent of inputs taken in my application i.e set A is like a 'constant' to me (just like 'pi' in mathematics). Now I want to store set A so that whenever I run my application, it is available without any added computational cost i.e i don't have to calculate A every time I run my application.
My Try:
I think this can be done by building a class having one constant but can it be done without creating any special class for just a constant?
I am using Python!
No need for a class. You want to store the calculated values on disk and load them back again on startup: for that you will want to look into the shelve or pickle libraries.
Yes, you can certainly do this with Python
If your constant was just a number -- say, you had just discovered tau -- then you would just declare it in a module, and import that module in all of your other source files:
constants.py:
# Define my new super-useful number
TAU = 6.28318530718
everywhere else:
from constants import TAU # Look, no calculations!
Expanding a bit, if you had a more complicated structure, like a dictionary, that took you a long time to compute, then you could just declare that in your module instead:
constants.py:
# Verified results of the national survey
PEPSI_CHALLENGE = {
'Pepsi': 0.57,
'Coke': 0.43,
}
And you can do this for more and more complicated data. The problem, eventually, is that just writing your constants module gets harder and harder, the more complex your data is, and it can be especially hard to update if you occasionally recompute the value you want to cache. In that case, you want to look at pickling the data, possibly as the final step of a python script which calculates it, and then load that data in a module that you import.
To do that, import pickle, and dump a single object out to a disk file:
recalculate.py:
# Here is the script that computes a small value from the hugely complicated domain:
import random
from itertools import groupby
import pickle
# Collect all of the random numbers
random_numbers = [random.randint(0,10) for r in xrange(1000000)]
# TODO: Check this -- this should definitely be 7
most_popular = max(groupby(sorted(random_numbers)),
key=lambda(x, v):(len(list(v)),-L.index(x)))[0]
# Now save the most common random number to disk, using pickle
# Almost any object is picklable like this, but check the docs for the exact details
pickle.dump(most_popular, open('data_cache','w'))
Now, in your constants file, you can simply read the pickled data from the file on disk, and have it available without recalculating it:
constants.py:
import pickle
most_popular = pickle.load(open('data_cache'))
everywhere else:
from constants import most_popular