Problem in tqdm function in a Doc2Vec model - python

I am using this article https://actsusanli.medium.com/ to implement the Doc2Vec model and I have a problem in the training step.
model_dbow.train(utils.shuffle([x for x in tqdm(train_tagged.values)]), total_examples=len(train_tagged.values), epochs = 40)
As you can see, I am using the tqdm function. When I ran the code the tqdm is 100%, after some minutes, but the algorithm still runs in the same shell for a long time.
Do you have any idea if this is a problem of tqdm function or something else?

By using the "list comprehension" ([..])...
[x for x in tqdm(train_tagged.values)]
...you are having tqdm iterate once over your train_tagged.values sequence, into an actual in-memory Python list. This will show the tqdm progress rather quickly – then completely finish any involvement with tqdm.
Then, you're passing that plain result list (without any tqdm features) into Doc2Vec.train(), where Doc2Vec does its epochs=40 training passes. tqdm is no longer involved, so there'll be no incremental progress-bar output.
You might be tempted to try (or have already tried) something that skips the extra list creation, passing the tqdm-wrapped sequence directly in like:
corpus = utils.shuffle(train_tagged.values)
model_dbow.train(tqdm(corpus), total_examples=len(corpus), epochs = 40)
But this has a different problem: the tqdm-wrapper is only designed to allow (& report the progress of) one iteration over the wrapped sequence. So this will show that one iteration's incremental progress.
But when .train() tries its next necessary 39 re-iterations, to complete its epochs=40 training-runs, the single-pass tqdm object will be exhausted, preventing full & proper training.
Note that there is an option for progress-logging within Gensim, by setting the Python logging level (globally, or just for the class Doc2Vec) to INFO. Doc2Vec will then emit a log-line showing progress, within each epoch and between epochs, about every 1 second. But: you can also make such logging less-frequent by supplying a different seconds value to the optional report_delay argument of .train(), for example report_delay=60 (for a log line every minute instead of every second).
If you really want a progress-bar, it should possible to use tqdm - but you will have to work around its assumption that the iterable you're wrapping with tqdm() will only be iterated over once.
I believe there'd be two possible approaches, each with different tradeoffs:
(1) Instead of letting .train() repeat the corpus N times, do it yourself - adjusting the other .train() parameters accordingly. Roughly, that'd mean changing a line like...
model.train(corpus, total_examples=len(corpus), epochs=40)
...into something that turns your desired 40 epochs into something that looks like just one iteration to both tqdm & Gensim's .train(), like...
repeated_corpus = itertools.chain(*[corpus]*40)
repeated_len = 40 * len(corpus)
model.train(tqdm(repeated_corpus, total=repeated_len), total_examples=repeated_len, epochs=1)
(Note that you now have to give tqdm a hint as to the sequence's length, because the one-time chained-iterator from itertools.chain() doesn't report its own length.)
Then you'll get one progress-bar across the whole, training corpus - which the model is now seeing as one pass over a larger corpus, but ultimately involves the same 40 passes.
You'll want to reinterpret any remaining log lines with this change in mind, and you'll lose a chance to install your own per-epoch callbacks via the model's end-of-epoch callback mechanism. (But, that's a seldom-used feature, anyway.)
(2) Instead of wrapping the corpus with a single tqdm() (which can only show a progress-bar for one-iteration), wrap the corpus as a new fully-re-iterable object that itself will start a new tqdm() each time. For example, something like:
class TqdmEveryIteration(object):
def __init__(self, inner_iterable):
self.inner_iterable = inner_iterable
def iter(self):
return tqdm(inner_iterable)
Then, using this new extra tqdm-adding wrapper, you should be able to do:
corpus = utils.shuffle(train_tagged.values)
model_dbow.train(TqdmEveryIteration(corpus), total_examples=len(corpus), epochs = 40)
In this case, you should get one progress bar per epoch, because a new tqdm() wrapper will be started each training pass.
(If you try either of these approaches & they work well, please let me know! They should be roughly correct, but I haven't tested them yet.)
Separately: if the article from the author at actsusanli.medium.com that you're modeling your work on is...
https://towardsdatascience.com/multi-class-text-classification-with-doc2vec-logistic-regression-9da9947b43f4
...note that it's using an overly-complex & fragile anti-pattern, calling .train() multiple times in a loop with manual alpha management. That has problems as described in this other answer. But that approach would also have the side-effect of re-wrapping the corpus each time in a new tqdm (like the TqdmEveryIteration class above), so despite its other issues, would achieve one actual progress-bar each call to .train().
(I sent the author a private note via Medium about a month ago about this problem.)

Related

TensorFlow: restoring model in a MonitoredSession

I have a model that contains multiple variables including a global step. I've been able to successfully use a MonitoredSession to save checkpoints and summaries every 100 steps. I was expecting the MonitoredSession to automatically restore all my variables when the session is run in multiple passes (based on this documentation), however this does not happen. If I take a look at the global step after running the training session again, I find that it starts back from zero. This is a simplified version of my code without the actual model. Let me know if more code is needed to solve this problem
train_graph = tf.Graph()
with train_graph.as_default():
# I create some datasets using the Dataset API
# ...
global_step = tf.train.create_global_step()
# Create all the other variables and the model here
# ...
saver_hook = tf.train.CheckpointSaverHook(
checkpoint_dir='checkpoint/',
save_secs=None,
save_steps=100,
saver=tf.train.Saver(),
checkpoint_basename='model.ckpt',
scaffold=None)
summary_hook = tf.train.SummarySaverHook(
save_steps=100,
save_secs=None,
output_dir='summaries/',
summary_writer=None,
scaffold=None,
summary_op=train_step_summary)
num_steps_hook = tf.train.StopAtStepHook(num_steps=500) # Just for testing
with tf.train.MonitoredSession(
hooks=[saver_hook, summary_hook, num_steps_hook]) as sess:
while not sess.should_stop():
step = sess.run(global_step)
if (step % 100 == 0):
print(step)
sess.run(optimizer)
When I run this code the first time, I get this output
0
100
200
300
400
The checkpoint folder at this point has checkpoints for every hundredth step up to 500. If I run the program again I would expect to see the counter start at 500 and the increase up to 900, but instead I just get the same thing and all of my checkpoints get overwritten. Any ideas?
Alright, I figured it out. It was actually really simple. First, it's easier to use a MonitoredTraningSession() instead of a MonitoredSession(). This wrapper session takes as an argument 'checkpoint_dir'. I thought that the saver_hook would take care of restoring, but that's not the case. In order to fix my problem I just had to change the line where I define the session like so:
with tf.train.MonitoredTrainingSession(hooks=[saver_hook, summary_hook], checkpoint_dir='checkpoint'):
It can also be done with the MonitoredSession directly, but you need to set up a session_creator instead.

How does one train multiple models in a single script in TensorFlow when there are GPUs present?

Say I have access to a number of GPUs in a single machine (for the sake of argument assume 8GPUs each with max memory of 8GB each in one single machine with some amount of RAM and disk). I wanted to run in one single script and in one single machine a program that evaluates multiple models (say 50 or 200) in TensorFlow, each with a different hyper parameter setting (say, step-size, decay rate, batch size, epochs/iterations, etc). At the end of training assume we just record its accuracy and get rid of the model (if you want assume the model is being check pointed every so often, so its fine to just throw away the model and start training from scratch. You may also assume some other data may be recorded like the specific hyper params, train, validation, train errors are recorded as we train etc).
Currently I have a (pseudo-)script that looks as follow:
def train_multiple_modles_in_one_script_with_gpu(arg):
'''
trains multiple NN models in one session using GPUs correctly.
arg = some obj/struct with the params for trianing each of the models.
'''
#### try mutliple models
for mdl_id in range(100):
#### define/create graph
graph = tf.Graph()
with graph.as_default():
### get mdl
x = tf.placeholder(float_type, get_x_shape(arg), name='x-input')
y_ = tf.placeholder(float_type, get_y_shape(arg))
y = get_mdl(arg,x)
### get loss and accuracy
loss, accuracy = get_accuracy_loss(arg,x,y,y_)
### get optimizer variables
opt = get_optimizer(arg)
train_step = opt.minimize(loss, global_step=global_step)
#### run session
with tf.Session(graph=graph) as sess:
# train
for i in range(nb_iterations):
batch_xs, batch_ys = get_batch_feed(X_train, Y_train, batch_size)
sess.run(fetches=train_step, feed_dict={x: batch_xs, y_: batch_ys})
# check_point mdl
if i % report_error_freq == 0:
sess.run(step.assign(i))
#
train_error = sess.run(fetches=loss, feed_dict={x: X_train, y_: Y_train})
test_error = sess.run(fetches=loss, feed_dict={x: X_test, y_: Y_test})
print( 'step %d, train error: %s test_error %s'%(i,train_error,test_error) )
essentially it tries lots of models in one single run but it builds each model in a separate graph and runs each one in a separate session.
I guess my main worry is that its unclear to me how tensorflow under the hood allocates resources for the GPUs to be used. For example, does it load the (part of the) data set only when a session is ran? When I create a graph and a model, is it brought in the GPU immediately or when is it inserted in the GPU? Do I need to clear/free the GPU each time it tries a new model? I don't actually care too much if the models are ran in parallel in multiple GPU (which can be a nice addition), but I want it to first run everything serially without crashing. Is there anything special I need to do for this to work?
Currently I am getting an error that starts as follow:
I tensorflow/core/common_runtime/bfc_allocator.cc:702] Stats:
Limit: 340000768
InUse: 336114944
MaxInUse: 339954944
NumAllocs: 78
MaxAllocSize: 335665152
W tensorflow/core/common_runtime/bfc_allocator.cc:274] ***************************************************xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
W tensorflow/core/common_runtime/bfc_allocator.cc:275] Ran out of memory trying to allocate 160.22MiB. See logs for memory state.
W tensorflow/core/framework/op_kernel.cc:975] Resource exhausted: OOM when allocating tensor with shape[60000,700]
and further down the line it says:
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[60000,700]
[[Node: standardNN/NNLayer1/Z1/add = Add[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"](standardNN/NNLayer1/Z1/MatMul, b1/read)]]
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-SXM2-16GB, pci bus id: 0000:06:00.0)
however further down the output file (where it prints) it seems to print fine the errors/messages that should show as training proceeds. Does this mean that it didn't run out of resources? Or was it actually able to use the GPU? If it was able to use the CPU instead of the CPU, when why is this an error only happening when GPU are about to be used?
The weird thing is that the data set is really not that big (all 60K points are 24.5M) and when I run a single model locally in my own computer it seems that the process uses less than 5GB. The GPUs have at least 8GB and the computer with them has plenty of RAM and disk (at least 16GB). Thus, the errors that tensorflow is throwing at me are quite puzzling. What is it trying to do and why are they occurring? Any ideas?
After reading the answer that suggests to use the multiprocessing library I came up with the following script:
def train_mdl(args):
train(mdl,args)
if __name__ == '__main__':
for mdl_id in range(100):
# train one model with some specific hyperparms (assume they are chosen randomly inside the funciton bellow or read from a config file or they could just be passed or something)
p = Process(target=train_mdl, args=(args,))
p.start()
p.join()
print('Done training all models!')
honestly I am not sure why his answer suggests to use pool, or why there are weird tuple brackets but this is what would make sense for me. Would the resources for tensorflow be re-allocated every time a new process is created in the above loop?
I think that running all models in one single script can be bad practice in the long term (see my suggestion below for a better alternative). However, if you would like to do it, here is a solution: You can encapsulate your TF session into a process with the multiprocessing module, this will make sure TF releases the session memory once the process is done. Here is a code snippet:
from multiprocessing import Pool
import contextlib
def my_model((param1, param2, param3)): # Note the extra (), required by the pool syntax
< your code >
num_pool_worker=1 # can be bigger than 1, to enable parallel execution
with contextlib.closing(Pool(num_pool_workers)) as po: # This ensures that the processes get closed once they are done
pool_results = po.map_async(my_model,
((param1, param2, param3)
for param1, param2, param3 in params_list))
results_list = pool_results.get()
Note from OP: The random number generator seed does not reset automatically with the multi-processing library if you choose to use it. Details here: Using python multiprocessing with different random seed for each process
About TF resource allocation: Usually TF allocates much more resources than it needs. Many times you can restrict each process to use a fraction of the total GPU memory, and discover through trial and error the fraction your script requires.
You can do it with the following snippet
gpu_memory_fraction = 0.3 # Choose this number through trial and error
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=gpu_memory_fraction,)
session_config = tf.ConfigProto(gpu_options=gpu_options)
sess = tf.Session(config=session_config, graph=graph)
Note that sometimes TF increases the memory usage in order to accelerate the execution. Therefore, reducing the memory usage might make your model run slower.
Answers to the new questions in your edit/comments:
Yes, Tensorflow will be re-allocated every time a new process is created, and cleared once a process ends.
The for-loop in your edit should also do the job. I suggest to use Pool instead, because it will enable you to run several models concurrently on a single GPU. See my notes about setting gpu_memory_fraction and "choosing the maximal number of processes". Also note that: (1) The Pool map runs the loop for you, so you don't need an outer for-loop once you use it. (2) In your example, you should have something like mdl=get_model(args) before calling train()
Weird tuple parenthesis: Pool only accepts a single argument, therefore we use a tuple to pass multiple arguments. See multiprocessing.pool.map and function with two arguments for more details. As suggested in one answer, you can make it more readable with
def train_mdl(params):
(x,y)=params
< your code >
As #Seven suggested, you can use CUDA_VISIBLE_DEVICES environment variable to choose which GPU to use for your process. You can do it from within your python script using the following on the beginning of the process function (train_mdl).
import os # the import can be on the top of the python script
os.environ["CUDA_VISIBLE_DEVICES"] = "{}".format(gpu_id)
A better practice for executing your experiments would be to isolate your training/evaluation code from the hyper parameters/ model search code.
E.g. have a script named train.py, which accepts a specific combination of hyper parameters and references to your data as arguments, and executes training for a single model.
Then, to iterate through the all the possible combinations of parameters you can use a simple task (jobs) queue, and submit all the possible combinations of hyper-parametrs as separate jobs. The task queue will feed your jobs one at a time to your machine. Usually, you can also set the queue to execute number of processes concurrently (see details below).
Specifically, I use task spooler, which is super easy to install and handful (doesn't requires admin privileges, details below).
Basic usage is (see notes below about task spooler usage):
ts <your-command>
In practice, I have a separate python script that manages my experiments, set all the arguments per specific experiment and send the jobs to the ts queue.
Here are some relevant snippets of python code from my experiments manager:
run_bash executes a bash command
def run_bash(cmd):
p = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, executable='/bin/bash')
out = p.stdout.read().strip()
return out # This is the stdout from the shell command
The next snippet sets the number of concurrent processes to be run (see note below about choosing the maximal number of processes):
max_job_num_per_gpu = 2
run_bash('ts -S %d'%max_job_num_per_gpu)
The next snippet iterates through a list of all combinations of hyper params / model params. Each element of the list is a dictionary, where the keys are the command line arguments for the train.py script
for combination_dict in combinations_list:
job_cmd = 'python train.py ' + ' '.join(
['--{}={}'.format(flag, value) for flag, value in combination_dict.iteritems()])
submit_cmd = "ts bash -c '%s'" % job_cmd
run_bash(submit_cmd)
A note about about choosing the maximal number of processes:
If you are short on GPUs, you can use gpu_memory_fraction you found, to set the number of processes as max_job_num_per_gpu=int(1/gpu_memory_fraction)
Notes about task spooler (ts):
You could set the number of concurrent processes to run ("slots") with:
ts -S <number-of-slots>
Installing ts doesn't requires admin privileges. You can download and compile it from source with a simple make, add it to your path and you're done.
You can set up multiple queues (I use it for multiple GPUs), with
TS_SOCKET=<path_to_queue_name> ts <your-command>
e.g.
TS_SOCKET=/tmp/socket-ts.gpu_queue_1 ts <your-command>
TS_SOCKET=/tmp/socket-ts.gpu_queue_2 ts <your-command>
See here for further usage example
A note about automatically setting the path names and file names:
Once you separate your main code from the experiment manager, you will need an efficient way to generate file names and directory names, given the hyper-params. I usually keep my important hyper params in a dictionary and use the following function to generate a single chained string from the dictionary key-value pairs.
Here are the functions I use for doing it:
def build_string_from_dict(d, sep='%'):
"""
Builds a string from a dictionary.
Mainly used for formatting hyper-params to file names.
Key-value pairs are sorted by the key name.
Args:
d: dictionary
Returns: string
:param d: input dictionary
:param sep: key-value separator
"""
return sep.join(['{}={}'.format(k, _value2str(d[k])) for k in sorted(d.keys())])
def _value2str(val):
if isinstance(val, float):
# %g means: "Floating point format.
# Uses lowercase exponential format if exponent is less than -4 or not less than precision,
# decimal format otherwise."
val = '%g' % val
else:
val = '{}'.format(val)
val = re.sub('\.', '_', val)
return val
As I understand, firstly tensorflow constructs a symbolic graph and infers the derivatives based on chain rule. Then allocates memory for all (necessary) tensors, including some inputs and outputs of layers for efficiency. When running a session, data will be loaded into the graph but in general, memory use will not change any more.
The error you met, I guess, may be caused by constructing several models in one GPU.
Isolating your training/evaluation code from the hyper parameters is a good choice, as #user2476373 proposed. But I am using bash script directly, not task spooler (may be it's more convenient), e.g.
CUDA_VISIBLE_DEVICES=0 python train.py --lrn_rate 0.01 --weight_decay_rate 0.001 --momentum 0.9 --batch_size 8 --max_iter 60000 --snapshot 5000
CUDA_VISIBLE_DEVICES=0 python eval.py
Or you can write a 'for' loop in the bash script, not necessarily in python script. Noting that I used CUDA_VISIBLE_DEVICES=0 at beginning of the script (the index could be 7 if you have 8 GPUs in one machine). Because based on my experience, I've found that tensorflow uses all GPUs in one machine if I didn't specify operations use which GPU with the code like this
with tf.device('/gpu:0'):
If you want to try multi-GPU implementation, there is some example.
Hope this could help you.
An easy solution: Give each model a unique session and graph.
It works for this platform: TensorFlow 1.12.0, Keras 2.1.6-tf, Python 3.6.7, Jupyter Notebook.
Key code:
with session.as_default():
with session.graph.as_default():
# do something about an ANN model
Full code:
import tensorflow as tf
from tensorflow import keras
import gc
def limit_memory():
""" Release unused memory resources. Force garbage collection """
keras.backend.clear_session()
keras.backend.get_session().close()
tf.reset_default_graph()
gc.collect()
#cfg = tf.ConfigProto()
#cfg.gpu_options.allow_growth = True
#keras.backend.set_session(tf.Session(config=cfg))
keras.backend.set_session(tf.Session())
gc.collect()
def create_and_train_ANN_model(hyper_parameter):
print('create and train my ANN model')
info = { 'result about this ANN model' }
return info
for i in range(10):
limit_memory()
session = tf.Session()
keras.backend.set_session(session)
with session.as_default():
with session.graph.as_default():
hyper_parameter = { 'A set of hyper-parameters' }
info = create_and_train_ANN_model(hyper_parameter)
limit_memory()
Inspired by this link: Keras (Tensorflow backend) Error - Tensor input_1:0, specified in either feed_devices or fetch_devices was not found in the Graph
I have the same issue. My solution is to run from another script doing the following as many times and in as many hyperparameter configurations as you want.
cmd = "python3 ./model_train.py hyperparameters"
os.system(cmd)
You probably don't want to do this.
If you run thousands and thousands of models on your data, and pick the one that evaluates best, you are not doing machine learning; instead you are memorizing your data set, and there is no guarantee that the model you pick will perform at all outside that data set.
In other words, that approach is similar to having a single model, which has thousands of degrees of liberty. Having a model with such high order of complexity is problematic, since it will be able to fit your data better than is actually warranted; such a model is annoyingly able to memorize any noise (outliers, measurement errors, and such) in your training data, which causes the model to perform poorly when the noise is even slightly different.
(Apologies for posting this as an answer, the site wouldn't let me add a comment.)

TensorFlow - how to evaluate all test set with every example once and only once

I am running cifar10 example from TensorFlow. But there is a problem for evaluation.
I have a test set and I want to evaluate every example from it once and only once. But the code (line 121) now only takes from an queue (line 126) which can not guarantee that. I have also made a modification that input is a '.tfrecords' file. Is there any suggestion?
Thank you in advance.
The function tf.train.string_input_producer that creates the queue of filenames here can accept an argument num_epochs. You can specify that you want it to run only 1 epoch.
# Create a queue that produces the filenames to read.
filename_queue = tf.train.string_input_producer(filenames, num_epochs=1)
I have figured a solution but rather imperfect. The clue is exclude it from variables to load and then initialize the limit_epochs by one's own. Following is the detailed step:
Add the code
del variables_to_restore['input_producer/limit_epochs/epochs'] after variables_to_restore = variable_averages.variables_to_restore(). And it will stop loading input_producer/limit_epochs to the model.
Next, add the code sess.run(tf.initialize_variables([v for v in tf.all_variables() if v.name.startswith("input_producer")])) in a session to activate the variable.
In the end, do the operation filename_queue = tf.train.string_input_producer(filenames, num_epochs=1).
And try to save the files before shutting down the threads.
The imperfection is you have to make every thread read only one example if you want it fits arbitrary test examples.

Mulitprocessing and rpy2 (with ape)

I ran into this today and can't figure out why. I have several functions chained together that perform some time consuming operations as part of a larger pipeline. I've included these here, pared down to a test example, as best as I could. The issue is that when I call a function directly, I get the expected output (e.g., 5 different trees). However, when I call the same function in a multiprocessing pool with apply_async (or apply, doesn't matter), I get 5 trees, but they are all the same.
I've documented this in an IPython notebook, which can be viewed here: http://nbviewer.ipython.org/gist/cfriedline/0e275d528ff1a8d674c6
In cell 91, I create 5 trees (each with 10 tips), and return two lists. The first containing the non-multiprocessing trees, and the second from apply_async.
In cell 92, you can see the results of creating trees without multiprocessing, and in 93, with multiprocessing.
What I expect is that there would be a total of 10 different trees between the two tests, but instead all of the multiprocessing trees are identical. Makes little sense to me.
Relevant versions of things:
Linux 2.6.18-238.12.1.el5 x86_64 GNU/Linux
Python 2.7.6 :: Anaconda 1.9.2 (64-bit)
IPython 2.0.0
Rpy2 2.3.9
Thanks!
Chris
I solved this one, with a point in the right direction from #mgilson. In fact, it was a random number problem, just not in python - in R (sigh). The state of R is copied when the Pool is created, meaning so is its random seed. To fix, just a little rpy2 as below calling R's set.seed function (with some process specific stuff for good measure):
def create_tree(num_tips, type):
"""
creates the taxa tree in R
#param num_tips: number of taxa to create
#param type: type for naming (e.g., 'taxa')
#return: a dendropy Tree
#rtype: dendropy.Tree
"""
r = rpy2.robjects.r
set_seed = r('set.seed')
set_seed(int((time.time()+os.getpid()*1000)))
rpy2.robjects.globalenv['numtips'] = num_tips
rpy2.robjects.globalenv['treetype'] = type
name = _get_random_string(20)
if type == "T":
r("%s = rtree(numtips, rooted=T, tip.label=paste(treetype, seq(1:(numtips)), sep=''))" % name)
else:
r("%s = rtree(numtips, rooted=F, tip.label=paste(treetype, seq(1:(numtips)), sep=''))" % name)
tree = r[name]
return ape_to_dendropy(tree)
I'm not 100% familiar with these libraries, however, on Linux, (IIRC) multiprocessing uses os.fork. This means that the state of the random module (which you're using) will also be forked and that each of your processes will generate the same sequence of random numbers resulting in a not-so-random _get_random_string function.
If I'm right, and you make the pool smaller than the number of trees that you want, you should see that you get groups of N identical trees (where N is the number of pools).
I think that probably the ideal solution is to re-seed the random number generator inside of each of the processes. It's unlikely that they'll run at exactly the same time, so you should get differing results.

Python something resets my random seed

My question is the exact opposite of this one.
This is an excerpt from my test file
f1 = open('seed1234','r')
f2 = open('seed7883','r')
s1 = eval(f1.read())
s2 = eval(f2.read())
f1.close()
f2.close()
####
test_sampler1.random_inst.setstate(s1)
out1 = test_sampler1.run()
self.assertEqual(out1,self.out1_regress) # this is fine and passes
test_sampler2.random_inst.setstate(s2)
out2 = test_sampler2.run()
self.assertEqual(out2,self.out2_regress) # this FAILS
Some info -
test_sampler1 and test_sampler2 are 2 object from a class that performs some stochastic sampling. The class has an attribute random_inst which is an object of type random.Random(). The file seed1234 contains a TestSampler's random_inst's state as returned by random.getstate() when it was given a seed of 1234 and you can guess what seed7883 is. What I did was I created a TestSampler in the terminal, gave it a random seed of 1234, acquired the state with rand_inst.getstate() and save it to a file. I then recreate the regression test and I always get the same output.
HOWEVER
The same procedure as above doesn't work for test_sampler2 - whatever I do not get the same random sequence of numbers. I am using python's random module and I am not importing it anywhere else, but I do use numpy in some places (but not numpy.random).
The only difference between test_sampler1 and test_sampler2 is that they are created from 2 different files. I know this is a big deal and it is totally dependent on the code I wrote but I also can't simply paste ~800 lines of code here, I am merely looking for some general idea of what I might be messing up...
What might be scrambling the state of test_sampler2's random number generator?
Solution
There were 2 separate issues with my code:
1
My script is a command line script and after I refactored it to use python's optparse library I found out that I was setting the seed for my sampler using something like seed = sys.argv[1] which meant that I was setting the seed to be a str, not an int - seed can take any hashable object and I found it the hard way. This explains why I would get 2 different sequences if I used the same seed - one if I run my script from the command line with sth like python sample 1234 #seed is 1234 and from my unit_tests.py file when I would create an object instance like test_sampler1 = TestSampler(seed=1234).
2
I have a function for discrete distribution sampling which I borrowed from here (look at the accepted answer). The code there was missing something fundamental: it was still non-deterministic in the sense that if you give it the same values and probabilities array, but transformed by a permutation (say values ['a','b'] and probs [0.1,0.9] and values ['b','a'] and probabilities [0.9,0.1]) and the seed is set and you will get the same random sample, say 0.3, by the PRNG, but since the intervals for your probabilities are different, in one case you'll get a b and in one an a. To fix it, I just zipped the values and probabilities together, sorted by probability and tadaa - I now always get the same probability intervals.
After fixing both issues the code worked as expected i.e. out2 started behaving deterministically.
The only thing (apart from an internal Python bug) that can change the state of a random.Random instance is calling methods on that instance. So the problem lies in something you haven't shown us. Here's a little test program:
from random import Random
r1 = Random()
r2 = Random()
for _ in range(100):
r1.random()
for _ in range(200):
r2.random()
r1state = r1.getstate()
r2state = r2.getstate()
with open("r1state", "w") as f:
print >> f, r1state
with open("r2state", "w") as f:
print >> f, r2state
for _ in range(100):
with open("r1state") as f:
r1.setstate(eval(f.read()))
with open("r2state") as f:
r2.setstate(eval(f.read()))
assert r1state == r1.getstate()
assert r2state == r2.getstate()
I haven't run that all day, but I bet I could and never see a failing assert ;-)
BTW, it's certainly more common to use pickle for this kind of thing, but it's not going to solve your real problem. The problem is not in getting or setting the state. The problem is that something you haven't yet found is calling methods on your random.Random instance(s).
While it's a major pain in the butt to do so, you could try adding print statements to random.py to find out what's doing it. There are cleverer ways to do that, but better to keep it dirt simple so that you don't end up actually debugging the debugging code.

Categories