Mulitprocessing and rpy2 (with ape) - python

I ran into this today and can't figure out why. I have several functions chained together that perform some time consuming operations as part of a larger pipeline. I've included these here, pared down to a test example, as best as I could. The issue is that when I call a function directly, I get the expected output (e.g., 5 different trees). However, when I call the same function in a multiprocessing pool with apply_async (or apply, doesn't matter), I get 5 trees, but they are all the same.
I've documented this in an IPython notebook, which can be viewed here: http://nbviewer.ipython.org/gist/cfriedline/0e275d528ff1a8d674c6
In cell 91, I create 5 trees (each with 10 tips), and return two lists. The first containing the non-multiprocessing trees, and the second from apply_async.
In cell 92, you can see the results of creating trees without multiprocessing, and in 93, with multiprocessing.
What I expect is that there would be a total of 10 different trees between the two tests, but instead all of the multiprocessing trees are identical. Makes little sense to me.
Relevant versions of things:
Linux 2.6.18-238.12.1.el5 x86_64 GNU/Linux
Python 2.7.6 :: Anaconda 1.9.2 (64-bit)
IPython 2.0.0
Rpy2 2.3.9
Thanks!
Chris

I solved this one, with a point in the right direction from #mgilson. In fact, it was a random number problem, just not in python - in R (sigh). The state of R is copied when the Pool is created, meaning so is its random seed. To fix, just a little rpy2 as below calling R's set.seed function (with some process specific stuff for good measure):
def create_tree(num_tips, type):
"""
creates the taxa tree in R
#param num_tips: number of taxa to create
#param type: type for naming (e.g., 'taxa')
#return: a dendropy Tree
#rtype: dendropy.Tree
"""
r = rpy2.robjects.r
set_seed = r('set.seed')
set_seed(int((time.time()+os.getpid()*1000)))
rpy2.robjects.globalenv['numtips'] = num_tips
rpy2.robjects.globalenv['treetype'] = type
name = _get_random_string(20)
if type == "T":
r("%s = rtree(numtips, rooted=T, tip.label=paste(treetype, seq(1:(numtips)), sep=''))" % name)
else:
r("%s = rtree(numtips, rooted=F, tip.label=paste(treetype, seq(1:(numtips)), sep=''))" % name)
tree = r[name]
return ape_to_dendropy(tree)

I'm not 100% familiar with these libraries, however, on Linux, (IIRC) multiprocessing uses os.fork. This means that the state of the random module (which you're using) will also be forked and that each of your processes will generate the same sequence of random numbers resulting in a not-so-random _get_random_string function.
If I'm right, and you make the pool smaller than the number of trees that you want, you should see that you get groups of N identical trees (where N is the number of pools).
I think that probably the ideal solution is to re-seed the random number generator inside of each of the processes. It's unlikely that they'll run at exactly the same time, so you should get differing results.

Related

numpy.random.choice with percentages not working in practice

I'm running python code that's similar to:
import numpy
def get_user_group(user, groups):
if not user.group_id:
user.group_id = assign(groups)
return user.group_id
def assign(groups):
for group in groups:
ids.append(group.id)
percentages.append(group.percentage) # e.g. .33
assignment = numpy.random.choice(ids, p=percentages)
return assignment
We are running this in the wild against tens of thousands of users. I've noticed that the assignments do not respect the actual group percentages. E.G. if our percentages are [.9, .1] we've noticed a consistent hour over hour split of 80% and 20%. We've confirmed the inputs of the choice function are correct and mismatch from actual behavior.
Does anyone have a clue why this could be happening? Is it because we are using the global numpy? Some groups will be split between [.9, .1] while others are [.33,.34,.33] etc. Is it possible that different sets of groups are interfering with each other?
We are running this code in a python flask web application on a number of nodes.
Any recommendations on how to get reliable "random" weighted choice?
This comment exhausted the limitations of a comment, hence I post it here.
The fact that your team was not able to reproduce the problem but got proper results is a sign that most probably NumPy can suit your needs. You can benefit from NumPy later, when you need efficiency, and it can be seen that efficiency is not your concern now.
A more complete code and infrastructure setup on your nodes would be helpful though. How often do you restart your Flask server? Where do you initialize the NumPy random generator? Consider the following code that creates a page /random which can be customized with size, e.g: localhost:5000/random?size=20:
from flask import Flask, request
import numpy
import pandas
... # your webapp
numpy.random.seed(0)
#app.route('/random', methods=['GET'])
def random():
"""Gives the desired number of random numbers
with the state of the random number generator.
"""
# DON'T PUT numpy.random.seed(0) HERE
size = request.args.get('size')
if size is not None:
size = int(size)
else:
size = 1
state = numpy.random.get_state()
data = numpy.random.random(size=size)
table = pandas.DataFrame(data=data)
return table.to_html() + repr(state)
In this example, the state is initialized once after the Flask app is started. Whenever the /random page is requested, good random numbers are generated.
If you put the state initialization inside the function, it would surely cause unexpected distributions, bc you'll get the same random numbers (and same choices).
If you use multiple nodes and initialize with the same seed, your different nodes will produce the same choice again. In this case, use the unique node ids as seed values. If you restart the servers often, concatenate the restart ID or timestamp to the unique node ID. It is also a good idea to ensure that the timestamp is logged.

Optimizing a multithreaded numpy array function

Given 2 large arrays of 3D points (I'll call the first "source", and the second "destination"), I needed a function that would return indices from "destination" which matched elements of "source" as its closest, with this limitation: I can only use numpy... So no scipy, pandas, numexpr, cython...
To do this i wrote a function based on the "brute force" answer to this question. I iterate over elements of source, find the closest element from destination and return its index. Due to performance concerns, and again because i can only use numpy, I tried multithreading to speed it up. Here are both threaded and unthreaded functions and how they compare in speed on an 8 core machine.
import timeit
import numpy as np
from numpy.core.umath_tests import inner1d
from multiprocessing.pool import ThreadPool
def threaded(sources, destinations):
# Define worker function
def worker(point):
dlt = (destinations-point) # delta between destinations and given point
d = inner1d(dlt,dlt) # get distances
return np.argmin(d) # return closest index
# Multithread!
p = ThreadPool()
return p.map(worker, sources)
def unthreaded(sources, destinations):
results = []
#for p in sources:
for i in range(len(sources)):
dlt = (destinations-sources[i]) # difference between destinations and given point
d = inner1d(dlt,dlt) # get distances
results.append(np.argmin(d)) # append closest index
return results
# Setup the data
n_destinations = 10000 # 10k random destinations
n_sources = 10000 # 10k random sources
destinations= np.random.rand(n_destinations,3) * 100
sources = np.random.rand(n_sources,3) * 100
#Compare!
print 'threaded: %s'%timeit.Timer(lambda: threaded(sources,destinations)).repeat(1,1)[0]
print 'unthreaded: %s'%timeit.Timer(lambda: unthreaded(sources,destinations)).repeat(1,1)[0]
Retults:
threaded: 0.894030461056
unthreaded: 1.97295164054
Multithreading seems beneficial but I was hoping for more than 2X increase given the real life dataset i deal with are much larger.
All recommendations to improve performance (within the limitations described above) will be greatly appreciated!
Ok, I've been reading Maya documentation on python and I came to these conclusions/guesses:
They're probably using CPython inside (several references to that documentation and not any other).
They're not fond of threads (lots of non-thread safe methods)
Since the above, I'd say it's better to avoid threads. Because of the GIL problem, this is a common problem and there are several ways to do the earlier.
Try to build a tool C/C++ extension. Once that is done, use threads in C/C++. Personally, I'd only try SIP to work, and then move on.
Use multiprocessing. Even if your custom python distribution doesn't include it, you can get to a working version since it's all pure python code. multiprocessing is not affected by the GIL since it spawns separate processes.
The above should've worked out for you. If not, try another parallel tool (after some serious praying).
On a side note, if you're using outside modules, be most mindful of trying to match maya's version. This may have been the reason because you couldn't build scipy. Of course, scipy has a huge codebase and the windows platform is not the most resilient to build stuff.

Python something resets my random seed

My question is the exact opposite of this one.
This is an excerpt from my test file
f1 = open('seed1234','r')
f2 = open('seed7883','r')
s1 = eval(f1.read())
s2 = eval(f2.read())
f1.close()
f2.close()
####
test_sampler1.random_inst.setstate(s1)
out1 = test_sampler1.run()
self.assertEqual(out1,self.out1_regress) # this is fine and passes
test_sampler2.random_inst.setstate(s2)
out2 = test_sampler2.run()
self.assertEqual(out2,self.out2_regress) # this FAILS
Some info -
test_sampler1 and test_sampler2 are 2 object from a class that performs some stochastic sampling. The class has an attribute random_inst which is an object of type random.Random(). The file seed1234 contains a TestSampler's random_inst's state as returned by random.getstate() when it was given a seed of 1234 and you can guess what seed7883 is. What I did was I created a TestSampler in the terminal, gave it a random seed of 1234, acquired the state with rand_inst.getstate() and save it to a file. I then recreate the regression test and I always get the same output.
HOWEVER
The same procedure as above doesn't work for test_sampler2 - whatever I do not get the same random sequence of numbers. I am using python's random module and I am not importing it anywhere else, but I do use numpy in some places (but not numpy.random).
The only difference between test_sampler1 and test_sampler2 is that they are created from 2 different files. I know this is a big deal and it is totally dependent on the code I wrote but I also can't simply paste ~800 lines of code here, I am merely looking for some general idea of what I might be messing up...
What might be scrambling the state of test_sampler2's random number generator?
Solution
There were 2 separate issues with my code:
1
My script is a command line script and after I refactored it to use python's optparse library I found out that I was setting the seed for my sampler using something like seed = sys.argv[1] which meant that I was setting the seed to be a str, not an int - seed can take any hashable object and I found it the hard way. This explains why I would get 2 different sequences if I used the same seed - one if I run my script from the command line with sth like python sample 1234 #seed is 1234 and from my unit_tests.py file when I would create an object instance like test_sampler1 = TestSampler(seed=1234).
2
I have a function for discrete distribution sampling which I borrowed from here (look at the accepted answer). The code there was missing something fundamental: it was still non-deterministic in the sense that if you give it the same values and probabilities array, but transformed by a permutation (say values ['a','b'] and probs [0.1,0.9] and values ['b','a'] and probabilities [0.9,0.1]) and the seed is set and you will get the same random sample, say 0.3, by the PRNG, but since the intervals for your probabilities are different, in one case you'll get a b and in one an a. To fix it, I just zipped the values and probabilities together, sorted by probability and tadaa - I now always get the same probability intervals.
After fixing both issues the code worked as expected i.e. out2 started behaving deterministically.
The only thing (apart from an internal Python bug) that can change the state of a random.Random instance is calling methods on that instance. So the problem lies in something you haven't shown us. Here's a little test program:
from random import Random
r1 = Random()
r2 = Random()
for _ in range(100):
r1.random()
for _ in range(200):
r2.random()
r1state = r1.getstate()
r2state = r2.getstate()
with open("r1state", "w") as f:
print >> f, r1state
with open("r2state", "w") as f:
print >> f, r2state
for _ in range(100):
with open("r1state") as f:
r1.setstate(eval(f.read()))
with open("r2state") as f:
r2.setstate(eval(f.read()))
assert r1state == r1.getstate()
assert r2state == r2.getstate()
I haven't run that all day, but I bet I could and never see a failing assert ;-)
BTW, it's certainly more common to use pickle for this kind of thing, but it's not going to solve your real problem. The problem is not in getting or setting the state. The problem is that something you haven't yet found is calling methods on your random.Random instance(s).
While it's a major pain in the butt to do so, you could try adding print statements to random.py to find out what's doing it. There are cleverer ways to do that, but better to keep it dirt simple so that you don't end up actually debugging the debugging code.

multiple cpu usage when accessing data attached to traited classes

I have an application that uses a number of classes inheriting from HasTraits. Some of these classes manage access to data and others provide functions for analyzing that data. This works wonderfully for a gui -- I can check that the data and analysis code is doing what it should. However, I've noticed that when I use these classes for gui-less computations, all the cpus on the system end up getting used.
Here is a small example that shows the cpu usage:
from traits.api import HasTraits, List, Int, Enum, Instance
import numpy as np
import psutil
from itertools import combinations
"""
Small example of high CPU usage by traited classes
"""
class DataStorage(HasTraits):
nsamples = Int(2000)
samples = List
def _samples_default(self):
return np.random.randn(self.nsamples,2000).tolist()
def sample_samples(self,indices):
""" return a 2D array of data at indices """
return np.array(
[self.samples[i] for i in indices])
class DataAccessor(HasTraits):
""" Class that grabs data and computes something """
measure = Enum("correlation","covariance")
data_source = Instance(DataStorage,())
def compute_measure(self,indices):
""" example of some computation """
samples = self.data_source.sample_samples(indices)
percentage = psutil.cpu_percent(interval=0, percpu=True)
if self.measure == "correlation":
result = np.corrcoef(samples)
elif self.measure == "covariance":
result = np.cov(samples)
return percentage
# Run a simulation to see cpu usage
analyzer = DataAccessor()
usage = []
n_iterations = 0
max_iterations = 500
for combo in combinations(np.arange(2000),500):
# evaluate the measurement on a subset of the data
usage.append(analyzer.compute_measure(combo))
n_iterations += 1
if n_iterations > max_iterations:
break
print n_iterations
use_percents = np.array(usage).T
When I run this on an 8-cpu machine running CentOS, top reports the python process at roughly 600%.
>>> use_percents.mean(1)
shows
array([ 67.05548902, 67.06906188, 66.89041916, 67.28942116,
66.69421158, 67.61437126, 99.8007984 , 67.31996008])
Question:
My computation is embarrassingly parallel, so it would be great to have the other cpus available to split up the job. Does anyone know what's happening here? A plain python version of this uses 100% on a single cpu.
Is there a way to keep everything local to a single cpu without rewriting all my classes without traits?
Traits is not causing the CPU usage. It's easy to rewrite this bit of code without Traits, and you will see that you get the same pattern of CPU usage (at least, I do).
Instead, what you are probably seeing is the CPU usage of the BLAS library that your build of numpy is linked against. numpy.corrcoeff() calls numpy.cov(), and much of the computation of numpy.cov() is taken up by a numpy.dot() call, which does a matrix-matrix multiplication using BLAS. If it is an optimized BLAS library, then it will usually use non-Python threads internally to split up these computations among your CPUs. You will have to consult the documentation of your optimized BLAS library to find out how to change this.

What is the easiest way to generate a Control Flow-Graph for a method in Python?

I am writing a program that tries to compare two methods. I would like to generate Control flow graphs (CFG) for all matched methods and use either a topological sort to compare the two graphs.
RPython, the translation toolchain behind PyPy, offers a way of grabbing the flow graph (in the pypy/rpython/flowspace directory of the PyPy project) for type inference.
This works quite well in most cases but generators are not supported. The result will be in SSA form, which might be good or bad, depending on what you want.
There's a Python package called staticfg which does exactly the this -- generation of control flow graphs from a piece of Python code.
For instance, putting the first quick sort Python snippet from Rosseta Code in qsort.py, the following code generates its control flow graph.
from staticfg import CFGBuilder
cfg = CFGBuilder().build_from_file('quick sort', 'qsort.py')
cfg.build_visual('qsort', 'png')
Note that it doesn't seem to understand more advanced control flow like comprehensions.
I found py2cfg has a better representation of Control Flow Graph (CFG) than one from staticfg.
https://gitlab.com/classroomcode/py2cfg
https://pypi.org/project/py2cfg/
Let's take this function in Python:
def fib():
a, b = 0, 1
while True:
yield a
a, b = b, a + b
fib_gen = fib()
for _ in range(10):
next(fib_gen)
Image from StaticCFG:
Image from PY2CFG:
http://pycallgraph.slowchop.com/ looks like what you need.
Python trace module also have option --trackcalls that can be an entrypoint for call tracing machinery in stdlib.

Categories