Optimizing a multithreaded numpy array function

Optimizing a multithreaded numpy array function - python

Given 2 large arrays of 3D points (I'll call the first "source", and the second "destination"), I needed a function that would return indices from "destination" which matched elements of "source" as its closest, with this limitation: I can only use numpy... So no scipy, pandas, numexpr, cython...
To do this i wrote a function based on the "brute force" answer to this question. I iterate over elements of source, find the closest element from destination and return its index. Due to performance concerns, and again because i can only use numpy, I tried multithreading to speed it up. Here are both threaded and unthreaded functions and how they compare in speed on an 8 core machine.
import timeit
import numpy as np
from numpy.core.umath_tests import inner1d
from multiprocessing.pool import ThreadPool
def threaded(sources, destinations):
# Define worker function
def worker(point):
dlt = (destinations-point) # delta between destinations and given point
d = inner1d(dlt,dlt) # get distances
return np.argmin(d) # return closest index
# Multithread!
p = ThreadPool()
return p.map(worker, sources)
def unthreaded(sources, destinations):
results = []
#for p in sources:
for i in range(len(sources)):
dlt = (destinations-sources[i]) # difference between destinations and given point
d = inner1d(dlt,dlt) # get distances
results.append(np.argmin(d)) # append closest index
return results
# Setup the data
n_destinations = 10000 # 10k random destinations
n_sources = 10000 # 10k random sources
destinations= np.random.rand(n_destinations,3) * 100
sources = np.random.rand(n_sources,3) * 100
#Compare!
print 'threaded: %s'%timeit.Timer(lambda: threaded(sources,destinations)).repeat(1,1)[0]
print 'unthreaded: %s'%timeit.Timer(lambda: unthreaded(sources,destinations)).repeat(1,1)[0]
Retults:
threaded: 0.894030461056
unthreaded: 1.97295164054
Multithreading seems beneficial but I was hoping for more than 2X increase given the real life dataset i deal with are much larger.
All recommendations to improve performance (within the limitations described above) will be greatly appreciated!

Ok, I've been reading Maya documentation on python and I came to these conclusions/guesses:
They're probably using CPython inside (several references to that documentation and not any other).
They're not fond of threads (lots of non-thread safe methods)
Since the above, I'd say it's better to avoid threads. Because of the GIL problem, this is a common problem and there are several ways to do the earlier.
Try to build a tool C/C++ extension. Once that is done, use threads in C/C++. Personally, I'd only try SIP to work, and then move on.
Use multiprocessing. Even if your custom python distribution doesn't include it, you can get to a working version since it's all pure python code. multiprocessing is not affected by the GIL since it spawns separate processes.
The above should've worked out for you. If not, try another parallel tool (after some serious praying).
On a side note, if you're using outside modules, be most mindful of trying to match maya's version. This may have been the reason because you couldn't build scipy. Of course, scipy has a huge codebase and the windows platform is not the most resilient to build stuff.

Related

How to parallelize calculations of celestial bodies motion?

I have a piece of code which calculates positions of some satellites and planets using Skyfield. For clarity, I use Pandas DataFrame as a container of positions and corresponding time moments. I want to make calculation parallel, but always getting the same error: TypeError: can't pickle Satrec objects. Different parallelizers were tested, like Dask, pandarallel, swifter and Pool.map().
Example of piece of code to be parallelized:
def get_sun_position(self, row):
t = self.ts.utc(row["Date"]) # from skyfield
pos = self.earth.at(t).observe(self.sun).apparent().position.m # from skyfield, error is here
return pos
def get_sat_position(self, row):
t = self.ts.utc(row["Date"]) # from skyfield
pos = self.sat.at(t).position.m # from skyfield, error is here
return pos
def get_positions(self):
self.df["sat_pos"] = self.df.swifter.apply(self.get_sat_position, axis=1) # all the parallelization goes here
self.df["sun_pos"] = self.df.swifter.apply(self.get_sun_position, axis=1) # and here
# the same implementation but using dask
# self.df["sat_pos"] = dd.from_pandas(self.df, npartitions=4*cpu_count())\
# .map_partitions(lambda df : df.apply(lambda row : self.get_sat_position(row),axis=1))\
# .compute(scheduler='processes')
# self.df["sun_pos"] = dd.from_pandas(self.df, npartitions=4*cpu_count())\
# .map_partitions(lambda df : df.apply(lambda row : self.get_sun_position(row),axis=1))\
# .compute(scheduler='processes')
For Dask to avoid Pickle I tried to set serializaton manually like this serializers=['dask', 'pickle'] but it didn't help.
As I understand, Skyfield uses sgp4 which contains Satrec class.
I would be wondering if there is some way to parallelize this .apply(). Or maybe I should not try Skyfield functions for parallel processing at all?

Alas, all of the mechanisms you are using to make the computation parallel do so by creating another process and then sending copies of all of the objects involved in the computation over to the other process — and the Satrec object is written in C++, not Python, to make it faster, and C++ objects have no native way to "serialize" themselves into bytes for transmission to another process. (Python objects have that ability built-in.)
Have you profiled your code to see what the most expensive steps are? My guess is that most of your expense is in the Sun computation, because to achieve its high precision Skyfield needs to compute the Earth's orientation to very high accuracy to give the Sun's position in the sky to high enough precision for even radio astronomers.
But if you yourself don't need that high an accuracy, you could switch to lower-precision sky coordinates for the Sun. Before using t in get_sun_position(), try doing this to it:
t._nutation_angles = iau2000b(t.tt)
That will use a lower precision estimate of the Earth's nutation (print out the values before and after this change to see how big the difference is, and compare that to how much inaccuracy your application can stand), but also hopefully run faster.

Python genetic optimisation multiprocessing with a global constant variable, how to speed up?

I am writing a genetic optimization algorithm based on the deap package in python 2.7 (goal is to migrate to python 3 soon). As it is a pretty heavy process, some parts of the optimisation are processed using the multiprocessing package. Here is a summary outline of my program:
Configurations are read in and saved in a config object
Some additional pre-computations are made and saved as well in the config object
The optimisation starts (population is initialized randomly and mutations, crossover is applied to find a better solution) and some parts of it (evaluation function) are executed in multiprocessing
The results are saved
For the evaluation function, we need to have access to some parts of the config object (which after phase 2 stays a constant). Therefore we make it accessible to the different cores using a global (constant) variable:
from deap import base
import multiprocessing
toolbox = base.Toolbox()
def evaluate(ind):
# compute evaluation using config object
return(obj1,obj2)
toolbox.register('evaluate',evaluate)
def init_pool_global_vars(self, _config):
global config
config = _config
...
# setting up multiprocessing
pool = multiprocessing.Pool(processes=72, initializer=self.init_pool_global_vars,
initargs=[config])
toolbox.register('map', pool.map_async)
...
while tic < max_time:
# creating new individuals
# computing in optimisation the objective function on the different individuals
jobs = toolbox.map(toolbox.evaluate, ind)
fits = jobs.get()
# keeping best individuals
We basically make different iterations (big for loop) until a maximum time is reached. I have noticed that if I make the config object bigger (i.e. add big attributes to it, like a big numpy array) even if the code is still same it runs much slower (fewer iterations for the same timespan). So I thought I would make a specific config_multiprocessing object that contains only the attributes needed in the multiprocessing part and pass that as a global variable, but when I run it on 3 cores it is slower than with the big config object and on 72 cores, it is slightly faster, but not much.
What should I do in order to make sure my loops don't suffer in speed from the config object or from any other data manipulations I make before launching the multiprocessing loops?
Running in a Linux docker image on a linux VM in the cloud.

The joblib package is designed to handle cases where you have large numpy arrays to distribute to workers with shared memory. This is especially useful if you are treating the data in shared memory as "read-only" like what you describe in your scenario. You can also create writable shared memory as described in the docs.
Your code might look something like:
import os
import numpy as np
from joblib import Parallel, delayed
from joblib import dump, load
folder = './joblib_memmap'
try:
os.mkdir(folder)
except FileExistsError:
pass
def evaluate(ind, data):
# compute evaluation using shared memory data
return(obj1, obj2)
# just used to initialize memory mapped data
def init_memmap_data(original_data):
data_filename_memmap = os.path.join(folder, 'data_memmap')
dump(original_data, data_filename_memmap)
shared_data = load(data_filename_memmap, mmap_mode='r')
return shared_data
...
# however you set up indices needs to be changed here
indexes = range(10)
# however you load your numpy data needs to be done here
shared_data = init_memmap_data(numpy_array_to_share)
# change n_jobs as appropriate
results = Parallel(n_jobs=2)(delayed(evaluate)(ind, shared_data) for ind in indexes)
# get index of the maximum as the "best" individual
best_fit_individual = indexes[results.argmax()]
Additionally, joblib supports a threading backend that may be faster than the process based one. It will be easy to test both with joblib.

Mulitprocessing and rpy2 (with ape)

I ran into this today and can't figure out why. I have several functions chained together that perform some time consuming operations as part of a larger pipeline. I've included these here, pared down to a test example, as best as I could. The issue is that when I call a function directly, I get the expected output (e.g., 5 different trees). However, when I call the same function in a multiprocessing pool with apply_async (or apply, doesn't matter), I get 5 trees, but they are all the same.
I've documented this in an IPython notebook, which can be viewed here: http://nbviewer.ipython.org/gist/cfriedline/0e275d528ff1a8d674c6
In cell 91, I create 5 trees (each with 10 tips), and return two lists. The first containing the non-multiprocessing trees, and the second from apply_async.
In cell 92, you can see the results of creating trees without multiprocessing, and in 93, with multiprocessing.
What I expect is that there would be a total of 10 different trees between the two tests, but instead all of the multiprocessing trees are identical. Makes little sense to me.
Relevant versions of things:
Linux 2.6.18-238.12.1.el5 x86_64 GNU/Linux
Python 2.7.6 :: Anaconda 1.9.2 (64-bit)
IPython 2.0.0
Rpy2 2.3.9
Thanks!
Chris

I solved this one, with a point in the right direction from #mgilson. In fact, it was a random number problem, just not in python - in R (sigh). The state of R is copied when the Pool is created, meaning so is its random seed. To fix, just a little rpy2 as below calling R's set.seed function (with some process specific stuff for good measure):
def create_tree(num_tips, type):
"""
creates the taxa tree in R
#param num_tips: number of taxa to create
#param type: type for naming (e.g., 'taxa')
#return: a dendropy Tree
#rtype: dendropy.Tree
"""
r = rpy2.robjects.r
set_seed = r('set.seed')
set_seed(int((time.time()+os.getpid()*1000)))
rpy2.robjects.globalenv['numtips'] = num_tips
rpy2.robjects.globalenv['treetype'] = type
name = _get_random_string(20)
if type == "T":
r("%s = rtree(numtips, rooted=T, tip.label=paste(treetype, seq(1:(numtips)), sep=''))" % name)
else:
r("%s = rtree(numtips, rooted=F, tip.label=paste(treetype, seq(1:(numtips)), sep=''))" % name)
tree = r[name]
return ape_to_dendropy(tree)

I'm not 100% familiar with these libraries, however, on Linux, (IIRC) multiprocessing uses os.fork. This means that the state of the random module (which you're using) will also be forked and that each of your processes will generate the same sequence of random numbers resulting in a not-so-random _get_random_string function.
If I'm right, and you make the pool smaller than the number of trees that you want, you should see that you get groups of N identical trees (where N is the number of pools).
I think that probably the ideal solution is to re-seed the random number generator inside of each of the processes. It's unlikely that they'll run at exactly the same time, so you should get differing results.

multiple cpu usage when accessing data attached to traited classes

I have an application that uses a number of classes inheriting from HasTraits. Some of these classes manage access to data and others provide functions for analyzing that data. This works wonderfully for a gui -- I can check that the data and analysis code is doing what it should. However, I've noticed that when I use these classes for gui-less computations, all the cpus on the system end up getting used.
Here is a small example that shows the cpu usage:
from traits.api import HasTraits, List, Int, Enum, Instance
import numpy as np
import psutil
from itertools import combinations
"""
Small example of high CPU usage by traited classes
"""
class DataStorage(HasTraits):
nsamples = Int(2000)
samples = List
def _samples_default(self):
return np.random.randn(self.nsamples,2000).tolist()
def sample_samples(self,indices):
""" return a 2D array of data at indices """
return np.array(
[self.samples[i] for i in indices])
class DataAccessor(HasTraits):
""" Class that grabs data and computes something """
measure = Enum("correlation","covariance")
data_source = Instance(DataStorage,())
def compute_measure(self,indices):
""" example of some computation """
samples = self.data_source.sample_samples(indices)
percentage = psutil.cpu_percent(interval=0, percpu=True)
if self.measure == "correlation":
result = np.corrcoef(samples)
elif self.measure == "covariance":
result = np.cov(samples)
return percentage
# Run a simulation to see cpu usage
analyzer = DataAccessor()
usage = []
n_iterations = 0
max_iterations = 500
for combo in combinations(np.arange(2000),500):
# evaluate the measurement on a subset of the data
usage.append(analyzer.compute_measure(combo))
n_iterations += 1
if n_iterations > max_iterations:
break
print n_iterations
use_percents = np.array(usage).T
When I run this on an 8-cpu machine running CentOS, top reports the python process at roughly 600%.
>>> use_percents.mean(1)
shows
array([ 67.05548902, 67.06906188, 66.89041916, 67.28942116,
66.69421158, 67.61437126, 99.8007984 , 67.31996008])
Question:
My computation is embarrassingly parallel, so it would be great to have the other cpus available to split up the job. Does anyone know what's happening here? A plain python version of this uses 100% on a single cpu.
Is there a way to keep everything local to a single cpu without rewriting all my classes without traits?

Traits is not causing the CPU usage. It's easy to rewrite this bit of code without Traits, and you will see that you get the same pattern of CPU usage (at least, I do).
Instead, what you are probably seeing is the CPU usage of the BLAS library that your build of numpy is linked against. numpy.corrcoeff() calls numpy.cov(), and much of the computation of numpy.cov() is taken up by a numpy.dot() call, which does a matrix-matrix multiplication using BLAS. If it is an optimized BLAS library, then it will usually use non-Python threads internally to split up these computations among your CPUs. You will have to consult the documentation of your optimized BLAS library to find out how to change this.

more efficient Python scripting in Blender3D

I am basically building a 3D scatter plot using primitive UV spheres and am running into memory issues when attempting to create more than a couple hundred points at one time. I am limited on my laptop with a 2.1Ghz processor but wanted to know if there is a better way to write this:
import bpy
import random
while count < 5:
bpy.ops.mesh.primitive_uv_sphere_add(size=.3,\
location=(random.randint(-9,9), random.randint(-9,9),\
random.randint(-9,9)), rotation=(0,0,0))
count += 1
I realize that with such a simple script any performance increase is likely negligible but wanted to give it a shot anyway.

Some possible suggestions
I would pre-calculate the x,y,z values, store them in a mathutil vector and add it to a dict to be iterated over.
Duplication should provide a smaller memory footprint than
instantiating new objects. bpy.ops.object.duplicate_move(OBJECT_OT_duplicate=(linked:false, TRANSFORM_OT_translate=(transform)
Edit:
Doing further research it appears each time a bpy.ops.* is called the redraw function . One user documentented exponential increase in time taken to genenerate UV sphere.
CoDEmanX provided the following code snippet to another user.
import bpy
bpy.ops.object.select_all(action='DESELECT')
bpy.ops.mesh.primitive_uv_sphere_add()
sphere = bpy.context.object
for i in range(-1000, 1000, 2):
ob = sphere.copy()
ob.location.y = i
#ob.data = sphere.data.copy() # uncomment this, if you want full copies and no linked duplicates
bpy.context.scene.objects.link(ob)
bpy.context.scene.update()
Then it is just a case of adapting the code to set the object locations
obj.location = location_dict[i]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Optimizing a multithreaded numpy array function - python

Related

How to parallelize calculations of celestial bodies motion?

Python genetic optimisation multiprocessing with a global constant variable, how to speed up?

Mulitprocessing and rpy2 (with ape)

multiple cpu usage when accessing data attached to traited classes

more efficient Python scripting in Blender3D

Categories

Resources