multithreading optimization in python - python

I am using lmfit to do some optimization and it is tediously slow. I have a large image and I am basically running least square minimization at every pixel. It seems to me that perhaps it is ideal for multithreading or other types of optimizations as it is painfully slow at the moment.
So, my optimization code is as follows:
I define the objective function as:
def objective(params, x, data):
s0 = params['S0']
t1 = params['T1']
m = s0 * (1.0 - np.exp(-x / t1))
return m - data
So, I am trying to minimise the difference between the model and observation. I think lmfit ensures that the absolute value is minimized but I am not sure and needs to check.
The main loop is as follows:
The parameters I need to estimate are as initialized as follows with some initial values:
p = Parameters()
p.add('S0', value=1.0)
p.add('T1', value=3.0)
final_data = <numpy image>
tis = np.asarray([1.0, 2.0, 3.0])
for i in range(final_data.shape[0]):
print "Processing pixel: ", i
minner = Minimizer(objective, params=p, fcn_args=np.asarray(tis),
final_data[i, :]),
nan_policy='propagate')
result = minner.minimize(method='least_squares')
s0_data[i] = result.params['S0']
t1_data[i] = result.params['T1']
This works fine but it is tediously slow. I was trying to figure out how to do multithreading in python and got thoroughly confused about posts regarding GIL locking and that multithreading in python does not really exist.
My question is:
1: Can this be scaled with multithreading in python easily?
2: Is there any other optimizations that I can try?

As the comments suggest, multithreading here won't be very fruitful. Basically, any single fit with lmfit or scipy ends up with a single-threaded fortran routine calling your python objective function repeatedly, and using those results to generate the next step. Trying to use multithreading means that the python objective function and parameters have to be managed among the threads -- the fortran code is not geared for this, and the calculation is not really I/O bound anyway.
Multiprocessing in order to use multiple cores is a better approach. But trying to use multiprocessing for a single fit is not as trivial as it sounds, as the objective function and parameters have to be pickle-able. For your simple example, this should work, but the approach can break as the problem uses more complex objects. The dill package can help with that.
But also: there is an even easier solution for your problem, as it is naturally parallelized. Just to do a separate fit per pixel, each in their own process. You can use multiprocessing for that, or you could even break the problem up into N separate real processes run N different scripts each fitting 1/N of the pixels.

Related

Efficiently executing complex Python code in R

I'm looking for an efficient way to execute Python code in a fast and flexible way in R.
The problem is that I want to execute a particular method from topological data analysis, called persistent homology, which you don't really need to know in detail to understand the problem I'm dealing with.
The thing is that computing persistent homology is not the most easiest thing to do when working with large data sets, as it requires both a lot of memory and computational time.
Executing the method implemented in the R-package TDA is however very inconvenient: it crashes starting at around 800 data points and doesn't really allow me to use approximation algorithms.
In contrast, the Python package Ripser allows me to do the computation on a few thousands of points with ease in Python. Furthermore, it also allows one to provide a parameter that approximates the result for even larger data sets, and can also store the output I want. In summary, it is much more convenient to compute persistent homology with this package.
However, since everything else I'm working on is in R, it would also be lot more convenient for me to execute the code from the Ripser package in R as well. An illustrative example of this is as follows:
# Import R libraries
library("reticulate") # Conduct Python code in R
library("ggplot2") # plotting
# Import Python packages
ripser <- import("ripser") # Python package for computing persistent homology
persim <- import("persim") # Python package for many tools used in analyzing Persistence Diagrams
# Generate data on circle
set.seed(42)
npoints <- 200
theta <- 2 * pi * runif(npoints)
noise <- cbind(rnorm(npoints, sd=0.1), rnorm(npoints, sd=0.1))
X <- data.frame(x=cos(theta) + noise[,1], y=sin(theta) + noise[,2])
# Compute persistent homology on full data
PHfull <- ripser$ripser(X, do_cocycles=TRUE) # slow if X is large
# Plot diagrams
persim$plot_diagrams(PHfull$dgms) # crashes
Now I have two problems when using this code:
The persistent homology computation performed by the ripser function works perfectly fine in this example. However, when I increase the number of data points in X, say npoints ~ 2000, the computation will take ages compared to around 30 seconds when I would perform the computation straightforwardly in Python. I don't really know what's happening behind the scenes that causes this huge difference in computational times. Is this because R is perhaps less convenient than Python for this example, and in stead of converting my arguments to the custom types in Python and executing the code in Python, the Python code is converted to R code? I am looking for an approach that combines this flexibility and efficient type conversion with the speed of Python.
In Python, the analog to the last line would plot an image based on the result of my persistent homology computation. However, executing this line causes R to crash. Is there a possible way to show the image that would normally result in Python in R?

Python vectorised minimisation (i.e multiple data realisations)

Suppose I have a (multivariate) function
f(x,data)
that I want to minimise, where x is the parameter vector. Ordinarily in Python I could do something like this:
scipy.optimize.minimize(f, x0, args=my_data)
However, suppose I now want to do this repeatedly, for lots of different realisations of my_data (this is for some statistical simulations so I need lots of realisations to study the behaviour of test statistics and so on).
I could do it in a big loop, but this is really slow if I have tens of thousands, up to millions, of data realisations.
So, I am wondering if there is some clever way to vectorise this rather than using a loop. I have one vague idea; say I have N data realisations and p parameters. I could make an enormous combined function with N*p parameters, which accepts a data vector of size N, and finds the minimum of this giant function all at once, where the global minimum of this giant function minimises all the individual functions simultaneously.
However this sounds like a difficult problem for most multivariate minimisers to handle. Yet, the structure of the problem is fairly simple, since each block of p parameters can be minimised entirely independently. So, I wonder, is there an algorithm in scipy or somewhere that can make use of this known dependency structure between the parameters? Or, if not, is there some other smart way to achieve a speedy minimisation of the same function repeatedly?

Python Multiprocessing/EM

I did a machine learning Expectation Maximization algorithm in Python, basically an implementation of IBM Model1 for doing machine translation ( here is my GitHub if you want to look at the code) and it works, but reeeaaaaallly sloowwwlly. I'm taking a class now in parallel computing and I was wondering if I could use Python Multiprocessing to reach convergence faster. Can anyone give me any pointers or tips? I don't even know where to start.
EDIT: I was reading around and found this paper on using EM with MapReduce to do parallelization -- maybe this is a better idea?
Most of your problem is that Python is really slow. Remember, your code is executing in an interpreter. When you do code (such as line 82) where you perform a numerical computation one element at a time, you have that one computation - and all the overhead of the Python interpreter.
The first thing you will want to do is vectorize you code with numpy. Unlike your normal python code, numpy is calling out to precompiled efficient binary code. The more work you can hide into numpy, the less time you will waist in the interpreter.
Once you vectorize your code, you can then start profiling it if its still too slow. You should be able to find a lot of simple examples on how to vectorize python, and some of the alternative options.
EDIT: Let me clarify, that parallelizing inherently slow code is mostly pointless. First, is the issue that parallelizing slow code gives the false impression that you have made an improvement. The "scaling up" of parallel code should always be done against the fastest possible single threaded version of the same code (within reason, no need to write everything in assembly before starting any parallel code). For example, consider a lock under contention. The more threads fighting for the lock, the slower the code will run, and you will get no (or negative) performance gains. One way to reduce contention for the lock is to simply slow down the code competing for the lock. This makes it appear as if there is no overhead from lock contention, when in actuality - you have no improvements because the fastest single threaded version of your code will outperform your parallel code.
Also, python really isn't a great language to learn how to write parallel code in. Python has the GIL , which essentially forces all multithreaded code in python to run as if there was but one CPU core. This means bizarre hacks (such as the one you linked) must be done, which have their own additional drawbacks and issues (there are times where such tricks are needed / used, but they shouldn't be the default for running code on a single machine). Don't expect what you learn writing any parallel python code to carry over to other languages or help you with your course.
I think you will have some good success depending on where your bottleneck is. One caveat - When I do code optimization I always like to profile the code, even informally to get an idea of where the bottlenecks are. This will help identify where the time is being spent i.e. file io, network latency, resource contention, not enough cpu cycles etc...
For others who may not be familiar with the Expectation Maximization algorithm a very nice introduction is in Motion Segmentation using EM - a short tutorial, by Yair Weiss. Let us assume we have M data points and N classes/models.
In the EM algorithm there are two steps: Computing the distance between data points and models and Updating our model weights using weighted least squares.
Step 1 - Expectation stage
for data_point in M:
for current_model in N:
compute distance or residual between data_point and current_model
Step 2 - Maximization stage
for each model, compute weighted least squares solving for the model parameters
This requires solving N weighted least square problems where the size is
dependent on the number of parameters in the model that will be solved for.
Your bottleneck may be in the stage of computing the residuals or distances between the data points and the models stage 1 - E Step. In this stage the computations are all independent. I would consider the first stage as embarassingly parallel and quite amenable to parallel computation using parallel map reduce or some other tools in python. I have good success using IPython for such tasks, but there are other good python packages as well.

Memory sharing in parallelized python code

I am a college freshman and Python newbie so bear with me. I am trying to parallelize some matrix operations. Here is my attempt using the ParallelPython module:
def testfunc(connectionMatrix, qCount, iCount, Htry, tStepCount):
test = connectionMatrix[0:qCount,0:iCount].dot(Htry[tStepCount-1, 0:iCount])
return test
f1 = job_server.submit(testfunc, (self.connectionMatrix, self.qCount, self.iCount, self.iHtry, self.tStepCount), modules = ("scipy.sparse",))
f2 = job_server.submit(testfunc, (self.connectionMatrix, self.qCount, self.iCount, self.didtHtry, self.tStepCount), modules = ("scipy.sparse",))
r1 = f1()
r2 = f2()
self.qHtry[self.tStepCount, 0:self.qCount] = self.qHtry[self.tStepCount-1, 0:self.qCount] + self.delT * r1 + 0.5 * (self.delT**2) * r2
It seems that there is a normal curve with size of the matrix on the x-axis and the percent speed-up on the y-axis. It seems to cap out at a 30% speed increase at 100x100 matrices. Smaller and larger matrices result in less increase and with small enough and large enough matrices, the serial code is faster. My guess is that the problem lies within the passing of the arguments. The overhead of copying the large matrix is actually taking longer than the job itself. What can I do to get around this? Is there some way to incorporate memory sharing and passing the matrix by reference? As you can see, none of the arguments are modified so it could be read-only access.
Thanks.
Well, the point of ParallelPython is that you can write code that doesn't care whether it's distributed across threads, processes, or even multiple computers, and using memory sharing would break that abstraction.
One option is to use something like a file on a shared filesystem, where you mmap that file in each worker. Of course that's more complicated, and whether it's better or worse will depend on a lot of details about the filesystem, sharing protocol, and network, but it's an option.
If you're willing to give up the option of distributed processing, you can use multiprocessing.Array (or multiprocessing,Value, or multiprocessing.sharedctypes) to access shared memory. But at that point, you might want to consider just using multiprocessing instead of ParallelPython for the job distribution, since multiprocessing is part of the standard library, and has a more powerful API, and you're explicitly giving up the one major advantage of ParallelPython.
Or you could combine the two options, for the worst of both worlds in many ways, but maybe the best in terms of how little you need to change your existing code: Just use a local file and mmap it.
However, before you do any of this, you may want to consider profiling to see if copying the matrix really is the bottleneck. And, if it is, you may want to consider whether there's an algorithmic fix, just copying the part each job needs instead of copying the entire matrix. (Whether that makes sense depends on whether the part each job needs is significantly less than the whole thing, of course.)

Python multiprocessing design

I have written an algorithm that takes geospatial data and performs a number of steps. The input data are a shapefile of polygons and covariate rasters for a large raster study area (~150 million pixels). The steps are as follows:
Sample points from within polygons of the shapefile
For each sampling point, extract values from the covariate rasters
Build a predictive model on the sampling points
Extract covariates for target grid points
Apply predictive model to target grid
Write predictions to a set of output grids
The whole process needs to be iterated a number of times (say 100) but each iteration currently takes more than an hour when processed in series. For each iteration, the most time-consuming parts are step 4 and 5. Because the target grid is so large, I've been processing it a block (say 1000 rows) at a time.
I have a 6-core CPU with 32 Gb RAM, so within each iteration, I had a go at using Python's multiprocessing module with a Pool object to process a number of blocks simultaneously (steps 4 and 5) and then write the output (the predictions) to the common set of output grids using a callback function that calls a global output-writing function. This seems to work, but is no faster (actually, it's probably slower) than processing each block in series.
So my question is, is there a more efficient way to do it? I'm interested in the multiprocessing module's Queue class, but I'm not really sure how it works. For example, I'm wondering if it's more efficient to have a queue that carries out steps 4 and 5 then passes the results to another queue that carries out step 6. Or is this even what Queue is for?
Any pointers would be appreciated.
The current state of Python's multi-processing capabilities are not great for CPU bound processing. I fear to tell you that there is no way to make it run faster using the multiprocessing module nor is it your use of multiprocessing that is the problem.
The real problem is that Python is still bound by the rules of the GlobalInterpreterLock(GIL) (I highly suggest the slides). There have been some exciting theoretical and experimental advances on working around the GIL. Python 3.2 event contains a new GIL which solves some of the issues, but introduces others.
For now, it is faster to execute many Python process with a single serial thread than to attempt to run many threads within one process. This will allow you avoid issues of acquiring the GIL between threads (by effectively having more GILs). This however is only beneficial if the IPC overhead between your Python processes doesn't eclipse the benefits of the processing.
Eli Bendersky wrote a decent overview article on his experiences with attempting to make a CPU bound process run faster with multiprocessing.
It is worth noting that PEP 371 had the desire to 'side-step' the GIL with the introduction of the multiprocessing module (previously a non-standard packaged named pyProcessing). However the GIL still seems to play too large of a role in the Python interpreter to make it work well with CPU bound algorithms. Many different people have worked on removing/rewriting the GIL, but nothing has made enough traction to make it into a Python release.
Some of the multiprocessing examples at python.org are not very clear IMO, and it's easy to start off with a flawed design. Here's a simplistic example I made to get me started on a project:
import os, time, random, multiprocessing
def busyfunc(runseconds):
starttime = int(time.clock())
while 1:
for randcount in range(0,100):
testnum = random.randint(1, 10000000)
newnum = testnum / 3.256
newtime = int(time.clock())
if newtime - starttime > runseconds:
return
def main(arg):
print 'arg from init:', arg
print "I am " + multiprocessing.current_process().name
busyfunc(15)
if __name__ == '__main__':
p = multiprocessing.Process(name = "One", target=main, args=('passed_arg1',))
p.start()
p = multiprocessing.Process(name = "Two", target=main, args=('passed_arg2',))
p.start()
p = multiprocessing.Process(name = "Three", target=main, args=('passed_arg3',))
p.start()
time.sleep(5)
This should exercise 3 processors for 15 seconds. It should be easy to modify it for more. Maybe this will help to debug your current code and ensure you are really generating multiple independent processes.
If you must share data due to RAM limitations, then I suggest this:
http://docs.python.org/library/multiprocessing.html#sharing-state-between-processes
As python is not really meant to do intensive number-cunching, I typically start converting time-critical parts of a python program to C/C++ and speed things up a lot.
Also, the python multithreading is not very good. Python keeps using a global semaphore for all kinds of things. So even when you use the Threads that python offers, things won't get faster. The threads are useful for applications, where threads will typically wait for things like IO.
When making a C module, you can manually release the global semaphore when processing your data (then, of course, do not access the python values anymore).
It takes some practise using the C API, but's its clearly structured and much easier to use than, for example, the Java native API.
See 'extending and embedding' in the python documentation.
This way you can make the time critical parts in C/C++, and the slower parts with faster programming work in python...
I recommend you first check which aspects of your code is taking the most time, so your gonna have to profile it, I've used http://packages.python.org/line_profiler/#line-profiler with much success, though it does require cython.
As for Queues, their mostly used for sharing data/synchronizing threads, though I've rarely used it. I do use multiprocessing all the time.
I mostly follow the map reduce philosophy, which is simple and clean but it has some major overhead, since values have to be packed into dictionaries and copied across each process, when applying the map function ...
You can try segmenting your file and applying your algorithm to different sets.

Categories