I am using the scipy minimize function. The function that it's calling was compiled with Cython and has an underlying C++ implementation that I wrote, but that shouldn't really matter. For some reason when I run my program, it creates as many threads as it can to fill all my cpus. For example if I run top I see that 800% of a cpu is being used or on htop I can see that 8 individual processors are being used, when I only created the program to be run on one. I didn't think that scipy even had parallel processing functionality and I can't find any documentation related to this. What could possible be going on and is there any way to control it?
If some BLAS-implementation (with threading-support) is available (default on Ubuntu for example), some expressions like np.dot() (only the dense case as far as i know) will automatically be run in parallel (reference). Another possible example is sparse-matrix factorization with SuperLU.
Of course different minimizers will behave different.
Newton-type methods (core: solve a system of sparse linear-equations) are probably based on SuperLU (if the code is not one of the common old Fortran/C ones, where the whole code is self-contained). CG-type methods are heavily based on matrix-vector products (np.dot; so the dense-case will be parallel).
For some control over this, start with this SO question.
Related
I am working on a project and I have a .so (shared object) which contains a bunch of functions inside which deal with matrix-matrix addition and matrix-matrix multiplication.I "LD_PRELOAD" this .so file while running a python script that uses TensorFlow.
In a nutshell, instead of using the matrix multiplication functions built into TensorFlow's I am using this .so file which has matrix multiply functions (TensorFlow-mkl-enabled uses libmklml_intel.so file for matrix multiply functions). This is the format in which mkl takes matrix multiplication.
Now, the problem at hand is that I need to monitor three things:
What are the dimensions of matrices that are multiplied?
How many times is this function called?
What is the % if time run time the script spends in this function?
I can think of two approaches
Trace where these function calls are being made from the TensorFlow framework and then do changes to get the required information. But I feel that this approach involves modification at multiple places as there's not just one single function which calls cblas. Additionally I'll have to scrape through the TensorFlow Source code which is a huge task. And it may also call for building TF from source again
But, if we are somehow able to moniter the interaction of the .so file with TF and maybe somehow register all the data sent while calling, maybe with a clever shell script or with a tool that I am unaware of? What are my best options? Would Perf be of any help here
Thanks in Advance
Newbie here
This is sort of a general question related to a specific implementation I have in mind, about whether it's safe to use python routines designed for use inside the GIL in a shared memory environment. Specifically what I'd like to do is use scipy.optimize.curve_fit on a large array inside a cython function.
The data can be expressed as a 2d numpy array (say, of floats) with the axis to be fit along and the other the serialized axis to be parallelized over. Then I'd just like to release the GIL and start looping through the data with a cython.parallel.prange (the idea being then that I can have all my cores working on fitting at once).
The main issue I can foresee is that curve_fit does not operate "in place"; it returns the fit values of the parameters (and optionally their covariance matrix) and so has to allocate that memory at some point. (Of course I also have no idea about any intermediate memory allocation the routine performs.) I'm worried about how this will operate outside the GIL with many threads working concurrently.
I realize that the answer could just be "it should work fine go try it," but I'm hoping to get some idea of what to look out for. I also realize that this question is similar to others about parallelizing scipy/numpy routines, but I think this one is worded differently in that falls within the cython scope of a C environment for python.
Thanks for any help/suggestions.
Not safe. If CPython could safely run that kind of code without the GIL, we wouldn't have the GIL in the first place.
You may find the following discussion to be of interest on Parallel Programming in SciPy.
[I would have posted this as merely a comment, but I lack the requisite reputation.]
I did a machine learning Expectation Maximization algorithm in Python, basically an implementation of IBM Model1 for doing machine translation ( here is my GitHub if you want to look at the code) and it works, but reeeaaaaallly sloowwwlly. I'm taking a class now in parallel computing and I was wondering if I could use Python Multiprocessing to reach convergence faster. Can anyone give me any pointers or tips? I don't even know where to start.
EDIT: I was reading around and found this paper on using EM with MapReduce to do parallelization -- maybe this is a better idea?
Most of your problem is that Python is really slow. Remember, your code is executing in an interpreter. When you do code (such as line 82) where you perform a numerical computation one element at a time, you have that one computation - and all the overhead of the Python interpreter.
The first thing you will want to do is vectorize you code with numpy. Unlike your normal python code, numpy is calling out to precompiled efficient binary code. The more work you can hide into numpy, the less time you will waist in the interpreter.
Once you vectorize your code, you can then start profiling it if its still too slow. You should be able to find a lot of simple examples on how to vectorize python, and some of the alternative options.
EDIT: Let me clarify, that parallelizing inherently slow code is mostly pointless. First, is the issue that parallelizing slow code gives the false impression that you have made an improvement. The "scaling up" of parallel code should always be done against the fastest possible single threaded version of the same code (within reason, no need to write everything in assembly before starting any parallel code). For example, consider a lock under contention. The more threads fighting for the lock, the slower the code will run, and you will get no (or negative) performance gains. One way to reduce contention for the lock is to simply slow down the code competing for the lock. This makes it appear as if there is no overhead from lock contention, when in actuality - you have no improvements because the fastest single threaded version of your code will outperform your parallel code.
Also, python really isn't a great language to learn how to write parallel code in. Python has the GIL , which essentially forces all multithreaded code in python to run as if there was but one CPU core. This means bizarre hacks (such as the one you linked) must be done, which have their own additional drawbacks and issues (there are times where such tricks are needed / used, but they shouldn't be the default for running code on a single machine). Don't expect what you learn writing any parallel python code to carry over to other languages or help you with your course.
I think you will have some good success depending on where your bottleneck is. One caveat - When I do code optimization I always like to profile the code, even informally to get an idea of where the bottlenecks are. This will help identify where the time is being spent i.e. file io, network latency, resource contention, not enough cpu cycles etc...
For others who may not be familiar with the Expectation Maximization algorithm a very nice introduction is in Motion Segmentation using EM - a short tutorial, by Yair Weiss. Let us assume we have M data points and N classes/models.
In the EM algorithm there are two steps: Computing the distance between data points and models and Updating our model weights using weighted least squares.
Step 1 - Expectation stage
for data_point in M:
for current_model in N:
compute distance or residual between data_point and current_model
Step 2 - Maximization stage
for each model, compute weighted least squares solving for the model parameters
This requires solving N weighted least square problems where the size is
dependent on the number of parameters in the model that will be solved for.
Your bottleneck may be in the stage of computing the residuals or distances between the data points and the models stage 1 - E Step. In this stage the computations are all independent. I would consider the first stage as embarassingly parallel and quite amenable to parallel computation using parallel map reduce or some other tools in python. I have good success using IPython for such tasks, but there are other good python packages as well.
(This is a follow-up to Statistical profiler for PyPy)
I'm running some Python code under PyPy and would like to optimize it.
In Python, I would use statprof or lineprofiler to know which exact lines are causing the slowdown and try to work around them. In PyPy though, both of the tools don't really report sensible results as PyPy might optimize away some lines. I would also prefer not to use cProfile as I find it very difficult to distil which part of the reported function is the bottleneck.
Does anyone have some tips on how to proceed? Perhaps another profiler which works nicely under PyPy? In general, how does one go about optimizing Python code for PyPy?
If you understand the way the PyPy architecture works, you'll realize that trying to pinpoint individual lines of code isn't really productive. You start with a Python interpreter written in RPython, which then gets run through a tracing JIT which generates flow graphs and then transforms those graphs to optimize the RPython interpreter. What this means is the layout of your Python code being run by the RPython interpreter being JIT'ed may have very different structure than the optimized assembler actually be run. Furthermore, keep in mind that the JIT always works on a loop or a function, so getting line-by-line stats is not as meaningful. Consequently, I think cProfile may really be a good option for you since it will give you an idea of where to concentrate your optimization. Once you know which functions are your bottlenecks, you can spend your optimization efforts targeting those slower functions, rather than trying to fix a single line of Python code.
Keep in mind as you do this that PyPy has very different performance characteristics than cPython. Always try to write code in as simple a way as possible (that doesn't mean as few lines as possible btw). There are a few other heuristics that help such as using specialized lists, using objects over dicts when you have a small number of mostly constant keys, avoiding C extensions using the C Python API, etc.
If you really, really insist on trying to optimize at the line level. There are a few options. One is called JitViewer (https://foss.heptapod.net/pypy/jitviewer), which will let you have a very low level view of what the JIT is doing to your code. For instance, you can even see the assembler instructions which correspond to a Python loop. Using that tool, you can really get a sense for just how fast PyPy will behave with certain parts of the code, as you can now do silly things like count the number of assembler instructions used for your loop or something.
I have a two dimensional table (Matrix)
I need to process each line in this matrix independently from the others.
The process of each line is time consuming.
I'd like to use parallel computing resources in our university (Canadian Grid something)
Can I have some advise on how to start ? I never used parallel computing before.
Thanks :)
Start here: http://docs.python.org/library/multiprocessing.html
Be sure to read this: http://docs.python.org/library/multiprocessing.html#examples
This may be helpful: http://www.slideshare.net/pvergain/multiprocessing-with-python-presentation.
While excellent, it includes threads and multiprocessing, even though multiprocessing is often far, far superior to attempting multi-threading.
For Grid computing, multi-threading is largely useless.
Also, you probably also want to read up on celery.
I am one of the developper of a new library called scoop.
It was built exactly for this purpose (grid or super-computing, scientific computing). I suggest you give it a try.
In your case, all you would have to do is a call like this:
futures.map(YourFunc, matrixLine)
It will then be distributed on your grid or whatever environment you choose.
Like the commentators have said, find someone to talk to in your university. The answer to your question will be specific to what software is installed on the grid. If you have access to a grid, it's highly likely you also have access to a person whose job it is to answer your questions (and they will be pleased to help) - find this person!
From what you describe, I would say: first have a look at numpy.
Numpy provides methods to compute the columns and rows in a vectorized manner with nearly C speed. Depending on your problem, this could be faster than parallel computation with pure CPython.
You can than use parallel computing with numpy-arrays to get a really big speed up.
Possible ways to do this is using multiprocessing or Ipython on a cluster.
It is recommended that you use C++/C for performing this computation. You can use the OpenMP API using the #include<omp.h> header. You can start your parallel region using the #pragma amp parallel directive. Since you are parallelising a for-loop for computing your matrix multiplication, you can use #pragma omp parallel for { } to start your for-loop inside the parallel region. OpenMP will automatically take care of the process synchronisation.
Check this out for a sample code: https://gist.github.com/metallurgix/0dfafc03215ce89fc595
Remember to use a big matrix to see actual improvements in speed. A smaller matrix will perform poorly in fact due to increased task overhead created due to forking and joining the multiple threads.
You can also check out MPI if you want to parallelise your code using multiple processors instead of multiple threads.