faster numpy array copy; multi-threaded memcpy? - python

Suppose we have two large numpy arrays of the same data type and shape, of size on the order of GB's. What is the fastest way to copy all the values from one into the other?
When I do this using normal notation, e.g. A[:] = B, I see exactly one core on the computer at maximum effort doing the copy for several seconds, while the others are idle. When I launch multiple workers using multiprocessing and have them each copy a distinct slice into the destination array, such that all the data is copied, using multiple workers is faster. This is true regardless of whether the destination array is a shared memory array or one that becomes local to the worker. I can get a 5-10x speedup in some tests on a machine with many cores. As I add more workers, the speed does eventually level off and even slow down, so I think this achieves being memory-performance bound.
I'm not suggesting using multiprocessing for this problem; it was merely to demonstrate the possibility of better hardware utilization.
Does there exist a python interface to some multi-threaded C/C++ memcpy tool?
Update (03 May 2017)
When it is possible, using multiple python processes to move data can give major speedup. I have a scenario in which I already have several small shared memory buffers getting written to by worker processes. Whenever one fills up, the master process collects this data and copies it into a master buffer. But it is much faster to have the master only select the location in the master buffer, and assign a recording worker to actually do the copying (from a large set of recording processes standing by). On my particular computer, several GB can be moved in a small fraction of a second by concurrent workers, as opposed to several seconds by a single process.
Still, this sort of setup is not always (or even usually?) possible, so it would be great to have a single python process able to drop into a multi-threaded memcpy routine...

If you are certain that the types/memory layout of both arrays are identical, this might give you a speedup: memoryview(A)[:] = memoryview(B) This should be using memcpy directly and skips any checks for numpy broadcasting or type conversion rules.

Related

Argument bottleneck python multiprocessing

I am using pool.apply_async to parallelize my code. One of the arguments I am passing in is a set that I have stored in memory. Given I am passing in a pointer to the set (rather than the set itself) does multiprocessing make copies of the set for each CPU or is the same reference passed to each CPU? I assume that given the set is stored in memory, each CPU would receives it's own set. If that is not the case, would this be a bottleneck since each CPU would request the same object?
So in principle, if you create tasks for new processes, the data has to be copied over to the new processes, as the processes don't share memory (compared to threads). The details vary a bit from Operating System to Operating System (i.e. fork vs spawn) but in general the data has to be copied over.
Whether that is the bottleneck depends a bit on how much computation you are actually doing in the processes vs the amount of data that has to be transferred over to the child processes. I suggest to measure the times: (1) process start triggered by parent, (2) actual start of computation inside the child process and (3) end of computation. That should give you roughly the ramp-up (2-1) and the computation (3-2). With these numbers you can judge best if IO is the bottleneck or computation.

Threading vs Multiprocessing

Suppose i have a table with 100000 rows and a python script which performs some operations on each row of this table sequentially. Now to speed up this process should I create 10 separate scripts and run them simultaneously that process subsequent 10000 rows of the table or should I create 10 threads to process rows for better execution speed ?
Threading
Due to the Global Interpreter Lock, python threads are not truly parallel. In other words only a single thread can be running at a time.
If you are performing CPU bound tasks then dividing the workload amongst threads will not speed up your computations. If anything it will slow them down because there are more threads for the interpreter to switch between.
Threading is much more useful for IO bound tasks. For example if you are communicating with a number of different clients/servers at the same time. In this case you can switch between threads while you are waiting for different clients/servers to respond
Multiprocessing
As Eman Hamed has pointed out, it can be difficult to share objects while multiprocessing.
Vectorization
Libraries like pandas allow you to use vectorized methods on tables. These are highly optimized operations written in C that execute very fast on an entire table or column. Depending on the structure of your table and the operations that you want to perform, you should consider taking advantage of this
Process threads have in common a continouous(virtual) memory block known as heap processes don't. Threads also consume less OS resources relative to whole processes(seperate scripts) and there is no context switching happening.
The single biggest performance factor in multithreaded execution when there no
locking/barriers involved is data access locality eg. matrix multiplication kernels.
Suppose data is stored in heap in a linear fashion ie. 0-th row in [0-4095]bytes, 1st row in[4096-8191]bytes, etc. Then thread-0 should operate in 0,10,20, ... rows, thread-1 operate in 1,11,21,... rows, etc.
The main idea is to have a set of 4K pages kept in physical RAM and 64byte blocks kept in L3 cache and operate on them repeatedly. Computers usually assume that if you 'use' a particular memory location then you're also gonna use adjacent ones, and you should do your best to do so in your program. The worst case scenario is accessing memory locations that are like ~10MiB apart in a random fashion so don't do that. Eg. If a single row is 1310720 doubles(64B) in
size, then your threads should operate in a intra-row(single row) rather inter-row(above) fashion.
Benchmark your code and depending on your results, if your algorithm can process around 21.3GiB/s(DDR3-2666Mhz) of rows then you have a memory-bound task. If your code is like 1GiB/s processing speed, then you have a compute-bound task meaning executing instructions on data takes more time than fetching data from RAM and you need to either optimize your code or reach higher IPC by utilizing AVXx instructions sets or buy a newer processesor with more cores or higher frequency.

Python: How to synchronize access to a writable array of large numpy arrays (multiprocessing)

I am implementing a specific cache data structure in a machine learning application. The core consists of a list of (large) numpy arrays. The numpy arrays need to be replaced as often and quickly as possible, given the IO limitations. Therefore, there are a few workers that constantly read and prepare data. My current solution is to push a result produced by a worker into a shared queue. A separate process then received the result from the queue and replaces one of the numpy arrays in the list (which is owned by that central process).
I am now wondering whether that is the most elegant solution or whether there are faster solutions. In particular:
1.) As far as I understood the docs, going through a queue is equivalent to a serialization and de-serialization process, which could be slower than using shared memory.
2.) There is some memory overhead if the workers have several objects in the queue (which could have been replaced directly in the list).
I have thought about using a multiprocessing array or the numpy-sharedmem module but both did not really address my scenario. First, my list does not only contain ctypes. Second, each numpy array has a different size and all are independent. Third, I do not need write access to the numpy arrays but only to the 'wrapper' list organizing them.
Also, it should be noted that I am using multiprocessing and not threading as the workers heavily make use of numpy, which should invoke the global interpreter lock essentially all the time.
Questions:
- Is there a way to have a list of numpy arrays in shared memory?
- Is there a 'better' solution compared to the one described above?
Many thanks...

Python: Perform an operation on each pixel of a 2-d array simultaneously

I want to apply a 3x3 or larger image filter (gaussian or median) on a 2-d array.
Though there are several ways for doing that such as scipy.ndimage.gaussian_filter or applying a loop, I want to know if there is a way to apply a 3x3 or larger filter on each pixel of a mxn array simultaneously, because it would save a lot of time bypassing loops. Can functional programming be used for the purpose??
There is a module called scipy.ndimage.filters.convolve, please tell whether it is able to perform simultaneous operations.
You may want to learn about parallel processing in Python:
http://wiki.python.org/moin/ParallelProcessing
or the multiprocessing package in particular:
http://docs.python.org/library/multiprocessing.html
Check out using the Python Imaging Library (PIL) on multiprocessors.
Using multiprocessing with the PIL
and similar questions.
You could create four workers, divide your image in four, and assign each quadrant to a worker. You will likely lose time for the overhead however. If, on another hand, you have several images to process, then this approach may work (letting each worker opening its own image).
Even if python did provide functionality to apply an operation to an NxM array without looping over it, the operation would still not be executed simultaneously in the background since the amount of instructions a CPU can handle per cycle is limited and thus no time could be saved. For your use case this might even be counterproductive since the fields in your arrays proably have dependencies and if you don't know in what order they are accessed this will most likely end up in a mess.
Hugues provided some useful links about parallel processing in Python, but be careful when accessing the same data structure such as an array with multiple threads at the same time. If you don't synchronize the threads they might access the same part of the array at the same time and mess things up.
And be aware, the amount of threads that can effectively be run in parallel is limited by the number of processor cores.

Creating a thread for each operation or a some threads for various operations?

For a class project I am writing a simple matrix multiplier in Python. My professor has asked for it to be threaded. The way I handle this right now is to create a thread for every row and throw the result in another matrix.
What I wanted to know if it would be faster that instead of creating a thread for each row it creates some amount threads that each handles various rows.
For example: given Matrix1 100x100 * Matrix2 100x100 (matrix sizes can vary widely):
4 threads each handling 25 rows
10 threads each handling 10 rows
Maybe this is a problem of fine tuning or maybe the thread creation process overhead is still faster than the above distribution mechanism.
You will probably get the best performance if you use one thread for each CPU core available to the machine running your application. You won't get any performance benefit by running more threads than you have processors.
If you are planning to spawn new threads each time you perform a matrix multiplication then there is very little hope of your multi-threaded app ever outperforming the single-threaded version unless you are multiplying really huge matrices. The overhead involved in thread creation is just too high relative to the time required to multiply matrices. However, you could get a significant performance boost if you spawn all the worker threads once when your process starts and then reuse them over and over again to perform many matrix multiplications.
For each pair of matrices you want to multiply you will want to load the multiplicand and multiplier matrices into memory once and then allow all of your worker threads to access the memory simultaneously. This should be safe because those matrices will not be changing during the multiplication.
You should also be able to allow all the worker threads to write their output simultaneously into the same output matrix because (due to the nature of matrix multiplication) each thread will end up writing its output to different elements of the matrix and there will not be any contention.
I think you should distribute the rows between threads by maintaining an integer NextRowToProcess that is shared by all of the threads. Whenever a thread is ready to process another row it calls InterlockedIncrement (or whatever atomic increment operation you have available on your platform) to safely get the next row to process.
In no single case a CPU-bound task will be faster in Python in multi-threaded mode. Due to the Global Interpreter Lock, only one thread can be executed at once (unless you write some C extension and release the lock explicitly).
This applies to standard CPython implementation as well as PyPy. In Jython try to use a thread per core, more does not make sense.
Please also check out the great GIL overview by David Beazley.
On the other hand, if your professor does not mind, you can use multiprocessing.

Categories