Parallelization Python loop

Parallelization Python loop - python

I'm a bit lost between joblib, multiprocessing, etc..
What's the most effective way to parallelize a for loop, based on your experience?
For example :
for i, p in enumerate(patches[ss_idx]):
bar.update(i+1)
h_features.append(calc_haralick(p))
def calc_haralick(roi):
feature_vec = []
texture_features = mt.features.haralick(roi)
mean_ht = texture_features.mean(axis=0)
[feature_vec.append(i) for i in mean_ht[0:9]]
return np.array(feature_vec)
It gets i patches of images then extract features via haralick
And this is how I get patches
h_neigh = 11 # haralick neighbourhood
size = h_neigh
shape = (img.shape[0] - size + 1, img.shape[1] - size + 1, size, size)
strides = 2 * img.strides
patches = stride_tricks.as_strided(img, shape=shape, strides=strides)
patches = patches.reshape(-1, size, size)
Sorry if any information is superfluous

Your images appear to be simple two-dimensional NumPy arrays, and patches a list or array of those. I assume ss_idx is an index array (i.e., not an integer), so that patches[ss_idx] remains something that can be iterated over (as in your example).
In that case, simply use multiprocessing.Pool.map:
import multiprocessing as mp
nproc = 10
with mp.Pool(nproc) as pool:
h_features = pool.map(calc_haralick, patches[ss_idx])
See the first basic example in the multiprocessing documentation.
If you leave out nproc or set it to None, all available cores will be used.
The potential problem with multiprocessing is, that it will create nproc identical Python processes, and copy all the relevant data to those processes. If your images are large, this will cause considerable overhead.
In such a case, it may be worth to split your Python program in separate programs, where calculating the future of a single image is one independent program. That program would need to handle reading a single image and writing the features. You'd then wrap everything in e.g. a bash script that loops over all images, taking care to use only a certain amount of cores at the same (e.g., background processes, but wait every 10 images). The next step/program requires reading the independent feature files into a multi-dimensional array, but from there, you can continue your old program.
While this is more work, it may save some copying overhead (though it introduces extra I/O overhead, in particular writing the separate feature files).
It also has the optional advantage that this is fairly easy to run distributed, should the possibility ever occur.
Try multiprocessing, keeping an eye out on memory usage and CPU usage (if nothing happens for a long time, it may be copying overhead). Then, try another method.

Related

write multi dimensional numpy array to many files

I was wondering if there was a more efficient way of doing the following without using loops.
I have a numpy array with the shape (i, x, y, z). Essentially I have i elements of the shape (x, y, z).
I want to write each element to a separate file so that I have i files, each with the data from a single element.
In my case, each element is an image, but I'm sure a solution can be format agnostic.
I'm currently looping through each of the i elements and writing them out one at a time.
As i gets really large, this takes a progressively longer time. Is there a better way or a useful library which could make this more efficient?
Update
I tried the suggestion to use multiprocessing by using concurrent.futures both the thread pool and then also trying the process pool. It was simpler in the code but the time to complete was 4x slower.
i in this case is approximately 10000 while x and y are approximately 750

This sounds very suitable for multiprocessing, as the different elements need to be processed separately and can be save to disk independantly.
Python has a usefull package for this, called multiprocessing, with a variety of pooling, processing, and other options.
Here's a simple (and comment-documented) example of usage:
from multiprocessing import Process
import numpy as np
# This should be your existing function
def write_file(element):
# write file
pass
# You'll still be looping of course, but in parallel over batches. This is a helper function for looping over a "batch"
def write_list_of_files(elements_list):
for element in elements_list:
write_file(element)
# You're data goes here...
all_elements = np.ones((1000, 256, 256, 3))
num_procs = 10 # Depends on system limitations, number of cpu-cores, etc.
procs = [Process(target=write_list_of_files, args=[all_elements[k::num_procs, ...]]) for k in range(num_procs)] # Each of these processes in the list is going to run the "write_list_of_files" function, but have separate inputs, due to the indexing trick of using "k::num_procs"...
for p in procs:
p.start() # Each process starts running independantly
for p in procs:
p.join() # assures the code won't continue until all are "joined" and done. Optional obviously...
print('All done!') # This only runs onces all procs are done, due to "p.join"

how to efficiently use multiprocessing to speed up huge amount of tiny tasks?

I'm having a bit trouble in Python multiprocessing.Pool. I have two list of numpy array a and b, in which
a.shape=(10000,3)
and
b.shape=(1000000000,3)
Then I have a function which does some computation like
def role(array, point):
sub = array-point
return (1/(np.sqrt(np.min(np.sum(sub*sub, axis=-1)))+0.001)**2)
Next, I need to compute
[role(a, point) for point in b]
To speed it up, I try to use
cpu_num = 4
m = multiprocessing.Pool(cpu_num)
cost_list = m.starmap(role, [(a, point) for point in b])
m.close
The whole process takes around 70s, but if I set cpu_num = 1, the processing time decrease to 60s... My laptop has 6 core, for reference.
Here I have two questions:
is there sth I did wrong with multiprocessing.Pool? why the processing time increased if I set cpu_num = 4?
for task like this (each for loop is a very tiny process), should I use multiprocessing to speed up? I feel like each time, python fill in Pool takes longer than process function role...
Any suggestions is really welcome.

Multiprocessing comes with some overhead (to create new processes), which is why it's not a very good choice when you have lots of tiny tasks, where the overhead of process creation might outweigh the benefit of parallelizing.
Have you considered vectorizing your problem?
In particular, if you broadcast the variable b you get there:
sub = a - b[::,np.newaxis] # broadcast b
1./(np.sqrt(np.min(np.sum(sub**2, axis=2), axis=-1))+0.001)**2
I believe you could then still reduce the complexity of the last expression a bit, as you're creating the square of a square root, which seems redundant (note that I'm assuming the 0.001 constant value is merely there to avoid some non-sensible operation like division by zero).

If the tasks are too tiny, then tohe multiprocessing overhead will be your bottleneck and you will win nothing.
If the amount of data per task that you have to pass to a worker or that the worker has to return then you will also not win a lot (or even win nothing)
If you have 10.000 tiny tasks, then I recommend to create a list of meta tasks.
Each meta task would consist of executing for example 20 tiny tasks.
meta_tasks = []
for idx in range(0, len(tiny_tasks), 20):
meta_tasks.append(tiny_tasks[idx:idx+20])
Then pass the meta tasks to your worker pool.

Python hogging memory when looping over list of numpy arrays

I am using Python 3 and Numpy for some scientific data analysis, and I am faced with a memory related issue. When looping over a list of numpy arrays (a few thousand of them) and doing a few intermediate calculations, I have noticed python taking up over 6GB more memory than I would have expected it to. I have isolated the issue to a single function, shown below:
def overlap_correct(self):
running_total = np.zeros((512, 512))
shutter = 0
for data_index in range(len(self.data)):
if self.TOF[data_index] < self.shutter_times[shutter]:
occupied_prob = running_total/self.N_TRIGS[data_index]
running_total += self.data[data_index]
self.data[data_index] = np.round(np.divide(self.data[data_index], (1 - occupied_prob)))
else:
running_total = np.zeros((512, 512))
shutter += 1
The relevant data structures here are self.data which is a list with a fwe thousand 512x512 numpy arrays, self.TOF and self.N_TRIGS are numpy arrays of a few thousand floats, and self.shutter times is a numpy array with three floats.
During the processing of this loop, which takes a few minutes, I can observe the memory usage of Python gradually increasing, until the loop finishes with about 6GB more memory used up than when it started.
I have used memory_profiler and objgraph to analyse the memory usage without any success. I am aware that before and after the loop, self.data, self.TOF, self.N_TRIGS, and self.shutter remain the same size, and hold the same number and elements of the same type. If I understand this correctly, local variables such as occupied _prob should get out of scope after every iteration of the for loop, and if not, all redundant memory should be garbage collected after the function returns back to the main loop. This does not happen, and the 6GB remain locked up until the script terminates. I have attempted to also run manual garbage collection using gc.collect() without any results.
If it helps, this function exists inside a thread and is part of a larger data analysis process. No other threads attempt to concurrently access the data, and after the thread exits, self.data is copied to a different class. The instance of the thread is then destroyed by going out of scope. I have also attempted to manually destroy the thread using del thread_instance as well as thread_instance = None, but the 6GB remains locked up. This is not a huge issue on the development machine, but the code will be part of a larger package which may run on machines with limited RAM.

I have managed to find a solution to the issue. TL;DR: During the execution of the function, the dtype of self.data was not enforced.
The first issue that prevented me from realising this is that by using sys.getsizeof() to see how much space self.data was taking up in memory, I was given the size of the list of pointers to numpy.ndarray objects, which remained the same as the number of arrays did not change.
Secondly, as I was checking the dtype of self.data[0], which was the only unchanged data "slide", I wrongly assumed that the whole list of arrays also had the same dtype.
I suspect that the reason as to why the dtype of some of the arrays was changed is that np.round() returns a rounded float.
By changing the structure of self.data from a list of a few thousand 256x256 arrays into a 3D array of [a few thousand]x[256]x[256], the function no longer guessed the dtype of the data, but silently casted the float64 returned by np.round to uint16:
self.data = np.asarray(self.data, dtype='uint16')

Printing whenever a python object is freed?

I would like to assure that several numpy arrays I'm allocating are properly freed.
I'm curious is there is any module that will let me track an object and print whenever its memory is de-allocated.
For context I'm asking the question because when I reach a certain point in my program the system monitor shows that python is using 300ishMB. Then I execute 2 commands: the first creates a list of numpy arrays which comes to about 1GB in size. The next command performs vstack on this list which further increases memory by about 1GB. Then I take this big numpy array, do some math get an answer which is about 1MB (8000 x 128 uint8 ndarray). The other arrays are no longer needed, so they should be unallocated at the end of the function. However, after I return from the function and collect garbage I'm still left with Python using 1GB of memory. Where did those 700MB come from!?
I'll further illustrate the example with sudo-code
def myfunc(api):
# Gets a list of 1000 1MB arrays
list_ = api.get_arrays()
# Stacks into a big 1GB array
bigarray = np.vstack(list_)
# Summarizes big array using about 1MB
smallarray = api.cluster(bigarray)
# I shouldn't really need del statements
del list_
del bigarray
return smallarray
def main():
# I do preprocessing stuff and have about 300MB in memory
# It costs about 2GB to run this function, but that should all be freed at the end
smallarray = myfunc(api)
# There is 700MB of extra memory allocated! Where did it come from!?
To debug this I was thinking it would be useful to ensure that those numpy arrays are actually gone. Maybe someone has a better idea, but hopefully someone at least has an idea.

Shared memory in multiprocessing

I have three large lists. First contains bitarrays (module bitarray 0.8.0) and the other two contain arrays of integers.
l1=[bitarray 1, bitarray 2, ... ,bitarray n]
l2=[array 1, array 2, ... , array n]
l3=[array 1, array 2, ... , array n]
These data structures take quite a bit of RAM (~16GB total).
If i start 12 sub-processes using:
multiprocessing.Process(target=someFunction, args=(l1,l2,l3))
Does this mean that l1, l2 and l3 will be copied for each sub-process or will the sub-processes share these lists? Or to be more direct, will I use 16GB or 192GB of RAM?
someFunction will read some values from these lists and then performs some calculations based on the values read. The results will be returned to the parent-process. The lists l1, l2 and l3 will not be modified by someFunction.
Therefore i would assume that the sub-processes do not need and would not copy these huge lists but would instead just share them with the parent. Meaning that the program would take 16GB of RAM (regardless of how many sub-processes i start) due to the copy-on-write approach under linux?
Am i correct or am i missing something that would cause the lists to be copied?
EDIT:
I am still confused, after reading a bit more on the subject. On the one hand Linux uses copy-on-write, which should mean that no data is copied. On the other hand, accessing the object will change its ref-count (i am still unsure why and what does that mean). Even so, will the entire object be copied?
For example if i define someFunction as follows:
def someFunction(list1, list2, list3):
i=random.randint(0,99999)
print list1[i], list2[i], list3[i]
Would using this function mean that l1, l2 and l3 will be copied entirely for each sub-process?
Is there a way to check for this?
EDIT2 After reading a bit more and monitoring total memory usage of the system while sub-processes are running, it seems that entire objects are indeed copied for each sub-process. And it seems to be because reference counting.
The reference counting for l1, l2 and l3 is actually unneeded in my program. This is because l1, l2 and l3 will be kept in memory (unchanged) until the parent-process exits. There is no need to free the memory used by these lists until then. In fact i know for sure that the reference count will remain above 0 (for these lists and every object in these lists) until the program exits.
So now the question becomes, how can i make sure that the objects will not be copied to each sub-process? Can i perhaps disable reference counting for these lists and each object in these lists?
EDIT3 Just an additional note. Sub-processes do not need to modify l1, l2 and l3 or any objects in these lists. The sub-processes only need to be able to reference some of these objects without causing the memory to be copied for each sub-process.

Because this is still a very high result on google and no one else has mentioned it yet, I thought I would mention the new possibility of 'true' shared memory which was introduced in python version 3.8.0: https://docs.python.org/3/library/multiprocessing.shared_memory.html
I have here included a small contrived example (tested on linux) where numpy arrays are used, which is likely a very common use case:
# one dimension of the 2d array which is shared
dim = 5000
import numpy as np
from multiprocessing import shared_memory, Process, Lock
from multiprocessing import cpu_count, current_process
import time
lock = Lock()
def add_one(shr_name):
existing_shm = shared_memory.SharedMemory(name=shr_name)
np_array = np.ndarray((dim, dim,), dtype=np.int64, buffer=existing_shm.buf)
lock.acquire()
np_array[:] = np_array[0] + 1
lock.release()
time.sleep(10) # pause, to see the memory usage in top
print('added one')
existing_shm.close()
def create_shared_block():
a = np.ones(shape=(dim, dim), dtype=np.int64) # Start with an existing NumPy array
shm = shared_memory.SharedMemory(create=True, size=a.nbytes)
# # Now create a NumPy array backed by shared memory
np_array = np.ndarray(a.shape, dtype=np.int64, buffer=shm.buf)
np_array[:] = a[:] # Copy the original data into shared memory
return shm, np_array
if current_process().name == "MainProcess":
print("creating shared block")
shr, np_array = create_shared_block()
processes = []
for i in range(cpu_count()):
_process = Process(target=add_one, args=(shr.name,))
processes.append(_process)
_process.start()
for _process in processes:
_process.join()
print("Final array")
print(np_array[:10])
print(np_array[10:])
shr.close()
shr.unlink()
Note that because of the 64 bit ints this code can take about 1gb of ram to run, so make sure that you won't freeze your system using it. ^_^

Generally speaking, there are two ways to share the same data:
Multithreading
Shared memory
Python's multithreading is not suitable for CPU-bound tasks (because of the GIL), so the usual solution in that case is to go on multiprocessing. However, with this solution you need to explicitly share the data, using multiprocessing.Value and multiprocessing.Array.
Note that usually sharing data between processes may not be the best choice, because of all the synchronization issues; an approach involving actors exchanging messages is usually seen as a better choice. See also Python documentation:
As mentioned above, when doing concurrent programming it is usually
best to avoid using shared state as far as possible. This is
particularly true when using multiple processes.
However, if you really do need to use some shared data then
multiprocessing provides a couple of ways of doing so.
In your case, you need to wrap l1, l2 and l3 in some way understandable by multiprocessing (e.g. by using a multiprocessing.Array), and then pass them as parameters.
Note also that, as you said you do not need write access, then you should pass lock=False while creating the objects, or all access will be still serialized.

For those interested in using Python3.8 's shared_memory module, it still has a bug (github issue link here) which hasn't been fixed and is affecting Python3.8/3.9/3.10 by now (2021-01-15). The bug affects posix systems and is about resource tracker destroys shared memory segments when other processes should still have valid access. So take care if you use it in your code.

If you want to make use of copy-on-write feature and your data is static(unchanged in child processes) - you should make python don't mess with memory blocks where your data lies. You can easily do this by using C or C++ structures (stl for instance) as containers and provide your own python wrappers that will use pointers to data memory (or possibly copy data mem) when python-level object will be created if any at all.
All this can be done very easy with almost python simplicity and syntax with cython.
# pseudo cython
cdef class FooContainer:
cdef char * data
def __cinit__(self, char * foo_value):
self.data = malloc(1024, sizeof(char))
memcpy(self.data, foo_value, min(1024, len(foo_value)))
def get(self):
return self.data
# python part
from foo import FooContainer
f = FooContainer("hello world")
pid = fork()
if not pid:
f.get() # this call will read same memory page to where
# parent process wrote 1024 chars of self.data
# and cython will automatically create a new python string
# object from it and return to caller
The above pseudo-code is badly written. Dont use it. In place of self.data should be C or C++ container in your case.

You can use memcached or redis and set each as a key value pair
{'l1'...

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.