Related
I'm a bit lost between joblib, multiprocessing, etc..
What's the most effective way to parallelize a for loop, based on your experience?
For example :
for i, p in enumerate(patches[ss_idx]):
bar.update(i+1)
h_features.append(calc_haralick(p))
def calc_haralick(roi):
feature_vec = []
texture_features = mt.features.haralick(roi)
mean_ht = texture_features.mean(axis=0)
[feature_vec.append(i) for i in mean_ht[0:9]]
return np.array(feature_vec)
It gets i patches of images then extract features via haralick
And this is how I get patches
h_neigh = 11 # haralick neighbourhood
size = h_neigh
shape = (img.shape[0] - size + 1, img.shape[1] - size + 1, size, size)
strides = 2 * img.strides
patches = stride_tricks.as_strided(img, shape=shape, strides=strides)
patches = patches.reshape(-1, size, size)
Sorry if any information is superfluous
Your images appear to be simple two-dimensional NumPy arrays, and patches a list or array of those. I assume ss_idx is an index array (i.e., not an integer), so that patches[ss_idx] remains something that can be iterated over (as in your example).
In that case, simply use multiprocessing.Pool.map:
import multiprocessing as mp
nproc = 10
with mp.Pool(nproc) as pool:
h_features = pool.map(calc_haralick, patches[ss_idx])
See the first basic example in the multiprocessing documentation.
If you leave out nproc or set it to None, all available cores will be used.
The potential problem with multiprocessing is, that it will create nproc identical Python processes, and copy all the relevant data to those processes. If your images are large, this will cause considerable overhead.
In such a case, it may be worth to split your Python program in separate programs, where calculating the future of a single image is one independent program. That program would need to handle reading a single image and writing the features. You'd then wrap everything in e.g. a bash script that loops over all images, taking care to use only a certain amount of cores at the same (e.g., background processes, but wait every 10 images). The next step/program requires reading the independent feature files into a multi-dimensional array, but from there, you can continue your old program.
While this is more work, it may save some copying overhead (though it introduces extra I/O overhead, in particular writing the separate feature files).
It also has the optional advantage that this is fairly easy to run distributed, should the possibility ever occur.
Try multiprocessing, keeping an eye out on memory usage and CPU usage (if nothing happens for a long time, it may be copying overhead). Then, try another method.
I am using Python 3 and Numpy for some scientific data analysis, and I am faced with a memory related issue. When looping over a list of numpy arrays (a few thousand of them) and doing a few intermediate calculations, I have noticed python taking up over 6GB more memory than I would have expected it to. I have isolated the issue to a single function, shown below:
def overlap_correct(self):
running_total = np.zeros((512, 512))
shutter = 0
for data_index in range(len(self.data)):
if self.TOF[data_index] < self.shutter_times[shutter]:
occupied_prob = running_total/self.N_TRIGS[data_index]
running_total += self.data[data_index]
self.data[data_index] = np.round(np.divide(self.data[data_index], (1 - occupied_prob)))
else:
running_total = np.zeros((512, 512))
shutter += 1
The relevant data structures here are self.data which is a list with a fwe thousand 512x512 numpy arrays, self.TOF and self.N_TRIGS are numpy arrays of a few thousand floats, and self.shutter times is a numpy array with three floats.
During the processing of this loop, which takes a few minutes, I can observe the memory usage of Python gradually increasing, until the loop finishes with about 6GB more memory used up than when it started.
I have used memory_profiler and objgraph to analyse the memory usage without any success. I am aware that before and after the loop, self.data, self.TOF, self.N_TRIGS, and self.shutter remain the same size, and hold the same number and elements of the same type. If I understand this correctly, local variables such as occupied _prob should get out of scope after every iteration of the for loop, and if not, all redundant memory should be garbage collected after the function returns back to the main loop. This does not happen, and the 6GB remain locked up until the script terminates. I have attempted to also run manual garbage collection using gc.collect() without any results.
If it helps, this function exists inside a thread and is part of a larger data analysis process. No other threads attempt to concurrently access the data, and after the thread exits, self.data is copied to a different class. The instance of the thread is then destroyed by going out of scope. I have also attempted to manually destroy the thread using del thread_instance as well as thread_instance = None, but the 6GB remains locked up. This is not a huge issue on the development machine, but the code will be part of a larger package which may run on machines with limited RAM.
I have managed to find a solution to the issue. TL;DR: During the execution of the function, the dtype of self.data was not enforced.
The first issue that prevented me from realising this is that by using sys.getsizeof() to see how much space self.data was taking up in memory, I was given the size of the list of pointers to numpy.ndarray objects, which remained the same as the number of arrays did not change.
Secondly, as I was checking the dtype of self.data[0], which was the only unchanged data "slide", I wrongly assumed that the whole list of arrays also had the same dtype.
I suspect that the reason as to why the dtype of some of the arrays was changed is that np.round() returns a rounded float.
By changing the structure of self.data from a list of a few thousand 256x256 arrays into a 3D array of [a few thousand]x[256]x[256], the function no longer guessed the dtype of the data, but silently casted the float64 returned by np.round to uint16:
self.data = np.asarray(self.data, dtype='uint16')
I am running simulations using dask.distributed. My model is defined in a delayed function and I stack several realizations.
A simplified version of what I do is given in this code snippet:
import numpy as np
import xarray as xr
import dask.array as da
import dask
from dask.distributed import Client
from itertools import repeat
#dask.delayed
def run_model(n_time,a,b):
result = np.array([a*np.random.randn(n_time)+b])
return result
client = Client()
# Parameters
n_sims = 10000
n_time = 100
a_vals = np.random.randn(n_sims)
b_vals = np.random.randn(n_sims)
output_file = 'out.nc'
# Run simulations
out = da.stack([da.from_delayed(run_model(n_time,a,b),(1,n_time,),np.float64) for a,b in zip(a_vals, b_vals)])
# Store output in a dataframe
ds = xr.Dataset({'var1': (['realization', 'time'], out[:,0,:])},
coords={'realization': np.arange(n_sims),
'time': np.arange(n_time)*.1})
# Save to a netcdf file -> at this point, computations will be carried out
ds.to_netcdf(output_file)
If I want to run a lot of simulations I get the following warning:
/home/user/miniconda3/lib/python3.6/site-packages/distributed/worker.py:840: UserWarning: Large object of size 2.73 MB detected in task graph:
("('getitem-32103d4a23823ad4f97dcb3faed7cf07', 0, ... cd39>]), False)
Consider scattering large objects ahead of time
with client.scatter to reduce scheduler burden and keep data on workers
future = client.submit(func, big_data) # bad
big_future = client.scatter(big_data) # good
future = client.submit(func, big_future) # good
% (format_bytes(len(b)), s))
As far as I understand (from this and this question), the method proposed by the warning helps in getting large data into the function. However, my inputs are all scalar values, so they should not take up nearly 3MB of memory. Even if the function run_model() does not take any argument at all (so no parameters are passed), I get the same warning.
I also had a look at the task graph to see whether there is some step that requires loading lots of data. For three realizations it looks like this:
So it seems to me that every realization is handled separately which should keep the amount of data to deal with low.
I would like to understand what actually is the step that produces a large object and what I need to do to break it down into smaller parts.
The message is, in this case, slightly misleading. The issue is demonstrated by the following:
> len(out[:, 0, :].dask)
40000
> out[:, 0, :].npartitions
10000
and the pickled size of that graph (whose head is the getitem key in the message) is the ~3MB. By creating a dask-array for each element of the computation, you end up with a stacked array with as many partitions as elements, and the model run operation and item selection, as well as storage operation are being applied to every single one and stored in the graph. Yes, they are independent, and likely this entire computation would complete, but this is all very wasteful, unless the model making function runs for considerable time on each input scalar.
In your real situation, it may be that the inner arrays are in fact bigger than the one-element version you present, but in the general case of doing numpy operations on arrays, it is normal to create the arrays on the workers (with random or some load function) and operate on partitions with sizes >100MB.
I was reading up on Python Memory Management and would like to reduce the memory footprint of my application. It was suggested that subprocesses would go a long way in mitigating the problem; but i'm having trouble conceptualizing what needs to be done. Could some one please provide a simple example of how to turn this...
def my_function():
x = range(1000000)
y = copy.deepcopy(x)
del x
return y
#subprocess_witchcraft
def my_function_dispatcher(*args):
return my_function()
...into a real subprocessed function that doesn't store an extra "free-list"?
Bonus Question:
Does this "free-list" concept apply to python c-extensions as well?
The important thing about the optimization suggestion is to make sure that my_function() is only invoked in a subprocess. The deepcopy and del are irrelevant — once you create five million distinct integers in a process, holding onto all of them at the same time, it's game over. Even if you stop referring to those objects, Python will free them by keeping references to five million empty integer-object-sized fields in a limbo where they await reuse for the next function that wants to create five million integers. This is the free list mentioned in the other answer, and it buys blindingly fast allocation and deallocation of ints and floats. It is only fair to Python to note that this is not a memory leak since the memory is definitely made available for further allocations. However, that memory will not get returned to the system until the process ends, nor will it be reused for anything other than allocating numbers of the same type.
Most programs don't have this problem because most programs do not create pathologically huge lists of numbers, free them, and then expect to reuse that memory for other objects. Programs using numpy are also safe because numpy stores numeric data of its arrays in tightly packed native format. For programs that do follow this usage pattern, the way to mitigate the problem is by not creating a large number of the integers at the same time in the first place, at least not in the process which needs to return memory to the system. It is unclear what exact use case you have, but a real-world solution will likely require more than a "magic decorator".
This is where subprocess come in: if the list of numbers is created in another process, then all the memory associated with the list, including but not limited to storage of ints, is both freed and returned to the system by the mere act of terminating the subprocess. Of course, you must design your program so that the list can be both created and processed in the subsystem, without requiring the transfer of all these numbers. The subprocess can receive information needed to create the data set, and can send back the information obtained from processing the list.
To illustrate the principle, let's upgrade your example so that the whole list actually needs to exist - say we're benchmarking sorting algorithms. We want to create a huge list of integers, sort it, and reliably free the memory associated with the list, so that the next benchmark can allocate memory for its own needs without worrying of running out of RAM. To spawn the subprocess and communicate, this uses the multiprocessing module:
# To run this, save it to a file that looks like a valid Python module, e.g.
# "foo.py" - multiprocessing requires being able to import the main module.
# Then run it with "python foo.py".
import multiprocessing, random, sys, os, time
def create_list(size):
# utility function for clarity - runs in subprocess
maxint = sys.maxint
randrange = random.randrange
return [randrange(maxint) for i in xrange(size)]
def run_test(state):
# this function is run in a separate process
size = state['list_size']
print 'creating a list with %d random elements - this can take a while... ' % size,
sys.stdout.flush()
lst = create_list(size)
print 'done'
t0 = time.time()
lst.sort()
t1 = time.time()
state['time'] = t1 - t0
if __name__ == '__main__':
manager = multiprocessing.Manager()
state = manager.dict(list_size=5*1000*1000) # shared state
p = multiprocessing.Process(target=run_test, args=(state,))
p.start()
p.join()
print 'time to sort: %.3f' % state['time']
print 'my PID is %d, sleeping for a minute...' % os.getpid()
time.sleep(60)
# at this point you can inspect the running process to see that it
# does not consume excess memory
Bonus Answer
It is hard to provide an answer to the bonus question, since the question is unclear. The "free list concept" is exactly that, a concept, an implementation strategy that needs to be explicitly coded on top of the regular Python allocator. Most Python types do not use that allocation strategy, for example it is not used for instances of classes created with the class statement. Implementing a free list is not hard, but it is fairly advanced and rarely undertaken without good reason. If some extension author has chosen to use a free list for one of its types, it can be expected that they are aware of the tradeoff a free list offers — gaining extra-fast allocation/deallocation at the cost of some additional space (for the objects on the free list and the free list itself) and inability to reuse the memory for something else.
I have three large lists. First contains bitarrays (module bitarray 0.8.0) and the other two contain arrays of integers.
l1=[bitarray 1, bitarray 2, ... ,bitarray n]
l2=[array 1, array 2, ... , array n]
l3=[array 1, array 2, ... , array n]
These data structures take quite a bit of RAM (~16GB total).
If i start 12 sub-processes using:
multiprocessing.Process(target=someFunction, args=(l1,l2,l3))
Does this mean that l1, l2 and l3 will be copied for each sub-process or will the sub-processes share these lists? Or to be more direct, will I use 16GB or 192GB of RAM?
someFunction will read some values from these lists and then performs some calculations based on the values read. The results will be returned to the parent-process. The lists l1, l2 and l3 will not be modified by someFunction.
Therefore i would assume that the sub-processes do not need and would not copy these huge lists but would instead just share them with the parent. Meaning that the program would take 16GB of RAM (regardless of how many sub-processes i start) due to the copy-on-write approach under linux?
Am i correct or am i missing something that would cause the lists to be copied?
EDIT:
I am still confused, after reading a bit more on the subject. On the one hand Linux uses copy-on-write, which should mean that no data is copied. On the other hand, accessing the object will change its ref-count (i am still unsure why and what does that mean). Even so, will the entire object be copied?
For example if i define someFunction as follows:
def someFunction(list1, list2, list3):
i=random.randint(0,99999)
print list1[i], list2[i], list3[i]
Would using this function mean that l1, l2 and l3 will be copied entirely for each sub-process?
Is there a way to check for this?
EDIT2 After reading a bit more and monitoring total memory usage of the system while sub-processes are running, it seems that entire objects are indeed copied for each sub-process. And it seems to be because reference counting.
The reference counting for l1, l2 and l3 is actually unneeded in my program. This is because l1, l2 and l3 will be kept in memory (unchanged) until the parent-process exits. There is no need to free the memory used by these lists until then. In fact i know for sure that the reference count will remain above 0 (for these lists and every object in these lists) until the program exits.
So now the question becomes, how can i make sure that the objects will not be copied to each sub-process? Can i perhaps disable reference counting for these lists and each object in these lists?
EDIT3 Just an additional note. Sub-processes do not need to modify l1, l2 and l3 or any objects in these lists. The sub-processes only need to be able to reference some of these objects without causing the memory to be copied for each sub-process.
Because this is still a very high result on google and no one else has mentioned it yet, I thought I would mention the new possibility of 'true' shared memory which was introduced in python version 3.8.0: https://docs.python.org/3/library/multiprocessing.shared_memory.html
I have here included a small contrived example (tested on linux) where numpy arrays are used, which is likely a very common use case:
# one dimension of the 2d array which is shared
dim = 5000
import numpy as np
from multiprocessing import shared_memory, Process, Lock
from multiprocessing import cpu_count, current_process
import time
lock = Lock()
def add_one(shr_name):
existing_shm = shared_memory.SharedMemory(name=shr_name)
np_array = np.ndarray((dim, dim,), dtype=np.int64, buffer=existing_shm.buf)
lock.acquire()
np_array[:] = np_array[0] + 1
lock.release()
time.sleep(10) # pause, to see the memory usage in top
print('added one')
existing_shm.close()
def create_shared_block():
a = np.ones(shape=(dim, dim), dtype=np.int64) # Start with an existing NumPy array
shm = shared_memory.SharedMemory(create=True, size=a.nbytes)
# # Now create a NumPy array backed by shared memory
np_array = np.ndarray(a.shape, dtype=np.int64, buffer=shm.buf)
np_array[:] = a[:] # Copy the original data into shared memory
return shm, np_array
if current_process().name == "MainProcess":
print("creating shared block")
shr, np_array = create_shared_block()
processes = []
for i in range(cpu_count()):
_process = Process(target=add_one, args=(shr.name,))
processes.append(_process)
_process.start()
for _process in processes:
_process.join()
print("Final array")
print(np_array[:10])
print(np_array[10:])
shr.close()
shr.unlink()
Note that because of the 64 bit ints this code can take about 1gb of ram to run, so make sure that you won't freeze your system using it. ^_^
Generally speaking, there are two ways to share the same data:
Multithreading
Shared memory
Python's multithreading is not suitable for CPU-bound tasks (because of the GIL), so the usual solution in that case is to go on multiprocessing. However, with this solution you need to explicitly share the data, using multiprocessing.Value and multiprocessing.Array.
Note that usually sharing data between processes may not be the best choice, because of all the synchronization issues; an approach involving actors exchanging messages is usually seen as a better choice. See also Python documentation:
As mentioned above, when doing concurrent programming it is usually
best to avoid using shared state as far as possible. This is
particularly true when using multiple processes.
However, if you really do need to use some shared data then
multiprocessing provides a couple of ways of doing so.
In your case, you need to wrap l1, l2 and l3 in some way understandable by multiprocessing (e.g. by using a multiprocessing.Array), and then pass them as parameters.
Note also that, as you said you do not need write access, then you should pass lock=False while creating the objects, or all access will be still serialized.
For those interested in using Python3.8 's shared_memory module, it still has a bug (github issue link here) which hasn't been fixed and is affecting Python3.8/3.9/3.10 by now (2021-01-15). The bug affects posix systems and is about resource tracker destroys shared memory segments when other processes should still have valid access. So take care if you use it in your code.
If you want to make use of copy-on-write feature and your data is static(unchanged in child processes) - you should make python don't mess with memory blocks where your data lies. You can easily do this by using C or C++ structures (stl for instance) as containers and provide your own python wrappers that will use pointers to data memory (or possibly copy data mem) when python-level object will be created if any at all.
All this can be done very easy with almost python simplicity and syntax with cython.
# pseudo cython
cdef class FooContainer:
cdef char * data
def __cinit__(self, char * foo_value):
self.data = malloc(1024, sizeof(char))
memcpy(self.data, foo_value, min(1024, len(foo_value)))
def get(self):
return self.data
# python part
from foo import FooContainer
f = FooContainer("hello world")
pid = fork()
if not pid:
f.get() # this call will read same memory page to where
# parent process wrote 1024 chars of self.data
# and cython will automatically create a new python string
# object from it and return to caller
The above pseudo-code is badly written. Dont use it. In place of self.data should be C or C++ container in your case.
You can use memcached or redis and set each as a key value pair
{'l1'...