I was wondering if there was a more efficient way of doing the following without using loops.
I have a numpy array with the shape (i, x, y, z). Essentially I have i elements of the shape (x, y, z).
I want to write each element to a separate file so that I have i files, each with the data from a single element.
In my case, each element is an image, but I'm sure a solution can be format agnostic.
I'm currently looping through each of the i elements and writing them out one at a time.
As i gets really large, this takes a progressively longer time. Is there a better way or a useful library which could make this more efficient?
Update
I tried the suggestion to use multiprocessing by using concurrent.futures both the thread pool and then also trying the process pool. It was simpler in the code but the time to complete was 4x slower.
i in this case is approximately 10000 while x and y are approximately 750
This sounds very suitable for multiprocessing, as the different elements need to be processed separately and can be save to disk independantly.
Python has a usefull package for this, called multiprocessing, with a variety of pooling, processing, and other options.
Here's a simple (and comment-documented) example of usage:
from multiprocessing import Process
import numpy as np
# This should be your existing function
def write_file(element):
# write file
pass
# You'll still be looping of course, but in parallel over batches. This is a helper function for looping over a "batch"
def write_list_of_files(elements_list):
for element in elements_list:
write_file(element)
# You're data goes here...
all_elements = np.ones((1000, 256, 256, 3))
num_procs = 10 # Depends on system limitations, number of cpu-cores, etc.
procs = [Process(target=write_list_of_files, args=[all_elements[k::num_procs, ...]]) for k in range(num_procs)] # Each of these processes in the list is going to run the "write_list_of_files" function, but have separate inputs, due to the indexing trick of using "k::num_procs"...
for p in procs:
p.start() # Each process starts running independantly
for p in procs:
p.join() # assures the code won't continue until all are "joined" and done. Optional obviously...
print('All done!') # This only runs onces all procs are done, due to "p.join"
Related
I want to parallelize a piece of code that resembles the following:
Ngal=10
sampind=[7,16,22,31,45]
samples=0.3*np.ones((60,Ngal))
zt=[2.15,7.16,1.23,3.05,4.1,2.09,1.324,3.112,0.032,0.2356]
toavg=[]
for j in range(Ngal):
gal=[]
for m in sampind:
gal.append(samples[m][j]-zt[j])
toavg.append(np.mean(gal))
accuracy=np.mean(toavg)
so I followed the advice here and I rewrote it as follows:
toavg=[]
gal=[]
p = mp.Pool()
def deltaz(params):
j=params[0] # index of the galaxy
m=params[1] # indices for which we have sampled redshifts
gal.append(samples[m][j]-zt[j])
return np.mean(gal)
j=(np.linspace(0,Ngal-1,Ngal).astype(int))
m=sampind
grid=[j,m]
input=itertools.product(*grid)
results = p.map(deltaz,input)
accuracy=np.mean(results)
p.close()
p.join()
but the results are not the same. In fact, sometimes they are, sometimes they're not. It doesn't seem very deterministic. Is my approach correct? If not, what should I fix? Thank you! The modules that you will need to reproduce the above examples are:
import numpy as np
import multiprocess as mp
import itertools
Thank you!
The first issue I see is that you are creating a global variable gal which is being accessed by the function deltaz. These are however not shared between the pool processes but instantiated for each process separately. You will have to use shared memory if you want them to share this structure. This is probably why you see a non-deterministic behavior.
The next issue is that you are not actually completing the same tasking with the different variation. The first one you are taking an average of each set of averages (gal). The parallel one is taking an average of which ever elements happen to end up in that list. This is nondeterministic because items are assigned to processes as they become available and this is not necessarily predictable.
I would suggest parallelizing the inner loop. To do this, you need zt and samples to both be in shared memory because they are accessed by all of the processes. This can get dangerous if you are modifying data but since you appear to only be reading it should be fine.
import numpy as np
import multiprocessing as mp
import itertools
import ctypes
#Non-parallel code
Ngal=10
sampind=[7,16,22,31,45]
samples=0.3*np.ones((60,Ngal))
zt=[2.15,7.16,1.23,3.05,4.1,2.09,1.324,3.112,0.032,0.2356]
#Nonparallel
toavg=[]
for j in range(Ngal):
gal=[]
for m in sampind:
gal.append(samples[m][j]-zt[j])
toavg.append(np.mean(gal))
accuracy=np.mean(toavg)
print(toavg)
# Parallel function
def deltaz(j):
sampind=[7,16,22,31,45]
gal = []
for m in sampind:
gal.append(samples[m][j]-zt[j])
return np.mean(gal)
# Shared array for zt
zt_base = mp.Array(ctypes.c_double, int(len(zt)),lock=False)
ztArr = np.ctypeslib.as_array(zt_base)
#Shared array for samples
sample_base = mp.Array(ctypes.c_double, int(np.product(samples.shape)),lock=False)
sampArr = np.ctypeslib.as_array(sample_base)
sampArr = sampArr.reshape(samples.shape)
#Copy arrays to shared
sampArr[:,:] = samples[:,:]
ztArr[:] = zt[:]
with mp.Pool() as p:
result = p.map(deltaz,(np.linspace(0,Ngal-1,Ngal).astype(int)))
print(result)
Here is an example that produces the same results. You can add more complexity to this as you see fit but I would read about multiprocessing in general and memory types/scopes to get an idea of what will and won't work. You have to take more care when you get into the multiprocessing world. Let me know if this doesn't help and I will try to update it so that it does.
Is it possible in python (maybe using dask, maybe using multiprocessing) to 'emplace' generators on cores, and then, in parallel, step through the generators and process the results?
It needs to be generators in particular (or objects with __iter__); lists of all the yielded elements the generators yield won't fit into memory.
In particular:
With pandas, I can call read_csv(...iterator=True), which gives me an iterator (TextFileReader) - I can for in it or explicitly call next multiple times. The entire csv never gets read into memory. Nice.
Every time I read a next chunk from the iterator, I also perform some expensive computation on it.
But now I have 2 such files. I would like to create 2 such generators, and 'emplace' 1 on one core and 1 on another, such that I can:
result = expensive_process(next(iterator))
on each core, in parallel, and then combine and return the result. Repeat this step until one generator or both is out of yield.
It looks like the TextFileReader is not pickleable, nor is a generator. I can't find out how to do this in dask or multiprocessing. Is there a pattern for this?
Dask's read_csv is designed to load data from multiple files in chunks, with a chunk-size that you can specify. When you operate on the resultant dataframe, you will be working chunk-wise, which is exactly the point of using Dask in the first place. There should be no need to use your iterator method.
The dask dataframe method you will want to use, most likely, is map_partitions().
If you really wanted to use the iterator idea, you should look into dask.delayed, which is able to parallelise arbitrary python functions, by sending each invocation of the function (with a different file-name for each) to your workers.
So luckily I think this problem maps nicely onto python's multiprocessing .Process and .Queue.
def data_generator(whatever):
for v in something(whatever):
yield v
def generator_constructor(whatever):
def generator(outputQueue):
for d in data_generator(whatever):
outputQueue.put(d)
outputQueue.put(None) # sentinel
return generator
def procSumGenerator():
outputQs = [Queue(size) for _ in range(NumCores)]
procs = [Process(target=generator_constructor(whatever),
args=(outputQs[i],))
for i in range(NumCores)]
for proc in procs: proc.start()
# until any output queue returns a None, collect
# from all and yield
done = False
while not done:
results = [oq.get() for oq in outputQs]
done = any(res is None for res in results)
if not done:
yield some_combination_of(results)
for proc in procs: terminate()
for v in procSumGenerator():
print(v)
Maybe this can be done better with Dask? I find that my solution fairly quickly saturates the network for large sizes of generated data - I'm manipulating csvs with pandas and returning large numpy arrays.
https://github.com/colinator/doodle_generator/blob/master/data_generator_uniform_final.ipynb
I'm trying to speed up a section of my code using parallel processing in python, but I'm having trouble getting it to work right, or even find examples that are relevant to me.
The code produces a low-polygon version of an image using Delaunay triangulation, and the part that's slowing me down is finding the mean values of each triangle.
I've been able to get a good speed increase by vectorizing my code, but hope to get more using parallelization:
The code I'm having trouble with is an extremely simple for loop:
for tri in tris:
lopo[tridex==tri,:] = np.mean(hipo[tridex==tri,:],axis=0)
The variables referenced are as follows.
tris - a unique python list of all the indices of the triangles
lopo - a Numpy array of the final low-polygon version of the image
hipo - a Numpy array of the original image
tridex - a Numpy array the same size as the image. Each element represents a pixel and stores the triangle that the pixel lies within
I can't seem to find a good example that uses multiple numpy arrays as input, with one of them shared.
I've tried multiprocessing (with the above snippet wrapped in a function called colorImage):
p = Process(target=colorImage, args=(hipo,lopo,tridex,ppTris))
p.start()
p.join()
But I get a a broken pipe error immediately.
So the way that Python's multiprocessing works (for the most part) is that you have to designate the individual threads that you want to run. I made a brief introductory tutorial here: http://will-farmer.com/parallel-python.html
In your case, what I would recommend is split tris into a bunch of different parts, each equally sized, each that represents a "worker". You can split this list with numpy.split() (documentation here: http://docs.scipy.org/doc/numpy/reference/generated/numpy.split.html).
Then for each list in tri, we use the Threading and Queue modules to designate 8 workers.
import numpy as np
# split into 8 different lists
tri_lists = np.split(tris, 8)
# Queues are threadsafe
return_values = queue.Queue()
threads = []
def color_image(q, tris, hipo, tridex):
""" This is the function we're parallelizing """
for tri in tris:
return_values.put(np.mean(hipo[tridex==tri,:], axis=0))
# Now we run the jobs
for i in range(8):
threads.append(threading.Thread(
target=color_image,
args=(return_values, tri_lists[i], hipo, tridex)))
# Now we have to cleanup our results
# First get items from queue
results = [item for item in return_values.queue]
# Now set values in lopo
for i in range(len(results)):
for t in tri_lists[i]:
lopo[tridex==t, :] = results[i]
This isn't the cleanest way to do it, and I'm not sure if it works since I can't test it, but this is a decent way to do it. The parallelized part is now np.mean(), while setting the values is not parallelized.
If you want to also parallelize the setting of the values, you'll have to have a shared variable, either using the Queue, or with a global variable.
See this post for a shared global variable: Python Global Variable with thread
I'm using such algorithm to make some calculations on array of Decimals:
fkn = Decimal('0')
for bits in itertools.combinations(decimals_array, elements_count):
kxn = reduce(operator.mul, bits, Decimal('1'))
fkn += kxn
I'm using Python 3.4 x64.
Decimals have precision>300 (it's a must).
len(decimals_array) is most of the time over 40.
elements_count is most of the time len(decimals_array)/2.
Calculations take very long time.
I wanted to make them multiprocess so first I was thinking about making an array of all combinations and send parts of this array to many processes - but during making of such array I quickly get MemoryError Exception.
Now I'm looking for nicer way to make this code multiprocess.
What is a good way to run this algorithm on multiple cores?
Or maybe there is a better (faster) way to make such calculations?
Thank you in advance for some ideas.
In order to really parallelize this you need to get around combinations() being sequential so that each process can generate its own combinations. The rest of the problem is already paralellizable.
40 choose 20 is about 138 billion combinations so pre-generating that or generating it in each process is going to hurt. With a 20-element list taking around 224 bytes (says sys.getsizeof()) that's 30 something terabytes if you generate the whole thing in one go. No wonder you ran out of memory. You also can't really split a generator across processes; or rather, if you do, each process will get its own copy of the generator.
Solution 1 is to have a process whose sole job is to generate combinations and push them into a queue, possibly in batches to save on IPC overhead, and have the other processes consume combinations from that queue.
Solution 2 is to write a non-sequential version of combinations that returns the Nth combination without computing the rest. This is definitely possible because it's possible with permutations, and combinations are an internally sorted subset of permutations. Then each process in a Pool can generate its own based on a start and step of N - process one counts combination 0, 3, 6..., process two counts combination 1, 4, 7... and so on, for example. This would probably be even slower unless you use C/Cython though.
Solution 3 (or possibly solution 0?) is to go over to the math stackexchange and ask if there's a mathematical rather than computational solution to this problem.
Here is one solution although it isn't very neat.
The idea is to use multiple processes where one process is responsible for one interval. However, since itertools.combinations is sequential, each process has to loop over unnecessary combinations until it reaches the right interval. When the right interval is handled, the process stops. The code is from this book.
import itertools
from tqdm import tqdm
from math import factorial
from multiprocessing import Process
import itertools
def total_combo(n, r):
return factorial(n) // factorial(r) // factorial(n-r)
def cal_combo(var,noCombo,start,end):
data = itertools.combinations(range(var),noCombo)
for i in enumerate(tqdm(data)):
if i[0] >= start:
if i[0] < start+10: print(i)
if i[0] > end: break
if __name__=='__main__':
noCombo=3
var=1000
print(total_combo(var,noCombo),'combinations for',noCombo,'of',var,'variants')
noProc=6
interval=total_combo(var,noCombo)/noProc
if interval%1==0:
print(interval)
procs=[]
for pid in range(noProc):
proc = Process(target=cal_combo, args=(var,noCombo, interval*pid, interval*(pid+1)))
procs.append(proc)
proc.start()
for proc in procs:
proc.join()
I am performing some large computations on 3 different numpy 2D arrays sequentially. The arrays are huge, 25000x25000 each. Each computation takes significant time so I decided to run 3 of them in parallel on 3 CPU cores on the server. I am following standard multiprocessing guideline and creating 2 processes and a worker function. Two computations are running through the 2 processes and the third one is running locally without separate process. I am passing the huge arrays as arguments of the processes like :
p1 = Process(target = Worker, args = (queue1, array1, ...)) # Some other params also going
p2 = Process(target = Worker, args = (queue2, array2, ...)) # Some other params also going
the Worker function sends back two numpy vectors (1D array) in a list appended in the queue like:
queue.put([v1, v2])
I am not using multiprocessing.pool
but surprisingly I am not getting speedup, it is actually running 3 times slower. Is passing large arrays taking time? I am unable to figure out what is going on. Should I use shared memory objects instead of passing arrays?
I shall be thankful if anybody can help.
Thank you.
my problem appears to be resolved. I was using a django module from inside which I was calling multiprocessing.pool.map_async. My worker function was a function inside the class itself. That was the problem. Multiprocessesing cannot call a function of the same class inside another process because subprocesses do not share memory. So inside the subprocess there is no live instance of the class. Probably that is why it is not getting called. As far as I understood. I removed the function from the class and put it in the same file but outside of the class, just before the class definition starts. It worked. I got moderate speedup also. And One more thing is people who are facing the same problem please do not read large arrays and pass between processes. Pickling and Unpickling would take a lot of time and you won't get speed up rather speed down. Try to read arrays inside the subprocess itself.
And if possible please use numpy.memmap arrays, they are quite fast.
Here is an example using np.memmap and Pool. See that you can define the number of processes and workers. In this case you don't have control over the queue, which can be achieved using multiprocessing.Queue:
from multiprocessing import Pool
import numpy as np
def mysum(array_file_name, col1, col2, shape):
a = np.memmap(array_file_name, shape=shape, mode='r+')
a[:, col1:col2] = np.random.random((shape[0], col2-col1))
ans = a[:, col1:col2].sum()
del a
return ans
if __name__ == '__main__':
nop = 1000 # number_of_processes
now = 3 # number of workers
p = Pool(now)
array_file_name = 'test.array'
shape = (250000, 250000)
a = np.memmap(array_file_name, shape=shape, mode='w+')
del a
cols = [[shape[1]*i/nop, shape[1]*(i+1)/nop] for i in range(nop)]
results = []
for c1, c2 in cols:
r = p.apply_async(mysum, args=(array_file_name, c1, c2, shape))
results.append(r)
p.close()
p.join()
final_result = sum([r.get() for r in results])
print final_result
You can achieve better performances using shared memory parallel processing, when possible. See this related question:
Shared-memory objects in python multiprocessing