I'm using such algorithm to make some calculations on array of Decimals:
fkn = Decimal('0')
for bits in itertools.combinations(decimals_array, elements_count):
kxn = reduce(operator.mul, bits, Decimal('1'))
fkn += kxn
I'm using Python 3.4 x64.
Decimals have precision>300 (it's a must).
len(decimals_array) is most of the time over 40.
elements_count is most of the time len(decimals_array)/2.
Calculations take very long time.
I wanted to make them multiprocess so first I was thinking about making an array of all combinations and send parts of this array to many processes - but during making of such array I quickly get MemoryError Exception.
Now I'm looking for nicer way to make this code multiprocess.
What is a good way to run this algorithm on multiple cores?
Or maybe there is a better (faster) way to make such calculations?
Thank you in advance for some ideas.
In order to really parallelize this you need to get around combinations() being sequential so that each process can generate its own combinations. The rest of the problem is already paralellizable.
40 choose 20 is about 138 billion combinations so pre-generating that or generating it in each process is going to hurt. With a 20-element list taking around 224 bytes (says sys.getsizeof()) that's 30 something terabytes if you generate the whole thing in one go. No wonder you ran out of memory. You also can't really split a generator across processes; or rather, if you do, each process will get its own copy of the generator.
Solution 1 is to have a process whose sole job is to generate combinations and push them into a queue, possibly in batches to save on IPC overhead, and have the other processes consume combinations from that queue.
Solution 2 is to write a non-sequential version of combinations that returns the Nth combination without computing the rest. This is definitely possible because it's possible with permutations, and combinations are an internally sorted subset of permutations. Then each process in a Pool can generate its own based on a start and step of N - process one counts combination 0, 3, 6..., process two counts combination 1, 4, 7... and so on, for example. This would probably be even slower unless you use C/Cython though.
Solution 3 (or possibly solution 0?) is to go over to the math stackexchange and ask if there's a mathematical rather than computational solution to this problem.
Here is one solution although it isn't very neat.
The idea is to use multiple processes where one process is responsible for one interval. However, since itertools.combinations is sequential, each process has to loop over unnecessary combinations until it reaches the right interval. When the right interval is handled, the process stops. The code is from this book.
import itertools
from tqdm import tqdm
from math import factorial
from multiprocessing import Process
import itertools
def total_combo(n, r):
return factorial(n) // factorial(r) // factorial(n-r)
def cal_combo(var,noCombo,start,end):
data = itertools.combinations(range(var),noCombo)
for i in enumerate(tqdm(data)):
if i[0] >= start:
if i[0] < start+10: print(i)
if i[0] > end: break
if __name__=='__main__':
noCombo=3
var=1000
print(total_combo(var,noCombo),'combinations for',noCombo,'of',var,'variants')
noProc=6
interval=total_combo(var,noCombo)/noProc
if interval%1==0:
print(interval)
procs=[]
for pid in range(noProc):
proc = Process(target=cal_combo, args=(var,noCombo, interval*pid, interval*(pid+1)))
procs.append(proc)
proc.start()
for proc in procs:
proc.join()
Related
I was wondering if there was a more efficient way of doing the following without using loops.
I have a numpy array with the shape (i, x, y, z). Essentially I have i elements of the shape (x, y, z).
I want to write each element to a separate file so that I have i files, each with the data from a single element.
In my case, each element is an image, but I'm sure a solution can be format agnostic.
I'm currently looping through each of the i elements and writing them out one at a time.
As i gets really large, this takes a progressively longer time. Is there a better way or a useful library which could make this more efficient?
Update
I tried the suggestion to use multiprocessing by using concurrent.futures both the thread pool and then also trying the process pool. It was simpler in the code but the time to complete was 4x slower.
i in this case is approximately 10000 while x and y are approximately 750
This sounds very suitable for multiprocessing, as the different elements need to be processed separately and can be save to disk independantly.
Python has a usefull package for this, called multiprocessing, with a variety of pooling, processing, and other options.
Here's a simple (and comment-documented) example of usage:
from multiprocessing import Process
import numpy as np
# This should be your existing function
def write_file(element):
# write file
pass
# You'll still be looping of course, but in parallel over batches. This is a helper function for looping over a "batch"
def write_list_of_files(elements_list):
for element in elements_list:
write_file(element)
# You're data goes here...
all_elements = np.ones((1000, 256, 256, 3))
num_procs = 10 # Depends on system limitations, number of cpu-cores, etc.
procs = [Process(target=write_list_of_files, args=[all_elements[k::num_procs, ...]]) for k in range(num_procs)] # Each of these processes in the list is going to run the "write_list_of_files" function, but have separate inputs, due to the indexing trick of using "k::num_procs"...
for p in procs:
p.start() # Each process starts running independantly
for p in procs:
p.join() # assures the code won't continue until all are "joined" and done. Optional obviously...
print('All done!') # This only runs onces all procs are done, due to "p.join"
I'm having a bit trouble in Python multiprocessing.Pool. I have two list of numpy array a and b, in which
a.shape=(10000,3)
and
b.shape=(1000000000,3)
Then I have a function which does some computation like
def role(array, point):
sub = array-point
return (1/(np.sqrt(np.min(np.sum(sub*sub, axis=-1)))+0.001)**2)
Next, I need to compute
[role(a, point) for point in b]
To speed it up, I try to use
cpu_num = 4
m = multiprocessing.Pool(cpu_num)
cost_list = m.starmap(role, [(a, point) for point in b])
m.close
The whole process takes around 70s, but if I set cpu_num = 1, the processing time decrease to 60s... My laptop has 6 core, for reference.
Here I have two questions:
is there sth I did wrong with multiprocessing.Pool? why the processing time increased if I set cpu_num = 4?
for task like this (each for loop is a very tiny process), should I use multiprocessing to speed up? I feel like each time, python fill in Pool takes longer than process function role...
Any suggestions is really welcome.
Multiprocessing comes with some overhead (to create new processes), which is why it's not a very good choice when you have lots of tiny tasks, where the overhead of process creation might outweigh the benefit of parallelizing.
Have you considered vectorizing your problem?
In particular, if you broadcast the variable b you get there:
sub = a - b[::,np.newaxis] # broadcast b
1./(np.sqrt(np.min(np.sum(sub**2, axis=2), axis=-1))+0.001)**2
I believe you could then still reduce the complexity of the last expression a bit, as you're creating the square of a square root, which seems redundant (note that I'm assuming the 0.001 constant value is merely there to avoid some non-sensible operation like division by zero).
If the tasks are too tiny, then tohe multiprocessing overhead will be your bottleneck and you will win nothing.
If the amount of data per task that you have to pass to a worker or that the worker has to return then you will also not win a lot (or even win nothing)
If you have 10.000 tiny tasks, then I recommend to create a list of meta tasks.
Each meta task would consist of executing for example 20 tiny tasks.
meta_tasks = []
for idx in range(0, len(tiny_tasks), 20):
meta_tasks.append(tiny_tasks[idx:idx+20])
Then pass the meta tasks to your worker pool.
I have made a program for adding a list by dividing them in subparts and using multiprocessing in Python. My code is the following:
from concurrent.futures import ProcessPoolExecutor, as_completed
import random
import time
def dummyFun(l):
s=0
for i in range(0,len(l)):
s=s+l[i]
return s
def sumaSec(v):
start=time.time()
sT=0
for k in range(0,len(v),10):
vc=v[k:k+10]
print ("vector ",vc)
for item in vc:
sT=sT+item
print ("sequential sum result ",sT)
sT=0
start1=time.time()
print ("sequential version time ",start1-start)
def main():
workers=5
vector=random.sample(range(1,101),100)
print (vector)
sumaSec(vector)
dim=10
sT=0
for k in range(0,len(vector),dim):
vc=vector[k:k+dim]
print (vc)
for item in vc:
sT=sT+item
print ("sub list result ",sT)
sT=0
chunks=(vector[k:k+dim] for k in range(0,len(vector),10))
start=time.time()
with ProcessPoolExecutor(max_workers=workers) as executor:
futures=[executor.submit(dummyFun,chunk) for chunk in chunks]
for future in as_completed(futures):
print (future.result())
start1=time.time()
print (start1-start)
if __name__=="__main__":
main()
The problem is that for the sequential version I got a time of:
0.0009753704071044922
while for the concurrent version my time is:
0.10629010200500488
And when I reduce the number of workers to 2 my time is:
0.08622884750366211
Why is this happening?
The length of your vector is only 100. That is a very small amount of work, so the the fixed cost of starting the process pool is the most significant part of the runtime. For this reason parallelism is most beneficial when there is a lot of work to do. Try a larger vector, like a length of 1 million.
The second problem is that you have each worker do a tiny amount of work: a chunk of size 10. Again, that means the cost of starting a task cannot be amortized over so little work. Use larger chunks. For example, instead of 10 use int(len(vector)/(workers*10)).
Also note that you're creating 5 processes. For a CPU-bound task like this one you ideally want to use the same number of processes as you have physical CPU cores. Either use whatever number of cores your system has, or if you use max_workers=None (the default value) then ProcessPoolExecutor will default to that number for your system. If you use too few processes you're leaving performance on the table, if you use too many then the CPU will have to switch between them and your performance may suffer.
Your chunking is pretty awful for creating multiple tasks.
Creating too many tasks still incurs the time punishment even when your workers are already created.
Maybe this post can help you in your search:
How to parallel sum a loop using multiprocessing in Python
I defined two correct ways of calculating averages in python.
def avg_regular(values):
total = 0
for value in values:
total += value
return total/len(values)
def avg_concurrent(values):
mean = 0
num_of_values = len(values)
for value in values:
#calculate a small portion of the average for each num and add to the total
mean += value/num_of_values
return mean
The first function is the regular way of calculating averages, but I wrote the second one because each run of the loop doesn't depend on previous runs. So theoretically the average can be computed in parallel.
However, the "parallel" one (without running in parallel) takes about 30% more time than the regular one.
Are my assumptions correct and worth the speed loss?
if yes how can I make the second function run the second one parrallely?
if not, where did I go wrong?
The code you implemented is basically the difference between (a1+a2+ ... + an) / n and (a1/n + a2/n + ... + an/n). The result is the same, but in the second version there are more operations (namely (n-1) more divisions) which slows the calculation down. You claimed that in the second version each loop run is independent of the others. In the first loop we need the following information to finish one loop run: total before the run and the current value. In the second version we need the following information to finish one loop run: mean before the run, the current value and num_of_values. As you see in the second version we even depend on more values!
But how could we divide the work between cores (which is the goal of multiprocessing)? We could just give one core the first half of the values and the second the second half, i.e. ((a1+a2+ ... + a(n//2)) + ( a(n//2 +1) + ... + a(n)) / n). Yes, the work of dividing by n is not splitted between the cores, but it's a single instruction so we don't really care. Also we need to add the left total and the right total, which we can't split, but again it's only a single operation.
So the code we want to run:
def my_sum(values):
total = 0
for value in values:
total += value
return total
There's still a problem with python - normally one could use threads to do the computations, because each thread will use one core. But in that case one has to take care that your program does not run into race conditions, and the python interpreter itself also needs to take care of that. CPython decided it's not worth it and basically only runs in one thread at a time. A basic solution is to use multiple processes via multiprocessing.
from multiprocessing import Pool
if __name__ == '__main__':
with Pool(5) as p:
results = p.map(my_sum, [long_list[0:len(long_list)//2], long_list[len(long_list)//2:]))
print(sum(results) / len(long_list)) # add subresults and divide by n
But of course multiple processes do not come for free. You need to fork, copy stuff, etc. so you will not gain a speedup of 2 as one could expect. Also the biggest slowdown is actually using python itself, it's not really optimized for fast numerical computations. There are various ways around that, but using numpy is probably the simplest. Just use:
import numpy
print(numpy.mean(long_list))
Which is probably much faster than the python version. I don't think numpy uses multiprocessing internal, so one could gain a boost by using multiple processes and a fast implementation (numpy or something other written in C) but normally numpy is fast enough.
Let's say I have a multithread queue myQueue with 300 elements. Can I remove the 100 oldest with out having to iteratively call myQueue.get() 100 times in a for loop?
note
Trying to avoid using a for loop might seem like a weird goal but I am looking for ways to improve performance and increase simplicity of code. That's why I would like to be able to process the queue (remove the elements) in non iterative manner.
I found out something like that works and it also seems that it doesn't violate any thread safety rules.
Example
import Queue
myQueue = Queue.Queue()
for i in range(1,100):
myQueue.put(i)
amount = [i for i in range(1,10)]
map(myQueue.get, amount) # that will result in removing the 10 oldest elements
That is equivalent to using a for loop but the loop code is implemented in C and can result to better performance.
another way would be to use the parallel equivalent of the map() built-in function. This would divide the chunks that have to be processed in the loop in different processes
import multiprocessing
pool = multiprocessing.Pool()
pool.map(myQueue.get, range(1, 10))