I am trying to make a multiprocessing MongoDB utility, it is perfectly working, but I think I have a performance issue... Even with 20 workers,it isn't processing more than 2800 docs per second... I think I can get 5x faster... This is my code, it isn't doing anything exceptional, just prints a remaining time to the end of the cursor.
Maybe there is a better way to perform multiprocessing on a MongoDB cursor, because I need to run some stuff on every doc with a 17.4M records collection, so performance and less time is a must.
START = time.time()
def remaining_time(a, b):
if START:
y = (time.time() - START)
z = ((a * y) / b) - y
d = time.strftime('%H:%M:%S', time.gmtime(z))
e = round(b / y)
progress("{0}/{1} | Tiempo restante {2} ({3}p/s)".format(b, a, d, e), b, a)
def progress(p, c, t):
pc = (c * 100) / t
sys.stdout.write("%s [%-20s] %d%%\r" % (p, '█' * (pc / 5), pc))
sys.stdout.flush()
def dowork(queue):
for p, i, pcount in iter(queue.get, 'STOP'):
remaining_time(pcount, i)
def populate_jobs(queue):
mongo_query = {}
products = MONGO.mydb.items.find(mongo_query, no_cursor_timeout=True)
if products:
pcount = products.count()
i = 1
print "Procesando %s productos..." % pcount
for p in products:
try:
queue.put((p, i, pcount))
i += 1
except Exception, e:
utils.log(e)
continue
queue.put('STOP')
def main():
queue = multiprocessing.Queue()
procs = [multiprocessing.Process(target=dowork, args=(queue,)) for _ in range(CONFIG_POOL_SIZE)]
for p in procs:
p.start()
populate_jobs(queue)
for p in procs:
p.join()
Also, I've noticed that about every 2500 aprox documents, script pauses for about .5 - 1 secs which is obviously a bad issue. This is a MongoDB problem becase if I do the exactly same loop but using a range(0, 1000000) script doesn't pause at all and runs at 57,000 iterations per second, with a total of 20 seconds to end the script... Huge difference from 2,800 MongoDB documents per second...
This is the code to run a 1,000,000 iteration loop instead docs.
def populate_jobs(queue):
mongo_query = {}
products = MONGO.mydb.items.find(mongo_query, no_cursor_timeout=True)
if products:
pcount = 1000000
i = 1
print "Procesando %s productos..." % pcount
for p in range(0, 1000000):
queue.put((p, i, pcount))
i += 1
queue.put('STOP')
UPDATE
As I saw, the problem is not the multiprocessing itself, is the cursor filling the Queue which is not running in multiprocessing mode, it is one simple process that fills the Queue (populateJobs method) maybe if I could make the cursor multithread/multirpocess and fill the Queue in parallel it will be filled up faster, then the multiprocessing method dowork will do faster, because I think there's a bottleneck where I only fill about 2,800 items per second in Queue and retrieving a lot more in dowork multiprocess, but I don't know how can I parallelize MongoDB cursor.
Maybe, the problem is the latency between my computer and the server's MongoDB. That latency, between me asking for next cursor and MongoDB telling me which is, reduces my performance by 2000% (from 61,000 str/s to 2,800 doc/s)
NOPE I've tried on a localhost MongoDB and performance is exactly the same... This is driving me nuts
Here's how you can use a Pool to feed the children:
START = time.time()
def remaining_time(a, b):
if START:
y = (time.time() - START)
z = ((a * y) / b) - y
d = time.strftime('%H:%M:%S', time.gmtime(z))
e = round(b / y)
progress("{0}/{1} | Tiempo restante {2} ({3}p/s)".format(b, a, d, e), b, a)
def progress(p, c, t):
pc = (c * 100) / t
sys.stdout.write("%s [%-20s] %d%%\r" % (p, '█' * (pc / 5), pc))
sys.stdout.flush()
def dowork(args):
p, i, pcount = args
remaining_time(pcount, i)
def main():
queue = multiprocessing.Queue()
procs = [multiprocessing.Process(target=dowork, args=(queue,)) for _ in range(CONFIG_POOL_SIZE)]
pool = multiprocessing.Pool(CONFIG_POOL_SIZE)
mongo_query = {}
products = MONGO.mydb.items.find(mongo_query, no_cursor_timeout=True)
pcount = products.count()
pool.map(dowork, ((p, idx, pcount) for idx,p in enumerate(products)))
pool.close()
pool.join()
Note that using pool.map requires loading everything from the cursor into memory at once, though, which might be a problem because of how large it is. You can use imap to avoid consuming the whole thing at once, but you'll need to specify a chunksize to minimize IPC overhead:
# Calculate chunksize using same algorithm used internally by pool.map
chunksize, extra = divmod(pcount, CONFIG_POOL_SIZE * 4)
if extra:
chunksize += 1
pool.imap(dowork, ((p, idx, pcount) for idx,p in enumerate(products)), chunksize=chunksize)
pool.close()
pool.join()
For 1,000,000 items, that gives a chunksize of 12,500. You can try sizes larger and smaller than that, and see how it affects performance.
I'm not sure this will help much though, if the bottleneck is actually just pulling the data out of MongoDB.
Why are you using multiprocessing? You don't seem to be doing actual work in other threads using the queue. Python has a global interpreter lock which makes multithreaded code less performant than you'd expect. It's probably making this program slower, not faster.
A couple performance tips:
Try setting batch_size in your find() call to some big number (e.g. 20000). This is the maximum number of documents returned at a time, before the client fetches more, and the default is 101.
Try setting cursor_type to pymongo.cursor.CursorType.EXHAUST, which might reduce the latency you're seeing.
Related
I have a program that I created using threads, but then I learned that threads don't run concurrently in python and processes do. As a result, I am trying to rewrite the program using multiprocessing, but I am having a hard time doing so. I have tried following several examples that show how to create the processes and pools, but I don't think it's exactly what I want.
Below is my code with the attempts I have tried. The program tries to estimate the value of pi by randomly placing points on a graph that contains a circle. The program takes two command-line arguments: one is the number of threads/processes I want to create, and the other is the total number of points to try placing on the graph (N).
import math
import sys
from time import time
import concurrent.futures
import random
import multiprocessing as mp
def myThread(arg):
# Take care of imput argument
n = int(arg)
print("Thread received. n = ", n)
# main calculation loop
count = 0
for i in range (0, n):
x = random.uniform(0,1)
y = random.uniform(0,1)
d = math.sqrt(x * x + y * y)
if (d < 1):
count = count + 1
print("Thread found ", count, " points inside circle.")
return count;
# end myThread
# receive command line arguments
if (len(sys.argv) == 3):
N = sys.argv[1] # original ex: 0.01
N = int(N)
totalThreads = sys.argv[2]
totalThreads = int(totalThreads)
print("N = ", N)
print("totalThreads = ", totalThreads)
else:
print("Incorrect number of arguments!")
sys.exit(1)
if ((totalThreads == 1) or (totalThreads == 2) or (totalThreads == 4) or (totalThreads == 8)):
print()
else:
print("Invalid number of threads. Please use 1, 2, 4, or 8 threads.")
sys.exit(1)
# start experiment
t = int(time() * 1000) # begin run time
total = 0
# ATTEMPT 1
# processes = []
# for i in range(totalThreads):
# process = mp.Process(target=myThread, args=(N/totalThreads))
# processes.append(process)
# process.start()
# for process in processes:
# process.join()
# ATTEMPT 2
#pool = mp.Pool(mp.cpu_count())
#total = pool.map(myThread, [N/totalThreads])
# ATTEMPT 3
#for i in range(totalThreads):
#total = total + pool.map(myThread, [N/totalThreads])
# p = mp.Process(target=myThread, args=(N/totalThreads))
# p.start()
# ATTEMPT 4
# with concurrent.futures.ThreadPoolExecutor() as executor:
# for i in range(totalThreads):
# future = executor.submit(myThread, N/totalThreads) # start thread
# total = total + future.result() # get result
# analyze results
pi = 4 * total / N
print("pi estimate =", pi)
delta_time = int(time() * 1000) - t # calculate time required
print("Time =", delta_time, " milliseconds")
I thought that creating a loop from 0 to totalThreads that creates a process for each iteration would work. I also wanted to pass in N/totalThreads (to divide the work), but it seems that processes take in an iterable list rather than an argument to pass to the method.
What is it I am missing with multiprocessing? Is it at all possible to even do what I want to do with processes?
Thank you in advance for any help, it is greatly appreciated :)
I have simplified your code and used some hard-coded values which may or may not be reasonable.
import math
import concurrent.futures
import random
from datetime import datetime
def myThread(arg):
count = 0
for i in range(0, arg[0]):
x = random.uniform(0, 1)
y = random.uniform(0, 1)
d = math.sqrt(x * x + y * y)
if (d < 1):
count += 1
return count
N = 10_000
T = 8
_start = datetime.now()
with concurrent.futures.ThreadPoolExecutor() as executor:
futures = {executor.submit(myThread, (int(N / T),)): _ for _ in range(T)}
total = 0
for future in concurrent.futures.as_completed(futures):
total += future.result()
_end = datetime.now()
print(f'Estimate for PI = {4 * total / N}')
print(f'Run duration = {_end-_start}')
A typical output on my machine looks like this:-
Estimate for PI = 3.1472
Run duration = 0:00:00.008895
Bear in mind that the number of threads you start is effectively managed by the ThreadPoolExecutor (TPE) [ when constructed with no parameters ]. It makes decisions about the number of threads that can run based on your machine's processing capacity (number of cores etc). Therefore you could, if you really wanted to, set T to a very high number and the TPE will block execution of any new threads until it determines that there is capacity.
The Multiprocessing module is quite confusing for python beginners specially for those who have just migrated from MATLAB and are made lazy with its parallel computing toolbox. I have the following function which takes ~80 Secs to run and I want to shorten this time by using Multiprocessing module of Python.
from time import time
xmax = 100000000
start = time()
for x in range(xmax):
y = ((x+5)**2+x-40)
if y <= 0xf+1:
print('Condition met at: ', y, x)
end = time()
tt = end-start #total time
print('Each iteration took: ', tt/xmax)
print('Total time: ', tt)
This outputs as expected:
Condition met at: -15 0
Condition met at: -3 1
Condition met at: 11 2
Each iteration took: 8.667453265190124e-07
Total time: 86.67453265190125
As any iteration of the loop is not dependent on others, I tried to adopt this Server Process from the official documentation to scan chunks of the range in separate processes. And finally I came up with vartec's answer to this question and could prepare the following code. I also updated the code based on Darkonaut's response to the current question.
from time import time
import multiprocessing as mp
def chunker (rng, t): # this functions makes t chunks out of rng
L = rng[1] - rng[0]
Lr = L % t
Lm = L // t
h = rng[0]-1
chunks = []
for i in range(0, t):
c = [h+1, h + Lm]
h += Lm
chunks.append(c)
chunks[t-1][1] += Lr + 1
return chunks
def worker(lock, xrange, return_dict):
'''worker function'''
for x in range(xrange[0], xrange[1]):
y = ((x+5)**2+x-40)
if y <= 0xf+1:
print('Condition met at: ', y, x)
return_dict['x'].append(x)
return_dict['y'].append(y)
with lock:
list_x = return_dict['x']
list_y = return_dict['y']
list_x.append(x)
list_y.append(y)
return_dict['x'] = list_x
return_dict['y'] = list_y
if __name__ == '__main__':
start = time()
manager = mp.Manager()
return_dict = manager.dict()
lock = manager.Lock()
return_dict['x']=manager.list()
return_dict['y']=manager.list()
xmax = 100000000
nw = mp.cpu_count()
workers = list(range(0, nw))
chunks = chunker([0, xmax], nw)
jobs = []
for i in workers:
p = mp.Process(target=worker, args=(lock, chunks[i],return_dict))
jobs.append(p)
p.start()
for proc in jobs:
proc.join()
end = time()
tt = end-start #total time
print('Each iteration took: ', tt/xmax)
print('Total time: ', tt)
print(return_dict['x'])
print(return_dict['y'])
which considerably reduces the run time to ~17 Secs. But, my shared variable cannot retrieve any values. Please help me find out which part of the code is going wrong.
the output I get is:
Each iteration took: 1.7742713451385497e-07
Total time: 17.742713451385498
[]
[]
from which I expect:
Each iteration took: 1.7742713451385497e-07
Total time: 17.742713451385498
[0, 1, 2]
[-15, -3, 11]
The issue in your example is that modifications to standard mutable structures within Manager.dict will not be propagated. I'm first showing you how to fix it with manager, just to show you better options afterwards.
multiprocessing.Manager is a bit heavy since it uses a separate Process just for the Manager and working on a shared object needs using locks for data consistency. If you run this on one machine, there are better options with multiprocessing.Pool, in case you don't have to run customized Process classes and if you have to, multiprocessing.Process together with multiprocessing.Queue would be the common way of doing it.
The quoting parts are from the multiprocessing docs.
Manager
If standard (non-proxy) list or dict objects are contained in a referent, modifications to those mutable values will not be propagated through the manager because the proxy has no way of knowing when the values contained within are modified. However, storing a value in a container proxy (which triggers a setitem on the proxy object) does propagate through the manager and so to effectively modify such an item, one could re-assign the modified value to the container proxy...
In your case this would look like:
def worker(xrange, return_dict, lock):
"""worker function"""
for x in range(xrange[0], xrange[1]):
y = ((x+5)**2+x-40)
if y <= 0xf+1:
print('Condition met at: ', y, x)
with lock:
list_x = return_dict['x']
list_y = return_dict['y']
list_x.append(x)
list_y.append(y)
return_dict['x'] = list_x
return_dict['y'] = list_y
The lock here would be a manager.Lock instance you have to pass along as argument since the whole (now) locked operation is not by itself atomic. (Here
is an easier example with Manager using Lock)
This approach is perhaps less convenient than employing nested Proxy Objects for most use cases but also demonstrates a level of control over the synchronization.
Since Python 3.6 proxy objects are nestable:
Changed in version 3.6: Shared objects are capable of being nested. For example, a shared container object such as a shared list can contain other shared objects which will all be managed and synchronized by the SyncManager.
Since Python 3.6 you can fill your manager.dict before starting multiprocessing with manager.list as values and then append directly in the worker without having to reassign.
return_dict['x'] = manager.list()
return_dict['y'] = manager.list()
EDIT:
Here is the full example with Manager:
import time
import multiprocessing as mp
from multiprocessing import Manager, Process
from contextlib import contextmanager
# mp_util.py from first link in code-snippet for "Pool"
# section below
from mp_utils import calc_batch_sizes, build_batch_ranges
# def context_timer ... see code snippet in "Pool" section below
def worker(batch_range, return_dict, lock):
"""worker function"""
for x in batch_range:
y = ((x+5)**2+x-40)
if y <= 0xf+1:
print('Condition met at: ', y, x)
with lock:
return_dict['x'].append(x)
return_dict['y'].append(y)
if __name__ == '__main__':
N_WORKERS = mp.cpu_count()
X_MAX = 100000000
batch_sizes = calc_batch_sizes(X_MAX, n_workers=N_WORKERS)
batch_ranges = build_batch_ranges(batch_sizes)
print(batch_ranges)
with Manager() as manager:
lock = manager.Lock()
return_dict = manager.dict()
return_dict['x'] = manager.list()
return_dict['y'] = manager.list()
tasks = [(batch_range, return_dict, lock)
for batch_range in batch_ranges]
with context_timer():
pool = [Process(target=worker, args=args)
for args in tasks]
for p in pool:
p.start()
for p in pool:
p.join()
# Create standard container with data from manager before exiting
# the manager.
result = {k: list(v) for k, v in return_dict.items()}
print(result)
Pool
Most often a multiprocessing.Pool will just do it. You have an additional challenge in your example since you want to distribute iteration over a range.
Your chunker function doesn't manage to divide the range even so every process has about the same work to do:
chunker((0, 21), 4)
# Out: [[0, 4], [5, 9], [10, 14], [15, 21]] # 4, 4, 4, 6!
For the code below please grab the code snippet for mp_utils.py from my answer here, it provides two functions to chunk ranges as even as possible.
With multiprocessing.Pool your worker function just has to return the result and Pool will take care of transporting the result back over internal queues back to the parent process. The result will be a list, so you will have to rearange your result again in a way you want it to have. Your example could then look like this:
import time
import multiprocessing as mp
from multiprocessing import Pool
from contextlib import contextmanager
from itertools import chain
from mp_utils import calc_batch_sizes, build_batch_ranges
#contextmanager
def context_timer():
start_time = time.perf_counter()
yield
end_time = time.perf_counter()
total_time = end_time-start_time
print(f'\nEach iteration took: {total_time / X_MAX:.4f} s')
print(f'Total time: {total_time:.4f} s\n')
def worker(batch_range):
"""worker function"""
result = []
for x in batch_range:
y = ((x+5)**2+x-40)
if y <= 0xf+1:
print('Condition met at: ', y, x)
result.append((x, y))
return result
if __name__ == '__main__':
N_WORKERS = mp.cpu_count()
X_MAX = 100000000
batch_sizes = calc_batch_sizes(X_MAX, n_workers=N_WORKERS)
batch_ranges = build_batch_ranges(batch_sizes)
print(batch_ranges)
with context_timer():
with Pool(N_WORKERS) as pool:
results = pool.map(worker, iterable=batch_ranges)
print(f'results: {results}')
x, y = zip(*chain.from_iterable(results)) # filter and sort results
print(f'results sorted: x: {x}, y: {y}')
Example Output:
[range(0, 12500000), range(12500000, 25000000), range(25000000, 37500000),
range(37500000, 50000000), range(50000000, 62500000), range(62500000, 75000000), range(75000000, 87500000), range(87500000, 100000000)]
Condition met at: -15 0
Condition met at: -3 1
Condition met at: 11 2
Each iteration took: 0.0000 s
Total time: 8.2408 s
results: [[(0, -15), (1, -3), (2, 11)], [], [], [], [], [], [], []]
results sorted: x: (0, 1, 2), y: (-15, -3, 11)
Process finished with exit code 0
If you had multiple arguments for your worker you would build a "tasks"-list with argument-tuples and exchange pool.map(...) with pool.starmap(...iterable=tasks). See docs for further details on that.
Process & Queue
If you can't use multiprocessing.Pool for some reason, you have to take
care of inter-process communication (IPC) yourself, by passing a
multiprocessing.Queue as argument to your worker-functions in the child-
processes and letting them enqueue their results to be send back to the
parent.
You will also have to build your Pool-like structure so you can iterate over it to start and join the processes and you have to get() the results back from the queue. More about Queue.get usage I've written up here.
A solution with this approach could look like this:
def worker(result_queue, batch_range):
"""worker function"""
result = []
for x in batch_range:
y = ((x+5)**2+x-40)
if y <= 0xf+1:
print('Condition met at: ', y, x)
result.append((x, y))
result_queue.put(result) # <--
if __name__ == '__main__':
N_WORKERS = mp.cpu_count()
X_MAX = 100000000
result_queue = mp.Queue() # <--
batch_sizes = calc_batch_sizes(X_MAX, n_workers=N_WORKERS)
batch_ranges = build_batch_ranges(batch_sizes)
print(batch_ranges)
with context_timer():
pool = [Process(target=worker, args=(result_queue, batch_range))
for batch_range in batch_ranges]
for p in pool:
p.start()
results = [result_queue.get() for _ in batch_ranges]
for p in pool:
p.join()
print(f'results: {results}')
x, y = zip(*chain.from_iterable(results)) # filter and sort results
print(f'results sorted: x: {x}, y: {y}')
I'm trying to learn how to do parallel programming in python. I wrote a simple int square function and then ran it in serial, multi-thread, and multi-process:
import time
import multiprocessing, threading
import random
def calc_square(numbers):
sq = 0
for n in numbers:
sq = n*n
def splita(list, n):
a = [[] for i in range(n)]
counter = 0
for i in range(0,len(list)):
a[counter].append(list[i])
if len(a[counter]) == len(list)/n:
counter = counter +1
continue
return a
if __name__ == "__main__":
random.seed(1)
arr = [random.randint(1, 11) for i in xrange(1000000)]
print "init completed"
start_time2 = time.time()
calc_square(arr)
end_time2 = time.time()
print "serial: " + str(end_time2 - start_time2)
newarr = splita(arr,8)
print 'split complete'
start_time = time.time()
for i in range(8):
t1 = threading.Thread(target=calc_square, args=(newarr[i],))
t1.start()
t1.join()
end_time = time.time()
print "mt: " + str(end_time - start_time)
start_time = time.time()
for i in range(8):
p1 = multiprocessing.Process(target=calc_square, args=(newarr[i],))
p1.start()
p1.join()
end_time = time.time()
print "mp: " + str(end_time - start_time)
Output:
init completed
serial: 0.0640001296997
split complete
mt: 0.0599999427795
mp: 2.97099995613
However, as you can see, something weird happened and mt is taking the same time as serial and mp is actually taking significantly longer (almost 50 times longer).
What am I doing wrong? Could someone push me in the right direction to learn parallel programming in python?
Edit 01
Looking at the comments, I see that perhaps the function not returning anything seems pointless. The reason I'm even trying this is because previously I tried the following add function:
def addi(numbers):
sq = 0
for n in numbers:
sq = sq + n
return sq
I tried returning the addition of each part to a serial number adder, so at least I could see some performance improvement over a pure serial implementation. However, I couldn't figure out how to store and use the returned value, and that's the reason I'm trying to figure out something even simpler than that, which is just dividing up the array and running a simple function on it.
Thanks!
I think that multiprocessing takes quite a long time to create and start each process. I have changed the program to make 10 times the size of arr and changed the way that the processes are started and there is a slight speed-up:
(Also note python 3)
import time
import multiprocessing, threading
from multiprocessing import Queue
import random
def calc_square_q(numbers,q):
while q.empty():
pass
return calc_square(numbers)
if __name__ == "__main__":
random.seed(1) # note how big arr is now vvvvvvv
arr = [random.randint(1, 11) for i in range(10000000)]
print("init completed")
# ...
# other stuff as before
# ...
processes=[]
q=Queue()
for arrs in newarr:
processes.append(multiprocessing.Process(target=calc_square_q, args=(arrs,q)))
print('start processes')
for p in processes:
p.start() # even tho' each process is started it waits...
print('join processes')
q.put(None) # ... for q to become not empty.
start_time = time.time()
for p in processes:
p.join()
end_time = time.time()
print("mp: " + str(end_time - start_time))
Also notice above how I create and start the processes in two different loops, and then finally join with the processes in a third loop.
Output:
init completed
serial: 0.53214430809021
split complete
start threads
mt: 0.5551605224609375
start processes
join processes
mp: 0.2800724506378174
Another factor of 10 increase in size of arr:
init completed
serial: 5.8455305099487305
split complete
start threads
mt: 5.411392450332642
start processes
join processes
mp: 1.9705185890197754
And yes, I've also tried this in python 2.7, although Threads seemed slower.
I'm exploring multi-tasking in Python, after reading this article, I create an example to compare performance between multithreading and multiprocessing:
dummy_data = ''.join(['0' for i in range(1048576)]) # around 1MB of data
def do_something(num):
l = []
for i in range(num):
l.append(dummy_data)
def test(use_thread):
if use_thread: title = 'Thread'
else: title = 'Process'
num = 1000
jobs = []
for i in range(4): # the test machine has 4 cores
if use_thread:
j = Thread(target=do_something, args=(num,))
else:
j = Process(target=do_something, args=(num,))
jobs.append(j)
start = time.time()
for j in jobs: j.start()
for j in jobs: j.join()
end = time.time()
print '{0}: {1}'.format(title, str(end - start))
The results are:
Process: 0.0416989326477
Thread: 0.149359941483
Which means using Process results in better performance since it utilises available cores.
However, if I change the implementation of function do_something to:
def do_something_1(num):
l = ''.join([dummy_data for i in range(num)])
Using process suddenly performs worse than threading (I reduce the num value to 1000 due to MemoryError):
Process: 14.6903309822
Thread: 4.30753493309
Can anyone explain to me why using the second implemetation of do_something results in the worse performance to Process in compare to Thread?
I'm trying to perform numerical integration on a large array and the computation takes a very long time. I tried to speed up my code by using numba and the jit decorator, but numpy.trapz isn't supported.
My new idea would be to create n-many threads to run the calculations in parallel, but I was wondering how I could do this or if it was even feasible?
Referencing Below Code
Can I make sz[2] many threads to run at the same time that calls ZO_SteadState to calculate values?
for i in range(sz[1]):
phii = phi[i]
for j in range(sz[2]):
s = tau[0, i, j, :].reshape(1, n4)
[R3, PHI3, S3] = meshgrid(rprime, phiprime, s)
BCoeff = Bessel0(bm * R3)
SS[0, i, j] = ZO_SteadyState(alpha, b,bm,BCoeff,Bessel_Denom, k2,maxt,phii, PHI2, PHI3, phiprime,R3,rprime,s,S3, T,v)
The calculation in question.
#jit()
def ZO_SteadyState(alpha, b,bm,BCoeff,Bessel_Denom, k2,maxt,phii, PHI2, PHI3, phiprime,R3,rprime,s,S3, T,v):
g = 1000000 * exp(-(10 ** 5) * (R3 - (b / maxt) * S3) ** 2) * (
exp(-(10 ** 5) * (PHI3 - 0) ** 2) + exp(-(10 ** 5) * (PHI3 - 2 * np.pi) ** 2) + exp(
-(10 ** 5) * (PHI3 - 2 * np.pi / 3) ** 2) + exp(
-(10 ** 5) * (PHI3 - 4 * np.pi / 3) ** 2)) # stationary point heat source.
y = R3 * ((np.sqrt(2) / b) * (1 / (np.sqrt((H2 ** 2 / bm ** 2) + (1 - (v ** 2 / (bm ** 2 * b ** 2))))))
* (BCoeff / Bessel_Denom)) * np.cos(v * (phii - PHI3)) * g
x = (np.trapz(y, phiprime, axis=0)).reshape(1, 31, 300)
# integral transform of heat source. integral over y-axis
gbarbar = np.trapz(x, rprime, axis=1)
PHI2 = np.meshgrid(phiprime, s)[0]
sz2 = PHI2.shape
f = h2 * 37 * Array_Ones((sz2[0], sz[1])) # boundary condition.
fbar = np.trapz(np.cos(v * (phii - PHI2)) * f, phiprime, 1).reshape(1, n4) # integrate over y
A = (alpha / k) * gbarbar + ((np.sqrt(2) * alpha) / k2) * (
1 / (np.sqrt((H2 ** 2 / bm ** 2) + (1 - (v ** 2 / (bm ** 2 * b ** 2)))))) * fbar
return np.trapz(exp(-alpha * bm ** 2 * (T[0, i, j] - s)) * A, s)
Another concept implementation, with processes spawning processes (EDIT: jit tested):
import numpy as np
# better pickling
import pathos
from contextlib import closing
from numba import jit
#https://stackoverflow.com/questions/47574860/python-pathos-process-pool-non-daemonic
import multiprocess.context as context
class NoDaemonProcess(context.Process):
def _get_daemon(self):
return False
def _set_daemon(self, value):
pass
daemon = property(_get_daemon, _set_daemon)
class NoDaemonPool(pathos.multiprocessing.Pool):
def Process(self, *args, **kwds):
return NoDaemonProcess(*args, **kwds)
# matrix dimensions
x = 100 # i
y = 500 # j
NUM_PROCESSES = 10 # total NUM_PROCESSES*NUM_PROCESSES will be spawned
SS = np.zeros([x, y], dtype=float)
#jit
def foo(i):
return (i*i + 1)
#jit
def bar(phii, j):
return phii*(j+1)
# The code which is implemented down here:
'''
for i in range(x):
phii = foo(i)
for j in range(y):
SS[i, j] = bar(phii, j)
'''
# Threaded version:
# queue is in global scope
def outer_loop(i):
phii = foo(i)
# i is in process scope
def inner_loop(j):
result = bar(phii,j)
# the data is coordinates and result
return (i, j, result)
with closing(NoDaemonPool(processes=NUM_PROCESSES)) as pool:
res = list(pool.imap(inner_loop, range(y)))
return res
with closing(NoDaemonPool(processes=NUM_PROCESSES)) as pool:
results = list(pool.imap(outer_loop, range(x)))
result_list = []
for r in results:
result_list += r
# read results from queue
for res in result_list:
if res:
i, j, val = res
SS[i,j] = val
# check that all cells filled
print(np.count_nonzero(SS)) # 100*500
EDIT: Explanation.
The reason of all the complications in this code is that I wanted to do more parallelization than OP asked for. If only inner loop is parallelized, then the outer loop remains, so for each iteration of outer loop new process pool is created and computations for inner loop are performed. As long, as it seemed for me, that formula does not depend on other iterations of outer loop, I decided to parallelize everything: now computations for outer loop are assigned to processes from the pool, after that each of the "outer-loop" processes creates its own new pool, and additional processes are spawned to perform computations for inner loop.
I might be wrong though and outer loop must not be parallelized; In this case you can leave only inner process pool.
Using process pools might be not optimal solutions though as time will be consumed on creation and destruction of pools. More efficient (but requiring mode handwork) solution will be to instantiate N processes once and for all, and then feed data into them and receive results using multiprocessing Queue(). So you should test first whether this multiprocessing solution gives you enough speedup (This will happen if time on constructing and destructing pools is small in comparison to Z0_SteadyState run).
The next complication, is that artificial no-daemon pool. Daemon process is used to gracefully stop application: when main program exits, daemon processes are terminated silently. However, daemon process can not spawn child processes. Here in your example you need to wait until each process ends to retrieve data, so I made them non-daemon to allow spawning child processes to compute inner loop.
Data exchange: I suppose that the amount of data which is needed to fill matrix and time to do it is small in comparison to actual computations. So I use pools and pool.imap function (which is a bit faster than .map(). You can also try .imap_unordered(), however in your case it should not make significant difference). Thus inner pool waits until all the results are computed and returns them as list. The outer pool thus returns list of lists which must be concatenated. Then the matrix is reconstructed from these results in single fast loop.
Notice with closing() thing: it closes pool automatically after things under this statement are finished, avoiding memory consumption by zombie processes.
Also, you might notice that I weirdly defined one function inside another, and inside processes I have access to some variables which have not been passed there: i, phii. This happens because processes have access to the global scope from which they were launched with copy-on-change policy (default fork mode). This does not involve pickling and is fast.
The last comment is about using pathos library instead of standard multiprocessing, concurrent.futures, subprocess, etc. The reason is that pathos has better pickling library used, so it can serialize functions which standard libraries can't (for example, lambda functions). I don't know about your function, so I used more powerful tool to avoid further problems.
And the very last thing: multiprocessing vs threading. You can change pathos processing pool to, say, standard ThreadPoolExecutor from concurrent.futures, as I did on the beginning when I just started that code. But, during execution, on my system CPU is loaded only on 100% (i.e. one core is utilized, it appears like all 8 cores are loaded at 15-20%). I am not that skilled to understand differences between threads and processes, but it seems for me, that processes allow to utilize all cores (100% each, 800% total).
This is the overall idea that I'd probably do. There's not enough context to give you a more reliable example. You'll have to set all your variables into the class.
import multiprocessing
pool = multiprocessing.Pool(processes=12)
runner = mp_Z0(variable=variable, variable2=variable2)
for i, j, v in pool.imap(runner.run, range(sz[1]):
SS[0, i, j] = v
class mp_Z0:
def __init__(self, **kwargs):
for k, v in kwargs:
setattr(self, k, v)
def run(self, i):
phii = self.phi[i]
for j in range(self.sz[2]):
s = self.tau[0, i, j, :].reshape(1, self.n4)
[R3, PHI3, S3] = meshgrid(self.rprime, self.phiprime, s)
BCoeff = Bessel0(self.bm * R3)
return (i, j, ZO_SteadyState(self.alpha, self.b, self.bm, BCoeff, Bessel_Denom, self.k2, self.maxt, phii, self.PHI2, PHI3, self.phiprime, R3, self.rprime, self.s, S3, self.T, self.v))
This is an example (assuming everything is in the local namespace) of doing it without classes:
import multiprocessing
pool = multiprocessing.Pool(processes=12)
def runner_function(i):
phii = phi[i]
for j in range(sz[2]):
s = tau[0, i, j, :].reshape(1, n4)
[R3, PHI3, S3] = meshgrid(rprime, phiprime, s)
BCoeff = Bessel0(bm * R3)
return (i, j, ZO_SteadyState(alpha, b, bm, BCoeff, Bessel_Denom, k2, maxt, phii, PHI2, PHI3,
phiprime, R3, rprime, s, S3, T, v))
for i, j, v in pool.imap(runner_function, range(sz[1]):
SS[0, i, j] = v