I hope this is not a duplicate question.
I've run the same function in Python 3.4.2 in a simple way and in a multi-processing way and I've found that the simple way is faster. Perhaps my design is not good, but I don't see where the problem lies.
Below is my code:
Common part
import os
import math
from multiprocessing import Process
import timeit
def exponential(number):
"""
A function that returns exponential
"""
result = math.exp(number)
proc = os.getpid()
Simple solution
if __name__ == '__main__':
start = timeit.default_timer()
numbers = [5, 10, 20, 30, 40, 50, 60]
for index, number in enumerate(numbers):
exponential(number)
stop = timeit.default_timer()
duration = stop - start
print(duration)
Multi-processing solution
if __name__ == '__main__':
start = timeit.default_timer()
numbers = [5, 10, 20, 30, 40, 50, 60]
procs = []
for index, number in enumerate(numbers):
proc = Process(target=exponential, args=(number,))
procs.append(proc)
proc.start()
for proc in procs:
proc.join()
stop = timeit.default_timer()
duration = stop - start
print(duration)
What I see is that the simple solution is faster than the multi-processing one:
Duration with Simple solution: 2.8359994757920504e-05
Duration with Multi processing solution: 0.012581961986143142
Computing math.exp(x), where x<100 (as it is in your case) is not especially difficult. So you don't computing these in parallel does not offer a clear advantage.
Remember that when you set up multiple processes, you also incur the overhead of creating a new process, and copying over the memory space, etc.
Finally, there's something to be said about you creating a new process for each number in that list. If that list had 100 numbers in it, you'd be creating 100 new processes, which will compete for time on your 4 or 8 cores (depending on your CPU), which will add to further delays (especially when the computation itself gets complex). You're better off creating a pool of processes and getting them to incrementally work on your dataset:
import math
import multiprocess as mp
def slave(qIn, qOut):
for i, num in iter(qIn.get, None):
qOut.put((i, math.exp(num))
qOut.put(None)
def master():
numbers = [5, 10, 20, 30, 40, 50, 60]
qIn, qOut = [mp.Queue() for _ in range(2)]
procs = [mp.Process(target=slave, args=(qIn, qOut)) for _ in range(mp.cpu_count()-1)]
for p in procs: p.start()
for t in enumerate(numbers): qIn.put(t)
for p in procs: qIn.put(None)
answer = [None] * len(numbers)
done = 0
while done < len(numProcs):
t = qOut.get()
if t is None:
done += 1
continue
i, e = t
answer[i] = e
for p in procs: p.terminate()
return answer
Related
I am trying to run a model multiple times. As a result it is time consuming. As a solution I try to make it parallel. However, it ends up to be slower. Parallel is 40 seconds while serial is 34 seconds.
# !pip install --target=$nb_path transformers
oracle = pipeline(model="deepset/roberta-base-squad2")
question = 'When did the first extension of the Athens Tram take place?'
print(data)
print("Data size is: ", len(data))
parallel = True
if parallel == False:
counter = 0
l = len(data)
cr = []
for words in data:
counter+=1
print(counter, " out of ", l)
cr.append(oracle(question=question, context=words))
elif parallel == True:
from multiprocessing import Process, Queue
import multiprocessing
no_CPU = multiprocessing.cpu_count()
print("Number of cpu : ", no_CPU)
l = len(data)
def answer_question(data, no_CPU, sub_no):
cr_process = []
counter_process = 0
for words in data:
counter_process+=1
l_data = len(data)
# print("n is", no_CPU)
# print("l is", l_data)
print(counter_process, " out of ", l_data, "in subprocess number", sub_no)
cr_process.append(oracle(question=question, context=words))
# Q.put(cr_process)
cr.append(cr_process)
n = no_CPU # number of subprocesses
m = l//n # number of data the n-1 first subprocesses will handle
res = l % n # number of extra data samples the last subprocesses has
# print(m)
# print(res)
procs = []
# instantiating process with arguments
for x in range(n-1):
# print(x*m)
# print((x+1)*m)
proc = Process(target=answer_question, args=(data[x*m:(x+1)*m],n, x+1,))
procs.append(proc)
proc.start()
proc = Process(target=answer_question, args=(data[(n-1)*m:n*m+res],n,n,))
procs.append(proc)
proc.start()
# complete the processes
for proc in procs:
proc.join()
A sample of the data variable can be found here (to not flood the question). Argument parallel controls the serial and the parallel version. So my question is, why does it happen and how do I make the parallel version faster? I use google colab so it has 2 CPU cores available , that's what multiprocessing.cpu_count() is saying at least.
Your pipeline is already running on multi-cpu even when run as one process. The code of transformers are optimized to run on multi-cpu.
when on top of that you are creating multiple process, you are loosing some time for building the processes and switching between them.
To verify this, on the so-called "single process" version look at your cpu utilizations, you should already see all are at max, so creating extra parallel processes are not going to save you some time,
I'm having trouble using python multiprocess.
im trying with a minimal version of code:
import os
os.environ["OMP_NUM_THREADS"] = "1" # just in case the system uses multithrad somehow
os.environ["OPENBLAS_NUM_THREADS"] = "1" # just in case the system uses multithrad somehow
os.environ["MKL_NUM_THREADS"] = "1" # just in case the system uses multithrad somehow
os.environ["VECLIB_MAXIMUM_THREADS"] = "1" # just in case the system uses multithrad somehow
os.environ["NUMEXPR_NUM_THREADS"] = "1" # just in case the system uses multithrad somehow
import numpy as np
from datetime import datetime as dt
from multiprocessing import Pool
from pandas import DataFrame as DF
def trytrytryshare(times):
i = 0
for j in range(times):
i+=1
return
def trymultishare(thread = 70 , times = 10):
st = dt.now()
args_l = [(times,) for i in range(thread)]
print(st)
p = Pool(thread)
for i in range(len(args_l)):
p.apply_async(func = trytrytryshare, args = (args_l[i]))
p.close()
p.join()
timecost = (dt.now()-st).total_seconds()
print('%d threads finished in %f secs' %(thread,timecost))
return timecost
if __name__ == '__main__':
res = DF(columns = ['thread','timecost'])
n = 0
for j in range(5):
for i in range(1,8,3):
timecost = trymultishare(thread = i,times = int(1e8))
res.loc[n] = [i,timecost]
n+=1
timecost = trymultishare(thread = 70,times = int(1e8))
res.loc[n] = [70,timecost]
n+=1
res_sum = res.groupby('thread').mean()
res_sum['decay'] = res_sum.loc[1,'timecost'] / res_sum['timecost']
on my own computer (8cores):
on my server (80 cores, im the only one using it)
i tried again, make one thread job longer.
the decay is really bad....
any idea how to "fix" this, or this is just what i can get when using multi-process?
thanks
The way you're timing apply_async is flawed. You won't know when the subprocesses have completed unless you wait for their results.
It's a good idea to work out an optimum process pool size based on number of CPUs. The code that follows isn't necessarily the best for all cases but it's what I use.
You shouldn't set the pool size to the number of processes you intend to run. That's the whole point of using a pool.
So here's a simpler example of how you could test subprocess performance.
from multiprocessing import Pool
from time import perf_counter
from os import cpu_count
def process(n):
r = 0
for _ in range(n):
r += 1
return r
POOL = max(cpu_count()-2, 1)
N = 1_000_000
def main(procs):
# no need for pool size to be bigger than the numer of processes to be run
poolsize = min(POOL, procs)
with Pool(poolsize) as pool:
_start = perf_counter()
for result in [pool.apply_async(process, (N,)) for _ in range(procs)]:
result.wait() # wait for async processes to terminate
_end = perf_counter()
print(f'Duration for {procs} processes with pool size of {poolsize} = {_end-_start:.2f}s')
if __name__ == '__main__':
print(f'CPU count = {cpu_count()}')
for procs in range(10, 101, 10):
main(procs)
Output:
CPU count = 20
Duration for 10 processes with pool size of 10 = 0.12s
Duration for 20 processes with pool size of 18 = 0.19s
Duration for 30 processes with pool size of 18 = 0.18s
Duration for 40 processes with pool size of 18 = 0.28s
Duration for 50 processes with pool size of 18 = 0.30s
Duration for 60 processes with pool size of 18 = 0.39s
Duration for 70 processes with pool size of 18 = 0.42s
Duration for 80 processes with pool size of 18 = 0.45s
Duration for 90 processes with pool size of 18 = 0.54s
Duration for 100 processes with pool size of 18 = 0.59s
My guess is that you're observing the cost of spawning new processes, since apply_async returns immediately. It's much cheaper to spawn one process in the case of thread==1 instead of spawning 70 processes (your last case with the worst decay).
The fact that the server with 80 cores performs better than you laptop with 8 cores could be due to the server containing better hardware in general (better heat removal, faster CPU, etc) or it might contain a different OS. Benchmarking across different machines is non-trivial.
I have the following problem. The code below successfully linear fits may data from 50 to 400 samples (I never have more than 400 samples and the first 50 are of horrendous quality). In the third dimension I will have the value of 7 and the fourth dimension can have values of up to 10000 therefore this loop "solution" would take alot of time. How can I not use a for loop and decrease my runtimes? Thank you for your help (I am pretty new to Python)
from sklearn.linear_model import TheilSenRegressor
import numpy as np
#ransac = linear_model.RANSACRegressor()
skip_v=50#number of values to be skipped
N=400
test_n=np.reshape(range(skip_v, N),(-1,1))
f_n=7
d4=np.shape(data)
a6=np.ones((f_n,d4[3]))
b6=np.ones((f_n,d4[3]))
for j in np.arange(d4[3]):
for i in np.arange(f_n):
theil = TheilSenRegressor(random_state=0).fit(test_n,np.log(data[skip_v:,3,i,j]))
a6[i,j]=theil.coef_
b6[i,j]=theil.intercept_
You can use multiprocessing to work your loop in parallel. The following code is not working. It just demonstrates how to do it. It is only useful, if your numbers are really big. Otherwise, doing in sequential is faster.
from sklearn.linear_model import TheilSenRegressor
import numpy as np
import multiprocessing as mp
from itertools import product
def worker_function(input_queue, output_queue, skip_v, test_n, data):
for task in iter(input_queue.get, 'STOP'):
i = task[0]
j = task[1]
theil = TheilSenRegressor(random_state=0).fit(test_n,np.log(data[skip_v:,3,i,j]))
output_queue.put([i, j, theil])
if __name__ == "__main__":
# define data here
f_n = 7
d4 = np.shape(data)
skip_v = 50
N=400
test_n=np.reshape(range(skip_v, N),(-1,1))
input_queue = mp.Queue()
output_queue = mp.Queue()
# here you create all combinations of j and i of your loop
list1 = range(f_n)
list2 = range(d4[3])
list3 = [list1, list2]
tasks = [p for p in product(*list3)]
numProc = 4
# start processes
process = [mp.Process(target=worker_function,
args=(input_queue, output_queue,
skip_v, test_n, data)) for x in range(numProc)]
for p in process:
p.start()
# queue tasks
for i in tasks:
input_queue.put(i)
# signal workers to stop after tasks are all done
for i in range(numProc):
input_queue.put('STOP')
# get the results
for i in range(len(tasks)):
res = output_queue.get(block=True) # wait for results
a6[res[0], res[1]] = res[2].coef_
b6[res[0], res[1]] = res[2].intercept_
The Multiprocessing module is quite confusing for python beginners specially for those who have just migrated from MATLAB and are made lazy with its parallel computing toolbox. I have the following function which takes ~80 Secs to run and I want to shorten this time by using Multiprocessing module of Python.
from time import time
xmax = 100000000
start = time()
for x in range(xmax):
y = ((x+5)**2+x-40)
if y <= 0xf+1:
print('Condition met at: ', y, x)
end = time()
tt = end-start #total time
print('Each iteration took: ', tt/xmax)
print('Total time: ', tt)
This outputs as expected:
Condition met at: -15 0
Condition met at: -3 1
Condition met at: 11 2
Each iteration took: 8.667453265190124e-07
Total time: 86.67453265190125
As any iteration of the loop is not dependent on others, I tried to adopt this Server Process from the official documentation to scan chunks of the range in separate processes. And finally I came up with vartec's answer to this question and could prepare the following code. I also updated the code based on Darkonaut's response to the current question.
from time import time
import multiprocessing as mp
def chunker (rng, t): # this functions makes t chunks out of rng
L = rng[1] - rng[0]
Lr = L % t
Lm = L // t
h = rng[0]-1
chunks = []
for i in range(0, t):
c = [h+1, h + Lm]
h += Lm
chunks.append(c)
chunks[t-1][1] += Lr + 1
return chunks
def worker(lock, xrange, return_dict):
'''worker function'''
for x in range(xrange[0], xrange[1]):
y = ((x+5)**2+x-40)
if y <= 0xf+1:
print('Condition met at: ', y, x)
return_dict['x'].append(x)
return_dict['y'].append(y)
with lock:
list_x = return_dict['x']
list_y = return_dict['y']
list_x.append(x)
list_y.append(y)
return_dict['x'] = list_x
return_dict['y'] = list_y
if __name__ == '__main__':
start = time()
manager = mp.Manager()
return_dict = manager.dict()
lock = manager.Lock()
return_dict['x']=manager.list()
return_dict['y']=manager.list()
xmax = 100000000
nw = mp.cpu_count()
workers = list(range(0, nw))
chunks = chunker([0, xmax], nw)
jobs = []
for i in workers:
p = mp.Process(target=worker, args=(lock, chunks[i],return_dict))
jobs.append(p)
p.start()
for proc in jobs:
proc.join()
end = time()
tt = end-start #total time
print('Each iteration took: ', tt/xmax)
print('Total time: ', tt)
print(return_dict['x'])
print(return_dict['y'])
which considerably reduces the run time to ~17 Secs. But, my shared variable cannot retrieve any values. Please help me find out which part of the code is going wrong.
the output I get is:
Each iteration took: 1.7742713451385497e-07
Total time: 17.742713451385498
[]
[]
from which I expect:
Each iteration took: 1.7742713451385497e-07
Total time: 17.742713451385498
[0, 1, 2]
[-15, -3, 11]
The issue in your example is that modifications to standard mutable structures within Manager.dict will not be propagated. I'm first showing you how to fix it with manager, just to show you better options afterwards.
multiprocessing.Manager is a bit heavy since it uses a separate Process just for the Manager and working on a shared object needs using locks for data consistency. If you run this on one machine, there are better options with multiprocessing.Pool, in case you don't have to run customized Process classes and if you have to, multiprocessing.Process together with multiprocessing.Queue would be the common way of doing it.
The quoting parts are from the multiprocessing docs.
Manager
If standard (non-proxy) list or dict objects are contained in a referent, modifications to those mutable values will not be propagated through the manager because the proxy has no way of knowing when the values contained within are modified. However, storing a value in a container proxy (which triggers a setitem on the proxy object) does propagate through the manager and so to effectively modify such an item, one could re-assign the modified value to the container proxy...
In your case this would look like:
def worker(xrange, return_dict, lock):
"""worker function"""
for x in range(xrange[0], xrange[1]):
y = ((x+5)**2+x-40)
if y <= 0xf+1:
print('Condition met at: ', y, x)
with lock:
list_x = return_dict['x']
list_y = return_dict['y']
list_x.append(x)
list_y.append(y)
return_dict['x'] = list_x
return_dict['y'] = list_y
The lock here would be a manager.Lock instance you have to pass along as argument since the whole (now) locked operation is not by itself atomic. (Here
is an easier example with Manager using Lock)
This approach is perhaps less convenient than employing nested Proxy Objects for most use cases but also demonstrates a level of control over the synchronization.
Since Python 3.6 proxy objects are nestable:
Changed in version 3.6: Shared objects are capable of being nested. For example, a shared container object such as a shared list can contain other shared objects which will all be managed and synchronized by the SyncManager.
Since Python 3.6 you can fill your manager.dict before starting multiprocessing with manager.list as values and then append directly in the worker without having to reassign.
return_dict['x'] = manager.list()
return_dict['y'] = manager.list()
EDIT:
Here is the full example with Manager:
import time
import multiprocessing as mp
from multiprocessing import Manager, Process
from contextlib import contextmanager
# mp_util.py from first link in code-snippet for "Pool"
# section below
from mp_utils import calc_batch_sizes, build_batch_ranges
# def context_timer ... see code snippet in "Pool" section below
def worker(batch_range, return_dict, lock):
"""worker function"""
for x in batch_range:
y = ((x+5)**2+x-40)
if y <= 0xf+1:
print('Condition met at: ', y, x)
with lock:
return_dict['x'].append(x)
return_dict['y'].append(y)
if __name__ == '__main__':
N_WORKERS = mp.cpu_count()
X_MAX = 100000000
batch_sizes = calc_batch_sizes(X_MAX, n_workers=N_WORKERS)
batch_ranges = build_batch_ranges(batch_sizes)
print(batch_ranges)
with Manager() as manager:
lock = manager.Lock()
return_dict = manager.dict()
return_dict['x'] = manager.list()
return_dict['y'] = manager.list()
tasks = [(batch_range, return_dict, lock)
for batch_range in batch_ranges]
with context_timer():
pool = [Process(target=worker, args=args)
for args in tasks]
for p in pool:
p.start()
for p in pool:
p.join()
# Create standard container with data from manager before exiting
# the manager.
result = {k: list(v) for k, v in return_dict.items()}
print(result)
Pool
Most often a multiprocessing.Pool will just do it. You have an additional challenge in your example since you want to distribute iteration over a range.
Your chunker function doesn't manage to divide the range even so every process has about the same work to do:
chunker((0, 21), 4)
# Out: [[0, 4], [5, 9], [10, 14], [15, 21]] # 4, 4, 4, 6!
For the code below please grab the code snippet for mp_utils.py from my answer here, it provides two functions to chunk ranges as even as possible.
With multiprocessing.Pool your worker function just has to return the result and Pool will take care of transporting the result back over internal queues back to the parent process. The result will be a list, so you will have to rearange your result again in a way you want it to have. Your example could then look like this:
import time
import multiprocessing as mp
from multiprocessing import Pool
from contextlib import contextmanager
from itertools import chain
from mp_utils import calc_batch_sizes, build_batch_ranges
#contextmanager
def context_timer():
start_time = time.perf_counter()
yield
end_time = time.perf_counter()
total_time = end_time-start_time
print(f'\nEach iteration took: {total_time / X_MAX:.4f} s')
print(f'Total time: {total_time:.4f} s\n')
def worker(batch_range):
"""worker function"""
result = []
for x in batch_range:
y = ((x+5)**2+x-40)
if y <= 0xf+1:
print('Condition met at: ', y, x)
result.append((x, y))
return result
if __name__ == '__main__':
N_WORKERS = mp.cpu_count()
X_MAX = 100000000
batch_sizes = calc_batch_sizes(X_MAX, n_workers=N_WORKERS)
batch_ranges = build_batch_ranges(batch_sizes)
print(batch_ranges)
with context_timer():
with Pool(N_WORKERS) as pool:
results = pool.map(worker, iterable=batch_ranges)
print(f'results: {results}')
x, y = zip(*chain.from_iterable(results)) # filter and sort results
print(f'results sorted: x: {x}, y: {y}')
Example Output:
[range(0, 12500000), range(12500000, 25000000), range(25000000, 37500000),
range(37500000, 50000000), range(50000000, 62500000), range(62500000, 75000000), range(75000000, 87500000), range(87500000, 100000000)]
Condition met at: -15 0
Condition met at: -3 1
Condition met at: 11 2
Each iteration took: 0.0000 s
Total time: 8.2408 s
results: [[(0, -15), (1, -3), (2, 11)], [], [], [], [], [], [], []]
results sorted: x: (0, 1, 2), y: (-15, -3, 11)
Process finished with exit code 0
If you had multiple arguments for your worker you would build a "tasks"-list with argument-tuples and exchange pool.map(...) with pool.starmap(...iterable=tasks). See docs for further details on that.
Process & Queue
If you can't use multiprocessing.Pool for some reason, you have to take
care of inter-process communication (IPC) yourself, by passing a
multiprocessing.Queue as argument to your worker-functions in the child-
processes and letting them enqueue their results to be send back to the
parent.
You will also have to build your Pool-like structure so you can iterate over it to start and join the processes and you have to get() the results back from the queue. More about Queue.get usage I've written up here.
A solution with this approach could look like this:
def worker(result_queue, batch_range):
"""worker function"""
result = []
for x in batch_range:
y = ((x+5)**2+x-40)
if y <= 0xf+1:
print('Condition met at: ', y, x)
result.append((x, y))
result_queue.put(result) # <--
if __name__ == '__main__':
N_WORKERS = mp.cpu_count()
X_MAX = 100000000
result_queue = mp.Queue() # <--
batch_sizes = calc_batch_sizes(X_MAX, n_workers=N_WORKERS)
batch_ranges = build_batch_ranges(batch_sizes)
print(batch_ranges)
with context_timer():
pool = [Process(target=worker, args=(result_queue, batch_range))
for batch_range in batch_ranges]
for p in pool:
p.start()
results = [result_queue.get() for _ in batch_ranges]
for p in pool:
p.join()
print(f'results: {results}')
x, y = zip(*chain.from_iterable(results)) # filter and sort results
print(f'results sorted: x: {x}, y: {y}')
I'm trying to learn how to do parallel programming in python. I wrote a simple int square function and then ran it in serial, multi-thread, and multi-process:
import time
import multiprocessing, threading
import random
def calc_square(numbers):
sq = 0
for n in numbers:
sq = n*n
def splita(list, n):
a = [[] for i in range(n)]
counter = 0
for i in range(0,len(list)):
a[counter].append(list[i])
if len(a[counter]) == len(list)/n:
counter = counter +1
continue
return a
if __name__ == "__main__":
random.seed(1)
arr = [random.randint(1, 11) for i in xrange(1000000)]
print "init completed"
start_time2 = time.time()
calc_square(arr)
end_time2 = time.time()
print "serial: " + str(end_time2 - start_time2)
newarr = splita(arr,8)
print 'split complete'
start_time = time.time()
for i in range(8):
t1 = threading.Thread(target=calc_square, args=(newarr[i],))
t1.start()
t1.join()
end_time = time.time()
print "mt: " + str(end_time - start_time)
start_time = time.time()
for i in range(8):
p1 = multiprocessing.Process(target=calc_square, args=(newarr[i],))
p1.start()
p1.join()
end_time = time.time()
print "mp: " + str(end_time - start_time)
Output:
init completed
serial: 0.0640001296997
split complete
mt: 0.0599999427795
mp: 2.97099995613
However, as you can see, something weird happened and mt is taking the same time as serial and mp is actually taking significantly longer (almost 50 times longer).
What am I doing wrong? Could someone push me in the right direction to learn parallel programming in python?
Edit 01
Looking at the comments, I see that perhaps the function not returning anything seems pointless. The reason I'm even trying this is because previously I tried the following add function:
def addi(numbers):
sq = 0
for n in numbers:
sq = sq + n
return sq
I tried returning the addition of each part to a serial number adder, so at least I could see some performance improvement over a pure serial implementation. However, I couldn't figure out how to store and use the returned value, and that's the reason I'm trying to figure out something even simpler than that, which is just dividing up the array and running a simple function on it.
Thanks!
I think that multiprocessing takes quite a long time to create and start each process. I have changed the program to make 10 times the size of arr and changed the way that the processes are started and there is a slight speed-up:
(Also note python 3)
import time
import multiprocessing, threading
from multiprocessing import Queue
import random
def calc_square_q(numbers,q):
while q.empty():
pass
return calc_square(numbers)
if __name__ == "__main__":
random.seed(1) # note how big arr is now vvvvvvv
arr = [random.randint(1, 11) for i in range(10000000)]
print("init completed")
# ...
# other stuff as before
# ...
processes=[]
q=Queue()
for arrs in newarr:
processes.append(multiprocessing.Process(target=calc_square_q, args=(arrs,q)))
print('start processes')
for p in processes:
p.start() # even tho' each process is started it waits...
print('join processes')
q.put(None) # ... for q to become not empty.
start_time = time.time()
for p in processes:
p.join()
end_time = time.time()
print("mp: " + str(end_time - start_time))
Also notice above how I create and start the processes in two different loops, and then finally join with the processes in a third loop.
Output:
init completed
serial: 0.53214430809021
split complete
start threads
mt: 0.5551605224609375
start processes
join processes
mp: 0.2800724506378174
Another factor of 10 increase in size of arr:
init completed
serial: 5.8455305099487305
split complete
start threads
mt: 5.411392450332642
start processes
join processes
mp: 1.9705185890197754
And yes, I've also tried this in python 2.7, although Threads seemed slower.