I try to speeds up the dot product of two large matrice, so I test a small example of multiprocessing. The codes are as follows. But from the results, I found that my codes runs like sequentially.
Codes
import multiprocessing as mp
import numpy as np
import time
def dot(i):
print(f"Process {i} enters")
np.random.seed(10)
a = np.random.normal(0, 1, (5000, 5000))
b = np.random.normal(0, 1, (5000, 5000))
print(f"Process {i} starts calculating")
res = np.dot(a, b)
print(f"Process {i} finishes")
return res
if __name__ == '__main__':
start = time.perf_counter()
dot(1)
print(time.perf_counter() - start)
print('=============================')
print(mp.cpu_count())
i = 8
start = time.perf_counter()
pool = mp.Pool(mp.cpu_count())
res = []
for j in range(i):
res.append(pool.apply_async(dot, args=(j,)))
pool.close()
pool.join()
end = time.perf_counter()
# res = [r.get() for r in res]
# print(res)
print(end - start)
Results
Process 1 enters
Process 1 starts calculating
Process 1 finishes
2.582571708
=============================
8
Process 0 enters
Process 1 enters
Process 2 enters
Process 3 enters
Process 4 enters
Process 5 enters
Process 6 enters
Process 7 enters
Process 4 starts calculating
Process 7 starts calculating
Process 5 starts calculating
Process 3 starts calculating
Process 1 starts calculating
Process 6 starts calculating
Process 0 starts calculating
Process 2 starts calculating
Process 4 finishes
Process 7 finishes
Process 1 finishes
Process 0 finishes
Process 6 finishes
Process 2 finishes
Process 5 finishes
Process 3 finishes
27.05124225
The results showed that the codes seems to run indeed parallelly (from the text), but the final running time seems run sequentially. I don't know why, so hope some one could give me some advice. Thanks in advance.
Of course there is always additional overhead involved in creating processes and in passing arguments and results between address spaces (and in this case your results are extremely large).
My best guess is that the performance problem is arising because the storage requirements for running 8 processes in parallel (I assume you have at least 8 logical processors, preferably 8 physical processors) due to the large arrays being computed is probably causing extreme paging (I get the same results as you). I have therefore modified the demo to be less memory intensive but kept the CPU requirements high by performing the dot function many times in a loop. I have also reduced the number of processes to 4, which is the number of physical processors that I have on my desktop, which gives each process a better chance of running in parallel:
from multiprocessing.pool import Pool
import numpy as np
import time
def dot(i):
print(f"Process {i} enters")
np.random.seed(10)
a = np.random.normal(0, 1, (50, 50))
b = np.random.normal(0, 1, (50, 50))
print(f"Process {i} starts calculating")
for _ in range(500_000):
res = np.dot(a, b)
print(f"Process {i} finishes")
return res
if __name__ == '__main__':
start = time.perf_counter()
dot(1)
print(time.perf_counter() - start)
print('=============================')
i = 4
start = time.perf_counter()
pool = Pool(i)
res = []
for j in range(i):
res.append(pool.apply_async(dot, args=(j,)))
pool.close()
pool.join()
end = time.perf_counter()
# res = [r.get() for r in res]
# print(res)
print(end - start)
Results:
Process 1 enters
Process 1 starts calculating
Process 1 finishes
6.0469717
=============================
Process 0 enters
Process 0 starts calculating
Process 1 enters
Process 1 starts calculating
Process 2 enters
Process 3 enters
Process 2 starts calculating
Process 3 starts calculating
Process 0 finishes
Process 1 finishes
Process 3 finishes
Process 2 finishes
8.8419177
This is much closer to what you would expect. When I change i to 8, the number of logical processors, then the running times were 6.1023760000000005 and 12.749368100000002 respectively.
Related
I am trying to run a model multiple times. As a result it is time consuming. As a solution I try to make it parallel. However, it ends up to be slower. Parallel is 40 seconds while serial is 34 seconds.
# !pip install --target=$nb_path transformers
oracle = pipeline(model="deepset/roberta-base-squad2")
question = 'When did the first extension of the Athens Tram take place?'
print(data)
print("Data size is: ", len(data))
parallel = True
if parallel == False:
counter = 0
l = len(data)
cr = []
for words in data:
counter+=1
print(counter, " out of ", l)
cr.append(oracle(question=question, context=words))
elif parallel == True:
from multiprocessing import Process, Queue
import multiprocessing
no_CPU = multiprocessing.cpu_count()
print("Number of cpu : ", no_CPU)
l = len(data)
def answer_question(data, no_CPU, sub_no):
cr_process = []
counter_process = 0
for words in data:
counter_process+=1
l_data = len(data)
# print("n is", no_CPU)
# print("l is", l_data)
print(counter_process, " out of ", l_data, "in subprocess number", sub_no)
cr_process.append(oracle(question=question, context=words))
# Q.put(cr_process)
cr.append(cr_process)
n = no_CPU # number of subprocesses
m = l//n # number of data the n-1 first subprocesses will handle
res = l % n # number of extra data samples the last subprocesses has
# print(m)
# print(res)
procs = []
# instantiating process with arguments
for x in range(n-1):
# print(x*m)
# print((x+1)*m)
proc = Process(target=answer_question, args=(data[x*m:(x+1)*m],n, x+1,))
procs.append(proc)
proc.start()
proc = Process(target=answer_question, args=(data[(n-1)*m:n*m+res],n,n,))
procs.append(proc)
proc.start()
# complete the processes
for proc in procs:
proc.join()
A sample of the data variable can be found here (to not flood the question). Argument parallel controls the serial and the parallel version. So my question is, why does it happen and how do I make the parallel version faster? I use google colab so it has 2 CPU cores available , that's what multiprocessing.cpu_count() is saying at least.
Your pipeline is already running on multi-cpu even when run as one process. The code of transformers are optimized to run on multi-cpu.
when on top of that you are creating multiple process, you are loosing some time for building the processes and switching between them.
To verify this, on the so-called "single process" version look at your cpu utilizations, you should already see all are at max, so creating extra parallel processes are not going to save you some time,
I'm having trouble using python multiprocess.
im trying with a minimal version of code:
import os
os.environ["OMP_NUM_THREADS"] = "1" # just in case the system uses multithrad somehow
os.environ["OPENBLAS_NUM_THREADS"] = "1" # just in case the system uses multithrad somehow
os.environ["MKL_NUM_THREADS"] = "1" # just in case the system uses multithrad somehow
os.environ["VECLIB_MAXIMUM_THREADS"] = "1" # just in case the system uses multithrad somehow
os.environ["NUMEXPR_NUM_THREADS"] = "1" # just in case the system uses multithrad somehow
import numpy as np
from datetime import datetime as dt
from multiprocessing import Pool
from pandas import DataFrame as DF
def trytrytryshare(times):
i = 0
for j in range(times):
i+=1
return
def trymultishare(thread = 70 , times = 10):
st = dt.now()
args_l = [(times,) for i in range(thread)]
print(st)
p = Pool(thread)
for i in range(len(args_l)):
p.apply_async(func = trytrytryshare, args = (args_l[i]))
p.close()
p.join()
timecost = (dt.now()-st).total_seconds()
print('%d threads finished in %f secs' %(thread,timecost))
return timecost
if __name__ == '__main__':
res = DF(columns = ['thread','timecost'])
n = 0
for j in range(5):
for i in range(1,8,3):
timecost = trymultishare(thread = i,times = int(1e8))
res.loc[n] = [i,timecost]
n+=1
timecost = trymultishare(thread = 70,times = int(1e8))
res.loc[n] = [70,timecost]
n+=1
res_sum = res.groupby('thread').mean()
res_sum['decay'] = res_sum.loc[1,'timecost'] / res_sum['timecost']
on my own computer (8cores):
on my server (80 cores, im the only one using it)
i tried again, make one thread job longer.
the decay is really bad....
any idea how to "fix" this, or this is just what i can get when using multi-process?
thanks
The way you're timing apply_async is flawed. You won't know when the subprocesses have completed unless you wait for their results.
It's a good idea to work out an optimum process pool size based on number of CPUs. The code that follows isn't necessarily the best for all cases but it's what I use.
You shouldn't set the pool size to the number of processes you intend to run. That's the whole point of using a pool.
So here's a simpler example of how you could test subprocess performance.
from multiprocessing import Pool
from time import perf_counter
from os import cpu_count
def process(n):
r = 0
for _ in range(n):
r += 1
return r
POOL = max(cpu_count()-2, 1)
N = 1_000_000
def main(procs):
# no need for pool size to be bigger than the numer of processes to be run
poolsize = min(POOL, procs)
with Pool(poolsize) as pool:
_start = perf_counter()
for result in [pool.apply_async(process, (N,)) for _ in range(procs)]:
result.wait() # wait for async processes to terminate
_end = perf_counter()
print(f'Duration for {procs} processes with pool size of {poolsize} = {_end-_start:.2f}s')
if __name__ == '__main__':
print(f'CPU count = {cpu_count()}')
for procs in range(10, 101, 10):
main(procs)
Output:
CPU count = 20
Duration for 10 processes with pool size of 10 = 0.12s
Duration for 20 processes with pool size of 18 = 0.19s
Duration for 30 processes with pool size of 18 = 0.18s
Duration for 40 processes with pool size of 18 = 0.28s
Duration for 50 processes with pool size of 18 = 0.30s
Duration for 60 processes with pool size of 18 = 0.39s
Duration for 70 processes with pool size of 18 = 0.42s
Duration for 80 processes with pool size of 18 = 0.45s
Duration for 90 processes with pool size of 18 = 0.54s
Duration for 100 processes with pool size of 18 = 0.59s
My guess is that you're observing the cost of spawning new processes, since apply_async returns immediately. It's much cheaper to spawn one process in the case of thread==1 instead of spawning 70 processes (your last case with the worst decay).
The fact that the server with 80 cores performs better than you laptop with 8 cores could be due to the server containing better hardware in general (better heat removal, faster CPU, etc) or it might contain a different OS. Benchmarking across different machines is non-trivial.
I have the following code and I want to spread the task into multi-process. After experiments, I realized that increase the number of CPU cores negatively impacts the execution time.
I have 8 cores on my machine
Case 1: without using multiprocessing
Execution time: 106 minutes
Case 2: with multiprocessing using ncores = 4
Execution time: 37 minutes
Case 3: with multiprocessing using ncores = 7
Execution time: 40 minutes
the following code:
import time
import multiprocessing as mp
def _fun(i, args1=10):
#Sort matrix W
#For loop 1 on matrix M
#For loop 2 on matrix Y
return value
def run1(ncores=mp.cpu_count()):
ncores = ncores - 4 # use 4 and 1 to have ncores = 4 and 7
_f = functools.partial(_fun,args1=x)
with mp.Pool(ncores) as pool:
result = pool.map(_f, range(n))
return [t for t in result]
start = time.time()
list1= run1()
end = time.time()
print( 'time {0} minutes '.format((end - start)/60))
My question, what is the best practice to use multiprocessing? As I understand that as much we use cpu cores as much it will be faster.
I'm trying to learn how to do parallel programming in python. I wrote a simple int square function and then ran it in serial, multi-thread, and multi-process:
import time
import multiprocessing, threading
import random
def calc_square(numbers):
sq = 0
for n in numbers:
sq = n*n
def splita(list, n):
a = [[] for i in range(n)]
counter = 0
for i in range(0,len(list)):
a[counter].append(list[i])
if len(a[counter]) == len(list)/n:
counter = counter +1
continue
return a
if __name__ == "__main__":
random.seed(1)
arr = [random.randint(1, 11) for i in xrange(1000000)]
print "init completed"
start_time2 = time.time()
calc_square(arr)
end_time2 = time.time()
print "serial: " + str(end_time2 - start_time2)
newarr = splita(arr,8)
print 'split complete'
start_time = time.time()
for i in range(8):
t1 = threading.Thread(target=calc_square, args=(newarr[i],))
t1.start()
t1.join()
end_time = time.time()
print "mt: " + str(end_time - start_time)
start_time = time.time()
for i in range(8):
p1 = multiprocessing.Process(target=calc_square, args=(newarr[i],))
p1.start()
p1.join()
end_time = time.time()
print "mp: " + str(end_time - start_time)
Output:
init completed
serial: 0.0640001296997
split complete
mt: 0.0599999427795
mp: 2.97099995613
However, as you can see, something weird happened and mt is taking the same time as serial and mp is actually taking significantly longer (almost 50 times longer).
What am I doing wrong? Could someone push me in the right direction to learn parallel programming in python?
Edit 01
Looking at the comments, I see that perhaps the function not returning anything seems pointless. The reason I'm even trying this is because previously I tried the following add function:
def addi(numbers):
sq = 0
for n in numbers:
sq = sq + n
return sq
I tried returning the addition of each part to a serial number adder, so at least I could see some performance improvement over a pure serial implementation. However, I couldn't figure out how to store and use the returned value, and that's the reason I'm trying to figure out something even simpler than that, which is just dividing up the array and running a simple function on it.
Thanks!
I think that multiprocessing takes quite a long time to create and start each process. I have changed the program to make 10 times the size of arr and changed the way that the processes are started and there is a slight speed-up:
(Also note python 3)
import time
import multiprocessing, threading
from multiprocessing import Queue
import random
def calc_square_q(numbers,q):
while q.empty():
pass
return calc_square(numbers)
if __name__ == "__main__":
random.seed(1) # note how big arr is now vvvvvvv
arr = [random.randint(1, 11) for i in range(10000000)]
print("init completed")
# ...
# other stuff as before
# ...
processes=[]
q=Queue()
for arrs in newarr:
processes.append(multiprocessing.Process(target=calc_square_q, args=(arrs,q)))
print('start processes')
for p in processes:
p.start() # even tho' each process is started it waits...
print('join processes')
q.put(None) # ... for q to become not empty.
start_time = time.time()
for p in processes:
p.join()
end_time = time.time()
print("mp: " + str(end_time - start_time))
Also notice above how I create and start the processes in two different loops, and then finally join with the processes in a third loop.
Output:
init completed
serial: 0.53214430809021
split complete
start threads
mt: 0.5551605224609375
start processes
join processes
mp: 0.2800724506378174
Another factor of 10 increase in size of arr:
init completed
serial: 5.8455305099487305
split complete
start threads
mt: 5.411392450332642
start processes
join processes
mp: 1.9705185890197754
And yes, I've also tried this in python 2.7, although Threads seemed slower.
I am attempting to create a program in python that runs multiple instances (15) of a function simultaneously over different processors. I have been researching this, and have the below program set up using the Process tool from multiprocessing.
Unfortunately, the program executes each instance of the function sequentially (it seems to wait for one to finish before moving onto the next part of the loop).
from __future__ import print_function
from multiprocessing import Process
import sys
import os
import re
for i in range(1,16):
exec("path%d = 0" % (i))
exec("file%d = open('%d-path','a', 1)" % (i, i))
def stat(first, last):
for j in range(1,40000):
input_string = "water" + str(j) + ".xyz.geocard"
if os.path.exists('./%s' % input_string) == True:
exec("out%d = open('output%d', 'a', 1)" % (first, first))
exec('print("Processing file %s...", file=out%d)' % (input_string, first))
with open('./%s' % input_string,'r') as file:
for line in file:
for i in range(first,last):
search_string = " " + str(i) + " path:"
for result in re.finditer(r'%s' % search_string, line):
exec("path%d += 1" % i)
for i in range(first,last):
exec("print(path%d, file=file%d)" % (i, i))
processes = []
for m in range(1,16):
n = m + 1
p = Process(target=stat, args=(m, n))
p.start()
processes.append(p)
for p in processes:
p.join()
I am reasonably new to programming, and have no experience with parallelization - any help would be greatly appreciated.
I have included the entire program above, replacing "Some Function" with the actual function, to demonstrate that this is not a timing issue. The program can take days to cycle through all 40,000 files (each of which is quite large).
I think what is happening is that you are not doing enough in some_function to observe work happening in parallel. It spawns a process, and it completes before the next one gets spawned. If you introduce a random sleep time into some_function, you'll see that they are in fact running in parallel.
from multiprocessing import Process
import random
import time
def some_function(first, last):
time.sleep(random.randint(1, 3))
print first, last
processes = []
for m in range(1,16):
n = m + 1
p = Process(target=some_function, args=(m, n))
p.start()
processes.append(p)
for p in processes:
p.join()
Output
2 3
3 4
5 6
12 13
13 14
14 15
15 16
1 2
4 5
6 7
9 10
8 9
7 8
11 12
10 11
Are you sure? I just tried it and it worked for me; the results are out of order on every execution, so they're being executed concurrently.
Have a look at your function. It takes "first" and "last", so is its execution time smaller for lower values? In this case, you could expect the smaller numbered arguments to make runtime lower, so it would appear to run in parallel.
ps ux | grep python | grep -v grep | wc -l
> 16
If you execute the code repeatedly (i.e. using a bash script) you can see that every process is starting up. If you want to confirm this, import os and have the function print out os.getpid() so you can see they have a different process ID.
So yeah, double check your results because it seems to me like you've written it concurrently just fine!
This code below can run 10 processes parallelly printing the numbers from 0 to 99.
*if __name__ == "__main__":
is needed to run processes on Windows:
from multiprocessing import Process
def test():
for i in range(0, 100):
print(i)
if __name__ == "__main__": # Here
process_list = []
for _ in range(0, 10):
process = Process(target=test)
process_list.append(process)
for process in process_list:
process.start()
for process in process_list:
process.join()
And, this code below is the shorthand for loop version of the above code running 10 processes parallelly printing the numbers from 0 to 99:
from multiprocessing import Process
def test():
[print(i) for i in range(0, 100)]
if __name__ == "__main__":
process_list = [Process(target=test) for _ in range(0, 10)]
[process.start() for process in process_list]
[process.join() for process in process_list]
This is the result below:
...
99
79
67
71
67
89
81
99
80
68
...