Why this Python parallel loop is taking longer time than sequential loop? - python

I have this code that I tried to make parallel based on a previous question. Here is the code using 2 processes.
import multiprocessing
import timeit
start_time = timeit.default_timer()
d1 = dict( (i,tuple([i*0.1,i*0.2,i*0.3])) for i in range(500000) )
d2={}
def fun1(gn):
x,y,z = d1[gn]
d2.update({gn:((x+y+z)/3)})
#
if __name__ == '__main__':
gen1 = [x for x in d1.keys()]
#fun1(gen1)
p= multiprocessing.Pool(2)
p.map(fun1,gen1)
print('Script finished')
stop_time = timeit.default_timer()
print(stop_time - start_time)
Output is:
Script finished
1.8478448875989333
If I change the program to sequential,
fun1(gen1)
#p= multiprocessing.Pool(2)
#p.map(fun1,gen1)
output is:
Script finished
0.8345944193950299
So parallel loop is taking more time that sequential loop, more than double. (My computer has 2 cores, running on Windows.) I tried to find similar questions on the topic, this and this but could not figure out the reason. How can I get performance improvement using multiprocessing module in this example?

When you do p.map(fun1,gen1) you send gen1 over to the other process. This includes serializing the list which is 500000 elements big.
Comparing serialization to the small computation, it takes much longer.
You can measure or profile where the time is spent.

Related

Multithreading inside Multiprocessing in Python

I am using concurrent.futures module to do multiprocessing and multithreading. I am running it on a 8 core machine with 16GB RAM, intel i7 8th Gen processor. I tried this on Python 3.7.2 and even on Python 3.8.2
import concurrent.futures
import time
takes list and multiply each elem by 2
def double_value(x):
y = []
for elem in x:
y.append(2 *elem)
return y
multiply an elem by 2
def double_single_value(x):
return 2* x
define a
import numpy as np
a = np.arange(100000000).reshape(100, 1000000)
function to run multiple thread and multiple each elem by 2
def get_double_value(x):
with concurrent.futures.ThreadPoolExecutor() as executor:
results = executor.map(double_single_value, x)
return list(results)
code shown below ran in 115 seconds. This is using only multiprocessing. CPU utilization for this piece of code is 100%
t = time.time()
with concurrent.futures.ProcessPoolExecutor() as executor:
my_results = executor.map(double_value, a)
print(time.time()-t)
Below function took more than 9 min and consumed all the Ram of system and then system kill all the process. Also CPU utilization during this piece of code is not upto 100% (~85%)
t = time.time()
with concurrent.futures.ProcessPoolExecutor() as executor:
my_results = executor.map(get_double_value, a)
print(time.time()-t)
I really want to understand:
1) why the code that first split do multiple processing and then run tried multi-threading is not running faster than the code that runs only multiprocessing ?
(I have gone through many post that describe multiprocessing and multi-threading and one of the crux that I got is multi-threading is for I/O process and multiprocessing for CPU processes ? )
2) Is there any better way of doing multi-threading inside multiprocessing for max utilization of allotted core(or CPU) ?
3) Why that last piece of code consumed all the RAM ? Was it due to multi-threading ?
You can mix concurrency with parallelism.
Why? You can have your valid reasons. Imagine a bunch of requests you have to make while processing their responses (e.g., converting XML to JSON) as fast as possible.
I did some tests and here are the results.
In each test, I mix different workarounds to make a print 16000 times (I have 8 cores and 16 threads).
Parallelism with multiprocessing, concurrency with asyncio
The fastest, 1.1152372360229492 sec.
import asyncio
import multiprocessing
import os
import psutil
import threading
import time
async def print_info(value):
await asyncio.sleep(1)
print(
f"THREAD: {threading.get_ident()}",
f"PROCESS: {os.getpid()}",
f"CORE_ID: {psutil.Process().cpu_num()}",
f"VALUE: {value}",
)
async def await_async_logic(values):
await asyncio.gather(
*(
print_info(value)
for value in values
)
)
def run_async_logic(values):
asyncio.run(await_async_logic(values))
def multiprocessing_executor():
start = time.time()
with multiprocessing.Pool() as multiprocessing_pool:
multiprocessing_pool.map(
run_async_logic,
(range(1000 * x, 1000 * (x + 1)) for x in range(os.cpu_count())),
)
end = time.time()
print(end - start)
multiprocessing_executor()
Very important note: with asyncio I can spam tasks as much as I want. For example, I can change the value from 1000 to 10000 to generate 160000 prints and there is no problem (I tested it and it took me 2.0210490226745605 sec).
Parallelism with multiprocessing, concurrency with threading
An alternative option, 1.6983509063720703 sec.
import multiprocessing
import os
import psutil
import threading
import time
def print_info(value):
time.sleep(1)
print(
f"THREAD: {threading.get_ident()}",
f"PROCESS: {os.getpid()}",
f"CORE_ID: {psutil.Process().cpu_num()}",
f"VALUE: {value}",
)
def multithreading_logic(values):
threads = []
for value in values:
threads.append(threading.Thread(target=print_info, args=(value,)))
for thread in threads:
thread.start()
for thread in threads:
thread.join()
def multiprocessing_executor():
start = time.time()
with multiprocessing.Pool() as multiprocessing_pool:
multiprocessing_pool.map(
multithreading_logic,
(range(1000 * x, 1000 * (x + 1)) for x in range(os.cpu_count())),
)
end = time.time()
print(end - start)
multiprocessing_executor()
Very important note: with this method I can NOT spam as many tasks as I want. If I change the value from 1000 to 10000 I get RuntimeError: can't start new thread.
I also want to say that I am impressed because I thought that this method would be better in every aspect compared to asyncio, but quite the opposite.
Parallelism and concurrency with concurrent.futures
Extremely slow, 50.08251595497131 sec.
import os
import psutil
import threading
import time
from concurrent.futures import thread, process
def print_info(value):
time.sleep(1)
print(
f"THREAD: {threading.get_ident()}",
f"PROCESS: {os.getpid()}",
f"CORE_ID: {psutil.Process().cpu_num()}",
f"VALUE: {value}",
)
def multithreading_logic(values):
with thread.ThreadPoolExecutor() as multithreading_executor:
multithreading_executor.map(
print_info,
values,
)
def multiprocessing_executor():
start = time.time()
with process.ProcessPoolExecutor() as multiprocessing_executor:
multiprocessing_executor.map(
multithreading_logic,
(range(1000 * x, 1000 * (x + 1)) for x in range(os.cpu_count())),
)
end = time.time()
print(end - start)
multiprocessing_executor()
Very important note: with this method, as with asyncio, I can spam as many tasks as I want. For example, I can change the value from 1000 to 10000 to generate 160000 prints and there is no problem (except for the time).
Extra notes
To make this comment, I modified the test so that it only makes 1600 prints (modifying the 1000 value with 100 in each test).
When I remove the parallelism from asyncio, the execution takes me 16.090194702148438 sec.
In addition, if I replace the await asyncio.sleep(1) with time.sleep(1), it takes 160.1889989376068 sec.
Removing the parallelism from the multithreading option, the execution takes me 16.24941658973694 sec.
Right now I am impressed. Multithreading without multiprocessing gives me good performance, very similar to asyncio.
Removing parallelism from the third option, execution takes me 80.15227723121643 sec.
As you say: "I have gone through many post that describe multiprocessing and multi-threading and one of the crux that I got is multi-threading is for I/O process and multiprocessing for CPU processes".
You need to figure out, if your program is IO-bound or CPU-bound, then apply the correct method to solve your problem. Applying various methods at random or all together at the same time usually makes things only worse.
Use of threading in clean Python for CPU-bound problems is a bad approach regardless of using multiprocessing or not. Try to redesign your app to use only multiprocessing or use third-party libs such as Dask and so on
I believe you figured it out, but I wanted to answer. Obviously, your function double_single_value is CPU bound. It has nothing to do with Io. In CPU bound tasks using multi-thread will make it worse than using a single thread, because GIL does not allow you actually run on multi-thread and you will eventually run on single thread. Also, you may not finish a task and go to another and when you get back you should load it to the CPU again, which will make this even slower.
Based off your code, I see most of your code is dealing with computations(calculations) so it's most encouraged to use multiprocessing to solve your problem since it's CPU-bound and NOT I/O bound(things like sending requests to websites and then waiting for some response from the server in exchange, writing to disk or even reading from disk). This is true for Python programming as far as I know. The python GIL(Global Interpreter Lock) will make your code run slowly as it is a mutex (or a lock) that allows only one thread to take the control of the Python interpreter meaning it won't achieve parallelism but will give you concurrency instead. But it's very fine to use threading for I/O bound tasks because they'll outcompete multiprocessing in execution times but for your case i would encourage you to use multiprocessing because each Python process will get its own Python interpreter and memory space so the GIL won’t be a problem to you.
I am not so sure about integrating multithreading with multiprocessing but what i know it can cause inconsistency in the processed results since you will need more bolierplate code for data synchronization if you want the processes to communicate(IPC) and also threads are kinda unpredictable(thus inconsistent at times) since they're controlled by the OS so anytime they can be scooped out(pre-emptive scheduling) for kernel level threads(due to time sharing). i don't stop you from writing that code but be really sure of what you are doing. You never know you would propose a solution to it one day.

How to make real parallel programming in Python?

I want to do parallel processing to speed up the task in Python.
I used apply_async but the cpu only consumes 30%. How to fully utilize the cpu?
Below is my code.
import numpy as np
import pandas as pd
import multiprocessing
def calc_score(df, i, j, score):
score[i,j] = df.loc[i, 'data'] + df.loc[j, 'data']
if __name__ == '__main__':
df = pd.read_csv('data.csv')
score = np.zeros([100, 100])
pool = multiprocessing.Pool(multiprocessing.cpu_count())
for i in range(100):
for j in range(100):
pool.apply_async(calc_score, (df, i, j, score))
pool.close()
pool.join()
Thank you very much.
You can't utilize 100% CPU with pool = multiprocessing.Pool(multiprocessing.cpu_count()) . It starts your worker function on the number of core given by you but also looks for a free core. If you want to utilize maximum CPU with multiprocessing you should use multiprocessing Process class. It keeps spinning new thread. But be aware it will breakdown system if your CPU doesn't have memory to spin new thread.
"CPU utilization" should be about performance, i.e. you want to do the job in as little time as possible. There is no generic way to do that. If there was a generic way to optimize software, then there would be no slow software, right?
You seem to be looking for a different thing: spend as much CPU time as possible, so that it does not sit idly. That may seem like the same thing, but is absolutely not.
Anyway, if you want to spend 100% of CPU time, this script will do that for you:
import time
import multiprocessing
def loop_until_t(t):
while time.time() < t:
pass
def waste_cpu_for_n_seconds(num_seconds, num_processes=multiprocessing.cpu_count()):
t0 = time.time()
t = t0 + num_seconds
print("Begin spending CPU time (in {} processes)...".format(num_processes))
with multiprocessing.Pool(num_processes) as pool:
pool.map(loop_until_t, num_processes*[t])
print("Done.")
if __name__ == '__main__':
waste_cpu_for_n_seconds(15)
If, instead, you want your program to run faster, you will not do that with an "illustration for parallel processing", as you call it - you need an actual problem to be solved.

pool.apply_async takes so long to finish, how to speed up?

I call pool.apply_async() with 14 cores.
import multiprocessing
from time import time
import timeit
informative_patients = informative_patients_2500_end[20:]
pool = multiprocessing.Pool(14)
results = []
wLength = [20,30,50]
start = time()
for fn in informative_patients:
result = pool.apply_async(compute_features_test_set, args = (fn,
wLength), callback=results.append)
pool.close()
pool.join()
stop = timeit.default_timer()
print stop - start
The problem is it finishes calling compute_features_test_set() function for the first 13 data in less than one hour, but it takes more than one hour to finish the last one. The size of the data for all the 14 data-set is the same. I tried putting pool.terminate() after pool.close() but in this case it doesn't even start the pool and terminate the pool immediately without going inside the for loop. This always happen in the same way and if I use more cores and more data set, always the last one takes so long to finish. My compute_features_test_set() function is a simple feature extraction code and works correctly. I work on a server with Linux red hat 6, python 2.7 and jupyter. Computation time is important to me and my question is what is wrong here and how I can fix it to get the all the computation done in a reasonable time?
Question: ... what is wrong here and how I can fix it
Couldn't catch this as a multiprocessing issue.
But How do you get this: "always the last one takes so long to finish"?
You are using callback=results.append instead of a own function?
Edit your Question and show How you timeit one Process Time.
Also add your Python Version to your Question.
Do the following to verify it's not a Data issue:
start = time()
results.append(
compute_features_test_set(<First informative_patients>, wLength
)
stop = timeit.default_timer()
print stop - start
start = time()
results.append(
compute_features_test_set(<Last informative_patients>, wLength
)
stop = timeit.default_timer()
print stop - start
Compare the two times you get.

Why this multiprocessing code is slower than the serial one?

I tried the following python programs, both sequential and parallel versions on a cluster computing facility. I could clearly see(using top command) more processes initiating for the parallel program. But when I time it, it seems the parallel version is taking more time. What could be the reason? I am attaching the codes and the timing info herewith.
#parallel.py
from multiprocessing import Pool
import numpy
def sqrt(x):
return numpy.sqrt(x)
pool = Pool()
results = pool.map(sqrt, range(100000), chunksize=10)
#seq.py
import numpy
def sqrt(x):
return numpy.sqrt(x)
results = [sqrt(x) for x in range(100000)]
user#domain$ time python parallel.py > parallel.txt
real 0m1.323s
user 0m2.238s
sys 0m0.243s
user#domain$ time python seq.py > seq.txt
real 0m0.348s
user 0m0.324s
sys 0m0.024s
The amount of work per task is by far too little to compensate for the work-distribution-overhead. First you should increase the chunksize, but still a single square root operation is too short to compensate for the cost of sending around the data between processes. You can see an effective speedup from something like this:
def sqrt(x):
for _ in range(100):
x = numpy.sqrt(x)
return x
results = pool.map(sqrt, range(10000), chunksize=100)

Python multiprocessing run time per process increases with number of processes

I have a pool of workers which perform the same identical task, and I send each a distinct clone of the same data object. Then, I measure the run time separately for each process inside the worker function.
With one process, run time is 4 seconds. With 3 processes, the run time for each process goes up to 6 seconds.
With more complex tasks, this increase is even more nuanced.
There are no other cpu-hogging processes running on my system, and the workers don't use shared memory (as far as I can tell). The run times are measured inside the worker function, so I assume the forking overhead shouldn't matter.
Why does this happen?
def worker_fn(data):
t1 = time()
data.process()
print time() - t1
return data.results
def main( n, num_procs = 3):
from multiprocessing import Pool
from cPickle import dumps, loads
pool = Pool(processes = num_procs)
data = MyClass()
data_pickle = dumps(data)
list_data = [loads(data_pickle) for i in range(n)]
results = pool.map(worker_fn,list_data)
Edit: Although I can't post the entire code for MyClass(), I can tell you that it involves a lot of numpy matrix operations. It seems that numpy's use of OpenBlass may somehow be to blame.

Categories