I wrote this bit of code to test out Python's multiprocessing on my computer:
from multiprocessing import Pool
var = range(5000000)
def test_func(i):
return i+1
if __name__ == '__main__':
p = Pool()
var = p.map(test_func, var)
I timed this using Unix's time command and the results were:
real 0m2.914s
user 0m4.705s
sys 0m1.406s
Then, using the same var and test_func() I timed:
var = map(test_func, var)
and the results were
real 0m1.785s
user 0m1.548s
sys 0m0.214s
Shouldn't the multiprocessing code be much faster than plain old map?
Why it should.
In map function, you are just calling the function sequentially.
Multiprocessing pool creates a set of workers to which your task will be mapped.
It is coordinating multiple worker processes to run these functions.
Try doing some significant work inside your function and then time them and see if multiprocessing helps you to compute faster.
You have to understand that there will be overheads in using multiprocessing. Only when the computing effort is significantly greater than these overheads that you will see it's benefits.
See the last example in excellent introduction by Hellmann: http://www.doughellmann.com/PyMOTW/multiprocessing/communication.html
pool_size = multiprocessing.cpu_count() * 2
pool = multiprocessing.Pool(processes=pool_size,
initializer=start_process,
maxtasksperchild=2,
)
pool_outputs = pool.map(do_calculation, inputs)
You create pools depending on cores that you have.
There is an overhead on using parallelization. There is only benefit if each work unit takes long enough to compensate the overhead.
Also if you only have one CPU (or CPU thread) on your machine, there's no point in using parallelization at all. You'll only see gains if you have at least a hyperthreaded machine or at least two CPU cores.
In your case a simple addition operation doesn't compensate that overhead.
Try something a bit more costly such as:
from multiprocessing import Pool
import math
def test_func(i):
j = 0
for x in xrange(1000000):
j += math.atan2(i, i)
return j
if __name__ == '__main__':
var = range(500)
p = Pool()
var = p.map(test_func, var)
Related
I'm trying to concurrently do Floyd-Warshall, and the code which implements the algorithm concurrently works, but is significantly slower than the Serial version. I've read about GIL and the way it works hence the overhead and slow execution, so I tried to use ProcessPoolExecutor instead of ThreadPoolExecutor, but it doesn't seem to work even with a main check:
if __name__ == '__main__':
v = 10
workers = 8
graph = generate_grid(v)
fwc = ConcurrentFloydWarshallUpdated(deepcopy(graph), v, workers)
fwc_start = time.time()
with ProcessPoolExecutor(max_workers=workers) as executor:
results = executor.map(fwc.work, range(workers))
# shortest_paths = floydWarshall_p1(deepcopy(graph))
fwc_time = time.time() - fwc_start
print(f"fwc {fwc_time}")
Using ThreadPoolExecutor works but ProcessPoolExecutor does not reach the end of execution.
Edit: Note that the class ConcurrentFloydWarshallUpdated makes use of threading properties such as Lock(), Semaphore(), and Condition(). Do these not work with ProcessPoolExecutor?
I am using the multiprocessing module in python 3.7 to call a function repeatedly in parallel. I would like to write the results out to a file every k iterations. (It can be a different file each time.)
Below is my first attempt, which basically loops over sets of function arguments, running each set in parallel and writing the results to a file before moving onto the next set. This is obviously very inefficient. In practice, the time it takes for my function to run is much longer and varies depending on the input values, so many processors sit idle between iterations of the loop.
Is there a more efficient way to achieve this?
import multiprocessing as mp
import numpy as np
import pandas as pd
def myfunction(x): # toy example function
return(x**2)
for start in np.arange(0,500,100):
with mp.Pool(mp.cpu_count()) as pool:
out = pool.map(myfunction, np.arange(start, start+100))
pd.DataFrame(out).to_csv('filename_'+str(start//100+1)+'.csv', header=False, index=False)
My first comment is that if myfunction is a trivial as the one you have shown, then your performance will be worse using multiprocessing because there is overhead in creating the process pool (which by the way you are unnecessarily creating over and over in each loop iteration) and passing arguments from one process to another.
Assuming that myfunction is pure CPU and after map has returned 100 values there is an opportunity to overlap the writing of the CSV files that you are not taking advantage of (it's not clear how much performance will be improved by concurrent disk writing; it depends on the type of drive you have, head movement, etc.), then a combination of multithreading and multiprocessing could be the solution. The number of processes in your processing pool will be limited to the number of CPU cores given the assumption that myfunction is 100% CPU and does not release the Global Interpreter Lock and therefore cannot take advantage of a pool size greater than the number of CPUs you have. Anyway, that is my assumption. If you are going to be using certain numpy functions for example, then that is an erroneous assumption. On the other hand, it is known that numpy uses multiprocessing for some of its own processing in which case the combination of using numpy and your own multiprocessing could result in worse performance. Your current code is only using numpy for generating ranges. This seems to be a bit of overkill as there are other means of generating ranges. I have taken the liberty of generating the ranges in a slightly different fashion by defining START and STOP values and N_SPLITS, the number of equal (or as equally as possible) divisions of this range as possible and generating tuples of start and stop values that can be converted into ranges. I hope this is not too confusing. But this seemed to be a more flexible approach.
In the following code both a thread pool and a processing pool are created. The tasks are submitted to the thread pool with one of the arguments being the processing pool, whish is used by the worker to do the CPU intensive calculations and then when the results have been assembled the worker writes out the CSV file.
from multiprocessing.pool import Pool, ThreadPool
from multiprocessing import cpu_count
import pandas as pd
def worker(process_pool, index, split_range):
out = process_pool.map(myfunction, range(*split_range))
pd.DataFrame(out).to_csv(f'filename_{index}.csv', header=False, index=False)
def myfunction(x): # toy example function
return(x ** 2)
def split(start, stop, n):
k, m = divmod(stop - start, n)
return [(i * k + min(i, m),(i + 1) * k + min(i + 1, m)) for i in range(n)]
def main():
RANGE_START = 0
RANGE_STOP = 500
N_SPLITS = 5
n_processes = min(N_SPLITS, cpu_count())
split_ranges = split(RANGE_START, RANGE_STOP, N_SPLITS) # [(0, 100), (100, 200), ... (400, 500)]
process_pool = Pool(n_processes)
thread_pool = ThreadPool(N_SPLITS)
for index, split_range in enumerate(split_ranges):
thread_pool.apply_async(worker, args=(process_pool, index, split_range))
# wait for all threading tasks to complete:
thread_pool.close()
thread_pool.join()
# required for Windows:
if __name__ == '__main__':
main()
I want to do parallel processing to speed up the task in Python.
I used apply_async but the cpu only consumes 30%. How to fully utilize the cpu?
Below is my code.
import numpy as np
import pandas as pd
import multiprocessing
def calc_score(df, i, j, score):
score[i,j] = df.loc[i, 'data'] + df.loc[j, 'data']
if __name__ == '__main__':
df = pd.read_csv('data.csv')
score = np.zeros([100, 100])
pool = multiprocessing.Pool(multiprocessing.cpu_count())
for i in range(100):
for j in range(100):
pool.apply_async(calc_score, (df, i, j, score))
pool.close()
pool.join()
Thank you very much.
You can't utilize 100% CPU with pool = multiprocessing.Pool(multiprocessing.cpu_count()) . It starts your worker function on the number of core given by you but also looks for a free core. If you want to utilize maximum CPU with multiprocessing you should use multiprocessing Process class. It keeps spinning new thread. But be aware it will breakdown system if your CPU doesn't have memory to spin new thread.
"CPU utilization" should be about performance, i.e. you want to do the job in as little time as possible. There is no generic way to do that. If there was a generic way to optimize software, then there would be no slow software, right?
You seem to be looking for a different thing: spend as much CPU time as possible, so that it does not sit idly. That may seem like the same thing, but is absolutely not.
Anyway, if you want to spend 100% of CPU time, this script will do that for you:
import time
import multiprocessing
def loop_until_t(t):
while time.time() < t:
pass
def waste_cpu_for_n_seconds(num_seconds, num_processes=multiprocessing.cpu_count()):
t0 = time.time()
t = t0 + num_seconds
print("Begin spending CPU time (in {} processes)...".format(num_processes))
with multiprocessing.Pool(num_processes) as pool:
pool.map(loop_until_t, num_processes*[t])
print("Done.")
if __name__ == '__main__':
waste_cpu_for_n_seconds(15)
If, instead, you want your program to run faster, you will not do that with an "illustration for parallel processing", as you call it - you need an actual problem to be solved.
I tried the following python programs, both sequential and parallel versions on a cluster computing facility. I could clearly see(using top command) more processes initiating for the parallel program. But when I time it, it seems the parallel version is taking more time. What could be the reason? I am attaching the codes and the timing info herewith.
#parallel.py
from multiprocessing import Pool
import numpy
def sqrt(x):
return numpy.sqrt(x)
pool = Pool()
results = pool.map(sqrt, range(100000), chunksize=10)
#seq.py
import numpy
def sqrt(x):
return numpy.sqrt(x)
results = [sqrt(x) for x in range(100000)]
user#domain$ time python parallel.py > parallel.txt
real 0m1.323s
user 0m2.238s
sys 0m0.243s
user#domain$ time python seq.py > seq.txt
real 0m0.348s
user 0m0.324s
sys 0m0.024s
The amount of work per task is by far too little to compensate for the work-distribution-overhead. First you should increase the chunksize, but still a single square root operation is too short to compensate for the cost of sending around the data between processes. You can see an effective speedup from something like this:
def sqrt(x):
for _ in range(100):
x = numpy.sqrt(x)
return x
results = pool.map(sqrt, range(10000), chunksize=100)
I have a pool of workers which perform the same identical task, and I send each a distinct clone of the same data object. Then, I measure the run time separately for each process inside the worker function.
With one process, run time is 4 seconds. With 3 processes, the run time for each process goes up to 6 seconds.
With more complex tasks, this increase is even more nuanced.
There are no other cpu-hogging processes running on my system, and the workers don't use shared memory (as far as I can tell). The run times are measured inside the worker function, so I assume the forking overhead shouldn't matter.
Why does this happen?
def worker_fn(data):
t1 = time()
data.process()
print time() - t1
return data.results
def main( n, num_procs = 3):
from multiprocessing import Pool
from cPickle import dumps, loads
pool = Pool(processes = num_procs)
data = MyClass()
data_pickle = dumps(data)
list_data = [loads(data_pickle) for i in range(n)]
results = pool.map(worker_fn,list_data)
Edit: Although I can't post the entire code for MyClass(), I can tell you that it involves a lot of numpy matrix operations. It seems that numpy's use of OpenBlass may somehow be to blame.