python multiprocessing large range - python

I'm trying to understang how to use multiprocessing. My example code:
import multiprocessing as mp
import time
def my_func(x):
print(mp.current_process().pid)
time.sleep(2)
return x**x
def main():
pool = mp.Pool(mp.cpu_count())
result = pool.map(my_func, range(1, 10))
print(result)
if __name__ == "__main__":
main()
but if i've large range (from 1 to 5 million). Do i need to use range(1,5000000) or there is better solution? my_func will do some work with database.

The Pool will have no problem dealing with a large range. It won't increase memory or CPU usage. It will still only create as many processes as you specify and each process will receive one number from the range at a time. The range will not be copied. There's no need to split it up.
Only caveat is that in Python 2 you should probably use xrange.

Related

Python Multiprocessing: Writing to file every k iterations

I am using the multiprocessing module in python 3.7 to call a function repeatedly in parallel. I would like to write the results out to a file every k iterations. (It can be a different file each time.)
Below is my first attempt, which basically loops over sets of function arguments, running each set in parallel and writing the results to a file before moving onto the next set. This is obviously very inefficient. In practice, the time it takes for my function to run is much longer and varies depending on the input values, so many processors sit idle between iterations of the loop.
Is there a more efficient way to achieve this?
import multiprocessing as mp
import numpy as np
import pandas as pd
def myfunction(x): # toy example function
return(x**2)
for start in np.arange(0,500,100):
with mp.Pool(mp.cpu_count()) as pool:
out = pool.map(myfunction, np.arange(start, start+100))
pd.DataFrame(out).to_csv('filename_'+str(start//100+1)+'.csv', header=False, index=False)
My first comment is that if myfunction is a trivial as the one you have shown, then your performance will be worse using multiprocessing because there is overhead in creating the process pool (which by the way you are unnecessarily creating over and over in each loop iteration) and passing arguments from one process to another.
Assuming that myfunction is pure CPU and after map has returned 100 values there is an opportunity to overlap the writing of the CSV files that you are not taking advantage of (it's not clear how much performance will be improved by concurrent disk writing; it depends on the type of drive you have, head movement, etc.), then a combination of multithreading and multiprocessing could be the solution. The number of processes in your processing pool will be limited to the number of CPU cores given the assumption that myfunction is 100% CPU and does not release the Global Interpreter Lock and therefore cannot take advantage of a pool size greater than the number of CPUs you have. Anyway, that is my assumption. If you are going to be using certain numpy functions for example, then that is an erroneous assumption. On the other hand, it is known that numpy uses multiprocessing for some of its own processing in which case the combination of using numpy and your own multiprocessing could result in worse performance. Your current code is only using numpy for generating ranges. This seems to be a bit of overkill as there are other means of generating ranges. I have taken the liberty of generating the ranges in a slightly different fashion by defining START and STOP values and N_SPLITS, the number of equal (or as equally as possible) divisions of this range as possible and generating tuples of start and stop values that can be converted into ranges. I hope this is not too confusing. But this seemed to be a more flexible approach.
In the following code both a thread pool and a processing pool are created. The tasks are submitted to the thread pool with one of the arguments being the processing pool, whish is used by the worker to do the CPU intensive calculations and then when the results have been assembled the worker writes out the CSV file.
from multiprocessing.pool import Pool, ThreadPool
from multiprocessing import cpu_count
import pandas as pd
def worker(process_pool, index, split_range):
out = process_pool.map(myfunction, range(*split_range))
pd.DataFrame(out).to_csv(f'filename_{index}.csv', header=False, index=False)
def myfunction(x): # toy example function
return(x ** 2)
def split(start, stop, n):
k, m = divmod(stop - start, n)
return [(i * k + min(i, m),(i + 1) * k + min(i + 1, m)) for i in range(n)]
def main():
RANGE_START = 0
RANGE_STOP = 500
N_SPLITS = 5
n_processes = min(N_SPLITS, cpu_count())
split_ranges = split(RANGE_START, RANGE_STOP, N_SPLITS) # [(0, 100), (100, 200), ... (400, 500)]
process_pool = Pool(n_processes)
thread_pool = ThreadPool(N_SPLITS)
for index, split_range in enumerate(split_ranges):
thread_pool.apply_async(worker, args=(process_pool, index, split_range))
# wait for all threading tasks to complete:
thread_pool.close()
thread_pool.join()
# required for Windows:
if __name__ == '__main__':
main()

Repeatedly run a function in parallel

How do you run a function repeatedly in parallel?
For example, I have a function that takes no parameters and has a stochastic element. I want to run it multiple times, which is illustrated below using a for loop. How do I accomplish the same in parallel please?
import numpy as np
def f():
x = np.random.uniform()
return x*x
np.random.seed(1)
a = []
for i in range(10):
a.append(f())
This is a duplicate of parallel-python-just-run-function-n-times, however, the answer doesn't quite fit as it passes different inputs into the function, and How do I parallelize a simple Python loop? also gives examples of passing different parameters into the function rather than repeating the same call.
I am on Windows 10 and using Jupyter
In regrds to my real use:
Does it produce a large volume of output per call?
Each iteration of the loop produces one number.
Do you need to keep the output? How long does each invocation take roughly?
Yes, I need to retain the numbers and it takes ~30 minutes per iteration.
?How many times do you need to run it in total?
At least 100.
Do you want to parallelize across multiple machines or just multiple cores?
Currently just across multiple cores.
If you don't want to pass any input to your function, just use a Throwaway variable _ as argument to your function and parallelise it as shown in the below code.
import numpy as np
from multiprocessing.pool import Pool
def f(_):
x = np.random.uniform()
return x*x
if __name__ == "__main__":
processes = 5 # Specify number of processes here
p = Pool(processes)
p.map(f, range(10))
Update:
To answer your updated question, if your tasks aren't too heavyweight and are just I/O bound, then I recommend you use ThreadPool (multithreading) instead of Pool (multiprocessing)
Code to create a Threadpool:
from multiprocessing.pool import ThreadPool
threads = 5
t = ThreadPool(threads)
t.map(f, range(10))

Python multiprocessing progress record

I'm running a pythong program using the multiprocessing module to take advantage of multiple cores on the cpu.
The program itself works fine, but when it comes to show a kind of progress percentage it all messes up.
In order to try to simulate what happens to me, I've written this little scenario where I've used some random times to try to replicate some tasks that could take different times in the original program.
When you ran it, you'll see how percentages are mixed up.
Is there a propper way to achieve this?
from multiprocessing import Pool, Manager
import random
import time
def PrintPercent(percent):
time.sleep(random.random())
print(' (%s %%) Ready' %(percent))
def HeavyProcess(cont):
total = 20
cont[0] = cont[0] + 1
percent = round((float(cont[0])/float(total))*100, 1)
PrintPercent(percent)
def Main():
cont = Manager().list(range(1))
cont[0] = 0
pool = Pool(processes=2)
for i in range(20):
pool.apply_async(HeavyProcess, [cont])
pool.close()
pool.join()
Main()

Multiprocessing in python

I am writing a Python script (in Python 2.7) wherein I need to generate around 500,000 uniform random numbers within a range. I need to do this 4 times, perform some calculations on them and write out the 4 files.
At the moment I am doing: (this is just part of my for loop, not the entire code)
random_RA = []
for i in xrange(500000):
random_RA.append(np.random.uniform(6.061,6.505)) # FINAL RANDOM RA
random_dec = []
for i in xrange(500000):
random_dec.append(np.random.uniform(min(data_dec_1),max(data_dec_1))) # FINAL RANDOM 'dec'
to generate the random numbers within the range. I am running Ubuntu 14.04 and when I run the program I also open my system manager to see how the 8 CPU's I have are working. I seem to notice that when the program is running, only 1 of the 8 CPU's seem to work at 100% efficiency. So the entire program takes me around 45 minutes to complete.
I noticed that it is possible to use all the CPU's to my advantage using the module Multiprocessing
I would like to know if this is enough in my example:
random_RA = []
for i in xrange(500000):
multiprocessing.Process()
random_RA.append(np.random.uniform(6.061,6.505)) # FINAL RANDOM RA
i.e adding just the line multiprocessing.Process(), would that be enough?
If you use multiprocessing, you should avoid shared state (like your random_RA list) as much as possible.
Instead, try to use a Pool and its map method:
from multiprocessing import Pool, cpu_count
def generate_random_ra(x):
return np.random.uniform(6.061, 6.505)
def generate_random_dec(x):
return np.random.uniform(min(data_dec_1), max(data_dec_1))
pool = Pool(cpu_count())
random_RA = pool.map(generate_random_ra, xrange(500000))
random_dec = pool.map(generate_random_dec, xrange(500000))
To get you started:
import multiprocessing
import random
def worker(i):
random.uniform(1,100000)
print i,'done'
if __name__ == "__main__":
for i in range(4):
t = multiprocessing.Process(target = worker, args=(i,))
t.start()
print 'All the processes have been started.'
You must gate the t = multiprocess.Process(...) with __name__ == "__main__" as each worker calls this program (module) again to find out what it has to do. If the gating didn't happen it would spawn more processes ...
Just for completeness, generating 500000 random numbers is not going to take you 45 minutes so i assume there are some intensive calculations going on here: you may want to look at them closely.

Python multiprocessing speed

I wrote this bit of code to test out Python's multiprocessing on my computer:
from multiprocessing import Pool
var = range(5000000)
def test_func(i):
return i+1
if __name__ == '__main__':
p = Pool()
var = p.map(test_func, var)
I timed this using Unix's time command and the results were:
real 0m2.914s
user 0m4.705s
sys 0m1.406s
Then, using the same var and test_func() I timed:
var = map(test_func, var)
and the results were
real 0m1.785s
user 0m1.548s
sys 0m0.214s
Shouldn't the multiprocessing code be much faster than plain old map?
Why it should.
In map function, you are just calling the function sequentially.
Multiprocessing pool creates a set of workers to which your task will be mapped.
It is coordinating multiple worker processes to run these functions.
Try doing some significant work inside your function and then time them and see if multiprocessing helps you to compute faster.
You have to understand that there will be overheads in using multiprocessing. Only when the computing effort is significantly greater than these overheads that you will see it's benefits.
See the last example in excellent introduction by Hellmann: http://www.doughellmann.com/PyMOTW/multiprocessing/communication.html
pool_size = multiprocessing.cpu_count() * 2
pool = multiprocessing.Pool(processes=pool_size,
initializer=start_process,
maxtasksperchild=2,
)
pool_outputs = pool.map(do_calculation, inputs)
You create pools depending on cores that you have.
There is an overhead on using parallelization. There is only benefit if each work unit takes long enough to compensate the overhead.
Also if you only have one CPU (or CPU thread) on your machine, there's no point in using parallelization at all. You'll only see gains if you have at least a hyperthreaded machine or at least two CPU cores.
In your case a simple addition operation doesn't compensate that overhead.
Try something a bit more costly such as:
from multiprocessing import Pool
import math
def test_func(i):
j = 0
for x in xrange(1000000):
j += math.atan2(i, i)
return j
if __name__ == '__main__':
var = range(500)
p = Pool()
var = p.map(test_func, var)

Categories