Python sharing a lock between processes - python

I am attempting to use a partial function so that pool.map() can target a function that has more than one parameter (in this case a Lock() object).
Here is example code (taken from an answer to a previous question of mine):
from functools import partial
def target(lock, iterable_item):
for item in items:
# Do cool stuff
if (... some condition here ...):
lock.acquire()
# Write to stdout or logfile, etc.
lock.release()
def main():
iterable = [1, 2, 3, 4, 5]
pool = multiprocessing.Pool()
l = multiprocessing.Lock()
func = partial(target, l)
pool.map(func, iterable)
pool.close()
pool.join()
However when I run this code, I get the error:
Runtime Error: Lock objects should only be shared between processes through inheritance.
What am I missing here? How can I share the lock between my subprocesses?

You can't pass normal multiprocessing.Lock objects to Pool methods, because they can't be pickled. There are two ways to get around this. One is to create Manager() and pass a Manager.Lock():
def main():
iterable = [1, 2, 3, 4, 5]
pool = multiprocessing.Pool()
m = multiprocessing.Manager()
l = m.Lock()
func = partial(target, l)
pool.map(func, iterable)
pool.close()
pool.join()
This is a little bit heavyweight, though; using a Manager requires spawning another process to host the Manager server. And all calls to acquire/release the lock have to be sent to that server via IPC.
The other option is to pass the regular multiprocessing.Lock() at Pool creation time, using the initializer kwarg. This will make your lock instance global in all the child workers:
def target(iterable_item):
for item in items:
# Do cool stuff
if (... some condition here ...):
lock.acquire()
# Write to stdout or logfile, etc.
lock.release()
def init(l):
global lock
lock = l
def main():
iterable = [1, 2, 3, 4, 5]
l = multiprocessing.Lock()
pool = multiprocessing.Pool(initializer=init, initargs=(l,))
pool.map(target, iterable)
pool.close()
pool.join()
The second solution has the side-effect of no longer requiring partial.

Here's a version (using Barrier instead of Lock, but you get the idea) which would also work on Windows (where the missing fork is causing additional troubles):
import multiprocessing as mp
def procs(uid_barrier):
uid, barrier = uid_barrier
print(uid, 'waiting')
barrier.wait()
print(uid, 'past barrier')
def main():
N_PROCS = 10
with mp.Manager() as man:
barrier = man.Barrier(N_PROCS)
with mp.Pool(N_PROCS) as p:
p.map(procs, ((uid, barrier) for uid in range(N_PROCS)))
if __name__ == '__main__':
mp.freeze_support()
main()

Related

Why is this multiprocessing.Pool stuck?

CODE:
from multiprocessing import Pool
print ('parent')
max_processes = 4
def foo(result):
print (result)
def main():
pool = Pool(processes=max_processes)
while True:
pool.apply_async(foo, 5)
if __name__ == '__main__':
main()
'parent' gets printed 5 times, so initial pools were created. But there was no execution from print(result) statement.
You are passing the arguments incorrectly in your call to apply_async. The arguments need to be in a tuple (or other sequence, maybe), but you're passing 5 as a bare number.
Try:
def main():
pool = Pool(processes=max_processes)
while True:
pool.apply_async(foo, (5,)) # make a 1-tuple for the args!
Try to add with Pool(processes=max_processes) as pool:
with Pool(processes=max_processes) as pool:
while True:
pool.apply_async(foo, 5)
...
Warning multiprocessing.pool objects have internal resources that need to be properly managed (like any other resource) by using the pool as a context manager or by calling close() and terminate() manually. Failure to do this can lead to the process hanging on finalization.

Block script from hanging when islice is used on a multiprocessing Queue

I have an iterable Python class that wraps around a multiprocessing generator. There are use cases where only a subset of what is generated is necessary, so it gets wrapped in islice.
However, the call hangs when islice is used, I guess due to the underlying multiprocessing Process not being aware that things are over.
A minimally functioning example is as follows:
from itertools import islice
import multiprocessing as mp
STOP_MSG = 'STOP!'
def generator(queue, max_val):
for i in range(max_val):
queue.put(i)
queue.put(STOP_MSG)
class GeneratorMPProc:
def __init__(self, max_val):
self.max_val = max_val
def __iter__(self):
queue = mp.Queue()
feeder_process = mp.Process(
target=generator,
args=(
queue,
self.max_val,
))
feeder_process.start()
msg = queue.get()
while msg != STOP_MSG:
yield msg
msg = queue.get()
feeder_process.join()
if __name__ == '__main__':
max_val = 0xFFFFFFFFF
end_val = 10
psm = GeneratorMPProc(max_val)
rsm = [i for i in islice(psm, end_val)]
How do I fix this so that it terminates correctly even when islice or any subset selector is used?
Your isllice call is not going to return until GeneratorMPProc.iter returns and that will take quite a while with max_val set to 0xFFFFFFFFF (writing to a queue is not the fastest operation and this will also use up a bit of resources). In other words, "things are not over" until your generator function ends and thus your multiprocess.Process actually ends and can be joined.
Set max_val to a value such as 20 and your program will terminate readily enough.
if __name__ == '__main__':
#max_val = 0xFFFFFFFFF
max_val = 20
end_val = 10
psm = GeneratorMPProc(max_val)
rsm = [i for i in islice(psm, end_val)]
print(rsm)
Prints:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Update
You might want to consider starting the "generator" process with an extra argument daemon=True to make it a daemon process and then remove feeder_process.join() altogether from method __iter__. The code will then work even with your original max_val.
def __iter__(self):
queue = mp.Queue()
feeder_process = mp.Process(
target=generator,
args=(
queue,
self.max_val,
),
daemon=True
)
feeder_process.start()
msg = queue.get()
while msg != STOP_MSG:
yield msg
msg = queue.get()

How can I append to class variables using multiprocessing in python?

I have this program where everything is built in a class object. There is a function that does 50 computations of a another function, each with a different input, so I decided to use multiprocessing to speed it up. However, the list that needs to be returned in the end always returns empty. any ideas? Here is a simplified version of my problem. The output of main_function() should be a list containing the numbers 0-9, however the list returns empty.
class MyClass(object):
def __init__(self):
self.arr = list()
def helper_function(self, n):
self.arr.append(n)
def main_function(self):
jobs = []
for i in range(0,10):
p = multiprocessing.Process(target=self.helper_function, args=(i,))
jobs.append(p)
p.start()
for job in jobs:
jobs.join()
print(self.arr)
arr is a list that's not going to be shared across subprocess instances.
For that you have to use a Manager object to create a managed list that is aware of the fact that it's shared between processes.
The key is:
self.arr = multiprocessing.Manager().list()
full working example:
import multiprocessing
class MyClass(object):
def __init__(self):
self.arr = multiprocessing.Manager().list()
def helper_function(self, n):
self.arr.append(n)
def main_function(self):
jobs = []
for i in range(0,10):
p = multiprocessing.Process(target=self.helper_function, args=(i,))
jobs.append(p)
p.start()
for job in jobs:
job.join()
print(self.arr)
if __name__ == "__main__":
a = MyClass()
a.main_function()
this code now prints: [7, 9, 2, 8, 6, 0, 4, 3, 1, 5]
(well of course the order cannot be relied on between several executions, but all numbers are here which means that all processes contributed to the result)
multiprocessing is touchy.
For simple multiprocessing tasks, I would recomend:
from multiprocessing.dummy import Pool as ThreadPool
class MyClass(object):
def __init__(self):
self.arr = list()
def helper_function(self, n):
self.arr.append(n)
def main_function(self):
pool = ThreadPool(4)
pool.map(self.helper_function, range(10))
print(self.arr)
if __name__ == '__main__':
c = MyClass()
c.main_function()
The idea of using map instead of complicated multithreading calls is from one of my favorite blog posts: https://chriskiehl.com/article/parallelism-in-one-line

Run a script in Python in parallel [duplicate]

I researched first and couldn't find an answer to my question. I am trying to run multiple functions in parallel in Python.
I have something like this:
files.py
import common #common is a util class that handles all the IO stuff
dir1 = 'C:\folder1'
dir2 = 'C:\folder2'
filename = 'test.txt'
addFiles = [25, 5, 15, 35, 45, 25, 5, 15, 35, 45]
def func1():
c = common.Common()
for i in range(len(addFiles)):
c.createFiles(addFiles[i], filename, dir1)
c.getFiles(dir1)
time.sleep(10)
c.removeFiles(addFiles[i], dir1)
c.getFiles(dir1)
def func2():
c = common.Common()
for i in range(len(addFiles)):
c.createFiles(addFiles[i], filename, dir2)
c.getFiles(dir2)
time.sleep(10)
c.removeFiles(addFiles[i], dir2)
c.getFiles(dir2)
I want to call func1 and func2 and have them run at the same time. The functions do not interact with each other or on the same object. Right now I have to wait for func1 to finish before func2 to start. How do I do something like below:
process.py
from files import func1, func2
runBothFunc(func1(), func2())
I want to be able to create both directories pretty close to the same time because every min I am counting how many files are being created. If the directory isn't there it will throw off my timing.
You could use threading or multiprocessing.
Due to peculiarities of CPython, threading is unlikely to achieve true parallelism. For this reason, multiprocessing is generally a better bet.
Here is a complete example:
from multiprocessing import Process
def func1():
print 'func1: starting'
for i in xrange(10000000): pass
print 'func1: finishing'
def func2():
print 'func2: starting'
for i in xrange(10000000): pass
print 'func2: finishing'
if __name__ == '__main__':
p1 = Process(target=func1)
p1.start()
p2 = Process(target=func2)
p2.start()
p1.join()
p2.join()
The mechanics of starting/joining child processes can easily be encapsulated into a function along the lines of your runBothFunc:
def runInParallel(*fns):
proc = []
for fn in fns:
p = Process(target=fn)
p.start()
proc.append(p)
for p in proc:
p.join()
runInParallel(func1, func2)
If your functions are mainly doing I/O work (and less CPU work) and you have Python 3.2+, you can use a ThreadPoolExecutor:
from concurrent.futures import ThreadPoolExecutor
def run_io_tasks_in_parallel(tasks):
with ThreadPoolExecutor() as executor:
running_tasks = [executor.submit(task) for task in tasks]
for running_task in running_tasks:
running_task.result()
run_io_tasks_in_parallel([
lambda: print('IO task 1 running!'),
lambda: print('IO task 2 running!'),
])
If your functions are mainly doing CPU work (and less I/O work) and you have Python 2.6+, you can use the multiprocessing module:
from multiprocessing import Process
def run_cpu_tasks_in_parallel(tasks):
running_tasks = [Process(target=task) for task in tasks]
for running_task in running_tasks:
running_task.start()
for running_task in running_tasks:
running_task.join()
run_cpu_tasks_in_parallel([
lambda: print('CPU task 1 running!'),
lambda: print('CPU task 2 running!'),
])
This can be done elegantly with Ray, a system that allows you to easily parallelize and distribute your Python code.
To parallelize your example, you'd need to define your functions with the #ray.remote decorator, and then invoke them with .remote.
import ray
ray.init()
dir1 = 'C:\\folder1'
dir2 = 'C:\\folder2'
filename = 'test.txt'
addFiles = [25, 5, 15, 35, 45, 25, 5, 15, 35, 45]
# Define the functions.
# You need to pass every global variable used by the function as an argument.
# This is needed because each remote function runs in a different process,
# and thus it does not have access to the global variables defined in
# the current process.
#ray.remote
def func1(filename, addFiles, dir):
# func1() code here...
#ray.remote
def func2(filename, addFiles, dir):
# func2() code here...
# Start two tasks in the background and wait for them to finish.
ray.get([func1.remote(filename, addFiles, dir1), func2.remote(filename, addFiles, dir2)])
If you pass the same argument to both functions and the argument is large, a more efficient way to do this is using ray.put(). This avoids the large argument to be serialized twice and to create two memory copies of it:
largeData_id = ray.put(largeData)
ray.get([func1(largeData_id), func2(largeData_id)])
Important - If func1() and func2() return results, you need to rewrite the code as follows:
ret_id1 = func1.remote(filename, addFiles, dir1)
ret_id2 = func2.remote(filename, addFiles, dir2)
ret1, ret2 = ray.get([ret_id1, ret_id2])
There are a number of advantages of using Ray over the multiprocessing module. In particular, the same code will run on a single machine as well as on a cluster of machines. For more advantages of Ray see this related post.
Seems like you have a single function that you need to call on two different parameters. This can be elegantly done using a combination of concurrent.futures and map with Python 3.2+
import time
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
def sleep_secs(seconds):
time.sleep(seconds)
print(f'{seconds} has been processed')
secs_list = [2,4, 6, 8, 10, 12]
Now, if your operation is IO bound, then you can use the ThreadPoolExecutor as such:
with ThreadPoolExecutor() as executor:
results = executor.map(sleep_secs, secs_list)
Note how map is used here to map your function to the list of arguments.
Now, If your function is CPU bound, then you can use ProcessPoolExecutor
with ProcessPoolExecutor() as executor:
results = executor.map(sleep_secs, secs_list)
If you are not sure, you can simply try both and see which one gives you better results.
Finally, if you are looking to print out your results, you can simply do this:
with ThreadPoolExecutor() as executor:
results = executor.map(sleep_secs, secs_list)
for result in results:
print(result)
In 2021 the easiest way is to use asyncio:
import asyncio, time
async def say_after(delay, what):
await asyncio.sleep(delay)
print(what)
async def main():
task1 = asyncio.create_task(
say_after(4, 'hello'))
task2 = asyncio.create_task(
say_after(3, 'world'))
print(f"started at {time.strftime('%X')}")
# Wait until both tasks are completed (should take
# around 2 seconds.)
await task1
await task2
print(f"finished at {time.strftime('%X')}")
asyncio.run(main())
References:
[1] https://docs.python.org/3/library/asyncio-task.html
If you are a windows user and using python 3, then this post will help you to do parallel programming in python.when you run a usual multiprocessing library's pool programming, you will get an error regarding the main function in your program. This is because the fact that windows has no fork() functionality. The below post is giving a solution to the mentioned problem .
http://python.6.x6.nabble.com/Multiprocessing-Pool-woes-td5047050.html
Since I was using the python 3, I changed the program a little like this:
from types import FunctionType
import marshal
def _applicable(*args, **kwargs):
name = kwargs['__pw_name']
code = marshal.loads(kwargs['__pw_code'])
gbls = globals() #gbls = marshal.loads(kwargs['__pw_gbls'])
defs = marshal.loads(kwargs['__pw_defs'])
clsr = marshal.loads(kwargs['__pw_clsr'])
fdct = marshal.loads(kwargs['__pw_fdct'])
func = FunctionType(code, gbls, name, defs, clsr)
func.fdct = fdct
del kwargs['__pw_name']
del kwargs['__pw_code']
del kwargs['__pw_defs']
del kwargs['__pw_clsr']
del kwargs['__pw_fdct']
return func(*args, **kwargs)
def make_applicable(f, *args, **kwargs):
if not isinstance(f, FunctionType): raise ValueError('argument must be a function')
kwargs['__pw_name'] = f.__name__ # edited
kwargs['__pw_code'] = marshal.dumps(f.__code__) # edited
kwargs['__pw_defs'] = marshal.dumps(f.__defaults__) # edited
kwargs['__pw_clsr'] = marshal.dumps(f.__closure__) # edited
kwargs['__pw_fdct'] = marshal.dumps(f.__dict__) # edited
return _applicable, args, kwargs
def _mappable(x):
x,name,code,defs,clsr,fdct = x
code = marshal.loads(code)
gbls = globals() #gbls = marshal.loads(gbls)
defs = marshal.loads(defs)
clsr = marshal.loads(clsr)
fdct = marshal.loads(fdct)
func = FunctionType(code, gbls, name, defs, clsr)
func.fdct = fdct
return func(x)
def make_mappable(f, iterable):
if not isinstance(f, FunctionType): raise ValueError('argument must be a function')
name = f.__name__ # edited
code = marshal.dumps(f.__code__) # edited
defs = marshal.dumps(f.__defaults__) # edited
clsr = marshal.dumps(f.__closure__) # edited
fdct = marshal.dumps(f.__dict__) # edited
return _mappable, ((i,name,code,defs,clsr,fdct) for i in iterable)
After this function , the above problem code is also changed a little like this:
from multiprocessing import Pool
from poolable import make_applicable, make_mappable
def cube(x):
return x**3
if __name__ == "__main__":
pool = Pool(processes=2)
results = [pool.apply_async(*make_applicable(cube,x)) for x in range(1,7)]
print([result.get(timeout=10) for result in results])
And I got the output as :
[1, 8, 27, 64, 125, 216]
I am thinking that this post may be useful for some of the windows users.
There's no way to guarantee that two functions will execute in sync with each other which seems to be what you want to do.
The best you can do is to split up the function into several steps, then wait for both to finish at critical synchronization points using Process.join like #aix's answer mentions.
This is better than time.sleep(10) because you can't guarantee exact timings. With explicitly waiting, you're saying that the functions must be done executing that step before moving to the next, instead of assuming it will be done within 10ms which isn't guaranteed based on what else is going on on the machine.
(about How can I simultaneously run two (or more) functions in python?)
With asyncio, sync/async tasks could be run concurrently by:
import asyncio
import time
def function1():
# performing blocking tasks
while True:
print("function 1: blocking task ...")
time.sleep(1)
async def function2():
# perform non-blocking tasks
while True:
print("function 2: non-blocking task ...")
await asyncio.sleep(1)
async def main():
loop = asyncio.get_running_loop()
await asyncio.gather(
# https://docs.python.org/3/library/asyncio-eventloop.html#asyncio.loop.run_in_executor
loop.run_in_executor(None, function1),
function2(),
)
if __name__ == '__main__':
asyncio.run(main())

Python3 Pool async processes | workers

I am trying to use 4 processes for 4 async methods.
Here is my code for 1 async method (x):
from multiprocessing import Pool
import time
def x(i):
while(i < 100):
print(i)
i += 1
time.sleep(1)
def finish(str):
print("done!")
if __name__ == "__main__":
pool = Pool(processes=5)
result = pool.apply_async(x, [0], callback=finish)
print("start")
according to: https://docs.python.org/2/library/multiprocessing.html#multiprocessing.JoinableQueue
the parameter processes in Pool is the number of workers.
How can i use each of these workers?
EDIT: my ASYNC class
from multiprocessing import Pool
import time
class ASYNC(object):
def __init__(self, THREADS=[]):
print('do')
pool = Pool(processes=len(THREADS))
self.THREAD_POOL = {}
thread_index = 0
for thread_ in THREADS:
self.THREAD_POOL[thread_index] = {
'thread': thread_['thread'],
'args': thread_['args'],
'callback': thread_['callback']
}
pool.apply_async(self.run, [thread_index], callback=thread_['callback'])
self.THREAD_POOL[thread_index]['running'] = True
thread_index += 1
def run(self, thread_index):
print('enter')
while(self.THREAD_POOL[thread_index]['running']):
print("loop")
self.THREAD_POOL[thread_index]['thread'](self.THREAD_POOL[thread_index])
time.sleep(1)
self.THREAD_POOL[thread_index]['running'] = False
def wait_for_finish(self):
for pool in self.THREAD_POOL:
while(self.THREAD_POOL[pool]['running']):
time.sleep(1)
def x(pool):
print(str(pool))
pool['args'][0] += 1
def y(str):
print("done")
A = ASYNC([{'thread': x, 'args':[10], 'callback':y}])
print("start")
A.wait_for_finish()
multiprocessing.Pool is designed to be a convenient way of distributing work to a pool of workers, without worrying about which worker does which work. The reason that it has a size is to allow you to be lazy about how quickly you dispatch work to the queue and to limit the expensive (relatively) overhead of creating child processes.
So the answer to your question is in principle you shouldn't be able to access individual workers in a Pool. If you want to be able to address workers individually, you will need to implement your own work distribution system and using multiprocessing.Process, something like:
from multiprocessing import Process
def x(i):
while(i < 100):
print(i)
i += 1
pools = [Process(target=x, args=(1,)) for _ in range(5)]
map(lambda pool: pool.start(), pools)
map(lambda pool: pool.join(), pools)
print('Done!')
And now you can access each worker directly. If you want to be able to send work dynamically to each worker while it's running (not just give it one thing to do like I did in my example) then you'll have to implement that yourself, potentially using multiprocessing.Queue. Have a look at the code for multiprocessing to see how that distributes work to its workers to get an idea of how to do this.
Why do you want to do this anyway? If it's just concern about whether the workers get scheduled efficiently, then my advice would just be to trust multiprocessing to get that right for you, unless you have really good evidence that in your case it does not for some reason.

Categories