I am using the multiprocessing.Pool class within an object and attempting the following:
from multiprocessing import Lock, Pool
class A:
def __init__(self):
self.lock = Lock()
self.file = open('test.txt')
def function(self, i):
self.lock.acquire()
line = self.file.readline()
self.lock.release()
return line
def anotherfunction(self):
pool = Pool()
results = pool.map(self.function, range(10000))
pool.close()
pool.join()
return results
However I am getting a runtime error stating that lock objects should only be shared between processes through inheritance. I am fairly new to Python and multiprocessing. How can I get put on the right track?
multiprocessing.Lock instances can be attributes of multiprocessing.Process instances. When a process is created in the main process with a lock attribute, the lock exists in the main process’s address space. When the process’s start method is invoked and runs a subprocess which invokes the process’s run method, the lock has to be serialized/deserialized to the subprocess address space. This works as expected:
from multiprocessing import Lock, Process
class P(Process):
def __init__(self, *args, **kwargs):
Process.__init__(self, *args, **kwargs)
self.lock = Lock()
def run(self):
print(self.lock)
if __name__ == '__main__':
p = P()
p.start()
p.join()
Prints:
<Lock(owner=None)>
Unfortuantely, this does not work when you are dealing with multiprocessing.Pool instances. In your example, self.lock is created in the main process by the __init__ method. But when Pool.map is called to invoke self.function, the lock cannot be serialized/deserialized to the already-running pool process that will be running this method.
The solution is to initialize each pool process with a global variable set to this lock (there is no point in having this lock being an attribute of the class now). The way to do this is to use the initializer and initargs parameters of the pool __init__ method. See the documentation:
from multiprocessing import Lock, Pool
def init_pool_processes(the_lock):
'''Initialize each process with a global variable lock.
'''
global lock
lock = the_lock
class Test:
def function(self, i):
lock.acquire()
with open('test.txt', 'a') as f:
print(i, file=f)
lock.release()
def anotherfunction(self):
lock = Lock()
pool = Pool(initializer=init_pool_processes, initargs=(lock,))
pool.map(self.function, range(10))
pool.close()
pool.join()
if __name__ == '__main__':
t = Test()
t.anotherfunction()
Related
I have two python scripts and I want them to communicate to each other. Specifically, I want script Communication.py to send an array to script Process.py if required by the latter. I've used module multiprocessing.Process and multiprocessing.Pipe to make it works. My code works, but I want to handle gracefully SIGINT and SIGTERM, I've tried the following but it does not exit gracefully:
Process.py
from multiprocessing import Process, Pipe
from Communication import arraySender
import time
import signal
class GracefulKiller:
kill_now = False
def __init__(self):
signal.signal(signal.SIGINT, self.exit_gracefully)
signal.signal(signal.SIGTERM, self.exit_gracefully)
def exit_gracefully(self, *args):
self.kill_now = True
def main():
parent_conn, child_conn = Pipe()
p = Process(target=arraySender, args=(child_conn,True))
p.start()
print(parent_conn.recv())
if __name__ == '__main__':
killer = GracefulKiller()
while not killer.kill_now:
main()
Communication.py
import numpy
from multiprocessing import Process, Pipe
def arraySender(child_conn, sendData):
if sendData:
child_conn.send(numpy.random.randint(0, high=10, size=15, dtype=int))
child_conn.close()
what am I doing wrong?
I strongly suspect you are running this under Windows because I think the code you have should work under Linux. This is why it is important to always tag your questions concerning Python and multiprocessing with the actual platform you are on.
The problem appears to be due to the fact that in addition to your main process you have created a child process in function main that is also receiving the signals. The solution would normally be to add calls like signal.signal(signal.SIGINT, signal.SIG_IGN) to your array_sender worker function. But there are two problems with this:
There is a race condition: The signal could be received by the child process before it has a change to ignore signals.
Regardless, the call to ignore signals when you are using multiprocess.Processing does not seem to work (perhaps that class does its own signal handling that overrides these calls).
The solution is to create a multiprocessing pool and initialize each pool process so that they ignore signals before you submit any tasks. The other advantage of using a pool, although in this case we only need a pool size of 1 because you never have more than one task running at a time, is that you only need to create the process once which can then be reused.
As an aside, you have some inconsistency in your GracefulKiller class by mixing a class attribute kill_now with an instance attribute kill_now that gets created when you execute self.kill_now = True. So when the main process is testing killer.kill_now it is accessing the class attribute until such time as self.kill_now is set to True when it will then be accessing the instance attribute.
from multiprocessing import Pool, Pipe
import time
import signal
import numpy
class GracefulKiller:
def __init__(self):
self.kill_now = False # Instance attribute
signal.signal(signal.SIGINT, self.exit_gracefully)
signal.signal(signal.SIGTERM, self.exit_gracefully)
def exit_gracefully(self, *args):
self.kill_now = True
def init_pool_processes():
signal.signal(signal.SIGINT, signal.SIG_IGN)
signal.signal(signal.SIGTERM, signal.SIG_IGN)
def arraySender(sendData):
if sendData:
return numpy.random.randint(0, high=10, size=15, dtype=int)
def main(pool):
result = pool.apply(arraySender, args=(True,))
print(result)
if __name__ == '__main__':
# Create pool with only 1 process:
pool = Pool(1, initializer=init_pool_processes)
killer = GracefulKiller()
while not killer.kill_now:
main(pool)
pool.close()
pool.join()
Ideally GracefulKiller should be a singleton class so that regardless of how many times GracefulKiller was instantiated by a process, you would be calling signal.signal only once for each type of signal you want to handle:
class Singleton(type):
def __init__(self, *args, **kwargs):
self.__instance = None
super().__init__(*args, **kwargs)
def __call__(self, *args, **kwargs):
if self.__instance is None:
self.__instance = super().__call__(*args, **kwargs)
return self.__instance
class GracefulKiller(metaclass=Singleton):
def __init__(self):
self.kill_now = False # Instance attribute
signal.signal(signal.SIGINT, self.exit_gracefully)
signal.signal(signal.SIGTERM, self.exit_gracefully)
def exit_gracefully(self, *args):
self.kill_now = True
I want to implement a file crawler that stores data to a Mongo. I would like to use multiprocessing as a way to 'hand off' blocking tasks such as unzipping files, file crawling and uploading to Mongo. There are certain tasks that are reliant on other tasks (i.e., a file needs to be unzipped before files can be crawled), so I would like the ability to complete the necessary task and add new ones to the same task queue.
Below is what I currently have:
import multiprocessing
class Worker(multiprocessing.Process):
def __init__(self, task_queue: multiprocessing.Queue):
super(Worker, self).__init__()
self.task_queue = task_queue
def run(self):
for (function, *args) in iter(self.task_queue.get, None):
print(f'Running: {function.__name__}({*args,})')
# Run the provided function with its parameters in child process
function(*args)
self.task_queue.task_done()
def foo(task_queue: multiprocessing.Queue) -> None:
print('foo')
# Add new task to queue from this child process
task_queue.put((bar, 1))
def bar(x: int) -> None:
print(f'bar: {x}')
def main():
# Start workers on separate processes
workers = []
manager = multiprocessing.Manager()
task_queue = manager.Queue()
for i in range(multiprocessing.cpu_count()):
worker = Worker(task_queue)
workers.append(worker)
worker.start()
# Run foo on child process using the queue as parameter
task_queue.put((foo, task_queue))
for _ in workers:
task_queue.put(None)
# Block until workers complete and join main process
for worker in workers:
worker.join()
print('Program completed.')
if __name__ == '__main__':
main()
Expected Behaviour:
Running: foo((<AutoProxy[Queue] object, typeid 'Queue' at 0x1b963548908>,))
foo
Running: bar((1,))
bar: 1
Program completed.
Actual Behaviour:
Running: foo((<AutoProxy[Queue] object, typeid 'Queue' at 0x1b963548908>,))
foo
Program completed.
I am quite new to multiprocessing so any help would be greatly appreciated.
As #FrankYellin noted, this is due to the fact that None is being put into task_queue before bar can be added.
Assuming that the queue will either be non-empty or waiting for a task to complete
during the program (which is true in my case), the join method on the queue can be used. According to the docs:
Blocks until all items in the queue have been gotten and processed.
The count of unfinished tasks goes up whenever an item is added to the
queue. The count goes down whenever a consumer thread calls
task_done() to indicate that the item was retrieved and all work on it
is complete. When the count of unfinished tasks drops to zero, join()
unblocks.
Below is the updated code:
import multiprocessing
class Worker(multiprocessing.Process):
def __init__(self, task_queue: multiprocessing.Queue):
super(Worker, self).__init__()
self.task_queue = task_queue
def run(self):
for (function, *args) in iter(self.task_queue.get, None):
print(f'Running: {function.__name__}({*args,})')
# Run the provided function with its parameters in child process
function(*args)
self.task_queue.task_done() # <-- Notify queue that task is complete
def foo(task_queue: multiprocessing.Queue) -> None:
print('foo')
# Add new task to queue from this child process
task_queue.put((bar, 1))
def bar(x: int) -> None:
print(f'bar: {x}')
def main():
# Start workers on separate processes
workers = []
manager = multiprocessing.Manager()
task_queue = manager.Queue()
for i in range(multiprocessing.cpu_count()):
worker = Worker(task_queue)
workers.append(worker)
worker.start()
# Run foo on child process using the queue as parameter
task_queue.put((foo, task_queue))
# Block until all items in queue are popped and completed
task_queue.join() # <---
for _ in workers:
task_queue.put(None)
# Block until workers complete and join main process
for worker in workers:
worker.join()
print('Program completed.')
if __name__ == '__main__':
main()
This seems to work fine. I will update this if I discover anything new. Thank you all.
How can I pass a Lock object to subclass of multiprocessing.Process ? I've tried this and I faced pickling error.
from multiprocessing import Process
from threading import Lock
class myProcess (Process):
def setLock (self , lock) :
self.lock = lock
def run(self) :
with self.lock :
# do stuff
if __name__ == '__main__' :
lock = Lock()
proc1 = myProcess()
proc1.setLock(lock)
proc2 = myProcess()
proc2.setLock(lock)
proc1.start()
proc2.start()
There are many answered questions about passing lock to multiprocessing.Pool but none of them solved my problem with OOP approach usage of Process. If I wanna make a global lock , where should I define it and where can I pass it to myProcess objects ?
You can't use a threading.Lock for multiprocessing, you need to use a multiprocessing.Lock.
You get the pickling-error because a threading.Lock can't be pickled and you are on a OS which uses "spawn" as default for starting new processes (Windows or macOS with Python 3.8+).
Note that on a forking OS (Linux, BSD...), with using threading.Lock, you wouldn't get a pickling error, but the lock would be silently replicated, not providing the synchronization between processes you intended.
Using a separate function for setting the lock is possible, but I would prefer passing it as argument to Process.__init__() along with possible other arguments.
import time
from multiprocessing import Process, Lock, current_process
class MyProcess(Process):
def __init__(self, lock, name=None, args=(), kwargs={}, daemon=None):
super().__init__(
group=None, name=name, args=args, kwargs=kwargs, daemon=daemon
)
# `args` and `kwargs` are stored as `self._args` and `self._kwargs`
self.lock = lock
def run(self) :
with self.lock :
for i in range(3):
print(current_process().name, *self._args)
time.sleep(1)
if __name__ == '__main__' :
lock = Lock()
p1 = MyProcess(lock=lock, args=("hello",))
p2 = MyProcess(lock=lock, args=("world",))
p1.start()
p2.start()
p1.join() # don't forget joining to prevent parent from exiting too soon.
p2.join()
Output:
MyProcess-1 hello
MyProcess-1 hello
MyProcess-1 hello
MyProcess-2 world
MyProcess-2 world
MyProcess-2 world
Process finished with exit code 0
I want to call a multiprocessing.pool.map inside a process.
When initialized inside the run() function, it works. When initialized at instantiation, it does not.
I cannot figure the reason for this behavior ? What happens in the process ?
I am on python 3.6
from multiprocessing import Pool, Process, Queue
def DummyPrinter(key):
print(key)
class Consumer(Process):
def __init__(self, task_queue):
Process.__init__(self)
self.task_queue = task_queue
self.p = Pool(1)
def run(self):
p = Pool(8)
while True:
next_task = self.task_queue.get()
if next_task is None:
break
p.map(DummyPrinter, next_task) # Works
#self.p.map(DummyPrinter, next_task) # Does not Work
return
if __name__ == '__main__':
task_queue = Queue()
Consumer(task_queue).start()
task_queue.put(range(5))
task_queue.put(None)
multiprocessing.Pool cannot be shared by multiple processes because it relies on pipes and threads for its functioning.
The __init__ method gets executed in the parent process whereas the run logic belongs to the child process.
I usually recommend against sub-classing the Process object as it's quite counter intuitive.
A logic like the following would better show the actual division of responsibilities.
def function(task_queue):
"""This runs in the child process."""
p = Pool(8)
while True:
next_task = self.task_queue.get()
if next_task is None:
break
p.map(DummyPrinter, next_task) # Works
def main():
"""This runs in the parent process."""
task_queue = Queue()
process = Process(target=function, args=[task_queue])
process.start()
I have threaded code where each thread needs to write to the same file. To prevent concurrency issues, I am using a Lock object.
My question is whether I am using the Lock correctly. If I set the lock from within each thread, is that lock global or only specific to that specific thread?
Basically, should I create a Lock first and pass its reference to each thread, or is it ok to set it from within the thread like I do here:
import time
from threading import Thread, Lock
def main():
for i in range(20):
agent = Agent(i)
agent.start()
class Agent(Thread):
def __init__(self, thread_num):
Thread.__init__(self)
self.thread_num = thread_num
def run(self):
while True:
print 'hello from thread %s' % self.thread_num
self.write_result()
def write_result(self):
lock = Lock()
lock.acquire()
try:
f = open('foo.txt', 'a')
f.write('hello from thread %s\n' % self.thread_num)
f.flush()
f.close()
finally:
lock.release()
if __name__ == '__main__':
main()
For your use case one approach could be to write a file subclass that locks:
class LockedWrite(file):
""" Wrapper class to a file object that locks writes """
def __init__(self, *args, **kwds):
super(LockedWrite, self).__init__(*args, **kwds)
self._lock = Lock()
def write(self, *args, **kwds):
self._lock.acquire()
try:
super(LockedWrite, self).write(*args, **kwds)
finally:
self._lock.release()
To use in your code just replace following functions:
def main():
f = LockedWrite('foo.txt', 'a')
for i in range(20):
agent = Agent(i, f)
agent.start()
class Agent(Thread):
def __init__(self, thread_num, fileobj):
Thread.__init__(self)
self.thread_num = thread_num
self._file = fileobj
# ...
def write_result(self):
self._file.write('hello from thread %s\n' % self.thread_num)
This approach puts file locking in the file itself which seems cleaner IMHO
Create the lock outside the method.
class Agent(Thread):
mylock = Lock()
def write_result(self):
self.mylock.acquire()
try:
...
finally:
self.mylock.release()
or if using python >= 2.5:
class Agent(Thread):
mylock = Lock()
def write_result(self):
with self.mylock:
...
To use that with python 2.5 you must import the statement from the future:
from __future__ import with_statement
The lock() method returns a lock object for every call. So every thread ( actually every call to write_result ) will have a different lock object. And there will be no locking.
The lock that's used needs to be common to all threads, or at least ensure that two locks can't lock the same resource at the same time.
You can simplify things a bit (at the cost of slightly more overhead) by designating a single thread (probably created exclusively for this purpose) as the sole thread that writes to the file, and have all other threads delegate to the file-writer by placing the string that they want to add to the file into a queue.Queue object.
Queues have all of the locking built-in, so any thread can safely call Queue.put() at any time. The file-writer would be the only thread calling Queue.get(), and can presumably spend much of its time blocking on that call (with a reasonable timeout to allow the thread to cleanly respond to a shutdown request). All of the synchronization issues will be handled by the Queue, and you'll be spared having to worry about whether you've forgotten some lock acquire/release somewhere... :)
The lock instance should be associated with the file instance.
In other words, you should create both the lock and file at the same time and pass both to each thread.
I'm pretty sure that the lock needs to be the same object for each thread. Try this:
import time
from threading import Thread, Lock
def main():
lock = Lock()
for i in range(20):
agent = Agent(i, lock)
agent.start()
class Agent(Thread, Lock):
def __init__(self, thread_num, lock):
Thread.__init__(self)
self.thread_num = thread_num
self.lock = lock
def run(self):
while True:
print 'hello from thread %s' % self.thread_num
self.write_result()
def write_result(self):
self.lock.acquire()
try:
f = open('foo.txt', 'a')
f.write('hello from thread %s\n' % self.thread_num)
f.flush()
f.close()
finally:
lock.release()
if __name__ == '__main__':
main()