multiprocessing -> pathos.multiprocessing and windows

multiprocessing -> pathos.multiprocessing and windows - python

I'm currently using the standard multiprocessing in python to generate a bunch of processes that will run indefinitely. I'm not particularly concerned with performance; each thread is simply watching for a different change on the filesystem, and will take the appropriate action when a file is modified.
Currently, I have a solution that works, for my needs, in Linux. I have a dictionary of functions and arguments that looks like:
job_dict['func1'] = {'target': func1, 'args': (args,)}
For each, I create a process:
import multiprocessing
for k in job_dict.keys():
jobs[k] = multiprocessing.Process(target=job_dict[k]['target'],
args=job_dict[k]['args'])
With this, I can keep track of each one that is running, and, if necessary, restart a job that crashes for any reason.
This does not work in Windows. Many of the functions I'm using are wrappers, using various functools functions, and I get messages about not being able to serialize the functions (see What can multiprocessing and dill do together?). I have not figured out why I do not get this error in Linux, but do in Windows.
If I import dill before starting my processes in Windows, I do not get the serialization error. However, the processes do not actually do anything. I cannot figure out why.
I then switched to the multiprocessing implementation in pathos, but did not find an analog to the simple Process class within the standard multiprocessing module. I was able to generate threads for each job using pathos.pools.ThreadPool. This is not the intended use for map, I'm sure, but it started all the threads, and they ran in Windows:
import pathos
tp = pathos.pools.ThreadPool()
for k in job_dict.keys():
tp.uimap(job_dict[k]['target'], job_dict[k]['args'])
However, now I'm not sure how to monitor whether a thread is still active, which I'm looking for so that I can restart threads that crash for some reason or another. Any suggestions?

I'm the pathos and dill author. The Process class is buried deep within pathos at pathos.helpers.mp.process.Process, where mp itself is the actual fork of the multiprocessing library. Everything in multiprocessing should be accessible from there.
Another thing to know about pathos is that it keeps the pool alive for you until you remove it from the held state. This helps reduce overhead in creating "new" pools. To remove a pool, you do:
>>> # create
>>> p = pathos.pools.ProcessPool()
>>> # remove
>>> p.clear()
There's no such mechanism for a Process however.
For multiprocessing, windows is different than Linux and Macintosh… because windows doesn't have a proper fork like on linux… linux can share objects across processes, while on windows there is no sharing… it's basically a fully independent new process created… and therefore the serialization has to be better for the object to pass across to the other process -- just as if you would send the object to another computer. On, linux, you'd have to do this to get the same behavior:
def check(obj, *args, **kwds):
"""check pickling of an object across another process"""
import subprocess
fail = True
try:
_x = dill.dumps(x, *args, **kwds)
fail = False
finally:
if fail:
print "DUMP FAILED"
msg = "python -c import dill; print dill.loads(%s)" % repr(_x)
print "SUCCESS" if not subprocess.call(msg.split(None,2)) else "LOAD FAILED"

Related

Update progressbar inside multiprocessor functions (Ray) simultaneously

I'm writing a program that uses ray package for multiprocessing programming. In the program, there is a function that would be called 5 times at the same time. During the execution, I want to show a progress bar using PyQT5 QprogressBar to indicate how much work is done. My idea is to let every execution of the function updates the progress bar by 20%. So I wrote the code like the following:
running_tasks = [myFunction.remote(x,y,z,self.progressBar,QApplication) for x in myList]
Results = list(ray.get(running_tasks))
Inside myFunction, there is a line to update the sent progress bar as the following:
QApplication.processEvents()
progressBar.setValue(progressBar.Value()+20)
But, when I run the code, I got the following error:
TypeError: Could not serialize the argument
<PyQt5.QtWidgets.QProgressBar object at 0x000001B787A36B80> for a task
or actor myFile.myFunction. Check
https://docs.ray.io/en/master/serialization.html#troubleshooting for
more information.
I searched through the internet (The URL returns 404) and I understand that this error is because multiprocessing in ray doesn't have shared memory between the processors, and sending a class attribute (like self.prgressBar) will lead each processor to have its own copy where it will modify it locally only. I also tried using the multiprocessing package instead of ray but it throws a pickling error, and I assume it is due to the same reason. So, Can anyone confirm if I'm right? or provide a further explanation about the error?
Also, how can I achieve my requirement in multiprocessing (i.e. updating the same progress bar simultaneously) If multiprocessing doesn't have shared memory between the processors?

I am unfamiliar with ray, but you can do this in the multiprocessing library using the multiprocessing.Queue().
The Queue is exactly as it's named, a queue where you can put data for other multiprocesses to read. In my case I usually put a dictionary in the Queue with a Command (Key) and what to do with that command (Value).
In one multiprocess you will do Queue.put() and in the other you can do Queue.get(). If you want to pass in one direction. In the example below I emulate what you may be looking to do.
I usually use a QTimer to check if there is any data in the queue, but you can also check whenever you feel like by calling a method to do so.
from multiprocessing import Process, Queue
myQueue = Queue()
class FirstProcess():
...
def update_progress_percentage(self, percentage):
self.progresss_percentage = percentage
def send_data_to_other_process(self):
myQueue.put({"UpdateProgress":self.progresss_percentage})
class SecondProcess():
...
def get_data_from_other_process(self):
while not myQueue.empty():
queue_dict = myQueue.get()
for key in queue_dict :
if key == "UpdateProgress":
percentage = queue_dict["UpdateProgress"]
progressBar.setValue(percentage)

multiprocessing gives AssertionError: daemonic processes are not allowed to have children

I am trying to use multiprocessing for the first time. So I thought I would make a very simple test example which factors 100 different numbers.
from multiprocessing import Pool
from primefac import factorint
N = 10**30
L = range(N,N + 100)
pool = Pool()
pool.map(factorint, L)
This gives me the error:
Traceback (most recent call last):
File "test.py", line 8, in <module>
pool.map(factorint, L)
File "/usr/lib/python2.7/multiprocessing/pool.py", line 251, in map
return self.map_async(func, iterable, chunksize).get()
File "/usr/lib/python2.7/multiprocessing/pool.py", line 567, in get
raise self._value
AssertionError: daemonic processes are not allowed to have children
I see that Python Process Pool non-daemonic? discusses this problem but I don't understand why it is relevant to my simple toy example. What am I doing wrong?

The problem appears to be that primefac uses its own multiprocessing.Pool. Unfortunately, while PyPI is down, I can't find the source to the module—but I did find various forks on GitHub, like this one, and they all have multiprocessing code.
So, your apparently simple example isn't all that simple—because it's importing and running non-simple code.
By default, all Pool processes are daemonic, so you can't create more child processes from inside another Pool. Usually, attempting to do so is a mistake.
If you really do want to multiprocess the factors even though some of them are going to multiprocess their own work (quite possibly adding more contention overhead without adding any parallelism), then you just have to subclass Pool and override that—as explained in the related question that you linked.
But the simplest thing is to just not use multiprocessing here, if primefac is already using your cores efficiently. (If you need quasi-concurrency, getting answers as they come in instead of getting them in sequence, I suppose you could do that with a thread pool, but I don't think there's any advantage to that here—you're not using imap_unordered or explicit AsyncResult anywhere.)
Alternatively, if it's not using all of your cores most of the time, only doing so for the "tricky remainders" at the end of factoring some numbers, while you've got 7 cores sitting idle for 60% of the time… then you probably want to prevent primefac from using multiprocessing at all. I don't know if the module has a public API for doing that. If so, of course, just use it. If not… well, you may have to subclass or monkeypatch some of its code, or, at worst, monkeypatching its import of multiprocessing, and that may not be worth doing.
The ideal solution would probably be to refactor primefac to push the "tricky remainder" jobs onto the same pool you're already using. But that's probably by far the most work, and not that much more benefit.
As a side note, this isn't your problem, but you should have a __main__ guard around your top-level code, like this:
from multiprocessing import Pool
from primefac import factorint
if __name__ == '__main__':
N = 10**30
L = range(N,N + 100)
pool = Pool()
pool.map(factorint, L)
Otherwise, when run with the spawn or forkserver startmethods—and notice that spawn is the only one available on Windows—each pool process is going to try to create another pool of children. So, if you run your code on Windows, you would get this same assertion—as a way for multiprocessing to protect you from accidentally forkbombing your system.
This is explained under safe importing of main module in the "programming guidelines" section of the multiprocessing docs.

I came here because my unittest raises
AssertionError: daemonic processes are not allowed to have children
This is because I have used multiprocessing and I did not close and join
the pool properly, after close and join everything is fine now.

Python multiprocessing linux windows difference

This code executes on linux but throws an AttributeError: type object 'T' has no attribute 'val' on windows, why?
from multiprocessing import Process
import sys
class T():
#classmethod
def init(cls, val):
cls.val = val
def f():
print(T.val)
if __name__ == '__main__':
T.init(5)
f()
p = Process(target=f, args=())
p.start()

Windows lacks a fork() system call, which duplicates current process. This has many implications, including those listed on the windows multiprocessing documentation page. More specifically:
Bear in mind that if code run in a child process tries to access a global variable, then the value it sees (if any) may not be the same as the value in the parent process at the time that Process.start was called.
In internals, python creates a new process on windows by starting a new process from scratch, and telling it to load all modules again. So any change you have done in current process will not be seen.
In your example, this means that in the child process, your module will be loaded, but the if __name__ == '__main__' section will not be run. So T.init will not be called, and T.val won't exist, thus the error you see.
On the other hand, on POSIX systems (that includes Linux), process creation uses fork, and all global state is left untouched. The child runs with a copy of everything, so it does not have to reload anything and will see its copy of T with its copy of val.
This also means that Process creation is much faster and much lighter on resources on POSIX systems, especially as the “duplication” uses copy-on-write to avoid the overhead of actually copying the data.
There are other quirks when using multiprocessing, all of which are detailed in the python multiprocessing guidelines.

Python create a subprocess and do not wait

I would like to run a series of commands (which take a long time). But I do not want to wait for the completion of each command. How can I go about this in Python?
I looked at
os.fork()
and
subprocess.popen()
Don't think that is what I need.
Code
def command1():
wait(10)
def command2():
wait(10)
def command3():
wait(10)
I would like to call
command1()
command2()
command3()
Without having to wait.

Use python's multiprocessing module.
def func(arg1):
... do something ...
from multiprocessing import Process
p = Process(target=func, args=(arg1,), name='func')
p.start()
Complete Documentaion is over here: https://docs.python.org/2/library/multiprocessing.html
EDIT:
You can also use the Threading module of python if you are using jpython/cpython distribution as you can overcome the GIL (Global Interpreter Lock) in these distributions.
https://docs.python.org/2/library/threading.html

This example maybe is suitable for you:
#!/usr/bin/env python3
import sys
import os
import time
def forked(fork_func):
def do_fork():
pid = os.fork()
if (pid > 0):
fork_func()
exit(0)
else:
return pid
return do_fork
#forked
def command1():
time.sleep(2)
#forked
def command2():
time.sleep(1)
command1()
command2()
print("Hello")
You just use decorator #forked for your functions.
There is only one problem: when main program is over, it waits for end of child processes.

The most straightforward way is to use Python's own multiprocessing:
from multiprocessing import Process
def command1():
wait(10)
...
call1 = Process(target=command1, args=(...))
call1.start()
...
This module was introduced back exactly to ease the burden on controlling external process execution of functions accessible in the same code-base Of course, that could already be done by using os.fork, subprocess. Multiprocessing emulates as far as possible, Python's own threading moudle interface. The one immediate advantage of using multiprocessing over threading is that this enables the various worker processes to make use of different CPU cores, actually working in parallel - while threading, effectively, due to language design limitations is actually limited to a single execution worker at once, thus making use of a single core even when several are available.
Now, note that there are still peculiarities - specially if you are, for example, calling these from inside a web-request. Check this question an answers form a few days ago:
Stop a background process in flask without creating zombie processes

Limit function execution in Python

There are a lot of similar questions and asnwers, but I still can't find reliable answer.
So, I have a function, that can possibly run too long. Function is private, in sense that I can not change it code.
I want to restrict its execution time to 60 seconds.
I tried following approaches:
Python signals. Don't work on Windows and in multithreaded envorinment (mod_wsgi).
Threads. Nice way, but thread can not be stopped, so that it lives even after raising TimeoutException.
multiprocessing python module. I have problems with pickling and I don't know how to solve them. I want to make time_limit decorator and there are problems with importing required function in top-level. Function that is executed long is instance method, and wrapping it also doesn't help...
So, are there good solutions to the above problem?
How to kill thread, that I started?
How to use subprocesses and avoid problems with pickling?
Is subprocess module of any help?
Thank you.

I think the multiprocessing approach is your only real option. You're correct that threads can't be killed (nicely) and signals have cross-platform issues. Here is one multiprocessing implementation:
import multiprocessing
import Queue
def timed_function(return_queue):
do_other_stuff()
return_queue.put(True)
return
def main():
return_queue = multiprocessing.Manager().Queue()
proc = multiprocessing.Process(target=timed_function, args=(return_queue,))
proc.start()
try:
# wait for 60 seconds for the function to return a value
return_queue.get(timeout=60)
except Queue.Empty:
# timeout expired
proc.terminate() # kill the subprocess
# other cleanup
I know you said that you have pickling issues, but those can almost always be resolved with refactoring. For example, you said that your long function is an instance method. You can wrap those kinds of functions to use them with multiprocessing:
class TestClass(object):
def timed_method(self, return_queue):
do_other_stuff()
return_queue.put(True)
return
To use that method in a pool of workers, add this wrapper to the top-level of the module:
def _timed_method_wrapper(TestClass_object, return_queue):
return TestClass_object(return_queue)
Now you can, for example, use apply_async on this class method from a different method of the same class:
def run_timed_method():
return_queue = multiprocessing.Manager().Queue()
pool = multiprocessing.Pool()
result = pool.apply_async(_timed_method_wrapper, args=(self, return_queue))
I'm pretty sure that these wrappers are only necessary if you're using a multiprocessing.Pool instead of launching the subprocess with a multiprocessing.Process object. Also, I bet a lot of people would frown on this construct because you're breaking the nice, clean abstraction that classes provide, and also creating a dependency between the class and this other random wrapper function hanging around. You'll have to be the one to decide if making your code more ugly is worth it or not.

An answer to
Is it possible to kill a process on Windows from within Python? may help:
You need to kill that subprocess or thread:
"Terminating a subprocess on windows"
Maybe also TerminateThread helps

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

multiprocessing -> pathos.multiprocessing and windows - python

Related

Update progressbar inside multiprocessor functions (Ray) simultaneously

multiprocessing gives AssertionError: daemonic processes are not allowed to have children

Python multiprocessing linux windows difference

Python create a subprocess and do not wait

Limit function execution in Python

Categories

Resources