I'm encountering some strange behavior with Pipe in Python multiprocessing on my Mac (Intel, Monterey). I've tried the following code in 3.7 and 3.11 and in both cases, not all the tasks are executed.
def _mp_job(nth, child):
print("Nth is", nth)
if __name__ == "__main__":
from multiprocessing import Pool, Pipe, set_start_method, log_to_stderr
import logging, time
set_start_method("spawn")
logger = log_to_stderr()
logger.setLevel(logging.DEBUG)
with Pool(processes = 10) as mp_pool:
jobs = []
for i in range(20):
parent, child = Pipe()
# child = None
r = mp_pool.apply_async(_mp_job, args = (i, child))
jobs.append(r)
while jobs:
new_jobs = []
for job in jobs:
if not job.ready():
new_jobs.append(job)
jobs = new_jobs
print("%d jobs remaining" % len(jobs))
time.sleep(1)
I know exactly what's going on, but I don't know why.
[EDITED: my explanation for what was happening was quite unclear on my first pass, as reflected in the comments, so I've cleaned it up. Thanks for your patience.]
If I run this code on my macOS Monterey machine, it will loop forever, reporting that some number of jobs are remaining. The logging information reveals that the child processes are failing; you'll see a number of lines like this:
[DEBUG/SpawnPoolWorker-10] worker got EOFError or OSError -- exiting
What's happening is that when the child worker dequeues a job and tries to unpickle the argument list, it encounters ConnectionRefusedError when unpickling the child connection side of the Pipe in the arguments (I know these details not because of the output of the function above, but because I inserted a traceback printout at the point in the Python multiprocessing library where the worker reports encountering the OSError). At that point the worker fails, having removed the job from the work queue but not having completed it. That's why I have # child = None in there; if I uncomment that, everything works fine.
My first suspicion is that this is a bug in Python on macOS (I haven't tested this on other platforms, but it makes no sense to me that something this basic would have been missed unless it's a platform-specific error). I don't understand why the child process would get ConnectionRefusedError, since the Pipe establishes a socket pair and you shouldn't be able to get ConnectionRefusedError in that case, as far as I understand.
This seems more likely to happen the more processes I have in the pool. If I have 2, it seems to work reliably. But 4 or more seem to cause a problem; I have a six-core computer, so I don't think that's part of what's happening.
Does anyone have any insight into this? Am I doing something obviously wrong?
Related
In Python (3.5), I started running external executables (written in C++) by multiprocessing.Pool.map + subprocess from an Xshell connection. However, the Xshell connection is interrupted due to bad internet condition.
After connecting again, I see that the managing Python is gone but the C++ executables are still running (and it looks like in the correct way, the Pool seems still control them.)
The question is if this is a bug, and what I shall do in this case. I cannot kill or kill -9 them.
Add: after removing all sublst_file by hand, all running executables(cmd) are gone. It seems that except sub.SubprocessError as e: part is still working.
The basic frame of my program is outlined in the following.
import subprocess as sub
import multiprocessing as mp
import itertools as it
import os
import time
def chunks(lst, chunksize=5):
return it.zip_longest(*[iter(lst)]*chunksize)
class Work():
def __init__(self, lst):
self.lst = lst
def _work(self, sublst):
retry_times = 6
for i in range(retry_times):
try:
cmd = 'my external c++ cmd'
sublst_file = 'a config file generated from sublst'
sub.check_call([cmd, sublst_file])
os.remove(sublst_file)
return sublst # return success sublst
except sub.SubprocessError as e:
if i == (retry_times-1):
print('\n[ERROR] %s %s failed after %d tries\n' % (cmd, sublst_file, retry_times))
return []
else:
print('\n[WARNNING] %dth sleeping, please waiting for restart\n' % (i+1))
time.sleep(1+i)
def work(self):
with mp.Pool(4) as pool:
results = pool.map(self._work, chunks(self.lst, 5))
for r in it.chain(results):
# other work on success items
print(r)
The multiprocessing.Pool does terminate its workers upon terminate() which is also called by __del__ which in turn will be called upon module unload (at exit).
The reason why these guys are orphaned is because subprocess.check_call spawns are not terminated upon exit.
This fact is not mentioned explicitly in the reference, but there is no indication anywhere that the spawns are terminated. A brief review of the source code also left me with no findings. This behavior is also easily testable.
To clean up upon parent termination use the Popen interface and this answer Killing child process when parent crashes in python
I'm converting a program to multiprocessing and need to be able to log to a single rotating log from the main process as well as subprocesses. I'm trying to use the 2nd example in the python cookbook Logging to a single file from multiple processes, which starts a logger_thread running as part of the main process, picking up log messages off a queue that the subprocesses add to. The example works well as is, and also works if I switch to a RotatingFileHandler.
However if I change it to start logger_thread before the subprocesses (so that I can log from the main process as well), then as soon as the log rotates, all subsequent logging generates a traceback with WindowsError: [Error 32] The process cannot access the file because it is being used by another process.
In other words I change this code from the 2nd example
workers = []
for i in range(5):
wp = Process(target=worker_process, name='worker %d' % (i + 1), args=(q,))
workers.append(wp)
wp.start()
logging.config.dictConfig(d)
lp = threading.Thread(target=logger_thread, args=(q,))
lp.start()
to this:
logging.config.dictConfig(d)
lp = threading.Thread(target=logger_thread, args=(q,))
lp.start()
workers = []
for i in range(5):
wp = Process(target=worker_process, name='worker %d' % (i + 1), args=(q,))
workers.append(wp)
wp.start()
and swap out logging.FileHandler for logging.handlers.RotatingFileHandler (with a very small maxBytes for testing) and then I hit this error.
I'm using Windows and python 2.7. QueueHandler is not part of stdlib til python 3.2 but I've copied the source code from Gist, which it says is safe to do.
I don't understand why starting the listener first would make any difference, nor do I understand why any process other than main would be attempting to access the file.
You should never start any threads before subprocesses. When Python forks, the threads and IPC state will not always be copied properly.
There are several resources on this, just google for fork and threads. Some people claim they can do it, but it's not clear to me that it can ever work properly.
Just start all your processes first.
Example additional information:
Status of mixing multiprocessing and threading in Python
https://stackoverflow.com/a/6079669/4279
In your case, it might be that the copied open file handle is the problem, but you still should start your subprocesses before your threads (and before you open any files that you will later want to destroy).
Some rules of thumb, summarized by fantabolous from the comments:
Subprocesses must always be started before any threads created by the same process.
multiprocessing.Pool creates both subprocesses AND threads, so one mustn't create additional Processes or Pools after the first one.
Files should not already be open at the time a Process or Pool is created. (This is OK in some cases, but not, e.g. if a file will be deleted later.)
Subprocesses can create their own threads and processes, with the same rules above applying.
Starting all processes first is the easiest way to do this
So, you can simply make your own file log handler. I have yet to see logs getting garbled from multiprocessing, so it seems file log rotation is the big issue. Just do this in your main, and you don't have to change any of the rest of your logging
import logging
import logging.handlers
from multiprocessing import RLock
class MultiprocessRotatingFileHandler(logging.handlers.RotatingFileHandler):
def __init__(self, *kargs, **kwargs):
super(MultiprocessRotatingFileHandler, self).__init__(*kargs, **kwargs)
self.lock = RLock()
def shouldRollover(self, record):
with self.lock:
super(MultiprocessRotatingFileHandler, self).shouldRollover(record)
file_log_path = os.path.join('var','log', os.path.basename(__file__) + '.log')
file_log = MultiprocessRotatingFileHandler(file_log_path,
maxBytes=8*1000*1024,
backupCount=5,
delay=True)
logging.basicConfig(level=logging.DEBUG)
logging.addHandler(file_log)
I'm willing to guess that locking every time you try to rotate is probably slowing down logging, but then this is a case where we need to sacrifice performance for correctness.
My friend and I have been working on a large project to learn and for fun in python and PyGame. Basically it is an AI simulation of a small village. we wanted a day/night cycle so I found a neat way to change the color of an entire surface using numpy (specifically the cross-fade tutorial) - http://www.pygame.org/docs/tut/surfarray/SurfarrayIntro.html
I implemented it into the code and it WORKS, but is extremely slow, like < 1 fps slow. so I look into threading (because I wanted to add it eventually) and found this page on Queues - Learning about Queue module in python (how to run it)
I spend about 15 minutes making a basic system but as soon as I run it, the window closes and it says
Exception in thread Thread-1 (most likely raised during interpreter shutdown):
EDIT: This is literally all it says, no Traceback error
I don't know what I am doing wrong, but I assume I am missing something simple. I added the necessary parts of the code below.
q_in = Queue.Queue(maxsize=0)
q_out = Queue.Queue(maxsize=0)
def run(): #Here is where the main stuff happens
#There is more here I am just showing the essential parts
while True:
a = abs(abs(world.degree-180)-180)/400.
#Process world
world.process(time_passed_seconds)
blank_surface = pygame.Surface(SCREEN_SIZE)
world.render(blank_surface) #The world class renders everything onto a blank surface
q_in.put((blank_surface, a))
screen.blit(q_out.get(), (0,0))
def DayNight():
while True:
blank_surface, a = q_in.get()
imgarray = surfarray.array3d(blank_surface) # Here is where the new numpy stuff starts (AKA Day/Night cycle)
src = N.array(imgarray)
dest = N.zeros(imgarray.shape)
dest[:] = 20, 30, 120
diff = (dest - src) * a
xfade = src + diff.astype(N.int)
surfarray.blit_array(blank_surface, xfade)
q_out.put(blank_surface)
q_in.task_done()
def main():
MainT = threading.Thread(target=run)
MainT.daemon = True
MainT.start()
DN = threading.Thread(target=DayNight)
DN.daemon = True
DN.start()
q_in.join()
q_out.join()
If anyone could help it would be greatly appreciated. Thank you.
This is pretty common when using daemon threads. Why are you setting .daemon = True on your threads? Think about it. While there are legitimate uses for daemon threads, most times a programmer does it because they're confused, as in "I don't know how to shut my threads down cleanly, and the program will freeze on exit if I don't, so I know! I'll say they're daemon threads. Then the interpreter won't wait for them to terminate when it exits. Problem solved."
But it isn't solved - it usually just creates other problems. In particular, the daemon threads keep on running while the interpreter is - on exit - destroying itself. Modules are destroyed, stdin and stdout and stderr are destroyed, etc etc. All sorts of things can go wrong in daemon threads then, as the stuff they try to access is annihilated.
The specific message you're seeing is produced when an exception is raised in some thread, but interpreter destruction has gotten so far that even the sys module no longer contains anything usable. The threading implementation retains a reference to sys.stderr internally so that it can tell you something then (specifically, the exact message you're seeing), but too much of the interpreter has been destroyed to tell you anything else about what went wrong.
So find a way to shut down your threads cleanly instead (and remove .daemon = True). Don't know enough about your problem to suggest a specific way, but you'll think of something ;-)
BTW, I'd suggest removing the maxsize=0 arguments on your Queue() constructors. The default is "unbounded", and "everyone knows that", while few people know that maxsize=0 also means "unbounded". That's gotten worse as other datatypes have taken maxsize=0 to mean "maximum size really is 0" (the best example of that is collections.deque); but "no argument means unbounded" is still universally true.
(1) This works for me.
SOURCE: https://realpython.com/intro-to-python-threading/#starting-a-thread
(2) I use daemon
import logging
import threading
import time
def thread_function(name):
logging.info("Thread %s: starting", name)
time.sleep(2)
logging.info("Thread %s: finishing", name)
if __name__ == "__main__":
format = "%(asctime)s: %(message)s"
logging.basicConfig(format=format, level=logging.INFO,
datefmt="%H:%M:%S")
threads = list()
for index in range(3):
logging.info("Main : create and start thread %d.", index)
x = threading.Thread(target=thread_function, args=(index,) , daemon=True)
threads.append(x)
x.start()
for index, thread in enumerate(threads):
logging.info("Main : before joining thread %d.", index)
thread.join()
logging.info("Main : thread %d done", index)
I am trying to create a class than can run a separate process to go do some work that takes a long time, launch a bunch of these from a main module and then wait for them all to finish. I want to launch the processes once and then keep feeding them things to do rather than creating and destroying processes. For example, maybe I have 10 servers running the dd command, then I want them all to scp a file, etc.
My ultimate goal is to create a class for each system that keeps track of the information for the system in which it is tied to like IP address, logs, runtime, etc. But that class must be able to launch a system command and then return execution back to the caller while that system command runs, to followup with the result of the system command later.
My attempt is failing because I cannot send an instance method of a class over the pipe to the subprocess via pickle. Those are not pickleable. I therefore tried to fix it various ways but I can't figure it out. How can my code be patched to do this? What good is multiprocessing if you can't send over anything useful?
Is there any good documentation of multiprocessing being used with class instances? The only way I can get the multiprocessing module to work is on simple functions. Every attempt to use it within a class instance has failed. Maybe I should pass events instead? I don't understand how to do that yet.
import multiprocessing
import sys
import re
class ProcessWorker(multiprocessing.Process):
"""
This class runs as a separate process to execute worker's commands in parallel
Once launched, it remains running, monitoring the task queue, until "None" is sent
"""
def __init__(self, task_q, result_q):
multiprocessing.Process.__init__(self)
self.task_q = task_q
self.result_q = result_q
return
def run(self):
"""
Overloaded function provided by multiprocessing.Process. Called upon start() signal
"""
proc_name = self.name
print '%s: Launched' % (proc_name)
while True:
next_task_list = self.task_q.get()
if next_task is None:
# Poison pill means shutdown
print '%s: Exiting' % (proc_name)
self.task_q.task_done()
break
next_task = next_task_list[0]
print '%s: %s' % (proc_name, next_task)
args = next_task_list[1]
kwargs = next_task_list[2]
answer = next_task(*args, **kwargs)
self.task_q.task_done()
self.result_q.put(answer)
return
# End of ProcessWorker class
class Worker(object):
"""
Launches a child process to run commands from derived classes in separate processes,
which sit and listen for something to do
This base class is called by each derived worker
"""
def __init__(self, config, index=None):
self.config = config
self.index = index
# Launce the ProcessWorker for anything that has an index value
if self.index is not None:
self.task_q = multiprocessing.JoinableQueue()
self.result_q = multiprocessing.Queue()
self.process_worker = ProcessWorker(self.task_q, self.result_q)
self.process_worker.start()
print "Got here"
# Process should be running and listening for functions to execute
return
def enqueue_process(target): # No self, since it is a decorator
"""
Used to place an command target from this class object into the task_q
NOTE: Any function decorated with this must use fetch_results() to get the
target task's result value
"""
def wrapper(self, *args, **kwargs):
self.task_q.put([target, args, kwargs]) # FAIL: target is a class instance method and can't be pickled!
return wrapper
def fetch_results(self):
"""
After all processes have been spawned by multiple modules, this command
is called on each one to retreive the results of the call.
This blocks until the execution of the item in the queue is complete
"""
self.task_q.join() # Wait for it to to finish
return self.result_q.get() # Return the result
#enqueue_process
def run_long_command(self, command):
print "I am running number % as process "%number, self.name
# In here, I will launch a subprocess to run a long-running system command
# p = Popen(command), etc
# p.wait(), etc
return
def close(self):
self.task_q.put(None)
self.task_q.join()
if __name__ == '__main__':
config = ["some value", "something else"]
index = 7
workers = []
for i in range(5):
worker = Worker(config, index)
worker.run_long_command("ls /")
workers.append(worker)
for worker in workers:
worker.fetch_results()
# Do more work... (this would actually be done in a distributor in another class)
for worker in workers:
worker.close()
Edit: I tried to move the ProcessWorker class and the creation of the multiprocessing queues outside of the Worker class and then tried to manually pickle the worker instance. Even that doesn't work and I get an error
RuntimeError: Queue objects should only be shared between processes
through inheritance
. But I am only passing references of those queues into the worker instance?? I am missing something fundamental. Here is the modified code from the main section:
if __name__ == '__main__':
config = ["some value", "something else"]
index = 7
workers = []
for i in range(1):
task_q = multiprocessing.JoinableQueue()
result_q = multiprocessing.Queue()
process_worker = ProcessWorker(task_q, result_q)
worker = Worker(config, index, process_worker, task_q, result_q)
something_to_look_at = pickle.dumps(worker) # FAIL: Doesn't like queues??
process_worker.start()
worker.run_long_command("ls /")
So, the problem was that I was assuming that Python was doing some sort of magic that is somehow different from the way that C++/fork() works. I somehow thought that Python only copied the class, not the whole program into a separate process. I seriously wasted days trying to get this to work because all of the talk about pickle serialization made me think that it actually sent everything over the pipe. I knew that certain things could not be sent over the pipe, but I thought my problem was that I was not packaging things up properly.
This all could have been avoided if the Python docs gave me a 10,000 ft view of what happens when this module is used. Sure, it tells me what the methods of multiprocess module does and gives me some basic examples, but what I want to know is what is the "Theory of Operation" behind the scenes! Here is the kind of information I could have used. Please chime in if my answer is off. It will help me learn.
When you run start a process using this module, the whole program is copied into another process. But since it is not the "__main__" process and my code was checking for that, it doesn't fire off yet another process infinitely. It just stops and sits out there waiting for something to do, like a zombie. Everything that was initialized in the parent at the time of calling multiprocess.Process() is all set up and ready to go. Once you put something in the multiprocess.Queue or shared memory, or pipe, etc. (however you are communicating), then the separate process receives it and gets to work. It can draw upon all imported modules and setup just as if it was the parent. However, once some internal state variables change in the parent or separate process, those changes are isolated. Once the process is spawned, it now becomes your job to keep them in sync if necessary, either through a queue, pipe, shared memory, etc.
I threw out the code and started over, but now I am only putting one extra function out in the ProcessWorker, an "execute" method that runs a command line. Pretty simple. I don't have to worry about launching and then closing a bunch of processes this way, which has caused me all kinds of instability and performance issues in the past in C++. When I switched to launching processes at the beginning and then passing messages to those waiting processes, my performance improved and it was very stable.
BTW, I looked at this link to get help, which threw me off because the example made me think that methods were being transported across the queues: http://www.doughellmann.com/PyMOTW/multiprocessing/communication.html
The second example of the first section used "next_task()" that appeared (to me) to be executing a task received via the queue.
Instead of attempting to send a method itself (which is impractical), try sending a name of a method to execute.
Provided that each worker runs the same code, it's a matter of a simple getattr(self, task_name).
I'd pass tuples (task_name, task_args), where task_args were a dict to be directly fed to the task method:
next_task_name, next_task_args = self.task_q.get()
if next_task_name:
task = getattr(self, next_task_name)
answer = task(**next_task_args)
...
else:
# poison pill, shut down
break
REF: https://stackoverflow.com/a/14179779
Answer on Jan 6 at 6:03 by David Lynch is not factually correct when he says that he was misled by
http://www.doughellmann.com/PyMOTW/multiprocessing/communication.html.
The code and examples provided are correct and work as advertised. next_task() is executing a task received via the queue -- try and understand what the Task.__call__() method is doing.
In my case what, tripped me up was syntax errors in my implementation of run(). It seems that the sub-process will not report this and just fails silently -- leaving things stuck in weird loops! Make sure you have some kind of syntax checker running e.g. Flymake/Pyflakes in Emacs.
Debugging via multiprocessing.log_to_stderr()F helped me narrow down the problem.
Let's assume I'm stuck using Python 2.6, and can't upgrade (even if that would help). I've written a program that uses the Queue class. My producer is a simple directory listing. My consumer threads pull a file from the queue, and do stuff with it. If the file has already been processed, I skip it. The processed list is generated before all of the threads are started, so it isn't empty.
Here's some pseudo-code.
import Queue, sys, threading
processed = []
def consumer():
while True:
file = dirlist.get(block=True)
if file in processed:
print "Ignoring %s" % file
else:
# do stuff here
dirlist.task_done()
dirlist = Queue.Queue()
for f in os.listdir("/some/dir"):
dirlist.put(f)
max_threads = 8
for i in range(max_threads):
thr = Thread(target=consumer)
thr.start()
dirlist.join()
The strange behavior I'm getting is that if a thread encounters a file that's already been processed, the thread stalls out and waits until the entire program ends. I've done a little bit of testing, and the first 7 threads (assuming 8 is the max) stop, while the 8th thread keeps processing, one file at a time. But, by doing that, I'm losing the entire reason for threading the application.
Am I doing something wrong, or is this the expected behavior of the Queue/threading classes in Python 2.6?
I tried running your code, and did not see the behavior you describe. However, the program never exits. I recommend changing the .get() call as follows:
try:
file = dirlist.get(True, 1)
except Queue.Empty:
return
If you want to know which thread is currently executing, you can import the thread module and print thread.get_ident().
I added the following line after the .get():
print file, thread.get_ident()
and got the following output:
bin 7116328
cygdrive 7116328
cygwin.bat 7149424
cygwin.ico 7116328
dev etc7598568
7149424
fix 7331000
home 7116328lib
7598568sbin
7149424Thumbs.db
7331000
tmp 7107008
usr 7116328
var 7598568proc
7441800
The output is messy because the threads are writing to stdout at the same time. The variety of thread identifiers further confirms that all of the threads are running.
Perhaps something is wrong in the real code or your test methodology, but not in the code you posted?
Since this problem only manifests itself when finding a file that's already been processed, it seems like this is something to do with the processed list itself. Have you tried implementing a simple lock? For example:
processed = []
processed_lock = threading.Lock()
def consumer():
while True:
with processed_lock.acquire():
fileInList = file in processed
if fileInList:
# ... et cetera
Threading tends to cause the strangest bugs, even if they seem like they "shouldn't" happen. Using locks on shared variables is the first step to make sure you don't end up with some kind of race condition that could cause threads to deadlock.
Of course, if what you're doing under # do stuff here is CPU-intensive, then Python will only run code from one thread at a time anyway, due to the Global Interpreter Lock. In that case, you may want to switch to the multiprocessing module - it's very similar to threading, though you will need to replace shared variables with another solution (see here for details).