I wrote a python script to test the multi-process pool and use apply_async to call the class method. But why does the same process (same pid) output multiple times in the output?
OS: centos-7.4
PYTHON: python-2.7
#!/usr/bin/env python
import time
import os
from multiprocessing import Pool
class New(object):
def __init__(self):
self.pid = os.getpid()
def gen(self, num):
pid = os.getpid()
print 'NEW PROCESS PID IS {}'.format(pid)
return (pid, num)
def log(self, pid):
print 'START WRITE {} INTO FILE'.format(pid[0])
with open('log', 'a') as f:
f.write('CURRENT PROCESS IS {} <--> NUM IS {}\n'.format(pid[0], pid[1]))
def start(self):
print 'CREATE MAIN PROCESS {}'.format(self.pid)
self.pool = Pool()
num = 0
while True:
narg = num
self.pool.apply_async(self, args=(narg,), callback=self.log)
num += 1
time.sleep(2)
self.pool.close()
self.pool.join()
def __call__(self, num):
return self.gen(num)
def __getstate__(self):
self_dict = self.__dict__.copy()
del self_dict['pool']
return self_dict
def __setstate__(self, state):
self.__dict__.update(state)
if __name__ == '__main__':
new = New()
new.start()
The following is the result of the script print, the same process id output twice,The specific code is below。
eg:
NEW PROCESS PID IS 14459
START WRITE 14459 INTO FILE
NEW PROCESS PID IS 14459
START WRITE 14459 INTO FILE
callback of apply_async will write some lines into file.
The output at the same time is as follows
eg:
CURRENT PROCESS IS 14459 <--> NUM IS 29
CURRENT PROCESS IS 14459 <--> NUM IS 30
I just want to get one print and write for one process.
The behavior you're observing is expected. The point of using a multiprocessing.Pool() is to distribute the work over a pool of workers (i.e., processes). See multiprocessing.Pool with maxtasksperchild produces equal PIDs for one way to achieve what you want. But honestly, it seems to me you should just be using multiprocessing.Process() if you want to spawn a new process for each iteration of the inner loop.
Related
I'm wondering if there can be a sort of deadlock in the following code. I have to read each element of a database (about 1 million items), process it, then collect the results in a unique file.
I've parallelized the execution with multiprocessing using two Queue's and three types of processes:
Reader: Main process which reads the database and adds the read items in a task_queue
Worker: Pool of processes. Each worker gets an item from task_queue, processes the item, saves the results in an intermediate file stored in item_name/item_name.txt and puts the item_name in a completed_queue
Writer: Process which gets an item_name from completed_queue, gets the intermediate result from item_name/item_name.txt and writes it in results.txt
from multiprocessing import Pool, Process, Queue
class Computation():
def __init__(self,K):
self.task_queue = Queue()
self.completed_queue = Queue()
self.n_cpus = K
def reader(self,):
with open(db, "r") as db:
... # Read an item
self.task_queue.put(item)
def worker(self,):
while True:
item = self.task_queue.get(True)
if item == "STOP":
break
self.process_item(item)
def writer_process(self,):
while True:
f = self.completed_queue.get(True)
if f == "DONE":
break
self.write_f(f)
def run(self,):
pool = Pool(n_cpus, self.worker, args=())
writer = Process(target=self.writer_process, args=())
writer.start()
self.reader()
pool.close()
pool.join()
self.completed_queue.put("DONE")
writer.join()
The code works, but it seems that sometimes the writer or the pool stops working (or they are very slow). Is a deadlock possible in this scenario?
There are a couple of issues with your code. First, by using the queues as you are, you are in effect creating your own process pool and have no need for using the multiprocessing.Pool class at all. You are using a pool initializer as an actual pool worker and it's a bit of a misuse of this class; you would be better off to just use regular Process instances (my opinion, anyway).
Second, although it is well and good that you are putting message DONE to the writer_process to signal it to terminate, you have not done similarly for the self.n_cpus worker processes, which are looking for 'STOP' messages, and therefore the reader function needs to put self.n_cpus STOP messages in the task queue:
from multiprocessing import Process, Queue
class Computation():
def __init__(self, K):
self.task_queue = Queue()
self.completed_queue = Queue()
self.n_cpus = K
def reader(self,):
with open(db, "r") as db:
... # Read an item
self.task_queue.put(item)
# signal to the worker processes to terminate:
for _ in range(self.n_cpus):
self.task_queue.put('STOP')
def worker(self,):
while True:
item = self.task_queue.get(True)
if item == "STOP":
break
self.process_item(item)
def writer_process(self,):
while True:
f = self.completed_queue.get(True)
if f == "DONE":
break
self.write_f(f)
def run(self):
processes = [Process(target=self.worker) for _ in range(self.n_cpus)]
for p in processes:
p.start()
writer = Process(target=self.writer_process, args=())
writer.start()
self.reader()
for p in processes:
p.join()
self.completed_queue.put("DONE")
writer.join()
Personally, instead of using 'STOP' and 'DONE' as the sentinel messages, I would use None instead, assuming that is not a valid actual message. I have tested the above code where reader just processed strings in a list and self.process_item(item) simply appended ' done' to the each of those strings and put the modified string on the completed_queue and replaced self.write_f in the writer_process with a print call. I did not see any problems with the code as is.
Update to use a Managed Queue
Disclaimer: I have had no experience using mpi4py and have no idea how the queue proxies would get distributed across different computers. The above code may not be sufficient as suggested by the following article, How to share mutliprocessing queue object between multiple computers. However, that code is creating instances of Queue.Queue (that code is Python 2 code) and not the proxies that are returned by the multiprocessing.SyncManager. The documentation on this is very poor. Try the above change to see if it works better (it will be slower).
Because the proxy returned by manager.Queue(), I have had to rearrange the code a bit; the queues are now being passed explicitly as arguments to the process functions:
from multiprocessing import Process, Manager
class Computation():
def __init__(self, K):
self.n_cpus = K
def reader(self, task_queue):
with open(db, "r") as db:
... # Read an item
# signal to the worker processes to terminate:
for _ in range(self.n_cpus):
task_queue.put('STOP')
def worker(self, task_queue, completed_queue):
while True:
item = task_queue.get(True)
if item == "STOP":
break
self.process_item(item)
def writer_process(self, completed_queue):
while True:
f = completed_queue.get(True)
if f == "DONE":
break
self.write_f(f)
def run(self):
with Manager() as manager:
task_queue = manager.Queue()
completed_queue = manager.Queue()
processes = [Process(target=self.worker, args=(task_queue, completed_queue)) for _ in range(self.n_cpus)]
for p in processes:
p.start()
writer = Process(target=self.writer_process, args=(completed_queue,))
writer.start()
self.reader(task_queue)
for p in processes:
p.join()
completed_queue.put("DONE")
writer.join()
I learned that AWS Lambda does not support multiprocessing.Pool and multiprocessing.Queue from this other question.
I'm also working on Python multiprocessing in AWS Lambda. But my question: how do we terminate the main process when the first child process returns? (all child processes will return with different execution time)
What I have here:
import time
from multiprocessing import Process, Pipe
class run_func():
number = 0
def __init__(self, number):
self.number = number
def subrun(self, input, conn):
# subprocess function with different execution time based on input.
response = subprocess(input)
conn.send([input, response])
conn.close()
def run(self):
number = self.number
processes = []
parent_connections = []
for i in range(0, number):
parent_conn, child_conn = Pipe()
parent_connections.append(parent_conn)
process = Process(target=self.subrun, args=(i, child_conn,))
processes.append(process)
for process in processes:
process.start()
for process in processes:
process.join()
results = []
for parent_connection in parent_connections:
resp = parent_connection.recv()
print(resp)
results.append((resp[0],resp[1]))
return results
def lambda_handler(event, context):
starttime = time.time()
results = []
work = run_func(int(event['number']))
results = work.run()
print("Results : {}".format(results))
print('Time: {} seconds'.format(time.time() - starttime))
return output
The current program will return until all child processes finish (with for parent_connection in parent_connections). But I wonder how to terminate with the first child process finish? (terminate the main at least, other child processes - it's ok to leave it running)
Added:
To be clear, I mean the first returned child process (may not be the first created child).
So the join() loop is the one which waits for all child process to complete.
If we break that after completing the first child and terminate all other process forcefully it'll work for you
class run_func():
number = 0
def __init__(self, number):
self.number = number
def subrun(self, input, conn):
# subprocess function with different execution time based on input.
response = subprocess(input)
conn.send([input, response])
conn.close()
def run(self):
number = self.number
processes = []
parent_connections = []
for i in range(0, number):
parent_conn, child_conn = Pipe()
parent_connections.append(parent_conn)
process = Process(target=self.subrun, args=(i, child_conn,))
processes.append(process)
for process in processes:
process.start()
for process in processes:
process.join()
break
results = []
for parent_connection in parent_connections:
resp = parent_connection.recv()
print(resp)
I have faces a very strange behavior of Python. It looks like when I start parallel program which uses multiprocessing and in the main process spawn 2 more(producer, consumer) I see 4 processes running. I think there should be only 3: the main, Producer, Consumer. But after some time the 4th process appears.
I have made a minimal example of the code to reproduce the problem. It create two processes in which calculate Fibonacci numbers using recursion:
from multiprocessing import Process, Queue
import os, sys
import time
import signal
def fib(n):
if n == 1 or n == 2:
return 1
result = fib(n-1) + fib(n-2)
return result
def worker(queue, amount):
pid = os.getpid()
def workerProcess(a, b):
print a, b
print 'This is Writer(', pid, ')'
signal.signal(signal.SIGUSR1, workerProcess)
print 'Worker', os.getpid()
for i in range(0, amount):
queue.put(fib(35 - i % 4))
queue.put('end')
print 'Worker finished'
def writer(queue):
pid = os.getpid()
def writerProcess(a, b):
print a, b
print 'This is Writer(', pid, ')'
signal.signal(signal.SIGUSR1, writerProcess)
print 'Writer', os.getpid()
working = True
while working:
if not queue.empty():
value = queue.get()
if value != 'end':
fib(32 + value % 4)
else:
working = False
else:
time.sleep(1)
print 'Writer finished'
def daemon():
print 'Daemon', os.getpid()
while True:
time.sleep(1)
def useProcesses(amount):
q = Queue()
writer_process = Process(target=writer, args=(q,))
worker_process = Process(target=worker, args=(q, amount))
writer_process.daemon = True
worker_process.daemon = True
worker_process.start()
writer_process.start()
def run(amount):
print 'Main', os.getpid()
pid = os.getpid()
def killThisProcess(a, b):
print a, b
print 'Main killed by signal(', pid, ')'
sys.exit(0)
signal.signal(signal.SIGTERM, killThisProcess)
useProcesses(amount)
print 'Ready to exit main'
while True:
time.sleep(1)
def main():
run(1000)
if __name__=='__main__':
main()
What I see in the output is:
$ python python_daemon.py
Main 13257
Ready to exit main
Worker 13258
Writer 13259
but in htop I see the following:
And it looks like the process with PID 13322 is actually a thread. The question is what is it? Who spawn it? Why?
If I send SIGUSR1 to this PID I see in the output appears:
10 <frame object at 0x7f05c14ed5d8>
This is Writer( 13258 )
This question is slightly related with: Python multiprocessing: more processes than requested
The threads belongs to the Queue object.
It uses internally a thread to dispatch the data over a Pipe.
From the docs:
class multiprocessing.Queue([maxsize])
Returns a process shared queue implemented using a pipe and a few locks/semaphores. When a process first puts an item on the queue a feeder thread is started which transfers objects from a buffer into the pipe.
I am using process pools(including 3 processes). In every process, I have set (created) some threads by using the thread classes to speed handle something.
At first, everything was OK. But when I wanted to change some variable in a thread, I met an odd situation.
For testing or to know what happens, I set a global variable COUNT to test. Honestly, I don't know this is safe or not. I just want to see, by using multiprocessing and threading can I change COUNT or not?
#!/usr/bin/env python
# encoding: utf-8
import os
import threading
from Queue import Queue
from multiprocessing import Process, Pool
# global variable
max_threads = 11
Stock_queue = Queue()
COUNT = 0
class WorkManager:
def __init__(self, work_queue_size=1, thread_pool_size=1):
self.work_queue = Queue()
self.thread_pool = [] # initiate, no have a thread
self.work_queue_size = work_queue_size
self.thread_pool_size = thread_pool_size
self.__init_work_queue()
self.__init_thread_pool()
def __init_work_queue(self):
for i in xrange(self.work_queue_size):
self.work_queue.put((func_test, Stock_queue.get()))
def __init_thread_pool(self):
for i in xrange(self.thread_pool_size):
self.thread_pool.append(WorkThread(self.work_queue))
def finish_all_threads(self):
for i in xrange(self.thread_pool_size):
if self.thread_pool[i].is_alive():
self.thread_pool[i].join()
class WorkThread(threading.Thread):
def __init__(self, work_queue):
threading.Thread.__init__(self)
self.work_queue = work_queue
self.start()
def run(self):
while self.work_queue.qsize() > 0:
try:
func, args = self.work_queue.get(block=False)
func(args)
except Queue.Empty:
print 'queue is empty....'
def handle(process_name):
print process_name, 'is running...'
work_manager = WorkManager(Stock_queue.qsize()/3, max_threads)
work_manager.finish_all_threads()
def func_test(num):
# use a global variable to test what happens
global COUNT
COUNT += num
def prepare():
# prepare test queue, store 50 numbers in Stock_queue
for i in xrange(50):
Stock_queue.put(i)
def main():
prepare()
pools = Pool()
# set 3 process
for i in xrange(3):
pools.apply_async(handle, args=('process_'+str(i),))
pools.close()
pools.join()
global COUNT
print 'COUNT: ', COUNT
if __name__ == '__main__':
os.system('printf "\033c"')
main()
Now, finally the result of COUNT is just 0.I am unable to understand whats happening here?
You print the COUNT var in the father process. Variables doesn't sync across processes because they doesn't share memory, that means that the variable stay 0 at the father process and is increased in the subprocesses
In the case of threading, threads share memory, that means that they share the variable count, so they should have COUNT as more than 0 but again they are at the subprocesses, and when they change the variable, it doesn't update it in other processes.
I have little problem understanding python multiprocessing. I wrote an application, witch analyzes downloaded web pages. I would like to fetch raw html in separate process with specific timeout. I know i can set timeout in urllib2, but it seems not working correctly in some cases when using socks5 proxy.
So, wrote a little Class:
class SubprocessManager(Logger):
def __init__(self, function):
self.request_queue = Queue()
self.return_queue = Queue()
self.worker = function
self.args = ()
self.kwargs = {'request_queue': self.request_queue,
'return_queue': self.return_queue}
self._run()
def _run(self):
self.subprocess = Process(target=self.worker, args=self.args, kwargs=self.kwargs)
self.subprocess.start()
def put_in_queue(self, data):
self.request_queue.put(data)
def get_from_queue(self):
result = None
try:
result = self.request_queue.get(timeout=10)
except Empty:
self.reset_process()
return result
def reset_process(self):
if self.subprocess.is_alive():
self.subprocess.terminate()
self._run()
Worker function:
def subprocess_fetch_www(*args, **kwargs):
request_queue = kwargs['request_queue']
return_queue = kwargs['return_queue']
while True:
request_data = request_queue.get()
if request_data:
return_data = fetch_request(*request_data)
return_queue.put(return_data)
And function that is called for each url from input list:
def fetch_html(url, max_retry=cfg.URLLIB_MAX_RETRY, to_xml=False, com_headers=False):
subprocess = Logger.SUBPROCESS
args = (url, max_retry, com_headers)
subprocess.put_in_queue(args)
result = subprocess.get_from_queue()
if result and to_xml:
return html2lxml(result)
return result
I need help in fixing my code. I want my subprocess running all the time waiting for job in request_queue. I want to recreate subprocess only in case of timeout. Worker should suspend execution once request_data is processed and return_data put in return queue.
How can i achieve that?
EDIT:
Well, it seems that above code works as intended, if get_from_queue requests result data from return_queue instead request_queue... >_>'
Ok, I think I have a better understanding of what you want to do.
Have a look at this code. It's not OO but illustrates the idea.
from multiprocessing import Process, Queue, Pipe
from time import sleep
import random
proc = None
inq = None
outq = None
def createWorker():
global inq, outq, proc
inq = Queue()
outq = Queue()
proc = Process(target=worker, args=(inq,outq))
proc.start()
def worker(inq, outq):
print "Worker started"
while True:
url = inq.get()
secs = random.randint(1,5)
print "processing", url, " sleeping for", secs
sleep(secs)
outq.put(url + " done")
def callWithTimeout(arg):
global proc, inq, outq
inq.put(arg)
result = None
while result is None:
try:
result = outq.get(timeout=4)
except:
print "restarting worker process"
proc.terminate()
createWorker()
inq.put(arg)
return result
def main():
global proc, inq, outq
createWorker()
for arg in ["foo", "bar", "baz", "quux"]:
res = callWithTimeout(arg)
print "res =", res
proc.terminate()
main()
It uses two queues - one for sending messages to the worker process and one for receiving the results. You could also use pipes. Also, new queues are created when the worker process is restarted - this is to avoid a possible race condition.
Edit: Just saw your edit - looks like the same idea.