Running Python processes until first result - python

Edited
I'm trying to run few Python processes, and want to kill all of them as soon as I get
a result from one of them.
edit: How do I do that?
In the code below, we can see a loop that initiates 10 processes,
and prints "hello world (i)". How can I stop after the first print?
I'll put a small example(modified from https://docs.python.org/3/library/multiprocessing.html#module-multiprocessing)
# MAIN
from multiprocessing import Process, Lock
import globals
import globalsOperations
globals.init()
def f(l, i):
# l.acquire()
# try:
if not globalsOperations.get_my_bool_state():
print(globalsOperations.get_my_bool_state())
print('hello world', i)
globalsOperations.set_my_bool_state(True)
print(globalsOperations.get_my_bool_state())
# finally:
# l.release()
if __name__ == '__main__':
lock = Lock()
for num in range(10):
Process(target=f, args=(lock, num)).start()
# global.py
def init():
global my_bool
my_bool = False
#globalsOperations.py
import globals
def set_my_bool_state(bool_value):
globals.my_bool = bool_value
def get_my_bool_state():
return globals.my_bool
Lock is commented because I've tried to stop after the first success, with no luck.
So- to the question- how do I stop after the first result?
preferably with no memory leaks when releasing the processes..
(I'm not asking a lot of questions here so don't be too harsh on me :) )
thanks!

Your biggest problem is the failure to recognize that each process has its own copy of memory so when one process modifies a global variable the memory spaces of other processes have not been updated. In short, your program cannot possibly work. So globals either has to be located in shared memory or can be a managed object represented by a proxy. I have used the latter since how you would access your global data would require the fewer syntactical changes. This is a huge topic. See this.
Second, I would suggest using a multiprocessing pool, e.g. a multiprocessing.pool.Pool instance combined with the imap_unordered method rather than individual multiprocessingProcess instances. The imap_unordered method returns an iterator that you can use to iterate results from your worker function f as soon as they become available. You need to now modify f to return True or False based upon whether its invocation was the first to set globals.my_bool to True or not. As soon as the main process gets a True result, it can issue method terminate on the pool, killing any tasks that are running or scheduled to run.
There will be some lag before the main process detects that a task completed successfully and its termination of the remaining tasks. In that window of time, a few of the other submitted tasks can be running to completion.
Finally, globals is a built-in function name and should not be used for other purposes, such as the name of a module or variable. So I will be using the name gbls instead.
And you do need to use locking or multiple tasks can think that they are the first to succeed.
There is a lot here for you to be investigating:
from multiprocessing import Manager, Pool, Lock
def init_processes(g, l):
"""
Initialize the global variable(s) for each process
in the multiprocessing pool.
In this case we initialize variable gbls with a proxy to a
managed Namespace object.
"""
global gbls, lock
gbls, lock = g, l
def set_my_bool_state(bool_value):
gbls.my_bool = bool_value
def get_my_bool_state():
return gbls.my_bool
def f(i):
with lock:
if not get_my_bool_state():
print(get_my_bool_state())
print('hello world', i, flush=True)
set_my_bool_state(True)
print(get_my_bool_state())
return True # we were the first to succeed
else:
# A few of these might print before the pool is terminated:
print('Already set.', i, flush=True)
return False # we were not the first to succeed
if __name__ == '__main__':
with Manager() as manager:
gbls = manager.Namespace()
gbls.my_bool = False
lock = Lock()
pool = Pool(10, initializer=init_processes, initargs=(gbls, lock))
for result in pool.imap_unordered(f, range(10)):
if result: # first to succeed:
break
pool.terminate() # kill all remaining tasks
# Wait for all processes to end:
pool.join()
Prints:
False
hello world 0
True
Already set. 1
Already set. 2

Related

Creating a process that creates a thread which again updates a global variable

Currently, I am trying to spawn a process in a Python program which again creates threads that continuously update variables in the process address space. So far I came up with this code which runs, but the update of the variable seems not to be propagated to the process level. I would have expected that defining a variable in the process address space and using global in the thread (which shares the address space of the process) would allow the thread to manipulate the variable and propagate the changes to the process.
Below is a minimal example of the problem:
import multiprocessing
import threading
import time
import random
def process1():
lst = {}
url = "url"
thrd = threading.Thread(target = urlCaller, args = (url,))
print("process alive")
thrd.start()
while True:
# the process does some CPU intense calculation
print(lst)
time.sleep(2)
def urlCaller(url):
global lst
while True:
# the thread continuously pulls data from an API
# this is I/O heavy and therefore done by a thread
lst = {random.randint(1,9), random.randint(20,30)}
print(lst)
time.sleep(2)
prcss = multiprocessing.Process(target = process1)
prcss.start()
The process always prints an empty list while the thread prints, as expected, a list with two integers. I would expect that the process prints a list with two integers as well.
(Note: I am using Spyder as IDE and somehow there is only printed something to the console if I run this code on Linux/Ubuntu but nothing is printed to the console if I run the exact same code in Spyder on Windows.)
I am aware that the use of global variables is not always a good solution but I think it serves the purpose well in this case.
You might wonder why I want to create a thread within a process. Basically, I need to run the same complex calculation on different data sets that constantly change. Hence, I need multiple processes (one for each data set) to optimize the utilization of my CPU and use threads within the processes to make the I/O process most efficient. The data depreciates very fast, therefore, I cannot just store it in a database or file, which would of course simplify the communication process between data producer (thread) and data consumer (process).
You are defining a local variable lst inside the function process1, so what urlCaller does is irrelevant, it cannot change the local variable of a different function. urlCaller is defining a global variable but process1 can never see it because it's shadowed by the local variable you defined.
You need to remove lst = {} from that function and find an other way to return a value or declare the variable global there too:
def process1():
global lst
lst = {}
url = "url"
thrd = threading.Thread(target = urlCaller, args = (url,))
print("process alive")
thrd.start()
while True:
# the process does some CPU intense calculation
print(lst)
time.sleep(2)
I'd use something like concurrent.futures instead of the threading module directly.
Thanks to the previous answer, I figured out that it's best to implement a process class and define "thread-functions" within this class. Now, the threads can access a shared variable and manipulate this variable without the need of using "thread.join()" and terminating a thread.
Below is a minimal example in which 2 concurrent threads provide data for a parent process.
import multiprocessing
import threading
import time
import random
class process1(multiprocessing.Process):
lst = {}
url = "url"
def __init__(self, url):
super(process1, self).__init__()
self.url = url
def urlCallerInt(self, url):
while True:
self.lst = {random.randint(1,9), random.randint(20,30)}
time.sleep(2)
def urlCallerABC(self, url):
while True:
self.lst = {"Ab", "cD"}
time.sleep(5)
def run(self):
t1 = threading.Thread(target = self.urlCallerInt, args=(self.url,))
t2 = threading.Thread(target = self.urlCallerABC, args=(self.url,))
t1.start()
t2.start()
while True:
print(self.lst)
time.sleep(1)
p1 = process1("url")
p1.start()

A simple way to run a piece of python code in parallel?

I have this very simple python code:
Test = 1;
def para():
while(True):
if Test > 10:
print("Test is bigger than ten");
time.sleep(1);
para(); # I want this to start in parallel, so that the code below keeps executing without waiting for this function to finish
while(True):
Test = random.randint(1,42);
time.sleep(1);
if Test == 42:
break;
...#stop the parallel execution of the para() here (kill it)
..some other code here
Basically, I want to run the function para() in parallel to the other code, so that the code below it doesn't have to wait for the para() to end.
However, I want to be able to access the current value of the Test variable, inside of the para() while it is running in parallel (as seen in the code example above). Later, when I decide, that I am done with the para() running in parallel, I would like to know how to kill it both from the main thread, but also from within the parallel-ly running para() itself (self-terminate).
I have read some tutorials on threading, but almost every tutorial approaches it differently, plus I had a trouble understanding some of it, so I would like to know, what is the easiest way to run a piece of code in parallel.
Thank you.
Okay, first, here is an answer to your question, verbatim and in the simplest possible way. After that, we answer a little more fully with two examples that show two ways to do this and share access to data between the main and parallel code.
import random
from threading import Thread
import time
Test = 1;
stop = False
def para():
while not stop:
if Test > 10:
print("Test is bigger than ten");
time.sleep(1);
# I want this to start in parallel, so that the code below keeps executing without waiting for this function to finish
thread = Thread(target=para)
thread.start()
while(True):
Test = random.randint(1,42);
time.sleep(1);
if Test == 42:
break;
#stop the parallel execution of the para() here (kill it)
stop = True
thread.join()
#..some other code here
print( 'we have stopped' )
And now, the more complete answer:
In the following we show two code examples (listed below) that demonstrate (a) parallel execution using the threading interface, and (b) using the multiprocessing interface. Which of these you choose to use, depends on what you are trying to do. Threading can be a good choice when the purpose of the second thread is to wait for I/O, and multiprocessing can be a good choice when the second thread is for doing cpu intensive calculations.
In your example, the main code changed a variable and the parallel code only examined the variable. Things are different if you want to change a variable from both, for example to reset a shared counter. So, we will show you how to do that also.
In the following example codes:
The variables "counter" and "run" and "lock" are shared between the main program and the code executed in parallel.
The function myfunc(), is executed in parallel. It loops over updating counter and sleeping, until run is set to false, by the main program.
The main program loops over printing the value of counter until it reaches 5, at which point it resets the counter. Then, after it reaches 5 again, it sets run to false and finally, it waits for the thread or process to exit before exiting itself.
You might notice that counter is incremented inside of calls to lock.acquire() and lock.release() in the first example, or with lock in the second example.
Incrementing a counter comprises three steps, (1) reading the current value, (2) adding one to it, and then (3) storing the result back into the counter. The problem comes when one thread tries to set the counter at the same time that this is happening.
We solve this by having both the main program and the parallel code acquire a lock before they change the variable, and then release it when they are done. If the lock is already taken, the program or parallel code waits until it is released. This synchronizes their access to change the shared data, i.e. the counter. (Aside, see semaphore for another kind of synchronization).
With that introduction, here is the first example, which uses threads:
# Parallel code with shared variables, using threads
from threading import Lock, Thread
from time import sleep
# Variables to be shared across threads
counter = 0
run = True
lock = Lock()
# Function to be executed in parallel
def myfunc():
# Declare shared variables
global run
global counter
global lock
# Processing to be done until told to exit
while run:
sleep( 1 )
# Increment the counter
lock.acquire()
counter = counter + 1
lock.release()
# Set the counter to show that we exited
lock.acquire()
counter = -1
lock.release()
print( 'thread exit' )
# ----------------------------
# Launch the parallel function as a thread
thread = Thread(target=myfunc)
thread.start()
# Read and print the counter
while counter < 5:
print( counter )
sleep( 1 )
# Change the counter
lock.acquire()
counter = 0
lock.release()
# Read and print the counter
while counter < 5:
print( counter )
sleep( 1 )
# Tell the thread to exit and wait for it to exit
run = False
thread.join()
# Confirm that the thread set the counter on exit
print( counter )
And here is the second example, which uses multiprocessing. Notice that there are some extra steps involved to access the shared variables.
from time import sleep
from multiprocessing import Process, Value, Lock
def myfunc(counter, lock, run):
while run.value:
sleep(1)
with lock:
counter.value += 1
print( "thread %d"%counter.value )
with lock:
counter.value = -1
print( "thread exit %d"%counter.value )
# =======================
counter = Value('i', 0)
run = Value('b', True)
lock = Lock()
p = Process(target=myfunc, args=(counter, lock, run))
p.start()
while counter.value < 5:
print( "main %d"%counter.value )
sleep(1)
with lock:
counter.value = 0
while counter.value < 5:
print( "main %d"%counter.value )
sleep(1)
run.value = False
p.join()
print( "main exit %d"%counter.value)
Rather than manually starting threads, much better just use multiprocessing.pool. The multiprocessing part needs to be in a function that you call with map. Instead of map you can then use pool.imap.
import multiprocessing
import time
def func(x):
time.sleep(x)
return x + 2
if __name__ == "__main__":
p = multiprocessing.Pool()
start = time.time()
for x in p.imap(func, [1,5,3]):
print("{} (Time elapsed: {}s)".format(x, int(time.time() - start)))
Also check out:
multiprocessing.Pool: What's the difference between map_async and imap?
Also worth checking out is functools.partials which can be used to pass in multiple variables (in addition to the list).
Another trick: sometimes you don’t l really need multiprocessing (as in multiple cores of your processor), but just multiple threads to concurrently query a database with many connections at the same time. In that case just do from multiprocessing.dummy import Pool and can avoid python from spawning a separate process (which makes you lose access to all the namespaces you don’t pass into the function), but keep all the benefits of a pool, just in a single cpu core. That’s all you need to know about python multi processing (using multiple cores) and multithreading (using just one process and keeping the global interpreter lock intact).
Another little advice: always try to use map first without any pools. Then switch to pool.imap in the next step once you’re sure it all works.

Can't access global variable in python

I'm using multi processing library in python in code below:
from multiprocessing import Process
import os
from time import sleep as delay
test = "First"
def f():
global test
print('hello')
print("before: "+test)
test = "Second"
if __name__ == '__main__':
p = Process(target=f, args=())
p.start()
p.join()
delay(1)
print("after: "+test)
It's supposed to change the value of test so at last the value of test must be Second, but the value doesn't change and remains First.
here is the output:
hello
before: First
after: First
The behavior you're seeing is because p is a new process, not a new thread. When you spawn a new process, it copies your initial process's state completely and then starts executing in parallel. When you spawn a thread, it shares memory with your initial thread.
Since processes have memory isolation, they won't create race-condition errors caused by reading and writing to shared memory. However, to get data from your child process back into the parent, you'll need to use some form of inter-process communication like a pipe, and because they fork memory, they are more expensive to spawn. As always in computer science, you have to make a tradeoff.
For more information, see:
https://en.wikipedia.org/wiki/Process_(computing)
https://en.wikipedia.org/wiki/Thread_(computing)
https://en.wikipedia.org/wiki/Inter-process_communication
Based on what you're actually trying to accomplish, consider using threads instead.
Global state is not shared so the changes made by child processes has no effect.
Here is why:
Actually it does change the global variable but only for the spawned
process. If you would access it within your process you can see it. As
its a process your global variable environment will be initialized but
the modification you make will be limited to the process itself and
not the whole.
Try this It explains whats happening
from multiprocessing import Process
import os
from time import sleep as delay
test = "First"
def f2():
print ("f2:" + test)
def f():
global test
print ('hello')
print ("before: "+test)
test = "Second"
f2()
if __name__ == '__main__':
p = Process(target=f, args=())
p.start()
p.join()
delay(1)
print("after: "+test)
If you really need to use modify from the process their's another way of doing it, read this doc or post it might help you.

How to list Processes started by multiprocessing Pool?

While attempting to store multiprocessing's process instance in multiprocessing list-variable 'poolList` I am getting a following exception:
SimpleQueue objects should only be shared between processes through inheritance
The reason why I would like to store the PROCESS instances in a variable is to be able to terminate all or just some of them later (if for example a PROCESS freezes). If storing a PROCESS in variable is not an option I would like to know how to get or to list all the PROCESSES started by mutliprocessing POOL. That would be very similar to what .current_process() method does. Except .current_process gets only a single process while I need all the processes started or all the processes currently running.
Two questions:
Is it even possible to store an instance of the Process (as a result of mp.current_process()
Currently I am only able to get a single process from inside of the function that the process is running (from inside of myFunct() using .current_process() method).
Instead I would like to to list all the processes currently running by multiprocessing. How to achieve it?
import multiprocessing as mp
poolList=mp.Manager().list()
def myFunct(arg):
print 'myFunct(): current process:', mp.current_process()
try: poolList.append(mp.current_process())
except Exception, e: print e
for i in range(110):
for n in range(500000):
pass
poolDict[arg]=i
print 'myFunct(): completed', arg, poolDict
from multiprocessing import Pool
pool = Pool(processes=2)
myArgsList=['arg1','arg2','arg3']
pool=Pool(processes=2)
pool.map_async(myFunct, myArgsList)
pool.close()
pool.join()
To list the processes started by a Pool()-instance(which is what you mean if I understand you correctly), there is the pool._pool-list. And it contains the instances of the processes.
However, it is not part of the documented interface and hence, really should not be used.
BUT...it seems a little bit unlikely that it would change just like that anyway. I mean, should they stop having an internal list of processes in the pool? And not call that _pool?
And also, it annoys me that there at least isn't a get processes-method. Or something.
And handling it breaking due to some name change should not be that difficult.
But still, use at your own risk:
from multiprocessing import pool
# Have to run in main
if __name__ == '__main__':
# Create 3 worker processes
_my_pool = pool.Pool(3)
# Loop, terminate, and remove from the process list
# Use a copy [:] of the list to remove items correctly
for _curr_process in _my_pool._pool[:]:
print("Terminating process "+ str(_curr_process.pid))
_curr_process.terminate()
_my_pool._pool.remove(_curr_process)
# If you call _repopulate, the pool will again contain 3 worker processes.
_my_pool._repopulate_pool()
for _curr_process in _my_pool._pool[:]:
print("After repopulation "+ str(_curr_process.pid))
The example creates a pool and manually terminates all processes.
It is important that you remember to delete the process you terminate from the pool yourself i you want Pool() to continue working as usual.
_my_pool._repopulate increases the number of working processes to 3 again, not needed to answer the question, but gives a little bit of behind-the-scenes insight.
Yes you can get all active process and perform action based on name of process
e.g
multiprocessing.Process(target=foo, name="refresh-reports")
and then
for p in multiprocessing.active_children():
if p.name == "foo":
p.terminate()
You're creating a managed List object, but then letting the associated Manager object expire.
Process objects are shareable because they aren't pickle-able; that is, they aren't simple.
Oddly the multiprocessing module doesn't have the equivalent of threading.enumerate() -- that is, you can't list all outstanding processes. As a workaround, I just store procs in a list. I never terminate() a process, but do sys.exit(0) in the parent. It's rough, because the workers will leave things in an inconsistent state, but it's okay for smaller programs
To kill a frozen worker, I suggest: 1) worker receives "heartbeat" jobs in a queue every now and then, 2) if parent notices worker A hasn't responded to a heartbeat in a certain amount of time, then p.terminate(). Consider restating the problem in another SO question, as it's interesting.
To be honest the map stuff is much easier than using a Manager.
Here's a Manager example I've used. A worker adds stuff to a shared list. Another worker occasionally wakes up, processes everything on the list, then goes back to sleep. The code also has verbose logs, which are essential for ease in debugging.
source
# producer adds to fixed-sized list; scanner uses them
import logging, multiprocessing, sys, time
def producer(objlist):
'''
add an item to list every sec; ensure fixed size list
'''
logger = multiprocessing.get_logger()
logger.info('start')
while True:
try:
time.sleep(1)
except KeyboardInterrupt:
return
msg = 'ding: {:04d}'.format(int(time.time()) % 10000)
logger.info('put: %s', msg)
del objlist[0]
objlist.append( msg )
def scanner(objlist):
'''
every now and then, run calculation on objlist
'''
logger = multiprocessing.get_logger()
logger.info('start')
while True:
try:
time.sleep(5)
except KeyboardInterrupt:
return
logger.info('items: %s', list(objlist))
def main():
logger = multiprocessing.log_to_stderr(
level=logging.INFO
)
logger.info('setup')
# create fixed-length list, shared between producer & consumer
manager = multiprocessing.Manager()
my_objlist = manager.list( # pylint: disable=E1101
[None] * 10
)
multiprocessing.Process(
target=producer,
args=(my_objlist,),
name='producer',
).start()
multiprocessing.Process(
target=scanner,
args=(my_objlist,),
name='scanner',
).start()
logger.info('running forever')
try:
manager.join() # wait until both workers die
except KeyboardInterrupt:
pass
logger.info('done')
if __name__=='__main__':
main()

multiple output returned from python multiprocessing function

I am trying to use multiprocessing to return a list, but instead of waiting until all processes are done, I get several returns from one return statement in mp_factorizer, like this:
None
None
(returns list)
in this example I used 2 threads. If I used 5 threads, there would be 5 None returns before the list is being put out. Here is the code:
def mp_factorizer(nums, nprocs, objecttouse):
if __name__ == '__main__':
out_q = multiprocessing.Queue()
chunksize = int(math.ceil(len(nums) / float(nprocs)))
procs = []
for i in range(nprocs):
p = multiprocessing.Process(
target=worker,
args=(nums[chunksize * i:chunksize * (i + 1)],
out_q,
objecttouse))
procs.append(p)
p.start()
# Collect all results into a single result dict. We know how many dicts
# with results to expect.
resultlist = []
for i in range(nprocs):
temp=out_q.get()
index =0
for i in temp:
resultlist.append(temp[index][0][0:])
index +=1
# Wait for all worker processes to finish
for p in procs:
p.join()
resultlist2 = [x for x in resultlist if x != []]
return resultlist2
def worker(nums, out_q, objecttouse):
""" The worker function, invoked in a process. 'nums' is a
list of numbers to factor. The results are placed in
a dictionary that's pushed to a queue.
"""
outlist = []
for n in nums:
outputlist=objecttouse.getevents(n)
if outputlist:
outlist.append(outputlist)
out_q.put(outlist)
mp_factorizer gets a list of items, # of threads, and an object that the worker should use, it then splits up the list of items so all threads get an equal amount of the list, and starts the workers.
The workers then use the object to calculate something from the given list, add the result to the queue.
Mp_factorizer is supposed to collect all results from the queue, merge them to one large list and return that list. However - I get multiple returns.
What am I doing wrong? Or is this expected behavior due to the strange way windows handles multiprocessing?
(Python 2.7.3, Windows7 64bit)
EDIT:
The problem was the wrong placement of if __name__ == '__main__':. I found out while working on another problem, see using multiprocessing in a sub process for a complete explanation.
if __name__ == '__main__' is in the wrong place. A quick fix would be to protect only the call to mp_factorizer like Janne Karila suggested:
if __name__ == '__main__':
print mp_factorizer(list, 2, someobject)
However, on windows the main file will be executed once on execution + once for every worker thread, in this case 2. So this would be a total of 3 executions of the main thread, excluding the protected part of the code.
This can cause problems as soon as there are other computations being made in the same main thread, and at the very least unnecessarily slow down performance. Even though only the worker function should be executed several times, in windows everything will be executed thats not protected by if __name__ == '__main__'.
So the solution would be to protect the whole main process by executing all code only after
if __name__ == '__main__'.
If the worker function is in the same file, however, it needs to be excluded from this if statement because otherwise it can not be called several times for multiprocessing.
Pseudocode main thread:
# Import stuff
if __name__ == '__main__':
#execute whatever you want, it will only be executed
#as often as you intend it to
#execute the function that starts multiprocessing,
#in this case mp_factorizer()
#there is no worker function code here, it's in another file.
Even though the whole main process is protected, the worker function can still be started, as long as it is in another file.
Pseudocode main thread, with worker function:
# Import stuff
#If the worker code is in the main thread, exclude it from the if statement:
def worker():
#worker code
if __name__ == '__main__':
#execute whatever you want, it will only be executed
#as often as you intend it to
#execute the function that starts multiprocessing,
#in this case mp_factorizer()
#All code outside of the if statement will be executed multiple times
#depending on the # of assigned worker threads.
For a longer explanation with runnable code, see using multiprocessing in a sub process
Your if __name__ == '__main__' statement is in the wrong place. Put it around the print statement to prevent the subprocesses from executing that line:
if __name__ == '__main__':
print mp_factorizer(list, 2, someobject)
Now you have the if inside mp_factorizer, which makes the function return None when called inside a subprocess.

Categories