I'm using multi processing library in python in code below:
from multiprocessing import Process
import os
from time import sleep as delay
test = "First"
def f():
global test
print('hello')
print("before: "+test)
test = "Second"
if __name__ == '__main__':
p = Process(target=f, args=())
p.start()
p.join()
delay(1)
print("after: "+test)
It's supposed to change the value of test so at last the value of test must be Second, but the value doesn't change and remains First.
here is the output:
hello
before: First
after: First
The behavior you're seeing is because p is a new process, not a new thread. When you spawn a new process, it copies your initial process's state completely and then starts executing in parallel. When you spawn a thread, it shares memory with your initial thread.
Since processes have memory isolation, they won't create race-condition errors caused by reading and writing to shared memory. However, to get data from your child process back into the parent, you'll need to use some form of inter-process communication like a pipe, and because they fork memory, they are more expensive to spawn. As always in computer science, you have to make a tradeoff.
For more information, see:
https://en.wikipedia.org/wiki/Process_(computing)
https://en.wikipedia.org/wiki/Thread_(computing)
https://en.wikipedia.org/wiki/Inter-process_communication
Based on what you're actually trying to accomplish, consider using threads instead.
Global state is not shared so the changes made by child processes has no effect.
Here is why:
Actually it does change the global variable but only for the spawned
process. If you would access it within your process you can see it. As
its a process your global variable environment will be initialized but
the modification you make will be limited to the process itself and
not the whole.
Try this It explains whats happening
from multiprocessing import Process
import os
from time import sleep as delay
test = "First"
def f2():
print ("f2:" + test)
def f():
global test
print ('hello')
print ("before: "+test)
test = "Second"
f2()
if __name__ == '__main__':
p = Process(target=f, args=())
p.start()
p.join()
delay(1)
print("after: "+test)
If you really need to use modify from the process their's another way of doing it, read this doc or post it might help you.
Related
Edited
I'm trying to run few Python processes, and want to kill all of them as soon as I get
a result from one of them.
edit: How do I do that?
In the code below, we can see a loop that initiates 10 processes,
and prints "hello world (i)". How can I stop after the first print?
I'll put a small example(modified from https://docs.python.org/3/library/multiprocessing.html#module-multiprocessing)
# MAIN
from multiprocessing import Process, Lock
import globals
import globalsOperations
globals.init()
def f(l, i):
# l.acquire()
# try:
if not globalsOperations.get_my_bool_state():
print(globalsOperations.get_my_bool_state())
print('hello world', i)
globalsOperations.set_my_bool_state(True)
print(globalsOperations.get_my_bool_state())
# finally:
# l.release()
if __name__ == '__main__':
lock = Lock()
for num in range(10):
Process(target=f, args=(lock, num)).start()
# global.py
def init():
global my_bool
my_bool = False
#globalsOperations.py
import globals
def set_my_bool_state(bool_value):
globals.my_bool = bool_value
def get_my_bool_state():
return globals.my_bool
Lock is commented because I've tried to stop after the first success, with no luck.
So- to the question- how do I stop after the first result?
preferably with no memory leaks when releasing the processes..
(I'm not asking a lot of questions here so don't be too harsh on me :) )
thanks!
Your biggest problem is the failure to recognize that each process has its own copy of memory so when one process modifies a global variable the memory spaces of other processes have not been updated. In short, your program cannot possibly work. So globals either has to be located in shared memory or can be a managed object represented by a proxy. I have used the latter since how you would access your global data would require the fewer syntactical changes. This is a huge topic. See this.
Second, I would suggest using a multiprocessing pool, e.g. a multiprocessing.pool.Pool instance combined with the imap_unordered method rather than individual multiprocessingProcess instances. The imap_unordered method returns an iterator that you can use to iterate results from your worker function f as soon as they become available. You need to now modify f to return True or False based upon whether its invocation was the first to set globals.my_bool to True or not. As soon as the main process gets a True result, it can issue method terminate on the pool, killing any tasks that are running or scheduled to run.
There will be some lag before the main process detects that a task completed successfully and its termination of the remaining tasks. In that window of time, a few of the other submitted tasks can be running to completion.
Finally, globals is a built-in function name and should not be used for other purposes, such as the name of a module or variable. So I will be using the name gbls instead.
And you do need to use locking or multiple tasks can think that they are the first to succeed.
There is a lot here for you to be investigating:
from multiprocessing import Manager, Pool, Lock
def init_processes(g, l):
"""
Initialize the global variable(s) for each process
in the multiprocessing pool.
In this case we initialize variable gbls with a proxy to a
managed Namespace object.
"""
global gbls, lock
gbls, lock = g, l
def set_my_bool_state(bool_value):
gbls.my_bool = bool_value
def get_my_bool_state():
return gbls.my_bool
def f(i):
with lock:
if not get_my_bool_state():
print(get_my_bool_state())
print('hello world', i, flush=True)
set_my_bool_state(True)
print(get_my_bool_state())
return True # we were the first to succeed
else:
# A few of these might print before the pool is terminated:
print('Already set.', i, flush=True)
return False # we were not the first to succeed
if __name__ == '__main__':
with Manager() as manager:
gbls = manager.Namespace()
gbls.my_bool = False
lock = Lock()
pool = Pool(10, initializer=init_processes, initargs=(gbls, lock))
for result in pool.imap_unordered(f, range(10)):
if result: # first to succeed:
break
pool.terminate() # kill all remaining tasks
# Wait for all processes to end:
pool.join()
Prints:
False
hello world 0
True
Already set. 1
Already set. 2
Currently, I am trying to spawn a process in a Python program which again creates threads that continuously update variables in the process address space. So far I came up with this code which runs, but the update of the variable seems not to be propagated to the process level. I would have expected that defining a variable in the process address space and using global in the thread (which shares the address space of the process) would allow the thread to manipulate the variable and propagate the changes to the process.
Below is a minimal example of the problem:
import multiprocessing
import threading
import time
import random
def process1():
lst = {}
url = "url"
thrd = threading.Thread(target = urlCaller, args = (url,))
print("process alive")
thrd.start()
while True:
# the process does some CPU intense calculation
print(lst)
time.sleep(2)
def urlCaller(url):
global lst
while True:
# the thread continuously pulls data from an API
# this is I/O heavy and therefore done by a thread
lst = {random.randint(1,9), random.randint(20,30)}
print(lst)
time.sleep(2)
prcss = multiprocessing.Process(target = process1)
prcss.start()
The process always prints an empty list while the thread prints, as expected, a list with two integers. I would expect that the process prints a list with two integers as well.
(Note: I am using Spyder as IDE and somehow there is only printed something to the console if I run this code on Linux/Ubuntu but nothing is printed to the console if I run the exact same code in Spyder on Windows.)
I am aware that the use of global variables is not always a good solution but I think it serves the purpose well in this case.
You might wonder why I want to create a thread within a process. Basically, I need to run the same complex calculation on different data sets that constantly change. Hence, I need multiple processes (one for each data set) to optimize the utilization of my CPU and use threads within the processes to make the I/O process most efficient. The data depreciates very fast, therefore, I cannot just store it in a database or file, which would of course simplify the communication process between data producer (thread) and data consumer (process).
You are defining a local variable lst inside the function process1, so what urlCaller does is irrelevant, it cannot change the local variable of a different function. urlCaller is defining a global variable but process1 can never see it because it's shadowed by the local variable you defined.
You need to remove lst = {} from that function and find an other way to return a value or declare the variable global there too:
def process1():
global lst
lst = {}
url = "url"
thrd = threading.Thread(target = urlCaller, args = (url,))
print("process alive")
thrd.start()
while True:
# the process does some CPU intense calculation
print(lst)
time.sleep(2)
I'd use something like concurrent.futures instead of the threading module directly.
Thanks to the previous answer, I figured out that it's best to implement a process class and define "thread-functions" within this class. Now, the threads can access a shared variable and manipulate this variable without the need of using "thread.join()" and terminating a thread.
Below is a minimal example in which 2 concurrent threads provide data for a parent process.
import multiprocessing
import threading
import time
import random
class process1(multiprocessing.Process):
lst = {}
url = "url"
def __init__(self, url):
super(process1, self).__init__()
self.url = url
def urlCallerInt(self, url):
while True:
self.lst = {random.randint(1,9), random.randint(20,30)}
time.sleep(2)
def urlCallerABC(self, url):
while True:
self.lst = {"Ab", "cD"}
time.sleep(5)
def run(self):
t1 = threading.Thread(target = self.urlCallerInt, args=(self.url,))
t2 = threading.Thread(target = self.urlCallerABC, args=(self.url,))
t1.start()
t2.start()
while True:
print(self.lst)
time.sleep(1)
p1 = process1("url")
p1.start()
Earlier I tried to use the threading module in python to create multiple threads. Then I learned about the GIL and how it does not allow taking advantage of multiple CPU cores on a single machine. So now I'm trying to do multiprocessing (I don't strictly need seperate threads).
Here is a sample code I wrote to see if distinct processes are being created. But as can be seen in the output below, I'm getting the same process ID everytime. So multiple processes are not being created. What am I missing?
import multiprocessing as mp
import os
def pri():
print(os.getpid())
if __name__=='__main__':
# Checking number of CPU cores
print(mp.cpu_count())
processes=[mp.Process(target=pri()) for x in range(1,4)]
for p in processes:
p.start()
for p in processes:
p.join()
Output:
4
12554
12554
12554
The Process class requires a callable as its target.
Instead of running the function in the separate process, you are calling it and passing its result (None in this case) to the Process class.
Just change the following:
mp.Process(target=pri())
with:
mp.Process(target=pri)
Since the subprocesses runs on a different process, you won't see their print statements. They also don't share the same memory space. You pass pri() to target, where it needs to be pri. You need to pass a callable object, not execute it.
The prints you see are part of your main thread executions. Because you pass pri(), the code is actually executed. You need to change your code so the pri function returns value, rather than prints it.
Then you need to implement a queue, where all your threads write to it and when they're done, your main thread reads the queue.
A nice feature of the multiprocessing module is the Pool object. It allows you to create a thread pool, and then just use it. It's more convenient.
I have tried your code, the thing is the command executes too quick, so the OS reuses the PIDs. If you add a time.sleep(1) in your pri function, it would work as you expect.
That is True only for Windows. The example below is made on Windows platform. On Unix like machines, you won't need the sleep.
The more convenience solution is like this:
from multiprocessing import Pool
from time import sleep
import os
def pri(x):
sleep(1)
return os.getpid()
def use_procs():
p_pool = Pool(4)
p_results = p_pool.map(pri, [_ for _ in range(1,4)])
p_pool.close()
p_pool.join()
return p_results
if __name__ == '__main__':
res = use_procs()
for r in res:
print r
Without the sleep:
==================== RESTART: C:/Python27/tests/test2.py ====================
6576
6576
6576
>>>
with the sleep:
==================== RESTART: C:/Python27/tests/test2.py ====================
10396
10944
9000
I have a large codebase to parallelise. I can avoid rewriting the method signatures of hundreds of functions by using a single global queue. I know it's messy; please don't tell me that if I'm using globals I'm doing something wrong in this case it really is the easiest choice. The code below works but i don't understand why. I declare a global multiprocessing.Queue() but don't declare that it should be shared between processes (by passing it as a parameter to the worker). Does python automatically place this queue in shared memory? Is it safe to do this on a larger scale?
Note: You can tell that the queue is shared between the processes: the worker processes start doing work on empty queues and are idle for one second before the main queue pushes some work onto the queues.
import multiprocessing
import time
outqueue = None
class WorkerProcess(multiprocessing.Process):
def __init__(self):
multiprocessing.Process.__init__(self)
self.exit = multiprocessing.Event()
def doWork(self):
global outqueue
ob = outqueue.get()
ob = ob + "!"
print ob
time.sleep(1) #simulate more hard work
outqueue.put(ob)
def run(self):
while not self.exit.is_set():
self.doWork()
def shutdown(self):
self.exit.set()
if __name__ == '__main__':
global outqueue
outqueue = multiprocessing.Queue()
procs = []
for x in range(10):
procs.append(WorkerProcess())
procs[x].start()
time.sleep(1)
for x in range(20):
outqueue.put(str(x))
time.sleep(10)
for p in procs:
p.shutdown()
for p in procs:
p.join()
try:
while True:
x = outqueue.get(False)
print x
except:
print "done"
Assuming you're using Linux, the answer is in the way the OS creates a new process.
When a process spawns a new one in Linux, it actually forks the parent one. The result is a child process with all the properties of the parent one. Basically a clone.
In your example you are instantiating the Queue and then creating the new processes. Therefore the children processes will have a copy of the same queue and will be able to use it.
To see things broken just try to first create the processes and then creating the Queue object. You'll see the children having the global variable still set as None while the parent will have a Queue.
It is safe, yet not recommended, to share a Queue as a global variable on Linux. On Windows, due to the different process creation approach, sharing a queue through a global variable won't work.
As mentioned in the programming guidelines
Explicitly pass resources to child processes
On Unix using the fork start method, a child process can make use of a shared resource created in a parent process using a global resource. However, it is better to pass the object as an argument to the constructor for the child process.
Apart from making the code (potentially) compatible with Windows and the other start methods this also ensures that as long as the child process is still alive the object will not be garbage collected in the parent process. This might be important if some resource is freed when the object is garbage collected in the parent process.
For more info about Linux forking you can read its man page.
While attempting to store multiprocessing's process instance in multiprocessing list-variable 'poolList` I am getting a following exception:
SimpleQueue objects should only be shared between processes through inheritance
The reason why I would like to store the PROCESS instances in a variable is to be able to terminate all or just some of them later (if for example a PROCESS freezes). If storing a PROCESS in variable is not an option I would like to know how to get or to list all the PROCESSES started by mutliprocessing POOL. That would be very similar to what .current_process() method does. Except .current_process gets only a single process while I need all the processes started or all the processes currently running.
Two questions:
Is it even possible to store an instance of the Process (as a result of mp.current_process()
Currently I am only able to get a single process from inside of the function that the process is running (from inside of myFunct() using .current_process() method).
Instead I would like to to list all the processes currently running by multiprocessing. How to achieve it?
import multiprocessing as mp
poolList=mp.Manager().list()
def myFunct(arg):
print 'myFunct(): current process:', mp.current_process()
try: poolList.append(mp.current_process())
except Exception, e: print e
for i in range(110):
for n in range(500000):
pass
poolDict[arg]=i
print 'myFunct(): completed', arg, poolDict
from multiprocessing import Pool
pool = Pool(processes=2)
myArgsList=['arg1','arg2','arg3']
pool=Pool(processes=2)
pool.map_async(myFunct, myArgsList)
pool.close()
pool.join()
To list the processes started by a Pool()-instance(which is what you mean if I understand you correctly), there is the pool._pool-list. And it contains the instances of the processes.
However, it is not part of the documented interface and hence, really should not be used.
BUT...it seems a little bit unlikely that it would change just like that anyway. I mean, should they stop having an internal list of processes in the pool? And not call that _pool?
And also, it annoys me that there at least isn't a get processes-method. Or something.
And handling it breaking due to some name change should not be that difficult.
But still, use at your own risk:
from multiprocessing import pool
# Have to run in main
if __name__ == '__main__':
# Create 3 worker processes
_my_pool = pool.Pool(3)
# Loop, terminate, and remove from the process list
# Use a copy [:] of the list to remove items correctly
for _curr_process in _my_pool._pool[:]:
print("Terminating process "+ str(_curr_process.pid))
_curr_process.terminate()
_my_pool._pool.remove(_curr_process)
# If you call _repopulate, the pool will again contain 3 worker processes.
_my_pool._repopulate_pool()
for _curr_process in _my_pool._pool[:]:
print("After repopulation "+ str(_curr_process.pid))
The example creates a pool and manually terminates all processes.
It is important that you remember to delete the process you terminate from the pool yourself i you want Pool() to continue working as usual.
_my_pool._repopulate increases the number of working processes to 3 again, not needed to answer the question, but gives a little bit of behind-the-scenes insight.
Yes you can get all active process and perform action based on name of process
e.g
multiprocessing.Process(target=foo, name="refresh-reports")
and then
for p in multiprocessing.active_children():
if p.name == "foo":
p.terminate()
You're creating a managed List object, but then letting the associated Manager object expire.
Process objects are shareable because they aren't pickle-able; that is, they aren't simple.
Oddly the multiprocessing module doesn't have the equivalent of threading.enumerate() -- that is, you can't list all outstanding processes. As a workaround, I just store procs in a list. I never terminate() a process, but do sys.exit(0) in the parent. It's rough, because the workers will leave things in an inconsistent state, but it's okay for smaller programs
To kill a frozen worker, I suggest: 1) worker receives "heartbeat" jobs in a queue every now and then, 2) if parent notices worker A hasn't responded to a heartbeat in a certain amount of time, then p.terminate(). Consider restating the problem in another SO question, as it's interesting.
To be honest the map stuff is much easier than using a Manager.
Here's a Manager example I've used. A worker adds stuff to a shared list. Another worker occasionally wakes up, processes everything on the list, then goes back to sleep. The code also has verbose logs, which are essential for ease in debugging.
source
# producer adds to fixed-sized list; scanner uses them
import logging, multiprocessing, sys, time
def producer(objlist):
'''
add an item to list every sec; ensure fixed size list
'''
logger = multiprocessing.get_logger()
logger.info('start')
while True:
try:
time.sleep(1)
except KeyboardInterrupt:
return
msg = 'ding: {:04d}'.format(int(time.time()) % 10000)
logger.info('put: %s', msg)
del objlist[0]
objlist.append( msg )
def scanner(objlist):
'''
every now and then, run calculation on objlist
'''
logger = multiprocessing.get_logger()
logger.info('start')
while True:
try:
time.sleep(5)
except KeyboardInterrupt:
return
logger.info('items: %s', list(objlist))
def main():
logger = multiprocessing.log_to_stderr(
level=logging.INFO
)
logger.info('setup')
# create fixed-length list, shared between producer & consumer
manager = multiprocessing.Manager()
my_objlist = manager.list( # pylint: disable=E1101
[None] * 10
)
multiprocessing.Process(
target=producer,
args=(my_objlist,),
name='producer',
).start()
multiprocessing.Process(
target=scanner,
args=(my_objlist,),
name='scanner',
).start()
logger.info('running forever')
try:
manager.join() # wait until both workers die
except KeyboardInterrupt:
pass
logger.info('done')
if __name__=='__main__':
main()