Could there be a race condition in the following code such that two async processes both try to update the global variable B at the same time in the call back function? If so, does python handle this or is it something we have to handle using locks? I have run this piece of code several times and have not had a race condition occur (that I know about) but not sure if that means it is impossible to happen.
Code:
import multiprocessing as mp
ky = 0
B = {}
def update(u):
global B
global ky
B[ky] = u
ky += 1
def square(x, y):
return x*y
def runMlt():
pool = mp.Pool(2)
for i in range(10):
pool.apply_async(square, args=(i, i + 1), callback=update)
pool.close()
pool.join()
if __name__ == '__main__':
runMlt()
No.
The callback is executed by the master process, and changes the master process makes to those variables aren't visible to the subprocesses. No subprocess touches B.
(B doesn't need to be marked global, by the way; you're not assigning to the name, but just mutating it internally.)
Well, there aren't any race conditions because they aren't connected. Remember, these are separate processes with separate address spaces. The subprocesses will inherit the initial values, of the globals, but they are not shared.
If you need to communicate, you need to use something like a Queue to send the results back to "central control".
The multiprocessing module does have shared memory objects.
Related
Edited
I'm trying to run few Python processes, and want to kill all of them as soon as I get
a result from one of them.
edit: How do I do that?
In the code below, we can see a loop that initiates 10 processes,
and prints "hello world (i)". How can I stop after the first print?
I'll put a small example(modified from https://docs.python.org/3/library/multiprocessing.html#module-multiprocessing)
# MAIN
from multiprocessing import Process, Lock
import globals
import globalsOperations
globals.init()
def f(l, i):
# l.acquire()
# try:
if not globalsOperations.get_my_bool_state():
print(globalsOperations.get_my_bool_state())
print('hello world', i)
globalsOperations.set_my_bool_state(True)
print(globalsOperations.get_my_bool_state())
# finally:
# l.release()
if __name__ == '__main__':
lock = Lock()
for num in range(10):
Process(target=f, args=(lock, num)).start()
# global.py
def init():
global my_bool
my_bool = False
#globalsOperations.py
import globals
def set_my_bool_state(bool_value):
globals.my_bool = bool_value
def get_my_bool_state():
return globals.my_bool
Lock is commented because I've tried to stop after the first success, with no luck.
So- to the question- how do I stop after the first result?
preferably with no memory leaks when releasing the processes..
(I'm not asking a lot of questions here so don't be too harsh on me :) )
thanks!
Your biggest problem is the failure to recognize that each process has its own copy of memory so when one process modifies a global variable the memory spaces of other processes have not been updated. In short, your program cannot possibly work. So globals either has to be located in shared memory or can be a managed object represented by a proxy. I have used the latter since how you would access your global data would require the fewer syntactical changes. This is a huge topic. See this.
Second, I would suggest using a multiprocessing pool, e.g. a multiprocessing.pool.Pool instance combined with the imap_unordered method rather than individual multiprocessingProcess instances. The imap_unordered method returns an iterator that you can use to iterate results from your worker function f as soon as they become available. You need to now modify f to return True or False based upon whether its invocation was the first to set globals.my_bool to True or not. As soon as the main process gets a True result, it can issue method terminate on the pool, killing any tasks that are running or scheduled to run.
There will be some lag before the main process detects that a task completed successfully and its termination of the remaining tasks. In that window of time, a few of the other submitted tasks can be running to completion.
Finally, globals is a built-in function name and should not be used for other purposes, such as the name of a module or variable. So I will be using the name gbls instead.
And you do need to use locking or multiple tasks can think that they are the first to succeed.
There is a lot here for you to be investigating:
from multiprocessing import Manager, Pool, Lock
def init_processes(g, l):
"""
Initialize the global variable(s) for each process
in the multiprocessing pool.
In this case we initialize variable gbls with a proxy to a
managed Namespace object.
"""
global gbls, lock
gbls, lock = g, l
def set_my_bool_state(bool_value):
gbls.my_bool = bool_value
def get_my_bool_state():
return gbls.my_bool
def f(i):
with lock:
if not get_my_bool_state():
print(get_my_bool_state())
print('hello world', i, flush=True)
set_my_bool_state(True)
print(get_my_bool_state())
return True # we were the first to succeed
else:
# A few of these might print before the pool is terminated:
print('Already set.', i, flush=True)
return False # we were not the first to succeed
if __name__ == '__main__':
with Manager() as manager:
gbls = manager.Namespace()
gbls.my_bool = False
lock = Lock()
pool = Pool(10, initializer=init_processes, initargs=(gbls, lock))
for result in pool.imap_unordered(f, range(10)):
if result: # first to succeed:
break
pool.terminate() # kill all remaining tasks
# Wait for all processes to end:
pool.join()
Prints:
False
hello world 0
True
Already set. 1
Already set. 2
I have a method which calculates a final result using multiple other methods. It has a while loop inside which continuously checks for new data, and if new data is received, it runs the other methods and calculates the results. This main method is the only one which is called by the user, and it stays active until the program is closed. the basic structure is as follows:
class sample:
def __init__(self):
results = []
def main_calculation(self):
while True:
#code to get data
if newdata != olddata:
#insert code to prepare data for analysis
res1 = self.calc1(prepped_data)
res2 = self.calc2(prepped_data)
final = res1 + res2
self.results.append(final)
I want to run calc1 and calc2 in parallel, so that I can get the final result faster. However, I am unsure of how to implement multiprocessing in this way, since I'm not using a __main__ guard. Is there any way to run these processes in parallel?
This is likely not the best organization for this code, but it is what is easiest for the actual calculations I am running, since it is necessary that this code be imported and run from a different file. However, I can restructure the code if this is not a salvageable structure.
According to the documentation, the reason you need to use a __main__ guard is that when your program creates a multiprocessing.Process object, it starts up a whole new copy of the Python interpreter which will import a new copy of your program's modules. If importing your module calls multiprocessing.Process() itself, that will create yet another copy of the Python interpreter which interprets yet another copy of your code, and so on until your system crashes (or actually, until Python hits a non-reentrant piece of the multiprocessing code).
In the main module of your program, which usually calls some code at the top level, checking __name__ == '__main__' is the way you can tell whether the program is being run for the first time or is being run as a subprocess. But in a different module, there might not be any code at the top level (other than definitions), and in that case there's no need to use a guard because the module can be safely imported without starting a new process.
In other words, this is dangerous:
import multiprocessing as mp
def f():
...
p = mp.Process(target=f)
p.start()
p.join()
but this is safe:
import multiprocessing as mp
def f():
...
def g():
p = mp.Process(target=f)
p.start()
p.join()
and this is also safe:
import multiprocessing as mp
def f():
...
class H:
def g(self):
p = mp.Process(target=f)
p.start()
p.join()
So in your example, you should be able to directly create Process objects in your function.
However, I'd suggest making it clear in the documentation for the class that that method creates a Process, because whoever uses it (maybe you) needs to know that it's not safe to call that method at the top level of a module. It would be like doing this, which also falls in the "dangerous" category:
import multiprocessing as mp
def f():
...
class H:
def g(self):
p = mp.Process(target=f)
p.start()
p.join()
H().g() # this creates a Process at the top level
You could also consider an alternative approach where you make the caller do all the process creation. In this approach, either your sample class constructor or the main_calculation() method could accept, say, a Pool object, and it can use the processes from that pool to do its calculations. For example:
class sample:
def main_calculation(self, pool):
while True:
if newdata != olddata:
res1_async = pool.apply_async(self.calc1, [prepped_data])
res2_async = pool.apply_async(self.calc2, [prepped_data])
res1 = res1_async.get()
res2 = res2_async.get()
# and so on
This pattern may also allow your program to be more efficient in its use of resources, if there are many different calculations happening, because they can all use the same pool of processes.
I am in the following setting: I have a method that takes an objective function f as input. As a subrouting of that method i want to evaluate f on a small set of points. Since f has high complexity i considered doing that in parallel.
All online examples hang up even for trivial functions like squaring on sets with 5 points. They are using the multiprocessing library - and i don't know what i am doing wrong. I am not sure how to encapsulate that __name__ == "__main__" statement in my method. (since it is part of a module - i guess instead of "__main__" i should use the module name?)
Code i have been using looks like
from multiprocessing.pool import Pool
from multiprocessing import cpu_count
x = [1,2,3,4,5]
num_cores = cpu_count()
def f(x):
return x**2
if __name__ == "__main__":
pool = Pool(num_cores)
y = list(pool.map(f, x))
pool.join()
print(y)
When executing this code in my spyder it takes a bloody long time to finish.
So my main questions are: What am i doing wrong in this code? How can i encapsulate the __name__-statement, when this code is part of a bigger method?
Is it even worth it parallelizing this? (one function evaluation can take multiple minutes and in serial this adds up to a total runtime of hours...)
According to documentation :
close()
Prevents any more tasks from being submitted to the pool. Once all the tasks have been completed the worker processes will exit.
terminate()
Stops the worker processes immediately without completing outstanding work. When the pool object is garbage collected
terminate() will be called immediately.
join()
Wait for the worker processes to exit. One must call close() or terminate() before using join().
So you should add :
from multiprocessing.pool import Pool
from multiprocessing import cpu_count
x = [1,2,3,4,5]
def f(x):
return x**2
if __name__ == "__main__":
pool = Pool()
y = list(pool.map(f, x))
pool.close()
pool.join()
print(y)
You can call Pool without any argument and it will use cpu_count by default
If processes is None then the number returned by cpu_count() is used
About the if name == "main", read more informations here.
So you need to think a bit about which code you want executed only in the main program. The most obvious example is that you want code that creates child processes to run only in the main program - so that should be protected by name == 'main'
You might want to look into the chunksize argument of the map function that you are using.
On a large enough input list, a lot of your time is spent simply communicating the arguments to and from the separate parallel processes.
One symptom of this problem is that when you use something like htop all cores are firing but at < 100%.
I have code that makes unique combinations of elements. There are 6 types, and there are about 100 of each. So there are 100^6 combinations. Each combination has to be calculated, checked for relevance and then either be discarded or saved.
The relevant bit of the code looks like this:
def modconffactory():
for transmitter in totaltransmitterdict.values():
for reciever in totalrecieverdict.values():
for processor in totalprocessordict.values():
for holoarray in totalholoarraydict.values():
for databus in totaldatabusdict.values():
for multiplexer in totalmultiplexerdict.values():
newconfiguration = [transmitter, reciever, processor, holoarray, databus, multiplexer]
data_I_need = dosomethingwith(newconfiguration)
saveforlateruse_if_useful(data_I_need)
Now this takes a long time and that is fine, but now I realize this process (making the configurations and then calculations for later use) is only using 1 of my 8 processor cores at a time.
I've been reading up about multithreading and multiprocessing, but I only see examples of different processes, not how to multithread one process. In my code I call two functions: 'dosomethingwith()' and 'saveforlateruse_if_useful()'. I could make those into separate processes and have those run concurrently to the for-loops, right?
But what about the for-loops themselves? Can I speed up that one process? Because that is where the time consumption is. (<-- This is my main question)
Is there a cheat? for instance compiling to C and then the os multithreads automatically?
I only see examples of different processes, not how to multithread one process
There is multithreading in Python, but it is very ineffective because of GIL (Global Interpreter Lock). So if you want to use all of your processor cores, if you want concurrency, you have no other choice than use multiple processes, which can be done with multiprocessing module (well, you also could use another language without such problems)
Approximate example of multiprocessing usage for your case:
import multiprocessing
WORKERS_NUMBER = 8
def modconffactoryProcess(generator, step, offset, conn):
"""
Function to be invoked by every worker process.
generator: iterable object, the very top one of all you are iterating over,
in your case, totalrecieverdict.values()
We are passing a whole iterable object to every worker, they all will iterate
over it. To ensure they will not waste time by doing the same things
concurrently, we will assume this: each worker will process only each stepTH
item, starting with offsetTH one. step must be equal to the WORKERS_NUMBER,
and offset must be a unique number for each worker, varying from 0 to
WORKERS_NUMBER - 1
conn: a multiprocessing.Connection object, allowing the worker to communicate
with the main process
"""
for i, transmitter in enumerate(generator):
if i % step == offset:
for reciever in totalrecieverdict.values():
for processor in totalprocessordict.values():
for holoarray in totalholoarraydict.values():
for databus in totaldatabusdict.values():
for multiplexer in totalmultiplexerdict.values():
newconfiguration = [transmitter, reciever, processor, holoarray, databus, multiplexer]
data_I_need = dosomethingwith(newconfiguration)
saveforlateruse_if_useful(data_I_need)
conn.send('done')
def modconffactory():
"""
Function to launch all the worker processes and wait until they all complete
their tasks
"""
processes = []
generator = totaltransmitterdict.values()
for i in range(WORKERS_NUMBER):
conn, childConn = multiprocessing.Pipe()
process = multiprocessing.Process(target=modconffactoryProcess, args=(generator, WORKERS_NUMBER, i, childConn))
process.start()
processes.append((process, conn))
# Here we have created, started and saved to a list all the worker processes
working = True
finishedProcessesNumber = 0
try:
while working:
for process, conn in processes:
if conn.poll(): # Check if any messages have arrived from a worker
message = conn.recv()
if message == 'done':
finishedProcessesNumber += 1
if finishedProcessesNumber == WORKERS_NUMBER:
working = False
except KeyboardInterrupt:
print('Aborted')
You can adjust WORKERS_NUMBER to your needs.
Same with multiprocessing.Pool:
import multiprocessing
WORKERS_NUMBER = 8
def modconffactoryProcess(transmitter):
for reciever in totalrecieverdict.values():
for processor in totalprocessordict.values():
for holoarray in totalholoarraydict.values():
for databus in totaldatabusdict.values():
for multiplexer in totalmultiplexerdict.values():
newconfiguration = [transmitter, reciever, processor, holoarray, databus, multiplexer]
data_I_need = dosomethingwith(newconfiguration)
saveforlateruse_if_useful(data_I_need)
def modconffactory():
pool = multiprocessing.Pool(WORKERS_NUMBER)
pool.map(modconffactoryProcess, totaltransmitterdict.values())
You probably would like to use .map_async instead of .map
Both snippets do the same, but I would say in the first one you have more control over the program.
I suppose the second one is the easiest, though :)
But the first one should give you the idea of what is happening in the second one
multiprocessing docs: https://docs.python.org/3/library/multiprocessing.html
you can run your function in this way:
from multiprocessing import Pool
def f(x):
return x*x
if __name__ == '__main__':
p = Pool(5)
print(p.map(f, [1, 2, 3]))
https://docs.python.org/2/library/multiprocessing.html#using-a-pool-of-workers
Ive been trying to read up on threading and multiprocessing but all the examples are to intricate and advanced for my level of python/programming knowlegde. I want to run a function, which consists of a while loop, and while that loop runs I want to continue with the program and eventually change the condition for the while-loop and end that process. This is the code:
class Example():
def __init__(self):
self.condition = False
def func1(self):
self.condition = True
while self.condition:
print "Still looping"
time.sleep(1)
print "Finished loop"
def end_loop(self):
self.condition = False
The I make the following function-calls:
ex = Example()
ex.func1()
time.sleep(5)
ex.end_loop()
What I want is for the func1 to run for 5s before the end_loop() is called and changes the condition and ends the loop and thus also the function. I.e I want one process to start and "go" into func1 and at the same time I want time.sleep(5) to be called, so the processes "split" when arriving at func1, one process entering the function while the other continues down the program and start with the time.sleep(5) execution.
This must be the most basic example of a multiprocess, still Ive had trouble finding a simple way to do it!
Thank you
EDIT1: regarding do_something. In my real problem do_something is replaced by some code that communicates with another program via a socket and receives packages with coordinates every 0.02s and stores them in membervariables of the class. I want this constant updating of the coordinates to start and then be able to to read the coordinates via other functions at the same time.
However that is not so relevant. What if do_something is replaced by:
time.sleep(1)
print "Still looping"
How do I solve my problem then?
EDIT2: I have tried multiprocessing like this:
from multiprocessing import Process
ex = Example()
p1 = Process(target=ex.func1())
p2 = Process(target=ex.end_loop())
p1.start()
time.sleep(5)
p2.start()
When I ran this, I never got to p2.start(), so that did not help. Even if it had this is not really what Im looking for either. What I want would be just to start the process p1, and then continue with time.sleep and ex.end_loop()
The first problem with your code are the calls
p1 = Process(target=ex.func1())
p2 = Process(target=ex.end_loop())
With ex.func1() you're calling the function and pass the return value as target parameter. Since the function doesn't return anything, you're effectively calling
p1 = Process(target=None)
p2 = Process(target=None)
which makes, of course, no sense.
After fixing that, the next problem will be shared data: when using the multiprocessing package, you implement concurrency using multiple processes which, by default, cannot simply share data afaik. Have a look at Sharing state between processes in the package's documentation to read about this. Especially take the first sentence into account: "when doing concurrent programming it is usually best to avoid using shared state as far as possible"!
So you might want to also have a look at Exchanging objects between processes to read about how to send/receive data between two different processes. So, instead of simply setting a flag to stop the loop, it might be better to send a message to signal the loop should be terminated.
Also note that processes are a heavyweight form of multiprocessing, they spawn multiple OS processes which comes with a relatively big overhead. multiprocessing's main purpose is to avoid problems imposed by Python's Global Interpreter Lock (google about this to read more...) If your problem is'nt much more complex than what you've told us, you might want to use the threading package instead: threads come with less overhead than processes and also allow to access the same data (although you really should read about synchronization when doing this...)
I'm afraid, multiprocessing is an inherently complex subject. So I think you will need to advance your programming/python skills to successfully use it. But I'm sure you'll manage this, the python documentation about this is comprehensive and there are a lot of other resources about this.
To tackle your EDIT2 problem, you could try using the shared memory map Value.
import time
from multiprocessing import Process, Value
class Example():
def func1(self, cond):
while (cond.value == 1):
print('do something')
time.sleep(1)
return
if __name__ == '__main__':
ex = Example()
cond = Value('i', 1)
proc = Process(target=ex.func1, args=(cond,))
proc.start()
time.sleep(5)
cond.value = 0
proc.join()
(Note the target=ex.func1 without the parentheses and the comma after cond in args=(cond,).)
But look at the answer provided by MartinStettner to find a good solution.