How do read and writes work with a manager in Python? - python

Sorry if this is a stupid question, but I'm having trouble understanding how managers work in python.
Let's say I have a manager that contains a dictionary to be shared across all processes. I want to have just one process writing to the dictionary at a time, while many others read from the dictionary.
Can this happen concurrently, with no synchronization primitives or will something break if read/writes happen at the same time?
What if I want to have multiple processes writing to the dictionary at once - is that allowed or will it break (I know it could cause race conditions, but could it error out)?
Additionally, does a manager process each read and write transaction in a queue like fashion, one at a time, or does it do them all at once?
https://docs.python.org/3/library/multiprocessing.html#sharing-state-between-processes

It depends on how you write to the dictionary, i.e. whether the operation is atomic or not:
my_dict[some_key] = 9 # this is atomic
my_dict[some_key] += 1 # this is not atomic
So creating a new key and updating a an existing key as in the first line of code above are atomic operations. But the second line of code are really multiple operations equivalent to:
temp = my_dict[some_key]
temp = temp + 1
my_dict[some_key] = temp
So if two processes were executing my_dict[some_key] += 1 in parallel, they could be reading the same value of temp = my_dict[some_key] and incrementing temp to the same new value and the net effect would be that the dictionary value only gets incremented once. This can be demonstrated as follows:
from multiprocessing import Pool, Manager, Lock
def init_pool(the_lock):
global lock
lock = the_lock
def worker1(d):
for _ in range(1000):
with lock:
d['x'] += 1
def worker2(d):
for _ in range(1000):
d['y'] += 1
if __name__ == '__main__':
lock = Lock()
with Manager() as manager, \
Pool(4, initializer=init_pool, initargs=(lock,)) as pool:
d = manager.dict()
d['x'] = 0
d['y'] = 0
# worker1 will serialize with a lock
pool.apply_async(worker1, args=(d,))
pool.apply_async(worker1, args=(d,))
# worker2 will not serialize with a lock:
pool.apply_async(worker2, args=(d,))
pool.apply_async(worker2, args=(d,))
# wait for the 4 tasks to complete:
pool.close()
pool.join()
print(d)
Prints:
{'x': 2000, 'y': 1162}
Update
As far as serialization, goes:
The BaseManager creates a server using by default a socket for Linux and a named pipe for Windows. So essentially every method you execute against a managed dictionary, for example, is pretty much like a remote method call implemented with message passing. This also means that the server could also be running on a different computer altogether. But, these method calls are not serialized; the object methods themselves must be thread-safe because each method call is run in a new thread.
The following is an example of creating our own managed type and having the server listening for requests possibly from a different computer (although in this example, the client is running on the same computer). The client is calling increment on the managed object 1000 times across two threads, but the method implementation is not done under a lock and so the resulting value of self.x when we are all done is not 1000. Also, when we retrieve the value of x twice concurrently by method get_x we see that both invocations start up more-or-less at the same time:
from multiprocessing.managers import BaseManager
from multiprocessing.pool import ThreadPool
from threading import Event, Thread, get_ident
import time
class MathManager(BaseManager):
pass
class MathClass:
def __init__(self, x=0):
self.x = x
def increment(self, y):
temp = self.x
time.sleep(.01)
self.x = temp + 1
def get_x(self):
print(f'get_x started by thread {get_ident()}', time.time())
time.sleep(2)
return self.x
def set_x(self, value):
self.x = value
def server(event1, event2):
MathManager.register('Math', MathClass)
manager = MathManager(address=('localhost', 5000), authkey=b'abracadabra')
manager.start()
event1.set() # show we are started
print('Math server running; waiting for shutdown...')
event2.wait() # wait for shutdown
print("Math server shutting down.")
manager.shutdown()
def client():
MathManager.register('Math')
manager = MathManager(address=('localhost', 5000), authkey=b'abracadabra')
manager.connect()
math = manager.Math()
pool = ThreadPool(2)
pool.map(math.increment, [1] * 1000)
results = [pool.apply_async(math.get_x) for _ in range(2)]
for result in results:
print(result.get())
def main():
event1 = Event()
event2 = Event()
t = Thread(target=server, args=(event1, event2))
t.start()
event1.wait() # server started
client() # now we can run client
event2.set()
t.join()
# Required for Windows:
if __name__ == '__main__':
main()
Prints:
Math server running; waiting for shutdown...
get_x started by thread 43052 1629375415.2502146
get_x started by thread 71260 1629375415.2502146
502
502
Math server shutting down.

Related

How to allow a class's variables to be modified concurrently by multiple threads

I have a class (MyClass) which contains a queue (self.msg_queue) of actions that need to be run and I have multiple sources of input that can add tasks to the queue.
Right now I have three functions that I want to run concurrently:
MyClass.get_input_from_user()
Creates a window in tkinter that has the user fill out information and when the user presses submit it pushes that message onto the queue.
MyClass.get_input_from_server()
Checks the server for a message, reads the message, and then puts it onto the queue. This method uses functions from MyClass's parent class.
MyClass.execute_next_item_on_the_queue()
Pops a message off of the queue and then acts upon it. It is dependent on what the message is, but each message corresponds to some method in MyClass or its parent which gets run according to a big decision tree.
Process description:
After the class has joined the network, I have it spawn three threads (one for each of the above functions). Each threaded function adds items from the queue with the syntax "self.msg_queue.put(message)" and removes items from the queue with "self.msg_queue.get_nowait()".
Problem description:
The issue I am having is that it seems that each thread is modifying its own queue object (they are not sharing the queue, msg_queue, of the class of which they, the functions, are all members).
I am not familiar enough with Multiprocessing to know what the important error messages are; however, it is stating that it cannot pickle a weakref object (it gives no indication of which object is the weakref object), and that within the queue.put() call the line "self._sem.acquire(block, timeout) yields a '[WinError 5] Access is denied'" error. Would it be safe to assume that this failure in the queue's reference not copying over properly?
[I am using Python 3.7.2 and the Multiprocessing package's Process and Queue]
[I have seen multiple Q/As about having threads shuttle information between classes--create a master harness that generates a queue and then pass that queue as an argument to each thread. If the functions didn't have to use other functions from MyClass I could see adapting this strategy by having those functions take in a queue and use a local variable rather than class variables.]
[I am fairly confident that this error is not the result of passing my queue to the tkinter object as my unit tests on how my GUI modifies its caller's queue work fine]
Below is a minimal reproducible example for the queue's error:
from multiprocessing import Queue
from multiprocessing import Process
import queue
import time
class MyTest:
def __init__(self):
self.my_q = Queue()
self.counter = 0
def input_function_A(self):
while True:
self.my_q.put(self.counter)
self.counter = self.counter + 1
time.sleep(0.2)
def input_function_B(self):
while True:
self.counter = 0
self.my_q.put(self.counter)
time.sleep(1)
def output_function(self):
while True:
try:
var = self.my_q.get_nowait()
except queue.Empty:
var = -1
except:
break
print(var)
time.sleep(1)
def run(self):
process_A = Process(target=self.input_function_A)
process_B = Process(target=self.input_function_B)
process_C = Process(target=self.output_function)
process_A.start()
process_B.start()
process_C.start()
# without this it generates the WinError:
# with this it still behaves as if the two input functions do not modify the queue
process_C.join()
if __name__ == '__main__':
test = MyTest()
test.run()
Indeed - these are not "threads" - these are "processes" - while if you were using multithreading, and not multiprocessing, the self.my_q instance would be the same object, placed at the same memory space on the computer,
multiprocessing does a fork of the process, and any data in the original process (the one in execution in the "run" call) will be duplicated when it is used - so, each subprocess will see its own "Queue" instance, unrelated to the others.
The correct way to have various process share a multiprocessing.Queue object is to pass it as a parameter to the target methods. The simpler way to reorganize your code so that it works is thus:
from multiprocessing import Queue
from multiprocessing import Process
import queue
import time
class MyTest:
def __init__(self):
self.my_q = Queue()
self.counter = 0
def input_function_A(self, queue):
while True:
queue.put(self.counter)
self.counter = self.counter + 1
time.sleep(0.2)
def input_function_B(self, queue):
while True:
self.counter = 0
queue.put(self.counter)
time.sleep(1)
def output_function(self, queue):
while True:
try:
var = queue.get_nowait()
except queue.Empty:
var = -1
except:
break
print(var)
time.sleep(1)
def run(self):
process_A = Process(target=self.input_function_A, args=(queue,))
process_B = Process(target=self.input_function_B, args=(queue,))
process_C = Process(target=self.output_function, args=(queue,))
process_A.start()
process_B.start()
process_C.start()
# without this it generates the WinError:
# with this it still behaves as if the two input functions do not modify the queue
process_C.join()
if __name__ == '__main__':
test = MyTest()
test.run()
As you can see, since your class is not actually sharing any data through the instance's attributes, this "class" design does not make much sense for your application - but for grouping the different workers in the same code block.
It would be possible to have a magic-multiprocess-class that would have some internal method to actually start the worker-methods and share the Queue instance - so if you have a lot of those in a project, there would be a lot less boilerplate.
Something along:
from multiprocessing import Queue
from multiprocessing import Process
import time
class MPWorkerBase:
def __init__(self, *args, **kw):
self.queue = None
self.is_parent_process = False
self.is_child_process = False
self.processes = []
# ensure this can be used as a colaborative mixin
super().__init__(*args, **kw)
def run(self):
if self.is_parent_process or self.is_child_process:
# workers already initialized
return
self.queue = Queue()
processes = []
cls = self.__class__
for name in dir(cls):
method = getattr(cls, name)
if callable(method) and getattr(method, "_MP_worker", False):
process = Process(target=self._start_worker, args=(self.queue, name))
self.processes.append(process)
process.start()
# Setting these attributes here ensure the child processes have the initial values for them.
self.is_parent_process = True
self.processes = processes
def _start_worker(self, queue, method_name):
# this method is called in a new spawned process - attribute
# changes here no longer reflect attributes on the
# object in the initial process
# overwrite queue in this process with the queue object sent over the wire:
self.queue = queue
self.is_child_process = True
# call the worker method
getattr(self, method_name)()
def __del__(self):
for process in self.processes:
process.join()
def worker(func):
"""decorator to mark a method as a worker that should
run in its own subprocess
"""
func._MP_worker = True
return func
class MyTest(MPWorkerBase):
def __init__(self):
super().__init__()
self.counter = 0
#worker
def input_function_A(self):
while True:
self.queue.put(self.counter)
self.counter = self.counter + 1
time.sleep(0.2)
#worker
def input_function_B(self):
while True:
self.counter = 0
self.queue.put(self.counter)
time.sleep(1)
#worker
def output_function(self):
while True:
try:
var = self.queue.get_nowait()
except queue.Empty:
var = -1
except:
break
print(var)
time.sleep(1)
if __name__ == '__main__':
test = MyTest()
test.run()

Is instance variable in Python 3.5 process-safe?

Testing Environment:
Python Version: 3.5.1
OS Platform: Ubuntu 16.04
IDE: PyCharm Community Edition 2016.3.2
I write a simple program to test process-safe. I find that subprocess2 won't run until subprocess1 finished. It seems that the instance variable self.count is process-safe.How the process share this variable? Does they share self directly?
Another question is when I use Queue, I have to use multiprocessing.Manager to guarantees process safety manually, or the program won't run as expected.(If you uncomment self.queue = multiprocessing.Queue(), this program won't run normally, but using self.queue = multiprocessing.Manager().Queue() is OK.)
The last question is why the final result is 900? I think it should be 102.
Sorry for asking so many questions, but I'm indeed curious about these things. Thanks a lot!
Code:
import multiprocessing
import time
class Test:
def __init__(self):
self.pool = multiprocessing.Pool(1)
self.count = 0
#self.queue = multiprocessing.Queue()
#self.queue = multiprocessing.Manager().Queue()
def subprocess1(self):
for i in range(3):
print("Subprocess 1, count = %d" %self.count)
self.count += 1
time.sleep(1)
print("Subprocess 1 Completed")
def subprocess2(self):
self.count = 100
for i in range(3):
print("Subprocess 2, count = %d" %self.count)
self.count += 1
time.sleep(1)
print("Subprocess 2 Completed")
def start(self):
self.pool.apply_async(func=self.subprocess1)
print("Subprocess 1 has been started")
self.count = 900
self.pool.apply_async(func=self.subprocess2)
print("Subprocess 2 has been started")
self.pool.close()
self.pool.join()
def __getstate__(self):
self_dict = self.__dict__.copy()
del self_dict['pool']
return self_dict
def __setstate__(self, state):
self.__dict__.update(state)
if __name__ == '__main__':
test = Test()
test.start()
print("Final Result, count = %d" %test.count)
Output:
Subprocess 1 has been started
Subprocess 2 has been started
Subprocess 1, count = 0
Subprocess 1, count = 1
Subprocess 1, count = 2
Subprocess 1 Completed
Subprocess 2, count = 100
Subprocess 2, count = 101
Subprocess 2, count = 102
Subprocess 2 Completed
Final Result, count = 900
The underlying details are rather tricky (see the Python3 documentation for more, and note that the details are slightly different for Python2), but essentially, when you pass self.subprocess1 or self.subprocess2 as an argument to self.pool.apply_async, Python ends up calling:
pickle.dumps(self)
in the main process—the initial one on Linux before forking, or the one invoked as __main__ on Windows—and then, eventually, pickle.loads() of the resulting byte-string in the pool process.1 The pickle.dumps code winds up calling your own __getstate__ function; that function's job is to return something that can be serialized to a byte-string.2 The subsequent pickle.loads creates a blank instance of the appropriate type, does not call its __init__, and then uses its __setstate__ function to fill in the object (instead of __init__ing it).
Your __getstate__ returns the dictionary holding the state of self, minus the pool object, for good reason:
>>> import multiprocessing
>>> x = multiprocessing.Pool(1)
>>> import pickle
>>> pickle.dumps(x)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.5/multiprocessing/pool.py", line 492, in __reduce__
'pool objects cannot be passed between processes or pickled'
NotImplementedError: pool objects cannot be passed between processes or pickled
Since pool objects refuse to be pickled (serialized), we must avoid even attempting to do that.
In any case, all of this means that the pool process has its own copy of self, which has its own copy of self.count (and is missing self.pool entirely). These items are not shared in any way so it is safe to modify self.count there.
I find the simplest mental model of this is to give each worker process a name: Alice, Bob, Carol, and so on, if you like. You can then think of the main process as "you": you copy something and give the copy to Alice, then copy it and give that one to Bob, and so on. Function calls, such as apply or apply_async, copy all of their arguments—including the implied self for bound methods.
When using a multiprocessing.Queue, you get something that knows how to work between the various processes, sharing data as needed, with appropriate synchronization. This lets you pass copies of data back and forth. However, like a pool instance, a multiprocessing.Queue instance cannot be copied. The multiprocessing routines do let you copy a multiprocessing.Manager().Queue() instance, which is good if you want a copied and otherwise private Queue() instance. (The internal details of this are complicated.3)
The final result you get is just 900 because you are looking only at the original self object.
Note that each applied functions (from apply or apply_async) returns a result. This result is copied back, from the worker process to the main process. With apply_async, you may choose to get called back as soon as the result is ready. If you want this result you should save it somewhere, or use the get function (as shown in that same answer) to wait for it when you need it.
1We can say "the" pool process here without worrying about which one, as you limited yourself to just one. In any case, though, there is a simple byte-oriented, two-way communications stream, managed by the multiprocessing code, connecting each worker process with the parent process that invoked it. If you create two such pool processes, each one has its own byte-stream connecting to the main process. This means it would not matter if there were two or more: the behavior would be the same.
2This "something" is often a dictionary, but see Simple example of use of __setstate__ and __getstate__ for details.
3The output of pickle.dumps on such an instance is:
>>> pickle.dumps(y)
(b'\x80\x03cmultiprocessing.managers\n'
b'RebuildProxy\n'
b'q\x00(cmultiprocessing.managers\n'
b'AutoProxy\n'
b'q\x01cmultiprocessing.managers\n'
b'Token\n'
b'q\x02)\x81q\x03X\x05\x00\x00\x00Queueq\x04X$\x00\x00\x00/tmp/pymp-pog4bhub/listener-0_uwd8c9q\x05X\t\x00\x00\x00801b92400q\x06\x87q\x07bX\x06\x00\x00\x00pickleq\x08}q\tX\x07\x00\x00\x00exposedq\n'
b'(X\x05\x00\x00\x00emptyq\x0bX\x04\x00\x00\x00fullq\x0cX\x03\x00\x00\x00getq\rX\n'
b'\x00\x00\x00get_nowaitq\x0eX\x04\x00\x00\x00joinq\x0fX\x03\x00\x00\x00putq\x10X\n'
b'\x00\x00\x00put_nowaitq\x11X\x05\x00\x00\x00qsizeq\x12X\t\x00\x00\x00task_doneq\x13tq\x14stq\x15Rq\x16.\n')
I did a little trickiness to split this at newlines and then manually added the parentheses, just to keep the long line from being super-long. The arguments will vary on different systems; this particular one uses a file system object that is a listener socket, that allows cooperating Python processes to establish a new byte stream between themselves.
Question: ... why the final result is 900? I think it should be 102.
The result should be 106, range are 0 based, you get 3 iterations.
You can get the expected output, for instance:
class PoolTasks(object):
def __init__(self):
self.count = None
def task(self, n, start):
import os
pid = os.getpid()
count = start
print("Task %s in Process %s has been started - start=%s" % (n, pid, count))
for i in range(3):
print("Task %s in Process %s, count = %d " % (n, pid, count))
count += 1
time.sleep(1)
print("Task %s in Process %s has been completed - count=%s" % (n, pid, count))
return count
def start(self):
with mp.Pool(processes=4) as pool:
# launching multiple tasks asynchronously using processes
multiple_results = [pool.apply_async(self.task, (p)) for p in [(1, 0), (2, 100)]]
# sum result from tasks
self.count = 0
for res in multiple_results:
self.count += res.get()
if __name__ == '__main__':
pool = PoolTasks()
pool.start()
print('sum(count) = %s' % pool.count)
Output:
Task 1 in Process 5601 has been started - start=0
Task 1 in Process 5601, count = 0
Task 2 in Process 5602 has been started - start=100
Task 2 in Process 5602, count = 100
Task 1 in Process 5601, count = 1
Task 2 in Process 5602, count = 101
Task 1 in Process 5601, count = 2
Task 2 in Process 5602, count = 102
Task 1 in Process 5601 has been completed - count=3
Task 2 in Process 5602 has been completed - count=103
sum(count) = 106
Tested with Python:3.4.2

Can I assume my threads are done when threading.active_count() returns 1?

Given the following class:
from abc import ABCMeta, abstractmethod
from time import sleep
import threading
from threading import active_count, Thread
class ScraperPool(metaclass=ABCMeta):
Queue = []
ResultList = []
def __init__(self, Queue, MaxNumWorkers=0, ItemsPerWorker=50):
# Initialize attributes
self.MaxNumWorkers = MaxNumWorkers
self.ItemsPerWorker = ItemsPerWorker
self.Queue = Queue # For testing purposes.
def initWorkerPool(self, PrintIDs=True):
for w in range(self.NumWorkers()):
Thread(target=self.worker, args=(w + 1, PrintIDs,)).start()
sleep(1) # Explicitly wait one second for this worker to start.
def run(self):
self.initWorkerPool()
# Wait until all workers (i.e. threads) are done.
while active_count() > 1:
print("Active threads: " + str(active_count()))
sleep(5)
self.HandleResults()
def worker(self, id, printID):
if printID:
print("Starting worker " + str(id) + ".")
while (len(self.Queue) > 0):
self.scraperMethod()
if printID:
print("Worker " + str(id) + " is quiting.")
# Todo Kill is this Thread.
return
def NumWorkers(self):
return 1 # Simplified for testing purposes.
#abstractmethod
def scraperMethod(self):
pass
class TestScraper(ScraperPool):
def scraperMethod(self):
# print("I am scraping.")
# print("Scraping. Threads#: " + str(active_count()))
temp_item = self.Queue[-1]
self.Queue.pop()
self.ResultList.append(temp_item)
def HandleResults(self):
print(self.ResultList)
ScraperPool.register(TestScraper)
scraper = TestScraper(Queue=["Jaap", "Piet"])
scraper.run()
print(threading.active_count())
# print(scraper.ResultList)
When all the threads are done, there's still one active thread - threading.active_count() on the last line gets me that number.
The active thread is <_MainThread(MainThread, started 12960)> - as printed with threading.enumerate().
Can I assume that all my threads are done when active_count() == 1?
Or can, for instance, imported modules start additional threads so that my threads are actually done when active_count() > 1 - also the condition for the loop I'm using in the run method.
You can assume that your threads are done when active_count() reaches 1. The problem is, if any other module creates a thread, you'll never get to 1. You should manage your threads explicitly.
Example: You can put the threads in a list and join them one at a time. The relevant changes to your code are:
def __init__(self, Queue, MaxNumWorkers=0, ItemsPerWorker=50):
# Initialize attributes
self.MaxNumWorkers = MaxNumWorkers
self.ItemsPerWorker = ItemsPerWorker
self.Queue = Queue # For testing purposes.
self.WorkerThreads = []
def initWorkerPool(self, PrintIDs=True):
for w in range(self.NumWorkers()):
thread = Thread(target=self.worker, args=(w + 1, PrintIDs,))
self.WorkerThreads.append(thread)
thread.start()
sleep(1) # Explicitly wait one second for this worker to start.
def run(self):
self.initWorkerPool()
# Wait until all workers (i.e. threads) are done. Waiting in order
# so some threads further in the list may finish first, but we
# will get to all of them eventually
while self.WorkerThreads:
self.WorkerThreads[0].join()
self.HandleResults()
according to docs active_count() includes the main thread, so if you're at 1 then you're most likely done, but if you have another source of new threads in your program then you may be done before active_count() hits 1.
I would recommend implementing explicit join method on your ScraperPool and keeping track of your workers and explicitly joining them to main thread when needed instead of checking that you're done with active_count() calls.
Also remember about GIL...

Python3 Pool async processes | workers

I am trying to use 4 processes for 4 async methods.
Here is my code for 1 async method (x):
from multiprocessing import Pool
import time
def x(i):
while(i < 100):
print(i)
i += 1
time.sleep(1)
def finish(str):
print("done!")
if __name__ == "__main__":
pool = Pool(processes=5)
result = pool.apply_async(x, [0], callback=finish)
print("start")
according to: https://docs.python.org/2/library/multiprocessing.html#multiprocessing.JoinableQueue
the parameter processes in Pool is the number of workers.
How can i use each of these workers?
EDIT: my ASYNC class
from multiprocessing import Pool
import time
class ASYNC(object):
def __init__(self, THREADS=[]):
print('do')
pool = Pool(processes=len(THREADS))
self.THREAD_POOL = {}
thread_index = 0
for thread_ in THREADS:
self.THREAD_POOL[thread_index] = {
'thread': thread_['thread'],
'args': thread_['args'],
'callback': thread_['callback']
}
pool.apply_async(self.run, [thread_index], callback=thread_['callback'])
self.THREAD_POOL[thread_index]['running'] = True
thread_index += 1
def run(self, thread_index):
print('enter')
while(self.THREAD_POOL[thread_index]['running']):
print("loop")
self.THREAD_POOL[thread_index]['thread'](self.THREAD_POOL[thread_index])
time.sleep(1)
self.THREAD_POOL[thread_index]['running'] = False
def wait_for_finish(self):
for pool in self.THREAD_POOL:
while(self.THREAD_POOL[pool]['running']):
time.sleep(1)
def x(pool):
print(str(pool))
pool['args'][0] += 1
def y(str):
print("done")
A = ASYNC([{'thread': x, 'args':[10], 'callback':y}])
print("start")
A.wait_for_finish()
multiprocessing.Pool is designed to be a convenient way of distributing work to a pool of workers, without worrying about which worker does which work. The reason that it has a size is to allow you to be lazy about how quickly you dispatch work to the queue and to limit the expensive (relatively) overhead of creating child processes.
So the answer to your question is in principle you shouldn't be able to access individual workers in a Pool. If you want to be able to address workers individually, you will need to implement your own work distribution system and using multiprocessing.Process, something like:
from multiprocessing import Process
def x(i):
while(i < 100):
print(i)
i += 1
pools = [Process(target=x, args=(1,)) for _ in range(5)]
map(lambda pool: pool.start(), pools)
map(lambda pool: pool.join(), pools)
print('Done!')
And now you can access each worker directly. If you want to be able to send work dynamically to each worker while it's running (not just give it one thing to do like I did in my example) then you'll have to implement that yourself, potentially using multiprocessing.Queue. Have a look at the code for multiprocessing to see how that distributes work to its workers to get an idea of how to do this.
Why do you want to do this anyway? If it's just concern about whether the workers get scheduled efficiently, then my advice would just be to trust multiprocessing to get that right for you, unless you have really good evidence that in your case it does not for some reason.

When I'm testing about multiprocessing and threading with python, and I meet a odd situation

I am using process pools(including 3 processes). In every process, I have set (created) some threads by using the thread classes to speed handle something.
At first, everything was OK. But when I wanted to change some variable in a thread, I met an odd situation.
For testing or to know what happens, I set a global variable COUNT to test. Honestly, I don't know this is safe or not. I just want to see, by using multiprocessing and threading can I change COUNT or not?
#!/usr/bin/env python
# encoding: utf-8
import os
import threading
from Queue import Queue
from multiprocessing import Process, Pool
# global variable
max_threads = 11
Stock_queue = Queue()
COUNT = 0
class WorkManager:
def __init__(self, work_queue_size=1, thread_pool_size=1):
self.work_queue = Queue()
self.thread_pool = [] # initiate, no have a thread
self.work_queue_size = work_queue_size
self.thread_pool_size = thread_pool_size
self.__init_work_queue()
self.__init_thread_pool()
def __init_work_queue(self):
for i in xrange(self.work_queue_size):
self.work_queue.put((func_test, Stock_queue.get()))
def __init_thread_pool(self):
for i in xrange(self.thread_pool_size):
self.thread_pool.append(WorkThread(self.work_queue))
def finish_all_threads(self):
for i in xrange(self.thread_pool_size):
if self.thread_pool[i].is_alive():
self.thread_pool[i].join()
class WorkThread(threading.Thread):
def __init__(self, work_queue):
threading.Thread.__init__(self)
self.work_queue = work_queue
self.start()
def run(self):
while self.work_queue.qsize() > 0:
try:
func, args = self.work_queue.get(block=False)
func(args)
except Queue.Empty:
print 'queue is empty....'
def handle(process_name):
print process_name, 'is running...'
work_manager = WorkManager(Stock_queue.qsize()/3, max_threads)
work_manager.finish_all_threads()
def func_test(num):
# use a global variable to test what happens
global COUNT
COUNT += num
def prepare():
# prepare test queue, store 50 numbers in Stock_queue
for i in xrange(50):
Stock_queue.put(i)
def main():
prepare()
pools = Pool()
# set 3 process
for i in xrange(3):
pools.apply_async(handle, args=('process_'+str(i),))
pools.close()
pools.join()
global COUNT
print 'COUNT: ', COUNT
if __name__ == '__main__':
os.system('printf "\033c"')
main()
Now, finally the result of COUNT is just 0.I am unable to understand whats happening here?
You print the COUNT var in the father process. Variables doesn't sync across processes because they doesn't share memory, that means that the variable stay 0 at the father process and is increased in the subprocesses
In the case of threading, threads share memory, that means that they share the variable count, so they should have COUNT as more than 0 but again they are at the subprocesses, and when they change the variable, it doesn't update it in other processes.

Categories