Python sharing a deque between multiprocessing processes - python

I've been looking at the following questions for the pas hour without any luck:
Python sharing a dictionary between parallel processes
multiprocessing: sharing a large read-only object between processes?
multiprocessing in python - sharing large object (e.g. pandas dataframe) between multiple processes
I've written a very basic test file to illustrate what I'm trying to do:
from collections import deque
from multiprocessing import Process
import numpy as np
class TestClass:
def __init__(self):
self.mem = deque(maxlen=4)
self.process = Process(target=self.run)
def run(self):
while True:
self.mem.append(np.array([0, 1, 2, 3, 4]))
def print_values(x):
while True:
print(x)
test = TestClass()
process = Process(target=print_values(test.mem))
test.process.start()
process.start()
Currently this outputs the following :
deque([], maxlen=4)
How can I access the mem value's from the main code or the process that runs "print_values"?

Unfortunately multiprocessing.Manager() doesn't support deque but it does work with list, dict, Queue, Value and Array. A list is fairly close so I've used it in the example below..
from multiprocessing import Process, Manager, Lock
import numpy as np
class TestClass:
def __init__(self):
self.maxlen = 4
self.manager = Manager()
self.mem = self.manager.list()
self.lock = self.manager.Lock()
self.process = Process(target=self.run, args=(self.mem, self.lock))
def run(self, mem, lock):
while True:
array = np.random.randint(0, high=10, size=5)
with lock:
if len(mem) >= self.maxlen:
mem.pop(0)
mem.append(array)
def print_values(mem, lock):
while True:
with lock:
print mem
test = TestClass()
print_process = Process(target=print_values, args=(test.mem, test.lock))
test.process.start()
print_process.start()
test.process.join()
print_process.join()
You have to be a little careful using manager objects. You can use them a lot like the objects they reference but you can't do something like... mem = mem[-4:] to truncate the values because you're changing the referenced object.
As for coding style, I might move the Manager objects outside the class or move the print_values function inside it but for an example, this works. If you move things around, just note that you can't use self.mem directly in the run method. You need to pass it in when you start the process or the fork that python does in the background will create a new instance and it won't be shared.
Hopefully this works for your situation, if not, we can try to adapt it a bit.

So by combining the code provided by #bivouac0 and the comment #Marijn Pieters posted, I came up with the following solution:
from multiprocessing import Process, Manager, Queue
class testClass:
def __init__(self, maxlen=4):
self.mem = Queue(maxsize=maxlen)
self.process = Process(target=self.run)
def run(self):
i = 0
while True:
self.mem.empty()
while not self.mem.full():
self.mem.put(i)
i += 1
def print_values(queue):
while True:
values = queue.get()
print(values)
if __name__ == "__main__":
test = testClass()
print_process = Process(target=print_values, args=(test.mem,))
test.process.start()
print_process.start()
test.process.join()
print_process.join()

Related

How to set an instance attribute in parallel in a Python class?

I want to set an instance attribute by running an instance method in parallel. Let's say the attribute is initially an empty dictionary called d, and I want to update it in parallel by an instance method called update_d. I am currentely using multiprocessing.Pool:
from multiprocessing import Pool
import random
class A():
def __init__(self, n_jobs):
self.d = dict()
self.n_jobs = n_jobs
pool = Pool(self.n_jobs)
pool.map(self.update_d, range(100))
pool.close()
def update_d(self, key):
self.d[key] = random.randint(0, 100)
if __name__ == '__main__':
a = A(n_jobs=4)
print(a.d)
However, the attribute is not updated after running update_d in parallel. I understand that it's because multiprocessing.Pool always folks the instance to individual processes. But I want to know what is the recommended way to do this in Python? Note that I don't want to return anything from update_d, and we can assume that the code is written in a way that the individual processes won't conflict with each other.
Edit: I just use dictionary as an example. I need a solution that allows the attribute to be any type of variable, e.g. a Pandas dataframe.
You may need a Manager to create a dict for you. I still don't know how well the updates will work, whether there will be any race conditions.
from multiprocessing import Pool, Manager
import random
class A():
def __init__(self, n_jobs, manager):
self.d = manager.dict()
self.n_jobs = n_jobs
pool = Pool(self.n_jobs)
pool.map(self.update_d, range(100))
pool.close()
def update_d(self, key):
self.d[key] = random.randint(0, 100)
if __name__ == '__main__':
with Manager() as manager:
a = A(n_jobs=4, manager=manager)
print(a.d)

Multiprocessing advice and stop processes

I am trying to implement a system where:
Actors generate data
Replay is a class that manage the data generated by the Actors (In theory it does much more than in the code below, but I kept it simple for posting it here)
Learner use the data of the Replay class (and sometimes update some data of Replay)
To implement that, I appended my generated data of the Actors to a multiprocessing.Queue, I use a process to push my data to my Replay. I used a multiprocessing.BaseManager to share the Replay.
This is my implementation (the code is working):
import time
import random
from collections import deque
import torch.multiprocessing as mp
from multiprocessing.managers import BaseManager
T = 20
B = 5
REPLAY_MINIMUM_SIZE = 10
REPLAY_MAXIMUM_SIZE = 100
class Actor:
def __init__(self, global_buffer, rank):
self.rank = rank
self.local_buffer = []
self.global_buffer = global_buffer
def run(self, num_steps):
for step in range(num_steps):
data = f'{self.rank}_{step}'
self.local_buffer.append(data)
if len(self.local_buffer) >= B:
self.global_buffer.put(self.local_buffer)
self.local_buffer = []
class Learner:
def __init__(self, replay):
self.replay = replay
def run(self, num_steps):
while self.replay.size() <= REPLAY_MINIMUM_SIZE:
time.sleep(0.1)
for step in range(num_steps):
batch = self.replay.sample(B)
print(batch)
class Replay:
def __init__(self, capacity):
self.memory = deque(maxlen=capacity)
def push(self, experiences):
self.memory.extend(experiences)
def sample(self, n):
return random.sample(self.memory, n)
def size(self):
return len(self.memory)
def send_data_to_replay(global_buffer, replay):
while True:
if not global_buffer.empty():
batch = global_buffer.get()
replay.push(batch)
if __name__ == '__main__':
num_actors = 2
global_buffer = mp.Queue()
BaseManager.register("ReplayMemory", Replay)
Manager = BaseManager()
Manager.start()
replay = Manager.ReplayMemory(REPLAY_MAXIMUM_SIZE)
learner = Learner(replay)
learner_process = mp.Process(target=learner.run, args=(T,))
learner_process.start()
actor_processes = []
for rank in range(num_actors):
p = mp.Process(target=Actor(global_buffer, rank).run, args=(T,))
p.start()
actor_processes.append(p)
replay_process = mp.Process(target=send_data_to_replay, args=(global_buffer, replay,))
replay_process.start()
learner_process.join()
[actor_process.join() for actor_process in actor_processes]
replay_process.join()
I followed several tutorials and read websites related to multiprocessing, but I am very new to distributed computing. I am not sure if what I am doing is right.
I wanted to know if there is some malpractice in my code or things that are not following good practices. Moreover, when I launch the program, the different processes do not terminate. And I am not sure why and how to handle it.
Any feedback would be appreciated!
I find that when working with multiprocessing it is best to have a Queue for each running process. When you are ready to close the application you can send an exit message ( or poison pill ) to each queue and close each process cleanly.
When you launch a child process pass the parent queue and the child queue to the new process through inheritance.

How to allow a class's variables to be modified concurrently by multiple threads

I have a class (MyClass) which contains a queue (self.msg_queue) of actions that need to be run and I have multiple sources of input that can add tasks to the queue.
Right now I have three functions that I want to run concurrently:
MyClass.get_input_from_user()
Creates a window in tkinter that has the user fill out information and when the user presses submit it pushes that message onto the queue.
MyClass.get_input_from_server()
Checks the server for a message, reads the message, and then puts it onto the queue. This method uses functions from MyClass's parent class.
MyClass.execute_next_item_on_the_queue()
Pops a message off of the queue and then acts upon it. It is dependent on what the message is, but each message corresponds to some method in MyClass or its parent which gets run according to a big decision tree.
Process description:
After the class has joined the network, I have it spawn three threads (one for each of the above functions). Each threaded function adds items from the queue with the syntax "self.msg_queue.put(message)" and removes items from the queue with "self.msg_queue.get_nowait()".
Problem description:
The issue I am having is that it seems that each thread is modifying its own queue object (they are not sharing the queue, msg_queue, of the class of which they, the functions, are all members).
I am not familiar enough with Multiprocessing to know what the important error messages are; however, it is stating that it cannot pickle a weakref object (it gives no indication of which object is the weakref object), and that within the queue.put() call the line "self._sem.acquire(block, timeout) yields a '[WinError 5] Access is denied'" error. Would it be safe to assume that this failure in the queue's reference not copying over properly?
[I am using Python 3.7.2 and the Multiprocessing package's Process and Queue]
[I have seen multiple Q/As about having threads shuttle information between classes--create a master harness that generates a queue and then pass that queue as an argument to each thread. If the functions didn't have to use other functions from MyClass I could see adapting this strategy by having those functions take in a queue and use a local variable rather than class variables.]
[I am fairly confident that this error is not the result of passing my queue to the tkinter object as my unit tests on how my GUI modifies its caller's queue work fine]
Below is a minimal reproducible example for the queue's error:
from multiprocessing import Queue
from multiprocessing import Process
import queue
import time
class MyTest:
def __init__(self):
self.my_q = Queue()
self.counter = 0
def input_function_A(self):
while True:
self.my_q.put(self.counter)
self.counter = self.counter + 1
time.sleep(0.2)
def input_function_B(self):
while True:
self.counter = 0
self.my_q.put(self.counter)
time.sleep(1)
def output_function(self):
while True:
try:
var = self.my_q.get_nowait()
except queue.Empty:
var = -1
except:
break
print(var)
time.sleep(1)
def run(self):
process_A = Process(target=self.input_function_A)
process_B = Process(target=self.input_function_B)
process_C = Process(target=self.output_function)
process_A.start()
process_B.start()
process_C.start()
# without this it generates the WinError:
# with this it still behaves as if the two input functions do not modify the queue
process_C.join()
if __name__ == '__main__':
test = MyTest()
test.run()
Indeed - these are not "threads" - these are "processes" - while if you were using multithreading, and not multiprocessing, the self.my_q instance would be the same object, placed at the same memory space on the computer,
multiprocessing does a fork of the process, and any data in the original process (the one in execution in the "run" call) will be duplicated when it is used - so, each subprocess will see its own "Queue" instance, unrelated to the others.
The correct way to have various process share a multiprocessing.Queue object is to pass it as a parameter to the target methods. The simpler way to reorganize your code so that it works is thus:
from multiprocessing import Queue
from multiprocessing import Process
import queue
import time
class MyTest:
def __init__(self):
self.my_q = Queue()
self.counter = 0
def input_function_A(self, queue):
while True:
queue.put(self.counter)
self.counter = self.counter + 1
time.sleep(0.2)
def input_function_B(self, queue):
while True:
self.counter = 0
queue.put(self.counter)
time.sleep(1)
def output_function(self, queue):
while True:
try:
var = queue.get_nowait()
except queue.Empty:
var = -1
except:
break
print(var)
time.sleep(1)
def run(self):
process_A = Process(target=self.input_function_A, args=(queue,))
process_B = Process(target=self.input_function_B, args=(queue,))
process_C = Process(target=self.output_function, args=(queue,))
process_A.start()
process_B.start()
process_C.start()
# without this it generates the WinError:
# with this it still behaves as if the two input functions do not modify the queue
process_C.join()
if __name__ == '__main__':
test = MyTest()
test.run()
As you can see, since your class is not actually sharing any data through the instance's attributes, this "class" design does not make much sense for your application - but for grouping the different workers in the same code block.
It would be possible to have a magic-multiprocess-class that would have some internal method to actually start the worker-methods and share the Queue instance - so if you have a lot of those in a project, there would be a lot less boilerplate.
Something along:
from multiprocessing import Queue
from multiprocessing import Process
import time
class MPWorkerBase:
def __init__(self, *args, **kw):
self.queue = None
self.is_parent_process = False
self.is_child_process = False
self.processes = []
# ensure this can be used as a colaborative mixin
super().__init__(*args, **kw)
def run(self):
if self.is_parent_process or self.is_child_process:
# workers already initialized
return
self.queue = Queue()
processes = []
cls = self.__class__
for name in dir(cls):
method = getattr(cls, name)
if callable(method) and getattr(method, "_MP_worker", False):
process = Process(target=self._start_worker, args=(self.queue, name))
self.processes.append(process)
process.start()
# Setting these attributes here ensure the child processes have the initial values for them.
self.is_parent_process = True
self.processes = processes
def _start_worker(self, queue, method_name):
# this method is called in a new spawned process - attribute
# changes here no longer reflect attributes on the
# object in the initial process
# overwrite queue in this process with the queue object sent over the wire:
self.queue = queue
self.is_child_process = True
# call the worker method
getattr(self, method_name)()
def __del__(self):
for process in self.processes:
process.join()
def worker(func):
"""decorator to mark a method as a worker that should
run in its own subprocess
"""
func._MP_worker = True
return func
class MyTest(MPWorkerBase):
def __init__(self):
super().__init__()
self.counter = 0
#worker
def input_function_A(self):
while True:
self.queue.put(self.counter)
self.counter = self.counter + 1
time.sleep(0.2)
#worker
def input_function_B(self):
while True:
self.counter = 0
self.queue.put(self.counter)
time.sleep(1)
#worker
def output_function(self):
while True:
try:
var = self.queue.get_nowait()
except queue.Empty:
var = -1
except:
break
print(var)
time.sleep(1)
if __name__ == '__main__':
test = MyTest()
test.run()

Memory leak while retrieving data from a proxy class

I am multi-processing data from a series of files.
To achieve the purpose, I built a class to distribute the data.
I started 4 processes that will visit the same class and retrieve data.
The problem is, if I use the class method (retrieve()) to retrieve data, the memory will keep going up. If I don't, the memory is stable, even though the data keeps refreshing by getData(). How to keep a stable memory usage while retrieving data? Or any other way to achieve the same goal?
import pandas as pd
from multiprocessing import Process, RLock
from multiprocessing.managers import BaseManager
class myclass():
def __init__(self, path):
self.path = path
self.lock = RLock()
self.getIter()
def getIter(self):
self.iter = pd.read_csv(self.path, chunksize=1000)
def getData(self):
with self.lock:
try:
self.data = next(self.iter)
except:
self.getIter()
self.data = next(self.iter)
def retrieve(self):
return self.data
def worker(c):
while True:
c.getData()
# Uncommenting the following line, memory usage goes up
data = c.retrieve()
#Generate a testing file
with open('tmp.csv', 'w') as f:
for i in range(1000000):
f.write('%f\n'%(i*1.))
BaseManager.register('myclass', myclass)
bm = BaseManager()
bm.start()
c = bm.myclass('tmp.csv')
for i in range(4):
p = Process(target=worker, args=(c,))
p.start()
I wasn't able to find out the cause nor solving it, but after changing the data type for the returning variable from pandas.DataFrame to a str (json string), the problem goes.

Process containing object method doesn't recognize edit to object

I have the following situation process=Process(target=sample_object.run) I then would like to edit a property of the sample_object: sample_object.edit_property(some_other_object).
class sample_object:
def __init__(self):
self.storage=[]
def edit_property(self,some_other_object):
self.storage.append(some_other_object)
def run:
while True:
if len(self.storage) is not 0:
print "1"
#I know it's an infinite loop. It's just an example.
_______________________________________________________
from multiprocessing import Process
from sample import sample_object
from sample2 import some_other_object
class driver:
if __name__ == "__main__":
samp = sample_object()
proc = Process(target=samp.run)
proc.start()
while True:
some = some_other_object()
samp.edit_property(some)
#I know it's an infinite loop
The previous code never prints "1". How would I connect the Process to the sample_object so that an edit made to the object whose method Process is calling is recognized by the process? In other words, is there a way to get .run to recognize the change in sample_object ?
Thank you.
You can use multiprocessing.Manager to share Python data structures between processes.
from multiprocessing import Process, Manager
class A(object):
def __init__(self, storage):
self.storage = storage
def add(self, item):
self.storage.append(item)
def run(self):
while True:
if self.storage:
print 1
if __name__ == '__main__':
manager = Manager()
storage = manager.list()
a = A(storage)
p = Process(target=a.run)
p.start()
for i in range(10):
a.add({'id': i})
p.join()

Categories