I want to set an instance attribute by running an instance method in parallel. Let's say the attribute is initially an empty dictionary called d, and I want to update it in parallel by an instance method called update_d. I am currentely using multiprocessing.Pool:
from multiprocessing import Pool
import random
class A():
def __init__(self, n_jobs):
self.d = dict()
self.n_jobs = n_jobs
pool = Pool(self.n_jobs)
pool.map(self.update_d, range(100))
pool.close()
def update_d(self, key):
self.d[key] = random.randint(0, 100)
if __name__ == '__main__':
a = A(n_jobs=4)
print(a.d)
However, the attribute is not updated after running update_d in parallel. I understand that it's because multiprocessing.Pool always folks the instance to individual processes. But I want to know what is the recommended way to do this in Python? Note that I don't want to return anything from update_d, and we can assume that the code is written in a way that the individual processes won't conflict with each other.
Edit: I just use dictionary as an example. I need a solution that allows the attribute to be any type of variable, e.g. a Pandas dataframe.
You may need a Manager to create a dict for you. I still don't know how well the updates will work, whether there will be any race conditions.
from multiprocessing import Pool, Manager
import random
class A():
def __init__(self, n_jobs, manager):
self.d = manager.dict()
self.n_jobs = n_jobs
pool = Pool(self.n_jobs)
pool.map(self.update_d, range(100))
pool.close()
def update_d(self, key):
self.d[key] = random.randint(0, 100)
if __name__ == '__main__':
with Manager() as manager:
a = A(n_jobs=4, manager=manager)
print(a.d)
Related
I have two custom Python classes, the first one has a method to make some calculations (using Pool) and create a new instance attribute, and the second one is used to aggregate two objects of the first class and has a method with which I want to send said calculations (also in parallel) in the two first-class objects and correctly save their new instance attributes.
Dummy code:
from multiprocessing import Pool, Process
class State:
def __init__(self, data):
self.data = data
def calculate(self):
with Pool() as p:
p.map(function, args)
new_attribute = *some code that reads the files generated with the Pool*
self.new_attribute = new_attribute
return
class Pair:
def __init__(self. state1:State, state2:State):
self.state1 = state1
self.state2 = state2
def calculate_states(self):
for state in [self.state1, self.state2]
p = Process(state.calculate, args)
p.start()
return
state1 = State(data1)
state2 = State(data2)
pair = Pair(state1, state2)
pair.calculate_states()
The problem is that, as I have found out during my extensive research about the problem, multiprocessing.Process creates copies of the namespace in which the processes work, and the values aren't returned to the main namespace. Setting the process.daemon to True produces an error, because "daemonic processes aren't allowed to have children", which is the same thing that happens if I exchange the Processes by an additional Pool. Using multiprocess (instead of multiprocessing) or concurrent.futures doesn't seem to work either. Additionally, I don't understand how multiprocessing.Queue works and I'm not sure if it could be applied here (I have read somewhere that it could be used).
I would like to do what I am trying to do without having to pass a shared-memory object to the Processes (to write the new_attribute into it and then apply it to the States in the main namespace). I hope someone can point me towards the solution even if I have not provided a working code/reproducible example.
Your problem arises from invoking method calculate as a new subprocess. You can still compute the new attributes in parallel without doing that by using map_async with a callback argument.
I have taken your code and provided missing function implementations to demonstrate:
from multiprocessing import Pool, cpu_count
def some_code(data):
if data == 1:
return 1032
if data == 2:
return 9874
raise ValueError('Invalid data value:', data)
def function(val):
...
# return value is not of interest
class State:
def __init__(self, data):
self.data = data
def calculate(self, pool, args):
pool.map_async(function, args, callback=self.callback)
def callback(self, result):
"""
Called when map_async completes
"""
new_attribute = some_code(self.data)
self.new_attribute = new_attribute
class Pair:
def __init__(self, state1:State, state2:State):
self.state1 = state1
self.state2 = state2
def calculate_states(self):
args = (6, 9, 18)
# Assumption is computation is VERY CPU-intensive
# If there is quite a bit of I/O involved then: pool_size = 2 * len(args)
# If it's mostly I/O you should have been using multithreading to begin with
pool_size = min(2*len(args), cpu_count())
with Pool(pool_size) as pool:
for state in [self.state1, self.state2]:
state.calculate(pool, args)
# wait for tasks to complete
pool.close()
pool.join()
# Required for Windows:
if __name__ == '__main__':
data1 = 1
data2 = 2
state1 = State(data1)
state2 = State(data2)
pair = Pair(state1, state2)
pair.calculate_states()
print(state1.new_attribute, state2.new_attribute)
Prints:
1032 9874
I'm learning about multithreading and I try to implement a few things to understand it.
After reading several (and very technical topics) I cannot find a solution or way to understand my issue.
Basically, I have the following structure:
class MyObject():
def __init__():
self.lastupdate = datetime.datetime.now()
def DoThings():
...
def MyThreadFunction(OneOfMyObject):
OneOfMyObject.DoThings()
OneOfMyObject.lastupdate = datetime.datetime.now()
def main():
MyObject1 = MyObject()
MyObject2 = MyObject()
MyObjects = [MyObject1, MyObject2]
pool = Pool(2)
while True:
pool.map(MyThreadFunction, MyObjects)
if __name__ == '__main__':
main()
I think the function .map make a copy of my objects because it does not update the time. Is it right ? if yes, how could I input a Global version of my objects. If not, would you have any idea why the time is fixed in my objects ?
When I check the new time with print(MyObject.lastupdate), the time is right, but not in the next loop
Thank you very much for any of your ideas
Yes, python threading will serialize (actually, pickle) your objects and then reconstruct them in the thread. However, it also sends them back. To recover them, see the commented additions to the code below:
class MyObject():
def __init__():
self.lastupdate = datetime.datetime.now()
def DoThings():
...
def MyThreadFunction(OneOfMyObject):
OneOfMyObject.DoThings()
OneOfMyObject.lastupdate = datetime.datetime.now()
# NOW, RETURN THE OBJECT
return oneOfMyObject
def main():
MyObject1 = MyObject()
MyObject2 = MyObject()
MyObjects = [MyObject1, MyObject2]
with Pool(2) as pool: # <- this is just a neater way of doing it than a while loop for various reasons. Checkout context managers if interested.
# Now we recover a list of the updated objects:
processed_object_list = pool.map(MyThreadFunction, MyObjects)
# Now inspect
for my_object in processed_object_list:
print(my_object.lastupdate)
if __name__ == '__main__':
main()
I've been looking at the following questions for the pas hour without any luck:
Python sharing a dictionary between parallel processes
multiprocessing: sharing a large read-only object between processes?
multiprocessing in python - sharing large object (e.g. pandas dataframe) between multiple processes
I've written a very basic test file to illustrate what I'm trying to do:
from collections import deque
from multiprocessing import Process
import numpy as np
class TestClass:
def __init__(self):
self.mem = deque(maxlen=4)
self.process = Process(target=self.run)
def run(self):
while True:
self.mem.append(np.array([0, 1, 2, 3, 4]))
def print_values(x):
while True:
print(x)
test = TestClass()
process = Process(target=print_values(test.mem))
test.process.start()
process.start()
Currently this outputs the following :
deque([], maxlen=4)
How can I access the mem value's from the main code or the process that runs "print_values"?
Unfortunately multiprocessing.Manager() doesn't support deque but it does work with list, dict, Queue, Value and Array. A list is fairly close so I've used it in the example below..
from multiprocessing import Process, Manager, Lock
import numpy as np
class TestClass:
def __init__(self):
self.maxlen = 4
self.manager = Manager()
self.mem = self.manager.list()
self.lock = self.manager.Lock()
self.process = Process(target=self.run, args=(self.mem, self.lock))
def run(self, mem, lock):
while True:
array = np.random.randint(0, high=10, size=5)
with lock:
if len(mem) >= self.maxlen:
mem.pop(0)
mem.append(array)
def print_values(mem, lock):
while True:
with lock:
print mem
test = TestClass()
print_process = Process(target=print_values, args=(test.mem, test.lock))
test.process.start()
print_process.start()
test.process.join()
print_process.join()
You have to be a little careful using manager objects. You can use them a lot like the objects they reference but you can't do something like... mem = mem[-4:] to truncate the values because you're changing the referenced object.
As for coding style, I might move the Manager objects outside the class or move the print_values function inside it but for an example, this works. If you move things around, just note that you can't use self.mem directly in the run method. You need to pass it in when you start the process or the fork that python does in the background will create a new instance and it won't be shared.
Hopefully this works for your situation, if not, we can try to adapt it a bit.
So by combining the code provided by #bivouac0 and the comment #Marijn Pieters posted, I came up with the following solution:
from multiprocessing import Process, Manager, Queue
class testClass:
def __init__(self, maxlen=4):
self.mem = Queue(maxsize=maxlen)
self.process = Process(target=self.run)
def run(self):
i = 0
while True:
self.mem.empty()
while not self.mem.full():
self.mem.put(i)
i += 1
def print_values(queue):
while True:
values = queue.get()
print(values)
if __name__ == "__main__":
test = testClass()
print_process = Process(target=print_values, args=(test.mem,))
test.process.start()
print_process.start()
test.process.join()
print_process.join()
I have created a class with a number of methods. One of the methods is very time consuming, my_process, and I'd like to do that method in parallel. I came across Python Multiprocessing - apply class method to a list of objects but I'm not sure how to apply it to my problem, and what effect it will have on the other methods of my class.
class MyClass():
def __init__(self, input):
self.input = input
self.result = int
def my_process(self, multiply_by, add_to):
self.result = self.input * multiply_by
self._my_sub_process(add_to)
return self.result
def _my_sub_process(self, add_to):
self.result += add_to
list_of_numbers = range(0, 5)
list_of_objects = [MyClass(i) for i in list_of_numbers]
list_of_results = [obj.my_process(100, 1) for obj in list_of_objects] # multi-process this for-loop
print list_of_numbers
print list_of_results
[0, 1, 2, 3, 4]
[1, 101, 201, 301, 401]
I'm going to go against the grain here, and suggest sticking to the simplest thing that could possibly work ;-) That is, Pool.map()-like functions are ideal for this, but are restricted to passing a single argument. Rather than make heroic efforts to worm around that, simply write a helper function that only needs a single argument: a tuple. Then it's all easy and clear.
Here's a complete program taking that approach, which prints what you want under Python 2, and regardless of OS:
class MyClass():
def __init__(self, input):
self.input = input
self.result = int
def my_process(self, multiply_by, add_to):
self.result = self.input * multiply_by
self._my_sub_process(add_to)
return self.result
def _my_sub_process(self, add_to):
self.result += add_to
import multiprocessing as mp
NUM_CORE = 4 # set to the number of cores you want to use
def worker(arg):
obj, m, a = arg
return obj.my_process(m, a)
if __name__ == "__main__":
list_of_numbers = range(0, 5)
list_of_objects = [MyClass(i) for i in list_of_numbers]
pool = mp.Pool(NUM_CORE)
list_of_results = pool.map(worker, ((obj, 100, 1) for obj in list_of_objects))
pool.close()
pool.join()
print list_of_numbers
print list_of_results
A big of magic
I should note there are many advantages to taking the very simple approach I suggest. Beyond that it "just works" on Pythons 2 and 3, requires no changes to your classes, and is easy to understand, it also plays nice with all of the Pool methods.
However, if you have multiple methods you want to run in parallel, it can get a bit annoying to write a tiny worker function for each. So here's a tiny bit of "magic" to worm around that. Change worker() like so:
def worker(arg):
obj, methname = arg[:2]
return getattr(obj, methname)(*arg[2:])
Now a single worker function suffices for any number of methods, with any number of arguments. In your specific case, just change one line to match:
list_of_results = pool.map(worker, ((obj, "my_process", 100, 1) for obj in list_of_objects))
More-or-less obvious generalizations can also cater to methods with keyword arguments. But, in real life, I usually stick to the original suggestion. At some point catering to generalizations does more harm than good. Then again, I like obvious things ;-)
If your class is not "huge", I think process oriented is better.
Pool in multiprocessing is suggested.
This is the tutorial -> https://docs.python.org/2/library/multiprocessing.html#using-a-pool-of-workers
Then seperate the add_to from my_process since they are quick and you can wait util the end of the last process.
def my_process(input, multiby):
return xxxx
def add_to(result,a_list):
xxx
p = Pool(5)
res = []
for i in range(10):
res.append(p.apply_async(my_process, (i,5)))
p.join() # wait for the end of the last process
for i in range(10):
print res[i].get()
Generally the easiest way to run the same calculation in parallel is the map method of a multiprocessing.Pool (or the as_completed function from concurrent.futures in Python 3).
However, the map method applies a function that only takes one argument to an iterable of data using multiple processes.
So this function cannot be a normal method, because that requires at least two arguments; it must also include self! It could be a staticmethod, however. See also this answer for a more in-depth explanation.
Based on the answer of Python Multiprocessing - apply class method to a list of objects and your code:
add MyClass object into simulation object
class simulation(multiprocessing.Process):
def __init__(self, id, worker, *args, **kwargs):
# must call this before anything else
multiprocessing.Process.__init__(self)
self.id = id
self.worker = worker
self.args = args
self.kwargs = kwargs
sys.stdout.write('[%d] created\n' % (self.id))
run what you want in run function
def run(self):
sys.stdout.write('[%d] running ... process id: %s\n' % (self.id, os.getpid()))
self.worker.my_process(*self.args, **self.kwargs)
sys.stdout.write('[%d] completed\n' % (self.id))
Try this:
list_of_numbers = range(0, 5)
list_of_objects = [MyClass(i) for i in list_of_numbers]
list_of_sim = [simulation(id=k, worker=obj, multiply_by=100*k, add_to=10*k) \
for k, obj in enumerate(list_of_objects)]
for sim in list_of_sim:
sim.start()
If you don't absolutely need to stick with Multiprocessing module then,
it can easily achieved using concurrents.futures library
here's the example code:
from concurrent.futures.thread import ThreadPoolExecutor, wait
MAX_WORKERS = 20
class MyClass():
def __init__(self, input):
self.input = input
self.result = int
def my_process(self, multiply_by, add_to):
self.result = self.input * multiply_by
self._my_sub_process(add_to)
return self.result
def _my_sub_process(self, add_to):
self.result += add_to
list_of_numbers = range(0, 5)
list_of_objects = [MyClass(i) for i in list_of_numbers]
With ThreadPoolExecutor(MAX_WORKERS) as executor:
for obj in list_of_objects:
executor.submit(obj.my_process, 100, 1).add_done_callback(on_finish)
def on_finish(future):
result = future.result() # do stuff with your result
here executor returns future for every task it submits. keep in mind that if you use add_done_callback() finished task from thread returns to the main thread (which would block your main thread) if you really want true parallelism then you should wait for future objects separately. here's the code snippet for that.
futures = []
with ThreadPoolExecutor(MAX_WORKERS) as executor:
for objin list_of_objects:
futures.append(executor.submit(obj.my_process, 100, 1))
wait(futures)
for succeded, failed in futures:
# work with your result here
if succeded:
print (succeeeded.result())
if failed:
print (failed.result())
hope this helps.
I have the following situation process=Process(target=sample_object.run) I then would like to edit a property of the sample_object: sample_object.edit_property(some_other_object).
class sample_object:
def __init__(self):
self.storage=[]
def edit_property(self,some_other_object):
self.storage.append(some_other_object)
def run:
while True:
if len(self.storage) is not 0:
print "1"
#I know it's an infinite loop. It's just an example.
_______________________________________________________
from multiprocessing import Process
from sample import sample_object
from sample2 import some_other_object
class driver:
if __name__ == "__main__":
samp = sample_object()
proc = Process(target=samp.run)
proc.start()
while True:
some = some_other_object()
samp.edit_property(some)
#I know it's an infinite loop
The previous code never prints "1". How would I connect the Process to the sample_object so that an edit made to the object whose method Process is calling is recognized by the process? In other words, is there a way to get .run to recognize the change in sample_object ?
Thank you.
You can use multiprocessing.Manager to share Python data structures between processes.
from multiprocessing import Process, Manager
class A(object):
def __init__(self, storage):
self.storage = storage
def add(self, item):
self.storage.append(item)
def run(self):
while True:
if self.storage:
print 1
if __name__ == '__main__':
manager = Manager()
storage = manager.list()
a = A(storage)
p = Process(target=a.run)
p.start()
for i in range(10):
a.add({'id': i})
p.join()