Python concurrency with concurrent.futures.ThreadPoolExecutor - python

Consider the following snippet:
import concurrent.futures
import time
from random import random
class Test(object):
def __init__(self):
self.my_set = set()
def worker(self, name):
temp_set = set()
temp_set.add(name)
temp_set.add(name*10)
time.sleep(random() * 5)
temp_set.add(name*10 + 1)
self.my_set = self.my_set.union(temp_set) # question 1
return name
def start(self):
result = []
names = [1,2,3,4,5,6,7]
with concurrent.futures.ThreadPoolExecutor(max_workers=len(names)) as executor:
futures = [executor.submit(self.worker, x) for x in names]
for future in concurrent.futures.as_completed(futures):
result.append(future.result()) # question 2
Is there a chance self.my_set can become corrupted via the line marked "question 1"? I believe union is atomic, but couldn't the assignment be a problem?
Is there a problem on the line marked "question 2"? I believe the list append is atomic, so perhaps this is ok.
I've read these docs:
https://docs.python.org/3/library/stdtypes.html#set
https://web.archive.org/web/20201101025814id_/http://effbot.org/zone/thread-synchronization.htm
Is Python variable assignment atomic?
https://docs.python.org/3/glossary.html#term-global-interpreter-lock
and executed the snippet code provided in this question, but I can't find a definitive answer to how concurrency should work in this case.

Regarding question 1: Think about what's going on here:
self.my_set = self.my_set.union(temp_set)
There's a sequence of at least three distinct steps
The worker call grabs a copy of self.my_set (a reference to Set object)
The union function constructs a new set.
The worker assigns self.my_set to refer to the newly constructed set.
So what happens if two or more workers concurrently try to do the same thing? (note: it's not guaranteed to happen this way, but it could happen this way.)
Each of them could grab a reference to the original my_set.
Each of them could compute a new set, consisting only of the original members of my_set plus its own contribution.
Each of them could assign its new set to the my_set variable.
The problem is in step three. If it happened this way, then each of those new sets only would contain the contribution from the one worker that created it. There would be no single set containing the new contributions from all of the workers. When it's all over, my_set would only refer to one of those new sets—whichever thread was the last to perform the assignment would "win"—and the other new sets all would be be thrown away.
One way to prevent that would be to use mutual exclusion to keep other threads from trying to compute their new sets and update the shared variable at the same time:
class Test(object):
def __init__(self):
self.my_set = set()
self.my_set_mutex = threading.Lock()
def worker(self, name):
...
with self.my_set_mutex
self.my_set = self.my_set.union(temp_set)
return name
Regarding question 2: It doesn't matter whether or not appending to a list is "atomic." The result variable is local to the start method. In the code that you've shown, the list to which result refers is inaccessible to any other thread than the one that created it. There can't be any interference between threads unless you share the list with other threads.

Related

python mpire: modifying internal state of object within multiprocessing

I have a class with a method which modifies its internal state, for instance:
class Example():
def __init__(self, value):
self.param = value
def example_method(self, m):
self.param = self.param * m
# By convention, these methods in my implementation return the object itself
return self
I wanna run example_method in parallel (I am using the mpire lib, but other options are welcome as well), for many instances of Example, and have their internal states altered in my instances. Something like:
import mpire
list_of_instances = [Example(i) for i in range(1, 6)]
def run_method(ex):
ex.example_method(10)
print("Before parallel calls, this should print <1>}")
print(f"<{list_of_instances[0]}>")
with mpire.WorkerPool(n_jobs=3) as pool:
pool.map_unordered(run_method, [(example,) for example in list_of_instances])
print("After parallel calls, this should print <10>}")
print(f"<{list_of_instances[0]}>")
However, the way that mpire works, what is being modified are copies of example, and not the objects within list_of_instances, making any changes to internal state not being kept after the parallel processing. So the second print will print <1> instead, because that object`s internal state was not changed, a copy of it was.
I am wondering if there are any solutions to have the internal state changes be applied to the original objects in list_of_instances.
The only solutions I can think about is:
replace list_of_instances by the result of pool.map_unordered (changing to pool.map_ordered if order is important).
Since in any other case (even when using shared_objects) I have a copy of the original objects being made, resulting in the state changes being lost.
Is there any way to solve this with parallel processing? I also accept answers using other libs.

Multiprocessing across classes on objects within modules

I am trying to parallelize operations on objects which are attributes of another object by using a simple top-level script to access methods contained within a module.
I have four classes in two modules: Host_Population and Host, contained in Host_Within_Population; and Vector_Population and Vector, contained in Vector_Within_Population. Host_Population.hosts is a list of Host objects, and Vector_Population.vectors is a list of Vector objects.
The top-level script looks something like this:
import Host_Within_Population
import Vector_Within_Population
host_pop = Host_Within_Population.Host_Population()
vect_pop = Vector_Within_Population.Vector_Population()
for time in range(5):
host_pop.host_cycle(time)
vect_pop.vector_cycle(time)
host_pop.calculate_variance()
This is a representation of the module, Host_Within_Population
class Host_Population(object):
def host_cycle(self, time):
for host in self.hosts:
host.lifecycle(time)
host.mort()
class Host(object):
def lifecycle(self, time):
#do stuff
def mort(self):
#do stuff
This is a representation of the module, Vector_Within_Population
class Vector_Population(object):
def vector_cycle(self, time):
for vect in self.vects:
vect.lifecycle(time)
vect.mort()
class Vector(object):
def lifecycle(self, time):
#do stuff
def mort(self):
#do stuff
I want parallelize the for loops in host_cycle() and vector_cycle() after calling the methods from the top-level script. The attributes of each Host object will be permanently changed by the methods acting on them in host_cycle(), and likewise for each Vector object in vector_cycle(). It doesn't matter what order the objects within each cycle are processed in (ie hosts are not affected by actions taken on other hosts), but host_cycle() must completely finish before vector_cycle() begins. Processes in vector_cycle need to be able to access each Host in the Host_Population, and the outcome of those processes will depend on the attributes of the Host. I will need to access methods in both modules at times other than host_cycle() and vector_cycle(). I have been trying to use multiprocessing.pool and map in many different permutations, but no luck even in highly simplified forms. One example of something I've tried:
class Host_Population:
def host_cycle(self):
with Pool() as q:
q.map(h.lifecycle, [h for h in self.hosts])
But of course, h is not defined.
I have been unable to adapt the response to similar questions, such as this one. Any help is appreciated.
So I got a tumbleweed badge for this incredibly unpopular question, but just in case anyone ever has the same issue, I found a solution.
Within the Host class, lifecycle() returns a Host:
def lifecycle(self, time):
#do stuff
return self
These are passed to the multiprocessing method in the Host_Within_Population class, which adds them to the population.
def host_pop_cycle(self, time):
p = Pool()
results = p.map_async(partial(Host.lifecycle, time = time), self.hosts)
p.close()
p.join()
self.hosts = []
for a in results.get():
self.hosts.append(a)

Removing 2nd item from a queue, using another queue as an ADT

class Queue:
def __init__(self):
self._contents = []
def enqueue(self, obj):
self._contents.append(obj)
def dequeue(self):
return self._contents.pop(0)
def is_empty(self):
return self._contents == []
class remove_2nd(Queue):
def dequeue(self):
first_item = Queue.dequeue(self)
# Condition if the queue length isn't greater than two
if self.is_empty():
return first_item
else:
# Second item to return
second_item = Queue.dequeue(self)
# Add back the first item to the queue (stuck here)
The remove_2nd class is basically a queue except if the length of the queue is greater than two, then you remove the 2nd item every dequeue. If it isn't then you do the same as a normal queue. I am only allowed to use the methods in the queue to finish remove_2nd.
My algorithm:
If queue is bigger than two:
Lets say my queue is 1 2 3 4
I would first remove the first item so it becomes
2 3 4
I would then remove the 2nd item and that will be the returned value, so then it will be
3 4
I would then add back the first item as wanted
1 3 4
The problem is, I don't know how to add it back. Enqueue puts it at the end, so basically it would be 3 4 1. I was thinking of reversing the 3 4, but I don't know how to do that either. Any help?
Just want to point out, I'm not allowed to call on _contents or allowed to create my own private variable for the remove_2nd class. This should strictly be done using the queue adt
def insert(self,position,element):
self._contents.insert(position,element)
To get the queue back in the right order after removing the first two elements, you'll need to remove all the other elements as well. Once the queue is empty, you can add back the first element and all the other elements one by one.
How exactly you keep track of the values you're removing until you can add them again is a somewhat tricky question that depends on the rules of your assignment. If you can use Python's normal types (as local variables, not as new attributes for your class), you can put them in a list or a deque from the collections module. But you can also just use another Queue instance (an instance of the base type, not your subclass).
Try something like this in your else clause:
second_item = Queue.dequeue(self) # note, this could be written super().dequeue()
temp = Queue()
while not self.is_empty():
temp.enqueue(Queue.dequeue(self))
self.enqueue(first_item)
while not temp.is_empty()
self.enqueue(temp.dequeue())
return second_item
As I commented in the code, Queue.dequeue(self) can be written more "pythonically" using the super builtin. The exact details of the call depend on which version of Python you're using (Python 3's super is much fancier than Python 2's version).
In Python 2, you have to explicitly pass self and your current class, so the call would be super(self, dequeue_2nd).dequeue(). In Python 3, you simply use super().dequeue() and it "magically" takes care of everything (in reality, the compiler figures out the class at compile time, and adds some extra code to let it find self at run time).
For your simple code with only basic inheritance, there's no difference between using super or explicitly looking up the base class by name. But in more complicated situations, using super is very important. If you ever use multiple inheritance, calling overridden methods with super is often the only way to get things to work sanely.

How do I access class variables without changing them in python?

I'm new to programming so sorry for the basic question. I am trying to write a search algorithm for a class, and I thought creating a class for each search node would be helpful.
class Node(object):
def __init__(self, path_to_node, search_depth, current_state):
self.path_to_node = path_to_node
self.search_depth = search_depth
self.current_state = current_state
...
With some functions too. I am now trying to define a function outside of the class to create children nodes of a node and add them to a queue. node.current_state is a list
def bfs_expand(node, queuey, test_states):
# Node Queue List -> Queue List
# If legal move and not already in test states create and put children nodes
# into the queue and their state into test_states. Return queue and test states
# Copy original path, depth, and state to separate variables
original_path = node.path_to_node
original_depth = node.search_depth
original_state = node.current_state
# Check if up is legal, if so add new node to queue and state to test state
if node.is_legal_move('Up'):
up_state = original_state
a = up_state.index(0)
b = a - 3
up_state[a], up_state[b] = up_state[b], up_state[a]
if up_state not in test_states:
test_states.append(up_state)
up_node = Node(original_path + ['Up'], original_depth + 1, up_state)
queuey.put(up_node)
print(test_states)
print(original_state)
I then try to proceed through down, left and right with similar if statements, but they are messed up because the original_state has changed. When I print the original state after that up statement, it returns the up_state created in the if statement. I realize (well, I think) that this is because original_state, and therefore up_state, are actually calling node.current_state and do not store the list in a separate variable. How should I get the variable from a node to manipulate independently? Should I not even be using a class for something like this, maybe a dictionary? I don't need code written for me but a conceptual nudge would be greatly appreciated!
You should use copy.deepcopy if you want to avoid modifying the original
original_path = copy.deepcopy(node.path_to_node)
original_depth = copy.deepcopy(node.search_depth)
original_state = copy.deepcopy(node.current_state)
Or essentially whichever object you want to use as a "working copy" should be a deep copy of the original if you don't want to modify the original version of it.
Expanding a bit on #CoryKramer's answer: In Python, objects have reference semantics, which means that saying
a = b
where a and b are objects, makes both a and b references to the same object, meaning that changing a property on a will change that same property on b as well. In order to actually get a new object with the same properties as the old one, you should use copy.deepcopy as already stated. However, be careful when using that function. If your object contains a reference cycle (i.e.: It contains a reference to an object which contains a reference to itself), copy.deepcopy will lead to an infinite loop.
For this reason, there is also copy.copy, which does not follow object references contained in the object to copy.

python multiprocessing : setting class attribute value

I have an class called Experiment and another called Case. One Experiment is made of many individual cases. See Class definitions below,
from multiprocessing import Process
class Experiment (object):
def __init__(self, name):
self.name = name
self.cases = []
self.cases.append(Case('a'))
self.cases.append(Case('b'))
self.cases.append(Case('c'))
def sr_execute(self):
for c in self.cases:
c.setVars(6)
class Case(object):
def __init__(self, name):
self.name = name
def setVars(self, var):
self.var = var
In my Experiment Class, I have a function called sr_execute. This function shows the desired behavior. I am interested in parsing thru all cases and set an attribute for each of the cases. When I run the following code,
if __name__ == '__main__':
#multiprocessing.freeze_support()
e = Experiment('exp')
e.sr_execute()
for c in e.cases: print c.name, c.var
I get,
a 6
b 6
c 6
This is the desired behavior.
However, I would like to do this in parallel using multiprocessing. To do this, I add a mp_execute() function to the Experiment Class,
def mp_execute(self):
processes = []
for c in self.cases:
processes.append(Process(target= c.setVars, args = (6,)))
[p.start() for p in processes]
[p.join() for p in processes]
However, this does not work. When I execute the following,
if __name__ == '__main__':
#multiprocessing.freeze_support()
e = Experiment('exp')
e.mp_execute()
for c in e.cases: print c.name, c.var
I get an error,
AttributeError: 'Case' object has no attribute 'var'
Apparently, I am unable to set class attribute using multiprocessing.
Any clues what is going on,
When you call:
def mp_execute(self):
processes = []
for c in self.cases:
processes.append(Process(target= c.setVars, args = (6,)))
[p.start() for p in processes]
[p.join() for p in processes]
when you create the Process it will use a copy of your object and the modifications to such object are not passed to the main program because different processes have different adress spaces. It would work if you used Threads
since in that case no copy is created.
Also note that your code will probably fail in Windows because you are passing a method as target and Windows requires the target to be picklable (and instance methods are not pickable).
The target should be a function defined at the top level of a module in order to work on all Oses.
If you want to communicate to the main process the changes you could:
Use a Queue to pass the result
Use a Manager to built a shared object
Anyway you must handle the communication "explicitly" either by setting up a "channel" (like a Queue) or setting up a shared state.
Style note: Do not use list-comprehensions in this way:
[p.join() for p in processes]
it's simply wrong. You are only wasting space creating a list of Nones. It is also probably slower compared to the right way:
for p in processes:
p.join()
Since it has to append the elements to the list.
Some say that list-comprehensions are slightly faster than for loops, however:
The difference in performance is so small that it generally doesn't matter
They are faster if and only if you consider this kind of loops:
a = []
for element in something:
a.append(element)
If the loop, like in this case, does not create a list, then the for loop will be faster.
By the way: some use map in the same way to perform side-effects. This again is wrong because you wont gain much in speed for the same reason as before and it fails completely in python3 where map returns an iterator and hence it will not execute the functions at all, thus making the code less portable.
#Bakuriu's answer offers good styling and efficiency suggestions. And true that each process gets a copy of the master process stack, hence the changes made by forked processes will not be reflected in address space of the master process unless you utilize some form of IPC (e.g. Queue, Pipe, Manager).
But the particular AttributeError: 'Case' object has no attribute 'var' error that you are getting has an additional reason, namely that your Case objects do not yet have the var attribute at the time you launch your processes. Instead, the var attribute is created in the setVars() method.
Your forked processes do indeed create the variable when they call setVars() (and actually even set it to 6), but alas, this change is only in the copies of Case objects, i.e. not reflected in the master process's memory space (where the variable still does not exist).
To see what I mean, change your Case class to this:
class Case(object):
def __init__(self, name):
self.name = name
self.var = 7 # Create var in the constructor.
def setVars(self, var):
self.var = var
By adding the var member variable in the constructor, your master process will have access to it. Of course, the changes in the forked processes will still not be reflected in the master process, but at least you don't get the error:
a 7
b 7
c 7
Hope this sheds light on what's going on. =)
SOLUTION:
The least-intrusive (to original code) thing to do is use ctypes object from shared memory:
from multiprocessing import Value
class Case(object):
def __init__(self, name):
self.name = name
self.var = Value('i', 7) # Use ctypes "int" from shared memory.
def setVars(self, var):
self.var.value = var # Set the variable's "value" attribute.
and change your main() to print c.var.value:
for c in e.cases: print c.name, c.var.value # Print the "value" attribute.
Now you have the desired output:
a 6
b 6
c 6

Categories