Blocking dict in Python? - python

Is there a data structure in Python that resembles a blocking dictionary? This data structure must fulfill these requirements:
it must be randomly accessible and allow any element to be modified/deleted (not just the first or last)
it must have a blocking get() and put()
it must be thread-safe
I would have used a queue but, although blocking and thread-safe, it's not randomly accessible. A dict is not blocking either (as far as my Python knowledge goes).
As an example, think of one producer thread adding key-value pairs to such a data-structure (updating values for existing keys if already present - this is where a queue won't cut it), and a worker blocking on get() and consuming these key-value pairs as they become available.
Many many thanks!
edit:
Let's assume the producer polls a CI server and gets project-status pairs. It generates the differences in project statuses and puts them in the aforementioned data structure. The worker picks up these project-status updates and displays them one by one as an animation on the screen.
class Producer:
def generateProjectStatusChanges():
...
def updateSuperAwesomeDataStructure(changes):
for (proj, stat) in changes:
#queue won't do cause the update could take place in the middle of the queue
#hence the dict behavior
superAwesomeDS.putOrUpdate(proj, stat)
def watchForUpdates():
changes = generateProjectStatusChanges()
updateSuperAwesomeDataStructure(changes)
time.sleep(self.interval)
class Worker:
def blockingNotifyAnimation():
...
def watchForUpdates():
while true:
proj, stat = superAwesomeDS.getFirstPair() #or any pair really
blockingNotifyAnimation(proj, stat)

Something along the following lines should do the trick (untested):
class UpdatableBlockingQueue(object):
def __init__(self):
self.queue = {}
self.cv = threading.Condition()
def put(self, key, value):
with self.cv:
self.queue[key] = value
self.cv.notify()
def pop(self):
with self.cv:
while not self.queue:
self.cv.wait()
return self.queue.popitem()
It uses a dictionary for the queue and a condition variable for serialising access and signalling between threads.

Related

Python concurrency with concurrent.futures.ThreadPoolExecutor

Consider the following snippet:
import concurrent.futures
import time
from random import random
class Test(object):
def __init__(self):
self.my_set = set()
def worker(self, name):
temp_set = set()
temp_set.add(name)
temp_set.add(name*10)
time.sleep(random() * 5)
temp_set.add(name*10 + 1)
self.my_set = self.my_set.union(temp_set) # question 1
return name
def start(self):
result = []
names = [1,2,3,4,5,6,7]
with concurrent.futures.ThreadPoolExecutor(max_workers=len(names)) as executor:
futures = [executor.submit(self.worker, x) for x in names]
for future in concurrent.futures.as_completed(futures):
result.append(future.result()) # question 2
Is there a chance self.my_set can become corrupted via the line marked "question 1"? I believe union is atomic, but couldn't the assignment be a problem?
Is there a problem on the line marked "question 2"? I believe the list append is atomic, so perhaps this is ok.
I've read these docs:
https://docs.python.org/3/library/stdtypes.html#set
https://web.archive.org/web/20201101025814id_/http://effbot.org/zone/thread-synchronization.htm
Is Python variable assignment atomic?
https://docs.python.org/3/glossary.html#term-global-interpreter-lock
and executed the snippet code provided in this question, but I can't find a definitive answer to how concurrency should work in this case.
Regarding question 1: Think about what's going on here:
self.my_set = self.my_set.union(temp_set)
There's a sequence of at least three distinct steps
The worker call grabs a copy of self.my_set (a reference to Set object)
The union function constructs a new set.
The worker assigns self.my_set to refer to the newly constructed set.
So what happens if two or more workers concurrently try to do the same thing? (note: it's not guaranteed to happen this way, but it could happen this way.)
Each of them could grab a reference to the original my_set.
Each of them could compute a new set, consisting only of the original members of my_set plus its own contribution.
Each of them could assign its new set to the my_set variable.
The problem is in step three. If it happened this way, then each of those new sets only would contain the contribution from the one worker that created it. There would be no single set containing the new contributions from all of the workers. When it's all over, my_set would only refer to one of those new sets—whichever thread was the last to perform the assignment would "win"—and the other new sets all would be be thrown away.
One way to prevent that would be to use mutual exclusion to keep other threads from trying to compute their new sets and update the shared variable at the same time:
class Test(object):
def __init__(self):
self.my_set = set()
self.my_set_mutex = threading.Lock()
def worker(self, name):
...
with self.my_set_mutex
self.my_set = self.my_set.union(temp_set)
return name
Regarding question 2: It doesn't matter whether or not appending to a list is "atomic." The result variable is local to the start method. In the code that you've shown, the list to which result refers is inaccessible to any other thread than the one that created it. There can't be any interference between threads unless you share the list with other threads.

Multiprocessing across classes on objects within modules

I am trying to parallelize operations on objects which are attributes of another object by using a simple top-level script to access methods contained within a module.
I have four classes in two modules: Host_Population and Host, contained in Host_Within_Population; and Vector_Population and Vector, contained in Vector_Within_Population. Host_Population.hosts is a list of Host objects, and Vector_Population.vectors is a list of Vector objects.
The top-level script looks something like this:
import Host_Within_Population
import Vector_Within_Population
host_pop = Host_Within_Population.Host_Population()
vect_pop = Vector_Within_Population.Vector_Population()
for time in range(5):
host_pop.host_cycle(time)
vect_pop.vector_cycle(time)
host_pop.calculate_variance()
This is a representation of the module, Host_Within_Population
class Host_Population(object):
def host_cycle(self, time):
for host in self.hosts:
host.lifecycle(time)
host.mort()
class Host(object):
def lifecycle(self, time):
#do stuff
def mort(self):
#do stuff
This is a representation of the module, Vector_Within_Population
class Vector_Population(object):
def vector_cycle(self, time):
for vect in self.vects:
vect.lifecycle(time)
vect.mort()
class Vector(object):
def lifecycle(self, time):
#do stuff
def mort(self):
#do stuff
I want parallelize the for loops in host_cycle() and vector_cycle() after calling the methods from the top-level script. The attributes of each Host object will be permanently changed by the methods acting on them in host_cycle(), and likewise for each Vector object in vector_cycle(). It doesn't matter what order the objects within each cycle are processed in (ie hosts are not affected by actions taken on other hosts), but host_cycle() must completely finish before vector_cycle() begins. Processes in vector_cycle need to be able to access each Host in the Host_Population, and the outcome of those processes will depend on the attributes of the Host. I will need to access methods in both modules at times other than host_cycle() and vector_cycle(). I have been trying to use multiprocessing.pool and map in many different permutations, but no luck even in highly simplified forms. One example of something I've tried:
class Host_Population:
def host_cycle(self):
with Pool() as q:
q.map(h.lifecycle, [h for h in self.hosts])
But of course, h is not defined.
I have been unable to adapt the response to similar questions, such as this one. Any help is appreciated.
So I got a tumbleweed badge for this incredibly unpopular question, but just in case anyone ever has the same issue, I found a solution.
Within the Host class, lifecycle() returns a Host:
def lifecycle(self, time):
#do stuff
return self
These are passed to the multiprocessing method in the Host_Within_Population class, which adds them to the population.
def host_pop_cycle(self, time):
p = Pool()
results = p.map_async(partial(Host.lifecycle, time = time), self.hosts)
p.close()
p.join()
self.hosts = []
for a in results.get():
self.hosts.append(a)

Removing 2nd item from a queue, using another queue as an ADT

class Queue:
def __init__(self):
self._contents = []
def enqueue(self, obj):
self._contents.append(obj)
def dequeue(self):
return self._contents.pop(0)
def is_empty(self):
return self._contents == []
class remove_2nd(Queue):
def dequeue(self):
first_item = Queue.dequeue(self)
# Condition if the queue length isn't greater than two
if self.is_empty():
return first_item
else:
# Second item to return
second_item = Queue.dequeue(self)
# Add back the first item to the queue (stuck here)
The remove_2nd class is basically a queue except if the length of the queue is greater than two, then you remove the 2nd item every dequeue. If it isn't then you do the same as a normal queue. I am only allowed to use the methods in the queue to finish remove_2nd.
My algorithm:
If queue is bigger than two:
Lets say my queue is 1 2 3 4
I would first remove the first item so it becomes
2 3 4
I would then remove the 2nd item and that will be the returned value, so then it will be
3 4
I would then add back the first item as wanted
1 3 4
The problem is, I don't know how to add it back. Enqueue puts it at the end, so basically it would be 3 4 1. I was thinking of reversing the 3 4, but I don't know how to do that either. Any help?
Just want to point out, I'm not allowed to call on _contents or allowed to create my own private variable for the remove_2nd class. This should strictly be done using the queue adt
def insert(self,position,element):
self._contents.insert(position,element)
To get the queue back in the right order after removing the first two elements, you'll need to remove all the other elements as well. Once the queue is empty, you can add back the first element and all the other elements one by one.
How exactly you keep track of the values you're removing until you can add them again is a somewhat tricky question that depends on the rules of your assignment. If you can use Python's normal types (as local variables, not as new attributes for your class), you can put them in a list or a deque from the collections module. But you can also just use another Queue instance (an instance of the base type, not your subclass).
Try something like this in your else clause:
second_item = Queue.dequeue(self) # note, this could be written super().dequeue()
temp = Queue()
while not self.is_empty():
temp.enqueue(Queue.dequeue(self))
self.enqueue(first_item)
while not temp.is_empty()
self.enqueue(temp.dequeue())
return second_item
As I commented in the code, Queue.dequeue(self) can be written more "pythonically" using the super builtin. The exact details of the call depend on which version of Python you're using (Python 3's super is much fancier than Python 2's version).
In Python 2, you have to explicitly pass self and your current class, so the call would be super(self, dequeue_2nd).dequeue(). In Python 3, you simply use super().dequeue() and it "magically" takes care of everything (in reality, the compiler figures out the class at compile time, and adds some extra code to let it find self at run time).
For your simple code with only basic inheritance, there's no difference between using super or explicitly looking up the base class by name. But in more complicated situations, using super is very important. If you ever use multiple inheritance, calling overridden methods with super is often the only way to get things to work sanely.

Threadsafe way to copy-and-clear array

One array accumulates datagrams. On interval or some length it flushed to database.
It accumulates on datagram_received network event asynchronous.
class Protocol:
flows = []
def datagram_received(self, data, addr):
...
self.flows.append(flow)
And flushed by this method:
def store(self):
flows = []
while len(self.flows):
flows.append(self.flows.pop(0))
self.db.insert(flows)
sleep(10)
self.store()
How to optimize it to replace while with one thread-safe operation?
This module runs one instance of class, but in two threads.

Twisted wait for event in loop

I want to read and process some data from an external service. I ask the service if there is any data, if something was returned I process it and ask again (so data can be processed immediately when it's available) and otherwise I wait for a notification that data is available. This can be written as an infinite loop:
def loop(self):
while True:
data = yield self.get_data_nonblocking()
if data is not None:
yield self.process_data(data)
else:
yield self.data_available
def on_data_available(self):
self.data_available.fire()
How can data_available be implemented here? It could be a Deferred but a Deferred cannot be reset, only recreated. Are there better options?
Can this loop be integrated into the Twisted event loop? I can read and process data right in on_data_available and write some code instead of the loop checking get_data_nonblocking but I feel like then I'll need some locks to make sure data is processed in the same order it arrives (the code above enforces it because it's the only place where it's processed). Is this a good idea at all?
Consider the case of a TCP connection. The receiver buffer for a TCP connection can either have data in it or not. You can get that data, or get nothing, without blocking by using the non-blocking socket API:
data = socket.recv(1024)
if data:
self.process_data(data)
You can wait for data to be available using select() (or any of the basically equivalent APIs):
socket.setblocking(False)
while True:
data = socket.recv(1024)
if data:
self.process_data(data)
else:
select([socket], [], [])
Of these, only select() is particularly Twisted-unfriendly (though the Twisted idiom is certainly not to make your own socket.recv calls). You could replace the select call with a Twisted-friendly version though (implement a Protocol with a dataReceived method that fires a Deferred - sort of like your on_data_available method - toss in some yields and make this whole thing an inlineCallbacks generator).
But though that's one way you can get data from a TCP connection, that's not the API that Twisted encourages you to use to do so. Instead, the API is:
class SomeProtocol(Protocol):
def dataReceived(self, data):
# Your logic here
I don't see how your case is substantially different. What if, instead of the loop you wrote, you did something like this:
class YourDataProcessor(object):
def process_data(self, data):
# Your logic here
class SomeDataGetter(object):
def __init__(self, processor):
self.processor = processor
def on_available_data(self):
data = self.get_data_nonblocking()
if data is not None:
self.processor.process_data(data)
Now there are no Deferreds at all (except perhaps in whatever implements on_available_data or get_data_nonblocking but I can't see that code).
If you leave this roughly as-is, you are guaranteed of in-ordered execution because Twisted is single-threaded (except in a couple places that are very clearly marked) and in a single-threaded program, an earlier call to process_data must complete before any later call to process_data could be made (excepting, of course, the case where process_data reentrantly invokes itself - but that's another story).
If you switch this back to using inlineCallbacks (or any equivalent "coroutine" flavored drink mix) then you are probably introducing the possibility of out-of-order execution.
For example, if get_data_nonblocking returns a Deferred and you write something like this:
#inlineCallbacks
def on_available_data(self):
data = yield self.get_data_nonblocking()
if data is not None:
self.processor.process_data(data)
Then you have changed on_available_data to say that a context switch is allowed when calling get_data_nonblocking. In this case, depending on your implementation of get_data_nonblocking and on_available_data, it's entirely possible that:
on_available_data is called
get_data_nonblocking is called and returns a Deferred
on_available_data tells execution to switch to another context (via yield / inlineCallbacks)
on_available_data is called again
get_data_nonblocking is called again and returns a Deferred (perhaps the same one! perhaps a new one! depends on how it's implement)
The second invocation of on_available_data tells execution to switch to another context (same reason)
The reactor spins around for a while and eventually an event arrives that causes the Deferred returned by the second invocation of get_data_nonblocking to fire.
Execution switches back to the second on_available_data frame
process_data is called with whatever data the second get_data_nonblocking call returned
Eventually the same things happen to the first set of objects and process_data is called again with whatever data the first get_data_nonblocking call returned
Now perhaps you've processed data out of order - again, this depends on more details of other parts of your system.
If so, you can always re-impose order. There are a lot of different possible approaches to this. Twisted itself doesn't come with any APIs that are explicitly in support of this operation so the solution involves writing some new code. Here's one idea (untested) for an approach - a queue-like class that knows about object sequence numbers:
class SequencedQueue(object):
"""
A queue-like type which guarantees objects come out of the queue in the order
defined by a sequence number associated with the objects when they are put into
the queue.
Application code manages sequence number assignment so that sequence numbers don't
have to have the same order as `put` calls on this type.
"""
def __init__(self):
# The sequence number of the object that should be given out
# by the next call to `get`
self._next_sequence = 0
# The sequence number of the next result that needs to be provided.
self._next_result = 0
# A holding area for objects past _next_sequence
self._queue = {}
# A holding area
self._waiting =
def put(self, sequence, object):
"""
Put an object into the queue at a particular point in the sequence.
"""
if sequence < self._next_sequence:
# Programming error. The sequence number
# of the object being put has already been used.
raise ...
self._queue[sequence] = object
self._check_waiters()
def get(self):
"""
Get an object from the queue which has the next sequence number
following whatever was previously gotten.
"""
result = self._waiters[self._next_sequence] = Deferred()
self._next_sequence += 1
self._check_waiters()
return result
def _check_waiters(self):
"""
Find any Deferreds previously given out by get calls which can now be given
their results and give them to them.
"""
while True:
seq = self._next_result
if seq in self._queue and seq in self._waiting:
self._next_result += 1
# XXX Probably a re-entrancy bug here. If a callback calls back in to
# put then this loop might run recursively
self._waiting.pop(seq).callback(self._queue.pop(seq))
else:
break
The expected behavior (modulo any bugs I accidentally added) is something like:
q = SequencedQueue()
d1 = q.get()
d2 = q.get()
# Nothing in particular happens
q.put(1, "second result")
# d1 fires with "first result" and afterwards d2 fires with "second result"
q.put(0, "first result")
Using this, just make sure you assign sequence numbers in the order you want data dispatched rather than the order it actually shows up somewhere. For example:
#inlineCallbacks
def on_available_data(self):
sequence = self._process_order
data = yield self.get_data_nonblocking()
if data is not None:
self._process_order += 1
self.sequenced_queue.put(sequence, data)
Elsewhere, some code can consume the queue sort of like:
#inlineCallbacks
def queue_consumer(self):
while True:
yield self.process_data(yield self.sequenced_queue.get())

Categories