Is there an established module, or good practice, to work efficiently with large object pools in Python 3?
What I mean by "object pool" is some class capable of:
fetching new instances of specified type, while dynamically extending the memory allocation under the hood when necessary;
maintaining a consistent indexing for previously fetched objects.
Here is a basic example:
class Value:
__slots__ = ('a','b')
def __init__(self,a=None,b=None):
self.a = a
self.b = b
class BasicPool:
def __init__(self):
self.data = []
def __getitem__(self,k):
return self.data[k]
def fetch(self):
v = Value()
self.data.append(v)
return v
class BlockPool:
def __init__(self,bsize=100):
self.bsize = bsize
self.next = bsize
self.data = []
def __getitem__(self,k):
b,k = divmod(k,self.bsize)
return self.data[b][k]
def fetch(self):
self.next += 1
if self.next >= self.bsize:
self.data.append([ Value() for _ in range(self.bsize) ])
self.next = 0
return self.data[-1][self.next]
The BasicPool doesn't do anything smart: whenever a new instance is requested, it is instanciated and appended to an underlying list. On the other hand, the BlockPool grows a list of pre-allocated blocks of instances. Surprisingly though, it seems that preallocation is not beneficial in practice:
from timeit import default_timer as timer
def benchmark(P):
N = int(1e6)
start = timer()
for _ in range(N): P.fetch()
print( timer() - start )
print( 'Basic pool:' )
for _ in range(5): benchmark(BasicPool())
# Basic pool:
# 1.2352294209995307
# 0.5003506309985823
# 0.48115064000012353
# 0.48508202800076106
# 1.1760561199989752
print( 'Block pool:' )
for _ in range(5): benchmark(BlockPool())
# Block pool:
# 0.7272855400005938
# 1.4875716509995982
# 0.726611527003115
# 0.7369502859983186
# 1.4867010340021807
As you can see, the BasicPool is always faster than the BlockPool (I also don't know the cause of these large variations). Pools of objects must be a fairly common need in Python; is the best approach really to use the builtin list.append? Are there smarter containers that can be used to further improve runtime performance, or is this dominated by the instanciation time anyway?
The whole point of the geometric growth of the array underlying a list is to reduce the reallocation overhead to a constant factor. That constant can easily be smaller than that for manually making blocks (principally because of the slow, interpreted manipulation of self.next and self.data in the latter). (Asymptotically, the cost of BlockPool.fetch is still the append, of course.) Moreover, your benchmark doesn’t include the additional cost of destroying the blocks, nor that of the two-step indexing on read.
So list is surely as good as it gets (without writing your own C code). You can improve BasicPool a bit by inheriting from list rather than containing one, eliminating a dictionary lookup per fetch and the interpreted __getitem__ wrapper entirely.
Related
I have been learning the queue data structure recently. How do we actually create a queue? Can we just simply use a list and insert and remove items from the list? Or do I need to do something else? I have tried creating a queue class too. What is the correct method?
class Queue:
def __init__(self, capacity):
self.capacity = capacity
self.queue = []
def IsEmpty(self):
return len(self.queue) == self.capacity
def IsFull(self):
return len(self.queue) == 0
def Enqueue(self, x):
if len(self.queue) == self.capacity:
return 'Queue overloaded'
self.queue.insert(0, x)
return f'{x} enqueued into queue.'
def Dequeue(self):
return f'{self.queue[0]} dequeued fron queue.'
self.queue.pop(0)
def GetFront(self):
return self.queue[0]
def GetBack(self):
return self.queue[len(self.queue) - 1]
The common ways to implement a queue are basically:
Like an ArrayList, you use an array and reallocate a bigger one if it fills up. Unlike an array list, you need to allow the elements in the queue to wrap around from the end to the start. This is what java ArrayDeque does, and is probably the most common implementation. There are variants that waste more memory but don't wrap around, and these are usually used if you sometimes have to pass parts of the queue to some other function as a contiguous region.
Use a singly-linked list, but also keep a pointer to the tail node so you can enqueue items quickly. This is typically an intrusive list (i.e., you just and next pointers to objects you already have), or...
There's a particularly simple lock-free implementation of a thread-safe queue using a singly-linked list with an extra head node (and again a pointer to the tail node). This is what Java's ConcurrentLinkedQueue uses.
Your implementation of a queue is none of these. The python list is backed by an array like (1), but the first item is always at the start of the array. The pop() operation can therefore take a long time, because all the other items have to be moved toward the start.
I am trying to reproduce the reactive extensions "shared" observable concept with Python generators.
Say I have an API that gives me an infinite stream that I can use like this:
def my_generator():
for elem in the_infinite_stream():
yield elem
I could use this generator multiple times like so:
stream1 = my_generator()
stream2 = my_generator()
And the_infinite_stream() will be called twice (once for each generator).
Now say that the_infinite_stream() is an expensive operation. Is there a way to "share" the generator between multiple clients? It seems like tee would do that, but I have to know in advance how many independent generators I want.
The idea is that in other languages (Java, Swift) using the reactive extensions (RxJava, RxSwift) "shared" streams, I can conveniently duplicate the stream on the client side. I am wondering how to do that in Python.
Note: I am using asyncio
I took tee implementation and modified it such you can have various number of generators from infinite_stream:
import collections
def generators_factory(iterable):
it = iter(iterable)
deques = []
already_gone = []
def new_generator():
new_deque = collections.deque()
new_deque.extend(already_gone)
deques.append(new_deque)
def gen(mydeque):
while True:
if not mydeque: # when the local deque is empty
newval = next(it) # fetch a new value and
already_gone.append(newval)
for d in deques: # load it to all the deques
d.append(newval)
yield mydeque.popleft()
return gen(new_deque)
return new_generator
# test it:
infinite_stream = [1, 2, 3, 4, 5]
factory = generators_factory(infinite_stream)
gen1 = factory()
gen2 = factory()
print(next(gen1)) # 1
print(next(gen2)) # 1 even after it was produced by gen1
print(list(gen1)) # [2, 3, 4, 5] # the rest after 1
To cache only some amount of values you can change already_gone = [] into already_gone = collections.deque(maxlen=size) and add size=None parameter to generators_factory.
Consider simple class attributes.
Given
def infinite_stream():
"""Yield a number from a (semi-)infinite iterator."""
# Alternatively, `yield from itertools.count()`
yield from iter(range(100000000))
# Helper
def get_data(iterable):
"""Print the state of `data` per stream."""
return ", ".join([f"{x.__name__}: {x.data}" for x in iterable])
Code
class SharedIterator:
"""Share the state of an iterator with subclasses."""
_gen = infinite_stream()
data = None
#staticmethod
def modify():
"""Advance the shared iterator + assign new data."""
cls = SharedIterator
cls.data = next(cls._gen)
Demo
Given a tuple of client streams (A, B and C),
# Streams
class A(SharedIterator): pass
class B(SharedIterator): pass
class C(SharedIterator): pass
streams = A, B, C
let us modify and print the state of one iterator shared between them:
# Observe changed state in subclasses
A.modify()
print("1st access:", get_data(streams))
B.modify()
print("2nd access:", get_data(streams))
C.modify()
print("3rd access:", get_data(streams))
Output
1st access: A: 0, B: 0, C: 0
2nd access: A: 1, B: 1, C: 1
3rd access: A: 2, B: 2, C: 2
Although any stream can modify the iterator, the class attribute is shared between sub-classes.
See Also
Docs on asyncio.Queue - an async alternative to shared container
Post on the Observer Pattern + asyncio
You can call "tee" repeatedly to create multiple iterators as needed.
it = iter([ random.random() for i in range(100)])
base, it_cp = itertools.tee(it)
_, it_cp2 = itertools.tee(base)
_, it_cp3 = itertools.tee(base)
Sample: http://tpcg.io/ZGc6l5.
You can use single generator and "subscriber generators":
subscribed_generators = []
def my_generator():
while true:
elem = yield
do_something(elem) # or yield do_something(elem) depending on your actual use
def publishing_generator():
for elem in the_infinite_stream():
for generator in subscribed_generators:
generator.send(elem)
subscribed_generators.extend([my_generator(), my_generator()])
# Next is just ane example that forces iteration over `the_infinite_stream`
for elem in publishing_generator():
pass
Instead of generator-function you may also create a class with methods: __next__, __iter__, send, throw. That way you can modify MyGenerator.__init__ method to automatically add new instances of it to subscribed_generators.
This is somewhat similar to event-based approach with a "dumb implementation":
for elem in the_infinite_stream is similar to emitting event
for generator ...: generator.send is similar to sending event to each subscriber.
So one way to implement a "more complex but structured solution" would be to use event-based approach:
For example you can use asyncio.Event
Or some third-party solution like aiopubsub
For any of those approaches you should emit event for each element from the_infinite_stream, and your instances of my_generator should be subscribed to those events.
And other approaches can also be used and the best choice depends: on details of your task, on how are you using event-loop in asyncio. For example:
You can implement the_infinite_stream (or wrapper for it) as some class with "cursors" (objects that track current position in the stream for different subscribers); then each my_generator registers new cursor and uses it to get next item in the infinite stream. In this approach event-loop will not automatically revisit my_generator instances, which might be required if those instances "are not equal" (for example have some "priority balancing")
Intermediate generator calling all the instances of my_generator (as described earlier). In this approach each instance of my_generator is automatically revisited by event-loop. Most likely this approach is thread-safe.
Event-based approaches:
using asyncio.Event. Similar to use of intermediate generator. Not
thread-safe
aiopubsub.
something that uses Observer pattern
Make the_infinite_generator (or wrapper for it) to be "Singleton" that "caches" latest event. Some approaches were described in other answers. Another "caching" solutions can be used:
emit the same element once for each instance of the_infinite_generator (use class with custom __new__ method that tracks instances, or use the same instance of class that has a method returning "shifted" iterator over the_infinite_loop) until someone calls special method on
instance of the_infinite_generator (or on class): infinite_gen.next_cycle. In
this case there should always be some "last finalizing
generator/processor" that at the end of each event-loop's cycle will
do the_infinite_generator().next_cycle()
Similar to previous but same event is allowed to fire multiple times in the same my_generator instance (so they should watch for this case). In this approach the_infinite_generator().next_cycle() can be called "periodically" with loop.call_later or loop.cal_at. This approach might be needed if "subscribers" should be able to handle/analyze: delays, rate-limits, timeouts between events, etc.
Many other solutions are possible. It's hard to propose something specific without looking at your current implementation and without knowing what is the desired behavior of generators that use the_infinite_loop
If I understand your description of "shared" streams correctly, that you really need "one" the_infinite_stream generator and a "handler" for it. Example that tries to do this:
class StreamHandler:
def __init__(self):
self.__real_stream = the_infinite_stream()
self.__sub_streams = []
def get_stream(self):
sub_stream = [] # or better use some Queue/deque object. Using list just to show base principle
self.__sub_streams.append(sub_stream)
while True:
while sub_stream:
yield sub_stream.pop(0)
next(self)
def __next__(self):
next_item = next(self.__real_stream)
for sub_stream in self.__sub_steams:
sub_stream.append(next_item)
some_global_variable = StreamHandler()
# Or you can change StreamHandler.__new__ to make it singleton, or you can create an instance at the point of creation of event-loop
def my_generator():
for elem in some_global_variable.get_stream():
yield elem
But if all your my_generator objects are initialized at the same point of infinite stream, and "equally" iterated inside the loop, then this approach will introduce "unnecessary" memory overhead for each "sub_stream" (used as queue). Unnecessary: because those queues will always be the same (but that can be optimized: if there are some existing "empty" sub_stream than it can be re-used for new sub_streams with some changes to "pop-logic"). And many-many other implementations and nuances can be discussed
If you have a single generator, you can use one queue per "subscriber" and route events to each subscriber as the primary generator produces results.
This has the advantage of allowing the subscribers to move at their own pace, and it can be dropped in existing code with very little changes to the original source.
For example:
def my_gen():
...
m1 = Muxer(my_gen)
m2 = Muxer(my_gen)
consumer1(m1).start()
consumer2(m2).start()
As items are pulled from the primary generator they are inserted into queues for each listener. Listeners can subscribe any time by constructing a new Muxer():
import queue
from threading import Lock
from collections import namedtuple
class Muxer():
Entry = namedtuple('Entry', 'genref listeners, lock')
already = {}
top_lock = Lock()
def __init__(self, func, restart=False):
self.restart = restart
self.func = func
self.queue = queue.Queue()
with self.top_lock:
if func not in self.already:
self.already[func] = self.Entry([func()], [], Lock())
ent = self.already[func]
self.genref = ent.genref
self.lock = ent.lock
self.listeners = ent.listeners
self.listeners.append(self)
def __iter__(self):
return self
def __next__(self):
try:
e = self.queue.get_nowait()
except queue.Empty:
with self.lock:
try:
e = self.queue.get_nowait()
except queue.Empty:
try:
e = next(self.genref[0])
for other in self.listeners:
if not other is self:
other.queue.put(e)
except StopIteration:
if self.restart:
self.genref[0] = self.func()
raise
return e
Original source code, including test suite:
https://gist.github.com/earonesty/cafa4626a2def6766acf5098331157b3
The unit tests run many threads concurrently processing the same generated events in sequence. The code is order preserving, with a lock acquired during the single generator's access.
Caveats: the version here uses a singleton to gate access, otherwise it would be possible to accidentally evade its control over the contained generators. It also allows the contained generators to be "restartable", which was a useful feature for me a the time. There is no "close()" feature, simply because I didn't need it. This is an appropriate use case for __del__ however, since the last reference to a listener is the right time to clean up.
I'm attempting to speed up a multivariate fixed-point iteration algorithm using multiprocessing however, I'm running issues dealing with shared data. My solution vector is actually a named dictionary rather than a vector of numbers. Each element of the vector is actually computed using a different formula. At a high level, I have an algorithm like this:
current_estimate = previous_estimate
while True:
for state in all_states:
current_estimate[state] = state.getValue(previous_estimate)
if norm(current_estimate, previous_estimate) < tolerance:
break
else:
previous_estimate, current_estimate = current_estimate, previous_estimate
I'm trying to parallelize the for-loop part with multiprocessing. The previous_estimate variable is read-only and each process only needs to write to one element of current_estimate. My current attempt at rewriting the for-loop is as follows:
# Class and function definitions
class A(object):
def __init__(self,val):
self.val = val
# representative getValue function
def getValue(self, est):
return est[self] + self.val
def worker(state, in_est, out_est):
out_est[state] = state.getValue(in_est)
def worker_star(a_b_c):
""" Allow multiple arguments for a pool
Taken from http://stackoverflow.com/a/5443941/3865495
"""
return worker(*a_b_c)
# Initialize test environment
manager = Manager()
estimates = manager.dict()
all_states = []
for i in range(5):
a = A(i)
all_states.append(a)
estimates[a] = 0
pool = Pool(process = 2)
prev_est = estimates
curr_est = estimates
pool.map(worker_star, itertools.izip(all_states, itertools.repeat(prev_est), itertools.repreat(curr_est)))
The issue I'm currently running into is that the elements added to the all_states array are not the same as those added to the manager.dict(). I keep getting key value errors when trying to access elements of the dictionary using elements of the array. And debugging, I found that none of the elements are the same.
print map(id, estimates.keys())
>>> [19558864, 19558928, 19558992, 19559056, 19559120]
print map(id, all_states)
>>> [19416144, 19416208, 19416272, 19416336, 19416400]
This is happening because the objects you're putting into the estimates DictProxy aren't actually the same objects as those that live in the regular dict. The manager.dict() call returns a DictProxy, which is proxying access to a dict that actually lives in a completely separate manager process. When you insert things into it, they're really being copied and sent to a remote process, which means they're going to have a different identity.
To work around this, you can define your own __eq__ and __hash__ functions on A, as described in this question:
class A(object):
def __init__(self,val):
self.val = val
# representative getValue function
def getValue(self, est):
return est[self] + self.val
def __hash__(self):
return hash(self.__key())
def __key(self):
return (self.val,)
def __eq__(x, y):
return x.__key() == y.__key()
This means the key look ups for items in the estimates will just use the value of the val attribute to establish identity and equality, rather than the id assigned by Python.
Most of my programming prior to Python was in C++ or Matlab. I don't have a degree in CS (almost completed a PhD in physics), but have done some courses and a good amount of actual programming. Now, I'm taking an algorithms course on Coursera (excellent course, by the way, with a professor from Stanford). I decided to implement the homeworks in Python. However, sometimes I find myself wanting things the language does not so easily support. I'm very used to creating classes and objects for things in C++ just to group together data (i.e. when there are no methods). In Python however, where you can add fields on the fly, what I basically end up wanting all the time are Matlab structs. I think this is possibly a sign I am not using good style and doing things the "Pythonic" way.
Underneath is my implementation of a union-find data structure (for Kruskal's algorithm). Although the implementation is relatively short and works well (there isn't much error checking), there are a few odd points. For instance, my code assumes that the data originally passed in to the union-find is a list of objects. However, if a list of explicit pieces of data are passed in instead (i.e. a list of ints), the code fails. Is there some much clearer, more Pythonic way to implement this? I have tried to google this, but most examples are very simple and relate more to procedural code (i.e. the "proper" way to do a for loop in python).
class UnionFind:
def __init__(self,data):
self.data = data
for d in self.data:
d.size = 1
d.leader = d
d.next = None
d.last = d
def find(self,element):
return element.leader
def union(self,leader1,leader2):
if leader1.size >= leader2.size:
newleader = leader1
oldleader = leader2
else:
newleader = leader2
oldleader = leader1
newleader.size = leader1.size + leader2.size
d = oldleader
while d != None:
d.leader = newleader
d = d.next
newleader.last.next = oldleader
newleader.last = oldleader.last
del(oldleader.size)
del(oldleader.last)
Generally speaking, doing this sort of thing Pythonically means that you try to make your code not care what is given to it, at least not any more than it really needs to.
Let's take your particular example of the union-find algorithm. The only thing that the union-find algorithm actually does with the values you pass to it is compare them for equality. So to make a generally useful UnionFind class, your code shouldn't rely on the values it receives having any behavior other than equality testing. In particular, you shouldn't rely on being able to assign arbitrary attributes to the values.
The way I would suggest getting around this is to have UnionFind use wrapper objects which hold the given values and any attributes you need to make the algorithm work. You can use namedtuple as suggested by another answer, or make a small wrapper class. When an element is added to the UnionFind, you first wrap it in one of these objects, and use the wrapper object to store the attributes leader, size, etc. The only time you access the thing being wrapped is to check whether it is equal to another value.
In practice, at least in this case, it should be safe to assume that your values are hashable, so that you can use them as keys in a Python dictionary to find the wrapper object corresponding to a given value. Of course, not all objects in Python are necessarily hashable, but those that are not are relatively rare and it's going to be a lot more work to make a data structure that is able to handle those.
The more pythonic way is to avoid tedious objects if you don't have to.
class UnionFind(object):
def __init__(self, members=10, data=None):
"""union-find data structure for Kruskal's algorithm
members are ignored if data is provided
"""
if not data:
self.data = [self.default_data() for i in range(members)]
for d in self.data:
d.size = 1
d.leader = d
d.next = None
d.last = d
else:
self.data = data
def default_data(self):
"""create a starting point for data"""
return Data(**{'last': None, 'leader':None, 'next': None, 'size': 1})
def find(self, element):
return element.leader
def union(self, leader1, leader2):
if leader2.leader is leader1:
return
if leader1.size >= leader2.size:
newleader = leader1
oldleader = leader2
else:
newleader = leader2
oldleader = leader1
newleader.size = leader1.size + leader2.size
d = oldleader
while d is not None:
d.leader = newleader
d = d.next
newleader.last.next = oldleader
newleader.last = oldleader.last
oldleader.size = 0
oldleader.last = None
class Data(object):
def __init__(self, **data_dict):
"""convert a data member dict into an object"""
self.__dict__.update(**data_dict)
One option is to use dictionaries to store the information you need about a data item, rather than attributes on the item directly. For instance, rather than referring to d.size you could refer to size[d] (where size is a dict instance). This requires that your data items be hashable, but they don't need to allow attributes to be assigned on them.
Here's a straightforward translation of your current code to use this style:
class UnionFind:
def __init__(self,data):
self.data = data
self.size = {d:1 for d in data}
self.leader = {d:d for d in data}
self.next = {d:None for d in data}
self.last = {d:d for d in data}
def find(self,element):
return self.leader[element]
def union(self,leader1,leader2):
if self.size[leader1] >= self.size[leader2]:
newleader = leader1
oldleader = leader2
else:
newleader = leader2
oldleader = leader1
self.size[newleader] = self.size[leader1] + self.size[leader2]
d = oldleader
while d != None:
self.leader[d] = newleader
d = self.next[d]
self.next[self.last[newleader]] = oldleader
self.last[newleader] = self.last[oldleader]
A minimal test case:
>>> uf = UnionFind(list(range(100)))
>>> uf.find(10)
10
>>> uf.find(20)
20
>>> uf.union(10,20)
>>> uf.find(10)
10
>>> uf.find(20)
10
Beyond this, you could also consider changing your implementation a bit to require less initialization. Here's a version that doesn't do any initialization (it doesn't even need to know the set of data it's going to work on). It uses path compression and union-by-rank rather than always maintaining an up-to-date leader value for all members of a set. It should be asymptotically faster than your current code, especially if you're doing a lot of unions:
class UnionFind:
def __init__(self):
self.rank = {}
self.parent = {}
def find(self, element):
if element not in self.parent: # leader elements are not in `parent` dict
return element
leader = self.find(self.parent[element]) # search recursively
self.parent[element] = leader # compress path by saving leader as parent
return leader
def union(self, leader1, leader2):
rank1 = self.rank.get(leader1,1)
rank2 = self.rank.get(leader2,1)
if rank1 > rank2: # union by rank
self.parent[leader2] = leader1
elif rank2 > rank1:
self.parent[leader1] = leader2
else: # ranks are equal
self.parent[leader2] = leader1 # favor leader1 arbitrarily
self.rank[leader1] = rank1+1 # increment rank
For checking if an argument is of the expected type, use the built-in isinstance() function:
if not isinstance(leader1, UnionFind):
raise ValueError('leader1 must be a UnionFind instance')
Additionally, it is a good habit to add docstrings to functions, classes and member functions. Such a docstring for a function or method should describe what it does, what arguments are to be passed to it and if applicable what is returned and which exceptions can be raised.
I'm guessing that the indentation issues here are just simple errors with inputting the code into SO. Could you possibly create a subclass of a simple, built in data type? For instance, you can create a sub-class of the list data type by putting the datatype in parenthesis:
class UnionFind(list):
'''extends list object'''
I am working on an algorithm that uses diagonal and first off-diagonal blocks of a large (will be e06 x e06) block diagonal sparse matrix.
Right now I create a dict that stores the blocks in such a way that I can access the blocks in a matrix like fashion. For example B[0,0](5x5) gives the first block of matrix A(20x20), assuming 5x5 blocks and that matrix A is of type sparse.lil.
This works fine but takes horribly long too run. It is inefficient because it copies the data, as this reference revealed to my astonishment: GetItem Method
Is there a way to only store a view on a sparse matrix in a dict? I would like to change the content and still be able to use the same identifiers. It is fine if it takes a little longer as it should only be done once. The blocks will have many different dimensions and shapes.
As far as I know, all of the various sparse matricies in scipy.sparse return copies rather than a view of some sort. (Some of the others may be significantly faster at doing so than lil_matrix, though!)
One way of doing what you want is to just work with slice objects. For example:
import scipy.sparse
class SparseBlocks(object):
def __init__(self, data, chunksize=5):
self.data = data
self.chunksize = chunksize
def _convert_slices(self, slices):
newslices = []
for axslice in slices:
if isinstance(axslice, slice):
start, stop = axslice.start, axslice.stop
if axslice.start is not None:
start *= self.chunksize
if axslice.stop is not None:
stop *= self.chunksize
axslice = slice(start, stop, None)
elif axslice is not None:
axslice = slice(axslice, axslice+self.chunksize)
newslices.append(axslice)
return tuple(newslices)
def __getitem__(self, item):
item = self._convert_slices(item)
return self.data.__getitem__(item)
def __setitem__(self, item, value):
item = self._convert_slices(item)
return self.data.__setitem__(item, value)
data = scipy.sparse.lil_matrix((20,20))
s = SparseBlocks(data)
s[0,0] = 1
print s.data
Now, whenever we modify s[whatever] it will modify s.data of the appropriate chunk. In other words, s[0,0] will return or set s.data[:5, :5], and so on.