Consistent indexing for objects with variable attributes in ZODB - python

I have a ZODB installation where I have to organize several million objects of about a handful of different types. I have a generic container class Table, which contains BTrees to index objects by attributes or combinations of these attributes. Data consistency is quite essential, and so I want to enforce, that the indices are automatically updated, when I write to any of the attributes, which are covered by the indexing. So a simple obj.a = x should be sufficient to calculate all new dependent index entries, check if there are any collisions, and finally write the indices and the value.
In general, I'd be happy to use a library for that, so I was looking at repoze.catalog and IndexedCatalog, but was not really happy with that. IndexedCatalog seems dead for quite a while, and not providing the kind of consistency for changes to the objects. repoze.catalog seems to be more used and active, but also not providing this kind of consistency, as far as I understand. If I missed something here, I'd love to hear about it and prefer reusing over reinventing.
So, how I see it besides trying to find a library for the problem, I'd have to intercept the write access to the dataobject attributes with descriptors and let the Table class do the magic of changing the indices. For that, the descriptor instances have to know, with which Table instances they have to talk with. The current implementation goes someting like that:
class DatabaseElement(Persistent):
name = Property(constant_parameters)
...
class Property(object):
...
def __set__(self, obj, name, val):
val = self.check_value(val)
setattr(obj, '_' + name, val)
When these DatabaseElement classes are generated, the database and the objects within are not yet created. So as mentioned in this nice answer, I'd probably have to create some singleton lookup mechanism, to find the Table objects, without handing them to Property as an instantiation argument. Is there a more elegant way? Persisting the descriptors itself? Any suggestions and best-practice examples welcome!

So I finally figured out myself. The solution comes in three parts. No ugly Singleton required. Table provides the logic to check for collisions, DatabaseElement gets the ability to lookup the responsible Table without ugly workarounds and Property takes care, that the indices are updated, before any indexed values are written. Here some snippets, the main clue is the table lookup of DatabaseElement. I also didn't see that documented anywhere. Nice extra: It not only verifies writes to single values, I can also check for changes of several indexed values in one go.
class Table(PersistentMapping):
....
def update_indices(self, inst, updated_values_dict):
changed_indices_keys = self._check_collision(inst, updated_values_dict)
original_keys = [inst.key(index) for index, tmp_key in changed_indices_keys]
for (index, tmp_key), key in zip(changed_indices_keys, original_keys):
self[index][tmp_key] = inst
try:
del self[index][key]
except KeyError:
pass
class DatabaseElement(Persistent):
....
#property
def _table(self):
return self._p_jar and self._p_jar.root()[self.__class__.__name__]
def _update_indices(self, update_dict, verify=True):
if verify:
update_dict = dict((key, getattr(type(self), key).verify(val))
for key, val in update_dict.items()
if key in self._key_properties)
if not update_dict:
return
table = self._table
table and table.update_indices(self, update_dict)
class Property(object):
....
def __set__(self, obj, val):
validated_val = self.validator(obj, self.name, val)
if self.indexed:
obj._update_indices({self.name: val}, verify=False)
setattr(obj, self.hidden_name, validated_val)

Related

"Updater" design pattern, as opposed to "Builder"

This is actually language agnostic, but I always prefer Python.
The builder design pattern is used to validate that a configuration is valid prior to creating an object, via delegation of the creation process.
Some code to clarify:
class A():
def __init__(self, m1, m2): # obviously more complex in life
self._m1 = m1
self._m2 = m2
class ABuilder():
def __init__():
self._m1 = None
self._m2 = None
def set_m1(self, m1):
self._m1 = m1
return self
def set_m2(self, m1):
self._m2 = m2
return self
def _validate(self):
# complicated validations
assert self._m1 < 1000
assert self._m1 < self._m2
def build(self):
self._validate()
return A(self._m1, self._m2)
My problem is similar, with an extra constraint that I can't re-create the object each time due to to performance limitations.
Instead, I want to only update an existing object.
Bad solutions I came up with:
I could do as suggested here and just use setters like so
class A():
...
set_m1(self, m1):
self._m1 = m1
# and so on
But this is bad because using setters
Beats the purpose of encapsulation
Beats the purpose of the buillder (now updater), which is supposed to validate that some complex configuration is preserved after the creation, or update in this case.
As I mentioned earlier, I can't recreate the object every time, as this is expensive and I only want to update some fields, or sub-fields, and still validate or sub-validate.
I could add update and validation methods to A and call those, but this beats the purpose of delegating the responsibility of updates, and is intractable in the number of fields.
class A():
...
def update1(m1):
pass # complex_logic1
def update2(m2):
pass # complex_logic2
def update12(m1, m2):
pass # complex_logic12
I could just force to update every single field in A in a method with optional parameters
class A():
...
def update("""list of all fields of A"""):
pass
Which again is not tractable, as this method will soon become a god method due to the many combinations possible.
Forcing the method to always accept changes in A, and validating in the Updater also can't work, as the Updater will need to look at A's internal state to make a descision, causing a circular dependency.
How can I delegate updating fields in my object A
in a way that
Doesn't break encapsulation of A
Actually delegates the responsibility of updating to another object
Is tractable as A becomes more complicated
I feel like I am missing something trivial to extend building to updating.
I am not sure I understand all of your concerns, but I want to try and answer your post. From what you have written I assume:
Validation is complex and multiple properties of an object must be checked to decide if any change to the object is valid.
The object must always be in a valid state. Changes that make the object invalid are not permitted.
It is too expensive to copy the object, make the change, validate the object, and then reject the change if the validation fails.
Move the validation logic out of the builder and into a separate class like ModelValidator with a validateModel(model) method
The first option is to use a command pattern.
Create abstract class or interface named Update (I don't think Python abstract classes/interfaces, but that's fine). The Update interface implements two methods, execute() and undo().
A concrete class has a name like UpdateAdress, UpdatePortfolio, or UpdatePaymentInfo.
Each concrete Update object also holds a reference to your model object.
The concrete classes hold the state needed to for a particular kind of update. Imageine these methods exist on the UpdateAddress class:
UpdateAddress
setStreetNumber(...)
setCity(...)
setPostcode(...)
setCountry(...)
The update object needs to hold both the current and new values of a property. Like:
setStreetNumber(aString):
self.oldStreetNumber = model.getStreetNumber
self.newStreetNumber = aString
When the execute method is called, the model is updated:
execute:
model.setStreetNumber(newStreetNumber)
model.setCity(newCity)
# Set postcode and country
if not ModelValidator.isValid(model):
self.undo()
raise ValidationError
and the undo method looks like:
undo:
model.setStreetNumber(oldStreetNumber)
model.setCity(oldCity)
# Set postcode and country
That is a lot of typing, but it would work. Mutating your model object is nicely encapsulated by different kinds of updates. You can execute or undo the changes by calling those methods on the update object. You can even store a list of update objects for multi-level undos and re-tries.
However, it is a lot of typing for the programmer. Consider using persistent data structures. Persistent data structures can be used to copy objects very quickly -- approximately constant time complexity. Here is a python library of persistent data structures.
Let's assume your data was in a persistent data structure version of a dict. The library I referenced calls it a PMap.
The implementation of the update classes can be simpler. Starting with the constructor:
UpdateAddress(pmap)
self.oldPmap = pmap
self.newPmap = pmap
The setters are easier:
setStreetNumber(aString):
self.newPmap = newPmap.set('streetNumber', aString)
Execute passes back a new instance of the model, with all the updates.
execute:
if ModelValidator.isValid(newModel):
return newModel;
else:
raise ValidationError
The original object has not changed at all, thanks to the magic of persistent data structures.
The best thing is to not do any of this. Instead, use an ORM or object database. That is the "enterprise grade" solution. These libraries give you sophisticated tools like transactions and object version history.

How to apply 'load_from' and 'dump_to' to every field in a marshmallow schema?

I've been trying to implement an 'abstract' schema class that will automatically convert values in CamelCase (serialized) to snake_case (deserialized).
class CamelCaseSchema(marshmallow.Schema):
#marshmallow.pre_load
def camel_to_snake(self, data):
return {
utils.CaseConverter.camel_to_snake(key): value for key, value in data.items()
}
#marshmallow.post_dump
def snake_to_camel(self, data):
return {
utils.CaseConverter.snake_to_camel(key): value for key, value in data.items()
}
While using something like this works nicely, it does not achieve everything applying load_from and dump_to to a field does. Namely, it fails to provide correct field names when there's an issue with deserialization. For instance, I get:
{'something_id': [u'Not a valid integer.']} instead of {'somethingId': [u'Not a valid integer.']}.
While I can post-process these emitted errors, this seems like an unnecessary coupling that I wish to avoid if I'm to make the use of schema fully transparent.
Any ideas? I tried tackling the metaclasses involved, but the complexity was a bit overwhelming and everything seemed exceptionally ugly.
You're using marshmallow 2. Marshmallow 3 is now out and I recommend using it. My answer will apply to marshmallow 3.
In marshmallow 3, load_from / dump_to have been replace by a single attribute : data_key.
You'd need to alter data_key in each field when instantiating the schema. This will happen after field instantiation but I don't think it matters.
You want to do that ASAP when the schema is instantiated to avoid inconsistency issues. The right moment to do that would be in the middle of Schema._init_fields, before the data_key attributes are checked for consistency. But duplicating this method would be a pity. Besides, due to the nature of the camel/snake case conversion the consistency checks can be applied before the conversion anyway.
And since _init_fields is private API, I'd recommend doing the modification at the end of __init__.
class CamelCaseSchema(Schema):
def __init__(self, **kwargs):
super().__init__(**kwargs)
for field_name, field in self.fields.items():
fields.data_key = utils.CaseConverter.snake_to_camel(field_name)
I didn't try that but I think it should work.

Can an association class be implemented in Python?

I have just started learning software development and I am modelling my system in a UML Class diagram. I am unsure how I would implement this in code.
To keep things simple let’s assume the followimg example:
There is a Room and a Guest Class with association Room(0..)-Guest(0..) and an association class RoomBooking, which contains booking details. How would I model this in Python if my system wants to see all room bookings made by a particular guest?
Most Python applications developed from a UML design are backed by a relational database, usually via an ORM. In which case your design is pretty trivial: your RoomBooking is a table in the database, and the way you look up all RoomBooking objects for a given Guest is just an ORM query. Keeping it vague rather than using a particular ORM syntax, something like this:
bookings = RoomBooking.select(Guest=guest)
With an RDBMS but no ORM, it's not much different. Something like this:
sql = 'SELECT Room, Guest, Charge, Paid FROM RoomBooking WHERE Guest = ?'
cur = db.execute(sql, (guest.id))
bookings = [RoomBooking(*row) for row in cur]
And this points to what you'd do if you're not using a RDBMS: any relation that would be stored as a table with a foreign key is instead stored as some kind of dict in memory.
For example, you might have a dict mapping guests to sets of room bookings:
bookings = guest_booking[guest]
Or, alternatively, if you don't have a huge number of hotels, you might have this mapping implicit, with each hotel having a 1-to-1 mapping of guests to bookings:
bookings = [hotel.bookings[guest] for hotel in hotels]
Since you're starting off with UML, you're probably thinking in strict OO terms, so you'll want to encapsulate this dict in some class, behind some mutator and accessor methods, so you can ensure that you don't accidentally break any invariants.
There are a few obvious places to put it—a BookingManager object makes sense for the guest-to-set-of-bookings mapping, and the Hotel itself is such an obvious place for the per-hotel-guest-to-booking that I used it without thinking above.
But another place to put it, which is closer to the ORM design, is in a class attribute on the RoomBooking type, accessed by classmethods. This also allows you to extend things if you later need to, e.g., look things up by hotel—you'd then put two dicts as class attributes, and ensure that a single method always updates both of them, so you know they're always consistent.
So, let's look at that:
class RoomBooking
guest_mapping = collections.defaultdict(set)
hotel_mapping = collections.defaultdict(set)
def __init__(self, guest, room):
self.guest, self.room = guest, room
#classmethod
def find_by_guest(cls, guest):
return cls.guest_mapping[guest]
#classmethod
def find_by_hotel(cls, hotel):
return cls.hotel_mapping[hotel]
#classmethod
def add_booking(cls, guest, room):
booking = cls(guest, room)
cls.guest_mapping[guest].add(booking)
cls.hotel_mapping[room.hotel].add(booking)
Of course your Hotel instance probably needs to add the booking as well, so it can raise an exception if two different bookings cover the same room on overlapping dates, whether that happens in RoomBooking.add_booking, or in some higher-level function that calls both Hotel.add_booking and RoomBooking.add_booking.
And if this is multi-threaded (which seems like a good possibility, given that you're heading this far down the Java-inspired design path), you'll need a big lock, or a series of fine-grained locks, around the whole transaction.
For persistence, you probably want to store these mappings along with the public objects. But for a small enough data set, or for a server that rarely restarts, it might be simpler to just persist the public objects, and rebuild the mappings at load time by doing a bunch of add_booking calls as part of the load process.
If you want to make it even more ORM-style, you can have a single find method that takes keyword arguments and manually executes a "query plan" in a trivial way:
#classmethod
def find(cls, guest=None, hotel=None):
if guest is None and hotel is None:
return {booking for bookings in cls.guest_mapping.values()
for booking in bookings}
elif hotel is None:
return cls.guest_mapping[guest]
elif guest is None:
return cls.hotel_mapping[hotel]
else:
return {booking for booking in cls.guest_mapping[guest]
if booking.room.hotel == hotel}
But this is already pushing things to the point where you might want to go back and ask whether you were right to not use an ORM in the first place. If that sounds ridiculously heavy duty for your simple toy app, take a look at sqlite3 for the database (which comes with Python, and which takes less work to use than coming up with a way to pickle or json all your data for persistence) and SqlAlchemy for the ORM. There's not much of a learning curve, and not much runtime overhead or coding-time boilerplate.
Sure you can implement it in Python. But there is not a single way. Quite often you have a database layer where the association class is used with two foreign keys (in your case to the primaries of Room and Guest). So in order to search you would just code an according SQL to be sent. In case you want to cache this table you would code it like this (or similarly) with an associative array:
from collections import defaultdict
class Room():
def __init__(self, num):
self.room_number = num
def key(self):
return str(self.room_number)
class Guest():
def __init__(self, name):
self.name = name
def key(self):
return self.name
def nested_dict(n, type):
if n == 1:
return defaultdict(type)
else:
return defaultdict(lambda: nested_dict(n-1, type))
room_booking = nested_dict(2, str)
class Room_Booking():
def __init__(self, date):
self.date = date
room1 = Room(1)
guest1 = Guest("Joe")
room_booking[room1.key()][guest1.key()] = Room_Booking("some date")
print(room_booking[room1.key()][guest1.key()])

MongoEngine: Create Pickle field

I'm using MongoEngine and trying to create a field that works like SQLAlchemy's PickleType field. Basically, I just need to pickle objects before they're written to the database, and unpickle them when they're loaded.
However it looks like MongoEngine's fields don't provide proper conversion methods I could override, instead having two coercion methods (to_python and to_mongo). If I understand correctly, these functions can be called anytime, that is, a call to to_python(v) does not guarantee that v comes from the database. I've thought of writing something like this:
class PickleField(fields.BinaryField):
def to_python(self, value):
value = super().to_python(value)
if <<value was pickled by the field>>
return pickle.loads(value)
else:
return value
Unfortunately, if I want to be as general as possible, I don't see a way to check whether the value should be unpickled or not. For instance,
a = pickle.dumps(x)
PickleField().to_python(a) # should return a, will return x
I also don't think I can store any state in the PickleField, since that's shared by all instances.
Is there a way around this?

Dynamically building up types in python

Suppose I am building a composite set of types:
def subordinate_type(params):
#Dink with stuff
a = type(myname, (), dict_of_fields)
return a()
def toplevel(params)
lots_of_types = dict(keys, values)
myawesomedynamictype = type(toplevelname, (), lots_of_types)
#Now I want to edit some of the values in myawesomedynamictype's
#lots_of_types.
return myawesomedynamictype()
In this particular case, I want a reference to the "typeclass" myawesomedynamictype inserted into lots_of_types.
I've tried to iterate through lots_of_types and set it, supposing that the references were pointed at the same thing, but I found that the myawesomedynamictype got corrupted and lost its fields.
The problem I'm trying to solve is that I get values related to the type subordinate_type, and I need to generate a toplevel instantiation based on subordinate_type.
This is an ancient question, and because it's not clear what the code is trying to do (being a code gist rather than working code), it's a little hard to answer.
But it sounds like you want a reference to the dynamically created class "myawesomedynamictype" on the class itself. A copy of (I believe a copy of) the dictionary lots_of_types became the __dict__ of this new class when you called type() to construct it.
So, just set a new attribute on the class to have a value of the class you just constructed; Is that what you were after?
def toplevel(params)
lots_of_types = dict(keys, values)
myawesomedynamictype = type(toplevelname, (), lots_of_types)
myawesomedynamictype.myawesomedynamictype = myawesomedynamictype
return myawesomedynamictype()

Categories