How can I test the validity of a ReferenceProperty in Appengine? - python

I am currently testing a small application I have written. I have not been sufficiently careful in ensuring the data in my datastore is consistent and now I have an issue that I have some records referencing objects which no longer exist. More specifially, I have some objects which have ReferenceProperty's which have been assigned values; the objects referred to have been deleted but the reference remains.
I would like to add some checks to my code to ensure that referenced objects exist and acting accordingly. (Of course, I also need to clean up my data).
One approach is to just try to get() it; however, this should probably retrieve the entire object - I'd like an approach which just tests the existence of the object, ideally, just resulting in costs of key manipulations on the datastore, rather than full entity reads.
So, my question is the following: is there a simple mechanism to test if a given ReferenceProperty is valid which only involves key access operations (rather than full entity get operations)?

This will test key existence without returning an entity:
db.Query(keys_only=True).filter('__key__ =', test_key).count(1) == 1
I'm not certain that it's computationally cheaper than fetching the entity.

Related

SQLAlchemy add vs add_all

I'm trying to figure out when to use session.add and when to use session.add_all with SQLAlchemy.
Specifically, I don't understand the downsides of using add_all. It can do everything that add can do, so why not just always use it? There is no mention of this in the SQLalchemy documentation.
If you only have one new record to add, then use sqlalchemy.orm.session.Session.add() but if you have multiple records then use sqlalchemy.orm.session.Session.add_all(). There's not really a significant difference, except the API of the first method is for a single instance whereas the second is for multiple instances. Is that a big difference? No. It's just convenience.
I was wondering about the same and as mentioned by others, there is no real difference. However, I would like to add that using add in a loop, instead of using add_all allows you to be more fine grained regarding exception handling. Passing a list of mapped class instances to add_all will cause a rollback for all instances if, for example, one of these objects violates a constraint (e.g., unique). I prefer to decouple my data logic from my service logic and decide what to do with instances not stored in the service layer by returning them from my data layer. However, I think it depends on how you are handling exceptions.

How to handle errors from ndb.put_multi

The documentation for GAE's ndp.put_multi is severely lacking. NDB Entities and Keys - Python — Google Cloud Platform shows that it returns a list of keys (list_of_keys = ndb.put_multi(list_of_entities)), but it says nothing about failures. NDB Functions doesn't provide much more information.
Spelunking through the code (below), shows me that, at least for now, put_multi just aggregates the Future.get_result()s returned from the async method, which itself delegates to the entities' put code. Now, the docs for the NDB Future Class indicate that a result will be returned or else an exception will be raised. I've been told, however, that the result will be None if a particular put failed (I can't find any authoritative documentation to that effect, but if it's anything like db.get then that would make sense).
So all of this boils down to some questions I can't find the answers to:
Clearly, the return value is a list - is it a list with some elements possibly None? Or are exceptions used instead?
When there is an error, what should be re-put? Can all entities be re-put (idempotent), or only those whose return value are None (if that's even how errors are communicated)?
How common are errors (One answer: 1/3000)? Do they show up in logs (because I haven't seen any)? Is there a way to reliably simulate an error for testing?
Usage of the function in an open source library implies that the operation is idempotent, but that's about it. (Other usages don't even bother checking the return value or catching exceptions.)
Handling Datastore Errors makes no mention of anything but exceptions.
I agree with your reading of the code: put_multi() reacts to an error the same way put_async().get_result() does. If put() would raise an exception, put_multi() will also, and will be unhelpful about which of the multiple calls failed. I'm not aware of a circumstance where put_multi() would return None for some entries in the key list.
You can re-put entities that have been put, assuming no other user has updated those entities since the last put attempt. Entities that are created with system-generated IDs have their in-memory keys updated, so re-putting these would overwrite the existing entities and not create new ones. I wouldn't call it idempotent exactly because the retry would overwrite any updates to those entities made by other processes.
Of course, if you need more control over the result, you can perform this update in a transaction, though all entities would need to be in the same entity group for this to work with a primitive transaction. (Cross-group transactions support up to five distinct entity groups.) A failure during a transaction would ensure that none of the entities are created or updated if any of the attempts fail.
I don't know a general error rate for update failures. Such failures are most likely to include contention errors or "hot tablets" (too many updates to nearby records in too short a time, resulting in a timeout), which would depend on app behavior. All such errors are reported by the API as exceptions.
The easiest way to test error handling call paths would be to wrap the Model class and override the methods with a test mode behavior. You could get fancier and dig into the stub API used by testbed, which may have a way to hook into low-level calls and simulate errors. (I don't think this is a feature of testbed directly but you could use a similar technique.)

Does an AppEngine Transaction need to perform a get and a put operation for it to be useful?

Two code examples (simplified):
.get outside the transaction (object from .get passed into the transactional function)
#db.transactional
def update_object_1_txn(obj, new_value):
obj.prop1 = new_value
return obj.put()
.get inside the transaction
#db.transactional
def update_object2_txn(obj_key, new_value):
obj = db.get(obj_key)
obj.prop1 = new_value
return obj.put()
Is the first example logically sound? Is the transaction there useful at all, does it provide anything? I'm trying to better understand appengine's transactions. Would choosing the second option prevent from concurrent modifications for that object?
To answer your question in one word: yes, your second example is the way to do it. In the boundaries of a transaction, you get some data, change it, and commit the new value.
Your first one is not wrong, though, because you don't read from obj. So even though it might not have the same value that the earlier get returned, you wouldn't notice. Put another way: as written, your examples aren't good at illustrating the point of a transaction, which is usually called "test and set". See a good Wikipedia article on it here: http://en.wikipedia.org/wiki/Test-and-set
More specific to GAE, as defined in GAE docs, a transaction is:
a set of Datastore operations on one or more entities. Each transaction is guaranteed to be atomic, which means that transactions are never partially applied. Either all of the operations in the transaction are applied, or none of them are applied.
which tells you it doesn't have to be just for test and set, it could also be useful for ensuring the batch commit of several entities, etc.

Examples of use for PickledObjectField (django-picklefield)?

surfing on the web, reading about django dev best practices points to use pickled model fields with extreme caution.
But in a real life example, where would you use a PickledObjectField, to solve what specific problems?
We have a system of social-networks "backends" which do some generic stuff like "post message", "get status", "get friends" etc. The link between each backend class and user is django model, which keeps user, backend name and credentials. Now imagine how many auth systems are there: oauth, plain passwords, facebook's obscure js stuff etc. This is where JSONField shines, we keep all backend-specif auth data in a dictionary on this model, which is stored in db as json, we can put anything into it no problem.
You would use it to store... almost-arbitrary Python objects. In general there's little reason to use it; JSON is safer and more portable.
You can definitely substitute a PickledObjectField with JSON and some extra logic to create an object out of the JSON. At the end of the day, your use case, when considering to use a PickledObjectField or JSON+logic, is serializing a Python object into your database. If you can trust the data in the Python object, and know that it will always be serialize-able, you can reasonably use the PickledObjectField. In my case (I don't use django's ORM, but this should still apply), I have a couple different object types that can go into my PickledObjectField, and their definitions are constantly mutating. Rather than constantly updating my JSON parsing logic to create an object out of JSON values, I simply use a PickledObjectField to just store the different objects, and then later retrieve them in perfectly usable form (calling their functions). Caveat: If you store an object via PickledObjectField, then you change the object definition, and then you retrieve the object, the old object may have trouble fitting into the new object's definition (depending on what you changed).
The problems to be solved are the efficiency and the convenience of defining and handling a complex object consisting of many parts.
You can turn each part type into a Model and connect them via ForeignKeys.
Or you can turn each part type into a class, dictionary, list, tuple, enum or whathaveyou to your liking and use PickledObjectField to store and retrieve the whole beast in one step.
That approach makes sense if you will never manipulate parts individually, only the complex object as a whole.
Real life example
In my application there are RQdef objects that represent essentially a type with a certain basic structure (if you are curious what they mean, look here).
RQdefs consist of several Aspects and some fixed attributes.
Aspects consist of one or more Facets and some fixed attributes.
Facets consist of two or more Levels and some fixed attributes.
Levels consist of a few fixed attributes.
Overall, a typical RQdef will have about 20-40 parts.
An RQdef is always completely constructed in a single step before it is stored in the database and it is henceforth never modified, only read (but read frequently).
PickledObjectField is more convenient and much more efficient for this purpose than would be a set of four models and 20-40 objects for each RQdef.

Getting a list of child entities in App Engine using get_by_key_name (Python)

My adventures with entity groups continue after a slightly embarrassing beginning (see Under some circumstances an App Engine get_by_key_name call using an existing key_name returns None).
I now see that I can't do a normal get_by_key_name call over a list of entities for child entities that have more than one parent entity. As the Model docs say,
Multiple entities requested by one
(get_by_key_name) call must all have
the same parent.
I've gotten into the habit of doing something like the following:
# Model just has the basic properties
entities = Model.get_by_key_name(key_names)
# ContentModel has all the text and blob properties for Model
content_entities = ContentModel.get_by_key_name(content_key_names)
for entity, content_entity in zip(entities, content_entities):
# do some stuff
Now that ContentModel entities are children of Model entities, this won't work because of the single-parent requirement.
An easy way to enable the above scenario with entity groups is to be able to pass a list of parents to a get_by_key_name call, but I'm guessing that there's a good reason why this isn't currently possible. I'm wondering if this is a hard rule (as in there is absolutely no way such a call could ever work) or if perhaps the db module could be modified so that this type of call would work, even if it meant a greater CPU expense.
I'd also really like to see how others accomplish this sort of task. I can think of a bunch of ways of handling it, like using GQL queries, but none I can think of approach the performance of a get_by_key_name call.
Just create a key list and do a get on it.
entities = Model.get_by_key_name(key_names)
content_keys = [db.Key.from_path('Model', name, 'ContentModel', name)
for name in key_names]
content_entities = ContentModel.get(content_keys)
Note that I assume the key_name for each ContentModel entity is the same as its parent Model. (For a 1:1 relationship, it makes sense to reuse the key_name.)
I'm embarrassed to say that the restriction ('must be in the same entity group') actually no longer applies in this case. Please do feel free to file a documentation bug!
In any case, get_by_key_name is only syntactic sugar for get, as Bill Katz illustrates. You can go a step further, even, and use db.get on a list of keys to get everything in one go - db.get doesn't care about model type.

Categories