parse google app engine object in python - python

Is it possible to parse a google app engine object like so...
objects = db.GqlQuery('SELECT * FROM Database WHERE item='random'')
memcache.add('object', objects, 3600)
if object =='some condition':
#here can I do a query on 'objects' without using GqlQuery
elif object =='something else':
#do a different query than the one above
The idea is to store an object into memcache and then manipulate that object in different ways. This is to lighten the datastore reads. Thanks in advance!

You can and everyone will find they do it. However there a bunch of things you need to consider.
At the moment you are trying to store a query object and not the results in memcache. objects in your code is a query object. Use run, fetch etc to get some results.
Manipulating the objects and storing in memcache without writing back will mean you will lose data etc. memcache is not a reliable storage mechanism on appengine (it is just a cache) and things can be evicted at any time.
If your query is intended to return a single result, get the object by the key it is far more efficient and not a lot slower than memcache, compared with a query. (ndb will cache gets for you - see the next point)
It looks like you are starting out with appengine, if you have not got an existing code base, start out with ndb rather than db. It is in my opinion a better library. ndb does a lot of caching for you (when using get()) in memcache and the request/instance.

Related

SQLAlchemy - cache table obj locally

I'm querying an existing read-only database with SQLAlchemy, and wonder if there is a way to cache the queried table object locally (in an automatic way) so that I can use it later. The main reason of this need is not to lock the database while my script is running (e.g. I have to keep a session connected to wait for user's response), and the database is read-only so I really don't need the modified data to be synchronized back.
Right now I'm working through a solution to convert the queried results into pd.DataFrame, but it would be nice to keep the advantage of SQLAlchemy where the queried result can retain it's structure rather than being converted to a flat table in pd.DataFrame.
I'm new to SQLAlchemy and still learning. Any suggestions about the solution or if I miss some major features already provided in SQLAlchemy package would really be appreciated!

Mongoengine - Can I get the same document object instance from two subsequent queries?

This is the use case: I have a server that receives instructions from many clients. Each client instructions are handled by its own Session object, who holds all the information about the state of the session and queries mongoengine for the data it needs.
Now, suppose session1 queries mongoengine and gets document "A" as a document object.
Later, session2 also queries and gets document "A", as another separate document object.
Now we have 2 document objects representing document "A", and to get them consistent I need to call A.update() and A.reload() all the time, which seems unnecessary.
Is there any way I can get a reference to the same document object over the two queries? This way both sessions could make changes to the document object and those changes would be seen by the other sessions, since they would be made to the same python object.
I've thought about making a wrapper for mongoengine that caches the documents that we have as document objects at runtime and ensures there are no multiple objects for the same document at any given time. But my knowledge of mongoengine is too rudimentary to do it at the time.
Any thoughts on this? Is my entire design flawed? Is there any easy solution?
I don't think going in that direction is a good idea. From what I understand you are in a web application context, you might be able to get something working for threads within a single process but you won't be able to share instances across different processes (and it gets even worse if you have processes running on different machines).
One way to address this is to use optimistic concurrency validation, you basically maintain a field like "version-identifier" that gets updated whenever the instance is updated and whenever you save/update the object, you run a query like "update object if version-identifier=... else you fail"
This means that if there are concurrent requests, 1 of them will succeed (first one to be flusged), the other one will fail because the version-identifier that they have is outdated. MongoEngine has no built in support for that but more info can be found here https://github.com/MongoEngine/mongoengine/issues/1563

SQLAlchemy: use related object when session is closed

I have many models with relational links to each other which I have to use. My code is very complicated so I cannot keep session alive after a query. Instead, I try to preload all the objects:
def db_get_structure():
with Session(my_engine) as session:
deps = {x.id: x for x in session.query(Department).all()}
...
return (deps, ...)
def some_logic(id):
struct = db_get_structure()
return some_other_logic(struct.deps[id].owner)
However, I get the following error anyway regardless of the fact that all the objects are already loaded:
sqlalchemy.orm.exc.DetachedInstanceError: Parent instance <Department at 0x10476e780> is not bound to a Session; lazy load operation of attribute 'owner' cannot proceed
Is it possible to link preloaded objects with each other so that the relations will work after session get closed?
I know about joined queries (.options(joinedload(), but this approach leads to more code lines and bigger DB request, and I think this should be solved simpler, because all the objects are already loaded into Python objects.
It's even possible now to request the related objects like struct.deps[struct.deps[id].owner_id], but I think the ORM should do this and provide shorter notation struct.deps[id].owner using some "cached load".
Whenever you access an attribute on a DB entity that has not yet been loaded from the DB, SQLAlchemy will issue an implicit SQL statement to the DB to fetch that data. My guess is that this is what happens when you issue struct.deps[struct.deps[id].owner_id].
If the object in question has been removed from the session it is in a "detached" state and SQLAlchemy protects you from accidentally running into inconsistent data. In order to work with that object again it needs to be "re-attached".
I've done this already fairly often with session.merge:
attached_object = new_session.merge(detached_object)
But this will reconile the object instance with the DB and potentially issue updates to the DB if necessary. The detached_object is taken as "truth".
I believe you can do the reverse (attaching it by reading from the DB instead of writing to it) by using session.refresh(detached_object), but I need to verify this. I'll update the post if I found something.
Both ways have to talk to the DB with at least a select to ensure the data is consistent.
In order to avoid loading, issue session.merge(..., load=False). But this has some very important cavetas. Have a look at the docs of session.merge() for details.
I will need to read up on your link you added concerning your "complicated code". I would like to understand why you need to throw away your session the way you do it. Maybe there is an easier way?

App engine -- when to use memcache vs search index (and search API)?

I am interested in adding a spell checker to my app -- I'm planning on using difflib with a custom word list that's ~147kB large (13,025 words).
When testing user queries against this list, would it make more sense to:
load the dictionary into memcache (I guess from the datastore?) and keep it in memcache or
build an index for the Search API and pass in the user query against it there
I guess what I'm asking is which is faster: memcache or a search index?
Thanks.
Memcache is definitely faster.
Another important consideration is cost. Memcache API calls are free, while Search API calls have their own quota and pricing.
By the way, you may store your library as a static file, because it's small and it does not change. There is no need to store it in the Datastore.
Memcache is faster however you need to consider the following.
it is not reliable, at any moment entities can be purged. So your code needs a fallback for non cached data
You can only fetch by key, so as you said you would need to store whole dictionaries in memcache objects.
Each memcache entity can only store 1MB. If you dictionary is larger you would have to span multiple entities. Ok not relevant in your case.
There are some other alternatives. How often will the dictionary be updated ?
Here is one alternate strategy.
You could store it in the filesystem (requires app updates) or GCS if you want to update the dictionary outside of app updates. Then you can load the dictionary in each instance into memory at startup or on first request and cache it at the running instance level, then you won't have any round trips to services adding latencies. This will also be simpler code wise (ie no fallbacks if not in memcache etc)
Here is an example. In this case the code lives in a module, which is imported as required. I am using a yaml file for additional configuration, it could just as easily json load a dictionary, or you could define a python dictionary in the module.
_modsettings = {}
def loadSettings(settings='settings.yaml'):
x= _modsettings
if not x:
try:
_modsettings.update(load(open(settings,'r').read()))
except IOError:
pass
return _modsettings
settings = loadSettings()
Then whenever I want the settings dictionary my code just refers to mymodule.settings.
By importing this module during a warmup request you won't get a race condition, or have to import/parse the dictionary during a user facing request. You can put in more error traps as appropriate ;-)

Attribute Cache in Django - What's the point?

I was just looking over EveryBlock's source code and I noticed this code in the alerts/models.py code:
def _get_user(self):
if not hasattr(self, '_user_cache'):
from ebpub.accounts.models import User
try:
self._user_cache = User.objects.get(id=self.user_id)
except User.DoesNotExist:
self._user_cache = None
return self._user_cache
user = property(_get_user)
I've noticed this pattern around a bunch, but I don't quite understand the use. Is the whole idea to make sure that when accessing the FK on self (self = alert object), that you only grab the user object once from the db? Why wouldn't you just rely upon the db caching amd django's ForeignKey() field? I noticed that the model definition only holds the user id and not a foreign key field:
class EmailAlert(models.Model):
user_id = models.IntegerField()
...
Any insights would be appreciated.
I don't know why this is an IntegerField; it looks like it definitely should be a ForeignKey(User) field--you lose things like select_related() here and other things because of that, too.
As to the caching, many databases don't cache results--they (or rather, the OS) will cache the data on disk needed to get the result, so looking it up a second time should be faster than the first, but it'll still take work.
It also still takes a database round-trip to look it up. In my experience, with Django, doing an item lookup can take around 0.5 to 1ms, for an SQL command to a local Postgresql server plus sometimes nontrivial overhead of QuerySet. 1ms is a lot if you don't need it--do that a few times and you can turn a 30ms request into a 35ms request.
If your SQL server isn't local and you actually have network round-trips to deal with, the numbers get bigger.
Finally, people generally expect accessing a property to be fast; when they're complex enough to cause SQL queries, caching the result is generally a good idea.
Although databases do cache things internally, there's still an overhead in going back to the db every time you want to check the value of a related field - setting up the query within Django, the network latency in connecting to the db and returning the data over the network, instantiating the object in Django, etc. If you know the data hasn't changed in the meantime - and within the context of a single web request you probably don't care if it has - it makes much more sense to get the data once and cache it, rather than querying it every single time.
One of the applications I work on has an extremely complex home page containing a huge amount of data. Previously it was carrying out over 400 db queries to render. I've refactored it now so it 'only' uses 80, using very similar techniques to the one you've posted, and you'd better believe that it gives a massive performance boost.

Categories