Does app engine automatically cache frequent queries? - python

I seem to remember reading somewhere that google app engine automatically caches the results of very frequent queries into memory so that they are retrieved faster.
Is this correct?
If so, is there still a charge for datastore reads on these queries?

If you're using Python and the new ndb API, it DOES have automatic caching of entities, so if you fetch entities by key, it would be cached:
http://code.google.com/appengine/docs/python/ndb/cache.html
As the comments say, queries are not cached.
Cached requests don't hit the datastore, so you save on reads there.
If you're using Java, or the other APIs for accessing the datastore, then no, there's no caching.
edited Fixed my mistake about queries getting cached.

I think that app engine does not cache anything for you. While it could be that, internally, it caches some things for a split second, I don't think you should rely on that.
I think you will be charged the normal number of read operations for every entity you read from every query.

No, it doesn't. However depending on what framework you use for access to the datastore, memcache will be used. Are you developing in java or python? On the java side, Objectify will cache GETs automatically but not Queries. Keep in mind that there is a big difference in terms of performance and cachability between gets and queries in both python and java.
You are not charged for datastore reads for memcache hits.

Related

Is there anyway to view memcache data in google app engine?

Basically what I want to do is see the raw data of memcache so that I can see how my data are being stored.
No, for largely the same reasons that memcached does not support enumerating or dumping the cache. In order to support such a feature safely, all other cache operations would have to block, which would be unacceptable in a shared environment.
For your purpose of occasionally examining some portion of data in the cache, there is a reasonable alternative. Instrument your (and/or your colleagues) use of the memcache client in order to log which keys are frequently used, then periodically sample those keys' values.
What's wrong with the memcache viewer in the admin console?

Force immediate availability of item in Memcache

Using Google App Engine NDB, most aspects of memcache are handled automatically. However, an item does not become available in Memcache until it is read at least once. So first the item must be read using get, and then memcache stores it. Put() removes it from memcache.
However, I need something to be available in memcache immediately on put. I'm new to memcache, so I'm not entirely sure how everything works behind the scenes, but there are two ways I can do this:
Immediately after a put() of an entity, do a get(), just so that it becomes available in memcache.
Immediately after a put(), manually set the item in memcache. This would make sense, but I'm not sure if there are any gotachas with this approach. If I manually set something in memcache, will this interfere with the rest of NDB's automatic memcache handling?
Also, what key should I use when setting something in memcache manually so that upon a get, the automatic memcache handler knows what to look for?
I suspect you are referring to this:
Memcache does not support transactions. Thus, an update meant to be applied to both the Datastore and memcache might be made to only one of the two. To maintain consistency in such cases (possibly at the expense of performance), the updated entity is deleted from memcache and then written to the Datastore. A subsequent read operation will find the entity missing from memcache, retrieve it from the Datastore, and then update it in memcache as a side effect of the read. Also, NDB reads inside transactions ignore the Memcache.
So if you need something to be available on put then you'll have to cache it in memcache yourself.
Which brings us to 2)
If you manually set something in memcache AFAIK it won't interact with NDB's automatic caching in any way. Also AFAIK you can't set a manual memcache entry with a key that the automatic version will then be able to automatically work with.
You simply have to build a layer of memcache around your content that you explicitly control. Every time you to do a put you use a function that puts to the datastore then into memcache, invalidating existing entries if required. Likewise for get, you try memcache first then fall back to the datastore. Which sounds almost exactly like what NDB is doing already for you!
Perhaps look at the Policy functions options for finer control:
https://developers.google.com/appengine/docs/python/ndb/cache#policy_functions
Don't forget however that the in context cache might well be doing what you want already:
The in-context cache persists only for the duration of a single incoming HTTP request and is "visible" only to the code that handles that request. It's fast; this cache lives in memory. When an NDB function writes to the Datastore, it also writes to the in-context cache. When an NDB function reads an entity, it checks the in-context cache first. If the entity is found there, no Datastore interaction takes place.
Queries do not look up values in any cache. However, query results are
written back to the in-context cache if the cache policy says so (but
never to Memcache).
So if your put and subsequent get is happening in the same request it's coming out of the in-context cache in any case.

How to interface with another database effectively using python

I have an application that needs to interface with another app's database. I have read access but not write.
Currently I'm using sql statements via pyodbc to grab the rows and using python manipulate the data. Since I don't cache anything this can be quite costly.
I'm thinking of using an ORM to solve my problem. The question is if I use an ORM like "sql alchemy" would it be smart enough to pick up changes in the other database?
E.g. sql alchemy accesses a table and retrieves a row. If that row got modified outside of sql alchemy would it be smart enough to pick it up?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Edit: To be more clear
I have one application that is simply a reporting tool lets call App A.
I have another application that handles various financial transactions called App B.
A has access to B's database to retrieve the transactions and generates various reports. There's hundreds of thousands of transactions. We're currently caching this info manually in python, if we need an updated report we refresh the cache. If we get rid of the cache, the sql queries combined with the calculations becomes unscalable.
I don't think an ORM is the solution to your problem of performance. By default ORMs tend to be less efficient than row SQL because they might fetch data that you're not going to use (eg. doing a SELECT * when you need only one field), although SQLAlchemy allows fine-grained control over the SQL generated.
Now to implement a caching mechanism, depending on your application, you could use a simple dictionary in memory or a specialized system such as memcached or Redis.
To keep your cached data relatively fresh, you can poll the source at regular intervals, which might be OK if your application can tolerate a little delay. Otherwise you'll need the application that has write access to the db to notify your application or your cache system when an update occurs.
Edit: since you seem to have control over app B, and you've already got a cache system in app A, the simplest way to solve your problem is probably to create a callback in app A that app B can call to expire cached items. Both apps need to agree on a convention to identify cached items.

Is it more efficient to parse external XML or to hit the database?

I was wondering when dealing with a web service API that returns XML, whether it's better (faster) to just call the external service each time and parse the XML (using ElementTree) for display on your site or to save the records into the database (after parsing it once or however many times you need to each day) and make database calls instead for that same information.
First off -- measure. Don't just assume that one is better or worse than the other.
Second, if you really don't want to measure, I'd guess the database is a bit faster (assuming the database is relatively local compared to the web service). Network latency usually is more than parse time unless we're talking a really complex database or really complex XML.
Everyone is being very polite in answering this question: "it depends"... "you should test"... and so forth.
True, the question does not go into great detail about the application and network topographies involved, but if the question is even being asked, then it's likely a) the DB is "local" to the application (on the same subnet, or the same machine, or in memory), and b) the webservice is not. After all, the OP uses the phrases "external service" and "display on your own site." The phrase "parsing it once or however many times you need to each day" also suggests a set of data that doesn't exactly change every second.
The classic SOA myth is that the network is always available; going a step further, I'd say it's a myth that the network is always available with low latency. Unless your own internal systems are crap, sending an HTTP query across the Internet will always be slower than a query to a local DB or DB cluster. There are any number of reasons for this: number of hops to the remote server, outage or degradation issues that you can't control on the remote end, and the internal processing time for the remote web service application to analyze your request, hit its own persistence backend (aka DB), and return a result.
Fire up your app. Do some latency and response times to your DB. Now do the same to a remote web service. Unless your DB is also across the Internet, you'll notice a huge difference.
It's not at all hard for a competent technologist to scale a DB, or for you to completely remove the DB from caching using memcached and other paradigms; the latency between servers sitting near each other in the datacentre is monumentally less than between machines over the Internet (and more secure, to boot). Even if achieving this scale requires some thought, it's under your control, unlike a remote web service whose scaling and latency are totally opaque to you. I, for one, would not be too happy with the idea that the availability and responsiveness of my site are based on someone else entirely.
Finally, what happens if the remote web service is unavailable? Imagine a world where every request to your site involves a request over the Internet to some other site. What happens if that other site is unavailable? Do your users watch a spinning cursor of death for several hours? Do they enjoy an Error 500 while your site borks on this unexpected external dependency?
If you find yourself adopting an architecture whose fundamental features depend on a remote Internet call for every request, think very carefully about your application before deciding if you can live with the consequences.
Consuming the webservices is more efficient because there are a lot more things you can do to scale your webservices and webserver (via caching, etc.). By consuming the middle layer, you also have the options to change the returned data format (e.g. you can decide to use JSON rather than XML). Scaling database is much harder (involving replication, etc.) so in general, reduce hits on DB if you can.
There is not enough information to be able to say for sure in the general case. Why don't you do some tests and find out? Since it sounds like you are using python you will probably want to use the timeit module.
Some things that could effect the result:
Performance of the web service you are using
Reliability of the web service you are using
Distance between servers
Amount of data being returned
I would guess that if it is cacheable, that a cached version of the data will be faster, but that does not necessarily mean using a local RDBMS, it might mean something like memcached or an in memory cache in your application.
It depends - who is calling the web service? Is the web service called every time the user hits the page? If that's the case I'd recommend introducing a caching layer of some sort - many web service API's throttle the amount of hits you can make per hour.
Whether you choose to parse the cached XML on the fly or call the data from a database probably won't matter (unless we are talking enterprise scaling here). Personally, I'd much rather make a simple SQL call than write a DOM Parser (which is much more prone to exceptional scenarios).
It depends from case to case, you'll have to measure (or at least make an educated guess).
You'll have to consider several things.
Web service
it might hit database itself
it can be cached
it will introduce network latency and might be unreliable
or it could be in local network and faster than accessing even local disk
DB
might be slow since it needs to access disk (although databases have internal caches, but those are usually not targeted)
should be reliable
Technology itself doesn't mean much in terms of speed - in one case database parses SQL, in other XML parser parses XML, and database is usually acessed via socket as well, so you have both parsing and network in either case.
Caching data in your application if applicable is probably a good idea.
As a few people have said, it depends, and you should test it.
Often external services are slow, and caching them locally (in a database in memory, e.g., with memcached) is faster. But perhaps not.
Fortunately, it's cheap and easy to test.
Test definitely. As a rule of thumb, XML is good for communicating between apps, but once you have the data inside of your app, everything should go into a database table. This may not apply in all cases, but 95% of the time it has for me. Anytime I ever tried to store data any other way (ex. XML in a content management system) I ended up wishing I would have just used good old sprocs and sql server.
It sounds like you essentially want to cache results, and are wondering if it's worth it. But if so, I would NOT use a database (I assume you are thinking of a relational DB): RDBMSs are not good for caching; even though many use them. You don't need persistence nor ACID.
If choice was between Oracle/MySQL and external web service, I would start with just using service.
Instead, consider real caching systems; local or not (memcache, simple in-memory caches etc).
Or if you must use a DB, use key/value store, BDB works well. Store response message in its serialized form (XML), try to fetch from cache, if not, from service, parse. Or if there's a convenient and more compact serialization, store and fetch that.

Django Sessions

I'm looking at sessions in Django, and by default they are stored in the database. What are the benefits of filesystem and cache sessions and when should I use them?
The filesystem backend is only worth looking at if you're not going to use a database for any other part of your system. If you are using a database then the filesystem backend has nothing to recommend it.
The memcache backend is much quicker than the database backend, but you run the risk of a session being purged and some of your session data being lost.
If you're a really, really high traffic website and code carefully so you can cope with losing a session then use memcache. If you're not using a database use the file system cache, but the default database backend is the best, safest and simplest option in almost all cases.
I'm no Django expert, so this answer is about session stores generally. Downvote if I'm wrong.
Performance and Scalability
Choice of session store has an effect on performance and scalability. This should only be a big problem if you have a very popular application.
Both database and filesystem session stores are (usually) backed by disks so you can have a lot of sessions cheaply (because disks are cheap), but requests will often have to wait for the data to be read (because disks are slow). Memcached sessions use RAM, so will cost more to support the same number of concurrent sessions (because RAM is expensive), but may be faster (because RAM is fast).
Filesystem sessions are tied to the box where your application is running, so you can't load balance between multiple application servers if your site gets huge. Database and memcached sessions let you have multiple application servers talking to a shared session store.
Simplicity
Choice of session store will also impact how easy it is to deploy your site. Changing away from the default will cost some complexity. Memcached and RDBMSs both have their own complexities, but your application is probably going to be using an RDBMS anyway.
Unless you have a very popular application, simplicity should be the larger concern.
Bonus
Another approach is to store session data in cookies (all of it, not just an ID). This has the advantage that the session store automatically scales with the number of users, but it has disadvantages too. You (or your framework) need to be careful to stop users forging session data. You also need to keep each session small because the whole thing will be sent with every request.
As of Django 1.1 you can use the cached_db session back end.
This stores the session in the cache (only use with memcached), and writes it back to the DB. If it has fallen out of the cache, it will be read from the DB.
Although this is slower than just using memcached for storing the session, it adds persistence to the session.
For more information, see: Django Docs: Using Cached Sessions
One thing that has to be considered when choosing session backend is "how often session data is modified"? Even sites with moderate traffic will suffer if session data is modified on each request, making many database trips to store and retrieve data.
In my previous work we used memcache as session backend exclusively and it worked really well. Our administrative team put really great effort in making two special memcached instances stable as a rock, but after bit of twiddling with initial setup, we did not have any interrupts of session backends operations.
If the database have a DBA that isn't you, you may not be allowed to use a database-backed session (it being a front-end matter only). Until django supports easily merging data from several databases, so that you can have frontend-specific stuff like sessions and user-messages (the messages in django.contrib.auth are also stored in the db) in a separate db, you need to keep this in mind.

Categories