Using Google App Engine NDB, most aspects of memcache are handled automatically. However, an item does not become available in Memcache until it is read at least once. So first the item must be read using get, and then memcache stores it. Put() removes it from memcache.
However, I need something to be available in memcache immediately on put. I'm new to memcache, so I'm not entirely sure how everything works behind the scenes, but there are two ways I can do this:
Immediately after a put() of an entity, do a get(), just so that it becomes available in memcache.
Immediately after a put(), manually set the item in memcache. This would make sense, but I'm not sure if there are any gotachas with this approach. If I manually set something in memcache, will this interfere with the rest of NDB's automatic memcache handling?
Also, what key should I use when setting something in memcache manually so that upon a get, the automatic memcache handler knows what to look for?
I suspect you are referring to this:
Memcache does not support transactions. Thus, an update meant to be applied to both the Datastore and memcache might be made to only one of the two. To maintain consistency in such cases (possibly at the expense of performance), the updated entity is deleted from memcache and then written to the Datastore. A subsequent read operation will find the entity missing from memcache, retrieve it from the Datastore, and then update it in memcache as a side effect of the read. Also, NDB reads inside transactions ignore the Memcache.
So if you need something to be available on put then you'll have to cache it in memcache yourself.
Which brings us to 2)
If you manually set something in memcache AFAIK it won't interact with NDB's automatic caching in any way. Also AFAIK you can't set a manual memcache entry with a key that the automatic version will then be able to automatically work with.
You simply have to build a layer of memcache around your content that you explicitly control. Every time you to do a put you use a function that puts to the datastore then into memcache, invalidating existing entries if required. Likewise for get, you try memcache first then fall back to the datastore. Which sounds almost exactly like what NDB is doing already for you!
Perhaps look at the Policy functions options for finer control:
https://developers.google.com/appengine/docs/python/ndb/cache#policy_functions
Don't forget however that the in context cache might well be doing what you want already:
The in-context cache persists only for the duration of a single incoming HTTP request and is "visible" only to the code that handles that request. It's fast; this cache lives in memory. When an NDB function writes to the Datastore, it also writes to the in-context cache. When an NDB function reads an entity, it checks the in-context cache first. If the entity is found there, no Datastore interaction takes place.
Queries do not look up values in any cache. However, query results are
written back to the in-context cache if the cache policy says so (but
never to Memcache).
So if your put and subsequent get is happening in the same request it's coming out of the in-context cache in any case.
Related
Maybe I'm overlooking something simple and obvious here, but here goes:
So one of the features of the Etag header in a HTTP request/response it to enforce concurrency, namely so that multiple clients cannot override each other's edits of a resource (normally when doing a PUT request). I think that part is fairly well known.
The bit I'm not so sure about is how the backend/API implementation can actually implement this without having a race condition; for example:
Setup:
RESTful API sits on top of a standard relational database, using an ORM for all interactions (SQL Alchemy or Postgres for example).
Etag is based on 'last updated time' of the resource
Web framework (Flask) sits behind a multi threaded/process webserver (nginx + gunicorn) so can process multiple requests concurrently.
The problem:
Client 1 and 2 both request a resource (get request), both now have the same Etag.
Both Client 1 and 2 sends a PUT request to update the resource at the same time. The API receives the requests, proceeds to uses the ORM to fetch the required information from the database then compares the request Etag with the 'last updated time' from the database... they match so each is a valid request. Each request continues on and commits the update to the database.
Each commit is a synchronous/blocking transaction so one request will get in before the other and thus one will override the others changes.
Doesn't this break the purpose of the Etag?
The only fool-proof solution I can think of is to also make the database perform the check, in the update query for example. Am I missing something?
P.S Tagged as Python due to the frameworks used but this should be a language/framework agnostic problem.
This is really a question about how to use ORMs to do updates, not about ETags.
Imagine 2 processes transferring money into a bank account at the same time -- they both read the old balance, add some, then write the new balance. One of the transfers is lost.
When you're writing with a relational DB, the solution to these problems is to put the read + write in the same transaction, and then use SELECT FOR UPDATE to read the data and/or ensure you have an appropriate isolation level set.
The various ORM implementations all support transactions, so getting the read, check and write into the same transaction will be easy. If you set the SERIALIZABLE isolation level, then that will be enough to fix race conditions, but you may have to deal with deadlocks.
ORMs also generally support SELECT FOR UPDATE in some way. This will let you write safe code with the default READ COMMITTED isolation level. If you google SELECT FOR UPDATE and your ORM, it will probably tell you how to do it.
In both cases (serializable isolation level or select for update), the database will fix the problem by getting a lock on the row for the entity when you read it. If another request comes in and tries to read the entity before your transaction commits, it will be forced to wait.
Etag can be implemented in many ways, not just last updated time. If you choose to implement the Etag purely based on last updated time, then why not just use the Last-Modified header?
If you were to encode more information into the Etag about the underlying resource, you wouldn't be susceptible to the race condition that you've outlined above.
The only fool proof solution I can think of is to also make the database perform the check, in the update query for example. Am I missing something?
That's your answer.
Another option would be to add a version to each of your resources which is incremented on each successful update. When updating a resource, specify both the ID and the version in the WHERE. Additionally, set version = version + 1. If the resource had been updated since the last request then the update would fail as no record would be found. This eliminates the need for locking.
You are right that you can still get race conditions if the 'check last etag' and 'make the change' aren't in one atomic operation.
In essence, if your server itself has a race condition, sending etags to the client won't help with that.
You already mentioned a good way to achieve this atomicity:
The only fool-proof solution I can think of is to also make the database perform the check, in the update query for example.
You could do something else, like using a mutex lock. Or using an architecture where two threads cannot deal with the same data.
But the database check seems good to me. What you describe about ORM checks might be an addition for better error messages, but is not by itself sufficient as you found.
Basically what I want to do is see the raw data of memcache so that I can see how my data are being stored.
No, for largely the same reasons that memcached does not support enumerating or dumping the cache. In order to support such a feature safely, all other cache operations would have to block, which would be unacceptable in a shared environment.
For your purpose of occasionally examining some portion of data in the cache, there is a reasonable alternative. Instrument your (and/or your colleagues) use of the memcache client in order to log which keys are frequently used, then periodically sample those keys' values.
What's wrong with the memcache viewer in the admin console?
I seem to remember reading somewhere that google app engine automatically caches the results of very frequent queries into memory so that they are retrieved faster.
Is this correct?
If so, is there still a charge for datastore reads on these queries?
If you're using Python and the new ndb API, it DOES have automatic caching of entities, so if you fetch entities by key, it would be cached:
http://code.google.com/appengine/docs/python/ndb/cache.html
As the comments say, queries are not cached.
Cached requests don't hit the datastore, so you save on reads there.
If you're using Java, or the other APIs for accessing the datastore, then no, there's no caching.
edited Fixed my mistake about queries getting cached.
I think that app engine does not cache anything for you. While it could be that, internally, it caches some things for a split second, I don't think you should rely on that.
I think you will be charged the normal number of read operations for every entity you read from every query.
No, it doesn't. However depending on what framework you use for access to the datastore, memcache will be used. Are you developing in java or python? On the java side, Objectify will cache GETs automatically but not Queries. Keep in mind that there is a big difference in terms of performance and cachability between gets and queries in both python and java.
You are not charged for datastore reads for memcache hits.
I have an application that needs to interface with another app's database. I have read access but not write.
Currently I'm using sql statements via pyodbc to grab the rows and using python manipulate the data. Since I don't cache anything this can be quite costly.
I'm thinking of using an ORM to solve my problem. The question is if I use an ORM like "sql alchemy" would it be smart enough to pick up changes in the other database?
E.g. sql alchemy accesses a table and retrieves a row. If that row got modified outside of sql alchemy would it be smart enough to pick it up?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Edit: To be more clear
I have one application that is simply a reporting tool lets call App A.
I have another application that handles various financial transactions called App B.
A has access to B's database to retrieve the transactions and generates various reports. There's hundreds of thousands of transactions. We're currently caching this info manually in python, if we need an updated report we refresh the cache. If we get rid of the cache, the sql queries combined with the calculations becomes unscalable.
I don't think an ORM is the solution to your problem of performance. By default ORMs tend to be less efficient than row SQL because they might fetch data that you're not going to use (eg. doing a SELECT * when you need only one field), although SQLAlchemy allows fine-grained control over the SQL generated.
Now to implement a caching mechanism, depending on your application, you could use a simple dictionary in memory or a specialized system such as memcached or Redis.
To keep your cached data relatively fresh, you can poll the source at regular intervals, which might be OK if your application can tolerate a little delay. Otherwise you'll need the application that has write access to the db to notify your application or your cache system when an update occurs.
Edit: since you seem to have control over app B, and you've already got a cache system in app A, the simplest way to solve your problem is probably to create a callback in app A that app B can call to expire cached items. Both apps need to agree on a convention to identify cached items.
I'm looking at sessions in Django, and by default they are stored in the database. What are the benefits of filesystem and cache sessions and when should I use them?
The filesystem backend is only worth looking at if you're not going to use a database for any other part of your system. If you are using a database then the filesystem backend has nothing to recommend it.
The memcache backend is much quicker than the database backend, but you run the risk of a session being purged and some of your session data being lost.
If you're a really, really high traffic website and code carefully so you can cope with losing a session then use memcache. If you're not using a database use the file system cache, but the default database backend is the best, safest and simplest option in almost all cases.
I'm no Django expert, so this answer is about session stores generally. Downvote if I'm wrong.
Performance and Scalability
Choice of session store has an effect on performance and scalability. This should only be a big problem if you have a very popular application.
Both database and filesystem session stores are (usually) backed by disks so you can have a lot of sessions cheaply (because disks are cheap), but requests will often have to wait for the data to be read (because disks are slow). Memcached sessions use RAM, so will cost more to support the same number of concurrent sessions (because RAM is expensive), but may be faster (because RAM is fast).
Filesystem sessions are tied to the box where your application is running, so you can't load balance between multiple application servers if your site gets huge. Database and memcached sessions let you have multiple application servers talking to a shared session store.
Simplicity
Choice of session store will also impact how easy it is to deploy your site. Changing away from the default will cost some complexity. Memcached and RDBMSs both have their own complexities, but your application is probably going to be using an RDBMS anyway.
Unless you have a very popular application, simplicity should be the larger concern.
Bonus
Another approach is to store session data in cookies (all of it, not just an ID). This has the advantage that the session store automatically scales with the number of users, but it has disadvantages too. You (or your framework) need to be careful to stop users forging session data. You also need to keep each session small because the whole thing will be sent with every request.
As of Django 1.1 you can use the cached_db session back end.
This stores the session in the cache (only use with memcached), and writes it back to the DB. If it has fallen out of the cache, it will be read from the DB.
Although this is slower than just using memcached for storing the session, it adds persistence to the session.
For more information, see: Django Docs: Using Cached Sessions
One thing that has to be considered when choosing session backend is "how often session data is modified"? Even sites with moderate traffic will suffer if session data is modified on each request, making many database trips to store and retrieve data.
In my previous work we used memcache as session backend exclusively and it worked really well. Our administrative team put really great effort in making two special memcached instances stable as a rock, but after bit of twiddling with initial setup, we did not have any interrupts of session backends operations.
If the database have a DBA that isn't you, you may not be allowed to use a database-backed session (it being a front-end matter only). Until django supports easily merging data from several databases, so that you can have frontend-specific stuff like sessions and user-messages (the messages in django.contrib.auth are also stored in the db) in a separate db, you need to keep this in mind.