Basically what I want to do is see the raw data of memcache so that I can see how my data are being stored.
No, for largely the same reasons that memcached does not support enumerating or dumping the cache. In order to support such a feature safely, all other cache operations would have to block, which would be unacceptable in a shared environment.
For your purpose of occasionally examining some portion of data in the cache, there is a reasonable alternative. Instrument your (and/or your colleagues) use of the memcache client in order to log which keys are frequently used, then periodically sample those keys' values.
What's wrong with the memcache viewer in the admin console?
Related
I've been using PostgreSQL for the longest time. All of my data lives inside Postgres. I've recently looked into redis and it has a lot of powerful features that would otherwise take a couple of lines in Django (python) to do. Redis data is persistent as long the machine it's running on doesn't go down and you can configure it to write out the data it's storing to disk every 1000 keys or every 5 minutes or so depending on your choice.
Redis would make a great cache and it would certainly replace a lot of functions I have written in python (up voting a user's post, viewing their friends list etc...). But my concern is, all of this data would some how need to be translated over to postgres. I don't trust storing this data in redis. I see redis as a temporary storage solution for quick retrieval of information. It's extremely fast and this far outweighs doing repetitive queries against postgres.
I'm assuming the only way I could technically write the redis data to the database is to save() whatever I get from the 'get' query from redis to the postgres database through Django.
That's the only solution I could think of. Do you know of any other solutions to this problem?
Redis is increasingly used as a caching layer, much like a more sophisticated memcached, and is very useful in this role. You usually use Redis as a write-through cache for data you want to be durable, and write-back for data you might want to accumulate then batch write (where you can afford to lose recent data).
PostgreSQL's LISTEN and NOTIFY system is very useful for doing selective cache invalidation, letting you purge records from Redis when they're updated in PostgreSQL.
For combining it with PostgreSQL, you will find the Redis foreign data wrapper provider that Andrew Dunstain and Dave Page are working on very interesting.
I'm not aware of any tool that makes Redis into a transparent write-back cache for PostgreSQL. Their data models are probably too different for this to work well. Usually you write changes to PostgreSQL and invalidate their Redis cache entries using listen/notify to a cache manager worker, or you queue changes in Redis then have your app read them out and write them into Pg in chunks.
Redis is persistent if configured to be so, both through snapshots and a kind of WAL called AOF. Loads of people use it as a primary datastore.
https://redis.io/topics/persistence
If one is referring to the greater world of Redis compatible (resp protocol) datastores, many are not limited to in-memory storage:
https://keydb.dev/
http://ssdb.io/
and many more...
Using Google App Engine NDB, most aspects of memcache are handled automatically. However, an item does not become available in Memcache until it is read at least once. So first the item must be read using get, and then memcache stores it. Put() removes it from memcache.
However, I need something to be available in memcache immediately on put. I'm new to memcache, so I'm not entirely sure how everything works behind the scenes, but there are two ways I can do this:
Immediately after a put() of an entity, do a get(), just so that it becomes available in memcache.
Immediately after a put(), manually set the item in memcache. This would make sense, but I'm not sure if there are any gotachas with this approach. If I manually set something in memcache, will this interfere with the rest of NDB's automatic memcache handling?
Also, what key should I use when setting something in memcache manually so that upon a get, the automatic memcache handler knows what to look for?
I suspect you are referring to this:
Memcache does not support transactions. Thus, an update meant to be applied to both the Datastore and memcache might be made to only one of the two. To maintain consistency in such cases (possibly at the expense of performance), the updated entity is deleted from memcache and then written to the Datastore. A subsequent read operation will find the entity missing from memcache, retrieve it from the Datastore, and then update it in memcache as a side effect of the read. Also, NDB reads inside transactions ignore the Memcache.
So if you need something to be available on put then you'll have to cache it in memcache yourself.
Which brings us to 2)
If you manually set something in memcache AFAIK it won't interact with NDB's automatic caching in any way. Also AFAIK you can't set a manual memcache entry with a key that the automatic version will then be able to automatically work with.
You simply have to build a layer of memcache around your content that you explicitly control. Every time you to do a put you use a function that puts to the datastore then into memcache, invalidating existing entries if required. Likewise for get, you try memcache first then fall back to the datastore. Which sounds almost exactly like what NDB is doing already for you!
Perhaps look at the Policy functions options for finer control:
https://developers.google.com/appengine/docs/python/ndb/cache#policy_functions
Don't forget however that the in context cache might well be doing what you want already:
The in-context cache persists only for the duration of a single incoming HTTP request and is "visible" only to the code that handles that request. It's fast; this cache lives in memory. When an NDB function writes to the Datastore, it also writes to the in-context cache. When an NDB function reads an entity, it checks the in-context cache first. If the entity is found there, no Datastore interaction takes place.
Queries do not look up values in any cache. However, query results are
written back to the in-context cache if the cache policy says so (but
never to Memcache).
So if your put and subsequent get is happening in the same request it's coming out of the in-context cache in any case.
I seem to remember reading somewhere that google app engine automatically caches the results of very frequent queries into memory so that they are retrieved faster.
Is this correct?
If so, is there still a charge for datastore reads on these queries?
If you're using Python and the new ndb API, it DOES have automatic caching of entities, so if you fetch entities by key, it would be cached:
http://code.google.com/appengine/docs/python/ndb/cache.html
As the comments say, queries are not cached.
Cached requests don't hit the datastore, so you save on reads there.
If you're using Java, or the other APIs for accessing the datastore, then no, there's no caching.
edited Fixed my mistake about queries getting cached.
I think that app engine does not cache anything for you. While it could be that, internally, it caches some things for a split second, I don't think you should rely on that.
I think you will be charged the normal number of read operations for every entity you read from every query.
No, it doesn't. However depending on what framework you use for access to the datastore, memcache will be used. Are you developing in java or python? On the java side, Objectify will cache GETs automatically but not Queries. Keep in mind that there is a big difference in terms of performance and cachability between gets and queries in both python and java.
You are not charged for datastore reads for memcache hits.
I'm just wondering if Django was designed to be a fully stateless framework?
It seems to encourage statelessness and external storage mechanisms (databases and caches) but I'm wondering if it is possible to store some things in the server's memory while my app is in develpoment and runs via manage.py runserver.
Sure it's possible. But if you are writing a web application you probably won't want to do that because of threading issues.
That depends on what you mean by "store things in the server's memory." It also depends on the type of data. If you can, you're better off storing "global data" in a database or in the file system somewhere. Unless it is needed every request it doesn't really make sense to store it in the Django instance itself. You'll need to implement some form of locking to prevent race conditions, but you'd need to worry about race conditions if you stored everything on the server object anyway.
Of course, if you're talking about user-by-user data, Django does support sessions. Or, and this is another perfectly good option if you're willing to make the user save the data, cookies.
The best way to maintain state in a django app on a per-user basis is request.session (see django sessions) which is a dictionary you can use to remember things about the current user.
For Application-wide state you should use the a persistent datastore (database or key/value store)
example view for sessions:
def my_view(request):
pages_viewed = request.session.get('pages_viewed', 1) + 1
request.session['pages_viewed'] = pages_viewed
...
If you wanted to maintain local variables on a per app-instance basis you can just define module level variables like so
# number of times my_view has been served since by this server
# instance since the last restart
served_since_restart = 0
def my_view(request):
served_since_restart += 1
...
If you wanted to maintain some server state across ALL app servers (like total number of pages viewed EVER) you should probably use a persistent key/value store like redis, memcachedb, or riak. There is a decent comparison of all these options here: http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
You can do it with redis (via redis-py) like so (assuming your redis server is at "127.0.0.1" (localhost) and it's port 6379 (the default):
import redis
def my_view(request):
r = redis.Redis(host='127.0.0.1', port="6379")
served = r.get('pages_served_all_time', 0)
served += 1
r.set('pages_served_all_time', served)
...
There is LocMemCache cache backend that stores data in-process. You can use it with sessions (but with great care: this cache is not cross-process so you will have to use single process for deployment because it will not be guaranteed that subsequent requests will be handled by the same process otherwise). Global variables may also work (use threadlocals if they shouldn't be shared for all process threads; the warning about cross-process communication also applies here).
By the way, what's wrong with external storage? External storage provides easy cross-process data sharing and other features (like memory limiting algorithms for cache or persistance with databases).
I'm looking at sessions in Django, and by default they are stored in the database. What are the benefits of filesystem and cache sessions and when should I use them?
The filesystem backend is only worth looking at if you're not going to use a database for any other part of your system. If you are using a database then the filesystem backend has nothing to recommend it.
The memcache backend is much quicker than the database backend, but you run the risk of a session being purged and some of your session data being lost.
If you're a really, really high traffic website and code carefully so you can cope with losing a session then use memcache. If you're not using a database use the file system cache, but the default database backend is the best, safest and simplest option in almost all cases.
I'm no Django expert, so this answer is about session stores generally. Downvote if I'm wrong.
Performance and Scalability
Choice of session store has an effect on performance and scalability. This should only be a big problem if you have a very popular application.
Both database and filesystem session stores are (usually) backed by disks so you can have a lot of sessions cheaply (because disks are cheap), but requests will often have to wait for the data to be read (because disks are slow). Memcached sessions use RAM, so will cost more to support the same number of concurrent sessions (because RAM is expensive), but may be faster (because RAM is fast).
Filesystem sessions are tied to the box where your application is running, so you can't load balance between multiple application servers if your site gets huge. Database and memcached sessions let you have multiple application servers talking to a shared session store.
Simplicity
Choice of session store will also impact how easy it is to deploy your site. Changing away from the default will cost some complexity. Memcached and RDBMSs both have their own complexities, but your application is probably going to be using an RDBMS anyway.
Unless you have a very popular application, simplicity should be the larger concern.
Bonus
Another approach is to store session data in cookies (all of it, not just an ID). This has the advantage that the session store automatically scales with the number of users, but it has disadvantages too. You (or your framework) need to be careful to stop users forging session data. You also need to keep each session small because the whole thing will be sent with every request.
As of Django 1.1 you can use the cached_db session back end.
This stores the session in the cache (only use with memcached), and writes it back to the DB. If it has fallen out of the cache, it will be read from the DB.
Although this is slower than just using memcached for storing the session, it adds persistence to the session.
For more information, see: Django Docs: Using Cached Sessions
One thing that has to be considered when choosing session backend is "how often session data is modified"? Even sites with moderate traffic will suffer if session data is modified on each request, making many database trips to store and retrieve data.
In my previous work we used memcache as session backend exclusively and it worked really well. Our administrative team put really great effort in making two special memcached instances stable as a rock, but after bit of twiddling with initial setup, we did not have any interrupts of session backends operations.
If the database have a DBA that isn't you, you may not be allowed to use a database-backed session (it being a front-end matter only). Until django supports easily merging data from several databases, so that you can have frontend-specific stuff like sessions and user-messages (the messages in django.contrib.auth are also stored in the db) in a separate db, you need to keep this in mind.