I have been using the datastore with ndb for a multiplayer app. This appears to be using a lot of reads/writes and will undoubtedly go over quota and cost a substantial amount.
I was thinking of changing all the game data to be stored only in memcache. I understand that data stored here can be lost at any time, but as the data will only be needed for, at most, 10 minutes and as it's just a game, that wouldn't be too bad.
Am I right to move to solely use memcache, or is there a better method, and is memcache essentially 'free' short term data storage?
Yes, memcache is free and you can use it as a free "datastorage". Just keep in mind that it can be purged at any time (more likely to be purged if heavily used) and that it also is not always available. To check for memecache availability use Capabilities API.
As a commenter on another answer noted, there are now two memcache offerings: shared and dedicated. Shared is the original service, and is still free. Dedicated is in preview, and presently costs $.12/GB hour.
Dedicated memcache allows you to have a certain amount of space set aside. However, it's important to understand that you can still experience partial or complete flushes at any time with dedicated memcache, due to things like machine reboots. Because of this, it's not a suitable replacement for the datastore.
However, it is true that you can greatly reduce your datastore usage with judicious use of memcache. Using it as a write-through cache, for example, can greatly reduce your datastore reads (albeit not the writes).
Hope this helps.
Related
I have a Flask application that allows users to query a ~small database (2.4M rows) using SQL. It's similar to a HackerRank but more limited in scope. It's deployed on Heroku.
I've noticed during testing that I can predictably hit an R14 error (memory quota exceeded) or R15 (memory quota greatly exceeded) by running large queries. The queries that typically cause this are outside what a normal user might do, such as SELECT * FROM some_huge_table. That said, I am concerned that these errors will become a regular occurrence for even small queries when 5, 10, 100 users are querying at the same time.
I'm looking for some advice on how to manage memory quotas for this type of interactive site. Here's what I've explored so far:
Changing the # of gunicorn workers. This has had some effect but I still hit R14 and R15 errors consistently.
Forced limits on user queries, based on either text or the EXPLAIN output. This does work to reduce memory usage, but I'm afraid it won't scale to even a very modest # of users.
Moving to a higher Heroku tier. The plan I use currently provides ~512MB RAM. The largest plan is around 14GB. Again, this would help but won't even moderately scale, to say nothing of the associated costs.
Reducing the size of the database significantly. I would like to avoid this if possible. Doing the napkin math on a table with 1.9M rows going to 10k or 50k, the application would have greatly reduced memory needs and will scale better, but will still have some moderate max usage limit.
As you can see, I'm a novice at best when it comes to memory management. I'm looking for some strategies/ideas on how to solve this general problem, and if it's the case that I need to either drastically cut the data size or throw tons of $ at this, that's OK too.
Thanks
Coming from my personal experience, I see two approaches:
1. plan for it
Coming from your example, this means you try to calculate the maximum memory that the request would use, multiply it by the number of gunicorn workers, and use dynos big enough.
With a different example this could be valid, I don't think it is for you.
2. reduce memory usage, solution 1
The fact that too much application memory is used makes me think that likely in your code you are loading the whole result-set into memory (probably even multiple times in multiple formats) before returning it to the client.
In the end, your application is only getting the data from the database and converting it to some output format (JSON/CSV?).
What you are probably searching for is streaming responses.
Your Flask-view will work on a record-by-record base. It will read a single record, convert it to your output format, and return a single record.
Both your database client library and Flask will support this (on most databases it is called cursors / iterators).
2. reduce memory usage, solution 2
other services often go for simple pagination or limiting resultsets to manage server-side memory.
security sidenote
it sounds like the users can actually define the SQL statement in their API requests. This is a security and application risk. Apart from doing INSERT, UPDATE, or DELETE statements, the user could create a SQL statement that will not only blow your application memory, but also break your database.
I am building a website having in my mind that hundreds (I wish thousands!) of 'get' queries -per day- will be cached for a couple of months in the filesystem.
Reading the cache documentation, however, I observe that the default values lean towards a small and fast cache cycle.
An old post describes that a strategy like the one I imagine, wrecked havoc in their servers.
Of course, the current django code seems to have evolved since 2012. However the cache defaults still remain the same...
I wonder whether I am on the right track or not.
My familiarity with caching is restricted in enjoying the W3 Total Cache results after saving thousands of files in the relevant directories without understanding anything but its basic settings.
How would an experienced developer approach "stage 1" of this task:
Without the budget -yet- to support solutions based on Redis (for example) (Not a valid argument)
How would you cache a normally augmenting number of queries -capable to form a bulk- for a long period of time, running on rather basic server resources?
Django's cache backend *should be implementation agnostic. For example, if you want to start with filesystem cache or redis cache or memcache it shouldn't really matter to django.
I can think of a couple issues with your approach:
how fast is your dataset growing? If you have pretty stable sized dataset, it shoudn't matter if the cache entries are long-lived.
how will you invalidate your queries? if the queries are being cached for months, it suggests that data does not change; cache invalidation is a big thing to consider, clients shouldn't see stale data.
are you using Filesystem cache? if data is being cached per server, are requests being consitantly assigned to the same servers? If not then multiple servers can have duplicate caches, this is one of the benefits of using a centralized cache (redis/memcache)
you should be able to calculate a pretty good estimate based on your current dataset size, how much data you'd like to cache, and the rate of growth of your data on how large of cache you'd need. I feel like a shared cache will go very far, and can be ran on "basic server" resources.
For stage 1, i would:
choose a shared cache, either redis or memcached, this should be a lot less painful when you start to scale to multi server setups
estimate how much data you will need to cache, and what sort of data size growth you predict, to make sure your cache is of an appropriate size.
I feel like cache invalidation is usually not a set policy on how long the data should persist in the cache, it is governed by when your data changes, which should force invalidate the cache so that clients don't see stale data
We have a little data which almost won't be updated but read frequently (site config and some selection items like states and counties information), I think if I can move it to application memory instead of any database, our I/O performance would get a big improvement.
But we have a lot of web servers, I cannot figure out a good solution how to notice all the servers to reload these data.
You are likely looking for a cache pattern: Is there a Python caching library? You just need to ask how stale you can afford to be. If it was looking this up on every request, even a short-lived cache can massively improve performance. It's likely though that this information can live for minutes or hours without too much risk of being "stale".
If you can't live with a stale cache, I've implemented solutions that have a single database call, which keeps track of the last updated date for any of the cached data. This at least reduces the cache lookups to a single database call.
Be aware though, as soon as you are sharing updateable information, you have to deal with multi-threaded updates of shared state. Make sure you understand the implications of this. Hopefully your caching library handles this gracefully.
Our situation is as follows:
We are working on a schoolproject where the intention is that multiple teams walk around in a city with smarthphones and play a city game while walking.
As such, we can have 10 active smarthpones walking around in the city, all posting their location, and requesting data from the google appengine.
Someone is behind a webbrowser,watching all these teams walk around, and sending them messages etc.
We are using the datastore the google appengine provides to store all the data these teams send and request, to store the messages and retrieve them etc.
However we soon found out we where at our max limit of reads and writes, so we searched for a solution to be able to retrieve periodic updates(which cost the most reads and writes) without using any of the limited resources google provides. And obviously, because it's a schoolproject we don't want to pay for more reads and writes.
Storing this information in global variables seemed an easy and quick solution, which it was... but when we started to truly test we noticed some of our data was missing and then reappearing. Which turned out to be because there where so many requests being done to the cloud that a new instance was made, and instances don't keep these global variables persistent.
So our question is:
Can we somehow make sure these global variables are always the same on every running instance of google appengine.
OR
Can we limit the amount of instances ever running, no matter how many requests are done to '1'.
OR
Is there perhaps another way to store this data in a better way, without using the datastore and without using globals.
You should be using memcache. If you use the ndb (new database) library, you can automatically cache the results of queries. Obviously this won't improve your writes much, but it should significantly improve the numbers of reads you can do.
You need to back it with the datastore as data can be ejected from memcache at any time. If you're willing to take the (small) chance of losing updates you could just use memcache. You could do something like store just a message ID in the datastore and have the controller periodically verify that every message ID has a corresponding entry in memcache. If one is missing the controller would need to reenter it.
Interesting question. Some bad news first, I don't think there's a better way of storing data; no, you won't be able to stop new instances from spawning and no, you cannot make seperate instances always have the same data.
What you could do is have the instances perioidically sync themselves with a master record in the datastore, by choosing the frequency of this intelligently and downloading/uploading the information in one lump you could limit the number of read/writes to a level that works for you. This is firmly in the kludge territory though.
Despite finding the quota for just about everything else, I can't find the limits for free read/write so it is possible that they're ludicrously small but the fact that you're hitting them with a mere 10 smartphones raises a red flag to me. Are you certain that the smartphones are being polled (or calling in) at a sensible frequency? It sounds like you might be hammering them unnecessarily.
Consider jabber protocol for communication between peers. Free limits are on quite high level for it.
First, definitely use memcache as Tim Delaney said. That alone will probably solve your problem.
If not, you should consider a push model. The advantage is that your clients won't be asking you for new data all the time, only when something has actually changed. If the update is small enough that you can deliver it in the push message, you won't need to worry about datastore reads on memcache misses, or any other duplicate work, for all those clients: you read the data once when it changes and push it out to everyone.
The first option for push is C2DM (Android) or APN (iOS). These are limited on the amount of data they send and the frequency of updates.
If you want to get fancier you could use XMPP instead. This would let you do more frequent updates with (I believe) bigger payloads but might require more engineering. For a starting point, see Stack Overflow questions about Android and iOS.
Have fun!
I have a rather small (ca. 4.5k pageviews a day) website running on Django, with PostgreSQL 8.3 as the db.
I am using the database as both the cache and the sesssion backend. I've heard a lot of good things about using Memcached for this purpose, and I would definitely like to give it a try. However, I would like to know exactly what would be the benefits of such a change: I imagine that my site may be just not big enough for the better cache backend to make a difference. The point is: it wouldn't be me who would be installing and configuring memcached, and I don't want to waste somebody's time for nothing or very little.
How can I measure the overhead introduced by using the db as the cache backend? I've looked at django-debug-toolbar, but if I understand correctly it isn't something you'd like to put on a production site (you have to set DEBUG=True for it to work). Unfortunately, I cannot quite reproduce the production setting on my laptop (I have a different OS, CPU and a lot more RAM).
Has anyone benchmarked different Django cache/session backends? Does anybody know what would be the performance difference if I was doing, for example, one session-write on every request?
At my previous work we tried to measure caching impact on site we was developing. On the same machine we load-tested the set of 10 pages that are most commonly used as start pages (object listings), plus some object detail pages taken randomly from the pool of ~200000. The difference was like 150 requests/second to 30000 requests/second and the database queries dropped to 1-2 per page.
What was cached:
sessions
lists of objects retrieved for each individual page in object listing
secondary objects and common content (found on each page)
lists of object categories and other categorising properties
object counters (calculated offline by cron job)
individual objects
In general, we used only low-level granular caching, not the high-level cache framework. It required very careful design (cache had to be properly invalidated upon each database state change, like adding or modifying any object).
The DiskCache project publishes Django cache benchmarks comparing local memory, Memcached, Redis, file based, and diskcache.DjangoCache. An added benefit of DiskCache is that no separate process is necessary (unlike Memcached and Redis). Instead cache keys and small values are memory-mapped into the Django process memory. Retrieving values from the cache is generally faster than Memcached on localhost. A number of settings control how much data is kept in memory; the rest being paged out to disk.
Short answer : If you have enougth ram, memcached will be always faster. You can't really benchhmark memcached vs. database cache, just keep in mind that the big bottleneck with servers is disk access, specially write access.
Anyway, disk cache is better if you have many objects to cache and long time expiration. But for this situation, if you want gig performances, it is better to generate your pages statically with a python script and deliver them with ligthtpd or nginx.
For memcached, you could adjust the amount of ram dedicated to the server.
Just try it out. Use firebug or a similar tool and run memcache with a bit of RAM allocation (e.g. 64mb) on the test server.
Mark your average loading results seen in firebug without memcache, then turn caching on and mark new results. That's done as easy as it said.
The results usually shocks people, because the perfomance raises up very nicely.
Use django-debug-toolbar to see how much time has been saved on SQL query