I have a Flask application that allows users to query a ~small database (2.4M rows) using SQL. It's similar to a HackerRank but more limited in scope. It's deployed on Heroku.
I've noticed during testing that I can predictably hit an R14 error (memory quota exceeded) or R15 (memory quota greatly exceeded) by running large queries. The queries that typically cause this are outside what a normal user might do, such as SELECT * FROM some_huge_table. That said, I am concerned that these errors will become a regular occurrence for even small queries when 5, 10, 100 users are querying at the same time.
I'm looking for some advice on how to manage memory quotas for this type of interactive site. Here's what I've explored so far:
Changing the # of gunicorn workers. This has had some effect but I still hit R14 and R15 errors consistently.
Forced limits on user queries, based on either text or the EXPLAIN output. This does work to reduce memory usage, but I'm afraid it won't scale to even a very modest # of users.
Moving to a higher Heroku tier. The plan I use currently provides ~512MB RAM. The largest plan is around 14GB. Again, this would help but won't even moderately scale, to say nothing of the associated costs.
Reducing the size of the database significantly. I would like to avoid this if possible. Doing the napkin math on a table with 1.9M rows going to 10k or 50k, the application would have greatly reduced memory needs and will scale better, but will still have some moderate max usage limit.
As you can see, I'm a novice at best when it comes to memory management. I'm looking for some strategies/ideas on how to solve this general problem, and if it's the case that I need to either drastically cut the data size or throw tons of $ at this, that's OK too.
Thanks
Coming from my personal experience, I see two approaches:
1. plan for it
Coming from your example, this means you try to calculate the maximum memory that the request would use, multiply it by the number of gunicorn workers, and use dynos big enough.
With a different example this could be valid, I don't think it is for you.
2. reduce memory usage, solution 1
The fact that too much application memory is used makes me think that likely in your code you are loading the whole result-set into memory (probably even multiple times in multiple formats) before returning it to the client.
In the end, your application is only getting the data from the database and converting it to some output format (JSON/CSV?).
What you are probably searching for is streaming responses.
Your Flask-view will work on a record-by-record base. It will read a single record, convert it to your output format, and return a single record.
Both your database client library and Flask will support this (on most databases it is called cursors / iterators).
2. reduce memory usage, solution 2
other services often go for simple pagination or limiting resultsets to manage server-side memory.
security sidenote
it sounds like the users can actually define the SQL statement in their API requests. This is a security and application risk. Apart from doing INSERT, UPDATE, or DELETE statements, the user could create a SQL statement that will not only blow your application memory, but also break your database.
Related
I have deployed a webapplication Python Django Rest Framework Front end is Vue js and Database is Mysql.
Hosted the website on digital ocean i am using CPU Optimized Droplets with the below configuration 8 GB 4 vCPUs i facing performance issue the site is very slow though the hosting is CPU optimized with 8gb of ram.
When i checked the error log i am able to find [CRITICAL] WORKER TIMEOUT (pid:9116) i increased the timeout time of the gunicorn still i am facing the same issue.
1)I want to increase the performance
2)I want to fix the timeout issue i have increased the value in gunicorn as nginx the maximum value is 75s i have not done any addition to it. what will be th best solution.
Kindly help me is there any alternative ways to test the performance of the site.
I think you need optimization
Tips on optimizing your django rest on production server .
Check your queries. Best way to check this is using the django debug toolbar, it will show how many queries you produce and if you think there is a lot, then optimize your queries. Ex. use prefetch_related or select_related to join the tables or else it will query one by one.
Index your tables, try to index the columns you're always using on querying. Just be careful on indexing, just index the columns you really wanted because the downside will be slower on update, delete and add.
Check your database health, if you perform delete or update frequently those data actually stored somewhere and it is still part of your indexes, it called Dead Rows, you can remove this by perform some commands like vacuum if your are using postgres.
Scale up your server, this actually depends on the request/traffic of your server, check the RAM or CPU spikes for you to determine the needed units. As experience, I have atleast 10m row of data on some of my tables, which in some scenario I cross refer it to a 30k data, on 2gb RAM and 1gb CPU working fine in few simultaneous request.
I am building a website having in my mind that hundreds (I wish thousands!) of 'get' queries -per day- will be cached for a couple of months in the filesystem.
Reading the cache documentation, however, I observe that the default values lean towards a small and fast cache cycle.
An old post describes that a strategy like the one I imagine, wrecked havoc in their servers.
Of course, the current django code seems to have evolved since 2012. However the cache defaults still remain the same...
I wonder whether I am on the right track or not.
My familiarity with caching is restricted in enjoying the W3 Total Cache results after saving thousands of files in the relevant directories without understanding anything but its basic settings.
How would an experienced developer approach "stage 1" of this task:
Without the budget -yet- to support solutions based on Redis (for example) (Not a valid argument)
How would you cache a normally augmenting number of queries -capable to form a bulk- for a long period of time, running on rather basic server resources?
Django's cache backend *should be implementation agnostic. For example, if you want to start with filesystem cache or redis cache or memcache it shouldn't really matter to django.
I can think of a couple issues with your approach:
how fast is your dataset growing? If you have pretty stable sized dataset, it shoudn't matter if the cache entries are long-lived.
how will you invalidate your queries? if the queries are being cached for months, it suggests that data does not change; cache invalidation is a big thing to consider, clients shouldn't see stale data.
are you using Filesystem cache? if data is being cached per server, are requests being consitantly assigned to the same servers? If not then multiple servers can have duplicate caches, this is one of the benefits of using a centralized cache (redis/memcache)
you should be able to calculate a pretty good estimate based on your current dataset size, how much data you'd like to cache, and the rate of growth of your data on how large of cache you'd need. I feel like a shared cache will go very far, and can be ran on "basic server" resources.
For stage 1, i would:
choose a shared cache, either redis or memcached, this should be a lot less painful when you start to scale to multi server setups
estimate how much data you will need to cache, and what sort of data size growth you predict, to make sure your cache is of an appropriate size.
I feel like cache invalidation is usually not a set policy on how long the data should persist in the cache, it is governed by when your data changes, which should force invalidate the cache so that clients don't see stale data
We have a little data which almost won't be updated but read frequently (site config and some selection items like states and counties information), I think if I can move it to application memory instead of any database, our I/O performance would get a big improvement.
But we have a lot of web servers, I cannot figure out a good solution how to notice all the servers to reload these data.
You are likely looking for a cache pattern: Is there a Python caching library? You just need to ask how stale you can afford to be. If it was looking this up on every request, even a short-lived cache can massively improve performance. It's likely though that this information can live for minutes or hours without too much risk of being "stale".
If you can't live with a stale cache, I've implemented solutions that have a single database call, which keeps track of the last updated date for any of the cached data. This at least reduces the cache lookups to a single database call.
Be aware though, as soon as you are sharing updateable information, you have to deal with multi-threaded updates of shared state. Make sure you understand the implications of this. Hopefully your caching library handles this gracefully.
I have been using the datastore with ndb for a multiplayer app. This appears to be using a lot of reads/writes and will undoubtedly go over quota and cost a substantial amount.
I was thinking of changing all the game data to be stored only in memcache. I understand that data stored here can be lost at any time, but as the data will only be needed for, at most, 10 minutes and as it's just a game, that wouldn't be too bad.
Am I right to move to solely use memcache, or is there a better method, and is memcache essentially 'free' short term data storage?
Yes, memcache is free and you can use it as a free "datastorage". Just keep in mind that it can be purged at any time (more likely to be purged if heavily used) and that it also is not always available. To check for memecache availability use Capabilities API.
As a commenter on another answer noted, there are now two memcache offerings: shared and dedicated. Shared is the original service, and is still free. Dedicated is in preview, and presently costs $.12/GB hour.
Dedicated memcache allows you to have a certain amount of space set aside. However, it's important to understand that you can still experience partial or complete flushes at any time with dedicated memcache, due to things like machine reboots. Because of this, it's not a suitable replacement for the datastore.
However, it is true that you can greatly reduce your datastore usage with judicious use of memcache. Using it as a write-through cache, for example, can greatly reduce your datastore reads (albeit not the writes).
Hope this helps.
I have a rather small (ca. 4.5k pageviews a day) website running on Django, with PostgreSQL 8.3 as the db.
I am using the database as both the cache and the sesssion backend. I've heard a lot of good things about using Memcached for this purpose, and I would definitely like to give it a try. However, I would like to know exactly what would be the benefits of such a change: I imagine that my site may be just not big enough for the better cache backend to make a difference. The point is: it wouldn't be me who would be installing and configuring memcached, and I don't want to waste somebody's time for nothing or very little.
How can I measure the overhead introduced by using the db as the cache backend? I've looked at django-debug-toolbar, but if I understand correctly it isn't something you'd like to put on a production site (you have to set DEBUG=True for it to work). Unfortunately, I cannot quite reproduce the production setting on my laptop (I have a different OS, CPU and a lot more RAM).
Has anyone benchmarked different Django cache/session backends? Does anybody know what would be the performance difference if I was doing, for example, one session-write on every request?
At my previous work we tried to measure caching impact on site we was developing. On the same machine we load-tested the set of 10 pages that are most commonly used as start pages (object listings), plus some object detail pages taken randomly from the pool of ~200000. The difference was like 150 requests/second to 30000 requests/second and the database queries dropped to 1-2 per page.
What was cached:
sessions
lists of objects retrieved for each individual page in object listing
secondary objects and common content (found on each page)
lists of object categories and other categorising properties
object counters (calculated offline by cron job)
individual objects
In general, we used only low-level granular caching, not the high-level cache framework. It required very careful design (cache had to be properly invalidated upon each database state change, like adding or modifying any object).
The DiskCache project publishes Django cache benchmarks comparing local memory, Memcached, Redis, file based, and diskcache.DjangoCache. An added benefit of DiskCache is that no separate process is necessary (unlike Memcached and Redis). Instead cache keys and small values are memory-mapped into the Django process memory. Retrieving values from the cache is generally faster than Memcached on localhost. A number of settings control how much data is kept in memory; the rest being paged out to disk.
Short answer : If you have enougth ram, memcached will be always faster. You can't really benchhmark memcached vs. database cache, just keep in mind that the big bottleneck with servers is disk access, specially write access.
Anyway, disk cache is better if you have many objects to cache and long time expiration. But for this situation, if you want gig performances, it is better to generate your pages statically with a python script and deliver them with ligthtpd or nginx.
For memcached, you could adjust the amount of ram dedicated to the server.
Just try it out. Use firebug or a similar tool and run memcache with a bit of RAM allocation (e.g. 64mb) on the test server.
Mark your average loading results seen in firebug without memcache, then turn caching on and mark new results. That's done as easy as it said.
The results usually shocks people, because the perfomance raises up very nicely.
Use django-debug-toolbar to see how much time has been saved on SQL query