So I decided to benchmark my REST API today that I developed using Django REST Framework. The request I send is a GET request that basically retrieves the latest 50 posts from the database and returns it in JSON format.
Using Apache Benchmark, the stats were:
Server Software: nginx/1.4.6
Concurrency Level: 100
Time taken for tests: 18.394 seconds
Complete requests: 1000
Failed requests: 0
Non-2xx responses: 1000
Total transferred: 5628000 bytes
HTML transferred: 5447000 bytes
Requests per second: 54.36 [#/sec] (mean)
Time per request: 1839.442 [ms] (mean)
Time per request: 18.394 [ms] (mean, across all concurrent requests)
Transfer rate: 298.79 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 17 1137 1899.3 31 12366
Processing: 25 189 314.2 31 1418
Waiting: 24 184 309.4 29 1415
Total: 44 1326 1846.3 888 12407
Percentage of the requests served within a certain time (ms)
50% 888
66% 1178
75% 1775
80% 2286
90% 3434
95% 4576
98% 7859
99% 7922
100% 12407 (longest request)
This is obviously, incredibly slow...but I'm not sure how I can improve this.
PS: I am very new to developing a server, and want to learn from this. In the above GET request, I am not doing any sort of threading whatsoever on the server side. All it simply is doing is:
user_id = str(request.QUERY_PARAMS.get("user_id", None))
cur = connection.cursor()
cur.execute("SELECT * FROM get_posts(%s)", [user_id]) # This is a Function in the SQL database
return Response(convertToDict(cursor))
I want to improve the speed of that GET request, so what can I possibly do to make it faster?
Well, I am a little surprised to see a raw SQL query (that's another show) but you can do all sorts of things.
TL;DR
Upfront
Doing performance testing is great, regularly benchmarking and recording those outcomes over time is a good practice but benchmarking can be tricky to do properly: you have to take software and hardware into account - the outcomes of your tests will heavily depend on the interactions of those two things. Do your best to replicate your production environment for these things and try out different configurations (you are 12factor, right?) to determine a good fit.
Side note: I'm not horribly familiar with AB but looks like you are also returning HTML according to the output that doesn't seem to be intended behavior.
Fixing the problem
The first thing to do is evaluate what you have done in a thoughtful way.
Inspect the queries
Use things like django-debug-toolbar to see if you have some query bottlenecks - many queries that are chained together, long running queries, etc. If you need to get more granular, your database probably has logging facilities to record the long queries.
Assuming your data is pretty normalized (in the sense of normal forms), this may be a place to introduce denormalization so you don't have to traverse as many relationships.
You could also introduce raw SQL (but you seem to already be doing that).
Inspect your business logic
You should be diligent in making sure your business logic is being placed in the correct parts of the request, response cycle. Many times you put things in places just to get it working, maybe your initial decision is finding it's limits.
It seems like you are doing something very simple: get the last 50 entries in a table. If you are computing whether or not a post is included, you should probably leave that to the database - it should be handling all of the logic when it comes to what data to retrieve.
Inspect supporting code
While you're at it, try doing some more performance testing and see what areas of your code are lagging behind. Maybe there are things you can do that will improve your code (while being readable and understandable by others) and give you a performance bump. List comprehensions, generators, taking advantage of prefetch_ and select_related, taking care to lazily evaluate queries - all of these things are worth implementing because their functionality is well documented and understood. That said, be sure to document these decisions carefully for your future self and possibly others.
I'm not really familiar with your implementation of the view code as it relates to the Django REST framework, I would probably stick to the JSON serializers that come with it.
Work arounds
Another useful trick is to do things like implement a pagination strategy (but with the REST Framework most likely) so the data only get transferred in small pieces to the client. This will be heavily dependent on the use case.
This serves as a nice introduction to:
Throwing Software at the Problem
You can use a cache to save data in the RAM of your server so it is quickly accessed by Django.
Generally, what cache will work best will depend on the data itself. It may be the case that using a search engine to store documents that you query frequently will be most useful. But, a good start is Redis. You can read all about implementing the cache from a variety of sources but a good place to search around with Django is on Django Packages.
Throwing Hardware at the Problem
Speed can also be about hardware. You should think about the requirements of your software and it's dependencies. Do some testing, search around and experiment with what's right for you. Throwing more hardware at a problem has severe diminishing marginal returns.
can you post your get_posts(user_id) method??.
Steps to improve your performance
Make improvements on get_posts() method. You need to make sure that there are minimum number of queries to the database. Try to get the results by using one .filter and make use of select_related, prefetch related to reduce the database calls. https://docs.djangoproject.com/en/1.8/ref/models/querysets/#prefetch-related
Use .extra to the .filter if required, Through which you can add additional attribute to the model instance which cannot be done through a single query https://docs.djangoproject.com/en/1.8/ref/models/querysets/#extra
Make these changes to the get_posts() and see how your GET request responds. If it still lags you can opt for caching.
Most of the consumption of time will be for the database calls. If you optimize the get_posts() you might get satisfied with the performance
Related
I have a Flask application that allows users to query a ~small database (2.4M rows) using SQL. It's similar to a HackerRank but more limited in scope. It's deployed on Heroku.
I've noticed during testing that I can predictably hit an R14 error (memory quota exceeded) or R15 (memory quota greatly exceeded) by running large queries. The queries that typically cause this are outside what a normal user might do, such as SELECT * FROM some_huge_table. That said, I am concerned that these errors will become a regular occurrence for even small queries when 5, 10, 100 users are querying at the same time.
I'm looking for some advice on how to manage memory quotas for this type of interactive site. Here's what I've explored so far:
Changing the # of gunicorn workers. This has had some effect but I still hit R14 and R15 errors consistently.
Forced limits on user queries, based on either text or the EXPLAIN output. This does work to reduce memory usage, but I'm afraid it won't scale to even a very modest # of users.
Moving to a higher Heroku tier. The plan I use currently provides ~512MB RAM. The largest plan is around 14GB. Again, this would help but won't even moderately scale, to say nothing of the associated costs.
Reducing the size of the database significantly. I would like to avoid this if possible. Doing the napkin math on a table with 1.9M rows going to 10k or 50k, the application would have greatly reduced memory needs and will scale better, but will still have some moderate max usage limit.
As you can see, I'm a novice at best when it comes to memory management. I'm looking for some strategies/ideas on how to solve this general problem, and if it's the case that I need to either drastically cut the data size or throw tons of $ at this, that's OK too.
Thanks
Coming from my personal experience, I see two approaches:
1. plan for it
Coming from your example, this means you try to calculate the maximum memory that the request would use, multiply it by the number of gunicorn workers, and use dynos big enough.
With a different example this could be valid, I don't think it is for you.
2. reduce memory usage, solution 1
The fact that too much application memory is used makes me think that likely in your code you are loading the whole result-set into memory (probably even multiple times in multiple formats) before returning it to the client.
In the end, your application is only getting the data from the database and converting it to some output format (JSON/CSV?).
What you are probably searching for is streaming responses.
Your Flask-view will work on a record-by-record base. It will read a single record, convert it to your output format, and return a single record.
Both your database client library and Flask will support this (on most databases it is called cursors / iterators).
2. reduce memory usage, solution 2
other services often go for simple pagination or limiting resultsets to manage server-side memory.
security sidenote
it sounds like the users can actually define the SQL statement in their API requests. This is a security and application risk. Apart from doing INSERT, UPDATE, or DELETE statements, the user could create a SQL statement that will not only blow your application memory, but also break your database.
Is there a way to measure the amount of memory allocated by an arbitrary web request in a Flask/Werkzeug app? By arbitrary, I mean I'd prefer a technique that lets me instrument code at a high enough level that I don't have to change it to test memory usage of different routes. If that's not possible but it's still possible to do this by wrapping individual requests with a little code, so be it.
In a PHP app I wrote a while ago, I accomplished this by calling the memory_get_peak_usage() function both at the start and the end of the request and taking the difference.
Is there an analog in Python/Flask/Werkzeug? Using Python 2.7.9 if it matters.
First of all, one should understand the main difference between PHP and Python requests processing. Roughly speaking, each PHP worker accepts only one request, handle it and then die (or reinit interpreter). PHP was designed directly for it, it's request processing language by its nature. So, it's pretty simple to measure per request memory usage. Request's peak memory usage is equal to the worker peak memory usage. It's a language feature.
At the same time, Python usually uses another approach to handle requests. There are two main models - synchronous and asynchronous request processing. However, both of them have the same difficulty when it comes to measure per request memory usage. The reason is that one Python worker handles plenty of requests (concurrently or sequentially) during his life. So, it's hard to get memory usage exactly for a request.
However, one can adapt an underlying framework and application code to accomplish collecting memory usage task. One possible solution is to use some kind of events. For example, one can raise an abstract mem_usage event on: before request, at the beginning of a view function, at the end of a view function, in some important places within the business logic and so on. Then it should exists a subscriber for such events, doing the next thing:
import resource
mem_usage = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
This subscriber have to accumulate such usage data and on the app_request_teardown/after_request send it to the metrics collection system with information about current request.endpoint or route or whatever.
Also, using a memory profiler is a good idea, but usually not for a production usage.
Further reading about request processing models:
CGI
FastCGI
PHP specific
Another possible solution is to use sys.setrace. Using this tool one can measure memory usage even per each line of code. Usage examples can be found in the memory_profiler project. Of course, it will slowdown the code significantly.
We have a little data which almost won't be updated but read frequently (site config and some selection items like states and counties information), I think if I can move it to application memory instead of any database, our I/O performance would get a big improvement.
But we have a lot of web servers, I cannot figure out a good solution how to notice all the servers to reload these data.
You are likely looking for a cache pattern: Is there a Python caching library? You just need to ask how stale you can afford to be. If it was looking this up on every request, even a short-lived cache can massively improve performance. It's likely though that this information can live for minutes or hours without too much risk of being "stale".
If you can't live with a stale cache, I've implemented solutions that have a single database call, which keeps track of the last updated date for any of the cached data. This at least reduces the cache lookups to a single database call.
Be aware though, as soon as you are sharing updateable information, you have to deal with multi-threaded updates of shared state. Make sure you understand the implications of this. Hopefully your caching library handles this gracefully.
I am trying to optimize performance on GAE but once I deploy I get very unstable results. It's really hard to see if each optimization actually works because datastore and memcache operations take a very variable time (it ranges from milliseconds to seconds for the same operations).
For these tests I am the only one making only one request on the application by refreshing the homepage. There is no other people/traffic happening (besides my own browser requesting images/css/js files from the page).
Edit: To make sure that the drops were not due to concurrent requests from the browser (images/css/js), I've redone the test by requesting ONLY the page with urllib2.urlopen(). Problem persists.
My questions are:
1) Is this something to expect due to the fact that machines/resources are shared?
2) What are the most common cases where this behavior can happen?
3) Where can I go from there?
Here is a very slow datastore get (memcache was just flushed):
Full size
Here is a very slow memcache get (things are cached because of the previous request):
Full size
Here is a slow but faster memcache get (same repro step as the previous one, different calls are slow):
Full size
To answer your questions,
1) yes, you can expect variance in remote calls because of the shared network;
2) the most common place you will see variance is in datastore requests -- the larger/further the request, the more variance you will see;
3) here are some options for you:
It looks like you are trying to fetch large amounts of data from the datastore/memcache. You may want to re-think the queries and caches so they retrieve smaller chunks of data. Does your app need all that data for a single request?
If the app really needs to process all that data on every request, another option is to preprocess it with a background task (cron, task queue, etc.) and put the results into memcache. The request that serves up the page should simply pick the right pieces out of the memcache and assemble the page.
#proppy's suggestion to use NDB is a good one. It takes some work to rewrite serial queries into parallel ones, but the savings from async calls can be huge. If you can benefit from parallel tasks (using map), all the better.
I am trying to use the SimpleDB in following way.
I want to keep 48 hrs worth data at anytime into simpledb and query it for different purposes.
Each domain has 1 hr worth data, so at any time there are 48 domains present in the simpledb.
As the new data is constantly uploaded, I delete the oldest domain and create a new domain for each new hour.
Each domain is about 50MB in size, the total size of all the domains is around 2.2 GB.
The item in the domain has following type of attributes
identifier - around 50 characters long -- 1 per item
timestamp - timestamp value -- 1 per item
serial_n_data - 500-1000 bytes data -- 200 per item
I'm using python boto library to upload and query the data.
I send 1 item/sec with around 200 attributes in the domain.
For one of the application of this data, I need to get all the data from all the 48 domains.
The Query looks like, "SELECT * FROM domain", for all the domains.
I use 8 threads to query data with each thread taking responsibility of few domains.
e.g domain 1-6 thread 1
domain 7-12 thread 2 and so on
It takes close to 13 minutes to get the entire data.I am using boto's select method for this.I need much more faster performance than this. Any suggestions on speed up the querying process? Is there any other language that I can use, which can speed up the things?
Use more threads
I would suggest inverting your threads/domain ratio from 1/6 to something closer to 30/1. Most of the time taken to pull down large chunks of data from SimpleDB is going to be spent waiting. In this situation upping the thread count will vastly improve your throughput.
One of the limits of SimpleDB is the query response size cap at 1MB. This means pulling down the 50MB in a single domain will take a minimum of 50 Selects (the original + 49 additional pages). These must occur sequentially because the NextToken from the current response is needed for the next request. If each Select takes 2+ seconds (not uncommon with large responses and high request volume) you spend 2 minutes on each domain. If every thread has to iterate thru each of 6 domains in turn, that's about 12 minutes right there. One thread per domain should cut that down to about 2 minutes easily.
But you should be able to do much better than that. SimpleDB is optimized for concurrency. I would try 30 threads per domain, giving each thread a portion of the hour to query on, since it is log data after all. For example:
SELECT * FROM domain WHERE timestamp between '12:00' and '12:02'
(Obviously, you'd use real timestamp values) All 30 queries can be kicked off without waiting for any responses. In this way you still need to make at least 50 queries per domain, but instead of making them all sequentially you can get a lot more concurrency. You will have to test for yourself how many threads gives you the best throughput. I would encourage you to try up to 60 per domain, breaking the Select conditions down to one minute increments. If it works for you then you will have fully parallel queries and most likely have eliminated all follow up pages. If you get 503 ServiceUnavailable errors, scale back the threads.
The domain is the basic unit of scalability for SimpleDB so it is good that you have a convenient way to partition your data. You just need take advantage of the concurrency. Rather than 13 minutes, I wouldn't be surprised if you were able to get the data in 13 seconds for an app running on EC2 in the same region. But the actual time it takes will depend on a number of other factors.
Cost Concerns
As a side note, I should mention the costs of what you are doing, even though you haven't raised the issue. CreateDomain and DeleteDomain are heavyweight operations. Normally I wouldn't advise using them so often. You are charged about 25 seconds of box usage each time so creating and deleting one each hour adds up to about $70 per month just for domain management. You can store orders of magnitude more data in a domain than the 50MB you mention. So you might want to let the data accumulate more before you delete. If your queries include the timestamp (or could be made to include the timestamp) query performance may not be hurt at all by having an extra GB of old data in the domain. In any case, GetAttributes and PutAttributes will never suffer a performance hit with a large domain size, it is only queries that don't make good use of a selective index. You'd have to test your queries to see. That is just a suggestion, I realize that the create/delete is cleaner conceptually.
Also writing 200 attributes at a time is expensive, due to a quirk in the box usage formula. The box usage for writes is proportional to the number of attributes raised to the power of 3 ! The formula in hours is:
0.0000219907 + 0.0000000002 N^3
For the base charge plus the per attribute charge, where N is the number of attributes. In your situation, if you write all 200 attributes in a single request, the box usage charges will be about $250 per million items ($470 per million if you write 256 attributes). If you break each request in to 4 requests with 50 attributes each, you will quadruple your PutAttributes volume, but reduce the box usage charges by an order of magnitude to about $28 per million items. If you are able break the requests down, then it may be worth doing. If you cannot (due to request volume, or just the nature of your app) it means that SimpleDB can end up being extremely unappealing from a cost standpoint.
I have had the same issue as you Charlie. After profiling the code, I have narrowed the performance problem down to SSL. It seems like that is where it is spending most of it's time and hence CPU cycles.
I have read of a problem in the httplib library (which boto uses for SSL) where the performance doesn't increase unless the packets are over a certain size, though that was for Python 2.5 and may have already been fixed.
SBBExplorer uses Multithreaded BatchPutAttributes to achieve high write throughput while uploading bulk data to Amazon SimpleDB. SDB Explorer allows multiple parallel uploads. If you have the bandwidth, you can take full advantage of that bandwidth by running number of BatchPutAttributes processes at once in parallel queue that will reduce the time spend in processing.
http://www.sdbexplorer.com/