Testing memory usage of python frameworks in Virtualenv

Testing memory usage of python frameworks in Virtualenv - python

I'm creating an app in several different python web frameworks to see which has the better balance of being comfortable for me to program in and performance. Is there a way of reporting the memory usage of a particular app that is being run in virtualenv?
If not, how can I find the average, maximum and minimum memory usage of my web framework apps?

It depends on how you're going to run the application in your environment. There are many different ways to run Python web apps. Recently popular methods seem to be Gunicorn and uWSGI. So you'd be best off running the application as you would in your environment and you could simply use a process monitor to see how much memory and CPU is being used by the process running your applicaiton.

I'll second Matt W's note about the application environment being a major factor ( Gunicorn , uWSGI, nginx->paster/pserve, eventlet, apache+mod_wsgi, etc etc etc)
I'll also add this- the year is 2012. In 1999, memory and CPU for stuff like this were huge concerns. But it's 2012. Computers are significantly more powerful, expanding them is much easier and cheaper, and frameworks are coded better.
You're essentially looking at benchmarking things that have no practical matter and will only be theoretically 'neat' and informative.
The performance bottlenecks on Python webapps are usually:
database communications bottleneck
database schema
concurrent connections / requests-per-second
In terms of database communications bottleneck , the general approaches to solving it are:
communicate less
aggressive caching
optimize your sql queries and result sets , so there's less data
upgrade your db infrastructure
dedicated machine(s)
cluster master/slave or shard
In terms of database schema, convenience comes at a price. It's faster to get certain things done in Django -- but you're going to be largely stuck with the schema it creates. Pyramid+SqlAlchemy is more flexible and you can build against a finely tuned database with it... but you're not going to get any of the automagic tools that Django gives.
for concurrent connections / requests per second, it's largely due to the environment. running the same app under paster, uwsgi and other deployment strategies will have different results.
here's a link to a good , but old, benchmark - http://nichol.as/benchmark-of-python-web-servers
you'll note there's a slide for peak memory usage there, and although there are a few outliers and a decent amount of clustering going on, the worst performer had 122MB. that's nothing.
you could interpret gevent as awesome for having 3MB compared to uwsgi's 15 or cogen's 122... but these are all a small fraction of a modern system's memory.
the frameworks have such a small overhead and will barely be a factor in operating performance. even the database portions are nothing. reference this posting about SqlAlchemy ( Why is SQLAlchemy insert with sqlite 25 times slower than using sqlite3 directly? ) , where the maintainer notes some impressive performance notes: straight-up sql generation was ~.5s for 100k rows. when a full ORM with integrity checks/etc is involved , it becomes 16s for the same amount of rows. that is nothing.
So , my point is simple- the two factors you should consider are:
how fast / comfortable can i program now
how fast / comfortable can i program in a year from now ( i.e. how likely is my project to grow 'technical debt' using this framework , and how much of a problem will that become )
play with the frameworks to decide which one you like the most, but don't waste your time on performance testing , because all you're going to do is waste time.

The choice of hosting mechanism isn't the cause of memory usage, it is how you configure them, plus what fat Python web application you decide to run.
The benchmark being quoted of:
http://nichol.as/benchmark-of-python-web-servers
is a good example of where benchmarks can get it quite wrong.
The configurations of the different hosting mechanisms in that benchmark were not comparable and so there is no way you can use the results to evaluate memory usage of each properly. I would not pay much attention to that benchmark if memory is your concern.
Ignoring memory, some of the other comments made about where the real bottlenecks are going to be are valid. For a lot more detail on this whole issue see my PyCon talk.
http://lanyrd.com/2012/pycon/spcdg/

Related

Managing Heroku RAM for Unique Application

I have a Flask application that allows users to query a ~small database (2.4M rows) using SQL. It's similar to a HackerRank but more limited in scope. It's deployed on Heroku.
I've noticed during testing that I can predictably hit an R14 error (memory quota exceeded) or R15 (memory quota greatly exceeded) by running large queries. The queries that typically cause this are outside what a normal user might do, such as SELECT * FROM some_huge_table. That said, I am concerned that these errors will become a regular occurrence for even small queries when 5, 10, 100 users are querying at the same time.
I'm looking for some advice on how to manage memory quotas for this type of interactive site. Here's what I've explored so far:
Changing the # of gunicorn workers. This has had some effect but I still hit R14 and R15 errors consistently.
Forced limits on user queries, based on either text or the EXPLAIN output. This does work to reduce memory usage, but I'm afraid it won't scale to even a very modest # of users.
Moving to a higher Heroku tier. The plan I use currently provides ~512MB RAM. The largest plan is around 14GB. Again, this would help but won't even moderately scale, to say nothing of the associated costs.
Reducing the size of the database significantly. I would like to avoid this if possible. Doing the napkin math on a table with 1.9M rows going to 10k or 50k, the application would have greatly reduced memory needs and will scale better, but will still have some moderate max usage limit.
As you can see, I'm a novice at best when it comes to memory management. I'm looking for some strategies/ideas on how to solve this general problem, and if it's the case that I need to either drastically cut the data size or throw tons of $ at this, that's OK too.
Thanks

Coming from my personal experience, I see two approaches:
1. plan for it
Coming from your example, this means you try to calculate the maximum memory that the request would use, multiply it by the number of gunicorn workers, and use dynos big enough.
With a different example this could be valid, I don't think it is for you.
2. reduce memory usage, solution 1
The fact that too much application memory is used makes me think that likely in your code you are loading the whole result-set into memory (probably even multiple times in multiple formats) before returning it to the client.
In the end, your application is only getting the data from the database and converting it to some output format (JSON/CSV?).
What you are probably searching for is streaming responses.
Your Flask-view will work on a record-by-record base. It will read a single record, convert it to your output format, and return a single record.
Both your database client library and Flask will support this (on most databases it is called cursors / iterators).
2. reduce memory usage, solution 2
other services often go for simple pagination or limiting resultsets to manage server-side memory.
security sidenote
it sounds like the users can actually define the SQL statement in their API requests. This is a security and application risk. Apart from doing INSERT, UPDATE, or DELETE statements, the user could create a SQL statement that will not only blow your application memory, but also break your database.

Cloud PubSub to App Engine memory Issue (and Should I move to DataFlow?)

I currently run a "small" flask application on (Google Cloud's App Engine) that is used to integrate applications (it listens to webhooks and calls other APIs). The issue is that I consistently exceed the soft memory limit after 35 - 45 requests.
Memory footprint of the combined instances:
Since I intend to increase the load on this system by orders of magnitude this worries me.
There seems to be three possible solutions to me but I don't know where to start:
Switch to DataFlow: I already use Pub/Sub between two App Engine instances to add higher consistency, but maybe App Engine is the wrong platform or this kind of platform.
Fix memory leak: The issue here could be a possible memory leak, but I can't find the right tools to analyse the memory usage on the App Engine platform (on my local machine usage of the Python process hovers around 51MB)
Divide the system into multiple microservices to decrease the footprint per instance. (Maintaining the code base will probably be harder though).
Any advice or experience is very welcome.

If your case is indeed a memory leak, you need to verify your code as this will consistently lead to your application crashing. There are other posts, like this one that discusses tools and strategies to address memory issues in python code.
You could potentially use Dataflow or Cloud Functions in your project. If you provide more details about the nature of your use case in a separate question, one could evaluate if these options could be a better alternative to your current App Engine approach.
Finally, dividing your application into multiple services is likely the best long term solution to your issue, as it will make it easier to find any memory leak, to control costs and to generally maintain your application.
There are few pages in App Engine’s documentation that discuss best practices in microservice design using App Engine [1] [2] [3]. Proper microservice-based applications permit clear logging and monitoring as well as an increase in application reliability and scalability, among other benefits [1]. These benefits that I mentioned here are important to you. Following the layout as discussed in [4], you can scale your services individually and independently of each other. If you believe that one of your services is more resource-demanding, you can adjust the scaling parameters in order to provide optimal performance for that service. For example, you can manage the numbers of instances that are fired during operations.
You can use the app elements max_concurrent_requests and target_throughput_utilization which you can define in your App Engine’s configuration file, the app.yaml file. See [5]. To clarify, you want to reduce your max_concurrent_requests in your case.
Please note that, as discussed in previous comments, this road could lead to higher costs. If you are using the free tier then you will need to check [4] for available resources to you in this tier.
Regarding the issue of your instances running out of memory, if you find out that it is not due to a memory leak, then another solution would be to use a different instance_class which means that you can instantiate App Engines with higher compute resources (also higher costs). Please see [5] and [6].
[1] https://cloud.google.com/appengine/docs/standard/python/microservices-on-app-engine
[2] https://cloud.google.com/appengine/docs/standard/python/designing-microservice-api
[3] https://cloud.google.com/appengine/docs/standard/python/microservice-performance
[4] https://cloud.google.com/appengine/docs/standard/python/an-overview-of-app-engine
[5] https://cloud.google.com/appengine/docs/standard/python/config/appref
[6] https://cloud.google.com/appengine/docs/standard/#instance_classes

Tricks to improve performance of python backend

I am using python programs to nearly everything:
deploy scripts
nagios routines
website backend (web2py)
The reason why I am doing this is because I can reuse the code to provide different kind of services.
Since a while ago I have noticed that those scripts are putting a high CPU load on my servers. I have taken several steps to mitigate this:
late initialization, using cached_property (see here and here), so that only those objects needed are indeed initialized (including import of the related modules)
turning some of my scripts into http services (with a simple web.py implementation, wrapping-up my classes). The services are then triggered (by nagios for example), with simple curl calls.
This has reduced the load dramatically, going from over 20 CPU load to well under 1. It seems python startup is very resource intensive, for complex programs with lots of inter-dependencies.
I would like to know what other strategies are people here implementing to improve the performance of python software.

An easy one-off improvement is to use PyPy instead of the standard CPython for long-lived scripts and daemons (for short-lived scripts it's unlikely to help and may actually have longer startup times). Other than that, it sounds like you've already hit upon one of the biggest improvements for short-lived system scripts, which is to avoid the overhead of starting the Python interpreter for frequently-invoked scripts.
For example, if you invoke one script from another and they're both in Python you should definitely consider importing the other script as a module and calling its functions directly, as opposed to using subprocess or similar.
I appreciate that it's not always possible to do this, since some use-cases rely on external scripts being invoked - Nagios checks, for example, are going to be tricky to keep resident at all times. Your approach of making the actual check script a simple HTTP request seems reasonable enough, but the approach I took was to use passive checks and run an external service to periodically update the status. This allows the service generating check results to be resident as a daemon rather than requiring Nagios to invoke a script for each check.
Also, watch your system to see whether the slowness really is CPU overload or IO issues. You can use utilities like vmstat to watch your IO usage. If you're IO bound then optimising your code won't necessarily help a lot. In this case, if you're doing something like processing lots of text files (e.g. log files) then you can store them gzipped and access them directly using Python's gzip module. This increases CPU load but reduces IO load because you only need transfer the compressed data from disk. You can also write output files directly in gzipped format using the same approach.
I'm afraid I'm not particularly familiar with web2py specifically, but you can investigate whether it's easy to put a caching layer in front if the freshness of the data isn't totally critical. Try and make sure both your server and clients use conditional requests correctly, which will reduce request processing time. If they're using a back-end database, you could investigate whether something like memcached will help. These measures are only likely to give you real benefit if you're experiencing a reasonably high volume of requests or if each request is expensive to handle.
I should also add that generally reducing system load in other ways can occasionally give surprising benefits. I used to have a relatively small server running Apache and I found moving to nginx helped a surprising amount - I believe it was partly more efficient request handling, but primarily it freed up some memory that the filesystem cache could then use to further boost IO-bound operations.
Finally, if overhead is still a problem then carefully profile your most expensive scripts and optimise the hotspots. This could be improving your Python code, or it could mean pushing code out to C extensions if that's an option for you. I've had some great performance by pushing data-path code out into C extensions for large-scale log processing and similar tasks (talking about hundreds of GB of logs at a time). However, this is a heavy-duty and time-consuming approach and should be reserved for the few places where you really need the speed boost. It also depends whether you have someone available who's familiar enough with C to do it.

Moving Parallel Python code to the cloud

Upon hearing that the scientific computing project (happens to be the stochastic tractography method described here) I'm currently running for an investigator would take 4 months on our 50 node cluster, the investigator has asked me to examine other options. The project is currently using parallel python to farm out chunks of a 4d array to different cluster nodes, and put the processed chunks back together.
The jobs I'm currently working with are probably much too coarsely grained, (5 seconds to 10 minutes, I had to increase the timeout default in parallel python) and I estimate I could speed up the process by 2-4 times by rewriting it to make better use of resources (splitting up and putting back together the data is taking too long, that should be parallelized as well). Most of the work in done by numpy arrays.
Let's assume that 2-4 times isn't enough, and I decide to get the code off of our local hardware. For high throughput computing like this, what are my commercial options and how will I need to modify the code?

You might be interested in PiCloud. I have never used it, but their offer apparently includes the Enthought Python Distribution, which covers the standard scientific libraries.
It's tough to say if this will work for your specific case, but the Parallel Python interface is pretty generic. So hopefully not too many changes would be needed. Maybe you can even write a custom scheduler class (implementing the same interface as PP). Actually that might be useful for many people, so maybe you can drum up some support in the PP forum.

The most obvious commercial options which come to mind are Amazon EC2 and the Rackspace Cloud. I have played with both and found the Rackspace API a little easier to use.
The good news is that you can prototype and play with their compute instances (short- or long-lived virtual machines of the OS of your choice) for very little investment, typically $US 0.10 / hr or so. You create them on demand and then release them back to the cloud when you are done, and only pay for what you use. For example, I saw a demo on Django deployment using 6 Rackspace instances which took perhaps an hour and cost the speakers less than a dollar.
For your use case (not clear exactly what you meant by 'high throughput'), you will have to look at your budget and your computing needs, as well as your total network throughput (you pay for that, too). A few small-scale tests and a simple spreadsheet calculation should tell you if it's really practical or not.
There are Python APIs for both Rackspace Cloud and Amazon EC2. Whichever you use, I recommend python-based Fabric for automated deployment and configuration of your instances.

How to measure Django cache performance?

I have a rather small (ca. 4.5k pageviews a day) website running on Django, with PostgreSQL 8.3 as the db.
I am using the database as both the cache and the sesssion backend. I've heard a lot of good things about using Memcached for this purpose, and I would definitely like to give it a try. However, I would like to know exactly what would be the benefits of such a change: I imagine that my site may be just not big enough for the better cache backend to make a difference. The point is: it wouldn't be me who would be installing and configuring memcached, and I don't want to waste somebody's time for nothing or very little.
How can I measure the overhead introduced by using the db as the cache backend? I've looked at django-debug-toolbar, but if I understand correctly it isn't something you'd like to put on a production site (you have to set DEBUG=True for it to work). Unfortunately, I cannot quite reproduce the production setting on my laptop (I have a different OS, CPU and a lot more RAM).
Has anyone benchmarked different Django cache/session backends? Does anybody know what would be the performance difference if I was doing, for example, one session-write on every request?

At my previous work we tried to measure caching impact on site we was developing. On the same machine we load-tested the set of 10 pages that are most commonly used as start pages (object listings), plus some object detail pages taken randomly from the pool of ~200000. The difference was like 150 requests/second to 30000 requests/second and the database queries dropped to 1-2 per page.
What was cached:
sessions
lists of objects retrieved for each individual page in object listing
secondary objects and common content (found on each page)
lists of object categories and other categorising properties
object counters (calculated offline by cron job)
individual objects
In general, we used only low-level granular caching, not the high-level cache framework. It required very careful design (cache had to be properly invalidated upon each database state change, like adding or modifying any object).

The DiskCache project publishes Django cache benchmarks comparing local memory, Memcached, Redis, file based, and diskcache.DjangoCache. An added benefit of DiskCache is that no separate process is necessary (unlike Memcached and Redis). Instead cache keys and small values are memory-mapped into the Django process memory. Retrieving values from the cache is generally faster than Memcached on localhost. A number of settings control how much data is kept in memory; the rest being paged out to disk.

Short answer : If you have enougth ram, memcached will be always faster. You can't really benchhmark memcached vs. database cache, just keep in mind that the big bottleneck with servers is disk access, specially write access.
Anyway, disk cache is better if you have many objects to cache and long time expiration. But for this situation, if you want gig performances, it is better to generate your pages statically with a python script and deliver them with ligthtpd or nginx.
For memcached, you could adjust the amount of ram dedicated to the server.

Just try it out. Use firebug or a similar tool and run memcache with a bit of RAM allocation (e.g. 64mb) on the test server.
Mark your average loading results seen in firebug without memcache, then turn caching on and mark new results. That's done as easy as it said.
The results usually shocks people, because the perfomance raises up very nicely.

Use django-debug-toolbar to see how much time has been saved on SQL query

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.