Tricks to improve performance of python backend

Tricks to improve performance of python backend - python

I am using python programs to nearly everything:
deploy scripts
nagios routines
website backend (web2py)
The reason why I am doing this is because I can reuse the code to provide different kind of services.
Since a while ago I have noticed that those scripts are putting a high CPU load on my servers. I have taken several steps to mitigate this:
late initialization, using cached_property (see here and here), so that only those objects needed are indeed initialized (including import of the related modules)
turning some of my scripts into http services (with a simple web.py implementation, wrapping-up my classes). The services are then triggered (by nagios for example), with simple curl calls.
This has reduced the load dramatically, going from over 20 CPU load to well under 1. It seems python startup is very resource intensive, for complex programs with lots of inter-dependencies.
I would like to know what other strategies are people here implementing to improve the performance of python software.

An easy one-off improvement is to use PyPy instead of the standard CPython for long-lived scripts and daemons (for short-lived scripts it's unlikely to help and may actually have longer startup times). Other than that, it sounds like you've already hit upon one of the biggest improvements for short-lived system scripts, which is to avoid the overhead of starting the Python interpreter for frequently-invoked scripts.
For example, if you invoke one script from another and they're both in Python you should definitely consider importing the other script as a module and calling its functions directly, as opposed to using subprocess or similar.
I appreciate that it's not always possible to do this, since some use-cases rely on external scripts being invoked - Nagios checks, for example, are going to be tricky to keep resident at all times. Your approach of making the actual check script a simple HTTP request seems reasonable enough, but the approach I took was to use passive checks and run an external service to periodically update the status. This allows the service generating check results to be resident as a daemon rather than requiring Nagios to invoke a script for each check.
Also, watch your system to see whether the slowness really is CPU overload or IO issues. You can use utilities like vmstat to watch your IO usage. If you're IO bound then optimising your code won't necessarily help a lot. In this case, if you're doing something like processing lots of text files (e.g. log files) then you can store them gzipped and access them directly using Python's gzip module. This increases CPU load but reduces IO load because you only need transfer the compressed data from disk. You can also write output files directly in gzipped format using the same approach.
I'm afraid I'm not particularly familiar with web2py specifically, but you can investigate whether it's easy to put a caching layer in front if the freshness of the data isn't totally critical. Try and make sure both your server and clients use conditional requests correctly, which will reduce request processing time. If they're using a back-end database, you could investigate whether something like memcached will help. These measures are only likely to give you real benefit if you're experiencing a reasonably high volume of requests or if each request is expensive to handle.
I should also add that generally reducing system load in other ways can occasionally give surprising benefits. I used to have a relatively small server running Apache and I found moving to nginx helped a surprising amount - I believe it was partly more efficient request handling, but primarily it freed up some memory that the filesystem cache could then use to further boost IO-bound operations.
Finally, if overhead is still a problem then carefully profile your most expensive scripts and optimise the hotspots. This could be improving your Python code, or it could mean pushing code out to C extensions if that's an option for you. I've had some great performance by pushing data-path code out into C extensions for large-scale log processing and similar tasks (talking about hundreds of GB of logs at a time). However, this is a heavy-duty and time-consuming approach and should be reserved for the few places where you really need the speed boost. It also depends whether you have someone available who's familiar enough with C to do it.

Related

Sharing information between a python code and c++ code (IPC)

I have 2 code bases, one in python, one in c++. I want to share real time data between them. I am trying to evaluate which option will work best for my specific use case:
many small data updates from the C++ program to the python program
they both run on the same machine
reliability is important
low latency is nice to have
I can see a few options:
One process writes to a flat file, the other process reads it. It is non scalable, slow and I/O error prone.
One process writes to a database, the other process reads it. This makes it more scalable, slightly less error prone, but still very slow.
Embed my python program into the C++ one or the other way round. I rejected that solution because both code bases are reasonably complex, and I prefered to keep them separated for maintainability reasons.
I use some sockets in both programs, and send messages directly. This seems to be a reasonable approach, but does not leverage the fact that they are on the same machine (it will be optimized slightly by using local host as destination, but still feels cumbersome).
Use shared memory. So far I think this is the most satisfying solution I have found, but has the drawback of being slightly more complex to implement.
Are there other solutions I should consider?

First of all, this question is highly opinion-based!
The cleanest way would be to use them in the same process and get them communicate directly. The only complexity is to implement proper API and C++ -> Python calls. Drawbacks are maintainability as you noted and potentially lower robustness (both crash together, not a problem in most cases) and lower flexibility (are you sure you'll never need to run them on different machines?). Extensibility is the best as it's very simple to add more communication or to change existing. You can reconsider maintainability point. Can you python app be used w/o C++ counterpart? If not I wouldn't worry about maintainability so much.
Then shared memory is the next choice with better maintainability but same other drawbacks. Extensibility is a little bit worse but still not so bad. It can be complicated, I don't know Python support for shared memory operation, for C++ you can have a look at Boost.Interprocess. The main question I'd check first is synchronisation between processes.
Then, network communication. Lots of choices here, from the simplest possible binary protocol implemented on socket level to higher-level options mentioned in comments. It depends how complex your C++ <-> Python communication is and can be in the future. This approach can be more complicated to implement, can require 3rd-party libraries but once done it's extensible and flexible. Usually 3rd-party libraries are based on code generation (Thrift, Protobuf) that doesn't simplify your build process.
I wouldn't seriously consider file system or database for this case.

How to measure memory usage of a web request when using Werkzeug/Flask?

Is there a way to measure the amount of memory allocated by an arbitrary web request in a Flask/Werkzeug app? By arbitrary, I mean I'd prefer a technique that lets me instrument code at a high enough level that I don't have to change it to test memory usage of different routes. If that's not possible but it's still possible to do this by wrapping individual requests with a little code, so be it.
In a PHP app I wrote a while ago, I accomplished this by calling the memory_get_peak_usage() function both at the start and the end of the request and taking the difference.
Is there an analog in Python/Flask/Werkzeug? Using Python 2.7.9 if it matters.

First of all, one should understand the main difference between PHP and Python requests processing. Roughly speaking, each PHP worker accepts only one request, handle it and then die (or reinit interpreter). PHP was designed directly for it, it's request processing language by its nature. So, it's pretty simple to measure per request memory usage. Request's peak memory usage is equal to the worker peak memory usage. It's a language feature.
At the same time, Python usually uses another approach to handle requests. There are two main models - synchronous and asynchronous request processing. However, both of them have the same difficulty when it comes to measure per request memory usage. The reason is that one Python worker handles plenty of requests (concurrently or sequentially) during his life. So, it's hard to get memory usage exactly for a request.
However, one can adapt an underlying framework and application code to accomplish collecting memory usage task. One possible solution is to use some kind of events. For example, one can raise an abstract mem_usage event on: before request, at the beginning of a view function, at the end of a view function, in some important places within the business logic and so on. Then it should exists a subscriber for such events, doing the next thing:
import resource
mem_usage = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
This subscriber have to accumulate such usage data and on the app_request_teardown/after_request send it to the metrics collection system with information about current request.endpoint or route or whatever.
Also, using a memory profiler is a good idea, but usually not for a production usage.
Further reading about request processing models:
CGI
FastCGI
PHP specific

Another possible solution is to use sys.setrace. Using this tool one can measure memory usage even per each line of code. Usage examples can be found in the memory_profiler project. Of course, it will slowdown the code significantly.

Is the Multi interface in curl faster or more efficient than using multiple easy interfaces?

I am making something which involves pycurl since pycurl depends on libcurl, I was reading through its documentation and came across this Multi interface where you could perform several transfers using a single multi object. I was wondering if this is faster/more memory efficient than having miltiple easy interfaces ? I was wondering what is the advantage with this approach since the site barely says,
"Enable multiple simultaneous transfers in the same thread without making it complicated for the application."

You are trying to optimize something that doesn't matter at all.
If you want to download 200 URLs as fast as possible, you are going to spend 99.99% of your time waiting for those 200 requests, limited by your network and/or the server(s) you're downloading from. The key to optimizing that is to make the right number of concurrent requests. Anything you can do to cut down the last 0.01% will have no visible effect on your program. (See Amdahl's Law.)
Different sources give different guidelines, but typically it's somewhere between 6-12 requests, no more than 2-4 to the same server. Since you're pulling them all from Google, I'd suggest starting 4 concurrent requests, then, if that's not fast enough, tweaking that number until you get the best results.
As for space, the cost of storing 200 pages is going to far outstrip the cost of a few dozen bytes here and there for overhead. Again, what you want to optimize is those 200 pages—by storing them to disk instead of in memory, by parsing them as they come in instead of downloading everything and then parsing everything, etc.
Anyway, instead of looking at what command-line tools you have and trying to find a library that's similar to those, look for libraries directly. pycurl can be useful in some cases, e.g., when you're trying to do something complicated and you already know how to do it with libcurl, but in general, it's going to be a lot easier to use either stdlib modules like urllib or third-party modules designed to be as simple as possible like requests.
The main example for ThreadPoolExecutor in the docs shows how to do exactly what you want to do. (If you're using Python 2.x, you'll have to pip install futures to get the backport for ThreadPoolExecutor, and use urllib2 instead of urllib.request, but otherwise the code will be identical.)

Having multiple easy interfaces running concurrently in the same thread means building your own reactor and driving curl at a lower level. That's painful in C, and just as painful in Python, which is why libcurl offers, and recommends, multi.
But that "in the same thread" is key here. You can also create a pool of threads and throw the easy instances into that. In C, that can still be painful; in Python, it's dead simple. In fact, the first example in the docs for using a concurrent.futures.ThreadPoolExecutor does something similar, but actually more complicated than you need here, and it's still just a few lines of code.
If you're comparing multi vs. easy with a manual reactor, the simplicity is the main benefit. In C, you could easily implement a more efficient reactor than the one libcurl uses; in Python, that may or may not be true. But in either language, the performance cost of switching among a handful of network requests is going to be so tiny compared to everything else you're doing—especially waiting for those network requests—that it's unlikely to ever matter.
If you're comparing multi vs. easy with a thread pool, then a reactor can definitely outperform threads (except on platforms where you can tie a thread pool to a proactor, as with Windows I/O completion ports), especially for huge numbers of concurrent connections. Also, each thread needs its own stack, which typically means about 1MB of memory pages allocated (although not all of them used), which can be a serious problem in 32-bit land for huge numbers of connections. That's why very few serious servers use threads for connections. But in a client making a handful of connections, none of this matters; again, the costs incurred by wasting 8 threads vs. using a reactor will be so small compared to the real costs of your program that they won't matter.

python: bytecode-oriented profiler

I'm writing a web application (http://www.checkio.org/) which allows users to write python code. As one feedback metric among many, I'd like to enable profiling while running checks on this code. This is to allow users to get a very rough idea of the relative efficiency of various solutions.
I need the profile to be (reasonably) deterministic. I don't want other load on the web server to give a bad efficiency reading. Also, I'm worried that some profilers won't give a good measurement because these short scripts run so quickly. The timeit module shows a function being run thousands of time, but I'd like to not waste server reasources on this small features if possible.
It's not clear which (if any) of the standard profilers meet this need. Ideally the profiler would give units of "interpreter bytecode ticks" which would increment one per bytecode instruction. This would be a very rough measure, but meets the requirements of determinism and high-precision.
Which profiling system should I use?

Python's standard profiler module provides deterministic profiling.

I also suggest giving a try to yappi. (http://code.google.com/p/yappi/) In v0.62, it supports CPU time profiling and you can stop the profiler at any time you want...

having to run multiple instances of a web service for ruby/python seems like a hack to me

Is it just me or is having to run multiple instances of a web server to scale a hack?
Am I wrong in this?
Clarification
I am referring to how I read people run multiple instances of a web service on a single server. I am not talking about a cluster of servers.

Not really, people were running multiple frontends across a cluster of servers before multicore cpus became widespread
So there has been all the infrastructure for supporting sessions properly across multiple frontends for quite some time before it became really advantageous to run a bunch of threads on one machine.
Infact using asynchronous style frontends gives better performance on the same hardware than a multithreaded approach, so I would say that not running multiple instances in favour of a multithreaded monster is a hack

Since we are now moving towards more cores, rather than faster processors - in order to scale more and more, you will need to be running more instances.
So yes, I reckon you are wrong.
This does not by any means condone brain-dead programming with the excuse that you can just scale it horizontally, that just seems retarded.

With no details, it is very difficult to see what you are getting at. That being said, it is quite possible that you are simply not using the right approach for your problem.
Sometimes multiple separate instances are better. Sometimes, your Python services are actually better deployed behind a single Apache instance (using mod_wsgi) which may elect to use more than a single process. I don't know about Ruby to opinionate there.
In short, if you want to make your service scalable then the way to do so depends heavily on additional details. Is it scaling up or scaling out? What is the operating system and available or possibly installable server software? Is the service itself easily parallelized and how much is it database dependent? How is the database deployed?

Even if Ruby/Python interpreters were perfect, and could utilize all avail CPU with single process, you would still reach maximal capability of single server sooner or later and have to scale across several machines, going back to running several instances of your app.

I would hesitate to say that the issue is a "hack". Or indeed that threaded solutions are necessarily superior.
The situation is a result of design decisions used in the interpreters of languages like Ruby and Python.
I work with Ruby, so the details may be different for other languages.
BUT ... essentially, Ruby uses a Global Interpreter Lock to prevent threading issues:
http://en.wikipedia.org/wiki/Global_Interpreter_Lock
The side-effect of this is that to achieve concurrency with frameworks like Rails, rather than relying on multiple threads within the VM, we use multiple processes, each with its own interpreter and instance of your framework and application code
Each instance of the app handles a single request at a time. To achieve concurrency we have to spin up multiple instances.
In the olden days (2-3 years ago) we would run multiple mongrel (or similar) instances behind a proxy (generally apache). Passenger changed some of this because it is smart enough to manage the processes itself, rather than requiring manual setup. You tell Passenger how many processes it can use and off it goes.
The whole structure is actually not as bad as the thread-orthodoxy would have you believe. For a start, it's pretty easy to make this type of architecture work in a multicore environment. Any modern database is designed to handle highly concurrent loads, so having multiple processes has very little if any effect at that level.
If you use a language like JRuby you can deploy into a threaded app server like Tomcat and have a deployment that looks much more "java-like". However, this is not as big a win as you might think, because now your application needs to be much more thread-aware and you can see side effects and strangeness from threading issues.

Your assumption that Tomcat's and IIS's single process per server is superior is flawed. The choice of a multi-threaded server and a multi-process server depends on a lot of variables.
One main thing is the underlying operating system. Unix systems have always had great support for multi-processing because of the copy-on-write nature of the fork system call. This makes multi-processes a really attractive option because web-serving is usually very shared-nothing and you don't have to worry about locking. Windows on the other hand had much heavier processes and lighter threads so programs like IIS would gravitate to a multi-threading model.
As for the question to wether it's a hack to run multiple servers really depends on your perspective. If you look at Apache, it comes with a variety of pluggable engines to choose from. The MPM-prefork one is the default because it allows the programmer to easily use non-thread-safe C/Perl/database libraries without having to throw locks and semaphores all over the place. To some that might be a hack to work around poorly implemented libraries. To me it's a brilliant way of leaving it to the OS to handle the problems and letting me get back to work.
Also a multi-process model comes with a few features that would be very difficult to implement in a multi-threaded server. Because they are just processes, zero-downtime rolling-updates are trivial. You can do it with a bash script.
It also has it's short-comings. In a single-server model setting up a singleton that holds some global state is trivial, while on a multi-process model you have to serialize that state to a database or Redis server. (Of course if your single-process server outgrows a single server you'll have to do that anyway.)
Is it a hack? Yes and no. Both original implementations (MRI, and CPython) have Global Interpreter Locks that will prevent a multi-core server from operating at it's 100% potential. On the other hand multi-process has it's advantages (especially on the Unix-side of the fence).
There's also nothing inherent in the languages themselves that makes them require a GIL, so you can run your application with Jython, JRuby, IronPython or IronRuby if you really want to share state inside a single process.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.