python: bytecode-oriented profiler

python: bytecode-oriented profiler - python

I'm writing a web application (http://www.checkio.org/) which allows users to write python code. As one feedback metric among many, I'd like to enable profiling while running checks on this code. This is to allow users to get a very rough idea of the relative efficiency of various solutions.
I need the profile to be (reasonably) deterministic. I don't want other load on the web server to give a bad efficiency reading. Also, I'm worried that some profilers won't give a good measurement because these short scripts run so quickly. The timeit module shows a function being run thousands of time, but I'd like to not waste server reasources on this small features if possible.
It's not clear which (if any) of the standard profilers meet this need. Ideally the profiler would give units of "interpreter bytecode ticks" which would increment one per bytecode instruction. This would be a very rough measure, but meets the requirements of determinism and high-precision.
Which profiling system should I use?

Python's standard profiler module provides deterministic profiling.

I also suggest giving a try to yappi. (http://code.google.com/p/yappi/) In v0.62, it supports CPU time profiling and you can stop the profiler at any time you want...

Related

Performance monitoring tools vs the process status (ps) command

I am currently in the lookout for performance monitoring tools, tools which produce an output of metrics of specific OS processes. I want something which is lightweight (not to effect the performance of the system it is currently monitoring), hence I wrote a simple bash script which uses the ps command to retrieve CPU% and Memory%, writes them to a file, and sleeps for a specified couple of seconds, and repeats until it is terminated.
The question I have is whether this is a correct approach because as found on the documentation of ps (here):
Since ps cannot run faster than the system and is run as any other
scheduled process, the information it displays can never be exact.
I know of other ways such as using ps_util in python and retrieve the information that way. However, is this (and any other tool) faster or more reliable? If so, can you recommend a tool?
Or is the ps command safe enough?

ps is primarily a snapshot tool that gives you an instant view of at that given moment. You cannot get instant CPU percentage from ps. The ps page gives out this information regarding CPU:
CPU usage is currently expressed as the percentage of time spent
running during the entire lifetime of a process. This is not ideal,
and it does not conform to the standards that ps otherwise conforms
to. CPU usage is unlikely to add up to exactly 100%.
You can use the tool top which is a monitoring tool that keeps polling for numbers over time. I'd recommend using top as it for cases where you want repetitive updates.
For a more detailed explanation on differences between top and ps, you can read the answers on this question: https://unix.stackexchange.com/questions/58539/top-and-ps-not-showing-the-same-cpu-result

Externalising CPU computation from Python for multi-core concurrency

I have a PyQt5 application which runs perfectly on my development machine (Core i7 Windows 7), but has performance issues on my target platform (Linux Embedded ARM). I've been researching Python concurrency in further detail, prior to 'optimising' my current code (i.e. ensuring all UI code is in the MainThread, with all logic code in separate threads). I've learnt that the GIL largely prevents the CPython interpreter from realising true concurrency.
My question: would I be better off using IronPython or Cython as the interpreter, or sending all the logic to an external non-Python function which can make use of multiple cores, and leave the PyQt application to simply update the UI? If the latter, which language would be well suited to high-speed, concurrent calculation?

If the latter, which language would be well suited to high-speed, concurrent calculation?
You've written a lot about your system and yet not enough about what it actually does; what kind of "calculations" are you doing? — If you're doing anything heavily computational, it's very likely someone has worked very hard to make a hardware-optimized library to do these kinds of calculations, e.g. BLAS via scipy/numpy (see Arm's own website). You want to push as much work out of your own Python code and into their hands. The language you use to call these libraries is much less important. Python is already great for this kind of "gluing" work for such libraries. Note that even using built-in Python functions, such as using sum(value for value in some_iter) instead of summing in a Python for loop, also pushes computation out of slow interpretation and into highly-optimized C code.
Otherwise, without profiling your actual code, it's hard to say what would be best. After doing the above by efficiently formulating your calculations in a way that optimized libraries can best do their work (e.g. by properly vectorizing them), you can then use Python's multiprocessing to divide up whatever Python logic is causing a bottleneck from that which isn't (see this answer on why multiprocesing is often better than threading). I'd wager this would be much more beneficial than just swapping out CPython for another implementation.
Only once you've delegated as much computation to external libraries as possible and paralllelized as well as possible using multiprocessing would I then start writing these computation-heavy processes in Cython, which could be considered a type of low-level optimization over the aforementioned architectural improvements.

echoing #errantlinguist, please be aware that parallel performance is highly application-dependent.
To maintain GUI responsiveness, yes, I would just use a separate "worker" thread to keep the main thread available to handle GUI events.
To do something "insanely parallel", like a Monte Carlo computation, where you have many many completely independent tasks which have minimal communication between them, I might try multiprocessing.
If I were doing something like very large matrix operations, I would do it multithreaded. Anaconda will automatically parallelize some numpy operations via MKL on intel processors (but this will not help you on ARM). I believe you could look at something like numba to help you with this, if you stay in python. If you are unhappy with performance, you may want to try implementing in C++. If you use almost all vectorized numpy operations, you should not see a big difference from using C++, but as python loops, etc. start to creep in, you will probably begin to see big differences in performance (beyond the max 4x you will gain by parallelizing your python code over 4 cores). If you switch to C++ for matrix operations, I highly recommend the Eigen library. It's very fast and easy to understand at a high-level.
Please be aware that when you use multithreading, you are usually in a shared memory context, which eliminates a lot of the expensive io you will encounter in multiprocessing, but it also introduces some classes of bugs you are not used to encountering in serial programs (when two threads begin to access the same resources). In multiprocessing, memory is usually separate, except for explicitly defined communications between the processes. In that sense, I find that multiprocessing code is typically easier to understand and debug.
Also there are frameworks out there to handle complex computational graphs, with many steps, which may include both multithreading and multiprocessing (try dask).
Good luck!

Tricks to improve performance of python backend

I am using python programs to nearly everything:
deploy scripts
nagios routines
website backend (web2py)
The reason why I am doing this is because I can reuse the code to provide different kind of services.
Since a while ago I have noticed that those scripts are putting a high CPU load on my servers. I have taken several steps to mitigate this:
late initialization, using cached_property (see here and here), so that only those objects needed are indeed initialized (including import of the related modules)
turning some of my scripts into http services (with a simple web.py implementation, wrapping-up my classes). The services are then triggered (by nagios for example), with simple curl calls.
This has reduced the load dramatically, going from over 20 CPU load to well under 1. It seems python startup is very resource intensive, for complex programs with lots of inter-dependencies.
I would like to know what other strategies are people here implementing to improve the performance of python software.

An easy one-off improvement is to use PyPy instead of the standard CPython for long-lived scripts and daemons (for short-lived scripts it's unlikely to help and may actually have longer startup times). Other than that, it sounds like you've already hit upon one of the biggest improvements for short-lived system scripts, which is to avoid the overhead of starting the Python interpreter for frequently-invoked scripts.
For example, if you invoke one script from another and they're both in Python you should definitely consider importing the other script as a module and calling its functions directly, as opposed to using subprocess or similar.
I appreciate that it's not always possible to do this, since some use-cases rely on external scripts being invoked - Nagios checks, for example, are going to be tricky to keep resident at all times. Your approach of making the actual check script a simple HTTP request seems reasonable enough, but the approach I took was to use passive checks and run an external service to periodically update the status. This allows the service generating check results to be resident as a daemon rather than requiring Nagios to invoke a script for each check.
Also, watch your system to see whether the slowness really is CPU overload or IO issues. You can use utilities like vmstat to watch your IO usage. If you're IO bound then optimising your code won't necessarily help a lot. In this case, if you're doing something like processing lots of text files (e.g. log files) then you can store them gzipped and access them directly using Python's gzip module. This increases CPU load but reduces IO load because you only need transfer the compressed data from disk. You can also write output files directly in gzipped format using the same approach.
I'm afraid I'm not particularly familiar with web2py specifically, but you can investigate whether it's easy to put a caching layer in front if the freshness of the data isn't totally critical. Try and make sure both your server and clients use conditional requests correctly, which will reduce request processing time. If they're using a back-end database, you could investigate whether something like memcached will help. These measures are only likely to give you real benefit if you're experiencing a reasonably high volume of requests or if each request is expensive to handle.
I should also add that generally reducing system load in other ways can occasionally give surprising benefits. I used to have a relatively small server running Apache and I found moving to nginx helped a surprising amount - I believe it was partly more efficient request handling, but primarily it freed up some memory that the filesystem cache could then use to further boost IO-bound operations.
Finally, if overhead is still a problem then carefully profile your most expensive scripts and optimise the hotspots. This could be improving your Python code, or it could mean pushing code out to C extensions if that's an option for you. I've had some great performance by pushing data-path code out into C extensions for large-scale log processing and similar tasks (talking about hundreds of GB of logs at a time). However, this is a heavy-duty and time-consuming approach and should be reserved for the few places where you really need the speed boost. It also depends whether you have someone available who's familiar enough with C to do it.

Proper way to automatically test performance in Python (for all developers)?

Our Python application (a cool web service) has a full suite of tests (unit tests, integration tests etc.) that all developers must run before committing code.
I want to add some performance tests to the suite to make sure no one adds code that makes us run too slow (for some rather arbitrary definition of slow).
Obviously, I can collect some functionality into a test, time it and compare to some predefined threshold.
The tricky requirements:
I want every developer to be able to test the code on his machine (varies with CPU power, OS(! Linux and some Windows) and external configurations - the Python version, libraries and modules are the same). A test server, while generally a good idea, does not solve this.
I want the test to be DETERMINISTIC - regardless of what is happening on the machine running the tests, I want multiple runs of the test to return the same results.
My preliminary thoughts:
Use timeit and do a benchmark of the system every time I run the tests. Compare the performance test results to the benchmark.
Use cProfile to instrument the interpreter to ignore "outside noise". I'm not sure I know how to read the pstats structure yet, but I'm sure it is doable.
Other thoughts?
Thanks!
Tal.

Check out funkload - it's a way of running your unit tests as either functional or load tests to gauge how well your site is performing.
Another interesting project which can be used in conjunction with funkload is codespeed. This is an internal dashboard that measures the "speed" of your codebase for every commit you make to your code, presenting graphs with trends over time. This assumes you have a number of automatic benchmarks you can run - but it could be a useful way to have an authoritative account of performance over time. The best use of codespeed I've seen so far is the speed.pypy.org site.
As to your requirement for determinism - perhaps the best approach to that is to use statistics to your advantage? Automatically run the test N times, produce the min, max, average and standard deviation of all your runs? Check out this article on benchmarking for some pointers on this.

I want the test to be DETERMINISTIC - regardless of what is happening on the machine running the tests, I want multiple runs of the test to return the same results.
Fail. More or less by definition this is utterly impossible in a multi-processing system with multiple users.
Either rethink this requirement or find a new environment in which to run tests that doesn't involve any of the modern multi-processing operating systems.
Further, your running web application is not deterministic, so imposing some kind of "deterministic" performance testing doesn't help much.
When we did time-critical processing (in radar, where "real time" actually meant real time) we did not attempt deterministic testing. We did code inspections and ran simple performance tests that involved simple averages and maximums.
Use cProfile to instrument the interpreter to ignore "outside noise". I'm not sure I know how to read the pstats structure yet, but I'm sure it is doable.
The Stats object created by the profiler is what you're looking for.
http://docs.python.org/library/profile.html#the-stats-class
Focus on 'pcalls', primitive call count, in the profile statistics and you'll have something that's approximately deterministic.

Alternatives to ApacheBench for profiling my code speed

I've done some experiments using Apache Bench to profile my code response times, and it doesn't quite generate the right kind of data for me. I hope the good people here have ideas.
Specifically, I need a tool that
Does HTTP requests over the network (it doesn't need to do anything very fancy)
Records response times as accurately as possible (at least to a few milliseconds)
Writes the response time data to a file without further processing (or provides it to my code, if a library)
I know about ab -e, which prints data to a file. The problem is that this prints only the quantile data, which is useful, but not what I need. The ab -g option would work, except that it doesn't print sub-second data, meaning I don't have the resolution I need.
I wrote a few lines of Python to do it, but the httplib is horribly inefficient and so the results were useless. In general, I need better precision than pure Python is likely to provide. If anyone has suggestions for a library usable from Python, I'm all ears.
I need something that is high performance, repeatable, and reliable.
I know that half my responses are going to be along the lines of "internet latency makes that kind of detailed measurements meaningless." In my particular use case, this is not true. I need high resolution timing details. Something that actually used my HPET hardware would be awesome.
Throwing a bounty on here because of the low number of answers and views.

I have done this in two ways.
With "loadrunner" which is a wonderful but pretty expensive product (from I think HP these days).
With combination perl/php and the Curl package. I found the CURL api slightly easier to use from php. Its pretty easy to roll your own GET and PUT requests. I would also recommend manually running through some sample requests with Firefox and the LiveHttpHeaders add on to captute the exact format of the http requests you need.

JMeter is pretty handy. It has a GUI from which you can set up your requests and threadpools and it also can be run from the command line.

If you can code in Java, you can look at the combination of JUnitPerf + HttpUnit.
The downside is that you will have to do more things yourself. But at the price of this you will get unlimited flexibility and arguably more preciseness than with GUI tools, not to mention HTML parsing, JavaScript execution, etc.
There's also another project called Grinder which seems to be purposed for a similar task but I don't have any experience with it.

A good reference of opensource perfomance testing tools: http://www.opensourcetesting.org/performance.php
You will find descriptions and a "most popular" list

httperf is very powerful.

I've used a script to drive 10 boxes on the same switch to generate load by "replaying" requests to 1 server. I had my web app logging response time (server only) to the granularity I needed, but I didn't care about the response time to the client. I'm not sure you care to include the trip to and from the client in your calculations, but if you did it shouldn't be to difficult to code up. I then processed my log with a script which extracted the times per url and did scatter plot graphs, and trend graphs based on load.
This satisfied my requirements which were:
Real world distribution of calls to different urls.
Trending performance based on load.
Not influencing the web app by running other intensive ops on the same box.
I did controller as a shell script that foreach server started a process in the background to loop over all the urls in a file calling curl on each one. I wrote the log processor in Perl since I was doing more Perl at that time.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.