Statistical profiler for PyPy

Statistical profiler for PyPy - python

I would like to use statprof.py for profiling code in PyPy. Unfortunately, it does not seem to work, the line numbers it points to are off. Does anyone know how to make it work or know of an alternative?

It's likely that "the line numbers are off" because PyPy, in JITted code, will inline many functions and will only deliver signals (here from the timer) at the end of the loops. Compare this with CPython, which delivers the signals between two random bytecodes -- occasionally at the end of the loops too, but generally anywhere. So what you get on PyPy is the same as what you'd get on CPython if you constrained the signal handlers to run only at the "end of loop" bytecode.
This is why this kind of profiling will seem to always miss a lot of functions, like most functions with no loop in them.
You can try to use the built-in cProfile module. It comes of course with a bigger performance hit than statistical profiling, but try it anyway --- it doesn't prevent JITting, for example, so the performance hit should still be reasonable.
More generally, I don't see an easy way to implement the equivalent of statistical profiling in PyPy. It's quite hard to give it sense in the presence of functions that are inlined into each other and then optimized globally... I'd be interested if you can find that a tool actually exists, for some other high-level language, doing statistical profiling, on a VM with a tracing JIT.
We could record enough information to track each small group of assembler instructions back to the real Python function it comes from, and then use hacks to inspect the current Instruction Pointer (IP) at the machine level. Not impossible, but serious work :-)

Related

Does the Python 3 interpreter have a JIT feature?

I found that when I ask something more to Python, python doesn't use my machine resource at 100% and it's not really fast, it's fast if compared to many other interpreted languages, but when compared to compiled languages i think that the difference is really remarkable.
Is it possible to speedup things with a Just In Time (JIT) compiler in Python 3?
Usually a JIT compiler is the only thing that can improve performances in interpreted languages, so i'm referring to this one, if other solutions are available i would love to accept new answers.

First off, Python 3(.x) is a language, for which there can be any number of implementations. Okay, to this day no implementation except CPython actually implements those versions of the language. But that will change (PyPy is catching up).
To answer the question you meant to ask: CPython, 3.x or otherwise, does not, never did, and likely never will, contain a JIT compiler. Some other Python implementations (PyPy natively, Jython and IronPython by re-using JIT compilers for the virtual machines they build on) do have a JIT compiler. And there is no reason their JIT compilers would stop working when they add Python 3 support.
But while I'm here, also let me address a misconception:
Usually a JIT compiler is the only thing that can improve performances in interpreted languages
This is not correct. A JIT compiler, in its most basic form, merely removes interpreter overhead, which accounts for some of the slow down you see, but not for the majority. A good JIT compiler also performs a host of optimizations which remove the overhead needed to implement numerous Python features in general (by detecting special cases which permit a more efficient implementation), prominent examples being dynamic typing, polymorphism, and various introspective features.
Just implementing a compiler does not help with that. You need very clever optimizations, most of which are only valid in very specific circumstances and for a limited time window. JIT compilers have it easy here, because they can generate specialized code at run time (it's their whole point), can analyze the program easier (and more accurately) by observing it as it runs, and can undo optimizations when they become invalid. They can also interact with interpreters, unlike ahead of time compilers, and often do it because it's a sensible design decision. I guess this is why they are linked to interpreters in people's minds, although they can and do exist independently.
There are also other approaches to make Python implementation faster, apart from optimizing the interpreter's code itself - for example, the HotPy (2) project. But those are currently in research or experimentation stage, and are yet to show their effectiveness (and maturity) w.r.t. real code.
And of course, a specific program's performance depends on the program itself much more than the language implementation. The language implementation only sets an upper bound for how fast you can make a sequence of operations. Generally, you can improve the program's performance much better simply by avoiding unnecessary work, i.e. by optimizing the program. This is true regardless of whether you run the program through an interpreter, a JIT compiler, or an ahead-of-time compiler. If you want something to be fast, don't go out of your way to get at a faster language implementation. There are applications which are infeasible with the overhead of interpretation and dynamicness, but they aren't as common as you'd think (and often, solved by calling into machine code-compiled code selectively).

The only Python implementation that has a JIT is PyPy. Byt - PyPy is both a Python 2 implementation and a Python 3 implementation.

The Numba project should work on Python 3. Although it is not exactly what you asked, you may want to give it a try:
https://github.com/numba/numba/blob/master/docs/source/doc/userguide.rst.
It does not support all Python syntax at this time.

You can try the pypy py3 branch, which is more or less python compatible, but the official CPython implementation has no JIT.

This will best be answered by some of the remarkable Python developer folks on this site.
Still I want to comment: When discussing speed of interpreted languages, I just love to point to a project hosted at this location: Computer Language Benchmarks Game
It's a site dedicated to running benchmarks. There are specified tasks to do. Anybody can submit a solution in his/her preferred language and then the tests compare the runtime of each solution. Solutions can be peer reviewed, are often further improved by others, and results are checked against the spec. In the long run this is the most fair benchmarking system to compare different languages.
As you can see from indicative summaries like this one, compiled languages are quite fast compared to interpreted languages. However, the difference is probably not so much in the exact type of compilation, it's the fact that Python (and the others in the graph slower than python) are fully dynamic. Objects can be modified on the fly. Types can be modified on the fly. So some type checking has to be deferred to runtime, instead of compile time.
So while you can argue about compiler benefits, you have to take into account that there are different features in different languages. And those features may come at an intrinsic price.
Finally, when talking about speed: Most often it's not the language and the perceived slowness of a language that's causing the issue, it's a bad algorithm. I never had to switch languages because one was too slow: When there's a speed issue in my code, I fix the algorithm. However, if there are time-consuming, computational intensive loops in your code it is usually worth the while to recompile those. A prominent example are libraries coded in C used by scripting languages (Perl XS libs, or e.g. numpy/scipy for Python, lapack/blas are examples of libs available with bindings for many scripting languages)

If you mean JIT as in Just in time compiler to a Bytecode representation then it has such a feature(since 2.2). If you mean JIT to machine code, then no. Yet the compilation to byte code provides a lot of performance improvement. If you want it to compile to machine code, then Pypy is the implementation you're looking for.
Note: pypy doesn't work with Python 3.x

If you are looking for speed improvements in a block of code, then you may want to have a look to rpythonic, that compiles down to C using pypy. It uses a decorator that converts it in a JIT for Python.

RegEx performance in Objective-C vs Python

This question might seem vague, sorry. Does anybody have experience writing RegEx with Objective-C and Python? I am wondering about the performance of one vs the other? Which is faster in terms of 1. runtime speed, and 2. memory consumption? I have a Mac OS application that is always running in the background, and I'd like my app to index some text files that are being saved, and then save the result... I could write a regex method in my app in Obj-C, or I could potentially write a separate app using Perl or Python (just a beginner in Python).
(Thanks, I got some good info from some of you already. Boo to those who downvoted; I am here to learn, and I might have some stupid questions time to time - part of the deal.)

If you’re looking for raw speed, neither of those two would be a very good choice. For execution speed, you’d choose Perl. For how quickly you could code it up, either Python or Perl alike would easily beat the time to write it in Objective C, just as both would easily beat a Java solution. High-level languages that take less time to code up are always a win if all you’re measuring is time-to-solution compared with solutions that take many more lines of code.
As far as actual run-time performance goes, Perl’s regexes are written in very tightly coded C, and are known to be the fastest and most flexible regexes available. The regex optimizer does a lot of very clever things to the compiled regex program, such as applying an Aho–Corasick start-point optimization for finding the start of an alternation trie, running in O(1) time. Nobody else does that. Heck, I don’t think anybody else but Perl even bothers to optimize alternations into tries, which is the thing that takes you from O(n) to O(1), because the compiler spent more time doing something smart so that the interpreter runs much faster. Perl regexes also offer substantial improvements in debugging and profiling. They’re also more flexible than Python’s, but the debugging alone is enough to tip the balance.
The only exception on performance matters is with certain pathological patterns that degenerate when run under any recursive backtracker, whether Perl’s, Java’s, or Python’s. Those can be addressed by using the highly recommended RE2 library, written by Russ Cox, as a replacement plugin. I know it’s available as a transparent replacement regex engine for Perl, and I’m pretty sure I remember seeing that it was also available for Python, too.
On the other hand, if you really want to use Python but just want a more expressive and robust regex library, particularly one that is well-behaved on Unicode, then you want to use Matthew Barnett’s regex module, available for both Python2 and Python3. Besides conforming to tr18’s level-1 compliance requirements (that’s the standards doc on Unicode regexes), it also has all kinds of other clever features, some of which are completely sui generis. If you’re a regex connoisseur, it’s very much worth checking out.

In my Mac OS application I will be doing some text processing, and I was wondering if doing that in Python would be faster.
It will be faster in terms of development time, almost certainly. For nearly all software projects, development time dominates runtime as a measure of success.
If you mean runtime, then you're almost certainly doing premature optimization, unless you've shown that slow code will cause unbearable/noticeable user-interface slowdown.
Premature optimization is the root of all evil.
-- Donald Knuth

What determines debugger run-time performance

I have tried debugging Python 3 with Wing IDE (v.4.1.3) and Komodo IDE (v.7.0.0). As, expected the debugger adds a lot of run-time overhead. But what surprised me is how different the debuggers can be between each other.
Here are the run-times for the same program. No breakpoints or anything else, just a regular run without any actual debugging:
executed by python interpreter: 26 sec
executed by debugger #1: 137 sec
executed by debugger #2: 1143 sec
I refer to the debuggers as anonymous #1 and #2 lest this becomes an unintentional (and possibly misguided) advertising for one of them.
Is one of the debuggers really 8 times "faster"?
Or is there some design trade-off, where a faster debugger gives up some features, or precision, or robustness, or whatever, in return for greater speed? If so, I'd love to know those details, whether for Wing/Komodo specifically, or Python debuggers in general.

Doing an optimized Python debugger is as any other software: things can be really different performance-wise (I'm the PyDev author and I've done the PyDev debugger, so, I can comment on it, but not on the others, so, I'll just explain a bit on optimizing a Python debugger -- as I've spent a lot of time optimizing the PyDev debugger -- I won't really talk about other implementations as I don't know how they were done -- except for pdb, but pdb is not really a fast debugger implementation after it hits a breakpoint and you're stepping through it, although it does work well by running things untraced until you actually execute the code that'll start tracing your code).
Particularly, any 'naive' debugger can make your program much slower just by enabling the Python trace in each frame and checking if there's a breakpoint match for each line executed (this is roughly how pdb works: whenever you enter a context it'll trace it and for each line called it'll check if a breakpoint matches it, so, I believe any implementation that expects to be fast can't really rely on it).
I know the PyDev debugger has several optimizations... the major one is the following: when the debugger enters a new frame (i.e.: function) it will check if there's any 'potential' breakpoint that may be hit there and if there's not, it won't even trace that function (on the other hand, when a breakpoint is added later on after the program is executing, it'll have to go and reevaluate all previous assumptions, as any current frame could end up skipping a breakpoint). And if it determines that some frame should be traced, it'll create a new instance for that frame which will be responsible for caching all that's related to that frame (this was only really possible from Python 2.5, so, when working on Python 2.4 and earlier, unless the threadframe extension is installed, the debugger will try to emulate that, which will make it considerably slower on Python 2.4).
Also, the PyDev debugger currently takes advantage of Cython, even though this is only restricted to CPython... Jython, IronPython and PyPy don't take advantage of it), so, many optimizations are still done considering pure Python mode (thankfully Cython is close enough to Python so that few changes are needed in order to make it work faster on CPython with Cython).
Some related posts regarding PyDev debugger optimization evolution:
http://pydev.blogspot.co.id/2017/03/pydev-560-released-faster-debugger.html
http://pydev.blogspot.co.id/2016/01/pydev-451-debug-faster.html
http://pydev.blogspot.com/2008/02/pydev-debugger-and-psyco-speedups.html
http://pydev.blogspot.com/2005/10/high-speed-debugger.html
Anyways, running with the debugger in place will always add some overhead (even when heavily optimized such as the PyDev debugger), so, PyDev also provides the same approach that may be used in pdb: add a breakpoint in code and it'll only start tracing at that point (which is the remote debugger feature of PyDev): http://pydev.org/manual_adv_remote_debugger.html
And depending on the features you want the debugger to support, it can also be slower (e.g.: when you enable the break for caught exceptions in PyDev, the program will execute slower because it'll need to trace more things in order to properly break).

This is like asking why is program A faster than program B. The reason may not be specific to the domain of debugging.
I would imagine most graphical debuggers are built on top or a frontend to the pdb module provided in the python standard library (though not all are). The performance differences would mostly boil down to implementation details and GUI updating overhead. The difference could be as simple as doing unnecessary deep copies vs direct referencing in some layer of the code.
If you are concerned about debugger performance, then you should stick with the faster graphical debugger, unless it is not satisfying some feature you need. You are asking for feature differences; since this is the case, they are both obviously current serving your needs feature-wise.
If one debugger is losing precision or sacrificing robustness, then I would discount it immediately from consideration. I would chance a guess that the slower one is either:
Doing extra work or unnecessary checks (if the developer doesn't trust other parts of the code, or their are too many layers or redundancy architecturally).
Using some cache-unfriendly or wasteful representation or hasn't tuned or optimised the code algorithmically as much as the faster debugger. Correct data structure choice and algorithmic optimisations can make orders of magnitudes of difference. Using low-level optimisations (like ctypes, psyco, pyrex) etc.. you can make another order of magnitude of a difference. Remember python gives you flexible and powerful default containers that you always "pay" for, even if you dont need all their functionality.
I have found that WinPDB is fairly lightweight, and is the one I usually use, since it doesnt tie me to an IDE, and supports remote debugging quite effectively. You might also want to try eclipse with pydev. Another new one I have just started playing around with is Python Tools for Visual Studio which looks very promising. There is also a list of python debuggers on the python wiki. It might be worth giving the others a try.
As to why one is faster than the other, there are a multitude of possible factors. If you have access to the source of the debugger, you might be able to profile the debugger itself to identify the performance bottlenecks. But if you just want a fast debugger that handles all the basic use cases, just try a few more out, and stick with the fastest that serves your needs.

Why does Psyco use a lot of memory?

Psyco is a specialising compiler for Python. The documentation states
Psyco can and will use large amounts of memory.
What are the main reasons for this memory usage? Is substantial memory overhead a feature of JIT compilers in general?
Edit: Thanks for the answers so far. There are three likely contenders.
Writing multiple specialised blocks, each of which require memory
Overhead due to compiling source on the fly
Overhead due to capturing enough data to do dynamic profiling
The question is, which one is the dominant factor in memory usage? I have my own opinion. But I'm adding a bounty, because I'd like to accept the answer that's actually correct! If anyone can demonstrate or prove where the majority of the memory is used, I'll accept it. Otherwise whoever the community votes for will be auto-accepted at the end of the bounty.

From psyco website "The difference with the traditional approach to JIT compilers is that Psyco writes several version of the same blocks (a block is a bit of a function), which are optimized by being specialized to some kinds of variables (a "kind" can mean a type, but it is more general)"

"Psyco uses the actual run-time data that your program manipulates to write potentially several versions of the machine code, each differently specialized for different kinds of data." http://psyco.sourceforge.net/introduction.html
Many JIT compilers work with statically typed languages, so they know what the types are so can create machine code for just the known types. The better ones do dynamic profiling if the types are polymorphic and optimise the more commonly encountered paths; this is also commonly done with languages featuring dynamic types†. Psyco appears to hedge its bets in order to avoid doing a full program analysis to decide what the types could be, or profiling to find what the types in use are.
† I've never gone deep enough into Python to work out whether it does or doesn't have dynamic types or not ( types whose structure can be changed at runtime after objects have been created with that type ), or just the common implementations only check types at runtime; most of the articles just rave about dynamic typing without actually defining it in the context of Python.

The memory overhead of Psyco is currently large. I has been reduced a bit over time, but it is still an overhead. This overhead is proportional to the amount of Python code that Psyco rewrites; thus if your application has a few algorithmic "core" functions, these are the ones you will want Psyco to accelerate --- not the whole program.
So I would think the large memory requirements are due to the fact that it's loading source into memory and then compiling it as it goes. The more source you try and compile the more it's going to need. I'd guess that if it's trying to optomise it on top of that, it'll look at multiple possible solutions to try and identify the best case.

Definitely psyco memory usage comes from compiled assembler blocks. Psyco suffers sometimes from overspecialization of functions, which means there are multiple versions of assembler
blocks. Also, which is also very important, psyco never frees once allocated assembler blocks
even if the code assosciated with it is dead.
If you run your program under linux you can look at /proc/xxx/smaps to see a growing block of anonymous memory, which is in different region than heap. That's anonymously mmap'ed part for writing down assembler, which of course disappears when running without psyco.

Is it reasonable to integrate python with c for performance?

I like to use python for almost everything and always had clear in my mind that if for some reason I was to find a bottleneck in my python code(due to python's limitations), I could always use a C script integrated to my code.
But, as I started to read a guide on how to integrate python. In the article the author says:
There are several reasons why one might wish to extend Python in C or C++, such as:
Calling functions in an existing library.
Adding a new builtin type to Python
Optimising inner loops in code
Exposing a C++ class library to Python
Embedding Python inside a C/C++ application
Nothing about performance. So I ask again, is it reasonable to integrate python with c for performance?

In my experience it is rarely necessary to optimize using C. I prefer to identify bottlenecks and improve algorithms in those areas completely in Python. Using hash tables, caching, and generally re-organizing your data structures to suit future needs has amazing potential for speeding up your program. As your program develops you'll get a better sense of what kind of material can be precalculated, so don't be afraid to go back and redo your storage and algorithms. Additionally, look for chances to kill "two birds with one stone", such as sorting objects as you render them instead of doing huge sorts.
When everything is worked to the best of your knowledge, I'd consider using an optimizer like Psyco. I've experienced literally 10x performance improvements just by using Psyco and adding one line to my program.
If all else fails, use C in the proper places and you'll get what you want.

* Optimising inner loops in code
Isn't that about performance ?

Performance is a broad topic so you should be more specific. If the bottleneck in your program involves a lot of networking then rewriting it in C/C++ probably won't make a difference since it's the network calls taking up time, not your code. You would be better off rewriting the slow section of your program to use fewer network calls thus reducing the time your program spends waiting on entwork IO. If your doing math intensive stuff such as solving differential equations and you know there are C librarys that can offer better performance then the way you are currently doing it in Python you may want to rewrite the section of your program to use those librarys to increase it's performance.

The C extensions API is notoriously hard to work with, but there are a number of other ways to integrate C code.
For some more usable alternatives see http://www.scipy.org/PerformancePython, in particular the section about using Weave for easy inlining of C code.
Also of interest is Cython, which provides a nice system for integrating with C code. Cython is used for optimization by some well-respected high-performance Python projects such as NumPy and Sage.
As mentioned above, Psyco is another attractive option for optimization, and one which requires nothing more than
import psyco
psyco.bind(myfunction)
Psyco will identify your inner loops and automatically substitute optimized versions of the routines.

C can definitely speed up processor bound tasks. Integrating is even easier now, with the ctypes library, or you could go for any of the other methods you mention.
I feel mercurial has done a good job with the integration if you want to look at their code as an example. The compute intensive tasks are in C, and everything else is python.

You will gain a large performance boost using C from Python (assuming your code is well written, etc) because Python is interpreted at run time, whereas C is compiled beforehand. This will speed up things quite a bit because with C, your code is simply running, whereas with Python, the Python interpreter must figure out what you are doing and interpret it into machine instructions.

I've been told for the calculating portion use C for the scripting use python. So yes you can integrate both. C is capable of faster calculations than that of python

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.