Pypy (python) optimization

Pypy (python) optimization - python

I'm looking into replacing some C code with python code and using pypy as the interpreter. The code does a lot of list/dictionary operations. Therefore to get a vague idea of the performance of pypy vs C I am writing sorting algorithms. To test all my read functions I wrote a bubble sort, both in python and C++. CPython of course sucks 6.468s, pypy came in at 0.366s and C++ at 0.229s. Then I remembered that I had forgotten -O3 on the C++ code and the time went to 0.042s. For a 32768 dataset C++ with -O3 is only 2.588s and pypy is 19.65s. Is there anything I can do to speed up my python code (besides using a better sort algorithm of course) or how I use pypy (some flag or something)?
Python code (read_nums module omitted since it's time is trivial: 0.036s on 32768 dataset):
import read_nums
import sys
nums = read_nums.read_nums(sys.argv[1])
done = False
while not done:
done = True
for i in range(len(nums)-1):
if nums[i] > nums[i+1]:
nums[i], nums[i+1] = nums[i+1], nums[i]
done = False
$ time pypy-c2.0 bubble_sort.py test_32768_1.nums
real 0m20.199s
user 0m20.189s
sys 0m0.009s
C code (read_nums function again omitted since it takes little time: 0.017s):
#include <iostream>
#include "read_nums.h"
int main(int argc, char** argv)
{
std::vector<int> nums;
int count, i, tmp;
bool done;
if(argc < 2)
{
std::cout << "Usage: " << argv[0] << " filename" << std::endl;
return 1;
}
count = read_nums(argv[1], nums);
done = false;
while(!done)
{
done = true;
for(i=0; i<count-1; ++i)
{
if(nums[i] > nums[i+1])
{
tmp = nums[i];
nums[i] = nums[i+1];
nums[i+1] = tmp;
done = false;
}
}
}
for(i=0; i<count; ++i)
{
std::cout << nums[i] << ", ";
}
return 0;
}
$ time ./bubble_sort test_32768_1.nums > /dev/null
real 0m2.587s
user 0m2.586s
sys 0m0.001s
P.S. Some of the numbers given in the first paragraph are a little different then the numbers from time because they're the numbers I got the first time.
Further improvements:
Just tried xrange instead of range and the run time went to 16.370s.
Moved the code starting from first done = False to last done = False in a function, speed is now 8.771-8.834s.

The most relevant way to answer this question is to note that the speed of C, CPython and PyPy are not differing by a constant factor: it depends most importantly on what is done and the way it is written. For example, if your C code is doing naive things like walking arrays when the "equivalent" Python code would naturally use dictionaries, then any implementation of Python is faster than C provided the arrays are long enough. Of course, this is not the case on most real-life examples, but the same argument still applies to a smaller extent. There is no one-size-fits-all way to predict the relative speed of a program written in C, or rewritten in Python and running on CPython or PyPy.
Obviously there are guidelines about these relative speeds: on small algorithmical examples you could expect the speed of PyPy to be approaching that of "gcc -O0". In your example it is "only" 1.6x slower. We might help you optimize it, or even find optimizations missing in PyPy, in order to gain 10% or 30% more speed. But this is a tiny example that has nothing to do with your real program. For the reasons above the speed we get here is only vaguely related to the speed you'll get in the end.
I can only say that rewriting code from C to Python for reasons of clarity, notably when the C has become too tangled up for further developments, is clearly a win in the long run --- even in the case where at the end you need to rewrite some parts of it in C again. And PyPy's goal here is to reduce the need for that. While it would be nice to say that no-one ever needs C any more, it's just not true :-)

Related

why is this memoized Euler14 implementation so much slower in Raku than Python?

I was recently playing with problem 14 of the Euler project: which number in the range 1..1_000_000 produces the longest Collatz sequence?
I'm aware of the issue of having to memoize to get reasonable times, and the following piece of Python code returns an answer relatively quickly using that technique (memoize to a dict):
#!/usr/bin/env python
L = 1_000_000
cllens={1:1}
cltz = lambda n: 3*n + 1 if n%2 else n//2
def cllen(n):
if n not in cllens: cllens[n] = cllen(cltz(n)) + 1
return cllens[n]
maxn=1
for i in range(1,L+1):
ln=cllen(i)
if (ln > cllens[maxn]): maxn=i
print(maxn)
(adapted from here; I prefer this version that doesn't use max, because I might want to fiddle with it to return the longest 10 sequences, etc.).
I have tried to translate it to Raku staying as semantically close as I could:
#!/usr/bin/env perl6
use v6;
my $L=1_000_000;
my %cllens = (1 => 1);
sub cltz($n) { ($n %% 2) ?? ($n div 2) !! (3*$n+1) }
sub cllen($n) {
(! %cllens{$n}) && (%cllens{$n} = 1+cllen($n.&cltz));
%cllens{$n};
}
my $maxn=1;
for (1..$L) {
my $ln = cllen($_);
($ln > %cllens{$maxn}) && ($maxn = $_)
}
say $maxn
Here are the sorts of times I am consistently getting running these:
$ time <python script>
837799
real 0m1.244s
user 0m1.179s
sys 0m0.064s
On the other hand, in Raku:
$ time <raku script>
837799
real 0m21.828s
user 0m21.677s
sys 0m0.228s
Question(s)
Am I mistranslating between the two, or is the difference an irreconcilable matter of starting up a VM, etc.?
Are there clever tweaks / idioms I can apply to the Raku code to speed it up considerably past this?
Aside
Naturally, it's not so much about this specific Euler project problem; I'm more broadly interested in whether there are any magical speedup arcana appropriate to Raku I am not aware of.

I think the majority of the extra time is because Raku has type checks, and they aren't getting removed by the runtime type specializer. Or if they are getting removed it is after a significant amount of time.
Generally the way to optimize Raku code is first to run it with the profiler:
$ raku --profile test.raku
Of course that fails with a Segfault with this code, so we can't use it.
My guess would be that much of the time is related to using the Hash.
If it was implemented, using native ints for the key and value might have helped:
my int %cllens{int} = (1 => 1);
Then declaring the functions as using native-sized ints could be a bigger win.
(Currently this is a minor improvement at best.)
sub cltz ( int $n --> int ) {…}
sub cllen( int $n --> int ) {…}
for (1..$L) -> int $_ {…}
Of course like I said native hashes aren't implemented, so that is pure speculation.
You could try to use the multi-process abilities of Raku, but there may be issues with the shared %cllens variable.
The problem could also be because of recursion. (Combined with the aforementioned type checks.)
If you rewrote cllen so that it used a loop instead of recursion that might help.
Note:
The closest to n not in cllens is probably %cllens{$n}:!exists.
Though that might be slower than just checking that the value is not zero.
Also cellen is kinda terrible. I would have written it more like this:
sub cllen($n) {
%cllens{$n} //= cllen(cltz($n)) + 1
}

Why does this supposedly infinite loop program terminate?

I was talking to my friend about these two pieces of code. He said the python one terminates, the C++ one doesn't.
Python:
arr = [1, 2, 3]
for i in range(len(arr)):
arr.append(i)
print("done")
C++:
#include <iostream>
#include <vector>
using namespace std;
int main() {
vector<int> arr{1,2,3};
for(int i = 0; i < arr.size(); i++){
arr.push_back(i);
}
cout << "done" << endl;
return 0;
}
I challenged that and ran it on 2 computers. The first one ran out of memory (bad alloc) because it had 4gb of ram. My mac as 12gb of ram and it was able to run and terminate just fine.
I thought it wouldn't run forever because the type of size() in vector is an unsigned int. Since my mac was 64 bit, I thought that it could store 2^(64-2)=2^62 ints (which is true) but the unsigned int for the size is 32 for some reason.
Is this some bug in the C++ compiler that does not change the max_size() to be relative to the system's hardware? The overflow causes the program to terminate. Or is it for some other reason?

There is not a bug in your C++ compiler manifesting itself here.
int is overflowing (due to the i++), the behaviour of which is undefined. (It's feasible that you'll run out of memory on some platforms before this overflow occurs.) Note that there is no defined behaviour that will make i negative, although that is a common occurrence on machines with 2's complement signed integral types once std::numeric_limits<int>::max() is attained, and if i were -1 say then i < arr.size() would be false due to the implicit conversion of i to an unsigned type.
The Python version pre-computes range(len(arr)); that is subsequent appends do not change that initial value.

What is the maximum depth level of nested For loops in Python 3? [duplicate]

This question already has answers here:
Why does Python have a limit on the number of static blocks that can be nested?
(3 answers)
Closed 3 years ago.
I was wondering if there exists such a "level". I know that in C the limit is 127, but I couldn't find any information regarding Python.
For example:
for True: # level 0
for True: # level 1
...
for True: # level max
print("something")

For CPython 3.7 (and previous CPython versions), the limit is 20. But that's a somewhat arbitrary feature of the implementation; it is not required by the Python language definition and other Python implementations (at least Jython) have different limits. See below.
I know that in C the limit is 127
That's not precise. The C standard suggests that a compiler needs to be able to handle a program with blocks nested at least to depth 127, but it doesn't provide any maximum nesting depth. It does say that:
Implementations should avoid imposing fixed translation limits whenever possible.
In fact, the standard doesn't actually say that an implementation must be able to handle any program with block nesting of 127; what it says, is that it must be able to handle "at least one" such program. (Being able to handle any program would be a bit much to ask, because the compiler might choke on some other issue with the program text.)
Python is even less clearly specified.
Of course, you can attempt to figure out the limit by just trying successively deeper nesting levels. It's easy to generate such programs with a bash script (see below) or even with python. But that doesn't tell you anything about the language in abstract. All it tells you is what the limits are in the specific implementation you happen to be using.
You will probably find different limits in a new version of the same language, a different implementation of the same language, or even the same language on a different host.
For example, I used the following bash-function to generate sample python programs:
genpython () {
"${2:-python}" -c "$(
for ((i=0; i<$1; ++i)); do
printf "%*sfor i%d in range(1):\n" $i "" $i;
done;
printf '%*sprint("Nested %d worked!")\n' $1 "" $1;
)"
}
which produces programs which look like (using genpython 10):
for i0 in range(1):
for i1 in range(1):
for i2 in range(1):
for i3 in range(1):
for i4 in range(1):
for i5 in range(1):
for i6 in range(1):
for i7 in range(1):
for i8 in range(1):
for i9 in range(1):
print("Nested 10 worked!")
Both Python2 (CPython 2.7.15rc1) and Python3 (CPython 3.6.7) topped out at 20:
$ genpython 20 python2
Nested 20 worked!
$ genpython 21 python2
SyntaxError: too many statically nested blocks
$ genpython 20 python3
Nested 20 worked!
$ genpython 21 python3
SyntaxError: too many statically nested blocks
However, jython (2.7.1) was able to go up to 99 (failing with a stack trace rather than a simple syntax error):
$ genpython 99 jython
Nested 99 worked!
$ genpython 100 jython
java.lang.ArrayIndexOutOfBoundsException: Index 100 out of bounds for length 100
at org.python.antlr.PythonTokenSource.push(PythonTokenSource.java:323)
at org.python.antlr.PythonTokenSource.handleIndents(PythonTokenSource.java:288)
at org.python.antlr.PythonTokenSource.insertImaginaryIndentDedentTokens(PythonTokenSource.java:222)
at org.python.antlr.PythonTokenSource.nextToken(PythonTokenSource.java:143)
at org.antlr.runtime.CommonTokenStream.fillBuffer(CommonTokenStream.java:119)
at org.antlr.runtime.CommonTokenStream.LT(CommonTokenStream.java:238)
at org.python.antlr.PythonParser.file_input(PythonParser.java:560)
at org.python.antlr.BaseParser.parseModule(BaseParser.java:77)
at org.python.core.CompileMode$3.dispatch(CompileMode.java:22)
at org.python.core.ParserFacade.parse(ParserFacade.java:158)
at org.python.core.ParserFacade.parse(ParserFacade.java:203)
at org.python.core.Py.compile_flags(Py.java:2205)
at org.python.util.PythonInterpreter.exec(PythonInterpreter.java:267)
at org.python.util.jython.run(jython.java:394)
at org.python.util.jython.main(jython.java:142)
java.lang.ArrayIndexOutOfBoundsException: java.lang.ArrayIndexOutOfBoundsException: Index 100 out of bounds for length 100
Out of curiosity, I tried the same thing with C, using a slightly modified script:
genc () {
{
echo '#include <stdio.h>'
echo 'int main(void) {';
for ((i=0; i<$1; ++i)); do
printf "%*s for (int i%d=0; i%d==0; ++i%d) {\n" $i "" $i $i $i;
done;
printf '%*s puts("Nested %d worked!");\n' $1 "" $1;
for ((i=$1; i-->0; 1)); do
printf "%*s }\n" $i "";
done;
echo ' return 0;'
echo '}'
} | ${2:-gcc} "${#:3}" -std=c11 -x c - && ./a.out
}
This produces, for example:
#include <stdio.h>
int main(void) {
for (int i0=0; i0==0; ++i0) {
for (int i1=0; i1==0; ++i1) {
for (int i2=0; i2==0; ++i2) {
puts("Nested 3 worked!");
}
}
}
return 0;
}
(Note that the nesting depth in C is one greater than the number of nested for statements because int main() {...} counts, too. So although the above says its nesting depth 3, it's really depth 4.)
With gcc (v8.3.0), I got up to a nesting depth of 15000. I didn't try any further because the compile times and memory usage were getting out of control. But clang (v7.0.0) produced an error much earlier, albeit with a suggested work-around:
$ genc 255 clang
Nested 255 worked!
$ genc 256 clang
<stdin>:258:291: fatal error: bracket nesting level exceeded maximum of 256
...for (int i255=0; i255==0; ++i255) {
^
<stdin>:258:291: note: use -fbracket-depth=N to increase maximum nesting level
1 error generated.
Using the suggested command-line option, I was able to reliably get up to a nesting depth of 3574 (i.e. 3573 nested for loops). But higher than that, and things started failing inconsistently. Depth 3575 fails about one in three tries. Depth 3576 fails about half the time. Depth 3577 fails about 80% of the time. The failure is presumably a compiler bug (at least, that's what the compiler prints out after a long stack trace and an even longer list of error messages).

Why is my python/numpy example faster than pure C implementation?

I have pretty much the same code in python and C. Python example:
import numpy
nbr_values = 8192
n_iter = 100000
a = numpy.ones(nbr_values).astype(numpy.float32)
for i in range(n_iter):
a = numpy.sin(a)
C example:
#include <stdio.h>
#include <math.h>
int main(void)
{
int i, j;
int nbr_values = 8192;
int n_iter = 100000;
double x;
for (j = 0; j < nbr_values; j++){
x = 1;
for (i=0; i<n_iter; i++)
x = sin(x);
}
return 0;
}
Something strange happen when I ran both examples:
$ time python numpy_test.py
real 0m5.967s
user 0m5.932s
sys 0m0.012s
$ g++ sin.c
$ time ./a.out
real 0m13.371s
user 0m13.301s
sys 0m0.008s
It looks like python/numpy is twice faster than C. Is there any mistake in the experiment above? How you can explain it?
P.S. I have Ubuntu 12.04, 8G ram, core i5 btw

First, turn on optimization. Secondly, subtleties matter. Your C code is definitely not 'basically the same'.
Here is equivalent C code:
sinary2.c:
#include <math.h>
#include <stdlib.h>
float *sin_array(const float *input, size_t elements)
{
int i = 0;
float *output = malloc(sizeof(float) * elements);
for (i = 0; i < elements; ++i) {
output[i] = sin(input[i]);
}
return output;
}
sinary.c:
#include <math.h>
#include <stdlib.h>
extern float *sin_array(const float *input, size_t elements)
int main(void)
{
int i;
int nbr_values = 8192;
int n_iter = 100000;
float *x = malloc(sizeof(float) * nbr_values);
for (i = 0; i < nbr_values; ++i) {
x[i] = 1;
}
for (i=0; i<n_iter; i++) {
float *newary = sin_array(x, nbr_values);
free(x);
x = newary;
}
return 0;
}
Results:
$ time python foo.py
real 0m5.986s
user 0m5.783s
sys 0m0.050s
$ gcc -O3 -ffast-math sinary.c sinary2.c -lm
$ time ./a.out
real 0m5.204s
user 0m4.995s
sys 0m0.208s
The reason the program has to be split in two is to fool the optimizer a bit. Otherwise it will realize that the whole loop has no effect at all and optimize it out. Putting things in two files doesn't give the compiler visibility into the possible side-effects of sin_array when it's compiling main and so it has to assume that it actually has some and repeatedly call it.
Your original program is not at all equivalent for several reasons. One is that you have nested loops in the C version and you don't in Python. Another is that you are working with arrays of values in the Python version and not in the C version. Another is that you are creating and discarding arrays in the Python version and not in the C version. And lastly you are using float in the Python version and double in the C version.
Simply calling the sin function the appropriate number of times does not make for an equivalent test.
Also, the optimizer is a really big deal for C. Comparing C code on which the optimizer hasn't been used to anything else when you're wondering about a speed comparison is the wrong thing to do. Of course, you also need to be mindful. The C optimizer is very sophisticated and if you're testing something that really doesn't do anything, the C optimizer might well notice this fact and simply not do anything at all, resulting in a program that's ridiculously fast.

Because "numpy" is a dedicated math library implemented for speed. C has standard functions for sin/cos, that are generally derived for accuracy.
You are also not comparing apples with apples, as you are using double in C, and float32 (float) in python. If we change the python code to calculate float64 instead, the time increases by about 2.5 seconds on my machine, making it roughly match with the correctly optimized C version.
If the whole test was made to do something more complicated that requires more control structres (if/else, do/while, etc), then you would probably see even less difference between C and Python - because the C compiler can't really do "sin" any faster - unless you implement a better "sin" function.
Newer mind the fact that your code isn't quite the same on both sides... ;)

You seem to be doing the the same operation in C 8192 x 10000 times but only 10000 in python (I haven't used numpy before so I may misunderstand the code). Why are you using an array in the python case (again I'm not use to numpy so perhaps the dereferencing is implicit). If you wish to use an array be careful doubles have a performance hit in terms of caching and optimised vectorisation - you're using different types between both implementations (float vs double) but given the algorithm I don't think it matters.
The main reason for a lot of anomalous performance benchmark issues surrounding C vs Pythis, Pythat... Is that simply the C implementation is often poor.
https://www.ibm.com/developerworks/community/blogs/jfp/entry/A_Comparison_Of_C_Julia_Python_Numba_Cython_Scipy_and_BLAS_on_LU_Factorization?lang=en
If you notice the guy writes C to process an array of doubles (without using restrict or const keywords where he could've), he builds with optimisation then forces the compiler to use SIMD rather than AVE. In short the compiler is using an inefficient instruction set for doubles and the wrong type of registers too if he wanted performance - you can be sure the numba and numpy will be using as many bells and whistles as possible and will be shipped with very efficient C and C++ libraries to begin with. In short if you want speed with C you have to think about it, you may even have to disassemble the code and perhaps disable optimisation and use compiler instrinsics instead. It gives you the tools to do it so don't expect the compiler to do all the work for you. If you want that degree of freedom use Cython, Numba, Numpy, Scipy etc. They're very fast but you won't be able to eek out every bit of performance out of the machine - to do that use C, C++ or new versions of FORTRAN.
Here is a very good article on these very points (I'd use SciPy):
https://www.scipy.org/scipylib/faq.html

How do I connect a Python and a C program?

I have a python-based program that reads serial data off an a port connected to an rs232 cable. I want to pass the data I get here to a C-program that will handle the computation-intensive side of things. I have been checking up the net and all I've found are linux-based.

My suggestion would be the inline function from the instant module, though that only works if you can do everything you need to in a single c function. You just pass it a c function and it compiles a c extension at runtime.
from instant import inline
sieve_code = """
PyObject* prime_list(int max) {
PyObject *list = PyList_New(0);
int *numbers, *end, *n;
numbers = (int *) calloc(sizeof(int), max);
end = numbers + max;
numbers[2] = 2;
for (int i = 3; i < max; i += 2) { numbers[i] = i; }
for (int i = 3; i < sqrt(max); i++) {
if (numbers[i] != 0) {
for (int j = i + i; j < max; j += i) { numbers[j] = 0; }
}
}
for (n = numbers; n < end; n++) {
if (*n != 0) { PyList_Append(list, PyInt_FromLong(*n)); }
}
free(numbers);
return list;
}
"""
sieve = inline(sieve_code)

There a number of ways to do this.
The rawest, simplest way is to use the Python C API and write a wrapper for your C library which can be called from Python. This ties your module to CPython.
The second way is to use ctypes which is an FFI for Python that allows you to load and call functions in C libraries directly. In theory, this should work across Python implementations.
A third way is to use Pyrex or it's next generation version Cython which allows you to annotate your Python code with type information that the compiler can convert into compiled code. It can be used to write wrappers too. AFAIK, It's tied to CPython.
Yet another way is to use SWIG which is a tool that generates glue code that helps you wrap C libraries for use from Python. It's basically the first approach with a helper tool.
Another way is to use Boost Python API which is an object oriented wrapper over the raw Python C API.
All of the above let you do your work in the same process.
If that's not a constraint, like Digital Ross suggested, you can simply spawn a subprocess and hand over arguments (either as command line ones or via it's standard input) and have an external process do the work for you.

Use a pipe and popen
The easiest way to deal with this is probably to just use popen(3). The popen function is available in both Python and C and will connect a program of either language with the other using a pipe.
>>> import subprocess
>>> print args
['/bin/vikings', '-input', 'eggs.txt', '-output', 'spam spam.txt', '-cmd', "echo '$MONEY'"]
>>> p = subprocess.Popen(args)
Once you have the pipe, you should probably send yaml or json through it, though I've never tried to read either in C. If it's really a simple stream, just parse it yourself. If you like XML, I suppose that's available as well.

How many bits per second are you getting across this RS-232 cable? Have you test results that show that Python won't do the crunchy bits fast enough? If the C program is yet to be written, consider the possibility of writing the computation-intensive side of things in Python, with easy fallback to Cython in the event that Python isn't fast enough.

Indeed this question does not have much to do with C++.
Having said that, you can try SWIG - it's multi-platform and allows functional calls from Python to C/C++.

I would use a standard form of IPC like a socket.
A good start would be Beej's Guide.
Also, don't tag the question with c++ if you are specifically using c. c and c++ are different languages.

I'd use ctypes: http://python.net/crew/theller/ctypes/tutorial.html
It allows you to call c (and c++) code from python.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.