DLR & Performance

DLR & Performance - python

I'm intending to create a web service which performs a large number of manually-specified calculations as fast as possible, and have been exploring the use of DLR.
Sorry if this is long but feel free to skim over and get the general gist.
I've been using the IronPython library as it makes the calculations very easy to specify. My works laptop gives a performance of about 400,000 calculations per second doing the following:
ScriptEngine py = Python.CreateEngine();
ScriptScope pys = py.CreateScope();
ScriptSource src = py.CreateScriptSourceFromString(#"
def result():
res = [None]*1000000
for i in range(0, 1000000):
res[i] = b.GetValue() + 1
return res
result()
");
CompiledCode compiled = src.Compile();
pys.SetVariable("b", new DynamicValue());
long start = DateTime.Now.Ticks;
var res = compiled.Execute(pys);
long end = DateTime.Now.Ticks;
Console.WriteLine("...Finished. Sample data:");
for (int i = 0; i < 10; i++)
{
Console.WriteLine(res[i]);
}
Console.WriteLine("Took " + (end - start) / 10000 + "ms to run 1000000 times.");
Where DynamicValue is a class that returns random numbers from a pre-built array (seeded and built at run time).
When I create a DLR class to do the same thing, I get much higher performance (~10,000,000 calculations per second). The class is as follows:
class DynamicCalc : IDynamicMetaObjectProvider
{
DynamicMetaObject IDynamicMetaObjectProvider.GetMetaObject(Expression parameter)
{
return new DynamicCalcMetaObject(parameter, this);
}
private class DynamicCalcMetaObject : DynamicMetaObject
{
internal DynamicCalcMetaObject(Expression parameter, DynamicCalc value) : base(parameter, BindingRestrictions.Empty, value) { }
public override DynamicMetaObject BindInvokeMember(InvokeMemberBinder binder, DynamicMetaObject[] args)
{
Expression Add = Expression.Convert(Expression.Add(args[0].Expression, args[1].Expression), typeof(System.Object));
DynamicMetaObject methodInfo = new DynamicMetaObject(Expression.Block(Add), BindingRestrictions.GetTypeRestriction(Expression, LimitType));
return methodInfo;
}
}
}
and is called/tested in the same way by doing the following:
dynamic obj = new DynamicCalc();
long t1 = DateTime.Now.Ticks;
for (int i = 0; i < 10000000; i++)
{
results[i] = obj.Add(ar1[i], ar2[i]);
}
long t2 = DateTime.Now.Ticks;
Where ar1 and ar2 are pre-built, runtime seeded arrays of random numbers.
The speed is great this way, but it's not easy to specify the calculation. I'd basically be looking at creating my own lexer & parser, whereas IronPython has everything I need already there.
I'd have thought I could get much better performance from IronPython since it is implemented on top of the DLR, and I could do with better than what I'm getting.
Is my example making best use of the IronPython engine? Is it possible to get significantly better performance out of it?
(Edit) Same as first example but with the loop in C#, setting variables and calling the python function:
ScriptSource src = py.CreateScriptSourceFromString(#"b + 1");
CompiledCode compiled = src.Compile();
double[] res = new double[1000000];
for(int i=0; i<1000000; i++)
{
pys.SetVariable("b", args1[i]);
res[i] = compiled.Execute(pys);
}
where pys is a ScriptScope from py, and args1 is a pre-built array of random doubles. This example executes slower than running the loop in the Python code and passing in the entire arrays.

delnan's comment leads you to some of the problems here. But I'll just get specific about what the differences are here. In the C# version you've cut out a significant amount of the dynamic calls that you have in the Python version. For starters your loop is typed to int and it sounds like ar1 and ar2 are strongly typed arrays. So in the C# version the only dynamic operations you have are the call to obj.Add (which is 1 operation in C#) and potentially the assignment to results if it's not typed to object which seems unlikely. Also note all of this code is lock free.
In the Python version you first have the allocation of the list - this also appears to be during your timer where as in C# it doesn't look like it is. Then you have the dynamic call to range, luckily that only happens once. But that again creates a gigantic list in memory - delnan's suggestion of xrange is an improvement here. Then you have the loop counter i which is getting boxed to an object for every iteration through the loop. Then you have the call to b.GetValue() which is actually 2 dynamic invocatiosn - first a get member to get the "GetValue" method and then an invoke on that bound method object. This is again creating one new object for every iteration of the loop. Then you have the result of b.GetValue() which may be yet another value that's boxed on every iteration. Then you add 1 to that result and you have another boxing operation on every iteration. Finally you store this into your list which is yet another dynamic operation - I think this final operation needs to lock to ensure the list remains consistent (again, delnan's suggestion of using a list comprehension improves this).
So in summary during the loop we have:
C# IronPython
Dynamic Operations 1 4
Allocations 1 4
Locks Acquired 0 1
So basically Python's dynamic behavior does come at a cost vs C#. If you want the best of both worlds you can try and balance what you do in C# vs what you do in Python. For example you could write the loop in C# and have it call a delegate which is a Python function (you can do scope.GetVariable> to get a function out of the scope as a delegate). You could also consider allocating a .NET array for the results if you really need to get every last bit of performance as it may reduce working set and GC copying by not keeping around a bunch of boxed values.
To do the delegate you could have the user write:
def computeValue(value):
return value + 1
Then in the C# code you'd do:
CompiledCode compiled = src.Compile();
compiled.Execute(pys);
var computer = pys.GetVariable<Func<object,object>>("computeValue");
Now you can do:
for (int i = 0; i < 10000000; i++)
{
results[i] = computer(i);
}

If you concerned about computation speed, is it better to look at lowlevel computation specification? Python and C# are high-level languages, and its implementation runtime can spend a lot of time for undercover work.
Look on this LLVM wrapper library: http://www.llvmpy.org
Install it using: pip install llvmpy ply
or on Debian Linux: apt install python-llvmpy python-ply
You still need to write some tiny compiler (you can use PLY library), and bind it with LLVM JIT calls (see LLVM Execution Engine), but this approach can be more effective (generated code much closer to real CPU code), and multiplatform comparing to .NET jail.
LLVM has ready to use optimizing compiler infrastructure, including a lot of optimizer stage modules, and big user and developer community.
Also look here: http://gmarkall.github.io/tutorials/llvm-cauldron-2016
PS: If you interested in it, I can help you with a compiler, contributing to my project's manual in parallel. But it will not be jumpstart, this theme is new to me too.

Related

why is this memoized Euler14 implementation so much slower in Raku than Python?

I was recently playing with problem 14 of the Euler project: which number in the range 1..1_000_000 produces the longest Collatz sequence?
I'm aware of the issue of having to memoize to get reasonable times, and the following piece of Python code returns an answer relatively quickly using that technique (memoize to a dict):
#!/usr/bin/env python
L = 1_000_000
cllens={1:1}
cltz = lambda n: 3*n + 1 if n%2 else n//2
def cllen(n):
if n not in cllens: cllens[n] = cllen(cltz(n)) + 1
return cllens[n]
maxn=1
for i in range(1,L+1):
ln=cllen(i)
if (ln > cllens[maxn]): maxn=i
print(maxn)
(adapted from here; I prefer this version that doesn't use max, because I might want to fiddle with it to return the longest 10 sequences, etc.).
I have tried to translate it to Raku staying as semantically close as I could:
#!/usr/bin/env perl6
use v6;
my $L=1_000_000;
my %cllens = (1 => 1);
sub cltz($n) { ($n %% 2) ?? ($n div 2) !! (3*$n+1) }
sub cllen($n) {
(! %cllens{$n}) && (%cllens{$n} = 1+cllen($n.&cltz));
%cllens{$n};
}
my $maxn=1;
for (1..$L) {
my $ln = cllen($_);
($ln > %cllens{$maxn}) && ($maxn = $_)
}
say $maxn
Here are the sorts of times I am consistently getting running these:
$ time <python script>
837799
real 0m1.244s
user 0m1.179s
sys 0m0.064s
On the other hand, in Raku:
$ time <raku script>
837799
real 0m21.828s
user 0m21.677s
sys 0m0.228s
Question(s)
Am I mistranslating between the two, or is the difference an irreconcilable matter of starting up a VM, etc.?
Are there clever tweaks / idioms I can apply to the Raku code to speed it up considerably past this?
Aside
Naturally, it's not so much about this specific Euler project problem; I'm more broadly interested in whether there are any magical speedup arcana appropriate to Raku I am not aware of.

I think the majority of the extra time is because Raku has type checks, and they aren't getting removed by the runtime type specializer. Or if they are getting removed it is after a significant amount of time.
Generally the way to optimize Raku code is first to run it with the profiler:
$ raku --profile test.raku
Of course that fails with a Segfault with this code, so we can't use it.
My guess would be that much of the time is related to using the Hash.
If it was implemented, using native ints for the key and value might have helped:
my int %cllens{int} = (1 => 1);
Then declaring the functions as using native-sized ints could be a bigger win.
(Currently this is a minor improvement at best.)
sub cltz ( int $n --> int ) {…}
sub cllen( int $n --> int ) {…}
for (1..$L) -> int $_ {…}
Of course like I said native hashes aren't implemented, so that is pure speculation.
You could try to use the multi-process abilities of Raku, but there may be issues with the shared %cllens variable.
The problem could also be because of recursion. (Combined with the aforementioned type checks.)
If you rewrote cllen so that it used a loop instead of recursion that might help.
Note:
The closest to n not in cllens is probably %cllens{$n}:!exists.
Though that might be slower than just checking that the value is not zero.
Also cellen is kinda terrible. I would have written it more like this:
sub cllen($n) {
%cllens{$n} //= cllen(cltz($n)) + 1
}

Call IronPython function from C# and Vice versa

I'm designing a Unity3D game where the user can adjust gameplay by modifying a .py file.
I'm using Python Interpreter asset, which wraps IronPython.
I need to pass control between C# and the running Python instance. Here is a pseudocode example:
// From Unity, C# script...
bool MakeNoise( float vol ) {
//...
}
void Awake() {
m_pythonEngine = Python.CreateEngine();
m_scriptScope = m_pythonEngine.CreateScope();
}
bool NewRound( int n ) {
myPythonRuntime.InvokeFunction( "NewRoundPy", param: roundNumber = n );
return retVal;
}
And then, in the game while it is running, the user could enter in a textbox:
# Python
vol = 0.5
def NewRoundPy( roundNumber ):
theCSharpObject.MakeNoise( vol )
return True
Which would in turn invoke the MakeNoise function every time a new round starts.
Apologies for the clumsy hypothetical situation. The important thing is that I need to be able to transfer execution from one environment to the other, passsing data as I do so.
I'm aware of the code needed to run a Python script. But I want to invoke individual functions from the script. And the script has to stay running between such calls, so that variables set in one function/call are retaining their values the next time some function is called.
Is this possible? If so, how?
PS going Python -> C# seems to be fairly simple; IronPython .NET Integration documentation
PPS I will tidy up the question once I have some understanding of what's going on

Portable/fast way to obtain a pointer to Numpy/Numpypy data

I recently tried PyPy and was intrigued by the approach. I have lots of C extensions for Python, which all use PyArray_DATA() to obtain a pointer to the data sections of numpy arrays. Unfortunately, PyPy doesn't appear to export an equivalent for their numpypy arrays in their cpyext module, so I tried following the recommendation on their website to use ctypes. This pushes the task of obtaining the pointer to the Python level.
There appear to be two ways:
import ctypes as C
p_t = C.POINTER(C.c_double)
def get_ptr_ctypes(x):
return x.ctypes.data_as(p_t)
def get_ptr_array(x):
return C.cast(x.__array_interface__['data'][0], p_t)
Only the second one works on PyPy, so for compatibility the choice is clear. For CPython, both are slow as hell and a complete bottleneck for my application! Is there a fast and portable way of obtaining this pointer? Or is there an equivalent of PyArray_DATA() for PyPy (possibly undocumented)?

I still haven't found an entirely satisfactory solution, but nevertheless there is something one can do to obtain the pointer with a lot less overhead in CPython. First off, the reason why both ways mentioned above are so slow is that both .ctypes and .__array_interface__ are on-demand attributes, which are set by array_ctypes_get() and array_interface_get() in numpy/numpy/core/src/multiarray/getset.c. The first imports ctypes and creates a numpy.core._internal._ctypes instance, while the second one creates a new dictionary and populates it with lots of unnecessary stuff in addition to the data pointer.
There is nothing one can do on the Python level about this overhead, but one can write a micro-module on the C-level that bypasses most of the overhead:
#include <Python.h>
#include <numpy/arrayobject.h>
PyObject *_get_ptr(PyObject *self, PyObject *obj) {
return PyLong_FromVoidPtr(PyArray_DATA(obj));
}
static PyMethodDef methods[] = {
{"_get_ptr", _get_ptr, METH_O, "Wrapper to PyArray_DATA()"},
{NULL, NULL, 0, NULL}
};
PyMODINIT_FUNC initaccel(void) {
Py_InitModule("accel", methods);
}
Compile as usual as an Extension in setup.py, and import as
try:
from accel import _get_ptr
def get_ptr(x):
return C.cast(_get_ptr(x), p_t)
except ImportError:
get_ptr = get_ptr_array
On PyPy, from accel import _get_ptr will fail and get_ptr will fall back to get_ptr_array, which works with Numpypy.
As far as performance goes, for light-weight C function calls, ctypes + accel._get_ptr() is still quite a bit slower than the native CPython extension, which has essentially no overhead. It is of course much faster than get_ptr_ctypes() and get_ptr_array() above, so that the overhead may become insignificant for medium-weight C function calls.
One has gained compatibility with PyPy, although I have to say that after spending quite a bit of time trying to evaluate PyPy for my scientific computation applications, I don't see a future for it as long as they (quite stubbornly) refuse to support the full CPython API.
Update
I found that ctypes.cast() was now becoming the bottleneck after introducing accel._get_ptr(). One can get rid of the casts by declaring all pointers in the interface as ctypes.c_void_p. This is what I ended up with:
def get_ptr_ctypes2(x):
return x.ctypes._data
def get_ptr_array(x):
return x.__array_interface__['data'][0]
try:
from accel import _get_ptr as get_ptr
except ImportError:
get_ptr = get_ptr_array
Here, get_ptr_ctypes2() avoids the cast by accessing the hidden ndarray.ctypes._data attribute directly. Here are some timing results for calling heavy-weight and light-weight C functions from Python:
heavy C (few calls) light C (many calls)
ctypes + get_ptr_ctypes(): 0.71 s 15.40 s
ctypes + get_ptr_ctypes2(): 0.68 s 13.30 s
ctypes + get_ptr_array(): 0.65 s 11.50 s
ctypes + accel._get_ptr(): 0.63 s 9.47 s
native CPython: 0.62 s 8.54 s
Cython (no decorators): 0.64 s 9.96 s
So, with accel._get_ptr() and no ctypes.cast()s, ctypes' speed is actually competitive with a native CPython extension. So I just have to wait until someone rewrites h5py, matplotlib and scipy with ctypes to be able to try PyPy for anything serious...

That might not be answer enough, but hopefully a good hint. I am using scipy.weave.inline() in some parts of my code. I do not know much about the speed of the interface itself, because the function I execute is quite heavy and relies on a few pointers/arrays only, but it seems fast to me. Maybe you can get some inspiration from the scipy.weave code, particularly from attempt_function_call
https://github.com/scipy/scipy/blob/master/scipy/weave/inline_tools.py#L390
If you want to have a look at the C++ code that is generated by scipy.weave,
produce a simple example from here: http://docs.scipy.org/doc/scipy/reference/tutorial/weave.html ,
run the python script
get the scipy.weave cache folder:
import scipy.weave.catalog as ctl
ctl.default_dir()
Out[5]: '/home/user/.python27_compiled'
have a look at the generated C++ code in the folder

Why is my python/numpy example faster than pure C implementation?

I have pretty much the same code in python and C. Python example:
import numpy
nbr_values = 8192
n_iter = 100000
a = numpy.ones(nbr_values).astype(numpy.float32)
for i in range(n_iter):
a = numpy.sin(a)
C example:
#include <stdio.h>
#include <math.h>
int main(void)
{
int i, j;
int nbr_values = 8192;
int n_iter = 100000;
double x;
for (j = 0; j < nbr_values; j++){
x = 1;
for (i=0; i<n_iter; i++)
x = sin(x);
}
return 0;
}
Something strange happen when I ran both examples:
$ time python numpy_test.py
real 0m5.967s
user 0m5.932s
sys 0m0.012s
$ g++ sin.c
$ time ./a.out
real 0m13.371s
user 0m13.301s
sys 0m0.008s
It looks like python/numpy is twice faster than C. Is there any mistake in the experiment above? How you can explain it?
P.S. I have Ubuntu 12.04, 8G ram, core i5 btw

First, turn on optimization. Secondly, subtleties matter. Your C code is definitely not 'basically the same'.
Here is equivalent C code:
sinary2.c:
#include <math.h>
#include <stdlib.h>
float *sin_array(const float *input, size_t elements)
{
int i = 0;
float *output = malloc(sizeof(float) * elements);
for (i = 0; i < elements; ++i) {
output[i] = sin(input[i]);
}
return output;
}
sinary.c:
#include <math.h>
#include <stdlib.h>
extern float *sin_array(const float *input, size_t elements)
int main(void)
{
int i;
int nbr_values = 8192;
int n_iter = 100000;
float *x = malloc(sizeof(float) * nbr_values);
for (i = 0; i < nbr_values; ++i) {
x[i] = 1;
}
for (i=0; i<n_iter; i++) {
float *newary = sin_array(x, nbr_values);
free(x);
x = newary;
}
return 0;
}
Results:
$ time python foo.py
real 0m5.986s
user 0m5.783s
sys 0m0.050s
$ gcc -O3 -ffast-math sinary.c sinary2.c -lm
$ time ./a.out
real 0m5.204s
user 0m4.995s
sys 0m0.208s
The reason the program has to be split in two is to fool the optimizer a bit. Otherwise it will realize that the whole loop has no effect at all and optimize it out. Putting things in two files doesn't give the compiler visibility into the possible side-effects of sin_array when it's compiling main and so it has to assume that it actually has some and repeatedly call it.
Your original program is not at all equivalent for several reasons. One is that you have nested loops in the C version and you don't in Python. Another is that you are working with arrays of values in the Python version and not in the C version. Another is that you are creating and discarding arrays in the Python version and not in the C version. And lastly you are using float in the Python version and double in the C version.
Simply calling the sin function the appropriate number of times does not make for an equivalent test.
Also, the optimizer is a really big deal for C. Comparing C code on which the optimizer hasn't been used to anything else when you're wondering about a speed comparison is the wrong thing to do. Of course, you also need to be mindful. The C optimizer is very sophisticated and if you're testing something that really doesn't do anything, the C optimizer might well notice this fact and simply not do anything at all, resulting in a program that's ridiculously fast.

Because "numpy" is a dedicated math library implemented for speed. C has standard functions for sin/cos, that are generally derived for accuracy.
You are also not comparing apples with apples, as you are using double in C, and float32 (float) in python. If we change the python code to calculate float64 instead, the time increases by about 2.5 seconds on my machine, making it roughly match with the correctly optimized C version.
If the whole test was made to do something more complicated that requires more control structres (if/else, do/while, etc), then you would probably see even less difference between C and Python - because the C compiler can't really do "sin" any faster - unless you implement a better "sin" function.
Newer mind the fact that your code isn't quite the same on both sides... ;)

You seem to be doing the the same operation in C 8192 x 10000 times but only 10000 in python (I haven't used numpy before so I may misunderstand the code). Why are you using an array in the python case (again I'm not use to numpy so perhaps the dereferencing is implicit). If you wish to use an array be careful doubles have a performance hit in terms of caching and optimised vectorisation - you're using different types between both implementations (float vs double) but given the algorithm I don't think it matters.
The main reason for a lot of anomalous performance benchmark issues surrounding C vs Pythis, Pythat... Is that simply the C implementation is often poor.
https://www.ibm.com/developerworks/community/blogs/jfp/entry/A_Comparison_Of_C_Julia_Python_Numba_Cython_Scipy_and_BLAS_on_LU_Factorization?lang=en
If you notice the guy writes C to process an array of doubles (without using restrict or const keywords where he could've), he builds with optimisation then forces the compiler to use SIMD rather than AVE. In short the compiler is using an inefficient instruction set for doubles and the wrong type of registers too if he wanted performance - you can be sure the numba and numpy will be using as many bells and whistles as possible and will be shipped with very efficient C and C++ libraries to begin with. In short if you want speed with C you have to think about it, you may even have to disassemble the code and perhaps disable optimisation and use compiler instrinsics instead. It gives you the tools to do it so don't expect the compiler to do all the work for you. If you want that degree of freedom use Cython, Numba, Numpy, Scipy etc. They're very fast but you won't be able to eek out every bit of performance out of the machine - to do that use C, C++ or new versions of FORTRAN.
Here is a very good article on these very points (I'd use SciPy):
https://www.scipy.org/scipylib/faq.html

How do I connect a Python and a C program?

I have a python-based program that reads serial data off an a port connected to an rs232 cable. I want to pass the data I get here to a C-program that will handle the computation-intensive side of things. I have been checking up the net and all I've found are linux-based.

My suggestion would be the inline function from the instant module, though that only works if you can do everything you need to in a single c function. You just pass it a c function and it compiles a c extension at runtime.
from instant import inline
sieve_code = """
PyObject* prime_list(int max) {
PyObject *list = PyList_New(0);
int *numbers, *end, *n;
numbers = (int *) calloc(sizeof(int), max);
end = numbers + max;
numbers[2] = 2;
for (int i = 3; i < max; i += 2) { numbers[i] = i; }
for (int i = 3; i < sqrt(max); i++) {
if (numbers[i] != 0) {
for (int j = i + i; j < max; j += i) { numbers[j] = 0; }
}
}
for (n = numbers; n < end; n++) {
if (*n != 0) { PyList_Append(list, PyInt_FromLong(*n)); }
}
free(numbers);
return list;
}
"""
sieve = inline(sieve_code)

There a number of ways to do this.
The rawest, simplest way is to use the Python C API and write a wrapper for your C library which can be called from Python. This ties your module to CPython.
The second way is to use ctypes which is an FFI for Python that allows you to load and call functions in C libraries directly. In theory, this should work across Python implementations.
A third way is to use Pyrex or it's next generation version Cython which allows you to annotate your Python code with type information that the compiler can convert into compiled code. It can be used to write wrappers too. AFAIK, It's tied to CPython.
Yet another way is to use SWIG which is a tool that generates glue code that helps you wrap C libraries for use from Python. It's basically the first approach with a helper tool.
Another way is to use Boost Python API which is an object oriented wrapper over the raw Python C API.
All of the above let you do your work in the same process.
If that's not a constraint, like Digital Ross suggested, you can simply spawn a subprocess and hand over arguments (either as command line ones or via it's standard input) and have an external process do the work for you.

Use a pipe and popen
The easiest way to deal with this is probably to just use popen(3). The popen function is available in both Python and C and will connect a program of either language with the other using a pipe.
>>> import subprocess
>>> print args
['/bin/vikings', '-input', 'eggs.txt', '-output', 'spam spam.txt', '-cmd', "echo '$MONEY'"]
>>> p = subprocess.Popen(args)
Once you have the pipe, you should probably send yaml or json through it, though I've never tried to read either in C. If it's really a simple stream, just parse it yourself. If you like XML, I suppose that's available as well.

How many bits per second are you getting across this RS-232 cable? Have you test results that show that Python won't do the crunchy bits fast enough? If the C program is yet to be written, consider the possibility of writing the computation-intensive side of things in Python, with easy fallback to Cython in the event that Python isn't fast enough.

Indeed this question does not have much to do with C++.
Having said that, you can try SWIG - it's multi-platform and allows functional calls from Python to C/C++.

I would use a standard form of IPC like a socket.
A good start would be Beej's Guide.
Also, don't tag the question with c++ if you are specifically using c. c and c++ are different languages.

I'd use ctypes: http://python.net/crew/theller/ctypes/tutorial.html
It allows you to call c (and c++) code from python.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.