Cython for loop conversion - python

Using cython -a, I found that a for i in range(0, a, b) statement was run as a python loop (very yellow line in cython -a html output). i, a and b were cdef-ed as int64_t.
Then I tried the 'old' syntax for i from 0 <= i < b by a. From the output of cython -a it seemed to compile quite optimal as expected.
Is it expected behaviour that range(0, a, b) is not optimized here or is this rather bound to the implementation?

Automatic range conversion is only applied when cython can determine the sign of the step at compile time. As the step in this case is a signed type it cannot and so falls back to the python loop.
Note that currently even when the type is unsigned cython still falls back onto the python loop, this is a (rather old) outstanding further optimisation that the compiler could do but doesn't. Have a look at this ticket for more information:
http://trac.cython.org/ticket/546

Related

Calling C functions via Python ctypes: why does passing uints to a function expecting size_t work?

I have some simple code that adds two size_ts together:
#include <stdlib.h>
extern "C" __declspec(dllexport) size_t _cdecl add(size_t x, size_t y)
{
return x + y;
}
(Note: this code is compiled and run on a 64-bit system.)
When calling that function via Python's ctypes and passing it arguments of type c_uint (32 bits in size instead of 64), the function works as expected:
import ctypes
lib = ctypes.cdll['./ctypetest.dll']
add = lib.add
add.restype = ctypes.c_uint
add.argtypes = [ctypes.c_uint, ctypes.c_uint]
add(1, 2) # = 3
As a sanity check, I verified that both uint and size_t are of different sizes:
>>> ctypes.sizeof(ctypes.c_size_t)
8
>>> ctypes.sizeof(ctypes.c_uint)
4
How does ctypes successfully call this function given arguments of different sizes?
The answer depends on the calling conventions of the ABI of the C compiler used to compile your Python.
It sounds like you're on x86-64 Windows.1 If so, your system is built around the Microsoft x64 ABI. And if not, that still makes for a good example, so let's pretend you are. Slightly oversimplified,2 the calling conventions for that ABI work like this:
The first four arguments are stored in registers RCX, RDX, R8, and R9.
Any additional arguments are pushed onto the stack.
So, your c_uint arguments get stored in the low 32 bits of RCX and RDX, respectively, while the high 32 bits of each of those registers gets cleared to 0.
The add function goes to add RCX and RDX as unsigned 64-bit ints, and the result is exactly what you'd expect; everything works.3
But imagine you were on a different platform, with a different ABI. In fact, your imagination doesn't have to go very far; if you run a 32-bit program on the same Windows machine, you get the Microsoft IA-32 ABI instead of Microsoft x64. That ABI has three different calling conventions, and that _cdecl in your declaration now selects one of the three, which works like this:
Push everything on the stack.
OK, now c_uint and size_t both happen to be 32 bits, but let's do the same thing with c_ushort.
Your Python code pushes two 16-bit values onto the stack.
add tries to use the both of your values—as in x | (y<<32)—as its x parameter, and then whatever happens to be next to it on the stack as its y parameter. So, what you get back is garbage.
And it can get even worse.
What if you'd used _stdcall? In the Microsoft x64 ABI that does nothing, but in the Windows IA-32 ABI, it specifies the same parameter passing order as _cdecl, but stack cleanup by the callee rather than the caller.
So, after generating your garbage for you, add goes to clean up the stack, and it's expecting a different size than what you gave it, and… well, actually, I think in this specific case you get away with it because the parameter area of the stack is aligned to 16-byte pages, so cleaning up 16 bytes instead of 8 doesn't matter. But that's just dumb luck.
There are also some platforms that pass values in partial registers. For example, IIRC, the Win32s version of _fastcall did something like this:
First argument in EAX if 32-bit, AX if 16-bit, AL if 8-bit.
Second argument in EDX if 32-bit, DX if 16-bit, DL if 8-bit.
Third argument in EBX if 32-bit, BX if 16-bit, BL if 8-bit.
Everything else on the stack.
AL is just the low half of AX. and loading a byte into AL does not clear the high half. So, what happens if you call a _fastcall function that wanted to add two 16-bit numbers, but you thought it wanted to add two 8-bit numbers? You get the sum of x, y, z*256, and w*256, where z and w are just whatever happened to be left around in AH and DH by some previous instruction.
There's a reason all of my weird examples came from 32-bit and smaller ABIs. Most 64-bit ABIs were designed more recently, and less haphazardly, and specifically to make POSIX/C code and/or Win64/C code run nicely, so they tend to be pretty similar. For example, the System V AMD64 ABI (used by almost everything but Windows on x86_64), the AArch64 ABI (used by almost everything on ARM64), and the PowerPC64 (used by everything on PowerPC64) all have basically the same calling convention as the Microsoft x64 ABI, except for a different set of integer-parameter registers, and slightly different floats-and-stuff rules. But that doesn't mean you can rely on it being safe to get the parameters wrong; it just means you have a harder time finding test systems to detect and debug your bugs…
1. You didn't say, but __declspec and _cdecl usually only appear in Windows code. And you said "a 64-bit system", and I doubt you're on Itanium or some other 64-bit platform.
2. There's some extra complexity for floats, SSE vectors, structs larger than 64 bits, varargs…
3. You might be a bit surprised that 0xffffffff + 0xffffffff is 0x00000001fffffffe instead of 0xfffffffe… but since you got the restype wrong as well, you're going to truncate that to 32 bits (and you’re on a little-endian system and one that returns values in registers—if both of those were not true, you’d get 1 as the answer…), and, since these are unsigned ints, truncating and rolling over look identical, so two errors cancel out and you see the 0xfffffffe you expected.

Parallelize python loop numpy.searchsorted using cython

I've coded a function using cython containing the following loop. Each row of array A1 is binary searched for all values in array A2. So each loop iteration returns a 2D array of index values. Arrays A1 and A2 enter as function arguments, properly typed.
The array C is pre-allocated at highest indent level as required in cython.
I simplified things a little for this question.
...
cdef np.ndarray[DTYPEint_t, ndim=3] C = np.zeros([N,M,M], dtype=DTYPEint)
for j in range(0,N):
C[j,:,:] = np.searchsorted(A1[j,:], A2, side='left' )
All's fine so far, things compile and run as expected. However, to gain even more speed I want to parallelize the j-loop. First attempt was simply writing
for j in prange(0,N, nogil=True):
C[j,:,:] = np.searchsorted(A1[j,:], A2, side='left' )
I tried many coding variations such as putting things in a separate nogil_function, assigning the result to an intermediate array and write a nested loop to avoid the assignment to the sliced part of C.
Errors usually are of the form "Accessing Python attribute not allowed without gil"
I can't get it to work. Any suggestions on how I can do this?
EDIT:
This is my setup.py
try:
from setuptools import setup
from setuptools import Extension
except ImportError:
from distutils.core import setup
from distutils.extension import Extension
from Cython.Build import cythonize
import numpy
extensions = [Extension("matchOnDistanceVectors",
sources=["matchOnDistanceVectors.pyx"],
extra_compile_args=["/openmp", "/O2"],
extra_link_args=[]
)]
setup(
ext_modules = cythonize(extensions),
include_dirs=[numpy.get_include()]
)
I'm on windows 7 compiling with msvc. I did specify the /openmp flag, my arrays are of sizes 200*200. So everything seems in order...
I believe that searchsorted releases the GIL itself (see https://github.com/numpy/numpy/blob/e2805398f9a63b825f4a2aab22e9f169ff65aae9/numpy/core/src/multiarray/item_selection.c, line 1664 "NPY_BEGIN_THREADS_DEF").
Therefore, you can do
for j in prange(0,N, nogil=True):
with gil:
C[j,:,:] = np.searchsorted(A1[j,:], A2, side='left' )
That temporarily claims back the GIL to do the necessary work on Python objects (which is hopefully quick), and then it should be released again inside searchsorted allowing in the run largely in parallel.
To update I did a quick test of this (A1.shape==(105,100), A2.shape==(302,302), numbers are chosen pretty arbitrarily). For 10 repeats the serial version took 4.5 second, the parallel version took 1.4 seconds (test was run on a 4 core CPU). You don't get the 4x full speed-up, but you get close.
This was compiled as described in the documentation. I suspect if you aren't seeing speed-up then it could be any of: 1) your arrays are small enough that the function-call/numpy checking types and sizes overhead is dominating; 2) You aren't compiling it with OpenMP enabled; or 3) your compiler doesn't support OpenMP.
You have a bit of a catch 22. You need the GIL to call numpy.searchsorted but the GIL prevents any kind of parallel processing. Your best bet is to write your own nogil version of searchsorted:
cdef mySearchSorted(double[:] array, double target) nogil:
# binary search implementation
for j in prange(0,N, nogil=True):
for k in range(A2.shape[0]):
for L in range(A2.shape[1]):
C[j, k, L] = mySearchSorted(A1[j, :], A2[k, L])
numpy.searchsorted also has a non trivial amount of overhead, so if N is large it makes sense to use your own searchsorted just to reduce the overhead.

Segmentation fault after removing debug printing

I have a (for me) very weird segmentation error. At first, I thought it was interference between my 4 cores due to openmp, but removing openmp from the equation is not what I want. It turns out that when I do, the segfault still occurs.
What's weird is that if I add a print or write anywhere within the inner-do, it works.
subroutine histogrambins(rMatrix, N, L, dr, maxBins, bins)
implicit none;
double precision, dimension(N,3), intent(in):: rMatrix;
integer, intent(in) :: maxBins, N;
double precision, intent(in) :: L, dr;
integer, dimension(maxBins, 1), intent(out) :: bins;
integer :: i, j, b;
double precision, dimension(N,3) :: cacheParticle, cacheOther;
double precision :: r;
do b= 1, maxBins
bins(b,1) = 0;
end do
!$omp parallel do &
!$omp default(none) &
!$omp firstprivate(N, L, dr, rMatrix, maxBins) &
!$omp private(cacheParticle, cacheOther, r, b) &
!$omp shared(bins)
do i = 1,N
do j = 1,N
!Check the pair distance between this one (i) and its (j) closest image
if (i /= j) then
!should be faster, because it doesn't have to look for matrix indices
cacheParticle(1, :) = rMatrix(i,:);
cacheOther(1, :) = rMatrix(j, :);
call inbox(cacheParticle, L);
call inbox(cacheOther, L);
call closestImage(cacheParticle, cacheOther, L);
r = sum( (cacheParticle - cacheOther) * (cacheParticle - cacheOther) ) ** .5;
if (r /= r) then
! r is NaN
bins(maxBins,1) = bins(maxBins,1) + 1;
else
b = floor(r/dr);
if (b > maxBins) then
b = maxBins;
end if
bins(b,1) = bins(b,1) + 1;
end if
end if
end do
end do
!$omp end parallel do
end subroutine histogramBins
I enabled -debug-capi in the f2py command:
f2py --fcompiler=gfortran --f90flags="-fopenmp -fcheck=all" -lgomp --debug-capi --debug -m -c modulename module.f90;
Which gives me this:
debug-capi:Fortran subroutine histogrambins(rmatrix,&n,&l,&dr,&maxbins,bins)'
At line 320 of file mol-dy.f90
Fortran runtime error: Aborted
It also does a load of other checking, listing arguments given and other subroutines called and so on.
Anyway, the two subroutines called in are both non-parallel subroutines. I use them in several other subroutines and I thought it best not to call a parallel subroutine with the parallel code of another subroutine. So, at the time of processing this function, no other function should be active.
What's going on here? How can adding "print *, ;"" cause a segfault to go away?
Thank you for your time.
It's not unusual for print statements to impact - and either create or remove the segfault. The reason is that they change the way memory is laid out to make room for the string being printed, or you will be making room for temporary strings if you're doing some formatting. That change can be sufficient to cause a bug to either appear as a crash for the first time, or to disappear.
I see you're calling this from Python. If you're using Linux - you could try following a guide to using a debugger with Fortran called from Python and find the line and the data values that cause the crash. This method also works for OpenMP. You can also try using GDB as the debugger.
Without the source code to your problem, I don't think you're likely to get an "answer" to the question - but hopefully the above ideas will help you to solve this yourself.
Using a debugger is (in my experience) considerably less likely to have this now-you-see-it-now-you-don't behaviour than with print statements (almost certainly so if only using one thread).

Intermittent Memory Allocation Error for Fortran Matrix using F2Py

Background:
I have a Python script that uses Fortran code for it's intensive calculations. I'm using F2Py to do this. One particular Fortran subroutine builds a matrix used in later calculations. This subroutine is iterated over in a loop, and solved at each step. A snippet of the code using essential arrays and variables is given below:
for i in xrange(steps):
x+=dx
F_Output=Matrix_Build_F2Py.hamiltonian_solve(array_1, array_2, array_3, array_4)
#Do things with F_Output
SUBROUTINE Hamiltonian_Solve(array_1, array_2, array_3, array_4, output_array)
!N_Long, N_Short are implied, Work, RWork, LWork, INFO
INTEGER, INTENT(IN), DIMENSION(0:N_Long-1) :: array_1, array_2, array_3
INTEGER, INTENT(IN), DIMENSION(0:N_Short-1) :: array_4
COMPLEX*16,ALLOCATABLE :: Hamiltonian(:,:)
COMPLEX*16, DIMENSION(0:N_Short :: Complex_Var
DOUBLE PRECISION, INTENT(OUT), DIMENSION(0:N_Short-1) :: E
INTEGER :: LWork, INFO, j
COMPLEX*16, ALLOCATABLE :: Work(:)
ALLOCATE(Hamiltonian(0:N_Short-1, 0:N_Short-1))
ALLOCATE(RWork(MAX(1,3*(N_Short-2))))
ALLOCATE(Work(MAX(1,LWork)))
ALLOCATE(E(0:N_Short-1))
DO h=0, N_Long-1
Hamiltonian(array_1(h),array_2(h))=Hamiltonian(array_1(h),array_2(h))-Complex_Var(h)
END DO
CALL ZHEEV('N','U',N_Short,Hamiltonian,N_Short,E,Work,LWork,RWork,INFO)
DO j=0,N_Short-1
Output_Array(j)=E(j)
END DO
END SUBROUTINE
However, for some reason this subroutine crashes my Python program, and generates the following malloc error:
error for object 0x1015f9808: incorrect checksum for freed object - object was probably modified after being freed.
This error is unusual in that it does not occur every time, but only a significant percentage of the time. I have determined that the root of the error lies in the line:
Hamiltonian(array_1(h),array_2(h))=Hamiltonian(array_1(h),array_2(h))-Complex_Var(h)
As if I change it to:
Hamiltonian(array_1(h),array_2(h))=Hamiltonian(array_1(h),array_2(h))
The error stops. However, Complex_Var is essential to the output, otherwise the program simply produces zeroes. This thread bears some similarity to my issue, but that issue seemed to occur after every run, mine does not. I have taken care to ensure the arrays are not mismatched, other arrangements (ie not accounting for numpy's different array formats) immediately creates a segmentation fault, as expected.
Question
Why is Complex_Var breaking the code? Why is the problem intermittent rather than systematic? And are there any obvious (or not so obvious) tips to avoid this?
Any help would be much appreciated!
updated per first comment and revision of question:
I see that some arrays in the problem expression have upper-dimension N_long-1 (i.e., array_1 and array_2) and array Complex_Var dimension N_short. The loop iterates up to N_Long-1. Do you know that N_Long-1 <= N_short ? If not, you might be accessing an illegal subscript o Complex_var. And do you know that the values in array_1 and array_2 are always legal subscripts for Hamilton? If you write outside the reserved size of that array, you could corrupt the information used by the memory allocator when it created some array, preventing it from freeing that array later.
If this is the problem, using your compiler's option for run-time subscript checking can help you find similar errors.
It could be because you don't have any deallocate commands. However it is hard to tell with this obviously incomplete code - could you post the actual code (i.e. something that will compile)?

DLR & Performance

I'm intending to create a web service which performs a large number of manually-specified calculations as fast as possible, and have been exploring the use of DLR.
Sorry if this is long but feel free to skim over and get the general gist.
I've been using the IronPython library as it makes the calculations very easy to specify. My works laptop gives a performance of about 400,000 calculations per second doing the following:
ScriptEngine py = Python.CreateEngine();
ScriptScope pys = py.CreateScope();
ScriptSource src = py.CreateScriptSourceFromString(#"
def result():
res = [None]*1000000
for i in range(0, 1000000):
res[i] = b.GetValue() + 1
return res
result()
");
CompiledCode compiled = src.Compile();
pys.SetVariable("b", new DynamicValue());
long start = DateTime.Now.Ticks;
var res = compiled.Execute(pys);
long end = DateTime.Now.Ticks;
Console.WriteLine("...Finished. Sample data:");
for (int i = 0; i < 10; i++)
{
Console.WriteLine(res[i]);
}
Console.WriteLine("Took " + (end - start) / 10000 + "ms to run 1000000 times.");
Where DynamicValue is a class that returns random numbers from a pre-built array (seeded and built at run time).
When I create a DLR class to do the same thing, I get much higher performance (~10,000,000 calculations per second). The class is as follows:
class DynamicCalc : IDynamicMetaObjectProvider
{
DynamicMetaObject IDynamicMetaObjectProvider.GetMetaObject(Expression parameter)
{
return new DynamicCalcMetaObject(parameter, this);
}
private class DynamicCalcMetaObject : DynamicMetaObject
{
internal DynamicCalcMetaObject(Expression parameter, DynamicCalc value) : base(parameter, BindingRestrictions.Empty, value) { }
public override DynamicMetaObject BindInvokeMember(InvokeMemberBinder binder, DynamicMetaObject[] args)
{
Expression Add = Expression.Convert(Expression.Add(args[0].Expression, args[1].Expression), typeof(System.Object));
DynamicMetaObject methodInfo = new DynamicMetaObject(Expression.Block(Add), BindingRestrictions.GetTypeRestriction(Expression, LimitType));
return methodInfo;
}
}
}
and is called/tested in the same way by doing the following:
dynamic obj = new DynamicCalc();
long t1 = DateTime.Now.Ticks;
for (int i = 0; i < 10000000; i++)
{
results[i] = obj.Add(ar1[i], ar2[i]);
}
long t2 = DateTime.Now.Ticks;
Where ar1 and ar2 are pre-built, runtime seeded arrays of random numbers.
The speed is great this way, but it's not easy to specify the calculation. I'd basically be looking at creating my own lexer & parser, whereas IronPython has everything I need already there.
I'd have thought I could get much better performance from IronPython since it is implemented on top of the DLR, and I could do with better than what I'm getting.
Is my example making best use of the IronPython engine? Is it possible to get significantly better performance out of it?
(Edit) Same as first example but with the loop in C#, setting variables and calling the python function:
ScriptSource src = py.CreateScriptSourceFromString(#"b + 1");
CompiledCode compiled = src.Compile();
double[] res = new double[1000000];
for(int i=0; i<1000000; i++)
{
pys.SetVariable("b", args1[i]);
res[i] = compiled.Execute(pys);
}
where pys is a ScriptScope from py, and args1 is a pre-built array of random doubles. This example executes slower than running the loop in the Python code and passing in the entire arrays.
delnan's comment leads you to some of the problems here. But I'll just get specific about what the differences are here. In the C# version you've cut out a significant amount of the dynamic calls that you have in the Python version. For starters your loop is typed to int and it sounds like ar1 and ar2 are strongly typed arrays. So in the C# version the only dynamic operations you have are the call to obj.Add (which is 1 operation in C#) and potentially the assignment to results if it's not typed to object which seems unlikely. Also note all of this code is lock free.
In the Python version you first have the allocation of the list - this also appears to be during your timer where as in C# it doesn't look like it is. Then you have the dynamic call to range, luckily that only happens once. But that again creates a gigantic list in memory - delnan's suggestion of xrange is an improvement here. Then you have the loop counter i which is getting boxed to an object for every iteration through the loop. Then you have the call to b.GetValue() which is actually 2 dynamic invocatiosn - first a get member to get the "GetValue" method and then an invoke on that bound method object. This is again creating one new object for every iteration of the loop. Then you have the result of b.GetValue() which may be yet another value that's boxed on every iteration. Then you add 1 to that result and you have another boxing operation on every iteration. Finally you store this into your list which is yet another dynamic operation - I think this final operation needs to lock to ensure the list remains consistent (again, delnan's suggestion of using a list comprehension improves this).
So in summary during the loop we have:
C# IronPython
Dynamic Operations 1 4
Allocations 1 4
Locks Acquired 0 1
So basically Python's dynamic behavior does come at a cost vs C#. If you want the best of both worlds you can try and balance what you do in C# vs what you do in Python. For example you could write the loop in C# and have it call a delegate which is a Python function (you can do scope.GetVariable> to get a function out of the scope as a delegate). You could also consider allocating a .NET array for the results if you really need to get every last bit of performance as it may reduce working set and GC copying by not keeping around a bunch of boxed values.
To do the delegate you could have the user write:
def computeValue(value):
return value + 1
Then in the C# code you'd do:
CompiledCode compiled = src.Compile();
compiled.Execute(pys);
var computer = pys.GetVariable<Func<object,object>>("computeValue");
Now you can do:
for (int i = 0; i < 10000000; i++)
{
results[i] = computer(i);
}
If you concerned about computation speed, is it better to look at lowlevel computation specification? Python and C# are high-level languages, and its implementation runtime can spend a lot of time for undercover work.
Look on this LLVM wrapper library: http://www.llvmpy.org
Install it using: pip install llvmpy ply
or on Debian Linux: apt install python-llvmpy python-ply
You still need to write some tiny compiler (you can use PLY library), and bind it with LLVM JIT calls (see LLVM Execution Engine), but this approach can be more effective (generated code much closer to real CPU code), and multiplatform comparing to .NET jail.
LLVM has ready to use optimizing compiler infrastructure, including a lot of optimizer stage modules, and big user and developer community.
Also look here: http://gmarkall.github.io/tutorials/llvm-cauldron-2016
PS: If you interested in it, I can help you with a compiler, contributing to my project's manual in parallel. But it will not be jumpstart, this theme is new to me too.

Categories