Segmentation fault after removing debug printing - python

I have a (for me) very weird segmentation error. At first, I thought it was interference between my 4 cores due to openmp, but removing openmp from the equation is not what I want. It turns out that when I do, the segfault still occurs.
What's weird is that if I add a print or write anywhere within the inner-do, it works.
subroutine histogrambins(rMatrix, N, L, dr, maxBins, bins)
implicit none;
double precision, dimension(N,3), intent(in):: rMatrix;
integer, intent(in) :: maxBins, N;
double precision, intent(in) :: L, dr;
integer, dimension(maxBins, 1), intent(out) :: bins;
integer :: i, j, b;
double precision, dimension(N,3) :: cacheParticle, cacheOther;
double precision :: r;
do b= 1, maxBins
bins(b,1) = 0;
end do
!$omp parallel do &
!$omp default(none) &
!$omp firstprivate(N, L, dr, rMatrix, maxBins) &
!$omp private(cacheParticle, cacheOther, r, b) &
!$omp shared(bins)
do i = 1,N
do j = 1,N
!Check the pair distance between this one (i) and its (j) closest image
if (i /= j) then
!should be faster, because it doesn't have to look for matrix indices
cacheParticle(1, :) = rMatrix(i,:);
cacheOther(1, :) = rMatrix(j, :);
call inbox(cacheParticle, L);
call inbox(cacheOther, L);
call closestImage(cacheParticle, cacheOther, L);
r = sum( (cacheParticle - cacheOther) * (cacheParticle - cacheOther) ) ** .5;
if (r /= r) then
! r is NaN
bins(maxBins,1) = bins(maxBins,1) + 1;
else
b = floor(r/dr);
if (b > maxBins) then
b = maxBins;
end if
bins(b,1) = bins(b,1) + 1;
end if
end if
end do
end do
!$omp end parallel do
end subroutine histogramBins
I enabled -debug-capi in the f2py command:
f2py --fcompiler=gfortran --f90flags="-fopenmp -fcheck=all" -lgomp --debug-capi --debug -m -c modulename module.f90;
Which gives me this:
debug-capi:Fortran subroutine histogrambins(rmatrix,&n,&l,&dr,&maxbins,bins)'
At line 320 of file mol-dy.f90
Fortran runtime error: Aborted
It also does a load of other checking, listing arguments given and other subroutines called and so on.
Anyway, the two subroutines called in are both non-parallel subroutines. I use them in several other subroutines and I thought it best not to call a parallel subroutine with the parallel code of another subroutine. So, at the time of processing this function, no other function should be active.
What's going on here? How can adding "print *, ;"" cause a segfault to go away?
Thank you for your time.

It's not unusual for print statements to impact - and either create or remove the segfault. The reason is that they change the way memory is laid out to make room for the string being printed, or you will be making room for temporary strings if you're doing some formatting. That change can be sufficient to cause a bug to either appear as a crash for the first time, or to disappear.
I see you're calling this from Python. If you're using Linux - you could try following a guide to using a debugger with Fortran called from Python and find the line and the data values that cause the crash. This method also works for OpenMP. You can also try using GDB as the debugger.
Without the source code to your problem, I don't think you're likely to get an "answer" to the question - but hopefully the above ideas will help you to solve this yourself.
Using a debugger is (in my experience) considerably less likely to have this now-you-see-it-now-you-don't behaviour than with print statements (almost certainly so if only using one thread).

Related

why is this memoized Euler14 implementation so much slower in Raku than Python?

I was recently playing with problem 14 of the Euler project: which number in the range 1..1_000_000 produces the longest Collatz sequence?
I'm aware of the issue of having to memoize to get reasonable times, and the following piece of Python code returns an answer relatively quickly using that technique (memoize to a dict):
#!/usr/bin/env python
L = 1_000_000
cllens={1:1}
cltz = lambda n: 3*n + 1 if n%2 else n//2
def cllen(n):
if n not in cllens: cllens[n] = cllen(cltz(n)) + 1
return cllens[n]
maxn=1
for i in range(1,L+1):
ln=cllen(i)
if (ln > cllens[maxn]): maxn=i
print(maxn)
(adapted from here; I prefer this version that doesn't use max, because I might want to fiddle with it to return the longest 10 sequences, etc.).
I have tried to translate it to Raku staying as semantically close as I could:
#!/usr/bin/env perl6
use v6;
my $L=1_000_000;
my %cllens = (1 => 1);
sub cltz($n) { ($n %% 2) ?? ($n div 2) !! (3*$n+1) }
sub cllen($n) {
(! %cllens{$n}) && (%cllens{$n} = 1+cllen($n.&cltz));
%cllens{$n};
}
my $maxn=1;
for (1..$L) {
my $ln = cllen($_);
($ln > %cllens{$maxn}) && ($maxn = $_)
}
say $maxn
Here are the sorts of times I am consistently getting running these:
$ time <python script>
837799
real 0m1.244s
user 0m1.179s
sys 0m0.064s
On the other hand, in Raku:
$ time <raku script>
837799
real 0m21.828s
user 0m21.677s
sys 0m0.228s
Question(s)
Am I mistranslating between the two, or is the difference an irreconcilable matter of starting up a VM, etc.?
Are there clever tweaks / idioms I can apply to the Raku code to speed it up considerably past this?
Aside
Naturally, it's not so much about this specific Euler project problem; I'm more broadly interested in whether there are any magical speedup arcana appropriate to Raku I am not aware of.
I think the majority of the extra time is because Raku has type checks, and they aren't getting removed by the runtime type specializer. Or if they are getting removed it is after a significant amount of time.
Generally the way to optimize Raku code is first to run it with the profiler:
$ raku --profile test.raku
Of course that fails with a Segfault with this code, so we can't use it.
My guess would be that much of the time is related to using the Hash.
If it was implemented, using native ints for the key and value might have helped:
my int %cllens{int} = (1 => 1);
Then declaring the functions as using native-sized ints could be a bigger win.
(Currently this is a minor improvement at best.)
sub cltz ( int $n --> int ) {…}
sub cllen( int $n --> int ) {…}
for (1..$L) -> int $_ {…}
Of course like I said native hashes aren't implemented, so that is pure speculation.
You could try to use the multi-process abilities of Raku, but there may be issues with the shared %cllens variable.
The problem could also be because of recursion. (Combined with the aforementioned type checks.)
If you rewrote cllen so that it used a loop instead of recursion that might help.
Note:
The closest to n not in cllens is probably %cllens{$n}:!exists.
Though that might be slower than just checking that the value is not zero.
Also cellen is kinda terrible. I would have written it more like this:
sub cllen($n) {
%cllens{$n} //= cllen(cltz($n)) + 1
}

Difficulty getting OpenMP to work with f2py

I am working on some simulation work for my research and have run into a snag importing fortran into my python scripts. As background, I have worked with Python for some years now, and have only toyed around inside of Fortran when the need arose.
I have done some work in the past with Fortran implementing some simple OpenMP functionality. I am no expert in this, but I have gotten the basics going before.
I am now using f2py to create a library I can call on from my python script. When I try to compile openmp, it compiles correctly and runs to completion, but there is zero improvement in speed and looking at top I see that the CPU usage indicates that only one thread is running.
I've scoured the documentation for f2py (which is not very well documented) as well as done the normal web sleuthing for answers. I've included the Fortran code I am compiling as well as a simple python script that calls on it. I also am throwing in the compile command I am using.
Currently I cut down the simulation to 10^4 as a nice benchmark. On my system it takes 3 seconds to run. Ultimately I need to run a number of 10^6 particle simulations though, so I need to bring down the time a bit.
If anyone can point me in the direction of how to get my code working, it would be super appreciated. I can also try to include any detailed information about the system as needed.
Cheers,
Rylkan
1) Compile
f2py -c --f90flags='-fopenmp' -lgomp -m calc_accel_jerk calc_accel_jerk.f90
2) Python script to call
import numpy as N
import calc_accel_jerk
# a is a (1e5,7) array with M,r,v information
a = N.load('../test.npy')
a = a[:1e4]
out = calc_accel_jerk.calc(a,a.shape[0])
print out[:10]
3) Fortran code
subroutine calc (input_array, nrow, output_array)
implicit none
!f2py threadsafe
include "omp_lib.h"
integer, intent(in) :: nrow
double precision, dimension(nrow,7), intent(in) :: input_array
double precision, dimension(nrow,2), intent(out) :: output_array
! Calculation parameters with set values
double precision,parameter :: psr_M=1.55*1.3267297e20
double precision,parameter :: G_Msun=1.3267297e20
double precision,parameter :: pc_to_m=3.08e16
! Vector declarations
integer :: irow
double precision :: vfac
double precision, dimension(nrow) :: drx,dry,drz,dvx,dvy,dvz,rmag,jfac,az,jz
! Break up the input array for faster access
double precision,dimension(nrow) :: input_M
double precision,dimension(nrow) :: input_rx
double precision,dimension(nrow) :: input_ry
double precision,dimension(nrow) :: input_rz
double precision,dimension(nrow) :: input_vx
double precision,dimension(nrow) :: input_vy
double precision,dimension(nrow) :: input_vz
input_M(:) = input_array(:,1)*G_Msun
input_rx(:) = input_array(:,2)*pc_to_m
input_ry(:) = input_array(:,3)*pc_to_m
input_rz(:) = input_array(:,4)*pc_to_m
input_vx(:) = input_array(:,5)*1000
input_vy(:) = input_array(:,6)*1000
input_vz(:) = input_array(:,7)*1000
!$OMP PARALLEL DO private(vfac,drx,dry,drz,dvx,dvy,dvz,rmag,jfac,az,jz) shared(output_array) NUM_THREADS(2)
DO irow = 1,nrow
! Get the i-th iteration
vfac = sqrt(input_M(irow)/psr_M)
drx = (input_rx-input_rx(irow))
dry = (input_ry-input_ry(irow))
drz = (input_rz-input_rz(irow))
dvx = (input_vx-input_vx(irow)*vfac)
dvy = (input_vy-input_vy(irow)*vfac)
dvz = (input_vz-input_vz(irow)*vfac)
rmag = sqrt(drx**2+dry**2+drz**2)
jfac = -3*drz/(drx**2+dry**2+drz**2)
! Calculate the acceleration and jerk
az = input_M*(drz/rmag**3)
jz = (input_M/rmag**3)*((dvx*drx*jfac)+(dvy*dry*jfac)+(dvz+dvz*drz*jfac))
! Remove bad index
az(irow) = 0
jz(irow) = 0
output_array(irow,1) = sum(az)
output_array(irow,2) = sum(jz)
END DO
!$OMP END PARALLEL DO
END subroutine calc
Here is a simple check to see, wether the OpenMP threads indeed are visible within the Fortran code:
module OTmod
!$ use omp_lib
implicit none
public :: get_threads
contains
function get_threads() result(nt)
integer :: nt
nt = 0
!$ nt = omp_get_max_threads()
end function get_threads
end module OTmod
Compilation:
> f2py -m OTfor --fcompiler=gfortran --f90flags='-fopenmp' -lgomp -c OTmod.f90
Execution:
> python
>>> from OTfor import otmod
>>> otmod.get_threads()
12

How to write a genfromtxt with f2py?

I've found that the function genfromtxt from numpy in python is very slow.
Therefore I decided to wrap a module with f2py to read my data. The data is a matrix.
subroutine genfromtxt(filename, nx, ny, a)
implicit none
character(100):: filename
real, dimension(ny,nx) :: a
integer :: row, col, ny, nx
!f2py character(100), intent(in) ::filename
!f2py integer, intent(in) :: nx
!f2py integer, intent(in) :: ny
!f2py real, intent(out), dimension(nx,ny) :: a
!Opening file
open(5, file=filename)
!read data again
do row = 1, ny
read(5,*) (a(row,col), col =1,nx) !reading line by line
end do
close (5)
end subroutine genfromtxt
The length of the filename is fixed to 100 because if f2py can't deal with dynamic sizes. The code works for sizes shorter than 100, otherwise the code in python crashes.
This is called in python as:
import Fmodules as modules
w_map=modules.genfromtxt(filename,100, 50)
How can I do this dynamically without passing nx, ny as parameters nor fixing the filename length to 100?
The filename length issue is easily dealt with: you make filename: character(len_filename):: filename, have len_filename as a parameter of the fortran function, and use the f2py intent(hide) to hide it from your Python interface (see the code below).
Your issue of not wanting to have to pass in nx and ny is more complicated, and is essentially is the problem of how to return an allocatable array from f2py. I can see two methods, both of which require a (small and very light) Python wrapper to give you a nice interface.
I've haven't actually addressed the problem of how to read the csv file - I've just generated and returned some dummy data. I'm assuming you have a good implementation for that.
Method 1: module level allocatable arrays
This seems to be a fairly standard method - I found at least one newsgroup post recommending this. You essentially have an allocatable global variable, initialise it to the right size, and write to it.
module mod
real, allocatable, dimension(:,:) :: genfromtxt_output
contains
subroutine genfromtxt_v1(filename, len_filename)
implicit none
character(len_filename), intent(in):: filename
integer, intent(in) :: len_filename
!f2py intent(hide) :: len_filename
integer :: row, col
! for the sake of a quick demo, assume 5*6
! and make it all 2
if (allocated(genfromtxt_output)) deallocate(genfromtxt_output)
allocate(genfromtxt_output(1:5,1:6))
do row = 1,5
do col = 1,6
genfromtxt_output(row,col) = 2
end do
end do
end subroutine genfromtxt_v1
end module mod
The short Python wrapper then looks like this:
import _genfromtxt
def genfromtxt_v1(filename):
_genfromtxt.mod.genfromtxt_v1(filename)
return _genfromtxt.mod.genfromtxt_output.copy()
# copy is needed, otherwise subsequent calls overwrite the data
The main issue with this is that it won't be thread safe: if two Python threads call genfromtxt_v1 at very similar times the data could get overwritten before you've had a change to copy it out.
Method 2: Python callback function that saves the data
This is slightly more involved and a horrendous hack of my own invention. You pass your array to a callback function which then saves it. The Fortran code looks like:
subroutine genfromtxt_v2(filename,len_filename,callable)
implicit none
character(len_filename), intent(in):: filename
integer, intent(in) :: len_filename
!f2py intent(hide) :: len_filename
external callable
real, allocatable, dimension(:,:) :: result
integer :: row, col
integer :: rows,cols
! for the sake of a quick demo, assume 5*6
! and make it all 2
rows = 5
cols = 6
allocate(result(1:rows,1:cols))
do row = 1,rows
do col = 1,cols
result(row,col) = 2
end do
end do
call callable(result,rows,cols)
deallocate(result)
end subroutine genfromtxt_v2
You then need to generate the "signature file" with f2py -m _genfromtxt -h _genfromtxt.pyf genfromtxt.f90 (assuming genfromtxt.f90 is the file with your Fortran code). You then modify the "user routines block" to clarify the signature of your callback:
python module genfromtxt_v2__user__routines
interface genfromtxt_v2_user_interface
subroutine callable(result,rows,cols)
real, dimension(rows,cols) :: result
integer :: rows
integer :: cols
end subroutine callable
end interface genfromtxt_v2_user_interface
end python module genfromtxt_v2__user__routines
(i.e. you specify the dimensions). The rest of the file is left unchanged.
Compile with f2py -c genfromtxt.f90 _genfromtxt.pyf
The small Python wrapper is
import _genfromtxt
def genfromtxt_v2(filename):
class SaveArrayCallable(object):
def __call__(self,array):
self.array = array.copy() # needed to avoid data corruption
f = SaveArrayCallable()
_genfromtxt.genfromtxt_v2(filename,f)
return f.array
I think this should be thread safe: although I think Python callback functions are implemented as global variables, but default f2py doesn't release the GIL so no Python code can be run between the global being set and the callback being run. I doubt you care about thread safety for this application though...
I think you can just use:
open(5, file=trim(filename))
to deal with filenames shorter than the length of filename. (i.e. you could make filename much longer than it needs to be and just trim it here)
I don't know of any nice clean way to remove the need to pass nx and ny to the fortran subroutine. Perhaps if you can determine the size and shape of the data file programatically (e.g. read the first line to find nx, call some function or have a first pass over the file to determine the number of lines in the file), then you could allocate your a array after finding those values. That would slow everything down though, so may be counter productive

Cython for loop conversion

Using cython -a, I found that a for i in range(0, a, b) statement was run as a python loop (very yellow line in cython -a html output). i, a and b were cdef-ed as int64_t.
Then I tried the 'old' syntax for i from 0 <= i < b by a. From the output of cython -a it seemed to compile quite optimal as expected.
Is it expected behaviour that range(0, a, b) is not optimized here or is this rather bound to the implementation?
Automatic range conversion is only applied when cython can determine the sign of the step at compile time. As the step in this case is a signed type it cannot and so falls back to the python loop.
Note that currently even when the type is unsigned cython still falls back onto the python loop, this is a (rather old) outstanding further optimisation that the compiler could do but doesn't. Have a look at this ticket for more information:
http://trac.cython.org/ticket/546

Intermittent Memory Allocation Error for Fortran Matrix using F2Py

Background:
I have a Python script that uses Fortran code for it's intensive calculations. I'm using F2Py to do this. One particular Fortran subroutine builds a matrix used in later calculations. This subroutine is iterated over in a loop, and solved at each step. A snippet of the code using essential arrays and variables is given below:
for i in xrange(steps):
x+=dx
F_Output=Matrix_Build_F2Py.hamiltonian_solve(array_1, array_2, array_3, array_4)
#Do things with F_Output
SUBROUTINE Hamiltonian_Solve(array_1, array_2, array_3, array_4, output_array)
!N_Long, N_Short are implied, Work, RWork, LWork, INFO
INTEGER, INTENT(IN), DIMENSION(0:N_Long-1) :: array_1, array_2, array_3
INTEGER, INTENT(IN), DIMENSION(0:N_Short-1) :: array_4
COMPLEX*16,ALLOCATABLE :: Hamiltonian(:,:)
COMPLEX*16, DIMENSION(0:N_Short :: Complex_Var
DOUBLE PRECISION, INTENT(OUT), DIMENSION(0:N_Short-1) :: E
INTEGER :: LWork, INFO, j
COMPLEX*16, ALLOCATABLE :: Work(:)
ALLOCATE(Hamiltonian(0:N_Short-1, 0:N_Short-1))
ALLOCATE(RWork(MAX(1,3*(N_Short-2))))
ALLOCATE(Work(MAX(1,LWork)))
ALLOCATE(E(0:N_Short-1))
DO h=0, N_Long-1
Hamiltonian(array_1(h),array_2(h))=Hamiltonian(array_1(h),array_2(h))-Complex_Var(h)
END DO
CALL ZHEEV('N','U',N_Short,Hamiltonian,N_Short,E,Work,LWork,RWork,INFO)
DO j=0,N_Short-1
Output_Array(j)=E(j)
END DO
END SUBROUTINE
However, for some reason this subroutine crashes my Python program, and generates the following malloc error:
error for object 0x1015f9808: incorrect checksum for freed object - object was probably modified after being freed.
This error is unusual in that it does not occur every time, but only a significant percentage of the time. I have determined that the root of the error lies in the line:
Hamiltonian(array_1(h),array_2(h))=Hamiltonian(array_1(h),array_2(h))-Complex_Var(h)
As if I change it to:
Hamiltonian(array_1(h),array_2(h))=Hamiltonian(array_1(h),array_2(h))
The error stops. However, Complex_Var is essential to the output, otherwise the program simply produces zeroes. This thread bears some similarity to my issue, but that issue seemed to occur after every run, mine does not. I have taken care to ensure the arrays are not mismatched, other arrangements (ie not accounting for numpy's different array formats) immediately creates a segmentation fault, as expected.
Question
Why is Complex_Var breaking the code? Why is the problem intermittent rather than systematic? And are there any obvious (or not so obvious) tips to avoid this?
Any help would be much appreciated!
updated per first comment and revision of question:
I see that some arrays in the problem expression have upper-dimension N_long-1 (i.e., array_1 and array_2) and array Complex_Var dimension N_short. The loop iterates up to N_Long-1. Do you know that N_Long-1 <= N_short ? If not, you might be accessing an illegal subscript o Complex_var. And do you know that the values in array_1 and array_2 are always legal subscripts for Hamilton? If you write outside the reserved size of that array, you could corrupt the information used by the memory allocator when it created some array, preventing it from freeing that array later.
If this is the problem, using your compiler's option for run-time subscript checking can help you find similar errors.
It could be because you don't have any deallocate commands. However it is hard to tell with this obviously incomplete code - could you post the actual code (i.e. something that will compile)?

Categories