How can numpy be so much faster than my Fortran routine?

How can numpy be so much faster than my Fortran routine? - python

I get a 512^3 array representing a Temperature distribution from a simulation (written in Fortran). The array is stored in a binary file that's about 1/2G in size. I need to know the minimum, maximum and mean of this array and as I will soon need to understand Fortran code anyway, I decided to give it a go and came up with the following very easy routine.
integer gridsize,unit,j
real mini,maxi
double precision mean
gridsize=512
unit=40
open(unit=unit,file='T.out',status='old',access='stream',&
form='unformatted',action='read')
read(unit=unit) tmp
mini=tmp
maxi=tmp
mean=tmp
do j=2,gridsize**3
read(unit=unit) tmp
if(tmp>maxi)then
maxi=tmp
elseif(tmp<mini)then
mini=tmp
end if
mean=mean+tmp
end do
mean=mean/gridsize**3
close(unit=unit)
This takes about 25 seconds per file on the machine I use. That struck me as being rather long and so I went ahead and did the following in Python:
import numpy
mmap=numpy.memmap('T.out',dtype='float32',mode='r',offset=4,\
shape=(512,512,512),order='F')
mini=numpy.amin(mmap)
maxi=numpy.amax(mmap)
mean=numpy.mean(mmap)
Now, I expected this to be faster of course, but I was really blown away. It takes less than a second under identical conditions. The mean deviates from the one my Fortran routine finds (which I also ran with 128-bit floats, so I somehow trust it more) but only on the 7th significant digit or so.
How can numpy be so fast? I mean you have to look at every entry of an array to find these values, right? Am I doing something very stupid in my Fortran routine for it to take so much longer?
EDIT:
To answer the questions in the comments:
Yes, also I ran the Fortran routine with 32-bit and 64-bit floats but it had no impact on performance.
I used iso_fortran_env which provides 128-bit floats.
Using 32-bit floats my mean is off quite a bit though, so precision is really an issue.
I ran both routines on different files in different order, so the caching should have been fair in the comparison I guess ?
I actually tried open MP, but to read from the file at different positions at the same time. Having read your comments and answers this sounds really stupid now and it made the routine take a lot longer as well. I might give it a try on the array operations but maybe that won't even be necessary.
The files are actually 1/2G in size, that was a typo, Thanks.
I will try the array implementation now.
EDIT 2:
I implemented what #Alexander Vogt and #casey suggested in their answers, and it is as fast as numpy but now I have a precision problem as #Luaan pointed out I might get. Using a 32-bit float array the mean computed by sum is 20% off. Doing
...
real,allocatable :: tmp (:,:,:)
double precision,allocatable :: tmp2(:,:,:)
...
tmp2=tmp
mean=sum(tmp2)/size(tmp)
...
Solves the issue but increases computing time (not by very much, but noticeably).
Is there a better way to get around this issue? I couldn't find a way to read singles from the file directly to doubles.
And how does numpy avoid this?
Thanks for all the help so far.

Your Fortran implementation suffers two major shortcomings:
You mix IO and computations (and read from the file entry by entry).
You don't use vector/matrix operations.
This implementation does perform the same operation as yours and is faster by a factor of 20 on my machine:
program test
integer gridsize,unit
real mini,maxi,mean
real, allocatable :: tmp (:,:,:)
gridsize=512
unit=40
allocate( tmp(gridsize, gridsize, gridsize))
open(unit=unit,file='T.out',status='old',access='stream',&
form='unformatted',action='read')
read(unit=unit) tmp
close(unit=unit)
mini = minval(tmp)
maxi = maxval(tmp)
mean = sum(tmp)/gridsize**3
print *, mini, maxi, mean
end program
The idea is to read in the whole file into one array tmp in one go. Then, I can use the functions MAXVAL, MINVAL, and SUM on the array directly.
For the accuracy issue: Simply using double precision values and doing the conversion on the fly as
mean = sum(real(tmp, kind=kind(1.d0)))/real(gridsize**3, kind=kind(1.d0))
only marginally increases the calculation time. I tried performing the operation element-wise and in slices, but that did only increase the required time at the default optimization level.
At -O3, the element-wise addition performs ~3 % better than the array operation. The difference between double and single precision operations is less than 2% on my machine - on average (the individual runs deviate by far more).
Here is a very fast implementation using LAPACK:
program test
integer gridsize,unit, i, j
real mini,maxi
integer :: t1, t2, rate
real, allocatable :: tmp (:,:,:)
real, allocatable :: work(:)
! double precision :: mean
real :: mean
real :: slange
call system_clock(count_rate=rate)
call system_clock(t1)
gridsize=512
unit=40
allocate( tmp(gridsize, gridsize, gridsize), work(gridsize))
open(unit=unit,file='T.out',status='old',access='stream',&
form='unformatted',action='read')
read(unit=unit) tmp
close(unit=unit)
mini = minval(tmp)
maxi = maxval(tmp)
! mean = sum(tmp)/gridsize**3
! mean = sum(real(tmp, kind=kind(1.d0)))/real(gridsize**3, kind=kind(1.d0))
mean = 0.d0
do j=1,gridsize
do i=1,gridsize
mean = mean + slange('1', gridsize, 1, tmp(:,i,j),gridsize, work)
enddo !i
enddo !j
mean = mean / gridsize**3
print *, mini, maxi, mean
call system_clock(t2)
print *,real(t2-t1)/real(rate)
end program
This uses the single precision matrix 1-norm SLANGE on matrix columns. The run-time is even faster than the approach using single precision array functions - and does not show the precision issue.

The numpy is faster because you wrote much more efficient code in python (and much of the numpy backend is written in optimized Fortran and C) and terribly inefficient code in Fortran.
Look at your python code. You load the entire array at once and then call functions that can operate on an array.
Look at your fortran code. You read one value at a time and do some branching logic with it.
The majority of your discrepancy is the fragmented IO you have written in Fortran.
You can write the Fortran just about the same way as you wrote the python and you'll find it runs much faster that way.
program test
implicit none
integer :: gridsize, unit
real :: mini, maxi, mean
real, allocatable :: array(:,:,:)
gridsize=512
allocate(array(gridsize,gridsize,gridsize))
unit=40
open(unit=unit, file='T.out', status='old', access='stream',&
form='unformatted', action='read')
read(unit) array
maxi = maxval(array)
mini = minval(array)
mean = sum(array)/size(array)
close(unit)
end program test

Related

np.matmul with large integer matrix without conversion/copy

import numpy as np
v = np.zeros((3,10000), dtype=np.float32)
mat = np.zeros((10000,10000000), dtype=np.int8)
w = np.matmul(v, mat)
yields
Traceback (most recent call last):
File "int_mul_test.py", line 6, in <module>
w = np.matmul(v, mat)
numpy.core._exceptions.MemoryError: Unable to allocate 373. GiB
for an array with shape (10000, 10000000) and data type float32
Apparently, numpy is trying to convert my 10k x 10m int8 matrix to dtype float32. Why does it need to do this? It seems extremely wasteful, and if matrix multiplication must work with float numbers in memory, it could convert say 1m columns at a time (which shouldn't sacrifice speed too much), instead of converting all 10m columns all at once.
My current solution is to use a loop to break the matrix into 10 pieces and reduce temporary memory allocation to 1/10 of the 373 GiB:
w = np.empty((v.shape[0],mat.shape[1]),dtype=np.float32)
start = 0
block = 1000000
for i in range(mat.shape[1]//block):
end = start + block
w[:,start:end] = np.matmul(v, mat[:,start:end])
start = end
w[:,start:] = np.matmul(v, mat[:,start:])
# runs in 396 seconds
Is there a numpy-idiomatic way to multiply "piece by piece" without manually coding a loop?

The semantic of Numpy operations force the inputs of a binary operation to be casted when the left/right types are different. In fact, this is the case in almost all statically typed language including C, C++, Java, Rust, but also many dynamically-typed languages (the semantic rules are applied at runtime in this case). Python also (partially) applies such a well defined semantic rule. For example, when you evaluate the expression True * 1.7, the interpreter evaluates the type of both operands (bool and float here) and then applies multiple semantic rules until the type of both operand are the same before performing the actual multiplication. In this case, True of type bool is casted to 1 of type int which is then casted to 1.0 of type float. Such semantic rules are generally defined in a way that is both relatively unambiguous and safe. For example, you do not expect 2 * 1.7 to be equal to 3. Numpy use semantic rules similar to the ones of the C language because it is written in C and provide native types. The semantic rules should be defined independently of a given implementation. That being said performance and ease-of-use matters a lot when designing it. Unfortunately, in your case, this means a huge array has to be allocated.
Note that Numpy could theoretically bypass casting and implement the N * N possible versions for the N different types for each binary operations in order to make them faster (like the "as-if" semantic rule of the C language). However, this would be insane to implement for developers and it would result in a more bug-prone code (ie. less stable and slower development) and a huge code bloat (bigger binaries). This is especially true since other parameters should be taken into account like the shape of the array and the memory layout (eg. alignment) or event the target architecture. The current main casting generative function of Numpy is already quite complex and already results in 4 x 18 x 18 = 1296 different C functions to be compiled and stored in Numpy binaries!
In your case, you can use Cython or Numba to generate a memory-efficient (and possibly faster) implementation dedicated to your specific needs. Be careful about possible overflows though.

How can I compute a large complex value efficiently in MATLAB, Python, andC?

I am facing a problem during my work with MATLAB, to compute sinh(a + b * i) (e.g., 1000+1i), where i is the imaginary unit and a, b are quite large values that type double" cannot handle. Surely I can compute the function via Wolfram or Fortran, but I do need a common language to do the GPIB communication together with this calculation.
After asking this to some guys, I was told Python or C has a type called "big float". But none of them can tell me what is the precision nor the max value of it, not to mention the efficiency. So can anyone suggest a solution? Or maybe there's another language can handle this problem (compute large complex and GPIB session).

You could use Julia:
julia> a = BigFloat("1e10")
1.0e+10
julia> b = BigFloat("1e500")
1.000000000000000000000000000000000000000000000000000000000000000000000000000004e+500
julia> sinh(a + b * im)
1.43445592092543814302692567115470616209662482064997303227590320999972133381932e+4342944818 + 5.194323395284352151694584377260055302504830707661913916785283453158333278974902e+4342944818im
but as others have commented, and as you can see in this example, sinh grows quickly with the real part of your complex number, so you can't have too big a a. (With a = 1e10, the result already reached 10^4342944818 !)
FWIW, in Julia, written down as an integer, the biggest BigFloat would have about 1,388,255,822,130,839,282 digits:
julia> prevfloat(typemax(BigFloat))
5.875653789111587590936911998878442589938516392745498308333779606469323584389875e+1388255822130839282
Also, I don't know how accurate this result is.

Building a filter with Python & MATLAB, results are not the same

I want to translate this MATLAB code into Python, I guess I did everything right, even though I didn't get the same results.
MATLAB script:
n=2 %Filter_Order
Wn=[0.4 0.6] %# Normalized cutoff frequencies
[b,a] = butter(n,Wn,'bandpass') % Transfer function coefficients of the filter
Python script:
import numpy as np
from scipy import signal
n=2 #Filter_Order
Wn=np.array([0.4,0.6]) # Normalized cutoff frequencies
b, a = signal.butter(n, Wn, btype='band') #Transfer function coefficients of the filter
a coefficients in MATLAB: 1, -5.55e-16, 1.14, -1.66e-16, 0.41
a coefficients in Python: 1, -2.77e-16, 1.14, -1.94e-16, 0.41
Could it just be a question of precision, since the two different values (the 2nd and 4th) are both on the order of 10^(-16)?!
The b coefficients are the same on the other hand.

You machine precision is about 1e-16 (in MATLAB this can be checked easily with eps(), I presume about the same in Python). The 'error' you are dealing with is thus on the order of machine precision, i.e. not actually calculable within fitting precision.
Also of note is that MATLAB ~= Python (or != in Python), thus the implementations of butter() on one hand and signal.butter() on the other will be slightly different, even if you use the exact same numbers, due to the way both languages are translated to machine code.
It rarely matters to have coefficients differing 16 orders of magnitude; the smaller ones would be essentially neglected. In case you do need exact values, consider using either symbolic math, or some kind of Variable Precision Arithmetic (vpa() in MATLAB), but I guess that in your case the difference is irrelevant.

Using numpy to square value gives negative number

I'm trying to use numpy to element-wise square an array. I've noticed that some of the values appear as negative numbers. The squared value isn't near the max int limit. Does anyone know why this is happening and how I could fix it? I'd rather avoid using a for loop to square an array element-wise, since my data set is quite large.
Here's an example of what is happening:
import numpy as np
test = [1, 2, 47852]
sq = np.array(test)**2
print(sq)
print(47852*47852)
Output:
[1,4, -2005153392]
2289813904

This is because NumPy doesn't check for integer overflow - likely because that would slow down every integer operation, and NumPy is designed with efficiency in mind. So when you have an array of 32-bit integers and your result does not fit in 32 bits, it is still interpreted as 32-bit integer, giving you the strange negative result.
To avoid this, you can be mindful of the dtype you need to perform the operation safely, in this case 'int64' would suffice.
>>> np.array(test, dtype='int64')**2
2289813904
You aren't seeing the same issue with Python int's because Python checks for overflow and adjusts accordingly to a larger data type if necessary. If I recall, there was a question about this on the mailing list and the response was that there would be a large performance implication on atomic array ops if the same were done in NumPy.
As for why your default integer type may be 32-bit on a 64-bit system, as Goyo answered on a related question, the default integer np.int_ type is the same as C long, which is platform dependent but can be 32-bits.

Testing C++ Math functions with Python's C Extension - Precision issues

I wrote a C++ wrapper class to some functions in LAPACK. In order to test the class, I use the Python C Extension, where I call numpy, and do the same operations, and compare the results by taking the difference
For example, for the inverse of a matrix, I generate a random matrix in C++, then pass it as a string (with many, many digits, like 30 digits) to Python's terminal using PyRun_SimpleString, and assign the matrix as numpy.matrix(...,dtype=numpy.double) (or numpy.complex128). Then I use numpy.linalg.inv() to calculate the inverse of the same matrix. Finally, I take the difference between numpy's result and my result, and use numpy.isclose with a specific relative tolerance to see whether the results are close enough.
The problem: The problem is that when I use C++ floats, the relative precision I need to be able to compare is about 1e-2!!! And yet with this relative precision I get some statistical failures (with low probability).
Doubles are fine... I can do 1e-10 and it's statistically safe.
While I know that floats have intrinsic bit precision of about 1e-6, I'm wondering why I have to go so low to 1e-2 to be able to compare the results, and it still fails some times!
So, going so low down to 1e-2 got me wondering whether I'm thinking about this whole thing the wrong way. Is there something wrong with my approach?
Please ask for more details if you need it.
Update 1: Eric requested example of Python calls. Here is an example:
//create my matrices
Matrix<T> mat_d = RandomMatrix<T>(...);
auto mat_d_i = mat_d.getInverse();
//I store everything in the dict 'data'
PyRun_SimpleString(std::string("data={}").c_str());
//original matrix
//mat_d.asString(...) will return in the format [[1,2],[3,4]], where 32 is 32 digits per number
PyRun_SimpleString(std::string("data['a']=np.matrix(" + mat_d.asString(32,'[',']',',') + ",dtype=np.complex128)").c_str());
//pass the inverted matrix to Python
PyRun_SimpleString(std::string("data['b_c']=np.matrix(" + mat_d_i.asString(32,'[',']',',') + ",dtype=np.complex128)").c_str());
//inverse in numpy
PyRun_SimpleString(std::string("data['b_p']=np.linalg.inv(data['a'])").c_str());
//flatten the matrices to make comparing them easier (make them 1-dimensional)
PyRun_SimpleString("data['fb_p']=((data['b_p']).flatten().tolist())[0]");
PyRun_SimpleString("data['fb_c']=((data['b_c']).flatten().tolist())[0]");
//make the comparison. The function compare_floats(f1,f2,t) calls numpy.isclose(f1,f2,rtol=t)
//prec is an integer that takes its value from a template function, where I choose the precision I want based on type
PyRun_SimpleString(std::string("res=list(set([compare_floats(data['fb_p'][i],data['fb_c'][i],1e-"+ std::to_string(prec) +") for i in range(len(data['fb_p']))]))[0]").c_str());
//the set above eliminates repeated True and False. If all results are True, we expect that res=[True], otherwise, the test failed somewhere
PyRun_SimpleString(std::string("res = ((len(res) == 1) and res[0])").c_str());
//Now if res is True, then success
Comments in the code describe the procedure step-by-step.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.