Issue with numba: "cannot unify array" unless I convert scalars to array - python

What I am trying to do
I am trying to create a very simple function which I want to optimise with numba (or at least verify if numba makes any difference).
I am running numpy 1.19.2 and numba 0.51.2 in an Anaconda installation on Windows.
The function takes 3 numeric inputs: a, b , c; the inputs can be scalars or numpy arrays; the output will of course be, respectively, a scalar or a numpy array
The function is fairly simple:
if a == 0 --> it returns np.nan
if b == 0 --> it returns a certain number
otherwise it performs some very simple algebra
The issue
I have come up with the toy example below (my actual formulas are more complex but I can show what I need to show with this easier example).
if the inputs are arrays, it works perfectly
if the inputs are scalar, numba doesn't work (Cannot unify array(int64, 0d, C) and float64 for '$phi12.0.2' )
if the inputs are arrays of size 1 (I make an array out of each scalar) numba works again
What I tried / similar questions
The closest question I found was this, but the mismatch there was between an int and a float.
Here it is between an array(int64, 0d, C) and a float64. I can convert my inputs to float but the mismatch remains.
Any ideas? I am not sure what the array and the float being compared are, to be honest.
The one solution I have found is to add a = np.array([a]) at the beginning of the function, but I don't understand why, plus this returns an array of size 1, whereas I'd like a scalar returned in these cases.
Toy example
#numba.jit
def my_fun(a,b,c):
return np.where(a == 0, np.nan,
np.where(b ==0 , 0 , c**2) )
a = np.arange(0,11)
b = np.arange(3,14)
b[1] = 0
c = np.arange(10,21)
out_array = my_fun(a,b,c)
out_scalar = my_fun(0,0,1)
The exact warning:
NumbaWarning:
Compilation is falling back to object mode WITH looplifting enabled because Function my_fun failed at nopython mode lowering due to: Failed in nopython mode pipeline (step: nopython frontend)
Cannot unify array(int64, 0d, C) and float64 for '$phi12.0.2', defined at C:\Users\USERNAME\anaconda3\lib\site-packages\numba\np\arraymath.py (3276)
File "C:\Users\USERNAME\anaconda3\lib\site-packages\numba\np\arraymath.py", line 3276:
def scalar_where_impl(cond, x, y):
<source elided>
"""
scal = x if cond else y
^
During: typing of assignment at C:\Users\USERNAME\anaconda3\lib\site-packages\numba\np\arraymath.py (3276)
File "C:\Users\USERNAME\anaconda3\lib\site-packages\numba\np\arraymath.py", line 3276:
def scalar_where_impl(cond, x, y):
<source elided>
"""
scal = x if cond else y
^
During: lowering "$36call_method.17 = call $4load_method.1($10compare_op.4, $14load_attr.6, $34call_method.16, func=$4load_method.1, args=[Var($10compare_op.4, refactor numba.py:8), Var($14load_attr.6, refactor numba.py:8), Var($34call_method.16, refactor numba.py:9)], kws=(), vararg=None)" at D:\MY DATA\USERNAME\Python\scratch scripts\refactor numba.py (8)
#numba.jit
C:\Users\USERNAME\anaconda3\lib\site-packages\numba\core\object_mode_passes.py:177: NumbaWarning: Function "my_fun" was compiled in object mode without forceobj=True.
File "refactor numba.py", line 6:
#numba.jit
def my_fun(a,b,c):
^
warnings.warn(errors.NumbaWarning(warn_msg,
C:\Users\USERNAME\anaconda3\lib\site-packages\numba\core\object_mode_passes.py:187: NumbaDeprecationWarning:
Fall-back from the nopython compilation path to the object mode compilation path has been detected, this is deprecated behaviour.
For more information visit https://numba.pydata.org/numba-doc/latest/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit
File "refactor numba.py", line 6:
#numba.jit
def my_fun(a,b,c):
^
warnings.warn(errors.NumbaDeprecationWarning(msg,

I have found a solution, but it's far from elegant, and I am hoping there is a better one.
To recap, I needed a function which:
works with numba
works with both scalars and arrays
returns scalar (not a one-sized array) when the inputs are scalars, and arrays when the inputs are arrays
I have tried the following, and found option 2 to be the fastest.
my_fun_optimised_1: a function which, without numba, determines whether the inputs are scalar or not, and then calls, accordingly, a sub-function for the scalar case and one for the arrays. Both sub-functions run with numba, but take forever. I guess this is because numba must be re-initialised at each call of the main function.
my_fun_optimised_2: similar to the above, except the scalar and array functions, both running with numba, are main functions and not subfunctions. Much much faster.
my_fun_non_opt_no_numba : a function which runs without numba.
The results are:
+-------------------------+----------------------------+-----------------------------+
| Function | Array: time vs the fastest | Scalar: time vs the fastest |
+-------------------------+----------------------------+-----------------------------+
| optimised numba 1 | 54,403 | 42,961 |
| optimised numba 2 | 1 | 1 |
| non-optimised, no numba | 3.409 | 4.53892 |
+-------------------------+----------------------------+-----------------------------+
What this means is that, on my PC, the non-optimised, no numba code takes 4.5 times longer than "optimsied numba 2" to run on scalars and 3.4 times longer for arrays.
The "optimised numba 1" is not optimsied at all and takes an insane amount of time.
I hope all of this can be of use to other people.
PS I am very well familiar with pitfalls of premature optimisation. I am only doing this because I have a specific case where 60% of the time is spent doing a similar (but not identical) calculation to the one shown here.
The code to time the functions is:
import numpy as np
import numba
import timeit
import pandas as pd
def my_fun_optimised_1(a,b,c):
#numba.jit
def my_fun_vectorised(a,b,c):
return np.where(a == 0, np.nan,
np.where(b ==0 , 0 , b*a**3 + a*b**3 + a*b*c**3)
)
#numba.jit
def my_fun_scalar(a,b,c):
if a ==0:
return np.nan
elif b == 0:
return np.nan
else:
return b*a**3 + a*b**3 + a*b*c**3
if np.isscalar(a) and np.isscalar(b) and np.isscalar(c):
return my_fun_scalar(a,b,c)
else:
return my_fun_vectorised(a,b,c)
def my_fun_optimised_2(a,b,c):
if np.isscalar(a) and np.isscalar(b) and np.isscalar(c):
return fun_2_scalar(a,b,c)
else:
return fun_2_vectorised(a,b,c)
#numba.jit
def fun_2_scalar(a,b,c):
if a ==0:
return np.nan
elif b == 0:
return np.nan
else:
return b*a**3 + a*b**3 + a*b*c**3
#numba.jit
def fun_2_vectorised(a,b,c):
return np.where(a == 0, np.nan,
np.where(b ==0 , 0 , b*a**3 + a*b**3 + a*b*c**3)
)
def my_fun_non_opt_no_numba(a,b,c):
# multipl by 1 converts from array to scalar
return 1 * np.where(a == 0, np.nan,
np.where(b ==0 , 0 , b*a**3 + a*b**3 + a*b*c**3)
)
# I couldn't get this to work with Numba
##numba.jit
def my_fun_non_opt_numba(a,b,c):
a = np.array([a])
b = np.array([b])
c = np.array([c])
out = np.where(a == 0, np.nan,
np.where(b ==0 , 0 , b*a**3 + a*b**3 + a*b*c**3)
)
return out
r = 4
n = int(100)
a = 3
b = 4
c = 5
x = my_fun_optimised_2(a,b,c)
t_scalar_opt_numba_1 = timeit.Timer("my_fun_optimised_1(a,b,c) " , globals = globals() ).repeat(repeat = r, number = n)
t_scalar_opt_numba_2 = timeit.Timer("my_fun_optimised_2(a,b,c) " , globals = globals() ).repeat(repeat = r, number = n)
t_scalar_non_opt_no_numba = timeit.Timer("my_fun_non_opt_no_numba(a,b,c) " , globals = globals() ).repeat(repeat = r, number = n)
resdf_scalar = pd.DataFrame(index = ['min time'])
resdf_scalar['optimised numba 1'] = [min(t_scalar_opt_numba_1)]
resdf_scalar['optimised numba 2'] = [min(t_scalar_opt_numba_2)]
resdf_scalar['non-optimised, no numba'] = [min(t_scalar_non_opt_no_numba)]
# the docs explain why we should take the min and not the avg
resdf_scalar = resdf_scalar.transpose()
resdf_scalar['diff vs fastest'] = (resdf_scalar / resdf_scalar.min() )
a = np.arange(3,13)
b = np.arange(0,10)
c = np.arange(20,30)
t_array_opt_numba_1 = timeit.Timer("my_fun_optimised_1(a,b,c) " , globals = globals() ).repeat(repeat = r, number = n)
t_array_opt_numba_2 = timeit.Timer("my_fun_optimised_2(a,b,c) " , globals = globals() ).repeat(repeat = r, number = n)
t_array_non_opt_no_numba = timeit.Timer("my_fun_non_opt_no_numba(a,b,c) " , globals = globals() ).repeat(repeat = r, number = n)
resdf_array = pd.DataFrame(index = ['min time'])
resdf_array['optimised numba 1'] = [min(t_array_opt_numba_1)]
resdf_array['optimised numba 2'] = [min(t_array_opt_numba_2)]
resdf_array['non-optimised, no numba'] = [min(t_array_non_opt_no_numba)]
# the docs explain why we should take the min and not the avg
resdf_array = resdf_array.transpose()
resdf_array['diff vs fastest'] = (resdf_array / resdf_array.min() )

Related

TypingError for Numba

I have this piece code, using Numba to speed up processing. Basically, particle_dtype is defined to make code ran using Numba. However, TypingError is reported, saying "Cannot determine Numba type of <class 'function'>". I cannot figure out where is the problem.
import numpy
from numba import njit
particle_dtype = numpy.dtype({'names':['x','y','z','m','phi'],
'formats':[numpy.double,
numpy.double,
numpy.double,
numpy.double,
numpy.double]})
def create_n_random_particles(n, m, domain=1):
parts = numpy.zeros((n), dtype=particle_dtype)
parts['x'] = numpy.random.random(size=n) * domain
parts['y'] = numpy.random.random(size=n) * domain
parts['z'] = numpy.random.random(size=n) * domain
parts['m'] = m
parts['phi'] = 0.0
return parts
def distance(se, other):
return numpy.sqrt(numpy.square(se['x'] - other['x']) +
numpy.square(se['y'] - other['y']) +
numpy.square(se['z'] - other['z']))
parts = create_n_random_particles(10, .001, 1)
#njit
def direct_sum(particles):
for i, target in enumerate(particles):
for j in range(particles.shape[0]):
if i == j:
continue
source = particles[j]
r = distance(target, source)
# target['phi'] += source['m'] / r
target['phi'] = target['phi'] + source['m'] / r
return(target['phi'])
print(direct_sum(parts) )
I guess it's because non-supported functions or operations are used somewhere, but I cannot find it. Thanks for your help.
direct_sum which is a JITed function cannot call distance because it is not JITed (pure-Python function). You need to use the decorator #njit on distance too.

Numba: how to parse arbitrary logic string into sequence of jitclassed instances in a loop

Tl Dr. If I were to explain the problem in short:
I have signals:
np.random.seed(42)
x = np.random.randn(1000)
y = np.random.randn(1000)
z = np.random.randn(1000)
and human readable string tuple logic like :
entry_sig_ = ((x,y,'crossup',False),)
exit_sig_ = ((x,z,'crossup',False), 'or_',(x,y,'crossdown',False))
where:
'entry_sig_' means the output will be 1 when the time series unfolds from left to right and 'entry_sig_' is hit. (x,y,'crossup',False) means: x crossed y up at a particular time i, and False means signal doesn't have "memory". Otherwise number of hits accumulates.
'exit_sig_' means the output will again become '0' when the 'exit_sig_' is hit.
The output is generated through:
#njit
def run(x, entry_sig, exit_sig):
'''
x: np.array
entry_sig, exit_sig: homogeneous tuples of tuple signals
Returns: sequence of 0 and 1 satisfying entry and exit sigs
'''
L = x.shape[0]
out = np.empty(L)
out[0] = 0.0
out[-1] = 0.0
i = 1
trade = True
while i < L-1:
out[i] = 0.0
if reduce_sig(entry_sig,i) and i<L-1:
out[i] = 1.0
trade = True
while trade and i<L-2:
i += 1
out[i] = 1.0
if reduce_sig(exit_sig,i):
trade = False
i+= 1
return out
reduce_sig(sig,i) is a function (see definition below) that parses the tuple and returns resulting output for a given point in time.
Question:
As of now, an object of SingleSig class is instantiated in the for loop from scratch for any given point in time; thus, not having "memory", which totally cancels the merits of having a class, a bare function will do. Does there exist a workaround (a different class template, a different approach, etc) so that:
combined tuple signal can be queried for its value at a particular point in time i.
"memory" can be reset; i.e. e.g. MultiSig(sig_tuple).memory_field can be set to 0 at a constituent signals levels.
Following code adds a memory to the signals which can be wiped using MultiSig.reset() to reset the count of all signals to 0. The memory can be queried using MultiSig.query_memory(key) to return the number of hits for that signal at that time.
For the memory function to work, I had to add unique keys to the signals to identify them.
from numba import njit, int64, float64, types
from numba.types import Array, string, boolean
from numba import jitclass
import numpy as np
np.random.seed(42)
x = np.random.randn(1000000)
y = np.random.randn(1000000)
z = np.random.randn(1000000)
# Example of "human-readable" signals
entry_sig_ = ((x,y,'crossup',False),)
exit_sig_ = ((x,z,'crossup',False), 'or_',(x,y,'crossdown',False))
# Turn signals into homogeneous tuple
#entry_sig_
entry_sig = (((x,y,'crossup',False),'NOP','1'),)
#exit_sig_
exit_sig = (((x,z,'crossup',False),'or_','2'),((x,y,'crossdown',False),'NOP','3'))
#njit
def cross(x, y, i):
'''
x,y: np.array
i: int - point in time
Returns: 1 or 0 when condition is met
'''
if (x[i - 1] - y[i - 1])*(x[i] - y[i]) < 0:
out = 1
else:
out = 0
return out
kv_ty = (types.string,types.int64)
spec = [
('memory', types.DictType(*kv_ty)),
]
#njit
def single_signal(x, y, how, acc, i):
'''
i: int - point in time
Returns either signal or accumulator
'''
if cross(x, y, i):
if x[i] < y[i] and how == 'crossdown':
out = 1
elif x[i] > y[i] and how == "crossup":
out = 1
else:
out = 0
else:
out = 0
return out
#jitclass(spec)
class MultiSig:
def __init__(self,entry,exit):
'''
initialize memory at single signal level
'''
memory_dict = {}
for i in entry:
memory_dict[str(i[2])] = 0
for i in exit:
memory_dict[str(i[2])] = 0
self.memory = memory_dict
def reduce_sig(self, sig, i):
'''
Parses multisignal
sig: homogeneous tuple of tuples ("human-readable" signal definition)
i: int - point in time
Returns: resulting value of multisignal
'''
L = len(sig)
out = single_signal(*sig[0][0],i)
logic = sig[0][1]
if out:
self.update_memory(sig[0][2])
for cnt in range(1, L):
s = single_signal(*sig[cnt][0],i)
if s:
self.update_memory(sig[cnt][2])
out = out | s if logic == 'or_' else out & s
logic = sig[cnt][1]
return out
def update_memory(self, key):
'''
update memory
'''
self.memory[str(key)] += 1
def reset(self):
'''
reset memory
'''
dicti = {}
for i in self.memory:
dicti[i] = 0
self.memory = dicti
def query_memory(self, key):
'''
return number of hits on signal
'''
return self.memory[str(key)]
#njit
def run(x, entry_sig, exit_sig):
'''
x: np.array
entry_sig, exit_sig: homogeneous tuples of tuples
Returns: sequence of 0 and 1 satisfying entry and exit sigs
'''
L = x.shape[0]
out = np.empty(L)
out[0] = 0.0
out[-1] = 0.0
i = 1
multi = MultiSig(entry_sig,exit_sig)
while i < L-1:
out[i] = 0.0
if multi.reduce_sig(entry_sig,i) and i<L-1:
out[i] = 1.0
trade = True
while trade and i<L-2:
i += 1
out[i] = 1.0
if multi.reduce_sig(exit_sig,i):
trade = False
i+= 1
return out
run(x, entry_sig, exit_sig)
To reiterate what I said in the comments, | and & are bitwise operators, not logical operators. 1 & 2 outputs 0/False which is not what I believe you want this to evaluate to so I made sure the out and s can only be 0/1 in order for this to produce the expected output.
You are aware that the because of:
out = out | s if logic == 'or_' else out & s
the order of the time-series inside entry_sig and exit_sig matters?
Let (output, logic) be tuples where output is 0 or 1 according to how crossup and crossdown would evalute the passed information of the tuple and logic is or_ or and_.
tuples = ((0,'or_'),(1,'or_'),(0,'and_'))
out = tuples[0][0]
logic = tuples[0][1]
for i in range(1,len(tuples)):
s = tuples[i][0]
out = out | s if logic == 'or_' else out & s
out = s
logic = tuples[i][1]
print(out)
0
changing the order of the tuple yields the other signal:
tuples = ((0,'or_'),(0,'and_'),(1,'or_'))
out = tuples[0][0]
logic = tuples[0][1]
for i in range(1,len(tuples)):
s = tuples[i][0]
out = out | s if logic == 'or_' else out & s
out = s
logic = tuples[i][1]
print(out)
1
The performance hinges on how many times the count needs to be updated. Using n=1,000,000 for all three time series, your code had a mean run-time of 0.6s on my machine, my code had 0.63s.
I then changed the crossing logic up a bit to save the number of if/else so that the nested if/else is only triggered if the time-series crossed which can be checked by one comparison only. This further halved the difference in run-time so above code now sits at 2.5% longer run-time your original code.

Convert numpy int to matlab int

I'm calling a matlab function in python through matlab engine, and I'm having problems to pass the variables.
I have figured out how to pass some, but for this on I'm getting an error. should be a scalar int.
but when I pass it I got the error:
File
"C:\ProgramData\Anaconda3\lib\site-packages\matlab_internal\mlarray_utils.py",
line 90, in _normalize_size
if init_dims[0] == 0:
IndexError: tuple index out of range
The code works fine If I do not pass the modn variable, so I know that my problem is the conversion type to matlab of this variable.
this is the python code:
import numpy as np
import matlab
import matlab.engine
eng = matlab.engine.start_matlab()
eng.cd()
Nn = 30
x= 250*np.ones((1,Nn))
y= 100*np.ones((1,Nn))
z = 32.0
xx = matlab.double(x.tolist())
yy = matlab.double(y.tolist())
f=np.arange(start=0.1,stop=0.66,step=0.1)
modnv=np.concatenate((np.ones((Nn)),2*np.ones((Nn))))
count = 0
for fks in f:
fks=np.float(0)
modn = modnv[count]
modn = modn.astype(int)
modn = matlab.int8(modn)
Output = eng.simple_test(xx,yy,z,fks,modn,nargout=4)
A = np.array(Output[0]).astype(float)
B = np.array(Output[1]).astype(float)
C = np.array(Output[2]).astype(float)
D = np.array(Output[3]).astype(float)
count = count + 1
and this is the matlab function simple_test:
function [A,B,C,D] = simple_test(x,y,z,fks,modn)
if modn == 1
A = 3*x+2*y;
B = x*ones(length(x),length(x));
C = ones(z);
D = x*y';
else
A = 3*fks;
B = 3*x+2*y;
C = A+B;
D = x*y'
end
Does someone know how to overcome that?
Whenever you get IndexError: tuple index out of range error its mostly:
Probably one of the indexes is wrong.I suspect you mean to say [0] where you say [1] and [1] where you say [2]. Indexes are 0-based in Python.
you are passing an array to a function that was expecting a variadic sequence of arguments (eg '{}{}'.format([1,2]) vs '{}{}'.format(*[1,2])

numpy.matmul in Theano

TL;DR
I want to replicate the functionality of numpy.matmul in theano. What's the best way to do this?
Too Short; Didn't Understand
Looking at theano.tensor.dot and theano.tensor.tensordot, I'm not seeing an easy way to do a straightforward batch matrix multiplication. i.e. treat the last two dimensions of N dimensional tensors as matrices, and multiply them. Do I need to resort to some goofy usage of theano.tensor.batched_dot? Or *shudder* loop them myself without broadcasting!?
The current pull requests don't support broadcasting, so I came up with this for now. I may clean it up, add a little more functionality, and submit my own PR as a temporary solution. Until then, I hope this helps someone!
I included the test to show it replicates numpy.matmul, given that the input complies with my more strict (temporary) assertions.
Also, .scan stops iterating the sequences at argmin(*sequencelengths) iterations. So, I believe that mismatched array shapes won't raise any exceptions.
import theano as th
import theano.tensor as tt
import numpy as np
def matmul(a: tt.TensorType, b: tt.TensorType, _left=False):
"""Replicates the functionality of numpy.matmul, except that
the two tensors must have the same number of dimensions, and their ndim must exceed 1."""
# TODO ensure that broadcastability is maintained if both a and b are broadcastable on a dim.
assert a.ndim == b.ndim # TODO support broadcasting for differing ndims.
ndim = a.ndim
assert ndim >= 2
# If we should left multiply, just swap references.
if _left:
tmp = a
a = b
b = tmp
# If a and b are 2 dimensional, compute their matrix product.
if ndim == 2:
return tt.dot(a, b)
# If they are larger...
else:
# If a is broadcastable but b is not.
if a.broadcastable[0] and not b.broadcastable[0]:
# Scan b, but hold a steady.
# Because b will be passed in as a, we need to left multiply to maintain
# matrix orientation.
output, _ = th.scan(matmul, sequences=[b], non_sequences=[a[0], 1])
# If b is broadcastable but a is not.
elif b.broadcastable[0] and not a.broadcastable[0]:
# Scan a, but hold b steady.
output, _ = th.scan(matmul, sequences=[a], non_sequences=[b[0]])
# If neither dimension is broadcastable or they both are.
else:
# Scan through the sequences, assuming the shape for this dimension is equal.
output, _ = th.scan(matmul, sequences=[a, b])
return output
def matmul_test() -> bool:
vlist = []
flist = []
ndlist = []
for i in range(2, 30):
dims = int(np.random.random() * 4 + 2)
# Create a tuple of tensors with potentially different broadcastability.
vs = tuple(
tt.TensorVariable(
tt.TensorType('float64',
tuple((p < .3) for p in np.random.ranf(dims-2))
# Make full matrices
+ (False, False)
)
)
for _ in range(2)
)
vs = tuple(tt.swapaxes(v, -2, -1) if j % 2 == 0 else v for j, v in enumerate(vs))
f = th.function([*vs], [matmul(*vs)])
# Create the default shape for the test ndarrays
defshape = tuple(int(np.random.random() * 5 + 1) for _ in range(dims))
# Create a test array matching the broadcastability of each v, for each v.
nds = tuple(
np.random.ranf(
tuple(s if not v.broadcastable[j] else 1 for j, s in enumerate(defshape))
)
for v in vs
)
nds = tuple(np.swapaxes(nd, -2, -1) if j % 2 == 0 else nd for j, nd in enumerate(nds))
ndlist.append(nds)
vlist.append(vs)
flist.append(f)
for i in range(len(ndlist)):
assert np.allclose(flist[i](*ndlist[i]), np.matmul(*ndlist[i]))
return True
if __name__ == "__main__":
print("matmul_test -> " + str(matmul_test()))

Summing with a for loop faster than with reduce?

I wanted to see how much faster reduce was than using a for loop for simple numerical operations. Here's what I found (using the standard timeit library):
In [54]: print(setup)
from operator import add, iadd
r = range(100)
In [55]: print(stmt1)
c = 0
for i in r:
c+=i
In [56]: timeit(stmt1, setup)
Out[56]: 8.948904991149902
In [58]: print(stmt3)
reduce(add, r)
In [59]: timeit(stmt3, setup)
Out[59]: 13.316915035247803
Looking a little more:
In [68]: timeit("1+2", setup)
Out[68]: 0.04145693778991699
In [69]: timeit("add(1,2)", setup)
Out[69]: 0.22807812690734863
What's going on here? Obviously, reduce does loop faster than for, but the function call seems to dominate. Shouldn't the reduce version run almost entirely in C? Using iadd(c,i) in the for loop version makes it run in ~24 seconds. Why would using operator.add be so much slower than +? I was under the impression + and operator.add run the same C code (I checked to make sure operator.add wasn't just calling + in python or anything).
BTW, just using sum runs in ~2.3 seconds.
In [70]: print(sys.version)
2.7.1 (r271:86882M, Nov 30 2010, 09:39:13)
[GCC 4.0.1 (Apple Inc. build 5494)]
The reduce(add, r) must invoke the add() function 100 times, so the overhead of the function calls adds up -- reduce uses PyEval_CallObject to invoke add on each iteration:
for (;;) {
...
if (result == NULL)
result = op2;
else {
# here it is creating a tuple to pass the previous result and the next
# value from range(100) into func add():
PyTuple_SetItem(args, 0, result);
PyTuple_SetItem(args, 1, op2);
if ((result = PyEval_CallObject(func, args)) == NULL)
goto Fail;
}
Updated: Response to question in comments.
When you type 1 + 2 in Python source code, the bytecode compiler performs the addition in place and replaces that expression with 3:
f1 = lambda: 1 + 2
c1 = byteplay.Code.from_code(f1.func_code)
print c1.code
1 1 LOAD_CONST 3
2 RETURN_VALUE
If you add two variables a + b the compiler will generate bytecode which loads the two variables and performs a BINARY_ADD, which is far faster than calling a function to perform the addition:
f2 = lambda a, b: a + b
c2 = byteplay.Code.from_code(f2.func_code)
print c2.code
1 1 LOAD_FAST a
2 LOAD_FAST b
3 BINARY_ADD
4 RETURN_VALUE
It could be the overhead of copying args and return values (i.e. "add(1, 2)"), opposed to simply operating on numeric literals
edit: Switching out zeroes instead of array multiply closes the gap big time.
from functools import reduce
from numpy import array, arange, zeros
from time import time
def add(x, y):
return x + y
def sum_columns(x):
if x.any():
width = len(x[0])
total = zeros(width)
for row in x:
total += array(row)
return total
l = arange(3000000)
l = array([l, l, l])
start = time()
print(reduce(add, l))
print('Reduce took {}'.format(time() - start))
start = time()
print(sum_columns(l))
print('For loop took took {}'.format(time() - start))
Gets you down almost no difference at all.
Reduce took 0.03230619430541992
For loop took took 0.058577775955200195
old: If reduce is used for adding together NumPy arrays by index, it can be faster than a for loop.
from functools import reduce
from numpy import array, arange
from time import time
def add(x, y):
return x + y
def sum_columns(x):
if x.any():
width = len(x[0])
total = array([0] * width)
for row in x:
total += array(row)
return total
l = arange(3000000)
l = array([l, l, l])
start = time()
print(reduce(add, l))
print('Reduce took {}'.format(time() - start))
start = time()
print(sum_columns(l))
print('For loop took took {}'.format(time() - start))
With the result of
[ 0 3 6 ..., 8999991 8999994 8999997]
Reduce took 0.024930953979492188
[ 0 3 6 ..., 8999991 8999994 8999997]
For loop took took 0.3731539249420166

Categories