generalized cumulative functions in NumPy/SciPy? - python

Is there a function in numpy or scipy (or some other library) that generalizes the idea of cumsum and cumprod to arbitrary function. For example, consider the (theoretical) function
cumf( func, array)
func is a function that accepts two floats, and returns a float. Particular cases
lambda x,y: x+y
and
lambda x,y: x*y
are cumsum and cumprod respectively. For example, if
func = lambda x,prev_x: x^2*prev_x
and I apply it to:
cumf(func, np.array( 1, 2, 3) )
I would like
np.array( 1, 4, 9*4 )

The ValueError above is still a bug using Numpy 1.20.1 (with Python 3.9.1).
Luckily a workaround was discovered that uses casting:
https://groups.google.com/forum/#!topic/numpy/JgUltPe2hqw
import numpy as np
uadd = np.frompyfunc(lambda x, y: x + y, 2, 1)
uadd.accumulate([1,2,3], dtype=object).astype(int)
# array([1, 3, 6])
Note that since the custom operation works on an object type, it won't benefit from the efficient memory management of numpy. So the operation may be slower than one that didn't need casting to object for extremely large arrays.

NumPy's ufuncs have accumulate():
In [22]: np.multiply.accumulate([[1, 2, 3], [4, 5, 6]], axis=1)
Out[22]:
array([[ 1, 2, 6],
[ 4, 20, 120]])
Unfortunately, calling accumulate() on a frompyfunc()'ed Python function fails with a strange error:
In [32]: uadd = np.frompyfunc(lambda x, y: x + y, 2, 1)
In [33]: uadd.accumulate([1, 2, 3])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
ValueError: could not find a matching type for <lambda> (vectorized).accumulate,
requested type has type code 'l'
This is using NumPy 1.6.1 with Python 2.7.3.

Related

Numpy fastest way to apply array of functions to matrix columns

I have an array of functions shape (n,) and a numpy matrix of shape (m, n). Now I want to apply each function to its corresponding column in the matrix, i.e.
matrix[:, i] = funcs[i](matrix[:, i])
I could do this with a for loop (see example below), but using for loops is generally discouraged in numpy. My question is what is the quickest (and preferably most elegant) way to do this?
A working example
import numpy as np
# Example of functions to apply to each row
funcs = np.array([np.vectorize(lambda x: x+1),
np.vectorize(lambda x: x-2),
np.vectorize(lambda x: x+3)])
# Initialise dummy matrix
matrix = np.random.rand(50, 3)
# Apply each function to each column
for i in range(funcs.shape[0]):
matrix[:, i] = funcs[i](matrix[:, i])
For an array that has many rows and a few columns, a simple column iteration should be time effective:
In [783]: funcs = [lambda x: x+1, lambda x: x+2, lambda x: x+3]
In [784]: arr = np.arange(12).reshape(4,3)
In [785]: for i in range(3):
...: arr[:,i] = funcs[i](arr[:,i])
...:
In [786]: arr
Out[786]:
array([[ 1, 3, 5],
[ 4, 6, 8],
[ 7, 9, 11],
[10, 12, 14]])
If the functions work with 1d array inputs, there's not need for np.vectorize (np.vectorize is generally slower than plain iteration anyways.) Also for iteration like this there's no need to wrap the list of functions in an array. It's faster to iterate on lists.
A variation on the indexed iteration:
In [787]: for f, col in zip(funcs, arr.T):
...: col[:] = f(col)
...:
In [788]: arr
Out[788]:
array([[ 2, 5, 8],
[ 5, 8, 11],
[ 8, 11, 14],
[11, 14, 17]])
I use arr.T here so the iteration is on the columns of arr, not the rows.
A general observation: a few iterations on a complex task is perfectly good numpy style. Many iterations on simple tasks is slow, and should be performed in compiled code where possible.
A loop is efficient here since the job in the loop is heavy.
A readable solution is just :
np.vectorize(apply)(funcs,matrix)

Why is map_block function run twice?

I have a question why is map_block function run twice? When I run an example below:
import dask.array as da
import numpy as np
def derivative(x):
print(x.shape)
return x - np.roll(x, 1)
x = np.array([1, 1, 2, 3, 3, 3, 2, 1, 1])
d = da.from_array(x, chunks = 5)
y = d.map_blocks(derivative)
res = y.compute()
I obtain this output:
(1L,)
(5L,)
(4L,)
Since my chunks are ((5, 4),), I assume that derivative function has to be somehow run once before is really executed on these chunks, am I right?
I have python v2.7 and dask on v0.13.0.
If you do not supply a dtype to the map-blocks call then it will try running your function on a tiny sample dataset (hence the singleton shape). You can avoid this by passing a dtype explicitly if you know it.
y = d.map_blocks(derivative, dtype=d.dtype)

Write to a masked array in numpy

Let's say I have an array x and a mask for the array mask. I want to use np.copyto to write to x using mask. Is there a way I can do this? Just trying to use copyto doesn't work, I suppose because the masked x is not writeable.
x = np.array([1,2,3,4])
mask = np.array([False,False,True,True])
np.copyto(x[mask],[30,40])
x
# array([1, 2, 3, 4])
# Should be array([1, 2, 30, 40])
As commented index assignment works
In [16]: x[mask]=[30,40]
In [17]: x
Out[17]: array([ 1, 2, 30, 40])
You have to careful when using x[mask]. That is 'advanced indexing', so it creates a copy, not a view of x. With direct assignment that isn't an issue, but with copyto x[mask] is passed as an argument to the function.
In [19]: y=x[mask]
In [21]: np.copyto(y,[2,3])
changes y, but not x.
Checking its docs I see the copyto does accept a where parameter, which could be used as
In [24]: np.copyto(x,[0,0,31,41],where=mask)
In [25]: x
Out[25]: array([ 1, 2, 31, 41])

Trouble with Python's 1-element tuples and SciPy

I have been trying (with some success) to write vectorized integration calls with the numpy vectorize function but every once and a while I get stuck with issues of how Python treats tuples.
I want to write variants of integrate.quad that can integrate vector-valued functions over a grid of points. Similarly I want to create a version of integrate.nquad that integrates over n-dimensional domains, and can compute these integrals over a grid of points (i.e an integral with an n-dimensional domain, vector output, computed along a lattice of points in k-dimensional space).
For example:
import numpy as np
from scipy import integrate
def vecint(F, I, *args):
componentintegrals = [integrate.nquad(f, I, args) for f in F]
retint = [CI[0] for CI in componentintegrals]
if (len(retint)==1):
retint = retint[0]
reterr = np.sqrt(sum(CI[1]**2 for CI in componentintegrals))
return retint, reterr
vecint takes as input an list of many-variable functions and treats it as an integration problem where one integrates a vector-valued function. This code works just fine, eg:
print(vecint([lambda x,y: np.sin(x), lambda x,y: np.cos(y)], [[0,np.pi],[0,np.pi]] ) )
print(vecint([lambda x: np.sin(x)], [[0,np.pi]]))
print(vecint([lambda x: np.cos(x), lambda x: np.sin(x)], [[0,np.pi]] ))
## and we can pass additional arguments.
print(vecint( [lambda x,k: np.sin(x)+k], [[0,np.pi]], 1) )
All the above calls work as expected. The trouble for me starts when I try to vectorize these functions. Vectorizing integrate.quad goes fine...
def quad_1vz1(f, I, *args):
return np.vectorize(lambda n: integrate.quad(f, I[0], I[1], (n,)+args)[0])
as expected. The above code allows for calls such as:
quad_1vz1(lambda x,k: np.sin(kx), [0,np.pi])(K)
where K=np.mgrid[0:1:6j], etc. These are the integrals
$$\int_0^{\pi} \sin(kx) dx$$
for various values of $k$.
The problem occurs when I try to replace integrate.quad with the vecint function above. eg:
## let's vectorize a 1-dimensional integral of a vector-valued function with one parameter.
def vecint_2vz1(F, I, *args):
#print(I, args)
return np.vectorize(lambda n: vecint( F, I, (n,)+args )[0])
def f1(t,k):
return np.cos(t)+k
def f2(t,k):
return np.sin(t)+k
K = np.mgrid[0:1:6j]
print( vecint_2vz1( [f1,f2], [[0,np.pi]] )(K) )
The above results in a "ValueError: setting an array element with a sequence."
When vecint is vectorized here, the elements of K are sent as 1-element tuples of 1-element tuples, i.e. the extra argument might be something like a ((0,),).
I suspect to avoid this I have to do some crafty unpacking/repacking of arguments.... but I'm a little confused as to what Python is thinking.
It appears as if Python sometimes auto-casts 1-tuples to the value contained inside... and sometimes it does not. This has me confused. I feel like I'm missing something elementary.
Python's output on execution:
ValueError Traceback (most recent call last)
<ipython-input-131-8cb6f52e7d30> in <module>()
17 ## integral of (cos(t)+k, sin(t)+k)dt for various k's.
18
---> 19 print( vecint_2vz1( [f1,f2], [[0,np.pi]] )(K) )
/usr/local/lib/python3.4/dist-packages/numpy/lib/function_base.py in __call__(self, *args, **kwargs)
1809 vargs.extend([kwargs[_n] for _n in names])
1810
-> 1811 return self._vectorize_call(func=func, args=vargs)
1812
1813 def _get_ufunc_and_otypes(self, func, args):
/usr/local/lib/python3.4/dist-packages/numpy/lib/function_base.py in _vectorize_call(self, func, args)
1882 if ufunc.nout == 1:
1883 _res = array(outputs,
-> 1884 copy=False, subok=True, dtype=otypes[0])
1885 else:
1886 _res = tuple([array(_x, copy=False, subok=True, dtype=_t)
ValueError: setting an array element with a sequence.
Further if I put a litle print(args) line in vecint, it prints out items such as ((a,),) where a are the elements of the mgrid K.
It's hard to trace what is going on in your code. We're trying to keep track of what quad takes, what your vecint does, lambdas and vectorize.
Why does quad_1vz1 run, but quad_2vz1 doesn't? Is quad unpacking the tuples? I don't know.
I tried to simplify things with:
def foo2(*args):
def foo(*args):
print(args)
return 1
return np.vectorize(lambda n: foo((n,)+args))
which produces:
In [148]: foo2()(np.arange(3))
((0,),)
((0,),)
((1,),)
((2,),)
Out[148]: array([1, 1, 1])
In [153]: foo2(3)(np.arange(2))
((0, 3),)
((0, 3),)
((1, 3),)
Out[153]: array([1, 1])
Note that since I did not specify an output type in vectorize it runs one step just to figure out what the output is like. Hence I see ((0,),) twice. That has given other SO problems if the initial type is different from later ones (e.g. integer v float).
The 2 tuple levels are the produce of my own (n,)+args, and the passage of *args and vectorize. I'd have to experiment more to sort out the responsibility of the 2nd two.
vectorize is a poor tool if you just want to iterate over one variable. It doesn't add speed, and it is hard to apply correctly. It is more useful if you have several variables that you want to broadcast together.
vectorize with 2 arguments:
def foo2(*args):
def foo(*args):
print(args)
return sum(*args)
return np.vectorize(lambda x,y: foo((x,y)+args))
In [164]: foo2(10)(np.arange(2),np.arange(3,5)[:,None])
((0, 3, 10),)
((0, 3, 10),)
((1, 3, 10),)
((0, 4, 10),)
((1, 4, 10),)
Out[164]:
array([[13, 14],
[14, 15]])
In [166]: 10+np.arange(2)+np.arange(3,5)[:,None]
Out[166]:
array([[13, 14],
[14, 15]])
Note that I change the inner print to print(*args), I see (0, 3, 10). You need to be careful when using *args and args - one's a tuple, the other isn't (or is it the other way around?).
If my function returns an array, I get your error:
def foo2(*args):
def foo(*args):
print(*args)
return np.array(*args)
return np.vectorize(lambda x,y: foo((x,y)+args))
.....:
In [197]: foo2(10)(np.arange(2),np.arange(3,5)[:,None])
...
ValueError: setting an array element with a sequence.
but ok with one step
In [203]: foo2(3)(1,2)
(1, 2, 3)
(1, 2, 3)
Out[203]: array([1, 2, 3])
I can specify and object otypes:
def foo2(*args):
def foo(*args):
print(*args)
return np.array(*args)
return np.vectorize(lambda x,y: foo((x,y)+args),otypes=[object])
.....:
In [206]: foo2(3)(1,2)
(1, 2, 3)
Out[206]: array([1, 2, 3], dtype=object)
In [207]: foo2(3)([1,1],[2,3])
(1, 2, 3)
(1, 3, 3)
Out[207]: array([array([1, 2, 3]), array([1, 3, 3])], dtype=object)
In [208]: foo2(10)(np.arange(2),np.arange(3,5)[:,None])
(0, 3, 10)
(1, 3, 10)
(0, 4, 10)
(1, 4, 10)
Out[208]:
array([[array([ 0, 3, 10]), array([ 1, 3, 10])],
[array([ 0, 4, 10]), array([ 1, 4, 10])]], dtype=object)
These can be stacked into an array, though in this 2d result I have to flatten as well (provided the inner arrays are all the same size).
In [225]: np.vstack(np.ravel(A))
Out[225]:
array([[ 0, 3, 10],
[ 1, 3, 10],
[ 0, 4, 10],
[ 1, 4, 10]])
I vaguely recall SO questions about vectorize object otypes, but I don't recall whether there were problems with this (in some versions) or it was just a solution.

Numpy passing input array as `out` argument to ufunc

Is it generally safe to provide the input array as the optional out argument to a ufunc in numpy, provided the type is correct? For example, I have verified that the following works:
>>> import numpy as np
>>> arr = np.array([1.2, 3.4, 4.5])
>>> np.floor(arr, arr)
array([ 1., 3., 4.])
The array type must be either compatible or identical with the output (which is a float for numpy.floor()), or this happens:
>>> arr2 = np.array([1, 3, 4], dtype = np.uint8)
>>> np.floor(arr2, arr2)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: ufunc 'floor' output (typecode 'e') could not be coerced to provided output parameter (typecode 'B') according to the casting rule ''same_kind''
So given that an array of proper type, is it generally safe to apply ufuncs in-place? Or is floor() an exceptional case? The documentation does not make it clear, and neither do the following two threads that have tangential bearing on the question:
Numpy modify array in place?
Numpy Ceil and Floor "out" Argument
EDIT:
As a first order guess, I would assume it is often, but not always safe, based on the tutorial at http://docs.scipy.org/doc/numpy/user/c-info.ufunc-tutorial.html. There does not appear to be any restriction on using the output array as a temporary holder for intermediate results during the computation. While something like floor() and ciel() may not require temporary storage, more complex functions might. That being said, the entire existing library may be written with that in mind.
The out parameter of a numpy function is the array where the result is written. The main advantage of using out is avoiding the allocation of new memory where it is not necessary.
Is it safe to use write the output of a function on the same array passed as input? There is no general answer, it depends on what the function is doing.
Two examples
Here are two examples of ufunc-like functions:
In [1]: def plus_one(x, out=None):
...: if out is None:
...: out = np.zeros_like(x)
...:
...: for i in range(x.size):
...: out[i] = x[i] + 1
...: return out
...:
In [2]: x = np.arange(5)
In [3]: x
Out[3]: array([0, 1, 2, 3, 4])
In [4]: y = plus_one(x)
In [5]: y
Out[5]: array([1, 2, 3, 4, 5])
In [6]: z = plus_one(x, x)
In [7]: z
Out[7]: array([1, 2, 3, 4, 5])
Function shift_one:
In [11]: def shift_one(x, out=None):
...: if out is None:
...: out = np.zeros_like(x)
...:
...: n = x.size
...: for i in range(n):
...: out[(i+1) % n] = x[i]
...: return out
...:
In [12]: x = np.arange(5)
In [13]: x
Out[13]: array([0, 1, 2, 3, 4])
In [14]: y = shift_one(x)
In [15]: y
Out[15]: array([4, 0, 1, 2, 3])
In [16]: z = shift_one(x, x)
In [17]: z
Out[17]: array([0, 0, 0, 0, 0])
For the function plus_one there is no problem: the expected result is obtained when the parameters x and out are the same array. But the function shift_one gives a surprising result when the parameters x and out are the same array because the array
Discussion
For function of the form out[i] := some_operation(x[i]), such as plus_one above but also the functions floor, ceil, sin, cos, tan, log, conj, etc, as far as I know it is safe to write the result in the input using parameter out.
It is also safe for functions taking two input parameters of the form ``out[i] := some_operation(x[i], y[i]) such as the numpy function add, multiply, subtract.
For the other functions, it is case-by-case. As illustrated bellow, the matrix multiplication is not safe:
In [18]: a = np.arange(4).reshape((2,2))
In [19]: a
Out[19]:
array([[0, 1],
[2, 3]])
In [20]: b = (np.arange(4) % 2).reshape((2,2))
In [21]: b
Out[21]:
array([[0, 1],
[0, 1]], dtype=int32)
In [22]: c = np.dot(a, b)
In [23]: c
Out[23]:
array([[0, 1],
[0, 5]])
In [24]: d = np.dot(a, b, out=a)
In [25]: d
Out[25]:
array([[0, 1],
[0, 3]])
Last remark: if the implementation is multithreaded, the result of an unsafe function may even be non-deterministic because it depends on the order on which the array elements are processed.
This is an old question, but there is an updated answer:
Yes, it is safe. In the Numpy documentation, we see that as of v1.13:
Operations where ufunc input and output operands have memory overlap are defined to be the same as for equivalent operations where there is no memory overlap. Operations affected make temporary copies as needed to eliminate data dependency. As detecting these cases is computationally expensive, a heuristic is used, which may in rare cases result in needless temporary copies. For operations where the data dependency is simple enough for the heuristic to analyze, temporary copies will not be made even if the arrays overlap, if it can be deduced copies are not necessary. As an example, np.add(a, b, out=a) will not involve copies.

Categories