weird behavior of numba guvectorize

weird behavior of numba guvectorize - python

I write a function to test numba.guvectorize. This function takes product of two numpy arrays and compute the sum after first axis, as following:
from numba import guvectorize, float64
import numpy as np
#guvectorize([(float64[:], float64[:], float64)], '(n),(n)->()')
def g(x, y, res):
res = np.sum(x * y)
However, the above guvectorize function returns wrong results as shown below:
>>> a = np.random.randn(3,4)
>>> b = np.random.randn(3,4)
>>> np.sum(a * b, axis=1)
array([-0.83053829, -0.15221319, -2.27825015])
>>> g(a, b)
array([4.67406747e-310, 0.00000000e+000, 1.58101007e-322])
What might be causing this problem?

Function g() receives an uninitialized array through the res parameter. Assigning a new value to it doesn't modify the original array passed to the function.
You need to replace the contents of res (and declare it as an array):
#guvectorize([(float64[:], float64[:], float64[:])], '(n),(n)->()')
def g(x, y, res):
res[:] = np.sum(x * y)
The function operates on 1D vectors and returns a scalar (thus the signature (n),(n)->()) and guvectorize does the job of dealing with 2D inputs and returning a 1D output.
>>> a = np.random.randn(3,4)
>>> b = np.random.randn(3,4)
>>> np.sum(a * b, axis=1)
array([-3.1756397 , 5.72632531, 0.45359806])
>>> g(a, b)
array([-3.1756397 , 5.72632531, 0.45359806])
But the original Numpy function np.sum is already vectorized and compiled, so there is little speed gain in using guvectorize in this specific case.

Your a and b arrays are 2-dimensional, while your guvectorized function has signature of accepting 1D arrays and returning 0D scalar. You have to modify it to accept 2D and return 1D.
In one case you do np.sum with axis = 1 in another case without it, you have to do same thing in both cases.
Also instead of res = ... use res[...] = .... Maybe it is not the problem in case of guvectorize but it can be a general problem in Numpy code because you have to assign values instead of variable reference.
In my case I added cache = True param to guvectorize decorator, it only speeds up running through caching/re-using Numba compiled code and not re-compiling it on every run. It just speeds up things.
Full modified corrected code see below:
Try it online!
from numba import guvectorize, float64
import numpy as np
#guvectorize([(float64[:, :], float64[:, :], float64[:])], '(n, m),(n, m)->(n)', cache = True)
def g(x, y, res):
res[...] = np.sum(x * y, axis = 1)
# Test
np.random.seed(0)
a = np.random.randn(3, 4)
b = np.random.randn(3, 4)
print(np.sum(a * b, axis = 1))
print(g(a, b))
Output:
[ 2.57335386 3.41749149 -0.42290296]
[ 2.57335386 3.41749149 -0.42290296]

Related

How do you use the user_data argument in scipy.LowLevelCallable in conjunction with scipy.ndimage.generic_filter?

I would like to use scipy's LowLevelCallable utility to improve the performance of my call to generic_filter with my own defined function, which takes in two arrays as input parameters. In this working example they show how a regular call to generic_filter could look.
LowLevelCallable already has fixed input arguments:
int callback(double *buffer, npy_intp filter_size,
double *return_value, void *user_data)
so the only way to pass this second array is via the user_data pointer. However, in order to create a carray object I need both the pointer to the array as well as it's length. How can I modify my function wrapper to pass two objects to my function?
from numba import cfunc,carray,jit
from numba.types import intc, CPointer, float64, intp, voidptr
from scipy import LowLevelCallable, ndimage
import numpy as np
image = np.random.random((128, 128))
footprint = np.array([[0, 1, 0],
[1, 1, 1],
[0, 1, 0]], dtype=bool)
def jit_filter_function(filter_function):
jitted_function = jit(filter_function, nopython=True)
#cfunc(intc(CPointer(float64), intp, CPointer(float64), voidptr))
def wrapped(values_ptr, len_values, result, data):
values = carray(values_ptr, (len_values,), dtype=float64)
more = carray(...) # what do I put here?
result[0] = jitted_function(values,more)
return 1
return LowLevelCallable(wrapped.ctypes)
#jit_filter_function
def fmin(values: np.ndarray, more:np.ndarray):
result = np.inf
for v in values:
if v < result:
result = v
return result
ndimage.generic_filter(image, fmin, footprint=footprint,extra_arguments = (np.array([1,2,3]),))

Device function throws nopython exception when its returning a list instead of an integer

a device function I have written always throws a no python exception and I do not understand why or where my error is.
Here a small example that represents my problem.
I have the following device function that I call from a kernel:
#cuda.jit (device=True)
def sub_stuff(vec_a, vec_b):
x0 = vec_a[0] - vec_b[0]
x1 = vec_a[1] - vec_b[1]
x2 = vec_a[2] - vec_b[2]
return [x0, x1, x2]
The kernel that calls this function looks like this:
#cuda.jit
def kernel_via_polygon(vectors_a, vectors_b, result_array):
pos = cuda.grid(1)
if pos < vectors_a.size and pos < result_array.size:
result_array[pos] = sub_stuff(vectors_a[pos], vectors_b[pos])
The three input arrays are the following:
vectors_a = np.arange(1, 10).reshape((3, 3))
vectors_b = np.arange(1, 10).reshape((3, 3))
result = np.zeros_like(vectors_a)
When I now call the function via trace_via_polygon(vectors_a, vectors_b, result) a no python error is thrown. When the device funtion would return only an integer value, this error is prevented.
Can someone explain to me where my mistake is?
Edit: FYI as answered by
talonmies list construction isn't supported in device code. An alternative that helped me is using tuples, which are supported.

The source of your error is that the device function sub_stuff is attempting to create a list in GPU code, and that isn't supported.
About the best you can do would be something like this:
from numba import jit, guvectorize, int32, int64, float64
from numba import cuda
import numpy as np
import math
#cuda.jit (device=True)
def sub_stuff(vec_a, vec_b, result):
for i in range(vec_a.shape[0]):
result[i] = vec_a[i] - vec_b[i]
#cuda.jit
def kernel_via_polygon(vectors_a, vectors_b, result_array):
pos = cuda.grid(1)
if pos < vectors_a.size and pos < result_array.size:
sub_stuff(vectors_a[pos], vectors_b[pos], result_array[pos])
vectors_a = 100 + np.arange(1, 10).reshape((3, 3))
vectors_b = np.arange(1, 10).reshape((3, 3))
result = np.zeros_like(vectors_a)
kernel_via_polygon[1,10](vectors_a, vectors_b, result)
print(result)
which uses a loop to iterate over the individual array slices and perform the subtraction between each element.

Numba vectorize for function with no input

I want to parallelize a function using numba.vectorize, but my function doesn't take any input. Currently, I use a dummy array and dummy input for my function that is never used.
Is there a more elegant/fast way (possibly without using numba.vectorize)?
Code example (not my actual code, only for demonstration how I discard input):
import numpy as np
from numba import vectorize
#vectorize(["int32(int32)"], nopython=True)
def particle_path(discard_me):
x = 0
for _ in range(10):
x += np.random.uniform(0, 1)
return np.int32(x)
arr = particle_path(np.empty(1024, dtype=np.int32))
print(arr)

If you'll simply be dealing with 1D arrays, then you can use the following, where the array must be instantiated outside the function. There doesn't seem to be any reason to use vectorize here, you can achieve the goal simply with jit although you do have to explicitly write the loop over the array elements using this. If your array will always be 1D, then you can use:
import numpy as np
from numba import jit
#jit(nopython=True)
def particle_path(out):
for i in range(len(out)):
x = 0
for _ in range(10):
x += np.random.uniform(0, 1)
out[i] = x
arr = np.empty(1024, dtype=np.int32)
particle_path(arr)
You can similarly deal with any-dimensional arrays using the flat attribute (and make sure to use .size to get total number of elements in the array):
#jit(nopython=True)
def particle_path(out):
for i in range(out.size):
x = 0
for _ in range(10):
x += np.random.uniform(0, 1)
out.flat[i] = x
arr = np.empty(1024, dtype=np.int32)
particle_path(arr)
and finally you can create your array inside your function if you need a new array each time you run the function (use the above instead if you'll be calling the function repeatedly and want to overwrite the same array, hence saving the time to re-allocate the same array over and over again).
#jit(nopython=True)
def particle_path(num):
out = np.empty(shape=num, dtype=np.int32)
for i in range(num):
x = 0
for _ in range(10):
x += np.random.uniform(0, 1)
out[i] = x
return out
arr = particle_path(1024)

Manipulation of numpy 2-D array

I have a 2-D numpy array like x = array([[ 1., 5.],[ 3., 4.]]), I have to compare each row with every other row in the matrix and create an new array of minimum values from both the rows and take the sum of minimum row and save it in a new matrix. Finally I will get a symmetric matrix.
Eg: I compare array [1,5] with itself. New 2-D array is array([[ 1., 5.],[ 1., 5.]]), I create a minimum array along axis=0 i.e [ 1., 5.] then take the sum of array which will be 6. Similarly I repeat the operation for all the rows and I end up with a 2*2 matrix array([[ 6, 5.],[ 5, 7.]]).
import numpy as np
x=np.array([[1,5],[3,4]])
y=np.zeros((len(x),len(x)))
for i in range(len(x)):
array_a=x[i]
for j in range(len(x)):
array_b=x[j]
array_c=np.array([array_a,array_b])
min_array=np.min(array_c,axis=0)
array_sum=np.sum(min_array)
y[i,j]=array_sum
My 2-D array is very big and performing above mentioned operations are taking lot of time. I am new to python so any suggestion to improve the performance will be really helpful.

The obvious improvement to save roughly half the time is to run only on i>=j indices. For elegance and some saving you can also use less variables.
import numpy as np
import time
x=np.random.randint(0, 10, (500, 500))
y=np.zeros((len(x),len(x)))
# OP version
t0 = time.time()
for i in range(len(x)):
array_a=x[i]
for j in range(len(x)):
array_b=x[j]
array_c=np.array([array_a,array_b])
min_array=np.min(array_c,axis=0)
array_sum=np.sum(min_array)
y[i,j]=array_sum
print(time.time() - t0)
z=np.zeros((len(x),len(x)))
# modified version
t0 = time.time()
for i in range(len(x)):
for j in range(i, len(x)):
z[i, j]=np.sum(np.min([x[i], x[j]], axis=0))
z[j, i] = z[i, j]
print(time.time() - t0)
# verify that the result are the same
print(np.all(z == y))
The results on my machine:
4.2974278926849365
2.746302604675293
True

The obvious way to speed up your code would be to do all the looping in numpy. I had a first solution (f2 in the code below), which would generate a matrix that contained all the combinations that need to be compared and then reduced that matrix into the final result performing the np.min and np.sum commands. Unfortunately that method is quite memory consuming and therefore becomes slow when the matrices are big, because the intermediate matrix is NxNx2xN for a NxN input matrix.
However, I found a different solution that uses one for loop (f3 below) and appears to be reasonably fast. The speed-up to the original posted by the OP is about 4 times for a 1000x1000 matrix. Here the codes with some tests:
import numpy as np
import timeit
def f(x):
y = np.zeros_like(x)
for i in range(x.shape[0]):
a = x[i]
for j in range(x.shape[1]):
b = x[j]
y[i,j] = np.sum(np.min([a,b], axis=0))
return y
def f2(x):
y = np.empty((x.shape[0],1,2,x.shape[0]))
y[:,0,0,:] = x[:,:]
y = np.repeat(y, x.shape[0],axis=1)
y[:,:,1,:] = x[:,:]
return np.sum(np.min(y,axis=2),axis=2)
def f3(x):
y = np.empty_like(x)
for i in range(x.shape[1]):
y[:,i] = np.sum(np.minimum(x[i,:],x[:,:]),axis=1)
return y
##some testing that the functions work
x = np.array([[1,5],[3,4]])
a=f(x)
b=f2(x)
c=f3(x)
print(np.all(a==b))
print(np.all(a==c))
x = np.array([[1,7,5],[2,3,8],[5,2,4]])
a=f(x)
b=f2(x)
c=f3(x)
print(np.all(a==b))
print(np.all(a==c))
x = np.random.randint(0,10,(100,100))
a=f(x)
b=f2(x)
c=f3(x)
print(np.all(a==b))
print(np.all(a==c))
##some speed testing:
print('-'*50)
print("speed test small")
x = np.random.randint(0,100,(100,100))
print("original")
print(min(timeit.Timer(
'f(x)',
setup = 'from __main__ import f,x',
).repeat(3,10)))
print("using np.repeat")
print(min(timeit.Timer(
'f2(x)',
setup = 'from __main__ import f2,x',
).repeat(3,10)))
print("one for loop")
print(min(timeit.Timer(
'f3(x)',
setup = 'from __main__ import f3,x',
).repeat(3,10)))
print('-'*50)
print("speed test big")
x = np.random.randint(0,100,(1000,1000))
print("original")
print(min(timeit.Timer(
'f(x)',
setup = 'from __main__ import f,x',
).repeat(3,1)))
print("one for loop")
print(min(timeit.Timer(
'f3(x)',
setup = 'from __main__ import f3,x',
).repeat(3,1)))
And here the output:
True
True
True
True
True
True
--------------------------------------------------
speed test small
original
1.3070102719939314
using np.repeat
0.15176948899170384
one for loop
0.029766165011096746
--------------------------------------------------
speed test big
original
17.505746565002482
one for loop
4.437685210024938
With other words, f2 is pretty fast for matrices that don't exhaust your memory, but especially for big matrices, f3 is the fastest that I could find.
EDIT:
Inspired by #Aguy's answer and this post, here still a modification that only computes the lower triangle of the matrix and then copies the results to the upper triangle:
def f4(x):
y = np.empty_like(x)
for i in range(x.shape[1]):
y[i:,i] = np.sum(np.minimum(x[i,:],x[i:,:]),axis=1)
i_upper = np.triu_indices(x.shape[1],1)
y[i_upper] = y.T[i_upper]
return y
The speed test for the 1000x1000 matrix now gives
speed test big
original
18.71281115297461
one for loop over lower triangle
2.0939957330119796
EDIT 2:
Here still a version that uses numba for speed up. According to this post it is better to write the loops explicitly in this case:
import numba as nb
#nb.jit(nopython=True)
def f_nb(x):
res = np.empty_like(x)
for j in range(res.shape[1]):
for i in range(j,res.shape[0]):
res[j,i] = res[i,j] = np.sum(np.minimum(x[i,:], x[j,:]))
return res
And the relevant speed tests give:
0.015975199989043176 for a 100x100 matrix
0.37946902704425156 for a 1000x1000 matrix
467.06363476096885 for a 10000x10000 matrix
The 10000x10000 speed test for f4 didn't seem to want to finish at all, so I left it out. If your matrices get much bigger than that, you might actually run into memory problems -- did you consider this?

how to apply a mask from one array to another array?

I've read the masked array documentation several times now, searched everywhere and feel thoroughly stupid. I can't figure out for the life in me how to apply a mask from one array to another.
Example:
import numpy as np
y = np.array([2,1,5,2]) # y axis
x = np.array([1,2,3,4]) # x axis
m = np.ma.masked_where(y>2, y) # filter out values larger than 5
print m
[2 1 -- 2]
print np.ma.compressed(m)
[2 1 2]
So this works fine.... but to plot this y axis, I need a matching x axis. How do I apply the mask from the y array to the x array? Something like this would make sense, but produces rubbish:
new_x = x[m.mask].copy()
new_x
array([5])
So, how on earth is that done (note the new x array needs to be a new array).
Edit:
Well, it seems one way to do this works like this:
>>> import numpy as np
>>> x = np.array([1,2,3,4])
>>> y = np.array([2,1,5,2])
>>> m = np.ma.masked_where(y>2, y)
>>> new_x = np.ma.masked_array(x, m.mask)
>>> print np.ma.compressed(new_x)
[1 2 4]
But that's incredibly messy! I'm trying to find a solution as elegant as IDL...

I had a similar issue, but involving loads more masking commands and more arrays to apply them. My solution is that I do all the masking on one array and then use the finally masked array as the condition in the mask_where command.
For example:
y = np.array([2,1,5,2]) # y axis
x = np.array([1,2,3,4]) # x axis
m = np.ma.masked_where(y>5, y) # filter out values larger than 5
new_x = np.ma.masked_where(np.ma.getmask(m), x) # applies the mask of m on x
The nice thing is you can now apply this mask to many more arrays without going through the masking process for each of them.

Why not simply
import numpy as np
y = np.array([2,1,5,2]) # y axis
x = np.array([1,2,3,4]) # x axis
m = np.ma.masked_where(y>2, y) # filter out values larger than 5
print list(m)
print np.ma.compressed(m)
# mask x the same way
m_ = np.ma.masked_where(y>2, x) # filter out values larger than 5
# print here the list
print list(m_)
print np.ma.compressed(m_)
code is for Python 2.x
Also, as proposed by joris, this do the work new_x = x[~m.mask].copy() giving an array
>>> new_x
array([1, 2, 4])

This may not bee 100% what OP wanted to know,
but it's a cute little piece of code I use all the time -
if you want to mask several arrays the same way, you can use this generalized function to mask a dynamic number of numpy arrays at once:
def apply_mask_to_all(mask, *arrays):
assert all([arr.shape == mask.shape for arr in arrays]), "All Arrays need to have the same shape as the mask"
return tuple([arr[mask] for arr in arrays])
See this example usage:
# init 4 equally shaped arrays
x1 = np.random.rand(3,4)
x2 = np.random.rand(3,4)
x3 = np.random.rand(3,4)
x4 = np.random.rand(3,4)
# create a mask
mask = x1 > 0.8
# apply the mask to all arrays at once
x1, x2, x3, x4 = apply_mask_to_all(m, x1, x2, x3, x4)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

weird behavior of numba guvectorize - python

Related

How do you use the user_data argument in scipy.LowLevelCallable in conjunction with scipy.ndimage.generic_filter?

Device function throws nopython exception when its returning a list instead of an integer

Numba vectorize for function with no input

Manipulation of numpy 2-D array

how to apply a mask from one array to another array?

Categories

Resources