Optimizing the creation of a non-numeric matrix with NumPy

Optimizing the creation of a non-numeric matrix with NumPy - python

As I was trying to find a way to optimize the creation and printing of a huge 2D matrix, I decided to try out NumPy. But, unfortunately for me, using this library on the contrary makes the situation worse.
My goal is to create a matrix that will be filled with strings with its index. Something like this (where n is size of matrix):
python_matrix = [[f"{y}, {x}" for x in range(n)] for y in range(n)]
And when I used the array() function of the NumPy library this way:
numpy_matrix = numpy.array([[f"{y}, {x}" for x in range(n)] for y in range(n)])
the time to create the matrix only increased. For example, for n = 1000: python_matrix is created by 0.032 sec, and numpy_matrix by 0.419, that is longer than python by 13 times
Also, numpy_matrix prints slower (if you output the full version, not the shortened version), than it python_matrix does using for cycle
n = 1000
def numpy_matrix(n):
matrix = numpy.array([[f"{y}, {x}" for x in range(n)] for y in range(n)])
with numpy.printoptions(threshold=numpy.inf):
print(coordArr)
def python_matrix(n):
matrix = [[f"{y}, {x}" for x in range(n)] for y in range(n)]
def print_matrix():
for arr in matrix:
print(arr)
print_matrix()
# time of numpy_matrix > time of python_matrix
Is it better to use the standard Python features, or is NumPy actually more efficient and I just didn't use it correctly?
Also, if I do use NumPy, the question of how I can speed up the output of the full version of the matrix remains

Running an ipython session and using its timeit, I don't get such large differences:
Making the list:
In [13]: timeit [[f"{y}, {x}" for y in range(N)] for x in range(N)]
492 ms ± 3.28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Making the array (while also making the list):
In [14]: timeit np.array([[f"{y}, {x}" for y in range(N)] for x in range(N)])
779 ms ± 12.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Removing the list creation step from the time:
In [15]: %%timeit alist = [[f"{y}, {x}" for y in range(N)] for x in range(N)]
...: np.array(alist)
...:
...:
313 ms ± 12.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
So creating the array from an existing list isn't that much longer.
Specifying the dtype helps a bit as well:
In [18]: %%timeit alist = [[f"{y}, {x}" for y in range(N)] for x in range(N)]
...: np.array(alist, dtype='U8')
...:
...:
224 ms ± 2.67 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Timing prints is more awkward, though we could just time the string formatting, str(x). I won't show the times, but yes, the array formatting is much slower. Essentially numpy has to go through Python's own string handling code; it has little of its own.
numeric list/array
For a numeric array, the relative difference is bigger:
In [29]: alist = [[(x,y) for y in range(N)] for x in range(N)]
In [30]: arr = np.array(alist)
In [31]: arr.shape
Out[31]: (1000, 1000, 2)
In [32]: timeit alist = [[(x,y) for y in range(N)] for x in range(N)]
171 ms ± 8.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [33]: timeit arr = np.array(alist)
832 ms ± 36.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
However if make the same array with array methods - i.e. not via list of lists, the time is much better:
In [40]: timeit np.stack(np.broadcast_arrays(np.arange(N)[:,None], np.arange(N)),axis=2)
8.51 ms ± 89.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
numpy is not an improvement over lists in all ways. It is best for math on existing arrays. Creating an array from lists is time consuming. And it doesn't add a whole lot to string handling.

Related

numpy.where on 2D array is slower than list comprehension of numpy.where on 1D array

Why is Numpy slower than list comprehensions in this case?
What is the best way to vectorize this grid-construction?
In [1]: import numpy as np
In [2]: mesh = np.linspace(-1, 1, 3000)
In [3]: rowwise, colwise = np.meshgrid(mesh, mesh)
In [4]: f = lambda x, y: np.where(x > y, x**2, x**3)
# Using 2D arrays:
In [5]: %timeit f(colwise, rowwise)
285 ms ± 2.25 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# Using 1D array and list-comprehension:
In [6]: %timeit np.array([f(x, mesh) for x in mesh])
58 ms ± 2.69 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Equivalent result
In [7]: np.allclose(f(colwise, rowwise), np.array([f(x, mesh) for x in mesh]))
True

In [1]: In [2]: mesh = np.linspace(-1, 1, 3000)
...: In [3]: rowwise, colwise = np.meshgrid(mesh, mesh)
...: In [4]: f = lambda x, y: np.where(x > y, x**2, x**3)
In addition lets make the sparse grid:
In [2]: r1,c1 = np.meshgrid(mesh,mesh,sparse=True)
In [3]: rowwise.shape
Out[3]: (3000, 3000)
In [4]: r1.shape
Out[4]: (1, 3000)
With the sparse grid, times are even better than your iteration:
In [5]: timeit f(colwise, rowwise)
645 ms ± 57.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [6]: timeit f(c1,r1)
108 ms ± 3.85 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [7]: timeit np.array([f(x, mesh) for x in mesh])
166 ms ± 13.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The other answer stresses caching. Other posts have shown that a modest amount of iteration can be faster than working with very large arrays, such as when using matmul. I don't know if it's caching or some other memory management complication that slows this down.
But at 3000*3000*8 bytes I'm not sure that's the issue here. Instead I think it's the time the x**2 and x**3 expressions require.
The arguments of the where are evaluated before being passed in.
The condition expression takes a modest amount of time:
In [8]: timeit colwise>rowwise
24.2 ms ± 71.1 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
But the power expression for the (3000,3000) array takes a majority of the total time:
In [9]: timeit rowwise**3
467 ms ± 8.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Contrast that with the time required for the sparse equivalent:
In [10]: timeit r1**3
142 µs ± 150 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
This time is 3288x faster; that's a bit worse that O(n) scaling.
repeated multiply is better:
In [11]: timeit rowwise*rowwise*rowwise
116 ms ± 12 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [f(x, mesh) for x in mesh], x**3 operates on a scalar, so is fast, even though it's repeated 3000 times.
In fact if we take the power calculations out of the timing, the whole array where is relatively fast:
In [15]: %%timeit x2,x3 = rowwise**2, rowwise**3
...: np.where(rowwise>colwise, x2,x3)
89.8 ms ± 3.99 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Why is Numpy slower than list comprehensions in this case?
you are essentially suffering from 2 problems.
the first one is the cache utilization,
the second version uses a smaller subset of the space (3000,1) (1,3000) for the calculation which can fit nicely in you cache, so x>y, x**2 , x**3 can all fit inside your cache which somewhat speeds things up,
the first version is calculating each of those 3 for a 3000x3000 array (9 million entries) which can never sit inside your cache (usually ~ 2-5 MB), then np.where is called that has to get parts of the data from your RAM (and not cache) in order to do its memory copying, which is then returned piece by piece to your RAM, which is very expensive.
also numpy implementation of np.where is somewhat alignment unaware and is accessing your arrays column-wise, not row-wise, so it's essentially grabbing each and every entry from your RAM, and not utilizing the cache at all.
your list comprehension actually solves this issue as it is only dealing with a small subset of data at a given time, and therefore all the data can sit in you cache, but it is still using np.where, it's only forcing it to use a row-wise access and therefore utilize your cache.
the second problem is the calculation of x**2 and x**3 which is a floating point exponentiation, which is very expensive, consider replacing it with x*x and x*x*x
What is the best way to vectorize this grid-construction?
apparently you have written it in your second method.
an even faster but unnecessary optimization by utilization of cache is to write your own code in C and call it from within python so you don't have to evaluate either x*x or x*x*x unless you need to, and won't have to store x>y,x*x,x*x*x but the speedup won't be worth the trouble.

Create new numpy array by duplicating each item in an array

I would like to create a new numpy array by repeating each item in another array by a given number of times (n). I am currently doing this with a for loop and .extend() with lists but this is not really efficient, especially for very large arrays.
Is there a more efficient way to do this?
def expandArray(array, n):
new_array = []
for item in array:
new_array.extend([item] * n)
new_array = np.array(new_array)
return new_array
print(expandArray([1,2,3,4], 3)
[1,1,1,2,2,2,3,3,3,4,4,4]

I don't know exactly why, but this code runs faster than np.repeat for me:
def expandArray(array, n):
return np.concatenate([array for i in range(0,n)])
I ran this little benchmark:
arr1 = np.random.rand(100000)
%timeit expandArray(arr1, 5)
1.07 ms ± 25.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
And np.repeat gives this result:
%timeit np.repeat(arr1,5)
2.45 ms ± 148 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

You can create a duplicated array with the numpy.repeat function.
array = np.array([1,2,3,4])
new_array = np.repeat(array, repeats=3, axis=None)
print(new_array)
array([1,1,1,2,2,2,3,3,3,4,4,4])
new_array = np.repeat(array.reshape(-1,1).transpose(), repeats=3, axis=0).flatten()
print(new_array)
array([1,2,3,4,1,2,3,4,1,2,3,4])

Fastest way to append nonzero numpy array elements to list

I want to add all nonzero elements from a numpy array arr to a list out_list. Previous research suggests that for numpy arrays, using np.nonzero is most efficient. (My own benchmark below actually suggests it can be slightly improved using np.delete).
However, in my case I want my output to be a list, because I am combining many arrays for which I don't know the number of nonzero elements (so I can't effectively preallocate a numpy array for them). Hence, I was wondering whether there are some synergies that can be exploited to speed up the process. While my naive list comprehension approach is much slower than the pure numpy approach, I got some promising results combining list comprehension with numba.
Here's what I found so far:
import numpy as np
n = 60_000 # size of array
nz = 0.3 # fraction of zero elements
arr = (np.random.random_sample(n) - nz).clip(min=0)
# method 1
def add_to_list1(arr, out):
out.extend(list(arr[np.nonzero(arr)]))
# method 2
def add_to_list2(arr, out):
out.extend(list(np.delete(arr, arr == 0)))
# method 3
def add_to_list3(arr, out):
out += [x for x in arr if x != 0]
# method 4 (not sure how to get numba to accept an empty list as argument)
#njit
def add_to_list4(arr):
return [x for x in arr if x != 0]
out_list = []
%timeit add_to_list1(arr, out_list)
out_list = []
%timeit add_to_list2(arr, out_list)
out_list = []
%timeit add_to_list3(arr, out_list)
_ = add_to_list4(arr) # call once to compile
out_list = []
%timeit out_list.extend(add_to_list4(arr))
Yielding the following results:
2.51 ms ± 137 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.19 ms ± 133 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
15.6 ms ± 183 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.63 ms ± 158 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Not surprisingly, numba outperforms all other methods. Among the rest, method 2 (using np.delete) is the best. Am I missing any obvious alternative that exploits the fact that I am converting to a list afterwards? Can you think of anything to further speed up the process?
Edit 1:
Performance of .tolist():
# method 5
def add_to_list5(arr, out):
out += arr[arr != 0].tolist()
# method 6
def add_to_list6(arr, out):
out += np.delete(arr, arr == 0).tolist()
# method 7
def add_to_list7(arr, out):
out += arr[arr.astype(bool)].tolist()
Timings are on par with numba:
1.62 ms ± 118 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.65 ms ± 104 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each
1.78 ms ± 119 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Edit 2:
Here's some benchmarking using Mad Physicist's suggestion to use np.concatenate to construct a numpy array instead.
# construct numpy array using np.concatenate
out_list = []
t = time.perf_counter()
for i in range(100):
out_list.append(arr[arr != 0])
result = np.concatenate(out_list)
print(f"Time elapsed: {time.perf_counter() - t:.4f}s")
# compare with best list-based method
out_list = []
t = time.perf_counter()
for i in range(100):
out_list += arr[arr != 0].tolist()
print(f"Time elapsed: {time.perf_counter() - t:.4f}s")
Concatenating numpy arrays yields indeed another significant speed-up, although it is not directly comparable since the output is a numpy array instead of a list. So it will depend on the precise use what will be best.
Time elapsed: 0.0400s
Time elapsed: 0.1430s
TLDR;
1/ using arr[arr != 0] is the fastest of all the indexing options
2/ using .tolist() instead of list(.) speeds up things by a factor 1.3 - 1.5
3/ with the gains of 1/ and 2/ combined, the speed is on par with numba
4/ if having a numpy array instead of a list is acceptable, then using np.concatenate yields another gain in speed by a factor of ~3.5 compared to the best alternative

I submit that the method of choice, if you are indeed looking for a list output, is:
def f(arr, out_list):
out_list += arr[arr != 0].tolist()
It seems to beat all the other methods mentioned so far in the OP's question or in other responses (at the time of this writing).
If, however, you are looking for a result as a numpy array, then following #MadPhysicist's version (slightly modified to use arr[arr != 0] instead of using np.nonzero()) is almost 6x faster, see at the end of this post.
Side note: I would avoid using %timeit out_list.extend(some_list): it keeps adding to out_list during the many loops of timeit. Example:
out_list = []
%timeit out_list.extend([1,2,3])
and now:
>>> len(out_list)
243333333 # yikes
Timings
On 60K items on my machine, I see:
out_list = []
a = %timeit -o out_list + arr[arr != 0].tolist()
b = %timeit -o out_list + arr[np.nonzero(arr)].tolist()
c = %timeit -o out_list + list(arr[np.nonzero(arr)])
Yields:
1.23 ms ± 10.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.53 ms ± 2.53 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
4.29 ms ± 3.02 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
And:
>>> c.average / a.average
3.476
>>> b.average / a.average
1.244
For a numpy array result instead
Following #MadPhysicist, you can get some extra boost by not turning the arrays into lists, but using np.concatenate() instead:
def all_nonzero(arr_iter):
"""return non zero elements of all arrays as a np.array"""
return np.concatenate([a[a != 0] for a in arr_iter])
def all_nonzero_list(arr_iter):
"""return non zero elements of all arrays as a list"""
out_list = []
for a in arr_iter:
out_list += a[a != 0].tolist()
return out_list
from itertools import repeat
ta = %timeit -o all_nonzero(repeat(arr, 100))
tl = %timeit -o all_nonzero_list(repeat(arr, 100))
Yields:
39.7 ms ± 107 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
227 ms ± 680 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
and
>>> tl.average / ta.average
5.75

Instead of extending a list by all of the elements of a new array, append the array itself. This will make for much fewer and smaller reallocations. You can also pre-allocate a list of Nones up-front or even use an object array, if you have an upper bound on the number of arrays you will process.
When you're done, call np.concatenate on the list.
So instead of this:
L = []
for i in range(10):
arr = (np.random.random_sample(n) - nz).clip(min=0)
L.extend(arr[np.nonzero(arr)])
result = np.array(L)
Try this:
L = []
for i in range(10):
arr = (np.random.random_sample(n) - nz).clip(min=0)
L.append(arr[np.nonzero(arr)])
result = np.concatenate(L)
Since you're keeping arrays around, the final concatenation will be a series of buffer copies (which is fast), rather than a bunch of python-to numpy type conversions (which won't be). The exact method you choose for deletion is of course still up to the result of your benchmark.
Also, here's another method to add to your benchmark:
def add_to_list5(arr, out):
out.extend(list(arr[arr.astype(bool)]))
I don't expect this to be overwhelmingly fast, but it's interesting to see how masking stacks up next to indexing.

numpy array fromfunction using each previous value as input, with non-zero initial value

I would like to fill a numpy array with values using a function. I want the array to start with one initial value and be filled to a given length, using each previous value in the array as the input to the function.
Each array value i should be (i-1)*x**(y/z).
After a bit of work, I have got to:
import numpy as np
f = np.zeros([31,1])
f[0] = 20
fun = lambda i, j: i*2**(1/3)
f[1:] = np.fromfunction(np.vectorize(fun), (len(f)-1,1), dtype = int)
This fills an array with
[firstvalue=20, 0, i-1 + 1*2**(1/3),...]
I have arrived here having read
https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.fromfunction.html
Use of numpy fromfunction
Most efficient way to map function over numpy array
Fastest way to populate a matrix with a function on pairs of elements in two numpy vectors?
How do I create a numpy array using a function?
But I'm just not getting how to translate it to my function.

Except for the initial 20, this produces the same values
np.arange(31)*2**(1/3)
Your iterative version (slightly modified)
def foo0(n):
f = np.zeros(n)
f[0] = 20
for i in range(1,n):
f[i] = f[i-1]*2**(1/3)
return f
An alternative:
def foo1(n):
g = [20]
for i in range(n-1):
g.append(g[-1]*2**(1/3))
return np.array(g)
They produce the same thing:
In [25]: np.allclose(foo0(31), foo1(31))
Out[25]: True
Mine is a bit faster:
In [26]: timeit foo0(100)
35 µs ± 75 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [27]: timeit foo1(100)
23.6 µs ± 83.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
But we don't need to evaluate 2**(1/3) every time
def foo2(n):
g = [20]
const = 2**(1/3)
for i in range(n-1):
g.append(g[-1]*const)
return np.array(g)
minor time savings. But that's just multiplying each entry by the same const. So we can use cumprod for a bigger time savings:
def foo3(n):
g = np.ones(n)*(2**(1/3))
g[0]=20
return np.cumprod(g)
In [37]: timeit foo3(31)
14.9 µs ± 14.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [40]: np.allclose(foo0(31), foo3(31))
Out[40]: True

Count unique elements along an axis of a NumPy array

I have a three-dimensional array like
A=np.array([[[1,1],
[1,0]],
[[1,2],
[1,0]],
[[1,0],
[0,0]]])
Now I would like to obtain an array that has a nonzero value in a given position if only a unique nonzero value (or zero) occurs in that position. It should have zero if only zeros or more than one nonzero value occur in that position. For the example above, I would like
[[1,0],
[1,0]]
since
in A[:,0,0] there are only 1s
in A[:,0,1] there are 0, 1 and 2, so more than one nonzero value
in A[:,1,0] there are 0 and 1, so 1 is retained
in A[:,1,1] there are only 0s
I can find how many nonzero elements there are with np.count_nonzero(A, axis=0), but I would like to keep 1s or 2s even if there are several of them. I looked at np.unique but it doesn't seem to support what I'd like to do.
Ideally, I'd like a function like np.count_unique(A, axis=0) which would return an array in the original shape, e.g. [[1, 3],[2, 1]], so I could check whether 3 or more occur and then ignore that position.
All I could come up with was a list comprehension iterating over the that I'd like to obtain
[[len(np.unique(A[:, i, j])) for j in range(A.shape[2])] for i in range(A.shape[1])]
Any other ideas?

You can use np.diff to stay at numpy level for the second task.
def diffcount(A):
B=A.copy()
B.sort(axis=0)
C=np.diff(B,axis=0)>0
D=C.sum(axis=0)+1
return D
# [[1 3]
# [2 1]]
it's seems to be a little faster on big arrays:
In [62]: A=np.random.randint(0,100,(100,100,100))
In [63]: %timeit diffcount(A)
46.8 ms ± 769 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [64]: timeit [[len(np.unique(A[:, i, j])) for j in range(A.shape[2])]\
for i in range(A.shape[1])]
149 ms ± 700 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Finally counting unique is simpler than sorting, a ln(A.shape[0]) factor can be win.
A way to win this factor is to use the set mechanism :
In [81]: %timeit np.apply_along_axis(lambda a:len(set(a)),axis=0,A)
183 ms ± 1.17 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Unfortunately, this is not faster.
Another way is to do it by hand :
def countunique(A,Amax):
res=np.empty(A.shape[1:],A.dtype)
c=np.empty(Amax+1,A.dtype)
for i in range(A.shape[1]):
for j in range(A.shape[2]):
T=A[:,i,j]
for k in range(c.size): c[k]=0
for x in T:
c[x]=1
res[i,j]= c.sum()
return res
At python level:
In [70]: %timeit countunique(A,100)
429 ms ± 18.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Which is not so bad for a pure python approach. Then just shift this code at low level with numba :
import numba
countunique2=numba.jit(countunique)
In [71]: %timeit countunique2(A,100)
3.63 ms ± 70.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Which will be difficult to improve a lot.

One approach would be to use A as first axis indices for setting a boolean array of the same lengths along the other two axes and then simply counting the non-zeros along the first axis of it. Two variants would be possible - One keeping it as 3D and another would be to reshape into 2D for some performance benefit as indexing into 2D would be faster. Thus, the two implementations would be -
def nunique_axis0_maskcount_app1(A):
m,n = A.shape[1:]
mask = np.zeros((A.max()+1,m,n),dtype=bool)
mask[A,np.arange(m)[:,None],np.arange(n)] = 1
return mask.sum(0)
def nunique_axis0_maskcount_app2(A):
m,n = A.shape[1:]
A.shape = (-1,m*n)
maxn = A.max()+1
N = A.shape[1]
mask = np.zeros((maxn,N),dtype=bool)
mask[A,np.arange(N)] = 1
A.shape = (-1,m,n)
return mask.sum(0).reshape(m,n)
Runtime test -
In [154]: A = np.random.randint(0,100,(100,100,100))
# #B. M.'s soln
In [155]: %timeit f(A)
10 loops, best of 3: 28.3 ms per loop
# #B. M.'s soln using slicing : (B[1:] != B[:-1]).sum(0)+1
In [156]: %timeit f2(A)
10 loops, best of 3: 26.2 ms per loop
In [157]: %timeit nunique_axis0_maskcount_app1(A)
100 loops, best of 3: 12 ms per loop
In [158]: %timeit nunique_axis0_maskcount_app2(A)
100 loops, best of 3: 9.14 ms per loop
Numba method
Using the same strategy as used for nunique_axis0_maskcount_app2 with directly getting the counts at C-level with numba, we would have -
from numba import njit
#njit
def nunique_loopy_func(mask, N, A, p, count):
for j in range(N):
mask[:] = True
mask[A[0,j]] = False
c = 1
for i in range(1,p):
if mask[A[i,j]]:
c += 1
mask[A[i,j]] = False
count[j] = c
return count
def nunique_axis0_numba(A):
p,m,n = A.shape
A.shape = (-1,m*n)
maxn = A.max()+1
N = A.shape[1]
mask = np.empty(maxn,dtype=bool)
count = np.empty(N,dtype=int)
out = nunique_loopy_func(mask, N, A, p, count).reshape(m,n)
A.shape = (-1,m,n)
return out
Runtime test -
In [328]: np.random.seed(0)
In [329]: A = np.random.randint(0,100,(100,100,100))
In [330]: %timeit nunique_axis0_maskcount_app2(A)
100 loops, best of 3: 11.1 ms per loop
# #B.M.'s numba soln
In [331]: %timeit countunique2(A,A.max()+1)
100 loops, best of 3: 3.43 ms per loop
# Numba soln posted in this post
In [332]: %timeit nunique_axis0_numba(A)
100 loops, best of 3: 2.76 ms per loop

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Optimizing the creation of a non-numeric matrix with NumPy - python

Related

numpy.where on 2D array is slower than list comprehension of numpy.where on 1D array

Create new numpy array by duplicating each item in an array

Fastest way to append nonzero numpy array elements to list

numpy array fromfunction using each previous value as input, with non-zero initial value

Count unique elements along an axis of a NumPy array

Categories

Resources