Why is there no 'is' ufunc in numpy?

Why is there no 'is' ufunc in numpy? - python

I can certainly do
a[a == 0] = something
that sets every entry of a that equals zero to something. Equivalently, I could write
a[np.equal(a, 0)] = something
Now, imagine a is an array of dtype=object. I cannot write a[a is None] because, of course, a itself isn't None. The intention is clear: I want the comparison is to be broadcast like any other ufunc. This list from the docs lists nothing like an is-unfunc.
Why is there none, and, more interestingly to me: what would be a performant replacement?

There are two things at play here.
The first (and more important) one is that is is implemented directly in the Python interpreter with no option to redirect to a dunder method. Numpy arrays, like many other objects, have an __eq__ method that implements the == operation. a is None is treated approximately as id(a) == id(None), with no recourse for an elementwise implementation under any circumstance. That's just how python works.
The second aspect is that numpy is fundamentally designed for storing numbers. Object arrays are special cases that store references to objects as a number. This appears to be the same as how lists store object references, but it's only similar when dealing with references. The elements of a list are always references to objects, even when the list contains homogeneous integers, for example. A numpy array of dtype int does not contain python objects. Each consecutive element of the array is a raw binary integer, not a reference to a python object wrapper. Even if python allowed you to override the is operator, it would be meaningless to apply elementwise.
So if you want to compare objects, use python lists:
mylist = [...]
mylist = [something if x is None else x for x in mylist]
If you insist on using a numpy array, either (a) use numerical arrays and mark None elements with something else, like np.nan, or (b) treat the array as a list. You will have to apply id or is to each element, which are python constructs, so there is no "performant" way to do it at that point, or (c) just use ==, which will trigger python-level equality comparison, which is equivalent to is for the singleton None.

Except for operations like reshape and indexing that don't depend on dtype (except for the itemsize), operations on object dtype arrays are performed at list-comprehension speeds, iterating on the elements and applying an appropriate method to each. Sometimes that method doesn't exist, such as when doing np.sin.
To illustrate, consider the array from one of the comments:
In [132]: a = np.array([1, None, 0, np.nan, ''])
In [133]: a
Out[133]: array([1, None, 0, nan, ''], dtype=object)
The object array test:
In [134]: a==None
Out[134]: array([False, True, False, False, False])
In [135]: timeit a==None
5.16 µs ± 73.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
An equivalent comprehension:
In [136]: [x is None for x in a]
Out[136]: [False, True, False, False, False]
In [137]: timeit [x is None for x in a]
1.52 µs ± 18.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
It's faster, even if we cast the result back to array (not a cheap step):
In [138]: timeit np.array([x is None for x in a])
4.67 µs ± 95.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Iteration on the list version of the array is even faster:
In [139]: timeit np.array([x is None for x in a.tolist()])
2.52 µs ± 48.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Let's look at the full assignment action:
In [141]: a[[x is None for x in a.tolist()]]
Out[141]: array([None], dtype=object)
In [142]: %%timeit a1=a.copy()
...: a1[[x is None for x in a1.tolist()]] = np.nan
...:
...:
4.03 µs ± 10 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [143]: %%timeit a1=a.copy()
...: a1[a1==None] = np.nan
...:
...:
6.18 µs ± 28.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
The usual caveat that things might scale differently.

Related

What's the meaning of `2097184` in numpy?

I use below code to create a empty matrix:
import numpy as np
x = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]])
print(x)
y =np.empty_like(x)
print(y)
# I get below data:
[[2097184 2097184 2097184]
[2097184 2097184 2097184]
[2097184 2097184 2097184]
[2097184 2097184 2097184]]
why the 2097184 stand for empty?

It doesn't stand for anything. From the documentation:
This function does not initialize the returned array; to do that use zeros_like or ones_like instead. It may be marginally faster than the functions that do set the array values.
So the contents of the array are whatever happens to be in the memory that it used for it. In this case, it was a bunch of 2097184 values. The next time you try it you'll probably get something different.
You use this when you don't care what's in the array, because you're going to overwrite it.

The empty_like method does not initialize the array (that's why it's very faster than zeros_like and ones_like), so the shape of the array is exactly the same as x, but the values are uninitialized and actually are almost random values from the memory place allocated to the array.

In addition, it's just a more efficient alternative to zeros_like or ones_like:
%%timeit
np.zeros_like(x)
>>> 18.4 µs ± 2.39 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
np.ones_like(x)
>>> 14.1 µs ± 205 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
np.empty_like(x)
>>> 2.09 µs ± 62.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Using np.apply_along_axis but on on certain indices

I have a numpy array, X, of shape (200, 200, 1500). I also have a function, func, that essentially returns the mean of an array (it does a few other things but they are are all numpy operations, you can think of it as np.mean). Now if I want to apply this function across the second array I could just do np.apply_along_axis(func, 2, X). But, I also have a truth array of shape (200, 200, 1500). I want to only apply func to places where the truth array has True. So it would ignore any places where the truth array is false. So going back to the np.mean example it would take the mean for each array index across the second axis but ignore some arbitrary set of indices.
So in practice, my solution would be to convertX into a new array Y with shape (200, 200) but the elements of the array are lists. This would be done using the truth array. Then apply func to each list in the array. The problem is this seems very time consuming since and I feel like there is a numpy oriented solution for this. Is there?
If what I said with the array list is the best way, how would I go about combining X and the truth array to get Y?
Any suggestions or comments appreciated.

In [268]: X = np.random.randint(0,100,(200,200,1500))
Let's check how apply works with just np.mean:
In [269]: res = np.apply_along_axis(np.mean, 2, X)
In [270]: res.shape
Out[270]: (200, 200)
In [271]: timeit res = np.apply_along_axis(np.mean, 2, X)
1.2 s ± 36.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
An equivalent using iteration on the first two dimensions. I'm using reshape to make it easier to write; speed should be about the same with a double loop.
In [272]: res1 = np.reshape([np.mean(row) for row in X.reshape(-1,1500)],(200,200))
In [273]: np.allclose(res, res1)
Out[273]: True
In [274]: timeit res1 = np.reshape([np.mean(row) for row in X.reshape(-1,1500)],(200,200))
906 ms ± 13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
So apply may be convenient, but it is not a speed tool.
For speed in numpy you need to maximize the use of compiled code, and avoiding unnecessary python level loops.
In [275]: res2 = np.mean(X,axis=2)
In [276]: np.allclose(res2,res)
Out[276]: True
In [277]: timeit res2 = np.mean(X,axis=2)
120 ms ± 619 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
If using apply in your new case is hard, you don't loose anything by using something you do understand.
masked
In [278]: mask = np.random.randint(0,2, X.shape).astype(bool)
The [272] iteration can be adapted to work with mask:
In [279]: resM1 = np.reshape([np.mean(row[m]) for row,m in zip(X.reshape(-1,1500),mask.reshape(-1,150
...: 0))],X.shape[:2])
In [280]: timeit resM1 = np.reshape([np.mean(row[m]) for row,m in zip(X.reshape(-1,1500),mask.reshape
...: (-1,1500))],X.shape[:2])
1.43 s ± 18.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
This might have problems if row[m] is empty. np.mean([]) produces a warning and nan value.
Applying the mask to X before any further processing looses dimensional information.
In [282]: X[mask].shape
Out[282]: (30001416,)
apply only works with one array, so it will be awkward (though not impossible) to use it to iterate on both X and mask. A structured array with data and mask fields might do the job. But the previous timings show, there's no speed advantage.
masked array
I don't usually expect masked arrays to offer speed, but this case it helps:
In [285]: xM = np.ma.masked_array(X, ~mask)
In [286]: resMM = np.ma.mean(xM, axis=2)
In [287]: np.allclose(resM1, resMM)
Out[287]: True
In [288]: timeit resMM = np.ma.mean(xM, axis=2)
849 ms ± 20.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
np.nanmean
There's a set of functions that use np.nan masking:
In [289]: Xfloat = X.astype(float)
In [290]: Xfloat[~mask] = np.nan
In [291]: resflt = np.nanmean(Xfloat, axis=2)
In [292]: np.allclose(resM1, resflt)
Out[292]: True
In [293]: %%timeit
...: Xfloat = X.astype(float)
...: Xfloat[~mask] = np.nan
...: resflt = np.nanmean(Xfloat, axis=2)
...:
...:
2.17 s ± 200 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
This doesn't help :(

Numpy vectorize numpy.dot function

I have a=np.array([array([1,2,3,4],[2,3,4,5]),array([6,7,8,9])]). I want to take a dot product of both of the arrays with some vector v.
I tried to vectorize the np.dot function.
vfunc=np.vectorize(np.dot) and I applied the vfunc to my array a. vfunc(a,v)where v is the vector i want to take dot product with. However, i get this error ValueError: setting an array element with a sequence. . Is there any other way to do this?

Since you are passing an object dtype array as argument, you need to specify 'O' result type as well. Without otypes vectorize tries to deduce the return dtype, and may do so wrongly. That's just one of the pitfalls to using np.vectorize:
In [196]: f = np.vectorize(np.dot, otypes=['O'])
In [197]: x = np.array([[1,2,3],[1,2,3,4]])
/usr/local/bin/ipython3:1: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
#!/usr/bin/python3
In [199]: f(x, x)
Out[199]: array([14, 30], dtype=object)
Another problem with np.vectorize is that it is slower than alternatives:
In [200]: f1 = np.frompyfunc(np.dot, 2,1)
In [201]: f1(x,x)
Out[201]: array([14, 30], dtype=object)
In [202]: np.array([np.dot(i,j) for i,j in zip(x,x)])
Out[202]: array([14, 30])
In [203]: timeit f(x, x)
27.1 µs ± 229 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [204]: timeit f1(x,x)
16.9 µs ± 135 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [205]: timeit np.array([np.dot(i,j) for i,j in zip(x,x)])
21.3 µs ± 201 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
np.vectorize has a clear speed disclaimer. Read the full docs; it isn't as simple a function as you might think. The name can be misleading.

Efficiently using Numpy to assign function values to array

I am interested in finding the fastest way of carrying a simple operation in Python3.6 using Numpy. I wish to create a function and from a given array to an array of function values. Here is a simplified code that does that using map:
import numpy as np
def func(x):
return x**2
xRange = np.arange(0,1,0.01)
arr_func = np.array(list(map(func, xRange)))
However, as I am running it with a complicated function and using large arrays, runtime speed is very important for me. Is there a known faster way?
EDIT My question is not the same as this one, because I am asking about assigning from a function, as opposed to a generator.

Check the related How do I build a numpy array from a generator?, where the most compelling option seems to be preallocating the numpy array and setting values, instead of creating a throwaway intermediate list.
arr_func = np.empty(len(xRange))
for i in range(len(xRange)):
arr_func[i] = func(xRange[i])

With a complex function that can't be rewritten with compiled numpy functions, we can't make big improvements in speed.
Define a function with math methods that require scalars, for example:
def func(x):
return math.sin(x)**2 + math.cos(x)**2
In [868]: x = np.linspace(0,np.pi,10000)
For reference do a straight forward list comprehension:
In [869]: np.array([func(i) for i in x])
Out[869]: array([ 1., 1., 1., ..., 1., 1., 1.])
In [870]: timeit np.array([func(i) for i in x])
13.4 ms ± 211 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Your list map is slightly faster:
In [871]: timeit np.array(list(map(func, x)))
12.6 ms ± 12.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
For 1d array like this, np.array can be replaced with np.fromiter. It works with a generator as well, including the Py3 map.
In [875]: timeit np.fromiter(map(func, x),float)
13.1 ms ± 176 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
So that could get around the possible time penalty of creating a whole list first. But in this case it doesn't help.
Another iterator is np.frompyfunc. It is used by np.vectorize, but usually is faster with less overhead. It returns a dtype object array:
In [876]: f = np.frompyfunc(func, 1, 1)
In [877]: f(x)
Out[877]: array([1.0, 1.0, 1.0, ..., 1.0, 1.0, 1.0], dtype=object)
In [878]: timeit f(x)
11.1 ms ± 298 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [879]: timeit f(x).astype(float)
11.2 ms ± 85.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
A slight speed improvement. I noticed more of an improvement with 1000 item x. This is even better if your problem requires several arrays that may be broadcasted against each other.
Assigning to a preallocated out array may save memory, and is often recommended as a alternative to the list append iteration. But here it doesn't not give a speed improvement:
In [882]: %%timeit
...: out = np.empty_like(x)
...: for i,j in enumerate(x): out[i]=func(j)
16.1 ms ± 308 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
(the use of enumerate is slightly faster than range iteration).

How to turn Numpy array to set efficiently?

I used:
df['ids'] = df['ids'].values.astype(set)
to turn lists into sets, but the output was a list not a set:
>>> x = np.array([[1, 2, 2.5],[12,35,12]])
>>> x.astype(set)
array([[1.0, 2.0, 2.5],
[12.0, 35.0, 12.0]], dtype=object)
Is there an efficient way to turn list into set in Numpy?
EDIT 1:
My input is as big as below:
I have 3,000 records. Each has 30,000 ids: [[1,...,12,13,...,30000], [1,..,43,45,...,30000],...,[...]]

First flatten your ndarray to obtain a single dimensional array, then apply set() on it:
set(x.flatten())
Edit : since it seems you just want an array of set, not a set of the whole array, then you can do value = [set(v) for v in x] to obtain a list of sets.

The current state of your question (can change any time): how can I efficiently remove unique elements from a large array of large arrays?
import numpy as np
rng = np.random.default_rng()
arr = rng.random((3000, 30000))
out1 = list(map(np.unique, arr))
#or
out2 = [np.unique(subarr) for subarr in arr]
Runtimes in an IPython shell:
>>> %timeit list(map(np.unique, arr))
5.39 s ± 37.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit [np.unique(subarr) for subarr in arr]
5.42 s ± 58.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Update: as #hpaulj pointed out in his comment, my dummy example is biased since floating-point random numbers will almost certainly be unique. So here's a more life-like example with integer numbers:
>>> arr = rng.integers(low=1, high=15000, size=(3000, 30000))
>>> %timeit list(map(np.unique, arr))
4.98 s ± 83.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit [np.unique(subarr) for subarr in arr]
4.95 s ± 51.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In this case the elements of the output list have varying lengths, since there are actual duplicates to remove.

A couple of earlier 'row-wise' unique questions:
vectorize numpy unique for subarrays
Numpy: Row Wise Unique elements
Count unique elements row wise in an ndarray
In a couple of these the count is more interesting than the actual unique values.
If the number of unique values per row differs, then the result cannot be a (2d) array. That's a pretty good indication that the problem cannot be fully vectorized. You need some sort of iteration over the rows.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.