Find multiple values in a Numpy array - python

a and b are two Numpy arrays of integers. They are sorted and without repetitions. b is a subset of a. I need to find the index in a of every element of b. Is there an efficient Numpy function that could help, so I can avoid the python loop?
(Actually, the arrays are of pandas.DatetimeIndex and Numpy datetime64, but I guess it doesn't change the answer.)

numpy.searchsorted() can be used to do this:
In [15]: a = np.array([1, 2, 3, 5, 10, 20, 25])
In [16]: b = np.array([1, 5, 20, 25])
In [17]: a.searchsorted(b)
Out[17]: array([0, 3, 5, 6])
From what I understand, it doesn't require b to be sorted, and uses binary search on a. This means that it's O(n logn) rather than O(n).
If that's not good enough, there's always Cython. :-)

Related

What is the 'a' in numpy.arange?

What does the 'a' in numpy's numpy.arange method stand for, and how does it differ from a simple range produced by Python's builtin range method (definitionally, not in terms of performance and whatnot)?
I tried looking online for an answer to this, but all I find is tutorials for how to use numpy.arange by GeeksForGeeks and co.
You can inspect the return types and reason about what it could mean that way:
print(type(range(0,5)))
import numpy as np
print(type(np.arange(0,5)))
Which prints:
<class 'range'>
<class 'numpy.ndarray'>
Here's a related question: Why was the name "arange" chosen for the numpy function?
Some people do from numpy import * which would shadow range which causes problems.
Naming the function arrayrange was not chosen because it's too long to type.
From the previous SO we learn that the 'a' stands, in some sense, for 'array'. arange is a function that returns a numpy array that is similar, at least in simple cases, to the list produced by list(range(...)). From the official arange docs:
For integer arguments the function is roughly equivalent to the Python built-in range, but returns an ndarray rather than a range instance.
In [104]: list(range(-3,10,2))
Out[104]: [-3, -1, 1, 3, 5, 7, 9]
In [105]: np.arange(-3,10,2)
Out[105]: array([-3, -1, 1, 3, 5, 7, 9])
In py3, range by itself is "unevaluated", it's generator like. It's the equivalent of the py2 xrange.
The best "definition" is the official documentation page:
https://numpy.org/doc/stable/reference/generated/numpy.arange.html
But maybe you are wondering when to use one or the other. The simple answer is - if you are doing python level iteration, range is usually better. If you need an array, use arange (or np.linspace as suggested by the docs).
In [106]: [x**2 for x in range(5)]
Out[106]: [0, 1, 4, 9, 16]
In [107]: np.arange(5)**2
Out[107]: array([ 0, 1, 4, 9, 16])
I often use arange to create a example array, as in:
In [108]: np.arange(12).reshape(3,4)
Out[108]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
While it is possible to make an array from a range, e.g. np.array(range(5)), that is relatively slow. np.fromiter(range(5),int) is faster, but still not as good as the direct np.arange.
The 'a' stands for 'array' in numpy.arange. Numpy.arange is a function that produces an array of sequential numbers within a given interval. It differs from Python's builtin range() function in that it can handle floating-point numbers as well as arbitrary step sizes. Also, the output of numpy.arange is an array of elements instead of a range object.

Index of Position of Values from B in A

I have a little bit of a tricky problem here...
Given two arrays A and B
A = np.array([8, 5, 3, 7])
B = np.array([5, 5, 7, 8, 3, 3, 3])
I would like to replace the values in B with the index of that value in A. In this example case, that would look like:
[1, 1, 3, 0, 2, 2, 2]
For the problem I'm working on, A and B contain the same set of values and all of the entries in A are unique.
The simple way to solve this is to use something like:
for idx in range(len(A)):
ind = np.where(B == A[idx])[0]
B_new[ind] = A[idx]
But the B array I'm working with contains almost a million elements and using a for loop gets super slow. There must be a way to vectorize this, but I can't figure it out. The closest I've come is to do something like
np.intersect1d(A, B, return_indices=True)
But this only gives me the first occurrence of each element of A in B. Any suggestions?
The solution of #mozway is good for small array but not for big ones as it runs in O(n**2) time (ie. quadratic time, see time complexity for more information). Here is a much better solution for big array running in O(n log n) time (ie. quasi-linear) based on a fast binary search:
unique_values, index = np.unique(A, return_index=True)
result = index[np.searchsorted(unique_values, B)]
Use numpy broadcasting:
np.where(B[:, None]==A)[1]
NB. the values in A must be unique
Output:
array([1, 1, 3, 0, 2, 2, 2])
Though cant tell exactly what the complexity of this is, I belive it will perform quite well:
A.argsort()[np.unique(B, return_inverse = True)[1]]
array([1, 1, 3, 0, 2, 2, 2], dtype=int64)

Rationale for numpy.split returning a list and not an array

I was surprised that numpy.split yields a list and not an array. I would have thought it would be better to return an array, since numpy has put a lot of work into making arrays more useful than lists. Can anyone justify numpy returning a list instead of an array? Why would that be a better programming decision for the numpy developers to have made?
A comment pointed out that if the slit is uneven, the result can't be a array, at least not one that has the same dtype. At best it would be an object dtype.
But lets consider the case of equal length subarrays:
In [124]: x = np.arange(10)
In [125]: np.split(x,2)
Out[125]: [array([0, 1, 2, 3, 4]), array([5, 6, 7, 8, 9])]
In [126]: np.array(_) # make an array from that
Out[126]:
array([[0, 1, 2, 3, 4],
[5, 6, 7, 8, 9]])
But we can get the same array without split - just reshape:
In [127]: x.reshape(2,-1)
Out[127]:
array([[0, 1, 2, 3, 4],
[5, 6, 7, 8, 9]])
Now look at the code for split. It just passes the task to array_split. Ignoring the details about alternative axes, it just does
sub_arys = []
for i in range(Nsections):
# st and end from `div_points
sub_arys.append(sary[st:end])
return sub_arys
In other words, it just steps through array and returns successive slices. Those (often) are views of the original.
So split is not that sophisticate a function. You could generate such a list of subarrays yourself without a lot of numpy expertise.
Another point. Documentation notes that split can be reversed with an appropriate stack. concatenate (and family) takes a list of arrays. If give an array of arrays, or a higher dim array, it effectively iterates on the first dimension, e.g. concatenate(arr) => concatenate(list(arr)).
Actually you are right it returns a list
import numpy as np
a=np.random.randint(1,30,(2,2))
b=np.hsplit(a,2)
type(b)
it will return type(b) as list so, there is nothing wrong in the documentation, i also first thought that the documentation is wrong it doesn't return a array, but when i checked
type(b[0])
type(b[1])
it returned type as ndarray.
it means it returns a list of ndarrary's.

Element-wise addition of 1D and 2D numpy arrays

Situation
I have objects that have attributes which are represented by numpy arrays:
>> obj = numpy.array([1, 2, 3])
where 1, 2, 3 are the attributes' values.
I'm about to write a few methods that should work equally on both a single object and a group of objects. A group of objects is represented by a 2D numpy array:
>>> group = numpy.array([[11, 21, 31],
... [12, 22, 32],
... [13, 23, 33]])
where the first digit indicates the object and the second digit indicates the attribute. That is 12 is attribute 2 of object 1 and 21 is attribute 1 of object 2.
Why this way and not transposed? Because I want the array indices to correspond to the attributes. That is object_or_group[0] should yield the first attribute either as a single number or as a numpy array, so it can be used for further computations.
Alright, so when I want to compute the dot product for example this works out of the box:
>>> obj = numpy.array([1, 2, 3])
>>> obj.dot(object_or_group)
What doesn't work is element-wise addition.
Input:
>>> group
array([[1, 2, 3],
[4, 5, 6]])
>>> obj
array([10, 20])
The resulting array should be the sum of the first element of group and obj and similar for the second element:
>>> result = numpy.array([group[0] + obj[0],
... group[1] + obj[1]])
>>> result
array([[11, 12, 13],
[24, 25, 26]])
However:
>>> group + obj
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: operands could not be broadcast together with shapes (2,3) (2,)
Which makes sense considering numpy's broadcasting rules.
It seems that there is no numpy function which performs an addition (or equivalently the broadcasting) along a specified axis. While I could use
>>> (group.T + obj).T
array([[11, 12, 13],
[24, 25, 26]])
this feels very cumbersome (and if, instead of a group, I consider a single object this feels wrong indeed). Especially because numpy covered each and every corner case for its usage, I have the feeling that I might have gotten something conceptually wrong here.
To sum it up
Similarly to
>>> obj1
array([1, 2])
>>> obj2
array([10, 20])
>>> obj1 + obj2
array([11, 22])
(which performs an element-wise - or attribute-wise - addition) I want to do the same for groups of objects:
>>> group
array([[1, 2, 3],
[4, 5, 6]])
while the layout of such a 2D group array must be such that the single objects are listed along the 2nd axis (axis=1) in order to be able to request a certain attribute (or many) via normal indexing: obj[0] and group[0] should both yield the first attribute(s).
what you want to do seems to work with this simple code !!
>>> m
array([[1, 2, 3],
[4, 5, 6]])
>>> g = np.array([10,20])
>>> m + g[ : , None]
array([[11, 12, 13],
[24, 25, 26]])
You appear to be confused about which dimension of the matrix is an object and which is an attirbute, as evidenced by the changing object size in your examples. In fact, it it the fact that you are swapping dimensions to match that changing size that is throwing you off. You are also using the unfortunate example of a 3x3 group for your dot product, which is further throwing off your explanation.
In the examples below, objects will be three-element vectors, i.e., they will have three attributes each. The example group will have consistently two rows, meaning two objects in it, and three columns, because objects have three attributes.
The first row of the group, group[0], a.k.a. group[0, :], will be the first object in the group. The first column, group[:, 0] will be the first attribute.
Here are a couple of sample objects and groups to illustrate the points that follow:
>>> obj1 = np.array([1, 2, 3])
>>> obj2 = np.array([4, 5, 6])
>>> group1 = np.array([[7, 8, 9],
[0, 1, 2]])
>>> group2 = np.array([[3, 4, 5]])
Addition will work out of the box because of broadcasting now:
>>> obj1 + obj2
array([5, 7, 9])
>>> group1 + obj1
array([[ 8, 10, 12],
[ 1, 3, 5]])
As you can see, corresponding attributes are getting added just fine. You can even add together groups, but only if they are the same size or if one of them only contains a single object:
>>> group1 + group2
array([[10, 12, 14],
[ 3, 5, 7]])
>>> group1 + group1
array([[14, 16, 18],
[ 0, 2, 4]])
The same will be true for all the binary elementwise operators: *, -, /, np.bitwise_and, etc.
The only remaining question is how to make dot products not care if they are operating on a matrix or a vector. It just so happens that dot products don't care. Your common dimension is always the number of attributes, so the second operand (the multiplier) needs to be transposed so that the number of columns becomes the number of rows. np.dot(x1, x2.T), or equivalently x1.dot(x2.T) will work correctly whether x1 and x2 are groups or objects:
>>> obj1.dot(obj2.T)
32
>>> obj1.dot(group1.T)
array([50, 8])
>>> group1.dot(obj1.T)
array([50, 8])
You can use either np.atleast_1d or np.atleast_2d to always coerce the result into a particular shape so you don't end up with a scalar like the obj1.dot(obj2.T) case. I would recommend the latter, so you always have a consistent number of dimensions regardless of the inputs:
>>> np.atleast_2d(obj1.dot(obj2.T))
array([[32]])
>>> np.atleast_2d(obj1.dot(group1.T))
array([[50, 8]])
Just keep in mind that the dimensions of the dot product will be the the number of objects in the first operand by the number of objects in the second operand (everything will be treated as a group). The attributes will get multiplied and summed together. Whether or not that has a valid interpretation for your purposes is entirely for you to decide.
UPDATE
The only remaining problem at this point is attribute access. As stated above obj1[0] and group1[0] mean very different things. There are three ways to reconcile this difference, listed in the order that I personally prefer them, with 1 being the most preferable:
Use the Ellipsis indexing object to get the last index instead of the first
>>> obj1[..., 0]
array([1])
>>> group1[..., 0]
array([7, 0])
This is the most efficient way since it does not make any copies, just does a normal index on the original arrays. As you can see, there will be no difference between the result from a single object (1D array) and a group with only one object in it (2D array).
Make all your objects 2D. As you pointed out yourself, this can be done with a decorator, and/or using np.atleast_2d. Personally, I would prefer having the convenience of using 1D arrays as single objects without having to wrap them in 2D.
Always access attributes via a transpose:
>>> obj1.T[0]
1
>>> group1.T[0]
array([7, 0])
While this is functionally equivalent to #1, it is clunky and unsightly by comparison, in addition to doing something very different under-the-hood. This approach at the very least creates a new view of the underlying array, and may run the risk of making unnecessary copies in certain cases if the group arrays are not laid out just right. I would not recommend this approach even if it does solve the problem if uniform access.

How do I sum a numpy array of size (m*n,) in groups of m?

Suppose I have a where a.shape is (m*n,), how do I create a new array that comprises the m sums of each group of n elements efficiently?
The best I came up with is:
a.reshape((m, n)).sum(axis=1)
but this creates an extra new array.
I think there is nothing wrong with using reshape and then taking the sum of the rows, I cannot think of anything faster. According to the manual, reshape should (if possible) return a view on the original array, so no large amount of data is copied. When a view is created, numpy only creates a new header with different strides and shape, with a pointer into the data of the original array. This should cost constant time and memory, independent of the array size.
In [23]: x = np.arange(12)
In [24]: y = x.reshape((3, 4))
In [25]: y
Out[25]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
In [26]: y.base is x # check if it is a view
Out[26]: True
There is another trick, a variant on cumsum, reduceat. In this case
np.add.reduceat(a, np.arange(0,m*n,n))
For m,n=100,10, it is 2x as fast as x.reshape((m,n)).sum(axis=1).
I haven't used it much, so it took a bit of digging to find in the documentation.

Categories