Create 2D array from where clause on 1D array numpy - python

I have a 1D array containing integer values:
a = np.array([1,2,3,3,2,2,3,2,3])
a
array([1, 2, 3, 3, 2, 2, 3, 2, 3])
I would like to create a 2D array with the first dimension holding the index of the integer value in the 1D array:
idx = [np.where(a == (i+1)) for i in range(a.max())]
But this returns a list (duh):
type(idx)
list
And the first dimension is a series of tuples:
type(idx[0])
tuple
How can I return a 2D numpy array of indices of values from a 1D array using a where clause?
EDIT:
Expected output:
array([[0],[1,4,5,7],[2,3,6,8]])

The closest you can come to a 2D-array would be:
In [147]: np.array(tuple(np.where(a == e)[0] for e in np.unique(a)))
Out[147]:
array([array([ 0, 14, 15, 16]),
array([ 1, 4, 5, 7, 9, 10, 11, 13, 17, 19, 21]),
array([ 2, 3, 6, 8, 12, 18, 20])], dtype=object)
But it is a 1D array or arrays.
Part of your issue is that np.where returns a tuple of arrays so that it will have the same interface no matter how many dimensions your array has. Since yours only have one you can get the 0-index.
Then I would suggestion using np.unique since it is sort of nicer but it would skip indices not present in a. So if that is dead important, then just change back but use range(a.max() + 1):
In [149]: np.array(tuple(np.where(a == e)[0] for e in range(a.max() + 1)))
Out[149]:
array([array([], dtype=int64), array([ 0, 14, 15, 16]),
array([ 1, 4, 5, 7, 9, 10, 11, 13, 17, 19, 21]),
array([ 2, 3, 6, 8, 12, 18, 20])], dtype=object)
Because indices start at 0 not 1.

Related

What is the difference between (13027,) and (13027,1) in numpy expand_dim()

These are two outputs in a chunk of code after I apply the call .shape to a variable b before and after applying the call np.expand_dim(b, axis=1).
I see that the _dim part may seem like a dead giveaway, but the outputs don't seem to be different, except for, perhaps turning a row vector into a column vector (?):
b is [208. 193. 208. ... 46. 93. 200.] a row vector, but np.expand_dim(b, axis=1) gives:
[[208.]
[193.]
[208.]
...
[ 46.]
[ 93.]
[200.]]
Which could be interpreted as a column vector (?), as opposed to any increased number of dimensions.
What is the difference between (13027,) and (13027,1)
They are arrays of different dimensions and some operations apply to them differently. For example
>>> a = np.arange(5)
>>> b = np.arange(5, 10)
>>> a + b
array([ 5, 7, 9, 11, 13])
>>> np.expand_dims(a, axis=1) + b
array([[ 5, 6, 7, 8, 9],
[ 6, 7, 8, 9, 10],
[ 7, 8, 9, 10, 11],
[ 8, 9, 10, 11, 12],
[ 9, 10, 11, 12, 13]])
The last result is what we call broadcasting, for which you can read in the numpy docs, or even this SO question.
Basically np.expand_dims adds new axes at the specified dimensions and all the following achieve the same result
>>> a.shape
(5,)
>>> np.expand_dims(a, axis=(0, 2)).shape
(1, 5, 1)
>>> a[None,:,None].shape
(1, 5, 1)
>>> a[np.newaxis,:,np.newaxis].shape
(1, 5, 1)
Note that in numpy the transpose of a 1D array is still a 1D array. It isn't like in MATLAB where a row vector turns to a column vector.
>>> a
array([0, 1, 2, 3, 4])
>>> a.T
array([0, 1, 2, 3, 4])
>>> a.T.shape
(5,)
So in order to turn it to a "column vector" you have to turn the array from shape (N,) to (N, 1) with broadcasting (or reshaping). But you're better off treating it as a 2D array of N rows with 1 element per row.
(13027,) is treating the x axis as 0, while (13027,1) is treating the x axis as 1.
https://numpy.org/doc/stable/reference/generated/numpy.expand_dims.html
It's like "i" where i = 0 by default so if you don't explicitly define it, it will start at 0.

How to ignore out of bounds values in an index array when using numpy?

I need to assign one array to another using an index array. But some values are out of bounds...
a = np.array([0, 1, 2, 3, 4])
b = np.array([10, 11, 12, 13, 14])
indexes = np.array([0, 2, 3, 5, 6])
a and b are the same size. If I use a[indexes] = b, it would throw an IndexError. I want it to ignore the out of bounds values, 5 and 6, so that a would become [10, 1, 11, 12, 4].
I tried to do indexes[indexes > b.size()] = 0
but this would mess up the value at index 0.
How can this be solved?
Edit
The indexes may not necessarily be in order. For example:
indexes = np.array([2, 3, 0, 5, 6])
a should become np.array([12, 1, 10, 11, 4])
You can filter out those invalid indexes:
indexes = indexes[indexes < len(a)]
a[indexes] = b[indexes]
Output:
array([10, 1, 12, 13, 4])

Python: general rule for mapping a 2D array onto a larger 2D array

Say you have a 2D numpy array, which you have sliced in order to extract its core, just as if you were cutting out the inner frame from a larger frame.
The larger frame:
In[0]: import numpy
In[1]: a=numpy.array([[0,1,2,3,4],[5,6,7,8,9],[10,11,12,13,14],[15,16,17,18,19]])
In[2]: a
Out[2]:
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19]])
The inner frame:
In[3]: b=a[1:-1,1:-1]
Out[3]:
array([[ 6, 7, 8],
[11, 12, 13]])
My question: if I want to retrieve the position of each value in b in the original array a, is there an approach better than this?
c=numpy.ravel(a) #This will flatten my values in a, so to have a sequential order
d=numpy.ravel(b) #Each element in b will tell me what its corresponding position in a was
y, x = np.ogrid[1:m-1, 1:n-1]
np.ravel_multi_index((y, x), (m, n))

ValueError: all the input arrays must have same number of dimensions

I'm having a problem with np.append.
I'm trying to duplicate the last column of 20x361 matrix n_list_converted by using the code below:
n_last = []
n_last = n_list_converted[:, -1]
n_lists = np.append(n_list_converted, n_last, axis=1)
But I get error:
ValueError: all the input arrays must have same number of dimensions
However, I've checked the matrix dimensions by doing
print(n_last.shape, type(n_last), n_list_converted.shape, type(n_list_converted))
and I get
(20L,) (20L, 361L)
so the dimensions match? Where is the mistake?
If I start with a 3x4 array, and concatenate a 3x1 array, with axis 1, I get a 3x5 array:
In [911]: x = np.arange(12).reshape(3,4)
In [912]: np.concatenate([x,x[:,-1:]], axis=1)
Out[912]:
array([[ 0, 1, 2, 3, 3],
[ 4, 5, 6, 7, 7],
[ 8, 9, 10, 11, 11]])
In [913]: x.shape,x[:,-1:].shape
Out[913]: ((3, 4), (3, 1))
Note that both inputs to concatenate have 2 dimensions.
Omit the :, and x[:,-1] is (3,) shape - it is 1d, and hence the error:
In [914]: np.concatenate([x,x[:,-1]], axis=1)
...
ValueError: all the input arrays must have same number of dimensions
The code for np.append is (in this case where axis is specified)
return concatenate((arr, values), axis=axis)
So with a slight change of syntax append works. Instead of a list it takes 2 arguments. It imitates the list append is syntax, but should not be confused with that list method.
In [916]: np.append(x, x[:,-1:], axis=1)
Out[916]:
array([[ 0, 1, 2, 3, 3],
[ 4, 5, 6, 7, 7],
[ 8, 9, 10, 11, 11]])
np.hstack first makes sure all inputs are atleast_1d, and then does concatenate:
return np.concatenate([np.atleast_1d(a) for a in arrs], 1)
So it requires the same x[:,-1:] input. Essentially the same action.
np.column_stack also does a concatenate on axis 1. But first it passes 1d inputs through
array(arr, copy=False, subok=True, ndmin=2).T
This is a general way of turning that (3,) array into a (3,1) array.
In [922]: np.array(x[:,-1], copy=False, subok=True, ndmin=2).T
Out[922]:
array([[ 3],
[ 7],
[11]])
In [923]: np.column_stack([x,x[:,-1]])
Out[923]:
array([[ 0, 1, 2, 3, 3],
[ 4, 5, 6, 7, 7],
[ 8, 9, 10, 11, 11]])
All these 'stacks' can be convenient, but in the long run, it's important to understand dimensions and the base np.concatenate. Also know how to look up the code for functions like this. I use the ipython ?? magic a lot.
And in time tests, the np.concatenate is noticeably faster - with a small array like this the extra layers of function calls makes a big time difference.
(n,) and (n,1) are not the same shape. Try casting the vector to an array by using the [:, None] notation:
n_lists = np.append(n_list_converted, n_last[:, None], axis=1)
Alternatively, when extracting n_last you can use
n_last = n_list_converted[:, -1:]
to get a (20, 1) array.
The reason why you get your error is because a "1 by n" matrix is different from an array of length n.
I recommend using hstack() and vstack() instead.
Like this:
import numpy as np
a = np.arange(32).reshape(4,8) # 4 rows 8 columns matrix.
b = a[:,-1:] # last column of that matrix.
result = np.hstack((a,b)) # stack them horizontally like this:
#array([[ 0, 1, 2, 3, 4, 5, 6, 7, 7],
# [ 8, 9, 10, 11, 12, 13, 14, 15, 15],
# [16, 17, 18, 19, 20, 21, 22, 23, 23],
# [24, 25, 26, 27, 28, 29, 30, 31, 31]])
Notice the repeated "7, 15, 23, 31" column.
Also, notice that I used a[:,-1:] instead of a[:,-1]. My version generates a column:
array([[7],
[15],
[23],
[31]])
Instead of a row array([7,15,23,31])
Edit: append() is much slower. Read this answer.
You can also cast (n,) to (n,1) by enclosing within brackets [ ].
e.g. Instead of np.append(b,a,axis=0) use np.append(b,[a],axis=0)
a=[1,2]
b=[[5,6],[7,8]]
np.append(b,[a],axis=0)
returns
array([[5, 6],
[7, 8],
[1, 2]])
I normally use np.row_stack((ndarray_1, ndarray_2, ..., ndarray_nth))
Assuming your ndarrays are indeed the same shape, this should work for you
n_last = []
n_last = n_list_converted[:, -1]
n_lists = np.row_stack((n_list_converted, n_last))

Find cumsum of subarrays split by indices for numpy array efficiently

Given an array 'array' and a set of indices 'indices', how do I find the cumulative sum of the sub-arrays formed by splitting the array along those indices in a vectorized manner?
To clarify, suppose I have:
>>> array = np.arange(20)
>>> array
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19])
indices = np.arrray([3, 8, 14])
The operation should output:
array([0, 1, 3, 3, 7, 12, 18, 25, 8, 17, 27, 38, 50, 63, 14, 29, 45, 62, 80, 99])
Please note that the array is very big (100000 elements) and so, I need a vectorized answer. Using any loops would slow it down considerably.
Also, if I had the same problem, but a 2D array and corresponding indices, and I need to do the same thing for each row in the array, how would I do it?
For the 2D version:
>>>array = np.arange(12).reshape((3,4))
>>>array
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
>>> indices = np.array([[2], [1, 3], [1, 2]])
The output would be:
array([[ 0, 1, 3, 3],
[ 4, 9, 6, 13],
[ 8, 17, 10, 11]])
To clarify: Every row will be split.
You can introduce differentiation of originally cumulatively summed array at indices positions to create a boundary like effect at those places, such that when the differentiated array is cumulatively summed, gives us the indices-stopped cumulatively summed output. This might feel a bit contrived at first-look, but stick with it, try with other samples and hopefully would make sense! The idea is very similar to the one applied in this other MATLAB solution. So, following such a philosophy here's one approach using numpy.diff along with cumulative summation -
# Get linear indices
n = array.shape[1]
lidx = np.hstack(([id*n+np.array(item) for id,item in enumerate(indices)]))
# Get successive differentiations
diffs = array.cumsum(1).ravel()[lidx] - array.ravel()[lidx]
# Get previous group's offsetted summations for each row at all
# indices positions across the entire 2D array
_,idx = np.unique(lidx/n,return_index=True)
offsetted_diffs = np.diff(np.append(0,diffs))
offsetted_diffs[idx] = diffs[idx]
# Get a copy of input array and place previous group's offsetted summations
# at indices. Then, do cumulative sum which will create a boundary like
# effect with those offsets at indices positions.
arrayc = array.copy()
arrayc.ravel()[lidx] -= offsetted_diffs
out = arrayc.cumsum(1)
This should be an almost vectorized solution, almost because even though we are calculating linear indices in a loop, but since it's not the computationally intensive part here, so it's effect on the total runtime would be minimal. Also, you can replace arrayc with array if you don't care about destructing the input for saving on memory.
Sample input, output -
In [75]: array
Out[75]:
array([[ 0, 1, 2, 3, 4, 5, 6, 7],
[ 8, 9, 10, 11, 12, 13, 14, 15],
[16, 17, 18, 19, 20, 21, 22, 23]])
In [76]: indices
Out[76]: array([[3, 6], [4, 7], [5]], dtype=object)
In [77]: out
Out[77]:
array([[ 0, 1, 3, 3, 7, 12, 6, 13],
[ 8, 17, 27, 38, 12, 25, 39, 15],
[16, 33, 51, 70, 90, 21, 43, 66]])
You can use np.split to split your array along the indices then using python built in function map apply the np.cumsum() to your sub arrays. And at the end by using np.hstack convert the result to an integrated array:
>>> np.hstack(map(np.cumsum,np.split(array,indices)))
array([ 0, 1, 3, 3, 7, 12, 18, 25, 8, 17, 27, 38, 50, 63, 14, 29, 45,
62, 80, 99])
Note that since map is a built in function in python and has been implemented in C inside the Python interpreter it would performs better than a regular loop.1
Here is an alternative for 2D arrays:
>>> def func(array,indices):
... return np.hstack(map(np.cumsum,np.split(array,indices)))
...
>>>
>>> array
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
>>>
>>> indices
array([[2], [1, 3], [1, 2]], dtype=object)
>>> np.array([func(arr,ind) for arr,ind in np.array((array,indices)).T])
array([[ 0, 1, 2, 5],
[ 4, 5, 11, 7],
[ 8, 9, 10, 21]])
Note that your expected output is not based on the way that np.split works.
If you want to such results you need to add 1 to your indices :
>>> indices = np.array([[3], [2, 4], [2, 3]], dtype=object)
>>>
>>> np.array([func(arr,ind) for arr,ind in np.array((array,indices)).T])
array([[ 0., 1., 3., 3.],
[ 4., 9., 6., 13.],
[ 8., 17., 10., 11.]])
Due to a comment which said there is not performance difference between using generator expression and map function I ran a benchmark which demonstrates result better.
# Use map
~$ python -m timeit --setup "import numpy as np;array = np.arange(20);indices = np.array([3, 8, 14])" "np.hstack(map(np.cumsum,np.split(array,indices)))"
10000 loops, best of 3: 72.1 usec per loop
# Use generator expression
~$ python -m timeit --setup "import numpy as np;array = np.arange(20);indices = np.array([3, 8, 14])" "np.hstack(np.cumsum(a) for a in np.split(array,indices))"
10000 loops, best of 3: 81.2 usec per loop
Note that this doesn't mean that using map which performs in C speed makes that code preforms in C speed. That's because of that, the code has implemented in python and calling the function (first argument) and applying it on iterable items would take time.

Categories