Getting the same order in ragged arrays in Python - python

I have two ragged arrays Ii0 and Iv0. I sort the values according to increasing j which Ii01 shows but then I want Iv01 to reflect the same sorting order. In general, I want the code to be able to handle many different shapes of Ii0 and Iv0. The current and desired outputs are attached.
import numpy as np
Ii0=np.array([[[0, 1],
[0, 3],
[1, 2]],
[[0, 3],
[2,5],
[0, 1]]])
Iv0 = np.array([[[10],
[20],
[30]],
[[100],
[200],
[300]]])
Ii01 = np.array([sorted(i, key = lambda x : x[1]) for i in Ii0])
print("Ii01 =",Ii01)
Iv01 = np.array([sorted(i, key = lambda x : x[0]) for i in Iv0])
print("Iv01 =",Iv01)
The current output is
Ii01 = [array([[[0, 1],
[1, 2],
[0, 3]],
[[0, 1],
[0, 3],
[2, 5]]])]
Iv01 = [array([[[ 10],
[ 20],
[ 30]],
[[100],
[200],
[300]]])]
The expected output is
Ii01 = [array([[[0, 1],
[1, 2],
[0, 3]],
[[0, 1],
[0, 3],
[2, 5]]])]
Iv01 = [array([[[ 10],
[ 30],
[ 20]],
[[300],
[100],
[200]]])]

import numpy as np
Ii0 = np.array([[
[0, 1],
[0, 3],
[1, 2]],
[[0, 3],
[2, 5],
[0, 1]]])
Iv0 = np.array([[[10],
[20],
[30]],
[[100],
[200],
[300]]])
Ii01 = np.array([Ii[np.argsort(Ii[:, 1])] for Ii in Ii0])
print("Ii01 =", Ii01)
Iv01 = np.array([Iv[np.argsort(Ii[:, 1])] for Ii,Iv in zip(Ii0,Iv0)])
print("Iv01 =", Iv01)
we need to sort our first array with np.argsort which basically return index of sorted array and that indices we used to arrange second array (actually second array is not sorted, it rearranges passions based on first array sort)

Related

Numpy Arrays: Most computationally efficient way of getting indices of sequences that follow a specific pattern?

Say I have a sequence like this
import numpy as np
array = np.array(
[
0, 1, 0, 1, 2, 2, 0, 0, 1, 1, 2, 0, 0, 1, 2
]
)
I want to get the indices for each sequence that follow a pattern the number 1 is the beginning of the sequence, and the number 2 is the rest of the sequence that follows the number 1.
This is the most obvious code to get these indices
places = np.arange(array.shape[0])
seqs = []
seqIndices = []
inSquence = False
for place, indc in zip(array, places):
if place == 1:
seqs.append([])
seqIndices.append([])
seqs[-1].append(place)
seqIndices[-1].append(indc)
if place == 2:
seqs[-1].append(place)
seqIndices[-1].append(indc)
seqs
[[1], [1, 2, 2], [1], [1, 2], [1, 2]]
seqIndices
[[1], [3, 4, 5], [8], [9, 10], [13, 14]]
This is method has a complexity of o(n) where n is the length of the sequence.
A more efficient method is
seqs2 = []
seqIndices2 = []
for indc in onesIndices:
seqs2.append([])
seqIndices2.append([])
seqs2[-1].append(array[indc])
seqIndices2[-1].append(indc)
for id2, pla in enumerate(array[indc+1:]):
if pla != 2:
break
seqs2[-1].append(pla)
seqIndices2[-1].append(places[indc+id2+1])
seqs2
[[1], [1, 2, 2], [1], [1, 2], [1, 2]]
seqIndices2
[[1], [3, 4, 5], [8], [9, 10], [13, 14]]
This is method has a complexity of sum of the lengths of the sequences.

Find indices of each bin using numpy

I'm encountering a problem that I hope you can help me solve.
I have a 2D numpy array which I want to divide into bins by value. Then I need to know the exact initial indices of all the numbers in each bin.
For example, consider the matrix
[[1,2,3], [4,5,6], [7,8,9]]
and the bin array
[0,2,4,6,8,10].
Then the element first element ([0,0]) should be stored in one bin, the next two elements ([0,1],[0,2]) should be stored in another bin and so on. The desired output looks like this:
[[[0,0]],[[0,1],[0,2]],[[1,0],[1,1]],[[1,2],[2,0]],[[2,1],[2,2]]]
Even though I tried several numpy functions, I'm not able to do this in an elegant way. The best attempt might be
>>> a = [[1,2,3], [4,5,6], [7,8,9]]
>>> bins = [0,2,4,6,8,10]
>>> bin_in_mat = np.digitize(a, bins, right=False)
>>> bin_in_mat
array([[1, 2, 2],
[3, 3, 4],
[4, 5, 5]])
>>> indices = np.argwhere(bin_in_mat)
>>> indices
array([[0, 0],
[0, 1],
[0, 2],
[1, 0],
[1, 1],
[1, 2],
[2, 0],
[2, 1],
[2, 2]])
but this doesn't solve my problem. Any suggestions?
You need to leave numpy and use a loop for this - it's not capable of representing your result:
bin_in_mat = np.digitize(a, bins, right=False)
bin_contents = [np.argwhere(bin_in_mat == i) for i in range(len(bins))]
>>> for b in bin_contents:
... print(repr(b))
array([], shape=(0, 2), dtype=int64)
array([[0, 0]], dtype=int64)
array([[0, 1],
[0, 2]], dtype=int64)
array([[1, 0],
[1, 1]], dtype=int64)
array([[1, 2],
[2, 0]], dtype=int64)
array([[2, 1],
[2, 2]], dtype=int64)
Note that digitize is a bad choice for large integer input (until 1.15), and is faster and more correct as bin_in_mat = np.searchsorted(bins, a, side='left')

cumulative argmax of a numpy array

Consider the array a
np.random.seed([3,1415])
a = np.random.randint(0, 10, (10, 2))
a
array([[0, 2],
[7, 3],
[8, 7],
[0, 6],
[8, 6],
[0, 2],
[0, 4],
[9, 7],
[3, 2],
[4, 3]])
What is a vectorized way to get the cumulative argmax?
array([[0, 0], <-- both start off as max position
[1, 1], <-- 7 > 0 so 1st col = 1, 3 > 2 2nd col = 1
[2, 2], <-- 8 > 7 1st col = 2, 7 > 3 2nd col = 2
[2, 2], <-- 0 < 8 1st col stays the same, 6 < 7 2nd col stays the same
[2, 2],
[2, 2],
[2, 2],
[7, 2], <-- 9 is new max of 2nd col, argmax is now 7
[7, 2],
[7, 2]])
Here is a non-vectorized way to do it.
Notice that as the window expands, argmax applies to the growing window.
pd.DataFrame(a).expanding().apply(np.argmax).astype(int).values
array([[0, 0],
[1, 1],
[2, 2],
[2, 2],
[2, 2],
[2, 2],
[2, 2],
[7, 2],
[7, 2],
[7, 2]])
Here's a vectorized pure NumPy solution that performs pretty snappily:
def cumargmax(a):
m = np.maximum.accumulate(a)
x = np.repeat(np.arange(a.shape[0])[:, None], a.shape[1], axis=1)
x[1:] *= m[:-1] < m[1:]
np.maximum.accumulate(x, axis=0, out=x)
return x
Then we have:
>>> cumargmax(a)
array([[0, 0],
[1, 1],
[2, 2],
[2, 2],
[2, 2],
[2, 2],
[2, 2],
[7, 2],
[7, 2],
[7, 2]])
Some quick testing on arrays with thousands to millions of values suggests that this is anywhere between 10-50 times faster than looping at the Python level (either implicitly or explicitly).
I cant think of a way to vectorize this over both columns easily; but if the number of columns is small relative to the number of rows, that shouldn't be an issue and a for loop should suffice for that axis:
import numpy as np
import numpy_indexed as npi
a = np.random.randint(0, 10, (10))
max = np.maximum.accumulate(a)
idx = npi.indices(a, max)
print(idx)
I would like to make a function that computes cumulative argmax for 1d array and then apply it to all columns. This is the code:
import numpy as np
np.random.seed([3,1415])
a = np.random.randint(0, 10, (10, 2))
def cumargmax(v):
uargmax = np.frompyfunc(lambda i, j: j if v[j] > v[i] else i, 2, 1)
return uargmax.accumulate(np.arange(0, len(v)), 0, dtype=np.object).astype(v.dtype)
np.apply_along_axis(cumargmax, 0, a)
The reason for converting to np.object and then converting back is a workaround for Numpy 1.9, as mentioned in generalized cumulative functions in NumPy/SciPy?

Updating a NumPy array with another

Seemingly simple question: I have an array with two columns, the first represents an ID and the second a count. I'd like to update it with another, similar array such that
import numpy as np
a = np.array([[1, 2],
[2, 2],
[3, 1],
[4, 5]])
b = np.array([[2, 2],
[3, 1],
[4, 0],
[5, 3]])
a.update(b) # ????
>>> np.array([[1, 2],
[2, 4],
[3, 2],
[4, 5],
[5, 3]])
Is there a way to do this with indexing/slicing such that I don't simply have to iterate over each row?
Generic case
Approach #1: You can use np.add.at to do such an ID-based adding operation like so -
# First column of output array as the union of first columns of a,b
out_id = np.union1d(a[:,0],b[:,0])
# Initialize second column of output array
out_count = np.zeros_like(out_id)
# Find indices where the first columns of a,b are placed in out_id
_,a_idx = np.where(a[:,None,0]==out_id)
_,b_idx = np.where(b[:,None,0]==out_id)
# Place second column of a into out_id & add in second column of b
out_count[a_idx] = a[:,1]
np.add.at(out_count, b_idx,b[:,1])
# Stack the ID and count arrays into a 2-column format
out = np.column_stack((out_id,out_count))
To find a_idx and b_idx, as probably a faster alternative, np.searchsorted could be used like so -
a_idx = np.searchsorted(out_id, a[:,0], side='left')
b_idx = np.searchsorted(out_id, b[:,0], side='left')
Sample input-output :
In [538]: a
Out[538]:
array([[1, 2],
[4, 2],
[3, 1],
[5, 5]])
In [539]: b
Out[539]:
array([[3, 7],
[1, 1],
[4, 0],
[2, 3],
[6, 2]])
In [540]: out
Out[540]:
array([[1, 3],
[2, 3],
[3, 8],
[4, 2],
[5, 5],
[6, 2]])
Approach #2: You can use np.bincount to do the same ID based adding -
# First column of output array as the union of first columns of a,b
out_id = np.union1d(a[:,0],b[:,0])
# Get all IDs and counts in a single arrays
id_arr = np.concatenate((a[:,0],b[:,0]))
count_arr = np.concatenate((a[:,1],b[:,1]))
# Get binned summations
summed_vals = np.bincount(id_arr,count_arr)
# Get mask of valid bins
mask = np.in1d(np.arange(np.max(out_id)+1),out_id)
# Mask valid summed bins for final counts array output
out_count = summed_vals[mask]
# Stack the ID and count arrays into a 2-column format
out = np.column_stack((out_id,out_count))
Specific case
If the ID columns in a and b are sorted, it becomes easier, as we can just use masks with np.in1d to index into the output ID array created with np.union like so -
# First column of output array as the union of first columns of a,b
out_id = np.union1d(a[:,0],b[:,0])
# Masks of first columns of a and b matches in the output ID array
mask1 = np.in1d(out_id,a[:,0])
mask2 = np.in1d(out_id,b[:,0])
# Initialize second column of output array
out_count = np.zeros_like(out_id)
# Place second column of a into out_id & add in second column of b
out_count[mask1] = a[:,1]
np.add.at(out_count, np.where(mask2)[0],b[:,1])
# Stack the ID and count arrays into a 2-column format
out = np.column_stack((out_id,out_count))
Sample run -
In [552]: a
Out[552]:
array([[1, 2],
[2, 2],
[3, 1],
[4, 5],
[8, 5]])
In [553]: b
Out[553]:
array([[2, 2],
[3, 1],
[4, 0],
[5, 3],
[6, 2],
[8, 2]])
In [554]: out
Out[554]:
array([[1, 2],
[2, 4],
[3, 2],
[4, 5],
[5, 3],
[6, 2],
[8, 7]])
>>> col=np.unique(np.hstack((b[:,0],a[:,0])))
>>> dif=np.setdiff1d(col,a[:,0])
>>> val=b[np.in1d(b[:,0],dif)]
>>> result=np.concatenate((a,val))
array([[1, 2],
[2, 2],
[3, 1],
[4, 5],
[5, 3]])
Note that if you want the result become sorted you can use np.lexsort :
result[np.lexsort((result[:,0],result[:,0]))]
Explanation :
First you can find the unique ids with following command :
>>> col=np.unique(np.hstack((b[:,0],a[:,0])))
>>> col
array([1, 2, 3, 4, 5])
Then find the different between the ids if a and all of ids :
>>> dif=np.setdiff1d(col,a[:,0])
>>> dif
array([5])
Then find the items within b with the ids in diff :
>>> val=b[np.in1d(b[:,0],dif)]
>>> val
array([[5, 3]])
And at last concatenate the result with list a:
>>> np.concatenate((a,val))
consider another example with sorting :
>>> a = np.array([[1, 2],
... [2, 2],
... [3, 1],
... [7, 5]])
>>>
>>> b = np.array([[2, 2],
... [3, 1],
... [4, 0],
... [5, 3]])
>>>
>>> col=np.unique(np.hstack((b[:,0],a[:,0])))
>>> dif=np.setdiff1d(col,a[:,0])
>>> val=b[np.in1d(b[:,0],dif)]
>>> result=np.concatenate((a,val))
>>> result[np.lexsort((result[:,0],result[:,0]))]
array([[1, 2],
[2, 2],
[3, 1],
[4, 0],
[5, 3],
[7, 5]])
That's an old question but here is a solution with pandas (that could be generalized for other aggregation functions than sum). Also sorting will occur automatically:
import pandas as pd
import numpy as np
a = np.array([[1, 2],
[2, 2],
[3, 1],
[4, 5]])
b = np.array([[2, 2],
[3, 1],
[4, 0],
[5, 3]])
print((pd.DataFrame(a[:, 1], index=a[:, 0])
.add(pd.DataFrame(b[:, 1], index=b[:, 0]), fill_value=0)
.astype(int))
.reset_index()
.to_numpy())
Output:
[[1 2]
[2 4]
[3 2]
[4 5]
[5 3]]

Getting positions of specific values of a 2D NumPy array with mask

I need some help to detect all values (coordinates) of 2D array which verify a specific conditional.
I have already asked a similar question, but now I masked specific values which don't interest me...
Last time, a person suggested to use zip(*np.where(test2D < 5000.))
For example:
import numpy as np
test2D = np.array([[ 3051.11, 2984.85, 3059.17],
[ 3510.78, 3442.43, 3520.7 ],
[ 4045.91, 3975.03, 4058.15],
[ 4646.37, 4575.01, 4662.29],
[ 5322.75, 5249.33, 5342.1 ],
[ 6102.73, 6025.72, 6127.86],
[ 6985.96, 6906.81, 7018.22],
[ 7979.81, 7901.04, 8021. ],
[ 9107.18, 9021.98, 9156.44],
[ 10364.26, 10277.02, 10423.1 ],
[ 11776.65, 11682.76, 11843.18]])
So I can get all positions which verify < 5000 :
positions=zip(*np.where(test2D < 5000.))
Now I want to reject some values which are useless for me (it s an array with coordinates):
rejectedvalues = np.array([[0, 0], [2, 2], [3, 1], [10, 2]])
i, j = rejectedvalues.T
mask = np.zeros(test2D.shape, bool)
mask[i,j] = True
m = np.ma.array(test2D, mask=mask)
positions2=zip(*np.where(m < 5000.))
But positions2 gives me the same as positions...
np.ma.where respects the mask -- it does not return indices in the condition (e.g. m < 5000.) that are masked.
In [58]: np.asarray(np.column_stack(np.ma.where(m < 5000.)))
Out[58]:
array([[0, 1],
[0, 2],
[1, 0],
[1, 1],
[1, 2],
[2, 0],
[2, 1],
[3, 0],
[3, 2]])
Compare that with the analogous expression using np.where:
In [57]: np.asarray(np.column_stack(np.where(m < 5000.)))
Out[57]:
array([[0, 0],
[0, 1],
[0, 2],
[1, 0],
[1, 1],
[1, 2],
[2, 0],
[2, 1],
[2, 2],
[3, 0],
[3, 1],
[3, 2]])

Categories