Seemingly simple question: I have an array with two columns, the first represents an ID and the second a count. I'd like to update it with another, similar array such that
import numpy as np
a = np.array([[1, 2],
[2, 2],
[3, 1],
[4, 5]])
b = np.array([[2, 2],
[3, 1],
[4, 0],
[5, 3]])
a.update(b) # ????
>>> np.array([[1, 2],
[2, 4],
[3, 2],
[4, 5],
[5, 3]])
Is there a way to do this with indexing/slicing such that I don't simply have to iterate over each row?
Generic case
Approach #1: You can use np.add.at to do such an ID-based adding operation like so -
# First column of output array as the union of first columns of a,b
out_id = np.union1d(a[:,0],b[:,0])
# Initialize second column of output array
out_count = np.zeros_like(out_id)
# Find indices where the first columns of a,b are placed in out_id
_,a_idx = np.where(a[:,None,0]==out_id)
_,b_idx = np.where(b[:,None,0]==out_id)
# Place second column of a into out_id & add in second column of b
out_count[a_idx] = a[:,1]
np.add.at(out_count, b_idx,b[:,1])
# Stack the ID and count arrays into a 2-column format
out = np.column_stack((out_id,out_count))
To find a_idx and b_idx, as probably a faster alternative, np.searchsorted could be used like so -
a_idx = np.searchsorted(out_id, a[:,0], side='left')
b_idx = np.searchsorted(out_id, b[:,0], side='left')
Sample input-output :
In [538]: a
Out[538]:
array([[1, 2],
[4, 2],
[3, 1],
[5, 5]])
In [539]: b
Out[539]:
array([[3, 7],
[1, 1],
[4, 0],
[2, 3],
[6, 2]])
In [540]: out
Out[540]:
array([[1, 3],
[2, 3],
[3, 8],
[4, 2],
[5, 5],
[6, 2]])
Approach #2: You can use np.bincount to do the same ID based adding -
# First column of output array as the union of first columns of a,b
out_id = np.union1d(a[:,0],b[:,0])
# Get all IDs and counts in a single arrays
id_arr = np.concatenate((a[:,0],b[:,0]))
count_arr = np.concatenate((a[:,1],b[:,1]))
# Get binned summations
summed_vals = np.bincount(id_arr,count_arr)
# Get mask of valid bins
mask = np.in1d(np.arange(np.max(out_id)+1),out_id)
# Mask valid summed bins for final counts array output
out_count = summed_vals[mask]
# Stack the ID and count arrays into a 2-column format
out = np.column_stack((out_id,out_count))
Specific case
If the ID columns in a and b are sorted, it becomes easier, as we can just use masks with np.in1d to index into the output ID array created with np.union like so -
# First column of output array as the union of first columns of a,b
out_id = np.union1d(a[:,0],b[:,0])
# Masks of first columns of a and b matches in the output ID array
mask1 = np.in1d(out_id,a[:,0])
mask2 = np.in1d(out_id,b[:,0])
# Initialize second column of output array
out_count = np.zeros_like(out_id)
# Place second column of a into out_id & add in second column of b
out_count[mask1] = a[:,1]
np.add.at(out_count, np.where(mask2)[0],b[:,1])
# Stack the ID and count arrays into a 2-column format
out = np.column_stack((out_id,out_count))
Sample run -
In [552]: a
Out[552]:
array([[1, 2],
[2, 2],
[3, 1],
[4, 5],
[8, 5]])
In [553]: b
Out[553]:
array([[2, 2],
[3, 1],
[4, 0],
[5, 3],
[6, 2],
[8, 2]])
In [554]: out
Out[554]:
array([[1, 2],
[2, 4],
[3, 2],
[4, 5],
[5, 3],
[6, 2],
[8, 7]])
>>> col=np.unique(np.hstack((b[:,0],a[:,0])))
>>> dif=np.setdiff1d(col,a[:,0])
>>> val=b[np.in1d(b[:,0],dif)]
>>> result=np.concatenate((a,val))
array([[1, 2],
[2, 2],
[3, 1],
[4, 5],
[5, 3]])
Note that if you want the result become sorted you can use np.lexsort :
result[np.lexsort((result[:,0],result[:,0]))]
Explanation :
First you can find the unique ids with following command :
>>> col=np.unique(np.hstack((b[:,0],a[:,0])))
>>> col
array([1, 2, 3, 4, 5])
Then find the different between the ids if a and all of ids :
>>> dif=np.setdiff1d(col,a[:,0])
>>> dif
array([5])
Then find the items within b with the ids in diff :
>>> val=b[np.in1d(b[:,0],dif)]
>>> val
array([[5, 3]])
And at last concatenate the result with list a:
>>> np.concatenate((a,val))
consider another example with sorting :
>>> a = np.array([[1, 2],
... [2, 2],
... [3, 1],
... [7, 5]])
>>>
>>> b = np.array([[2, 2],
... [3, 1],
... [4, 0],
... [5, 3]])
>>>
>>> col=np.unique(np.hstack((b[:,0],a[:,0])))
>>> dif=np.setdiff1d(col,a[:,0])
>>> val=b[np.in1d(b[:,0],dif)]
>>> result=np.concatenate((a,val))
>>> result[np.lexsort((result[:,0],result[:,0]))]
array([[1, 2],
[2, 2],
[3, 1],
[4, 0],
[5, 3],
[7, 5]])
That's an old question but here is a solution with pandas (that could be generalized for other aggregation functions than sum). Also sorting will occur automatically:
import pandas as pd
import numpy as np
a = np.array([[1, 2],
[2, 2],
[3, 1],
[4, 5]])
b = np.array([[2, 2],
[3, 1],
[4, 0],
[5, 3]])
print((pd.DataFrame(a[:, 1], index=a[:, 0])
.add(pd.DataFrame(b[:, 1], index=b[:, 0]), fill_value=0)
.astype(int))
.reset_index()
.to_numpy())
Output:
[[1 2]
[2 4]
[3 2]
[4 5]
[5 3]]
Related
Is there anyway to add two numpy arrays of different length in a Descartian fashion without iterating over columns a? See example below.
a = np.array([[1, 2], [3, 4]])
b = np.array([[1, 1], [2, 2], [3, 3]])
c = dec_sum(a, b) # c = np.array([[[2, 3], [3, 4], [3, 5]], [[4, 4], [5, 6], [6, 7]]])
Given a 2x2 numpy array a and 3x2 numpy array b, c= dec_sum(a, b) and c is 2x3x2.
How can the rows in an array be sorted without that the values in each row will changed?
Furthermore: how to get the indicies of this sort-process?
input:
a = np.array([[4,3],[0,3],[3,0],[1,3],[1,2],[2,0]])
required sorting arrray:
b = np.array([1,4,3,5,2,0])
a = a[b]
output:
a = np.array([[0,3],[1,2],[1,3][2,0],[3,0],[4,3]])
How do I get the array b ?
You need lexsort here:
b = np.lexsort((a[:, 1], a[:, 0]))
# array([1, 4, 3, 5, 2, 0], dtype=int64)
And applied to your initial array:
>>> a[b]
array([[0, 3],
[1, 2],
[1, 3],
[2, 0],
[3, 0],
[4, 3]])
As #miradulo pointed out, you may also use:
b = np.lexsort(np.fliplr(a).T)
Which is less verbose than explicitly stating the columns to sort on.
I am learning Python and solving a machine learning problem.
class_ids=np.arange(self.x.shape[0])
np.random.shuffle(class_ids)
self.x=self.x[class_ids]
This is a shuffle function in NumPy but I can't understand what self.x=self.x[class_ids] means. because I think it gives the value of the array to a variable.
It's a very complicated way to shuffle the first dimension of your self.x. For example:
>>> x = np.array([[1, 1], [2, 2], [3, 3], [4, 4], [5, 5]])
>>> x
array([[1, 1],
[2, 2],
[3, 3],
[4, 4],
[5, 5]])
Then using the mentioned approach
>>> class_ids=np.arange(x.shape[0]) # create an array [0, 1, 2, 3, 4]
>>> np.random.shuffle(class_ids) # shuffle the array
>>> x[class_ids] # use integer array indexing to shuffle x
array([[5, 5],
[3, 3],
[1, 1],
[4, 4],
[2, 2]])
Note that the same could be achieved just by using np.random.shuffle because the docstring explicitly mentions:
This function only shuffles the array along the first axis of a multi-dimensional array. The order of sub-arrays is changed but their contents remains the same.
>>> np.random.shuffle(x)
>>> x
array([[5, 5],
[3, 3],
[1, 1],
[2, 2],
[4, 4]])
or by using np.random.permutation:
>>> class_ids = np.random.permutation(x.shape[0]) # shuffle the first dimensions indices
>>> x[class_ids]
array([[2, 2],
[4, 4],
[3, 3],
[5, 5],
[1, 1]])
Assuming self.x is a numpy array:
class_ids is a 1-d numpy array that is being used as an integer array index in the expression: x[class_ids]. Because the previous line shuffled class_ids, x[class_ids] evaluates to self.x shuffled by rows.
The assignment self.x=self.x[class_ids] assigns the shuffled array to self.x
Consider the array a
np.random.seed([3,1415])
a = np.random.randint(0, 10, (10, 2))
a
array([[0, 2],
[7, 3],
[8, 7],
[0, 6],
[8, 6],
[0, 2],
[0, 4],
[9, 7],
[3, 2],
[4, 3]])
What is a vectorized way to get the cumulative argmax?
array([[0, 0], <-- both start off as max position
[1, 1], <-- 7 > 0 so 1st col = 1, 3 > 2 2nd col = 1
[2, 2], <-- 8 > 7 1st col = 2, 7 > 3 2nd col = 2
[2, 2], <-- 0 < 8 1st col stays the same, 6 < 7 2nd col stays the same
[2, 2],
[2, 2],
[2, 2],
[7, 2], <-- 9 is new max of 2nd col, argmax is now 7
[7, 2],
[7, 2]])
Here is a non-vectorized way to do it.
Notice that as the window expands, argmax applies to the growing window.
pd.DataFrame(a).expanding().apply(np.argmax).astype(int).values
array([[0, 0],
[1, 1],
[2, 2],
[2, 2],
[2, 2],
[2, 2],
[2, 2],
[7, 2],
[7, 2],
[7, 2]])
Here's a vectorized pure NumPy solution that performs pretty snappily:
def cumargmax(a):
m = np.maximum.accumulate(a)
x = np.repeat(np.arange(a.shape[0])[:, None], a.shape[1], axis=1)
x[1:] *= m[:-1] < m[1:]
np.maximum.accumulate(x, axis=0, out=x)
return x
Then we have:
>>> cumargmax(a)
array([[0, 0],
[1, 1],
[2, 2],
[2, 2],
[2, 2],
[2, 2],
[2, 2],
[7, 2],
[7, 2],
[7, 2]])
Some quick testing on arrays with thousands to millions of values suggests that this is anywhere between 10-50 times faster than looping at the Python level (either implicitly or explicitly).
I cant think of a way to vectorize this over both columns easily; but if the number of columns is small relative to the number of rows, that shouldn't be an issue and a for loop should suffice for that axis:
import numpy as np
import numpy_indexed as npi
a = np.random.randint(0, 10, (10))
max = np.maximum.accumulate(a)
idx = npi.indices(a, max)
print(idx)
I would like to make a function that computes cumulative argmax for 1d array and then apply it to all columns. This is the code:
import numpy as np
np.random.seed([3,1415])
a = np.random.randint(0, 10, (10, 2))
def cumargmax(v):
uargmax = np.frompyfunc(lambda i, j: j if v[j] > v[i] else i, 2, 1)
return uargmax.accumulate(np.arange(0, len(v)), 0, dtype=np.object).astype(v.dtype)
np.apply_along_axis(cumargmax, 0, a)
The reason for converting to np.object and then converting back is a workaround for Numpy 1.9, as mentioned in generalized cumulative functions in NumPy/SciPy?
I have three rather large NumPy arrays with varying numbers of rows, whose first columns are all integers. My hope is to filter these arrays such that the only rows left are those for whom the value in the first column is shared by all three. This would leave three arrays of the same size. The entries in the other columns are not necessarily shared across arrays.
So, with input:
A =
[[1, 1],
[2, 2],
[3, 3],]
B =
[[2, 1],
[3, 2],
[4, 3],
[5, 4]]
C =
[[2, 2],
[3, 1]
[5, 2]]
I hope to get back as output:
A =
[[2, 2],
[3, 3]]
B =
[[2, 1],
[3, 2]]
C =
[[2, 2],
[3, 1]]
My current approach is to:
Find the intersection of the three first columns using numpy.intersect1d()
Use numpy.in1d() on this intersection and the first columns of each array to find the row indices that are not shared in each array (converting boolean to index using a modified version of the method found here: Python: intersection indices numpy array )
Finally using numpy.delete() with each of these indices and its respective array to remove rows with non-shared entries in the first column.
I'm wondering if there might be a faster or more elegantly Pythonic way to go about this however, something that is suited to very large arrays.
Your indices in your example are sorted and unique. Assuming this is no coincidence (and this situation often arises, or can easily be enforced), the following works:
import numpy as np
A = np.array(
[[1, 1],
[2, 2],
[3, 3],])
B = np.array(
[[2, 1],
[3, 2],
[4, 3],
[5, 4]])
C = np.array(
[[2, 2],
[3, 1],
[5, 2],])
I = reduce(
lambda l,r: np.intersect1d(l,r,True),
(i[:,0] for i in (A,B,C)))
print A[np.searchsorted(A[:,0], I)]
print B[np.searchsorted(B[:,0], I)]
print C[np.searchsorted(C[:,0], I)]
and in case the first column is not in sorted order (but is still unique):
C = np.array(
[[9, 2],
[1,6],
[5, 1],
[2, 5],
[3, 2],])
def index_by_first_column_entry(M, keys):
colkeys = M[:,0]
sorter = np.argsort(colkeys)
index = np.searchsorted(colkeys, keys, sorter = sorter)
return M[sorter[index]]
print index_by_first_column_entry(C, I)
and make sure to change the true to false in
I = reduce(
lambda l,r: np.intersect1d(l,r,False),
(i[:,0] for i in (A,B,C)))
generalization to duplicate values can be made using np.unique
One way to do this is to build an indicator array, or a hash table if you like, to indicate which integers are in all your input arrays. Then you can use boolean indexing based on this indicator array to get the subarrays. Something like this:
import numpy as np
# Setup
A = np.array(
[[1, 1],
[2, 2],
[3, 3],])
B = np.array(
[[2, 1],
[3, 2],
[4, 3],
[5, 4]])
C = np.array(
[[2, 2],
[3, 1],
[5, 2],])
def take_overlap(*input):
n = len(input)
maxIndex = max(array[:, 0].max() for array in input)
indicator = np.zeros(maxIndex + 1, dtype=int)
for array in input:
indicator[array[:, 0]] += 1
indicator = indicator == n
result = []
for array in input:
# Look up each integer in the indicator array
mask = indicator[array[:, 0]]
# Use boolean indexing to get the sub array
result.append(array[mask])
return result
subA, subB, subC = take_overlap(A, B, C)
This should be quite fast and this method does not assume the elements of the input arrays are unique or sorted. However this method could take a lot of memory, and might e a bit slower, if the indexing integers are sparse, ie [1, 10, 10000], but should be close to optimal if the integers are more or less dense.
This works but I'm not sure if it is faster than any of the other answers:
import numpy as np
A = np.array(
[[1, 1],
[2, 2],
[3, 3],])
B = np.array(
[[2, 1],
[3, 2],
[4, 3],
[5, 4]])
C = np.array(
[[2, 2],
[3, 1],
[5, 2],])
a = A[:,0]
b = B[:,0]
c = C[:,0]
ab = np.where(a[:, np.newaxis] == b[np.newaxis, :])
bc = np.where(b[:, np.newaxis] == c[np.newaxis, :])
ab_in_bc = np.in1d(ab[1], bc[0])
bc_in_ab = np.in1d(bc[0], ab[1])
arows = ab[0][ab_in_bc]
brows = ab[1][ab_in_bc]
crows = bc[1][bc_in_ab]
anew = A[arows, :]
bnew = B[brows, :]
cnew = C[crows, :]
print(anew)
print(bnew)
print(cnew)
gives:
[[2 2]
[3 3]]
[[2 1]
[3 2]]
[[2 2]
[3 1]]