How to replicate a row of an array with numpy? - python

I want to replicate the last row of an array in python and found the following lines of code in the numpy documentation
>>> x = np.array([[1,2],[3,4]])
>>> np.repeat(x, [1, 2], axis=0)
in the above code what does the second parameter "[1,2]" in np.repeat do?
if i want to replicate a row in a 3*3 array how will this second parameter change.

It's the repeats parameter
repeats : int or array of ints
The number of repetitions for each element. repeats is broadcasted to fit the shape of the given axis.
It's the number of times you want to repeat a row or column based on the parameter axis.
x = np.array([[1,2],[3,4],[4,5]])
np.repeat(x, repeats = [1, 2, 1 ], axis=0)
This would lead to repetition of row 1 once, row 2 twice and row 3 once.
array([[1, 2],
[3, 4],
[3, 4],
[4, 5]])
Similarly, if you specify the axis = 1. Repeats can take maximum of 2 elements in the list,and below code lead to repetition of column 1 once and column 2 twice.
x = np.array([[1,2],[3,4],[4,5]])
np.repeat(x, repeats = [1, 2 ], axis=1)
array([[1, 2, 2],
[3, 4, 4],
[4, 5, 5]])
If you want to repeat only last row, repeat only last row and stack i.e
rep = 2
last = np.repeat([x[-1]],repeats= rep-1 ,axis=0)
np.vstack([x, last])
array([[1, 2],
[3, 4],
[4, 5],
[4, 5]])

I have test it using following code
>>> a
array([[1, 2],
[3, 4]])
>>> np.repeat(a, [2,3], axis = 0)
array([[1, 2],
[1, 2],
[3, 4],
[3, 4],
[3, 4]])
>>> np.repeat(a, [1,3], axis = 0)
array([[1, 2],
[3, 4],
[3, 4],
[3, 4]])
The second parameter seems mean how many times the i-th elements in a will be repeat. As my code shown above, [2,3] repeats a[0] 2 times and repeats a[1] 3 times, [1,3] repeats a[0] 1 times and repeats a[1] 3 times

Related

Can't index multiple elements in list of lists in Python (using the : operator)

I found it strange that indexing using range(:) operator for list of lists is not supported.
Sometimes this result in strange values :
a = [[1, 2], [3, 4], [5, 6], [7, 8]]
>>> a
[[1, 2], [3, 4], [5, 6], [7, 8]]
>>> a[0][1]
2
>>> a[1][1]
4
>>> a[2][1]
6
However,
>>> a[0:3][1]
[3, 4]
I was expecting [2,4,6]. What am I missing here ?
I tried this on Numpy arrays as well.enter code here
>>> a
[[1, 2], [3, 4], [5, 6], [7, 8]]
>>> a[0][1]
2
>>> a[1][1]
4
>>> a[2][1]
6
>>> a[0:3][1]
[3, 4]
I know I can use list comprehension, but my question is whether ":" is supported for list of lists?
numpy arrays do support slicing, but you're not considering the shape of the array. In numpy, this array has shape:
import numpy as np
a = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
print(a.shape)
>>>(4, 2)
meaning it's 4x2. If you slice [0:3] you're returning the first three elements of the 1st dimension. i.e.:
print(a[0:3])
>>>[[1 2]
[3 4]
[5 6]]
this output has shape:
print(a[0:3].shape)
>>>(3, 2)
if you do:
print(a[0:3][1])
>>>[3 4]
You are again calling the first element of the first dimension of the array that has shape (3, 2).
Instead you want to call:
print(a[0:3][:,1])
>>>[2 4 6]
Which gives you all of the row elements (i.e. all three elements of the first dimension) at column index 1 (where 0 and 1 represent the indexes for the two dimensions of the second dimension).
even cleaner (recommended):
print(a[0:3, 1])
>>>[2 4 6]
Using : is totally supported. Explained below...
So we start with:
a = [[1, 2], [3, 4], [5, 6], [7, 8]]
You asked about:
a[0:3][1]
We want the items from list a, from positions zero to three [0:3]. Those items returned are
[1, 2] --- position 0
[3, 4] --- position 1
[5, 6] --- position 2
[7, 8] --- position 3
Then we request from that list the item in position 1, which returns:
[3, 4]
If you want to access items inside that smaller list you need to add another index, like this:
a[0:3][1][1]
would return:
4
Diagram of basic string splitting:
Your first bracket (represented in blue) is saying "give me elements in list a between positions 0 and 3, which in this case, is ALL of them.
Your second bracket (represented in red) is saying "of the results of my first bracket, give me the element that is in position 1", which is the entire sub-list [3,4]
In this specific case
a[0:3][1]
could have simply been written as
a[1]
let us assume a list of list
list=[[1,2],[3,4],[5,6],[7,8]]
then,
list[0:3]
will return a list with elements(which are also list) from index 0 to 2
[[1, 2], [3, 4], [5, 6]]
so according list[0:3][1] will return the second element([3,4]) whose index is "1" .
a[0:3][1] will not return[2,4,6] , it returns the list of list with 3 element and chooses the second element.
When you call a[0:3] the result of that is a list with the first three elements of a. You then call a[0:3][1] which returns the 2nd element of that list which is the list [3,4].
Ordinary Python lists do not support this kind of slicing.
You can get [2, 4, 6] with Numpy:
>>> import numpy as np
>>> a = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
>>> a[0:3, 1]
array([2, 4, 6])
a = [[1, 2], [3, 4], [5, 6], [7, 8]]
a[0:3]
the output of this is a list:
>>> [[1, 2], [3, 4], [5, 6]]
Therefore:
a[0:3][1]
Accesses the element at index 1, which is [3, 4]
To get the desired output from your list, use list comprehension:
[x[1] for x in a[0:3]]
>>> [2, 4, 6]

numpy sort 2d: rearrange rows without changing values in row

How can the rows in an array be sorted without that the values in each row will changed?
Furthermore: how to get the indicies of this sort-process?
input:
a = np.array([[4,3],[0,3],[3,0],[1,3],[1,2],[2,0]])
required sorting arrray:
b = np.array([1,4,3,5,2,0])
a = a[b]
output:
a = np.array([[0,3],[1,2],[1,3][2,0],[3,0],[4,3]])
How do I get the array b ?
You need lexsort here:
b = np.lexsort((a[:, 1], a[:, 0]))
# array([1, 4, 3, 5, 2, 0], dtype=int64)
And applied to your initial array:
>>> a[b]
array([[0, 3],
[1, 2],
[1, 3],
[2, 0],
[3, 0],
[4, 3]])
As #miradulo pointed out, you may also use:
b = np.lexsort(np.fliplr(a).T)
Which is less verbose than explicitly stating the columns to sort on.

Is there any function in python which can perform the inverse of numpy.repeat function?

For example
x = np.repeat(np.array([[1,2],[3,4]]), 2, axis=1)
gives you
x = array([[1, 1, 2, 2],
[3, 3, 4, 4]])
but is there something which can perform
x = np.*inverse_repeat*(np.array([[1, 1, 2, 2],[3, 3, 4, 4]]), axis=1)
and gives you
x = array([[1,2],[3,4]])
Regular slicing should work. For the axis you want to inverse repeat, use ::number_of_repetitions
x = np.repeat(np.array([[1,2],[3,4]]), 4, axis=0)
x[::4, :] # axis=0
Out:
array([[1, 2],
[3, 4]])
x = np.repeat(np.array([[1,2],[3,4]]), 3, axis=1)
x[:,::3] # axis=1
Out:
array([[1, 2],
[3, 4]])
x = np.repeat(np.array([[[1],[2]],[[3],[4]]]), 5, axis=2)
x[:,:,::5] # axis=2
Out:
array([[[1],
[2]],
[[3],
[4]]])
This should work, and has the exact same signature as np.repeat:
def inverse_repeat(a, repeats, axis):
if isinstance(repeats, int):
indices = np.arange(a.shape[axis] / repeats, dtype=np.int) * repeats
else: # assume array_like of int
indices = np.cumsum(repeats) - 1
return a.take(indices, axis)
Edit: added support for per-item repeats as well, analogous to np.repeat
For the case where we know the axis and the repeat - and the repeat is a scalar (same value for all elements) we can construct a slicing index like this:
In [1117]: a=np.array([[1, 1, 2, 2],[3, 3, 4, 4]])
In [1118]: axis=1; repeats=2
In [1119]: ind=[slice(None)]*a.ndim
In [1120]: ind[axis]=slice(None,None,a.shape[axis]//repeats)
In [1121]: ind
Out[1121]: [slice(None, None, None), slice(None, None, 2)]
In [1122]: a[ind]
Out[1122]:
array([[1, 2],
[3, 4]])
#Eelco's use of take makes it easier to focus on one axis, but requires a list of indices, not a slice.
But repeat does allow for differing repeat counts.
In [1127]: np.repeat(a1,[2,3],axis=1)
Out[1127]:
array([[1, 1, 2, 2, 2],
[3, 3, 4, 4, 4]])
Knowing axis=1 and repeats=[2,3] we should be able construct the right take indexing (probably with cumsum). Slicing won't work.
But if we only know the axis, and the repeats are unknown then we probably need some sort of unique or set operation as in #redratear's answer.
In [1128]: a2=np.repeat(a1,[2,3],axis=1)
In [1129]: y=[list(set(c)) for c in a2]
In [1130]: y
Out[1130]: [[1, 2], [3, 4]]
A take solution with list repeats. This should select the last of each repeated block:
In [1132]: np.take(a2,np.cumsum([2,3])-1,axis=1)
Out[1132]:
array([[1, 2],
[3, 4]])
A deleted answer uses unique; here's my row by row use of unique
In [1136]: np.array([np.unique(row) for row in a2])
Out[1136]:
array([[1, 2],
[3, 4]])
unique is better than set for this use since it maintains element order. There's another problem with unique (or set) - what if the original had repeated values, e.g. [[1,2,1,3],[3,3,4,1]].
Here is a case where it would be difficult to deduce the repeat pattern from the result. I'd have to look at all the rows first.
In [1169]: a=np.array([[2,1,1,3],[3,3,2,1]])
In [1170]: a1=np.repeat(a,[2,1,3,4], axis=1)
In [1171]: a1
Out[1171]:
array([[2, 2, 1, 1, 1, 1, 3, 3, 3, 3],
[3, 3, 3, 2, 2, 2, 1, 1, 1, 1]])
But cumsum on a known repeat solves it nicely:
In [1172]: ind=np.cumsum([2,1,3,4])-1
In [1173]: ind
Out[1173]: array([1, 2, 5, 9], dtype=int32)
In [1174]: np.take(a1,ind,axis=1)
Out[1174]:
array([[2, 1, 1, 3],
[3, 3, 2, 1]])
>>> import numpy as np
>>> x = np.repeat(np.array([[1,2],[3,4]]), 2, axis=1)
>>> y=[list(set(c)) for c in x] #This part remove duplicates for each array in tuple. So this will not work for x = np.repeat(np.array([[1,1],[3,3]]), 2, axis=1)=[[1,1,1,1],[3,3,3,3]. Result will be [[1],[3]]
>>> print y
[[1, 2], [3, 4]]
You dont need know to axis and repeat amount...

Updating a NumPy array with another

Seemingly simple question: I have an array with two columns, the first represents an ID and the second a count. I'd like to update it with another, similar array such that
import numpy as np
a = np.array([[1, 2],
[2, 2],
[3, 1],
[4, 5]])
b = np.array([[2, 2],
[3, 1],
[4, 0],
[5, 3]])
a.update(b) # ????
>>> np.array([[1, 2],
[2, 4],
[3, 2],
[4, 5],
[5, 3]])
Is there a way to do this with indexing/slicing such that I don't simply have to iterate over each row?
Generic case
Approach #1: You can use np.add.at to do such an ID-based adding operation like so -
# First column of output array as the union of first columns of a,b
out_id = np.union1d(a[:,0],b[:,0])
# Initialize second column of output array
out_count = np.zeros_like(out_id)
# Find indices where the first columns of a,b are placed in out_id
_,a_idx = np.where(a[:,None,0]==out_id)
_,b_idx = np.where(b[:,None,0]==out_id)
# Place second column of a into out_id & add in second column of b
out_count[a_idx] = a[:,1]
np.add.at(out_count, b_idx,b[:,1])
# Stack the ID and count arrays into a 2-column format
out = np.column_stack((out_id,out_count))
To find a_idx and b_idx, as probably a faster alternative, np.searchsorted could be used like so -
a_idx = np.searchsorted(out_id, a[:,0], side='left')
b_idx = np.searchsorted(out_id, b[:,0], side='left')
Sample input-output :
In [538]: a
Out[538]:
array([[1, 2],
[4, 2],
[3, 1],
[5, 5]])
In [539]: b
Out[539]:
array([[3, 7],
[1, 1],
[4, 0],
[2, 3],
[6, 2]])
In [540]: out
Out[540]:
array([[1, 3],
[2, 3],
[3, 8],
[4, 2],
[5, 5],
[6, 2]])
Approach #2: You can use np.bincount to do the same ID based adding -
# First column of output array as the union of first columns of a,b
out_id = np.union1d(a[:,0],b[:,0])
# Get all IDs and counts in a single arrays
id_arr = np.concatenate((a[:,0],b[:,0]))
count_arr = np.concatenate((a[:,1],b[:,1]))
# Get binned summations
summed_vals = np.bincount(id_arr,count_arr)
# Get mask of valid bins
mask = np.in1d(np.arange(np.max(out_id)+1),out_id)
# Mask valid summed bins for final counts array output
out_count = summed_vals[mask]
# Stack the ID and count arrays into a 2-column format
out = np.column_stack((out_id,out_count))
Specific case
If the ID columns in a and b are sorted, it becomes easier, as we can just use masks with np.in1d to index into the output ID array created with np.union like so -
# First column of output array as the union of first columns of a,b
out_id = np.union1d(a[:,0],b[:,0])
# Masks of first columns of a and b matches in the output ID array
mask1 = np.in1d(out_id,a[:,0])
mask2 = np.in1d(out_id,b[:,0])
# Initialize second column of output array
out_count = np.zeros_like(out_id)
# Place second column of a into out_id & add in second column of b
out_count[mask1] = a[:,1]
np.add.at(out_count, np.where(mask2)[0],b[:,1])
# Stack the ID and count arrays into a 2-column format
out = np.column_stack((out_id,out_count))
Sample run -
In [552]: a
Out[552]:
array([[1, 2],
[2, 2],
[3, 1],
[4, 5],
[8, 5]])
In [553]: b
Out[553]:
array([[2, 2],
[3, 1],
[4, 0],
[5, 3],
[6, 2],
[8, 2]])
In [554]: out
Out[554]:
array([[1, 2],
[2, 4],
[3, 2],
[4, 5],
[5, 3],
[6, 2],
[8, 7]])
>>> col=np.unique(np.hstack((b[:,0],a[:,0])))
>>> dif=np.setdiff1d(col,a[:,0])
>>> val=b[np.in1d(b[:,0],dif)]
>>> result=np.concatenate((a,val))
array([[1, 2],
[2, 2],
[3, 1],
[4, 5],
[5, 3]])
Note that if you want the result become sorted you can use np.lexsort :
result[np.lexsort((result[:,0],result[:,0]))]
Explanation :
First you can find the unique ids with following command :
>>> col=np.unique(np.hstack((b[:,0],a[:,0])))
>>> col
array([1, 2, 3, 4, 5])
Then find the different between the ids if a and all of ids :
>>> dif=np.setdiff1d(col,a[:,0])
>>> dif
array([5])
Then find the items within b with the ids in diff :
>>> val=b[np.in1d(b[:,0],dif)]
>>> val
array([[5, 3]])
And at last concatenate the result with list a:
>>> np.concatenate((a,val))
consider another example with sorting :
>>> a = np.array([[1, 2],
... [2, 2],
... [3, 1],
... [7, 5]])
>>>
>>> b = np.array([[2, 2],
... [3, 1],
... [4, 0],
... [5, 3]])
>>>
>>> col=np.unique(np.hstack((b[:,0],a[:,0])))
>>> dif=np.setdiff1d(col,a[:,0])
>>> val=b[np.in1d(b[:,0],dif)]
>>> result=np.concatenate((a,val))
>>> result[np.lexsort((result[:,0],result[:,0]))]
array([[1, 2],
[2, 2],
[3, 1],
[4, 0],
[5, 3],
[7, 5]])
That's an old question but here is a solution with pandas (that could be generalized for other aggregation functions than sum). Also sorting will occur automatically:
import pandas as pd
import numpy as np
a = np.array([[1, 2],
[2, 2],
[3, 1],
[4, 5]])
b = np.array([[2, 2],
[3, 1],
[4, 0],
[5, 3]])
print((pd.DataFrame(a[:, 1], index=a[:, 0])
.add(pd.DataFrame(b[:, 1], index=b[:, 0]), fill_value=0)
.astype(int))
.reset_index()
.to_numpy())
Output:
[[1 2]
[2 4]
[3 2]
[4 5]
[5 3]]

Filtering multiple NumPy arrays based on the intersection of one column

I have three rather large NumPy arrays with varying numbers of rows, whose first columns are all integers. My hope is to filter these arrays such that the only rows left are those for whom the value in the first column is shared by all three. This would leave three arrays of the same size. The entries in the other columns are not necessarily shared across arrays.
So, with input:
A =
[[1, 1],
[2, 2],
[3, 3],]
B =
[[2, 1],
[3, 2],
[4, 3],
[5, 4]]
C =
[[2, 2],
[3, 1]
[5, 2]]
I hope to get back as output:
A =
[[2, 2],
[3, 3]]
B =
[[2, 1],
[3, 2]]
C =
[[2, 2],
[3, 1]]
My current approach is to:
Find the intersection of the three first columns using numpy.intersect1d()
Use numpy.in1d() on this intersection and the first columns of each array to find the row indices that are not shared in each array (converting boolean to index using a modified version of the method found here: Python: intersection indices numpy array )
Finally using numpy.delete() with each of these indices and its respective array to remove rows with non-shared entries in the first column.
I'm wondering if there might be a faster or more elegantly Pythonic way to go about this however, something that is suited to very large arrays.
Your indices in your example are sorted and unique. Assuming this is no coincidence (and this situation often arises, or can easily be enforced), the following works:
import numpy as np
A = np.array(
[[1, 1],
[2, 2],
[3, 3],])
B = np.array(
[[2, 1],
[3, 2],
[4, 3],
[5, 4]])
C = np.array(
[[2, 2],
[3, 1],
[5, 2],])
I = reduce(
lambda l,r: np.intersect1d(l,r,True),
(i[:,0] for i in (A,B,C)))
print A[np.searchsorted(A[:,0], I)]
print B[np.searchsorted(B[:,0], I)]
print C[np.searchsorted(C[:,0], I)]
and in case the first column is not in sorted order (but is still unique):
C = np.array(
[[9, 2],
[1,6],
[5, 1],
[2, 5],
[3, 2],])
def index_by_first_column_entry(M, keys):
colkeys = M[:,0]
sorter = np.argsort(colkeys)
index = np.searchsorted(colkeys, keys, sorter = sorter)
return M[sorter[index]]
print index_by_first_column_entry(C, I)
and make sure to change the true to false in
I = reduce(
lambda l,r: np.intersect1d(l,r,False),
(i[:,0] for i in (A,B,C)))
generalization to duplicate values can be made using np.unique
One way to do this is to build an indicator array, or a hash table if you like, to indicate which integers are in all your input arrays. Then you can use boolean indexing based on this indicator array to get the subarrays. Something like this:
import numpy as np
# Setup
A = np.array(
[[1, 1],
[2, 2],
[3, 3],])
B = np.array(
[[2, 1],
[3, 2],
[4, 3],
[5, 4]])
C = np.array(
[[2, 2],
[3, 1],
[5, 2],])
def take_overlap(*input):
n = len(input)
maxIndex = max(array[:, 0].max() for array in input)
indicator = np.zeros(maxIndex + 1, dtype=int)
for array in input:
indicator[array[:, 0]] += 1
indicator = indicator == n
result = []
for array in input:
# Look up each integer in the indicator array
mask = indicator[array[:, 0]]
# Use boolean indexing to get the sub array
result.append(array[mask])
return result
subA, subB, subC = take_overlap(A, B, C)
This should be quite fast and this method does not assume the elements of the input arrays are unique or sorted. However this method could take a lot of memory, and might e a bit slower, if the indexing integers are sparse, ie [1, 10, 10000], but should be close to optimal if the integers are more or less dense.
This works but I'm not sure if it is faster than any of the other answers:
import numpy as np
A = np.array(
[[1, 1],
[2, 2],
[3, 3],])
B = np.array(
[[2, 1],
[3, 2],
[4, 3],
[5, 4]])
C = np.array(
[[2, 2],
[3, 1],
[5, 2],])
a = A[:,0]
b = B[:,0]
c = C[:,0]
ab = np.where(a[:, np.newaxis] == b[np.newaxis, :])
bc = np.where(b[:, np.newaxis] == c[np.newaxis, :])
ab_in_bc = np.in1d(ab[1], bc[0])
bc_in_ab = np.in1d(bc[0], ab[1])
arows = ab[0][ab_in_bc]
brows = ab[1][ab_in_bc]
crows = bc[1][bc_in_ab]
anew = A[arows, :]
bnew = B[brows, :]
cnew = C[crows, :]
print(anew)
print(bnew)
print(cnew)
gives:
[[2 2]
[3 3]]
[[2 1]
[3 2]]
[[2 2]
[3 1]]

Categories