Index of identical rows in a NumPy array - python

I already asked a variation of this question, but I still have a problem regarding the runtime of my code.
Given a numpy array consisting of 15000 rows and 44 columns. My goal is to find out which rows are equal and add them to a list, like this:
1 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
1 0 0 0 0
1 2 3 4 5
Result:
equal_rows1 = [1,2,3]
equal_rows2 = [0,4]
What I did up till now is using the following code:
import numpy as np
input_data = np.load('IN.npy')
equal_inputs1 = []
equal_inputs2 = []
for i in range(len(input_data)):
for j in range(i+1,len(input_data)):
if np.array_equal(input_data[i],input_data[j]):
equal_inputs1.append(i)
equal_inputs2.append(j)
The problem is that it takes a lot of time to return the desired arrays and that this allows only 2 different "similar row lists" although there can be more. Is there any better solution for this, especially regarding the runtime?

This is pretty simple with pandas groupby:
df
A B C D E
0 1 0 0 0 0
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 1 0 0 0 0
5 1 2 3 4 5
[g.index.tolist() for _, g in df.groupby(df.columns.tolist()) if len(g.index) > 1]
# [[1, 2, 3], [0, 4]]
If you are dealing with many rows and many unique groups, this might get a bit slow. The performance depends on your data. Perhaps there is a faster NumPy alternative, but this is certainly the easiest to understand.

You can use collections.defaultdict, which retains the row values as keys:
from collections import defaultdict
dd = defaultdict(list)
for idx, row in enumerate(df.values):
dd[tuple(row)].append(idx)
print(list(dd.values()))
# [[0, 4], [1, 2, 3], [5]]
print(dd)
# defaultdict(<class 'list'>, {(1, 0, 0, 0, 0): [0, 4],
# (0, 0, 0, 0, 0): [1, 2, 3],
# (1, 2, 3, 4, 5): [5]})
You can, if you wish, filter out unique rows via a dictionary comprehension.

Related

How to iterate over previous rows in a dataframe

I have three columns: id (non-unique id), X (categories) and Y (categories). (I don't have a dataset to share yet. I'll try to replicate what I have using a smaller dataset and edit as soon as possible)
I ran a for loop on a very small subset and based on those results it might take over 4 hours to run this code. I'm looking for a faster way to do this task using pandas (maybe using iterrows, like iterating over previous rows within apply)
For each row I check
whether the current X matches any of previous Xs (check_X = X[:row] == X[row])
whether the current Y matches any of previous Ys (check_Y = Y[:row] == Y[row])
whether the current id does not match any of previous ids (check_id = id[:row] != id[row])
if sum(check_X & check_Y & check_id)>0: then append 1 to the array
else: append 0
Your are probably looking for duplicated:
df = pd.DataFrame({'id': [0, 0, 0, 1, 0],
'X': [1, 1, 2, 1, 1],
'Y': [2, 2, 2, 2, 2]})
df['dup'] = ~df[df.duplicated(['X', 'Y'])].duplicated('id', keep=False).loc[lambda x: ~x]
df['dup'] = df['dup'].fillna(False).astype(int)
print(df)
# Output
id X Y dup
0 0 1 2 0
1 0 1 2 0
2 0 2 2 0
3 1 1 2 1
4 0 1 2 0
Update
X and Y should be checked separately:
df = pd.DataFrame({'id': [0, 1, 1, 2, 2, 3, 4],
'X': [0, 1, 1, 1, 1, 1, 1],
'Y': [0, 2, 2, 2, 2, 2, 2]})
df['dup'] = np.where(df['X'].duplicated() & df['Y'].duplicated() & ~df['id'].duplicated(), 1, 0)
print(df)
# Output
id X Y dup
0 0 0 0 0
1 1 1 2 0
2 1 1 2 0
3 2 1 2 1
4 2 1 2 0
5 3 1 2 1
6 4 1 2 1
EDIT answer from #Corralien using duplicates() will likely be much faster and the best answer for this specific problem. However, apply is more flexible if you have different things to check.
You could do it with iterrows() or apply(). As far as I know apply() is faster:
check_id, check_x, check_y = set(), set(), set()
def apply_func(row):
global check_id, check_x, check_y
if row["id"] not in check_id and row['x'] in check_x and row['y'] in check_y:
row['duplicate'] = 1
else:
row['duplicate'] = 0
check_id.add(row['id'])
check_x.add(row['x'])
check_y.add(row['y'])
return row
df.apply(apply_func, axis=1)
With iterrows():
check_id, check_x, check_y = set(), set(), set()
for i, row in df.iterrows():
if row["id"] not in check_id and row['x'] in check_x and row['y'] in check_y:
df.loc[i, 'duplicate'] = 1
else:
df.loc[i, 'duplicate'] = 0
check_id.add(row['id'])
check_x.add(row['x'])
check_y.add(row['y'])

Numpy / Pandas slicing based on intervals

Trying to figure out a way to slice non-contiguous and non-equal length rows of a pandas / numpy matrix so I can set the values to a common value. Has anyone come across an elegant solution for this?
import numpy as np
import pandas as pd
x = pd.DataFrame(np.arange(12).reshape(3,4))
#x is the matrix we want to index into
"""
x before:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
"""
y = pd.DataFrame([[0,3],[2,2],[1,2],[0,0]])
#y is a matrix where each row contains a start idx and end idx per column of x
"""
0 1
0 0 3
1 2 3
2 1 3
3 0 1
"""
What I'm looking for is a way to effectively select different length slices of x based on the rows of y
x[y] = 0
"""
x afterwards:
array([[ 0, 1, 2, 0],
[ 0, 5, 0, 7],
[ 0, 0, 0, 11]])
Masking can still be useful, because even if a loop cannot be entirely avoided, the main dataframe x would not need to be involved in the loop, so this should speed things up:
mask = np.zeros_like(x, dtype=bool)
for i in range(len(y)):
mask[y.iloc[i, 0]:(y.iloc[i, 1] + 1), i] = True
x[mask] = 0
x
0 1 2 3
0 0 1 2 0
1 0 5 0 7
2 0 0 0 11
As a further improvement, consider defining y as a NumPy array if possible.
I customized this answer to your problem:
y_t = y.values.transpose()
y_t[1,:] = y_t[1,:] - 1 # or remove this line and change '>= r' below to '> r`
r = np.arange(x.shape[0])
mask = ((y_t[0,:,None] <= r) & (y_t[1,:,None] >= r)).transpose()
res = x.where(~mask, 0)
res
# 0 1 2 3
# 0 0 1 2 0
# 1 0 5 0 7
# 2 0 0 0 11

Add element after each element in numpy array python

I am just starting off with numpy and am trying to create a function that takes in an array (x), converts this into a np.array, and returns a numpy array with 0,0,0,0 added after each element.
It should look like so:
input array: [4,5,6]
output: [4,0,0,0,0,5,0,0,0,0,6,0,0,0,0]
I have tried the following:
import numpy as np
x = np.asarray([4,5,6])
y = np.array([])
for index, value in enumerate(x):
y = np.insert(x, index+1, [0,0,0,0])
print(y)
which returns:
[4 0 0 0 0 5 6]
[4 5 0 0 0 0 6]
[4 5 6 0 0 0 0]
So basically I need to combine the output into one single numpy array rather than three lists.
Would anybody know how to solve this?
Many thanks!
Use the numpy .zeros function !
import numpy as np
inputArray = [4,5,6]
newArray = np.zeros(5*len(inputArray),dtype=int)
newArray[::5] = inputArray
In fact, you 'force' all the values with indexes 0,5 and 10 to become 4,5 and 6.
so _____[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
becomes [4 0 0 0 0 5 0 0 0 0 6 0 0 0 0]
>>> newArray
array([4, 0, 0, 0, 0, 5, 0, 0, 0, 0, 6, 0, 0, 0 ,0])
I haven't used numpy to solve this problem,but this code seems to return your required output:
a = [4,5,6]
b = [0,0,0,0]
c = []
for x in a:
c = c + [x] + b
print(c)
I hope this helps!

How to create matrix in python of repeating number?

I want to:
Create a vector list from 0 to 4, i.e. [0, 1, 2, 3, 4] and from that
Create a matrix containing a "tiered list" from 0 to 4, 3 times over, once for each dimension. The matrix has 4^3 = 64 rows, so for example
T = [0 0 0
0 0 1
0 0 2
0 0 3
0 0 4
0 1 0
0 1 1
0 1 2
0 1 3
0 1 4
0 2 0
...
1 0 0
...
1 1 0
....
4 4 4]
This is what I have so far:
n=5;
ind=list(range(0,n))
print(ind)
I am just getting started with Python so any help would be greatly appreciated!
The python itertools module product() function can do this:
for code in itertools.product( range(5), repeat=3 ):
print(code)
Giving the result:
(0, 0, 0)
(0, 0, 1)
(0, 0, 2)
(0, 0, 3)
...
(4, 4, 2)
(4, 4, 3)
(4, 4, 4)
So to make this into a matrix:
import itertools
matrix = []
for code in itertools.product( range(5), repeat=3 ):
matrix.append( list(code) )
list_ = []
for a in range(5):
for b in range(5):
for c in range(5):
list_ += [a ,b ,c ]
print(list_)
Note, you really want the matrix to have 5^3 = 125 rows. The basic answer is to just iterate in nested for loops:
T = []
for a in range(5):
for b in range(5):
for c in range(5):
T.append([a, b, c])
There are other, probably faster, ways of doing this, but for sheer get 'er done velocity, it's hard to beat this.

Insert data from one sorted array into another sorted array

I apologize if this has been asked here - I've hunted around here and in the Tentative NumPy Tutorial for an answer.
I have 2 numpy arrays. The first array is similar to:
1 0 0 0 0
2 0 0 0 0
3 0 0 0 0
4 0 0 0 0
5 0 0 0 0
6 0 0 0 0
7 0 0 0 0
8 0 0 0 0
(etc... It's ~700x10 in actuality)
I then have a 2nd array similar to
3 1
4 18
5 2
(again, longer - maybe 400 or so rows)
The first column of the 2nd array is always completely contained within the first column of the first array
What I'd like to do is to insert the 2nd column of the 2nd array into that first array as part of an existing column, i.e:
array a:
1 0 0 0 0
2 0 0 0 0
3 1 0 0 0
4 18 0 0 0
5 2 0 0 0
6 0 0 0 0
7 0 0 0 0
8 0 0 0 0
(I'd be filling in each of those columns in turn, but each covers a different range within the original)
My first try was along the lines of a[b[:,0],1]=b[:,1] which puts them into the indices of b, not the values (ie, in my example above instead of filling in rows 3, 4, and 5, I filled in 2, 3, and 4). I should have realized that!
Since then, I've tried to make it work pretty inelegantly with where(), and I think I could make it work by finding the difference in the starting values of the first columns.
I'm new to python, so perhaps I'm overly optimistic - but it seems like there should be a more elegant way and I'm just missing it.
Thanks for any insights!
If the numbers in the first column of a are in sorted order, then you could use
a[a[:,0].searchsorted(b[:,0]),1] = b[:,1]
For example:
import numpy as np
a = np.array([(1,0,0,0,0),
(2,0,0,0,0),
(3,0,0,0,0),
(4,0,0,0,0),
(5,0,0,0,0),
(6,0,0,0,0),
(7,0,0,0,0),
(8,0,0,0,0),
])
b = np.array([(3, 1),
(5, 18),
(7, 2)])
a[a[:,0].searchsorted(b[:,0]),1] = b[:,1]
print(a)
yields
[[ 1 0 0 0 0]
[ 2 0 0 0 0]
[ 3 1 0 0 0]
[ 4 0 0 0 0]
[ 5 18 0 0 0]
[ 6 0 0 0 0]
[ 7 2 0 0 0]
[ 8 0 0 0 0]]
(I changed your example a bit to show that the values in b's first column do not have to be contiguous.)
If a[:,0] is not in sorted order, then you could use np.argsort to workaround this:
a = np.array( [(1,0,0,0,0),
(2,0,0,0,0),
(5,0,0,0,0),
(3,0,0,0,0),
(4,0,0,0,0),
(6,0,0,0,0),
(7,0,0,0,0),
(8,0,0,0,0),
])
b = np.array([(3, 1),
(5, 18),
(7, 2)])
perm = np.argsort(a[:,0])
a[:,1][perm[a[:,0][perm].searchsorted(b[:,0])]] = b[:,1]
print(a)
yields
[[ 1 0 0 0 0]
[ 2 0 0 0 0]
[ 5 18 0 0 0]
[ 3 1 0 0 0]
[ 4 0 0 0 0]
[ 6 0 0 0 0]
[ 7 2 0 0 0]
[ 8 0 0 0 0]]
The setup:
a = np.arange(20).reshape(2,10).T
b = np.array([[1, 100], [3, 300], [8, 800]])
This will work if you don't know anything about a[:, 0] except that it is sorted.
index = a[:, 0].searchsorted(b[:, 0])
a[index, 1] = b[:, 1]
print a
array([[ 0, 10],
[ 1, 100],
[ 2, 12],
[ 3, 300],
[ 4, 14],
[ 5, 15],
[ 6, 16],
[ 7, 17],
[ 8, 800],
[ 9, 19]])
But if you know that a[:, 0] is a sequence of contiguous integers like your example you can do:
index = b[:,0] + a[0, 0]
a[index, 1] = b[:, 1]

Categories