How can I compare two matrices row-wise in python? - python

I have two matrices with the same number of columns but a different number of rows, one is a lot larger.
matA = [[1,0,1],[0,0,0],[1,1,0]], matB = [[0,0,0],[1,0,1],[0,0,0],[1,1,1],[1,1,0]]
Both of them are numpy matrices
I am trying to find how many times each row of matA appears in matB and put that in an array so the array in this case will become arr = [1,2,1] because the first row of matA appeared one time in mat, the second row appeared two times and the last row only one time

Find unique rows in numpy.array
What is a faster way to get the location of unique rows in numpy
Here is a solution:
import numpy as np
A = np.array([[1,0,1],[0,0,0],[1,1,0]])
B = np.array([[0,0,0],[1,0,1],[0,0,0],[1,1,1],[1,1,0]])
# stack the rows, A has to be first
combined = np.concatenate((A, B), axis=0) #or np.vstack
unique, unique_indices, unique_counts = np.unique(combined,
return_index=True,
return_counts=True,
axis=0)
print(unique)
print(unique_indices)
print(unique_counts)
# now we need to derive your desired result from the unique
# indices and counts
# we know the number of rows in A
n_rows_in_A = A.shape[0]
# so we know that the indices from 0 to (n_rows_in_A - 1)
# in unique_indices are rows that appear first or only in A
indices_A = np.nonzero(unique_indices < n_rows_in_A)[0] #first
#indices_A1 = np.argwhere(unique_indices < n_rows_in_A)
print(indices_A)
#print(indices_A1)
unique_indices_A = unique_indices[indices_A]
unique_counts_A = unique_counts[indices_A]
print(unique_indices_A)
print(unique_counts_A)
# now we need to subtract one count from the unique_counts
# that's the one occurence in A that we are not interested in.
unique_counts_A -= 1
print(unique_indices_A)
print(unique_counts_A)
# this is nearly the result we want
# now we need to sort it and account for rows that are not
# appearing in A but in B
# will do that later...

Related

Python numpy sampling a 2D array for n rows

I have a numpy array as follows, I want to take a random sample of n rows.
([[996.924, 265.879, 191.655],
[996.924, 265.874, 191.655],
[996.925, 265.884, 191.655],
[997.294, 265.621, 192.224],
[997.294, 265.643, 192.225],
[997.304, 265.652, 192.223]], dtype=float32)
I've tried:
rows_id = random.sample(range(0,arr.shape[1]-1), 1)
row = arr[rows_id, :]
But this 9ndex mask only returns a single row, I want to return n rows as an numpy array (without duplication).
You have three key issues: arr.shape[1] returns the number of columns, while you want the number of rows--arr.shape[0]. Second, the second parameter to range is exclusive, so you don't really need the -1. Third, your last parameter to random.sample is the number of rows, which you set to 1.
A better way to do what you're trying might be random.choices instead.
Try where x is your original array:
n = 2 #number of rows
idx = np.random.choice(len(x), n, replace = False)
result = np.array([x[i] for i in idx])

Pandas random n samples of consecutive rows / pairs

I have panda dataframe indexed by ID and sorted by value. I want to create a sample size of n=20000 where there are 40000 rows in total and 2 rows are consecutive/paired. I want to perform additional calculations on these 2 consecutive / paired rows
e.g. If I say sample size n=2 I want to randomly pick and find the difference in distance of each of the following picks.
Additional condition: value difference can't exceed 4000.
index value distance
cg13869341 15865 1.635450
cg14008030 18827 4.161332
Then distance of the following etc
cg20826792 29425 0.657369
cg33045430 29407 1.708055
Sample original dataframe
index value distance
cg13869341 15865 1.635450
cg14008030 18827 4.161332
cg12045430 29407 0.708055
cg20826792 29425 0.657369
cg33045430 69407 1.708055
cg40826792 59425 0.857369
cg47454306 88407 0.708055
cg60826792 96425 2.857369
I tried using df_sample = df.sample(n=20000) Then i got bit lost trying to figure out how to get the next row for each value in df_sample
original shape is (480136, 14)
If it doesn't matter to always have (even, odd) pairs (which decreases a bit randomness), you can select n odd rows and get the next even:
N = 20000
# get the indices of N random ODD rows
idx = df.loc[::2].sample(n=N).index
# create a boolean mask to identify the rows
m = df.index.to_series().isin(idx)
# select those OR the next ones
df_sample = df.loc[m|m.shift()]
Example output on the toy DataFrame (N=3):
index value distance
2 cg12045430 29407 0.708055
3 cg20826792 29425 0.657369
4 cg33045430 69407 1.708055
5 cg40826792 59425 0.857369
6 cg47454306 88407 0.708055
7 cg60826792 96425 2.857369
increasing randomness
The drawback of the above approach is that there is a bias to always have (odd, even) pairs. To overcome this we can first remove a random fraction of the DataFrame, small enough to still leave enough choice to pick rows, but large enough to randomly shift the (odd, even) to (even, odd) pairs on many locations. The fraction of rows to remove should be tested depending on the initial size and the sampled size. I used 20-30% here:
N = 20000
frac = 0.2
idx = (df
.drop(df.sample(frac=frac).index)
.loc[::2].sample(n=N)
.index
)
m = df.index.to_series().isin(idx)
df_sample = df.loc[m|m.shift()]
# check:
# len(df_sample)
# 40000
Here's my first attempt (I only just noticed your additional constraint, and I'm not sure if you need the precise number of samples, in which case, you'll have to do some fudging after the line c=c[mask] below).
import random
# Temporarily reset index so we can have something that we can add one to.
df = df.reset_index(level=0)
# Choose the first index of each pair.
# Use random.sample if you don't want repeats,
# or random.choice if you don't mind them.
# The code below does allow overlapping pairs such as (1,2) and (2,3).
first_indices = np.array(random.sample(sorted(df.index[:-1]), 4))
# Filter out those indices where the diff with the next row down is large.
mask = [abs(df.loc[i, "value"] - df.loc[i+1, "value"]) > 4000 for i in c]
c = c[mask]
# Interleave this array with the same numbers, plus 1.
c = np.empty((first_indices.size * 2,), dtype=first_indices.dtype)
c[0::2] = first_indices
c[1::2] = first_indices + 1
# Filter
df_sample = df[df.index.isin(c)]
# Restore original index if required.
df = df.set_index("index")
Hope that helps. Regarding the bit where I use a mask to filter c, this answer might be of help if you need faster alternatives: Filtering (reducing) a NumPy Array

Removing non-triple rows en masse in numpy array

I have an array A that has 1 million rows and 3 columns. In the last column there are unique integers that help identify data in the other two columns. I would only like to keep data that has three of the same unique integer occurrences, and delete all other rows that have other amounts of unique integer occurrences (i.e. for unique integers that are only appearing once, twice, or four times for example). Below is a function remove_loose_ends that I wrote to handle this. However, this function is being called many times and is the bottleneck of the entire program. Are there any possible enhancements that could remove the loop from this operation or decrease its runtime in other ways?
import numpy as np
import time
def remove_loose_ends(A):
# get unique counts
unique_id, unique_counter = np.unique(A[:, 2], return_counts=True)
# initialize outgoing indice mask
good_index = np.array([[True] * (A.shape[0])])
# loop through all indices and flip them to false if they match the not triplet entries
for i in range(0, len(unique_id)):
if unique_counter[i] != 3:
good_index = good_index ^ (A[:, 2] == unique_id[i])
# return incoming array with mask applied
return A[np.squeeze(good_index), :]
# example array A
A = np.random.rand(1000000,3)
# making last column "unique" integers
A[:,2] = (A[:,2] * 1e6).astype(np.int)
# timing function call
start = time.time()
B = remove_loose_ends(A)
print(time.time() - start)
So, the main problem is that you loop over all the values twice basically, making it roughly an n² operation.
What you could do instead, is create an array of booleans directly from the output of the numpy.unique function to do the indexing for you.
For example, something like this:
import numpy as np
import time
def remove_loose_ends(A):
# get unique counts
_, unique_inverse, unique_counter = np.unique(A[:, 2], return_inverse=True, return_counts=True)
# Obtain boolean array of which integers occurred 3 times
idx = unique_counter == 3
# Obtain boolean array of which rows have integers that occurred 3 times
row_idx = idx[unique_inverse]
# return incoming array with mask applied
return A[row_idx, :]
# example array A
A = np.random.rand(1000000,3)
# making last column "unique" integers
A[:,2] = (A[:,2] * 1e6).astype(np.int)
# timing function call
start = time.time()
B = remove_loose_ends(A)
print(time.time() - start)
I tried timing both versions.
The function you posted I stopped after 15 minutes, whereas the one I give takes around 0.15s on my PC.

Numpy: Finding correspondencies in one array by uniques of other array, arbitrary length

I have a problem where I have two arrays, one with identifiers which can occur multiple time, lets just say
import numpy as np
ind = np.random.randint(0,10,(100,))
and another one which is the same length and contains some info, in this case boolean, for each of the elementes identified by ind. They are sorted correspondingly.
dat = np.random.randint(0,2,(100,)).astype(np.bool8)
I'm looking for a (faster?) way to do the following: Do a np.any() for each element (defined by ind) for all elements. The number of occurences per element is, as in the example, random. What I'm doing now is
result = np.empty(np.unique(ind))
for i,uni in enumerate(np.unique(ind)):
result[i] = np.any(dat[ind==uni])
Which is sort of slow. Any ideas?
Approach #1
Index ind with dat to select the ones required to be checked, get the binned counts with np.bincount and see which bins have more one than occurrence -
result = np.bincount(ind[dat])>0
If ind has negative numbers, offset it with the min value -
ar = ind[dat]
result = np.bincount(ar-ar.min())>0
Approach #2
One more with np.unique -
unq = np.unique(ind[dat])
n = len(np.unique(ind))
result = np.zeros(n,dtype=bool)
result[unq] = 1
We can use pandas to get n :
import pandas as pd
n = pd.Series(ind).nunique()
Approach #3
One more with indexing -
ar = ind[dat]
result = np.zeros(ar.max()+1,dtype=bool)
result[ar] = 1

Efficient combined in-place adding/removing of rows of a huge 2D numpy array

I have a 2D NumPy array and it's huge. I have some computer memory, which is not so huge.
A single copy of the array fits snugly in the computer memory. A second copy of this array brings the computer to its knees crying.
Before I can cut up the matrix into smaller, more manageable, chunks I need to add a few rows to it and remove some. Luckily I need to remove more rows than add new ones, so in theory this could all be done in-place. I'm working on a function to accomplish this, but I'm curious to what advice any of you can give me.
The plan so far:
Make a list of rows to remove
Make a matrix of rows to add
Replace rows to remove by the rows to add (one by one, cannot use fancy indexing here?)
Move any rows that still need to be removed to the end of the matrix
Call .resize() on the matrix to resize it in memory
Specially step 4 is hard to implement efficiently.
Code so far:
import numpy as np
n_rows = 100
n_columns = 1000000
n_rows_to_drop = 20
n_rows_to_add = 10
# Init huge array
data = np.random.rand(n_rows, n_columns)
# Some rows we drop
to_drop = np.arange(n_rows)
np.random.shuffle(to_drop)
to_drop = to_drop[:n_rows_to_drop]
# Some rows we add
new_data = np.random.rand(n_rows_to_add, n_columns)
# Start replacing rows with new rows
for new_data_idx, to_drop_idx in enumerate(to_drop):
if new_data_idx >= n_rows_to_add:
break # no more new data to add
# Replace a row to drop with a new row
data[to_drop_idx] = new_data[new_data_idx]
# These should still be dropped
to_drop = to_drop[n_rows_to_add:]
to_drop.sort()
# Make a list of row indices to keep, last rows first
to_keep = set(range(n_rows)) - set(to_drop)
to_keep = list(to_keep)
to_keep.sort()
to_keep = to_keep[::-1]
# Replace rows to drop with rows at the end of the matrix
for to_drop_idx, to_keep_idx in zip(to_drop, to_keep):
if to_drop_idx > to_keep_idx:
# All remaining rows to drop are at the end of the matrix
break
data[to_drop_idx] = data[to_keep_idx]
# Resize matrix in memory
data.resize(n_rows - n_rows_to_drop + n_rows_to_add, n_columns)
This seems to work, but is there any way to make this more elegant/efficient? Any way to check whether a copy of the huge array is made at some point?
This seems to perform the same as your code but is a little more brief. I'm relatively sure no copies of the big array are made here - the fancy indexing will work with views.
import numpy as np
n_rows = 100
n_columns = 100000
n_rows_to_drop = 20
n_rows_to_add = 10
# Init huge array
data = np.random.rand(n_rows, n_columns)
# Some rows we drop
to_drop = np.random.randint(0, n_rows, n_rows_to_drop)
to_drop = np.unique(to_drop)
# Some rows we add
new_data = np.random.rand(n_rows_to_add, n_columns)
# Start replacing rows with new rows
data[to_drop[:n_rows_to_add]] = new_data
# These should still be dropped
to_drop = to_drop[:n_rows_to_add]
# Make a list of row indices to keep, last rows first
to_keep = np.setdiff1d(np.arange(n_rows), to_drop, assume_unique=True)[-n_rows_to_add:]
# Replace rows to drop with rows at the end of the matrix
for to_drop_i, to_keep_i in zip(to_drop, to_keep):
data[to_drop_i] = data[to_keep_i]
# Resize matrix in memory
data.resize(n_rows - n_rows_to_drop + n_rows_to_add, n_columns)

Categories