Finding the index of similar columns in 2 numpy array - python

I have two 2d numpy arrays of size (12550,200) and (12550,10). I need to find the set of column indexes of the first array that are matching the 2nd array columns.
Eg:
ar1 = [[1,2,3,4],[4,5,6,7],[1,3,4,5],[6,7,8,5]]
ar2 = [[1,3],[4,6],[1,4],[6,8]]
so matching columns are 1,4,1,4 and 3,6,4,8
I need the index of these columns in ar1 as output i.e., [0,2]
Can anyone help me with the python code that is fast enough as the original array dimensions are big

Check this out:
ar1 = np.array([[1,2,3,4],[4,5,6,7],[1,3,4,5],[6,7,8,5]])
ar2 = np.array([[1,3],[4,6],[1,4],[6,8]])
np.where((ar1[:,None].T == ar2.T).all(axis=2))[0]
gives
array([0, 2], dtype=int64)
meaning column 0 of ar2 is found at column 0 of ar1, and column 1 of ar2 is found at column 2 of ar1.
The transpose is used because you care about columns rather than rows. The [:,None] is used for broadcasting (i.e. test every column against every other). The all() checks that entire columns match. And finally the [0] element of the np.where result will give you the ar1 column indices where this happens.

Related

Is there a way to write a python function that will create 'N' arrays? (see body)

I have an numpy array that is shape 20, 3. (So 20 3 by 1 arrays. Correct me if I'm wrong, I am still pretty new to python)
I need to separate it into 3 arrays of shape 20,1 where the first array is 20 elements that are the 0th element of each 3 by 1 array. Second array is also 20 elements that are the 1st element of each 3 by 1 array, etc.
I am not sure if I need to write a function for this. Here is what I have tried:
Essentially I'm trying to create an array of 3 20 by 1 arrays that I can later index to get the separate 20 by 1 arrays.
a = np.load() #loads file
num=20 #the num is if I need to change array size
num_2=3
for j in range(0,num):
for l in range(0,num_2):
array_elements = np.zeros(3)
array_elements[l] = a[j:][l]
This gives the following error:
'''
ValueError: setting an array element with a sequence
'''
I have also tried making it a dictionary and making the dictionary values lists that are appended, but it only gives the first or last value of the 20 that I need.
Your array has shape (20, 3), this means it's a 2-dimensional array with 20 rows and 3 columns in each row.
You can access data in this array by indexing using numbers or ':' to indicate ranges. You want to split this in to 3 arrays of shape (20, 1), so one array per column. To do this you can pick the column with numbers and use ':' to mean 'all of the rows'. So, to access the three different columns: a[:, 0], a[:, 1] and a[:, 2].
You can then assign these to separate variables if you wish e.g. arr = a[:, 0] but this is just a reference to the original data in array a. This means any changes in arr will also be made to the corresponding data in a.
If you want to create a new array so this doesn't happen, you can easily use the .copy() function. Now if you set arr = a[:, 0].copy(), arr is completely separate to a and changes made to one will not affect the other.
Essentially you want to group your arrays by their index. There are plenty of ways of doing this. Since numpy does not have a group by method, you have to horizontally split the arrays into a new array and reshape it.
old_length = 3
new_length = 20
a = np.array(np.hsplit(a, old_length)).reshape(old_length, new_length)
Edit: It appears you can achieve the same effect by rotating the array -90 degrees. You can do this by using rot90 and setting k=-1 or k=3 telling numpy to rotate by 90 k times.
a = np.rot90(a, k=-1)

Compare and store elements of multidimensional array to two new arrays

Assume I have the following simple array:
my_array = np.array([[1,2],[2,4],[3,6],[2,1]])
which corresponds to another parent array:
parent_array = np.array([0,1,2,3])
Of course, there is a function that maps parent_array to np.array but it is not important what function this is.
Goal:
I want to use this my_array so as to create two new arrays A and B by iterating each row of my_array: for row i if the value of the first column of my_array[i] is greater than the value of the second column I will store parent_array[i] in A . Otherwise I will store parent_array[i] in B (if the value of the second column in my_array[i] if bigger).
So for the case above the result would be:
A = [3]
because only in the 4-th value of my_array the first column has greater value and
B = [0,1,2]
because the in the first three rows the second column has greater value.
Now, although I know how to save the greater element in a row of columns to a new array, the fact that each row in my_array is associated with a row in parent_array is confusing me. I don't know how to correlate them.
Summary:
I need therefore to associate each row of parent_array to each row of my_array and then if check row by row the latter and if the value of the first column is greater in my_array[i] I save parent_row[i] in A while if the second column is greater in my_array[i] I save parent_row[i] in B.
Use boolean array indexing for this: create boolean condition array by comparing values from 1st and 2nd column of my_array and then use it to select values from parent_array:
cond = my_array[:,0] > my_array[:,1]
A, B = parent_array[cond], parent_array[~cond]
A
# [3]
B
# [0 1 2]

Numpy setting every column in a matrix to a certain value matching a condition

I have a matrix D and sort every row with the indicies (argsort). I'm trying to set values of some_matrix at indicies 1-5 in np.argsort(D) to 1. What I have below does what I need, but is there a way to do this in one line with numpy arrays?
some_matrix = np.zeros((n,n))
for i in range(n):
some_matrix[i,np.argsort(D)[i,1:5]] = 1
Firstly, note that you don't need a full sort, only a partition of elements 1-4 (I assume you need elements 1,2,3,4, because that's what your code does). So let's use that:
#assuming you want indices 1,2,3,4 of the sorted array, in any order
indices = np.argpartition(D, (1, 4), axis=1)[:, 1:5]
Now we've got indices of D with the first, second, third and fourth smallest elements (this is similar to indices = np.argsort(D, 1)[:, 1:5], but will be faster for large arrays). All we need is to set these elements to 1
np.put_along_axis(some_matrix, indices, 1, axis=1)

How can I access an array with an array of indexes in python?

I would like to access a multidimensional python array with an array of indexes, using the whole array to index the target element.
Let me explain it better:
A = np.arange(4).reshape(2,2)
a = [1,1]
>>> A[a[0],a[1]]
3
My intention is to pass the array without hard-coding the indexes values and get the same result, that is the value A[1,1]. I tried but the only way I found is working differently:
>>> A[a]
array([[2, 3],
[2, 3]])
What results is the construction of a new array where each value of the index array selects one row from the array being indexed and the resultant array has the resulting shape (number of index elements, size of row).
Thank you.
Pass a tuple (not a list) to __getitem__ (the [..] indexer).
A[tuple(a)]
3

Python Numpy: Coalesce and return first nonzero observation

I am currently new to NumPy, but very proficient with SQL.
I used a function called coalesce in SQL, which I was disappointed not to find in NumPy. I need this function to create a third array want from 2 arrays i.e. array1 and array2, where zero/ missing observations in array1 are replaced by observations in array2 under the same address/Location. I can't figure out how to use np.where?
Once this task is accomplished, I would like to take the lower diagonal of this array want and then populate a final array want2 noting the first non-zero observation. If all observations i.e. coalesce(array1, array2) returns missing or 0 in want2, then assign by default zero.
I have written an example demonstrating the desired behavior.
import numpy as np
array1= np.array(([-10,0,20],[-1,0,0],[0,34,-50]))
array2= np.array(([10,10,50],[10,0,25],[50,45,0]))
# Coalesce array1,array2 i.e. get the first non-zero value from array1, then from array2.
# if array1 is empty or zero, then populate table want with values from array2 under same address
want=np.tril(np.array(([-10,10,20],[-1,0,25],[50,34,-50])))
print(array1)
print(array2)
print(want)
# print first instance of nonzero observation from each column of table want
want2=np.array([-10,34,-50])
print(want2)
"Coalesce": use putmask to replace values equal to zero with values from array2:
want = array1.copy()
np.putmask(array1.copy(), array1==0, array2)
First nonzero element of each column of np.tril(want):
where_nonzero = np.where(np.tril(want) != 0)
"""For the where array, get the indices of only
the first index for each column"""
first_indices = np.unique(where_nonzero[1], return_index=True)[1]
# Get the values from want for those indices
want2 = want[(where_nonzero[0][first_indices], where_nonzero[1][first_indices])]

Categories