Finding combinatorial occurrences along combination of rows between 2 numpy array - python

I am trying to find a fast vectorized (at least partially) solution finding combinatorial occurrence between two 2D numpy array to identified Single Point Polymorphism linkage.
The shape of each array is
(factors, samples)
an example for matrix 1 is as follows:
array([[0., 1., 1.],
[1., 0., 1.]])
and matrix 2
array([[1., 1., 0.],
[0., 0., 0.]])
I need to find the total number of occurrence along samples axis for each permutation of 2 factors at the same position of 2 matrix (order matters because (1,0) count is different from (0,1) count). Therefore the combinations should be [(0, 0), (0, 1), (1, 0), (1, 1)] and the final output is (factor, factor) for counts of each occurrence.
For combination (0,0) for instance, we get the matrix
array([[0, 1],
[0., 1]])
Because
0 counts (0,0) along row 0 of matrix 1 & row 0 of matrix 2,
1 along row 0 of matrix 1 & row 1 of matrix 2,
0 along row 1 of matrix 1 & row 0 of matrix 2,
1 along row 1 of matrix 1 & row 1 of matrix 2,

With example data
import numpy as np
array1 = np.array([
[0., 1., 1.],
[1., 0., 1.]])
array2 = np.array([
[1., 1., 0.],
[0., 0., 0.]])
We can count the desired combinations with np.einsum and reshape to a suitable array
c1 = np.array([1-array1, array1]).astype('int')
c2 = np.array([1-array2, array2]).astype('int')
np.einsum('ijk,lmk->iljm', c1, c2).reshape(-1, len(array1), len(array2))
Output
array([[[0, 1], # counts for (0,0)
[0, 1]],
[[1, 0], # counts for (0,1)
[1, 0]],
[[1, 2], # counts for (1,0)
[1, 2]],
[[1, 0], # counts for (1,1)
[1, 0]]])
Checking that the previous results are equal to dot products
import itertools as it
np.array([x # y.T for x, y in it.product(c1, c2)])
Output
array([[[0, 1],
[0, 1]],
[[1, 0],
[1, 0]],
[[1, 2],
[1, 2]],
[[1, 0],
[1, 0]]])

Since I realized the solution while trying to derive a manual example for the question, I will just provide that we should solve these by dot products:
matrix1_0 = (array1[0]==0).astype('int')
matrix1_1 = (array1[0]==1).astype('int')
matrix2_0 = (array2[1]==0).astype('int')
matrix2_1 = (array2[1]==1).astype('int')
count_00 = np.dot(matrix1_0 , matrix2_0.T)
count_01 = np.dot(matrix1_0 , matrix2_1.T)
count_10 = np.dot(matrix1_1 , matrix2_0.T)
count_11 = np.dot(matrix1_1 , matrix2_1.T)
These would correspond to sum of number of occurrence for each combination for each factor along a certain axis (sample axis 1 here).

Related

Vectorized creation of an array of diagonal square arrays from a liner array in Numpy or Tensorflow

I have an array of shape [batch_size, N], for example:
[[1 2]
[3 4]
[5 6]]
and I need to create a 3 indices array with shape [batch_size, N, N] where for every batch I have a N x N diagonal matrix, where diagonals are taken by the corresponding batch element, for example in this case, In this simple case, the result I am looking for is:
[
[[1,0],[0,2]],
[[3,0],[0,4]],
[[5,0],[0,6]],
]
How can I make this operation without for loops and exploting vectorization? I guess it is an extension of dimension, but I cannot find the correct function to do this.
(I need it as I am working with tensorflow and prototyping with numpy).
Try it in tensorflow:
import tensorflow as tf
A = [[1,2],[3 ,4],[5,6]]
B = tf.matrix_diag(A)
print(B.eval(session=tf.Session()))
[[[1 0]
[0 2]]
[[3 0]
[0 4]]
[[5 0]
[0 6]]]
Approach #1
Here's a vectorized one with np.einsum for input array, a -
# Initialize o/p array
out = np.zeros(a.shape + (a.shape[1],),dtype=a.dtype)
# Get diagonal view and assign into it input array values
diag = np.einsum('ijj->ij',out)
diag[:] = a
Approach #2
Another based on slicing for assignment -
m,n = a.shape
out = np.zeros((m,n,n),dtype=a.dtype)
out.reshape(-1,n**2)[...,::n+1] = a
Using np.expand_dims with an element-wise product with np.eye
a = np.array([[1, 2],
[3, 4],
[5, 6]])
N = a.shape[1]
a = np.expand_dims(a, axis=1)
a*np.eye(N)
array([[[1., 0.],
[0., 2.]],
[[3., 0.],
[0., 4.]],
[[5., 0.],
[0., 6.]]])
Explanation
np.expand_dims(a, axis=1) adds a new axis to a, which will now be a (3, 1, 2) ndarray:
array([[[1, 2]],
[[3, 4]],
[[5, 6]]])
You can now multiply this array with a size N identity matrix, which you can generate with np.eye:
np.eye(N)
array([[1., 0.],
[0., 1.]])
Which will yield the desired output:
a*np.eye(N)
array([[[1., 0.],
[0., 2.]],
[[3., 0.],
[0., 4.]],
[[5., 0.],
[0., 6.]]])
Yu can use numpy.diag
m = [[1, 2],
[3, 4],
[5, 6]]
[np.diag(b) for b in m]
EDIT The following plot shows the average execution time for the solution above (solid line), and compared it against #Divakar's (dashed line) for different batch-sizes and different matrix sizes
I don't believe you get much of an improvement, but this is just based on this simple metric
You basically want a function that does the opposite of/reverses np.block(..)
I needed the same thing, so I wrote this little function:
def split_blocks(x, m=2, n=2):
"""
Reverse the action of np.block(..)
>>> x = np.random.uniform(-1, 1, (2, 18, 20))
>>> assert (np.block(split_blocks(x, 3, 4)) == x).all()
:param x: (.., M, N) input matrix to split into blocks
:param m: number of row splits
:param n: number of column, splits
:return:
"""
x = np.array(x, copy=False)
nd = x.ndim
*shape, nr, nc = x.shape
return list(map(list, x.reshape((*shape, m, nr//m, n, nc//n)).transpose(nd-2, nd, *range(nd-2), nd-1, nd+1)))

Left pad a 1D column vector with 0s

how to convert reshape 1D numpy array to 2D numpy array
and fill with zeroes on the columns.
For example:
Input:
a = np.array([1,2,3])
Expected output:
np.array([[0, 0, 1],
[0, 0, 2],
[0, 0, 3]])
How do I do this?
a = np.array([1,2,3])
Option 1
np.pad (this should be fast)
np.pad(a[:, None], ((0, 0), (2, 0)), mode='constant')
array([[0, 0, 1],
[0, 0, 2],
[0, 0, 3]])
Option 2
Assign a slice to np.zeros (also very fast)
b = np.zeros((3, 3))
b[:, -1] = a
array([[0., 0., 1.],
[0., 0., 2.],
[0., 0., 3.]])
For your specific example:
a = np.array([1,2,3])
a.resize([3, 3])
a = np.rot90(a, k=3)
hope this helps
Create a function that creates a zero array of m x n x 3 dimensionality. Then go through your original matrix and assign its values to those parties of the new zero matrix that should be non-zero.

numpy broadcasting with 3d arrays

Is it possible to apply numpy broadcasting (with 1D arrays),
x=np.arange(3)[:,np.newaxis]
y=np.arange(3)
x+y=
array([[0, 1, 2],
[1, 2, 3],
[2, 3, 4]])
to 3d matricies similar to the one below, such that each element in a[i] is treated as a 1D vector like in the example above?
a=np.zeros((2,2,2))
a[0]=1
b=a
result=a+b
resulting in
result[0,0]=array([[2, 2],
[2, 2]])
result[0,1]=array([[1, 1],
[1, 1]])
result[1,0]=array([[1, 1],
[1, 1]])
result[1,1]=array([[0, 0],
[0, 0]])
You can do this in the same way as if they are 1d array, i.e, insert a new axis between axis 0 and axis 1 in either a or b:
a + b[:,None] # or a[:,None] + b
(a + b[:,None])[0,0]
#array([[ 2., 2.],
# [ 2., 2.]])
(a + b[:,None])[0,1]
#array([[ 1., 1.],
# [ 1., 1.]])
(a + b[:,None])[1,0]
#array([[ 1., 1.],
# [ 1., 1.]])
(a + b[:,None])[1,1]
#array([[ 0., 0.],
# [ 0., 0.]])
Since a and b are of same shape, say (2,2,2), a+b will indeed work.
The way broadcasting works is that it matches the dimensions of the operands in reverse order, starting from the last dimension going up (e.g. considering columns before rows in a two-dimensional case). If the dimensions match then the next dimension is considered.
In case the dimensions don't match AND if one of the dimensions is 1 then that operand's dimension is repeated to match the other operand (e.g. if a.shape = (2,1,2) and b.shape = (2,2,2) then the values at the 1st dimension of a are repeated to make the shape (2,2,2))

Is there a better way to produce a membership matrix (one-hot array) for an array of cluster assignments in Python? [duplicate]

This question already has answers here:
Convert array of indices to one-hot encoded array in NumPy
(22 answers)
Closed 5 years ago.
After running kmeans I can easily get an array with the assigned clusters for ever data point. Now I want to get a membership matrix (one-hot array) which has the different clusters as columns and indicates the cluster assignment by either 1 or 0 in the matrix for each data point.
My code is shown below and it works but I am wondering if there is a more elegant way to do the same.
km = KMeans(n_clusters=3).fit(data)
membership_matrix = np.stack([np.where(km.labels_ == 0, 1,0),
np.where(km.labels_ == 1, 1,0),
np.where(km.labels_ == 2, 1,0)]
axis = 1)
So you can create 'one-hot array' which is equivalent to your membership array from array of cluster according to this question. Here is how you do it using np.eye
import numpy as np
clusters = np.array([2,1,2,2,0,1])
n_clusters = max(clusters) + 1
membership_matrix = np.eye(n_clusters)[clusters]
Output is as follows
array([[ 0., 0., 1.],
[ 0., 1., 0.],
[ 0., 0., 1.],
[ 0., 0., 1.],
[ 1., 0., 0.],
[ 0., 1., 0.]])
Here's a method that's agnostic to the number of clusters you have (with your method, you'll have to "stack" more things if you have more clusters).
This code sample assumes you have six data points and 3 clusters:
NUM_DATA_POINTS = 6
NUM_CLUSTERS = 3
clusters = np.array([2,1,2,2,0,1]) # hard-coded as an example, but this is your KMeans output
# create your empty membership matrix
membership = np.zeros((NUM_DATA_POINTS, NUM_CLUSTERS))
membership[np.arange(NUM_DATA_POINTS), clusters] = 1
The key feature being used here is 2D array indexing - in the last line of code above, we index into the rows of membership sequentially (np.arange creates an incrementing sequence from 0 to NUM_DATA_POINTS-1) and into the columns of membership using the cluster assignments. Here's the relevant numpy reference.
It would produce the following membership matrix:
>>> membership
array([[ 0., 0., 1.],
[ 0., 1., 0.],
[ 0., 0., 1.],
[ 0., 0., 1.],
[ 1., 0., 0.],
[ 0., 1., 0.]])
You are looking for LabelBinarizer. Give this code a try:
from sklearn import preprocessing
lb = preprocessing.LabelBinarizer()
membership_matrix = lb.fit_transform(km.labels_)
In contrast to other solutions proposed here, this approach:
Generates a compact membership matrix when the labels are not consecutive numbers.
Is able to deal with categorical labels.
Sample run:
In [9]: lb.fit_transform([0, 1, 2, 0, 2, 2])
Out[9]:
array([[1, 0, 0],
[0, 1, 0],
[0, 0, 1],
[1, 0, 0],
[0, 0, 1],
[0, 0, 1]])
In [10]: lb.fit_transform([0, 1, 9, 0, 9, 9])
Out[10]:
array([[1, 0, 0],
[0, 1, 0],
[0, 0, 1],
[1, 0, 0],
[0, 0, 1],
[0, 0, 1]])
In [11]: lb.fit_transform(['first', 'second', 'third', 'first', 'third', 'third'])
Out[11]:
array([[1, 0, 0],
[0, 1, 0],
[0, 0, 1],
[1, 0, 0],
[0, 0, 1],
[0, 0, 1]])

How to modify specific field in ndarray matrix

I have a matrix like this
x = [[a, b],
[c, d]]
But instead of a,b,c,d for each of those values there's a list of numbers, for example [x, xx, xxx].
I would like to create another matrix that would have ones only on positions where x==0 && xx==0 && xxx==0. How can I do that without loops? For example, I could do B = [x == 0], but how can I do that where there's a list instead of single matrix element?
If the list is of fixed length, you can create a 3d array and then use np.all() on its last axis:
In [1]: import numpy as np
In [2]: a = np.zeros((2, 2, 3)) # 2x2 matrix, 3 variants for each element
In [3]: a[0, 0] = [0, 1, 2] # filling one element of the "matrix"
In [4]: a[0, 1] = 1
In [5]: a[1, 1] = 0 # this
In [6]: a[1, 0] = 0 # and this are "all zeros"
In [7]: a
Out[7]:
array([[[ 0., 1., 2.],
[ 1., 1., 1.]],
[[ 0., 0., 0.],
[ 0., 0., 0.]]])
Now let's construct the matrix b:
In [8]: np.all(a == 0, axis=-1).astype(int)
Out[8]:
array([[0, 0],
[1, 1]])
If you want another condition, you can modify the expression in the following way:
In [9]: np.all(a - [0, 1, 2] == 0, axis=-1).astype(int)
Out[9]:
array([[1, 0],
[0, 0]])

Categories