Related
I am trying to find a fast vectorized (at least partially) solution finding combinatorial occurrence between two 2D numpy array to identified Single Point Polymorphism linkage.
The shape of each array is
(factors, samples)
an example for matrix 1 is as follows:
array([[0., 1., 1.],
[1., 0., 1.]])
and matrix 2
array([[1., 1., 0.],
[0., 0., 0.]])
I need to find the total number of occurrence along samples axis for each permutation of 2 factors at the same position of 2 matrix (order matters because (1,0) count is different from (0,1) count). Therefore the combinations should be [(0, 0), (0, 1), (1, 0), (1, 1)] and the final output is (factor, factor) for counts of each occurrence.
For combination (0,0) for instance, we get the matrix
array([[0, 1],
[0., 1]])
Because
0 counts (0,0) along row 0 of matrix 1 & row 0 of matrix 2,
1 along row 0 of matrix 1 & row 1 of matrix 2,
0 along row 1 of matrix 1 & row 0 of matrix 2,
1 along row 1 of matrix 1 & row 1 of matrix 2,
With example data
import numpy as np
array1 = np.array([
[0., 1., 1.],
[1., 0., 1.]])
array2 = np.array([
[1., 1., 0.],
[0., 0., 0.]])
We can count the desired combinations with np.einsum and reshape to a suitable array
c1 = np.array([1-array1, array1]).astype('int')
c2 = np.array([1-array2, array2]).astype('int')
np.einsum('ijk,lmk->iljm', c1, c2).reshape(-1, len(array1), len(array2))
Output
array([[[0, 1], # counts for (0,0)
[0, 1]],
[[1, 0], # counts for (0,1)
[1, 0]],
[[1, 2], # counts for (1,0)
[1, 2]],
[[1, 0], # counts for (1,1)
[1, 0]]])
Checking that the previous results are equal to dot products
import itertools as it
np.array([x # y.T for x, y in it.product(c1, c2)])
Output
array([[[0, 1],
[0, 1]],
[[1, 0],
[1, 0]],
[[1, 2],
[1, 2]],
[[1, 0],
[1, 0]]])
Since I realized the solution while trying to derive a manual example for the question, I will just provide that we should solve these by dot products:
matrix1_0 = (array1[0]==0).astype('int')
matrix1_1 = (array1[0]==1).astype('int')
matrix2_0 = (array2[1]==0).astype('int')
matrix2_1 = (array2[1]==1).astype('int')
count_00 = np.dot(matrix1_0 , matrix2_0.T)
count_01 = np.dot(matrix1_0 , matrix2_1.T)
count_10 = np.dot(matrix1_1 , matrix2_0.T)
count_11 = np.dot(matrix1_1 , matrix2_1.T)
These would correspond to sum of number of occurrence for each combination for each factor along a certain axis (sample axis 1 here).
Im looking for an efficient 'for loop' avoiding solution that solves an array related problem I'm having. I want to use a huge 1Darray (A -> size = 250.000) of values between 0 and 40 for indexing in one dimension, and a array (B) with the same size with values between 0 and 9995 for indexing in a second dimension.
The result should be an array with size (41, 9996) with for each index the amount of times that any value from array 1 occurs at a value from array 2.
Example:
A = [0, 3, 2, 4, 3]
B = [1, 2, 2, 0, 2]
which should result in:
[[0, 1, 0,
[0, 0, 0,
[0, 0, 1,
[0, 0, 2,
[1, 0, 0]]
The dirty way is too slow as the amount of data is huge, what you would be able to do is:
out = np.zeros(41,9995)
for i in A:
for j in B:
out[i,j] += 1
which will take 238.000 * 238.000 loops...
I've tried this, which works partially:
out = np.zeros(41,9995)
out[A,B] += 1
Which generates a result with 1 everywhere, regardless of the amount of times the values occur.
Does anyone have a clue how to fix this? Thanks in advance!
You are looking for a sparse tensor:
import torch
A = [0, 3, 2, 4, 3]
B = [1, 2, 2, 0, 2]
idx = torch.LongTensor([A, B])
torch.sparse.FloatTensor(idx, torch.ones(idx.shape[1]), torch.Size([5,3])).to_dense()
Output:
tensor([[0., 1., 0.],
[0., 0., 0.],
[0., 0., 1.],
[0., 0., 2.],
[1., 0., 0.]])
You can also do the same with scipy sparse matrix:
import numpy as np
from scipy.sparse import coo_matrix
coo_matrix((np.ones(len(A)), (np.array(A), np.array(B))), shape=(5,3)).toarray()
output:
array([[0., 1., 0.],
[0., 0., 0.],
[0., 0., 1.],
[0., 0., 2.],
[1., 0., 0.]])
Sometimes it is better to leave the matrix in its sparse representation, rather than forcing it to be "dense" again.
Use numpy.add.at:
import numpy as np
A = [0, 3, 2, 4, 3]
B = [1, 2, 2, 0, 2]
arr = np.zeros((5, 3))
np.add.at(arr, (A, B), 1)
print(arr)
Output
[[0. 1. 0.]
[0. 0. 0.]
[0. 0. 1.]
[0. 0. 2.]
[1. 0. 0.]]
Given that the numbers are in a small range, bincount would be a good choice for bin-based summing -
def accumulate_coords(A,B):
nrows = A.max()+1
ncols = B.max()+1
return np.bincount(A*ncols+B,minlength=nrows*ncols).reshape(-1,ncols)
Sample run -
In [55]: A
Out[55]: array([0, 3, 2, 4, 3])
In [56]: B
Out[56]: array([1, 2, 2, 0, 2])
In [58]: accumulate_coords(A,B)
Out[58]:
array([[0, 1, 0],
[0, 0, 0],
[0, 0, 1],
[0, 0, 2],
[1, 0, 0]])
Is it possible to apply numpy broadcasting (with 1D arrays),
x=np.arange(3)[:,np.newaxis]
y=np.arange(3)
x+y=
array([[0, 1, 2],
[1, 2, 3],
[2, 3, 4]])
to 3d matricies similar to the one below, such that each element in a[i] is treated as a 1D vector like in the example above?
a=np.zeros((2,2,2))
a[0]=1
b=a
result=a+b
resulting in
result[0,0]=array([[2, 2],
[2, 2]])
result[0,1]=array([[1, 1],
[1, 1]])
result[1,0]=array([[1, 1],
[1, 1]])
result[1,1]=array([[0, 0],
[0, 0]])
You can do this in the same way as if they are 1d array, i.e, insert a new axis between axis 0 and axis 1 in either a or b:
a + b[:,None] # or a[:,None] + b
(a + b[:,None])[0,0]
#array([[ 2., 2.],
# [ 2., 2.]])
(a + b[:,None])[0,1]
#array([[ 1., 1.],
# [ 1., 1.]])
(a + b[:,None])[1,0]
#array([[ 1., 1.],
# [ 1., 1.]])
(a + b[:,None])[1,1]
#array([[ 0., 0.],
# [ 0., 0.]])
Since a and b are of same shape, say (2,2,2), a+b will indeed work.
The way broadcasting works is that it matches the dimensions of the operands in reverse order, starting from the last dimension going up (e.g. considering columns before rows in a two-dimensional case). If the dimensions match then the next dimension is considered.
In case the dimensions don't match AND if one of the dimensions is 1 then that operand's dimension is repeated to match the other operand (e.g. if a.shape = (2,1,2) and b.shape = (2,2,2) then the values at the 1st dimension of a are repeated to make the shape (2,2,2))
This question already has answers here:
Convert array of indices to one-hot encoded array in NumPy
(22 answers)
Closed 5 years ago.
After running kmeans I can easily get an array with the assigned clusters for ever data point. Now I want to get a membership matrix (one-hot array) which has the different clusters as columns and indicates the cluster assignment by either 1 or 0 in the matrix for each data point.
My code is shown below and it works but I am wondering if there is a more elegant way to do the same.
km = KMeans(n_clusters=3).fit(data)
membership_matrix = np.stack([np.where(km.labels_ == 0, 1,0),
np.where(km.labels_ == 1, 1,0),
np.where(km.labels_ == 2, 1,0)]
axis = 1)
So you can create 'one-hot array' which is equivalent to your membership array from array of cluster according to this question. Here is how you do it using np.eye
import numpy as np
clusters = np.array([2,1,2,2,0,1])
n_clusters = max(clusters) + 1
membership_matrix = np.eye(n_clusters)[clusters]
Output is as follows
array([[ 0., 0., 1.],
[ 0., 1., 0.],
[ 0., 0., 1.],
[ 0., 0., 1.],
[ 1., 0., 0.],
[ 0., 1., 0.]])
Here's a method that's agnostic to the number of clusters you have (with your method, you'll have to "stack" more things if you have more clusters).
This code sample assumes you have six data points and 3 clusters:
NUM_DATA_POINTS = 6
NUM_CLUSTERS = 3
clusters = np.array([2,1,2,2,0,1]) # hard-coded as an example, but this is your KMeans output
# create your empty membership matrix
membership = np.zeros((NUM_DATA_POINTS, NUM_CLUSTERS))
membership[np.arange(NUM_DATA_POINTS), clusters] = 1
The key feature being used here is 2D array indexing - in the last line of code above, we index into the rows of membership sequentially (np.arange creates an incrementing sequence from 0 to NUM_DATA_POINTS-1) and into the columns of membership using the cluster assignments. Here's the relevant numpy reference.
It would produce the following membership matrix:
>>> membership
array([[ 0., 0., 1.],
[ 0., 1., 0.],
[ 0., 0., 1.],
[ 0., 0., 1.],
[ 1., 0., 0.],
[ 0., 1., 0.]])
You are looking for LabelBinarizer. Give this code a try:
from sklearn import preprocessing
lb = preprocessing.LabelBinarizer()
membership_matrix = lb.fit_transform(km.labels_)
In contrast to other solutions proposed here, this approach:
Generates a compact membership matrix when the labels are not consecutive numbers.
Is able to deal with categorical labels.
Sample run:
In [9]: lb.fit_transform([0, 1, 2, 0, 2, 2])
Out[9]:
array([[1, 0, 0],
[0, 1, 0],
[0, 0, 1],
[1, 0, 0],
[0, 0, 1],
[0, 0, 1]])
In [10]: lb.fit_transform([0, 1, 9, 0, 9, 9])
Out[10]:
array([[1, 0, 0],
[0, 1, 0],
[0, 0, 1],
[1, 0, 0],
[0, 0, 1],
[0, 0, 1]])
In [11]: lb.fit_transform(['first', 'second', 'third', 'first', 'third', 'third'])
Out[11]:
array([[1, 0, 0],
[0, 1, 0],
[0, 0, 1],
[1, 0, 0],
[0, 0, 1],
[0, 0, 1]])
I have a matrix like this
x = [[a, b],
[c, d]]
But instead of a,b,c,d for each of those values there's a list of numbers, for example [x, xx, xxx].
I would like to create another matrix that would have ones only on positions where x==0 && xx==0 && xxx==0. How can I do that without loops? For example, I could do B = [x == 0], but how can I do that where there's a list instead of single matrix element?
If the list is of fixed length, you can create a 3d array and then use np.all() on its last axis:
In [1]: import numpy as np
In [2]: a = np.zeros((2, 2, 3)) # 2x2 matrix, 3 variants for each element
In [3]: a[0, 0] = [0, 1, 2] # filling one element of the "matrix"
In [4]: a[0, 1] = 1
In [5]: a[1, 1] = 0 # this
In [6]: a[1, 0] = 0 # and this are "all zeros"
In [7]: a
Out[7]:
array([[[ 0., 1., 2.],
[ 1., 1., 1.]],
[[ 0., 0., 0.],
[ 0., 0., 0.]]])
Now let's construct the matrix b:
In [8]: np.all(a == 0, axis=-1).astype(int)
Out[8]:
array([[0, 0],
[1, 1]])
If you want another condition, you can modify the expression in the following way:
In [9]: np.all(a - [0, 1, 2] == 0, axis=-1).astype(int)
Out[9]:
array([[1, 0],
[0, 0]])