I have a data set with 4 columns: x,y,z, and value, let's say:
x y z value
0 0 0 0
0 1 0 0
0 2 0 0
1 0 0 0
1 1 0 1
1 2 0 1
2 0 0 0
2 1 0 0
2 2 0 0
I would like to calculate the center of mass CM = (x_m,y_m,z_m) of all values. In the present example, I would like to see (1,1.5,0) as output.
I thought this must be a trivial problem, but I can't find a solution to it in the internet. scipy.ndimage.measurements.center_of_mass seems to be the right thing, but unfortunately, the function always returns two values (instead of 3). In addition, I can't find any documentation on how to set up an ndimage from an array: Would I use a numpy array N of shape (9,4)? Would then N[:,0] be the x-coordinate?
Any help is highly appreciated.
The simplest way I can think of is this: just find an average of the coordinates of mass components weighted by each component's contribution.
import numpy
masses = numpy.array([[0, 0, 0, 0],
[0, 1, 0, 0],
[0, 2, 0, 0],
[1, 0, 0, 0],
[1, 1, 0, 1],
[1, 2, 0, 1],
[2, 0, 0, 0],
[2, 1, 0, 0],
[2, 2, 0, 0]])
nonZeroMasses = masses[numpy.nonzero(masses[:,3])] # Not really necessary, can just use masses because 0 mass used as weight will work just fine.
CM = numpy.average(nonZeroMasses[:,:3], axis=0, weights=nonZeroMasses[:,3])
Another option is to use the scipy center of mass:
from scipy import ndimage
import numpy
masses = numpy.array([[0, 0, 0, 0],
[0, 1, 0, 0],
[0, 2, 0, 0],
[1, 0, 0, 0],
[1, 1, 0, 1],
[1, 2, 0, 1],
[2, 0, 0, 0],
[2, 1, 0, 0],
[2, 2, 0, 0]])
ndimage.measurements.center_of_mass(masses)
How about:
# x y z value
table = np.array([[ 5. , 1.3, 8.3, 9. ],
[ 6. , 6.7, 1.6, 5.9],
[ 9.1, 0.2, 6.2, 3.7],
[ 2.2, 2. , 6.7, 4.6],
[ 3.4, 5.6, 8.4, 7.3],
[ 4.8, 5.9, 5.7, 5.8],
[ 3.7, 1.1, 8.2, 2.2],
[ 0.3, 0.7, 7.3, 4.6],
[ 8.1, 1.9, 7. , 5.3],
[ 9.1, 8.2, 3.3, 5.3]])
def com(xyz, mass):
mass = mass.reshape((-1, 1))
return (xyz * mass).mean(0)
print(com(table[:, :3], table[:, 3]))
Why did ndimage.measurements.center_of_mass not give the expected result?
The key is in how the input data masses was represented by an array of 4-tuples (x, y, z, value)
# x y z value
[[0, 0, 0, 0],
[0, 1, 0, 0],
[0, 2, 0, 0],
[1, 0, 0, 0],
[1, 1, 0, 1],
[1, 2, 0, 1],
[2, 0, 0, 0],
[2, 1, 0, 0],
[2, 2, 0, 0]]
The array masses here represents the 3-D position and weights of each mass.
Note however that this python array structure is only a 2-D array. It's shape is (9, 4).
The input you need to pass to ndimage to get the expected result is a 3-D array containing zeros everywhere and the weight of each mass at the appropriate coordinates within the array, like this:
from scipy import ndimage
import numpy
masses = numpy.zeros((3, 3, 1))
# x y z value
masses[1, 1, 0] = 1
masses[1, 2, 0] = 1
CM = ndimage.measurements.center_of_mass(masses)
# x y z
# (1.0, 1.5, 0.0)
Which is exactly the expected output.
Note the limitation of this solution (and the ndimage library) is it requires non-negative integer coordinates. Also will not be efficient for large and/or sparse volumes because each "pixel" of the ndimage needs to be instantiated in memory.
Related
mat = [[ 1. 2. 3. 4. 5.]
[ 6. 7. 8. 9. 10.]
[11. 12. 13. 14. 15.]]
Suppose, I have this NumPy array.
Say, I need to extract the 2nd column of each row, convert them into binary, and then create a vector out of them.
How can I do it using NumPy?
For instance, if I select 2nd column of this NumPy array, my output should look as follows:
[[0 0 1 0],
[0 1 1 1],
[1 1 0 0]]
I tried as follows:
my_data = np.genfromtxt('data_input')
print(my_data)
my_data_2nd_column = my_data[:, 1]
my_data_2nd_column_binary = Utils.encode(my_data_2nd_column)
my_2nd_column_binary = np.apply_along_axis(Utils.encode, 1, my_data)
print(my_2nd_column_binary)
Numpy has a built-in function for this. First, you can get a particular column using indexing:
>>> arr
array([[ 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10],
[11, 12, 13, 14, 15]])
>>> arr[:, [1]]
array([[ 2],
[ 7],
[12]])
Then, you could use the built-in function, but make sure you convert to unsigned, 8-bit integers:
>>> np.unpackbits(arr[:, [1]].astype(np.uint8), axis=1)
array([[0, 0, 0, 0, 0, 0, 1, 0],
[0, 0, 0, 0, 0, 1, 1, 1],
[0, 0, 0, 0, 1, 1, 0, 0]], dtype=uint8)
Of course, if you need the second dimension to be rank 4, just use slicing again, although, it is probably worth copying if you are going to do lots of operations on the resulting array:
>>> np.unpackbits(arr[:, [1]].astype(np.uint8), axis=1)[:, -4:]
array([[0, 0, 1, 0],
[0, 1, 1, 1],
[1, 1, 0, 0]], dtype=uint8)
I had done this example, but without using the numpy library.
I commented on all functions.
mat = [[ 1, 2, 3, 4, 1,],
[ 6, 7, 8, 9, 40,],
[11, 12, 13, 14, 15,]]
# convert the binary into a vector of elements
def split(word):
return [int(char) for char in word]
# returns the vector size of the largest binary
def binaryBig(lista):
maior = max(lista, key=int)
temp = "{0:b}".format(maior)
return len(split(temp))
# convert the element to binary
def binary(x,big):
temp = split(format(x, "b"))
for n in range(len(temp),big):
temp.insert(0,0)
return temp
# create the matrix with the binaries
def createBinaryMat(lista):
big = binaryBig(lista)
mat = []
for i in lista:
mat.append(binary(i,big))
return mat
# select the column and return the created matrix
def binaryElementsOfColum(colum,mat):
lista = []
for i in mat:
lista.append(i[colum])
return createBinaryMat(lista)
for i in binaryElementsOfColum(4,mat):
print(i)
Output:
[0, 0, 0, 0, 0, 1]
[1, 0, 1, 0, 0, 0]
[0, 0, 1, 1, 1, 1]
Let's say, now I have a 1x1 matrix, like:
M = Matrix([[2]])
How can I create a new 2x2 matrix from this, filling the all blanks with 0s? Which is:
N = Matrix([[2, 0], [0, 0]])
If it were numpy, I could use np.newaxis; however, it seems that there is no newaxis in sympy.
So, I tried:
N = M.reshape(2, 2)
I got the following error:
ValueError: Invalid reshape parameters 2 2
I found that the following expression works:
N = Matrix(2, 2, [D[0], 0, 0, 0])
However, this is a bit awkward.
Is there any better way?
Please note that a scalar multiplication N = D[0] * Matrix(2, 2, [1, 0, 0, 0]) is not acceptable, since next time I may ask you to convert 2x2 to 3x3.
Use sympy.diag.
>>> import sympy as sp
>>> m = sp.Matrix([[2]])
>>> sp.diag(m, 0)
Matrix([
[2, 0],
[0, 0]])
>>> sp.diag(m, 0, 0)
Matrix([
[2, 0, 0],
[0, 0, 0],
[0, 0, 0]])
>>> sp.diag(sp.Matrix([[1, 2], [3, 4]]), 0)
Matrix([
[1, 2, 0],
[3, 4, 0],
[0, 0, 0]])
I'm trying to calculate cosine distance in python between the rows in matrix and have couple a questions.So I'm creating matrix matr and populating it from the lists, then reshaping it for analysis purposes:
s = []
for i in range(len(a)):
for j in range(len(b_list)):
s.append(a[i].count(b_list[j]))
matr = np.array(s)
d = matr.reshape((22, 254))
The output of d gives me smth like:
array([[0, 0, 0, ..., 0, 0, 0],
[2, 0, 0, ..., 1, 0, 0],
[2, 0, 0, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[1, 0, 0, ..., 0, 0, 0]])
Then I want to use scipy.spatial.distance.cosine package to calculate cosine from first row to every other else in the d matrix.
How can I perform that? Should it be some for loop for that? Not too much experience with matrix and array operations.
So how can I use for loop for second argument (d[1],d[2], and so on) in that construction not to launch it every time:
from scipy.spatial.distance import cosine
x=cosine (d[0], d[6])
You said "calculate cosine from first row to every other else in the d matrix" [sic]. If I understand correctly, you can do that with scipy.spatial.distance.cdist, passing the first row as the first argument and the remaining rows as the second argument:
In [31]: from scipy.spatial.distance import cdist
In [32]: matr = np.random.randint(0, 3, size=(6, 8))
In [33]: matr
Out[33]:
array([[1, 2, 0, 1, 0, 0, 0, 1],
[0, 0, 2, 2, 1, 0, 1, 1],
[2, 0, 2, 1, 1, 2, 0, 2],
[2, 2, 2, 2, 0, 0, 1, 2],
[0, 2, 0, 2, 1, 0, 0, 0],
[0, 0, 0, 1, 2, 2, 2, 2]])
In [34]: cdist(matr[0:1], matr[1:], metric='cosine')
Out[34]: array([[ 0.65811827, 0.5545646 , 0.1752139 , 0.24407105, 0.72499045]])
If it turns out that you want to compute all the pairwise distances in matr, you can use scipy.spatial.distance.pdist.
For example,
In [35]: from scipy.spatial.distance import pdist
In [36]: pdist(matr, metric='cosine')
Out[36]:
array([ 0.65811827, 0.5545646 , 0.1752139 , 0.24407105, 0.72499045,
0.36039785, 0.27625314, 0.49748109, 0.41498206, 0.2799177 ,
0.76429774, 0.37117185, 0.41808563, 0.5765951 , 0.67661917])
Note that the first five values returned by pdist are the same values returned above using cdist.
For further explanation of the return value of pdist, see How does condensed distance matrix work? (pdist)
You can just use a simple for loop with scipy.spatial.distance.cosine:
import scipy.spatial.distance
dists = []
for row in matr:
dists.append(scipy.spatial.distance.cosine(matr[0,:], row))
Here's how you might calculate it easily by hand:
from numpy import array as a
from numpy.random import random_integers as randi
from numpy.linalg.linalg import norm
from numpy import set_printoptions
M = randi(10, size=a([5,5])); # create demo matrix
# dot products of rows against themselves
DotProducts = M.dot(M.T);
# kronecker product of row norms
NormKronecker = a([norm(M, axis=1)]) * a([norm(M, axis=1)]).T;
CosineSimilarity = DotProducts / NormKronecker
CosineDistance = 1 - CosineSimilarity
set_printoptions(precision=2, suppress=True)
print CosineDistance
Output:
[[-0. 0.15 0.1 0.11 0.22]
[ 0.15 0. 0.15 0.13 0.06]
[ 0.1 0.15 0. 0.15 0.14]
[ 0.11 0.13 0.15 0. 0.18]
[ 0.22 0.06 0.14 0.18 -0. ]]
This matrix is e.g. interpreted as "the cosine distance between row 3 against row 2 (or, equally, row 2 against row 3) is 0.15".
I have two sparse* adjacency matrices A1 and A2 of type 'numpy.int64'.
The nodes of the corresponding graphs are labeled by integers and the indices of the matrices correspond to these nodes (the matrix value being the link weight between the nodes).
I'm trying to compute a similarity measure between the graphs. To do this I need to find the adjacency matrix for the subgraph of each graph, which contains the nodes common to both graphs.
Nothing about the equals sizes of the matrices, or common nodes between them is assured.
The result should be the same adjacency matrices with values for nodes not in both graphs equal to zero.
Example:
A1:
array([[ 0, 1, 2, 1],
[ 1, 0, 0, 0],
[ 2, 0, 0, 0],
[ 1, 0, 0, 0]])
A2:
array([[ 0, 0, 1],
[ 0, 0, 0],
[ 1, 0, 0]])
Outcome:
A1':
array([[ 0, 0, 2, 0],
[ 0, 0, 0, 0],
[ 2, 0, 0, 0],
[ 0, 0, 0, 0]])
A2':
array([[ 0, 0, 1],
[ 0, 0, 0],
[ 1, 0, 0]])
The size of matrices I'm using are on order of 10^5 X 10^5. The resulting size doesn't matter, I'll slice down the size of the smallest afterwards.
I'll be repeating this operation many times and so speed is important.
Attempts so far:
I can get the list of common nodes by:
np.intersect1d(A1.nonzero()[0], A2.nonzero()[0])
But I can't find a way of using this as a filter to map the values for indices not in this list to 0.
*I don't think I necessarily need to use sparse though is very preferable for scalability later.
If I understand your question correctly, based on the example you have provided, you can simply use the numpy.in1d method to give you a boolean array indices, for example
A1 = np.array([[ 0, 1, 2, 1],
[ 1, 0, 0, 0],
[ 2, 0, 0, 0],
[ 1, 0, 0, 0]])
A2 = np.array([[ 0, 0, 1],
[ 0, 0, 0],
[ 1, 0, 0]])
idx = np.in1d(A1,A2).reshape(A1.shape)
A1[idx] = 0
print(A1)
# prints
[[0 0 2 0]
[0 0 0 0]
[2 0 0 0]
[0 0 0 0]]
For sparse matrices, the right solution depends on which sparse format you are using. If you are using csr or csc formats then you can apply the same technique on the coefficients (V_IJ) of the matrices A1.data and then use resulting array (idx) to modify the corresponding indices (I and J) i.e. A1.indices and A1.indptr.
In the docs for the chi-squared univariate feature selection function of scikit-learn http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html, it states
This score can be used to select the n_features features with the highest values for the χ² (chi-square) statistic from X, which must contain booleans or frequencies (e.g., term counts in document classification), relative to the classes.
I am struggling to understand what the corresponding contingency table would look like, especially in the case of frequency features.
For example, consider the below dataset with boolean features and targets:
import numpy as np
>>> X = np.random.randint(2, size=50).reshape(10, 5)
array([[1, 0, 0, 0, 1],
[1, 1, 0, 1, 1],
[1, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 1],
[1, 0, 0, 0, 1],
[1, 0, 1, 1, 1],
[0, 1, 1, 0, 0],
[1, 0, 1, 1, 1],
[1, 1, 1, 1, 0]])
>>> y = np.random.randint(2, size=10)
array([1, 0, 0, 0, 1, 1, 1, 1, 0, 1])
To construct the contingency table with respect to the first feature, we can do this (excuse my PEP8 violation)
import scipy as sp
>>> contingency_table = sp.sparse.coo_matrix(
... (np.ones_like(y), (X[:, 0], y)),
... shape=(np.unique(X[:, 0]).shape[0], np.unique(y).shape[0])).A
array([[1, 2],
[3, 4]])
So now I can calculate the chi-squared statistic and its p-values
>>> sp.stats.chi2_contingency(contingency_table)
(0.17857142857142855,
0.67260381744151676,
1,
array([[ 1.2, 1.8],
[ 2.8, 4.2]]))
And this ought to be consistent with scikit-learn's chi2
from sklearn.feature_selection import chi2
>>> chi2_, pval = chi2(X, y)
>>> chi2_[0], pval[0]
(0.023809523809523787, 0.87737055606414338)
...Nope. Have I misinterpreted something?
Also, what does the contingency table look like in the case of frequencies? I assumed it would be something like
contingency_table = sp.sparse.coo_matrix(
(np.ones_like(y), (X[:, 0], y)),
shape=(X[:, 0].max()+1, np.unique(y).shape[0])).A
But the corresponding table of expected frequencies will most likely have several zero elements.
Edit:
To clarify further, consider the first feature X[:, 0] that is, say, gender and the targets y, say, handedness.
From this we get the cross tabulation
Right-handed Left-handed (!right-handed)
Male 1 2
Female (!male) 3 4
And we can assess the significance of the difference between the two proportions using the Chi-squared test by setting the expected frequency
sklearn.feature_selection.chi2 does this directly without resorting to explicitly computing the table and obtains the scores using a more efficient procedure that is equivalent to scipy.stats.chisquare.
After explicitly enumerating the table shown above, I wanted to verify it is consistent with chi2 when applying scipy.stats.chi2_contingency and to my dismay, it isn't. I'd like to ask why it isn't.
Consider a column x of X. sklearn.feature_selection.chi2 tests whether
the frequencies of the y values where x is 1 agree with the frequencies of y in
the full population. (#larsman's answer shows how you can reproduce the calculation with numpy and scipy.) This is not the same as the standard 2x2 contingency table
analysis of x and y. In a 2x2 contingency table analysis, the frequencies of y
where x is 0 also contribute to the test.
Suppose we form the contingency table for x and y:
| y=0 y=1
----+---------
x=0 | a b
x=1 | c d
Let n = a + b + c + d. This is the number of samples (i.e. same as len(x) and len(y)).
Let nx = c + d. This is the number of occurrences of 1 in x.
Let py1 = (b + d)/n. This is the fraction of the full population where y is 1.
sklearn.feature_selection.chi2 performs a chi2 test on [c, d] using the expected
values [(1-py1)*nx, py1*nx]. This is not the same as the standard contingency table
analysis of a 2x2 table.
Here's an extreme example. Suppose the 2x2 contingency table for x and y is
| y=0 y=1
----+----------
x=0 | 8 8
x=1 | 20 188
The sklearn calculation produces a chi2 score of 1.58, with a p-value of 0.208.
The contingency table analysis of scipy.stats.chi2_contingency gives a chi2 score of 18.6, with a p-value of 1.60e-5.
Given your data,
>>> X = array([[1, 0, 0, 0, 1],
... [1, 1, 0, 1, 1],
... [1, 0, 0, 0, 0],
... [0, 0, 0, 0, 0],
... [0, 0, 0, 0, 1],
... [1, 0, 0, 0, 1],
... [1, 0, 1, 1, 1],
... [0, 1, 1, 0, 0],
... [1, 0, 1, 1, 1],
... [1, 1, 1, 1, 0]])
>>> y = array([1, 0, 0, 0, 1, 1, 1, 1, 0, 1])
this is what feature_selection.chi2 computes:
>>> Y = np.vstack([1 - y, y])
>>> observed = np.dot(Y, X)
>>> observed
array([[3, 1, 1, 2, 2],
[4, 2, 3, 2, 4]])
These are the observed feature frequencies, per class, i.e. the contingency table. Then the expected values:
>>> feature_count = X.sum(axis=0)
>>> class_prob = Y.mean(axis=1)
>>> expected = np.dot(feature_count.reshape(-1, 1), class_prob.reshape(1, -1)).T
>>> expected
array([[ 2.8, 1.2, 1.6, 1.6, 2.4],
[ 4.2, 1.8, 2.4, 2.4, 3.6]])
Finally, it runs a χ² test:
>>> from scipy.stats import chisquare
>>> score, pval = chisquare(observed, expected)
>>> score
array([ 0.02380952, 0.05555556, 0.375 , 0.16666667, 0.11111111])
>>> pval
array([ 0.87737056, 0.81366372, 0.54029137, 0.6830914 , 0.73888268])
The scores are the relevant bit: they're used to sort the features by discriminative power. Note that you get one score and one p-value per feature.