Related
I want to define a column vector v = (0, 0, …, 0, 1) in Python. The vector is supposed to have s components (for an arbitrary s), so the first s-1 components are supposed to be 0 and the last component 1. How do I do it? Because s is arbitrary. Thank you for your help in advance!
As I said, I wanted do use np.array but s is arbitrary. so I do not know how to use the np.array thingie (since the amount of components is not clear yet)
numpy.repeat is one option:
s = 10
v = np.repeat([0, 1], [s-1, 1])
Output: array([0, 0, 0, 0, 0, 0, 0, 0, 0, 1])
You can generalize to any number of values:
# one "1", zero "2", three "3", two "4"
np.repeat([1, 2, 3, 4], [1, 0, 3, 2])
# array([1, 3, 3, 3, 4, 4])
I want to implement a code to build an adjacency matrix such that (for example):
If X[0] : [0, 1, 2, 0, 1, 0], then,
A[0, 1] = 1
A[1, 2] = 1
A[2, 0] = 1
A[0, 1] = 1
A[1, 0] = 1
The following code works fine, however, it's too slow! So, please help me to vectorize this code on the batch (first) dimension at least:
A = torch.zeros((3, 3, 3), dtype = torch.float)
X = torch.tensor([[0, 1, 2, 0, 1, 0], [1, 0, 0, 2, 1, 1], [0, 0, 2, 2, 1, 1]])
for a, x in zip(A, X):
for i, j in zip(x, x[1:]):
a[i, j] = 1
Thanks! :)
I am pretty sure that there is a much simpler way of doing this, but I tried to keep within the realm of torch function calls, to make sure that any gradient operation could be properly tracked.
In case this is not required for backpropagation, I strongly suggest you look into solution that maybe utilize some numpy functions, because I think there is a stronger guarantee to find something suitable here. But, without further ado, here is the solution I came up with.
It essentially transforms your X vector into a series of tuple entries that correspond to the position in A. For this, we need to align some of the indices (specifically, the first dimension is only implicitly given in X, since the first list in X corresponds to A[0,:,:], the second list to A[1,:,:], and so on.
This is also probably where you can start optimizing the code, because I did not find a reasonable description of such a matrix, and therefore had to come up with my own way of creating it.
# Start by "aligning" your shifted view of X
# Essentially, take the all but the last element,
# and put it on top of all but the first element.
X_shift = torch.stack([X[:,:-1], X[:,1:]], dim=2)
# X_shift.shape: (3,5,2) in your example
# To assign this properly, we need to turn it into a "concatenated" list,
# where each entry corresponds to a 2D tuple in the respective dimension of A.
temp_tuples = X_shift.view(-1,2).transpose(0,1)
# temp_tuples.shape: (2,15) in your example. Below are the values:
tensor([[0, 1, 2, 0, 1, 1, 0, 0, 2, 1, 0, 0, 2, 2, 1],
[1, 2, 0, 1, 0, 0, 0, 2, 1, 1, 0, 2, 2, 1, 1]])
# Now we have to create a matrix do indicate the proper "first dimension index"
fix_dims = torch.repeat_interleave(torch.arange(0,3,1), len(X[0])-1, 0).unsqueeze(dim=0)
# fix_dims.shape: (1,15)
# Long story short, this creates the following vector.
tensor([[0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2]])
# Note that the unsqueeze is necessary to properly concatenate the two matrices:
access_tuples = tuple(torch.cat([fix_dims, temp_tuples], dim=0))
A[access_tuples] = 1
This further assumes that every dimension in X has the same number of tuples changed. If that is not the case, then you have to manually create a fix_dims vector, where each increment is repeated the length of X[i] times. If it is equal as in your example, you can safely use the proposed solution.
Make X a tuple instead of a tensor:
A = torch.zeros((3, 3, 3), dtype = torch.float)
X = ([0, 1, 2, 0, 1, 0], [1, 0, 0, 2, 1, 1], [0, 0, 2, 2, 1, 1])
A[X] = 1
For example, by casting it like this: A[tuple(X)]
I have looked at this question but it hasn't really given me any answers.
Essentially, how can I determine if a strong correlation exists or not using np.correlate? I expect the same output as I get from matlab's xcorr with the coeff option which I can understand (1 is a strong correlation at lag l and 0 is no correlation at lag l), but np.correlate produces values greater than 1, even when the input vectors have been normalised between 0 and 1.
Example input
import numpy as np
x = np.random.rand(10)
y = np.random.rand(10)
np.correlate(x, y, 'full')
This gives the following output:
array([ 0.15711279, 0.24562736, 0.48078652, 0.69477838, 1.07376669,
1.28020871, 1.39717118, 1.78545567, 1.85084435, 1.89776181,
1.92940874, 2.05102884, 1.35671247, 1.54329503, 0.8892999 ,
0.67574802, 0.90464743, 0.20475408, 0.33001517])
How can I tell what is a strong correlation and what is weak if I don't know the maximum possible correlation value is?
Another example:
In [10]: x = [0,1,2,1,0,0]
In [11]: y = [0,0,1,2,1,0]
In [12]: np.correlate(x, y, 'full')
Out[12]: array([0, 0, 1, 4, 6, 4, 1, 0, 0, 0, 0])
Edit: This was a badly asked question, but the marked answer does answer what was asked. I think it is important to note what I have found whilst digging around in this area, you cannot compare outputs from cross-correlation. In other words, it would not be valid to use the outputs from cross-correlation to say signal x is better correlated to signal y than signal z. Cross-correlation does not provide this kind of information
numpy.correlate is under-documented. I think that we can make sense of it, though. Let's start with your sample case:
>>> import numpy as np
>>> x = [0,1,2,1,0,0]
>>> y = [0,0,1,2,1,0]
>>> np.correlate(x, y, 'full')
array([0, 0, 1, 4, 6, 4, 1, 0, 0, 0, 0])
Those numbers are the cross-correlations for each of the possible lags. To make that more clear, let's put the lag numbers above the correlations:
>>> np.concatenate((np.arange(-5, 6)[None,...], np.correlate(x, y, 'full')[None,...]), axis=0)
array([[-5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5],
[ 0, 0, 1, 4, 6, 4, 1, 0, 0, 0, 0]])
Here, we can see that the cross-correlation reaches its peak at a lag of -1. If you look at x and y above, that makes sense: it one shifts y to the left by one place, it matches x exactly.
To verify this, let's try again, this time shifting y further:
>>> y = [0, 0, 0, 0, 1, 2]
>>> np.concatenate((np.arange(-5, 6)[None,...], np.correlate(x, y, 'full')[None,...]), axis=0)
array([[-5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5],
[ 0, 2, 5, 4, 1, 0, 0, 0, 0, 0, 0]])
Now, the correlation peaks at a lag of -3, meaning that the best match between x and y occurs when y is shifted to the left by 3 places.
Problem: I have an MxN matrix where M>=N. I want to identify the groups of linearly-interdependent column-vectors within this matrix.
I'm hoping there's a fast and easy way to do this in numpy.
>>> a = np.random.randn(7, 6)
>>> a[:, 3] = 2*a[:, 0]-a[:, 4]
>>> a[:, 5] = 3*a[:, 1]
I'm looking for a function get_column_groups, which will return
>>> get_column_groups(a)
array([0, 1, 2, 0, 0, 1])
I guess bonus points if it also returns the rank of each group, e.g.:
>>> groups, group_ranks = get_column_groups(a)
>>> groups
array([0, 1, 2, 0, 0, 1])
>>> group_ranks
[2, 1, 1]
As far as I understand the question, computing the Spark of the matrix is a subproblem of what you are requesting, hence rendering it NP-Complete in most cases.
There may be some algorithms that do this better than just combinatorial evaluation of columns up to a size of rank(a), but this would be a start.
In the docs for the chi-squared univariate feature selection function of scikit-learn http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html, it states
This score can be used to select the n_features features with the highest values for the χ² (chi-square) statistic from X, which must contain booleans or frequencies (e.g., term counts in document classification), relative to the classes.
I am struggling to understand what the corresponding contingency table would look like, especially in the case of frequency features.
For example, consider the below dataset with boolean features and targets:
import numpy as np
>>> X = np.random.randint(2, size=50).reshape(10, 5)
array([[1, 0, 0, 0, 1],
[1, 1, 0, 1, 1],
[1, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 1],
[1, 0, 0, 0, 1],
[1, 0, 1, 1, 1],
[0, 1, 1, 0, 0],
[1, 0, 1, 1, 1],
[1, 1, 1, 1, 0]])
>>> y = np.random.randint(2, size=10)
array([1, 0, 0, 0, 1, 1, 1, 1, 0, 1])
To construct the contingency table with respect to the first feature, we can do this (excuse my PEP8 violation)
import scipy as sp
>>> contingency_table = sp.sparse.coo_matrix(
... (np.ones_like(y), (X[:, 0], y)),
... shape=(np.unique(X[:, 0]).shape[0], np.unique(y).shape[0])).A
array([[1, 2],
[3, 4]])
So now I can calculate the chi-squared statistic and its p-values
>>> sp.stats.chi2_contingency(contingency_table)
(0.17857142857142855,
0.67260381744151676,
1,
array([[ 1.2, 1.8],
[ 2.8, 4.2]]))
And this ought to be consistent with scikit-learn's chi2
from sklearn.feature_selection import chi2
>>> chi2_, pval = chi2(X, y)
>>> chi2_[0], pval[0]
(0.023809523809523787, 0.87737055606414338)
...Nope. Have I misinterpreted something?
Also, what does the contingency table look like in the case of frequencies? I assumed it would be something like
contingency_table = sp.sparse.coo_matrix(
(np.ones_like(y), (X[:, 0], y)),
shape=(X[:, 0].max()+1, np.unique(y).shape[0])).A
But the corresponding table of expected frequencies will most likely have several zero elements.
Edit:
To clarify further, consider the first feature X[:, 0] that is, say, gender and the targets y, say, handedness.
From this we get the cross tabulation
Right-handed Left-handed (!right-handed)
Male 1 2
Female (!male) 3 4
And we can assess the significance of the difference between the two proportions using the Chi-squared test by setting the expected frequency
sklearn.feature_selection.chi2 does this directly without resorting to explicitly computing the table and obtains the scores using a more efficient procedure that is equivalent to scipy.stats.chisquare.
After explicitly enumerating the table shown above, I wanted to verify it is consistent with chi2 when applying scipy.stats.chi2_contingency and to my dismay, it isn't. I'd like to ask why it isn't.
Consider a column x of X. sklearn.feature_selection.chi2 tests whether
the frequencies of the y values where x is 1 agree with the frequencies of y in
the full population. (#larsman's answer shows how you can reproduce the calculation with numpy and scipy.) This is not the same as the standard 2x2 contingency table
analysis of x and y. In a 2x2 contingency table analysis, the frequencies of y
where x is 0 also contribute to the test.
Suppose we form the contingency table for x and y:
| y=0 y=1
----+---------
x=0 | a b
x=1 | c d
Let n = a + b + c + d. This is the number of samples (i.e. same as len(x) and len(y)).
Let nx = c + d. This is the number of occurrences of 1 in x.
Let py1 = (b + d)/n. This is the fraction of the full population where y is 1.
sklearn.feature_selection.chi2 performs a chi2 test on [c, d] using the expected
values [(1-py1)*nx, py1*nx]. This is not the same as the standard contingency table
analysis of a 2x2 table.
Here's an extreme example. Suppose the 2x2 contingency table for x and y is
| y=0 y=1
----+----------
x=0 | 8 8
x=1 | 20 188
The sklearn calculation produces a chi2 score of 1.58, with a p-value of 0.208.
The contingency table analysis of scipy.stats.chi2_contingency gives a chi2 score of 18.6, with a p-value of 1.60e-5.
Given your data,
>>> X = array([[1, 0, 0, 0, 1],
... [1, 1, 0, 1, 1],
... [1, 0, 0, 0, 0],
... [0, 0, 0, 0, 0],
... [0, 0, 0, 0, 1],
... [1, 0, 0, 0, 1],
... [1, 0, 1, 1, 1],
... [0, 1, 1, 0, 0],
... [1, 0, 1, 1, 1],
... [1, 1, 1, 1, 0]])
>>> y = array([1, 0, 0, 0, 1, 1, 1, 1, 0, 1])
this is what feature_selection.chi2 computes:
>>> Y = np.vstack([1 - y, y])
>>> observed = np.dot(Y, X)
>>> observed
array([[3, 1, 1, 2, 2],
[4, 2, 3, 2, 4]])
These are the observed feature frequencies, per class, i.e. the contingency table. Then the expected values:
>>> feature_count = X.sum(axis=0)
>>> class_prob = Y.mean(axis=1)
>>> expected = np.dot(feature_count.reshape(-1, 1), class_prob.reshape(1, -1)).T
>>> expected
array([[ 2.8, 1.2, 1.6, 1.6, 2.4],
[ 4.2, 1.8, 2.4, 2.4, 3.6]])
Finally, it runs a χ² test:
>>> from scipy.stats import chisquare
>>> score, pval = chisquare(observed, expected)
>>> score
array([ 0.02380952, 0.05555556, 0.375 , 0.16666667, 0.11111111])
>>> pval
array([ 0.87737056, 0.81366372, 0.54029137, 0.6830914 , 0.73888268])
The scores are the relevant bit: they're used to sort the features by discriminative power. Note that you get one score and one p-value per feature.