Interpreting (and comparing) output from numpy.correlate - python

I have looked at this question but it hasn't really given me any answers.
Essentially, how can I determine if a strong correlation exists or not using np.correlate? I expect the same output as I get from matlab's xcorr with the coeff option which I can understand (1 is a strong correlation at lag l and 0 is no correlation at lag l), but np.correlate produces values greater than 1, even when the input vectors have been normalised between 0 and 1.
Example input
import numpy as np
x = np.random.rand(10)
y = np.random.rand(10)
np.correlate(x, y, 'full')
This gives the following output:
array([ 0.15711279, 0.24562736, 0.48078652, 0.69477838, 1.07376669,
1.28020871, 1.39717118, 1.78545567, 1.85084435, 1.89776181,
1.92940874, 2.05102884, 1.35671247, 1.54329503, 0.8892999 ,
0.67574802, 0.90464743, 0.20475408, 0.33001517])
How can I tell what is a strong correlation and what is weak if I don't know the maximum possible correlation value is?
Another example:
In [10]: x = [0,1,2,1,0,0]
In [11]: y = [0,0,1,2,1,0]
In [12]: np.correlate(x, y, 'full')
Out[12]: array([0, 0, 1, 4, 6, 4, 1, 0, 0, 0, 0])
Edit: This was a badly asked question, but the marked answer does answer what was asked. I think it is important to note what I have found whilst digging around in this area, you cannot compare outputs from cross-correlation. In other words, it would not be valid to use the outputs from cross-correlation to say signal x is better correlated to signal y than signal z. Cross-correlation does not provide this kind of information

numpy.correlate is under-documented. I think that we can make sense of it, though. Let's start with your sample case:
>>> import numpy as np
>>> x = [0,1,2,1,0,0]
>>> y = [0,0,1,2,1,0]
>>> np.correlate(x, y, 'full')
array([0, 0, 1, 4, 6, 4, 1, 0, 0, 0, 0])
Those numbers are the cross-correlations for each of the possible lags. To make that more clear, let's put the lag numbers above the correlations:
>>> np.concatenate((np.arange(-5, 6)[None,...], np.correlate(x, y, 'full')[None,...]), axis=0)
array([[-5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5],
[ 0, 0, 1, 4, 6, 4, 1, 0, 0, 0, 0]])
Here, we can see that the cross-correlation reaches its peak at a lag of -1. If you look at x and y above, that makes sense: it one shifts y to the left by one place, it matches x exactly.
To verify this, let's try again, this time shifting y further:
>>> y = [0, 0, 0, 0, 1, 2]
>>> np.concatenate((np.arange(-5, 6)[None,...], np.correlate(x, y, 'full')[None,...]), axis=0)
array([[-5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5],
[ 0, 2, 5, 4, 1, 0, 0, 0, 0, 0, 0]])
Now, the correlation peaks at a lag of -3, meaning that the best match between x and y occurs when y is shifted to the left by 3 places.

Related

Efficient way to compute pair wise difference among the 1d numpy array

I have a specific problem.
I made up this code to compute difference of pairs of element in 1d array.
np.array([j-i for m, i in enumerate(X[:]) for j in X[m+1:]])
For example, for a input X=np.array([0, 1, 2, 0, 1, 2, 0, 1, 2]), this code return 9*8/2=36 elements array which is:
np.array([1,2,0,1,2,0,1,2,1,-1,0,1,-1,0,1,-2,-1,0,-2,-1,0,1,2,0,1,2,1,-1,0,1,-2,-1,0,1,2,1])
Although I understand that this code is inherently a O(n^2), my code takes a lot of time for larger array X (only n~400) and use a lot of memory. So I think double loop indexing is cause of this slow down and vectorization of this method may make it faster. Do you have any idea or know standard module to compute this?
You can do this (time) efficiently using broadcasting (which uses vectorization). The solution for X of length 400 in instantaneous on my machine:
# X = np.random.rand(400)
X=np.array([0, 1, 2, 0, 1, 2, 0, 1, 2])
X = X.reshape(-1,1)
M = X.T - X
idx = np.triu_indices(len(X), k=1)
solution = M[idx]
array([ 1, 2, 0, 1, 2, 0, 1, 2, 1, -1, 0, 1, -1, 0, 1, -2, -1,
0, -2, -1, 0, 1, 2, 0, 1, 2, 1, -1, 0, 1, -2, -1, 0, 1,
2, 1])
You want to compute the difference between all possible pairs? That's inherently a O(n2).
You're always going to run into trouble at some point, but you can go a lot further by not keeping the entire square in memory, only lazily generating and using each value as you iterate.

Determine the similarity between two arrays of counts [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
The Problem: I am trying to determine the similarity between two 1D arrays composed of counts. Both the positions and relative magnitudes of the counts inside the arrays are important.
X = [1, 5, 10, 0, 0, 0, 2]
Y = [1, 2, 0, 0, 10, 0, 5]
Z = [1, 3, 8, 0, 0, 0, 1]
In this case array X is more similar to array Z than array Y.
I have tried a few metrics including cosine distance, earth movers distance and histogram intersection and while cosine distance and earth movers distance work decently, only EMD really satisfies both of my conditions
I am curious to know if there are other algorithms / distance metrics out there that exist to answer this sort of problem.
Thank you!
One popular and simple method is root-mean-square, where you sum the squares of the differences between the elements, take the square root, and divide by the number of elements, In your case, X vs Y produces 2.1, and X vs Z produces 0.4.
import math
X = [1, 5, 10, 0, 0, 0, 2]
Y = [1, 2, 0, 0, 10, 0, 5]
Z = [1, 3, 8, 0, 0, 0, 1]
def rms(a,b):
return math.sqrt( sum((a1-b1)*(a1-b1) for a1,b1 in zip(a,b)))/len(a)
print(rms(X,Y))
print(rms(X,Z))
Perhaps manhattan distance works for you. The Manhattan distance between X and Y is 26, between X and Z is 5 and between Y and Z is 23.
from math import sqrt
def manhattan(x, y):
return sum(abs(val1-val2) for val1, val2 in zip(x,y))
X = [1, 5, 10, 0, 0, 0, 2]
Y = [1, 2, 0, 0, 10, 0, 5]
Z = [1, 3, 8, 0, 0, 0, 1]
manhattan(X, Y) # returns 26
manhattan(X, Z) # returns 5
manhattan(Y,Z) # returns 23
from dtaidistance import dtw
import numpy as np
X = [1, 5, 10, 0, 0, 0, 2]
Y = [1, 2, 0, 0, 10, 0, 5]
Z = [1, 3, 8, 0, 0, 0, 1]
def phase_corr(sig1, sig2):
fft_sig1 = np.fft.fft(sig1)
fft_sig2 = np.fft.fft(sig2)
fft_sig2_conj = np.conj(fft_sig2)
R = (fft_sig1 * fft_sig2_conj) / abs(fft_sig1 * fft_sig2_conj)
r = np.fft.ifft(R)
return np.real(r)
print(np.correlate(X, Z), np.correlate(Y, Z)) #cross-correlation
print(max(phase_corr(X, Z)), max(phase_corr(Y, Z)))
print(dtw.distance(X, Z), dtw.distance(Y, Z)) #smaller distance means more similar
print(np.corrcoef(X, Z)[1,0], np.corrcoef(Y, Z)[1,0]) #Pearson correlation
Check out scipy.spatial.distance for various distance metrics.
For instance, with the Chebyshev distance, we get that X is more similar to Z than to Y.
from scipy.spatial import distance
X = [1, 5, 10, 0, 0, 0, 2]
Y = [1, 2, 0, 0, 10, 0, 5]
Z = [1, 3, 8, 0, 0, 0, 1]
print(distance.chebyshev(X, Y)) # returns 10
print(distance.chebyshev(X, Z)) # returns 2

Iterating through matrices Python

If I have two lists and want to iterate through subtracting one from the other how would I go about this? I was thinking broadcasting. Right now I have:
array1 = [0,2,2,0]
array2 = [2,2,0,1]
I would like to subtract array1 from each value in array2 and make a new matrix of outputs:
output = [2, 0, 0, 2,
2, 0, 0, 2,
0, -2, -2, 0,
1, -1, -1, 1]
so in the end it's a 4x4 matrix.
Is this possible? Is the easiest way to use broadcasting? I was thinking of making each row value in array2 into it's own array, subtracting that from array2 using broadcasting, then summing all the array's at the end into one big array (using Numpy)... is there an easier way?
If I have two lists and want to iterate through subtracting one from the other how would I go about this? I was thinking broadcasting. Right now I have:
array1 = [0,2,2,0]
array2 = [2,2,0,1]
I would like to subtract array1 from each value in array2 and make a new matrix of outputs:
output = [2, 0, 0, 2,
2, 0, 0, 2,
0, -2, -2, 0,
1, -1, -1, 1]
so in the end it's a 4x4 matrix.
Is this possible? Is the easiest way to use broadcasting? I was thinking of making each row value in array2 into it's own array, subtracting that from array2 using broadcasting, then summing all the array's at the end into one big array (using Numpy)... is there an easier way?
Broadcasting with numpy:
>>> a1 = np.array([0,2,2,0])
>>> a2 = np.array([2,2,0,1])
>>> a2[:, np.newaxis] - a1
array([[ 2, 0, 0, 2],
[ 2, 0, 0, 2],
[ 0, -2, -2, 0],
[ 1, -1, -1, 1]])
Something like this?
def all_differences(x, y):
return (a - b for a in y for b in x)
print(list(all_differences([0, 2, 2, 0], [2, 2, 0,1])))
# -> [2, 0, 0, 2, 2, 0, 0, 2, 0, -2, -2, 0, 1, -1, -1, 1]
It just itertates over every item in the second list for every item in the first list, and gives their difference.
This can also be solved with itertools.product and can be generalised for multiple lists:
import itertools
import functools
import operator
difference = functools.partial(functools.reduce, operator.sub)
def all_differences(*lists):
return map(difference, itertools.product(*reversed(lists)))
print(list(all_differences([0, 2, 2, 0], [2, 2, 0,1])))
Or just handling two lists:
import itertools
def all_differences(x, y):
return (b - a for (a, b) in itertools.product((x, y)))
print(list(all_differences([0, 2, 2, 0], [2, 2, 0,1])))

Scipy - find bases of column space of matrix

I'm trying to code up a simple Simplex algorithm, the first step of which is to find a basic feasible solution:
Choose a set B of linearly independent columns of A
Set all components of x corresponding to the columns not in B to zero.
Solve the m resulting equations to determine the components of x. These are the basic variables.
I know the solution will involve using scipy.linalg.svd (or scipy.linalg.lu) and some numpy.argwhere / numpy.where magic, but I'm not sure exactly how.
Does anyone have a pure-Numpy/Scipy implementation of finding a basis (step 1) or, even better, all of the above?
Example:
>>> A
array([[1, 1, 1, 1, 0, 0, 0],
[1, 0, 0, 0, 1, 0, 0],
[0, 0, 1, 0, 0, 1, 0],
[0, 3, 1, 0, 0, 0, 1]])
>>> u, s, v = scipy.linalg.svd(A)
>>> non_zero, = numpy.where(s > 1e-7)
>>> rank = len(non_zero)
>>> rank
4
>>> for basis in some_unknown_function(A):
... print(basis)
{3, 4, 5, 6}
{1, 4, 5, 6}
and so on.
A QR decomposition provides an orthogonal basis for the column space of A:
q,r = np.linalg.qr(A)
If the rank of A is n, then the first n columns of q form a basis for the column space of A.
Try using this
scipy.linalg.orth(A)
this produces orthonormal basis for the matrix A

Scikit-learn χ² (chi-squared) statistic and corresponding contingency table

In the docs for the chi-squared univariate feature selection function of scikit-learn http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html, it states
This score can be used to select the n_features features with the highest values for the χ² (chi-square) statistic from X, which must contain booleans or frequencies (e.g., term counts in document classification), relative to the classes.
I am struggling to understand what the corresponding contingency table would look like, especially in the case of frequency features.
For example, consider the below dataset with boolean features and targets:
import numpy as np
>>> X = np.random.randint(2, size=50).reshape(10, 5)
array([[1, 0, 0, 0, 1],
[1, 1, 0, 1, 1],
[1, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 1],
[1, 0, 0, 0, 1],
[1, 0, 1, 1, 1],
[0, 1, 1, 0, 0],
[1, 0, 1, 1, 1],
[1, 1, 1, 1, 0]])
>>> y = np.random.randint(2, size=10)
array([1, 0, 0, 0, 1, 1, 1, 1, 0, 1])
To construct the contingency table with respect to the first feature, we can do this (excuse my PEP8 violation)
import scipy as sp
>>> contingency_table = sp.sparse.coo_matrix(
... (np.ones_like(y), (X[:, 0], y)),
... shape=(np.unique(X[:, 0]).shape[0], np.unique(y).shape[0])).A
array([[1, 2],
[3, 4]])
So now I can calculate the chi-squared statistic and its p-values
>>> sp.stats.chi2_contingency(contingency_table)
(0.17857142857142855,
0.67260381744151676,
1,
array([[ 1.2, 1.8],
[ 2.8, 4.2]]))
And this ought to be consistent with scikit-learn's chi2
from sklearn.feature_selection import chi2
>>> chi2_, pval = chi2(X, y)
>>> chi2_[0], pval[0]
(0.023809523809523787, 0.87737055606414338)
...Nope. Have I misinterpreted something?
Also, what does the contingency table look like in the case of frequencies? I assumed it would be something like
contingency_table = sp.sparse.coo_matrix(
(np.ones_like(y), (X[:, 0], y)),
shape=(X[:, 0].max()+1, np.unique(y).shape[0])).A
But the corresponding table of expected frequencies will most likely have several zero elements.
Edit:
To clarify further, consider the first feature X[:, 0] that is, say, gender and the targets y, say, handedness.
From this we get the cross tabulation
Right-handed Left-handed (!right-handed)
Male 1 2
Female (!male) 3 4
And we can assess the significance of the difference between the two proportions using the Chi-squared test by setting the expected frequency
sklearn.feature_selection.chi2 does this directly without resorting to explicitly computing the table and obtains the scores using a more efficient procedure that is equivalent to scipy.stats.chisquare.
After explicitly enumerating the table shown above, I wanted to verify it is consistent with chi2 when applying scipy.stats.chi2_contingency and to my dismay, it isn't. I'd like to ask why it isn't.
Consider a column x of X. sklearn.feature_selection.chi2 tests whether
the frequencies of the y values where x is 1 agree with the frequencies of y in
the full population. (#larsman's answer shows how you can reproduce the calculation with numpy and scipy.) This is not the same as the standard 2x2 contingency table
analysis of x and y. In a 2x2 contingency table analysis, the frequencies of y
where x is 0 also contribute to the test.
Suppose we form the contingency table for x and y:
| y=0 y=1
----+---------
x=0 | a b
x=1 | c d
Let n = a + b + c + d. This is the number of samples (i.e. same as len(x) and len(y)).
Let nx = c + d. This is the number of occurrences of 1 in x.
Let py1 = (b + d)/n. This is the fraction of the full population where y is 1.
sklearn.feature_selection.chi2 performs a chi2 test on [c, d] using the expected
values [(1-py1)*nx, py1*nx]. This is not the same as the standard contingency table
analysis of a 2x2 table.
Here's an extreme example. Suppose the 2x2 contingency table for x and y is
| y=0 y=1
----+----------
x=0 | 8 8
x=1 | 20 188
The sklearn calculation produces a chi2 score of 1.58, with a p-value of 0.208.
The contingency table analysis of scipy.stats.chi2_contingency gives a chi2 score of 18.6, with a p-value of 1.60e-5.
Given your data,
>>> X = array([[1, 0, 0, 0, 1],
... [1, 1, 0, 1, 1],
... [1, 0, 0, 0, 0],
... [0, 0, 0, 0, 0],
... [0, 0, 0, 0, 1],
... [1, 0, 0, 0, 1],
... [1, 0, 1, 1, 1],
... [0, 1, 1, 0, 0],
... [1, 0, 1, 1, 1],
... [1, 1, 1, 1, 0]])
>>> y = array([1, 0, 0, 0, 1, 1, 1, 1, 0, 1])
this is what feature_selection.chi2 computes:
>>> Y = np.vstack([1 - y, y])
>>> observed = np.dot(Y, X)
>>> observed
array([[3, 1, 1, 2, 2],
[4, 2, 3, 2, 4]])
These are the observed feature frequencies, per class, i.e. the contingency table. Then the expected values:
>>> feature_count = X.sum(axis=0)
>>> class_prob = Y.mean(axis=1)
>>> expected = np.dot(feature_count.reshape(-1, 1), class_prob.reshape(1, -1)).T
>>> expected
array([[ 2.8, 1.2, 1.6, 1.6, 2.4],
[ 4.2, 1.8, 2.4, 2.4, 3.6]])
Finally, it runs a χ² test:
>>> from scipy.stats import chisquare
>>> score, pval = chisquare(observed, expected)
>>> score
array([ 0.02380952, 0.05555556, 0.375 , 0.16666667, 0.11111111])
>>> pval
array([ 0.87737056, 0.81366372, 0.54029137, 0.6830914 , 0.73888268])
The scores are the relevant bit: they're used to sort the features by discriminative power. Note that you get one score and one p-value per feature.

Categories