I am trying to determine the similarity between two 1D time-series using numpy.correlate.
I wrote a small example program to learn more about how cross correlation works, however I am not completely understanding the trend in the correlation output.
Code:
import numpy as np
import matplotlib.pyplot as plt
#sample arrays to correlate
arr_1 = np.arange(1, 101) #[1, 2, 3, ..... 100]
arr_2 = np.concatenate([np.zeros(50), np.arange(50, 101)]) #[0, 0, ... 50, 51 ... 100]
cross_corr = np.correlate(arr_1, arr_2, "same")
plt.plot(list(cross_corr))
This graph raises a couple questions for me. It is my understanding that the cross correlation relies on the convolution operation (essentially the integral of the inner product of two signals - accounting for some lag).
Why is the correlation signal (above) steadily increases from (0, 50) if arr_2 is full of 0's from index 0 to 50?
How can I set the lag for the convolution operation. From the numpy docs I can't find a parameter which allows me to tweak the lag.
The peak at 50, is due to the fact that both signals line up at index 50, but why then does the correlation steadily decrease thereafter? If the two signals are lining up then shouldn't the correlation be increasing?
A correlation is significant only if its value is greater than 2/sqrt(n - abs(k)). Where n is the number of samples and k is the lag. How would correlation significance come into play for the graph shown above?
It seems that you are confused about what is exactly being output. The documentation is a little lacking honestly. The output computes the correlation between your two arrays for each lag. The midpoint, is where the lag is 0 and where correlation is highest.
FYI, your two arrays are not the same size. arr_1 is length 100 and arr_2 is length 101. Not sure if this was intentional.
Related
i'm trying to compute the cumulative distribution function of a multivariate normal using scipy.
i'm having trouble with the "input matrix must be symmetric positive definite" error.
to my knowledge, a diagonal matrix with positive diagonal entries is positive definite (see page 1 problem 2)
However, for different (relatively) small values of these diagonal values, the error shows up for the smaller values.
For example, this code:
import numpy as np
from scipy.stats import multivariate_normal
std = np.array([0.001, 2])
mean = np.array([1.23, 3])
multivariate_normal(mean=mean, cov=np.diag(std**2)).cdf([2,1])
returns 0.15865525393145702
while changing the third line with:
std = np.array([0.00001, 2])
causes the error to show up.
i'm guessing that it has something to do with computation error of floats.
The problem is, when the dimension of the cov matrix is larger, the accepted positive values on the diagoanal are bigger and bigger.
I tried multiple values on the diagonal of the covariance matrix of dimension 9x9. It seems that when other diagonal values are very large, small values cause the error.
Examining the stack trace you will see that it assumes the condition number as
1e6*np.finfo('d').eps ~ 2.2e-10 in _eigvalsh_to_eps
In your example the difference the smaller eigenvalue is 5e-6**2 times smaller than the largest eigenvalue so it will be treated as zero.
You can pass allow_singular=True to get it working
import numpy as np
from scipy.stats import multivariate_normal
std = np.array([0.000001, 2])
mean = np.array([1.23, 3])
multivariate_normal(mean=mean, cov=np.diag(std**2), allow_singular=True).cdf([2,1])
I want to know how much different are two numpy matrices. Matrix1 and Matrix2 could be much similar, like 80% same values but just shifted... I attach images of two identical arrays that differ in a little sequence of values in top right.
from skimage.util import compare_images
#matrix1 & matrix2 are numpy arrays
compare_images(matrix1, matrix2, method='diff')
Gives me a first comparison, but what about two numpy matrices, one of which is, for example, left-shifted by a couple of columns?
from scipy.signal import correlate2d
corr = correlate2d(matrix1, matrix2)
plt.figure(figsize=(10,10))
plt.imshow(corr)
plt.grid(False)
plt.show()
Prints out correlation and it seems a nice method, but I do not understand how the results are displayed, since the differences are in top right of the images.
Otherwise:
picture1_norm = picture1/np.sqrt(np.sum(picture1**2))
picture2_norm = picture2/np.sqrt(np.sum(picture2**2))
print(np.sum(picture2_norm*picture1_norm))
Returns a value in range 0-1 of similarity; for example 0.9942.
What could be a good method?
Correlation between two matrices is a legitimate measure of how similar both are. If both contain the same values the (normalized) correlation will be 1 and your (max?) value of 0.9942 is already very close to that.
Regarding translational (in-)variance of your result have a closer look at the mode argument of scipy.signal.correlate2d which defines how to handle differing sizes along both axes of your matrices and how far to slide one matrix over the other when calculating the correlation.
I want to perform principal component analysis for dimension reduction and data integration.
I have 3 features(variables) and 5 samples like below. I want to integrate them into 1-dimensional(1 feature) output by transforming them(computing 1st PC). I want to use transformed data for further statistical analysis, because I believe that it displays the 'main' characteristics of 3 input features.
I first wrote a test code with python using scikit-learn like below. It is the simple case that the values of 3 features are all equivalent. In other word, I applied PCA for three same vector, [0, 1, 2, 1, 0].
Code
import numpy as np
from sklearn.decomposition import PCA
pca = PCA(n_components=1)
samples = np.array([[0,0,0],[1,1,1],[2,2,2],[1,1,1],[0,0,0]])
pc1 = pca.fit_transform(samples)
print (pc1)
Output
[[-1.38564065]
[ 0.34641016]
[ 2.07846097]
[ 0.34641016]
[-1.38564065]]
Is taking 1st PCA after dimension reduction proper approach for data integration?
1-2. For example, if features are like [power rank, speed rank], and power have roughly negative correlation with speed, when it is a 2-feature case. I want to know the sample which have both 'high power' and 'high speed'. It is easy to decide that [power 1, speed 1] is better than [power 2, speed 2], but difficult for the case like [power 4, speed 2] vs [power 3, speed 3].
So I want to apply PCA to 2-dimensional 'power and speed' dataset, and take 1st PC, then use the rank of '1st PC'. Is this kind of approach still proper?
In this case, I think the output should also be [0, 1, 2, 1, 0] which is the same as the input. But output was [-1.38564065, 0.34641016, 2.07846097, 0.34641016, -1.38564065]. Are there any problem with the code, or is it the right answer?
Yes. It is also called data projection (to the lower dimension).
The resulting output is centered and normalized according to the train data. The result is correct.
In case of only 5 samples I don't think it is wise to run any statistical methods. And if you believe that your features are the same, just check that correlation between dimensions is close to 1, and then you can just disregard other dimensions.
There is no need to use PCA for this small dataset. And for PCA you array should be scaled.
In any case, you have only 3 dimensions: you can plot points and take a look with your eyes, you can calculate distances (make some kind on Nearest Neighborhoods algorithm).
I want to find groups in one dimensional array where order/position matters. I tried to use numpys kmeans2 but it works only when i have numbers in increasing order.
I have to maximize average difference between neigbour sub-arrays
For example: if I have array [1,2,2,8,9,0,0,0,1,1,1] and i want to get 4 groups the result should be something like [1,2,2], [8,9], [0,0,0], [1,1,1]
Is there a way to do it in better then O(n^k)
answer: I ended up with modiied dendrogram, where I merge neigbors only.
K-means is about minimizing the least squares. Among it's largest drawbacks (there are many) is that you need to know k. Why do you want to inherit this drawback?
Instead of hacking k-means into not ignoring the order, why don't you instead look at time series segmentation and change detection approaches that are much more appropriate for this problem?
E.g. split your time series if abs(x[i] - x[-1]) > stddev where stddev is the standard deviation of your data set. Or the standard deviation of the last 10 samples (in above series, the standard deviation is about 3, so it would split as [1,2,2], [8,9], [0,0,0,1,1,1] because the change 0 to 1 is not significant.
I'm using a function in python's opencv library to get the light flow movement of my hand as I move it around. Specifically http://docs.opencv.org/modules/video/doc/motion_analysis_and_object_tracking.html#calcopticalflowfarneback
This function outputs a numpy array
flow = cv2.calcOpticalFlowFarneback(prevgray, gray, 0.5, 3, 15, 3, 5, 1.2, 0)
print flow.shape # prints (480,320,2)
So flow is a matrix with each entry a vector. I want a way to quantify this matrix so I though of using the L1 Matrix norm (numpy.linalg.norm(flow, 1)) Which throws a improper dimensions to norm error.
I'm thinking about getting around this by calculating the euclidean norm of every vector and then finding the L1 norm of a matrix with the distances of the vectors.
I'm having trouble iterating through the flow matrix efficiently. I have done it using two for loops by going first through columns and then rows, but it's way too slow.
r,c,d = flow.shape
flowprime = numpy.zeros((r,c),flow.dtype)
for i in range(0,r):
for j in range (0,c):
flowprime[i,j] = numpy.linalg.norm(flow[i,j], 2)
print(numpy.linalg.norm(flowprime, 1))
I had also tried using numpy.nditer but
for x in numpy.nditer(flow, op_flags=['readwrite']):
print x
just prints a single value rather than a vector.
What would be the fastest way to iterate through a numpy matrix with vectors as entries, norm them and then take the L1 norm?
As of numpy version 1.9, norm takes an axis argument.
Aside from that, say what you want ideally, and almost surely you can ask numpy to do it. E.g., assuming no complex entries or missing values, the simplest case np.sqrt((flow**2).sum()) or the case I think you describe np.linalg.norm(np.sqrt((flow**2).sum(axis=-1)),1).