I have a SAS script that uses the "proc corr" procedure, along with weighting in order to create a weighted correlation matrix. I am now trying to reproduce this function in python, but I haven't found a good way of including the weighting in the output matrix.
While looking for a solution, I've found a few scripts and functions that calculate weighted correlation coefficients for two columns/variables (examples here) using a weights array, but I am trying to create a weighted correlation matrix with many more variables. I've tried using these functions by looping through variable combinations, but it is running magnitudes slower than the SAS procedure.
I was wondering if there was an efficient way to create a weighted correlation matrix in python that works similarly to the SAS code, or at least returns equivalent results without looping through all variable combinations.
numpy's covariance takes two different kind of weights parameters - I don't have SAS to check against, but it is likely a similar approach.
https://docs.scipy.org/doc/numpy/reference/generated/numpy.cov.html#numpy.cov
Once you have a covariance matrix, it can be converted to a correlation matrix using a formula like this
https://en.wikipedia.org/wiki/Covariance_matrix#Correlation_matrix
Complete example
import numpy as np
x = np.array([1., 1.1, 1.2, 0.9])
y = np.array([2., 2.05, 2.02, 2.8])
np.cov(x, y)
Out[49]:
array([[ 0.01666667, -0.03816667],
[-0.03816667, 0.151225 ]])
cov = np.cov(x, y, fweights=[10, 1, 1, 1])
cov
Out[51]:
array([[ 0.00474359, -0.00703205],
[-0.00703205, 0.04872308]])
def cov_to_corr(cov):
""" based on https://en.wikipedia.org/wiki/Covariance_matrix#Correlation_matrix """
D = np.sqrt(np.diag(np.diag(cov)))
Dinv = np.linalg.inv(D)
return Dinv # cov # Dinv # requires python3.5, use np.dot otherwise
cov_to_corr(cov)
Out[53]:
array([[ 1. , -0.46255259],
[-0.46255259, 1. ]])
Related
I'm supposed to apply a "binomial low pass filter" to data given in a NumPy numpy.ndarray.
However, I wasn't able to find anything of the sort at https://docs.scipy.org/doc/scipy/reference/signal.html What am I missing here? This should be a faily basic operation, right?
A binomial filter is a FIR filter whose coefficients can be generated by taking a row from Pascal's triangle. A quick way ("quick" as in just one line of code--not necessarily the most efficient) is with numpy.poly1d:
In [15]: np.poly1d([1, 1])**2
Out[15]: poly1d([1, 2, 1])
In [16]: np.poly1d([1, 1])**3
Out[16]: poly1d([1, 3, 3, 1])
In [17]: np.poly1d([1, 1])**4
Out[17]: poly1d([1, 4, 6, 4, 1])
To use a set of these coefficients as a low pass filter, the values must be normalization so the sum is one. The sum of the coefficients of np.poly1d([1, 1])**n is 2**n, so we could divide the above result by 2**n. Alternatively, we can generate coefficients that are already normalized by giving numpy.poly1d [1/2, 1/2] instead of [1, 1] (i.e. start with a normalized set of two coefficients). This function generates the filter coefficients for a given n:
def binomcoeffs(n):
return (np.poly1d([0.5, 0.5])**n).coeffs
For example,
In [35]: binomcoeffs(3)
Out[35]: array([0.125, 0.375, 0.375, 0.125])
In [36]: binomcoeffs(5)
Out[36]: array([0.03125, 0.15625, 0.3125 , 0.3125 , 0.15625, 0.03125])
In [37]: binomcoeffs(15)
Out[37]:
array([3.05175781e-05, 4.57763672e-04, 3.20434570e-03, 1.38854980e-02,
4.16564941e-02, 9.16442871e-02, 1.52740479e-01, 1.96380615e-01,
1.96380615e-01, 1.52740479e-01, 9.16442871e-02, 4.16564941e-02,
1.38854980e-02, 3.20434570e-03, 4.57763672e-04, 3.05175781e-05])
To apply the filter to a signal, use convolution. There are several discrete convolution functions available, including numpy.convolve, scipy.signal.convolve, scipy.ndimage.convolve1d. You can also use scipy.signal.lfilter (give the coefficients as the b argument, and set a=1).
For concrete examples, check out "Applying a FIR filter", a short article that I wrote several years ago (and that has been edited by others since then). Note that the timings shown in that article might not be up-to-date. The code in both NumPy and SciPy is continually evolving. If you run those scripts now, you might get radically different results.
I was trying to figure out how to calculate the Frobenius of a matrix in numpy. This way I can get the 2-norm of each row in the matrix x below:
My question is about the ord parameter in numpy's linalg.norm module and how the relevant part of numpy documentation describes which norm of a matrix one can calculate. I was able to get the Frobenius norm by setting ord=2, however, it says that only setting ord=None gives the Frobenius norm.
Here is my example:
x = np.array([[0, 3, 4],
[1, 6, 4]])
I found that I can the Frobenius norm with the following line of code:
x_norm = np.linalg.norm(x, ord = 2, axis=1,keepdims=True )
>>> x_norm
array([[ 5. ],
[ 7.28010989]])
My question is whether the documentation here would be considered not as helpful as possible and if this warrants a request to change the description of setting ord=2 in the aforementioned table.
You're not taking a matrix norm. Since you've passed axis=1, you're taking vector norms, and you should be looking at the vector norm column instead of the matrix norm column.
For vector norms, ord=None and ord=2 both produce a 2-norm.
I perform SVD with sklearn.decomposition.PCA
From the equation of the SVD
A= U x S x V_t
V_t = transpose matrix of V
(Sorry I can't paste the original equation)
If I want the matrix U, S, and V, how can I get it if I use the sklearn.decomposition.PCA ?
First of all, depending on the size of your matrix, sklearn implementation of PCA will not always compute the full SVD decomposition. The following is taken from PCA's GitHub reciprocity:
svd_solver : string {'auto', 'full', 'arpack', 'randomized'}
auto :
the solver is selected by a default policy based on `X.shape` and
`n_components`: if the input data is larger than 500x500 and the
number of components to extract is lower than 80% of the smallest
dimension of the data, then the more efficient 'randomized'
method is enabled. Otherwise the exact full SVD is computed and
optionally truncated afterwards.
full :
run exact full SVD calling the standard LAPACK solver via
`scipy.linalg.svd` and select the components by postprocessing
arpack :
run SVD truncated to n_components calling ARPACK solver via
`scipy.sparse.linalg.svds`. It requires strictly
0 < n_components < X.shape[1]
randomized :
run randomized SVD by the method of Halko et al.
In addition, it also performs some manipulations on the data (see here).
Now, if you want to get U, S, V that are used in sklearn.decomposition.PCA you can use pca._fit(X).
For example:
from sklearn.decomposition import PCA
X = np.array([[1, 2], [3,5], [8,10], [-1, 1], [5,6]])
pca = PCA(n_components=2)
pca._fit(X)
prints
(array([[ -3.55731195e-01, 5.05615563e-01],
[ 2.88830295e-04, -3.68261259e-01],
[ 7.10884729e-01, -2.74708608e-01],
[ -5.68187889e-01, -4.43103380e-01],
[ 2.12745524e-01, 5.80457684e-01]]),
array([ 9.950385 , 0.76800941]),
array([[ 0.69988535, 0.71425521],
[ 0.71425521, -0.69988535]]))
However, if you just want the SVD decomposition of the original data, I would suggest to use scipy.linalg.svd
Suppose the convolution of a general number of discrete probability density functions needs to be calculated. For the example below there are four distributions which take on values 0,1,2 with the specified probabilities:
import numpy as np
pdfs = np.array([[0.6,0.3,0.1],[0.5,0.4,0.1],[0.3,0.7,0.0],[1.0,0.0,0.0]])
The convolution can be found like this:
pdf = pdfs[0]
for i in range(1,pdfs.shape[0]):
pdf = np.convolve(pdfs[i], pdf)
The probabilities of seeing 0,1,...,8 are then given by
array([ 0.09 , 0.327, 0.342, 0.182, 0.052, 0.007, 0. , 0. , 0. ])
This part is the bottleneck in my code and it seems there must be something available to vectorize this operation. Does anyone have a suggestion for making it faster?
Alternatively, a solution where you could use
pdf1 = np.array([[0.6,0.3,0.1],[0.5,0.4,0.1]])
pdf2 = np.array([[0.3,0.7,0.0],[1.0,0.0,0.0]])
convolve(pd1,pd2)
and get the pairwise convolutions
array([[ 0.18, 0.51, 0.24, 0.07, 0. ],
[ 0.5, 0.4, 0.1, 0. , 0. ]])
would also help tremendously.
You can compute the convolution of all your PDFs efficiently using fast fourier transforms (FFTs): the key fact is that the FFT of the convolution is the product of the FFTs of the individual probability density functions. So transform each PDF, multiply the transformed PDFs together, and then perform the inverse transform. You'll need to pad each input PDF with zeros to the appropriate length to avoid effects from wraparound.
This should be reasonably efficient: if you have m PDFs, each containing n entries, then the time to compute the convolution using this method should grow as (m^2)n log(mn). The time is dominated by the FFTs, and we're effectively computing m + 1 independent FFTs (m forward transforms and one inverse transform), each of an array of length no greater than mn. But as always, if you want real timings you should profile.
Here's some code:
import numpy.fft
def convolve_many(arrays):
"""
Convolve a list of 1d float arrays together, using FFTs.
The arrays need not have the same length, but each array should
have length at least 1.
"""
result_length = 1 + sum((len(array) - 1) for array in arrays)
# Copy each array into a 2d array of the appropriate shape.
rows = numpy.zeros((len(arrays), result_length))
for i, array in enumerate(arrays):
rows[i, :len(array)] = array
# Transform, take the product, and do the inverse transform
# to get the convolution.
fft_of_rows = numpy.fft.fft(rows)
fft_of_convolution = fft_of_rows.prod(axis=0)
convolution = numpy.fft.ifft(fft_of_convolution)
# Assuming real inputs, the imaginary part of the output can
# be ignored.
return convolution.real
Applying this to your example, here's what I get:
>>> convolve_many([[0.6, 0.3, 0.1], [0.5, 0.4, 0.1], [0.3, 0.7], [1.0]])
array([ 0.09 , 0.327, 0.342, 0.182, 0.052, 0.007])
That's the basic idea. If you want to tweak this, you might also look at numpy.fft.rfft (and its inverse, numpy.fft.irfft), which take advantage of the fact that the input is real to produce more compact transformed arrays. You might also be able to gain some speed by padding the rows array with zeros so that the total number of columns is optimal for performing an FFT. The definition of "optimal" here would depend on the FFT implementation, but powers of two would be good targets, for example. Finally, there are some obvious simplifications that can be made when creating rows if all the input arrays have the same length. But I'll leave these potential enhancements to you.
I have the following matrices sigma and sigmad:
sigma:
1.9958 0.7250
0.7250 1.3167
sigmad:
4.8889 1.1944
1.1944 4.2361
If I try to solve the generalized eigenvalue problem in python I obtain:
d,V = sc.linalg.eig(matrix(sigmad),matrix(sigma))
V:
-1 -0.5614
-0.4352 1
If I try to solve the g. e. problem in matlab I obtain:
[V,d]=eig(sigmad,sigma)
V:
-0.5897 -0.5278
-0.2564 0.9400
But the d's do coincide.
Any (nonzero) scalar multiple of an eigenvector will also be an eigenvector; only the direction is meaningful, not the overall normalization. Different routines use different conventions -- often you'll see the magnitude set to 1, or the maximum value set to 1 or -1 -- and some routines don't even bother being internally consistent for performance reasons. Your two different results are multiples of each other:
In [227]: sc = array([[-1., -0.5614], [-0.4352, 1. ]])
In [228]: ml = array([[-.5897, -0.5278], [-0.2564, 0.94]])
In [229]: sc/ml
Out[229]:
array([[ 1.69577751, 1.06366048],
[ 1.69734789, 1.06382979]])
and so they're actually the same eigenvectors. Think of the matrix as an operator which changes a vector: the eigenvectors are the special directions where a vector pointing that way won't be twisted by the matrix, and the eigenvalues are the factors measuring how much the matrix expands or contracts the vector.