Calculating Correlation Coefficient with Numpy

Calculating Correlation Coefficient with Numpy - python

I have a list of values and a 1-d numpy array, and I would like to calculate the correlation coefficient using numpy.corrcoef(x,y,rowvar=0). I get the following error:
Traceback (most recent call last):
File "testLearner.py", line 25, in <module>
corr = np.corrcoef(valuesToCompare,queryOutput,rowvar=0)
File "/usr/local/lib/python2.6/site-packages/numpy/lib/function_base.py", line 2003, in corrcoef
c = cov(x, y, rowvar, bias, ddof)
File "/usr/local/lib/python2.6/site-packages/numpy/lib/function_base.py", line 1935, in cov
X = concatenate((X,y), axis)
ValueError: array dimensions must agree except for d_0
I printed out the shape for my numpy array and got (400,1). When I convert my list to an array with numpy.asarray(y) I get (400,)!
I believe this is the problem. I did an array.reshape to (400,1) and printed out the shape, and I still get (400,). What am I missing?
Thanks in advance.

I think you might have assumed that reshape modifies the value of the original array. It doesn't:
>>> a = np.random.randn(5)
>>> a.shape
(5,)
>>> b = a.reshape(5,1)
>>> b.shape
(5, 1)
>>> a.shape
(5,)
np.asarray treats a regular list as a 1d array, but your original numpy array that you said was 1d is actually 2d (because its shape is (400,1)). If you want to use your list like a 2d array, there are two easy approaches:
np.asarray(lst).reshape((-1, 1)) – -1 means "however many it needs" for that dimension".
np.asarray([lst]).T – .T means array transpose, which switches from (1,5) to (5,1).-
You could also reshape your original array to 1d via ary.reshape((-1,)).

Related

Vstack of two arrays with same number of rows gives an error

I have a numpy array of shape (29, 10) and a list of 29 elements and I want to end up with an array of shape (29,11)
I am basically converting the list to a numpy array and trying to vstack, but it complain about dimensions not being the same.
Toy example
a = np.zeros((29,10))
a.shape
(29,10)
b = np.array(['A']*29)
b.shape
(29,)
np.vstack((a, b))
ValueError: all the input array dimensions except for the concatenation axis must match exactly
Dimensions do actually match, why am I getting this error and how can I solve it?

I think you are looking for np.hstack.
np.hstack((a, b.reshape(-1,1)))
Moreover b must be 2-dimensional, that's why I used a reshape.

The problem is that you want to append a 1D array to a 2D array.
Also, for the dimension you've given for b, you are probably looking for hstack.
Try this:
a = np.zeros((29,10))
a.shape
(29,10)
b = np.array(['A']*29)[:,None] #to ensure 2D structure
b.shape
(29,1)
np.hstack((a, b))
If you do want to vertically stack, you'd need this:
a = np.zeros((29,10))
a.shape
(29,10)
b = np.array(['A']*10)[None,:] #to ensure 2D structure
b.shape
(1,10)
np.vstack((a, b))

Is this behaviour of NDArray correct?

I feel the behaviour of ndarray object is incorrect. I created one using the line of code below
c = np.ones((5,0), dtype=np.int32)
Some of the commands and outputs are given below
print(c)
[]
c
array([], shape=(5, 0), dtype=int32)
c[0]
array([], dtype=int32)
print(c[0])
[]
It's like empty array contains empty array. I can assign values but this value is lost, it doesn't show.
print(c)
[]
c.shape
(5, 0)
c[0]=10
print(c)
[]
print(c[0])
[]
What does (5,0) array mean? What is the difference between a and c?
a = np.ones((5,), dtype=np.int32)
c = np.ones((5,0), dtype=np.int32)
I am sorry I am new to Python so my knowledge is very basic.

Welcome to python. There seems to be some misconception about shape of an array, in particular the shape of a 1D array ( shapes of the form (n,). You see the shape (n,) corresponds to a 1 dimensional numpy array. If you are familiar with linear algebra, then this 1D array is analogous to a row vector. IT IS NOT the same thing as (n,0). What (n, m) represents is the shape of a 2D numpy array ( which is analagous to a matrix in linear algebra). Therefore saying an array has a shape (n,0) relates to an array with n rows but each row would have 0 columns therefore you are returned an empty array. If you do infact want a vector of ones you can type np.ones((5,)). Hope it helps. Comment if you require any further help.

In [43]: c = np.ones((5,0), dtype=np.int32)
In [44]: c
Out[44]: array([], shape=(5, 0), dtype=int32)
In [45]: c.size
Out[45]: 0
In [46]: np.ones(5).size
Out[46]: 5
The size, or number of elements of an array is the product of its shape. For c that 5*0 = 0. c is a 2d array that contains nothing.
If I try to assign a value to a column of c I get an error:
In [49]: c[:,0]=10
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-49-7baeeb61be4e> in <module>
----> 1 c[:,0]=10
IndexError: index 0 is out of bounds for axis 1 with size 0
Your assignment:
In [51]: c[0] = 10
is actually:
In [52]: c[0,:] = np.array(10)
That works because the c[0,:].shape is (0,), and an array with shape () or (1,) can be 'broadcast' to that target. That's a tricky case of broadcasting.
A more instructive case of assignment to c is where we try to assign 2 values:
In [57]: c[[0,1],:] = np.array([1,2])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-57-9ad1339293df> in <module>
----> 1 c[[0,1],:] = np.array([1,2])
ValueError: shape mismatch: value array of shape (2,) could not be broadcast to indexing result of shape (2,0)
The source array is 1d with shape (2,). The target is 2d with shape (2,0).
In general arrays with a 0 in the shape are confusing, and shouldn't be created. They some sometimes arise when indexing other arrays. But don't make one np.zeros((n,0)) except as an experiment.

Using broadcasting with sparse scipy matrices

I have a numpy array Z with shape (k,N) and a second array X with shape (N,n).
Using numpy broadcasting, I can easily obtain a new array H with shape (n,k,N) whose slices are the array Z whose rows have been multiplied by the columns of X:
H = Z.reshape((1, k, N)) * X.T.reshape((n, 1, N))
This works fine and is surprisingly fast.
Now, X is extremely sparse, and I want to further speed up this operation using sparse matrix operations.
However if I perform the following operations:
import scipy.sparse as sprs
spX = sprs.csr_matrix(X)
H = (Z.reshape((1,k,N))*spX.T.reshape((n,1,N))).dot(Z.T)
I get the following error:
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "C:\Python27\lib\site-packages\scipy\sparse\base.py", line 126, in reshape
self.__class__.__name__)
NotImplementedError: Reshaping not implemented for csc_matrix.
Is there a way to use broadcasting with sparse scipy matrices?

Scipy sparse matrices are limited to 2D shapes. But you can use Numpy in a "sparse" way:
H = np.zeros((n,k,N), np.result_type(Z, X))
I, J = np.nonzero(X)
Z_ = np.broadcast_to(Z, H.shape)
H[J,:,I] = Z_[J,:,I] * X[I,J,None]
Unfortunately the result H is still a dense array.
N.b. indexing with None is a handy way to add a unit-length dimension at the desired axis. The order of the result when combining advanced indexing with slicing is explained in the docs.

What is the difference between using matrix multiplication with np.matrix arrays, and dot()/tensor() with np.arrays?

At the moment, my code is written entirely using numpy arrays, np.array.
Define m as a np.array of 100 values, m.shape = (100,). There is also a multi-dimensional array, C.shape = (100,100).
The operation I would like to compute is
m^T * C * m
where m^T should be of shape (1,100), m of shape (100,1), and C of shape (100,100).
I'm conflicted how to proceed. If I insist the data types must remain np.arrays, then I should probably you numpy.dot() or numpy.tensordot() and specify the axis. That would be something like
import numpy as np
result = np.dot(C, m)
final = np.dot(m.T, result)
though m.T is an array of the same shape as m. Also, that's doing two individual operations instead of one.
Otherwise, I should convert everything into np.matrix and proceed to use matrix multiplication there. The problem with this is I must convert all my np.arrays into np.matrix, do the operations, and then convert back to np.array.
What is the most efficient and intelligent thing to do?
EDIT:
Based on the answers so far, I think np.dot(m^T, np.dot(C, m)) is probably the best way forward.

The main advantage of working with matrices is that the *symbol performs a matrix multiplication, whereas it performs an element-wise multiplications with arrays. With arrays you need to use dot. See:
Link
What are the differences between numpy arrays and matrices? Which one should I use?
If m is a one dimensional array, you don't need to transpose anything, because for 1D arrays, transpose doesn't change anything:
In [28]: m.T.shape, m.shape
Out[28]: ((3,), (3,))
In [29]: m.dot(C)
Out[29]: array([15, 18, 21])
In [30]: C.dot(m)
Out[30]: array([ 5, 14, 23])
This is different if you add another dimension to m:
In [31]: mm = m[:, np.newaxis]
In [32]: mm.dot(C)
--------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-32-28253c9b8898> in <module>()
----> 1 mm.dot(C)
ValueError: objects are not aligned
In [33]: (mm.T).dot(C)
Out[33]: array([[15, 18, 21]])
In [34]: C.dot(mm)
Out[34]:
array([[ 5],
[14],
[23]])

One dimensional Mahalanobis Distance in Python

I've been trying to validate my code to calculate Mahalanobis distance written in Python (and double check to compare the result in OpenCV)
My data points are of 1 dimension each (5 rows x 1 column).
In OpenCV (C++), I was successful in calculating the Mahalanobis distance when the dimension of a data point was with above dimensions.
The following code was unsuccessful in calculating Mahalanobis distance when dimension of the matrix was 5 rows x 1 column. But it works when the number of columns in the matrix are more than 1:
import numpy;
import scipy.spatial.distance;
s = numpy.array([[20],[123],[113],[103],[123]]);
covar = numpy.cov(s, rowvar=0);
invcovar = numpy.linalg.inv(covar)
print scipy.spatial.distance.mahalanobis(s[0],s[1],invcovar);
I get the following error:
Traceback (most recent call last):
File "/home/abc/Desktop/Return.py", line 6, in <module>
invcovar = numpy.linalg.inv(covar)
File "/usr/lib/python2.6/dist-packages/numpy/linalg/linalg.py", line 355, in inv
return wrap(solve(a, identity(a.shape[0], dtype=a.dtype)))
IndexError: tuple index out of range

One-dimensional Mahalanobis distance is really easy to calculate manually:
import numpy as np
s = np.array([[20], [123], [113], [103], [123]])
std = s.std()
print np.abs(s[0] - s[1]) / std
(reducing the formula to the one-dimensional case).
But the problem with scipy.spatial.distance is that for some reason np.cov returns a scalar, i.e. a zero-dimensional array, when given a set of 1d variables. You want to pass in a 2d array:
>>> covar = np.cov(s, rowvar=0)
>>> covar.shape
()
>>> invcovar = np.linalg.inv(covar.reshape((1,1)))
>>> invcovar.shape
(1, 1)
>>> mahalanobis(s[0], s[1], invcovar)
2.3674720531046645

Covariance needs 2 arrays to compare. In both np.cov() and Opencv CalcCovarMatrix, it expects the two arrays to be stacked on top of each other (Use vstack). You can also have the 2 arrays to be side by side if you change the Rowvar to false in numpy or use COVAR_COL in opencv. If your arrays are multidimentional, just flatten() them first.
So if I want to compare two 24x24 images, I flatten them both into 2 1x1024 images, then stack the two to get a 2x1024, and that is the first argument of np.cov().
You should then get a large square matrix, where it shows the results of comparing each element in array1 with each element in array2. In my example it will be 1024x1024. THAT is what you pass into your invert function.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Calculating Correlation Coefficient with Numpy - python

Related

Vstack of two arrays with same number of rows gives an error

Is this behaviour of NDArray correct?

Using broadcasting with sparse scipy matrices

What is the difference between using matrix multiplication with np.matrix arrays, and dot()/tensor() with np.arrays?

One dimensional Mahalanobis Distance in Python

Categories

Resources