Different results from scipy.stats.spearmanr depending on how data is produced - python

I'm having some weird problem using spearmanr from scipy.stats. I'm using the values of a polynomial to get some correlations that are a bit more interesting to work with, but if I manually enter the values (as a list, converted to a numpy array) I get a different correlation to what I get if I calculate the values using a function. The code below should demonstrate what I mean:
import numpy as np
from scipy.stats import spearmanr
data = np.array([ 0.4, 1.2, 1. , 0.4, 0. , 0.4, 2.2, 6. , 12.4, 22. ])
axis = np.arange(0, 10, dtype=np.float64)
print(spearmanr(axis, data))# gives a correlation of 0.693...
# Use this polynomial
poly = lambda x: 0.1*(x - 3.0)**3 + 0.1*(x - 1.0)**2 - x + 3.0
data2 = poly(axis)
print(data2) # It is the same as data
print(spearmanr(axis, data2))# gives a correlation of 0.729...
I did notice that the arrays are subtly different (i.e. data - data2 is not exactly zero for all elements), but the difference is tiny - order of 1e-16.
Is such a tiny difference enough to throw off spearmanr by this much?

Is such a tiny difference enough to throw off spearmanr by this much?
Yes, because Spearman's r is based on the sample rank. Such tiny differences can change the rank of values that would otherwise be equal:
sp.stats.rankdata(data)
# array([ 3., 6., 5., 3., 1., 3., 7., 8., 9., 10.])
# Note that all three values of 0.4 get the same rank 3.
sp.stats.rankdata(data2)
# array([ 2.5, 6. , 5. , 2.5, 1. , 4. , 7. , 8. , 9. , 10. ])
# Note that two values 0.4 get the rank 2.5 and one gets 4.
If you add a small gradient (larger than the numerical difference you observe) to break such ties, you will get the same result:
print(spearmanr(axis, data + np.arange(10)*1e-12))
# SpearmanrResult(correlation=0.74545454545454537, pvalue=0.013330146315440047)
print(spearmanr(axis, data2 + np.arange(10)*1e-12))
# SpearmanrResult(correlation=0.74545454545454537, pvalue=0.013330146315440047)
This, however, will break any ties that may be intentional and can lead to over- or underestimating the correlation. numpy.round may be the preferable solution if the data is expected to have discrete values.

Related

norm parameters in sklearn.preprocessing.normalize

In sklearn documentation says "norm" can be either
norm : ‘l1’, ‘l2’, or ‘max’, optional (‘l2’ by default)
The norm to use to normalize each non zero sample (or each non-zero feature if axis is 0).
The documentation about normalization isn't clearly stating how ‘l1’, ‘l2’, or ‘max’ are calculated.
Can anyone clear these?
Informally speaking, the norm is a generalization of the concept of (vector) length; from the Wikipedia entry:
In linear algebra, functional analysis, and related areas of mathematics, a norm is a function that assigns a strictly positive length or size to each vector in a vector space.
The L2-norm is the usual Euclidean length, i.e. the square root of the sum of the squared vector elements.
The L1-norm is the sum of the absolute values of the vector elements.
The max-norm (sometimes also called infinity norm) is simply the maximum absolute vector element.
As the docs say, normalization here means making our vectors (i.e. data samples) having unit length, so specifying which length (i.e. which norm) is also required.
You can easily verify the above adapting the examples from the docs:
from sklearn import preprocessing
import numpy as np
X = [[ 1., -1., 2.],
[ 2., 0., 0.],
[ 0., 1., -1.]]
X_l1 = preprocessing.normalize(X, norm='l1')
X_l1
# array([[ 0.25, -0.25, 0.5 ],
# [ 1. , 0. , 0. ],
# [ 0. , 0.5 , -0.5 ]])
You can verify by simple visual inspection that the absolute values of the elements of X_l1 sum up to 1.
X_l2 = preprocessing.normalize(X, norm='l2')
X_l2
# array([[ 0.40824829, -0.40824829, 0.81649658],
# [ 1. , 0. , 0. ],
# [ 0. , 0.70710678, -0.70710678]])
np.sqrt(np.sum(X_l2**2, axis=1)) # verify that L2-norm is indeed 1
# array([ 1., 1., 1.])

implementing euclidean distance based formula using numpy

I am trying to implement this formula in python using numpy
As you can see in picture above X is numpy matrix and each xi is a vector with n dimensions and C is also a numpy matrix and each Ci is vector with n dimensions too, dist(Ci,xi) is euclidean distance between these two vectors.
I implement a code in python:
value = 0
for i in range(X.shape[0]):
min_value = math.inf
#this for loop iterate k times
for j in range(C.shape[0]):
distance = (np.dot(X[i] - C[j],
X[i] - C[j])) ** .5
min_value = min(min_value, distance)
value += min_value
fitnessValue = value
But my code performance is not good enough I'am looking for faster,is there any faster way to calculate that formula in python any idea would be thankful.
Generally, loops running an important number of times should be avoided when possible in python.
Here, there exists a scipy function, scipy.spatial.distance.cdist(C, X), which computes the pairwise distance matrix between C and X. That is to say, if you call distance_matrix = scipy.spatial.distance.cdist(C, X), you have distance_matrix[i, j] = dist(C_i, X_j).
Then, for each j, you want to compute the minimum of the dist(C_i, X_j) over all i. You do not either need a loop to compute this! The function numpy.minimum does it for you, if you pass an axis argument.
And finally, the summation of all these minimum is done by calling the numpy.sum function.
This gives code much more readable and faster:
import scipy.spatial.distance
import numpy as np
def your_function(C, X):
distance_matrix = scipy.spatial.distance.cdist(C, X)
minimum = np.min(distance_matrix, axis=0)
return np.sum(minimum)
Which returns the same results as your function :)
Hope this helps!
einsum can also be called into play. Here is a simple small example of a pairwise distance calculation for a small set. Useful if you don't have scipy installed and/or wish to use numpy solely.
>>> a
array([[ 0., 0.],
[ 1., 1.],
[ 2., 2.],
[ 3., 3.],
[ 4., 4.]])
>>> b = a.reshape(np.prod(a.shape[:-1]),1,a.shape[-1])
>>> b
array([[[ 0., 0.]],
[[ 1., 1.]],
[[ 2., 2.]],
[[ 3., 3.]],
[[ 4., 4.]]])
>>> diff = a - b; dist_arr = np.sqrt(np.einsum('ijk,ijk->ij', diff, diff)).squeeze()
>>> dist_arr
array([[ 0. , 1.41421, 2.82843, 4.24264, 5.65685],
[ 1.41421, 0. , 1.41421, 2.82843, 4.24264],
[ 2.82843, 1.41421, 0. , 1.41421, 2.82843],
[ 4.24264, 2.82843, 1.41421, 0. , 1.41421],
[ 5.65685, 4.24264, 2.82843, 1.41421, 0. ]])
Array 'a' is a simple 2d (shape=(5,2), 'b' is just 'a' reshaped to facilitate (5, 1, 2) the difference calculations for the cdist style array. The terms are written verbosely since they are extracted from other code. the 'diff' variable is the difference array and the dist_arr shown is for the 'euclidean' distance. Should you need euclideansq (square distance) for 'closest' determinations, simply remove the np.sqrt term and finally squeeze, just removes and 1 terms in the shape.
cdist is faster for much larger arrays (in the order of 1000s of origins and destinations) but einsum is a nice alternative and well documented by others on this site.

Is there A 1D interpolation (along one axis) of an image using two images (2D arrays) as inputs? [duplicate]

This question already has an answer here:
Interpolate in one direction
(1 answer)
Closed 7 years ago.
I have two images representing x and y values. The images are full of 'holes' (the 'holes' are the same in both images).
I want to interpolate (linear interpolation is fine though higher level interpolation is preferable) along ONE of the axis in order to 'fill' the holes.
Say the axis of choice is 0, that is, I want to interpolate across each column. All I have found with numpy is interpolation when x is the same (e.g. numpy.interpolate.interp1d). In this case, however, each x is different (i.e. the holes or empty cells are different in each row).
Is there any numpy/scipy technique I can use? Could a 1D convolution work?(though kernels are fixed)
You still can use interp1d:
import numpy as np
from scipy import interpolate
A = np.array([[1,np.NaN,np.NaN,2],[0,np.NaN,1,2]])
#array([[ 1., nan, nan, 2.],
# [ 0., nan, 1., 2.]])
for row in A:
mask = np.isnan(row)
x, y = np.where(~mask)[0], row[~mask]
f = interpolate.interp1d(x, y, kind='linear',)
row[mask] = f(np.where(mask)[0])
#array([[ 1. , 1.33333333, 1.66666667, 2. ],
# [ 0. , 0.5 , 1. , 2. ]])

How to create a trendline with gaps of missing data in python?

So I'm new to python AND data analysis, but have been tasked to create a scatter plot. The data set that I'm using has many elements containing None values. When I use the polyfit method to create a trendline(best-fit line) I get errors for the Nones. I've tried using lists and numpy arrays with dismal results. I've also tried masked_array, masked_invalid, ect. in MULTIPLE configurations, but it kept giving me an array filled with Nones. Is there a way of creating a trendline in such a way that I don't need to remove the elements that have None values? I need them to keep my plot dimensions correct. I'm using Python 2.7. This is what I got so far:
import matplotlib.pyplot as plt
import numpy as np
import numpy.ma as ma
import pylab
#The InterpolatedUnivariateSpline method popped up during my endeavor
#to extrapolate the trendline through the gaps in data.
#To be honest, I don't think its doing anything for me...
from scipy.interpolate import InterpolatedUnivariateSpline
fig, ax = plt.subplots(1,1)
ax.scatter(y, dbm, color = 'purple', marker = 'o', s = 100)
plt.xlim(min(y), max(y))
plt.xlabel('Temp - C')
dbm_array = np.asarray(dbm) #dbm and y are lists earlier in the program
y_array = np.asarray(y)
x = np.linspace(min(y), max(y), len(y))
order = 1
s = InterpolatedUnivariateSpline(y, dbm, k=order)
blah = s(x)
plt.plot(y, blah, '--k')
This gives me the scatter plot without the trendline for some reason. No errors, so I guess I got that going for me....
Thank you so much in advance!
First of all, if you have arrays, there should be no Nones in them, just nans. This is because None is an object which cannot be expressed as a number. So, the first problem may be here. Let's have a look:
import numpy as np
a = np.array([None, 1, 2, 3, 4, None])
What do we get?
>>> a
array([None, 1, 2, 3, 4, None], dtype=object)
This is most certainly something we did not. It is an array of objects, which is most of the time something not very useful. You cannot perform any calculations on that one:
>>> 2*a
unsupported operand type(s) for *: 'int' and 'NoneType'
This happens because the element-wise multiplication tries to multiply 2*None.
So, what you really want to have is:
>>> a = np.array([np.nan, 1, 2, 3, 4, np.nan])
>>> a
array([ nan, 1., 2., 3., 4., nan])
>>> a.dtype
dtype('float64')
>>> 2 * a
array([ nan, 2., 4., 6., 8., nan])
Now everything works as expected.
So, the first thing is to check that your input arrays have the correct form. If you then have problems with curve fitting, you may create an array without the nasty nans in there:
import numpy as np
a = np.array([[0,np.nan], [1, 1], [2, 1.5], [3.2, np.nan], [4, 5]])
b = a[-np.isnan(a[:,1])]
Let's see the contents of a and b:
>>> a
array([[ 0. , nan],
[ 1. , 1. ],
[ 2. , 1.5],
[ 3.2, nan],
[ 4. , 5. ]])
>>> b
array([[ 1. , 1. ],
[ 2. , 1.5],
[ 4. , 5. ]])
And this is what you want. The curve is fitted with b without any nans which have the habit of migrating around and making the results of calculations nans. (This is by design.)
How does this work, then? The np.isnan(a[:,1]) returns a boolean array with True at each position with a nan in column 1 in a and False for each valid number. As this is exactly the opposite of what we want, we'll negate it by adding the minus sign in front. And then the indexing picks only the rows which have numbers.
In case you have your X data and Y data in two different 1-D vectors, do this:
# original y data: Y
# original x data: X
# both have the same length
# calculate a mask to be used (a boolean vector)
msk = -np.isnan(Y)
# use the mask to plot both X and Y only at the points where Y is not NaN
plot(X[msk], Y[msk])
In some cases you may not have the X data at all, but you would like to number the points from, e.g. 0 onwards (as matplotlib does if you only give it one vector). There are a couple of possibilities, but this is one:
msk = -np.isnan(Y)
X = np.arange(len(Y))
plot(X[msk], Y[msk])

DBSCAN in Python: Unexpected result

I'm trying to understand the DBSCAN implementation by scikit-learn, but I'm having trouble. Here is my data sample:
X = [[0,0],[0,1],[1,1],[1,2],[2,2],[5,0],[5,1],[5,2],[8,0],[10,0]]
Then I calculate D as in the example provided
D = distance.squareform(distance.pdist(X))
D returns a matrix with the distance between each point and all others. The diagonal is thus always 0.
Then I run DBSCAN as:
db = DBSCAN(eps=1.1, min_samples=2).fit(D)
eps = 1.1 means, if I understood the documentation well, that points with a distance of smaller or equal 1.1 will be considered in a cluster (core).
D[1] returns the following:
>>> D[1]
array([ 1. , 0. , 1. , 1.41421356,
2.23606798, 5.09901951, 5. , 5.09901951,
8.06225775, 10.04987562])
which means the second point has a distance of 1 to the first and the third. So I expect them to build a cluster, but ...
>>> db.core_sample_indices_
[]
which means no cores found, right? Here are the other 2 outputs.
>>> db.components_
array([], shape=(0, 10), dtype=float64)
>>> db.labels_
array([-1., -1., -1., -1., -1., -1., -1., -1., -1., -1.])
Why is there any cluster?
I figure the implementation might just assume your distance matrix is the data itself.
See: usually you wouldn't compute the full distance matrix for DBSCAN, but use a data index for faster neighbor search.
Judging from a 1 minute Google, consider adding metric="precomputed", since:
fit(X)
X: Array of distances between samples, or a feature array. The array is treated as a feature array unless the metric is given as ‘precomputed’.

Categories