How to create a trendline with gaps of missing data in python? - python

So I'm new to python AND data analysis, but have been tasked to create a scatter plot. The data set that I'm using has many elements containing None values. When I use the polyfit method to create a trendline(best-fit line) I get errors for the Nones. I've tried using lists and numpy arrays with dismal results. I've also tried masked_array, masked_invalid, ect. in MULTIPLE configurations, but it kept giving me an array filled with Nones. Is there a way of creating a trendline in such a way that I don't need to remove the elements that have None values? I need them to keep my plot dimensions correct. I'm using Python 2.7. This is what I got so far:
import matplotlib.pyplot as plt
import numpy as np
import numpy.ma as ma
import pylab
#The InterpolatedUnivariateSpline method popped up during my endeavor
#to extrapolate the trendline through the gaps in data.
#To be honest, I don't think its doing anything for me...
from scipy.interpolate import InterpolatedUnivariateSpline
fig, ax = plt.subplots(1,1)
ax.scatter(y, dbm, color = 'purple', marker = 'o', s = 100)
plt.xlim(min(y), max(y))
plt.xlabel('Temp - C')
dbm_array = np.asarray(dbm) #dbm and y are lists earlier in the program
y_array = np.asarray(y)
x = np.linspace(min(y), max(y), len(y))
order = 1
s = InterpolatedUnivariateSpline(y, dbm, k=order)
blah = s(x)
plt.plot(y, blah, '--k')
This gives me the scatter plot without the trendline for some reason. No errors, so I guess I got that going for me....
Thank you so much in advance!

First of all, if you have arrays, there should be no Nones in them, just nans. This is because None is an object which cannot be expressed as a number. So, the first problem may be here. Let's have a look:
import numpy as np
a = np.array([None, 1, 2, 3, 4, None])
What do we get?
>>> a
array([None, 1, 2, 3, 4, None], dtype=object)
This is most certainly something we did not. It is an array of objects, which is most of the time something not very useful. You cannot perform any calculations on that one:
>>> 2*a
unsupported operand type(s) for *: 'int' and 'NoneType'
This happens because the element-wise multiplication tries to multiply 2*None.
So, what you really want to have is:
>>> a = np.array([np.nan, 1, 2, 3, 4, np.nan])
>>> a
array([ nan, 1., 2., 3., 4., nan])
>>> a.dtype
dtype('float64')
>>> 2 * a
array([ nan, 2., 4., 6., 8., nan])
Now everything works as expected.
So, the first thing is to check that your input arrays have the correct form. If you then have problems with curve fitting, you may create an array without the nasty nans in there:
import numpy as np
a = np.array([[0,np.nan], [1, 1], [2, 1.5], [3.2, np.nan], [4, 5]])
b = a[-np.isnan(a[:,1])]
Let's see the contents of a and b:
>>> a
array([[ 0. , nan],
[ 1. , 1. ],
[ 2. , 1.5],
[ 3.2, nan],
[ 4. , 5. ]])
>>> b
array([[ 1. , 1. ],
[ 2. , 1.5],
[ 4. , 5. ]])
And this is what you want. The curve is fitted with b without any nans which have the habit of migrating around and making the results of calculations nans. (This is by design.)
How does this work, then? The np.isnan(a[:,1]) returns a boolean array with True at each position with a nan in column 1 in a and False for each valid number. As this is exactly the opposite of what we want, we'll negate it by adding the minus sign in front. And then the indexing picks only the rows which have numbers.
In case you have your X data and Y data in two different 1-D vectors, do this:
# original y data: Y
# original x data: X
# both have the same length
# calculate a mask to be used (a boolean vector)
msk = -np.isnan(Y)
# use the mask to plot both X and Y only at the points where Y is not NaN
plot(X[msk], Y[msk])
In some cases you may not have the X data at all, but you would like to number the points from, e.g. 0 onwards (as matplotlib does if you only give it one vector). There are a couple of possibilities, but this is one:
msk = -np.isnan(Y)
X = np.arange(len(Y))
plot(X[msk], Y[msk])

Related

True inverse function for cosine in numpy? (NOT arccos)

Here is a weird one:
I have found myself needing a numpy function that is what I would call the true inverse of np.cos (or another trigonometric function, cosine is used here for definiteness). What I mean by ''true inverse'' is a function invcos, such that
np.cos(invcos(x)) = x
for any real float x. Two observations: invcos(x) exists (it is a complex float) and np.arccos(x) does not do the job, because it only works for -1 < x < 1.
My question is if there is an efficient numpy function for this operation or if it can built from existing ones easily?
My attempt was to use a combination of np.arccos and np.arccosh to build the function by hand. This is based on the observation that np.arccos can deal with x in [-1,1] and np.arccosh can deal with x outside [-1,1] if one multiplies by the complex unit. To see that this works:
cos_x = np.array([0.5, 1., 1.5])
x = np.arccos(cos_x)
cos_x_reconstucted = np.cos(x)
# [0.5 1. nan]
x2 = 1j*np.arccosh(cos_x)
cos_x_reconstructed2 = np.cos(x2)
# [nan+nanj 1.-0.j 1.5-0.j]
So we could combine this to
def invcos(array):
x1 = np.arccos(array)
x2 = 1j*np.arccosh(array)
print(x1)
print(x2)
x = np.empty_like(x1, dtype=np.complex128)
x[~np.isnan(x1)] = x1[~np.isnan(x1)]
x[~np.isnan(x2)] = x2[~np.isnan(x2)]
return x
cos_x = np.array([0.5, 1., 1.5])
x = invcos(cos_x)
cos_x_reconstructed = np.cos(x)
# [0.5-0.j 1.-0.j 1.5-0.j]
This gives the correct results, but naturally raises RuntimeWarnings:
RuntimeWarning: invalid value encountered in arccos.
I guess since numpy even tells me that my algorithm is not efficient, it is probably not efficient. Is there a better way to do this?
For readers who are interested in why this strange function may be useful: The motivation comes from a physics background. In certain theories, one can have vector components that are 'off-shell', which means that the components might even be longer than the vector. The above function can be useful to nevertheless parametrize things in terms of angles.
My question is if there is an efficient numpy function for this operation or if it can built from existing ones easily?
Yes; it is... np.arccos.
From the documentation:
For real-valued input data types, arccos always returns real output. For each value that cannot be expressed as a real number or infinity, it yields nan and sets the invalid floating point error flag.
For complex-valued input, arccos is a complex analytic function that has branch cuts [-inf, -1] and [1, inf] and is continuous from above on the former and from below on the latter.
So all we need to do is ensure that the input is a complex number (even if its imaginary part is zero):
>>> import numpy as np
>>> np.arccos(2.0)
__main__:1: RuntimeWarning: invalid value encountered in arccos
nan
>>> np.arccos(2 + 0j)
-1.3169578969248166j
For an array, we need the appropriate dtype:
>>> np.arccos(np.ones((3,3)) * 2)
array([[nan, nan, nan],
[nan, nan, nan],
[nan, nan, nan]])
>>> np.arccos(np.ones((3,3), dtype=np.complex) * 2)
array([[0.-1.3169579j, 0.-1.3169579j, 0.-1.3169579j],
[0.-1.3169579j, 0.-1.3169579j, 0.-1.3169579j],
[0.-1.3169579j, 0.-1.3169579j, 0.-1.3169579j]])

Masking a 2D array and operating on second array based off masked indices

I have a function that reads in and outputs a 2D array. I want the output to be constant (pi in this case) for every index in the input that equals 0, otherwise I perform some maths on it. E.g:
import numpy as np
import numpy.ma as ma
def my_func(x):
mask = ma.where(x==0,x)
# make an array of pi's the same size and shape as the input
y = np.pi * np.ones(x)
# psuedo-code bit I can't figure out
y.not_masked = y**2
return y
my_array = [[0,1,2],[1,0,2],[1,2,0]]
result_array = my_func(my_array)
This should give me the following:
result_array = [[3.14, 1, 4],[1, 3.14, 4], [1, 4, 3.14]]
I.e. it has applied y**2 to each element in the 2D list that doesn't equal zero, and replaced all the zeros with pi.
I need this because my function will include division, and I don't know the indexes beforehand. I'm trying to convert a matlab tutorial from a textbook into Python and this function is stumping me!
Thanks
Just use np.where() directly:
y = np.where(x, x**2, np.pi)
Example:
>>> x = np.asarray([[0,1,2],[1,0,2],[1,2,0]])
>>> y = np.where(x, x**2, np.pi)
>>> print(y)
[[ 3.14159265 1. 4. ]
[ 1. 3.14159265 4. ]
[ 1. 4. 3.14159265]]
Try this:
my_array = np.array([[0,1,2],[1,0,2],[1,2,0]]).astype(float)
def my_func(x):
mask = x == 0
x[mask] = np.pi
x[~mask] = x[~mask]**2 # or some other operation on x...
return x
I would suggest rather than using masks you can use a boolean array to achieve what you want.
def my_func(x):
#create a boolean matrix, a, that has True where x==0 and
#False where x!=0
a=x==0
x[a]=np.pi
#Use np.invert to flip where a is True and False so we can
#operate on the non-zero values of the array
x[~a]=x[~a]**2
return x #return the transformed array
my_array = np.array([[0.,1.,2.],[1.,0.,2.],[1.,2.,0.]])
result_array = my_func(my_array)
this gives the output:
array([[ 3.14159265, 1. , 4. ],
[ 1. , 3.14159265, 4. ],
[ 1. , 4. , 3.14159265]])
Notice that I passed to the function an numpy array specifically, originally you passed a list and that will give problems when you attempt to do mathematical operations. Also notice I defined the array with 1. rather than just 1, in order to make sure it was an array of floats rather than integers, because if it is an array of integers when you set values equal to pi it will truncate to 3.
Perhaps it would be good to add a piece to the function to check the dtype of the input argument and see if it is a numpy array rather than a list or other object, and also to make sure it contains floats, and if not you can adjust accordingly.
EDIT:
Change to using ~a rather than invert(a) as per Scotty1's suggestion.

Different results from scipy.stats.spearmanr depending on how data is produced

I'm having some weird problem using spearmanr from scipy.stats. I'm using the values of a polynomial to get some correlations that are a bit more interesting to work with, but if I manually enter the values (as a list, converted to a numpy array) I get a different correlation to what I get if I calculate the values using a function. The code below should demonstrate what I mean:
import numpy as np
from scipy.stats import spearmanr
data = np.array([ 0.4, 1.2, 1. , 0.4, 0. , 0.4, 2.2, 6. , 12.4, 22. ])
axis = np.arange(0, 10, dtype=np.float64)
print(spearmanr(axis, data))# gives a correlation of 0.693...
# Use this polynomial
poly = lambda x: 0.1*(x - 3.0)**3 + 0.1*(x - 1.0)**2 - x + 3.0
data2 = poly(axis)
print(data2) # It is the same as data
print(spearmanr(axis, data2))# gives a correlation of 0.729...
I did notice that the arrays are subtly different (i.e. data - data2 is not exactly zero for all elements), but the difference is tiny - order of 1e-16.
Is such a tiny difference enough to throw off spearmanr by this much?
Is such a tiny difference enough to throw off spearmanr by this much?
Yes, because Spearman's r is based on the sample rank. Such tiny differences can change the rank of values that would otherwise be equal:
sp.stats.rankdata(data)
# array([ 3., 6., 5., 3., 1., 3., 7., 8., 9., 10.])
# Note that all three values of 0.4 get the same rank 3.
sp.stats.rankdata(data2)
# array([ 2.5, 6. , 5. , 2.5, 1. , 4. , 7. , 8. , 9. , 10. ])
# Note that two values 0.4 get the rank 2.5 and one gets 4.
If you add a small gradient (larger than the numerical difference you observe) to break such ties, you will get the same result:
print(spearmanr(axis, data + np.arange(10)*1e-12))
# SpearmanrResult(correlation=0.74545454545454537, pvalue=0.013330146315440047)
print(spearmanr(axis, data2 + np.arange(10)*1e-12))
# SpearmanrResult(correlation=0.74545454545454537, pvalue=0.013330146315440047)
This, however, will break any ties that may be intentional and can lead to over- or underestimating the correlation. numpy.round may be the preferable solution if the data is expected to have discrete values.

Is there A 1D interpolation (along one axis) of an image using two images (2D arrays) as inputs? [duplicate]

This question already has an answer here:
Interpolate in one direction
(1 answer)
Closed 7 years ago.
I have two images representing x and y values. The images are full of 'holes' (the 'holes' are the same in both images).
I want to interpolate (linear interpolation is fine though higher level interpolation is preferable) along ONE of the axis in order to 'fill' the holes.
Say the axis of choice is 0, that is, I want to interpolate across each column. All I have found with numpy is interpolation when x is the same (e.g. numpy.interpolate.interp1d). In this case, however, each x is different (i.e. the holes or empty cells are different in each row).
Is there any numpy/scipy technique I can use? Could a 1D convolution work?(though kernels are fixed)
You still can use interp1d:
import numpy as np
from scipy import interpolate
A = np.array([[1,np.NaN,np.NaN,2],[0,np.NaN,1,2]])
#array([[ 1., nan, nan, 2.],
# [ 0., nan, 1., 2.]])
for row in A:
mask = np.isnan(row)
x, y = np.where(~mask)[0], row[~mask]
f = interpolate.interp1d(x, y, kind='linear',)
row[mask] = f(np.where(mask)[0])
#array([[ 1. , 1.33333333, 1.66666667, 2. ],
# [ 0. , 0.5 , 1. , 2. ]])

scipy's splrep/splev for python interpolation returns nan

I have a data set with the first column is the x data (wavelenght) and the second column is the y data (relative intensity).
I wish to interpolate it on to another x_new-data but my problem is that splrep returns nan-values:
>>import numpy as np
>>from scipy.interpolate import splrep, splev
>>d = np.loadtxt("test.txt")
>>x,y = d[:,0],d[:,1]
>>
>>f = splrep( x,y,k=5 )
>>print f
>>(array([ 4501.19, 4501.19, 4501.19, ..., 7091.74, 7091.74, 7091.74]), array([ nan, nan, nan, ..., 0., 0., 0.]), 5)
It also happens when I don't specify k. Any suggestions how to overcome this problem?
Your x values probably contain duplicates, use s=... keyword argument to splrep to set a smoothing factor, because if this is not set the splines are supposed to go through every point exactly which is impossible with duplicates.
It might be that they are not duplicates but just very close too.

Categories