python: What does nan_policy=omit do for scipy.stat.spearmanr - python

scipy.stats.spearmanr([1,2,3,4,1],[1,2,2,1,np.nan],nan_policy='omit')
it will give a spearman correlation of 0.349999
My understanding is that nan_policy ='omit' will discard all the pairs which have nan. If that's the case, the results should be the same as scipy.stats.spearmanr([1,2,3,4],[1,2,2,1])
However, it gives a correlation of 0.235702.
Why are they different? Is my understand of nan_policy ='omit' corrent?

I tried to run your code, it gives me cero correlation (R=0.0).
I use this function and you are understanding well nan_policy ='omit'.
If you don't need the p-value of the correlation I would sugest using .corr(method = 'spearman') from pandas library. By default it excludes NA/null values.
Official Documentation

nan_policy='omit' should completely omit those pairs for which one or both values are nan. When I run the two commands you pasted above, I get the same correlation value, not different ones.

Related

what do the interpolationmethods ‘cubicspline’ and ‘from_derivatives’ do?

pandas interpolation documentary already leaves helpful notes for all the other Notes on wether they use the actual numerical indices or a timeindex for the interpolation.
method str, default ‘linear’
Interpolation technique to use. One of:
‘linear’: Ignore the index and treat the values as equally spaced. This is the only method supported on MultiIndexes.
‘time’: Works on daily and higher resolution data to interpolate given length of interval.
‘index’, ‘values’: use the actual numerical values of the index.
‘pad’: Fill in NaNs using existing values.
‘nearest’, ‘zero’, ‘slinear’, ‘quadratic’, ‘cubic’, ‘spline’, ‘barycentric’, ‘polynomial’: Passed to scipy.interpolate.interp1d. These methods use the numerical values of the index. Both ‘polynomial’ and ‘spline’ require that you also specify an order (int), e.g. df.interpolate(method='polynomial', order=5).
‘krogh’, ‘piecewise_polynomial’, ‘spline’, ‘pchip’, ‘akima’, ‘cubicspline’: Wrappers around the SciPy interpolation methods of similar names. See Notes.
‘from_derivatives’: Refers to scipy.interpolate.BPoly.from_derivatives which replaces ‘piecewise_polynomial’ interpolation method in scipy 0.18.
But unfortunately I couldnt find this information for the two last ones:
cubicspline
and
from_derivatives
scipy.interpolate.CubicSpline
Interpolate data with a piecewise cubic polynomial which is twice continuously differentiable. The result is represented as a PPoly instance with breakpoints matching the given data.
scipy.interpolate.BPoly.from_derivatives
Construct a piecewise polynomial in the Bernstein basis, compatible with the specified values and derivatives at breakpoints.

How does the adjust method work in Pandas ewm() function?

When calculating a exponentially weighted average in Pandas the parameter adjust is set to a default value of True.
I know what the adjust parameter does (but not how it does it which is what I want to know).
When adjust = True the ewa is calculated for every point in the sample but when adjust=False, then for a window of size n, you must wait for n observations to calculate the first ewa value.
I looked at the pandas documentation but it only proves that the adjust = True is equivalent to adjust = False for later values. It doesn't say how the earlier values are adjusted in the adjust=True case.
https://pandas.pydata.org/pandas-docs/stable/user_guide/computation.html#exponentially-weighted-windows
I even looked at the pandas code on github:
https://github.com/pandas-dev/pandas/blob/master/pandas/core/window/ewm.py
see L99 onwards: but it just seems to be using the regular ewm formula for the earlier points?
This blog post demonstrates the difference between the two version of ewm based on the following data points:
https://towardsdatascience.com/trading-toolbox-02-wma-ema-62c22205e2a9
I tried to replicate the results in the blog post, for the earlier data points, using the formula in L99 above.
Where every time I calculate the mean I am using the current and all preceding ewm values.
Is this what pandas ewm function also use all previous values when calculating the mean?
i Price alpha^i ewm ewm.mean
0 1
1 22.273 0.181818 =22.273*1/1 =22.273 22.273
2 22.194 0.03306 =(22.194*1+22.273*0.03306)/(1+0.03306)=22.20615 22.23958
3 22.085 0.00601 =(22.085*1+22.194*0.181818+22.273*0.03306)/(1+0.181818+0.3306)=22.10643 22.19519
The results are different to those shown in the blog post but if the method was correct they should be exactly the same.
Can someone please tell me where I'm going wrong?

Exponential Moving Average Pandas vs Ta-lib

I'm currently writng a code involving some financial calculation. More in particular some exponential moving average. To do the job I have tried Pandas and Talib:
talib_ex=pd.Series(talib.EMA(self.PriceAdjusted.values,timeperiod=200),self.PriceAdjusted.index)
pandas_ex=self.PriceAdjusted.ewm(span=200,adjust=True,min_periods=200-1).mean()
They both work fine, but they provide different results at the begining of the array:
So there is some parameter to be change into pandas's EWMA or it is a bug and I should worry?
Thanks in advance
Luca
For the talib ema, the formula is:
So when using the pandas, if you want to make pandas ema the same as talib, you should use it as:
pandas_ex=self.PriceAdjusted.ewm(span=200,adjust=False,min_periods=200-1).mean()
Set the adjust as False according to the document(https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.ewm.html) if you want to use the same formula as talib:
When adjust is True (default), weighted averages are calculated using weights (1-alpha)(n-1), (1-alpha)(n-2), ..., 1-alpha, 1.
When adjust is False, weighted averages are calculated recursively as:
weighted_average[0] = arg[0]; weighted_average[i] = (1-alpha)weighted_average[i-1] + alphaarg[i].
You can also reference here:
https://en.wikipedia.org/wiki/Moving_average
PS: however, in my project, i still find some small differences between the talib and the pandas.ewm and don't know why yet...

How to unpack the results from scipy ttest_1samp?

Scipy's ttest_1samp returns a tuple, with the t-statistic, and the two-tailed p-value.
Example:
ttest_1samp([0,1,2], 0) = (array(1.7320508075688774), 0.22540333075851657)
But I'm only interested in the float of the t-test (the t-statistic), which I have only been able to get by using [0].ravel()[0]
Example:
ttest_1samp([0,1,2], 0)[0].ravel()[0] = 1.732
However, I'm quite sure there must be a more pythonic way to do this. What is the best way to get the float from this output?
To expound on #miradulo's answer, if you use a newer version of scipy (release 0.14.0 or later), you can reference the statistic field of the returned namedtuple. Referencing this way is Pythonic and simplifies the code as there is no need to remember specific indices.
Code
res = ttest_1samp(range(3), 0)
print(res.statistic)
print(res.pvalue)
Output
1.73205080757
0.225403330759
From the source code, scipy.stats.ttest_1samp returns nothing more than a namedtuple Ttest_1sampResult with the statistic and p-value. Hence, you do not need to use .ravel - you can simply use
scipy.stats.ttest_1samp([0,1,2], 0)[0]
to access the statistic.
Note:
From a further look at the source code, it is clear that this namedtuple only began being returned in release 0.14.0. In release 0.13.0 and earlier, it appears that a zero dim array is returned (source code), which for all intents and purposes can act just like a plain number as mentioned by BrenBarn.
You can get the results with desired formats this way:
print ("The t-statistic is %.3f and the p-value is %.3f." % stats.ttest_1samp([0,1,2], 0))
Output:
The t-statistic is 1.732 and the p-value is 0.225.

incorrect mean from PANDAS dataframe

So here's an interesting thing:
Using python 2.7:
I've got a dataframe of about 5,100 entries, each with a number (melting point) in a column titled 'Tm'. Using the code:
self.sort_df[['Tm']].mean(axis=0)
I get a mean of:
Tm 92.969204
dtype: float64
This doesn't make sense because no entry has a Tm of greater than 83.
Does .mean() not work for this many values? I've tried pairing down the dataset and it seems to work for ~1,000 entries but considering I have full dataset of 150,000 to run at once, I'd like to know if I need to find a different way to calculate the mean.
A more readable syntax would be :
sort_df['Tm'].mean()
Try to do a sort_df['Tm'].value_counts() or sort_df['Tm'].max() to see what values are present. Some unexpected values must have crept up.
The .mean function gives accurate result irrespective of the size.

Categories