When calculating a exponentially weighted average in Pandas the parameter adjust is set to a default value of True.
I know what the adjust parameter does (but not how it does it which is what I want to know).
When adjust = True the ewa is calculated for every point in the sample but when adjust=False, then for a window of size n, you must wait for n observations to calculate the first ewa value.
I looked at the pandas documentation but it only proves that the adjust = True is equivalent to adjust = False for later values. It doesn't say how the earlier values are adjusted in the adjust=True case.
https://pandas.pydata.org/pandas-docs/stable/user_guide/computation.html#exponentially-weighted-windows
I even looked at the pandas code on github:
https://github.com/pandas-dev/pandas/blob/master/pandas/core/window/ewm.py
see L99 onwards: but it just seems to be using the regular ewm formula for the earlier points?
This blog post demonstrates the difference between the two version of ewm based on the following data points:
https://towardsdatascience.com/trading-toolbox-02-wma-ema-62c22205e2a9
I tried to replicate the results in the blog post, for the earlier data points, using the formula in L99 above.
Where every time I calculate the mean I am using the current and all preceding ewm values.
Is this what pandas ewm function also use all previous values when calculating the mean?
i Price alpha^i ewm ewm.mean
0 1
1 22.273 0.181818 =22.273*1/1 =22.273 22.273
2 22.194 0.03306 =(22.194*1+22.273*0.03306)/(1+0.03306)=22.20615 22.23958
3 22.085 0.00601 =(22.085*1+22.194*0.181818+22.273*0.03306)/(1+0.181818+0.3306)=22.10643 22.19519
The results are different to those shown in the blog post but if the method was correct they should be exactly the same.
Can someone please tell me where I'm going wrong?
Related
scipy.stats.spearmanr([1,2,3,4,1],[1,2,2,1,np.nan],nan_policy='omit')
it will give a spearman correlation of 0.349999
My understanding is that nan_policy ='omit' will discard all the pairs which have nan. If that's the case, the results should be the same as scipy.stats.spearmanr([1,2,3,4],[1,2,2,1])
However, it gives a correlation of 0.235702.
Why are they different? Is my understand of nan_policy ='omit' corrent?
I tried to run your code, it gives me cero correlation (R=0.0).
I use this function and you are understanding well nan_policy ='omit'.
If you don't need the p-value of the correlation I would sugest using .corr(method = 'spearman') from pandas library. By default it excludes NA/null values.
Official Documentation
nan_policy='omit' should completely omit those pairs for which one or both values are nan. When I run the two commands you pasted above, I get the same correlation value, not different ones.
I'm building indicator series based on market prices using ta-lib. I made a couple of implementations of the same concept but I found the same issue in any implementation. To obtain a correct series of values I must revert the input series and finally revert the resulting series. The python code that does the call to ta-lib library through a convenient wrapper is:
rsi1 = np.asarray(run_example( function_name,
arguments,
30,
weeklyNoFlatOpen[0],
weeklyNoFlatHigh[0],
weeklyNoFlatLow[0],
weeklyNoFlatClose[0],
weeklyNoFlatVolume[0][::-1]))
rsi2 = np.asarray(run_example( function_name,
arguments,
30,
weeklyNoFlatOpen[0][::-1],
weeklyNoFlatHigh[0][::-1],
weeklyNoFlatLow[0][::-1],
weeklyNoFlatClose[0][::-1],
weeklyNoFlatVolume[0][::-1]))[::-1]
The graphs of both series can be observed here (the indicator is really SMA):
The green line is clearly computed in reverse order (from n sample to 0) and the red one in the expected order. To achieve the red line I must reverse input series and output series.
The code of this test is available on: python code
Anybody observed the same behavior?
I found what's wrong with my approach to the problem. The simple answer is that the MA indicator puts the first valid value on the results array in the position zero, so the result series starts from zero and has N less samples than the input series (where N is the period value in this case). The reverted computation idea was completely wrong.
Here's the proof:
enter image description here
Adding 30 zeros at the beginning and removing the last ones the indicator fits over the input series nicely.
enter image description here
I'm currently writng a code involving some financial calculation. More in particular some exponential moving average. To do the job I have tried Pandas and Talib:
talib_ex=pd.Series(talib.EMA(self.PriceAdjusted.values,timeperiod=200),self.PriceAdjusted.index)
pandas_ex=self.PriceAdjusted.ewm(span=200,adjust=True,min_periods=200-1).mean()
They both work fine, but they provide different results at the begining of the array:
So there is some parameter to be change into pandas's EWMA or it is a bug and I should worry?
Thanks in advance
Luca
For the talib ema, the formula is:
So when using the pandas, if you want to make pandas ema the same as talib, you should use it as:
pandas_ex=self.PriceAdjusted.ewm(span=200,adjust=False,min_periods=200-1).mean()
Set the adjust as False according to the document(https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.ewm.html) if you want to use the same formula as talib:
When adjust is True (default), weighted averages are calculated using weights (1-alpha)(n-1), (1-alpha)(n-2), ..., 1-alpha, 1.
When adjust is False, weighted averages are calculated recursively as:
weighted_average[0] = arg[0]; weighted_average[i] = (1-alpha)weighted_average[i-1] + alphaarg[i].
You can also reference here:
https://en.wikipedia.org/wiki/Moving_average
PS: however, in my project, i still find some small differences between the talib and the pandas.ewm and don't know why yet...
Which method does Pandas use for computing the variance of a Series?
For example, using Pandas (v0.14.1):
pandas.Series(numpy.repeat(500111,2000000)).var()
12.579462289731145
Obviously due to some numeric instability. However, in R we get:
var(rep(500111,2000000))
0
I wasn't able to make enough sense of the Pandas source-code to figure out what algorithm it uses.
This link may be useful: http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance
Update: To summarize the comments below - If the Python bottleneck package for fast NumPy array functions is installed, a stabler two-pass algorithm similar to np.sqrt(((arr - arr.mean())**2).mean()) is used and gives 0.0 (as indicated by #Jeff); whereas if it is not installed, the naive implementation indicated by #BrenBarn is used.
The algorithm can be seen in nanops.py, in the function nanvar, the last line of which is:
return np.fabs((XX - X ** 2 / count) / d)
This is the "naive" implementation at the beginning of the Wikipedia article you mention. (d will be set to N-1 in the default case.)
The behavior you're seeing appears to be due to the sum of squared values overflowing the numpy datatypes. It's not an issue of how the variance is calculated per se.
I don't know the answer, but it seems related to how Series are stored, not necessarily the var function.
np.var(pd.Series(repeat(100000000,100000)))
26848.788479999999
np.var(repeat(100000000,100000))
0.0
Using Pandas 0.11.0.
beside the correct language ID langid.py returns a certain value - "The value returned is a score for the language. It is not a probability esimate, as it is not normalized by the document probability since this is unnecessary for classification."
But what does the value mean??
I'm actually the author of langid.py. Unfortunately, I've only just spotted this question now, almost a year after it was asked. I've tidied up the handling of the normalization since this question was asked, so all the README examples have been updated to show actual probabilities.
The value that you see there (and that you can still get by turning normalization off) is the un-normalized log-probability of the document. Because log/exp are monotonic, we don't actually need to compute the probability to decide the most likely class. The actual value of this log-prob is not actually of any use to the user. I should probably have never included it, and I may remove its output in the future.
I think this is the important chunk of langid.py code:
def nb_classify(fv):
# compute the log-factorial of each element of the vector
logfv = logfac(fv).astype(float)
# compute the probability of the document given each class
pdc = np.dot(fv,nb_ptc) - logfv.sum()
# compute the probability of the document in each class
pd = pdc + nb_pc
# select the most likely class
cl = np.argmax(pd)
# turn the pd into a probability distribution
pd /= pd.sum()
return cl, pd[cl]
It looks to me that the author is calculating something like the multinomial log-posterior of the data for each of the possible languages. logfv calculates the logarithm of the denominator of the PMF (x_1!...x_k!). np.dot(fv,nb_ptc) calculates the
logarithm of the p_1^x_1...p_k^x_k term. So, pdc looks like the list of language conditional log-likelihoods (except that it's missing the n! term). nb_pc looks like the prior probabilities, so pd would be the log-posteriors. The normalization line, pd /= pd.sum() confuses me, since one usually normalizes probability-like values (not log-probability values); also, the examples in the documentation (('en', -55.106250761034801)) don't look like they've been normalized---maybe they were generated before the normalization line was added?
Anyway, the short answer is that this value, pd[cl] is a confidence score. My understanding based on the current code is that they should be values between 0 and 1/97 (since there are 97 languages), with a smaller value indicating higher confidence.
Looks like a value that tells you how certain the engine is that it guessed the correct language for the document. I think generally the closer to 0 the number, the more sure it is, but you should be able to test that by mixing languages together and passing them in to see what values you get out. It allows you to fine tune your program when using langid depending upon what you consider 'close enough' to count as a match.