I'm currently writng a code involving some financial calculation. More in particular some exponential moving average. To do the job I have tried Pandas and Talib:
talib_ex=pd.Series(talib.EMA(self.PriceAdjusted.values,timeperiod=200),self.PriceAdjusted.index)
pandas_ex=self.PriceAdjusted.ewm(span=200,adjust=True,min_periods=200-1).mean()
They both work fine, but they provide different results at the begining of the array:
So there is some parameter to be change into pandas's EWMA or it is a bug and I should worry?
Thanks in advance
Luca
For the talib ema, the formula is:
So when using the pandas, if you want to make pandas ema the same as talib, you should use it as:
pandas_ex=self.PriceAdjusted.ewm(span=200,adjust=False,min_periods=200-1).mean()
Set the adjust as False according to the document(https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.ewm.html) if you want to use the same formula as talib:
When adjust is True (default), weighted averages are calculated using weights (1-alpha)(n-1), (1-alpha)(n-2), ..., 1-alpha, 1.
When adjust is False, weighted averages are calculated recursively as:
weighted_average[0] = arg[0]; weighted_average[i] = (1-alpha)weighted_average[i-1] + alphaarg[i].
You can also reference here:
https://en.wikipedia.org/wiki/Moving_average
PS: however, in my project, i still find some small differences between the talib and the pandas.ewm and don't know why yet...
Related
I'm trying to learn some python and are currently doing a few stock market examples. However, I ran across something called an Accumulated Distribution Line(technical indicator) and tried to follow the mathematical expression for this until I reached the following line:
ADL[i] = ADL[i-1] + money flow volume[i]
Now. I have the money flow volume at index 8 and an empty table for the ADL at index 9 (index for rows in a csv file). How would I actually compute the mathematical expression above in python? (Currently using Python with Pandas)
Currently tried using the range function such as:
for i in range(1,stock["Money flow volume"])):
stock.iloc[0,9] = stock.iloc[(i-1),9] + stock.iloc[i,8]
But I think I'm doing something wrong.
that just looks like a cumulative sum with an unspecified base case, so I'd just use the built in cumsum functionality.
import pandas as pd
df = pd.DataFrame(dict(mfv=range(10)))
df['adl'] = df['mfv'].cumsum()
should do what you want relatively efficiently
When calculating a exponentially weighted average in Pandas the parameter adjust is set to a default value of True.
I know what the adjust parameter does (but not how it does it which is what I want to know).
When adjust = True the ewa is calculated for every point in the sample but when adjust=False, then for a window of size n, you must wait for n observations to calculate the first ewa value.
I looked at the pandas documentation but it only proves that the adjust = True is equivalent to adjust = False for later values. It doesn't say how the earlier values are adjusted in the adjust=True case.
https://pandas.pydata.org/pandas-docs/stable/user_guide/computation.html#exponentially-weighted-windows
I even looked at the pandas code on github:
https://github.com/pandas-dev/pandas/blob/master/pandas/core/window/ewm.py
see L99 onwards: but it just seems to be using the regular ewm formula for the earlier points?
This blog post demonstrates the difference between the two version of ewm based on the following data points:
https://towardsdatascience.com/trading-toolbox-02-wma-ema-62c22205e2a9
I tried to replicate the results in the blog post, for the earlier data points, using the formula in L99 above.
Where every time I calculate the mean I am using the current and all preceding ewm values.
Is this what pandas ewm function also use all previous values when calculating the mean?
i Price alpha^i ewm ewm.mean
0 1
1 22.273 0.181818 =22.273*1/1 =22.273 22.273
2 22.194 0.03306 =(22.194*1+22.273*0.03306)/(1+0.03306)=22.20615 22.23958
3 22.085 0.00601 =(22.085*1+22.194*0.181818+22.273*0.03306)/(1+0.181818+0.3306)=22.10643 22.19519
The results are different to those shown in the blog post but if the method was correct they should be exactly the same.
Can someone please tell me where I'm going wrong?
I have big table of data that I read from excel in Python where I perform some calculation my dataframe looks like this but my true table is bigger and more complex but the logic stays the same:
with : My_cal_spread=set1+set2 and Errors = abs( My_cal_spread - spread)
My goal is to find using Scipy Minimize to the only same combination of (Set1 and Set 2) that can be used in each row so My_cal_spread is as close as possible to Spread by optimizing in finding the minimum sum of errors Possible.
this is the solution that I get when I am using excel solver, I'm looking for implementing the same solution using Scipy. Thanks
My code looks like this :
lnt=len(df['Spread'])
df['my_cal_Spread']=''
i=0
while i<lnt:
df['my_cal_Spread'].iloc[i]=df['set2'].iloc[i]+df['set1'].iloc[i]
df['errors'].iloc[i] = abs(df['my_cal_Spread'].iloc[i]-df['Spread'].iloc[i])
i=i+1
errors_sum=sum(df['errors'])
Is there an equivalent of R's summary() function in numpy?
numpy has std, mean, average functions separately, but does it have a function that sums up everything, like summary does in R?
If found this question which relates to pandas and this article with R-to-numpy equivalents, but it doesn't have what I seek for.
1. Load Pandas in console and load csv data file
import pandas as pd
data = pd.read_csv("data.csv", sep = ",")
2. Examine first few rows of data
data.head()
3. Calculate summary statistics
summary = data.describe()
4. Transpose statistics to get similar format as R summary() function
summary = summary.transpose()
5. Visualize summary statistics in console
summary.head()
No. You'll need to use pandas.
R is for language for statistics, so many of the basic functionality you need, like summary() and lm(), are loaded when you boot it up. Python has many uses, so you need to install and import the appropriate statistical packages. numpy isn't a statistics package - it's for numerical computation more generally, so you need to use packages like pandas, scipy and statsmodels to allow Python to do what R can do out of the box.
If you are looking for details like summary() in R i.e
5 point summary for numeric variables
Frequency of occurrence of each class for categorical variable
To achieve above in Python you can use df.describe(include= 'all').
Which method does Pandas use for computing the variance of a Series?
For example, using Pandas (v0.14.1):
pandas.Series(numpy.repeat(500111,2000000)).var()
12.579462289731145
Obviously due to some numeric instability. However, in R we get:
var(rep(500111,2000000))
0
I wasn't able to make enough sense of the Pandas source-code to figure out what algorithm it uses.
This link may be useful: http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance
Update: To summarize the comments below - If the Python bottleneck package for fast NumPy array functions is installed, a stabler two-pass algorithm similar to np.sqrt(((arr - arr.mean())**2).mean()) is used and gives 0.0 (as indicated by #Jeff); whereas if it is not installed, the naive implementation indicated by #BrenBarn is used.
The algorithm can be seen in nanops.py, in the function nanvar, the last line of which is:
return np.fabs((XX - X ** 2 / count) / d)
This is the "naive" implementation at the beginning of the Wikipedia article you mention. (d will be set to N-1 in the default case.)
The behavior you're seeing appears to be due to the sum of squared values overflowing the numpy datatypes. It's not an issue of how the variance is calculated per se.
I don't know the answer, but it seems related to how Series are stored, not necessarily the var function.
np.var(pd.Series(repeat(100000000,100000)))
26848.788479999999
np.var(repeat(100000000,100000))
0.0
Using Pandas 0.11.0.