Multitest p-values are different compared to graphpad prism? - python

I have a dataframe with the columns: Time, ID, Drug, Value
Here is my code on how i perform two-way anova and multipletests
#libraries
import pandas as pd
import statsmodels.formula.api as sm
from statsmodels.stats.anova import anova_lm
from statsmodels.stats.multitest import multipletests
import os
df= pd.read_excel(r"C:path.xlsm", sheet_name="test") #dataframe
order = [18,19,20,21,22,23,0] #sort 24 hour time starting at time 18hr
df['Drug']=pd.Categorical(df['Drug'])
df['Time'] = pd.Categorical(df['Time'], categories=order)
#TWO-WAY ANOVA
mod = sm.ols('Value~Drug+Time+Time*Drug', data = df).fit()
aov = anova_lm(mod, type=2)
#Multi-test (mt)
mt = pd.concat([mod.params,mod.pvalues],axis=1)
mt.columns=['coefficient','pvalues']
mt = mt[mt.index.str.contains('Drug')]
mt['corrected_p'] = multipletests(mt['pvalues'],alpha=0.05,method="sidak",is_sorted=True)[1]
I get the following uncorrected['pvalues' and correct pvalues['corrected_p'] from the output of multi-test (mt):
Index pvalues correct_p
Drug[T.B] 0.0159475 0.106432
Time[T.19]:Drug[T.B] 0.0738362 0.41546
Time[T.20]:Drug[T.B] 0.0778909 0.43314
Time[T.21]:Drug[T.B] 0.0699678 0.398153
When i use the same dataset in graphpad prism i get these values instead (using two-way anova and multicomparison sidak:
Drug A-B Individual P value Adjusted P Value
18 0.0159 0.1064
19 0.9689 >0.999
20 >0.9999 >0.999
21 0.9379 >0.999
Especially for time 19,20 and 21 the adjusted P-value are signficantly different and I'm not sure why. I'm concerned if i coded my statistics incorrectly causing the difference.
Happy to provide further info as needed

Related

talib.EMA() returning nan values

So I have the following code :
import pandas as pd
import matplotlib.pyplot as plt
import bt
import numpy as np
import talib
btc_data = pd.read_csv('Binance_BTCUSDT_minute.csv', index_col= 'date', parse_dates = True)
one = btc_data['close'] #one minute candles
**closes = np.array(one)** #numpy array of one minute candles
five = one.resample('5min').mean() #five minute candles
type(one),type(five),type(one[0]),type(five[0]) #comparing types
(they are the exact same type)
period_short = 55
period_long = 144
**closes = np.array(five)** #I can comment this out if I want to use one minute candles instead
EMA_short = talib.EMA(closes, timeperiod= period_short)
EMA_long = talib.EMA(closes, timeperiod= period_long)
The weird part is that when I use the one minute candles, the EMAs return numerical values. But when I use five minute candles, it returns nan
I compared the types of both, and they are the same type (both the arrays and the values contained are numpy.ndarray and numpy.float64 respectively). Why is the 5 minute then unable to produce values ?

Compute standard deviation for each row and by group based on a specific variable

I am fresh user of python, my issue is to compute standard deviation for the column residual.
to do it :
I have to calculate the mean residual in each group
I need the size of ID for each group
I happened to do some calculation and this is my code:
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as sm
from statistics import stdev
import statistics
from math import *
#Enumerate the data 1,2,3.. for each variable
A['Rec'] = A.groupby(['code ']).cumcount().add(1)
## Defining companies by their IDs
A['ID']=A.groupby('code ').ngroup().add(1)
### FINDING RESIDUALS
results = sm.ols(formula='Y ~ X', data=A).fit()
Y_pred = results.predict(A[["X"]])
Y_pred
A['residual'] = A["Y"].values-Y_pred
###SIZE
A['size']=A.groupby(['ID']).size()
###SD of residuals
for i in A['ID']:
A['Std'] = sqrt((A['residual']-A['MEAN'])**2)/(A['size']-1)))
This is my dataframe enter image description here
the groups now are referred to ID (1,2,3,4,5); in each group there is rows. In each row and based by group , I would like to have a SD of the column residual.
I apologize as I don't have enough points to just leave a comment, has to be an answer. Anyway, could you maybe try something like this:
new_df = df.loc[:, 'residual'].groupby(df['ID']).std()

How to calculate Cooks Distance, DFFITS using python statsmodel

I want to calculate Cooks_d and DFFITS in Python using statsmodel.
Here is my code in Python:
X = your_str_cleaned[param]
y = your_str_cleaned['Visitor']
X = sm.add_constant(X)
model = sm.OLS(y, X)
results = model.fit()
I tried using this for getting Cooks Distance and DFFITS:
import statsmodels.stats.outliers_influence as st_inf
st_inf.OLSInfluence.summary_frame(results)
But I am getting this error:
'OLSResults' object has no attribute 'results'.
Can someone help me find where I am going wrong?
I experience the same problem, so I had to find a way around. I don't have much experience, and this doesn't fix the root issue with OLSInfluence. But it gives you summary_frame.
I will use pandas dataframes as the source of the data. Even if you have it in other objects (like arrays) you can transform them into a dataframe with relative ease. To show how it works, I will import the Boston housing prices data set from sklearn.datasets:
import pandas as pd
from sklearn.datasets import load_boston
#imports dataset
boston = load_boston()
#generates DataFrame bos
bos = pd.DataFrame(boston.data)
#adds columns names to bos
bos.columns = boston.feature_names
#adds column 'PRICE' to bos
bos['PRICE'] = boston.target
Now let us consider the relation between the column 'RM' and the column 'PRICE', with 'RM'as independent variable. For simplicity, let us consider simple OLS. Here comes the actual answer:
from statsmodels.formula.api import ols
m = ols('PRICE ~ RM',bos).fit()
infl = m.get_influence()
sm_fr = infl.summary_frame()
sm_fr has the columns cooks_d and dffits that you look for.

Statsmodels gives different ANOVA results to SPSS

I'm getting acquainted with Statsmodels so as to shift my more complicated stats completely over to python. However, I'm being cautious, so I'm cross-checking my results with SPSS, just to make sure I'm not making any obvious blunders. Most of time, there's no difference, but I have one example of a two-way ANOVA that's throwing up very different test statistics in Statsmodels and SPSS. (Relevant point: the sample sizes in the ANOVA are mismatched, so ANOVA may not be the appropriate model here.)
I'm selecting my model as follows:
import pandas as pd
import scipy as sp
import numpy as np
import statsmodels.api as sm
import seaborn as sns
import statsmodels
import statsmodels.api as sm
from statsmodels.formula.api import ols
import matplotlib.pyplot as plt
Body = pd.read_csv(filepath)
Body = Body.dropna()
Body_lm = ols('Effect ~ C(Fiction) + C(Condition) + C(Fiction)*C(Condition)', data = Body).fit()
table = sm.stats.anova_lm(Body_lm, typ=2)
The Statsmodels output is as below:
sum_sq df F PR(>F)
C(Fiction) 278.176684 1.0 307.624463 1.682042e-55
C(Condition) 4.294764 1.0 4.749408 2.971278e-02
C(Fiction):C(Condition) 10.776312 1.0 11.917092 5.970123e-04
Residual 520.861599 576.0 NaN NaN
The corresponding SPSS results are these:
Can anyone help explain the difference? Is is perhaps the unequal sample sizes being treated differently under the hood? Or am I choosing the wrong model?
Any help appreciated!
You should use sum coding when comparing the means of the variables.
BTW you don't need to specify each variable that are in the interaction term if * multiply operator is used:
“:” adds a new column to the design matrix with the product of the other two columns.
“*” will also include the individual columns that were multiplied together.
Your model should be:
Body_lm = ols('Effect ~ C(Fiction, Sum)*C(Condition, Sum)', data = Body).fit()

Momentum portfolio(trend following) quant simulation on pandas

I am trying to construct trend following momentum portfolio strategy based on S&P500 index (momthly data)
I used Kaufmann's fractal efficiency ratio to filter out whipsaw signal
(http://etfhq.com/blog/2011/02/07/kaufmans-efficiency-ratio/)
I succeeded in coding, but it's very clumsy, so I need advice for better code.
Strategy
Get data of S&P 500 index from yahoo finance
Calculate Kaufmann's efficiency ratio on lookback period X (1 , if close > close(n), 0)
Averages calculated value of 2, from 1 to 12 time period ---> Monthly asset allocation ratio, 1-asset allocation ratio = cash (3% per year)
I am having a difficulty in averaging 1 to 12 efficiency ratio. Of course I know that it can be simply implemented by for loop and it's very easy task, but I failed.
I need more concise and refined code, anybody can help me?
a['meanfractal'] bothers me in the code below..
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import pandas_datareader.data as web
def price(stock, start):
price = web.DataReader(name=stock, data_source='yahoo', start=start)['Adj Close']
return price.div(price.iat[0]).resample('M').last().to_frame('price')
a = price('SPY','2000-01-01')
def fractal(a,p):
a['direction'] = np.where(a['price'].diff(p)>0,1,0)
a['abs'] = a['price'].diff(p).abs()
a['volatility'] = a.price.diff().abs().rolling(p).sum()
a['fractal'] = a['abs'].values/a['volatility'].values*a['direction'].values
return a['fractal']
def meanfractal(a):
a['meanfractal']= (fractal(a,1).values+fractal(a,2).values+fractal(a,3).values+fractal(a,4).values+fractal(a,5).values+fractal(a,6).values+fractal(a,7).values+fractal(a,8).values+fractal(a,9).values+fractal(a,10).values+fractal(a,11).values+fractal(a,12).values)/12
a['portfolio1'] = (a.price/a.price.shift(1).values*a.meanfractal.shift(1).values+(1-a.meanfractal.shift(1).values)*1.03**(1/12)).cumprod()
a['portfolio2'] = ((a.price/a.price.shift(1).values*a.meanfractal.shift(1).values+1.03**(1/12))/(1+a.meanfractal.shift(1))).cumprod()
a=a.dropna()
a=a.div(a.ix[0])
return a[['price','portfolio1','portfolio2']].plot()
print(a)
plt.show()
You could simplify further by storing the values corresponding to p in a DF rather than computing for each series separately as shown:
def fractal(a, p):
df = pd.DataFrame()
for count in range(1,p+1):
a['direction'] = np.where(a['price'].diff(count)>0,1,0)
a['abs'] = a['price'].diff(count).abs()
a['volatility'] = a.price.diff().abs().rolling(count).sum()
a['fractal'] = a['abs']/a['volatility']*a['direction']
df = pd.concat([df, a['fractal']], axis=1)
return df
Then, you could assign the repeating operations to a variable which reduces the re-computation time.
def meanfractal(a, l=12):
a['meanfractal']= pd.DataFrame(fractal(a, l)).sum(1,skipna=False)/l
mean_shift = a['meanfractal'].shift(1)
price_shift = a['price'].shift(1)
factor = 1.03**(1/l)
a['portfolio1'] = (a['price']/price_shift*mean_shift+(1-mean_shift)*factor).cumprod()
a['portfolio2'] = ((a['price']/price_shift*mean_shift+factor)/(1+mean_shift)).cumprod()
a.dropna(inplace=True)
a = a.div(a.ix[0])
return a[['price','portfolio1','portfolio2']].plot()
Resulting plot obtained:
meanfractal(a)
Note: If speed is not a major concern, you could perform the operations via the built-in methods present in pandas instead of converting them into it's corresponding numpy array values.

Categories