How to get/isolate the p-value from 'AnovaResults' object in python? - python

I want to use one way repeated measures anova in my dataset to test whether the values of 5 patients differ between the measured 3 days.
I use AnovaRM from statsmodels.stats.anova and the result is an 'AnovaResults' object.
I can see the p-value with the print() function but i don't know how to isolate it from this object.
Do you have any idea? Also is my code correct for what i want to test?
Thanks in advance
day1 = [1,2,3,4,5]
day2 = [2,4,6,8,10]
day3 = [1.5,2.5,3.5,4.5,5.5]
days_list = [day1,day2,day3]
df = pd.DataFrame({'patient': np.repeat(range(1, len(days_list[0])+1), len(days_list)),
'group': np.tile(range(1, len(days_list)+1), len(days_list[0])),
'score': [x[y] for y in range(len(days_list[0])) for x in days_list]})
print(AnovaRM(data=df, depvar='score', subject='patient', within=['group']).fit())

I'm assuming the p value you're looking for is the number displayed in the Pr > F column when you run the code in your question. If you instead assign the results of the test to a variable, the underlying dataframe can be accessed through the anova_table attribute:
results = AnovaRM(data=df, depvar='score', subject='patient', within=['group']).fit()
print(results.anova_table)
which gives:
F Value Num DF Den DF Pr > F
group 15.5 2.0 8.0 0.00177
Just access the 0th member of the Pr > F column, and you're all set:
print(results.anova_table["Pr > F"][0])
This yields the answer:
0.0017705227840260451

I think i found a way!
a=AnovaRM(data=df, depvar='score', subject='patient', within=['group']).fit().summary().as_html()
pd.read_html(a, header=0, index_col=0)[0]['Pr > F'][0]
Hope it will help someone!

Related

Applying custom functions to groupby objects pandas

I have the following pandas dataframe.
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
"bird_type": ["falcon", "crane", "crane", "falcon"],
"avg_speed": [np.random.randint(50, 200) for _ in range(4)],
"no_of_birds_observed": [np.random.randint(3, 10) for _ in range(4)],
"reliability_of_data": [np.random.rand() for _ in range(4)],
}
)
# The dataframe looks like this.
bird_type avg_speed no_of_birds_observed reliability_of_data
0 falcon 66 3 0.553841
1 crane 159 8 0.472359
2 crane 158 7 0.493193
3 falcon 161 7 0.585865
Now, I would like to have the weighted average (according to the number_of_birds_surveyed) for the average_speed and reliability variables. For that I have a simple function as follows, which calculates the weighted average.
def func(data, numbers):
ans = 0
for a, b in zip(data, numbers):
ans = ans + a*b
ans = ans / sum(numbers)
return ans
How can I apply the function of func to both average speed and reliability variables?
I expect the answer to be a dataframe like follows
bird_type avg_speed no_of_birds_observed reliability_of_data
0 falcon 132.5 10 0.5762578
# how (66*3 + 161*7)/(3+7) (3+10) (0.553841×3+0.585865×7)/(3+7)
1 crane 158.53 15 0.4820815
# how (159*8 + 158*7)/(8+7) (8+7) (0.472359×8+0.493193×7)/(8+7)
I saw this question, but could not generalize the solution / understand it completely. I thought of not asking the question, but according to this blog post by SO and this meta question, with a different example, I think this question can be considered a "borderline duplicate". An answer will benefit me and probably some others will also find this useful. So finally decided to ask.
Don't use a function with apply, rather perform a classical aggregation:
cols = ['avg_speed', 'reliability_of_data']
# multiply relevant columns by no_of_birds_observed
# aggregate everything as sum
out = (df[cols].mul(df['no_of_birds_observed'], axis=0)
.combine_first(df)
.groupby('bird_type').sum()
)
# divide the relevant columns by the sum of no_of_birds_observed
out[cols] = out[cols].div(out['no_of_birds_observed'], axis=0)
Output:
avg_speed no_of_birds_observed reliability_of_data
bird_type
crane 158.533333 15 0.482082
falcon 132.500000 10 0.576258
If want aggregate by GroupBy.agg for weights parameter is used no_of_birds_observed by DataFrame.loc:
#for correct ouput need default (or unique values) index
df = df.reset_index(drop=True)
f = lambda x: np.average(x, weights= df.loc[x.index, 'no_of_birds_observed'])
df1 = (df.groupby('bird_type', sort=False, as_index=False)
.agg(avg=('avg_speed',f),
no_of_birds=('no_of_birds_observed','sum'),
reliability_of_data=('reliability_of_data', f)))
print (df1)
bird_type avg no_of_birds reliability_of_data
0 falcon 132.500000 10 0.576258
1 crane 158.533333 15 0.482082

How can i take values from csv file and print the values that are within + or - 1 of the value?

I am quite new to python so please bear with me.
I am trying to pick one of the values printed, find it in the csv file, and print the values + or - 1 around it.
Here is the code to pick the values.
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
df = pd.read_csv(r"/Users/aaronhuang/Desktop/ffp/exfileCLEAN2.csv", skiprows=[1])
magnitudes = df['Magnitude '].values
times = df['Time '].values
zscores = np.abs(stats.zscore(magnitudes, ddof=1))
outlier_indicies = np.argwhere(zscores > 3).flatten()
numbers = print(times[outlier_indicies])
The values printed are below.
2455338.895 2455350.644 2455391.557 2455404.776 2455413.734 2455451.661
2455473.49 2455477.521 2455507.505 2455702.662 2455734.597 2455765.765
2455776.575 2455826.593 2455842.512 2455866.508 2455996.796 2456017.767
2456047.694 2456058.732 2456062.722 2456071.924 2456082.802 2456116.494
2456116.535 2456116.576 2456116.624 2456116.673 2456116.714 2456116.799
2456123.527 2456164.507 2456166.634 2456391.703 2456455.535 2456455.6
2456501.763 2456511.616 2456519.731 2456525.49 2456547.588 2456570.526
2456595.515 2456776.661 2456853.543 2456920.511 2456953.496 2457234.643
2457250.68 2457252.672 2457278.526 2457451.89 2457485.722 2457497.93
2457500.674 2457566.874 2457567.877 2457644.495 2457661.553 2457675.513
An example of the csv file is below.
Time Magnitude Magnitude error
2455260.853 19.472 0.150
2455260.900 19.445 0.126
2455261.792 19.484 0.168
2455262.830 19.157 0.261
2455264.814 19.376 0.150
... ... ... ...
2457686.478 19.063 0.176
2457689.480 19.178 0.128
2457690.475 19.386 0.171
2457690.480 19.092 0.112
2457691.476 19.191 0.122
For example, I want to pick the first value, which is 2455338.895 i would like to print all the values + or - 1 of it (in the time column) (and later graph it).
Some help would be greatly appreciated.
Thank you in advance.
I think this is what you are looking for (assuming you want single number query which you mentioned in question that way):
numbers = times[outlier_indicies]
print(df[(df['Time']<numbers[0]+1) & (df['Time']>numbers[0]-1)]['Time'])
Looping over numbers to get all intervals are straightforward, if that is what you might be interested in.
EDIT:
The for loop looks like this:
print(pd.concat([df[(df['Time']<i+1) & (df['Time']>i-1)]['Time'] for i in numbers]))
The non-loop version if there are no overlapping intervals in (numbers[i]-1,numbers[i]+1):
intervals = pd.DataFrame(data={'start': numbers-1,'finish': numbers+1})
starts = pd.DataFrame(data={'start': 1}, index=intervals.start.tolist())
finishs = pd.DataFrame(data={'finish': -1},index=intervals.finish.tolist())
transitions = pd.merge(starts, finishs, how='outer', left_index=True, right_index=True).fillna(0)
transitions['transition'] = (transitions.pop('finish')+transitions.pop('start')).cumsum()
B = pd.DataFrame(index=numbers)
pd.merge(transitions, B, how='outer', left_index=True, right_index=True).fillna(method='ffill').loc[B.index].astype(bool)
print(transitions[transitions.transition == 1].index)
In case of overlapping you can merge consecutive overlapping intervals in intervals dataframe with the help of following column and then run the above code(needs maybe a couple more lines to complete):
intervals['overlapping'] = (intervals.finish - intervals.start.shift(-1))>0
You can simply iterate over the numbers:
all_nums = numbers.split(" ")
first = all_nums[0]
threshold = 1
result = []
for num in all_nums:
if abs(float(first) - float(num)) < threshold:
result.append(num) # float(num) if you want number instead of str
print(result)

Pandas to modify values in csv file based on function

I have a CSV file that looks like below, this is same like my last question but this is by using Pandas.
Group Sam Dan Bori Son John Mave
A 0.00258844 0.983322 1.61479 1.2785 1.96963 10.6945
B 0.0026034 0.983305 1.61198 1.26239 1.9742 10.6838
C 0.0026174 0.983294 1.60913 1.24543 1.97877 10.6729
D 0.00263062 0.983289 1.60624 1.22758 1.98334 10.6618
E 0.00264304 0.98329 1.60332 1.20885 1.98791 10.6505
I have a function like below
def getnewno(value):
value = value + 30
if value > 40 :
value = value - 20
else:
value = value
return value
I want to send all these values to the getnewno function and get a newvalue and update the CSV file. How can this be accomplished in Pandas.
Expected output:
Group Sam Dan Bori Son John Mave
A 30.00258844 30.983322 31.61479 31.2785 31.96963 20.6945
B 30.0026034 30.983305 31.61198 31.26239 31.9742 20.6838
C 30.0026174 30.983294 31.60913 31.24543 31.97877 20.6729
D 30.00263062 30.983289 31.60624 31.22758 31.98334 20.6618
E 30.00264304 30.98329 31.60332 31.20885 31.98791 20.6505
The following should give you what you desire.
Applying a function
Your function can be simplified and here expressed as a lambda function.
It's then a matter of applying your function to all of the columns. There are a number of ways to do so. The first idea that comes to mind is to loop over df.columns. However, we can do better than this by using the applymap or transform methods:
import pandas as pd
# Read in the data from file
df = pd.read_csv('data.csv',
sep='\s+',
index_col=0)
# Simplified function with which to transform data
getnewno = lambda value: value + 10 if value > 10 else value + 30
# Looping over columns
#for col in df.columns:
# df[col] = df[col].apply(getnewno)
# Apply to all columns without loop
df = df.applymap(getnewno)
# Write out updated data
df.to_csv('data_updated.csv')
Using broadcasting
You can achieve your result using broadcasting and a little boolean logic. This avoids looping over any columns, and should ultimately prove faster and less memory intensive (although if your dataset is small any speed-up would be negligible):
import pandas as pd
df = pd.read_csv('data.csv',
sep='\s+',
index_col=0)
df += 30
make_smaller = df > 40
df[make_smaller] -= 20
First of all, your getnewno function looks too complicated... it can be simplified to e.g.:
def getnewno(value):
if value + 30 > 40:
return value - 20
else:
return value
you can even change value + 30 > 40 to value > 10.
Or even a oneliner if you want:
getnewno = lambda value: value-20 if value > 10 else value
Having the function you can apply it to specific values/columns. For example, if want you to create a column Mark_updated basing on Mark column, it should look like this (I assume your pandas DataFrame is called df):
df['Mark_updated'] = df['Mark'].apply(getnewno)
Use the mask function to do an if-else solution, before writing the data to csv
res = (df
.select_dtypes('number')
.add(30)
#the if-else comes in here
#if any entry in the dataframe is greater than 40, subtract 20 from it
#else leave as is
.mask(lambda x: x>40, lambda x: x.sub(20))
)
#insert the group column back
res.insert(0,'Group',df.Group.array)
write to csv
res.to_csv(filename)
Group Sam Dan Bori Son John Mave
0 A 30.002588 30.983322 31.61479 31.27850 31.96963 20.6945
1 B 30.002603 30.983305 31.61198 31.26239 31.97420 20.6838
2 C 30.002617 30.983294 31.60913 31.24543 31.97877 20.6729
3 D 30.002631 30.983289 31.60624 31.22758 31.98334 20.6618
4 E 30.002643 30.983290 31.60332 31.20885 31.98791 20.6505

Create a column that is a conditional smoothed moving average of another column in python

My Problem
I am trying to create a column in python which is the conditional smoothed moving 14 day average of another column. The condition is that I only want to include positive values from another column in the rolling average.
I am currently using the following code which works exactly how I want it to, but it is really slow because of the loops. I want to try and re-do it without using loops. The dataset is simply the last closing price of a stock.
Current Working Code
import numpy as np
import pandas as pd
csv1 = pd.read_csv('stock_price.csv', delimiter = ',')
df = pd.DataFrame(csv1)
df['delta'] = df.PX_LAST.pct_change()
df.loc[df.index[0], 'avg_gain'] = 0
for x in range(1,len(df.index)):
if df["delta"].iloc[x] > 0:
df["avg_gain"].iloc[x] = ((df["avg_gain"].iloc[x - 1] * 13) + df["delta"].iloc[x]) / 14
else:
df["avg_gain"].iloc[x] = ((df["avg_gain"].iloc[x - 1] * 13) + 0) / 14
df
Correct Output Example
Dates PX_LAST delta avg_gain
03/09/2018 43.67800 NaN 0.000000
04/09/2018 43.14825 -0.012129 0.000000
05/09/2018 42.81725 -0.007671 0.000000
06/09/2018 43.07725 0.006072 0.000434
07/09/2018 43.37525 0.006918 0.000897
10/09/2018 43.47925 0.002398 0.001004
11/09/2018 43.59750 0.002720 0.001127
12/09/2018 43.68725 0.002059 0.001193
13/09/2018 44.08925 0.009202 0.001765
14/09/2018 43.89075 -0.004502 0.001639
17/09/2018 44.04200 0.003446 0.001768
Attempted Solutions
I tried to create a new column that only comprises of the positive values and then tried to create the smoothed moving average of that new column but it doesn't give me the right answer
df['new_col'] = df['delta'].apply(lambda x: x if x > 0 else 0)
df['avg_gain'] = df['new_col'].ewm(14,min_periods=1).mean()
The maths behind it as follows...
Avg_Gain = ((Avg_Gain(t-1) * 13) + (New_Col * 1)) / 14
where New_Col only equals the positive values of Delta
Does anyone know how I might be able to do it?
Cheers
This should speed up your code:
df['avg_gain'] = df[df['delta'] > 0]['delta'].rolling(14).mean()
Does your current code converge to zero? If you can provide the data, then it would be easier for the folk to do some analysis.
I would suggest you add a column which is 0 if the value is < 0 and instead has the same value as the one you want to consider if it is >= 0. Then you take the running average of this new column.
df['new_col'] = df.apply(lambda x: x['delta'] if x['delta'] >= 0 else 0)
df['avg_gain'] = df['new_value'].rolling(14).mean()
This would take into account zeros instead of just discarding them.

Groupby a data set in Python

I have 30 years of daily data. I want to calculate average daily over 30 years. For example, I have data like this
1/1/2036 0
1/2/2036 73.61180115
1/3/2036 73.77733612
1/4/2036 73.61183929
1/5/2036 73.75443268
1/6/2036 73.58483887
.........
12/22/2065 73.90600586
12/23/2065 74.38092804
12/24/2065 77.76309967
I want to calculate:
1/1/yyyy ?
1/2/yyyy ?
1/3/yyyy ?
......
12/30/yyyy ?
12/31/yyyy ?
I wrote a code in python but it's only calculating 1st month avg. My dataset is 10950 x 1 which will be converted to 365 x 1. Following is my code:
import pandas as pd
files=glob.glob('*2036-2065*rcp26*.csv*')
RO_act=pd.read_csv('Reservoir storage zones_sohom.csv',index_col=0,parse_dates=True)
for i, fl in enumerate(files):
df = pd.read_csv(fl, index_col=0,usecols=[0,78],parse_dates=True)
df1=df.groupby(pd.TimeGrouper(freq='D')).mean()
Please help
You can pass a function to df.groupby which will act on the indices to make the groups. So, for you, use:
df.groupby(lambda x: (x.day,x.month)).mean()
Consider the following series s
days = pd.date_range('1986-01-01', '2015-12-31')
s = pd.Series(np.random.rand(len(days)), days)
then what you're looking for is:
s.groupby([s.index.month, s.index.day]).mean()
Timing
#juanpa.arrivillaga's answer gives the same solution but is slower.

Categories