Groupby a data set in Python

Groupby a data set in Python - python

I have 30 years of daily data. I want to calculate average daily over 30 years. For example, I have data like this
1/1/2036 0
1/2/2036 73.61180115
1/3/2036 73.77733612
1/4/2036 73.61183929
1/5/2036 73.75443268
1/6/2036 73.58483887
.........
12/22/2065 73.90600586
12/23/2065 74.38092804
12/24/2065 77.76309967
I want to calculate:
1/1/yyyy ?
1/2/yyyy ?
1/3/yyyy ?
......
12/30/yyyy ?
12/31/yyyy ?
I wrote a code in python but it's only calculating 1st month avg. My dataset is 10950 x 1 which will be converted to 365 x 1. Following is my code:
import pandas as pd
files=glob.glob('*2036-2065*rcp26*.csv*')
RO_act=pd.read_csv('Reservoir storage zones_sohom.csv',index_col=0,parse_dates=True)
for i, fl in enumerate(files):
df = pd.read_csv(fl, index_col=0,usecols=[0,78],parse_dates=True)
df1=df.groupby(pd.TimeGrouper(freq='D')).mean()
Please help

You can pass a function to df.groupby which will act on the indices to make the groups. So, for you, use:
df.groupby(lambda x: (x.day,x.month)).mean()

Consider the following series s
days = pd.date_range('1986-01-01', '2015-12-31')
s = pd.Series(np.random.rand(len(days)), days)
then what you're looking for is:
s.groupby([s.index.month, s.index.day]).mean()
Timing
#juanpa.arrivillaga's answer gives the same solution but is slower.

Related

How to get/isolate the p-value from 'AnovaResults' object in python?

I want to use one way repeated measures anova in my dataset to test whether the values of 5 patients differ between the measured 3 days.
I use AnovaRM from statsmodels.stats.anova and the result is an 'AnovaResults' object.
I can see the p-value with the print() function but i don't know how to isolate it from this object.
Do you have any idea? Also is my code correct for what i want to test?
Thanks in advance
day1 = [1,2,3,4,5]
day2 = [2,4,6,8,10]
day3 = [1.5,2.5,3.5,4.5,5.5]
days_list = [day1,day2,day3]
df = pd.DataFrame({'patient': np.repeat(range(1, len(days_list[0])+1), len(days_list)),
'group': np.tile(range(1, len(days_list)+1), len(days_list[0])),
'score': [x[y] for y in range(len(days_list[0])) for x in days_list]})
print(AnovaRM(data=df, depvar='score', subject='patient', within=['group']).fit())

I'm assuming the p value you're looking for is the number displayed in the Pr > F column when you run the code in your question. If you instead assign the results of the test to a variable, the underlying dataframe can be accessed through the anova_table attribute:
results = AnovaRM(data=df, depvar='score', subject='patient', within=['group']).fit()
print(results.anova_table)
which gives:
F Value Num DF Den DF Pr > F
group 15.5 2.0 8.0 0.00177
Just access the 0th member of the Pr > F column, and you're all set:
print(results.anova_table["Pr > F"][0])
This yields the answer:
0.0017705227840260451

I think i found a way!
a=AnovaRM(data=df, depvar='score', subject='patient', within=['group']).fit().summary().as_html()
pd.read_html(a, header=0, index_col=0)[0]['Pr > F'][0]
Hope it will help someone!

How to perform sliding window correlation operation on pandas dataframe with datetime index?

I am working with stock data coming from Yahoo Finance.
def load_y_finance_data(y_finance_tickers: list):
df = pd.DataFrame()
print("Loading Y-Finance data ...")
for ticker in y_finance_tickers:
df[ticker.replace("^", "")] = yf.download(
ticker,
auto_adjust=True, # only download adjusted data
progress=False,
)["Close"]
print("Done loading Y-Finance data!")
return df
x = load_y_finance_data(["^VIX", "^GSPC"])
x
VIX GSPC
Date
1990-01-02 17.240000 359.690002
1990-01-03 18.190001 358.760010
1990-01-04 19.219999 355.670013
1990-01-05 20.110001 352.200012
1990-01-08 20.260000 353.790009
DataSize=(8301, 2)
Here I want to perform a sliding window operation for every 50 days period, where I want to get correlation (using corr() function) for 50 days slice (day_1 to day_50) of data and after window will move by one day (day_2 to day_51) and so on.
I tried the naive way of using a for loop to do this and it works as well. But it takes too much time. Code below-
data_size = len(x)
period = 50
df = pd.DataFrame()
for i in range(data_size-period):
df.loc[i, "GSPC_VIX_corr"] = x[["GSPC", "VIX"]][i:i+period].corr().loc["GSPC", "VIX"]
df
GSPC_VIX_corr
0 -0.703156
1 -0.651513
2 -0.602876
3 -0.583256
4 -0.589086
How can I do this more efficiently? Is there any built-in way I can use?
Thanks :)

You can use the rolling windows functionality of Pandas with many different aggreggations, including corr(). Instead of your for loop, do this:
x["VIX"].rolling(window=period).corr(x["GSPC"])

How to populate a dataframe from row-by-row calculations?

I am seeking to populate a pandas dataframe row-by-row, whereby each new row is calculated on the basis of the contents of the previous row. I am using this for simple financial projections.
Let us take a dataframe 'df_basic_financials':
df_basic_financials = pd.DataFrame({'current_account': [18357.], 'savings_account': [14809.]})
Now I want to forecast what my current and saving accounts will look like in five years, assuming that I earn 24000 a year and that my saving accounts yields 2% yearly, assuming I spend zero money and do not transfer any money to my savings account.
How do I write the code so that I get this:
current_account savings_account
0 18357 14809
1 42357 15105.18
2 66357 15407.2836
etc... for any number of years I want, each time using the calculation 'value of the previous row in the same column + 24000' for current_account and 'value of the previous row in the same column*1.02' for savings_account.

You can get the input from user on number of years and then run the code this way
import pandas as pd
df = pd.DataFrame({'current_account': [18357], 'savings_account':[14809]})
years = int(input("Enter years: "))
for n in range(years):
lastrow = df.iloc[len(df)-1]
print(lastrow[0], lastrow[1])
df.loc[len(df.index)] = [int(lastrow[0]) +24000, int(lastrow[1])*1.02]
df
Out will be....

Just use math
df_basic_financials = pd.DataFrame({'current_account': [18357.], 'savings_account': [14809.]})
current_account_projection = [df_basic_financials['current_account'].iloc[-1] + (24000 * i) for i in range(10)]
savings_account_projection = [df_basic_financials['savings_account'].iloc[-1] * (1.02 ** i) for i in range(10)]
df_basic_financials = pd.DataFrame({'current_account': current_account_projection, 'savings_account': savings_account_projection})
if you really want an interative solution, apply the function on savings_account.iloc[-1]
current_account_next = df_basic_financials.iloc[-1]['current_account'] + 24000
savings_account_next = df_basic_financials.iloc[-1]['savings_account'] * 1.02
df_basic_financials = df_basic_financials.append(pd.Series({'current_account': current_account_next, 'savings_account': savings_account_next}))

Rolling Year Based on Condition

Hello, I have the following code:
# Import Libraies
import numpy as np
import pandas as pd
import datetime as dt
# Conect to Drive
from google.colab import drive
drive.mount('/content/drive')
# Read Data
ruta = '/content/drive/MyDrive/example.csv'
df = pd.read_csv(ruta)
df.head(15)
d = pd.date_range(start="2015-01-01",end="2022-01-01", freq='MS')
dates = pd.DataFrame({"DATE":d})
df["DATE"] = pd.to_datetime(df["DATE"])
df_merge = pd.merge(dates, df, how='outer', on='DATE')
The data that I am using, you could download here: DATA
What I am trying to achieve is something known as Rolling Year.
First I create this metric gruped for each category:
# ROLLING YEAR
##################################################################################################
# I want to make a Roling Year for each category. Thats mean how much sell each category since 12 moths ago TO current month
# RY_ACTUAL One year have 12 months so I pass as parameter in the rolling 12
f = lambda x:x.rolling(12).sum()
df_merge["RY_ACTUAL"] = df_merge.groupby(["CATEGORY"])['Sales'].apply(f)
# RY_24 I create a rolling with 24 as parameter to compare actual RY vs last RY
f_1 = lambda x:x.rolling(24).sum()
df_merge["RY_24"] = df_merge.groupby(["CATEGORY"])['Sales'].apply(f_1)
#RY_LAST Substract RY_24 - RY_Actual to get the correct amount. Thats mean amount of RY vs the amount of RY-1
df_merge["RY_LAST"] = df_merge["RY_24"] - df_merge["RY_ACTUAL"]
##################################################################################################
df_merge.head(30)
And it works perfectly, ´cause if you download the file and then filter for example for "Blue" category, you could see, something like this:
Thats mean, if you stop in the row 2015-November, you could see in the column RY_ACTUAL the sum of all the values 12 records before.
Mi next goal is to create a similar column using the rollig function but with the next condition:
The column must sum all the sales of ALL the categories, as long as
the Color/Animal column is equal to Colour. For example if I am
stopped in 2016-December, it should give me the sum of ALL the sales
of the colors from 2016-January to 2016-December
This was my attempt:
df_merge.loc[(df_merge['Colour/Animal'] == 'Colour'),'Sales'].apply(f)
Cold anyone help me to code correctly this example?.
Thanks in advance comunity!!!

Resample a data frame into n-month periods with arbitrary end-of-period months

I want to resample() my daily data into six-month chunks. However, I want the ends of the six-month chunks to be the ends of April and October. If I use df.resample('6M').sum() (or df.groupby(pd.Grouper(freq='6M').sum()), the end of the first six-month chunk is the end of the first month in the data. I know about anchored offsets, but I do not know how to create a custom anchored offset (e.g., '6M-APR' does not work).
Here is some example code:
import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame(
data={'logret': np.random.randn(1000)},
index=pd.date_range(start='2001-05-25', periods=1000, freq='B')
)
df.resample('6M').sum()
Which yields the following output:
logret
2001-05-31 2.2950148716254297
2001-11-30 -12.536360930670858
2002-05-31 5.468848462868161
2002-11-30 13.027927629740189
2003-05-31 -10.37282118563155
2003-11-30 -0.156275418330286
2004-05-31 -3.0768727498370905
2004-11-30 28.328856464071546
2005-05-31 -3.6462613215100546
I have not achieved my goal (six-month resampling that ends in April and October) with the start, offset, and loffset arguments to .resample().
I have achieved my goal with the hack below. However, it loses the date index, and I would like a more robust/repeatable approach.
def sixmonth(d, b=4):
y, m, h = d.year, d.month, 1
if (m > (b + 6)): y += 1
elif (m > b): h += 1
return y + h/10
df.groupby(sixmonth).sum()
Which yields the following output without a date:
logret
2001.2 -10.300839024148
2002.1 9.321994034984547
2002.2 8.855517878860585
2003.1 -2.4576797445001493
2003.2 -7.002919570231796
2004.1 -9.36895555474087
2004.2 27.13038641177464
2005.1 3.154551390326532
Of course, I could improve this hack. But is there a better/robust/repeatable solution for n-period resampling that ends in arbitrary months?

Another workaround, keeping the datetime index:
def custom_6M(df, month=4):
df = df.resample("M").sum()
df = df.rolling(6).sum()
return df[df.index.month.isin([month,month+6])]
>>> custom_6M(df)
logret
2001-10-31 -10.300839
2002-04-30 9.321994
2002-10-31 8.855518
2003-04-30 -2.457680
2003-10-31 -7.002920
2004-04-30 -9.368956
2004-10-31 27.130386

It's a pain. When I needed something similar, I ended up with the following approach:
anchor_month = 4
non_months = (anchor_month + 3) % 12, (anchor_month + 9) % 12
df = df.resample('Q-APR').sum()
df = (df.reset_index()
.groupby(df.index.month.isin(non_months).cumsum())
.agg({'index': 'last', 'logret': 'sum'})
.set_index('index'))
Result here:
logret
index
2001-10-31 -10.300839
2002-04-30 9.321994
2002-10-31 8.855518
2003-04-30 -2.457680
2003-10-31 -7.002920
2004-04-30 -9.368956
2004-10-31 27.130386
2005-04-30 3.154551
But the problem is, that sometimes the last index doesn't fit (okay here). That can be fixed by another '6M'-resample. Overall: Not pretty.

Thanks for the answers.
I have two more options.
Append a time-stamped series to df to anchor the six-month resampling periods
I hoped that .resample()'s origin argument would let me manually anchor my six-month resampling periods. It doesn't, but the following code does.
df.append(pd.Series(name=pd.to_datetime('2001-04-30'), dtype='float')).resample('6M').sum()
Improve my sixmonth() function to use timestamps
def sixmonth(d, m=6, n=4):
o = (m - (d.month - n)) % m
return d + pd.offsets.MonthEnd(o)
I first .resample('M') to make sure that I have end-of-month dates.
I could modify sixmonth() to check for end-of-month dates, but I'm more afraid of finding some new edge case than a little inefficiency.
df.resample('M').sum().groupby(sixmonth).sum()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Groupby a data set in Python - python

You can pass a function to df.groupby which will act on the indices to make the groups. So, for you, use: df.groupby(lambda x: (x.day,x.month)).mean()

Consider the following series s days = pd.date_range('1986-01-01', '2015-12-31') s = pd.Series(np.random.rand(len(days)), days) then what you're looking for is: s.groupby([s.index.month, s.index.day]).mean() Timing #juanpa.arrivillaga's answer gives the same solution but is slower.

Related

How to get/isolate the p-value from 'AnovaResults' object in python?

How to perform sliding window correlation operation on pandas dataframe with datetime index?

How to populate a dataframe from row-by-row calculations?

Rolling Year Based on Condition

Resample a data frame into n-month periods with arbitrary end-of-period months

Categories

Resources