Creating a rolling mean of an annual cycle in pandas, python - python

I am trying to use pandas to create a rolling mean, but of an annual cycle (so that the rolling mean for 31st December would take into account values from January, and the rolling means for January would use values for December). Does anyone know if there is a built in or other elegant way to do this?
The only way I've come up with so far is to create the annual cycle and then repeat it over leap years (as the annual cycle includes the 29th Feb), take the rolling mean (or standard deviation, etc) and then crop the middle year. There must be a better solution! Here's my attempt:
import pandas as pd
import numpy as np
import calendar
data = np.random.rand(366)
df_annual_cycle = pd.DataFrame(
columns=['annual_cycle'],
index=pd.date_range('2004-01-01','2004-12-31').strftime('%m-%d'),
data=data
)
df_annual_cycle.head()
# annual_cycle
# 01-01 0.863838
# 01-02 0.234168
# 01-03 0.368678
# 01-04 0.066332
# 01-05 0.493080
df1 = df_annual_cycle.copy()
df1.index = ['04-'+x for x in df1.index]
df1.index = pd.to_datetime(df1.index,format='%y-%m-%d')
df2 = df.copy()
df2.index = ['08-'+x for x in df2.index]
df2.index = pd.to_datetime(df2.index,format='%y-%m-%d')
df3 = df.copy()
df3.index = ['12-'+x for x in df3.index]
df3.index = pd.to_datetime(df3.index,format='%y-%m-%d')
df_for_rolling = df1.append(df2).append(df3)
df_rolling = df_for_rolling.rolling(65).mean()
df_annual_cycle_rolling = df_rolling.loc['2008-01-01':'2008-12-31']
df_annual_cycle_rolling.index = df_annual_cycle.index

We can use pandas.DataFrame.rolling().
Details and other rolling methods can be found here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rolling.html
Let's assume we have a dataframe like so:
data = np.concatenate([
1*np.random.rand(366//6),
2*np.random.rand(366//6),
3*np.random.rand(366//6),
4*np.random.rand(366//6),
5*np.random.rand(366//6),
6*np.random.rand(366//6)
])
df_annual_cycle = pd.DataFrame(
columns=['annual_cycle'],
index=pd.date_range('2004-01-01','2004-12-31').strftime('%m-%d'),
data=data,
)
We can do:
# reset the index to integers:
df_annual_cycle = df_annual_cycle.reset_index()
# rename index column to date:
df_annual_cycle = df_annual_cycle.rename(columns={'index':'date'})
# calculate the rolling mean:
df_annual_cycle['rolling_mean'] = df_annual_cycle['annual_cycle'].rolling(32, win_type='triang').mean()
# plot results
df_annual_cycle.plot(x='date', y=['annual_cycle', 'rolling_mean'], style=['o', '-'])
The result looks like this:

Related

How to find Date of 52 Week High and date of 52 Week low using pandas dataframe (Python)?

Please refer below table to for reference
I was able to find 52 Week High and low using:
df = pd.read_csv(csv_file_name, engine='python')
df['52W H'] = df['HIGH'].rolling(window=252, center=False).max()
df['52W L'] = df['LOW'].rolling(window=252, center=False).min()
Can someone please guide me how to find Date of 52 Week High and date of 52 Week low? Thanks in Advance.
My guess is that the date is another column in the dataframe, assuming its name is 'Date'.
you can try something like
df = pd.read_csv(csv_file_name, engine='python')
df['52W H'] = df['HIGH'].rolling(window=252, center=False).max()
df['52W L'] = df['LOW'].rolling(window=252, center=False).min()
df_low = df[df['LOW']== df['52W L'] ]
low_date = df_low['Date']
Similarly you can look for high values
Also it would have helped if you shared your sample dataframe structure.
Used 'pandas_datareader' data. The index is reset first. Then, using the idxmax() and idxmin() functions, the indices of highs and lows are found and lists are created from these values. The index of the 'Date' column is again set. And lists with indexes are fed into df.index. Note how setting indexes in df.index nan values are not involved.
High, Low replace with yours in df.
import pandas as pd
import pandas_datareader.data as web
import numpy as np
df = web.DataReader('GE', 'yahoo', start='2012-01-10', end='2019-10-09')
df = df.reset_index()
imax = df['High'].rolling(window=252, center=False).apply(lambda x: x.idxmax()).values
imin = df['Low'].rolling(window=252, center=False).apply(lambda x: x.idxmin()).values
count0_imax = np.count_nonzero(np.isnan(imax))
count0_imin = np.count_nonzero(np.isnan(imin))
imax = imax[count0_imax:].astype(int)
imin = imin[count0_imin:].astype(int)
df = df.set_index('Date')
df.loc[df.index[count0_imax]:, '52W H'] = df.index[imax]
df.loc[df.index[count0_imin]:, '52W L'] = df.index[imin]

How to switch years to appear in columns when using groupby in a time series

I have a time series that looks something like these
fechas= pd.Series(pd.date_range(start='2015-01-01', end='2020-12-01', freq='H'))
data=pd.Series(range(len(fechas)))
df=pd.DataFrame({'Date':fechas, 'Data':data})
What I need to do is the sum of every day and group by year, what I did and works is
df['year']=pd.DatetimeIndex(df['Date']).year
df['month']=pd.DatetimeIndex(df['Date']).month
df['day']=pd.DatetimeIndex(df['Date']).day
df.groupby(['year','month','day'])['Data'].sum().reset_index()
But what I need is to have the years in the columns to look something like this
res=pd.DataFrame(columns=['dd-mm','2015','2016','2017','2018','2019','2020']
This might be what you need:
df = pd.DataFrame({'Date':fechas, 'Data':data})
df = df.groupby(pd.DatetimeIndex(df["Date"]).date).sum()
df.index = pd.to_datetime(df.index)
df["dd-mm"] = df.index.strftime("%d-%m")
output = pd.DataFrame(index=df["dd-mm"].unique())
for yr in range(2015, 2021):
temp = df[df.index.year==yr]
temp = temp.set_index("dd-mm")
output[yr] = temp
output = output.reset_index() #if you want to have dd-mm as a column instead of the index

calculate Geometric mean return for specific rows

I have a dataframe like this.
Date price mid std top btm
..............
1999-07-21 8.6912 8.504580 0.084923 9.674425 8.334735
1999-07-22 8.6978 8.508515 0.092034 8.692583 8.324447
1999-07-23 8.8127 8.524605 0.118186 10.760976 8.288234
1999-07-24 8.8779 8.688810 0.091124 8.871057 8.506563
..............
I want to create a new col called 'diff'.
If in a row ,'price' >'top' then I want to fill 'diff' of this row with the Geometric mean return of price in this row and price in the n-5 previous row.(The 5-day Geometric mean).
For example, In row 1999-07-22,the price is greater than top, so I wanto fill 'diff' in this row with Geometric mean of 07-22 and 07-17(notice the date may not be consecutive since holidays are excluded ). Only a small part of the rows meet the demand. So most of values in 'diff' will be missing values.
Could you please tell me how I can do this in python?
Use Series.diff with Series.where for set NaNs:
df['diff'] = df['price'].diff().where(df['price'] > df['top'])
print (df)
price mid std top btm diff
Date
1999-07-21 8.6912 8.504580 0.084923 9.674425 8.334735 NaN
1999-07-22 8.6978 8.508515 0.092034 8.692583 8.324447 0.0066
1999-07-23 8.8127 8.524605 0.118186 10.760976 8.288234 NaN
1999-07-24 8.8779 8.688810 0.091124 8.871057 8.506563 0.0652
EDIT:
I believe you need:
df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index('Date')
from scipy.stats.mstats import gmean
df['gmean'] = (df['price'].rolling('5d')
.apply(gmean, raw=True)
.where(df['price'] > df['top']))
print (df)
price mid std top btm gmean
Date
1999-07-21 8.6912 8.504580 0.084923 9.674425 8.334735 NaN
1999-07-22 8.6978 8.508515 0.092034 8.692583 8.324447 8.694499
1999-07-23 8.8127 8.524605 0.118186 10.760976 8.288234 NaN
1999-07-24 8.8779 8.688810 0.091124 8.871057 8.506563 8.769546
You can achieve that by taking the difference of price and top columns and then assign those values that are <= 0 a NaN or zero value:
import pandas as pd
import numpy as np
df = pd.DataFrame(...)
df['diff'] = df['price'] - df['top']
df.loc[df['diff'] <= 0, 'diff'] = np.NaN # or 0
Here's another solution:
import pandas as pd
from functools import reduce
__name__ = 'RunScript'
ddict = {
'Date':['1999-07-21','1999-07-22','1999-07-23','1999-07-24',],
'price':[8.6912,8.6978,8.8127,8.8779],
'mid':[8.504580,8.508515,8.524605,8.688810],
'std':[0.084923,0.092034,0.118186,0.091124],
'top':[9.674425,8.692583,10.760976,8.871057],
'btm':[8.334735,8.324447,8.288234,8.506563],
}
data = pd.DataFrame(ddict)
def geo_mean(iter):
"""
Geometric mean function. Pass iterable
"""
return reduce(lambda a, b: a * b, iter) ** (1.0 / len(iter))
def set_geo_mean(df):
# Shift the price row down one period
data['shifted price'] = data['price'].shift(periods=1)
# Create a masked expression that evaluates price vs top
masked_expression = df['price'] > df['top']
# Return rows from dataframe where masked expression is true
masked_data = df[masked_expression]
# Apply our function to the relevant rows
df.loc[masked_expression, 'geo_mean'] = geo_mean([masked_data['price'], masked_data['shifted price']])
# Drop the shifted price data column once complete
df.drop('shifted price', axis=1, inplace=True)
if __name__ == 'RunScript':
# Call function and pass dataframe argument.
set_geo_mean(data)

parsing CSV in pandas

I want to calculate the average number of successful Rattatas catches hourly for this whole dataset. I am looking for an efficient way to do this by utilizing pandas--I'm new to Python and pandas.
You don't need any loops. Try this. I think logic is rather clear.
import pandas as pd
#read csv
df = pd.read_csv('pkmn.csv', header=0)
#we need apply some transformations to extract date from timestamp
df['time'] = df['time'].apply(lambda x : pd.to_datetime(str(x)))
df['date'] = df['time'].dt.date
#main transformations
df = df.query("Pokemon == 'rattata' and caught == True").groupby('hour')
result = pd.DataFrame()
result['caught total'] = df['hour'].count()
result['days'] = df['date'].nunique()
result['caught average'] = result['caught total'] / result['days']
If you have your pandas dataframe saved as df this should work:
rats = df.loc[df.Pokemon == "rattata"] #Gives you subset of rows relating to Rattata
total = sum(rats.Caught) #Gives you the number caught total
diff = rats.time[len(rats)] - rats.time[0] #Should give you difference between first and last
average = total/diff #Should give you the number caught per unit time

Use Pandas GroupBy Columns in new DataFrame

I have a large temperature time series that I'm performing some functions on. I'm taking hourly observations and creating daily statistics. After I'm done with my calculations, I want to use the grouped year and Julian days that are objects in the Groupby ('aa' below) and the drangeT and drangeHI arrays that come out and make an entirely new DataFrame with those variables. Code is below:
import numpy as np
import scipy.stats as st
import pandas as pd
city = ['BUF']#,'PIT','CIN','CHI','STL','MSP','DET']
mons = np.arange(5,11,1)
for a in city:
data = 'H:/Classwork/GEOG612/Project/'+a+'Data_cut.txt'
df = pd.read_table(data,sep='\t')
df['TempF'] = ((9./5.)*df['TempC'])+32.
df1 = df.loc[df['Month'].isin(mons)]
aa = df1.groupby(['Year','Julian'],as_index=False)
maxT = aa.aggregate({'TempF':np.max})
minT = aa.aggregate({'TempF':np.min})
maxHI = aa.aggregate({'HeatIndex':np.max})
minHI = aa.aggregate({'HeatIndex':np.min})
drangeT = maxT - minT
drangeHI = maxHI - minHI
df2 = pd.DataFrame(data = {'Year':aa.Year,'Day':aa.Julian,'TRange':drangeT,'HIRange':drangeHI})
All variables in the df2 command are of length 8250, but I get this error message when I run the it:
ValueError: cannot copy sequence with size 3 to array axis with dimension 8250
Any suggestions are welcomed and appreciated. Thanks!

Categories