In Python 3.5, Pandas 20, say I have a one year periodic time serie :
import pandas as pd
import numpy as np
start_date = pd.to_datetime("2015-01-01T01:00:00.000Z", infer_datetime_format=True)
end_date = pd.to_datetime("2015-12-31T23:00:00.000Z", infer_datetime_format=True)
index = pd.DatetimeIndex(start=start_date,
freq="60min",
end=end_date)
time = np.array((index - start_date)/ np.timedelta64(1, 'h'), dtype=int)
df = pd.DataFrame(index=index)
df["foo"] = np.sin( 2 * np.pi * time / len(time))
df.plot()
I want to do some periodic extrapolation of the time serie for a new index. I.e with :
new_start_date = pd.to_datetime("2017-01-01T01:00:00.000Z", infer_datetime_format=True)
new_end_date = pd.to_datetime("2019-12-31T23:00:00.000Z", infer_datetime_format=True)
new_index = pd.DatetimeIndex(start=new_start_date,
freq="60min",
end=new_end_date)
I would like to use some kind of extrapolate_periodic method to get:
# DO NOT RUN
new_df = df.extrapolate_periodic(index=new_index)
# END DO NOT RUN
new_df.plot()
What is the best way do such a thing in pandas?
How can I define a periodicity and get data from a new index easily?
I think I have what you are looking for, though it is not a simple pandas method.
Carrying on directly from where you left off,
def extrapolate_periodic(df, new_index):
df_right = df.groupby([df.index.dayofyear, df.index.hour]).mean()
df_left = pd.DataFrame({'new_index': new_index}).set_index('new_index')
df_left = df_left.assign(dayofyear=lambda x: x.index.dayofyear,
hour=lambda x: x.index.hour)
df = (pd.merge(df_left, df_right, left_on=['dayofyear', 'hour'],
right_index=True, suffixes=('', '_y'))
.drop(['dayofyear', 'hour'], axis=1))
return df.sort_index()
new_df = extrapolate_periodic(df, new_index)
# or as a method style
# new_df = df.pipe(extrapolate_periodic, new_index)
new_df.plot()
If you have more that a years worth of data it will take the mean of each duplicated day-hour. Here mean could be changed for last if you wanted just the most recent reading.
This will not work if you do not have a full years worth of data but you could fix this by adding in a reindex to complete the year and then using interpolate with a polynomial feature to fill in the missing foo column.
Here is some code I've used to solve my problem. The asumption is that the initial serie corresponds to a period of data.
def extrapolate_periodic(df, new_index):
index = df.index
start_date = np.min(index)
end_date = np.max(index)
period = np.array((end_date - start_date) / np.timedelta64(1, 'h'), dtype=int)
time = np.array((new_index - start_date)/ np.timedelta64(1, 'h'), dtype=int)
new_df = pd.DataFrame(index=new_index)
for col in list(df.columns):
new_df[col] = np.array(df[col].iloc[time % period])
return new_df
Related
Thank you for taking a look! I am having issues with a 4 level multiindex & attempting to make sure every possible value of the 4th index is represented.
Here is my dataframe:
np.random.seed(5)
size = 25
dict = {'Customer':np.random.choice( ['Bob'], size),
'Grouping': np.random.choice( ['Corn','Wheat','Soy'], size),
'Date':np.random.choice( pd.date_range('1/1/2018','12/12/2022', freq='D'), size),
'Data': np.random.randint(20,100, size=(size))
}
df = pd.DataFrame(dict)
# create the Sub-Group column
df['Sub-Group'] = np.nan
df.loc[df['Grouping'] == 'Corn', 'Sub-Group'] = np.random.choice(['White', 'Dry'], size=len(df[df['Grouping'] == 'Corn']))
df.loc[df['Grouping'] == 'Wheat', 'Sub-Group'] = np.random.choice(['SRW', 'HRW', 'SWW'], size=len(df[df['Grouping'] == 'Wheat']))
df.loc[df['Grouping'] == 'Soy', 'Sub-Group'] = np.random.choice(['Beans', 'Meal'], size=len(df[df['Grouping'] == 'Soy']))
df['Year'] = df.Date.dt.year
With that, I'm looking to create a groupby like the following:
(df.groupby(['Customer','Grouping','Sub-Group',df['Date'].dt.month,'Year'])
.agg(Units = ('Data','sum'))
.unstack()
)
This works as expected. I want to reindex this dataframe so that every single month (index 3) is represented & filled with 0s. The reason I want this is later on I'll be doing a cumulative sum of a groupby.
I have tried both the following reindex & nothing happens - many months are still missing.
rere = pd.date_range('2018-01-01','2018-12-31', freq='M').month
(df.groupby(['Customer','Grouping','Sub-Group',df['Date'].dt.month,'Year'])
.agg(Units = ('Data','sum'))
.unstack()
.fillna(0)
.pipe(lambda x: x.reindex(rere, level=3, fill_value=0))
)
I've also tried the following:
(df.groupby(['Customer','Grouping','Sub-Group',df['Date'].dt.month,'Year'])
.agg(Units = ('Data','sum'))
.unstack()
.fillna(0)
.pipe(lambda x: x.reindex(pd.MultiIndex.from_product(x.index.levels)))
)
The issue with the last one is that the index is much too long - it's doing the cartesian product of Grouping & Sub-Group when really there are no combinations of 'Wheat' as a Grouping & 'Dry' as 'Sub-Group'.
I'm looking for a flexible way to reindex this dataframe to make sure a specific index level (3rd in this case) has every option.
Thanks so much for any help!
try this:
def reindex_sub(g: pd.DataFrame):
g = g.droplevel([0, 1, 2])
result = g.reindex(range(1, 13))
return result
tmp = (df.groupby(['Customer','Grouping','Sub-Group',df['Date'].dt.month,'Year'])
.agg(Units = ('Data','sum'))
.unstack()
)
grouped = tmp.groupby(level=[0,1,2], group_keys=True)
out = grouped.apply(reindex_sub)
print(out)
Please refer below table to for reference
I was able to find 52 Week High and low using:
df = pd.read_csv(csv_file_name, engine='python')
df['52W H'] = df['HIGH'].rolling(window=252, center=False).max()
df['52W L'] = df['LOW'].rolling(window=252, center=False).min()
Can someone please guide me how to find Date of 52 Week High and date of 52 Week low? Thanks in Advance.
My guess is that the date is another column in the dataframe, assuming its name is 'Date'.
you can try something like
df = pd.read_csv(csv_file_name, engine='python')
df['52W H'] = df['HIGH'].rolling(window=252, center=False).max()
df['52W L'] = df['LOW'].rolling(window=252, center=False).min()
df_low = df[df['LOW']== df['52W L'] ]
low_date = df_low['Date']
Similarly you can look for high values
Also it would have helped if you shared your sample dataframe structure.
Used 'pandas_datareader' data. The index is reset first. Then, using the idxmax() and idxmin() functions, the indices of highs and lows are found and lists are created from these values. The index of the 'Date' column is again set. And lists with indexes are fed into df.index. Note how setting indexes in df.index nan values are not involved.
High, Low replace with yours in df.
import pandas as pd
import pandas_datareader.data as web
import numpy as np
df = web.DataReader('GE', 'yahoo', start='2012-01-10', end='2019-10-09')
df = df.reset_index()
imax = df['High'].rolling(window=252, center=False).apply(lambda x: x.idxmax()).values
imin = df['Low'].rolling(window=252, center=False).apply(lambda x: x.idxmin()).values
count0_imax = np.count_nonzero(np.isnan(imax))
count0_imin = np.count_nonzero(np.isnan(imin))
imax = imax[count0_imax:].astype(int)
imin = imin[count0_imin:].astype(int)
df = df.set_index('Date')
df.loc[df.index[count0_imax]:, '52W H'] = df.index[imax]
df.loc[df.index[count0_imin]:, '52W L'] = df.index[imin]
I have a time series that looks something like these
fechas= pd.Series(pd.date_range(start='2015-01-01', end='2020-12-01', freq='H'))
data=pd.Series(range(len(fechas)))
df=pd.DataFrame({'Date':fechas, 'Data':data})
What I need to do is the sum of every day and group by year, what I did and works is
df['year']=pd.DatetimeIndex(df['Date']).year
df['month']=pd.DatetimeIndex(df['Date']).month
df['day']=pd.DatetimeIndex(df['Date']).day
df.groupby(['year','month','day'])['Data'].sum().reset_index()
But what I need is to have the years in the columns to look something like this
res=pd.DataFrame(columns=['dd-mm','2015','2016','2017','2018','2019','2020']
This might be what you need:
df = pd.DataFrame({'Date':fechas, 'Data':data})
df = df.groupby(pd.DatetimeIndex(df["Date"]).date).sum()
df.index = pd.to_datetime(df.index)
df["dd-mm"] = df.index.strftime("%d-%m")
output = pd.DataFrame(index=df["dd-mm"].unique())
for yr in range(2015, 2021):
temp = df[df.index.year==yr]
temp = temp.set_index("dd-mm")
output[yr] = temp
output = output.reset_index() #if you want to have dd-mm as a column instead of the index
EDITED
I want to write an If loop with conditions on cooncatenating strings.
i.e. If cell A1 contains a specific format of text, then only do you concatenate, else leave as is.
example:
If bill number looks like: CM2/0000/, then concatenate this string with the date column (month - year), else leave the bill number as it is.
Sample Data
You can create function which does what you need and use df.apply() to execute it on all rows.
I use example data from #Boomer answer.
EDIT: you didn't show what you really have in dataframe and it seems you have datetime in bill_date but I used strings. I had to convert strings to datetime to show how to work with this. And now it needs .strftime('%m-%y') or sometimes .dt.strftime('%m-%y') instead of .str[3:].str.replace('/','-'). Because pandas uses different formats to display dateitm for different countries so I couldn't use str(x) for this because it gives me 2019-09-15 00:00:00 instead of yours 15/09/19
import pandas as pd
df = pd.DataFrame({
'bill_number': ['CM2/0000/', 'CM2/0000', 'CM3/0000/', 'CM3/0000'],
'bill_date': ['15/09/19', '15/09/19', '15/09/19', '15/09/19']
})
df['bill_date'] = pd.to_datetime(df['bill_date'])
def convert(row):
if row['bill_number'].endswith('/'):
#return row['bill_number'] + row['bill_date'].str[3:].replace('/','-')
return row['bill_number'] + row['bill_date'].strftime('%m-%y')
else:
return row['bill_number']
df['bill_number'] = df.apply(convert, axis=1)
print(df)
Result:
bill_number bill_date
0 CM2/0000/09-19 15/09/19
1 CM2/0000 15/09/19
2 CM3/0000/09-19 15/09/19
3 CM3/0000 15/09/19
Second idea is to create mask
mask = df['bill_number'].str.endswith('/')
and later use it for all values
#df.loc[mask,'bill_number'] = df[mask]['bill_number'] + df[mask]['bill_date'].str[3:].str.replace('/','-')
df.loc[mask,'bill_number'] = df[mask]['bill_number'] + df[mask]['bill_date'].dt.strftime('%m-%y')
or
#df.loc[mask,'bill_number'] = df.loc[mask,'bill_number'] + df.loc[mask,'bill_date'].str[3:].str.replace('/','-')
df.loc[mask,'bill_number'] = df.loc[mask,'bill_number'] + df.loc[mask,'bill_date'].dt.strftime('%m-%y')
Left side needs .loc[mask,'bill_number'] instead of `[mask]['bill_number'] to correctly assing values - but right side doesn't need it.
import pandas as pd
df = pd.DataFrame({
'bill_number': ['CM2/0000/', 'CM2/0000', 'CM3/0000/', 'CM3/0000'],
'bill_date': ['15/09/19', '15/09/19', '15/09/19', '15/09/19']
})
df['bill_date'] = pd.to_datetime(df['bill_date'])
mask = df['bill_number'].str.endswith('/')
#df.loc[mask,'bill_number'] = df[mask]['bill_number'] + df[mask]['bill_date'].str[3:].str.replace('/','-')
# or
#df.loc[mask,'bill_number'] = df.loc[mask,'bill_number'] + df.loc[mask,'bill_date'].str[3:].str.replace('/','-')
df.loc[mask,'bill_number'] = df[mask]['bill_number'] + df[mask]['bill_date'].dt.strftime('%m-%y')
#or
#df.loc[mask,'bill_number'] = df.loc[mask,'bill_number'] + df.loc[mask,'bill_date'].dt.strftime('%m-%y')
print(df)
Third idea is to use numpy.where()
import pandas as pd
import numpy as np
df = pd.DataFrame({
'bill_number': ['CM2/0000/', 'CM2/0000', 'CM3/0000/', 'CM3/0000'],
'bill_date': ['15/09/19', '15/09/19', '15/09/19', '15/09/19']
})
df['bill_date'] = pd.to_datetime(df['bill_date'])
df['bill_number'] = np.where(
df['bill_number'].str.endswith('/'),
#df['bill_number'] + df['bill_date'].str[3:].str.replace('/','-'),
df['bill_number'] + df['bill_date'].dt.strftime('%m-%y'),
df['bill_number'])
print(df)
Maybe this will work for you. It would be nice to have a data sample like #Mike67 was stating. But based on your information this is what I came up with. Bulky, but it works. I'm sure someone else will have a fancier version.
import pandas as pd
from pandas import DataFrame, Series
dat = {'num': ['CM2/0000/','CM2/0000', 'CM3/0000/', 'CM3/0000',],
'date': ['15/09/19','15/09/19','15/09/19','15/09/19']}
df = pd.DataFrame(dat)
df['date'] = df['date'].map(lambda x: str(x)[3:])
df['date'] = df['date'].str.replace('/','-')
for cols in df.columns:
df.loc[df['num'].str.endswith('/'), cols] = df['num'] + df['date']
print(df)
Results:
num date
0 CM2/0000/09-19 09-19
1 CM2/0000 09-19
2 CM3/0000/09-19 09-19
3 CM3/0000 09-19
I am trying to use pandas to create a rolling mean, but of an annual cycle (so that the rolling mean for 31st December would take into account values from January, and the rolling means for January would use values for December). Does anyone know if there is a built in or other elegant way to do this?
The only way I've come up with so far is to create the annual cycle and then repeat it over leap years (as the annual cycle includes the 29th Feb), take the rolling mean (or standard deviation, etc) and then crop the middle year. There must be a better solution! Here's my attempt:
import pandas as pd
import numpy as np
import calendar
data = np.random.rand(366)
df_annual_cycle = pd.DataFrame(
columns=['annual_cycle'],
index=pd.date_range('2004-01-01','2004-12-31').strftime('%m-%d'),
data=data
)
df_annual_cycle.head()
# annual_cycle
# 01-01 0.863838
# 01-02 0.234168
# 01-03 0.368678
# 01-04 0.066332
# 01-05 0.493080
df1 = df_annual_cycle.copy()
df1.index = ['04-'+x for x in df1.index]
df1.index = pd.to_datetime(df1.index,format='%y-%m-%d')
df2 = df.copy()
df2.index = ['08-'+x for x in df2.index]
df2.index = pd.to_datetime(df2.index,format='%y-%m-%d')
df3 = df.copy()
df3.index = ['12-'+x for x in df3.index]
df3.index = pd.to_datetime(df3.index,format='%y-%m-%d')
df_for_rolling = df1.append(df2).append(df3)
df_rolling = df_for_rolling.rolling(65).mean()
df_annual_cycle_rolling = df_rolling.loc['2008-01-01':'2008-12-31']
df_annual_cycle_rolling.index = df_annual_cycle.index
We can use pandas.DataFrame.rolling().
Details and other rolling methods can be found here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rolling.html
Let's assume we have a dataframe like so:
data = np.concatenate([
1*np.random.rand(366//6),
2*np.random.rand(366//6),
3*np.random.rand(366//6),
4*np.random.rand(366//6),
5*np.random.rand(366//6),
6*np.random.rand(366//6)
])
df_annual_cycle = pd.DataFrame(
columns=['annual_cycle'],
index=pd.date_range('2004-01-01','2004-12-31').strftime('%m-%d'),
data=data,
)
We can do:
# reset the index to integers:
df_annual_cycle = df_annual_cycle.reset_index()
# rename index column to date:
df_annual_cycle = df_annual_cycle.rename(columns={'index':'date'})
# calculate the rolling mean:
df_annual_cycle['rolling_mean'] = df_annual_cycle['annual_cycle'].rolling(32, win_type='triang').mean()
# plot results
df_annual_cycle.plot(x='date', y=['annual_cycle', 'rolling_mean'], style=['o', '-'])
The result looks like this: