Python multiple lines by year - python

I have a dataset consisting of two columns: Date ds and volume y. I would like to see how the daily avg volume is trending across different months and years. I would like to have month names on x-axis and avg vol on y-axis. The lines should represent different years. Here is sample dataset and where I am stuck.
df = pd.DataFrame([
{"ds":"2017-01-01","y":3},
{"ds":"2017-01-18","y":4},
{"ds":"2017-02-04","y":6},
{"ds":"2018-01-06","y":2},
{"ds":"2018-01-12","y":8},
{"ds":"2018-02-08","y":2},
{"ds":"2018-03-02","y":8},
{"ds":"2018-03-15","y":2},
{"ds":"2018-03-22","y":8},
])
df["ds"] = pd.to_datetime(df["ds"])
df.set_index("ds",inplace=True)
df.resample("M").mean().plot()

Solution with aggregate mean for month names with years, reshape by Series.unstack and last ploting:
df["ds"] = pd.to_datetime(df["ds"])
#if necessary sorting
#df = df.sort_values('ds')
df1 = (df.groupby([df["ds"].dt.strftime('%b'), df["ds"].dt.year], sort=False)['y']
.mean()
.unstack(fill_value=0))
print (df1)
ds 2017 2018
ds
Jan 3.5 5.0
Feb 6.0 2.0
Mar 0.0 6.0
df1.plot()

You must group by years and by months:
import calendar # to use months' proper names
means = df.groupby([df.index.month, df.index.year]).mean()\
.unstack().reset_index(0, drop=True)\
.rename(dict(enumerate(calendar.month_abbr[1:])))
#ds 2017 2018
#ds
#Jan 3.5 5.0
#Feb 6.0 2.0
#Mar NaN 6.0

Related

Finding the third Friday for an expiration date using pandas datetime

I have a simple definition which finds the third friday of the month. I use this function to populate the dataframe for the third fridays and that part works fine.
The trouble I'm having is finding the third friday for an expiration_date that doesn't fall on a third friday.
This is my code simplified:
import pandas as pd
def is_third_friday(d):
return d.weekday() == 4 and 15 <= d.day <= 21
x = ['09/23/2022','09/26/2022','09/28/2022','09/30/2022','10/3/2022','10/5/2022',
'10/7/2022','10/10/2022','10/12/2022','10/14/2022','10/17/2022','10/19/2022','10/21/2022',
'10/24/2022','10/26/2022','10/28/2022','11/4/2022','11/18/2022','12/16/2022','12/30/2022',
'01/20/2023','03/17/2023','03/31/2023','06/16/2023','06/30/2023','09/15/2023','12/15/2023',
'01/19/2024','06/21/2024','12/20/2024','01/17/2025']
df = pd.DataFrame(x)
df.rename( columns={0 :'expiration_date'}, inplace=True )
df['expiration_date']= pd.to_datetime(df['expiration_date'])
expiration_date = df['expiration_date']
df["is_third_friday"] = [is_third_friday(x) for x in expiration_date]
third_fridays = df.loc[df['is_third_friday'] == True]
df["current_monthly_exp"] = third_fridays['expiration_date'].min()
df["monthly_exp"] = third_fridays[['expiration_date']]
df.to_csv(path_or_buf = f'C:/Data/Date Dataframe.csv',index=False)
What I'm looking for is any expiration_date that is prior to the monthly expire, I want to populate the dataframe with that monthly expire. If it's past the monthly expire date I want to populate the dataframe with the following monthly expire.
I thought I'd be able to use a new dataframe with only the monthly expires as a lookup table and do a timedelta, but when you look at 4/21/2023 and 7/21/2023 these dates don't exist in that dataframe.
This is my current output:
This is the output I'm seeking:
I was thinking I could handle this problem with something like:
date_df["monthly_exp"][0][::-1].expanding().min()[::-1]
But, it wouldn't solve for the 4/21/2023 and 7/21/2023 problem. Additionally, Pandas won't let you do this in a datetime dataframe.
>>> df = pd.DataFrame([1, nan,2,nan,nan,nan,4])
>>> df
0
0 1.0
1 NaN
2 2.0
3 NaN
4 NaN
5 NaN
6 4.0
>>> df["b"] = df[0][::-1].expanding().min()[::-1]
>>> df
0 b
0 1.0 1.0
1 NaN 2.0
2 2.0 2.0
3 NaN 4.0
4 NaN 4.0
5 NaN 4.0
6 4.0 4.0
I've also tried something like the following in many different forms with little luck:
if df['is_third_friday'].any() == True:
df["monthly_exp"] = third_fridays[['expiration_date']]
else:
df["monthly_exp"] = third_fridays[['expiration_date']].shift(third_fridays)
Any suggestions to get me in the right direction would be appreciated. I've been stuck on this problem for sometime.
You could add these additional lines of code (to replace df["monthly_exp"] = third_fridays[['expiration_date']]:
# DataFrame of fridays from minimum expiration_date to 30 days after last
fri_3s = pd.DataFrame(pd.date_range(df["expiration_date"].min(),
df["expiration_date"].max()+pd.tseries.offsets.Day(30),
freq="W-FRI"),
columns=["monthly_exp"])
# only keep those that are between 15th and 21st (as your function did)
fri_3s = fri_3s[fri_3s.monthly_exp.dt.day.between(15, 21)]
# merge_asof to get next third friday
df = pd.merge_asof(df, fri_3s, left_on="expiration_date", right_on="monthly_exp", direction="forward")
This creates a second DataFrame with the 3rd Fridays, and then by merging with merge_asof returns the next of these from the expiration_date.
And to simplify your date_df["monthly_exp"][0][::-1].expanding().min()[::-1] and use it for datetime, you could instead write df["monthly_exp"].bfill() (which backward fills). As you mentioned, this will only include Fridays that exist in your DataFrame already, so creating a list of the possible Fridays might be the easiest way.

Spline interpolation on dataframes by row

I have the following data frame:
OBJECTID 2017 2018 2019 2020 2021
1.0 NaN NaN 7569.183179 7738.162829 7907.142480
2.0 NaN NaN 766.591146 783.861122 801.131099
3.0 NaN NaN 8492.215747 8686.747704 8881.279662
4.0 NaN NaN 40760.327825 41196.877473 41633.427120
5.0 NaN NaN 6741.819674 6788.981231 6836.142788
I am trying to apply a spline interpolation on each row to get the values for 2017 and 2018 using the following code:
years = list(range(2017,2022))
df[years] = df[years].interpolate(method="spline", order =1, limit_direction="both", axis=1)
However, I get the following error:
ValueError: Index column must be numeric or datetime type when using spline method other than linear. Try setting a numeric or datetime index column before interpolating.
The dataframe in this question is just a subset of a much larger dataset I am using. All of the examples I have seen do the spline interpolation down each column, but I can't seem to get it work across each row. I feel like it's a simple solution and I'm just missing it. Could someone please help?
It appears to be because the dtype of the index (really columns for axis=1) is probably object in your case since the index contains a string column name also. Even though you are grabbing a slice of the columns that contains only integer years the overall index dtype remains the same - object. Then it looks like interpolate looks at the dtype and punts when it sees a dtype of object.
Example - even though the years are stored as integers the overall dtype is object:
df.columns
Index(['OBJECTID', 2017, 2018, 2019, 2020, 2021], dtype='object')
If we did this:
df.drop(columns=['OBJECTID'], inplace=True)
df.columns = df.columns.astype('uint64')
df.columns
UInt64Index([2017, 2018, 2019, 2020, 2021], dtype='uint64')
Then the axis=1 interpolation works:
years = list(range(2017,2022))
df[years] = df[years].interpolate(method="spline", order =1, limit_direction="both", axis=1)
2017 2018 2019 2020 2021
0 7231.223878 7400.203528 7569.183179 7738.162829 7907.142480
1 732.051193 749.321169 766.591146 783.861122 801.131099
2 8103.151832 8297.683789 8492.215747 8686.747704 8881.279662
3 39887.228530 40323.778178 40760.327825 41196.877473 41633.427120
4 6647.496560 6694.658117 6741.819674 6788.981231 6836.142788
Dropping the OBJECTID was done to illustrate what is going on.

How to interpolate only over a specific window?

I have a dataset that follows a weekly indexation, and a list of dates that I need to get interpolated data for. For example, I have the following df with weekly aggregation:
data value
1/01/2021 10
7/01/2021 10
14/01/2021 10
28/01/2021 10
and a list of dates that do not coincide with the df indexed dates, for example:
list_dates = [12/01/2021, 13/01/2021 ...]
I need to get what the interpolated values would be for every date on the list_dates but within a given window (for ex: using only 4 values in the df to calculate to interpolation, split between before and after --> so the 2 first dates before the list date and the 2 first dates after the list date).
To get the interpolated value of the list date 12/01/2021 in the list, I would need to use:
1/1/2021
7/1/2021
14/1/2021
28/1/2021
The output would then be:
data value
1/01/2021 10
7/01/2021 10
12/01/2021 10
13/01/2021 10
14/01/2021 10
28/01/2021 10
I have successfully coded an example of this but it fails for when there are multiple NaNs consecutively (for ex: 12/01 and 13/01). I also can't concat the interpolated value before running the next one in the list, as that would be using the interpolated date to calc the new interpolated date (for ex: using 12/01 to calculate 13/01).
Any advice on how to do this?
Use interpolate to get expected outcome but before you have to prepare your dataframe like below.
I slightly modify your input data to show you interpolation with datetimeindex (method='time'):
# Input data
df = pd.DataFrame({'data': ['1/01/2021', '7/01/2021', '14/01/2021', '28/01/2021'],
'value': [10, 10, 17, 10]})
list_dates = ['12/01/2021', '13/01/2021']
# Conversion of dates
df['data'] = pd.to_datetime(df['data'], format='%d/%m/%Y')
new_dates = pd.to_datetime(list_dates, format='%d/%m/%Y')
# Set datetime column as index and append new dates
df = df.set_index('data')
df = df.reindex(df.index.append(new_dates)).sort_index()
# Interpolate with method='time'
df['value'] = df['value'].interpolate(method='time')
Output:
>>> df
value
2021-01-01 10.0
2021-01-07 10.0
2021-01-12 15.0 # <- time interpolation
2021-01-13 16.0 # <- time interpolation
2021-01-14 17.0 # <- changed from 10 to 17
2021-01-28 10.0

python pandas:get rolling value of one Dataframe by rolling index of another Dataframe

I have two dataframes: one has multi levels of columns, and another has only single level column (which is the first level of the first dataframe, or say the second dataframe is calculated by grouping the first dataframe).
These two dataframes look like the following:
first dataframe-df1
second dataframe-df2
The relationship between df1 and df2 is:
df2 = df1.groupby(axis=1, level='sector').mean()
Then, I get the index of rolling_max of df1 by:
result1=pd.rolling_apply(df1,window=5,func=lambda x: pd.Series(x).idxmax(),min_periods=4)
Let me explain result1 a little bit. For example, during the five days (window length) 2016/2/23 - 2016/2/29, the max price of the stock sh600870 happened in 2016/2/24, the index of 2016/2/24 in the five-day range is 1. So, in result1, the value of stock sh600870 in 2016/2/29 is 1.
Now, I want to get the sector price for each stock by the index in result1.
Let's take the same stock as example, the stock sh600870 is in sector ’家用电器视听器材白色家电‘. So in 2016/2/29, I wanna get the sector price in 2016/2/24, which is 8.770.
How can I do that?
idxmax (or np.argmax) returns an index which is relative to the rolling
window. To make the index relative to df1, add the index of the left edge of
the rolling window:
index = pd.rolling_apply(df1, window=5, min_periods=4, func=np.argmax)
shift = pd.rolling_min(np.arange(len(df1)), window=5, min_periods=4)
index = index.add(shift, axis=0)
Once you have ordinal indices relative to df1, you can use them to index
into df1 or df2 using .iloc.
For example,
import numpy as np
import pandas as pd
np.random.seed(2016)
N = 15
columns = pd.MultiIndex.from_product([['foo','bar'], ['A','B']])
columns.names = ['sector', 'stock']
dates = pd.date_range('2016-02-01', periods=N, freq='D')
df1 = pd.DataFrame(np.random.randint(10, size=(N, 4)), columns=columns, index=dates)
df2 = df1.groupby(axis=1, level='sector').mean()
window_size, min_periods = 5, 4
index = pd.rolling_apply(df1, window=window_size, min_periods=min_periods, func=np.argmax)
shift = pd.rolling_min(np.arange(len(df1)), window=window_size, min_periods=min_periods)
# alternative, you could use
# shift = np.pad(np.arange(len(df1)-window_size+1), (window_size-1, 0), mode='constant')
# but this is harder to read/understand, and therefore it maybe more prone to bugs.
index = index.add(shift, axis=0)
result = pd.DataFrame(index=df1.index, columns=df1.columns)
for col in index:
sector, stock = col
mask = pd.notnull(index[col])
idx = index.loc[mask, col].astype(int)
result.loc[mask, col] = df2[sector].iloc[idx].values
print(result)
yields
sector foo bar
stock A B A B
2016-02-01 NaN NaN NaN NaN
2016-02-02 NaN NaN NaN NaN
2016-02-03 NaN NaN NaN NaN
2016-02-04 5.5 5 5 7.5
2016-02-05 5.5 5 5 8.5
2016-02-06 5.5 6.5 5 8.5
2016-02-07 5.5 6.5 5 8.5
2016-02-08 6.5 6.5 5 8.5
2016-02-09 6.5 6.5 6.5 8.5
2016-02-10 6.5 6.5 6.5 6
2016-02-11 6 6.5 4.5 6
2016-02-12 6 6.5 4.5 4
2016-02-13 2 6.5 4.5 5
2016-02-14 4 6.5 4.5 5
2016-02-15 4 6.5 4 3.5
Note in pandas 0.18 the rolling_apply syntax was changed. DataFrames and Series now have a rolling method, so that now you would use:
index = df1.rolling(window=window_size, min_periods=min_periods).apply(np.argmax)
shift = (pd.Series(np.arange(len(df1)))
.rolling(window=window_size, min_periods=min_periods).min())
index = index.add(shift.values, axis=0)

Group by multiple time units in pandas data frame

I have a data frame that consists of a time series data with 15-second intervals:
date_time value
2012-12-28 11:11:00 103.2
2012-12-28 11:11:15 103.1
2012-12-28 11:11:30 103.4
2012-12-28 11:11:45 103.5
2012-12-28 11:12:00 103.3
The data spans many years. I would like to group by both year and time to look at the distribution of time-of-day effect over many years. For example, I may want to compute the mean and standard deviation of every 15-second interval across days, and look at how the means and standard deviations change from 2010, 2011, 2012, etc. I naively tried data.groupby(lambda x: [x.year, x.time]) but it didn't work. How can I do such grouping?
In case date_time is not your index, a date_time-indexed DataFrame could be created with:
dfts = df.set_index('date_time')
From there you can group by intervals using
dfts.groupby(lambda x : x.month).mean()
to see mean values for each month. Similarly, you can do
dfts.groupby(lambda x : x.year).std()
for standard deviations across the years.
If I understood the example task you would like to achieve, you could simply split the data into years using xs, group them and concatenate the results and store this in a new DataFrame.
years = range(2012, 2015)
yearly_month_stats = [dfts.xs(str(year)).groupby(lambda x : x.month).mean() for year in years]
df2 = pd.concat(yearly_month_stats, axis=1, keys = years)
From which you get something like
2012 2013 2014
value value value
1 NaN 5.324165 15.747767
2 NaN -23.193429 9.193217
3 NaN -14.144287 23.896030
4 NaN -21.877975 16.310195
5 NaN -3.079910 -6.093905
6 NaN -2.106847 -23.253183
7 NaN 10.644636 6.542562
8 NaN -9.763087 14.335956
9 NaN -3.529646 2.607973
10 NaN -18.633832 0.083575
11 NaN 10.297902 14.059286
12 33.95442 13.692435 22.293245
You were close:
data.groupby([lambda x: x.year, lambda x: x.time])
Also be sure to set date_time as the index, as in kermit666's answer

Categories