How do I change the name of a resampled column? - python

I have a dataframe with the price fluctuations of the Nasdaq stock index every minute.
In trading it is important to take into account data on different time units (to know the short term, medium and long term trends...)
So I used the resample() method of Pandas to get a dataframe with the price in 5 minutes in addition to the original 1 minute:
df1m = pd.DataFrame({
'Time' : ['2022-01-11 09:30:00', '2022-01-11 09:31:00', '2022-01-11 09:32:00', '2022-01-11 09:33:00', '2022-01-11 09:34:00', '2022-01-11 09:35:00', '2022-01-11 09:36:00' , '2022-01-11 09:37:00' , '2022-01-11 09:38:00' ,
'2022-01-11 09:39:00', '2022-01-11 09:40:00'],
'Price' : [1,2,3,4,5,6,7,8,9,10,11]})
df1m['Time'] = pd.to_datetime(df1m['Time'])
df1m.set_index(['Time'], inplace =True)
df5m = df1m.resample('5min').first()
I renamed the column names to 5min :
df5m.rename(columns={'Price' : 'Price5'})
Unfortunately the change of column names is no longer taken into account when the two dataframes (1 and 5 min) are put together:
df_1m_5m = pd.concat([df1m, df5m], axis=1)
How to rename definitively the columns created for the 5min data and avoid having twice the same column name for different data?

You can use:
df5m = df1m.resample('5min').first().add_suffix('5')
df_1m_5m = pd.concat([df1m, df5m], axis=1)
Output:
>>> df_1m_5m
Price Price5
Time
2022-01-11 09:30:00 1 1.0
2022-01-11 09:31:00 2 NaN
2022-01-11 09:32:00 3 NaN
2022-01-11 09:33:00 4 NaN
2022-01-11 09:34:00 5 NaN
2022-01-11 09:35:00 6 6.0
2022-01-11 09:36:00 7 NaN
2022-01-11 09:37:00 8 NaN
2022-01-11 09:38:00 9 NaN
2022-01-11 09:39:00 10 NaN
2022-01-11 09:40:00 11 11.0
You forgot to reassign the result to your dataframe:
df5m = df5m.rename(columns={'Price' : 'Price5'})
# OR
df5m.rename(columns={'Price' : 'Price5'}, inplace=True)
Output:
>>> df5m
Price5
Time
2022-01-11 09:30:00 1
2022-01-11 09:35:00 6
2022-01-11 09:40:00 11

Believe your issue is you are missing option inplace=true in your rename. By default it's false, so it generates a copy of the DataFrame rather than editing your existing DataFrame. Setting it to true will edit your existing DataFrame df5m
df5m.rename(columns={'Price' : 'Price5'},inplace=True)
Output of df_1m_5m:
Price Price5
Time
2022-01-11 09:30:00 1 1.0
2022-01-11 09:31:00 2 NaN
2022-01-11 09:32:00 3 NaN
2022-01-11 09:33:00 4 NaN
2022-01-11 09:34:00 5 NaN
2022-01-11 09:35:00 6 6.0
2022-01-11 09:36:00 7 NaN
2022-01-11 09:37:00 8 NaN
2022-01-11 09:38:00 9 NaN
2022-01-11 09:39:00 10 NaN
2022-01-11 09:40:00 11 11.0

Agree with Stephan and Corralien. You can also try this:
df1m['Price5'] = df1m.resample('5T').first()

Related

dataframe data transfer with selected values to another dataframe

My goal is selecting the column Sabah in dataframe prdt and entering every value to repeated rows called Sabah in dataframe prcal
prcal
Vakit Start_Date End_Date Start_Time End_Time
0 Sabah 2022-01-01 2022-01-01 NaN NaN
1 Güneş 2022-01-01 2022-01-01 NaN NaN
2 Öğle 2022-01-01 2022-01-01 NaN NaN
3 İkindi 2022-01-01 2022-01-01 NaN NaN
4 Akşam 2022-01-01 2022-01-01 NaN NaN
..........................................................
2184 Sabah 2022-12-31 2022-12-31 NaN NaN
2185 Güneş 2022-12-31 2022-12-31 NaN NaN
2186 Öğle 2022-12-31 2022-12-31 NaN NaN
2187 İkindi 2022-12-31 2022-12-31 NaN NaN
2188 Akşam 2022-12-31 2022-12-31 NaN NaN
2189 rows × 5 columns
prdt
Day Sabah Güneş Öğle İkindi Akşam Yatsı
0 2022-01-01 06:51:00 08:29:00 13:08:00 15:29:00 17:47:00 19:20:00
1 2022-01-02 06:51:00 08:29:00 13:09:00 15:30:00 17:48:00 19:21:00
2 2022-01-03 06:51:00 08:29:00 13:09:00 15:30:00 17:48:00 19:22:00
3 2022-01-04 06:51:00 08:29:00 13:09:00 15:31:00 17:49:00 19:22:00
4 2022-01-05 06:51:00 08:29:00 13:10:00 15:32:00 17:50:00 19:23:00
...........................................................................
360 2022-12-27 06:49:00 08:27:00 13:06:00 15:25:00 17:43:00 19:16:00
361 2022-12-28 06:50:00 08:28:00 13:06:00 15:26:00 17:43:00 19:17:00
362 2022-12-29 06:50:00 08:28:00 13:07:00 15:26:00 17:44:00 19:18:00
363 2022-12-30 06:50:00 08:28:00 13:07:00 15:27:00 17:45:00 19:18:00
364 2022-12-31 06:50:00 08:28:00 13:07:00 15:28:00 17:46:00 19:19:00
365 rows × 7 columns
Selected every row called sabah prcal.iloc[::6,:]
Made a list for prdt['Sabah'].
When integrating prcal.iloc[::6,:] = prdt['Sabah'][0:365] I get a value error:
ValueError: Must have equal len keys and value when setting with an iterable

Get the max value of dates in Pandas

here is my code and datetime columns.
import pandas as pd
xcel_file=pd.read_excel('data.xlsx',usecols=['datetime'])
date=[]
time=[]
date.append((xcel_file['datetime']).dt.date)
time.append((xcel_file['datetime']).dt.time)
new_file=pd.DataFrame({'a':len(xcel_file['datetime'])},index=xcel_file['datetime'])
day=new_file.between_time('9:00','16:00')
day.reset_index(inplace=True)
day=day.drop(columns={'a'})
day['time']=pd.to_datetime(day['datetime']).dt.date
model_list=day['time'].drop_duplicates()
data_set=[]
i=0
for n in day['datetime']:
data_2=max(day['datetime'][day['time']==model_list[i])
i+=1
data_set.append(data_2)
datetime column
0 2022-01-10 09:30:00
1 2022-01-10 10:30:00
2 2022-01-11 10:30:00
3 2022-01-11 15:30:00
4 2022-01-11 11:00:00
5 2022-01-11 12:00:00
6 2022-01-12 13:00:00
7 2022-01-12 15:30:00
8 2022-01-13 14:00:00
9 2022-01-14 15:00:00
10 2022-01-14 16:00:00
11 2022-01-14 16:30:00
expected result
1 2022-01-10 10:30:00
3 2022-01-11 15:30:00
7 2022-01-12 15:30:00
8 2022-01-13 14:00:00
9 2022-01-14 15:00:00
I'm trying to get max value of same dates from datetime column in between time 9am to 4pm.
Is there any way of doing this? Truly thankful for any kind of help.
Use DataFrame.between_time with aggregate by days in Grouper for maximal datetimes:
df = pd.read_excel('data.xlsx',usecols=['datetime'])
df = df.set_index('datetime', drop=False)
df = (df.between_time('9:00','16:00')
.groupby(pd.Grouper(freq='d'))[['datetime']]
.max()
.reset_index(drop=True))
print (df)
datetime
0 2022-01-10 10:30:00
1 2022-01-11 15:30:00
2 2022-01-12 15:30:00
3 2022-01-13 14:00:00
4 2022-01-14 16:00:00
EDIT: Added missing values if exist match, so DataFrame.dropna solve this problem.
print (df)
datetime
0 2022-01-10 17:40:00
1 2022-01-10 19:30:00
2 2022-01-11 19:30:00
3 2022-01-11 15:30:00
4 2022-01-12 19:30:00
5 2022-01-12 15:30:00
6 2022-01-14 18:30:00
7 2022-01-14 16:30:00
df = df.set_index('datetime', drop=False)
df = (df.between_time('17:00','19:30')
.groupby(pd.Grouper(freq='d'))[['datetime']]
.max()
.dropna()
.reset_index(drop=True))
print (df)
datetime
0 2022-01-10 19:30:00
1 2022-01-11 19:30:00
2 2022-01-12 19:30:00
3 2022-01-14 18:30:00
Added alternative solution:
df = df.set_index('datetime', drop=False)
df = (df.between_time('17:00','19:30')
.sort_index()
.assign(d = lambda x: x['datetime'].dt.date)
.drop_duplicates('d', keep='last')
.drop('d', axis=1)
.reset_index(drop=True)
)
print (df)
datetime
0 2022-01-10 19:30:00
1 2022-01-11 19:30:00
2 2022-01-12 19:30:00
3 2022-01-14 18:30:00
EDIT: solution for filter first by datetime column, then datetime2 and last filtering by dates from datetime column:
print (df)
datetime datetime2
0 2022-01-10 09:30:00 2022-01-10 17:40:00
1 2022-01-10 10:30:00 2022-01-10 19:30:00
2 2022-01-11 10:30:00 2022-01-11 19:30:00
3 2022-01-11 15:30:00 2022-01-11 15:30:00
4 2022-01-11 11:00:00 2022-01-12 15:30:00
5 2022-01-11 12:00:00 2022-01-14 18:30:00
6 2022-01-12 13:00:00 2022-01-14 16:30:00
7 2022-01-12 15:30:00 2022-01-14 17:30:00
df = (df.set_index('datetime', drop=False)
.between_time('9:00','16:00')
.sort_index()
.set_index('datetime2', drop=False)
.between_time('17:00','19:30')
.assign(d = lambda x: x['datetime'].dt.date)
.drop_duplicates('d', keep='last')
.drop('d', axis=1)
.reset_index(drop=True)
)
print (df)
datetime datetime2
0 2022-01-10 10:30:00 2022-01-10 19:30:00
1 2022-01-11 12:00:00 2022-01-14 18:30:00
2 2022-01-12 15:30:00 2022-01-14 17:30:00
If filtering by dates by datetim2 output is different:
df = (df.set_index('datetime', drop=False)
.between_time('9:00','16:00')
.sort_index()
.set_index('datetime2', drop=False)
.between_time('17:00','19:30')
.assign(d = lambda x: x['datetime2'].dt.date)
.drop_duplicates('d', keep='last')
.drop('d', axis=1)
.reset_index(drop=True)
)
print (df)
datetime datetime2
0 2022-01-10 10:30:00 2022-01-10 19:30:00
1 2022-01-11 10:30:00 2022-01-11 19:30:00
2 2022-01-12 15:30:00 2022-01-14 17:30:00

How to measure the time elapsed since the beginning of an event, and record it in a new dataframe column?

I'm trying to measure the time elapsed since the beginning of an event. In this case, I want to know if the volume of bitcoin traded per minute has exceeded a certain threshold. Because what moves the price is the volume. So I want to measure how long there has been significant volume, and record this measurement in a new column.
Here is an example of a dataframe that contains the date in index, the bitcoin price and the volume. I added a column that indicates when the volume has exceeded a certain threshold:
df = pd.DataFrame({
'Time': ['2022-01-11 09:30:00', '2022-01-11 09:31:00', '2022-01-11 09:32:00', '2022-01-11 09:33:00', '2022-01-11 09:34:00', '2022-01-11 09:35:00', ],
'Volume': ['132', '109', '74', '57', '123', '21'],
'Volume_cat': ["big_volume", "big_volume", None, None, "big_volume", None],
})
df['Time'] = pd.to_datetime(df['Time'])
df.set_index(['Time'], inplace =True)
df
My goal is to have a new column that will display the elapsed time (in seconds) since the last detection of the 'big_volume' event and will reset itself at each new detection.
Here is a line that can be added to the example code:
df['delta_big_vol'] = ['60', '120', '180', '240', '60', '120',]
df
I have to use the apply() method, but have not found any lambda that would work.
In pseudo code it would look like :
from datetime import timedelta
df['delta_xl_vol'] = df.apply(if df["Volume"] > 100 : return(timedelta.total_seconds))
Thanks for your help.
For this process, we can't have null values in our "Volume_cat" column:
>>> df["Volume_cat"] = df["Volume_cat"].fillna("-") # This could be any string except "big_volume"
This step will help us in the future. We'll remember if our data starts with a "big_volume" and also store the index of the first "big_volume" row.
>>> idx_of_first_big_volume = df.loc[df["Volume_cat"] == "big_volume"].head(1).index[0]
>>> starts_with_big_volume = idx_of_first_big_volume == df.index[0]
Now, let's assign a group to each set of consecutive values in the "Volume_cat" column (consecutive "big_volume" are grouped, and consecutive "-" too).
>>> df["Group"] = ((df.Volume_cat != df.Volume_cat.shift()).cumsum())
Then, we'll rank each group. Now it's important to group consecutive groups, starting with a "big_volume" group followed by a "-" group, to assign the rank starting from the earliest "big_volume" event up until the last non-new-"big_volume" event (I hope this make sense). Also, notice how the starts_with_big_volume help us align the groups properly. If we start with a "big_volume" group, we need to shift the values by subtracting 1:
>>> df["rank"] = df.groupby((df["Group"] - 1 * starts_with_big_volume)// 2)["Volume_cat"].rank("first", ascending=False)
Finally, we can use our "rank" column and multiply it by 60 to get the number of seconds since the last row with a "big_volume" observation. You can do this in a copy of your dataframe and then include the "delta_big_vol" column in your original dataframe, due to all this new columns in the output.
>>> df["delta_big_vol"] = 60 * (df["rank"] - 1)
Also, we now can use our idx_of_first_big_volume to match your requirement of filling with None all of the observations before the first "big_volume" event:
>>> df.loc[:idx_of_first_big_volume, "delta_big_vol"].iloc[:-1] = None
This should be the output you get:
>>> df
Volume Volume_cat Group rank delta_big_vol
Time
2022-01-11 09:30:00 132 big_volume 1 1.0 0.0
2022-01-11 09:31:00 109 big_volume 1 2.0 60.0
2022-01-11 09:32:00 74 - 2 3.0 120.0
2022-01-11 09:33:00 57 - 2 4.0 180.0
2022-01-11 09:34:00 123 big_volume 3 1.0 0.0
2022-01-11 09:35:00 21 - 4 2.0 60.0
Under the assumption that the Volume column contains numerical data (yours contains str data), you could do
threshold = 100
df['Result'] = (
df.assign(Result=60).Result
.groupby((df.Volume > threshold).cumsum()).cumsum()
)
with the result
Volume Volume_cat Result
Time
2022-01-11 09:30:00 132 big_volume 60
2022-01-11 09:31:00 109 big_volume 60
2022-01-11 09:32:00 74 None 120
2022-01-11 09:33:00 57 None 180
2022-01-11 09:34:00 123 big_volume 60
2022-01-11 09:35:00 21 None 120
Or, if you prefer to start at 0, you could do
df['Result'] = (
df.assign(Result=(df.Volume <= threshold) * 60).Result
.groupby((df.Volume > threshold).cumsum()).cumsum()
)
with the result
Volume Volume_cat Result
Time
2022-01-11 09:30:00 132 big_volume 0
2022-01-11 09:31:00 109 big_volume 0
2022-01-11 09:32:00 74 None 60
2022-01-11 09:33:00 57 None 120
2022-01-11 09:34:00 123 big_volume 0
2022-01-11 09:35:00 21 None 60
EDIT re comment: I'm not completely sure, I've understood correctly.
You could try:
threshold = 100
mask = df.Volume > threshold
idx_min = df.index[mask][0]
mask &= ~mask.shift().fillna(False)
df['Result'] = (~mask) * 60
df['Result'] = df.Result.groupby(mask.cumsum()).cumsum().loc[idx_min:]
The result for the modified sample frame
Volume
Time
2022-01-11 09:30:00 99
2022-01-11 09:31:00 109
2022-01-11 09:32:00 101
2022-01-11 09:33:00 57
2022-01-11 09:34:00 123
2022-01-11 09:35:00 21
is
Volume Result
Time
2022-01-11 09:30:00 99 NaN
2022-01-11 09:31:00 109 0.0
2022-01-11 09:32:00 101 60.0
2022-01-11 09:33:00 57 120.0
2022-01-11 09:34:00 123 0.0
2022-01-11 09:35:00 21 60.0

pd.merge_asof with multiple matches per time period?

I'm trying to merge two dataframes by time with multiple matches. I'm looking for all the instances of df2 whose timestamp falls 7 days or less before endofweek in df1. There may be more than one record that fits the case, and I want all of the matches, not just the first or last (which pd.merge_asof does).
import pandas as pd
df1 = pd.DataFrame({'endofweek': ['2019-08-31', '2019-08-31', '2019-09-07', '2019-09-07', '2019-09-14', '2019-09-14'], 'GroupCol': [1234,8679,1234,8679,1234,8679]})
df2 = pd.DataFrame({'timestamp': ['2019-08-30 10:00', '2019-08-30 10:30', '2019-09-07 12:00', '2019-09-08 14:00'], 'GroupVal': [1234, 1234, 8679, 1234], 'TextVal': ['1234_1', '1234_2', '8679_1', '1234_3']})
df1['endofweek'] = pd.to_datetime(df1['endofweek'])
df2['timestamp'] = pd.to_datetime(df2['timestamp'])
I've tried
pd.merge_asof(df1, df2, tolerance=pd.Timedelta('7d'), direction='backward', left_on='endofweek', right_on='timestamp', left_by='GroupCol', right_by='GroupVal')
but that gets me
endofweek GroupCol timestamp GroupVal TextVal
0 2019-08-31 1234 2019-08-30 10:30:00 1234.0 1234_2
1 2019-08-31 8679 NaT NaN NaN
2 2019-09-07 1234 NaT NaN NaN
3 2019-09-07 8679 NaT NaN NaN
4 2019-09-14 1234 2019-09-08 14:00:00 1234.0 1234_3
5 2019-09-14 8679 2019-09-07 12:00:00 8679.0 8679_1
I'm losing the text 1234_1. Is there way to do a sort of outer join for pd.merge_asof, where I can keep all of the instances of df2 and not just the first or last?
My ideal result would look like this (assuming that the endofweek times are treated like 00:00:00 on that date):
endofweek GroupCol timestamp GroupVal TextVal
0 2019-08-31 1234 2019-08-30 10:00:00 1234.0 1234_1
1 2019-08-31 1234 2019-08-30 10:30:00 1234.0 1234_2
2 2019-08-31 8679 NaT NaN NaN
3 2019-09-07 1234 NaT NaN NaN
4 2019-09-07 8679 NaT NaN NaN
5 2019-09-14 1234 2019-09-08 14:00:00 1234.0 1234_3
6 2019-09-14 8679 2019-09-07 12:00:00 8679.0 8679_1
pd.merge_asof only does a left join. After a lot of frustration trying to speed up the groupby/merge_ordered example, it's more intuitive and faster to do pd.merge_asof on both data sources in different directions, and then do an outer join to combine them.
left_merge = pd.merge_asof(df1, df2,
tolerance=pd.Timedelta('7d'), direction='backward',
left_on='endofweek', right_on='timestamp',
left_by='GroupCol', right_by='GroupVal')
right_merge = pd.merge_asof(df2, df1,
tolerance=pd.Timedelta('7d'), direction='forward',
left_on='timestamp', right_on='endofweek',
left_by='GroupVal', right_by='GroupCol')
merged = (left_merge.merge(right_merge, how="outer")
.sort_values(['endofweek', 'GroupCol', 'timestamp'])
.reset_index(drop=True))
merged
endofweek GroupCol timestamp GroupVal TextVal
0 2019-08-31 1234 2019-08-30 10:00:00 1234.0 1234_1
1 2019-08-31 1234 2019-08-30 10:30:00 1234.0 1234_2
2 2019-08-31 8679 NaT NaN NaN
3 2019-09-07 1234 NaT NaN NaN
4 2019-09-07 8679 NaT NaN NaN
5 2019-09-14 1234 2019-09-08 14:00:00 1234.0 1234_3
6 2019-09-14 8679 2019-09-07 12:00:00 8679.0 8679_1
In addition, it is much faster than my other answer:
import time
n=1000
start=time.time()
for i in range(n):
left_merge = pd.merge_asof(df1, df2,
tolerance=pd.Timedelta('7d'), direction='backward',
left_on='endofweek', right_on='timestamp',
left_by='GroupCol', right_by='GroupVal')
right_merge = pd.merge_asof(df2, df1,
tolerance=pd.Timedelta('7d'), direction='forward',
left_on='timestamp', right_on='endofweek',
left_by='GroupVal', right_by='GroupCol')
merged = (left_merge.merge(right_merge, how="outer")
.sort_values(['endofweek', 'GroupCol', 'timestamp'])
.reset_index(drop=True))
end = time.time()
end-start
15.040804386138916
One way I tried is using groupby on one data frame, and then subsetting the other one in a pd.merge_ordered:
merged = (df1.groupby(['GroupCol', 'endofweek']).
apply(lambda x: pd.merge_ordered(x, df2[(
(df2['GroupVal']==x.name[0])
&(abs(df2['timestamp']-x.name[1])<=pd.Timedelta('7d')))],
left_on='endofweek', right_on='timestamp')))
merged
endofweek GroupCol timestamp GroupVal TextVal
GroupCol endofweek
1234 2019-08-31 0 NaT NaN 2019-08-30 10:00:00 1234.0 1234_1
1 NaT NaN 2019-08-30 10:30:00 1234.0 1234_2
2 2019-08-31 1234.0 NaT NaN NaN
2019-09-07 0 2019-09-07 1234.0 NaT NaN NaN
2019-09-14 0 NaT NaN 2019-09-08 14:00:00 1234.0 1234_3
1 2019-09-14 1234.0 NaT NaN NaN
8679 2019-08-31 0 2019-08-31 8679.0 NaT NaN NaN
2019-09-07 0 2019-09-07 8679.0 NaT NaN NaN
2019-09-14 0 NaT NaN 2019-09-07 12:00:00 8679.0 8679_1
1 2019-09-14 8679.0 NaT NaN NaN
merged[['endofweek', 'GroupCol']] = (merged[['endofweek', 'GroupCol']]
.fillna(method="bfill"))
merged.reset_index(drop=True, inplace=True)
merged
endofweek GroupCol timestamp GroupVal TextVal
0 2019-08-31 1234.0 2019-08-30 10:00:00 1234.0 1234_1
1 2019-08-31 1234.0 2019-08-30 10:30:00 1234.0 1234_2
2 2019-08-31 1234.0 NaT NaN NaN
3 2019-09-07 1234.0 NaT NaN NaN
4 2019-09-14 1234.0 2019-09-08 14:00:00 1234.0 1234_3
5 2019-09-14 1234.0 NaT NaN NaN
6 2019-08-31 8679.0 NaT NaN NaN
7 2019-09-07 8679.0 NaT NaN NaN
8 2019-09-14 8679.0 2019-09-07 12:00:00 8679.0 8679_1
9 2019-09-14 8679.0 NaT NaN NaN
However it seems to me the result is very slow:
import time
n=1000
start=time.time()
for i in range(n):
merged = (df1.groupby(['GroupCol', 'endofweek']).
apply(lambda x: pd.merge_ordered(x, df2[(
(df2['GroupVal']==x.name[0])
&(abs(df2['timestamp']-x.name[1])<=pd.Timedelta('7d')))],
left_on='endofweek', right_on='timestamp')))
end = time.time()
end-start
40.72932052612305
I would greatly appreciate any improvements!

Group by column and resampled date and get rolling sum of other column

I have the following data:
(Pdb) df1 = pd.DataFrame({'id': ['SE0000195570','SE0000195570','SE0000195570','SE0000195570','SE0000191827','SE0000191827','SE0000191827','SE0000191827', 'SE0000191827'],'val': ['1','2','3','4','5','6','7','8', '9'],'date': pd.to_datetime(['2014-10-23','2014-07-16','2014-04-29','2014-01-31','2018-10-19','2018-07-11','2018-04-20','2018-02-16','2018-12-29'])})
(Pdb) df1
id val date
0 SE0000195570 1 2014-10-23
1 SE0000195570 2 2014-07-16
2 SE0000195570 3 2014-04-29
3 SE0000195570 4 2014-01-31
4 SE0000191827 5 2018-10-19
5 SE0000191827 6 2018-07-11
6 SE0000191827 7 2018-04-20
7 SE0000191827 8 2018-02-16
8 SE0000191827 9 2018-12-29
UPDATE:
As per the suggestions of #user3483203 I have gotten a bit further but not quite there. I've amended the example data above with a new row to illustrate better.
(Pdb) df2.assign(calc=(df2.dropna()['val'].groupby(level=0).rolling(4).sum().shift(-3).reset_index(0, drop=True)))
id val date calc
id date
SE0000191827 2018-02-28 SE0000191827 8 2018-02-16 26.0
2018-03-31 NaN NaN NaT NaN
2018-04-30 SE0000191827 7 2018-04-20 27.0
2018-05-31 NaN NaN NaT NaN
2018-06-30 NaN NaN NaT NaN
2018-07-31 SE0000191827 6 2018-07-11 NaN
2018-08-31 NaN NaN NaT NaN
2018-09-30 NaN NaN NaT NaN
2018-10-31 SE0000191827 5 2018-10-19 NaN
2018-11-30 NaN NaN NaT NaN
2018-12-31 SE0000191827 9 2018-12-29 NaN
SE0000195570 2014-01-31 SE0000195570 4 2014-01-31 10.0
2014-02-28 NaN NaN NaT NaN
2014-03-31 NaN NaN NaT NaN
2014-04-30 SE0000195570 3 2014-04-29 NaN
2014-05-31 NaN NaN NaT NaN
2014-06-30 NaN NaN NaT NaN
2014-07-31 SE0000195570 2 2014-07-16 NaN
2014-08-31 NaN NaN NaT NaN
2014-09-30 NaN NaN NaT NaN
2014-10-31 SE0000195570 1 2014-10-23 NaN
For my requirements, the row (SE0000191827, 2018-03-31) should have a calc value since it has four consecutive rows with a value. Currently the row is being removed with the dropna call and I can't figure out how to solve that problem.
What I need
Calculations: The dates in my initial data is quarterly dates. However, I need to transform this data into monthly rows ranging between the first and last date of each id and for each month calculate the sum of the four closest consecutive rows of the input data within that id. That's a mouthful. This led me to resample. See expected output below. I need the data to be grouped by both id and the monthly dates.
Performance: The data I'm testing on now is just for benchmarking but I will need the solution to be performant. I'm expecting to run this on upwards of 100k unique ids which may result in around 10 million rows. (100k ids, dates range back up to 10 years, 10years * 12months = 120 months per id, 100k*120 = 12million rows).
What I've tried
(Pdb) res = df.groupby('id').resample('M',on='date')
(Pdb) res.first()
id val date
id date
SE0000191827 2018-02-28 SE0000191827 8 2018-02-16
2018-03-31 NaN NaN NaT
2018-04-30 SE0000191827 7 2018-04-20
2018-05-31 NaN NaN NaT
2018-06-30 NaN NaN NaT
2018-07-31 SE0000191827 6 2018-07-11
2018-08-31 NaN NaN NaT
2018-09-30 NaN NaN NaT
2018-10-31 SE0000191827 5 2018-10-19
SE0000195570 2014-01-31 SE0000195570 4 2014-01-31
2014-02-28 NaN NaN NaT
2014-03-31 NaN NaN NaT
2014-04-30 SE0000195570 3 2014-04-29
2014-05-31 NaN NaN NaT
2014-06-30 NaN NaN NaT
2014-07-31 SE0000195570 2 2014-07-16
2014-08-31 NaN NaN NaT
2014-09-30 NaN NaN NaT
2014-10-31 SE0000195570 1 2014-10-23
This data looks very nice for my case since it's nicely grouped by id and has the dates nicely lined up by month. Here it seems like I could use something like df['val'].rolling(4) and make sure it skips NaN values and put that result in a new column.
Expected output (new column calc):
id val date calc
id date
SE0000191827 2018-02-28 SE0000191827 8 2018-02-16 26
2018-03-31 NaN NaN NaT
2018-04-30 SE0000191827 7 2018-04-20 NaN
2018-05-31 NaN NaN NaT
2018-06-30 NaN NaN NaT
2018-07-31 SE0000191827 6 2018-07-11 NaN
2018-08-31 NaN NaN NaT
2018-09-30 NaN NaN NaT
2018-10-31 SE0000191827 5 2018-10-19 NaN
SE0000195570 2014-01-31 SE0000195570 4 2014-01-31 10
2014-02-28 NaN NaN NaT
2014-03-31 NaN NaN NaT
2014-04-30 SE0000195570 3 2014-04-29 NaN
2014-05-31 NaN NaN NaT
2014-06-30 NaN NaN NaT
2014-07-31 SE0000195570 2 2014-07-16 NaN
2014-08-31 NaN NaN NaT
2014-09-30 NaN NaN NaT
2014-10-31 SE0000195570 1 2014-10-23 NaN
2014-11-30 NaN NaN NaT
2014-12-31 SE0000195570 1 2014-10-23 NaN
Here the result in calc is 26 for the first date since it adds the three preceding (8+7+6+5). The rest for that id is NaN since four values are not available.
The problems
While it may look like the data is grouped by id and date, it seems like it's actually grouped by date. I'm not sure how this works. I need the data to be grouped by id and date.
(Pdb) res['val'].get_group(datetime.date(2018,2,28))
7 6.730000e+08
Name: val, dtype: object
The result of the resample above returns a DatetimeIndexResamplerGroupby which doesn't have rolling...
(Pdb) res['val'].rolling(4)
*** AttributeError: 'DatetimeIndexResamplerGroupby' object has no attribute 'rolling'
What to do? My guess is that my approach is wrong but after scouring the documentation I'm not sure where to start.

Categories