Python Pandas rolling sum operation for subset of dataframe - python

This is best explained through an example.
I have the following dataframe (each row can be thought of as a transaction):
DATE AMOUNT
2017-01-29 10
2017-01-30 20
2017-01-31 30
2017-02-01 40
2017-02-02 50
2017-02-03 60
I would like to compute a 2-day rolling sum but only for rows in February.
Code snippet I have currently:
df.set_index('DATE',inplace=True)
res=df.rolling('2d')['AMOUNT'].sum()
which gives:
AMOUNT
2017-01-29 10
2017-01-30 30
2017-01-31 50
2017-02-01 70
2017-02-02 90
2017-02-03 110
but I really only need the output in the last 3 rows, the operations on the first 3 rows are unnecessary. When the dataframe is huge, this incurs immense time complexity. How do I compute the rolling sum only for the last 3 rows (other than computing the rolling sum for all rows and then doing a row filtering operation after that)?
*I cannot pre-filter the dataframe either because there wouldn't be the 'lookback' period in January for the correct rolling sum value to be obtained.

You can use timedelta to filter your df and keep the last day of January.
import datetime
dateStart = datetime.date(2017, 2, 1) - datetime.timedelta(days=1)
dateEnd = datetime.date(2017, 2, 3)
df.loc[dateStart:dateEnd]
Then you can do your rolling operation and drop the first line (which is 2017-01-31)

you can just compute the rolling sum only for the last rows by using tail(4)
res = df.tail(4).rolling('2d')['AMOUNT'].sum()
Output:
DATE
2017-01-31 NaN
2017-02-01 70.0
2017-02-02 90.0
2017-02-03 110.0
Name: AMOUNT, dtype: float64
If you want to merge those values - excluding 2017-01-31 then you can do the following:
df.loc[res.index[1:]] = res.tail(3)
Output:
AMOUNT
DATE
2017-01-29 10.0
2017-01-30 20.0
2017-01-31 30.0
2017-02-01 70.0
2017-02-02 90.0
2017-02-03 110.0

Related

get average monthly value by divide from its monthly row count

i have following datframe
created_time shares_count
2021-07-01 250.0
2021-07-31 501.0
2021-08-02 48.0
2021-08-05 300.0
2021-08-07 200.0
2021-09-06 28.0
2021-09-08 100.0
2021-09-25 100.0
2021-09-30 200.0
did the grouping as monthly like this
df_groupby_monthly = df.groupby(pd.Grouper(key='created_time',freq='M')).sum()
df_groupby_monthly
Now how to get the average of these 'shares_count's by dividing from a sum of monthly rows?
ex: if the 07th month has 2 rows average should be 751.0/2 = 375.5, and the 08th month has 3 rows average should be 548.0/3 = 182.666, and the 09th month has 4 rows average should be 428.0/4 = 142.66
how to get like this final output
created_time shares_count
2021-07-31 375.5
2021-08-31 182.666
2021-09-30 142.66
I have tried following
df.groupby(pd.Grouper(key='created_time',freq='M')).apply(lambda x: x['shares_count'].sum()/len(x))
this is working if only one column, multiple ones hard to get
df['created_time'] = pd.to_datetime(df['created_time'])
output = df.groupby(df['created_time'].dt.to_period('M')).mean().round(2).reset_index()
output
###
created_time shares_count
0 2021-07 375.50
1 2021-08 182.67
2 2021-09 107.00
Use this code:
df=df.groupby(pd.Grouper(key='created_time',freq='M')).agg({'shares_count':['sum', 'count']}).reset_index()
df['ss']=df[('shares_count','sum')]/df[('shares_count','count')]

How to group by column and a fixed time window/frequency

EDIT: My main goal is not to use a for loop and find a way of grouping the data efficiently/fast.
I am trying to solve a problem, which is about grouping together different rows of data based on an ID and a time window of 30 Days.
I have the following example data:
ID
Time
12345
2021-01-01 14:00:00
12345
2021-01-15 14:00:00
12345
2021-01-29 14:00:00
12345
2021-02-15 14:00:00
12345
2021-02-16 14:00:00
12345
2021-03-15 14:00:00
12345
2021-04-24 14:00:00
12344
2021-01-24 14:00:00
12344
2021-01-25 14:00:00
12344
2021-04-24 14:00:00
And I would like to have the following data:
ID
Time
Group
12345
2021-01-01 14:00:00
1
12345
2021-01-15 14:00:00
1
12345
2021-01-29 14:00:00
1
12345
2021-02-15 14:00:00
2
12345
2021-02-16 14:00:00
2
12345
2021-03-15 14:00:00
3
12345
2021-04-24 14:00:00
4
12344
2021-01-24 14:00:00
5
12344
2021-01-25 14:00:00
5
12344
2021-04-24 14:00:00
6
(4 can also be 1 as it is in a new group based on the ID 12344; 5 can also be 2)
I could differentiate then based on the ID column. So the Group does not need to be unique but can be.
The most important would be to separate it based on the ID and then check all the rows for each ID and assign an ID to the 30 Days time window. By 30 Days time window I mean that e.g. the first time frame for ID 12345 starts at 2021-01-01 and goes up to 2021-01-31 (this should be the group 1) and then the second time time frame for the ID 12345 starts at 2021-02-01 and would go to 2021-03-02 (for 30 days).
The problem I have faced with using the following code is that it uses the first date it finds in the dataframe:
grouped_data = df.groupby(["ID",pd.Grouper(key = "Time", freq = "30D")]).count()
In the above code I have just tried to count the rows (which wouldn't give me the Group, but I have tried to group it with my logic).
I hope someone can help me with this, because I have tried so many different things and nothing did work. I have already used the following (but maybe wrong):
pd.rolling()
pd.Grouper()
for loop
etc.
I really don't want to use for loop as I have 1.5 Mio rows.
And I have tried to vectorize the for loop but I am not really familiar with vectorization and was struggling to transfer my for loop to a vectorization.
Please let me know if I can use pd.Grouper differently so I get the results. thanks in advance.
For arbitrary windows you can use pandas.cut
eg, for 30 day bins starting at 2021-01-01 00:00:00 for the entirety of 2021 you can use:
bins = pd.date_range("2021", "2022", freq="30D")
group = pd.cut(df["Time"], bins)
group will label each row with an interval which you can then group on etc. If you want the groups to have labels 0, 1, 2, etc then you can map values with:
dict(zip(group.unique(), range(group.nunique())))
EDIT: approach where the windows are 30 day intervals, disjoint, and starting at a time in the Time column:
times = df["Time"].sort_values()
ii = pd.IntervalIndex.from_arrays(times, times+pd.Timedelta("30 days"))
disjoint_intervals = []
prev_interval = None
for i, interval in enumerate(ii):
if prev_interval is None or interval.left >= prev_interval.right: # no overlap
prev_interval = interval
disjoint_intervals.append(i)
bins = ii[disjoint_intervals]
group = pd.cut(df["Time"], bins)
Apologies, this is not a vectorised approach. Struggling to think if one could exist.
SOLUTION:
The solution which worked for me is the following:
I have imported the sampleData from excel into a dataframe. The data looks like this:
ID
Time
12345
2021-01-01 14:00:00
12345
2021-01-15 14:00:00
12345
2021-01-29 14:00:00
12345
2021-02-15 14:00:00
12345
2021-02-16 14:00:00
12345
2021-03-15 14:00:00
12345
2021-04-24 14:00:00
12344
2021-01-24 14:00:00
12344
2021-01-25 14:00:00
12344
2021-04-24 14:00:00
Then I have used the following steps:
Import the data:
df_test = pd.read_excel(r"sampleData.xlsx")
Order the dataframe so we have the correct order of ID and Time:
df_test_ordered = df_test.sort_values(["ID","Time"])
df_test_ordered = df_test_ordered.reset_index(drop=True)
I have also reset the index and dropped it as it has manipulated my calculations later on.
Create column with time difference between the previous row:
df_test_ordered.loc[df_test_ordered["ID"] == df_test_ordered["ID"].shift(1),"time_diff"] = df_test_ordered["Time"] - df_test_ordered["Time"].shift(1)
Transform timedelta64[ns] to timedelta64[D]:
df_test_ordered["time_diff"] = df_test_ordered["time_diff"].astype("timedelta64[D]")
Calculate the cumsum per ID:
df_test_ordered["cumsum"] = df_test_ordered.groupby("ID")["time_diff"].transform(pd.Series.cumsum)
Backfill the dataframe (exchange the NaN values with the next value):
df_final = df_test_ordered.ffill().bfill()
Create the window by dividing by 30 (30 days time period):
df_final["Window"] = df_final["cumsum"] / 30
df_final["Window_int"] = df_final["Window"].astype(int)
The "Window_int" column is now a kind of ID (not unique; but unique within the groups of column "ID").
Furthermore, I needed to backfill the dataframe as there were NaN values due to the calculation of time difference only if the previous ID equals the ID. If not then NaN is set as time difference. Backfilling will just set the NaN value to the next time difference which makes no difference mathematically and assign the correct value.
Solution dataframe:
ID Time time_diff cumsum Window Window_int
0 12344 2021-01-24 14:00:00 1.0 1.0 0.032258 0
1 12344 2021-01-25 14:00:00 1.0 1.0 0.032258 0
2 12344 2021-04-24 14:00:00 89.0 90.0 2.903226 2
3 12345 2021-01-01 14:00:00 14.0 14.0 0.451613 0
4 12345 2021-01-15 14:00:00 14.0 14.0 0.451613 0
5 12345 2021-01-29 14:00:00 14.0 28.0 0.903226 0
6 12345 2021-02-15 14:00:00 17.0 45.0 1.451613 1
7 12345 2021-02-16 14:00:00 1.0 46.0 1.483871 1
8 12345 2021-03-15 14:00:00 27.0 73.0 2.354839 2
9 12345 2021-04-24 14:00:00 40.0 113.0 3.645161 3

Resampled Pandas Dataframes Datetime Alignment

I have 3 resampled pandas dataframes using the same data indexed by datetime.
Each dataframe is resampled using a different timeframe (e.g 30min / 60 min / 240 min).
2 of the dataframes have resampled correctly with the datetimes aligned because they have an equal number of rows (20) but the 3rd dataframe only has 12 rows because there isn't enough data to create 20 rows resampled to 240mins.
How can I adjust the 240min dataframe so the datetimes are aligned with the other 2 dataframes?
For example, every 2nd row in the 30min dataframe equals the corresponding row in the 60min dataframe and every 4 rows in the 60min dataframe should equal the corresponding row in 240min dataframe but this is not the case because the 240min dataframe has resampled the datetimes differently due there not being enough data to create 20 rows.
If you're just trying to align the different datasets to one index you can use pd.concat.
import pandas as pd
periods = 12.5 * 240
index = pd.date_range(start='1/1/2018', periods=periods, freq="min")
data = pd.DataFrame(list(range(int(periods))), index=index)
df1 = data.resample('30min').asfreq()
df2 = data.resample('60min').asfreq()
df3 = data.resample('240min').asfreq()
df4 = pd.concat([df1, df2, df3], axis=1)
print(df4)
Output:
2018-01-01 00:00:00 0 0.0 0.0
2018-01-01 00:30:00 30 NaN NaN
2018-01-01 01:00:00 60 60.0 NaN
2018-01-01 01:30:00 90 NaN NaN
2018-01-01 02:00:00 120 120.0 NaN
... ... ... ...
2018-01-02 23:30:00 2850 NaN NaN
2018-01-03 00:00:00 2880 2880.0 2880.0
2018-01-03 00:30:00 2910 NaN NaN
2018-01-03 01:00:00 2940 2940.0 NaN
2018-01-03 01:30:00 2970 NaN NaN

Convert timestamps from temporal series to the same index

I have a data frame containing a timestamp every 5 minutes with a value for each ID. Now, I need to perform some analysis and I would like to plot all the time series on the same temporal time window.
My data frame is similar to this one:
ID timestamp value
12345 2017-02-09 14:35:00 60.0
12345 2017-02-09 14:40:00 62.0
12345 2017-02-09 14:45:00 58.0
12345 2017-02-09 14:50:00 60.0
54321 2017-03-09 13:35:00 50.0
54321 2017-03-09 13:40:00 58.0
54321 2017-03-09 13:45:00 59.0
54321 2017-03-09 13:50:00 61.0
For instance, in the xy axis, I need to use the x=0 value as the first timestamp for each ID, and the x=1 the second after 5 minutes, and so on.
Until now, I correctly resampled every 5 minutes with this code:
df = df.set_index('Date').resample('5T').mean().reset_index()
But, given the fact the every ID starts at different timestamps, I don't know how to modify the timestamps in order to use the first measured date of each ID as timestamp 0, and each next timestamp every 5 minutes as timestamp 1, timestamp 2, timestamp 3, ecc, in order to plot the series of each ID to confront them graphically. A sample final df may be:
ID timestamp value
12345 0 60.0
12345 1 62.0
12345 2 58.0
12345 3 60.0
54321 0 50.0
54321 1 58.0
54321 2 59.0
54321 3 61.0
Using this data frame, is is possible to plot all the series starting and finishing at the same point? Start at 0 and finish after 3 days.
How do I create such different timestamps and plot every series for each ID on the same figure?
Thankl you very much
First create a new column with the timestamp number in 5 minutes intervals.
df['ts_number'] = df.groupby(['ID']).timestamp.apply(lambda x: (x - x.min())/pd.Timedelta(minutes=5))
If you know in advance that all your timestamps are in 5 minute intervalls and they are sorted, then you can also use
df['ts_number'] = df.groupby(['ID']).cumcount()
Then plot the pivoted data:
df.pivot('ts_number', 'ID', 'value').plot()

best way to fill up gaps by yearly dates in Python dataframe

all, I'm newbie to Python and am stuck with the problem below. I have a DF as:
ipdb> DF
asofdate port_id
1 2010-01-01 76
2 2010-04-01 43
3 2011-02-01 76
4 2013-01-02 93
5 2017-02-01 43
For the yearly gaps, say 2012, 2014, 2015, and 2016, I'd like to fill in the gap using the new year date for each of the missing years, and port_id from previous year. Ideally, I'd like:
ipdb> DF
asofdate port_id
1 2010-01-01 76
2 2010-04-01 43
3 2011-02-01 76
4 2012-01-01 76
5 2013-01-02 93
6 2014-01-01 93
7 2015-01-01 93
8 2016-01-01 93
9 2017-02-01 43
I tried multiple approaches but still no avail. Could some expert shed me some lights on how to make it work out? Thanks much in advance!
You can use set.difference with range to find missing dates and then append a dataframe:
# convert to datetime if not already converted
df['asofdate'] = pd.to_datetime(df['asofdate'])
# calculate missing years
years = df['asofdate'].dt.year
missing = set(range(years.min(), years.max())) - set(years)
# append dataframe, sort and front-fill
df = df.append(pd.DataFrame({'asofdate': pd.to_datetime(list(missing), format='%Y')}))\
.sort_values('asofdate')\
.ffill()
print(df)
asofdate port_id
1 2010-01-01 76.0
2 2010-04-01 43.0
3 2011-02-01 76.0
1 2012-01-01 76.0
4 2013-01-02 93.0
2 2014-01-01 93.0
3 2015-01-01 93.0
0 2016-01-01 93.0
5 2017-02-01 43.0
I would create a helper dataframe, containing all the year start dates, then filter out the ones where the years match what is in df, and finally merge them together:
# First make sure it is proper datetime
df['asofdate'] = pd.to_datetime(df.asofdate)
# Create your temporary dataframe of year start dates
helper = pd.DataFrame({'asofdate':pd.date_range(df.asofdate.min(), df.asofdate.max(), freq='YS')})
# Filter out the rows where the year is already in df
helper = helper[~helper.asofdate.dt.year.isin(df.asofdate.dt.year)]
# Merge back in to df, sort, and forward fill
new_df = df.merge(helper, how='outer').sort_values('asofdate').ffill()
>>> new_df
asofdate port_id
0 2010-01-01 76.0
1 2010-04-01 43.0
2 2011-02-01 76.0
5 2012-01-01 76.0
3 2013-01-02 93.0
6 2014-01-01 93.0
7 2015-01-01 93.0
8 2016-01-01 93.0
4 2017-02-01 43.0

Categories