EDIT: My main goal is not to use a for loop and find a way of grouping the data efficiently/fast.
I am trying to solve a problem, which is about grouping together different rows of data based on an ID and a time window of 30 Days.
I have the following example data:
ID
Time
12345
2021-01-01 14:00:00
12345
2021-01-15 14:00:00
12345
2021-01-29 14:00:00
12345
2021-02-15 14:00:00
12345
2021-02-16 14:00:00
12345
2021-03-15 14:00:00
12345
2021-04-24 14:00:00
12344
2021-01-24 14:00:00
12344
2021-01-25 14:00:00
12344
2021-04-24 14:00:00
And I would like to have the following data:
ID
Time
Group
12345
2021-01-01 14:00:00
1
12345
2021-01-15 14:00:00
1
12345
2021-01-29 14:00:00
1
12345
2021-02-15 14:00:00
2
12345
2021-02-16 14:00:00
2
12345
2021-03-15 14:00:00
3
12345
2021-04-24 14:00:00
4
12344
2021-01-24 14:00:00
5
12344
2021-01-25 14:00:00
5
12344
2021-04-24 14:00:00
6
(4 can also be 1 as it is in a new group based on the ID 12344; 5 can also be 2)
I could differentiate then based on the ID column. So the Group does not need to be unique but can be.
The most important would be to separate it based on the ID and then check all the rows for each ID and assign an ID to the 30 Days time window. By 30 Days time window I mean that e.g. the first time frame for ID 12345 starts at 2021-01-01 and goes up to 2021-01-31 (this should be the group 1) and then the second time time frame for the ID 12345 starts at 2021-02-01 and would go to 2021-03-02 (for 30 days).
The problem I have faced with using the following code is that it uses the first date it finds in the dataframe:
grouped_data = df.groupby(["ID",pd.Grouper(key = "Time", freq = "30D")]).count()
In the above code I have just tried to count the rows (which wouldn't give me the Group, but I have tried to group it with my logic).
I hope someone can help me with this, because I have tried so many different things and nothing did work. I have already used the following (but maybe wrong):
pd.rolling()
pd.Grouper()
for loop
etc.
I really don't want to use for loop as I have 1.5 Mio rows.
And I have tried to vectorize the for loop but I am not really familiar with vectorization and was struggling to transfer my for loop to a vectorization.
Please let me know if I can use pd.Grouper differently so I get the results. thanks in advance.
For arbitrary windows you can use pandas.cut
eg, for 30 day bins starting at 2021-01-01 00:00:00 for the entirety of 2021 you can use:
bins = pd.date_range("2021", "2022", freq="30D")
group = pd.cut(df["Time"], bins)
group will label each row with an interval which you can then group on etc. If you want the groups to have labels 0, 1, 2, etc then you can map values with:
dict(zip(group.unique(), range(group.nunique())))
EDIT: approach where the windows are 30 day intervals, disjoint, and starting at a time in the Time column:
times = df["Time"].sort_values()
ii = pd.IntervalIndex.from_arrays(times, times+pd.Timedelta("30 days"))
disjoint_intervals = []
prev_interval = None
for i, interval in enumerate(ii):
if prev_interval is None or interval.left >= prev_interval.right: # no overlap
prev_interval = interval
disjoint_intervals.append(i)
bins = ii[disjoint_intervals]
group = pd.cut(df["Time"], bins)
Apologies, this is not a vectorised approach. Struggling to think if one could exist.
SOLUTION:
The solution which worked for me is the following:
I have imported the sampleData from excel into a dataframe. The data looks like this:
ID
Time
12345
2021-01-01 14:00:00
12345
2021-01-15 14:00:00
12345
2021-01-29 14:00:00
12345
2021-02-15 14:00:00
12345
2021-02-16 14:00:00
12345
2021-03-15 14:00:00
12345
2021-04-24 14:00:00
12344
2021-01-24 14:00:00
12344
2021-01-25 14:00:00
12344
2021-04-24 14:00:00
Then I have used the following steps:
Import the data:
df_test = pd.read_excel(r"sampleData.xlsx")
Order the dataframe so we have the correct order of ID and Time:
df_test_ordered = df_test.sort_values(["ID","Time"])
df_test_ordered = df_test_ordered.reset_index(drop=True)
I have also reset the index and dropped it as it has manipulated my calculations later on.
Create column with time difference between the previous row:
df_test_ordered.loc[df_test_ordered["ID"] == df_test_ordered["ID"].shift(1),"time_diff"] = df_test_ordered["Time"] - df_test_ordered["Time"].shift(1)
Transform timedelta64[ns] to timedelta64[D]:
df_test_ordered["time_diff"] = df_test_ordered["time_diff"].astype("timedelta64[D]")
Calculate the cumsum per ID:
df_test_ordered["cumsum"] = df_test_ordered.groupby("ID")["time_diff"].transform(pd.Series.cumsum)
Backfill the dataframe (exchange the NaN values with the next value):
df_final = df_test_ordered.ffill().bfill()
Create the window by dividing by 30 (30 days time period):
df_final["Window"] = df_final["cumsum"] / 30
df_final["Window_int"] = df_final["Window"].astype(int)
The "Window_int" column is now a kind of ID (not unique; but unique within the groups of column "ID").
Furthermore, I needed to backfill the dataframe as there were NaN values due to the calculation of time difference only if the previous ID equals the ID. If not then NaN is set as time difference. Backfilling will just set the NaN value to the next time difference which makes no difference mathematically and assign the correct value.
Solution dataframe:
ID Time time_diff cumsum Window Window_int
0 12344 2021-01-24 14:00:00 1.0 1.0 0.032258 0
1 12344 2021-01-25 14:00:00 1.0 1.0 0.032258 0
2 12344 2021-04-24 14:00:00 89.0 90.0 2.903226 2
3 12345 2021-01-01 14:00:00 14.0 14.0 0.451613 0
4 12345 2021-01-15 14:00:00 14.0 14.0 0.451613 0
5 12345 2021-01-29 14:00:00 14.0 28.0 0.903226 0
6 12345 2021-02-15 14:00:00 17.0 45.0 1.451613 1
7 12345 2021-02-16 14:00:00 1.0 46.0 1.483871 1
8 12345 2021-03-15 14:00:00 27.0 73.0 2.354839 2
9 12345 2021-04-24 14:00:00 40.0 113.0 3.645161 3
Related
In my dataframe I have a column that is timestamp formatted as 2021-11-18 00:58:22.705
I wish to create a column that displays the time elapsed from each row to the interval time (first time).
There are 2 ways in which I can think of doing this but I don't seem to know how to make it happen.
Method 1:
To subtract each time stamp to the row above.
df["difference"]= df["timestamp"].diff()
Now that this time difference has been calculated I would like to create another column that sums each time difference but it keeps the sum from the delta above (elapsed time from start of process)
Method 2:
I guess another way would be to calculate the timestamp of each row to the interval time stamp (first one)
I do not know how I would do that.
Thanks in advance.
not completely understood the type of difference needed so adding both which I think are reasonable:
import pandas as pd
times = pd.date_range('2022-05-23', periods=20, freq='0D30min')
df = pd.DataFrame({'Timestamp': times})
df['difference_in_min'] = (df.Timestamp - df.Timestamp.min()).astype('timedelta64[m]')
df['cumulative_dif_in_min'] = df.difference_in_min.cumsum()
print(df)
Timestamp difference_in_min cumulative_dif_in_min
0 2022-05-23 00:00:00 0.0 0.0
1 2022-05-23 00:30:00 30.0 30.0
2 2022-05-23 01:00:00 60.0 90.0
3 2022-05-23 01:30:00 90.0 180.0
4 2022-05-23 02:00:00 120.0 300.0
5 2022-05-23 02:30:00 150.0 450.0
6 2022-05-23 03:00:00 180.0 630.0
7 2022-05-23 03:30:00 210.0 840.0
8 2022-05-23 04:00:00 240.0 1080.0
Suppose we have two dataframes, one with a timestamp and the other with start and end timestamps. df1 and df2 as:
DateTime
Value1
StartDateTime
EnddDateTime
Value2
2020-01-11 12:30:00
1
2020-01-11 12:23:12
2020-01-11 13:10:00
a
2020-01-11 13:00:00
2
2020-01-11 14:12:20
2020-01-11 14:20:34
b
2020-02-11 13:30:00
3
2020-01-11 15:20:00
2020-01-11 15:28:10
c
2020-02-11 14:00:00
4
2020-01-11 15:45:20
2020-01-11 16:26:23
d
2020-02-11 14:30:00
5
2020-02-11 15:00:00
6
2020-02-11 15:30:00
7
2020-02-11 16:00:00
8
The timestamp of df1 represents half an hour starting from the time in the DateTime column. I want to match df2 start and end time with these 20 minutes periods. A value of df2 may fall in two rows of df1 if its period (the time between start and end) matches with two DateTime in df1, even for only one second. The outcome should be a dataframe as below.
DateTime
Value1
Value2
2020-01-11 12:30:00
1
a
2020-01-11 13:00:00
2
a
2020-02-11 13:30:00
3
Nan
2020-02-11 14:00:00
4
b
2020-02-11 14:30:00
5
Nan
2020-02-11 15:00:00
6
c
2020-02-11 15:30:00
7
d
2020-02-11 16:00:00
8
d
Any suggestions to efficiently merge large data?
There maybe shorter better answers out there because I am going longhand.
melt the second data frame
df3=pd.melt(df2, id_vars=['Value2'], value_vars=['StartDateTime', 'EnddDateTime'],value_name='DateTime').sort_values(by='DateTime')
Create temp columns on both dfs. The reason is, you want to get the time from datetime, append that time to a uniform date to be used in the merge
df1['DateTime1']=pd.Timestamp('today').strftime('%Y-%m-%d') + ' ' +pd.to_datetime(df1['DateTime']).dt.time.astype(str)
df3['DateTime1']=pd.Timestamp('today').strftime('%Y-%m-%d') + ' ' +pd.to_datetime(df3['DateTime']).dt.time.astype(str)
Convert the new column date times computed above to datetime
df3["DateTime1"]=pd.to_datetime(df3["DateTime1"])
df1["DateTime1"]=pd.to_datetime(df1["DateTime1"])
Finally, mergeasof with a time tolerance
final = pd.merge_asof(df1, df3, on="DateTime1",tolerance=pd.Timedelta("39M"),suffixes=('_', '_df2')).drop(columns=['DateTime1','variable','DateTime_df2'])
DateTime_ Value1 Value2
0 2020-01-11 13:00:00 2 a
1 2020-02-11 13:30:00 3 a
2 2020-02-11 14:00:00 4 NaN
3 2020-02-11 14:30:00 5 b
4 2020-02-11 15:00:00 6 NaN
5 2020-02-11 15:30:00 7 c
6 2020-02-11 16:00:00 8 d
I found this behavior of resample to be confusing after working on a related question. Here are some time series data at 5 minute intervals but with missing rows (code to construct at end):
user value total
2020-01-01 09:00:00 fred 1 1
2020-01-01 09:05:00 fred 13 1
2020-01-01 09:15:00 fred 27 3
2020-01-01 09:30:00 fred 40 12
2020-01-01 09:35:00 fred 15 12
2020-01-01 10:00:00 fred 19 16
I want to fill in the missing times using different methods for each column to fill missing data. For user and total, I want to to a forward fill, while for value I want to fill in with zeroes.
One approach I found was to resample, and then fill in the missing data after the fact:
resampled = df.resample('5T').asfreq()
resampled['user'].ffill(inplace=True)
resampled['total'].ffill(inplace=True)
resampled['value'].fillna(0, inplace=True)
Which gives correct expected output:
user value total
2020-01-01 09:00:00 fred 1.0 1.0
2020-01-01 09:05:00 fred 13.0 1.0
2020-01-01 09:10:00 fred 0.0 1.0
2020-01-01 09:15:00 fred 27.0 3.0
2020-01-01 09:20:00 fred 0.0 3.0
2020-01-01 09:25:00 fred 0.0 3.0
2020-01-01 09:30:00 fred 40.0 12.0
2020-01-01 09:35:00 fred 15.0 12.0
2020-01-01 09:40:00 fred 0.0 12.0
2020-01-01 09:45:00 fred 0.0 12.0
2020-01-01 09:50:00 fred 0.0 12.0
2020-01-01 09:55:00 fred 0.0 12.0
2020-01-01 10:00:00 fred 19.0 16.0
I thought one would be able to use agg to specify what to do by column. I try to do the following:
resampled = df.resample('5T').agg({'user':'ffill',
'value':'sum',
'total':'ffill'})
I find this to be more clear and simpler, but it doesn't give the expected output. The sum works, but the forward fill does not:
user value total
2020-01-01 09:00:00 fred 1 1.0
2020-01-01 09:05:00 fred 13 1.0
2020-01-01 09:10:00 NaN 0 NaN
2020-01-01 09:15:00 fred 27 3.0
2020-01-01 09:20:00 NaN 0 NaN
2020-01-01 09:25:00 NaN 0 NaN
2020-01-01 09:30:00 fred 40 12.0
2020-01-01 09:35:00 fred 15 12.0
2020-01-01 09:40:00 NaN 0 NaN
2020-01-01 09:45:00 NaN 0 NaN
2020-01-01 09:50:00 NaN 0 NaN
2020-01-01 09:55:00 NaN 0 NaN
2020-01-01 10:00:00 fred 19 16.0
Can someone explain this output, and if there is a way to achieve the expected output using agg? It seems odd that the forward fill doesn't work here, but if I were to just do resampled = df.resample('5T').ffill(), that would work for every column (but is undesired here as it would do so for the value column as well). The closest I have come is to individually run resampling for each column and apply the function I want:
resampled = pd.DataFrame()
d = {'user':'ffill',
'value':'sum',
'total':'ffill'}
for k, v in d.items():
resampled[k] = df[k].resample('5T').apply(v)
This works, but feels silly given that it adds extra iteration and uses the dictionary I am trying to pass to agg! I have looked a few posts on agg and apply but can't seem to explain what is happening here:
Losing String column when using resample and aggregation with pandas
resample multiple columns with pandas
pandas groupby with agg not working on multiple columns
Pandas named aggregation not working with resample agg
I have also tried using groupby with a pd.Grouper and using the pd.NamedAgg class, with no luck.
Example data:
import pandas as pd
dates = ['01-01-2020 9:00', '01-01-2020 9:05', '01-01-2020 9:15',
'01-01-2020 9:30', '01-01-2020 9:35', '01-01-2020 10:00']
dates = pd.to_datetime(dates)
df = pd.DataFrame({'user':['fred']*len(dates),
'value':[1,13,27,40,15,19],
'total':[1,1,3,12,12,16]},
index=dates)
I have a dataframe df as below:
Datetime Value
2020-03-01 08:00:00 10
2020-03-01 10:00:00 12
2020-03-01 12:00:00 15
2020-03-02 09:00:00 1
2020-03-02 10:00:00 3
2020-03-02 13:00:00 8
2020-03-03 10:00:00 20
2020-03-03 12:00:00 25
2020-03-03 14:00:00 15
I would like to calculate the difference between the value on the first time of each date and the last time of each date (ignoring the value of other time within a date), so the result will be:
Datetime Value_Difference
2020-03-01 5
2020-03-02 7
2020-03-03 -5
I have been doing this using a for loop, but it is slow (as expected) when I have larger data. Any help will be appreciated.
One solution would be to make sure the data is sorted by time, group by the data and then take the first and last value in each day. This works since pandas will preserve the order during groupby, see e.g. here.
df = df.sort_values(by='Datetime').groupby(df['Datetime'].dt.date).agg({'Value': ['first', 'last']})
df['Value_Difference'] = df['Value']['last'] - df['Value']['first']
df = df.drop('Value', axis=1).reset_index()
Result:
Datetime Value_Difference
2020-03-01 5
2020-03-02 7
2020-03-03 -5
Shaido's method works, but might be slow due to the groupby on very large sets
Another possible way is to take a difference from dates converted to int and only grab the values necessary without a loop.
idx = df.index
loc = np.diff(idx.strftime('%Y%m%d').astype(int).values).nonzero()[0]
loc1 = np.append(0,loc)
loc2 = np.append(loc,len(idx)-1)
res = df.values[loc2]-df.values[loc1]
df = pd.DataFrame(index=idx.date[loc1],values=res,columns=['values'])
I have a data frame containing a timestamp every 5 minutes with a value for each ID. Now, I need to perform some analysis and I would like to plot all the time series on the same temporal time window.
My data frame is similar to this one:
ID timestamp value
12345 2017-02-09 14:35:00 60.0
12345 2017-02-09 14:40:00 62.0
12345 2017-02-09 14:45:00 58.0
12345 2017-02-09 14:50:00 60.0
54321 2017-03-09 13:35:00 50.0
54321 2017-03-09 13:40:00 58.0
54321 2017-03-09 13:45:00 59.0
54321 2017-03-09 13:50:00 61.0
For instance, in the xy axis, I need to use the x=0 value as the first timestamp for each ID, and the x=1 the second after 5 minutes, and so on.
Until now, I correctly resampled every 5 minutes with this code:
df = df.set_index('Date').resample('5T').mean().reset_index()
But, given the fact the every ID starts at different timestamps, I don't know how to modify the timestamps in order to use the first measured date of each ID as timestamp 0, and each next timestamp every 5 minutes as timestamp 1, timestamp 2, timestamp 3, ecc, in order to plot the series of each ID to confront them graphically. A sample final df may be:
ID timestamp value
12345 0 60.0
12345 1 62.0
12345 2 58.0
12345 3 60.0
54321 0 50.0
54321 1 58.0
54321 2 59.0
54321 3 61.0
Using this data frame, is is possible to plot all the series starting and finishing at the same point? Start at 0 and finish after 3 days.
How do I create such different timestamps and plot every series for each ID on the same figure?
Thankl you very much
First create a new column with the timestamp number in 5 minutes intervals.
df['ts_number'] = df.groupby(['ID']).timestamp.apply(lambda x: (x - x.min())/pd.Timedelta(minutes=5))
If you know in advance that all your timestamps are in 5 minute intervalls and they are sorted, then you can also use
df['ts_number'] = df.groupby(['ID']).cumcount()
Then plot the pivoted data:
df.pivot('ts_number', 'ID', 'value').plot()