Suppose we have two dataframes, one with a timestamp and the other with start and end timestamps. df1 and df2 as:
DateTime
Value1
StartDateTime
EnddDateTime
Value2
2020-01-11 12:30:00
1
2020-01-11 12:23:12
2020-01-11 13:10:00
a
2020-01-11 13:00:00
2
2020-01-11 14:12:20
2020-01-11 14:20:34
b
2020-02-11 13:30:00
3
2020-01-11 15:20:00
2020-01-11 15:28:10
c
2020-02-11 14:00:00
4
2020-01-11 15:45:20
2020-01-11 16:26:23
d
2020-02-11 14:30:00
5
2020-02-11 15:00:00
6
2020-02-11 15:30:00
7
2020-02-11 16:00:00
8
The timestamp of df1 represents half an hour starting from the time in the DateTime column. I want to match df2 start and end time with these 20 minutes periods. A value of df2 may fall in two rows of df1 if its period (the time between start and end) matches with two DateTime in df1, even for only one second. The outcome should be a dataframe as below.
DateTime
Value1
Value2
2020-01-11 12:30:00
1
a
2020-01-11 13:00:00
2
a
2020-02-11 13:30:00
3
Nan
2020-02-11 14:00:00
4
b
2020-02-11 14:30:00
5
Nan
2020-02-11 15:00:00
6
c
2020-02-11 15:30:00
7
d
2020-02-11 16:00:00
8
d
Any suggestions to efficiently merge large data?
There maybe shorter better answers out there because I am going longhand.
melt the second data frame
df3=pd.melt(df2, id_vars=['Value2'], value_vars=['StartDateTime', 'EnddDateTime'],value_name='DateTime').sort_values(by='DateTime')
Create temp columns on both dfs. The reason is, you want to get the time from datetime, append that time to a uniform date to be used in the merge
df1['DateTime1']=pd.Timestamp('today').strftime('%Y-%m-%d') + ' ' +pd.to_datetime(df1['DateTime']).dt.time.astype(str)
df3['DateTime1']=pd.Timestamp('today').strftime('%Y-%m-%d') + ' ' +pd.to_datetime(df3['DateTime']).dt.time.astype(str)
Convert the new column date times computed above to datetime
df3["DateTime1"]=pd.to_datetime(df3["DateTime1"])
df1["DateTime1"]=pd.to_datetime(df1["DateTime1"])
Finally, mergeasof with a time tolerance
final = pd.merge_asof(df1, df3, on="DateTime1",tolerance=pd.Timedelta("39M"),suffixes=('_', '_df2')).drop(columns=['DateTime1','variable','DateTime_df2'])
DateTime_ Value1 Value2
0 2020-01-11 13:00:00 2 a
1 2020-02-11 13:30:00 3 a
2 2020-02-11 14:00:00 4 NaN
3 2020-02-11 14:30:00 5 b
4 2020-02-11 15:00:00 6 NaN
5 2020-02-11 15:30:00 7 c
6 2020-02-11 16:00:00 8 d
Related
EDIT: My main goal is not to use a for loop and find a way of grouping the data efficiently/fast.
I am trying to solve a problem, which is about grouping together different rows of data based on an ID and a time window of 30 Days.
I have the following example data:
ID
Time
12345
2021-01-01 14:00:00
12345
2021-01-15 14:00:00
12345
2021-01-29 14:00:00
12345
2021-02-15 14:00:00
12345
2021-02-16 14:00:00
12345
2021-03-15 14:00:00
12345
2021-04-24 14:00:00
12344
2021-01-24 14:00:00
12344
2021-01-25 14:00:00
12344
2021-04-24 14:00:00
And I would like to have the following data:
ID
Time
Group
12345
2021-01-01 14:00:00
1
12345
2021-01-15 14:00:00
1
12345
2021-01-29 14:00:00
1
12345
2021-02-15 14:00:00
2
12345
2021-02-16 14:00:00
2
12345
2021-03-15 14:00:00
3
12345
2021-04-24 14:00:00
4
12344
2021-01-24 14:00:00
5
12344
2021-01-25 14:00:00
5
12344
2021-04-24 14:00:00
6
(4 can also be 1 as it is in a new group based on the ID 12344; 5 can also be 2)
I could differentiate then based on the ID column. So the Group does not need to be unique but can be.
The most important would be to separate it based on the ID and then check all the rows for each ID and assign an ID to the 30 Days time window. By 30 Days time window I mean that e.g. the first time frame for ID 12345 starts at 2021-01-01 and goes up to 2021-01-31 (this should be the group 1) and then the second time time frame for the ID 12345 starts at 2021-02-01 and would go to 2021-03-02 (for 30 days).
The problem I have faced with using the following code is that it uses the first date it finds in the dataframe:
grouped_data = df.groupby(["ID",pd.Grouper(key = "Time", freq = "30D")]).count()
In the above code I have just tried to count the rows (which wouldn't give me the Group, but I have tried to group it with my logic).
I hope someone can help me with this, because I have tried so many different things and nothing did work. I have already used the following (but maybe wrong):
pd.rolling()
pd.Grouper()
for loop
etc.
I really don't want to use for loop as I have 1.5 Mio rows.
And I have tried to vectorize the for loop but I am not really familiar with vectorization and was struggling to transfer my for loop to a vectorization.
Please let me know if I can use pd.Grouper differently so I get the results. thanks in advance.
For arbitrary windows you can use pandas.cut
eg, for 30 day bins starting at 2021-01-01 00:00:00 for the entirety of 2021 you can use:
bins = pd.date_range("2021", "2022", freq="30D")
group = pd.cut(df["Time"], bins)
group will label each row with an interval which you can then group on etc. If you want the groups to have labels 0, 1, 2, etc then you can map values with:
dict(zip(group.unique(), range(group.nunique())))
EDIT: approach where the windows are 30 day intervals, disjoint, and starting at a time in the Time column:
times = df["Time"].sort_values()
ii = pd.IntervalIndex.from_arrays(times, times+pd.Timedelta("30 days"))
disjoint_intervals = []
prev_interval = None
for i, interval in enumerate(ii):
if prev_interval is None or interval.left >= prev_interval.right: # no overlap
prev_interval = interval
disjoint_intervals.append(i)
bins = ii[disjoint_intervals]
group = pd.cut(df["Time"], bins)
Apologies, this is not a vectorised approach. Struggling to think if one could exist.
SOLUTION:
The solution which worked for me is the following:
I have imported the sampleData from excel into a dataframe. The data looks like this:
ID
Time
12345
2021-01-01 14:00:00
12345
2021-01-15 14:00:00
12345
2021-01-29 14:00:00
12345
2021-02-15 14:00:00
12345
2021-02-16 14:00:00
12345
2021-03-15 14:00:00
12345
2021-04-24 14:00:00
12344
2021-01-24 14:00:00
12344
2021-01-25 14:00:00
12344
2021-04-24 14:00:00
Then I have used the following steps:
Import the data:
df_test = pd.read_excel(r"sampleData.xlsx")
Order the dataframe so we have the correct order of ID and Time:
df_test_ordered = df_test.sort_values(["ID","Time"])
df_test_ordered = df_test_ordered.reset_index(drop=True)
I have also reset the index and dropped it as it has manipulated my calculations later on.
Create column with time difference between the previous row:
df_test_ordered.loc[df_test_ordered["ID"] == df_test_ordered["ID"].shift(1),"time_diff"] = df_test_ordered["Time"] - df_test_ordered["Time"].shift(1)
Transform timedelta64[ns] to timedelta64[D]:
df_test_ordered["time_diff"] = df_test_ordered["time_diff"].astype("timedelta64[D]")
Calculate the cumsum per ID:
df_test_ordered["cumsum"] = df_test_ordered.groupby("ID")["time_diff"].transform(pd.Series.cumsum)
Backfill the dataframe (exchange the NaN values with the next value):
df_final = df_test_ordered.ffill().bfill()
Create the window by dividing by 30 (30 days time period):
df_final["Window"] = df_final["cumsum"] / 30
df_final["Window_int"] = df_final["Window"].astype(int)
The "Window_int" column is now a kind of ID (not unique; but unique within the groups of column "ID").
Furthermore, I needed to backfill the dataframe as there were NaN values due to the calculation of time difference only if the previous ID equals the ID. If not then NaN is set as time difference. Backfilling will just set the NaN value to the next time difference which makes no difference mathematically and assign the correct value.
Solution dataframe:
ID Time time_diff cumsum Window Window_int
0 12344 2021-01-24 14:00:00 1.0 1.0 0.032258 0
1 12344 2021-01-25 14:00:00 1.0 1.0 0.032258 0
2 12344 2021-04-24 14:00:00 89.0 90.0 2.903226 2
3 12345 2021-01-01 14:00:00 14.0 14.0 0.451613 0
4 12345 2021-01-15 14:00:00 14.0 14.0 0.451613 0
5 12345 2021-01-29 14:00:00 14.0 28.0 0.903226 0
6 12345 2021-02-15 14:00:00 17.0 45.0 1.451613 1
7 12345 2021-02-16 14:00:00 1.0 46.0 1.483871 1
8 12345 2021-03-15 14:00:00 27.0 73.0 2.354839 2
9 12345 2021-04-24 14:00:00 40.0 113.0 3.645161 3
I have read miscellaneous posts with a similar question but couldn't find exactly this question.
I have two pandas DataFrames that I want to merge.
They have timestamps as indexes.
The 2nd Dataframe basically overlaps the 1st and they thus both share rows with same timestamps and values.
I would like to remove these rows because they share everything: index and values in columns.
If they don't share both index and values in columns, I want to keep them.
So far, I could point out:
Index.drop_duplicate: this is not what I am looking for. It doesn't check values in columns are the same. And I want to keep rows with same timestamps but different values in columns
DataFrame.drop_duplicate: well, same as above, it doesn't check index value, and if rows are found with same values in column but different indexes, I want to keep them.
To give an example, I am re-using the data given in below answer.
df1
Value
2012-02-01 12:00:00 10
2012-02-01 12:30:00 10
2012-02-01 13:00:00 20
2012-02-01 13:30:00 30
df2
Value
2012-02-01 12:30:00 20
2012-02-01 13:00:00 20
2012-02-01 13:30:00 30
2012-02-02 14:00:00 10
Result I would like to obtain is the following one:
Value
2012-02-01 12:00:00 10 #(from df1)
2012-02-01 12:30:00 10 #(from df1)
2012-02-01 12:30:00 20 #(from df2 - same index than in df1, but different value)
2012-02-01 13:00:00 20 #(in df1 & df2, only one kept)
2012-02-01 13:30:00 30 #(in df1 & df2, only one kept)
2012-02-02 14:00:00 10 #(from df2)
Please, any idea?
Thanks for your help!
Bests
Assume that you have 2 following DataFrames:
df:
Date Value
0 2012-02-01 12:00:00 10
1 2012-02-01 12:30:00 10
2 2012-02-01 13:00:00 20
3 2012-02-01 13:30:00 30
4 2012-02-02 14:00:00 10
5 2012-02-02 14:30:00 10
6 2012-02-02 15:00:00 20
7 2012-02-02 15:30:00 30
df2:
Date Value
0 2012-02-01 12:00:00 10
1 2012-02-01 12:30:00 21
2 2012-02-01 12:40:00 22
3 2012-02-01 13:00:00 20
4 2012-02-01 13:30:00 30
To generate the result, run:
pd.concat([df, df2]).sort_values('Date')\
.drop_duplicates().reset_index(drop=True)
The result, for the above data, is:
Date Value
0 2012-02-01 12:00:00 10
1 2012-02-01 12:30:00 10
2 2012-02-01 12:30:00 21
3 2012-02-01 12:40:00 22
4 2012-02-01 13:00:00 20
5 2012-02-01 13:30:00 30
6 2012-02-02 14:00:00 10
7 2012-02-02 14:30:00 10
8 2012-02-02 15:00:00 20
9 2012-02-02 15:30:00 30
drop_duplicates drops duplicated rows, keeping the first.
Since no subset parameter has been passed, the criterion to treat
2 rows as duplicates is identity of all columns.
Just improving the first answer, insert Date inside drop_duplicates
pd.concat([df, df2]).sort_values('Date')\
.drop_duplicates('Date').reset_index(drop=True)
I have a data frame consists of column 1 i.e event and column 2 is Datetime:
Sample data
Event Time
0 2020-02-12 11:00:00
0 2020-02-12 11:30:00
2 2020-02-12 12:00:00
1 2020-02-12 12:30:00
0 2020-02-12 13:00:00
0 2020-02-12 13:30:00
0 2020-02-12 14:00:00
1 2020-02-12 14:30:00
0 2020-02-12 15:00:00
0 2020-02-12 15:30:00
And I want to find start time and end time of each event:
Desired Data
Event EventStartTime EventEndTime
0 2020-02-12 11:00:00 2020-02-12 12:00:00
2 2020-02-12 12:00:00 2020-02-12 12:30:00
1 2020-02-12 12:30:00 2020-02-12 13:00:00
0 2020-02-12 13:00:00 2020-02-12 14:30:00
1 2020-02-12 14:30:00 2020-02-12 15:00:00
Note: EventEndTime is time when the event changes the value say from value 1 to got change to 0 or any other value or vice versa
Here is a method that can get the results without a for loop. I assume that the input data is read into a dataframe called df:
# Initialize the output df
dfout = pd.DataFrame()
dfout['Event'] = df['Event']
dfout['EventStartTime'] = df['Time']
Now, I create a variable called 'change' that tells you whether the event changed.
dfout['change'] = df['Event'].diff()
This is how dfout looks now:
Event EventStartTime change
0 0 2020-02-12 11:00:00 NaN
1 0 2020-02-12 11:30:00 0.0
2 2 2020-02-12 12:00:00 2.0
3 1 2020-02-12 12:30:00 -1.0
4 0 2020-02-12 13:00:00 -1.0
5 0 2020-02-12 13:30:00 0.0
6 0 2020-02-12 14:00:00 0.0
7 1 2020-02-12 14:30:00 1.0
8 0 2020-02-12 15:00:00 -1.0
9 0 2020-02-12 15:30:00 0.0
Now, I go on to remove the rows where the event did not change:
dfout = dfout.loc[dfout['change'] !=0 ,:]
This will now leave me with rows where the event has changed.
Next, the event end time of the current event is the start time of the next event.
dfout['EventEndTime'] = dfout['EventStartTime'].shift(-1)
The dataframe looks like this:
Event EventStartTime change EventEndTime
0 0 2020-02-12 11:00:00 NaN 2020-02-12 12:00:00
2 2 2020-02-12 12:00:00 2.0 2020-02-12 12:30:00
3 1 2020-02-12 12:30:00 -1.0 2020-02-12 13:00:00
4 0 2020-02-12 13:00:00 -1.0 2020-02-12 14:30:00
7 1 2020-02-12 14:30:00 1.0 2020-02-12 15:00:00
8 0 2020-02-12 15:00:00 -1.0 NaN
You may chose to remove the 'change' column and also the last row if not needed.
Assuming the dataframe is data:
current_event = None
result = []
for event, time in zip(data['Event'], data['Time']):
if event != current_event:
if current_event is not None:
result.append([current_event, start_time, time])
current_event, start_time = event, time
data = pandas.DataFrame(result, columns=['Event','EventStartTime','EventEndTime'])
The trick is to save your event number; if the next event number is not the same as the saved one, the saved one has to be ended and a new one started.
Use group by and agg to get the output in desired format.
df =pd.DataFrame([['0',11],['1',12],['1',13],['0',15],['1',16],['3',11]],columns=['Event','Time'] )
df.groupby(['Event']).agg(['first','last']).rename(columns={'first':'start-event','last':'end-event'})
Output:
Event start-event end-event
0 11 15
1 12 16
3 11 11
I have two data frames like following, data frame A has datetime even with minutes, data frame B only has hour.
df:A
dataDate original
2018-09-30 11:20:00 3
2018-10-01 12:40:00 10
2018-10-02 07:00:00 5
2018-10-27 12:50:00 5
2018-11-28 19:45:00 7
df:B
dataDate count
2018-09-30 10:00:00 300
2018-10-01 12:00:00 50
2018-10-02 07:00:00 120
2018-10-27 12:00:00 234
2018-11-28 19:05:00 714
I like to merge the two on the basis of hour date and hour, so that now in dataframe A should have all the rows filled on the basis of merge on date and hour
I can try to do it via
A['date'] = A.dataDate.date
B['date'] = B.dataDate.date
A['hour'] = A.dataDate.hour
B['hour'] = B.dataDate.hour
and then merge
merge_df = pd.merge(A,B, how='left', left_on=['date', 'hour'],
right_on=['date', 'hour'])
but its a very long process, Is their an efficient way to perform the same operation with the help of pandas time series or date functionality?
Use map if need append only one column from B to A with floor for set minutes and seconds if exist to 0:
d = dict(zip(B.dataDate.dt.floor('H'), B['count']))
A['count'] = A.dataDate.dt.floor('H').map(d)
print (A)
dataDate original count
0 2018-09-30 11:20:00 3 NaN
1 2018-10-01 12:40:00 10 50.0
2 2018-10-02 07:00:00 5 120.0
3 2018-10-27 12:50:00 5 234.0
4 2018-11-28 19:45:00 7 714.0
For general solution use DataFrame.join:
A.index = A.dataDate.dt.floor('H')
B.index = B.dataDate.dt.floor('H')
A = A.join(B, lsuffix='_left')
print (A)
dataDate_left original dataDate count
dataDate
2018-09-30 11:00:00 2018-09-30 11:20:00 3 NaT NaN
2018-10-01 12:00:00 2018-10-01 12:40:00 10 2018-10-01 12:00:00 50.0
2018-10-02 07:00:00 2018-10-02 07:00:00 5 2018-10-02 07:00:00 120.0
2018-10-27 12:00:00 2018-10-27 12:50:00 5 2018-10-27 12:00:00 234.0
2018-11-28 19:00:00 2018-11-28 19:45:00 7 2018-11-28 19:05:00 714.0
I have a dataframe named DateUnique made of all unique dates (format datetime or string) that are present in my other dataframe named A.
>>> print(A)
'dateLivraisonDemande' 'abscisse' 'BaseASDébut' 'BaseATDébut' 0 2015-05-27 2004-01-10 05:00:00 05:00:00
1 2015-05-27 2004-02-10 18:30:00 22:30:00
2 2015-05-27 2004-01-20 23:40:00 19:30:00
3 2015-05-27 2004-03-10 12:05:00 06:00:00
4 2015-05-27 2004-01-10 23:15:00 13:10:00
5 2015-05-27 2004-02-10 18:00:00 13:45:00
6 2015-05-27 2004-01-20 02:05:00 19:15:00
7 2015-05-27 2004-03-20 08:00:00 07:45:00
8 2015-05-29 2004-01-01 18:45:00 21:00:00
9 2015-05-27 2004-02-15 04:20:00 07:30:00
10 2015-04-10 2004-01-20 13:50:00 15:30:00
And:
>>> print(DateUnique)
1 1899-12-30
2 1900-01-01
3 2004-03-10
4 2004-03-20
5 2004-01-20
6 2015-05-29
7 2015-04-10
8 2015-05-27
9 2004-02-15
10 2004-02-10
How can I get the name of the columns that contain each date?
Maybe with something similar to this:
# input:
If row == '2015-04-10':
print(df.name_Of_Column([0]))
# output:
'dateLivraisonDemande'
You can make a function that returns the appropriate column. Use the vectorized isin function, and then check if any value is True.
df = pd.DataFrame({'dateLivraisonDemande': ['2015-05-27']*7 + ['2015-05-27', '2015-05-29', '2015-04-10'],
'abscisse': ['2004-02-10', '2004-01-20', '2004-03-10', '2004-01-10',
'2004-02-10', '2004-01-20', '2004-03-10', '2004-01-10',
'2004-02-15', '2004-01-20']})
DateUnique = pd.Series(['1899-12-30', '1900-01-01', '2004-03-10', '2004-03-20',
'2004-01-20', '2015-05-29', '2015-04-10', '2015-05-27',
'2004-02-15', '2004-02-10'])
def return_date_columns(date_input):
if df["dateLivraisonDemande"].isin([date_input]).any():
return "dateLivraisonDemande"
if df["abscisse"].isin([date_input]).any():
return "abscisse"
>>> DateUnique.apply(return_date_columns)
0 None
1 None
2 abscisse
3 None
4 abscisse
5 dateLivraisonDemande
6 dateLivraisonDemande
7 dateLivraisonDemande
8 abscisse
9 abscisse
dtype: object