Split DataFrame rows by DateTime in pandas - python

I have a DataFrame containing events like this:
location start_time end_time some_value1 some_value2
LECP 00:00 01:30 25 nice info
LECP 02:00 04:00 10 other info
LECS 02:00 03:00 5 lorem
LIPM 02:55 03:15 9 ipsum
and I want to split the rows so that I get maximum intervals of 1 hour, e.g. if an event has a duration of 01:30, I want to get a row of length 01:00 and another of 00:30. If an event has a length of 02:30, I want to get three rows. And if an event has a duration of an hour or less, it should just remain being one row. Like so:
location start_time end_time some_value1 some_value2
LECP 00:00 01:00 25 nice info
LECP 01:00 01:30 25 nice info
LECP 02:00 03:00 10 other info
LECP 03:00 04:00 10 other info
LECS 02:00 03:00 5 lorem
LIPM 02:55 03:15 9 ipsum
It does not matter if the remainder is at the beginning or the end. It even would not matter if the duration is distributed equally to the rows, as long as no rows has a duration of > 1 hour.
What I tried:
- reading through Time Series / Date functionality and not understanding anything
- searching StackOverflow.

I adapted this answer to implement hourly and not daily splits. This code works in a WHIL-loop, so it will re-itereate as long as there are rows with durations still > 1hour.
mytimedelta = pd.Timedelta('1 hour')
#create boolean mask
split_rows = (dfob['duration'] > mytimedelta)
while split_rows.any():
#get new rows to append and adjust start time to 1 hour later.
new_rows = dfob[split_rows].copy()
new_rows['start'] = new_rows['start'] + mytimedelta
#update the end time of old rows
dfob.loc[split_rows, 'end'] = dfob.loc[split_rows, 'start'] + \
pd.DateOffset(hours=1, seconds=-1)
dfob = dfob.append(new_rows)
#update the duration of all rows
dfob['duration'] = dfob['end'] - dfob['start']
#create an updated boolean mask
split_rows = (dfob['duration'] > mytimedelta)
#when job is done:
dfob.sort_index().reset_index(drop=True)
dfob['duration'] = dfob['end'] - dfob['start']

Related

Pandas: Get first and last row of the same date and calculate time difference

I have a dataframe where I have a date and a time column.
Each row describes some event. I want to calculate the timespan for each different day and add it as new row. The actual calculation is not that important (which units etc.), I just want to know, how I can the first and last row for each date, to access the time value.
The dataframe is already sorted by date and all rows of the same date are also ordered by the time.
Minimal example of what I have
import pandas as pd
df = pd.DataFrame({"Date": ["01.01.2020", "01.01.2020", "01.01.2020", "02.02.2022", "02.02.2022"],
"Time": ["12:00", "13:00", "14:45", "02:00", "08:00"]})
df
and what I want
EDIT: The duration column should be calculated by
14:45 - 12:00 = 2:45 for the first date and
08:00 - 02:00 = 6:00 for the second date.
I suspect this is possible with the groupby function but I am not sure how exactly to do it.
I hope you will find this helpful.
import pandas as pd
df = pd.DataFrame({"Date": ["01.01.2020", "01.01.2020", "01.01.2020", "02.02.2022", "02.02.2022"],
"Time": ["12:00", "13:00", "14:45", "02:00", "08:00"]})
df["Datetime"] = pd.to_datetime((df["Date"] + " " + df["Time"]))
def date_diff(df):
df["Duration"] = df["Datetime"].max() - df["Datetime"].min()
return df
df = df.groupby("Date").apply(date_diff)
df = df.drop("Datetime", axis=1)
Output:
Date Time Duration
0 01.01.2020 12:00 0 days 02:45:00
1 01.01.2020 13:00 0 days 02:45:00
2 01.01.2020 14:45 0 days 02:45:00
3 02.02.2022 02:00 0 days 06:00:00
4 02.02.2022 08:00 0 days 06:00:00
You can then do some string styling:
df['Duration'] = df['Duration'].astype(str).map(lambda x: x[7:12])
Output:
Date Time Duration
0 01.01.2020 12:00 02:45
1 01.01.2020 13:00 02:45
2 01.01.2020 14:45 02:45
3 02.02.2022 02:00 06:00
4 02.02.2022 08:00 06:00
here is one way to do it
# groupby on Date and find the difference of max and min time in each group
# format it as HH:MM by extracting Hours and minutes
# and creating a dictionary
d=dict((df.groupby('Date')['Time'].apply(lambda x:
(pd.to_timedelta(x.max() +':00') -
pd.to_timedelta(x.min() +':00')
)
).astype(str).str.extract(r'days (..:..)')
).reset_index().values)
# map the dictionary and update the duration in DF
df['duration']=df['Date'].map(d)
df
Date Time duration
0 01.01.2020 12:00 02:45
1 01.01.2020 13:00 02:45
2 01.01.2020 14:45 02:45
3 02.02.2022 02:00 06:00
4 02.02.2022 08:00 06:00
By the example shown below, you can achieve want you want.
df['Start time'] = df.apply(lambda row: df[df['Date'] == row['Date']]['Time'].max(), axis=1)
df
Update:
import datetime
df['Duration'] = df.apply(lambda row: str(datetime.timedelta(seconds=(datetime.datetime.strptime(df[df['Date'] == row['Date']]['Time'].max(), '%H:%M') - datetime.datetime.strptime(df[df['Date'] == row['Date']]['Time'].min(), '%H:%M')).total_seconds())) , axis=1)
df
you can use:
from datetime import timedelta
import numpy as np
df['xdate']=pd.to_datetime(df['Date'] + ' '+ df['Time'],format='%d.%m.%Y %H:%M')
df['max']=df.groupby(df['xdate'].dt.date)['xdate'].transform(np.max) #get max dates each date
df['min']=df.groupby(df['xdate'].dt.date)['xdate'].transform(np.min) #get min date each date
#get difference max and min dates
df['Duration']= df[['min','max']].apply(lambda x: x['max'] - timedelta(hours=x['min'].hour,minutes=x['min'].minute,seconds=x['min'].second),axis=1).dt.strftime('%H:%M')
df=df.drop(['xdate','min','max'],axis=1)
print(df)
'''
Date Time Duration
0 01.01.2020 12:00 02:45
1 01.01.2020 13:00 02:45
2 01.01.2020 14:45 02:45
3 02.02.2022 02:00 06:00
4 02.02.2022 08:00 06:00
'''

Calculate the time difference between two hh:mm columns in a pandas dataframe

I am reading some data from an csv file where the datatype of the two columns are in hh:mm format. Here is an example:
Start End
11:15 15:00
22:30 2:00
In the above example, the End in the 2nd row happens in the next day. I am trying to get the time difference between these two columns in the most efficient way as the dataset is huge. Is there any good pythonic way for doing this? Also, since there is no date, and some Ends happen in the next I get wrong result when I calculate the diff.
>>> import pandas as pd
>>> df = pd.read_csv(file_path)
>>> pd.to_datetime(df['End'])-pd.to_datetime(df['Start'])
0 0 days 03:45:00
1 0 days 03:00:00
2 -1 days +03:30:00
You can use the technique (a+x)%x with a timedelta of 24h (or 1d, same)
the + timedelta(hours=24) makes all values becomes positive
the % timedelta(hours=24) makes the ones above 24h back of 24h
df['duration'] = (pd.to_datetime(df['End']) - pd.to_datetime(df['Start']) + timedelta(hours=24)) \
% timedelta(hours=24)
Gives
Start End duration
0 11:15 15:00 0 days 03:45:00
1 22:30 2:00 0 days 03:30:00

What is the most efficient way to count how many rows in a dataframe were "active" for every minute of a day?

I have a dataframe of the format:
object_id start_time end_time
123 13:23 13:28
234 13:25 13:26
And I want to transform it into a format like this:
time number_of_objects_active
13:22 0
13:23 1
13:24 1
13:25 2
13:26 1
13:27 1
13:28 1
13:29 0
Where each row has the minute of the day and the count of how many objects were active at that point (where active means time is greater than or equal to start time and less than end time).
I have tried to come up with some way of doing a groupby but have failed miserably. A not very nice solution is to loop through every minute of the day, and then sum the number of rows which were active in that minute:
results_dictionary = {}
for minute in minutes:
results_dictionary[minute] = df.loc[(df.start_time <= minute) & (df.end_time > minute)].shape[0]
but I suspect there's a nicer more pandas/pythonic way of doing this.
If you are on pandas v0.25 or later, use explode:
# Convert `start_time` and `end_time` to Timestamp, if they
# are not already. This also allows you to adjust cases where
# the times cross the day boundary, e.g.: 23:00 - 02:00
df['start_time'] = pd.to_datetime(df['start_time'])
df['end_time'] = pd.to_datetime(df['end_time'])
# Make a `time` column that holds a minutely range. We will
# later explode it into individual minutes
f = lambda row: pd.date_range(row['start_time'], row['end_time'], freq='T')
df['time'] = df.apply(f, axis=1)
# The reporting range, adjust as needed
t = pd.date_range('13:23', '13:30', freq='T')
result = df.explode('time') \
.groupby('time').size() \
.reindex(t).fillna(0) \
.to_frame('active')
result.index = result.index.time
Result:
active
13:23:00 1.0
13:24:00 1.0
13:25:00 2.0
13:26:00 2.0
13:27:00 1.0
13:28:00 1.0
13:29:00 0.0
13:30:00 0.0

pandas get difference of 2 times assuming if end_time is lower than start_time it is the next day

Assuming this is my dataframe:
date start_time end_time
1/1/2018 20:00 21:00
1/1/2018 23:00 1:00
I want to add another column, named duration which is obviously end_time - start_time
My problem is that if I write something like:
pd.to_datetime(train_2.end_time,format='%H:%M:%S')-pd.to_timedelta(train_2.start_time))
It thinks that the second line is negative (as 23:00>1:00), while it's really positive as 1:00 refers to the next day (1/2/2018), so I want the duration to be 2 hours.
How can I achieve such a result?
Any help will be appreciated!
You can try of subtracting by converting to datestamp, and for all the exception cases of negative values add extra day duration
df['duration'] = pd.to_datetime(df.end_time) - pd.to_datetime(df.start_time)
df.loc[df.duration.dt.total_seconds() <0,'duration'] += pd.Timedelta(1,'D')
Out:
date start_time end_time duration
0 1/1/2018 20:00 21:00 01:00:00
1 1/1/2018 23:00 1:00 02:00:00

Randomly sample rows from a file based on times in columns

This is a bit complex, and I greatly appreciate any help! I am trying to randomly sample rows from a .csv file. Essentially, I want a resulting file of unique locations (Locations are specified by Easting and Northing columns of the data file, below). I want to randomly pull 1 location per 12 hour period per SessionDate in this file (12 hour periods divided into: between 0631 and 1829 hours and between 1830 and 0630 hours; Given as Start: and End: in Data File, below); BUT if any 2 locations are within 6 hours of each other (based on their Start: time), for that location to be tossed, and a new location to be randomly drawn, and for this sampling to continue until no new locations are drawn (i.e., sampling WITHOUT replacement). I have been trying to do this with python, but my experience is very limited. I tried first putting each row into a dictionary, and recently each row into a list, as follows:
import random
import csv
f = open('file.csv', "U")
list = []
for line in f:
list.append(line.split(','))
I'm unsure where to go from here - how to sample from these lists the way I need to, then write them to an output file with my 'unique' locations.
Here is the top few lines of my data file:
SessionDate Start: End: Easting Northing
27-Apr-07 18:00 21:45 174739 9785206
28-Apr-07 18:00 21:30 171984 9784738
28-Apr-07 18:00 21:30 171984 9784738
28-Apr-07 18:00 21:30 171984 9784738
28-Apr-07 18:00 21:30 171984 9784738
It gets a bit complicated as some of the observations span midnight, so they may be on different dates, but can be within 6 hours of each other (which is why I have this criterion), for example:
SessionDate Start: End: Easting Northing
27-Apr-07 22:30 23:25 171984 9784738
28-Apr-07 0:25 1:30 174739 9785206
Here's my solution - I made a few changes to your data (location to make it easier to eyeball the results). I basically create a dict of dates pointing to another dict of locations which points to a list of selected rows.
data = """SessionDate Start: End: Easting Northing
27-Apr-07 18:00 21:45 A 1
27-Apr-07 18:00 21:30 G 2
28-Apr-07 18:00 21:30 B 2
28-Apr-07 18:00 21:30 B 2
28-Apr-07 18:00 21:30 B 2
29-Apr-07 8:00 11:30 C 3
29-Apr-07 20:00 21:30 C 3
29-Apr-07 20:00 21:30 C 3
30-Apr-07 8:00 10:30 D 4
30-Apr-07 16:00 17:30 E 5
30-Apr-07 14:00 21:30 F 6
30-Apr-07 18:00 21:30 F 6
"""
selected = {}
for line in data.split("\n"):
if "Session" in line:
continue
if not line:
continue
tmp = [x for x in line.split() if x]
raw_dt = " ".join([tmp[0], tmp[1]]).strip()
curr_dt = datetime.strptime(raw_dt, "%d-%b-%y %H:%M")
loc = (tmp[-2], tmp[-1])
found = False
for dt in selected:
diff = dt - curr_dt
if dt < curr_dt:
diff = curr_dt - dt
# print dt, curr_dt, diff, diff <= timedelta(hours=12), loc, loc in selected[dt]
if diff <= timedelta(hours=12):
if loc not in selected[dt]:
selected[dt].setdefault(loc, []).append(tmp)
found = True
else:
found = True
if not found:
if curr_dt not in selected:
selected[curr_dt] = {}
if loc not in selected[curr_dt]:
selected[curr_dt][loc] = [tmp,]
# if output needs to be sorted
rows = sorted(x for k in selected for l in selected[k] for x in selected[k][l])
for row in rows:
print " ".join(row)
This is not a complete answer, but something to point you in the right direction
As I said in a comment, handling datetime objects in python is done with the datetime module. Here's a little example related to your problem:
from datetime import datetime
d1 = datetime.strptime("27-Apr-07 18:00", "%d-%b-%y %H:%M")
d2 = datetime.strptime("28-Apr-07 01:00", "%d-%b-%y %H:%M")
difference = d2 - d1
#Difference in hours
dH = difference.days*24 + difference.seconds/3600
Other than that, simply loop through the sorted file, after reading a whole 12H block, sample ramdomly, make sure your unique condition is met (if not repeat) and move on.

Categories