Collapsing rows with overlapping dates - python

I'm trying to collapse the below dataframe into rows containing continuous time periods by id. Continuous means that, within the same id, the start_date is either less than, equal or at most one day greater than the end_date of the previous row (the data is already sorted by id, start_date and end_date). All rows that are continuous should be output as a single row, with start_date being the minimum start_date (i.e. the start_date of the first row in the continuous set) and end_date being the maximum end_date from the continuous set of rows.
Please see the desired output at the bottom.
The only way I can think of approaching this is by parsing the dataframe row by row, but I was wondering if there a more pythonic and elegant way to do it.
id start_date end_date
1 2017-01-01 2017-01-15
1 2017-01-12 2017-01-24
1 2017-01-25 2017-02-03
1 2017-02-05 2017-02-14
1 2017-02-16 2017-02-28
2 2017-01-01 2017-01-19
2 2017-01-24 2017-02-07
2 2017-02-07 2017-02-20
Here is the code to generate the input dataframe:
import numpy as np
import pandas as pd
start_date = ['2017-01-01','2017-01-12','2017-01-25','2017-02-05','2017-02-16',
'2017-01-01','2017-01-24','2017-02-07']
end_date = ['2017-01-15','2017-01-24','2017-02-03','2017-02-14','2017-02-28',
'2017-01-19','2017-02-07','2017-02-20']
df = pd.DataFrame({'id': [1,1,1,1,1,2,2,2],
'start_date': pd.to_datetime(start_date, format='%Y-%m-%d'),
'end_date': pd.to_datetime(end_date, format='%Y-%m-%d')})
Desired output:
id start_date end_date
1 2017-01-01 2017-02-03
1 2017-02-05 2017-02-14
1 2017-02-16 2017-02-28
2 2017-01-01 2017-01-19
2 2017-01-24 2017-02-20

def f(grp):
#define a list to collect valid start and end ranges
d=[]
(
#append a new row if the start date is at least 2 days greater than the last date from previous row,
#otherwise update last rows's end date with current row's end date.
grp.reset_index(drop=True)
.apply(lambda x: d.append({x.start_date:x.end_date})
if x.name==0 or (x.start_date-pd.DateOffset(1))>grp.iloc[x.name-1].end_date
else d[-1].update({list(d[-1].keys())[0]:x.end_date}),
axis=1)
)
#reconstruct a df using only valid start and end dates pairs.
return pd.DataFrame([[list(e.keys())[0],list(e.values())[0]] for e in d],
columns=['start_date','end_date'])
df.groupby('id').apply(f).reset_index().drop('level_1',1)
Out[467]:
id start_date end_date
0 1 2017-01-01 2017-02-03
1 1 2017-02-05 2017-02-14
2 1 2017-02-16 2017-02-28
3 2 2017-01-01 2017-01-19
4 2 2017-01-24 2017-02-20

Related

Comparing date column values in one dateframe with two date column in another dataframe by row in Pandas

I have a dataframe like this with two date columns and a quamtity column :
start_date end_date qty
1 2018-01-01 2018-01-08 23
2 2018-01-08 2018-01-15 21
3 2018-01-15 2018-01-22 5
4 2018-01-22 2018-01-29 12
I have a second dataframe with just column containing yearly holidays for a couple of years, like this:
holiday
1 2018-01-01
2 2018-01-27
3 2018-12-25
4 2018-12-26
I would like to go through the first dataframe row by row and assign boolean value to a new column holidays if a date in the second data frame falls between the date values of the first date frame. The result would look like this:
start_date end_date qty holidays
1 2018-01-01 2018-01-08 23 True
2 2018-01-08 2018-01-15 21 False
3 2018-01-15 2018-01-22 5 False
4 2018-01-22 2018-01-29 12 True
When I try to do that with a for loop I get the following error:
ValueError: Can only compare identically-labeled Series objects
An answer would be appreciated.
If you want a fully-vectorized solution, consider using the underlying numpy arrays:
import numpy as np
def holiday_arr(start, end, holidays):
start = start.reshape((-1, 1))
end = end.reshape((-1, 1))
holidays = holidays.reshape((1, -1))
result = np.any(
(start <= holiday) & (holiday <= end),
axis=1
)
return result
If you have your dataframes as above (calling them df1 and df2), you can obtain your desired result by running:
df1["contains_holiday"] = holiday_arr(
df1["start_date"].to_numpy(),
df1["end_date"].to_numpy(),
df2["holiday"].to_numpy()
)
df1 then looks like:
start_date end_date qty contains_holiday
1 2018-01-01 2018-01-08 23 True
2 2018-01-08 2018-01-15 21 False
3 2018-01-15 2018-01-22 5 False
4 2018-01-22 2018-01-29 12 True
try:
def _is_holiday(row, df2):
return ((df2['holiday'] >= row['start_date']) & (df2['holiday'] <= row['end_date'])).any()
df1.apply(lambda x: _is_holiday(x, df2), axis=1)
I'm not sure why you would want to go row-by-row. But boolean comparisons would be way faster.
df['holiday'] = ((df2.holiday >= df.start_date) & (df2.holiday <= df.end_date))
Time
>>> 1000 loops, best of 3: 1.05 ms per loop
Quoting #hchw solution (row-by-row)
def _is_holiday(row, df2):
return ((df2['holiday'] >= row['start_date']) & (df2['holiday'] <= row['end_date'])).any()
df.apply(lambda x: _is_holiday(x, df2), axis=1)
>>> The slowest run took 4.89 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 4.46 ms per loop
Try IntervalIndex.contains with list comprehensiont and np.sum
iix = pd.IntervalIndex.from_arrays(df1.start_date, df1.end_date, closed='both')
df1['holidays'] = np.sum([iix.contains(x) for x in df2.holiday], axis=0) >= 1
Out[812]:
start_date end_date qty holidays
1 2018-01-01 2018-01-08 23 True
2 2018-01-08 2018-01-15 21 False
3 2018-01-15 2018-01-22 5 False
4 2018-01-22 2018-01-29 12 True
Note: I assume start_date, end_date, holiday columns are in datetime format. If they are not, you need to convert them before run above command as follows
df1.start_date = pd.to_datetime(df1.start_date)
df1.end_date = pd.to_datetime(df1.end_date)
df2.holiday = pd.to_datetime(df2.holiday)

Pandas - Add at least one row for every day (datetimes include a time)

Edit: You can use the alleged duplicate solution with reindex() if your dates don't include times, otherwise you need a solution like the one by #kosnik. In addition, their solution doesn't need your dates to be the index!
I have data formatted like this
df = pd.DataFrame(data=[['2017-02-12 20:25:00', 'Sam', '8'],
['2017-02-15 16:33:00', 'Scott', '10'],
['2017-02-15 16:45:00', 'Steve', '5']],
columns=['Datetime', 'Sender', 'Count'])
df['Datetime'] = pd.to_datetime(df['Datetime'], format='%Y-%m-%d %H:%M:%S')
Datetime Sender Count
0 2017-02-12 20:25:00 Sam 8
1 2017-02-15 16:33:00 Scott 10
2 2017-02-15 16:45:00 Steve 5
I need there to be at least one row for every date, so the expected result would be
Datetime Sender Count
0 2017-02-12 20:25:00 Sam 8
1 2017-02-13 00:00:00 None 0
2 2017-02-14 00:00:00 None 0
3 2017-02-15 16:33:00 Scott 10
4 2017-02-15 16:45:00 Steve 5
I have tried to make datetime the index, add the dates and use reindex() like so
df.index = df['Datetime']
values = df['Datetime'].tolist()
for i in range(len(values)-1):
if values[i].date() + timedelta < values[i+1].date():
values.insert(i+1, pd.Timestamp(values[i].date() + timedelta))
print(df.reindex(values, fill_value=0))
This makes every row forget about the other columns and the same thing happens for asfreq('D') or resample()
ID Sender Count
Datetime
2017-02-12 16:25:00 0 Sam 8
2017-02-13 00:00:00 0 0 0
2017-02-14 00:00:00 0 0 0
2017-02-15 20:25:00 0 0 0
2017-02-15 20:25:00 0 0 0
What would be the appropriate way of going about this?
I would create a new DataFrame column which contains all the required data and then left join with your data frame.
A working code example is the following
df['Datetime'] = pd.to_datetime(df['Datetime']) # first convert to datetimes
datetimes = df['Datetime'].tolist() # these are existing datetimes - will add here the missing
dates = [x.date() for x in datetimes] # these are converted to dates
min_date = min(dates)
max_date = max(dates)
for d in range((max_date - min_date).days):
forward_date = min_date + datetime.timedelta(d)
if forward_date not in dates:
datetimes.append(np.datetime64(forward_date))
# create new dataframe, merge and fill 'Count' column with zeroes
df = pd.DataFrame({'Datetime': datetimes}).merge(df, on='Datetime', how='left')
df['Count'].fillna(0, inplace=True)

python compare date list to start and end date columns in dataframe

Problem: I have a dataframe with two columns: Start date and End date. I also have a list of dates. So lets say the data looks something like this:
data = [[1/1/2018,3/1/2018],[2/1/2018,3/1/2018],[4/1/2018,6/1/2018]]
df = pd.DataFrame(data,columns=['startdate','enddate'])
dates=[1/1/2018,2/1/2018]
What I need to do is:
1)Create a new column for each date in the dates list
2)for each row in the df, if the date for the new column created is in between the start and end date, assign a 1; if not, assign a 0.
I have tried to use zip but then I realized that the df rows will be thousands of rows, where as the dates list will contain about 24 items (spanning 2 years), so it stops when the dates list is exhausted, i.e., at 24.
So below is what the original df looks like and how it should look like afterwards:
Before:
startdate enddate
0 2018-01-01 2018-03-01
1 2018-02-01 2018-03-01
2 2018-04-01 2018-06-01
After:
startdate enddate 1/1/2018 2/1/2018
0 1/1/2018 3/1/2018 1 1
1 2/1/2018 3/1/2018 0 1
2 4/1/2018 6/1/2018 0 0
Any help on this would be much appreciated, thanks!
Using numpy broadcast
s1=df.startdate.values
s2=df.enddate.values
v=pd.to_datetime(pd.Series(dates)).values[:,None]
newdf=pd.DataFrame(((s1<=v)&(s2>=v)).T.astype(int),columns=dates,index=df.index)
pd.concat([df,newdf],axis=1)
startdate enddate 1/1/2018 2/1/2018
0 2018-01-01 2018-03-01 1 1
1 2018-02-01 2018-03-01 0 1
2 2018-04-01 2018-06-01 0 0

how to merge group rows in dataframe based on differences between datetime?

I have a dataframe with contains events on each row, with a Start and End datatime.
import pandas as pd
import datetime
df = pd.DataFrame({ 'Value' : [1.,2.,3.],
'Start' : [datetime.datetime(2017,1,1,0,0,0),datetime.datetime(2017,1,1,0,1,0),datetime.datetime(2017,1,1,0,4,0)],
'End' : [datetime.datetime(2017,1,1,0,0,59),datetime.datetime(2017,1,1,0,5,0),datetime.datetime(2017,1,1,0,6,00)]},
index=[0,1,2])
df
Out[7]:
End Start Value
0 2017-01-01 00:00:59 2017-01-01 00:00:00 1.0
1 2017-01-01 00:05:00 2017-01-01 00:01:00 2.0
2 2017-01-01 00:07:00 2017-01-01 00:06:00 3.0
I would like to group consecutive rows where the the differences between End and Start of consecutive rows is smaller than a given timedelta.
e.g. here for a timedelta of 5 seconds I would like to group row with index 0,1 and with timedelta of 2 minutes it should yield in rows 0,1,2
A solution would be to compare consecutive rows with their shifted version using .shift(), however, I would need to iterate the comparison multiple times if groups of more than 2 rows need to be merged.
As my df is very large, this is not an option.
threshold = datetime.timedelta(minutes=5)
df['delta'] = df['End'] - df['Start']
df['group'] = (df['delta'] - df['delta'].shift(-1) <= threshold).cumsum()
groups = df.groupby('group')
i assume you try to aggregate based on time difference.
marker = 60
df = df.assign(diff=df.apply(lambda row:(row.End - row.Start).total_seconds() <= marker, axis=1))
for g in df.groupby('diff'):
print g[1]
End Start Value diff
1 2017-01-01 00:05:00 2017-01-01 00:01:00 2.0 False
2 2017-01-01 00:06:00 2017-01-01 00:04:00 3.0 False
End Start Value diff
0 2017-01-01 00:00:59 2017-01-01 1.0 True

Merging two dataframes based on a date between two other dates without a common column

I have two dataframes that I need to merge based on whether or not a date value fits in between two other dates. Basically, I need to perform an outer join where B.event_date is between A.start_date and A.end_date. It seems that merge and join always assume a common column which in this case, I do not have.
A B
start_date end_date event_date price
0 2017-03-27 2017-04-20 0 2017-01-20 100
1 2017-01-10 2017-02-01 1 2017-01-27 200
Result
start_date end_date event_date price
0 2017-03-27 2017-04-20
1 2017-01-10 2017-02-01 2017-01-20 100
2 2017-01-10 2017-02-01 2017-01-27 200
Create data and format to datetimes:
df_A = pd.DataFrame({'start_date':['2017-03-27','2017-01-10'],'end_date':['2017-04-20','2017-02-01']})
df_B = pd.DataFrame({'event_date':['2017-01-20','2017-01-27'],'price':[100,200]})
df_A['end_date'] = pd.to_datetime(df_A.end_date)
df_A['start_date'] = pd.to_datetime(df_A.start_date)
df_B['event_date'] = pd.to_datetime(df_B.event_date)
Create keys to do a cross join:
New in pandas 1.2.0+ how='cross' instead of assigning psuedo keys:
df_merge = df_A.merge(df_B, how='cross')
Else, with pandas < 1.2.0 use psuedo key to merge on 'key'
df_A = df_A.assign(key=1)
df_B = df_B.assign(key=1)
df_merge = pd.merge(df_A, df_B, on='key').drop('key',axis=1)
Filter out records that do not meet criteria of event dates between start and end dates:
df_merge = df_merge.query('event_date >= start_date and event_date <= end_date')
Join back to original date range table and drop key column
df_out = df_A.merge(df_merge, on=['start_date','end_date'], how='left').fillna('').drop('key', axis=1)
print(df_out)
Output:
end_date start_date event_date price
0 2017-04-20 00:00:00 2017-03-27 00:00:00
1 2017-02-01 00:00:00 2017-01-10 00:00:00 2017-01-20 00:00:00 100
2 2017-02-01 00:00:00 2017-01-10 00:00:00 2017-01-27 00:00:00 200
conditional_join from pyjanitor may be helpful in the abstraction/convenience; the function is currently in dev:
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import pandas as pd
import janitor
Reusing #scottboston's data :
df_B.conditional_join(
df_A,
('event_date', 'start_date', '>='),
('event_date', 'end_date', '<='),
how = 'right'
)
left right
event_date price start_date end_date
0 NaT NaN 2017-03-27 2017-04-20
1 2017-01-20 100.0 2017-01-10 2017-02-01
2 2017-01-27 200.0 2017-01-10 2017-02-01
Under the hood, it uses np.searchsorted (binary search) to avoid the cartesian join.
Note that if the intervals do not overlap, pd.IntervalIndex is a more efficient solution.

Categories