Suppose I have two Date inputs : 2020-01-20 11:35:00 and 2020-01-25 08:00:00 .
I want a output DataFrame as :
time1 time2
-------------------------------------------
2020-01-20 11:35:00 | 2020-01-21 00:00:00
2020-01-21 00:00:00 | 2020-01-22 00:00:00
2020-01-22 00:00:00 | 2020-01-23 00:00:00
2020-01-23 00:00:00 | 2020-01-24 00:00:00
2020-01-24 00:00:00 | 2020-01-25 00:00:00
2020-01-25 00:00:00 | 2020-01-25 08:00:00
no built in way to do this, we can use iloc and pd.date_range to assign the first and last dates and generate your date range.
t1 = pd.Timestamp('2020-01-20 11:35:00')
t2 = pd.Timestamp('2020-01-25 08:00:00')
df = pd.DataFrame({'Time1' : pd.date_range(t1.date(),t2.date())})
df = df.assign(Time2 = df['Time1'] + pd.DateOffset(days=1))
df.iloc[0,0] = t1
df.iloc[-1,1] = t2
print(df)
Time1 Time2
0 2020-01-20 11:35:00 2020-01-21 00:00:00
1 2020-01-21 00:00:00 2020-01-22 00:00:00
2 2020-01-22 00:00:00 2020-01-23 00:00:00
3 2020-01-23 00:00:00 2020-01-24 00:00:00
4 2020-01-24 00:00:00 2020-01-25 00:00:00
5 2020-01-25 00:00:00 2020-01-25 08:00:00
You can use date_range with both dates and then create the dataframe.
d1 = pd.to_datetime('2020-01-20 11:35:00')
d2 = pd.to_datetime('2020-01-25 08:00:00')
l = pd.date_range(d1.date(), d2.date(), freq='d').tolist()[1:] #remove the first date
df = pd.DataFrame({'time1':[d1] + l, 'time2':l + [d2]})
print (df)
time1 time2
0 2020-01-20 11:35:00 2020-01-21 00:00:00
1 2020-01-21 00:00:00 2020-01-22 00:00:00
2 2020-01-22 00:00:00 2020-01-23 00:00:00
3 2020-01-23 00:00:00 2020-01-24 00:00:00
4 2020-01-24 00:00:00 2020-01-25 00:00:00
5 2020-01-25 00:00:00 2020-01-25 08:00:00
Related
I'm trying to measure the difference between timestamps using certain conditions. Using below, for each unique ID, I'm hoping to subtract the End Time where Item == A and the Start Time where Item == D.
So the timestamps are actually located on separate rows.
At the moment my process is returning an error. I'm also hoping to drop the .shift() for something more robust as each unique ID will have different combinations. For ex, A,B,C,D - A,B,D - A,D etc.
df = pd.DataFrame({'ID': [10,10,10,20,20,30],
'Start Time': ['2019-08-02 09:00:00','2019-08-03 10:50:00','2019-08-05 16:00:00','2019-08-04 08:00:00','2019-08-04 15:30:00','2019-08-06 11:00:00'],
'End Time': ['2019-08-04 15:00:00','2019-08-04 16:00:00','2019-08-05 16:00:00','2019-08-04 14:00:00','2019-08-05 20:30:00','2019-08-07 10:00:00'],
'Item': ['A','B','D','A','D','A'],
})
df['Start Time'] = pd.to_datetime(df['Start Time'])
df['End Time'] = pd.to_datetime(df['End Time'])
df['diff'] = (df.groupby('ID')
.apply(lambda x: x['End Time'].shift(1) - x['Start Time'].shift(1))
.reset_index(drop=True))
Intended Output:
ID Start Time End Time Item diff
0 10 2019-08-02 09:00:00 2019-08-04 15:00:00 A NaT
1 10 2019-08-03 10:50:00 2019-08-04 16:00:00 B NaT
2 10 2019-08-05 16:00:00 2019-08-05 16:00:00 D 1 days 01:00:00
3 20 2019-08-04 08:00:00 2019-08-04 14:00:00 A NaT
4 20 2019-08-04 15:30:00 2019-08-05 20:30:00 D 0 days 01:30:00
5 30 2019-08-06 11:00:00 2019-08-07 10:00:00 A NaT
df2 = df.set_index('ID')
df2.query('Item == "D"')['Start Time']-df2.query('Item == "A"')['End Time']
output:
ID
10 2 days 05:30:00
20 0 days 20:30:00
30 NaT
dtype: timedelta64[ns]
older answer
The issue is your fillna, you can't have strings in a timedelta column:
df['diff'] = (df.groupby('ID')
.apply(lambda x: x['End Time'].shift(1) - x['Start Time'].shift(1))
#.fillna('-') # the issue is here
.reset_index(drop=True))
output:
ID Start Time End Time Item diff
0 10 2019-08-02 09:00:00 2019-08-02 09:30:00 A NaT
1 10 2019-08-03 10:50:00 2019-08-03 11:00:00 B 0 days 00:30:00
2 10 2019-08-04 15:00:00 2019-08-05 16:00:00 C 0 days 00:10:00
3 20 2019-08-04 08:00:00 2019-08-04 14:00:00 B NaT
4 20 2019-08-05 10:30:00 2019-08-05 20:30:00 C 0 days 06:00:00
5 30 2019-08-06 11:00:00 2019-08-07 10:00:00 A NaT
IIUC use:
df1 = df.pivot('ID','Item')
print (df1)
Start Time \
Item A B D
ID
10 2019-08-02 09:00:00 2019-08-03 10:50:00 2019-08-04 15:00:00
20 2019-08-04 08:00:00 NaT 2019-08-05 10:30:00
30 2019-08-06 11:00:00 NaT NaT
End Time
Item A B D
ID
10 2019-08-02 09:30:00 2019-08-03 11:00:00 2019-08-05 16:00:00
20 2019-08-04 14:00:00 NaT 2019-08-05 20:30:00
30 2019-08-07 10:00:00 NaT NaT
a = df1[('Start Time','D')].sub(df1[('End Time','A')])
print (a)
ID
10 2 days 05:30:00
20 0 days 20:30:00
30 NaT
dtype: timedelta64[ns]
I have a df with a date index as follow:
ind = pd.date_range(start="2015-12-31", end = "2022-04-26", freq="D")
df = pd.DataFrame(
{
"col1": range(len(ind))
},
index=ind
)
What I need is slice the df in windows from the end of each month from 2017-08-31 to 3 years plus 1 month, so I have the next chunk of code
n = timedelta(365 * 3) + relativedelta(months=1)
fechas_ = pd.date_range("2017-08-31", ind.max() - n, freq="M")
# create a for loop to check the beginning and the end of each window
for i in fechas_:
print(f"start: {i}")
print(f"end: {i + n}")
print("\n")
My problem is that I need the last day of the month as the end of each window e.g.:
# first window
start: 2017-08-31 00:00:00
end: 2020-09-30 00:00:00
# second window
start: 2017-09-30 00:00:00
end: 2020-10-31 00:00:00
# so on
But I'm getting:
# first window
start: 2017-08-31 00:00:00
end: 2020-09-29 00:00:00
# second window
start: 2017-09-30 00:00:00
end: 2020-10-29 00:00:00
# 3
2017-10-31 00:00:00
2020-11-29 00:00:00
# 4
2017-11-30 00:00:00
2020-12-29 00:00:00
# 5
2017-12-31 00:00:00
2021-01-30 00:00:00
# 6
2018-01-31 00:00:00
2021-02-27 00:00:00
# 7
2018-02-28 00:00:00
2021-03-27 00:00:00
# 8
2018-03-31 00:00:00
2021-04-29 00:00:00
# 9
2018-04-30 00:00:00
2021-05-29 00:00:00
# 10
2018-05-31 00:00:00
2021-06-29 00:00:00
# 11
2018-06-30 00:00:00
2021-07-29 00:00:00
# 12
2018-07-31 00:00:00
2021-08-30 00:00:00
# 13
2018-08-31 00:00:00
2021-09-29 00:00:00
# 14
2018-09-30 00:00:00
2021-10-29 00:00:00
# 15
2018-10-31 00:00:00
2021-11-29 00:00:00
# 16
2018-11-30 00:00:00
2021-12-29 00:00:00
# 17
2018-12-31 00:00:00
2022-01-30 00:00:00
# 18
2019-01-31 00:00:00
2022-02-27 00:00:00
# 19
2019-02-28 00:00:00
2022-03-27 00:00:00
Does someone know how can I solve this?
Thanks a lot
In your code
n = timedelta(365 * 3) + relativedelta(months=1)
try replacing it with
n = relativedelta(years=3, months=1, day=31)
I have a dataframe like as shown below
df = pd.DataFrame({'person_id': [101,101,101,101,202,202,202],
'start_date':['5/7/2013 09:27:00 AM','09/08/2013 11:21:00 AM','06/06/2014 08:00:00 AM','06/06/2014 05:00:00 AM','12/11/2011 10:00:00 AM','13/10/2012 12:00:00 AM','13/12/2012 11:45:00 AM'],
'end_date':['5/12/2013 09:27:00 AM',np.nan,'06/11/2014 08:00:00 AM',np.nan,'12/16/2011 10:00:00','10/18/2012 00:00:00',np.nan],
'type':['O','I','O','O','I','O','I']})
df.start_date = pd.to_datetime(df.start_date)
df['end_date'] = pd.to_datetime(df.end_date)
I would like to fillna() under the end_date column based on two approaches below
a) If NA is found in any row except last row of that person, fillna by copying the value from next row
b) If NA is found in the last row of that person fillna by adding 10 days to his start_date (because there is no next row for that person to copy from. So, we give random value of 10 days)
The rules a and b only for persons with type=I.
For persons with type=O, just fillna by copying the value from start_date.
This is what I tried. You can see am writing the same code line twice.
df['end_date'] = np.where(df['type'].str.contains('I'),pd.DatetimeIndex(df['end_date'].bfill()),pd.DatetimeIndex(df.start_date.dt.date))
df['end_date'] = np.where(df['type'].str.contains('I'),pd.DatetimeIndex(df['start_date'] + pd.DateOffset(10)),pd.DatetimeIndex(df.start_date.dt.date))
Any elegant and efficient way to write this as I have to apply this on a big data with 15 million rows?
I expect my output to be like as shown below
Solution
s1 = df.groupby('person_id')['start_date'].shift(-1)
s1 = s1.fillna(df['start_date'] + pd.DateOffset(days=10))
s1 = df['end_date'].fillna(s1)
s2 = df['end_date'].fillna(df['start_date'])
df['end_date'] = np.where(df['type'].eq('I'), s1, s2)
Explanations
Group the dataframe on person_id and shift the column start_date one units upwards.
>>> df.groupby('person_id')['start_date'].shift(-1)
0 2013-09-08 11:21:00
1 2014-06-06 08:00:00
2 2014-06-06 05:00:00
3 NaT
4 2012-10-13 00:00:00
5 2012-12-13 11:45:00
6 NaT
Name: start_date, dtype: datetime64[ns]
Fill the NaN values in the shifted column with the values from start_date column after adding an offset of 10 days
>>> s1.fillna(df['start_date'] + pd.DateOffset(days=10))
0 2013-09-08 11:21:00
1 2014-06-06 08:00:00
2 2014-06-06 05:00:00
3 2014-06-16 05:00:00
4 2012-10-13 00:00:00
5 2012-12-13 11:45:00
6 2012-12-23 11:45:00
Name: start_date, dtype: datetime64[ns]
Now fill the NaN values in end_date column with the above series s1
>>> df['end_date'].fillna(s1)
0 2013-05-12 09:27:00
1 2014-06-06 08:00:00
2 2014-06-11 08:00:00
3 2014-06-16 05:00:00
4 2011-12-16 10:00:00
5 2012-10-18 00:00:00
6 2012-12-23 11:45:00
Name: end_date, dtype: datetime64[ns]
Similarly fill the NaN values in end_date column with the values from start_date column to create a series s2
>>> df['end_date'].fillna(df['start_date'])
0 2013-05-12 09:27:00
1 2013-09-08 11:21:00
2 2014-06-11 08:00:00
3 2014-06-06 05:00:00
4 2011-12-16 10:00:00
5 2012-10-18 00:00:00
6 2012-12-13 11:45:00
Name: end_date, dtype: datetime64[ns]
Then use np.where to select the values from s1 / s2 based on the condition where the type is I or O
>>> df
person_id start_date end_date type
0 101 2013-05-07 09:27:00 2013-05-12 09:27:00 O
1 101 2013-09-08 11:21:00 2014-06-06 08:00:00 I
2 101 2014-06-06 08:00:00 2014-06-11 08:00:00 O
3 101 2014-06-06 05:00:00 2014-06-06 05:00:00 O
4 202 2011-12-11 10:00:00 2011-12-16 10:00:00 I
5 202 2012-10-13 00:00:00 2012-10-18 00:00:00 O
6 202 2012-12-13 11:45:00 2012-12-23 11:45:00 I
I'm working on a timeseries dataframe which looks like this and has data from January to August 2020.
Timestamp Value
2020-01-01 00:00:00 -68.95370
2020-01-01 00:05:00 -67.90175
2020-01-01 00:10:00 -67.45966
2020-01-01 00:15:00 -67.07624
2020-01-01 00:20:00 -67.30549
.....
2020-07-01 00:00:00 -65.34212
I'm trying to apply a filter on the previous dataframe using the columns start_time and end_time in the dataframe below:
start_time end_time
2020-01-12 16:15:00 2020-01-13 16:00:00
2020-01-26 16:00:00 2020-01-26 16:10:00
2020-04-12 16:00:00 2020-04-13 16:00:00
2020-04-20 16:00:00 2020-04-21 16:00:00
2020-05-02 16:00:00 2020-05-03 16:00:00
The output should assign all values which are not within the start and end time as zero and retain values for the start and end times specified in the filter. I tried applying two simultaneous filters for start and end time but didn't work.
Any help would be appreciated.
Idea is create all masks by Series.between in list comprehension, then join with logical_or by np.logical_or.reduce and last pass to Series.where:
print (df1)
Timestamp Value
0 2020-01-13 00:00:00 -68.95370 <- changed data for match
1 2020-01-01 00:05:00 -67.90175
2 2020-01-01 00:10:00 -67.45966
3 2020-01-01 00:15:00 -67.07624
4 2020-01-01 00:20:00 -67.30549
5 2020-07-01 00:00:00 -65.34212
L = [df1['Timestamp'].between(s, e) for s, e in df2[['start_time','end_time']].values]
m = np.logical_or.reduce(L)
df1['Value'] = df1['Value'].where(m, 0)
print (df1)
Timestamp Value
0 2020-01-13 00:00:00 -68.9537
1 2020-01-01 00:05:00 0.0000
2 2020-01-01 00:10:00 0.0000
3 2020-01-01 00:15:00 0.0000
4 2020-01-01 00:20:00 0.0000
5 2020-07-01 00:00:00 0.0000
Solution using outer join of merge method and query:
print(df1)
timestamp Value <- changed Timestamp to timestamp to avoid name conflict in query
0 2020-01-13 00:00:00 -68.95370 <- also changed data for match
1 2020-01-01 00:05:00 -67.90175
2 2020-01-01 00:10:00 -67.45966
3 2020-01-01 00:15:00 -67.07624
4 2020-01-01 00:20:00 -67.30549
5 2020-07-01 00:00:00 -65.34212
df1.loc[df1.index.difference(df1.assign(key=0).merge(df2.assign(key=0), how = 'outer')\
.query("timestamp >= start_time and timestamp < end_time").index),"Value"] = 0
result:
timestamp Value
0 2020-01-13 00:00:00 -68.9537
1 2020-01-01 00:05:00 0.0000
2 2020-01-01 00:10:00 0.0000
3 2020-01-01 00:15:00 0.0000
4 2020-01-01 00:20:00 0.0000
5 2020-07-01 00:00:00 0.0000
Key assign(key=0) is added to both dataframes to produce cartesian product.
For clarity here is MRE:
df = pd.DataFrame(
{"id":[1,2,3,4],
"start_time":["2020-06-01 01:00:00", "2020-06-01 01:00:00", "2020-06-01 19:00:00", "2020-06-02 04:00:00"],
"end_time":["2020-06-01 14:00:00", "2020-06-01 18:00:00", "2020-06-02 10:00:00", "2020-06-02 16:00:00"]
})
df["start_time"] = pd.to_datetime(df["start_time"])
df["end_time"] = pd.to_datetime(df["end_time"])
df["sub_time"] = df["end_time"] - df["start_time"]
this outputs:
id start_time end_time sub_time
0 1 2020-06-01 01:00:00 2020-06-01 14:00:00 13:00:00
1 2 2020-06-01 01:00:00 2020-06-01 18:00:00 17:00:00
2 3 2020-06-01 19:00:00 2020-06-02 10:00:00 15:00:00
3 4 2020-06-02 04:00:00 2020-06-02 16:00:00 12:00:00
but when start_time ~ end_time consists of times range 00:00:00~03:59:59am I want to ignore it(not calculated in sub_time)
So instead of output above I would get:
id start_time end_time sub_time
0 1 2020-06-01 01:00:00 2020-06-01 14:00:00 10:00:00
1 2 2020-06-01 01:00:00 2020-06-01 18:00:00 14:00:00
2 3 2020-06-01 19:00:00 2020-06-02 10:00:00 11:00:00
3 4 2020-06-02 04:00:00 2020-06-02 16:00:00 12:00:00
row 0: starting at 01:00:00 do not count until 04:00:00. then 04:00:00 ~ 14:00:00 is 10 hour period
row 2: consider duration from 19:00:00 ~ 24:00:00 and 04:00:00 ~ 10:00:00 thus we get 11:00:00 in sub_time column.
Any suggestions?