I have a pandas dataframe where each row corresponds to a period of time for a given record. If a record has more than one period of time there is a gap between them. I would like to fill in all the missing time periods that are between the end of the first time period and the start of the final time period.
My data looks like this:
record = [1, 1, 2, 2, 2]
start_time = pd.to_datetime(['2001-01-01', '2001-02-01', '2000-01-01', '2001-05-31', '2001-09-01'])
stop_time = pd.to_datetime(['2001-01-15', '2001-02-28', '2001-01-31', '2001-08-16', '2001-09-30'])
df = pd.DataFrame({'record': record, 'start_time': start_time, 'stop_time': stop_time})
record start_time stop_time
0 1 2001-01-01 2001-01-15
1 1 2001-02-01 2001-02-28
2 2 2000-01-01 2001-01-31
3 2 2001-05-31 2001-08-16
4 2 2001-09-01 2001-09-30
The gaps in time are between lines 0 and 1 (the stop time is 2001-01-15 and the next start time is 2001-02-01, which is a 16 day gap), as well as 2 and 3, and 3 and 4. Gaps can only happen between the first and last row for a given record.
What I'm trying to achieve is this:
record start_time stop_time
0 1 2001-01-01 2001-01-15
1 1 2001-01-16 2001-01-31
2 1 2001-02-01 2001-02-28
3 2 2000-01-01 2001-01-31
4 2 2001-02-01 2001-05-30
5 2 2001-05-31 2001-08-16
6 2 2001-08-17 2001-08-31
7 2 2001-09-01 2001-09-30
That is, I want to add in rows that have start and stop times that fit those gaps. So in the previous example there would be a new row for record 1 with a start date of 2001-01-16 and an end date of 2001-01-31.
The full dataset has over 2M rows across 1.5M records, so I'm looking for a vectorized solution in pandas that doesn't use apply and is relatively efficient.
Maybe something like this?
import pandas as pd
record = [1, 1, 2, 2, 2]
start_time = pd.to_datetime(['2001-01-01', '2001-02-01', '2000-01-01', '2001-05-31', '2001-09-01'])
stop_time = pd.to_datetime(['2001-01-15', '2001-02-28', '2001-01-31', '2001-08-16', '2001-09-30'])
df = pd.DataFrame({'record': record, 'start_time': start_time, 'stop_time': stop_time})
one_day = pd.Timedelta('1d')
missing_dates = []
for record, df_per_record in df.groupby('record'):
start_time = pd.to_datetime(df_per_record.start_time)
stop_time = pd.to_datetime(df_per_record.stop_time)
reference_date = pd.Timestamp(df_per_record.start_time.iloc[0])
start_time_in_days = (start_time - reference_date) // one_day
stop_time_in_days = (stop_time - reference_date) // one_day
dates_diff = start_time_in_days.iloc[1:].values - stop_time_in_days.iloc[:-1].values
missing_start_dates = stop_time[:-1][dates_diff > 1] + one_day
missing_stop_dates = missing_start_dates + ((dates_diff-2) * one_day)
missing_dates.append(pd.DataFrame({"record": record, "start_time": missing_start_dates, "stop_time": missing_stop_dates}))
print(pd.concat([df]+missing_dates).sort_values(["record", "start_time"]))
Edit:
version #2 this time without the for loop:
import pandas as pd
record = [1, 1, 2, 2, 2]
start_time = pd.to_datetime(['2001-01-01', '2001-02-01', '2000-01-01', '2001-05-31', '2001-09-01'])
stop_time = pd.to_datetime(['2001-01-15', '2001-02-28', '2001-01-31', '2001-08-16', '2001-09-30'])
df = pd.DataFrame({'record': record, 'start_time': start_time, 'stop_time': stop_time})
one_day = pd.Timedelta('1d')
start_time = pd.to_datetime(df.start_time)
stop_time = pd.to_datetime(df.stop_time)
reference_date = pd.Timestamp(df.start_time.iloc[0])
start_time_in_days = (start_time - reference_date) // one_day
stop_time_in_days = (stop_time - reference_date) // one_day
is_same_record = df.record.iloc[1:].values == df.record.iloc[:-1].values
dates_diff = start_time_in_days.iloc[1:].values - stop_time_in_days.iloc[:-1].values
mask = (dates_diff > 1) & is_same_record
missing_start_dates = stop_time[:-1][mask] + one_day
missing_stop_dates = missing_start_dates + ((dates_diff[is_same_record]-2) * one_day)
missing_dates = pd.DataFrame({"record": df.record.iloc[:-1][mask], "start_time": missing_start_dates, "stop_time": missing_stop_dates})
print(pd.concat([df, missing_dates]).sort_values(["record", "start_time"]).reset_index(drop=True))
Related
I did some Googling and figured out how to generate all Friday dates in a year.
# get all Fridays in a year
from datetime import date, timedelta
def allfridays(year):
d = date(year, 1, 1) # January 1st
d += timedelta(days = 8 - 2) # Friday
while d.year == year:
yield d
d += timedelta(days = 7)
for d in allfridays(2022):
print(d)
Result:
2022-01-07
2022-01-14
2022-01-21
etc.
2022-12-16
2022-12-23
2022-12-30
Now, I'm trying to figure out how to loop through a range of rolling dates, so like 2022-01-07 + 60 days, then 2022-01-14 + 60 days, then 2022-01-21 + 60 days.
step #1:
start = '2022-01-07'
end = '2022-03-08'
step #2:
start = '2022-01-14'
end = '2022-03-15'
Ideally, I want to pass in the start and end date loop, into another loop, which looks like this...
price_data = []
for ticker in tickers:
try:
prices = wb.DataReader(ticker, start = start.strftime('%m/%d/%Y'), end = end.strftime('%m/%d/%Y'), data_source='yahoo')[['Adj Close']]
price_data.append(prices.assign(ticker=ticker)[['ticker', 'Adj Close']])
except:
print(ticker)
df = pd.concat(price_data)
as you use pandas then you can try to do it this way:
import pandas as pd
year = 2022
dates = pd.date_range(start=f'{year}-01-01',end=f'{year}-12-31',freq='W-FRI')
df = pd.DataFrame({'my_dates':dates, 'sixty_ahead':dates + pd.Timedelta(days=60)})
print(df.head())
'''
my_dates sixty_ahead
0 2022-01-07 2022-03-08
1 2022-01-14 2022-03-15
2 2022-01-21 2022-03-22
3 2022-01-28 2022-03-29
4 2022-02-04 2022-04-05
First, we have to figure out how to get the first Friday of a given year. Next, we will calculate the start, end days.
import datetime
FRIDAY = 4 # Based on Monday=0
WEEK = datetime.timedelta(days=7)
def first_friday(year):
"""Return the first Friday of the year."""
the_date = datetime.date(year, 1, 1)
while the_date.weekday() != FRIDAY:
the_date = the_date + datetime.timedelta(days=1)
return the_date
def friday_ranges(year, days_count):
"""
Generate date ranges that starts on first Friday of `year` and
lasts for `days_count`.
"""
DURATION = datetime.timedelta(days=days_count)
start_date = first_friday(year)
end_date = start_date + DURATION
while end_date.year == year:
yield start_date, end_date
start_date += WEEK
end_date = start_date + DURATION
for start_date, end_date in friday_ranges(year=2022, days_count=60):
# Do what you want with start_date and end_date
print((start_date, end_date))
Sample output:
(datetime.date(2022, 1, 7), datetime.date(2022, 3, 8))
(datetime.date(2022, 1, 14), datetime.date(2022, 3, 15))
(datetime.date(2022, 1, 21), datetime.date(2022, 3, 22))
...
(datetime.date(2022, 10, 21), datetime.date(2022, 12, 20))
(datetime.date(2022, 10, 28), datetime.date(2022, 12, 27))
Notes
The algorithm for first Friday is simple: Start with Jan 1, then keep advancing the day until Friday
I made an assumption that the end date must fall into the specified year. If that is not the case, you can adjust the condition in the while loop
This could work maybe. You can add the condition, the end of the loop within the lambda function.
from datetime import date, timedelta
def allfridays(year):
d = date(year, 1, 1) # January 1st
d += timedelta(days = 8 - 2) # Friday
while d.year == year:
yield d
d += timedelta(days = 7)
list_dates = []
for d in allfridays(2022):
list_dates.append(d)
add_days = map(lambda x: x+timedelta(days = 60),list_dates)
print(list(add_days))
Oh my, I totally missed this before. The solution below works just fine.
import pandas as pd
# get all Fridays in a year
from datetime import date, timedelta
def allfridays(year):
d = date(year, 1, 1) # January 1st
d += timedelta(days = 8 - 2) # Friday
while d.year == year:
yield d
d += timedelta(days = 7)
lst=[]
for d in allfridays(2022):
lst.append(d)
df = pd.DataFrame(lst)
print(type(df))
df.columns = ['my_dates']
df['sixty_ahead'] = df['my_dates'] + timedelta(days=60)
df
Result:
my_dates sixty_ahead
0 2022-01-07 2022-03-08
1 2022-01-14 2022-03-15
2 2022-01-21 2022-03-22
etc.
49 2022-12-16 2023-02-14
50 2022-12-23 2023-02-21
51 2022-12-30 2023-02-28
I want to exclude some period in my times series:
from 2.am till 6 a.m
How can I fix that ?
Thank you for your help !
import pandas as pd
start = pd.Timestamp("2022-10-03")
end = pd.Timestamp("2022-11-13")
N = 25
t = np.random.randint(start.value, end.value, N)
t -= t % 1000000000
start = pd.to_datetime(np.array(t, dtype="datetime64[ns]"))
duration = pd.to_timedelta(np.random.randint(100, 10000, N), unit="s")
df = pd.DataFrame({"start":start, "duration":duration})
df["end"] = df.start + df.duration```
start duration end
0 2022-10-06 21:17:16 0 days 00:25:55 2022-10-06 21:43:11
1 2022-10-27 08:20:47 0 days 00:04:32 2022-10-27 08:25:19
2 2022-10-09 16:34:08 0 days 01:53:24 2022-10-09 18:27:32
3 2022-10-08 16:16:26 0 days 00:16:35 2022-10-08 16:33:01
I have the timeseries dataframe as:
timestamp
signal_value
2017-08-28 00:00:00
10
2017-08-28 00:05:00
3
2017-08-28 00:10:00
5
2017-08-28 00:15:00
5
I am trying to get the average Monthly percentage of the time where "signal_value" is greater than 5. Something like:
Month
metric
January
16%
February
2%
March
8%
April
10%
I tried the following code which gives the result for the whole dataset but how can I summarize it per each month?
total,count = 0, 0
for index, row in df.iterrows():
total += 1
if row["signal_value"] >= 5:
count += 1
print((count/total)*100)
Thank you in advance.
Let us first generate some random data (generate random dates taken from here):
import pandas as pd
import numpy as np
import datetime
def randomtimes(start, end, n):
frmt = '%d-%m-%Y %H:%M:%S'
stime = datetime.datetime.strptime(start, frmt)
etime = datetime.datetime.strptime(end, frmt)
td = etime - stime
dtimes = [np.random.random() * td + stime for _ in range(n)]
return [d.strftime(frmt) for d in dtimes]
# Recreat some fake data
timestamp = randomtimes("01-01-2021 00:00:00", "01-01-2023 00:00:00", 10000)
signal_value = np.random.random(len(timestamp)) * 10
df = pd.DataFrame({"timestamp": timestamp, "signal_value": signal_value})
Now we can transform the timestamp column to pandas timestamps to extract month and year per timestamp:
df.timestamp = pd.to_datetime(df.timestamp)
df["month"] = df.timestamp.dt.month
df["year"] = df.timestamp.dt.year
We generate a boolean column whether signal_value is larger than some threshold (here 5):
df["is_larger5"] = df.signal_value > 5
Finally, we can get the average for every month by using pandas.groupby:
>>> df.groupby(["year", "month"])['is_larger5'].mean()
year month
2021 1 0.509615
2 0.488189
3 0.506024
4 0.519362
5 0.498778
6 0.483709
7 0.498824
8 0.460396
9 0.542918
10 0.463043
11 0.492500
12 0.519789
2022 1 0.481663
2 0.527778
3 0.501139
4 0.527322
5 0.486936
6 0.510638
7 0.483370
8 0.521253
9 0.493639
10 0.495349
11 0.474886
12 0.488372
Name: is_larger5, dtype: float64
This is my dataframe.
Start_hour End_date
23:58:00 00:26:00
23:56:00 00:01:00
23:18:00 23:36:00
How can I get in a new column the difference (in minutes) between these two columns?
>>> from datetime import datetime
>>>
>>> before = datetime.now()
>>> print('wait for more than 1 minute')
wait for more than 1 minute
>>> after = datetime.now()
>>> td = after - before
>>>
>>> td
datetime.timedelta(seconds=98, microseconds=389121)
>>> td.total_seconds()
98.389121
>>> td.total_seconds() / 60
1.6398186833333335
Then you can round it or use it as-is.
You can do something like this:
import pandas as pd
df = pd.DataFrame({
'Start_hour': ['23:58:00', '23:56:00', '23:18:00'],
'End_date': ['00:26:00', '00:01:00', '23:36:00']}
)
df['Start_hour'] = pd.to_datetime(df['Start_hour'])
df['End_date'] = pd.to_datetime(df['End_date'])
df['diff'] = df.apply(
lambda row: (row['End_date']-row['Start_hour']).seconds / 60,
axis=1
)
print(df)
Start_hour End_date diff
0 2021-03-29 23:58:00 2021-03-29 00:26:00 28.0
1 2021-03-29 23:56:00 2021-03-29 00:01:00 5.0
2 2021-03-29 23:18:00 2021-03-29 23:36:00 18.0
You can also rearrange your dates as string again if you like:
df['Start_hour'] = df['Start_hour'].apply(lambda x: x.strftime('%H:%M:%S'))
df['End_date'] = df['End_date'].apply(lambda x: x.strftime('%H:%M:%S'))
print(df)
Output:
Start_hour End_date diff
0 23:58:00 00:26:00 28.0
1 23:56:00 00:01:00 5.0
2 23:18:00 23:36:00 18.0
Short answer:
df['interval'] = df['End_date'] - df['Start_hour']
df['interval'][df['End_date'] < df['Start_hour']] += timedelta(hours=24)
Why so:
You probably trying to solve the problem that your Start_hout and End_date values sometimes belong to a different days, and that's why you can't just substutute one from the other.
It your time window never exceeds 24 hours interval, you could use some modular arithmetic to deal with 23:59:59 - 00:00:00 border:
if End_date < Start_hour, this always means End_date belongs to a next day
this implies, if End_date - Start_hour < 0 then we should add 24 hours to End_date to find the actual difference
The final formula is:
if rec['Start_hour'] < rec['End_date']:
offset = 0
else:
offset = timedelta(hours=24)
rec['delta'] = offset + rec['End_date'] - rec['Start_hour']
To do the same with pandas.DataFrame we need to change code accordingly. And
that's how we get the snippet from the beginning of the answer.
import pandas as pd
df = pd.DataFrame([
{'Start_hour': datetime(1, 1, 1, 23, 58, 0), 'End_date': datetime(1, 1, 1, 0, 26, 0)},
{'Start_hour': datetime(1, 1, 1, 23, 58, 0), 'End_date': datetime(1, 1, 1, 23, 59, 0)},
])
# ...
df['interval'] = df['End_date'] - df['Start_hour']
df['interval'][df['End_date'] < df['Start_hour']] += timedelta(hours=24)
> df
Start_hour End_date interval
0 0001-01-01 23:58:00 0001-01-01 00:26:00 0 days 00:28:00
1 0001-01-01 23:58:00 0001-01-01 23:59:00 0 days 00:01:00
I do have a json array, where i will be having id, starttime, endtime. I want to calculate average time being active by user. And some may have only startime but not endtime.
Example data -
data = [{"id":1, "stime":"2020-09-21T06:25:36Z","etime": "2020-09-22T09:25:36Z"},{"id":2, "stime":"2020-09-22T02:24:36Z","etime": "2020-09-23T07:25:36Z"},{"id":3, "stime":"2020-09-20T06:25:36Z","etime": "2020-09-24T09:25:36Z"},{"id":4, "stime":"2020-09-23T06:25:36Z","etime": "2020-09-29T09:25:36Z"}]
My method to achieve this, diff between startine and endtime. then total all difference time and divide by number of total num of Ids.
sample code:
import datetime
from datetime import timedelta
import dateutil.parser
datetimeFormat = '%Y-%m-%d %H:%M:%S.%f'
date_s_time = '2020-09-21T06:25:36Z'
date_e_time = '2020-09-22T09:25:36Z'
d1 = dateutil.parser.parse(date_s_time)
d2 = dateutil.parser.parse(date_e_time)
diff1 = datetime.datetime.strptime(d2.strftime('%Y-%m-%d %H:%M:%S.%f'), datetimeFormat)\
- datetime.datetime.strptime(d1.strftime('%Y-%m-%d %H:%M:%S.%f'), datetimeFormat)
print("Difference 1:", diff1)
date_s_time2 = '2020-09-20T06:25:36Z'
date_e_time2 = '2020-09-28T02:25:36Z'
d3 = dateutil.parser.parse(date_s_time2)
d4 = dateutil.parser.parse(date_e_time2)
diff2 = datetime.datetime.strptime(d4.strftime('%Y-%m-%d %H:%M:%S.%f'), datetimeFormat)\
- datetime.datetime.strptime(d3.strftime('%Y-%m-%d %H:%M:%S.%f'), datetimeFormat)
print("Difference 2:", diff2)
print("total", diff1+diff2)
print(diff1+diff2/2)
please suggest me is there a better approach which will be efficient.
You could use the pandas library.
import pandas as pd
data = [{"id":1, "stime":"2020-09-21T06:25:36Z","etime": "2020-09-22T09:25:36Z"},{"id":1, "stime":"2020-09-22T02:24:36Z","etime": "2020-09-23T07:25:36Z"},{"id":1, "stime":"2020-09-20T06:25:36Z","etime": "2020-09-24T09:25:36Z"},{"id":1, "stime":"2020-09-23T06:25:36Z"}]
(Let's say your last row has no end time)
Now, you can create a Pandas DataFrame using your data
df = pd.DataFrame(data)
df looks like so:
id stime etime
0 1 2020-09-21T06:25:36Z 2020-09-22T09:25:36Z
1 1 2020-09-22T02:24:36Z 2020-09-23T07:25:36Z
2 1 2020-09-20T06:25:36Z 2020-09-24T09:25:36Z
3 1 2020-09-23T06:25:36Z NaN
Now, we want to map the columns stime and etime so that the strings are converted to datetime objects, and fill NaNs with something that makes sense: if no end time exists, could we use the current time?
df = df.fillna(datetime.utcnow().strftime('%Y-%m-%dT%H:%M:%SZ'))
df['etime'] = df['etime'].map(dateutil.parser.parse)
df['stime'] = df['stime'].map(dateutil.parser.parse)
Or, if you want to drop the rows that don't have an etime, just do
df = df.dropna()
Now df becomes:
id stime etime
0 1 2020-09-21 06:25:36+00:00 2020-09-22 09:25:36+00:00
1 1 2020-09-22 02:24:36+00:00 2020-09-23 07:25:36+00:00
2 1 2020-09-20 06:25:36+00:00 2020-09-24 09:25:36+00:00
3 1 2020-09-23 06:25:36+00:00 2020-09-24 20:05:42+00:00
Finally, subtract the two:
df['tdiff'] = df['etime'] - df['stime']
and we get:
id stime etime tdiff
0 1 2020-09-21 06:25:36+00:00 2020-09-22 09:25:36+00:00 1 days 03:00:00
1 1 2020-09-22 02:24:36+00:00 2020-09-23 07:25:36+00:00 1 days 05:01:00
2 1 2020-09-20 06:25:36+00:00 2020-09-24 09:25:36+00:00 4 days 03:00:00
3 1 2020-09-23 06:25:36+00:00 2020-09-24 20:05:42+00:00 1 days 13:40:06
The mean of this column is:
df['tdiff'].mean()
Output: Timedelta('2 days 00:10:16.500000')