How to create rows that fill the time between events in Python - python

I am building a data frame for survival analysis starting from 2018-01-01 00:00:00 and ending TODAY. I have two columns with start and end times only for the events that ocurred associated with an ID.
However, I need to add rows with the times between which the event was not observed
Here I show what I have:
+--------+-----+-----+---------------------+---------------------+
| State | ID1 | ID2 | Start_Time | End_Time |
+--------+-----+-----+---------------------+---------------------+
| State1 | 111 | AA1 | 2019-12-04 04:00:00 | 2019-12-04 19:30:00 |
+--------+-----+-----+---------------------+---------------------+
| State1 | 111 | AA1 | 2019-12-08 06:30:00 | 2019-12-20 10:00:00 |
+--------+-----+-----+---------------------+---------------------+
| State1 | 111 | AA1 | 2019-12-22 11:00:00 | 2019-12-22 23:00:00 |
+--------+-----+-----+---------------------+---------------------+
| State1 | 111 | AA1 | 2019-12-26 08:00:00 | 2019-12-29 16:30:00 |
+--------+-----+-----+---------------------+---------------------+
| State2 | 112 | AA2 | 2018-09-19 08:00:00 | 2018-09-20 04:30:00 |
+--------+-----+-----+---------------------+---------------------+
| State2 | 112 | AA2 | 2018-09-25 16:30:00 | 2018-09-26 23:00:00 |
+--------+-----+-----+---------------------+---------------------+
| State2 | 112 | AA2 | 2018-09-27 01:30:00 | 2018-09-27 10:30:00 |
+--------+-----+-----+---------------------+---------------------+
And what I need is:
+--------+-----+-----+---------------------+---------------------+
| State | ID1 | ID2 | Start_Time | End_Time |
+--------+-----+-----+---------------------+---------------------+
| State1 | 111 | AA1 | 2018-01-01 00:00:00 | 2019-12-04 04:00:00 |
+--------+-----+-----+---------------------+---------------------+
| State1 | 111 | AA1 | 2019-12-04 04:00:00 | 2019-12-04 19:30:00 |
+--------+-----+-----+---------------------+---------------------+
| State1 | 111 | AA1 | 2019-12-04 19:30:00 | 2019-12-08 06:30:00 |
+--------+-----+-----+---------------------+---------------------+
| State1 | 111 | AA1 | 2019-12-08 06:30:00 | 2019-12-20 10:00:00 |
+--------+-----+-----+---------------------+---------------------+
| State1 | 111 | AA1 | 2019-12-20 10:00:00 | 2019-12-22 11:00:00 |
+--------+-----+-----+---------------------+---------------------+
| State1 | 111 | AA1 | 2019-12-22 11:00:00 | 2019-12-22 23:00:00 |
+--------+-----+-----+---------------------+---------------------+
| State1 | 111 | AA1 | 2019-12-22 23:00:00 | 2019-12-26 08:00:00 |
+--------+-----+-----+---------------------+---------------------+
| State1 | 111 | AA1 | 2019-12-26 08:00:00 | 2019-12-29 16:30:00 |
+--------+-----+-----+---------------------+---------------------+
| State1 | 111 | AA1 | 2019-12-29 16:30:00 | TODAY |
+--------+-----+-----+---------------------+---------------------+
| State1 | 112 | AA1 | 2018-01-01 00:00:00 | 2018-09-19 08:00:00 |
+--------+-----+-----+---------------------+---------------------+
| State2 | 112 | AA2 | 2018-09-19 08:00:00 | 2018-09-20 04:30:00 |
+--------+-----+-----+---------------------+---------------------+
| State2 | 112 | AA1 | 2018-09-20 04:30:00 | 2018-09-25 16:30:00 |
+--------+-----+-----+---------------------+---------------------+
| State2 | 112 | AA2 | 2018-09-25 16:30:00 | 2018-09-26 23:00:00 |
+--------+-----+-----+---------------------+---------------------+
| State2 | 112 | AA1 | 2018-09-26 23:00:00 | 2018-09-27 01:30:00 |
+--------+-----+-----+---------------------+---------------------+
| State2 | 112 | AA2 | 2018-09-27 01:30:00 | 2018-09-27 10:30:00 |
+--------+-----+-----+---------------------+---------------------+
| State2 | 112 | AA2 | 2018-09-27 10:30:00 | TODAY |
+--------+-----+-----+---------------------+---------------------+
I have tried this code (borrowed from: How to find the start time and end time of an event in python?), but it gives me only the sequence of events, not the desired rows and the answer provided by #Fredy MontaƱo (below):
fill_date = []
for item in range(1,df.shape[0],1):
if (df['End_Time'][item-1] - df['Start_Time'][item]) == 0:
""
else:
fill_date.append([df["State"][item-1], df["ID1"][item-1], df["ID2"][item-1], df['End_Time'][item-1],df['Start_Time'][item]])
df_add = pd.DataFrame(fill_date)
df_add.columns = ["State", "ID1", "ID2", 'Start_Time', 'End_Time']
df_output = pd.concat([df[["State", "ID1", "ID2", "Start_Time", "End_Time"]], df_add],axis = 0)
df_output = df_output.sort_values(["State", "ID2", "Start_Time"], ascending=True)
I think I have to put a condition over the STATE, ID1 and ID2 variables in order to not to take times from the previous groups.
Any suggestion?

Maybe this solution works for you.
I slice the dataframe only to take the dates, but it works for you you can repeat it taking into account the states and ID
df = df[['Start_Time', 'End_Time']]
fill_date = []
for item in range(1,df.shape[0],1):
if df['Start_Time'][item] - df['End_Time'][item-1] == 0:
""
else:
fill_date.append([df['End_Time'][item-1],df['Start_Time'][item]])
df_add = pd.DataFrame(fill_date)
df_add.columns = ['Start_Time', 'End_Time']
and finally, I do a concat to join you original dataframe with the new df of dates of not Observed events dates on squares are the new
df_final = pd.concat([df,df_add],axis = 0)
df_final.sort_index(0)

Related

How to reindex a datetime-based multiindex in pandas

I have a dataframe that counts the number of times an event has occured per user per day. Users may have 0 events per day and (since the table is an aggregate from a raw event log) rows with 0 events are missing from the dataframe. I would like to add these missing rows and group the data by week so that each user has one entry per week (including 0 if applicable).
Here is an example of my input:
import numpy as np
import pandas as pd
np.random.seed(42)
df = pd.DataFrame({
"person_id": np.arange(3).repeat(5),
"date": pd.date_range("2022-01-01", "2022-01-15", freq="d"),
"event_count": np.random.randint(1, 7, 15),
})
# end of each week
# Note: week 2022-01-23 is not in df, but should be part of the result
desired_index = pd.to_datetime(["2022-01-02", "2022-01-09", "2022-01-16", "2022-01-23"])
df
| | person_id | date | event_count |
|---:|------------:|:--------------------|--------------:|
| 0 | 0 | 2022-01-01 00:00:00 | 4 |
| 1 | 0 | 2022-01-02 00:00:00 | 5 |
| 2 | 0 | 2022-01-03 00:00:00 | 3 |
| 3 | 0 | 2022-01-04 00:00:00 | 5 |
| 4 | 0 | 2022-01-05 00:00:00 | 5 |
| 5 | 1 | 2022-01-06 00:00:00 | 2 |
| 6 | 1 | 2022-01-07 00:00:00 | 3 |
| 7 | 1 | 2022-01-08 00:00:00 | 3 |
| 8 | 1 | 2022-01-09 00:00:00 | 3 |
| 9 | 1 | 2022-01-10 00:00:00 | 5 |
| 10 | 2 | 2022-01-11 00:00:00 | 4 |
| 11 | 2 | 2022-01-12 00:00:00 | 3 |
| 12 | 2 | 2022-01-13 00:00:00 | 6 |
| 13 | 2 | 2022-01-14 00:00:00 | 5 |
| 14 | 2 | 2022-01-15 00:00:00 | 2 |
This is how my desired result looks like:
| | person_id | level_1 | event_count |
|---:|------------:|:--------------------|--------------:|
| 0 | 0 | 2022-01-02 00:00:00 | 9 |
| 1 | 0 | 2022-01-09 00:00:00 | 13 |
| 2 | 0 | 2022-01-16 00:00:00 | 0 |
| 3 | 0 | 2022-01-23 00:00:00 | 0 |
| 4 | 1 | 2022-01-02 00:00:00 | 0 |
| 5 | 1 | 2022-01-09 00:00:00 | 11 |
| 6 | 1 | 2022-01-16 00:00:00 | 5 |
| 7 | 1 | 2022-01-23 00:00:00 | 0 |
| 8 | 2 | 2022-01-02 00:00:00 | 0 |
| 9 | 2 | 2022-01-09 00:00:00 | 0 |
| 10 | 2 | 2022-01-16 00:00:00 | 20 |
| 11 | 2 | 2022-01-23 00:00:00 | 0 |
I can produce it using:
(
df
.groupby(["person_id", pd.Grouper(key="date", freq="w")]).sum()
.groupby("person_id").apply(
lambda df: (
df
.reset_index(drop=True, level=0)
.reindex(desired_index, fill_value=0))
)
.reset_index()
)
However, according to the docs of reindex, I should be able to use it with level=1 as a kwarg directly and without having to do another groupby. However, when I do this I get an "inner join" of the two indices instead of an "outer join":
result = (
df
.groupby(["person_id", pd.Grouper(key="date", freq="w")]).sum()
.reindex(desired_index, level=1)
.reset_index()
)
| | person_id | date | event_count |
|---:|------------:|:--------------------|--------------:|
| 0 | 0 | 2022-01-02 00:00:00 | 9 |
| 1 | 0 | 2022-01-09 00:00:00 | 13 |
| 2 | 1 | 2022-01-09 00:00:00 | 11 |
| 3 | 1 | 2022-01-16 00:00:00 | 5 |
| 4 | 2 | 2022-01-16 00:00:00 | 20 |
Why is that, and how am I supposed to use df.reindex correctly?
I have found a similar SO question on reindexing a multi-index level, but the accepted answer there uses df.unstack, which doesn't work for me, because not every level of my desired index occurs in my current index (and vice versa).
You need reindex by both levels of MultiIndex:
mux = pd.MultiIndex.from_product([df['person_id'].unique(), desired_index],
names=['person_id','date'])
result = (
df
.groupby(["person_id", pd.Grouper(key="date", freq="w")]).sum()
.reindex(mux, fill_value=0)
.reset_index()
)
print (result)
person_id date event_count
0 0 2022-01-02 9
1 0 2022-01-09 13
2 0 2022-01-16 0
3 0 2022-01-23 0
4 1 2022-01-02 0
5 1 2022-01-09 11
6 1 2022-01-16 5
7 1 2022-01-23 0
8 2 2022-01-02 0
9 2 2022-01-09 0
10 2 2022-01-16 20
11 2 2022-01-23 0

Iterate over each day, calculate average price of first and last 3 hours and take difference of those averages in Python

I have a data frame that looks like this,
+---------------------+------------+----------+-------+
| Timestamp | Date | Time | Price |
+---------------------+------------+----------+-------+
| 2017-01-01 00:00:00 | 2017-01-01 | 00:00:00 | 20 |
| 2017-01-01 00:01:00 | 2017-01-01 | 00:01:00 | 25 |
| 2017-01-01 00:02:00 | 2017-01-01 | 00:02:00 | 15 |
| 2017-01-01 00:03:00 | 2017-01-01 | 00:03:00 | 20 |
| ... | | | |
| 2017-01-01 00:20:00 | 2017-01-01 | 00:20:00 | 25 |
| 2017-01-01 00:21:00 | 2017-01-01 | 00:21:00 | 15 |
| 2017-01-01 00:22:00 | 2017-01-01 | 00:22:00 | 10 |
| 2017-01-01 00:23:00 | 2017-01-01 | 00:23:00 | 25 |
| 2017-01-01 00:00:00 | 2017-02-01 | 00:00:00 | 10 |
| 2017-02-01 00:01:00 | 2017-02-01 | 00:01:00 | 25 |
| 2017-02-01 00:02:00 | 2017-02-01 | 00:02:00 | 10 |
| 2017-02-01 00:03:00 | 2017-02-01 | 00:03:00 | 25 |
| ... | | | |
| 2017-02-01 00:20:00 | 2017-02-01 | 00:20:00 | 15 |
| 2017-02-01 00:21:00 | 2017-02-01 | 00:21:00 | 10 |
| 2017-02-01 00:22:00 | 2017-02-01 | 00:22:00 | 25 |
| 2017-02-01 00:24:00 | 2017-02-01 | 00:23:00 | 10 |
+---------------------+------------+----------+-------+
Timestamp datetime64[ns]
Date datetime64[ns]
Time object
Price float64
and I'm trying to calculate difference between the average price of the first 3 hours and the last 3 hours of a day.
Design in my mind is to do something like this;
For every unique date in Date
a = avg(price.first(3))
b = avg(price.last(3))
dif = a - b
append to another dataset
---------EDIT----------
and the expected result is;
+------------+---------+
| Date | Diff |
+------------+---------+
| 2017-01-01 | 3.33334 |
| 2017-01-02 | 0 |
+------------+---------+
My real query will be in seconds rather then hours.(I didnt wanted to put 120 rows in here show 2 minutes of the data).So hours are representations of seconds.
And there can be some missing rows in the dataset so if I just do price.first(3600) it can overshoot for some days right? If I can solve this using df.Timestamp.datetime.hour that will be more precise I think.
I really can't get my head around figuring how to get first and last 3 Price for everyday kind of approach. Any help will be much much appreciated!! Thank you so so much in advance!
As you showed, the hours are ordered, so you can groupby day, and the get the list of the prices of the 24 hours of the day, then, you can apply a function to do the difference. You could try something like this:
import pandas as pd
from statistics import mean
def getavg(ls):
mean3first=mean(ls[:3])
mean3last=mean(ls[len(ls)-3:])
return mean3first-mean3last
diff_means= df.groupby(['Date']).agg(list)['Price'].apply(getavg).reset_index()
diff_means.columns=['Date','Diff']
print(diff_means)
I'm not entirely sure what format you want the result in, but I found a solution that I find pretty elegant:
unique_dates = df.Date.unique()
new_df = pd.DataFrame()
for u_date in unique_dates:
first_3 = np.mean(df[df.Date == u_date].reset_index().head(3).Price)
last_3 = np.mean(df[df.Date == u_date].reset_index().tail(3).Price)
new_df = new_df.append(
pd.DataFrame([[u_date, last_3 - first_3]], columns = ['Date', 'PriceDiff']))

Replace selected cell values in a dataframe with values from another separate dataframe

I have a dataframe (df1) that looks like this:
+------------+--------+-------+
| Date | Length | Width |
+------------+--------+-------+
| 2020-01-01 | 10 | 12 |
+------------+--------+-------+
| 2020-01-02 | 39 | 34 |
+------------+--------+-------+
| 2020-01-03 | 50 | 23 |
+------------+--------+-------+
| 2020-01-04 | 1 | 24 |
+------------+--------+-------+
| 2020-01-05 | 2 | 10 |
+------------+--------+-------+
| 2020-01-06 | 1 | 16 |
+------------+--------+-------+
| 2020-01-07 | 79 | 20 |
+------------+--------+-------+
| 2020-01-08 | 86 | 34 |
+------------+--------+-------+
| 2020-01-09 | 92 | 23 |
+------------+--------+-------+
| 2020-01-10 | 101 | 25 |
+------------+--------+-------+
| 2020-01-11 | 113 | 24 |
+------------+--------+-------+
| 2020-01-12 | 125 | 50 |
+------------+--------+-------+
| ... | ... | |
+------------+--------+-------+
The value for date "2020-01-04" to "2020-01-06" under "Length" column is not what I want.
I found the correct values for those 3 dates and arranged them in a separate small table like this (df2):
+------------+--------+
| Date | Length |
+------------+--------+
| 2020-01-04 | 20 |
+------------+--------+
| 2020-01-05 | 30 |
+------------+--------+
| 2020-01-06 | 50 |
+------------+--------+
What is the most efficient way for me to replace the 3 values back to df1?
This is just a pseudo dataset I created to illustrate. The real data I have is much larger than this (both df1 and df2 are much larger), so I can't possibly manually replaced those values cell by cell.
I expect the end results to look like this:
+------------+--------+-------+
| Date | Length | Width |
+------------+--------+-------+
| 2020-01-01 | 10 | 12 |
+------------+--------+-------+
| 2020-01-02 | 39 | 34 |
+------------+--------+-------+
| 2020-01-03 | 50 | 23 |
+------------+--------+-------+
| 2020-01-04 | 20 | 24 |
+------------+--------+-------+
| 2020-01-05 | 30 | 10 |
+------------+--------+-------+
| 2020-01-06 | 50 | 16 |
+------------+--------+-------+
| 2020-01-07 | 79 | 20 |
+------------+--------+-------+
| 2020-01-08 | 86 | 34 |
+------------+--------+-------+
| 2020-01-09 | 92 | 23 |
+------------+--------+-------+
| 2020-01-10 | 101 | 25 |
+------------+--------+-------+
| 2020-01-11 | 113 | 24 |
+------------+--------+-------+
| 2020-01-12 | 125 | 50 |
+------------+--------+-------+
| ... | ... | |
+------------+--------+-------+
Thanks so much for your help!
Have a look at DataFrame.update():
# note: update() requires the DataFrames to have indices
df1.set_index('Date', inplace=True)
df1.update(df2.set_index('Date'))
df1.reset_index(inplace=True)
If your indexes are indeed aligned, we can use combine_first
#df1 = df1.set_index('Date')
#df2 = df2.set_index('Date')
df3 = df2.combine_first(df1)
print(df3)
Length Width
Date
2020-01-01 10 12
2020-01-02 39 34
2020-01-03 50 23
2020-01-04 20 24
2020-01-05 30 10
2020-01-06 50 16
2020-01-07 79 20
2020-01-08 86 34
2020-01-09 92 23
2020-01-10 101 25
2020-01-11 113 24
2020-01-12 125 50

Query/Replace element in DataFrame with element directly beneath it

I have a dataframe in which I need to query and replace 0.00s with a value directly below it if certain conditions are met. I have looked for documentation on such a behavior but have been unable to find an efficient Pythonic solution.
The logic is as follows:
IF [Symbol] = 'VIX' AND [QuoteDateTime] CONTAINS '09:31:00' AND [Close] = '0.00'
THEN I would like to replace the [Close] value with the [Close] value right below it.
+----+--------+---------------------+---------+
| | Symbol | QuoteDateTime | Close |
+----+--------+---------------------+---------+
| 0 | VIX | 2019-04-11 09:31:00 | 0.00 |
| 1 | VIX | 2019-04-11 09:32:00 | 14.24 |
| 2 | VIX | 2019-04-11 09:33:00 | 14.40 |
| 3 | SPX | 2019-04-11 09:31:00 | 2911.09 |
| 4 | SPX | 2019-04-11 09:32:00 | 2911.55 |
| 5 | SPX | 2019-04-11 09:33:00 | 2915.22 |
| 6 | VIX | 2019-04-12 09:31:00 | 0.00 |
| 7 | VIX | 2019-04-12 09:32:00 | 15.64 |
| 8 | VIX | 2019-04-12 09:33:00 | 15.80 |
| 9 | SPX | 2019-04-12 09:31:00 | 2901.09 |
| 10 | SPX | 2019-04-12 09:32:00 | 2901.55 |
| 11 | SPX | 2019-04-12 09:33:00 | 2905.22 |
+----+--------+---------------------+---------+
Expected output would be that Index 0 [Close] is 14.24 and Index 6 [Close] is 15.64. Everything else remains the same.
+----+--------+---------------------+---------+
| | Symbol | QuoteDateTime | Close |
+----+--------+---------------------+---------+
| 0 | VIX | 2019-04-11 09:31:00 | 14.24 |
| 1 | VIX | 2019-04-11 09:32:00 | 14.24 |
| 2 | VIX | 2019-04-11 09:33:00 | 14.40 |
| 3 | SPX | 2019-04-11 09:31:00 | 2911.09 |
| 4 | SPX | 2019-04-11 09:32:00 | 2911.55 |
| 5 | SPX | 2019-04-11 09:33:00 | 2915.22 |
| 6 | VIX | 2019-04-12 09:31:00 | 15.64 |
| 7 | VIX | 2019-04-12 09:32:00 | 15.64 |
| 8 | VIX | 2019-04-12 09:33:00 | 15.80 |
| 9 | SPX | 2019-04-12 09:31:00 | 2901.09 |
| 10 | SPX | 2019-04-12 09:32:00 | 2901.55 |
| 11 | SPX | 2019-04-12 09:33:00 | 2905.22 |
+----+--------+---------------------+---------+
Create boolean mask by Series.eq for ==, Series.dt.strftime for strings from datetimes and set new values by Series.mask with Series.shift:
#convert to datetimes if necessary
df['QuoteDateTime'] = pd.to_datetime(df['QuoteDateTime'])
mask = (df['Symbol'].eq('VIX') &
df['QuoteDateTime'].dt.strftime('%H:%M:%S').eq('09:31:00') &
df['Close'].eq(0))
df['Close'] = df['Close'].mask(mask, df['Close'].shift(-1))
#alternative1
#df.loc[mask, 'Close'] = df['Close'].shift(-1)
#alternative2
#df['Close'] = np.where(mask, df['Close'].shift(-1), df['Close'])
print (df)
Symbol QuoteDateTime Close
0 VIX 2019-04-11 09:31:00 14.24
1 VIX 2019-04-11 09:32:00 14.24
2 VIX 2019-04-11 09:33:00 14.40
3 SPX 2019-04-11 09:31:00 2911.09
4 SPX 2019-04-11 09:32:00 2911.55
5 SPX 2019-04-11 09:33:00 2915.22
6 VIX 2019-04-12 09:31:00 15.64
7 VIX 2019-04-12 09:32:00 15.64
8 VIX 2019-04-12 09:33:00 15.80
9 SPX 2019-04-12 09:31:00 2901.09
10 SPX 2019-04-12 09:32:00 2901.55
11 SPX 2019-04-12 09:33:00 2905.22
Not an expert, but you could try using the index:
First get the index with this short line:
idx = df.index[(df['Symbol'] == 'VIX') & (df['QuoteDateTime'].str.contains("09:31:00")) & (df['Close'] == '0.0')]
Then use the index to set the values to the values in the rows below:
df.loc[idx, 'Close'] = df.loc[idx+1, 'Close'].values

How to upsample a panda data frame

I have a comma separated data file as follows:
ID | StartTimeStamp | EndTimeStamp | Duration (in seconds) | AssetName
1233 | 2017-01-01 00:00:02 | 2017-01-01 00:10:01 | 601 | Car1
1233 | 2017-01-01 00:10:01 | 2017-01-01 00:10:12 | 11 | Car1
...
1235 | 2017-01-01 00:00:02 | 2017-01-01 00:10:01 | 601 | CarN
etc.
Now I would like to create the following using the starttime and duration to upsample the data.
ID | StartTimeStamp | AssetName
1233 | 2017-01-01 00:00:02 | Car1
1233 | 2017-01-01 00:00:03 | Car1
1233 | 2017-01-01 00:00:04 | Car1
...
1233 | 017-01-01 00:10:01 | Car1
...
1235 | 2017-01-01 00:00:02 | CarN
1235 | 2017-01-01 00:00:03 | CarN
1235 | 2017-01-01 00:00:04 | CarN
... (i.e. 601 rows of data one per second)
1235 | 2017-01-01 00:10:01 | CarN
but I am add odds on how to do this as upsampling seems to be only able to work with timeseries? I was thinking of using a for loop using the StartTimeStamp and number of seconds in the file, but am at a loss on how to go about this?
You can resample for each ID group and then fill the gaps in character columns
import pandas as pd
df_resampled = df.set_index(pd.to_datetime(df.StartTimeStamp)).groupby('ID')
# Expand out the dataframe for one second
df_resampled = df_resampled.resample('1S').asfreq()
# Interpolate AssetName for each group
df_resampled['AssetName'] = df_resampled['AssetName'].ffill().bfill()

Categories