Yesterday I asked this question (with some good answers) which is very similar, but slightly different from the problem I'm presented with now. Say I have the following pd.DataFrame (dict):
eff_timestamp val id begin_timestamp end_timestamp
0 2021-01-01 00:00:00 -0.710230 1 2021-01-01 02:00:00 2021-01-01 05:30:00
1 2021-01-01 01:00:00 0.121464 1 2021-01-01 02:00:00 2021-01-01 05:30:00
2 2021-01-01 02:00:00 -0.156328 1 2021-01-01 02:00:00 2021-01-01 05:30:00
3 2021-01-01 03:00:00 0.788685 1 2021-01-01 02:00:00 2021-01-01 05:30:00
4 2021-01-01 04:00:00 0.505210 1 2021-01-01 02:00:00 2021-01-01 05:30:00
5 2021-01-01 05:00:00 -0.738344 1 2021-01-01 02:00:00 2021-01-01 05:30:00
6 2021-01-01 06:00:00 0.266910 1 2021-01-01 02:00:00 2021-01-01 05:30:00
7 2021-01-01 07:00:00 -0.587401 1 2021-01-01 02:00:00 2021-01-01 05:30:00
8 2021-01-02 00:00:00 -0.160692 2 2021-01-02 12:00:00 2021-01-02 15:30:00
9 2021-01-02 01:00:00 0.306354 2 2021-01-02 12:00:00 2021-01-02 15:30:00
10 2021-01-02 02:00:00 NaN 2 2021-01-02 12:00:00 2021-01-02 15:30:00
11 2021-01-02 03:00:00 NaN 2 2021-01-02 12:00:00 2021-01-02 15:30:00
12 2021-01-02 04:00:00 NaN 2 2021-01-02 12:00:00 2021-01-02 15:30:00
13 2021-01-02 05:00:00 NaN 2 2021-01-02 12:00:00 2021-01-02 15:30:00
14 2021-01-02 06:00:00 NaN 2 2021-01-02 12:00:00 2021-01-02 15:30:00
15 2021-01-02 07:00:00 -0.349705 2 2021-01-02 12:00:00 2021-01-02 15:30:00
I would like to get the mean value of val for each unique id, for those val's that lie between the begin_timestamp and end_timestamp. If there are no rows that satisfy that criteria, I'd like to get the last value for that id before that period. Note that in this example, id=2 has no rows that satisfy the criteria. Previously I could slice the data so I only keep the rows between the begin and end_timestamp, and then use a groupby. The solution from my previous post then replaces the NaN value in the groupby object. However, in the example above, id=2 has no rows at all that satisfy the criteria, and therefore there is no NaN value created that can be replaced. So if I slice the data based above on the criteria:
sliced = df[(df.eff_timestamp > df.begin_timestamp) & (df.eff_timestamp < df.end_timestamp)]
sliced
>>>
eff_timestamp val id begin_timestamp end_timestamp
3 2021-01-01 03:00:00 0.788685 1 2021-01-01 02:00:00 2021-01-01 05:30:00
4 2021-01-01 04:00:00 0.505210 1 2021-01-01 02:00:00 2021-01-01 05:30:00
5 2021-01-01 05:00:00 -0.738344 1 2021-01-01 02:00:00 2021-01-01 05:30:00
sliced.groupby('id').val.mean()
>>>
id
1 0.185184
Name: val, dtype: float64
This result only includes id=1 with the mean value, but there is no value for id=2. How would I, instead of the mean, include the last available value for id=2, which is -0.349705?
Create a temp column between_time. Then Groupby id column and then, in apply add the condition - > If for a particular id is there any value that lies within the range? If yes, take the mean else take the value present at last_valid_index.
result = (
df.assign(
between_time=(df.eff_timestamp > df.begin_timestamp) & (df.eff_timestamp < df.end_timestamp))
.groupby('id')
.apply(
lambda x: x.loc[x['between_time']]['val'].mean()
if any(x['between_time'].values)
else
x.loc[x['val'].last_valid_index()]['val']
)
)
OUTPUT:
id
1 0.185184
2 -0.349705
dtype: float64
Related
I have a dataframe with a column 'queue_ist_dt'. This column contains pandas._libs.tslibs.timestamps.Timestamp values. My requirement is :
if time = 10:13:00 then round_off_time = 10:00:00
if time = 23:29:00 then round_off_time = 23:00:00
and so on.
if time = 10:31:00 then round_off_time = 10:30:00
if time = 23:53:00 then round_off_time = 23:30:00
and so on.
if time = 10:30:00 then round_off_time = 10:30:00
These are the 3 conditions.
I tried to write the following logic :
for r in range(df.shape[0]):
try:
if df.loc[r,'queue_ist_dt'].minute<30:
timedelta = pd.Timedelta(minutes=df.loc[r,'queue_ist_dt'].minute)
df.loc[r,'queue_placed_interval'] = df.loc[r,'queue_ist_dt']- timedelta
elif df.loc[r,'queue_ist_dt'].minute>30:
******NEED HELP TO BUILD THIS LOGIC******
except:
pass
Need help to build logic for the time where minutes is greater than 30 mins and have to be rounded down to 30 mins.
Use Series.dt.floor:
#if necessary convert to datetimes
df['queue_ist_dt'] = pd.to_datetime(df['queue_ist_dt'].astype(str))
df['queue_ist_dt1'] = df['queue_ist_dt'].dt.floor('30Min').dt.time
print (df)
Logic is subtract 30 minute from timedelta
code is as below:
for r in range(df.shape[0]):
try:
if df.loc[r,'queue_ist_dt'].minute<30:
timedelta = pd.Timedelta(minutes=df.loc[r,'queue_ist_dt'].minute)
df.loc[r,'queue_placed_interval'] = df.loc[r,'queue_ist_dt']- timedelta
elif df.loc[r,'queue_ist_dt'].minute>30:
******THIS LOGIC******
timedelta = pd.Timedelta(minutes=df.loc[r,'queue_ist_dt'].minute)
df.loc[r,'queue_placed_interval'] = df.loc[r,'queue_ist_dt']- (timedelta-30)
except:
pass
Let me know if this helps you😊
Considering this dataframe df as example
df = pd.DataFrame({'queue_ist_dt': [pd.Timestamp('2021-01-01 10:00:00'),
pd.Timestamp('2021-01-01 10:30:00'),
pd.Timestamp('2021-01-01 11:00:00'),
pd.Timestamp('2021-01-01 11:30:00'),
pd.Timestamp('2021-01-01 23:00:00'),
pd.Timestamp('2021-01-01 23:30:00'),
pd.Timestamp('2021-01-01 23:30:00')]
})
[Out]:
queue_ist_dt
0 2021-01-01 10:01:00
1 2021-01-01 10:35:00
2 2021-01-01 11:19:00
3 2021-01-01 11:33:00
4 2021-01-01 23:23:00
5 2021-01-01 23:22:00
6 2021-01-01 23:55:00
One way would be to use pandas.Series.dt.round as follows
df['round_off_time'] = df['queue_ist_dt'].dt.round('30min')
[Out]:
queue_ist_dt round_off_time
0 2021-01-01 10:01:00 2021-01-01 10:00:00
1 2021-01-01 10:35:00 2021-01-01 10:30:00
2 2021-01-01 11:19:00 2021-01-01 11:30:00
3 2021-01-01 11:33:00 2021-01-01 11:30:00
4 2021-01-01 23:23:00 2021-01-01 23:30:00
5 2021-01-01 23:22:00 2021-01-01 23:30:00
6 2021-01-01 23:55:00 2021-01-02 00:00:00
If the goal is to change the values in the column queue_ist_dt, do the following
df['queue_ist_dt'] = df['queue_ist_dt'].dt.round('30min')
[Out]:
queue_ist_dt
0 2021-01-01 10:00:00
1 2021-01-01 10:30:00
2 2021-01-01 11:30:00
3 2021-01-01 11:30:00
4 2021-01-01 23:30:00
5 2021-01-01 23:30:00
6 2021-01-02 00:00:00
I have a DataFrame with irregular sampling frequency, therefore I would like to resample it and interpolate.
Lets say I have following data:
import pandas as pd
idx = pd.DatetimeIndex(["2021-01-01 00:01:35", "2021-01-01 00:05:01", "2021-01-01 00:08:42"])
df = pd.DataFrame({"a": [1, 2, 3]}, index=idx)
# a
# 2021-01-01 00:01:35 1
# 2021-01-01 00:05:01 2
# 2021-01-01 00:08:42 3
And I would like to get result similar to this one (interpolation using "index" method):
a
2021-01-01 00:02:00 1.121359
2021-01-01 00:03:00 1.412621
2021-01-01 00:04:00 1.703883
2021-01-01 00:05:00 1.995146
2021-01-01 00:06:00 2.266968
2021-01-01 00:07:00 2.538462
2021-01-01 00:08:00 2.809955
For that, I thought that something like df.resample("T").interpolate(method="index") could work but this does not work, I would need to put there some aggregation function, e.g. df.resample("T").mean().interpolate(method="index") but it does not result in a wanted solution.
I could do some workaround like this:
df_res = pd.concat([df, df.resample("T").asfreq()]).sort_index()
df_res = df_res[~df_res.index.duplicated()]
df_res = df_res.interpolate(method="index").dropna()
df_res
# a
# 2021-01-01 00:01:35 1.000000
# 2021-01-01 00:02:00 1.121359
# 2021-01-01 00:03:00 1.412621
# 2021-01-01 00:04:00 1.703883
# 2021-01-01 00:05:00 1.995146
# 2021-01-01 00:05:01 2.000000
# 2021-01-01 00:06:00 2.266968
# 2021-01-01 00:07:00 2.538462
# 2021-01-01 00:08:00 2.809955
# 2021-01-01 00:08:42 3.000000
And then remove the original 3 indexes or keep everything based on my preferences. But I'm wondering whether there is a better solution that could work directly by combining resample and interpolate methods?
There may be other ways to do this, but the base value of the original data is in seconds, so upsampling in seconds is the way to go. There is an interpolation method for resampling, so we will use that. This will result in a complemented data frame of 1 second units, and we will filter that data frame by seconds.
df.resamle('S').interpolate()
df.resample('S').interpolate().head()
a
2021-01-01 00:01:35 1.000000
2021-01-01 00:01:36 1.004854
2021-01-01 00:01:37 1.009709
2021-01-01 00:01:38 1.014563
2021-01-01 00:01:39 1.019417
query
df.resample('S').interpolate().query('index.dt.second == 0')
a
2021-01-01 00:02:00 1.121359
2021-01-01 00:03:00 1.412621
2021-01-01 00:04:00 1.703883
2021-01-01 00:05:00 1.995146
2021-01-01 00:06:00 2.266968
2021-01-01 00:07:00 2.538462
2021-01-01 00:08:00 2.809955
I'm trying to use pd.cut to divide 24 hours into the following interval:
[6,11),[11,14),[14,17),[17,22),[22,6)
How could I achieve the last bin [22,6)?
Assuming some form of datetime column, try offsetting the datetime by 6 hours so that the lower bound becomes midnight. Then cutting based on those hours instead, with the custom labels:
import pandas as pd
# sample data
df = pd.DataFrame({
'datetime': pd.date_range('2021-01-01', periods=24, freq='H')
})
df['bins'] = pd.cut((df['datetime'] - pd.Timedelta(hours=6)).dt.hour,
bins=[0, 5, 8, 11, 16, 24],
labels=['[6,11)', '[11,14)', '[14,17)',
'[17,22)', '[22,6)'],
right=False)
df:
datetime bins
0 2021-01-01 00:00:00 [22,6)
1 2021-01-01 01:00:00 [22,6)
2 2021-01-01 02:00:00 [22,6)
3 2021-01-01 03:00:00 [22,6)
4 2021-01-01 04:00:00 [22,6)
5 2021-01-01 05:00:00 [22,6)
6 2021-01-01 06:00:00 [6,11)
7 2021-01-01 07:00:00 [6,11)
8 2021-01-01 08:00:00 [6,11)
9 2021-01-01 09:00:00 [6,11)
10 2021-01-01 10:00:00 [6,11)
11 2021-01-01 11:00:00 [11,14)
12 2021-01-01 12:00:00 [11,14)
13 2021-01-01 13:00:00 [11,14)
14 2021-01-01 14:00:00 [14,17)
15 2021-01-01 15:00:00 [14,17)
16 2021-01-01 16:00:00 [14,17)
17 2021-01-01 17:00:00 [17,22)
18 2021-01-01 18:00:00 [17,22)
19 2021-01-01 19:00:00 [17,22)
20 2021-01-01 20:00:00 [17,22)
21 2021-01-01 21:00:00 [17,22)
22 2021-01-01 22:00:00 [22,6)
23 2021-01-01 23:00:00 [22,6)
How do I replace duplicates for each group with NaNs while keeping the rows?
I need to keep rows without removing and perhaps keeping the first original value where it shows up first.
import pandas as pd
from datetime import timedelta
df = pd.DataFrame({
'date': ['2019-01-01 00:00:00','2019-01-01 01:00:00','2019-01-01 02:00:00', '2019-01-01 03:00:00',
'2019-09-01 02:00:00','2019-09-01 03:00:00','2019-09-01 04:00:00', '2019-09-01 05:00:00'],
'value': [10,10,10,10,12,12,12,12],
'ID': ['Jackie','Jackie','Jackie','Jackie','Zoop','Zoop','Zoop','Zoop',]
})
df['date'] = pd.to_datetime(df['date'], infer_datetime_format=True)
date value ID
0 2019-01-01 00:00:00 10 Jackie
1 2019-01-01 01:00:00 10 Jackie
2 2019-01-01 02:00:00 10 Jackie
3 2019-01-01 03:00:00 10 Jackie
4 2019-09-01 02:00:00 12 Zoop
5 2019-09-01 03:00:00 12 Zoop
6 2019-09-01 04:00:00 12 Zoop
7 2019-09-01 05:00:00 12 Zoop
Desired Dataframe:
date value ID
0 2019-01-01 00:00:00 10 Jackie
1 2019-01-01 01:00:00 NaN Jackie
2 2019-01-01 02:00:00 NaN Jackie
3 2019-01-01 03:00:00 NaN Jackie
4 2019-09-01 02:00:00 12 Zoop
5 2019-09-01 03:00:00 NaN Zoop
6 2019-09-01 04:00:00 NaN Zoop
7 2019-09-01 05:00:00 NaN Zoop
Edit:
Duplicated values should only be dropped on the same date indifferent of the frequency. So if value 10 shows up on twice on Jan-1 and three times on Jan-2, the value 10 should only show up once on Jan-1 and once on Jan-2.
I assume you check duplicates on columns value and ID and further check on date of column date
df.loc[df.assign(d=df.date.dt.date).duplicated(['value','ID', 'd']), 'value'] = np.nan
Out[269]:
date value ID
0 2019-01-01 00:00:00 10.0 Jackie
1 2019-01-01 01:00:00 NaN Jackie
2 2019-01-01 02:00:00 NaN Jackie
3 2019-01-01 03:00:00 NaN Jackie
4 2019-09-01 02:00:00 12.0 Zoop
5 2019-09-01 03:00:00 NaN Zoop
6 2019-09-01 04:00:00 NaN Zoop
7 2019-09-01 05:00:00 NaN Zoop
As #Trenton suggest, you may use pd.NA to avoid import numpy
(Note: as #rafaelc sugguest: here is the link explain detail differences between pd.NA and np.nan https://pandas.pydata.org/pandas-docs/stable/whatsnew/v1.0.0.html#experimental-na-scalar-to-denote-missing-values)
df.loc[df.assign(d=df.date.dt.date).duplicated(['value','ID', 'd']), 'value'] = pd.NA
Out[273]:
date value ID
0 2019-01-01 00:00:00 10 Jackie
1 2019-01-01 01:00:00 <NA> Jackie
2 2019-01-01 02:00:00 <NA> Jackie
3 2019-01-01 03:00:00 <NA> Jackie
4 2019-09-01 02:00:00 12 Zoop
5 2019-09-01 03:00:00 <NA> Zoop
6 2019-09-01 04:00:00 <NA> Zoop
7 2019-09-01 05:00:00 <NA> Zoop
This is working if the dataframe is sorted - as in your example:
import numpy as np # to be used for np.nan
df['duplicate'] = df['value'].shift(1) # create a duplicate column
df['value'] = df.apply(lambda x: np.nan if x['value'] == x['duplicate'] \
else x['value'], axis=1) # conditional replace
df = df.drop('duplicate', axis=1) # drop helper column
Group on the dates and take the first observed value (not necessarily the first when sorted by time), then merge the result back to the original dataframe.
df2 = df.groupby([df['date'].dt.date, 'ID'], as_index=False).first()
>>> df.drop(columns='value').merge(df2, on=['date', 'ID'], how='left')[df.columns]
date value ID
0 2019-01-01 00:00:00 10.0 Jackie
1 2019-01-01 01:00:00 NaN Jackie
2 2019-01-01 02:00:00 NaN Jackie
3 2019-01-01 03:00:00 NaN Jackie
4 2019-09-01 02:00:00 12.0 Zoop
5 2019-09-01 03:00:00 NaN Zoop
6 2019-09-01 04:00:00 NaN Zoop
7 2019-09-01 05:00:00 NaN Zoop
For clarity here is MRE:
df = pd.DataFrame(
{"id":[1,2,3,4],
"start_time":["2020-06-01 01:00:00", "2020-06-01 01:00:00", "2020-06-01 19:00:00", "2020-06-02 04:00:00"],
"end_time":["2020-06-01 14:00:00", "2020-06-01 18:00:00", "2020-06-02 10:00:00", "2020-06-02 16:00:00"]
})
df["start_time"] = pd.to_datetime(df["start_time"])
df["end_time"] = pd.to_datetime(df["end_time"])
df["sub_time"] = df["end_time"] - df["start_time"]
this outputs:
id start_time end_time sub_time
0 1 2020-06-01 01:00:00 2020-06-01 14:00:00 13:00:00
1 2 2020-06-01 01:00:00 2020-06-01 18:00:00 17:00:00
2 3 2020-06-01 19:00:00 2020-06-02 10:00:00 15:00:00
3 4 2020-06-02 04:00:00 2020-06-02 16:00:00 12:00:00
but when start_time ~ end_time consists of times range 00:00:00~03:59:59am I want to ignore it(not calculated in sub_time)
So instead of output above I would get:
id start_time end_time sub_time
0 1 2020-06-01 01:00:00 2020-06-01 14:00:00 10:00:00
1 2 2020-06-01 01:00:00 2020-06-01 18:00:00 14:00:00
2 3 2020-06-01 19:00:00 2020-06-02 10:00:00 11:00:00
3 4 2020-06-02 04:00:00 2020-06-02 16:00:00 12:00:00
row 0: starting at 01:00:00 do not count until 04:00:00. then 04:00:00 ~ 14:00:00 is 10 hour period
row 2: consider duration from 19:00:00 ~ 24:00:00 and 04:00:00 ~ 10:00:00 thus we get 11:00:00 in sub_time column.
Any suggestions?