For example, if I have the following data:
df = pd.DataFrame({'Start': ['2022-01-01 08:30:00', '2022-01-01 13:00:00', '2022-01-02 22:00:00'],
'Stop': ['2022-01-01 12:00:00', '2022-01-02 10:30:00', '2022-01-04 8:00:00']})
df = df.apply(pd.to_datetime)
Start Stop
0 2022-01-01 08:30:00 2022-01-01 12:00:00
1 2022-01-01 13:00:00 2022-01-02 10:30:00
2 2022-01-02 22:00:00 2022-01-04 08:00:00
How can I split each record across midnight and upsample my data, so it looks like this:
Start Stop
0 2022-01-01 08:30:00 2022-01-01 12:00:00
1 2022-01-01 13:00:00 2022-01-02 00:00:00
2 2022-01-02 00:00:00 2022-01-02 10:30:00
3 2022-01-02 22:00:00 2022-01-03 00:00:00
4 2022-01-03 00:00:00 2022-01-04 00:00:00
5 2022-01-04 00:00:00 2022-01-04 08:00:00
I want to calculate the duration per day for each time record using df['Stop'] - df['Start']. Maybe there is another way to do it. Thank you!
You could start by implementing a function that computes all dates splits from each row :
from datetime import timedelta
def split_date(start, stop):
# Same day case
if start.date() == stop.date():
return [(start, stop)]
# Several days split case
stop_split = start.replace(hour=0, minute=0, second=0) + timedelta(days=1)
return [(start, stop_split)] + split_date(stop_split, stop)
Then you can just use your existing dataframe to create a new one with all records by computing the split of each record :
new_dates = [
elt for _, row in df.iterrows() for elt in split_date(row["Start"], row["Stop"])
]
new_df = pd.DataFrame(new_dates, columns=["Start", "Stop"])
Then the output should be the one you expected :
Start Stop
0 2022-01-01 08:30:00 2022-01-01 12:00:00
1 2022-01-01 13:00:00 2022-01-02 00:00:00
2 2022-01-02 00:00:00 2022-01-02 10:30:00
3 2022-01-02 22:00:00 2022-01-03 00:00:00
4 2022-01-03 00:00:00 2022-01-04 00:00:00
5 2022-01-04 00:00:00 2022-01-04 08:00:00
Related
I have a dataframe with a column 'queue_ist_dt'. This column contains pandas._libs.tslibs.timestamps.Timestamp values. My requirement is :
if time = 10:13:00 then round_off_time = 10:00:00
if time = 23:29:00 then round_off_time = 23:00:00
and so on.
if time = 10:31:00 then round_off_time = 10:30:00
if time = 23:53:00 then round_off_time = 23:30:00
and so on.
if time = 10:30:00 then round_off_time = 10:30:00
These are the 3 conditions.
I tried to write the following logic :
for r in range(df.shape[0]):
try:
if df.loc[r,'queue_ist_dt'].minute<30:
timedelta = pd.Timedelta(minutes=df.loc[r,'queue_ist_dt'].minute)
df.loc[r,'queue_placed_interval'] = df.loc[r,'queue_ist_dt']- timedelta
elif df.loc[r,'queue_ist_dt'].minute>30:
******NEED HELP TO BUILD THIS LOGIC******
except:
pass
Need help to build logic for the time where minutes is greater than 30 mins and have to be rounded down to 30 mins.
Use Series.dt.floor:
#if necessary convert to datetimes
df['queue_ist_dt'] = pd.to_datetime(df['queue_ist_dt'].astype(str))
df['queue_ist_dt1'] = df['queue_ist_dt'].dt.floor('30Min').dt.time
print (df)
Logic is subtract 30 minute from timedelta
code is as below:
for r in range(df.shape[0]):
try:
if df.loc[r,'queue_ist_dt'].minute<30:
timedelta = pd.Timedelta(minutes=df.loc[r,'queue_ist_dt'].minute)
df.loc[r,'queue_placed_interval'] = df.loc[r,'queue_ist_dt']- timedelta
elif df.loc[r,'queue_ist_dt'].minute>30:
******THIS LOGIC******
timedelta = pd.Timedelta(minutes=df.loc[r,'queue_ist_dt'].minute)
df.loc[r,'queue_placed_interval'] = df.loc[r,'queue_ist_dt']- (timedelta-30)
except:
pass
Let me know if this helps you😊
Considering this dataframe df as example
df = pd.DataFrame({'queue_ist_dt': [pd.Timestamp('2021-01-01 10:00:00'),
pd.Timestamp('2021-01-01 10:30:00'),
pd.Timestamp('2021-01-01 11:00:00'),
pd.Timestamp('2021-01-01 11:30:00'),
pd.Timestamp('2021-01-01 23:00:00'),
pd.Timestamp('2021-01-01 23:30:00'),
pd.Timestamp('2021-01-01 23:30:00')]
})
[Out]:
queue_ist_dt
0 2021-01-01 10:01:00
1 2021-01-01 10:35:00
2 2021-01-01 11:19:00
3 2021-01-01 11:33:00
4 2021-01-01 23:23:00
5 2021-01-01 23:22:00
6 2021-01-01 23:55:00
One way would be to use pandas.Series.dt.round as follows
df['round_off_time'] = df['queue_ist_dt'].dt.round('30min')
[Out]:
queue_ist_dt round_off_time
0 2021-01-01 10:01:00 2021-01-01 10:00:00
1 2021-01-01 10:35:00 2021-01-01 10:30:00
2 2021-01-01 11:19:00 2021-01-01 11:30:00
3 2021-01-01 11:33:00 2021-01-01 11:30:00
4 2021-01-01 23:23:00 2021-01-01 23:30:00
5 2021-01-01 23:22:00 2021-01-01 23:30:00
6 2021-01-01 23:55:00 2021-01-02 00:00:00
If the goal is to change the values in the column queue_ist_dt, do the following
df['queue_ist_dt'] = df['queue_ist_dt'].dt.round('30min')
[Out]:
queue_ist_dt
0 2021-01-01 10:00:00
1 2021-01-01 10:30:00
2 2021-01-01 11:30:00
3 2021-01-01 11:30:00
4 2021-01-01 23:30:00
5 2021-01-01 23:30:00
6 2021-01-02 00:00:00
I have two dataframes (simple examples shown below):
df1 df2
time column time column ID column Value
2022-01-01 00:00:00 2022-01-01 00:00:00 1 10
2022-01-01 00:15:00 2022-01-01 00:30:00 1 9
2022-01-01 00:30:00 2022-01-02 00:30:00 1 5
2022-01-01 00:45:00 2022-01-02 00:45:00 1 15
2022-01-02 00:00:00 2022-01-01 00:00:00 2 6
2022-01-02 00:15:00 2022-01-01 00:15:00 2 2
2022-01-02 00:30:00 2022-01-02 00:45:00 2 7
2022-01-02 00:45:00
df1 shows every timestamp I am interested in. df2 shows data sorted by timestamp and ID. What I need to do is add every single timestamp from df1 that is not in df2 for each unique ID and add zero to the value column.
This is the outcome I'm interested in
df3
time column ID column Value
2022-01-01 00:00:00 1 10
2022-01-01 00:15:00 1 0
2022-01-01 00:30:00 1 9
2022-01-01 00:45:00 1 0
2022-01-02 00:00:00 1 0
2022-01-02 00:15:00 1 0
2022-01-02 00:30:00 1 5
2022-01-02 00:45:00 1 15
2022-01-01 00:00:00 2 6
2022-01-01 00:15:00 2 2
2022-01-01 00:30:00 2 0
2022-01-01 00:45:00 2 0
2022-01-02 00:00:00 2 0
2022-01-02 00:15:00 2 0
2022-01-02 00:30:00 2 0
2022-01-02 00:45:00 2 7
My df2 is much larger (hundreds of thousands of rows, and more than 500 unique IDs) so manually doing this isn't feasible. I've search for hours for something that could help, but everything has fallen flat. This data will ultimately be fed into a NN.
I am open to other libraries and can work in python or R.
Any help is greatly appreciated.
Try:
x = (
df2.groupby("ID column")
.apply(lambda x: x.merge(df1, how="outer").fillna(0))
.drop(columns="ID column")
.droplevel(1)
.reset_index()
.sort_values(by=["ID column", "time column"])
)
print(x)
Prints:
ID column time column Value
0 1 2022-01-01 00:00:00 10.0
4 1 2022-01-01 00:15:00 0.0
1 1 2022-01-01 00:30:00 9.0
5 1 2022-01-01 00:45:00 0.0
6 1 2022-01-02 00:00:00 0.0
7 1 2022-01-02 00:15:00 0.0
2 1 2022-01-02 00:30:00 5.0
3 1 2022-01-02 00:45:00 15.0
8 2 2022-01-01 00:00:00 6.0
9 2 2022-01-01 00:15:00 2.0
11 2 2022-01-01 00:30:00 0.0
12 2 2022-01-01 00:45:00 0.0
13 2 2022-01-02 00:00:00 0.0
14 2 2022-01-02 00:15:00 0.0
15 2 2022-01-02 00:30:00 0.0
10 2 2022-01-02 00:45:00 7.0
I have a pandas dataframe as follows
df_sample = pd.DataFrame({
'machine': [1, 1, 1, 2],
'ts_start': ["2022-01-01 20:00:00", "2022-01-01 20:30:00", "2022-01-02 20:30:00", "2022-01-01 19:00:00"],
'ts_end': ["2022-01-01 21:00:00", "2022-01-01 21:30:00", "2022-01-02 20:35:00", "2022-01-01 23:00:00"]
})
I want to check which of these [ts_start, ts_end] intervals are overlapped, for the same machine. I have seen some questions about finding overlaps, but couldn't find that grouped by another column, in my case considering the overlaps for each machine separately.
I tried using Piso which seems very interesting.
df_sample['ts_start'] = pd.to_datetime(df_sample['ts_start'])
df_sample['ts_end'] = pd.to_datetime(df_sample['ts_end'])
ii = pd.IntervalIndex.from_arrays(df_sample["ts_start"], df_sample["ts_end"])
df_sample["isOverlap"] = piso.adjacency_matrix(ii).any(axis=1).astype(int).values
I obtain something like this:
machine ts_start ts_end isOverlap
0 1 2022-01-01 20:00:00 2022-01-01 21:00:00 1
1 1 2022-01-01 20:30:00 2022-01-01 21:30:00 1
2 1 2022-01-02 20:30:00 2022-01-02 20:35:00 0
3 2 2022-01-01 19:00:00 2022-01-01 23:00:00 1
However, it is considering all machines at the same time. Is there a way (using piso or not) to get the overlapping moments, for each machine, in a single dataframe?
piso can indeed be used. It'll run fast on large datasets, and not be limited to assumptions on sampling rate of times. Modify your piso example to wrap the last two lines in a function:
def make_overlaps(df):
ii = pd.IntervalIndex.from_arrays(df["ts_start"], df["ts_end"])
df["isOverlap"] = piso.adjacency_matrix(ii).any(axis=1).astype(int).values
return df
Then group df_sample on the machine column, and apply:
df_sample.groupby("machine").apply(make_overlaps)
This will give you:
machine ts_start ts_end isOverlap
0 1 2022-01-01 20:00:00 2022-01-01 21:00:00 1
1 1 2022-01-01 20:30:00 2022-01-01 21:30:00 1
2 1 2022-01-02 20:30:00 2022-01-02 20:35:00 0
3 2 2022-01-01 19:00:00 2022-01-01 23:00:00 0
Here's a way to do what your question asks:
import pandas as pd
df_sample = pd.DataFrame({
'machine': [1, 1, 1, 2],
'ts_start': ["2022-01-01 20:00:00", "2022-01-01 20:30:00", "2022-01-02 20:30:00", "2022-01-01 19:00:00"],
'ts_end': ["2022-01-01 21:00:00", "2022-01-01 21:30:00", "2022-01-02 20:35:00", "2022-01-01 23:00:00"]
})
df_sample = df_sample.sort_values(['machine', 'ts_start', 'ts_end'])
print(df_sample)
def foo(x):
if len(x.index) > 1:
iPrev, reachOfPrev = x.index[0], x.loc[x.index[0], 'ts_end'] if len(x.index) else None
x.loc[iPrev, 'isOverlap'] = 0
for i in x.index[1:]:
if x.loc[i,'ts_start'] < reachOfPrev:
x.loc[iPrev, 'isOverlap'] = 1
x.loc[i, 'isOverlap'] = 1
else:
x.loc[i, 'isOverlap'] = 0
if x.loc[i, 'ts_end'] > reachOfPrev:
iPrev, reachOfPrev = i, x.loc[i, 'ts_end']
else:
x['isOverlap'] = 0
x.isOverlap = x.isOverlap.astype(int)
return x
df_sample = df_sample.groupby('machine').apply(foo)
print(df_sample)
Input:
machine ts_start ts_end
0 1 2022-01-01 20:00:00 2022-01-01 21:00:00
1 1 2022-01-01 20:30:00 2022-01-01 21:30:00
2 1 2022-01-02 20:30:00 2022-01-02 20:35:00
3 2 2022-01-01 19:00:00 2022-01-01 23:00:00
Output:
machine ts_start ts_end isOverlap
0 1 2022-01-01 20:00:00 2022-01-01 21:00:00 1
1 1 2022-01-01 20:30:00 2022-01-01 21:30:00 1
2 1 2022-01-02 20:30:00 2022-01-02 20:35:00 0
3 2 2022-01-01 19:00:00 2022-01-01 23:00:00 0
Assuming the overlap is only checked up by minutes, you could try:
#create date ranges by minute frequency
df_sample["times"] = df_sample.apply(lambda row: pd.date_range(row["ts_start"], row["ts_end"], freq="1min"), axis=1)
#explode to get one row per minute
df_sample = df_sample.explode("times")
#check if times overlap by looking for duplicates
df_sample["isOverlap"] = df_sample[["machine","times"]].duplicated(keep=False)
#groupby to get back original data structure
output = df_sample.drop("times", axis=1).groupby(["machine","ts_start","ts_end"]).any().astype(int).reset_index()
>>> output
machine ts_start ts_end isOverlap
0 1 2022-01-01 20:00:00 2022-01-01 21:00:00 1
1 1 2022-01-01 20:30:00 2022-01-01 21:30:00 1
2 1 2022-01-02 20:30:00 2022-01-02 20:35:00 0
3 2 2022-01-01 19:00:00 2022-01-01 23:00:00 0
I have a Pandas dataframe that looks like this :
# date
--- -------------------
0 2022-01-01 08:00:00
1 2022-01-01 08:01:00
2 2022-01-01 08:52:00
My goal is to add a new column that contains a datetime object with the value of the next hour. I looked at the documentation of the ceil function, and it works pretty well in most cases.
Issue
The problem concerns hours that are perfectly round (like the one at #0) :
df["next"] = (df["date"]).dt.ceil("H")
# date next
--- ------------------- -------------------
0 2022-01-01 08:00:00 2022-01-01 08:00:00 <--- wrong, expected 09:00:00
1 2022-01-01 08:01:00 2022-01-01 09:00:00 <--- correct
2 2022-01-01 08:52:00 2022-01-01 09:00:00 <--- correct
Sub-optimal solution
I have come up with the following workaround, but I find it really clumsy :
def nextHour(current):
return pd.date_range(start=current, periods=2, freq="H")[1]
df["next"] = (df["date"]).apply(lambda x: nextHour(x))
I have around 1-2 million rows in my dataset and I find this solution extremely slow compared to the native dt.ceil(). Is there a better way of doing it ?
This is the way ceil works, it won't jump to the next hour.
What you want seems more like a floor + 1h using pandas.Timedelta:
df['next'] = df['date'].dt.floor('H')+pd.Timedelta('1h')
output:
date next
0 2022-01-01 08:00:00 2022-01-01 09:00:00
1 2022-01-01 08:01:00 2022-01-01 09:00:00
2 2022-01-01 08:52:00 2022-01-01 09:00:00
difference of bounds behavior between floor and ceil:
date ceil floor
0 2022-01-01 08:00:00 2022-01-01 08:00:00 2022-01-01 08:00:00
1 2022-01-01 08:01:00 2022-01-01 09:00:00 2022-01-01 08:00:00
2 2022-01-01 08:52:00 2022-01-01 09:00:00 2022-01-01 08:00:00
3 2022-01-01 09:00:00 2022-01-01 09:00:00 2022-01-01 09:00:00
4 2022-01-01 09:01:00 2022-01-01 10:00:00 2022-01-01 09:00:00
I have a pandas column which I have initialized with ones, this column represents the health of a solar panel.
I need to decay this value linearly unless the time has occurred where the panel will be replaced, here the value resets to 1 (hence why I have initialized to ones). What I am doing is looping through the column, then updating the current value with the value of the previous value, minus a constant.
This operation is extremely expensive (and I have over 200,000 samples). I was hoping someone might be able to help me with a vectorized solution, where I can avoid this for loop. Here is my code:
def set_degredation_factors_pv(df):
for i in df.index:
if i != replacement_duration_PV_year * hour_per_year and i != 0:
df.loc[i, 'degradation_factor_PV_power_frac'] = df.loc[i-1, 'degradation_factor_PV_power_frac'] - degradation_rate_PV_power_perc_per_hour/100
return df
Variables:
replacement_duration_PV_year = 25
hour_per_year = 8760
degradation_rate_PV_power_perc_per_hour = 5.479e-5
Input data:
time_datetime degradation_factor_PV_power_frac
0 2022-01-01 00:00:00 1
1 2022-01-01 01:00:00 1
2 2022-01-01 02:00:00 1
3 2022-01-01 03:00:00 1
4 2022-01-01 04:00:00 1
... ... ...
8732 2022-12-30 20:00:00 1
8733 2022-12-30 21:00:00 1
8734 2022-12-30 22:00:00 1
8735 2022-12-30 23:00:00 1
8736 2022-12-31 00:00:00 1
Output data (only taking one year for time):
time_datetime degradation_factor_PV_power_frac
0 2022-01-01 00:00:00 1.000000
1 2022-01-01 01:00:00 0.999999
2 2022-01-01 02:00:00 0.999999
3 2022-01-01 03:00:00 0.999998
4 2022-01-01 04:00:00 0.999998
... ... ...
8732 2022-12-30 20:00:00 0.995216
8733 2022-12-30 21:00:00 0.995215
8734 2022-12-30 22:00:00 0.995215
8735 2022-12-30 23:00:00 0.995214
8736 2022-12-31 00:00:00 0.995214
Try:
rate = degradation_rate_PV_power_perc_per_hour / 100
mask = ~((df.index != replacement_duration_PV_year * hour_per_year)
& (df.index != 0))
df['degradation_factor_PV_power_frac'] = (
df.groupby(mask.cumsum())['degradation_factor_PV_power_frac']
.apply(lambda x: x.shift().sub(rate).cumprod())
.fillna(df['degradation_factor_PV_power_frac'])
)
Output:
>>> df
time_datetime degradation_factor_PV_power_frac
0 2022-01-01 00:00:00 1.000000
1 2022-01-01 01:00:00 0.999999
2 2022-01-01 02:00:00 0.999999
3 2022-01-01 03:00:00 0.999998
4 2022-01-01 04:00:00 0.999998