I have a pandas dataframe as follows
df_sample = pd.DataFrame({
'machine': [1, 1, 1, 2],
'ts_start': ["2022-01-01 20:00:00", "2022-01-01 20:30:00", "2022-01-02 20:30:00", "2022-01-01 19:00:00"],
'ts_end': ["2022-01-01 21:00:00", "2022-01-01 21:30:00", "2022-01-02 20:35:00", "2022-01-01 23:00:00"]
})
I want to check which of these [ts_start, ts_end] intervals are overlapped, for the same machine. I have seen some questions about finding overlaps, but couldn't find that grouped by another column, in my case considering the overlaps for each machine separately.
I tried using Piso which seems very interesting.
df_sample['ts_start'] = pd.to_datetime(df_sample['ts_start'])
df_sample['ts_end'] = pd.to_datetime(df_sample['ts_end'])
ii = pd.IntervalIndex.from_arrays(df_sample["ts_start"], df_sample["ts_end"])
df_sample["isOverlap"] = piso.adjacency_matrix(ii).any(axis=1).astype(int).values
I obtain something like this:
machine ts_start ts_end isOverlap
0 1 2022-01-01 20:00:00 2022-01-01 21:00:00 1
1 1 2022-01-01 20:30:00 2022-01-01 21:30:00 1
2 1 2022-01-02 20:30:00 2022-01-02 20:35:00 0
3 2 2022-01-01 19:00:00 2022-01-01 23:00:00 1
However, it is considering all machines at the same time. Is there a way (using piso or not) to get the overlapping moments, for each machine, in a single dataframe?
piso can indeed be used. It'll run fast on large datasets, and not be limited to assumptions on sampling rate of times. Modify your piso example to wrap the last two lines in a function:
def make_overlaps(df):
ii = pd.IntervalIndex.from_arrays(df["ts_start"], df["ts_end"])
df["isOverlap"] = piso.adjacency_matrix(ii).any(axis=1).astype(int).values
return df
Then group df_sample on the machine column, and apply:
df_sample.groupby("machine").apply(make_overlaps)
This will give you:
machine ts_start ts_end isOverlap
0 1 2022-01-01 20:00:00 2022-01-01 21:00:00 1
1 1 2022-01-01 20:30:00 2022-01-01 21:30:00 1
2 1 2022-01-02 20:30:00 2022-01-02 20:35:00 0
3 2 2022-01-01 19:00:00 2022-01-01 23:00:00 0
Here's a way to do what your question asks:
import pandas as pd
df_sample = pd.DataFrame({
'machine': [1, 1, 1, 2],
'ts_start': ["2022-01-01 20:00:00", "2022-01-01 20:30:00", "2022-01-02 20:30:00", "2022-01-01 19:00:00"],
'ts_end': ["2022-01-01 21:00:00", "2022-01-01 21:30:00", "2022-01-02 20:35:00", "2022-01-01 23:00:00"]
})
df_sample = df_sample.sort_values(['machine', 'ts_start', 'ts_end'])
print(df_sample)
def foo(x):
if len(x.index) > 1:
iPrev, reachOfPrev = x.index[0], x.loc[x.index[0], 'ts_end'] if len(x.index) else None
x.loc[iPrev, 'isOverlap'] = 0
for i in x.index[1:]:
if x.loc[i,'ts_start'] < reachOfPrev:
x.loc[iPrev, 'isOverlap'] = 1
x.loc[i, 'isOverlap'] = 1
else:
x.loc[i, 'isOverlap'] = 0
if x.loc[i, 'ts_end'] > reachOfPrev:
iPrev, reachOfPrev = i, x.loc[i, 'ts_end']
else:
x['isOverlap'] = 0
x.isOverlap = x.isOverlap.astype(int)
return x
df_sample = df_sample.groupby('machine').apply(foo)
print(df_sample)
Input:
machine ts_start ts_end
0 1 2022-01-01 20:00:00 2022-01-01 21:00:00
1 1 2022-01-01 20:30:00 2022-01-01 21:30:00
2 1 2022-01-02 20:30:00 2022-01-02 20:35:00
3 2 2022-01-01 19:00:00 2022-01-01 23:00:00
Output:
machine ts_start ts_end isOverlap
0 1 2022-01-01 20:00:00 2022-01-01 21:00:00 1
1 1 2022-01-01 20:30:00 2022-01-01 21:30:00 1
2 1 2022-01-02 20:30:00 2022-01-02 20:35:00 0
3 2 2022-01-01 19:00:00 2022-01-01 23:00:00 0
Assuming the overlap is only checked up by minutes, you could try:
#create date ranges by minute frequency
df_sample["times"] = df_sample.apply(lambda row: pd.date_range(row["ts_start"], row["ts_end"], freq="1min"), axis=1)
#explode to get one row per minute
df_sample = df_sample.explode("times")
#check if times overlap by looking for duplicates
df_sample["isOverlap"] = df_sample[["machine","times"]].duplicated(keep=False)
#groupby to get back original data structure
output = df_sample.drop("times", axis=1).groupby(["machine","ts_start","ts_end"]).any().astype(int).reset_index()
>>> output
machine ts_start ts_end isOverlap
0 1 2022-01-01 20:00:00 2022-01-01 21:00:00 1
1 1 2022-01-01 20:30:00 2022-01-01 21:30:00 1
2 1 2022-01-02 20:30:00 2022-01-02 20:35:00 0
3 2 2022-01-01 19:00:00 2022-01-01 23:00:00 0
Related
I have two dataframes (simple examples shown below):
df1 df2
time column time column ID column Value
2022-01-01 00:00:00 2022-01-01 00:00:00 1 10
2022-01-01 00:15:00 2022-01-01 00:30:00 1 9
2022-01-01 00:30:00 2022-01-02 00:30:00 1 5
2022-01-01 00:45:00 2022-01-02 00:45:00 1 15
2022-01-02 00:00:00 2022-01-01 00:00:00 2 6
2022-01-02 00:15:00 2022-01-01 00:15:00 2 2
2022-01-02 00:30:00 2022-01-02 00:45:00 2 7
2022-01-02 00:45:00
df1 shows every timestamp I am interested in. df2 shows data sorted by timestamp and ID. What I need to do is add every single timestamp from df1 that is not in df2 for each unique ID and add zero to the value column.
This is the outcome I'm interested in
df3
time column ID column Value
2022-01-01 00:00:00 1 10
2022-01-01 00:15:00 1 0
2022-01-01 00:30:00 1 9
2022-01-01 00:45:00 1 0
2022-01-02 00:00:00 1 0
2022-01-02 00:15:00 1 0
2022-01-02 00:30:00 1 5
2022-01-02 00:45:00 1 15
2022-01-01 00:00:00 2 6
2022-01-01 00:15:00 2 2
2022-01-01 00:30:00 2 0
2022-01-01 00:45:00 2 0
2022-01-02 00:00:00 2 0
2022-01-02 00:15:00 2 0
2022-01-02 00:30:00 2 0
2022-01-02 00:45:00 2 7
My df2 is much larger (hundreds of thousands of rows, and more than 500 unique IDs) so manually doing this isn't feasible. I've search for hours for something that could help, but everything has fallen flat. This data will ultimately be fed into a NN.
I am open to other libraries and can work in python or R.
Any help is greatly appreciated.
Try:
x = (
df2.groupby("ID column")
.apply(lambda x: x.merge(df1, how="outer").fillna(0))
.drop(columns="ID column")
.droplevel(1)
.reset_index()
.sort_values(by=["ID column", "time column"])
)
print(x)
Prints:
ID column time column Value
0 1 2022-01-01 00:00:00 10.0
4 1 2022-01-01 00:15:00 0.0
1 1 2022-01-01 00:30:00 9.0
5 1 2022-01-01 00:45:00 0.0
6 1 2022-01-02 00:00:00 0.0
7 1 2022-01-02 00:15:00 0.0
2 1 2022-01-02 00:30:00 5.0
3 1 2022-01-02 00:45:00 15.0
8 2 2022-01-01 00:00:00 6.0
9 2 2022-01-01 00:15:00 2.0
11 2 2022-01-01 00:30:00 0.0
12 2 2022-01-01 00:45:00 0.0
13 2 2022-01-02 00:00:00 0.0
14 2 2022-01-02 00:15:00 0.0
15 2 2022-01-02 00:30:00 0.0
10 2 2022-01-02 00:45:00 7.0
For example, if I have the following data:
df = pd.DataFrame({'Start': ['2022-01-01 08:30:00', '2022-01-01 13:00:00', '2022-01-02 22:00:00'],
'Stop': ['2022-01-01 12:00:00', '2022-01-02 10:30:00', '2022-01-04 8:00:00']})
df = df.apply(pd.to_datetime)
Start Stop
0 2022-01-01 08:30:00 2022-01-01 12:00:00
1 2022-01-01 13:00:00 2022-01-02 10:30:00
2 2022-01-02 22:00:00 2022-01-04 08:00:00
How can I split each record across midnight and upsample my data, so it looks like this:
Start Stop
0 2022-01-01 08:30:00 2022-01-01 12:00:00
1 2022-01-01 13:00:00 2022-01-02 00:00:00
2 2022-01-02 00:00:00 2022-01-02 10:30:00
3 2022-01-02 22:00:00 2022-01-03 00:00:00
4 2022-01-03 00:00:00 2022-01-04 00:00:00
5 2022-01-04 00:00:00 2022-01-04 08:00:00
I want to calculate the duration per day for each time record using df['Stop'] - df['Start']. Maybe there is another way to do it. Thank you!
You could start by implementing a function that computes all dates splits from each row :
from datetime import timedelta
def split_date(start, stop):
# Same day case
if start.date() == stop.date():
return [(start, stop)]
# Several days split case
stop_split = start.replace(hour=0, minute=0, second=0) + timedelta(days=1)
return [(start, stop_split)] + split_date(stop_split, stop)
Then you can just use your existing dataframe to create a new one with all records by computing the split of each record :
new_dates = [
elt for _, row in df.iterrows() for elt in split_date(row["Start"], row["Stop"])
]
new_df = pd.DataFrame(new_dates, columns=["Start", "Stop"])
Then the output should be the one you expected :
Start Stop
0 2022-01-01 08:30:00 2022-01-01 12:00:00
1 2022-01-01 13:00:00 2022-01-02 00:00:00
2 2022-01-02 00:00:00 2022-01-02 10:30:00
3 2022-01-02 22:00:00 2022-01-03 00:00:00
4 2022-01-03 00:00:00 2022-01-04 00:00:00
5 2022-01-04 00:00:00 2022-01-04 08:00:00
I have a dataset that does not have a time per each record. I however know the start and end time of the data. I also assume that all data points are recorded in equal intervals. Hence I would like to generate a new column 'time' of equally separated times between the start time and end time.
start_time= '2022:01:01:07:30'
end_time='2022:01:01:08:30'
data = {'rec' : ['rec1' for i in range(11)] ,
'readvalue' : [0.5 + 0.5*np.sin(2*np.pi/10*i)
for i in range(11)]}
df = pd.DataFrame(data, columns = [ 'rec', 'readvalue'])
df
You could use date_range:
df['time'] = pd.date_range(start='2022/01/01 07:30', end='2022/01/01 08:30', periods=len(df))
Output:
rec readvalue time
0 rec1 0.500000 2022-01-01 07:30:00
1 rec1 0.793893 2022-01-01 07:36:00
2 rec1 0.975528 2022-01-01 07:42:00
3 rec1 0.975528 2022-01-01 07:48:00
4 rec1 0.793893 2022-01-01 07:54:00
5 rec1 0.500000 2022-01-01 08:00:00
6 rec1 0.206107 2022-01-01 08:06:00
7 rec1 0.024472 2022-01-01 08:12:00
8 rec1 0.024472 2022-01-01 08:18:00
9 rec1 0.206107 2022-01-01 08:24:00
10 rec1 0.500000 2022-01-01 08:30:00
I am trying to perform operations on the time column for each unique merchant (calculate time between transactions). How do I access the individual merchants in an iteration? is there a way to do that in python?
Thank you.
Assuming, time is already a datetime64. Use groupby_diff:
df['delta'] = df.groupby('merchant')['time'].diff()
print(df)
# Output
merchant time delta
0 A 2022-01-01 16:00:00 NaT
1 A 2022-01-01 16:30:00 0 days 00:30:00
2 A 2022-01-01 17:00:00 0 days 00:30:00
3 B 2022-01-01 10:00:00 NaT
4 B 2022-01-01 11:00:00 0 days 01:00:00
5 B 2022-01-01 12:00:00 0 days 01:00:00
If you want to compute the mean between transactions per merchant, use:
out = df.groupby('merchant', as_index=False)['time'].apply(lambda x: x.diff().mean())
print(out)
# Output
merchant time
0 A 0 days 00:30:00
1 B 0 days 01:00:00
Setup:
data = {'merchant': ['A', 'A', 'A', 'B', 'B', 'B'],
'time': [pd.Timestamp('2022-01-01 16:00:00'),
pd.Timestamp('2022-01-01 16:30:00'),
pd.Timestamp('2022-01-01 17:00:00'),
pd.Timestamp('2022-01-01 10:00:00'),
pd.Timestamp('2022-01-01 11:00:00'),
pd.Timestamp('2022-01-01 12:00:00')]}
df = pd.DataFrame(data)
I have a pandas column which I have initialized with ones, this column represents the health of a solar panel.
I need to decay this value linearly unless the time has occurred where the panel will be replaced, here the value resets to 1 (hence why I have initialized to ones). What I am doing is looping through the column, then updating the current value with the value of the previous value, minus a constant.
This operation is extremely expensive (and I have over 200,000 samples). I was hoping someone might be able to help me with a vectorized solution, where I can avoid this for loop. Here is my code:
def set_degredation_factors_pv(df):
for i in df.index:
if i != replacement_duration_PV_year * hour_per_year and i != 0:
df.loc[i, 'degradation_factor_PV_power_frac'] = df.loc[i-1, 'degradation_factor_PV_power_frac'] - degradation_rate_PV_power_perc_per_hour/100
return df
Variables:
replacement_duration_PV_year = 25
hour_per_year = 8760
degradation_rate_PV_power_perc_per_hour = 5.479e-5
Input data:
time_datetime degradation_factor_PV_power_frac
0 2022-01-01 00:00:00 1
1 2022-01-01 01:00:00 1
2 2022-01-01 02:00:00 1
3 2022-01-01 03:00:00 1
4 2022-01-01 04:00:00 1
... ... ...
8732 2022-12-30 20:00:00 1
8733 2022-12-30 21:00:00 1
8734 2022-12-30 22:00:00 1
8735 2022-12-30 23:00:00 1
8736 2022-12-31 00:00:00 1
Output data (only taking one year for time):
time_datetime degradation_factor_PV_power_frac
0 2022-01-01 00:00:00 1.000000
1 2022-01-01 01:00:00 0.999999
2 2022-01-01 02:00:00 0.999999
3 2022-01-01 03:00:00 0.999998
4 2022-01-01 04:00:00 0.999998
... ... ...
8732 2022-12-30 20:00:00 0.995216
8733 2022-12-30 21:00:00 0.995215
8734 2022-12-30 22:00:00 0.995215
8735 2022-12-30 23:00:00 0.995214
8736 2022-12-31 00:00:00 0.995214
Try:
rate = degradation_rate_PV_power_perc_per_hour / 100
mask = ~((df.index != replacement_duration_PV_year * hour_per_year)
& (df.index != 0))
df['degradation_factor_PV_power_frac'] = (
df.groupby(mask.cumsum())['degradation_factor_PV_power_frac']
.apply(lambda x: x.shift().sub(rate).cumprod())
.fillna(df['degradation_factor_PV_power_frac'])
)
Output:
>>> df
time_datetime degradation_factor_PV_power_frac
0 2022-01-01 00:00:00 1.000000
1 2022-01-01 01:00:00 0.999999
2 2022-01-01 02:00:00 0.999999
3 2022-01-01 03:00:00 0.999998
4 2022-01-01 04:00:00 0.999998