Delete rows with overlapping intervals efficiently - python

Consider the following DataFrame
>>> df
Start End Tiebreak
0 1 6 0.376600
1 5 7 0.050042
2 15 20 0.628266
3 10 15 0.984022
4 11 12 0.909033
5 4 8 0.531054
Whenever the [Start, End] intervals of two rows overlap I want the row with lower tiebreaking value to be removed. The result of the example would be
>>> df
Start End Tiebreak
2 15 20 0.628266
3 10 15 0.984022
5 4 8 0.531054
I have a double-loop which does the job inefficiently and was wondering whether there exists an approach which exploits built-ins and works columnwise.
import pandas as pd
import numpy as np
# initial data
df = pd.DataFrame({
'Start': [1, 5, 15, 10, 11, 4],
'End': [6, 7, 20, 15, 12, 8],
'Tiebreak': np.random.uniform(0, 1, 6)
})
# checking for overlaps
list_idx_drop = []
for i in range(len(df) - 1):
for j in range(i + 1, len(df)):
idx_1 = df.index[i]
idx_2 = df.index[j]
cond_1 = (df.loc[idx_1, 'Start'] < df.loc[idx_2, 'End'])
cond_2 = (df.loc[idx_2, 'Start'] < df.loc[idx_1, 'End'])
# if rows overlaps
if cond_1 & cond_2:
tie_1 = df.loc[idx_1, 'Tiebreak']
tie_2 = df.loc[idx_2, 'Tiebreak']
# delete row with lower tiebreaking value
if tie_1 < tie_2:
df.drop(idx_1, inplace=True)
else:
df.drop(idx_2, inplace=True)

You could sort by End and check cases where the end is greater than the previous Start. Using that True/False value, you can create groupings on which to drop duplicates. Sort again by Tiebreak and drop duplicates on the group column.
import pandas as pd
df = pd.DataFrame({'Start': {0: 1, 1: 5, 2: 15, 3: 10, 4: 11, 5: 4}, 'End': {0: 6, 1: 7, 2: 20, 3: 15, 4: 12, 5: 8}, 'Tiebreak': {0: 0.3766, 1: 0.050042, 2: 0.628266, 3: 0.984022, 4: 0.909033, 5: 0.531054}})
df = df.sort_values(by='End', ascending=False)
df['overlap'] = df['End'].gt(df['Start'].shift(fill_value=0))
df['group'] = df['overlap'].eq(False).cumsum()
df = df.sort_values(by='Tiebreak', ascending=False)
df = df.drop_duplicates(subset='group').drop(columns=['overlap','group'])
print(df)
Output
Start End Tiebreak
2 15 20 0.628266
3 10 15 0.984022
5 4 8 0.531054

You can sort the values by Start and compute a cummax of the End, then form group by non-overlapping intervals and get the max Tiebreak with groupby.idxmax:
keep = (df
.sort_values(by=['Start', 'End'])
.assign(max_End=lambda d: d['End'].cummax(),
group=lambda d: d['Start'].ge(d['max_End'].shift()).cumsum())
.groupby('group', sort=False)['Tiebreak'].idxmax()
)
out = df[df.index.isin(keep)]
Output:
Start End Tiebreak
2 15 20 0.628266
3 10 15 0.984022
5 4 8 0.531054
logic as image
The logic is to move left to right and start a new group when then is a "jump" (no overlap). As hard lines the intervals (in bold the greatest Tiebreak), and as dotted lines the cummax End.
Intermediates:
Start End Tiebreak max_End group
0 1 6 0.376600 6 0
5 4 8 0.531054 8 0
1 5 7 0.050042 8 0
3 10 15 0.984022 15 1 # 10 ≥ 8
4 11 12 0.909033 15 1
2 15 20 0.628266 20 2 # 15 ≥ 15

Related

Find local maxima or peaks(index) in a numeric series using numpy and pandas Peak refers to the values surrounded by smaller values on both sides

Write a python program to find all the local maxima or peaks(index) in a numeric series using numpy and pandas Peak refers to the values surrounded by smaller values on both sides
Note
Create a Pandas series from the given input.
Input format:
First line of the input consists of list of integers separated by spaces to from pandas series.
Output format:
Output display the array of indices where peak values present.
Sample testcase
input1
12 1 2 1 9 10 2 5 7 8 9 -9 10 5 15
output1
[2 5 10 12]
smapletest cases image
How to solve this problem?
import pandas as pd
a = "12 1 2 1 9 10 2 5 7 8 9 -9 10 5 15"
a = [int(x) for x in a.split(" ")]
angles = []
for i in range(len(a)):
if i!=0:
if a[i]>a[i-1]:
angles.append('rise')
else:
angles.append('fall')
else:
angles.append('ignore')
prev=0
prev_val = "none"
counts = []
for s in angles:
if s=="fall" and prev_val=="rise":
prev_val = s
counts.append(1)
else:
prev_val = s
counts.append(0)
peaks_pd = pd.Series(counts).shift(-1).fillna(0).astype(int)
df = pd.DataFrame({
'a':a,
'peaks':peaks_pd
})
peak_vals = list(df[df['peaks']==1]['a'].index)
This could be improved further. Steps I have followed:
First find the angle whether its rising or falling
Look at the index at which it starts falling after rising and call it as peaks
Use:
data = [12, 1, 2, 1.1, 9, 10, 2.1, 5, 7, 8, 9.1, -9, 10.1, 5.1, 15]
s = pd.Series(data)
n = 3 # number of points to be checked before and after
from scipy.signal import argrelextrema
local_max_index = argrelextrema(s.to_frame().to_numpy(), np.greater_equal, order=n)[0].tolist()
print (local_max_index)
[0, 5, 14]
local_max_index = s.index[(s.shift() <= s) & (s.shift(-1) <= s)].tolist()
print (local_max_index)
[2, 5, 10, 12]
local_max_index = s.index[s == s.rolling(n, center=True).max()].tolist()
print (local_max_index)
[2, 5, 10, 12]
EDIT: Solution for processing value in DataFrame:
df = pd.DataFrame({'Input': ["12 1 2 1 9 10 2 5 7 8 9 -9 10 5 15"]})
print (df)
Input
0 12 1 2 1 9 10 2 5 7 8 9 -9 10 5 15
s = df['Input'].iloc[[0]].str.split().explode().astype(int).reset_index(drop=True)
print (s)
0 12
1 1
2 2
3 1
4 9
5 10
6 2
7 5
8 7
9 8
10 9
11 -9
12 10
13 5
14 15
Name: Input, dtype: int32
local_max_index = s.index[(s.shift() <= s) & (s.shift(-1) <= s)].tolist()
print (local_max_index)
[2, 5, 10, 12]
df['output'] = [local_max_index]
print (df)
Input output
0 12 1 2 1 9 10 2 5 7 8 9 -9 10 5 15 [2, 5, 10, 12]

Collapse multiple timestamp rows into a single one

I have a series like that:
s = pd.DataFrame({'ts': [1, 2, 3, 6, 7, 11, 12, 13]})
s
ts
0 1
1 2
2 3
3 6
4 7
5 11
6 12
7 13
I would like to collapse rows that have difference less than MAX_DIFF (2). That means that the desired output must be:
[{'ts_from': 1, 'ts_to': 3},
{'ts_from': 6, 'ts_to': 7},
{'ts_from': 11, 'ts_to': 13}]
I did some coding:
s['close'] = s.diff().shift(-1)
s['close'] = s[s['close'] > MAX_DIFF].astype('bool')
s['close'].iloc[-1] = True
parts = []
ts_from = None
for _, row in s.iterrows():
if row['close'] is True:
part = {'ts_from': ts_from, 'ts_to': row['ts']}
parts.append(part)
ts_from = None
continue
if not ts_from:
ts_from = row['ts']
This works but does not seem optimal because of iterrows(). I thought about ranks but couldn't figure out how to implement them so as to groupby rank further.
Is there way to optimes algorithm?
You can create groups by checking where the difference is more than your threshold and take a cumsum. Then agg however you'd like, perhaps first and last in this case.
gp = s['ts'].diff().abs().ge(2).cumsum().rename(None)
res = s.groupby(gp).agg(ts_from=('ts', 'first'),
ts_to=('ts', 'last'))
# ts_from ts_to
#0 1 3
#1 6 7
#2 11 13
And if you want the list of dicts then:
res.to_dict('records')
#[{'ts_from': 1, 'ts_to': 3},
# {'ts_from': 6, 'ts_to': 7},
# {'ts_from': 11, 'ts_to': 13}]
For completeness here is how the grouper aligns with the DataFrame:
s['gp'] = gp
print(s)
ts gp
0 1 0 # `1` becomes ts_from for group 0
1 2 0
2 3 0 # `3` becomes ts_to for group 0
3 6 1 # `6` becomes ts_from for group 1
4 7 1 # `7` becomes ts_to for group 1
5 11 2 # `11` becomes ts_from for group 2
6 12 2
7 13 2 # `13` becomes ts_to for group 2

Pandas DataFrame group-by indexes matching list - indexes respectively smaller than list[i+1] and greater than list[i]

I have a DataFrame Times_df with times in a single column and a second DataFrame End_df with specific end times for each group indexed by group name.
Times_df = pd.DataFrame({'time':np.unique(np.cumsum(np.random.randint(5, size=(100,))), axis=0)})
End_df = pd.DataFrame({'end time':np.unique(random.sample(range(Times_df.index.values[0], Times_df.index.values[-1]), 10))})
End_df.index.name = 'group'
I want to add a group index for all times in Times_df smaller or equal than each consequitive end time in End_df but greater than the previous one
I can only do it for now with a loop, which takes forever ;(
lis = []
i = 1
for row in Times_df['time'].values:
while i <= row:
lis.append((End_df['end time']==row).index)
i +1
Then I add the list lis as a new column to Times_df
Times_df['group']=lis
A nother sollution that sadly still uses a loop is this:
test_df = pd.DataFrame()
for group, index in End_df.iterrows():
test = count.loc[count.index<=index['end time]][:]
test['group']=group
test_df = pd.concat([test_df,test], axis=0, ignore_index=True)
I think what you are looking for is pd.cut to bin your values into the groups.
bins = [0, 3, 10, 20, 53, 59, 63, 65, 68, 74, np.inf]
groups = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Times_df["group"] = pd.cut(Times_df["time"], bins, labels=groups)
print(Times_df)
time group
0 2 0
1 3 0
2 7 1
3 11 2
4 15 2
5 16 2
6 18 2
7 22 3
8 25 3
9 28 3

Summing rows based on cumsum values

I have a data frame like
index  A B C
0   4 7 9
1   2 6 22   6 9 13   7 2 44   8 5 6
I want to create another data frame out of this based on the sum of C column. But the catch here is if the sum of C reached 10 or higher it should create another row. Something like this.
index  A B C
0   6 13 11
1   21 16 11
Any help will be highly appreciable. Is there a robust way to do this, or iterating is my last resort?
There is a non-iterative approach. You'll need a groupby based on C % 11.
# Groupby logic - https://stackoverflow.com/a/45959831/4909087
out = df.groupby((df.C.cumsum() % 10).diff().shift().lt(0).cumsum(), as_index=0).agg('sum')
print(out)
A B C
0 6 13 11
1 21 16 11
The code would look something like this:
import pandas as pd
lista = [4, 7, 10, 11, 7]
listb= [7, 8, 2, 5, 9]
listc = [9, 2, 1, 4, 6]
df = pd.DataFrame({'A': lista, 'B': listb, 'C': listc})
def sumsc(df):
suma=0
sumb=0
sumc=0
list_of_sums = []
for i in range(len(df)):
suma+=df.iloc[i,0]
sumb+=df.iloc[i,1]
sumc+=df.iloc[i,2]
if sumc > 10:
list_of_sums.append([suma, sumb, sumc])
suma=0
sumb=0
sumc=0
return pd.DataFrame(list_of_sums)
sumsc(df)
0 1 2
0 11 15 11
1 28 16 11

Python Pandas: Subsetting data frame both by rows and columns?

Data frame has w (week) and y (year) columns.
d = {
'y': [11,11,13,15,15],
'w': [5, 4, 7, 7, 8],
'z': [1, 2, 3, 4, 5]
}
df = pd.DataFrame(d)
In [61]: df
Out[61]:
w y z
0 5 11 1
1 4 11 2
2 7 13 3
3 7 15 4
4 8 15 5
Two questions:
1) How to get from this data frame min/max date as two numbers w and y in a list [w,y] ?
2) How to subset both columns and rows, so all w and y in the resulting data frame are constrained by conditions:
11 <= y <= 15
4 <= w <= 7
To get min/max pairs I need functions:
min_pair() --> [11,4]
max_pair() --> [15,8]
and these to get a data frame subset:
from_to(y1,w1,y2,w2)
from_to(11,4,15,7) -->
should return rf data frame like this:
r = {
'y': [11,13,15],
'w': [4, 7, 7 ],
'z': [2, 3, 4 ]
}
rf = pd.DataFrame(r)
In [62]: rf
Out[62]:
w y z
0 4 11 2
1 7 13 3
2 7 15 4
Are there any standard functions for this?
Update
For subsetting the following worked for me:
df[(df.y <= 15 ) & (df.y >= 11) & (df.w >= 4) & (df.w <= 7)]
a lot of typing though ...
Here are couple of methods
In [176]: df.min().tolist()
Out[176]: [4, 11]
In [177]: df.max().tolist()
Out[177]: [8, 15]
In [178]: df.query('11 <= y <= 15 and 4 <= w <= 7')
Out[178]:
w y
0 5 11
1 4 11
2 7 13
3 7 15

Categories