Quickly Find Non-Zero Intervals - python

I am writing an algorithm to determine the intervals of the "mountains" on a density plot. The plot is taken from the depths from a Kinect if anyone is interested. Here is a quick visual example of what this algorithm finds: (with the small mountains removed):
My current algorithm:
def find_peak_intervals(data):
previous = 0
peak = False
ranges = []
begin_range = 0
end_range = 0
for current in xrange(len(data)):
if (not peak) and ((data[current] - data[previous]) > 0):
peak = True
begin_range = current
if peak and (data[current] == 0):
peak = False
end_range = current
ranges.append((begin_range, end_range))
previous = current
return np.array(ranges)
The function works but it takes nearly 3 milliseconds on my laptop, and I need to be able to run my entire program at at least 30 frames per second. This function is rather ugly and I have to run it 3 times per frame for my program, so I would like any hints as to how to simplify and optimize this function (maybe something from numpy or scipy that I missed).

Assuming a pandas dataframe like so:
Value
0 0
1 3
2 2
3 2
4 1
5 2
6 3
7 0
8 1
9 3
10 0
11 0
12 0
13 1
14 0
15 3
16 2
17 3
18 1
19 0
You can get the contiguous non-zero ranges by using df["Value"].shift(x) where x could either be 1 or -1 so you can check if it's bounded by zeroes. Once you get the boundaries, you can just store their index pairs and use them later on when filtering the data.
The following code is based on the excellent answer here by #behzad.nouri.
import pandas as pd
df = pd.read_csv("data.csv")
# Or you can use df = pd.DataFrame.from_dict({'Value': {0: 0, 1: 3, 2: 2, 3: 2, 4: 1, 5: 2, 6: 3, 7: 0, 8: 1, 9: 3, 10: 0, 11: 0, 12: 0, 13: 1, 14: 0, 15: 3, 16: 2, 17: 3, 18: 1, 19: 0}})
# --
# from https://stackoverflow.com/questions/24281936
# credits to #behzad.nouri
df['tag'] = df['Value'] > 0
fst = df.index[df['tag'] & ~ df['tag'].shift(1).fillna(False)]
lst = df.index[df['tag'] & ~ df['tag'].shift(-1).fillna(False)]
pr = [(i, j) for i, j in zip(fst, lst)]
# --
for i, j in pr:
print df.loc[i:j, "Value"]
This gives the result:
1 3
2 2
3 2
4 1
5 2
6 3
Name: Value, dtype: int64
8 1
9 3
Name: Value, dtype: int64
13 1
Name: Value, dtype: int64
15 3
16 2
17 3
18 1
Name: Value, dtype: int64
Timing it in IPython gives the following:
%timeit find_peak_intervals(df)
1000 loops, best of 3: 1.49 ms per loop
This is not too far from your attempt speed-wise. An alternative is to use convert the pandas series to a numpy array and operate from there. Let's take another excellent answer, this one by #Warren Weckesser, and modify it to suit your needs. Let's time it as well.
In [22]: np_arr = np.array(df["Value"])
In [23]: def greater_than_zero(a):
...: isntzero = np.concatenate(([0], np.greater(a, 0).view(np.int8), [0]))
...: absdiff = np.abs(np.diff(isntzero))
...: ranges = np.where(absdiff == 1)[0].reshape(-1, 2)
...: return ranges
In [24]: %timeit greater_than_zero(np_arr)
100000 loops, best of 3: 17.1 µs per loop
Not so bad at 17.1 microseconds, and it gives the same ranges as well.
[1 7] # Basically same as indices 1-6 in pandas.
[ 8 10] # 8, 9
[13 14] # 13, 13
[15 19] # 15, 18

Related

Delete rows with overlapping intervals efficiently

Consider the following DataFrame
>>> df
Start End Tiebreak
0 1 6 0.376600
1 5 7 0.050042
2 15 20 0.628266
3 10 15 0.984022
4 11 12 0.909033
5 4 8 0.531054
Whenever the [Start, End] intervals of two rows overlap I want the row with lower tiebreaking value to be removed. The result of the example would be
>>> df
Start End Tiebreak
2 15 20 0.628266
3 10 15 0.984022
5 4 8 0.531054
I have a double-loop which does the job inefficiently and was wondering whether there exists an approach which exploits built-ins and works columnwise.
import pandas as pd
import numpy as np
# initial data
df = pd.DataFrame({
'Start': [1, 5, 15, 10, 11, 4],
'End': [6, 7, 20, 15, 12, 8],
'Tiebreak': np.random.uniform(0, 1, 6)
})
# checking for overlaps
list_idx_drop = []
for i in range(len(df) - 1):
for j in range(i + 1, len(df)):
idx_1 = df.index[i]
idx_2 = df.index[j]
cond_1 = (df.loc[idx_1, 'Start'] < df.loc[idx_2, 'End'])
cond_2 = (df.loc[idx_2, 'Start'] < df.loc[idx_1, 'End'])
# if rows overlaps
if cond_1 & cond_2:
tie_1 = df.loc[idx_1, 'Tiebreak']
tie_2 = df.loc[idx_2, 'Tiebreak']
# delete row with lower tiebreaking value
if tie_1 < tie_2:
df.drop(idx_1, inplace=True)
else:
df.drop(idx_2, inplace=True)
You could sort by End and check cases where the end is greater than the previous Start. Using that True/False value, you can create groupings on which to drop duplicates. Sort again by Tiebreak and drop duplicates on the group column.
import pandas as pd
df = pd.DataFrame({'Start': {0: 1, 1: 5, 2: 15, 3: 10, 4: 11, 5: 4}, 'End': {0: 6, 1: 7, 2: 20, 3: 15, 4: 12, 5: 8}, 'Tiebreak': {0: 0.3766, 1: 0.050042, 2: 0.628266, 3: 0.984022, 4: 0.909033, 5: 0.531054}})
df = df.sort_values(by='End', ascending=False)
df['overlap'] = df['End'].gt(df['Start'].shift(fill_value=0))
df['group'] = df['overlap'].eq(False).cumsum()
df = df.sort_values(by='Tiebreak', ascending=False)
df = df.drop_duplicates(subset='group').drop(columns=['overlap','group'])
print(df)
Output
Start End Tiebreak
2 15 20 0.628266
3 10 15 0.984022
5 4 8 0.531054
You can sort the values by Start and compute a cummax of the End, then form group by non-overlapping intervals and get the max Tiebreak with groupby.idxmax:
keep = (df
.sort_values(by=['Start', 'End'])
.assign(max_End=lambda d: d['End'].cummax(),
group=lambda d: d['Start'].ge(d['max_End'].shift()).cumsum())
.groupby('group', sort=False)['Tiebreak'].idxmax()
)
out = df[df.index.isin(keep)]
Output:
Start End Tiebreak
2 15 20 0.628266
3 10 15 0.984022
5 4 8 0.531054
logic as image
The logic is to move left to right and start a new group when then is a "jump" (no overlap). As hard lines the intervals (in bold the greatest Tiebreak), and as dotted lines the cummax End.
Intermediates:
Start End Tiebreak max_End group
0 1 6 0.376600 6 0
5 4 8 0.531054 8 0
1 5 7 0.050042 8 0
3 10 15 0.984022 15 1 # 10 ≥ 8
4 11 12 0.909033 15 1
2 15 20 0.628266 20 2 # 15 ≥ 15

Creating a function in Python for creating buckets from pandas dataframe values based on multiple conditions

I asked this question and it helped me, but now my task is more complex.
My dataframe has ~100 columns and values with 14 scales.
{'Diseasetype': {0: 'Oncology',
1: 'Oncology',
2: 'Oncology',
3: 'Nononcology',
4: 'Nononcology',
5: 'Nononcology'},
'Procedures1': {0: 100, 1: 300, 2: 500, 3: 200, 4: 400, 5: 1000},
'Procedures2': {0: 1, 1: 3, 2: 5, 3: 2, 4: 4, 5: 10},
'Procedures100': {0: 1000, 1: 3000, 2: 5000, 3: 2000, 4: 4000, 5: 10000}}
I want to convert each value in each column of the dataframe into a bucket value.
My current solution is:
def encoding(col, labels):
return np.select([col<200, col.between(200,500), col.between(500,1000), col>1000], labels, 0)
onc_labels = [1,2,3,4]
nonc_labels = [11,22,33,44]
msk = df['Therapy_area'] == 'Oncology'
df[cols] = pd.concat((df.loc[msk, cols].apply(encoding, args=(onc_labels,)), df.loc[msk, cols].apply(encoding, args=(nonc_labels,)))).reset_index(drop=True)
It works well, if all columns of the dataframe has the same scale, but they do not. Remember, I have 14 different scales.
I would like to update the code above (or get another solution), which would allow me to bucket data. I cannot use the same range of values for bucketing everything.
My logic is the following:
If Disease == Oncology and Procedures1 on this scale, convert values to these buckets (1, 2, 3)
If Disease == Oncology and Procedures2 on this scale, convert values to these buckets (1, 2, 3)
If Disease != Oncology and Procedures77 on this scale, convert values to these buckets (4, 5, 6)
Example of a scale and buckets:
Procedures1 for Oncology: < 200 = 1, 200-400 = 2, >400 = 3
Procedures2 for Oncology: < 2 = 1, 2-4 = 2, >4 = 3
Procedures3 for Oncology: < 2000 = 1, 2000-4000 = 2, >4000 = 3
Procedures1 for nonOncology: < 200 = 4, 200-400 = 5, >400 = 6
Procedures2 for nonOncology: < 2 = 4, 2-4 = 5, >4 = 6
Procedures3 for nonOncology: < 2000 = 4, 2000-4000 = 5, >4000 = 6
Expected output (happy to provide more info!)
Diseasetype Procedures1 Procedures2 Procedures100
Oncology 1 1 1
Oncology 2 2 2
Oncology 3 3 3
Nononcology 4 4 4
Nononcology 5 5 5
Nononcology 6 6 6
Link with rules:
I used an helper file with all scales (source at the end of answer):
Use melt to flatten your dataframe then filter out your rows with query and finally use pivot to reshape your dataframe. You can execute each line independently to show the transformations:
scales = pd.read_csv('scales.csv').fillna({'Start': -np.inf, 'End': np.inf})
out = (
df.melt('Diseasetype', var_name='Procedure', ignore_index=False).reset_index()
.merge(scales, on=['Diseasetype', 'Procedure'], how='left')
.query("value.between(Start, End)")
.pivot_table('Label', ['index', 'Diseasetype'], 'Procedure').astype(int)
.droplevel(0).rename_axis(columns=None).reset_index()
)
Output:
>>> df
Diseasetype Procedures1 Procedures100 Procedures2
0 Oncology 1 1 1
1 Oncology 2 2 2
2 Oncology 3 3 3
3 Nononcology 4 4 4
4 Nononcology 5 5 5
5 Nononcology 6 6 6
Content of scales.csv:
Diseasetype,Procedure,Start,End,Label
Oncology,Procedures1,,200,1
Oncology,Procedures1,200,400,2
Oncology,Procedures1,400,,3
Oncology,Procedures2,,2,1
Oncology,Procedures2,2,4,2
Oncology,Procedures2,4,,3
Oncology,Procedures100,,2000,1
Oncology,Procedures100,2000,4000,2
Oncology,Procedures100,4000,,3
Nononcology,Procedures1,,200,4
Nononcology,Procedures1,200,400,5
Nononcology,Procedures1,400,,6
Nononcology,Procedures2,,2,4
Nononcology,Procedures2,2,4,5
Nononcology,Procedures2,4,,6
Nononcology,Procedures100,,2000,4
Nononcology,Procedures100,2000,4000,5
Nononcology,Procedures100,4000,,6

Calculate sum of distances travelled for each unique ID

I have a data-frame which has three columns. One column contains x-coordinates, another column with y-coordinates. also, as you can see, there is a 'trackid' column -- this column associates all of the x and y coordinates with a specific, unique trackid.
trackiD X_COORDINATES Y_COORDINATES
2 542.299805 23.388090
2 544.108215 23.575758
2 545.300598 23.962421
2 546.417053 25.049328
2 546.198669 24.830357
2 546.724915 24.916084
2 547.037048 24.918982
2 547.011963 24.785202
2 547.649231 24.845772
3 547.600525 24.613401
3 547.891479 24.268734
3 548.580505 24.459103
3 548.144409 23.915531
3 548.626770 23.922005
4 548.527222 24.134670
4 548.504211 23.642254
4 548.936584 24.028818
4 548.627869 23.295454
What I am trying to do is the following:
take each pair of x and y coordinates and calculate the increments of distance traveled between them using the pythagorean distance formula:
(sqrt(x2-x1)^2 + (y2-y1)^2), adding each distance increment to a list, then taking the sum of all increments in the list to get the total distance traveled -- also important to note, I am doing this calculation only for each set of coordinates within a unique trackid. ie. calculate sum of the distance increments for trackid 2, then do the same process separately for trackid 3 and 4 and so forth -- ultimately storing all the total distances traveled per each unique track ID in a new list.
Here is my current code -- it runs, but the issue is, it outputs a list with just one single, large, likely incorrect value (displayed below). also the 'value' variable seems to have been cut off and displayed across multiple lines here on stackoverflow but this is not the case when I run it in jupyter notebook.
def pythag_dis(U_id):
c = data.Unique_id == U_id
df = data[c]
df.reset_index(inplace = True)
k = sorted(df.trackId.unique())
i = 0
j = 1
length = len(k)
while i < length:
condition = df.trackId == k[i]
df2 = df[condition]
df2.reset_index(inplace = True)
value =
math.sqrt((df.Object_Center_0.iloc[j] -
df.Object_Center_0.iloc[i])**2 +
(df.Object_Center_1.iloc[j] -
df.Object_Center_1.iloc[i])**2)
mylist = []
mylist.append(value)
fulldistance = sum(mylist)
mylist2 = []
mylist2.append(fulldistance)
i+=1
return mylist2
pythag_dis('1CCM0701')
OUTPUT: [1976.075585650214]
A possible solution using Pandas: I use pandas groupby shift to match the coordinates, calculate the distance and then sum the distance in the groups:
import math
import numpy as np
import pandas as pd
def distance(row):
x1, y1, x2, y2 = row["X_COORDINATES"], row["Y_COORDINATES"], row["X2"], row["Y2"]
if np.isnan(x2) or np.isnan(y2):
return 0
return math.sqrt((x2 - x1) ** 2 + (y2 - y1) ** 2)
df["X2"] = df.groupby("trackiD")["X_COORDINATES"].shift(-1)
df["Y2"] = df.groupby("trackiD")["Y_COORDINATES"].shift(-1)
df["distance"] = df.apply(distance, axis=1)
df.groupby("trackiD")["distance"].sum()
Output:
trackiD
2 6.560621
3 2.345185
4 1.868628
Name: distance, dtype: float64
Test dataframe:
df = pd.DataFrame(
{
"trackiD": {
0: 2,
1: 2,
2: 2,
3: 2,
4: 2,
5: 2,
6: 2,
7: 2,
8: 2,
9: 3,
10: 3,
11: 3,
12: 3,
13: 3,
14: 4,
15: 4,
16: 4,
17: 4,
},
"X_COORDINATES": {
0: 542.299805,
1: 544.108215,
2: 545.300598,
3: 546.417053,
4: 546.198669,
5: 546.724915,
6: 547.037048,
7: 547.011963,
8: 547.649231,
9: 547.600525,
10: 547.891479,
11: 548.580505,
12: 548.144409,
13: 548.62677,
14: 548.527222,
15: 548.504211,
16: 548.936584,
17: 548.627869,
},
"Y_COORDINATES": {
0: 23.38809,
1: 23.575758,
2: 23.962421,
3: 25.049328,
4: 24.830357,
5: 24.916084,
6: 24.918982,
7: 24.785202,
8: 24.845772,
9: 24.613401,
10: 24.268734,
11: 24.459103,
12: 23.915531,
13: 23.922005,
14: 24.13467,
15: 23.642254,
16: 24.028818,
17: 23.295454,
},
}
)
First create two new columns, X_SHIFTED and Y_SHIFTED that represents the next point's coordinates for each track ID. We do this by combining df.groupby and df.shift:
df[['X_SHIFTED', 'Y_SHIFTED']] = df.groupby('trackiD').shift()
Then, simply use the euclidean distance formula between points (X_COORDINATES, Y_COORDINATES) and (X_SHIFTED, Y_SHIFTED). We can do this using df.apply row-wise (axis=1), along with math.dist:
import math
df['DIST'] = df.apply(
lambda row: math.dist(
(row['X_COORDINATES'], row['Y_COORDINATES']),
(row['X_SHIFTED'], row['Y_SHIFTED'])
), axis=1)
output:
trackiD X_COORDINATES Y_COORDINATES X_SHIFTED Y_SHIFTED DIST
0 2 542.299805 23.388090 NaN NaN NaN
1 2 544.108215 23.575758 542.299805 23.388090 1.818122
2 2 545.300598 23.962421 544.108215 23.575758 1.253509
3 2 546.417053 25.049328 545.300598 23.962421 1.558152
4 2 546.198669 24.830357 546.417053 25.049328 0.309257
5 2 546.724915 24.916084 546.198669 24.830357 0.533183
6 2 547.037048 24.918982 546.724915 24.916084 0.312146
7 2 547.011963 24.785202 547.037048 24.918982 0.136112
8 2 547.649231 24.845772 547.011963 24.785202 0.640140
9 3 547.600525 24.613401 NaN NaN NaN
10 3 547.891479 24.268734 547.600525 24.613401 0.451054
11 3 548.580505 24.459103 547.891479 24.268734 0.714841
12 3 548.144409 23.915531 548.580505 24.459103 0.696886
13 3 548.626770 23.922005 548.144409 23.915531 0.482404
14 4 548.527222 24.134670 NaN NaN NaN
15 4 548.504211 23.642254 548.527222 24.134670 0.492953
16 4 548.936584 24.028818 548.504211 23.642254 0.579981
17 4 548.627869 23.295454 548.936584 24.028818 0.795693
To get each track's sum of distances, you can then use:
df.groupby('trackiD')['DIST'].sum()
output:
trackiD
2 6.560621
3 2.345185
4 1.868628
Name: DIST, dtype: float64

How to get location of nearest number in DataFrame

I have following DataFrame
0
0 5
1 10
2 15
3 20
i want get location of value which is near to n value. For example: if n=7 then nearest number is 5 .
And after this return location of 5 i.e [0] [0]
Use Series.abs and Series.idxmin:
# Setup
df = pd.DataFrame({0: {0: 5, 1: 10, 2: 15, 3: 20}})
n = 7
(n - df[0]).abs().idxmin()
[out]
0
Use numpy.argmin to get the closest number index:
df[1] = 7
df[2] = df[1] - df[0]
df[2] = df[2].abs()
print(np.argmin(df[2]))

Filter data with groupby in pandas

I have a DataFrame where I have the following data. Each row represents a word appearing in each episode of a TV series. If a word appears 3 times in an episode, the pandas dataframe has 3 rows. Now I need to filter a list of words such that I should only get only words which appear more than or equal to 2 times. I can do this by groupby, but if a word appears 2 (or say 3,4 or 5) times, I need two (3, 4 or 5) rows for it.
By groupby, I will only get the unique entry and count, but I need the entry to repeat as many times as it appears in the dialogue. Is there a one-liner to do this?
dialogue episode
0 music 1
1 corrections 1
2 somnath 1
3 yadav 5
4 join 2
5 instagram 1
6 wind 2
7 music 1
8 whimpering 2
9 music 1
10 wind 3
SO here I should ideally get,
dialogue episode
0 music 1
6 wind 2
7 music 1
9 music 1
10 wind 3
As these are the only 2 words that appear more than or equal to 2 times.
You can use groupby's filter:
In [11]: df.groupby("dialogue").filter(lambda x: len(x) > 1)
Out[11]:
dialogue episode
0 music 1
6 wind 2
7 music 1
9 music 1
10 wind 3
Answer for the updated question:
In [208]: df.groupby('dialogue')['episode'].transform('size') >= 3
Out[208]:
0 True
1 False
2 False
3 False
4 False
5 False
6 False
7 True
8 False
9 True
10 False
dtype: bool
In [209]: df[df.groupby('dialogue')['episode'].transform('size') >= 3]
Out[209]:
dialogue episode
0 music 1
7 music 1
9 music 1
Answer for the original question:
you can use duplicated() method:
In [202]: df[df.duplicated(subset=['dialogue'], keep=False)]
Out[202]:
dialogue episode
0 music 1
6 wind 2
7 music 1
9 music 1
10 wind 3
if you want to sort the result:
In [203]: df[df.duplicated(subset=['dialogue'], keep=False)].sort_values('dialogue')
Out[203]:
dialogue episode
0 music 1
7 music 1
9 music 1
6 wind 2
10 wind 3
I'd use value_counts
vc = df.dialogue.value_counts() >= 2
vc = vc[vc]
df[df.dialogue.isin(vc.index)]
Timing
keep in mind, this is completely over the top. however, i'm sharpening up my timing skills.
code
from timeit import timeit
def pirsquared(df):
vc = df.dialogue.value_counts() > 1
vc = vc[vc]
return df[df.dialogue.isin(vc.index)]
def maxu(df):
return df[df.groupby('dialogue')['episode'].transform('size') > 1]
def andyhayden(df):
return df.groupby("dialogue").filter(lambda x: len(x) > 1)
rows = ['pirsquared', 'maxu', 'andyhayden']
cols = ['OP_Given', '10000_3_letters']
summary = pd.DataFrame([], rows, cols)
iterations = 10
df = pd.DataFrame({'dialogue': {0: 'music', 1: 'corrections', 2: 'somnath', 3: 'yadav', 4: 'join', 5: 'instagram', 6: 'wind', 7: 'music', 8: 'whimpering', 9: 'music', 10: 'wind'}, 'episode': {0: 1, 1: 1, 2: 1, 3: 5, 4: 2, 5: 1, 6: 2, 7: 1, 8: 2, 9: 1, 10: 3}})
summary.loc['pirsquared', 'OP_Given'] = timeit(lambda: pirsquared(df), number=iterations)
summary.loc['maxu', 'OP_Given'] = timeit(lambda: maxu(df), number=iterations)
summary.loc['andyhayden', 'OP_Given'] = timeit(lambda: andyhayden(df), number=iterations)
df = pd.DataFrame(
pd.DataFrame(np.random.choice(list(lowercase), (10000, 3))).sum(1),
columns=['dialogue'])
df['episode'] = 1
summary.loc['pirsquared', '10000_3_letters'] = timeit(lambda: pirsquared(df), number=iterations)
summary.loc['maxu', '10000_3_letters'] = timeit(lambda: maxu(df), number=iterations)
summary.loc['andyhayden', '10000_3_letters'] = timeit(lambda: andyhayden(df), number=iterations)
summary

Categories