Related
Consider the following DataFrame
>>> df
Start End Tiebreak
0 1 6 0.376600
1 5 7 0.050042
2 15 20 0.628266
3 10 15 0.984022
4 11 12 0.909033
5 4 8 0.531054
Whenever the [Start, End] intervals of two rows overlap I want the row with lower tiebreaking value to be removed. The result of the example would be
>>> df
Start End Tiebreak
2 15 20 0.628266
3 10 15 0.984022
5 4 8 0.531054
I have a double-loop which does the job inefficiently and was wondering whether there exists an approach which exploits built-ins and works columnwise.
import pandas as pd
import numpy as np
# initial data
df = pd.DataFrame({
'Start': [1, 5, 15, 10, 11, 4],
'End': [6, 7, 20, 15, 12, 8],
'Tiebreak': np.random.uniform(0, 1, 6)
})
# checking for overlaps
list_idx_drop = []
for i in range(len(df) - 1):
for j in range(i + 1, len(df)):
idx_1 = df.index[i]
idx_2 = df.index[j]
cond_1 = (df.loc[idx_1, 'Start'] < df.loc[idx_2, 'End'])
cond_2 = (df.loc[idx_2, 'Start'] < df.loc[idx_1, 'End'])
# if rows overlaps
if cond_1 & cond_2:
tie_1 = df.loc[idx_1, 'Tiebreak']
tie_2 = df.loc[idx_2, 'Tiebreak']
# delete row with lower tiebreaking value
if tie_1 < tie_2:
df.drop(idx_1, inplace=True)
else:
df.drop(idx_2, inplace=True)
You could sort by End and check cases where the end is greater than the previous Start. Using that True/False value, you can create groupings on which to drop duplicates. Sort again by Tiebreak and drop duplicates on the group column.
import pandas as pd
df = pd.DataFrame({'Start': {0: 1, 1: 5, 2: 15, 3: 10, 4: 11, 5: 4}, 'End': {0: 6, 1: 7, 2: 20, 3: 15, 4: 12, 5: 8}, 'Tiebreak': {0: 0.3766, 1: 0.050042, 2: 0.628266, 3: 0.984022, 4: 0.909033, 5: 0.531054}})
df = df.sort_values(by='End', ascending=False)
df['overlap'] = df['End'].gt(df['Start'].shift(fill_value=0))
df['group'] = df['overlap'].eq(False).cumsum()
df = df.sort_values(by='Tiebreak', ascending=False)
df = df.drop_duplicates(subset='group').drop(columns=['overlap','group'])
print(df)
Output
Start End Tiebreak
2 15 20 0.628266
3 10 15 0.984022
5 4 8 0.531054
You can sort the values by Start and compute a cummax of the End, then form group by non-overlapping intervals and get the max Tiebreak with groupby.idxmax:
keep = (df
.sort_values(by=['Start', 'End'])
.assign(max_End=lambda d: d['End'].cummax(),
group=lambda d: d['Start'].ge(d['max_End'].shift()).cumsum())
.groupby('group', sort=False)['Tiebreak'].idxmax()
)
out = df[df.index.isin(keep)]
Output:
Start End Tiebreak
2 15 20 0.628266
3 10 15 0.984022
5 4 8 0.531054
logic as image
The logic is to move left to right and start a new group when then is a "jump" (no overlap). As hard lines the intervals (in bold the greatest Tiebreak), and as dotted lines the cummax End.
Intermediates:
Start End Tiebreak max_End group
0 1 6 0.376600 6 0
5 4 8 0.531054 8 0
1 5 7 0.050042 8 0
3 10 15 0.984022 15 1 # 10 ≥ 8
4 11 12 0.909033 15 1
2 15 20 0.628266 20 2 # 15 ≥ 15
I have a matrix 4*5 and I need to sort it by several columns.
Given these inputs:
sort_columns = [3, 1, 2, 4, 5, 2]
matrix = [[3, 1, 8, 1, 9],
[3, 7, 8, 2, 9],
[2, 7, 7, 1, 2],
[2, 1, 7, 1, 9]]
the matrix should first be sorted by the 3nd column (so the values 8, 8, 7, 7), then the sorted result should again be sorted by column 1 (values 3, 3, 2, 2) and so on.
So, after first sorting by column 3, the matrix would be:
2 7 7 1 2
2 1 7 1 9
3 1 8 1 9
3 7 8 2 9
and sorting on column 1 then has no effect as the values are already in the right order. The next column, 2, then makes the order:
2 1 7 1 9
3 1 8 1 9
2 7 7 1 2
3 7 8 2 9
etc.
After sorting on all the sort_columns numbers, I expect to get the result:
2 7 7 1 2
3 1 8 1 9
2 1 7 1 9
3 7 8 2 9
This is my code to sort the matrix:
def sort_matrix_columns(matrix, n, sort_columns):
for col in sort_columns:
column = col - 1
for i in range(n):
for j in range(i + 1, n):
if matrix[i][column] > matrix[j][column]:
temp = matrix[i]
matrix[i] = matrix[j]
matrix[j] = temp
which is called like this:
sort_matrix_columns(matrix, len(matrix), sort_columns)
But when I do I get the following wrong result:
3 1 8 1 9
2 1 7 1 9
2 7 7 1 2
3 7 8 2 9
Why am I getting the wrong order here? Where is my sort implementation failing?
The short answer is that your sort implementation is not stable.
A sort algorithm is stable when two entries in the sorted sequence keep the same (relative) order when their sort key is the same. For example, when sorting only by the first letter, a stable algorithm will always sort the sequence ['foo', 'flub', 'bar'] to be ['bar', 'foo', 'flub'], keeping the 'foo' and 'flub' values in the same relative order. Your algorithm would swap 'foo' and 'bar' (as 'f' > 'b' is true) without touching 'flub', and so you'd end up with ['bar', 'flub', 'foo'].
You need a stable sort algorithm when applying sort multiple times as you do when using multiple columns, because subsequent sortings should leave the original order applied by preceding sort operations when the value in the current column is the same between two rows.
You can see this when your implementation sorts by column 5, after first sorting on columns 3, 1, 2, 4. After those first 4 sort operations the matrix looks like this:
2 1 7 1 9
3 1 8 1 9
2 7 7 1 2
3 7 8 2 9
Your implementation then sorts by column 5, so by 9, 9, 2, 9. The first row is then swapped with the 3rd row (2 1 7 1 9 and 2 7 7 1 2, leaving the other rows all untouched. This changed the relative order of all the columns with a 9:
2 7 7 1 2 < - was third
3 1 8 1 9 < - so this row is now re-ordered!
2 1 7 1 9 < - was first
3 7 8 2 9
Sorting the above output by the 2nd column (7, 1, 1, 7) then leads to the wrong output you see.
A stable sort algorithm would have moved the 2 7 7 1 2 row to be the first row without reordering the other rows:
2 7 7 1 2 < - was third
2 1 7 1 9 < - was first
3 1 8 1 9 < - was second, stays *after* the first row
3 7 8 2 9 < - was third, stays *after* the second row
and sorting by the second column produces the correct output.
The default Python sort implementation, TimSort (named after its inventor, Tim Peters), is a stable sort function. You could just use that (via the list.sort() method and a sort key function):
def sort_matrix_columns(matrix, sort_columns):
for col in sort_columns:
matrix.sort(key=lambda row: row[col - 1])
Heads-up: I removed the n parameter from the function, for simplicity's sake.
Demo:
>>> def pm(m): print(*(' '.join(map(str, r)) for r in m), sep="\n")
...
>>> def sort_matrix_columns(matrix, sort_columns):
... for col in sort_columns:
... matrix.sort(key=lambda row: row[col - 1])
...
>>> sort_columns = [3, 1, 2, 4, 5, 2]
>>> matrix = [[3, 1, 8, 1, 9],
... [3, 7, 8, 2, 9],
... [2, 7, 7, 1, 2],
... [2, 1, 7, 1, 9]]
>>> sort_matrix_columns(matrix, sort_columns)
>>> pm(matrix)
2 1 7 1 9
3 1 8 1 9
2 7 7 1 2
3 7 8 2 9
You don't need to use loop, if you reverse the sort_columns list and use that to create a single sort key, you can do this with a single call:
def sort_matrix_columns(matrix, sort_columns):
matrix.sort(key=lambda r: [r[c - 1] for c in sort_columns[::-1]])
This works the same way, the most significant sort is the last column, only when two rows have the same value (a tie) would the one-but-last column sort matter, etc.
There are other stable sort algorithms, e.g. insertion or bubble sort would work just as well here. Wikipedia has a handy table of comparison sort algorithms that includes a 'stable' column, if you wanted to implement sorting yourself still.
E.g. here is a version using insertion sort:
def insertionsort_matrix_columns(matrix, sort_columns):
for col in sort_columns:
column = col - 1
for i in range(1, len(matrix)):
for j in range(i, 0, -1):
if matrix[j - 1][column] <= matrix[j][column]:
break
matrix[j - 1], matrix[j] = matrix[j], matrix[j - 1]
I didn't use a temp variable to swap two rows. In Python, you can swap two values simply by using tuple assignments.
Because insertion sort is stable, this produces the expected outcome:
>>> matrix = [[3, 1, 8, 1, 9],
... [3, 7, 8, 2, 9],
... [2, 7, 7, 1, 2],
... [2, 1, 7, 1, 9]]
>>> insertionsort_matrix_columns(matrix, sort_columns)
>>> pm(matrix)
2 1 7 1 9
3 1 8 1 9
2 7 7 1 2
3 7 8 2 9
I want to sort an array within the group boundaries defined in another array. The groups are not presorted in any way and need to remain unchanged after the sorting. In numpy terms it would look like this:
import numpy as np
def groupwise_sort(group_idx, a, reverse=False):
sortidx = np.lexsort((-a if reverse else a, group_idx))
# Reverse sorting back to into grouped order, but preserving groupwise sorting
revidx = np.argsort(np.argsort(group_idx, kind='mergesort'), kind='mergesort')
return a[sortidx][revidx]
group_idx = np.array([3, 2, 3, 2, 2, 1, 2, 1, 1])
a = np.array([3, 2, 1, 7, 4, 5, 5, 9, 1])
groupwise_sort(group_idx, a)
# >>> array([1, 2, 3, 4, 5, 1, 7, 5, 9])
groupwise_sort(group_idx, a, reverse=True)
# >>> array([3, 7, 1, 5, 4, 9, 2, 5, 1])
How can I do the same with pandas? I saw df.groupby() and df.sort_values(), though I couldn't find a straight forward way to achieve the same sorting. And a fast one, if possible.
Let us first set the stage:
import pandas as pd
import numpy as np
group_idx = np.array([3, 2, 3, 2, 2, 1, 2, 1, 1])
a = np.array([3, 2, 1, 7, 4, 5, 5, 9, 1])
df = pd.DataFrame({'group': group_idx, 'values': a})
df
# group values
#0 3 3
#1 2 2
#2 3 1
#3 2 7
#4 2 4
#5 1 5
#6 2 5
#7 1 9
#8 1 1
To get a dataframe sorted by group and values (within groups):
df.sort_values(["group", "values"])
# group values
#8 1 1
#5 1 5
#7 1 9
#1 2 2
#4 2 4
#6 2 5
#3 2 7
#2 3 1
#0 3 3
To sort the values in descending order, use ascending = False. To apply different orders to different columns, you can supply a list:
df.sort_values(["group", "values"], ascending = [True, False])
# group values
#7 1 9
#5 1 5
#8 1 1
#3 2 7
#6 2 5
#4 2 4
#1 2 2
#0 3 3
#2 3 1
Here, groups are sorted in ascending order, and the values within each group are sorted in descending order.
To only sort values for contiguous rows belonging to the same group, create a new group indicator:
(I keep this in here for reference since it might be helpful for others. I wrote this in an earlier version before the OP clarified his question in the comments.)
df['new_grp'] = (df.group.diff(1) != 0).astype('int').cumsum()
df
# group values new_grp
#0 3 3 1
#1 2 2 2
#2 3 1 3
#3 2 7 4
#4 2 4 4
#5 1 5 5
#6 2 5 6
#7 1 9 7
#8 1 1 7
We can then easily sort with new_grp instead of group, leaving the original order of groups untouched.
Ordering within groups but keeping the group-specifing row-positions:
To sort the elements of each group but keep the group-specific positions in the dataframe, we need to keep track of the original row numbers. For instance, the following will do the trick:
# First, create an indicator for the original row-number:
df["ind"] = range(len(df))
# Now, sort the dataframe as before
df_sorted = df.sort_values(["group", "values"])
# sort the original row-numbers within each group
newindex = df.groupby("group").apply(lambda x: x.sort_values(["ind"]))["ind"].values
# assign the sorted row-numbers to the sorted dataframe
df_sorted["ind"] = newindex
# Sort based on the row-numbers:
sorted_asc = df_sorted.sort_values("ind")
# compare the resulting order of values with your desired output:
np.array(sorted_asc["values"])
# array([1, 2, 3, 4, 5, 1, 7, 5, 9])
This is easier to test and profile when written up in a function, so let's do that:
def sort_my_frame(frame, groupcol = "group", valcol = "values", asc = True):
frame["ind"] = range(len(frame))
frame_sorted = frame.sort_values([groupcol, valcol], ascending = [True, asc])
ind_sorted = frame.groupby(groupcol).apply(lambda x: x.sort_values(["ind"]))["ind"].values
frame_sorted["ind"] = ind_sorted
frame_sorted = frame_sorted.sort_values(["ind"])
return(frame_sorted.drop(columns = "ind"))
np.array(sort_my_frame(df, "group", "values", asc = True)["values"])
# array([1, 2, 3, 4, 5, 1, 7, 5, 9])
np.array(sort_my_frame(df, "group", "values", asc = False)["values"])
# array([3, 7, 1, 5, 4, 9, 2, 5, 1])
Note that the latter results match your desired outcome.
I am sure this can be written up in a more succinct way. For instance, if the index of your dataframe is already ordered, you can use that one instead of the indicator ind I create (i.e., following #DJK's comment, we can use sort_index instead of sort_values and avoid assigning an additional column). In any case, the above highlights one possible solution and how to approach it. An alternative would be to use your numpy functions and wrap the output around a pd.DataFrame.
Pandas is built on top of numpy. Assuming a dataframe like so:
df
Out[21]:
group values
0 3 3
1 2 2
2 3 1
3 2 7
4 2 4
5 1 5
6 2 5
7 1 9
8 1 1
Call your function.
groupwise_sort(df.group.values, df['values'].values)
Out[22]: array([1, 2, 3, 4, 5, 1, 7, 5, 9])
groupwise_sort(df.group.values, df['values'].values, reverse=True)
Out[23]: array([3, 7, 1, 5, 4, 9, 2, 5, 1])
I have a data frame like
index A B C
0 4 7 9
1 2 6 22 6 9 13 7 2 44 8 5 6
I want to create another data frame out of this based on the sum of C column. But the catch here is if the sum of C reached 10 or higher it should create another row. Something like this.
index A B C
0 6 13 11
1 21 16 11
Any help will be highly appreciable. Is there a robust way to do this, or iterating is my last resort?
There is a non-iterative approach. You'll need a groupby based on C % 11.
# Groupby logic - https://stackoverflow.com/a/45959831/4909087
out = df.groupby((df.C.cumsum() % 10).diff().shift().lt(0).cumsum(), as_index=0).agg('sum')
print(out)
A B C
0 6 13 11
1 21 16 11
The code would look something like this:
import pandas as pd
lista = [4, 7, 10, 11, 7]
listb= [7, 8, 2, 5, 9]
listc = [9, 2, 1, 4, 6]
df = pd.DataFrame({'A': lista, 'B': listb, 'C': listc})
def sumsc(df):
suma=0
sumb=0
sumc=0
list_of_sums = []
for i in range(len(df)):
suma+=df.iloc[i,0]
sumb+=df.iloc[i,1]
sumc+=df.iloc[i,2]
if sumc > 10:
list_of_sums.append([suma, sumb, sumc])
suma=0
sumb=0
sumc=0
return pd.DataFrame(list_of_sums)
sumsc(df)
0 1 2
0 11 15 11
1 28 16 11
I have a dictionary 'wordfreq' like this:
{'techsmart': 30, 'paradies': 57, 'jobvark': 5000, 'midgley': 100, 'weisman': 2, 'tucuman': 1, 'amdahl': 2, 'frogfeet': 1, 'd8848': 1, 'jiaoyuwang': 1, 'walter': 19}
and I want to put the keys in a list if the value is more than 5 and also if the key is not in another dataframe 'df', and then adding them to a list called 'stopword':here is a df dataframe:
word freq
1 paradies 1
5 tucuman 1
and here is the code I am using:
stopword = []
for k,v in wordfreq.items():
if v >= 5:
if k not in list_c:
stopword.append((k))
Anybody knows how can I do the same thing with isin() method or more efficiently at least?
I'd load your dict into a df:
In [177]:
wordfreq = {'techsmart': 30, 'paradies': 57, 'jobvark': 5000, 'midgley': 100, 'weisman': 2, 'tucuman': 1, 'amdahl': 2, 'frogfeet': 1, 'd8848': 1, 'jiaoyuwang': 1, 'walter': 19}
df = pd.DataFrame({'word':list(wordfreq.keys()), 'freq':list(wordfreq.values())})
df
Out[177]:
freq word
0 1 frogfeet
1 1 tucuman
2 57 paradies
3 1 d8848
4 5000 jobvark
5 100 midgley
6 1 jiaoyuwang
7 30 techsmart
8 2 weisman
9 19 walter
10 2 amdahl
And then filter using isin against the other df (df_1 in my case) like this:
In [181]:
df[(df['freq'] > 5) & (~df['word'].isin(df1['word']))]
Out[181]:
freq word
4 5000 jobvark
5 100 midgley
7 30 techsmart
9 19 walter
So the boolean condition looks for freq values greater than 5 and also where the word is not in the other df using isin and invert the boolean mask ~.
You can then now get a list easily:
In [182]:
list(df[(df['freq'] > 5) & (~df['word'].isin(df1['word']))]['word'])
Out[182]:
['jobvark', 'midgley', 'techsmart', 'walter']