Remove rows in dataframe by overlaping groups based on coordinates - python

I have a dataframe such as
Seq Chrm start end length score
A C1 1 50 49 12
B C1 3 55 52 12
C C1 6 60 54 12
Cbis C1 6 60 54 11
D C1 70 120 50 12
E C1 78 111 33 12
F C2 350 400 50 12
A C2 349 400 51 12
B C2 450 500 50 12
And I would like, within each specific Chrm, to keep within each overlapping start and end the row with the longest length value AND the highest Score value.
For example in C1:
Seq Chrm start end length score
A C1 1 50 49 12
B C1 3 55 52 12
C C1 6 60 54 12
Cbis C1 6 60 54 11
D C1 70 120 50 12
E C1 78 111 33 12
Coordinates from start to end of A,B,C,Cbis together overlaps and D and E together overlaps.
In the A,B,C,Cbis group the longest are C and Cbis with 54, then I keep the one with the highest score which is **C** (12) In the **D,E** group, the longest is **D** with50`.
so I keep only the row C and D here.
If I do the same for other Chrm I should then get the following output:
Seq Chrm start end length score
C C1 6 60 54 12
D C1 70 120 50 12
A C2 349 400 51 12
B C2 450 500 50 12
Here is the dataframe in dic format if it can help :
{'Seq': {0: 'A', 1: 'B', 2: 'C', 3: 'Cbis', 4: 'D', 5: 'E', 6: 'F', 7: 'A', 8: 'B'}, 'Chrm': {0: 'C1', 1: 'C1', 2: 'C1', 3: 'C1', 4: 'C1', 5: 'C1', 6: 'C2', 7: 'C2', 8: 'C2'}, 'start': {0: 1, 1: 3, 2: 6, 3: 6, 4: 70, 5: 78, 6: 350, 7: 349, 8: 450}, 'end': {0: 50, 1: 55, 2: 60, 3: 60, 4: 120, 5: 111, 6: 400, 7: 400, 8: 500}, 'length': {0: 49, 1: 52, 2: 54, 3: 54, 4: 50, 5: 33, 6: 50, 7: 51, 8: 50}, 'score': {0: 12, 1: 12, 2: 12, 3: 11, 4: 12, 5: 12, 6: 12, 7: 12, 8: 12}}
Edit for Corralien :
If I used this table :
Seq Chrm start end length score
A C1 12414 14672 49 12
B C1 12414 14741 52 12
C C1 12414 14744 54 12
It does not class A,B and C in the same overlapping group...
{'Seq': {0: 'A', 1: 'B', 2: 'C'}, 'Chrm': {0: 'C1', 1: 'C1', 2: 'C1'}, 'start': {0: 12414, 1: 12414, 2: 12414}, 'end': {0: 14672, 1: 14741, 2: 14744}, 'length': {0: 49, 1: 52, 2: 54}, 'score': {0: 12, 1: 12, 2: 12}}

Create virtual groups and keep the best row (length, score) for each group:
Suppose this dataframe:
>>> df
Seq Chrm start end length score
0 A C1 1 50 49 12
1 B C1 3 55 52 12
2 C C1 6 60 54 12
3 Cbis C1 6 60 54 11
4 D C1 70 120 50 12
5 E C1 78 111 33 12
6 F C2 350 400 50 12
7 A C2 349 400 51 12
8 B C2 450 500 50 12
9 A C1 12414 14672 49 12
10 B C1 12414 14741 52 12
11 C C1 12414 14744 54 12
Create groups:
is_overlapped = lambda x: x['start'] >= x['end'].shift(fill_value=-1)
df['group'] = df.sort_values(['Chrm', 'start', 'end']) \
.groupby('Chrm').apply(is_overlapped).droplevel(0).cumsum()
out = df.sort_values(['group', 'length', 'score'], ascending=[True, False, False]) \
.groupby(df['group']).head(1)
Output:
>>> out
Seq Chrm start end length score group
2 C C1 6 60 54 12 1
4 D C1 70 120 50 12 2
11 C C1 12414 14744 54 12 3
7 A C2 349 400 51 12 4
8 B C2 450 500 50 12 5
# Groups
>>> df
Seq Chrm start end length score group
0 A C1 1 50 49 12 1
1 B C1 3 55 52 12 1
2 C C1 6 60 54 12 1
3 Cbis C1 6 60 54 11 1
4 D C1 70 120 50 12 2
5 E C1 78 111 33 12 2
6 F C2 350 400 50 12 4
7 A C2 349 400 51 12 4
8 B C2 450 500 50 12 5
9 A C1 12414 14672 49 12 3
10 B C1 12414 14741 52 12 3
11 C C1 12414 14744 54 12 3
You can drop the group column with out.drop(columns='group') but I left it to illustrate the virtual groups.

Related

Count how many times a pair of values in one pandas dataframe appears in another

I have a pandas dataframe df1 that looks like this:
import pandas as pd
d = {'node1': [47, 24, 19, 77, 24, 19, 77, 24, 56, 92, 32, 77], 'node2': [24, 19, 77, 24, 19, 77, 24, 19, 92, 32, 77, 24], 'user': ['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'C']}
df1 = pd.DataFrame(data=d)
df1
node1 node2 user
47 24 A
24 19 A
19 77 A
77 24 A
24 19 A
19 77 B
77 24 B
24 19 B
56 92 C
92 32 C
32 77 C
77 24 C
And a second pandas dataframe df2 that looks like this:
d2 = {'way_id': [4, 3, 1, 8, 5, 2, 7, 9, 6, 10], 'source': [24, 19, 84, 47, 19, 16, 77, 56, 32, 92], 'target': [19, 43, 67, 24, 77, 29, 24, 92, 77, 32]}
df2 = pd.DataFrame(data=d2)
df2
way_id source target
4 24 19
3 19 43
1 84 67
8 47 24
5 19 77
2 16 29
7 77 24
9 56 92
6 32 77
10 92 32
In a new dataframe I would like to count how often the value pairs per row in the columns node1 and node2 in df1 occur in the rows of the source and target columns in df2. The order is relevant, but also the corresponding user should be added to a new column. That's why the desired output should be like this:
way_id source target count user
4 24 19 2 A
3 19 43 0 A
1 84 67 0 A
8 47 24 1 A
5 19 77 1 A
2 16 29 0 A
7 77 24 1 A
9 56 92 0 A
6 32 77 0 A
10 92 32 0 A
4 24 19 1 B
3 19 43 0 B
1 84 67 0 B
8 47 24 0 B
5 19 77 1 B
2 16 29 0 B
7 77 24 1 B
9 56 92 0 B
6 32 77 0 B
10 92 32 0 B
4 24 19 0 C
3 19 43 0 C
1 84 67 0 C
8 47 24 0 C
5 19 77 0 C
2 16 29 0 C
7 77 24 1 C
9 56 92 1 C
6 32 77 1 C
10 92 32 1 C
Since you don't care about the source/target match, you need to duplicate the data then merge :
(pd.concat([df1.rename(columns={'node1':'source','node2':'target'}),
df1.rename(columns={'node2':'source','node1':'target'})]
)
.merge(df2, on=['source','target'], how='outer')
.groupby(['source','target','user'], as_index=False)['way_id'].count()
)

How to assign values on multiple columns of a pandas data frame based on condition

I have a dtaframe df as below
df = pd.DataFrame({
'A': [20,30,40,-50,60,-70 ],
'B': [21, -19, 20, 18, 17, -21],
'C': [1,12,-13,14,15,16],
'D': [-88, 92, 9, 70, -6, 78]})
I want every value on column ['C','D'] to be zero where the value is between -10 and 10, rest of the values should remain same.
is there something similar to data.series.between, which can be applied to a data frame
df[(df[['C','D']].between(-10,10,inclusive=True)]=0
output should be :
A B C D
0 20 21 0 -88
1 30 -19 12 92
2 40 20 -13 0
3 -50 18 14 70
4 60 17 15 0
5 -70 -21 16 78
You can use df.mask() here after comparing by df.ge and df.le:
df[['C','D']]=df[['C','D']].mask(df[['C','D']].ge(-10)&df[['C','D']].le(10),0)
Or np.where():
df[['C','D']]=np.where(df[['C','D']].ge(-10)&df[['C','D']].le(10),0,df[['C','D']])
A B C D
0 20 21 0 -88
1 30 -19 12 92
2 40 20 -13 0
3 -50 18 14 70
4 60 17 15 0
5 -70 -21 16 78

I cannot make my ideal DataFrame

There is a csv data like
No,User,A,B,C,D
1 Tom 100 120 110 90
1 Juddy 89 90 100 110
1 Bob 99 80 90 100
2 Tom 80 100 100 70
2 Juddy 79 90 80 70
2 Bob 88 90 95 90
・
・
・
I want to transform this csv data into this DataFrame like
Tom_A Tom_B Tom_C Tom_D Juddy_A Juddy_B Juddy_C Juddy_D Bob_A Bob_B Bob_C Bob_D
No
1 100 120 110 90 89 90 100 110
99 80 90 100
2 80 100 100 70 79 90 80 70
88 90 95 90
I run the codes,
import pandas as pd
csv = pd.read_csv("user.csv", header=0, index_col=‘No', sep='\s|,', engine='python')
but output is not my ideal one.I cannot understand how to make columns is not resignated like Tom_A・Tom_B・Juddy_A which is in csv.
How should I fix my codes?
Setup
df = pd.DataFrame({'No': {0: 1, 1: 1, 2: 1, 3: 2, 4: 2, 5: 2}, 'User': {0: 'Tom', 1: 'Juddy', 2: 'Bob', 3: 'Tom', 4: 'Juddy', 5: 'Bob'}, 'A': {0: 100, 1: 89, 2: 99, 3: 80, 4: 79, 5: 88}, 'B': {0: 120, 1: 90, 2: 80, 3: 100, 4: 90, 5: 90}, 'C': {0: 110, 1: 100, 2: 90, 3: 100, 4: 80, 5: 95}, 'D': {0: 90, 1: 110, 2: 100, 3: 70, 4: 70, 5: 90}})
You want pivot_table:
out = df.pivot_table(index='No', columns='User')
A B C D
User Bob Juddy Tom Bob Juddy Tom Bob Juddy Tom Bob Juddy Tom
No
1 99 89 100 80 90 120 90 100 110 100 110 90
2 88 79 80 90 90 100 95 80 100 90 70 70
To get the prefix:
out.columns = out.columns.swaplevel(0,1).to_series().str.join('_')
Bob_A Juddy_A Tom_A Bob_B Juddy_B Tom_B Bob_C Juddy_C Tom_C Bob_D Juddy_D Tom_D
No
1 99 89 100 80 90 120 90 100 110 100 110 90
2 88 79 80 90 90 100 95 80 100 90 70 70

Multiple binary columns to one column

I have a CSV file dataset that contains 21 columns, the first 10 columns are numbers and I don't want to change them. The next 10 columns are binary data and contain only 1 and 0 in it, one "1" and the others are "0", and the last column is the given label.
the example data looks like below
2596,51,3,258,0,510,221,232,148,6279,24(10th column),0,0,0,0,0,1(16th column),0,0,0,0,2(the last column)
Suppose I load the data into a matrix, can I keep the first 10 columns and the last column unchanged, and convert the middle 10 columns into one column? After transformation, I want the column value to be based on the index of the "1" in the row, like the row above, the wanted result is
2596,51,3,258,0,510,221,232,148,6279,24,6(it's 6 because the "1" is on 6th column of the binary data),2 #12 columns in total
Can I achieve this using NumPy, scikit-learn or something else?
This should do it if it is loaded into a numpy array
out = np.c_[in[:, :11], np.where(in[:, 11:-1])[1] + 1, in[:, -1]]
from io import StringIO
import pandas as pd
csv = StringIO("2596,51,3,258,0,510,221,232,148,6279,24,0,0,0,0,0,1,0,0,0,0,2"
"\n1,2,3,4,5,6,7,8,9,10,11,0,0,0,0,1,0,0,0,0,0,1")
df = pd.read_csv(csv, header=None)
df = pd.concat(objs=[df[df.columns[:11]],
df[df.columns[11:-1]].idxmax(axis=1) - 10,
df[df.columns[-1]]], axis=1)
print(df)
Output:
0 1 2 3 4 5 6 7 8 9 10 0 21
0 2596 51 3 258 0 510 221 232 148 6279 24 6 2
1 1 2 3 4 5 6 7 8 9 10 11 5 1
Data:
In [135]: df
Out[135]:
0 1 2 3 4 5 6 7 8 9 ... 12 13 14 15 16 17 18 19 20 21
0 2596 51 3 258 0 510 221 232 148 6279 ... 0 0 0 0 1 0 0 0 0 2
1 2596 51 3 258 0 510 221 232 148 6279 ... 0 0 0 0 0 0 0 0 1 2
[2 rows x 22 columns]
Solution:
df = pd.read_csv('/path/to/file.csv', header=None)
In [137]: df.iloc[:, :11] \
.join(df.iloc[:, 11:21].dot(range(1,11)).to_frame(11)) \
.join(df.iloc[:, -1])
Out[137]:
0 1 2 3 4 5 6 7 8 9 10 11 21
0 2596 51 3 258 0 510 221 232 148 6279 24 6 2
1 2596 51 3 258 0 510 221 232 148 6279 24 10 2
Setup
df = pd.DataFrame({0: {2596: 51},
1: {2596: 3},
2: {2596: 258},
3: {2596: 0},
4: {2596: 510},
5: {2596: 221},
6: {2596: 232},
7: {2596: 148},
8: {2596: 6279},
9: {2596: 24},
10: {2596: 0},
11: {2596: 0},
12: {2596: 0},
13: {2596: 0},
14: {2596: 0},
15: {2596: 1},
16: {2596: 0},
17: {2596: 0},
18: {2596: 0},
19: {2596: 0},
20: {2596: 2}})
Solution
#find the index of the column with value 1 within the 10 columns
df.iloc[:,10] = np.argmax(df.iloc[:,10:20].values,axis=1)+1
#select the first 10 columns, the position column and the label column
df.iloc[:,list(range(11))+[20]]
Out[2167]:
0 1 2 3 4 5 6 7 8 9 10 20
2596 51 3 258 0 510 221 232 148 6279 24 6 2

Pandas Very Simple Percent of total size from Group by

I'm having trouble for a seemingly incredibly easy operation. What is the most succint way to just get a percent of total from a group by operation such as df.groupby['col1'].size(). My DF after grouping looks like this and I just want a percent of total. I remember using a variation of this statement in the past but cannot get this to work now: percent = totals.div(totals.sum(1), axis=0)
Original DF:
A B C
0 77 3 98
1 77 52 99
2 77 58 61
3 77 3 93
4 77 31 99
5 77 53 51
6 77 2 9
7 72 25 78
8 34 41 34
9 44 95 27
Result:
df1.groupby('A').size() / df1.groupby('A').size().sum()
A
34 0.1
44 0.1
72 0.1
77 0.7
Here is what I came up with so far which seems pretty reasonable way to do this:
df.groupby('col1').size().apply(lambda x: float(x) / df.groupby('col1').size().sum()*100)
I don't know if I'm missing something, but looks like you could do something like this:
df.groupby('A').size() * 100 / len(df)
or
df.groupby('A').size() * 100 / df.shape[0]
Getting good performance (3.73s) on DF with shape (3e6,59) by using:
df.groupby('col1').size().apply(lambda x: float(x) / df.groupby('col1').size().sum()*100)
How about:
df = pd.DataFrame({'A': {0: 77, 1: 77, 2: 77, 3: 77, 4: 77, 5: 77, 6: 77, 7: 72, 8: 34, 9: None},
'B': {0: 3, 1: 52, 2: 58, 3: 3, 4: 31, 5: 53, 6: 2, 7: 25, 8: 41, 9: 95},
'C': {0: 98, 1: 99, 2: 61, 3: 93, 4: 99, 5: 51, 6: 9, 7: 78, 8: 34, 9: 27}})
>>> df.groupby('A').size().divide(sum(df['A'].notnull()))
A
34 0.111111
72 0.111111
77 0.777778
dtype: float64
>>> df
A B C
0 77 3 98
1 77 52 99
2 77 58 61
3 77 3 93
4 77 31 99
5 77 53 51
6 77 2 9
7 72 25 78
8 34 41 34
9 NaN 95 27

Categories