Select particular rows from inside groups in pandas dataframe

Select particular rows from inside groups in pandas dataframe - python

Suppose I have a dataframe that looks like this:
group level
0 1 10
1 1 10
2 1 11
3 2 5
4 2 5
5 3 9
6 3 9
7 3 9
8 3 8
The desired output is this:
group level
0 1 10
5 3 9
Namely, this is the logic: look inside each group, if there is more than 1 distinct value present in the level column, return the first row in that group. For example, no row from group 2 is selected, because the only value present in the level column is 5.
In addition, how does the situation change if I want the last, instead of the first row of such groups?
What I have attempted was combining group_by statements, with creating sets from entries in the level column, but failed to produce anything even nearly sensible.

This can be done with groupby and using apply to run a simple function on each group:
def get_first_val(group):
has_multiple_vals = len(group['level'].unique()) >= 2
if has_multiple_vals:
return group['level'].loc[group['level'].first_valid_index()]
else:
return None
df.groupby('group').apply(get_first_val).dropna()
Out[8]:
group
1 10
3 9
dtype: float64
There's also a last_valid_index() method, so you wouldn't have to
make any huge changes to get the last row instead.
If you have other columns that you want to keep, you just need a slight tweak:
import numpy as np
df['col1'] = np.random.randint(10, 20, 9)
df['col2'] = np.random.randint(20, 30, 9)
df
Out[17]:
group level col1 col2
0 1 10 19 21
1 1 10 18 24
2 1 11 14 23
3 2 5 14 26
4 2 5 10 22
5 3 9 13 27
6 3 9 16 20
7 3 9 18 26
8 3 8 11 2
def get_first_val_keep_cols(group):
has_multiple_vals = len(group['level'].unique()) >= 2
if has_multiple_vals:
return group.loc[group['level'].first_valid_index(), :]
else:
return None
df.groupby('group').apply(get_first_val_keep_cols).dropna()
Out[20]:
group level col1 col2
group
1 1 10 19 21
3 3 9 13 27

This would be simpler:
In [121]:
print df.groupby('group').\
agg(lambda x: x.values[0] if (x.values!=x.values[0]).any() else np.nan).\
dropna()
level
group
1 10
3 9
For each group, if any of the values are not the same as the first value, aggregate that group into its first value; otherwise, aggregate it to nan.
Finally, dropna().

Related

return first column number that fulfills a condition in pandas

I have a dataset with several columns of cumulative sums. For every row, I want to return the first column number that satisfies a condition.
Toy example:
df = pd.DataFrame(np.array(range(20)).reshape(4,5).T).cumsum(axis=1)
>>> df
0 1 2 3
0 0 5 15 30
1 1 7 18 34
2 2 9 21 38
3 3 11 24 42
4 4 13 27 46
If I want to return the first column whose value is greater than 20 for instance.
Desired output:
3
3
2
2
2
Many thanks as always!

Try with idxmax
df.gt(20).idxmax(1)
Out[66]:
0 3
1 3
2 2
3 2
4 2
dtype: object

No as short as #YOBEN_S but works is the chaining of index.get_loc and first_valid_index
df[df>20].apply(lambda x: x.index.get_loc(x.first_valid_index()), axis=1)
0 3
1 3
2 2
3 2
4 2
dtype: int64

Retain indexes from a dataframe based on indexes of another dataframe

Suppose that I have two dataframes A and B indexed from 0 to 10. I remove a couple of duplicate rows from A so that the indexes 7 and 9 are removed. So now A.index will be [0,1,2,3,4,5,6,8,10].
Now I want to retain exactly the rows having these same indexes in B. As of now its index set is from 0 to 10. In other words, given the exact same index initially, and having dropped a few indexes from A, how do I retain the subset of B rows that exactly correspond to retained rows of A in terms of their index?

I believe you can select by loc:
A = pd.DataFrame({'col':[5,8,4,0,6,2,1,8,3,4,9]})
B = pd.DataFrame({'col':np.arange(10, 21)})
#print (A)
#print (B)
A1 = A.drop_duplicates('col')
print (A1)
col
0 5
1 8
2 4
3 0
4 6
5 2
6 1
8 3
10 9
B1 = B.loc[A1.index]
print (B1)
col
0 10
1 11
2 12
3 13
4 14
5 15
6 16
8 18
10 20

Group rows by overlapping ranges

I have a dataframe, where the left column is the left - most location of an object, and the right column is the right most location. I need to group the objects if they overlap, or they overlap objects that overlap (recursively).
So, for example, if this is my dataframe:
left right
0 0 4
1 5 8
2 10 13
3 3 7
4 12 19
5 18 23
6 31 35
so lines 0 and 3 overlap - thus they should be on the same group, and also line 1 is overlapping line 3 - thus it joins the group.
So, for this example the output should be something like that:
left right group
0 0 4 0
1 5 8 0
2 10 13 1
3 3 7 0
4 12 19 1
5 18 23 1
6 31 35 2
I thought of various directions, but didn't figure it out (without an ugly for).
Any help will be appreciated!

I found the accepted solution (update: now deleted) to be misleading because it fails to generalize to similar cases. e.g. for the following example:
df = pd.DataFrame({'left': [0,5,10,3,12,13,18,31],
'right':[4,8,13,7,19,16,23,35]})
df
The suggested aggregate function outputs the following dataframe (note that the 18-23 should be in group 1, along with 12-19).
One solution is using the following approach (based on a method for combining intervals posted by #CentAu):
# Union intervals by #CentAu
from sympy import Interval, Union
def union(data):
""" Union of a list of intervals e.g. [(1,2),(3,4)] """
intervals = [Interval(begin, end) for (begin, end) in data]
u = Union(*intervals)
return [u] if isinstance(u, Interval) \
else list(u.args)
# Create a list of intervals
df['left_right'] = df[['left', 'right']].apply(list, axis=1)
intervals = union(df.left_right)
# Add a group column
df['group'] = df['left'].apply(lambda x: [g for g,l in enumerate(intervals) if
l.contains(x)][0])
...which outputs:

Can you try this, use rolling max and rolling min, to find the intersection of the range :
df=df.sort_values(['left','right'])
df['Group']=((df.right.rolling(window=2,min_periods=1).min()-df.left.rolling(window=2,min_periods=1).max())<0).cumsum()
df.sort_index()
Out[331]:
left right Group
0 0 4 0
1 5 8 0
2 10 13 1
3 3 7 0
4 12 19 1
5 18 23 1
6 31 35 2
For example , (1,3) and (2,4)
To find the intersection
mix(3,4)-max(1,2)=1 ; 1 is more than 0; then two intervals have intersection

You can sort samples and utilize cumulative functions cummax and cumsum. Let's take your example:
left right
0 0 4
3 3 7
1 5 8
2 10 13
4 12 19
5 13 16
6 18 23
7 31 35
First you need to sort values so that longer ranges come first:
df = df.sort_values(['left', 'right'], ascending=[True, False])
Result:
left right
0 0 4
3 3 7
1 5 8
2 10 13
4 12 19
5 13 16
6 18 23
7 31 35
Then you can find overlapping groups through comparing 'left' with previous 'right' values:
df['group'] = (df['right'].cummax().shift() <= df['left']).cumsum()
df.sort_index(inplace=True)
Result:
left right group
0 0 4 0
1 5 8 0
2 10 13 1
3 3 7 0
4 12 19 1
5 13 16 1
6 18 23 1
7 31 35 2
In one line:

Refer to next index in pandas

If I had a simple pandas DataFrame like this:
frame = pd.DataFrame(np.arange(12).reshape((3,4)), columns=list('abcd'), index=list('123'))
I want find the max value from each row, and use this to find the next value in the column and add this value to a new column.
So the above DataFrame looks like this (with d2 changed to 3):
a b c d
1 1 2 3 4
2 5 6 7 3
3 9 10 11 12
So, conceptually the first row should be scanned, 4 is identified as the largest number, then 3 is found as the number within the same column but in the next index. Similarly for the row 2, 7 is the largest number, and 11 is the next number in that column. So 3 and 11 should get added to a new column like this:
a b c d Next
1 1 2 3 4 NaN
2 5 6 7 3 3
3 9 10 11 12 11
I started by making a function like this, but it only finds the max values.
f = lambda x: x.max()
max = frame.apply(f, axis='columns')
frame['Next'] = max

Based on your edit, you can use np.argsort:
i = np.arange(len(df))
j = pd.Series(np.argmax(df.values, axis=1))
df['next'] = df.shift(-1).values[i, j]
a b c d next
1 1 2 3 4 3.0
2 5 6 7 3 11.0
3 9 10 11 12 NaN

Slicing Pandas Dataframe according to number of lines

I suppose this is something rather simple, but I can't find how to make this. I've been searching tutorials and stackoverflow.
Suppose I have a dataframe df loking like this :
Group Id_In_Group SomeQuantity
1 1 10
1 2 20
2 1 7
3 1 16
3 2 22
3 3 5
3 4 12
3 5 28
4 1 1
4 2 18
4 3 14
4 4 7
5 1 36
I would like to select only the lines having at least 4 objects in the group (so there are at least 4 rows having the same "group" number) and for which SomeQuantity for the 4th object, when sorted in the group by ascending SomeQuantity, is greater than 20 (for example).
In the given Dataframe, for example, it would only return the 3rd group, since it has 4 (>=4) members and its 4th SomeQuantity (after sorting) is 22 (>=20), so it should construct the dataframe :
Group Id_In_Group SomeQuantity
3 1 16
3 2 22
3 3 5
3 4 12
3 5 28
(being or not sorted by SomeQuantity, whatever).
Could somebody be kind enough to help me? :)

I would use .groupby() + .filter() methods:
In [66]: df.groupby('Group').filter(lambda x: len(x) >= 4 and x['SomeQuantity'].max() >= 20)
Out[66]:
Group Id_In_Group SomeQuantity
3 3 1 16
4 3 2 22
5 3 3 5
6 3 4 12
7 3 5 28

A slightly different approach using map, value_counts, groupby , filter:
(df[df.Group.map(df.Group.value_counts().ge(4))]
.groupby('Group')
.filter(lambda x: np.any(x['SomeQuantity'].sort_values().iloc[3] >= 20)))
Breakdown of steps:
Perform value_counts to compute the total counts of distinct elements present in Group column.
>>> df.Group.value_counts()
3 5
4 4
1 2
5 1
2 1
Name: Group, dtype: int64
Use map which functions like a dictionary (wherein the index becomes the keys and the series elements become the values) to map these results back to the original DF
>>> df.Group.map(df.Group.value_counts())
0 2
1 2
2 1
3 5
4 5
5 5
6 5
7 5
8 4
9 4
10 4
11 4
12 1
Name: Group, dtype: int64
Then, we check for the elements having a value of 4 or more which is our threshold limit and take only those subset from the entire DF.
>>> df[df.Group.map(df.Group.value_counts().ge(4))]
Group Id_In_Group SomeQuantity
3 3 1 16
4 3 2 22
5 3 3 5
6 3 4 12
7 3 5 28
8 4 1 1
9 4 2 28
10 4 3 14
11 4 4 7
Inorder to use groupby.filter operation on this, we must make sure that we return a single boolean value corresponding to each grouped key when we perform the sorting process and compare the fourth element to the threshold which is 20.
np.any returns all such possiblities matching our filter.
>>> df[df.Group.map(df.Group.value_counts().ge(4))] \
.groupby('Group').apply(lambda x: x['SomeQuantity'].sort_values().iloc[3])
Group
3 22
4 18
dtype: int64
From these, we compare the fourth element .iloc[3] as it is 0-based indexed and return all such favourable matches.

This is how I have worked through your question, warts and all. Im sure there are much nicer ways to do this.
Find groups with "4 objects in the group"
import collections
groups = list({k for k, v in collections.Counter(df.Group).items() if v > 3} );groups
Out:[3, 4]
Use these groups to filter to a new df containing these groups:
df2 = df[df.Group.isin(groups)]
"4th SomeQuantity (after sorting) is 22 (>=20)"
df3 = df2.sort_values(by='SomeQuantity',ascending=False)
(Updated as per comment below...)
df3.groupby('Group').filter(lambda grp: any(grp.sort_values('SomeQuantity').iloc[3] >= 20)).sort_index()
Group Id_In_Group SomeQuantity
3 3 1 16
4 3 2 22
5 3 3 5
6 3 4 12
7 3 5 28

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Select particular rows from inside groups in pandas dataframe - python

Related

return first column number that fulfills a condition in pandas

Retain indexes from a dataframe based on indexes of another dataframe

Group rows by overlapping ranges

Refer to next index in pandas

Slicing Pandas Dataframe according to number of lines

Categories

Resources