finding groups that meet a condition in pandas groupby - python

This is my dataframe:
df = pd.DataFrame({'a':list('xxxyyzz'), 'b':[10,20,30,5,3,1,2]})
I group them:
groups = df.groupby('a')
I want to print the groups that has at least one b above 20. In this case I want to print x.
This is my desired outcome:
x
a b
0 x 10
1 x 20
2 x 30

Compare values by Series.gt, grouping by a column like Series - df['a'] and use GroupBy.transform with GroupBy.any for test at least one True per groups:
df1 = df[df['b'].gt(20).groupby(df['a']).transform('any')]
print (df1)
a b
0 x 10
1 x 20
2 x 30

You could check which values are above 20, GroupBy column a and transform with any to select only those groups with at least one row satisfying the condition:
df[df.b.gt(20).groupby(df.a).transform('any')]
a b
0 x 10
1 x 20
2 x 30

No need groupby , just do isin
df[df.a.isin(df.loc[df.b>20,'a'])]
Out[996]:
a b
0 x 10
1 x 20
2 x 30

Related

Drop rows grouped by string value - pandas

I'm aiming to drop rows in a pandas df where a row is equal to a specific value. However, I want to extend this so it also drops associated rows grouped by a separate column. For instance, I want to drop all rows where Label == A,D, but I also to drop associated rows in Num from the same group.
import pandas as pd
df = pd.DataFrame({
'Num' : [1,1,1,2,2,3,3,4,4,4],
'Label' : ['X','X','A','Y','Y','Y','Y','Y','Y','D'],
})
df = df.groupby('Num').filter(lambda x: (x['Label'].isin['A','D']).any())
intended output:
Num Label
3 2 Y
4 2 Y
5 3 Y
6 3 Y
You are close, just add negation:
df.groupby('Num').filter(lambda x: ~x['Label'].isin(['A','D']).any())
Output:
Num Label
3 2 Y
4 2 Y
5 3 Y
6 3 Y
Let us try use isin without groupby
out = df.loc[~df.Num.isin(df.loc[df.Label.isin(['A','D']),'Num'])]
Out[108]:
Num Label
3 2 Y
4 2 Y
5 3 Y
6 3 Y

Pandas Dataframe Advanced Split

I have a big DataFrame I need to split into two (A and B), with the same number of rows from a certain column value in A and in B. That column has over 700 unique values, all of them strings. I leave an example:
DataFrame
Price Type
1 X
2 Y
3 Y
4 X
5 X
6 X
7 Y
8 Y
When splitting it (randomly), I should get two values of X, and two values of Y in DataFrame A and DataFrame B, like:
A
Price Type
1 X
5 X
2 Y
3 Y
B
Price Type
4 X
6 X
7 Y
8 Y
Thanks in advance!
You can use groupby().cumcount() to enumerate the rows within Type, then %2 to divide rows into two groups:
df['groups'] = df.groupby('Type').cumcount()%2
A,B = df[df['groups']==0], df[df['groups']==1]
Output:
**A**
Price Type groups
0 1 X 0
1 2 Y 0
4 5 X 0
6 7 Y 0
**B**
Price Type groups
2 3 Y 1
3 4 X 1
5 6 X 1
7 8 Y 1
Could you group by the value of Type and assign A/B to half of the group as a new column, then copy only rows with the label A/B assigned? If you need an exact split you could base it off the size of the group
You can you use "arry_split" feature of numpy library like below:
import numpy as np
df_split = np.array_split(df, 2)
df1 = df_split[0]
df2 = df_split[1]

Merge And overwrite common columns in two DataFrames

I have a dataframe A with 80 columns, and I did group by A and Sum 20 columns
E.g.
New_df=A.groupby(['X','Y','Z'])['a','b','c',......].sum().reset_Index()--------(1)
Then I want to overwrite the values in columns which are present in A with the New_df columns value which are common.
You can do:
cols1=set(A.columns.tolist())
cols2=set(New_df.columns.tolist())
common_cols = list(cols1.intersection(cols2))
A[common_cols]=New_df[common_cols]
to find the columns that the two df's have in common , then replace those in the first with the columns from the second.
This will give you results for example given an initial A:
x y
0 1 a
1 2 b
2 3 c
and New_df:
z y
0 4 d
1 5 e
2 6 f
And we wind up with final 'A', with y column taken from New_df:
x y
0 1 d
1 2 e
2 3 f

How to create pairs of column names based on a condition?

I have the following DataFrame df:
df =
min(arc) max(arc) min(gbm)_p1 max(gbm)_p1
1 10 2 5
0 11 1 6
How can I calculate the difference between pairs of max and min columns?
Expected result:
diff(arc) diff(gbm)_p1
9 3
11 5
I assume that apply(lambda x: ...) should be used to calculate the differences row-wise, but how can I create pairs of columns? In my case, I should only calculate the difference between columns that have the same name, e.g. ...(arc) or ...(gbm)_p1. Please notice that min and max prefixes always appear at the beginning of the column names.
Idea is filter both DataFrames by DataFrame.filter with regex where ^ is start of string, rename columns, so possible subtract, because same columns names in both:
df1 = df.filter(regex='^min').rename(columns= lambda x: x.replace('min','diff'))
df2 = df.filter(regex='^max').rename(columns= lambda x: x.replace('max','diff'))
df = df2.sub(df1)
print (df)
diff(arc) diff(gbm)_p1
0 9 3
1 11 5
EDIT:
print (df)
id min(arc) max(arc) min(gbm)_p1 max(gbm)_p1
0 123 1 10 2 5
1 546 0 11 1 6
df1 = df.filter(regex='^min').rename(columns= lambda x: x.replace('min','diff'))
df2 = df.filter(regex='^max').rename(columns= lambda x: x.replace('max','diff'))
df = df[['id']].join(df2.sub(df1))
print (df)
id diff(arc) diff(gbm)_p1
0 123 9 3
1 546 11 5

Is there a way to create multiple dataframes from single dataframe using pandas [duplicate]

I have DataFrame with column Sales.
How can I split it into 2 based on Sales value?
First DataFrame will have data with 'Sales' < s and second with 'Sales' >= s
You can use boolean indexing:
df = pd.DataFrame({'Sales':[10,20,30,40,50], 'A':[3,4,7,6,1]})
print (df)
A Sales
0 3 10
1 4 20
2 7 30
3 6 40
4 1 50
s = 30
df1 = df[df['Sales'] >= s]
print (df1)
A Sales
2 7 30
3 6 40
4 1 50
df2 = df[df['Sales'] < s]
print (df2)
A Sales
0 3 10
1 4 20
It's also possible to invert mask by ~:
mask = df['Sales'] >= s
df1 = df[mask]
df2 = df[~mask]
print (df1)
A Sales
2 7 30
3 6 40
4 1 50
print (df2)
A Sales
0 3 10
1 4 20
print (mask)
0 False
1 False
2 True
3 True
4 True
Name: Sales, dtype: bool
print (~mask)
0 True
1 True
2 False
3 False
4 False
Name: Sales, dtype: bool
Using groupby you could split into two dataframes like
In [1047]: df1, df2 = [x for _, x in df.groupby(df['Sales'] < 30)]
In [1048]: df1
Out[1048]:
A Sales
2 7 30
3 6 40
4 1 50
In [1049]: df2
Out[1049]:
A Sales
0 3 10
1 4 20
Using groupby and list comprehension:
Storing all the split dataframe in list variable and accessing each of the seprated dataframe by their index.
DF = pd.DataFrame({'chr':["chr3","chr3","chr7","chr6","chr1"],'pos':[10,20,30,40,50],})
ans = [y for x, y in DF.groupby('chr')]
accessing the separated DF like this:
ans[0]
ans[1]
ans[len(ans)-1] # this is the last separated DF
accessing the column value of the separated DF like this:
ansI_chr=ans[i].chr
One-liner using the walrus operator (Python 3.8):
df1, df2 = df[(mask:=df['Sales'] >= 30)], df[~mask]
Consider using copy to avoid SettingWithCopyWarning:
df1, df2 = df[(mask:=df['Sales'] >= 30)].copy(), df[~mask].copy()
Alternatively, you can use the method query:
df1, df2 = df.query('Sales >= 30').copy(), df.query('Sales < 30').copy()
I like to use this for speeding up searches or rolling average finds .apply(lambda x...) type functions so I split big files into dictionaries of dataframes:
df_dict = {sale_v: df[df['Sales'] == sale_v] for sale_v in df.Sales.unique()}
This should do it if you wanted to go based on categorical groups.

Categories