Pandas: Groupby.transform -> assign specific values to column - python

In general terms, is there a way to assign specific values to a column via groupby.transform(), where the groupby size is known in advance?
For example:
df = pd.DataFrame(data = {'A':[10,10,20,20],'B':['abc','def','ghi','jkl'],'GroupID':[1,1,2,2]})
funcDict = {'A':'sum','B':['specific_val_1', 'specific_val_2']}
df = df.groupby('GroupID').transform(funcDict)
where the result would be:
index
A
B
1
20
specific_val_1
2
20
specific_val_2
3
40
specific_val_1
4
40
specific_val_2

transform can not accepted dict , so we can do agg with merge
out = df.groupby('GroupID',as_index=False)[['A']].sum()
out = out.merge(pd.DataFrame({'B':['specific_val_1', 'specific_val_2']}),how='cross')
Out[90]:
GroupID A B
0 1 20 specific_val_1
1 1 20 specific_val_2
2 2 40 specific_val_1
3 2 40 specific_val_2

Related

Get last value from pandas groupeddataframe, summed by another column

I have the following dataframe
x = pd.DataFrame(
{
'FirstGroupCriterium': [1,1,2,2,3],
'SortingCriteria': [1,1,1,2,1],
'Value': [10,20,30,40,50]
}
)
x.sort_values('SortingCriteria').groupby('FirstGroupCriterium').agg(last_value=('Value', 'last'))
The latter outputs:
FirstGroupCriterium
last_value
1
20
2
40
3
50
What I would like to have, is to sum up the last value, based on the last SortingCriteria. So in this case:
FirstGroupCriterium
last_value
1
10+20 = 30
2
40
3
50
My initial idea was to call a custom aggregator function that groups the data yet again, but that fails.
def last_value(group):
return group.groupby('SortingCriteria')['Value'].sum().tail(1)
Do you have any idea how to get this to work? Thank you!
Sorting by both columns first, then filter last rows per FirstGroupCriterium in GroupBy.transform and aggregate sum:
df = x.sort_values(['FirstGroupCriterium','SortingCriteria'])
df1 = df[df['SortingCriteria'].eq(df.groupby('FirstGroupCriterium')['SortingCriteria'].transform('last'))]
print (df1)
FirstGroupCriterium SortingCriteria Value
0 1 1 10
1 1 1 20
3 2 2 40
4 3 1 50
df2 = df1.groupby(['FirstGroupCriterium'],as_index=False)['Value'].sum()
print (df2)
FirstGroupCriterium Value
0 1 30
1 2 40
2 3 50
Anoter idea is aggregate sum by both columns and then remove duplicates with keep last row by DataFrame.drop_duplicates:
df2 = (df.groupby(['FirstGroupCriterium','SortingCriteria'],as_index=False)['Value'].sum()
.drop_duplicates(['FirstGroupCriterium'], keep='last'))
print (df2)
FirstGroupCriterium SortingCriteria Value
0 1 1 30
2 2 2 40
3 3 1 50

Pandas: return the occurrences of the most frequent value for each group (possibly without apply)

Let's assume the input dataset:
test1 = [[0,7,50], [0,3,51], [0,3,45], [1,5,50],[1,0,50],[2,6,50]]
df_test = pd.DataFrame(test1, columns=['A','B','C'])
that corresponds to:
A B C
0 0 7 50
1 0 3 51
2 0 3 45
3 1 5 50
4 1 0 50
5 2 6 50
I would like to obtain the a dataset grouped by 'A', together with the most common value for 'B' in each group, and the occurrences of that value:
A most_freq freq
0 3 2
1 5 1
2 6 1
I can obtain the first 2 columns with:
grouped = df_test.groupby("A")
out_df = pd.DataFrame(index=grouped.groups.keys())
out_df['most_freq'] = df_test.groupby('A')['B'].apply(lambda x: x.value_counts().idxmax())
but I am having problems the last column.
Also: is there a faster way that doesn't involve 'apply'? This solution doesn't scale well with lager inputs (I also tried dask).
Thanks a lot!
Use SeriesGroupBy.value_counts which sorting by default, so then add DataFrame.drop_duplicates for top values after Series.reset_index:
df = (df_test.groupby('A')['B']
.value_counts()
.rename_axis(['A','most_freq'])
.reset_index(name='freq')
.drop_duplicates('A'))
print (df)
A most_freq freq
0 0 3 2
2 1 0 1
4 2 6 1

How to create pairs of column names based on a condition?

I have the following DataFrame df:
df =
min(arc) max(arc) min(gbm)_p1 max(gbm)_p1
1 10 2 5
0 11 1 6
How can I calculate the difference between pairs of max and min columns?
Expected result:
diff(arc) diff(gbm)_p1
9 3
11 5
I assume that apply(lambda x: ...) should be used to calculate the differences row-wise, but how can I create pairs of columns? In my case, I should only calculate the difference between columns that have the same name, e.g. ...(arc) or ...(gbm)_p1. Please notice that min and max prefixes always appear at the beginning of the column names.
Idea is filter both DataFrames by DataFrame.filter with regex where ^ is start of string, rename columns, so possible subtract, because same columns names in both:
df1 = df.filter(regex='^min').rename(columns= lambda x: x.replace('min','diff'))
df2 = df.filter(regex='^max').rename(columns= lambda x: x.replace('max','diff'))
df = df2.sub(df1)
print (df)
diff(arc) diff(gbm)_p1
0 9 3
1 11 5
EDIT:
print (df)
id min(arc) max(arc) min(gbm)_p1 max(gbm)_p1
0 123 1 10 2 5
1 546 0 11 1 6
df1 = df.filter(regex='^min').rename(columns= lambda x: x.replace('min','diff'))
df2 = df.filter(regex='^max').rename(columns= lambda x: x.replace('max','diff'))
df = df[['id']].join(df2.sub(df1))
print (df)
id diff(arc) diff(gbm)_p1
0 123 9 3
1 546 11 5

Is there a way to create multiple dataframes from single dataframe using pandas [duplicate]

I have DataFrame with column Sales.
How can I split it into 2 based on Sales value?
First DataFrame will have data with 'Sales' < s and second with 'Sales' >= s
You can use boolean indexing:
df = pd.DataFrame({'Sales':[10,20,30,40,50], 'A':[3,4,7,6,1]})
print (df)
A Sales
0 3 10
1 4 20
2 7 30
3 6 40
4 1 50
s = 30
df1 = df[df['Sales'] >= s]
print (df1)
A Sales
2 7 30
3 6 40
4 1 50
df2 = df[df['Sales'] < s]
print (df2)
A Sales
0 3 10
1 4 20
It's also possible to invert mask by ~:
mask = df['Sales'] >= s
df1 = df[mask]
df2 = df[~mask]
print (df1)
A Sales
2 7 30
3 6 40
4 1 50
print (df2)
A Sales
0 3 10
1 4 20
print (mask)
0 False
1 False
2 True
3 True
4 True
Name: Sales, dtype: bool
print (~mask)
0 True
1 True
2 False
3 False
4 False
Name: Sales, dtype: bool
Using groupby you could split into two dataframes like
In [1047]: df1, df2 = [x for _, x in df.groupby(df['Sales'] < 30)]
In [1048]: df1
Out[1048]:
A Sales
2 7 30
3 6 40
4 1 50
In [1049]: df2
Out[1049]:
A Sales
0 3 10
1 4 20
Using groupby and list comprehension:
Storing all the split dataframe in list variable and accessing each of the seprated dataframe by their index.
DF = pd.DataFrame({'chr':["chr3","chr3","chr7","chr6","chr1"],'pos':[10,20,30,40,50],})
ans = [y for x, y in DF.groupby('chr')]
accessing the separated DF like this:
ans[0]
ans[1]
ans[len(ans)-1] # this is the last separated DF
accessing the column value of the separated DF like this:
ansI_chr=ans[i].chr
One-liner using the walrus operator (Python 3.8):
df1, df2 = df[(mask:=df['Sales'] >= 30)], df[~mask]
Consider using copy to avoid SettingWithCopyWarning:
df1, df2 = df[(mask:=df['Sales'] >= 30)].copy(), df[~mask].copy()
Alternatively, you can use the method query:
df1, df2 = df.query('Sales >= 30').copy(), df.query('Sales < 30').copy()
I like to use this for speeding up searches or rolling average finds .apply(lambda x...) type functions so I split big files into dictionaries of dataframes:
df_dict = {sale_v: df[df['Sales'] == sale_v] for sale_v in df.Sales.unique()}
This should do it if you wanted to go based on categorical groups.

How to get dataframe of unique ids

I'm trying to group the following dataframe by unique binId and then parse the resulting rows based of 'z' and pick the row with highest value of 'z'. Here is my dataframe.
import pandas as pd
df = pd.DataFrame({'ID':['1','2','3','4','5','6'], 'binId': ['1','2','2','1','1','3'], 'x':[1,4,5,6,3,4], 'y':[11,24,35,16,23,34],'z':[1,4,5,2,3,4]})
`
I tried following code which gives required answer,
def f(x):
tp = df[df['binId'] == x][['binId','ID','x','y','z']].sort_values(by='z', ascending=False).iloc[0]
return tp`
and then,
binids= pd.Series(df.binId.unique())
print binids.apply(f)
The output is,
binId ID x y z
0 1 5 3 23 3
1 2 3 5 35 5
2 3 6 4 34 4
But the execution is too slow. What is the faster way of doing this?
Use idxmax for indices of max and select by loc:
df1 = df.loc[df.groupby('binId')['z'].idxmax()]
Or faster is use sort_values with drop_duplicates:
df1 = df.sort_values(['binId', 'z']).drop_duplicates('binId', keep='last')
print (df1)
ID binId x y z
4 5 1 3 23 3
2 3 2 5 35 5
5 6 3 4 34 4

Categories