Defining an aggregation function with groupby in pandas - python

I would like to collapse my dataset using groupby and agg, however after collapsing, I want the new column to show a string value only for the grouped rows.
For example, the initial data is:
df = pd.DataFrame([["a",1],["a",2],["b",2]], columns=['category','value'])
category value
0 a 1
1 a 3
2 b 2
Desired output:
category value
0 a grouped
1 b 2
How should I modify my code (to show "grouped" instead of 3):
df=df.groupby(['category'], as_index=False).agg({'value':'max'})

You can use a lambda with a ternary:
df.groupby("category", as_index=False)
.agg({"value": lambda x: "grouped" if len(x) > 1 else x})
This outputs:
category value
0 a grouped
1 b 2

Another possible solution:
(df.assign(value = np.where(
df.duplicated(subset=['category'], keep=False), 'grouped', df['value']))
.drop_duplicates())
Output:
category value
0 a grouped
2 b 2

Related

Creating a new map from existing maps in python

This question might be common but I am new to python and would like to learn more from the community. I have 2 map files which have data mapping like this:
map1 : A --> B
map2 : B --> C,D,E
I want to create a new map file which will be A --> C
What is the most efficient way to achieve this in python? A generic approach would be very helpful as I need to apply the same logic on different files and different columns
Example:
Map1:
1,100
2,453
3,200
Map2:
100,25,30,
200,300,,
250,190,20,1
My map3 should be:
1,25
2,0
3,300
As 453 is not present in map2, our map3 contains value 0 for key 2.
First create DataFrames:
df1 = pd.read_csv(Map1, header=None)
df2 = pd.read_csv(Map2, header=None)
And then use Series.map by second column with by Series created by df2 with set index by first column, last replace missing values to 0 for not matched values:
df1[1] = df1[1].map(df2.set_index(0)[1]).fillna(0, downcast='int')
print (df1)
0 1
0 1 25
1 2 0
2 3 300
EDIT: for mapping multiple columns use left join with remove only missing columns by DataFrame.dropna and columns b,c used for join, last replace missing values:
df1.columns=['a','b']
df2.columns=['c','d','e','f']
df = (df1.merge(df2, how='left', left_on='b', right_on='c')
.dropna(how='all', axis=1)
.drop(['b','c'], axis=1)
.fillna(0)
.convert_dtypes())
print (df)
a d e
0 1 25 30
1 2 0 0
2 3 300 0

Drop a group of rows if one column has missing data in a pandas dataframe

I have the following dataframe:
df
Group Dist
0 A 5
1 B 2
2 A 3
3 B 1
4 B 0
5 A 5
I am trying to drop all rows that match Group if the Dist column equals zero. This works to delete row 4:
df = df[df.Dist != 0]
however I also want to delete rows 1 and 3 so I am left with:
df
Group Dist
0 A 5
2 A 3
5 A 5
Any ideas on how to drop the group based off this condition?
Thanks!
First get all Group values for Entry == 0 and then filter out them by check column Group with inverted mask by ~:
df1 = df[~df['Group'].isin(df.loc[df.Dist == 0, 'Group'])]
print (df1)
Group Dist
0 A 5
2 A 3
5 A 5
Or you can use GroupBy.transform with GroupBy.all for test if groups has no 0 values:
df1 = df[(df.Dist != 0).groupby(df['Group']).transform('all')]
EDIT: For remove all groups with missing values:
df2 = df[df['Dist'].notna().groupby(df['Group']).transform('all')]
For test missing values:
print (df[df['Dist'].isna()])
if return nothing there are no missing values NaN or no None like Nonetype.
So is possible check scalar, e.g. if this value is in row with index 10:
print (df.loc[10, 'Dist'])
print (type(df.loc[10, 'Dist']))
You can use groupby and the method filter:
df.groupby('Group').filter(lambda x: x['Dist'].ne(0).all())
Output:
Group Dist
0 A 5
2 A 3
5 A 5
If you want to filter out groups with missing values:
df.groupby('Group').filter(lambda x: x['Dist'].notna().all())

how to use a column of df in a pivot table more than once

I have a dataframe like this but I need to convert it to a pivot table like the below one. to sum up, I need to use item columns more than one in a pivot table. I have tried to use aggfunc but how can I define it for items themselves. Could anyone please give a trick about that?
index
item
interval
transaction
0
a
x1
1
1
a
x2
2
2
b
x1
2
Transformed Table
x1
x2
item
count
item
count
a
1
a
2
b
2
b
0
The first step is to obtain the information you want in a natural way ("natural": easy to express in pandas, e.g. using pivot_table() or groupby()). In order to make the full product of interval x item (with 0 for missing pairs), you may use:
df.pivot_table(index='interval', columns='item', values='transaction',
aggfunc=sum, fill_value=0)
# out:
item a b
interval
x1 1 2
x2 2 0
The trick however is how to reshape this into the specific format you asked for. This will involve duplicating the 'item' column or level (something that pandas, understandably, is not particularly fond of). The following is the full operation in one chained sequence:
df2 = (df
.pivot_table(index='interval', columns='item', values='transaction',
aggfunc=sum, fill_value=0)
.stack().to_frame('count')
.reset_index('item').set_index('item', append=True, drop=False)
.unstack('interval').swaplevel(axis=1)
.sort_index(axis=1, ascending=[True, False])
.reset_index(drop=True)
)
# df2:
interval x1 x2
item count item count
0 a 1 a 2
1 b 2 b 0
You can comment out from the end to see the various stages. Let's break this down line by line after the pivot_table:
Move item to level-1 multiindex and rename the sum as 'count'
... .stack().to_frame('count')
count
interval item
x1 a 1
b 2
x2 a 2
b 0
Duplicate the item column (in order to unstack later):
... .reset_index('item').set_index('item', append=True, drop=False)
item count
interval item
x1 a a 1
b b 2
x2 a a 2
b b 0
Unstack the interval, and swap the levels of the new multiindex columns (note: that's why we needed to duplicate item: otherwise unstack() would operate on a regular index (not MultiIndex), and as such would convert to a Series):
... .unstack('interval').swaplevel(axis=1)
interval x1 x2 x1 x2
item item count count
item
a a a 1 2
b b b 2 0
Finally, sort the columns MultiIndex and drop the (now useless) index:
... .sort_index(axis=1, ascending=[True, False])
... .reset_index(drop=True)
interval x1 x2
item count item count
0 a 1 a 2
1 b 2 b 0

Pandas groupby with new column for each value

I hope the title speaks for itself; I'd just like to add that it can be assumed that each key has the same amount of values.
Online searching the title yielded the following solution:
Split pandas dataframe based on groupby
Which supposed to be solving my problem, although it does not.
I'll give an example:
Input:
pd.DataFrame(data={'a':['foo','foo','foo','bar','bar','bar'],'b':[1,2,3,4,5,6]})
Output:
pd.DataFrame(data={'a':['foo','bar'],'b':[1,4],'c':[2,5],'d':[3,6]})
Intuitively, it would be a groupby function without an aggregation function, or an aggregation function that makes a list out of the keys.
Obviously, it can be done 'manually' using for loops etc., but using for loops with large data sets is very expensive computationally.
Use GroupBy.cumcount for Series or column g, then reshape by DataFrame.set_index + Series.unstack or DataFrame.pivot, last data cleaning by DataFrame.add_prefix, DataFrame.rename_axis with
DataFrame.reset_index:
g = df1.groupby('a').cumcount()
df = (df1.set_index(['a', g])['b']
.unstack()
.add_prefix('new_')
.reset_index()
.rename_axis(None, axis=1))
print (df)
a new_0 new_1 new_2
0 bar 4 5 6
1 foo 1 2 3
Or:
df1['g'] = df1.groupby('a').cumcount()
df = df1.pivot('a','g','b').add_prefix('new_').reset_index().rename_axis(None, axis=1)
print (df)
a new_0 new_1 new_2
0 bar 4 5 6
1 foo 1 2 3
Here is an alternative approach, using groupby.apply and string.ascii_lowercase if column names are important:
from string import ascii_lowercase
df = pd.DataFrame(data={'a':['foo','foo','foo','bar','bar','bar'],'b':[1,2,3,4,5,6]})
# Groupby 'a'
g = df.groupby('a')['b'].apply(list)
# Construct new DataFrame from g
new_df = pd.DataFrame(g.values.tolist(), index=g.index).reset_index()
# Fix column names
new_df.columns = [x for x in ascii_lowercase[:new_df.shape[1]]]
print(new_df)
a b c d
0 bar 4 5 6
1 foo 1 2 3

How to count data in a column based on another column separately?

I have two dataframe like this:
df1 = pd.DataFrame({'a':[1,2]})
df2 = pd.DataFrame({'a':[1,1,1,2,2,3,4,5,6,7,8]})
I want to count the two numbers of df1 separately in df2, the correct answer like:
No Amount
1 3
2 2
Instead of:
No Amount
1 5
2 5
How can I solve this problem?
First filter df2 for values that are contained in df1['a'], then apply value_counts. The rest of the code just presents the data in your desired format.
result = (
df2[df2['a'].isin(df1['a'].unique())]['a']
.value_counts()
.reset_index()
)
result.columns = ['No', 'Amount']
>>> result
No Amount
0 1 3
1 2 2
In pandas 0.21.0 you can use set_axis to rename columns as chained method. Here's a one line solution:
df2[df2.a.isin(df1.a)]\
.squeeze()\
.value_counts()\
.reset_index()\
.set_axis(['No','Amount'], axis=1, inplace=False)
Output:
No Amount
0 1 3
1 2 2
You can simply find value_counts of second df and map that with first df i.e
df1['Amount'] = df1['a'].map(df2['a'].value_counts())
df1 = df1.rename(columns={'a':'No'})
Output :
No Amount
0 1 3
1 2 2

Categories