Pandas groupby with new column for each value - python

I hope the title speaks for itself; I'd just like to add that it can be assumed that each key has the same amount of values.
Online searching the title yielded the following solution:
Split pandas dataframe based on groupby
Which supposed to be solving my problem, although it does not.
I'll give an example:
Input:
pd.DataFrame(data={'a':['foo','foo','foo','bar','bar','bar'],'b':[1,2,3,4,5,6]})
Output:
pd.DataFrame(data={'a':['foo','bar'],'b':[1,4],'c':[2,5],'d':[3,6]})
Intuitively, it would be a groupby function without an aggregation function, or an aggregation function that makes a list out of the keys.
Obviously, it can be done 'manually' using for loops etc., but using for loops with large data sets is very expensive computationally.

Use GroupBy.cumcount for Series or column g, then reshape by DataFrame.set_index + Series.unstack or DataFrame.pivot, last data cleaning by DataFrame.add_prefix, DataFrame.rename_axis with
DataFrame.reset_index:
g = df1.groupby('a').cumcount()
df = (df1.set_index(['a', g])['b']
.unstack()
.add_prefix('new_')
.reset_index()
.rename_axis(None, axis=1))
print (df)
a new_0 new_1 new_2
0 bar 4 5 6
1 foo 1 2 3
Or:
df1['g'] = df1.groupby('a').cumcount()
df = df1.pivot('a','g','b').add_prefix('new_').reset_index().rename_axis(None, axis=1)
print (df)
a new_0 new_1 new_2
0 bar 4 5 6
1 foo 1 2 3

Here is an alternative approach, using groupby.apply and string.ascii_lowercase if column names are important:
from string import ascii_lowercase
df = pd.DataFrame(data={'a':['foo','foo','foo','bar','bar','bar'],'b':[1,2,3,4,5,6]})
# Groupby 'a'
g = df.groupby('a')['b'].apply(list)
# Construct new DataFrame from g
new_df = pd.DataFrame(g.values.tolist(), index=g.index).reset_index()
# Fix column names
new_df.columns = [x for x in ascii_lowercase[:new_df.shape[1]]]
print(new_df)
a b c d
0 bar 4 5 6
1 foo 1 2 3

Related

Defining an aggregation function with groupby in pandas

I would like to collapse my dataset using groupby and agg, however after collapsing, I want the new column to show a string value only for the grouped rows.
For example, the initial data is:
df = pd.DataFrame([["a",1],["a",2],["b",2]], columns=['category','value'])
category value
0 a 1
1 a 3
2 b 2
Desired output:
category value
0 a grouped
1 b 2
How should I modify my code (to show "grouped" instead of 3):
df=df.groupby(['category'], as_index=False).agg({'value':'max'})
You can use a lambda with a ternary:
df.groupby("category", as_index=False)
.agg({"value": lambda x: "grouped" if len(x) > 1 else x})
This outputs:
category value
0 a grouped
1 b 2
Another possible solution:
(df.assign(value = np.where(
df.duplicated(subset=['category'], keep=False), 'grouped', df['value']))
.drop_duplicates())
Output:
category value
0 a grouped
2 b 2

Creating a new map from existing maps in python

This question might be common but I am new to python and would like to learn more from the community. I have 2 map files which have data mapping like this:
map1 : A --> B
map2 : B --> C,D,E
I want to create a new map file which will be A --> C
What is the most efficient way to achieve this in python? A generic approach would be very helpful as I need to apply the same logic on different files and different columns
Example:
Map1:
1,100
2,453
3,200
Map2:
100,25,30,
200,300,,
250,190,20,1
My map3 should be:
1,25
2,0
3,300
As 453 is not present in map2, our map3 contains value 0 for key 2.
First create DataFrames:
df1 = pd.read_csv(Map1, header=None)
df2 = pd.read_csv(Map2, header=None)
And then use Series.map by second column with by Series created by df2 with set index by first column, last replace missing values to 0 for not matched values:
df1[1] = df1[1].map(df2.set_index(0)[1]).fillna(0, downcast='int')
print (df1)
0 1
0 1 25
1 2 0
2 3 300
EDIT: for mapping multiple columns use left join with remove only missing columns by DataFrame.dropna and columns b,c used for join, last replace missing values:
df1.columns=['a','b']
df2.columns=['c','d','e','f']
df = (df1.merge(df2, how='left', left_on='b', right_on='c')
.dropna(how='all', axis=1)
.drop(['b','c'], axis=1)
.fillna(0)
.convert_dtypes())
print (df)
a d e
0 1 25 30
1 2 0 0
2 3 300 0

Groupby on a column and apply function on another column but keep first element of all other columns of dataframe

This is my code:
frst_df = df.drop(columns=["Comment"]).groupby(['source'], as_index=False).agg('first')
cmnt_df = df.groupby(['source'], as_index=False)['Comment'].apply(', '.join)
merge_df = pd.merge(frst_df, cmnt_df , on='source')
I hope it is understandable what I'm trying to do here.
I have a large dataframe where I have a column 'source'. This is the primary column of the dataframe. Now for the column 'Comment', I want to join all comments corresponding to the value of the 'source'. There are approx 50 other columns in the dataframe. I want to pick only the first element from all the values corresponding to the 'source'.
The code I wrote works fine, but the dataframe is huge and it takes lots of time to create two separate dataframes and then merge them. Is there any better way to do this?
You can use GroupBy.agg by dictionary - all columns are aggregate by first only Comment by join:
df = pd.DataFrame({
'Comment':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'source':list('aaabbc')
})
d = dict.fromkeys(df.columns.difference(['source']), 'first')
d['Comment'] = ', '.join
merge_df = df.groupby('source', as_index=False).agg(d)
print (merge_df)
source B C Comment D E
0 a 4 7 a, b, c 1 5
1 b 5 4 d, e 7 9
2 c 4 3 f 0 4
This is another possible solution.
df['Comment'] = df.groupby('source')['Comment'].transform(lambda x: ','.join(x))
df = df.groupby('source').first()

How to count data in a column based on another column separately?

I have two dataframe like this:
df1 = pd.DataFrame({'a':[1,2]})
df2 = pd.DataFrame({'a':[1,1,1,2,2,3,4,5,6,7,8]})
I want to count the two numbers of df1 separately in df2, the correct answer like:
No Amount
1 3
2 2
Instead of:
No Amount
1 5
2 5
How can I solve this problem?
First filter df2 for values that are contained in df1['a'], then apply value_counts. The rest of the code just presents the data in your desired format.
result = (
df2[df2['a'].isin(df1['a'].unique())]['a']
.value_counts()
.reset_index()
)
result.columns = ['No', 'Amount']
>>> result
No Amount
0 1 3
1 2 2
In pandas 0.21.0 you can use set_axis to rename columns as chained method. Here's a one line solution:
df2[df2.a.isin(df1.a)]\
.squeeze()\
.value_counts()\
.reset_index()\
.set_axis(['No','Amount'], axis=1, inplace=False)
Output:
No Amount
0 1 3
1 2 2
You can simply find value_counts of second df and map that with first df i.e
df1['Amount'] = df1['a'].map(df2['a'].value_counts())
df1 = df1.rename(columns={'a':'No'})
Output :
No Amount
0 1 3
1 2 2

Pandas distinct count as a DataFrame

Suppose I have a Pandas DataFrame called df with columns a and b and what I want is the number of distinct values of b per each a. I would do:
distcounts = df.groupby('a')['b'].nunique()
which gives the desidered result, but it is as Series object rather than another DataFrame. I'd like a DataFrame instead. In regular SQL, I'd do:
SELECT a, COUNT(DISTINCT(b)) FROM df
and haven't been able to emulate this query in Pandas exactly. How to?
I think you need reset_index:
distcounts = df.groupby('a')['b'].nunique().reset_index()
Sample:
df = pd.DataFrame({'a':[7,8,8],
'b':[4,5,6]})
print (df)
a b
0 7 4
1 8 5
2 8 6
distcounts = df.groupby('a')['b'].nunique().reset_index()
print (distcounts)
a b
0 7 1
1 8 2
Another alternative using Groupby.agg instead:
df.groupby('a', as_index=False).agg({'b': 'nunique'})

Categories