Pandas: Sort before aggregate within a group - python

I have the following Pandas dataframe:
A B C
A A Test1
A A Test2
A A XYZ
A B BA
A B AB
B A AA
I want to group this dataset twice: First by A and B to concate the group within C and afterwards only on A to get the groups defined solely by column A. The result looks like this:
A A Test1,Test2,XYZ
A B AB, BA
B A AA
And the final result should be:
A A,A:(Test1,Test2,XYZ), A,B:(AB, BA)
B B,A:(AA)
Concatenating itself works, however the sorting does not seem work.
Can anyone help me with this problem?
Kind regards.

Using groupby + join
s1=df.groupby(['A','B']).C.apply(','.join)
s1
Out[421]:
A B
A A Test1,Test2,XYZ
B BA,AB
B A AA
Name: C, dtype: object
s1.reset_index().groupby('A').apply(lambda x : x.set_index(['A','B'])['C'].to_dict())
Out[420]:
A
A {('A', 'A'): 'Test1,Test2,XYZ', ('A', 'B'): 'B...
B {('B', 'A'): 'AA'}
dtype: object

First sort_values by 3 columns, then groupby with join first, then join A with B columns and last groupby for dictionary per groups:
df1 = df.sort_values(['A','B','C']).groupby(['A','B'])['C'].apply(','.join).reset_index()
#if only 3 columns DataFrame
#df1 = df.sort_values().groupby(['A','B'])['C'].apply(','.join).reset_index()
df1['D'] = df1['A'] + ',' + df1['B']
print (df1)
A B C D
0 A A Test1,Test2,XYZ A,A
1 A B AB,BA A,B
2 B A AA B,A
s = df1.groupby('A').apply(lambda x: dict(zip(x['D'], x['C']))).reset_index(name='val')
print (s)
A val
0 A {'A,A': 'Test1,Test2,XYZ', 'A,B': 'AB,BA'}
1 B {'B,A': 'AA'}
If need tuples only change first part of code:
df1 = df.sort_values(['A','B','C']).groupby(['A','B'])['C'].apply(tuple).reset_index()
df1['D'] = df1['A'] + ',' + df1['B']
print (df1)
A B C D
0 A A (Test1, Test2, XYZ) A,A
1 A B (AB, BA) A,B
2 B A (AA,) B,A
s = df1.groupby('A').apply(lambda x: dict(zip(x['D'], x['C']))).reset_index(name='val')
print (s)
A val
0 A {'A,A': ('Test1', 'Test2', 'XYZ'), 'A,B': ('AB...
1 B {'B,A': ('AA',)}

Related

Subtract one text column from the other using pandas

I want to remove the text that within one column from the other column vectorially. Meaning, without using loop or apply.
I found this solution that no longer works old solution link.
Input:
pd.DataFrame({'A': ['ABC', 'ABC'], 'B': ['A', 'B']})
A B
0 ABC A
1 ABC B
Desired output:
0 BC
1 AC
Use a list comprehension:
df['C'] = [a.replace(b, '') for a,b in zip(df['A'], df['B'])]
Output:
A B C
0 ABC A BC
1 ABC B AC
If you want a Series:
out = pd.Series([a.replace(b, '') for a,b in zip(df['A'], df['B'])], index=df.index)
Output:
0 BC
1 AC
dtype: object

how to groupby and join multiple rows from multiple columns at a time?

I want to know how to groupby a single column and join multiple column strings each row.
Here's an example dataframe:
df = pd.DataFrame(np.array([['a', 'a', 'b', 'b'], [1, 1, 2, 2],
['k', 'l', 'm', 'n']]).T,
columns=['a', 'b', 'c'])
print(df)
a b c
0 a 1 k
1 a 1 l
2 b 2 m
3 b 2 n
I've tried something like,
df.groupby(['b', 'a'])['c'].apply(','.join).reset_index()
b a c
0 1 a k,l
1 2 b m,n
But that is not my required output,
Desired output:
a b c
0 1 a,a k,l
1 2 b,b m,n
How can I achieve this? I need a scalable solution because I'm dealing with millions of rows.
I think you need grouping by b column only and then if necessary create list of columns for apply function with GroupBy.agg:
df1 = df.groupby('b')['a','c'].agg(','.join).reset_index()
#alternative if want join all columns without b
#df1 = df.groupby('b').agg(','.join).reset_index()
print (df1)
b a c
0 1 a,a k,l
1 2 b,b m,n

enumerate equal elements within dataframe column

I would like to enumerate elements in a column which appear more than once. Elements that appear only once should not be modified.
I have come up with two solutions, but they seem to be very inelegant, and I am hoping that there is a better solution.
Input:
X
0 A
1 B
2 C
3 A
4 C
5 C
6 D
Output:
new_name
X
A A1
A A2
B B
C C1
C C2
C C3
D D
Here are two possible ways of achieving this, one using .expanding().count(), the other using .cumcount(), but both pretty ugly
import pandas as pd
def solution_1(df):
pvt = (df.groupby(by='X')
.expanding()
.count()
.rename(columns={'X': 'Counter'})
.reset_index()
.drop('level_1', axis=1)
.assign(name = lambda s: s['X'] + s['Counter'].astype(int).astype(str))
.set_index('X')
)
pvt2 = (df.reset_index()
.groupby(by='X')
.count()
.rename(columns={'index': 'C'}
))
df2 = pd.merge(left=pvt, right=pvt2, left_index=True, right_index=True)
ind=df2['C']>1
df2.loc[ind, 'new_name']=df2.loc[ind, 'name']
df2.loc[~ind, 'new_name']=df2.loc[~ind].index
df2 = df2.drop(['Counter', 'C', 'name'], axis=1)
return df2
def solution_2(df):
pvt = pd.DataFrame(df.groupby(by='X')
.agg({'X': 'cumcount'})
).rename(columns={'X': 'Counter'})
pvt2 = pd.DataFrame(df.groupby(by='X')
.agg({'X': 'count'})
).rename(columns={'X': 'Total Count'})
# print(pvt2)
df2 = df.merge(pvt, left_index=True, right_index=True)
df3 = df2.merge(pvt2, left_on='X', right_index=True)
ind=df3['Total Count']>1
df3['Counter'] = df3['Counter']+1
df3.loc[ind, 'new_name']=df3.loc[ind, 'X']+df3.loc[ind, 'Counter'].astype(int).astype(str)
df3.loc[~ind, 'new_name']=df3.loc[~ind, 'X']
df3 = df3.drop(['Counter', 'Total Count'], axis=1).set_index('X')
return df3
if __name__ == '__main__':
s = ['A', 'B', 'C', 'A', 'C', 'C', 'D']
df = pd.DataFrame(s, columns=['X'])
print(df)
sol_1 = solution_1(df)
print(sol_1)
sol_2 = solution_2(df)
print(sol_2)
Any suggestions? Thanks a lot.
First we use GroupBy.cumcount to get a cumulative count for each unique value in X.
Then we add 1 and convert the numeric values to string with Series.astype.
Finally we concat the values to our original column with Series.cat:
df['new_name'] = df['X'].str.cat(df.groupby('X').cumcount().add(1).astype(str))
X new_name
0 A A1
1 A A2
2 B B1
3 C C1
4 C C2
5 C C3
6 D D1
If you actually dont want a number at the values which only appear once, we can use:
df['new_name'] = np.where(df.groupby('X')['X'].transform('size').eq(1),
df['new_name'].str.replace('\d', ''),
df['new_name'])
X new_name
0 A A1
1 A A2
2 B B
3 C C1
4 C C2
5 C C3
6 D D
All in one line:
df['new_name'] = np.where(df.groupby('X')['X'].transform('size').ne(1),
df['X'].str.cat(df.groupby('X').cumcount().add(1).astype(str)),
df['X'])
IIUC
df.X+(df.groupby('X').cumcount()+1).mask(df.groupby('X').X.transform('count').eq(1),'').astype(str)
Out[18]:
0 A1
1 B
2 C1
3 A2
4 C2
5 C3
6 D
dtype: object

Renaming columns on slice of dataframe not performing as expected

I was trying to clean up column names in a dataframe but only a part of the columns.
It doesn't work when trying to replace column names on a slice of the dataframe somehow, why is that?
Lets say we have the following dataframe:
Note, on the bottom is copy-able code to reproduce the data:
Value ColAfjkj ColBhuqwa ColCouiqw
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
I want to clean up the column names (expected output):
Value ColA ColB ColC
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
Approach 1:
I can get the clean column names like this:
df.iloc[:, 1:].columns.str[:4]
Index(['ColA', 'ColB', 'ColC'], dtype='object')
Or
Approach 2:
s = df.iloc[:, 1:].columns
[col[:4] for col in s]
['ColA', 'ColB', 'ColC']
But when I try to overwrite the column names, nothing happens:
df.iloc[:, 1:].columns = df.iloc[:, 1:].columns.str[:4]
Value ColAfjkj ColBhuqwa ColCouiqw
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
Same for the second approach:
s = df.iloc[:, 1:].columns
cols = [col[:4] for col in s]
df.iloc[:, 1:].columns = cols
Value ColAfjkj ColBhuqwa ColCouiqw
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
This does work, but you have to manually concat the name of the first column, which is not ideal:
df.columns = ['Value'] + df.iloc[:, 1:].columns.str[:4].tolist()
Value ColA ColB ColC
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
Is there an easier way to achieve this? Am I missing something?
Dataframe for reproduction:
df = pd.DataFrame({'Value':[1,2,3,4],
'ColAfjkj':['a', 'b', 'c', 'd'],
'ColBhuqwa':['e', 'f', 'g', 'h'],
'ColCouiqw':['i', 'j', 'k', 'l']})
This is because pandas' index is immutable. If you check the documentation for class pandas.Index, you'll see that it is defined as:
Immutable ndarray implementing an ordered, sliceable set
So in order to modify it you'll have to create a new list of column names, for instance with:
df.columns = [df.columns[0]] + list(df.iloc[:, 1:].columns.str[:4])
Another option is to use rename with a dictionary containing the columns to replace:
df.rename(columns=dict(zip(df.columns[1:], df.columns[1:].str[:4])))
To overwrite columns names you can .rename() method:
So, it will look like:
df.rename(columns={'ColA_fjkj':'ColA',
'ColB_huqwa':'ColB',
'ColC_ouiqw':'ColC'}
, inplace=True)
More info regarding rename here in docs: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html
I had this problem as well and came up with this solution:
First, create a mask of the columns you want to rename
mask = df.iloc[:,1:4].columns
Then, use list comprehension and a conditional to rename just the columns you want
df.columns = [x if x not in mask else str[:4] for x in df.columns]

Preserve the non-numerical columns when doing pandas.DataFrame.groupby().sum()

Can I preserve the non-numerical columns (the 1st appeared value) when doing pandas.DataFrame.groupby().sum() ?
For example, I have a DataFrame like this:
df = pd.DataFrame({'A' : ['aa1', 'aa2', 'aa1', 'aa2'],'B' : ['bb1', 'bbb1', 'bb2', 'bbb2'],'C' : ['cc1', 'ccc2', 'ccc3', 'ccc4'],'D' : [1, 2, 3, 4],'E' : [1, 2, 3, 4]})
>>> df
A B C D E
0 aa1 bb1 cc1 1 1
1 aa2 bbb1 ccc2 2 2
2 aa1 bb2 ccc3 3 3
3 aa2 bbb2 ccc4 4 4
>>> df.groupby(["A"]).sum()
D E
A
aa1 4 4
aa2 6 6
Following is the result I want to obtain:
B C D E
A
aa1 bb1 cc1 4 4
aa2 bbb1 ccc2 6 6
Notice that the value of column B and C is the first associated B value and C value of each group key.
Just use 'first':
df.groupby(["A"]).agg({'B': 'first',
'C': 'first',
'D': sum,
'E': sum})
For each key in the groupby-sum dataframe, look up the key in the original dataframe and put the associated value of column B into a new column.
#groupby and sum over columns C and D
df_1 = df.groupby(['A']).sum()
Find the first values in column B associated with groupby keys
groupby keys
col_b = []
#iterate through keys and find the the first value in df['B'] with that key in column A
for i in df_1.index:
col_b.append(df['B'][df['A'] == i].iloc[0])
#insert list of values into new dataframe
df_1.insert(0, 'B', col_b)
>>>df_1
B D E
A
aa1 bb1 4 4
aa2 bbb1 6 6
Grouping only on column 'A' gives:
df.groupby(['A']).sum()
C D
A
bar 1.26 0.88
foo 0.92 -4.19
Grouping on column 'A' and 'B' gives:
df.groupby(['A','B']).sum()
C D
A B
bar one 1.38 -0.73
three 0.26 0.80
two -0.38 0.81
foo one 1.96 -2.72
three -0.42 -0.18
two -0.62 -1.29
If you want only the column 'B' that has 'one' you can do:
d = df.groupby(['A','B'], as_index=False).sum()
d[d.B=='one'].set_index('A')
B C D
A
bar one 1.38 -0.73
foo one 1.96 -2.72
I'm not sure I understand but is this what you want to do?
Note: I increased the output precision just to get the same numbers shown in the post.
d = df.groupby('A').sum()
d['B'] = 'one'
d.sort_index(axis=1)
B C D
A
bar one 1.259069 0.876959
foo one 0.921510 -4.193397
If you want to put the first sorted value of the column from 'B' instead you can use:
d['B'] = df.B.sort(inplace=False)[0]
So here I replaced 'one','two','three' with 'a', 'b','c' to see if this is what you are trying to do, and use insert() method as suggested by other post
df
A B C D
0 foo a 0.638362 -0.931817
1 bar a 1.380706 -0.733307
2 foo b -0.324514 0.203515
3 bar c 0.258534 0.803298
4 foo b -0.299485 -1.495979
5 bar b -0.380171 0.806968
6 foo a 1.324810 -1.792996
7 foo c -0.417663 -0.176120
d = df.groupby('A').sum()
d.insert(0, 'B', df.B.sort(inplace=False)[0])
d
B C D
A
bar a 1.259069 0.876959
foo a 0.921510 -4.193397
All these answers seem pretty wordy; the Pandas doc [1] also doesn't seem clear on this point despite giving an SQL example on the first page:
SELECT Column1, Column2, mean(Column3), sum(Column4)
FROM SomeTable
GROUP BY Column1, Column2
which will choose the "first" of Column1, Column2, while Pandas does not. As OP points out the non-numeric columns are simply dropped. My solution is not that pretty either, but perhaps more control is Pandas' design goal:
agg_d = { c: 'sum' if c == 'A' else 'first' for c in df.columns }
df = df.groupby( df['A'] ).agg( agg_d )
this maintains all non-numeric columns, like sql does. This is basically the same as Phillip's answer below, but without needing to explicitly enumerate the columns.
NOTES
https://pandas.pydata.org/pandas-docs/stable/groupby.html

Categories