Using the DataFrame below as an example:
import pandas as pd
df = pd.DataFrame({'col1':[1, 2, 3, 2, 1] , 'col2':['A', 'A', 'B', 'B','C']})
col1 col2
0 1 A
1 2 A
2 3 B
3 2 B
4 1 C
how can I get
col1 col2
0 1 A,C
1 2 A,B
2 3 B
You can groupby on 'col1' and then apply a lambda that joins the values:
In [88]:
df = pd.DataFrame({'col1':[1, 2, 3, 2, 1] , 'col2':['A', 'A', 'B', 'B','C']})
df.groupby('col1')['col2'].apply(lambda x: ','.join(x)).reset_index()
Out[88]:
col1 col2
0 1 A,C
1 2 A,B
2 3 B
Related
is there a way to convert the multiindex columns to normal value columns? I have a multiindexed table like that:
level_0
level_1
Value
0
0
0
A
1
0
1
B
2
1
0
C
3
1
1
D
I want to convert level_0 and level_1 to normal columns:
ID
col0
col1
Value
0
0
0
A
1
0
1
B
2
1
0
C
3
1
1
D
Any suggestion?
Thank you!
You can use reset_index followed by rename.
# Setup
my_index = pd.MultiIndex.from_arrays([(0, 1, 2, 3),
(0, 0, 1, 1),
(0, 1, 0, 1)],
names=[None, 'level_0', 'level_1'])
df = pd.DataFrame({'Value': ['A', 'B', 'C', 'D']}, index=my_index)
>>> # level=['level_0', 'level_1'] works, too
>>> df = df.reset_index(level=[1, 2])
>>> df
level_0 level_1 Value
0 0 0 A
1 0 1 B
2 1 0 C
3 1 1 D
To rename the columns, you can do
>>> df.rename(columns={'level_0': 'col0', 'level_1': 'col1'})
col0 col1 Value
0 0 0 A
1 0 1 B
2 1 0 C
3 1 1 D
I have the following sample input data:
import pandas as pd
df = pd.DataFrame({'col1': ['x', 'y', 'z'], 'col2': [1, 2, 3], 'col3': ['a', 'a', 'b']})
I would like to sort and group by col3 while interleaving the summaries on top of the corresponding group in col1 and get the following output:
col1 col2
0 a 3
1 x 1
2 y 2
3 b 3
4 z 3
I can of course do the part:
df.sort_values(by=['col3']).groupby(by=['col3']).sum()
col2
col3
a 3
b 3
but I am not sure how to interleave the group labels on top of col1.
Use custom function for top1 row for each group:
def f(x):
return pd.DataFrame({'col1': x.name, 'col2': x['col2'].sum()}, index=[0]).append(x)
df = (df.sort_values(by=['col3'])
.groupby(by=['col3'], group_keys=False)
.apply(f)
.drop('col3', 1)
.reset_index(drop=True))
print (df)
col1 col2
0 a 3
1 x 1
2 y 2
3 b 3
4 z 3
More performant solution is use GroupBy.ngroup for indices, aggregate sum amd last join values by concat with only stable sorting by mergesort:
df = df.sort_values(by=['col3'])
df1 = df.groupby(by=['col3'])['col2'].sum().rename_axis('col1').reset_index()
df2 = df.set_index(df.groupby(by=['col3']).ngroup())
df = pd.concat([df1, df2]).sort_index(kind='mergesort', ignore_index=True).drop('col3', 1)
print (df)
col1 col2
0 a 3
1 x 1
2 y 2
3 b 3
4 z 3
What about:
(df.melt(id_vars='col2')
.rename(columns={'value': 'col1'})
.groupby('col1').sum()
.reset_index()
)
output:
col1 col2
0 a 3
1 b 3
2 x 1
3 y 2
4 z 3
def function1(dd:pd.DataFrame):
df.loc[dd.index.min()-0.5,['col1','col2']]=[dd.name,dd.col2.sum()]
df.groupby('col3').apply(function1).pipe(lambda dd:df.sort_index(ignore_index=True)).drop('col3',axis=1)
output
col1 col2
0 a 3
1 x 1
2 y 2
3 b 3
4 z 3
or use pandasql library
def function1(dd:pd.DataFrame):
return dd.sql("select '{}' as col1,{} as col2 union select col1,col2 from self".format(dd.name,dd.col2.sum()))
df.groupby('col3').apply(function1).reset_index(drop=False)
col1 col2
0 a 3
1 x 1
2 y 2
3 b 3
4 z 3
I try to get count of each category in DataFrame as:
data = {'col_1': ['a', 'b', 'c', 'd','c'],'col_2': [3, 2, 1, 0, 4],'col3':[99,88,77,66,55]}
df = pd.DataFrame.from_dict(data)
print(df.groupby(['col_1']).count())
Output:
col_2 col3
col_1
a 1 1
b 1 1
c 2 2
d 1 1
Why there are two columns "col_2" and "col_3" and hot to get only one with name "count" ?
Wished output is :
col_1 count
a 1
b 1
c 2
d 1
You can do:
print(df.groupby(['col_1'],as_index=False).agg(count=('col_2','count')))
OR
print(df.groupby(['col_1'],as_index=False).size().rename(columns={'size':'count'}))
Output:
col_1 count
a 1
b 1
c 2
d 1
Given the following dataframe,
is it possible to calculate the sum of col2 and the sum of col2 + col3,
in a single aggregating function?
import pandas as pd
df = pd.DataFrame({'col1': ['a', 'a', 'b', 'b'], 'col2': [1, 2, 3, 4], 'col3': [10, 20, 30, 40]})
.
col1
col2
col3
0
a
1
10
1
a
2
20
2
b
3
30
3
b
4
40
In R's dplyr I would do it with a single line of summarize,
and I was wondering what might be the equivalent in pandas:
df %>% group_by(col1) %>% summarize(col2_sum = sum(col2), col23_sum = sum(col2 + col3))
Desired result:
.
col1
col2_sum
col23_sum
0
a
3
33
1
b
7
77
Let us try assign the new column first
out = df.assign(col23 = df.col2+df.col3).groupby('col1',as_index=False).sum()
Out[81]:
col1 col2 col3 col23
0 a 3 30 33
1 b 7 70 77
From my understanding the apply is more like the summarize in R
out = df.groupby('col1').\
apply(lambda x : pd.Series({'col2_sum':x['col2'].sum(),
'col23_sum':(x['col2'] + x['col3']).sum()})).\
reset_index()
Out[83]:
col1 col2_sum col23_sum
0 a 3 33
1 b 7 77
You can do it easily with datar:
>>> from datar.all import f, tibble, group_by, summarize, sum
>>> df = tibble(
... col1=['a', 'a', 'b', 'b'],
... col2=[1, 2, 3, 4],
... col3=[10, 20, 30, 40]
... )
>>> df >> group_by(f.col1) >> summarize(
... col2_sum = sum(f.col2),
... col23_sum = sum(f.col2 + f.col3)
... )
col1 col2_sum col23_sum
<object> <int64> <int64>
0 a 3 33
1 b 7 77
I am the author of the datar package.
I have a Pandas dataframe that looks something like:
df = pd.DataFrame({'col1': [1, 2, 3, 4], 'col2': [5, 6, 7, 8]}, index=['A', 'B', 'C', 'D'])
col1 col2
A 1 50
B 2 60
C 3 70
D 4 80
However, I want to automatically rearrange it so that it looks like:
col1 A col1 B col1 C col1 D col2 A col2 B col2 C col2 D
0 1 2 3 4 50 60 70 80
I want to combine the row name with the column name
I want to end up with only one row
df2 = df.unstack()
df2.index = [' '.join(x) for x in df2.index.values]
df2 = pd.DataFrame(df2).T
df2
col1 A col1 B col1 C col1 D col2 A col2 B col2 C col2 D
0 1 2 3 4 5 6 7 8
If you want to have the orignal x axis labels in front of the column names ("A col1"...) just change .join(x) by .join(x[::-1]):
df2 = df.unstack()
df2.index = [' '.join(x[::-1]) for x in df2.index.values]
df2 = pd.DataFrame(df2).T
df2
A col1 B col1 C col1 D col1 A col2 B col2 C col2 D col2
0 1 2 3 4 5 6 7 8
Here's one way to do it, there could be a simpler way
In [562]: df = pd.DataFrame({'col1': [1, 2, 3, 4], 'col2': [50, 60, 70, 80]},
index=['A', 'B', 'C', 'D'])
In [563]: pd.DataFrame([df.values.T.ravel()],
columns=[y+x for y in df.columns for x in df.index])
Out[563]:
col1A col1B col1C col1D col2A col2B col2C col2D
0 1 2 3 4 50 60 70 80