Can pandas perform an aggregating operation involving two columns? - python

Given the following dataframe,
is it possible to calculate the sum of col2 and the sum of col2 + col3,
in a single aggregating function?
import pandas as pd
df = pd.DataFrame({'col1': ['a', 'a', 'b', 'b'], 'col2': [1, 2, 3, 4], 'col3': [10, 20, 30, 40]})
.
col1
col2
col3
0
a
1
10
1
a
2
20
2
b
3
30
3
b
4
40
In R's dplyr I would do it with a single line of summarize,
and I was wondering what might be the equivalent in pandas:
df %>% group_by(col1) %>% summarize(col2_sum = sum(col2), col23_sum = sum(col2 + col3))
Desired result:
.
col1
col2_sum
col23_sum
0
a
3
33
1
b
7
77

Let us try assign the new column first
out = df.assign(col23 = df.col2+df.col3).groupby('col1',as_index=False).sum()
Out[81]:
col1 col2 col3 col23
0 a 3 30 33
1 b 7 70 77
From my understanding the apply is more like the summarize in R
out = df.groupby('col1').\
apply(lambda x : pd.Series({'col2_sum':x['col2'].sum(),
'col23_sum':(x['col2'] + x['col3']).sum()})).\
reset_index()
Out[83]:
col1 col2_sum col23_sum
0 a 3 33
1 b 7 77

You can do it easily with datar:
>>> from datar.all import f, tibble, group_by, summarize, sum
>>> df = tibble(
... col1=['a', 'a', 'b', 'b'],
... col2=[1, 2, 3, 4],
... col3=[10, 20, 30, 40]
... )
>>> df >> group_by(f.col1) >> summarize(
... col2_sum = sum(f.col2),
... col23_sum = sum(f.col2 + f.col3)
... )
col1 col2_sum col23_sum
<object> <int64> <int64>
0 a 3 33
1 b 7 77
I am the author of the datar package.

Related

How do I apply a function to the groupby sub-groups that depends on multiple columns?

Take the following data frame and groupby object.
df = pd.DataFrame([[1, 2, 3],[1, 4, 5],[2, 5, 6]], columns=['a', 'b', 'c'])
print(df)
a b c
0 1 2 3
1 1 4 5
2 2 5 6
dfGrouped = df.groupby(['a'])
How would I apply to the groupby object dfGrouped, multiplying each element of b and c together and then taking the sum. So for this example, 2*3 + 4*5 = 26 for the 1 group and 5*6 = 30 for the 0 group.
So my desired output for the groupby object is:
a f
0 1 26
2 2 30
Do:
df = pd.DataFrame([[1, 2, 3],[1, 4, 5],[2, 5, 6]], columns=['a', 'b', 'c'])
df['f'] = df['c'] * df['b']
res = df.groupby('a', as_index=False)['f'].sum()
print(res)
Output
a f
0 1 26
1 2 30
If need multiple all columns without a use DataFrame.prod with aggregate sum:
df = df.drop('a', 1).prod(axis=1).groupby(df['a']).sum().reset_index(name='f')
print (df)
a f
0 1 26
1 2 30
Alternative with helper column:
df = df.assign(f = df.drop('a', 1).prod(axis=1)).groupby("a", as_index=False).f.sum()
If need multiple only some columns one idea is use #sammywemmy solution from comments:
df = df.assign(f = df.b.mul(df.c)).groupby("a", as_index=False).f.sum()
print (df)
a f
0 1 26
1 2 30
Code:
df=(df.b * df.c).groupby(df['a']).sum().reset_index(name="f")
print(df)
Output:
a f
0 1 26
1 2 30

Python - Pandas - Edit duplicate items keeping last

Lets say my df is:
import pandas as pd
df = pd.DataFrame({'col1':['a', 'a', 'a', 'b', 'b', 'c', 'd', 'd', 'd'],
'col2':[10,20, 30, 10, 20, 10, 10, 20, 30]})
How can I make all numbers zero keeping the last one only? In this case the result should be:
col1 col2
a 0
a 0
a 30
b 0
b 20
c 10
d 0
d 0
d 30
Thanks!
Use loc and duplicated with the argument keep='last':
df.loc[df.duplicated(subset='col1',keep='last'), 'col2'] = 0
>>> df
col1 col2
0 a 0
1 a 0
2 a 30
3 b 0
4 b 20
5 c 10
6 d 0
7 d 0
8 d 30

Converting a long dataframe to wide dataframe

What is a systematic way to go from this:
x = {'col0': [1, 1, 2, 2], 'col1': ['a', 'b', 'a', 'b'],
'col2': ['x', 'x', 'x', 'x'], 'col3': [12, 13, 14, 15]}
y = pd.DataFrame(data=x)
y
col0 col1 col2 col3
0 1 a x 12
1 1 b x 13
2 2 a x 14
3 2 b x 15
To this:
y2
col0 col3__a_x col3__b_x
0 1 12 13
1 2 14 15
I was initially thinking something like cast from the reshape2 package from R. However, I'm much less familiar with Pandas/Python than I am with R.
In the dataset I'm working with col1 has 3 different values, col2 is all the same value, ~200,000 rows, and ~80 other columns that would get the suffix added.
You will need pviot and column faltten
s=pd.pivot_table(y,index='col0',columns=['col1','col2'],values='col3')
s.columns=s.columns.map('_'.join)
s.add_prefix('col3_').reset_index()
Out[1383]:
col0 col3_a_x col3_b_x
0 1 12 13
1 2 14 15
You can do it using set_index and unstack if you don't have multiple values for resulting rows and columns otherwise you'll have to use a aggregation method such as pivot_table or groupby:
df_out = y.set_index(['col0','col1','col2']).unstack([1,2])
df_out.columns = df_out.columns.map('_'.join)
df_out.reset_index()
Output:
col0 col3_a_x col3_b_x
0 1 12 13
1 2 14 15
Or with multiple values using groupby:
df_out = y.groupby(['col0','col1','col2']).mean().unstack([1,2])
df_out.columns = df_out.columns.map('_'.join)
df_out.reset_index()
Using pd.factorize and Numpy slice assignment we can construct the data frame we need.
i, r = pd.factorize(df.col0)
j, c = pd.factorize(df.col1.str.cat(df.col2, '_'))
b = np.zeros((r.size, c.size), np.int64)
b[i, j] = df.col3.values
d = pd.DataFrame(
np.column_stack([r, b]),
columns=['col0'] + ['col3__' + col for col in c]
)
d
col0 col3__a_x col3__b_x
0 1 12 13
1 2 14 15
I think that #Wen 's solution is probably better, as it is pure pandas, but here is another solution if you want to use numpy:
import numpy as np
d = y.groupby('col0').apply(lambda x: x['col3']).unstack().values
d = d[~np.isnan(d)].reshape(len(d),-1)
new_df = pd.DataFrame(d).reset_index().rename(columns={'index': 'col0', 0: 'col3_a_x', 1:'col3_b_x'})
>>> new_df
col0 col3_a_x col3_b_x
0 0 12.0 13.0
1 1 14.0 15.0

Rearrange Python Pandas DataFrame Rows into a Single Row

I have a Pandas dataframe that looks something like:
df = pd.DataFrame({'col1': [1, 2, 3, 4], 'col2': [5, 6, 7, 8]}, index=['A', 'B', 'C', 'D'])
col1 col2
A 1 50
B 2 60
C 3 70
D 4 80
However, I want to automatically rearrange it so that it looks like:
col1 A col1 B col1 C col1 D col2 A col2 B col2 C col2 D
0 1 2 3 4 50 60 70 80
I want to combine the row name with the column name
I want to end up with only one row
df2 = df.unstack()
df2.index = [' '.join(x) for x in df2.index.values]
df2 = pd.DataFrame(df2).T
df2
col1 A col1 B col1 C col1 D col2 A col2 B col2 C col2 D
0 1 2 3 4 5 6 7 8
If you want to have the orignal x axis labels in front of the column names ("A col1"...) just change .join(x) by .join(x[::-1]):
df2 = df.unstack()
df2.index = [' '.join(x[::-1]) for x in df2.index.values]
df2 = pd.DataFrame(df2).T
df2
A col1 B col1 C col1 D col1 A col2 B col2 C col2 D col2
0 1 2 3 4 5 6 7 8
Here's one way to do it, there could be a simpler way
In [562]: df = pd.DataFrame({'col1': [1, 2, 3, 4], 'col2': [50, 60, 70, 80]},
index=['A', 'B', 'C', 'D'])
In [563]: pd.DataFrame([df.values.T.ravel()],
columns=[y+x for y in df.columns for x in df.index])
Out[563]:
col1A col1B col1C col1D col2A col2B col2C col2D
0 1 2 3 4 50 60 70 80

Pandas: consolidating columns in DataFrame

Using the DataFrame below as an example:
import pandas as pd
df = pd.DataFrame({'col1':[1, 2, 3, 2, 1] , 'col2':['A', 'A', 'B', 'B','C']})
col1 col2
0 1 A
1 2 A
2 3 B
3 2 B
4 1 C
how can I get
col1 col2
0 1 A,C
1 2 A,B
2 3 B
You can groupby on 'col1' and then apply a lambda that joins the values:
In [88]:
df = pd.DataFrame({'col1':[1, 2, 3, 2, 1] , 'col2':['A', 'A', 'B', 'B','C']})
df.groupby('col1')['col2'].apply(lambda x: ','.join(x)).reset_index()
Out[88]:
col1 col2
0 1 A,C
1 2 A,B
2 3 B

Categories