groupby in userdefined python function, doesn't work - python

I have made my own userdefined function in Python. The input are some parameters and a dataframe. First some new variables are added to the input dataframe. Then I try to make a groupby on the dataframe and left join the result on to the dataframe.
But the dataframe don't get the groupby variables added.
def test(df, params):
df['b']=df['a']*params['some_parameter']
df['c']=df['b']*df['total']
aaa=df.groupby(['aa', 'bb']).agg({'c':'sum'})
df=pd.merge(df,a,how='left',on=['aa', 'bb'])
return
Next try:
def test(df, params):
df['b']=df['a']*params['some_parameter']
df['d']=df['c']*df['b']
aaa=df.groupby(['y','x']).agg({'d':'sum','g':'sum'}).add_suffix('_sum')
df=df.join(aaa, on=['y','x'])
return
I then call the function by:
test(df2,params)
I would expect df2 would have 4 new columns, b, d, d_sum and g_sum. But it only has 2 new columns, b and d.

You can use GroupBy.transform instaed groupby with left join by merge:
aaa=df.groupby(['aa', 'bb']).agg({'c':'sum'})
df=pd.merge(df,a,how='left',on=['aa', 'bb'])
to:
df['c1'] = df.groupby(['aa', 'bb'])['c'].transform('sum')
All together:
def test(df, params):
df['b']=df['a']*params['some_parameter']
df['c']=df['b']*df['total']
df['new'] = df.groupby(['aa', 'bb'])['c'].transform('sum')
return df
If need aggregate multiple columns is possible use DataFrame.join with default left join:
df = pd.DataFrame({
'x':list('dddddd'),
'y':list('aaabbb'),
'a':[4,5,4,5,5,4],
'b':[7,8,9,4,2,3],
'c':[1,3,5,7,1,0],
'd':[5,3,6,9,2,4],
'g':[1,3,6,4,4,3],
})
print (df)
x y a b c d g
0 d a 4 7 1 5 1
1 d a 5 8 3 3 3
2 d a 4 9 5 6 6
3 d b 5 4 7 9 4
4 d b 5 2 1 2 4
5 d b 4 3 0 4 3
params = {'some_parameter':100}
def test(df, params):
df['b']=df['a']*params['some_parameter']
df['d']=df['c']*df['b']
aaa=df.groupby(['y','x']).agg({'d':'sum','g':'sum'}).add_suffix('_sum')
df=df.join(aaa, on=['y','x'])
return df
df1 = test(df, params)
print (df1)
x y a b c d g d_sum g_sum
0 d a 4 400 1 400 1 3900 10
1 d a 5 500 3 1500 3 3900 10
2 d a 4 400 5 2000 6 3900 10
3 d b 5 500 7 3500 4 4000 11
4 d b 5 500 1 500 4 4000 11
5 d b 4 400 0 0 3 4000 11

Related

Count level 1 size per level 0 in multi index and add new column

What is a pythonic way of counting level 1 size per level 0 in multi index and creating a new column (named counts). I can achieve this in the following way but would like to gain an understanding of any simpler approaches:
Code
df = pd.DataFrame({'STNAME':['AL'] * 3 + ['MI'] * 4,
'CTYNAME':list('abcdefg'),
'COL': range(7) }).set_index(['STNAME','CTYNAME'])
print(df)
COL
STNAME CTYNAME
AL a 0
b 1
c 2
MI d 3
e 4
f 5
g 6
df1 = df.groupby(level=0).size().reset_index(name='count')
counts = df.merge(df1,left_on="STNAME",right_on="STNAME")["count"].values
df["counts"] = counts
This is the desired output:
COL counts
STNAME CTYNAME
AL a 0 3
b 1 3
c 2 3
MI d 3 4
e 4 4
f 5 4
g 6 4
You can use groupby.transform with size here instead of merging:
output = df.assign(Counts=df.groupby(level=0)['COL'].transform('size'))
print(output)
COL Counts
STNAME CTYNAME
AL a 0 3
b 1 3
c 2 3
MI d 3 4
e 4 4
f 5 4
g 6 4

In pandas groupby mode use user defined function, apply it to multiple columns and assign the results to new pandas columns

I have a following data set:
> dt
a b group
1: 1 5 a
2: 2 6 a
3: 3 7 b
4: 4 8 b
I have a following function:
def bigSum(a,b):
return(a.min() + b.max())
I want to apply this function to a and b columns in groupby mode (by group) and assign it to the new column c of the data frame. My wished result is
> dt
a b group c
1: 1 5 a 7
2: 2 6 a 7
3: 3 7 b 11
4: 4 8 b 11
For instance, if I would have used R data.table, I would do the following:
dt[, c := bigSum(a,b), by = group]
and it would work exactly as I expect. I am interested if there is something similar in pandas.
In pandas we have transform
g = df.groupby('group')
df['out'] = g.a.transform('min') + g.b.transform('max')
df
Out[282]:
a b group out
1 1 5 a 7
2 2 6 a 7
3 3 7 b 11
4 4 8 b 11
Update
df['new'] = df.groupby('group').apply(lambda x : bigSum(x['a'],x['b'])).reindex(df.group).values
df
Out[287]:
a b group out new
1 1 5 a 7 7
2 2 6 a 7 7
3 3 7 b 11 11
4 4 8 b 11 11

Adding and multiplying values of a dataframe in Python

I have a dataset with multiple columns and rows. The rows are supposed to be summed up based on the unique value in a column. I tried .groupby but I want to retain the whole dataset and not just summed up columns based on one unique column. I further need to multiple these individual columns(values) with another column.
For example:
id A B C D E
11 2 1 2 4 100
11 2 2 1 1 100
12 1 3 2 2 200
13 3 1 1 4 190
14 Nan 1 2 2 300
I would like to sum up columns B, C & D based on the unique id and then multiply the result by column A and E in a new column F. I do not want to sum up the values of column A & E
I would like the resultant dataframe to be something like this, which also deals with NaN and while calculating skips the NaN value and moves onto further calculation:
id A B C D E F
11 2 3 3 5 100 9000
12 1 3 2 2 200 2400
13 3 1 1 4 190 2280
14 Nan 1 2 2 300 1200
If the above is unachievable then I would like something as, where the rows are same but the calculation is what I have stated above based on the same id:
id A B C D E F
11 2 3 3 5 100 9000
11 2 2 1 1 100 9000
12 1 3 2 2 200 2400
13 3 1 1 4 190 2280
14 Nan 1 2 2 300 1200
My logic earlier was to apply groupby on the columns B, C, D and then multiply but that is not working out for me. If the above dataframes are unachieavable then please let me know how can i perform this calculation and then merge/join the results with the original file with just E column.
You must first sum verticaly the columns B, C and D for common id, then take the horizontal product:
result = df.groupby('id').agg({'A': 'first', 'B':'sum', 'C': 'sum', 'D': 'sum',
'E': 'first'})
result['F'] = result.fillna(1).astype('int64').agg('prod', axis=1)
It gives:
A B C D E F
id
11 2.0 3 3 5 100 9000
12 1.0 3 2 2 200 2400
13 3.0 1 1 4 190 2280
14 NaN 1 2 2 300 1200
Beware: id is the index here - use reset_index if you want it to be a normal column.

How to change value in columns 4,5,6 of a dataframe to percentage format?

I use the following code to try to change value in columns 4,5,6 of a dataframe to percentage format but it returned me the errors.
df.iloc[:,4:7].apply('{:.2%}'.format)
You can use DataFrame.applymap:
df = pd.DataFrame({
'a':list('abcdef'),
'b':list('aaabbb'),
'c':[4,5,4,5,5,4],
'd':[7,8,9,4,2,3],
'e':[1,3,5,7,1,0],
'e':[5,3,6,9,2,4],
'f':[7,8,9,4,2,3],
'g':[1,3,5,7,1,0],
'h':[7,8,9,4,2,3],
'i':[1,3,5,7,1,0]
})
df.iloc[:,4:7] = df.iloc[:,4:7].applymap('{:.2%}'.format)
print (df)
a b c d e f g h i
0 a a 4 7 500.00% 700.00% 100.00% 7 1
1 b a 5 8 300.00% 800.00% 300.00% 8 3
2 c a 4 9 600.00% 900.00% 500.00% 9 5
3 d b 5 4 900.00% 400.00% 700.00% 4 7
4 e b 5 2 200.00% 200.00% 100.00% 2 1
5 f b 4 3 400.00% 300.00% 0.00% 3 0

How to "multiply" python pandas dataframes (as if they were vectors)?

I'm learning pandas. I have two dataframes:
df1 =
quality1 value
A 1
B 2
C 3
df2 =
quality2 value
D 1
E 10
F 100
I want to multiply them (as I might do with vectors to get a matrix). The answer should be:
df3 =
quality1 quality2 value
A D 1
E 10
F 100
B D 2
E 20
F 200
C D 3
E 30
F 300
How can I achieve this?
It's not the prettiest, but it would work:
>>> df1["dummy"] = 1
>>> df2["dummy"] = 1
>>> dfm = df1.merge(df2, on="dummy")
>>> dfm["value"] = dfm.pop("value_x") * dfm.pop("value_y")
>>> del dfm["dummy"]
>>> dfm
quality1 quality2 value
0 A D 1
1 A E 10
2 A F 100
3 B D 2
4 B E 20
5 B F 200
6 C D 3
7 C E 30
8 C F 300
Until we get native support for a Cartesian join (whistles and looks away..), merging on a dummy column is an easy way to get the same effect. The intermediate frame looks like
>>> dfm
quality1 value_x dummy quality2 value_y
0 A 1 1 D 1
1 A 1 1 E 10
2 A 1 1 F 100
3 B 2 1 D 1
4 B 2 1 E 10
5 B 2 1 F 100
6 C 3 1 D 1
7 C 3 1 E 10
8 C 3 1 F 100
You could also use cartesian function from scikit-learn:
from sklearn.utils.extmath import cartesian
# Your data:
df1 = pd.DataFrame({'quality1':list('ABC'), 'value':[1,2,3]})
df2 = pd.DataFrame({'quality2':list('DEF'), 'value':[1,10,100]})
# Make the matrix of labels:
dfm = pd.DataFrame(cartesian((df1.quality1.values, df2.quality2.values)),
columns=['quality1', 'quality2'])
# Multiply values:
dfm['value'] = df1.value.values.repeat(df2.value.size) * pd.np.tile(df2.value.values, df1.value.size)
print dfm.set_index(['quality1', 'quality2'])
Which yields:
value
quality1 quality2
A D 1
E 10
F 100
B D 2
E 20
F 200
C D 3
E 30
F 300

Categories