groupby in userdefined python function, doesn't work

groupby in userdefined python function, doesn't work - python

I have made my own userdefined function in Python. The input are some parameters and a dataframe. First some new variables are added to the input dataframe. Then I try to make a groupby on the dataframe and left join the result on to the dataframe.
But the dataframe don't get the groupby variables added.
def test(df, params):
df['b']=df['a']*params['some_parameter']
df['c']=df['b']*df['total']
aaa=df.groupby(['aa', 'bb']).agg({'c':'sum'})
df=pd.merge(df,a,how='left',on=['aa', 'bb'])
return
Next try:
def test(df, params):
df['b']=df['a']*params['some_parameter']
df['d']=df['c']*df['b']
aaa=df.groupby(['y','x']).agg({'d':'sum','g':'sum'}).add_suffix('_sum')
df=df.join(aaa, on=['y','x'])
return
I then call the function by:
test(df2,params)
I would expect df2 would have 4 new columns, b, d, d_sum and g_sum. But it only has 2 new columns, b and d.

You can use GroupBy.transform instaed groupby with left join by merge:
aaa=df.groupby(['aa', 'bb']).agg({'c':'sum'})
df=pd.merge(df,a,how='left',on=['aa', 'bb'])
to:
df['c1'] = df.groupby(['aa', 'bb'])['c'].transform('sum')
All together:
def test(df, params):
df['b']=df['a']*params['some_parameter']
df['c']=df['b']*df['total']
df['new'] = df.groupby(['aa', 'bb'])['c'].transform('sum')
return df
If need aggregate multiple columns is possible use DataFrame.join with default left join:
df = pd.DataFrame({
'x':list('dddddd'),
'y':list('aaabbb'),
'a':[4,5,4,5,5,4],
'b':[7,8,9,4,2,3],
'c':[1,3,5,7,1,0],
'd':[5,3,6,9,2,4],
'g':[1,3,6,4,4,3],
})
print (df)
x y a b c d g
0 d a 4 7 1 5 1
1 d a 5 8 3 3 3
2 d a 4 9 5 6 6
3 d b 5 4 7 9 4
4 d b 5 2 1 2 4
5 d b 4 3 0 4 3
params = {'some_parameter':100}
def test(df, params):
df['b']=df['a']*params['some_parameter']
df['d']=df['c']*df['b']
aaa=df.groupby(['y','x']).agg({'d':'sum','g':'sum'}).add_suffix('_sum')
df=df.join(aaa, on=['y','x'])
return df
df1 = test(df, params)
print (df1)
x y a b c d g d_sum g_sum
0 d a 4 400 1 400 1 3900 10
1 d a 5 500 3 1500 3 3900 10
2 d a 4 400 5 2000 6 3900 10
3 d b 5 500 7 3500 4 4000 11
4 d b 5 500 1 500 4 4000 11
5 d b 4 400 0 0 3 4000 11

Related

Count level 1 size per level 0 in multi index and add new column

What is a pythonic way of counting level 1 size per level 0 in multi index and creating a new column (named counts). I can achieve this in the following way but would like to gain an understanding of any simpler approaches:
Code
df = pd.DataFrame({'STNAME':['AL'] * 3 + ['MI'] * 4,
'CTYNAME':list('abcdefg'),
'COL': range(7) }).set_index(['STNAME','CTYNAME'])
print(df)
COL
STNAME CTYNAME
AL a 0
b 1
c 2
MI d 3
e 4
f 5
g 6
df1 = df.groupby(level=0).size().reset_index(name='count')
counts = df.merge(df1,left_on="STNAME",right_on="STNAME")["count"].values
df["counts"] = counts
This is the desired output:
COL counts
STNAME CTYNAME
AL a 0 3
b 1 3
c 2 3
MI d 3 4
e 4 4
f 5 4
g 6 4

You can use groupby.transform with size here instead of merging:
output = df.assign(Counts=df.groupby(level=0)['COL'].transform('size'))
print(output)
COL Counts
STNAME CTYNAME
AL a 0 3
b 1 3
c 2 3
MI d 3 4
e 4 4
f 5 4
g 6 4

In pandas groupby mode use user defined function, apply it to multiple columns and assign the results to new pandas columns

I have a following data set:
> dt
a b group
1: 1 5 a
2: 2 6 a
3: 3 7 b
4: 4 8 b
I have a following function:
def bigSum(a,b):
return(a.min() + b.max())
I want to apply this function to a and b columns in groupby mode (by group) and assign it to the new column c of the data frame. My wished result is
> dt
a b group c
1: 1 5 a 7
2: 2 6 a 7
3: 3 7 b 11
4: 4 8 b 11
For instance, if I would have used R data.table, I would do the following:
dt[, c := bigSum(a,b), by = group]
and it would work exactly as I expect. I am interested if there is something similar in pandas.

In pandas we have transform
g = df.groupby('group')
df['out'] = g.a.transform('min') + g.b.transform('max')
df
Out[282]:
a b group out
1 1 5 a 7
2 2 6 a 7
3 3 7 b 11
4 4 8 b 11
Update
df['new'] = df.groupby('group').apply(lambda x : bigSum(x['a'],x['b'])).reindex(df.group).values
df
Out[287]:
a b group out new
1 1 5 a 7 7
2 2 6 a 7 7
3 3 7 b 11 11
4 4 8 b 11 11

Adding and multiplying values of a dataframe in Python

I have a dataset with multiple columns and rows. The rows are supposed to be summed up based on the unique value in a column. I tried .groupby but I want to retain the whole dataset and not just summed up columns based on one unique column. I further need to multiple these individual columns(values) with another column.
For example:
id A B C D E
11 2 1 2 4 100
11 2 2 1 1 100
12 1 3 2 2 200
13 3 1 1 4 190
14 Nan 1 2 2 300
I would like to sum up columns B, C & D based on the unique id and then multiply the result by column A and E in a new column F. I do not want to sum up the values of column A & E
I would like the resultant dataframe to be something like this, which also deals with NaN and while calculating skips the NaN value and moves onto further calculation:
id A B C D E F
11 2 3 3 5 100 9000
12 1 3 2 2 200 2400
13 3 1 1 4 190 2280
14 Nan 1 2 2 300 1200
If the above is unachievable then I would like something as, where the rows are same but the calculation is what I have stated above based on the same id:
id A B C D E F
11 2 3 3 5 100 9000
11 2 2 1 1 100 9000
12 1 3 2 2 200 2400
13 3 1 1 4 190 2280
14 Nan 1 2 2 300 1200
My logic earlier was to apply groupby on the columns B, C, D and then multiply but that is not working out for me. If the above dataframes are unachieavable then please let me know how can i perform this calculation and then merge/join the results with the original file with just E column.

You must first sum verticaly the columns B, C and D for common id, then take the horizontal product:
result = df.groupby('id').agg({'A': 'first', 'B':'sum', 'C': 'sum', 'D': 'sum',
'E': 'first'})
result['F'] = result.fillna(1).astype('int64').agg('prod', axis=1)
It gives:
A B C D E F
id
11 2.0 3 3 5 100 9000
12 1.0 3 2 2 200 2400
13 3.0 1 1 4 190 2280
14 NaN 1 2 2 300 1200
Beware: id is the index here - use reset_index if you want it to be a normal column.

How to change value in columns 4,5,6 of a dataframe to percentage format?

I use the following code to try to change value in columns 4,5,6 of a dataframe to percentage format but it returned me the errors.
df.iloc[:,4:7].apply('{:.2%}'.format)

You can use DataFrame.applymap:
df = pd.DataFrame({
'a':list('abcdef'),
'b':list('aaabbb'),
'c':[4,5,4,5,5,4],
'd':[7,8,9,4,2,3],
'e':[1,3,5,7,1,0],
'e':[5,3,6,9,2,4],
'f':[7,8,9,4,2,3],
'g':[1,3,5,7,1,0],
'h':[7,8,9,4,2,3],
'i':[1,3,5,7,1,0]
})
df.iloc[:,4:7] = df.iloc[:,4:7].applymap('{:.2%}'.format)
print (df)
a b c d e f g h i
0 a a 4 7 500.00% 700.00% 100.00% 7 1
1 b a 5 8 300.00% 800.00% 300.00% 8 3
2 c a 4 9 600.00% 900.00% 500.00% 9 5
3 d b 5 4 900.00% 400.00% 700.00% 4 7
4 e b 5 2 200.00% 200.00% 100.00% 2 1
5 f b 4 3 400.00% 300.00% 0.00% 3 0

How to "multiply" python pandas dataframes (as if they were vectors)?

I'm learning pandas. I have two dataframes:
df1 =
quality1 value
A 1
B 2
C 3
df2 =
quality2 value
D 1
E 10
F 100
I want to multiply them (as I might do with vectors to get a matrix). The answer should be:
df3 =
quality1 quality2 value
A D 1
E 10
F 100
B D 2
E 20
F 200
C D 3
E 30
F 300
How can I achieve this?

It's not the prettiest, but it would work:
>>> df1["dummy"] = 1
>>> df2["dummy"] = 1
>>> dfm = df1.merge(df2, on="dummy")
>>> dfm["value"] = dfm.pop("value_x") * dfm.pop("value_y")
>>> del dfm["dummy"]
>>> dfm
quality1 quality2 value
0 A D 1
1 A E 10
2 A F 100
3 B D 2
4 B E 20
5 B F 200
6 C D 3
7 C E 30
8 C F 300
Until we get native support for a Cartesian join (whistles and looks away..), merging on a dummy column is an easy way to get the same effect. The intermediate frame looks like
>>> dfm
quality1 value_x dummy quality2 value_y
0 A 1 1 D 1
1 A 1 1 E 10
2 A 1 1 F 100
3 B 2 1 D 1
4 B 2 1 E 10
5 B 2 1 F 100
6 C 3 1 D 1
7 C 3 1 E 10
8 C 3 1 F 100

You could also use cartesian function from scikit-learn:
from sklearn.utils.extmath import cartesian
# Your data:
df1 = pd.DataFrame({'quality1':list('ABC'), 'value':[1,2,3]})
df2 = pd.DataFrame({'quality2':list('DEF'), 'value':[1,10,100]})
# Make the matrix of labels:
dfm = pd.DataFrame(cartesian((df1.quality1.values, df2.quality2.values)),
columns=['quality1', 'quality2'])
# Multiply values:
dfm['value'] = df1.value.values.repeat(df2.value.size) * pd.np.tile(df2.value.values, df1.value.size)
print dfm.set_index(['quality1', 'quality2'])
Which yields:
value
quality1 quality2
A D 1
E 10
F 100
B D 2
E 20
F 200
C D 3
E 30
F 300

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

groupby in userdefined python function, doesn't work - python

Related

Count level 1 size per level 0 in multi index and add new column

In pandas groupby mode use user defined function, apply it to multiple columns and assign the results to new pandas columns

Adding and multiplying values of a dataframe in Python

How to change value in columns 4,5,6 of a dataframe to percentage format?

How to "multiply" python pandas dataframes (as if they were vectors)?

Categories

Resources