Pandas - groupby, aggregate and scale on the sum of multiple columns

Pandas - groupby, aggregate and scale on the sum of multiple columns - python

Suppose I have the following DataFrame:
import pandas as pd
df = pd.DataFrame({'id': [1, 2, 2, 3, 3, 3], 'A': [2, 2, 3, 3, 5, 2], 'B': [1, 2, 1, 3, 2, 4]})
df
Out[253]:
id A B
0 1 2 1
1 2 2 2
2 2 3 1
3 3 3 3
4 3 5 2
5 3 2 4
I'd like to groupby 'id', and aggregate using a sum function over 'A', 'B'. But I'd also like to scale A and B by the sum of A+B (per each 'id), So the following output will look as follows:
id A B
0 1 0.666667 0.333333
1 2 0.625000 0.375000
2 3 0.526316 0.473684
Now, I can do
res = df.groupby('id').agg('sum').reset_index()
scaler = res['A'] + res['B']
res['A'] /= scaler
res['B'] /= scaler
res
Out[275]:
id A B
0 1 0.666667 0.333333
1 2 0.625000 0.375000
2 3 0.526316 0.473684
Which is quite inelegant. Is there a way to put all this "scalar" logic in the aggregation function ? Or any other pythonic and elegant way to do it? Solutions involving numpy are also welcome!

No, you cannot use agg function for scaling, because working with each column separately.
Solution is remove reset_index for alignment in division (div) of Series created by sum:
res = df.groupby('id').sum()
res = res.div(res.sum(axis=1), axis=0).reset_index()
print (res)
id A B
0 1 0.666667 0.333333
1 2 0.625000 0.375000
2 3 0.526316 0.473684
Details:
print (res.sum(axis=1))
id
1 3
2 8
3 19
dtype: int64

You can make use of the sum along the first axis:
res = df.groupby('id').agg('sum')
res.div(res.sum(1), 0)
A B
id
1 0.666667 0.333333
2 0.625000 0.375000
3 0.526316 0.473684

You can do
In [584]: res = df.groupby('id').sum()
In [585]: res.div(res.sum(1), 0).reset_index()
Out[585]:
id A B
0 1 0.666667 0.333333
1 2 0.625000 0.375000
2 3 0.526316 0.473684

Related

How to remove NaNs and squeeze in a DataFrame - pandas

I was doing some coding and realized something, I think there is an easier way of doing this.
So I have a DataFrame like this:
>>> df = pd.DataFrame({'a': [1, 'A', 2, 'A'], 'b': ['A', 3, 'A', 4]})
a b
0 1 A
1 A 3
2 2 A
3 A 4
And I want to remove all of the As from the data, but I also want to squeeze in the DataFrame, what I mean by squeezing in the DataFrame is to have a result of this:
a b
0 1 3
1 2 4
I have a solution as follows:
a = df['a'][df['a'] != 'A']
b = df['b'][df['b'] != 'A']
df2 = pd.DataFrame({'a': a.tolist(), 'b': b.tolist()})
print(df2)
Which works, but I seem to think there is an easier way, I've stopped coding for a while so not so bright anymore...
Note:
All columns have the same amount of As, there is no problem there.

You can try boolean indexing with loc to remove the A values:
pd.DataFrame({c: df.loc[df[c] != 'A', c].tolist() for c in df})
Result:
a b
0 1 3
1 2 4

This would do:
In [1513]: df.replace('A', np.nan).apply(lambda x: pd.Series(x.dropna().to_numpy()))
Out[1513]:
a b
0 1.0 3.0
1 2.0 4.0

We use can df.melt then filter out 'A' values then df.pivot
out = df.melt().query("value!='A'")
out.index = out.groupby('variable')['variable'].cumcount()
out.pivot(columns='variable', values='value').rename_axis(columns=None)
a b
0 1 3
1 2 4
Details
out = df.melt().query("value!='A'")
variable value
0 a 1
2 a 2
5 b 3
7 b 4
# We set this as index so it helps in `df.pivot`
out.groupby('variable')['variable'].cumcount()
0 0
2 1
5 0
7 1
dtype: int64
out.pivot(columns='variable', values='value').rename_axis(columns=None)
a b
0 1 3
1 2 4
Another alternative
df = df.mask(df.eq('A'))
out = df.stack()
pd.DataFrame(out.groupby(level=1).agg(list).to_dict())
a b
0 1 3
1 2 4
Details
df = df.mask(df.eq('A'))
a b
0 1 NaN
1 NaN 3
2 2 NaN
3 NaN 4
out = df.stack()
0 a 1
1 b 3
2 a 2
3 b 4
dtype: object
pd.DataFrame(out.groupby(level=1).agg(list).to_dict())
a b
0 1 3
1 2 4

Pandas agg fuction with operations on multiple columns

I am interested if we can use pandas.core.groupby.DataFrameGroupBy.agg function to make arithmetic operations on multiple columns columns. For example:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.arange(15).reshape(5, 3))
df['C'] = [0, 0, 2, 2, 5]
print(df.groupby('C').mean()[0] - df.groupby('C').mean()[1])
print(df.groupby('C').agg({0: 'mean', 1: 'sum', 2: 'nunique', 'C': 'mean0-mean1'}))
Is it somehow possible that we receive result like in this example: the difference between means of column 0 and column 1 grouped by column 'C'?
df
0 1 2 C
0 0 1 2 0
1 3 4 5 0
2 6 7 8 2
3 9 10 11 2
4 12 13 14 5
Groupped difference
C
0 -1.0
2 -1.0
5 -1.0
dtype: float64
I am not interested with solutions that does not use agg method. I am curious only if agg method can take multiple columns as argument and then do some operations on them to return one columns after job is done.

IIUC:
In [12]: df.groupby('C').mean().diff(axis=1)
Out[12]:
0 1 2
C
0 NaN 1.0 1.0
2 NaN 1.0 1.0
5 NaN 1.0 1.0
or
In [13]: df.groupby('C').mean().diff(-1, axis=1)
Out[13]:
0 1 2
C
0 -1.0 -1.0 NaN
2 -1.0 -1.0 NaN
5 -1.0 -1.0 NaN

Combine data from two columns into one, except if second is already occupied in pandas

Say I have two columns in a data frame, one of which is incomplete.
df = pd.DataFrame({'a': [1, 2, 3, 4], 'b':[5, '', 6, '']})
df
Out:
a b
0 1 5
1 2
2 3 6
3 4
is there a way to fill the empty values in column b with the corresponding values in column a whilst leaving the rest of column b intact?
such that you obtain without iterating over the column?
df
Out:
a b
0 1 5
1 2 2
2 3 6
3 4 4
I think you can use the apply method - but I am not sure. For reference the dataset I'm dealing with is quite large (appx 1GB) which is why iteration - my first attempt was not a good idea.

If blanks are empty strings, you could
In [165]: df.loc[df['b'] == '', 'b'] = df['a']
In [166]: df
Out[166]:
a b
0 1 5
1 2 2
2 3 6
3 4 4
However, if your blanks are NaNs, you could use fillna
In [176]: df
Out[176]:
a b
0 1 5.0
1 2 NaN
2 3 6.0
3 4 NaN
In [177]: df['b'] = df['b'].fillna(df['a'])
In [178]: df
Out[178]:
a b
0 1 5.0
1 2 2.0
2 3 6.0
3 4 4.0

You can use np.where to evaluate df.b, if it's not empty keep its value, otherwise use df.a instead.
df.b=np.where(df.b,df.b,df.a)
df
Out[33]:
a b
0 1 5
1 2 2
2 3 6
3 4 4

You can use pd.Series.where using a boolean version of df.b because '' resolve to False
df.assign(b=df.b.where(df.b.astype(bool), df.a))
a b
0 1 5
1 2 2
2 3 6
3 4 4

You can use replace and ffill with axis=1:
df.replace('',np.nan).ffill(axis=1).astype(df.a.dtypes)
Output:
a b
0 1 5
1 2 2
2 3 6
3 4 4

GroupBy Transformation on hierarchically indexed dataframe

I would like to take my Pandas dataframe with hierarchically indexed columns and normalize the values such that the values with the same outer index sum to one. For example:
cols = pd.MultiIndex.from_tuples([('A', 1), ('A', 2), ('B', 1), ('B', 2)])
X = pd.DataFrame(np.arange(20).reshape(5,4), columns=cols)
gives a dataframe X:
A B
1 2 1 2
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
4 16 17 18 19
I would like to normalize the rows so that the A columns sum to 1 and the B columns sum to 1. I.e. to generate:
A B
1 2 1 2
0 0.000000 1.000000 0.400000 0.600000
1 0.444444 0.555556 0.461538 0.538462
2 0.470588 0.529412 0.476190 0.523810
3 0.480000 0.520000 0.482759 0.517241
4 0.484848 0.515152 0.486486 0.513514
The following for loop works:
res = []
for (k,g) in X.groupby(axis=1, level=0):
g = g.div(g.sum(axis=1), axis=0)
res.append(g)
res = pd.concat(res, axis=1)
But the one liner fails:
X.groupby(axis=1, level=0).transform(lambda x: x.div(x.sum(axis=1), axis=0))
With the error message:
ValueError: transform must return a scalar value for each group
Any idea what the issue might be?

is that what you want?
In [33]: X.groupby(level=0, axis=1).apply(lambda x: x.div(x.sum(axis=1), axis=0))
Out[33]:
A B
1 2 1 2
0 0.000000 1.000000 0.400000 0.600000
1 0.444444 0.555556 0.461538 0.538462
2 0.470588 0.529412 0.476190 0.523810
3 0.480000 0.520000 0.482759 0.517241
4 0.484848 0.515152 0.486486 0.513514

computing daily return/increment on dataframe

So ive some timeseries data on which i want to compute daily return/increment, where Daily increment = value_at_time(T)/ value_at_time(T-1)
import pandas as pd
df=pd.DataFrame([1,2,3,7]) #Sample data frame
df[1:]
out:
0
1 2
2 3
3 7
df[:-1]
out:
0
0 1
1 2
2 3
######### Method 1
df[1:]/df[:-1]
out:
0
0 NaN
1 1
2 1
3 NaN
######### Method 2
df[1:]/df[:-1].values
out:
0
1 2.000000
2 1.500000
3 2.333333
######### Method 3
df[1:].values/df[:-1]
out:
0
0 2
1 1
2 2
My questions are that
If df[:-1] and df[1:] have only three values (row slices of the
dataframe) then why doesnt Method_1 work ?
Why are method 2 & 3 which are almost similar giving different results?
Why using .values in Method_2 makes it work

Lets look at each
method 1, if you look at what the slices return you can see that the indices don't align:
In [87]:
print(df[1:])
print(df[:-1])
0
1 2
2 3
3 7
0
0 1
1 2
2 3
so then when do the division only 2 columns intersect:
In [88]:
df[1:]/df[:-1]
Out[88]:
0
0 NaN
1 1.0
2 1.0
3 NaN
Method 2 produces a np array, this has no index so the division will be performed in order element-wise as expected:
In [89]:
df[:-1].values
Out[89]:
array([[1],
[2],
[3]], dtype=int64)
Giving:
In [90]:
df[1:]/df[:-1].values
Out[90]:
0
1 2.000000
2 1.500000
3 2.333333
Method 3 is the same reason as method 2
So the question is how to do this in pure pandas? We use shift to allow you to align the indices as desired:
In [92]:
df.shift(-1)/df
Out[92]:
0
0 2.000000
1 1.500000
2 2.333333
3 NaN

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas - groupby, aggregate and scale on the sum of multiple columns - python

You can make use of the sum along the first axis: res = df.groupby('id').agg('sum') res.div(res.sum(1), 0) A B id 1 0.666667 0.333333 2 0.625000 0.375000 3 0.526316 0.473684

You can do In [584]: res = df.groupby('id').sum() In [585]: res.div(res.sum(1), 0).reset_index() Out[585]: id A B 0 1 0.666667 0.333333 1 2 0.625000 0.375000 2 3 0.526316 0.473684

Related

How to remove NaNs and squeeze in a DataFrame - pandas

Pandas agg fuction with operations on multiple columns

Combine data from two columns into one, except if second is already occupied in pandas

GroupBy Transformation on hierarchically indexed dataframe

computing daily return/increment on dataframe

Categories

Resources