GroupBy Transformation on hierarchically indexed dataframe - python

I would like to take my Pandas dataframe with hierarchically indexed columns and normalize the values such that the values with the same outer index sum to one. For example:
cols = pd.MultiIndex.from_tuples([('A', 1), ('A', 2), ('B', 1), ('B', 2)])
X = pd.DataFrame(np.arange(20).reshape(5,4), columns=cols)
gives a dataframe X:
A B
1 2 1 2
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
4 16 17 18 19
I would like to normalize the rows so that the A columns sum to 1 and the B columns sum to 1. I.e. to generate:
A B
1 2 1 2
0 0.000000 1.000000 0.400000 0.600000
1 0.444444 0.555556 0.461538 0.538462
2 0.470588 0.529412 0.476190 0.523810
3 0.480000 0.520000 0.482759 0.517241
4 0.484848 0.515152 0.486486 0.513514
The following for loop works:
res = []
for (k,g) in X.groupby(axis=1, level=0):
g = g.div(g.sum(axis=1), axis=0)
res.append(g)
res = pd.concat(res, axis=1)
But the one liner fails:
X.groupby(axis=1, level=0).transform(lambda x: x.div(x.sum(axis=1), axis=0))
With the error message:
ValueError: transform must return a scalar value for each group
Any idea what the issue might be?

is that what you want?
In [33]: X.groupby(level=0, axis=1).apply(lambda x: x.div(x.sum(axis=1), axis=0))
Out[33]:
A B
1 2 1 2
0 0.000000 1.000000 0.400000 0.600000
1 0.444444 0.555556 0.461538 0.538462
2 0.470588 0.529412 0.476190 0.523810
3 0.480000 0.520000 0.482759 0.517241
4 0.484848 0.515152 0.486486 0.513514

Related

Modify data frame based on the row index - Python

Given a pandas dataframe (20, 40), I would like to modify the first 10 rows of the first 20 columns using the value index.
For example, if:
df.iloc[5,6] = 0.98,
I would like to modify the value in the following way
new df.iloc[5,6] = 0.98 ** -(1/5)
where 5 is the row index.
And I should do the same for every value between the first 10 rows and the first 20 columns.
Can anyone help me?
Thank you very much in advance.
Can you explain what you want to do in a more general way?
I don't understand why you chose 5 here.
The way to make columns off other columns is
df["new column"] = df["column1"] ** (-1/df["column2"])
the way you do it with the index is the same
df["new column"] = df["column1"] ** (-1/df.index)
You can do this operation in-place with the following snippet.
from numpy.random import default_rng
from pandas import DataFrame
from string import ascii_lowercase
rng = default_rng(0)
df = DataFrame(
(data := rng.integers(1, 10, size=(4, 5))),
columns=[*ascii_lowercase[:data.shape[1]]]
)
print(df)
a b c d e
0 8 6 5 3 3
1 1 1 1 2 8
2 6 9 5 6 9
3 7 6 5 6 9
# you would replace :3, :4 with :10, :20 for your data
df.iloc[:3, :4] **= (-1 / df.index)
print(df)
a b c d e
0 0 0.166667 0.447214 0.693361 3
1 1 1.000000 1.000000 0.793701 8
2 0 0.111111 0.447214 0.550321 9
3 7 6.000000 5.000000 6.000000 9
In the event your index is not a simple RangeIndex you can use numpy.arange to mimic this:
from numpy import arange
df.iloc[:3, :4] **= (-1 / arange(df.shape[0]))
print(df)
a b c d e
0 0 0.166667 0.447214 0.693361 3
1 1 1.000000 1.000000 0.793701 8
2 0 0.111111 0.447214 0.550321 9
3 7 6.000000 5.000000 6.000000 9
Note: If 0 is in your index, like it is in this example, you'll encounter a RuntimeWarning of dividing by 0.

How to find rate of change across successive rows using time and data columns after grouping by a different column using pandas?

I have a pandas DataFrame of the form:
df
ID_col time_in_hours data_col
1 62.5 4
1 40 3
1 20 3
2 30 1
2 20 5
3 50 6
What I want to be able to do is, find the rate of change of data_col by using the time_in_hours column. Specifically,
rate_of_change = (data_col[i+1] - data_col[i]) / abs(time_in_hours[ i +1] - time_in_hours[i])
Where i is a given row and the rate_of_change is calculated separately for different IDs
Effectively, I want a new DataFrame of the form:
new_df
ID_col time_in_hours data_col rate_of_change
1 62.5 4 NaN
1 40 3 -0.044
1 20 3 0
2 30 1 NaN
2 20 5 0.4
3 50 6 NaN
How do I go about this?
You can use groupby:
s = df.groupby('ID_col').apply(lambda dft: dft['data_col'].diff() / dft['time_in_hours'].diff().abs())
s.index = s.index.droplevel()
s
returns
0 NaN
1 -0.044444
2 0.000000
3 NaN
4 0.400000
5 NaN
dtype: float64
You can actually get around the groupby + apply given how your DataFrame is sorted. In this case, you can just check if the ID_col is the same as the shifted row.
So calculate the rate of change for everything, and then only assign the values back if they are within a group.
import numpy as np
mask = df.ID_col == df.ID_col.shift(1)
roc = (df.data_col - df.data_col.shift(1))/np.abs(df.time_in_hours - df.time_in_hours.shift(1))
df.loc[mask, 'rate_of_change'] = roc[mask]
Output:
ID_col time_in_hours data_col rate_of_change
0 1 62.5 4 NaN
1 1 40.0 3 -0.044444
2 1 20.0 3 0.000000
3 2 30.0 1 NaN
4 2 20.0 5 0.400000
5 3 50.0 6 NaN
You can use pandas.diff:
df.groupby('ID_col').apply(
lambda x: x['data_col'].diff() / x['time_in_hours'].diff().abs())
ID_col
1 0 NaN
1 -0.044444
2 0.000000
2 3 NaN
4 0.400000
3 5 NaN
dtype: float64

Pandas - groupby, aggregate and scale on the sum of multiple columns

Suppose I have the following DataFrame:
import pandas as pd
df = pd.DataFrame({'id': [1, 2, 2, 3, 3, 3], 'A': [2, 2, 3, 3, 5, 2], 'B': [1, 2, 1, 3, 2, 4]})
df
Out[253]:
id A B
0 1 2 1
1 2 2 2
2 2 3 1
3 3 3 3
4 3 5 2
5 3 2 4
I'd like to groupby 'id', and aggregate using a sum function over 'A', 'B'. But I'd also like to scale A and B by the sum of A+B (per each 'id), So the following output will look as follows:
id A B
0 1 0.666667 0.333333
1 2 0.625000 0.375000
2 3 0.526316 0.473684
Now, I can do
res = df.groupby('id').agg('sum').reset_index()
scaler = res['A'] + res['B']
res['A'] /= scaler
res['B'] /= scaler
res
Out[275]:
id A B
0 1 0.666667 0.333333
1 2 0.625000 0.375000
2 3 0.526316 0.473684
Which is quite inelegant. Is there a way to put all this "scalar" logic in the aggregation function ? Or any other pythonic and elegant way to do it? Solutions involving numpy are also welcome!
No, you cannot use agg function for scaling, because working with each column separately.
Solution is remove reset_index for alignment in division (div) of Series created by sum:
res = df.groupby('id').sum()
res = res.div(res.sum(axis=1), axis=0).reset_index()
print (res)
id A B
0 1 0.666667 0.333333
1 2 0.625000 0.375000
2 3 0.526316 0.473684
Details:
print (res.sum(axis=1))
id
1 3
2 8
3 19
dtype: int64
You can make use of the sum along the first axis:
res = df.groupby('id').agg('sum')
res.div(res.sum(1), 0)
A B
id
1 0.666667 0.333333
2 0.625000 0.375000
3 0.526316 0.473684
You can do
In [584]: res = df.groupby('id').sum()
In [585]: res.div(res.sum(1), 0).reset_index()
Out[585]:
id A B
0 1 0.666667 0.333333
1 2 0.625000 0.375000
2 3 0.526316 0.473684

Divide several columns in a python dataframe where the both the numerator and denominator columns will vary based on a picklist

I'm creating a dataframe by pairing down a very large dataframe (approximately 400 columns) based on a choices an enduser makes on a picklist. One of the picklist choices is the type of denominator that the enduser would like. Here is one example table with all the information before the final calculation is made.
county _tcount _tvote _f_npb_18_count _f_npb_18_vote
countycode
35 San Benito 28194 22335 2677 1741
36 San Bernardino 912653 661838 108724 61832
countycode _f_npb_30_count _f_npb_30_vote
35 384 288
36 76749 53013
However, I am trouble creating code that will automatically divide every column starting with the 5th (not including the index) by the column before it (skipping every other column). I've seen examples (Divide multiple columns by another column in pandas), but they all use fixed column names which is not achievable for this aspect. I've able to variable columns (based on positions) by fixed columns, but not variable columns by other variable columns based on position. I've tried modifying the code in the above link based on the column positions:
calculated_frame = [county_select_frame[county_select_frame.columns[5: : 2]].div(county_select_frame[4: :2], axis=0)]
output:
[ county _tcount _tvote _f_npb_18_count _f_npb_18_vote \
countycode
35 NaN NaN NaN NaN NaN
36 NaN NaN NaN NaN NaN]
RuntimeWarning: invalid value encountered in greater
(abs_vals > 0)).any()
The use of [5: :2] does work when the dividend is a fixed field.If I can't get this to work, it's not a big deal (But it would be great to have all options I wanted).
My preference would be to organize it by setting the index and using filter to split out a counts and votes dataframes separately. Then use join
d1 = df.set_index('county', append=True)
counts = d1.filter(regex='.*_\d+_count$').rename(columns=lambda x: x.replace('_count', ''))
votes = d1.filter(regex='.*_\d+_vote$').rename(columns=lambda x: x.replace('_vote', ''))
d1[['_tcount', '_tvote']].join(votes / counts)
_tcount _tvote _f_npb_18 _f_npb_30
countycode county
35 San Benito 28194 22335 0.650355 0.750000
36 San Bernardino 912653 661838 0.568706 0.690732
I think you can divide by numpy arrays created by values, because then not align columns names. Last create new DataFrame by constructor:
arr = county_select_frame.values
df1 = pd.DataFrame(arr[:,5::2] / arr[:,4::2], columns = county_select_frame.columns[5::2])
Sample:
np.random.seed(10)
county_select_frame = pd.DataFrame(np.random.randint(10, size=(10,10)),
columns=list('abcdefghij'))
print (county_select_frame)
a b c d e f g h i j
0 9 4 0 1 9 0 1 8 9 0
1 8 6 4 3 0 4 6 8 1 8
2 4 1 3 6 5 3 9 6 9 1
3 9 4 2 6 7 8 8 9 2 0
4 6 7 8 1 7 1 4 0 8 5
5 4 7 8 8 2 6 2 8 8 6
6 6 5 6 0 0 6 9 1 8 9
7 1 2 8 9 9 5 0 2 7 3
8 0 4 2 0 3 3 1 2 5 9
9 0 1 0 1 9 0 9 2 1 1
arr = county_select_frame.values
df1 = pd.DataFrame(arr[:,5::2] / arr[:,4::2], columns = county_select_frame.columns[5::2])
print (df1)
f h j
0 0.000000 8.000000 0.000000
1 inf 1.333333 8.000000
2 0.600000 0.666667 0.111111
3 1.142857 1.125000 0.000000
4 0.142857 0.000000 0.625000
5 3.000000 4.000000 0.750000
6 inf 0.111111 1.125000
7 0.555556 inf 0.428571
8 1.000000 2.000000 1.800000
9 0.000000 0.222222 1.000000
How about something like
cols = my_df.columns
for i in range(2, 6):
print(u'Creating new col %s', cols[i])
my_df['new_{0}'.format(cols[i]) = my_df[cols[i]] / my_df[cols[i-1]

Python Pandas: select column with the number of unique values greater than 10

In R, we can use sapply to extract columns with the number of unique values greater than 10 by:
X[, sapply(X, function(x) length(unique(x))) >=10]
How can we do the same thing in Python Pandas?
Also, how can we choose columns with missing proportion less than 10% like what we can do in R:
X[, sapply(X, function(x) sum(is.na(x))/length(x) ) < 0.1]
Thanks.
You can use nunique with apply, because it works only with Series:
print (df.ix[:, df.apply(lambda x: x.nunique()) >= 10])
and second isnull with mean:
print (df.ix[:, df.isnull().mean() < 0.1])
Sample:
df = pd.DataFrame({'A':[1,np.nan,3],
'B':[4,4,np.nan],
'C':[7,8,9],
'D':[3,3,5]})
print (df)
A B C D
0 1.0 4.0 7 3
1 NaN 4.0 8 3
2 3.0 NaN 9 5
print (df.ix[:, df.apply(lambda x: x.nunique()) >= 2])
A C D
0 1.0 7 3
1 NaN 8 3
2 3.0 9 5
print (df.isnull().sum())
A 1
B 1
C 0
D 0
dtype: int64
print (df.isnull().sum() / len(df.index))
A 0.333333
B 0.333333
C 0.000000
D 0.000000
dtype: float64
print (df.isnull().mean())
A 0.333333
B 0.333333
C 0.000000
D 0.000000
dtype: float64
print (df.ix[:, df.isnull().sum() / len(df.index) < 0.1])
C D
0 7 3
1 8 3
2 9 5
Or:
print (df.ix[:, df.isnull().mean() < 0.1])
C D
0 7 3
1 8 3
2 9 5

Categories