Equivalent of 'mutate_at' dplyr function in Python pandas - python

and thank you in advanced for the help.
I am looking to create multiple new columns in a pandas dataframe, by dividing a subset of existing columns by another existing column, dynamically named with a suffix. Below is dummy code illustrating the general gist of what i want to do, except for 25+ columns with various transformations.
R code
library(dplyr)
player = c('John','Peter','Michael')
min = c(20, 23, 35)
points = c(10,12,14)
rebounds = c(5,7,9)
assists = c(4,6,7)
df = data.frame(player,min,points,rebounds,assists)
df = df %>%
mutate_at(vars(points:assists),.funs=funs(per_min=./min))
Expected output
player min points rebounds assists points_per_min rebounds_per_min assists_per_min
1 John 20 10 5 4 0.5000000 0.2500000 0.2000000
2 Peter 23 12 7 6 0.5217391 0.3043478 0.2608696
3 Michael 35 14 9 7 0.4000000 0.2571429 0.2000000
I know that I can reproduce the above in pandas as follows:
import pandas as pd
data = pd.DataFrame({'player':['John','Peter','Michael'],
'min':[20,23,35],
'points':[10,12,14],
'rebounds':[5,7,9],
'assists':[4,6,7]})
df = pd.DataFrame(data)
df['points_per_minute'] = df['points']/df['min']
df['rebounds_per_minute'] = df['rebounds']/df['min']
df['assists_per_minute'] = df['assists']/df['min']
df.head()
player min points rebounds assists points_per_minute rebounds_per_minute assists_per_minute
0 John 20 10 5 4 0.500000 0.250000 0.20000
1 Peter 23 12 7 6 0.521739 0.304348 0.26087
2 Michael 35 14 9 7 0.400000 0.257143 0.20000
However, I have to do this for 25+ columns, with different transformations, and explicitly naming every column and operation will become rather cumbersome. Is there any pandas replication of this?

Similar to base R, assign by block of columns with basic arithmetic. Often base R translates better to Numpy/Pandas.
R
cols <- c("points", "rebounds", "assists")
df[paste0(cols, "_per_min")] <- df[cols] / df$min
Python pandas
cols = ["points", "rebounds", "assists"]
df[[col+'_per_min' for col in cols]] = df[cols].div(df['min'], axis='index')

Method1:
Take the list of columns(if you dont have a list of columns and want to get all columns after the min column , use cols=df.iloc[:,df.columns.get_loc('min')+1:].columns)
cols=['points','rebounds','assists']
create a copy of the subset of those columns by df.loc[] and add_suffix as _per_minute, then divide them with the min column.
m=df.loc[:,cols].add_suffix('_per_minute')
df[m.columns]=m.div(df['min'],axis=0)
print(df)
Method2: concat:
cols=['points','rebounds','assists']
df=pd.concat([df,df.loc[:,cols].add_suffix('_per_minute').div(df['min'],axis=0)],axis=1)
Method3:
directly assign them with string formatting using same logic:
cols=['points','rebounds','assists']
df[[f"{i}_per_minute" for i in cols]]=df.loc[:,cols].div(df['min'],axis=0)
print(df)
player min points rebounds assists points_per_minute \
0 John 20 10 5 4 0.500000
1 Peter 23 12 7 6 0.521739
2 Michael 35 14 9 7 0.400000
rebounds_per_minute assists_per_minute
0 0.250000 0.20000
1 0.304348 0.26087
2 0.257143 0.20000

mutate_at is superseded by mutate and across.
Here is how you can do it in a dplyr way in python:
>>> from datar.all import c, f, tibble, mutate, across
>>>
>>> player = c('John','Peter','Michael')
>>> min = c(20, 23, 35)
>>> points = c(10,12,14)
>>> rebounds = c(5,7,9)
>>> assists = c(4,6,7)
>>>
>>> df = tibble(player,min,points,rebounds,assists)
>>>
>>> df = df >> mutate(
... # f.min passed to lambda as y
... across(f[f.points:f.assists], {'per_min': lambda x, y: x / y}, f.min)
... )
>>> df
player min points rebounds assists points_per_min rebounds_per_min assists_per_min
<object> <int64> <int64> <int64> <int64> <float64> <float64> <float64>
0 John 20 10 5 4 0.500000 0.250000 0.20000
1 Peter 23 12 7 6 0.521739 0.304348 0.26087
2 Michael 35 14 9 7 0.400000 0.257143 0.20000
I am the author of the datar package. Feel free to submit issues if you have any questions.

With the specific goal of making this feel more like dplyr, I really prefer method-chaining solutions because of their syntactic similarity to piped dplyr code.
This solution uses pandas.DataFrame.assign and dictionary unpacking.
updated_data = data.assign(**{f"{col}_per_minute": lambda x: x[col] / x["min"]
for col in ["points", "rebounds", "assists"]})

Related

Preserving id columns in dataframe after applying assign and groupby

I have a data file containing different foetal ultrasound measurements. The measurements are collected at different points during pregnancy, like so:
PregnancyID MotherID gestationalAgeInWeeks abdomCirc
0 0 14 150
0 0 21 200
1 1 20 294
1 1 25 315
1 1 30 350
2 2 8 170
2 2 9 180
2 2 18 NaN
Following this answer to a previous questions I had asked, I used this code to summarise the ultrasound measurements using the maximum measurement recorded in a single trimester (13 weeks):
(df.assign(tm = (df['gestationalAgeInWeeks']+ 13 - 1 )// 13))
.drop(columns = 'gestationalAgeInWeeks')
.groupby(['MotherID', 'PregnancyID','tm'])
.agg('max')
.unstack()
)
This results in the following output:
tm 1 2 3
MotherID PregnancyID
0 0 NaN 200.0 NaN
1 1 NaN 294.0 350.0
2 2 180.0 NaN NaN
However, MotherID and PregnancyID no longer appear as columns in the output of df.info(). Similarly, when I output the dataframe to a csv file, I only get columns 1,2 and 3. The id columns only appear when running df.head() as can be seen in the dataframe above.
I need to preserve the id columns as I want to use them to merge this dataframe with another one using the ids. Therefore, my question is, how do I preserve these id columns as part of my dataframe after running the code above?
Chain that with reset_index:
(df.assign(tm = (df['gestationalAgeInWeeks']+ 13 - 1 )// 13)
# .drop(columns = 'gestationalAgeInWeeks') # don't need this
.groupby(['MotherID', 'PregnancyID','tm'])['abdomCirc'] # change here
.max().add_prefix('abdomCirc_') # here
.unstack()
.reset_index() # and here
)
Or a more friendly version with pivot_table:
(df.assign(tm = (df['gestationalAgeInWeeks']+ 13 - 1 )// 13)
.pivot_table(index= ['MotherID', 'PregnancyID'], columns='tm',
values= 'abdomCirc', aggfunc='max')
.add_prefix('abdomCirc_') # remove this if you don't want the prefix
.reset_index()
)
Output:
tm MotherID PregnancyID abdomCirc_1 abdomCirc_2 abdomCirc_3
0 abdomCirc_0 abdomCirc_0 NaN 200.0 NaN
1 abdomCirc_1 abdomCirc_1 NaN 315.0 350.0
2 abdomCirc_2 abdomCirc_2 180.0 NaN NaN

Correlation between two dataframes column with matched headers

I have two dataframes from excels which look like the below. The first dataframe has a multi-index header.
I am trying to find the correlation between each column in the dataframe with the corresponding dataframe based on the currency (i.e KRW, THB, USD, INR). At the moment, I am doing a loop to iterate through each column, matching by index and corresponding header before finding the correlation.
for stock_name in index_data.columns.get_level_values(0):
stock_prices = index_data.xs(stock_name, level=0, axis=1)
stock_prices = stock_prices.dropna()
fx = currency_data[stock_prices.columns.get_level_values(1).values[0]]
fx = fx[fx.index.isin(stock_prices.index)]
merged_df = pd.merge(stock_prices, fx, left_index=True, right_index=True)
merged_df[0].corr(merged_df[1])
Is there a more panda-ish way of doing this?
So you wish to find the correlation between the stock price and its related currency. (Or stock price correlation to all currencies?)
# dummy data
date_range = pd.date_range('2019-02-01', '2019-03-01', freq='D')
stock_prices = pd.DataFrame(
np.random.randint(1, 20, (date_range.shape[0], 4)),
index=date_range,
columns=[['BYZ6DH', 'BLZGSL', 'MBT', 'BAP'],
['KRW', 'THB', 'USD', 'USD']])
fx = pd.DataFrame(np.random.randint(1, 20, (date_range.shape[0], 3)),
index=date_range, columns=['KRW', 'THB', 'USD'])
This is what it looks like, calculating correlations on this data shouldn't make much sense since it is random.
>>> print(stock_prices.head())
BYZ6DH BLZGSL MBT BAP
KRW THB USD USD
2019-02-01 15 10 19 19
2019-02-02 5 9 19 5
2019-02-03 19 7 18 10
2019-02-04 1 6 7 18
2019-02-05 11 17 6 7
>>> print(fx.head())
KRW THB USD
2019-02-01 15 11 10
2019-02-02 6 5 3
2019-02-03 13 1 3
2019-02-04 19 8 14
2019-02-05 6 13 2
Use apply to calculate the correlation between columns with the same currency.
def f(x, fx):
correlation = x.corr(fx[x.name[1]])
return correlation
correlation = stock_prices.apply(f, args=(fx,), axis=0)
>>> print(correlation)
BYZ6DH KRW -0.247529
BLZGSL THB 0.043084
MBT USD -0.471750
BAP USD 0.314969
dtype: float64

Pandas: how to identify the values in a column of a dataframe and do some math operations

I want to do operations, such that I produce something like this:
In other words, if the values in Name are in the 'first_list', I want to multiply the 'Values' by two. If they are in the 'second_list', I want to multiply them by 0.5. If they are not in either (for Nick and Nicky), do not do anything.
This is what I have:
first_list = ['John', 'James', 'Julius', 'Alex']
second_list = ['Lilly', 'Alexis', 'Becly']
if df['Name'].isin(first_list).any():
df['New Values'] = df['Values'] * 2
elif df['Name'].isin(second_list).any():
df['New Values'] = df['Values'] * 0.5
But its' not doing the multiplication as I want. Instead, it gives me:
Let's use np.where and isin:
df['New Value'] = (np.where(df.Name.isin(first_list),
df.Values*2,
np.where(df.Name.isin(second_list),
df.Values*.5,
df.Values)))
Setup:
df = pd.DataFrame({'Name':['John','Lily','Alexis','Becky','James','Julian','Alex','Nick','Nicky'],'Values':[50,100,30,60,40,20,80,25,46]})
first_list = ['John','James','Julius','Alex']
second_list = ['Lily','Alexis','Becky']
Output:
Name Values New Value
0 John 50 100.0
1 Lily 100 50.0
2 Alexis 30 15.0
3 Becky 60 30.0
4 James 40 80.0
5 Julian 20 20.0
6 Alex 80 160.0
7 Nick 25 25.0
8 Nicky 46 46.0

How to calculate rolling mean on a GroupBy object using Pandas?

How to calculate rolling mean on a GroupBy object using Pandas?
My Code:
df = pd.read_csv("example.csv", parse_dates=['ds'])
df = df.set_index('ds')
grouped_df = df.groupby('city')
What grouped_df looks like:
I want calculate rolling mean on each of my groups in my GroupBy object using Pandas?
I tried pd.rolling_mean(grouped_df, 3).
Here is the error I get:
AttributeError: 'DataFrameGroupBy' object has no attribute 'dtype'
Edit: Do I use itergroups maybe and calculate rolling mean on each group on each group as I iterate through?
You could try iterating over the groups
In [39]: df = pd.DataFrame({'a':list('aaaaabbbbbaaaccccbbbccc'),"bookings":range(1,24)})
In [40]: grouped = df.groupby('a')
In [41]: for group_name, group_df in grouped:
....: print group_name
....: print pd.rolling_mean(group_df['bookings'],3)
....:
a
0 NaN
1 NaN
2 2.000000
3 3.000000
4 4.000000
10 6.666667
11 9.333333
12 12.000000
dtype: float64
b
5 NaN
6 NaN
7 7.000000
8 8.000000
9 9.000000
17 12.333333
18 15.666667
19 19.000000
dtype: float64
c
13 NaN
14 NaN
15 15
16 16
20 18
21 20
22 22
dtype: float64
You want the dates on your left column and all city values as separate columns. One way to do this is set the index on date and city, and then unstack. This is equivalent to a pivot table. You can then perform your rolling mean in the usual fashion.
df = pd.read_csv("example.csv", parse_dates=['ds'])
df = df.set_index(['date', 'city']).unstack('city')
rm = pd.rolling_mean(df, 3)
I wouldn't recommend using a function, as the data for a given city can simply be returned as follows (: returns all rows):
df.loc[:, city]

groupby weighted average and sum in pandas dataframe

I have a dataframe:
Out[78]:
contract month year buys adjusted_lots price
0 W Z 5 Sell -5 554.85
1 C Z 5 Sell -3 424.50
2 C Z 5 Sell -2 424.00
3 C Z 5 Sell -2 423.75
4 C Z 5 Sell -3 423.50
5 C Z 5 Sell -2 425.50
6 C Z 5 Sell -3 425.25
7 C Z 5 Sell -2 426.00
8 C Z 5 Sell -2 426.75
9 CC U 5 Buy 5 3328.00
10 SB V 5 Buy 5 11.65
11 SB V 5 Buy 5 11.64
12 SB V 5 Buy 2 11.60
I need a sum of adjusted_lots , price which is weighted average , of price and adjusted_lots , grouped by all the other columns , ie. grouped by (contract, month , year and buys)
Similar solution on R was achieved by following code, using dplyr, however unable to do the same in pandas.
> newdf = df %>%
select ( contract , month , year , buys , adjusted_lots , price ) %>%
group_by( contract , month , year , buys) %>%
summarise(qty = sum( adjusted_lots) , avgpx = weighted.mean(x = price , w = adjusted_lots) , comdty = "Comdty" )
> newdf
Source: local data frame [4 x 6]
contract month year comdty qty avgpx
1 C Z 5 Comdty -19 424.8289
2 CC U 5 Comdty 5 3328.0000
3 SB V 5 Comdty 12 11.6375
4 W Z 5 Comdty -5 554.8500
is the same possible by groupby or any other solution ?
EDIT: update aggregation so it works with recent version of pandas
To pass multiple functions to a groupby object, you need to pass a tuples with the aggregation functions and the column to which the function applies:
# Define a lambda function to compute the weighted mean:
wm = lambda x: np.average(x, weights=df.loc[x.index, "adjusted_lots"])
# Define a dictionary with the functions to apply for a given column:
# the following is deprecated since pandas 0.20:
# f = {'adjusted_lots': ['sum'], 'price': {'weighted_mean' : wm} }
# df.groupby(["contract", "month", "year", "buys"]).agg(f)
# Groupby and aggregate with namedAgg [1]:
df.groupby(["contract", "month", "year", "buys"]).agg(adjusted_lots=("adjusted_lots", "sum"),
price_weighted_mean=("price", wm))
adjusted_lots price_weighted_mean
contract month year buys
C Z 5 Sell -19 424.828947
CC U 5 Buy 5 3328.000000
SB V 5 Buy 12 11.637500
W Z 5 Sell -5 554.850000
You can see more here:
http://pandas.pydata.org/pandas-docs/stable/groupby.html#applying-multiple-functions-at-once
and in a similar question here:
Apply multiple functions to multiple groupby columns
[1] : https://pandas.pydata.org/pandas-docs/stable/whatsnew/v0.25.0.html#groupby-aggregation-with-relabeling
Doing weighted average by groupby(...).apply(...) can be very slow (100x from the following).
See my answer (and others) on this thread.
def weighted_average(df,data_col,weight_col,by_col):
df['_data_times_weight'] = df[data_col]*df[weight_col]
df['_weight_where_notnull'] = df[weight_col]*pd.notnull(df[data_col])
g = df.groupby(by_col)
result = g['_data_times_weight'].sum() / g['_weight_where_notnull'].sum()
del df['_data_times_weight'], df['_weight_where_notnull']
return result
Wouldn't it be a lot more simpler to do this.
Multiply (adjusted_lots * price_weighted_mean) into a new column "X"
Use groupby().sum() for columns "X" and "adjusted_lots" to get
grouped df df_grouped
Compute weighted average on the df_grouped as
df_grouped['X']/df_grouped['adjusted_lots']
The solution that uses a dict of aggregation functions will be deprecated in a future version of pandas (version 0.22):
FutureWarning: using a dict with renaming is deprecated and will be removed in a future
version return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)
Use a groupby apply and return a Series to rename columns as discussed in:
Rename result columns from Pandas aggregation ("FutureWarning: using a dict with renaming is deprecated")
def my_agg(x):
names = {'weighted_ave_price': (x['adjusted_lots'] * x['price']).sum()/x['adjusted_lots'].sum()}
return pd.Series(names, index=['weighted_ave_price'])
produces the same result:
>df.groupby(["contract", "month", "year", "buys"]).apply(my_agg)
weighted_ave_price
contract month year buys
C Z 5 Sell 424.828947
CC U 5 Buy 3328.000000
SB V 5 Buy 11.637500
W Z 5 Sell 554.850000
With datar, you don't have to learn pandas APIs to transition your R code:
>>> from datar.all import f, tibble, c, rep, select, summarise, sum, weighted_mean, group_by
>>> df = tibble(
... contract=c('W', rep('C', 8), 'CC', rep('SB', 3)),
... month=c(rep('Z', 9), 'U', rep('V', 3)),
... year=5,
... buys=c(rep('Sell', 9), rep('Buy', 4)),
... adjusted_lots=[-5, -3, -2, -2, -3, -2, -3, -2, -2, 5, 5, 5, 2],
... price=[554.85, 424.50, 424.00, 423.75, 423.50, 425.50, 425.25, 426.00, 426.75,3328.00, 11.65, 11.64, 1
1.60]
... )
>>> df
contract month year buys adjusted_lots price
0 W Z 5 Sell -5 554.85
1 C Z 5 Sell -3 424.50
2 C Z 5 Sell -2 424.00
3 C Z 5 Sell -2 423.75
4 C Z 5 Sell -3 423.50
5 C Z 5 Sell -2 425.50
6 C Z 5 Sell -3 425.25
7 C Z 5 Sell -2 426.00
8 C Z 5 Sell -2 426.75
9 CC U 5 Buy 5 3328.00
10 SB V 5 Buy 5 11.65
11 SB V 5 Buy 5 11.64
12 SB V 5 Buy 2 11.60
>>> newdf = df >> \
... select(f.contract, f.month, f.year, f.buys, f.adjusted_lots, f.price) >> \
... group_by(f.contract, f.month, f.year, f.buys) >> \
... summarise(
... qty = sum(f.adjusted_lots),
... avgpx = weighted_mean(x = f.price , w = f.adjusted_lots),
... comdty = "Comdty"
... )
[2021-05-24 13:11:03][datar][ INFO] `summarise()` has grouped output by ['contract', 'month', 'year'] (overr
ide with `_groups` argument)
>>>
>>> newdf
contract month year buys qty avgpx comdty
0 C Z 5 Sell -19 424.828947 Comdty
1 CC U 5 Buy 5 3328.000000 Comdty
2 SB V 5 Buy 12 11.637500 Comdty
3 W Z 5 Sell -5 554.850000 Comdty
[Groups: ['contract', 'month', 'year'] (n=4)]
I am the author of the package. Feel free to submit issues if you have any questions.
ErnestScribbler's answer is much faster than the accepted solution. Here a multivariate analogue:
def weighted_average(df,data_col,weight_col,by_col):
''' Now data_col can be a list of variables '''
df_data = df[data_col].multiply(df[weight_col], axis='index')
df_weight = pd.notnull(df[data_col]).multiply(df[weight_col], axis='index')
df_data[by_col] = df[by_col]
df_weight[by_col] = df[by_col]
result = df_data.groupby(by_col).sum() / df_weight.groupby(by_col).sum()
return result
I came across this thread when confronted with a similar problem. In my case, I wanted to generate a weighted metric of a quarterback rating should more than one quarterback have attempted a pass in a given NFL game.
I may change the code if I start running into significant performance issues as I scale. For now, I preferred squeezing my solution into the .agg function alongside other transforms. Happy to see if someone has a simpler solution to achieve the same end. Ultimately, I employed a closure pattern.
The magic of the closure approach, if this is an unfamiliar pattern to a future reader, is that I can still return a simple function to pandas' .agg() method, but I get to do so with some additional information preconfigured from the top-level factory function.
def weighted_mean_factory(*args, **kwargs):
weights = kwargs.get('w').copy()
def weighted_mean(x):
x_mask = ~np.isnan(x)
w = weights.loc[x.index]
if all(v is False for v in x_mask):
raise ValueError('there are no non-missing x variable values')
return np.average(x[x_mask], weights=w[x_mask])
return weighted_mean
res_df = df.groupby(['game_id', 'team'])\
.agg(pass_player_cnt=('attempts', count_is_not_zero),
completions=('completions', 'sum'),
attempts=('attempts', 'sum'),
pass_yds=('pass_yards', 'sum'),
pass_tds=('pass_tds', 'sum'),
pass_int=('pass_int', 'sum'),
sack_taken=('sacks_taken', 'sum'),
sack_yds_loss=('sack_yds_loss', 'sum'),
longest_completion=('longest_completion', 'max'),
qbr_w_avg=('qb_rating', weighted_mean_factory(x='qb_rating', w=df['attempts']))
)
Some basic benchmarking stats on a DataFrame with the shape (5436, 31) are below and are not cause for concern on my end in terms of performance at this stage:
149 ms ± 4.75 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
This combines the original approach by jrjc with the closure approach by MB. It has the advantage of being able to reuse the closure function.
import pandas as pd
def group_weighted_mean_factory(df: pd.DataFrame, weight_col_name: str):
# Ref: https://stackoverflow.com/a/69787938/
def group_weighted_mean(x):
try:
return np.average(x, weights=df.loc[x.index, weight_col_name])
except ZeroDivisionError:
return np.average(x)
return group_weighted_mean
df = ... # Define
group_weighted_mean = group_weighted_mean_factory(df, "adjusted_lots")
g = df.groupby(...) # Define
agg_df = g.agg({'price': group_weighted_mean})

Categories