Pandas groupby then assign - python

I have a dataframe in long format with columns: date, ticker, mcap, rank_mcap. The mcap columns is "marketcap" and measure how large a certain stock is, and mcap_rank is simply the ranked verson of it (where 1 is the largest marketcap).
I want to create a top 10 market cap weighted asset (e.g. S&P10). In R I do this
df %>%
filter(day(date) == 1, rank_mcap < 11) %>%
group_by(date) %>%
mutate(weight = mcap / sum(mcap)) %>%
ungroup() %>%
What do I do in pandas? I get the following error
AttributeError: Cannot access callable attribute 'assign' of 'DataFrameGroupBy' objects, try using the 'apply' method
when I tro do to a similar approach like the R method, namely in python do this:
df.\
query('included == True & date.dt.day == 1'). \
groupby('date').\
assign(w=df.mcap / df.mcap.sum())
I studied http://pandas.pydata.org/pandas-docs/stable/comparison_with_r.html and did not come to a conclusion.

How pandas achieve Mutate in R
df.query('included == True & date.dt.day == 1').\
assign(weight = lambda x : x.groupby('date',group_keys=False).
apply(lambda y: y.mcap / y.mcap.sum()))

You can do it in the same way as you did in R using datar:
from datar.all import f, filter, group_by, ungroup, mutate, sum
df >> \
filter(f.date.day == 1, f.rank_mcap < 11) >> \
group_by(f.date) >> \
mutate(weight = f.mcap / sum(f.mcap)) >> \
ungroup()
Disclaimer: I am the author of the datar package.

Related

I want to get grouped counts and percentages using pandas

table shows count and percentage count of cycle ride, grouped by membership type (casual, member).
What I have done in R:
big_frame %>%
group_by(member_casual) %>%
summarise(count = length(ride_id),
'%' = round((length(ride_id) / nrow(big_frame)) * 100, digit=2))
best I've come up with in Pandas, but I feel like there should be a better way:
member_casual_count = (
big_frame
.filter(['member_casual'])
.value_counts(normalize=True).mul(100).round(2)
.reset_index(name='percentage')
)
member_casual_count['count'] = (
big_frame
.filter(['member_casual'])
.value_counts()
.tolist()
)
member_casual_count
Thank you in advance
In R, you should be doing something like this:
big_frame %>%
count(member_casual) %>%
mutate(perc = n/sum(n))
In python, you can achieve the same like this:
(
big_frame
.groupby("member_casual")
.size()
.to_frame('ct')
.assign(n = lambda df: df.ct/df.ct.sum())
)

Pipe or sequence of function in python pandas or Filter then summarize (as dplyr)

To contextualize. I'm an R heavy user, but currently switching between python (with pandas). Let's say I have this data frame
data = {'participant': ['p1','p1','p2','p3'],
'metadata': ['congruent_1','congruent_2','incongruent_1','incongruent_2'],
'reaction': [22000,25000,27000,35000]
}
df_s1 = pd.DataFrame(data, columns = ['participant','metadata', 'reaction'])
df_s1 = df_s1.append([df_s1]*15,ignore_index=True)
df_s1
and I want to reproduce what I can easily do in R (pipe functions), by:
df_s1[(df_s1.metadata == "congruent_1") | (df_s1.metadata == "incongruent_1")].df_s1["reaction"].mean()
This is not possible. I just can success when I split this code into parts/variables:
x = df_s1[(df_s1.metadata == "congruent_1") | (df_s1.metadata == "incongruent_1")]
x = x["reaction"].mean()
x
In dplyr way, I'd go with
ds_s1 %>%
filter(metadata == "congruent_1" | metadata == "incongruent_1") %>%
summarise(mean(reaction))
Note: I highly appreciate concise references to a site in which I could transpose my R code to Python. Several literature is available, but with mixed formats and flexible styles.
Thanks
We have .loc here
df_s1.loc[(df_s1.metadata == "congruent_1") | (df_s1.metadata == "incongruent_1"), 'reaction'].mean()
Out[117]: 24500.0
Change to isin as Quang mentioned try to reduce the line of code
In base R
mean(ds_s1$reaction[ds_s1$metadata%in%c('congruent_1','incongruent_1')])
Do you mean:
df_s1.loc[(df_s1.metadata == "congruent_1") | (df_s1.metadata == "incongruent_1"), "reaction"].mean()
Or simpler with isin:
df_s1.loc[df_s1.metadata.isin(["congruent_1", "incongruent_1"]), "reaction"].mean()
Out:
24500.0
In addition to the other suggested solutions:
df_s1.query('metadata==["congruent_1","incongruent_1"]').agg({"reaction": "mean"})
reaction 24500.0
dtype: float64
With datar (I am the author) in python, it is easy for you to port your code from R to python:
from datar.all import *
data = tibble(
participant=['p1','p1','p2','p3'],
metadata=['congruent_1','congruent_2','incongruent_1','incongruent_2'],
reaction=[22000,25000,27000,35000]
)
df_s1 = data >> uncount(15)
df_s1 = df_s1 >> \
filter((f.metadata == "congruent_1") | (f.metadata == "incongruent_1")) >> \
group_by(f.metadata) >> \
summarise(reaction_mean=mean(f.reaction))
print(df_s1)
Output:
metadata reaction_mean
0 congruent_1 22000.0
1 incongruent_1 27000.0

How to write the Python/Pandas equivalent of the following R code?

For a project, I am attempting to convert the following R code to Python but I am struggling to write equivalent code for the summarize and mutate commands used in R. The code :
users <- users %>%
mutate(coup_start=ifelse(first_coup>DAY,"no","yes")) %>%
group_by(household_key,WEEK_NO,coup_start) %>%
summarize(weekly_spend=sum(SALES_VALUE),
dummy=1) #adding new column dummy
users_before <- filter(users,coup_start=="no")
users_after <- filter(users,coup_start=="yes")
users_before <- users_before %>%
group_by(household_key) %>%
mutate(cum_dummy=cumsum(dummy),
trip=cum_dummy-max(cum_dummy)) %>%
select(-dummy,-cum_dummy)
users_after <- users_after %>%
group_by(household_key) %>%
mutate(trip=cumsum(dummy)-1) %>%
select(-dummy)
I tried the following :
users = transaction_data.merge(coupon_users,on='household_key')
users['coup_start']= np.where((users['first_coup'] > users['DAY_x']), 1, 0)
users['dummy'] = 1
users_before = users[users['coup_start']==0]
users_after = users[users['coup_start']==1]
users_before['cum_dummy'] = users_before.groupby(['household_key'])['dummy'].cumsum()
users_before['trip'] = users_before.groupby(['household_key'])['cum_dummy'].transform(lambda x: x - x.max())
users_after['trip'] = users_after.groupby(['household_key'])['dummy'].transform(lambda x: cumsum(x) - 1)
But I'm encountering multiple issues, the transform(lambda x: cumsum(x) -1) is throwing an error. And the two groupby and transform attempts before that show the following warnings:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
"""Entry point for launching an IPython kernel.
I also feel that I did not insert the dummy = 1 correctly initially. How can I convert the mutate/summarize functions in R with Python?
Edit
I have attempted using apply function to perform the cumsum operation.
def thisop(x): return(cumsum(x)-1 )
users_after['trip']=users_after.groupby(['household_key'])['dummy'].apply(thisop)
The error : NameError: name 'cumsum' is not defined still persists.
You've changed some variable and value names from R to Python code (e.g.DAY to DAY_X).
The following code should work taking the variables/values from your R code:
users = (
users.assign(coup_start = np.where(users.first_coup > users.DAY), 'no', 'yes')
.groupby(['household_key','WEEK_NO','coup_start'])
.agg(weekly_spend=(SALES_VALUE, sum))
.assign(dummy=1)
)
users_before = users.query('coup_start=="no"')
users_after = users.query('coup_start=="yes"')
users_before = (
users_before.assign (
trip = users_before.groupby('household_key').dummy
.transform(lambda x: x.cumsum() - x.cumsum().max()) )
.drop(columns='dummy')
)
users_after = (
users_after.assign (
trip = users_after.groupby('household_key')
.transform(trip = dummy.cumsum()-1) )
.drop(columns='dummy')
)
How about using the same syntax in python:
from datar.all import f, mutate, if_else, summarize, filter, group_by, select, sum, cumsum, max
users = users >> \
mutate(coup_start=if_else(f.first_coup>f.DAY,"no","yes")) >> \
group_by(f.household_key,f.WEEK_NO,f.coup_start) >> \
summarize(weekly_spend=sum(f.SALES_VALUE),
dummy=1) #adding new column dummy
users_before = filter(users,f.coup_start=="no")
users_after = filter(users,f.coup_start=="yes")
users_before = users_before >> \
group_by(f.household_key) >> \
mutate(cum_dummy=cumsum(f.dummy),
trip=f.cum_dummy-max(f.cum_dummy)) >> \
select(~f.dummy,~f.cum_dummy)
users_after = users_after >> \
group_by(f.household_key) >> \
mutate(trip=cumsum(f.dummy)-1) >> \
select(~f.dummy)
I am the author of the datar package. Feel free to submit issues if you have any questions.

numpy under a groupby not working

I have the following code:
import pandas as pd
import numpy as np
df = pd.read_csv('C:/test.csv')
df.drop(['SecurityID'],1,inplace=True)
Time = 1
trade_filter_size = 9
groupbytime = (str(Time) + "min")
df['dateTime_s'] = df['dateTime'].astype('datetime64[s]')
df['dateTime'] = pd.to_datetime(df['dateTime'])
df[str(Time)+"min"] = df['dateTime'].dt.floor(str(Time)+"min")
df['tradeBid'] = np.where(((df['tradePrice'] <= df['bid1']) & (df['isTrade']==1)), df['tradeVolume'], 0)
groups = df[df['isTrade'] == 1].groupby(groupbytime)
print("groups",groups.dtypes)
#THIS IS WORKING
df_grouped = (groups.agg({
'tradeBid': [('sum', np.sum),('downticks_number', lambda x: (x > 0).sum())],
}))
# creating a new data frame which is filttered
df2 = pd.DataFrame( df.loc[(df['isTrade'] == 1) & (df['tradeVolume']>=trade_filter_size)])
#recalculating all the bid/ask volume to be bsaed on the filter size
df2['tradeBid'] = np.where(((df2['tradePrice'] <= df2['bid1']) & (df2['isTrade']==1)), df2['tradeVolume'], 0)
df2grouped = (df2.agg({
# here is the problem!!! NOT WORKING
'tradeBid': [('sum', np.sum), lambda x: (x > 0).sum()],
}))
The same function is used tradeBid': [('sum', np.sum),('downticks_number', lambda x: (x > 0).sum()). In the first time it's working ok but when doing it on filtered data in a new df it's causing an error:
ValueError: downticks_number is an unknown string function
when I use this code instead to solve the above
'tradeBid': [('sum', np.sum), lambda x: (x > 0).sum()],
I get this error:
ValueError: cannot combine transform and aggregation operations
Any idea why I get different results for the same usage of code?
since there were 2 conditions to match for the 2nd groupby, I solved this by moving the filter into the df by creating a new column which is used as a filter (with both 2 filters).
than there was no problem to groupby
the order was the problem

python pandas groupby complex calculations

I have dataframe df with columns: a, b, c,d. And i want to group data by a and make some calculations. I will provide code of this calculations in R. My main question is how to do the same in pandas?
library(dplyr)
df %>%
group_by(a) %>%
summarise(mean_b = mean(b),
qt95 = quantile(b, .95),
diff_b_c = max(b-c),
std_b_d = sd(b)-sd(d)) %>%
ungroup()
This example is synthetic, I just want to understand pandas syntaxis
I believe you need custom function with GroupBy.apply:
def f(x):
mean_b = x.b.mean()
qt95 = x.b.quantile(.95)
diff_b_c = (x.b - x.c).max()
std_b_d = x.b.std() - x.d.std()
cols = ['mean_b','qt95','diff_b_c','std_b_d']
return pd.Series([mean_b, qt95, diff_b_c, std_b_d], index=cols)
df1 = df.groupby('a').apply(f)

Categories