How to write the Python/Pandas equivalent of the following R code? - python

For a project, I am attempting to convert the following R code to Python but I am struggling to write equivalent code for the summarize and mutate commands used in R. The code :
users <- users %>%
mutate(coup_start=ifelse(first_coup>DAY,"no","yes")) %>%
group_by(household_key,WEEK_NO,coup_start) %>%
summarize(weekly_spend=sum(SALES_VALUE),
dummy=1) #adding new column dummy
users_before <- filter(users,coup_start=="no")
users_after <- filter(users,coup_start=="yes")
users_before <- users_before %>%
group_by(household_key) %>%
mutate(cum_dummy=cumsum(dummy),
trip=cum_dummy-max(cum_dummy)) %>%
select(-dummy,-cum_dummy)
users_after <- users_after %>%
group_by(household_key) %>%
mutate(trip=cumsum(dummy)-1) %>%
select(-dummy)
I tried the following :
users = transaction_data.merge(coupon_users,on='household_key')
users['coup_start']= np.where((users['first_coup'] > users['DAY_x']), 1, 0)
users['dummy'] = 1
users_before = users[users['coup_start']==0]
users_after = users[users['coup_start']==1]
users_before['cum_dummy'] = users_before.groupby(['household_key'])['dummy'].cumsum()
users_before['trip'] = users_before.groupby(['household_key'])['cum_dummy'].transform(lambda x: x - x.max())
users_after['trip'] = users_after.groupby(['household_key'])['dummy'].transform(lambda x: cumsum(x) - 1)
But I'm encountering multiple issues, the transform(lambda x: cumsum(x) -1) is throwing an error. And the two groupby and transform attempts before that show the following warnings:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
"""Entry point for launching an IPython kernel.
I also feel that I did not insert the dummy = 1 correctly initially. How can I convert the mutate/summarize functions in R with Python?
Edit
I have attempted using apply function to perform the cumsum operation.
def thisop(x): return(cumsum(x)-1 )
users_after['trip']=users_after.groupby(['household_key'])['dummy'].apply(thisop)
The error : NameError: name 'cumsum' is not defined still persists.

You've changed some variable and value names from R to Python code (e.g.DAY to DAY_X).
The following code should work taking the variables/values from your R code:
users = (
users.assign(coup_start = np.where(users.first_coup > users.DAY), 'no', 'yes')
.groupby(['household_key','WEEK_NO','coup_start'])
.agg(weekly_spend=(SALES_VALUE, sum))
.assign(dummy=1)
)
users_before = users.query('coup_start=="no"')
users_after = users.query('coup_start=="yes"')
users_before = (
users_before.assign (
trip = users_before.groupby('household_key').dummy
.transform(lambda x: x.cumsum() - x.cumsum().max()) )
.drop(columns='dummy')
)
users_after = (
users_after.assign (
trip = users_after.groupby('household_key')
.transform(trip = dummy.cumsum()-1) )
.drop(columns='dummy')
)

How about using the same syntax in python:
from datar.all import f, mutate, if_else, summarize, filter, group_by, select, sum, cumsum, max
users = users >> \
mutate(coup_start=if_else(f.first_coup>f.DAY,"no","yes")) >> \
group_by(f.household_key,f.WEEK_NO,f.coup_start) >> \
summarize(weekly_spend=sum(f.SALES_VALUE),
dummy=1) #adding new column dummy
users_before = filter(users,f.coup_start=="no")
users_after = filter(users,f.coup_start=="yes")
users_before = users_before >> \
group_by(f.household_key) >> \
mutate(cum_dummy=cumsum(f.dummy),
trip=f.cum_dummy-max(f.cum_dummy)) >> \
select(~f.dummy,~f.cum_dummy)
users_after = users_after >> \
group_by(f.household_key) >> \
mutate(trip=cumsum(f.dummy)-1) >> \
select(~f.dummy)
I am the author of the datar package. Feel free to submit issues if you have any questions.

Related

I want to get grouped counts and percentages using pandas

table shows count and percentage count of cycle ride, grouped by membership type (casual, member).
What I have done in R:
big_frame %>%
group_by(member_casual) %>%
summarise(count = length(ride_id),
'%' = round((length(ride_id) / nrow(big_frame)) * 100, digit=2))
best I've come up with in Pandas, but I feel like there should be a better way:
member_casual_count = (
big_frame
.filter(['member_casual'])
.value_counts(normalize=True).mul(100).round(2)
.reset_index(name='percentage')
)
member_casual_count['count'] = (
big_frame
.filter(['member_casual'])
.value_counts()
.tolist()
)
member_casual_count
Thank you in advance
In R, you should be doing something like this:
big_frame %>%
count(member_casual) %>%
mutate(perc = n/sum(n))
In python, you can achieve the same like this:
(
big_frame
.groupby("member_casual")
.size()
.to_frame('ct')
.assign(n = lambda df: df.ct/df.ct.sum())
)

Is there an python process for this R code

i am working on python and i want to group by my data by to columns and at the same time add missing dates from a date1 corresponding to the occurrence of the event to another date2 corresponding to a date that a choose and fill the missing values into the columns i decided by forwarfill .
I tried the code bellow on r and its works i want to do the same in python
library(data.table)
library(padr)
library(dplyr)
data = fread("path", header = T)
data$ORDERDATE <- as.Date(data$ORDERDATE)
datemax = max(data$ORDERDATE)
data2 = data %>%
group_by(Column1, Column2) %>%
pad(.,group = c('Column1', 'Column2'), end_val = as.Date(datemax), interval = "day",break_above = 100000000000) %>%
tidyr::fill("Column3")
I search for the corresponding package library(padr) in python but couldn't find any.
Thank your for answering my request.
As required as an example i have this table:
users=['User1','User1','User2','User1','User2','User1','User2','User1','User2'],
products=['product1','product1','product1','product1','product1','product2','product2','product2','product2'],
quantities=[5,6,8,10,4,5,2,9,7],
prices=[2,2,5,5,6,6,6,7,7],
data = pd.DataFrame({'date':dates,'user':users,'product':products,'quantity':quantities,'price':prices}),
data['date'] = pd.to_datetime(data.date, format='%Y-%m-%d'),
data2=data.groupby(['user','product','date'],as_index=False).mean()```[enter image description here][1]
for User1 and product1 for exemple i want to input missing dates and fill the quantities column with the value 0 and the column price with backward values from a range of date that a choose.
And do the same by users and by product for remainings in my data.
the result should look like this:
[1]: https://i.stack.imgur.com/qOOda.png
the r code i used to generate the image is as follow:
```library(padr)
library(dplyr)
dates=c('2014-01-14','2014-01-14','2014-01-15','2014-01-19','2014-01-18','2014-01-25','2014-01-28','2014-02-05','2014-02-14')
users=c('User1','User1','User2','User1','User2','User1','User2','User1','User2')
products=c('product1','product1','product1','product1','product1','product2','product2','product2','product2')
quantities=c(5,6,8,10,4,5,2,9,7)
prices=c(2,2,5,5,6,6,6,7,7)
data=data.frame(date=c('2014-01-14','2014-01-14','2014-01-15','2014-01-19','2014-01-18','2014-01-25','2014-01-28','2014-02-05','2014-02-14'),user=c('User1','User1','User2','User1','User2','User1','User2','User1','User2'),product=c('product1','product1','product1','product1','product1','product2','product2','product2','product2'),quantity=c(5,6,8,10,4,5,2,9,7),price=c(2,2,5,5,6,6,6,7,7))
data$date <- as.Date(data$date)
datemax = max(data$date)
data2 = data %>% group_by(user, product) %>% pad(.,group = c('user', 'product'), end_val = as.Date(datemax), interval = "day",break_above = 100000000000)
data3=data2 %>% group_by(user,product,date) %>%
summarize(quantity=sum(quantity),price=mean(price))
data4=data3%>% tidyr::fill("price")%>% fill_by_value(quantity, value = 0)```

Translating a Python Pandas line to R:

I am following a blog post here and I am getting a little stuck on one part regarding the translation from Python pandas to R…
In the part of the blog:
Tick Bars
The author has the line:
data_tick_grp = data.reset_index().assign(grpId=lambda row: row.index // num_ticks_per_bar)
I understand that data is the "data frame" -
reset_index not sure what this is.
assing(grpId =…) - creating a new variable grpId
lambda row: - not sure what this does.
row.index - is this the same as row_number?
\\ - is this the same as floor() in R?
num_ticks_per_bar is calculated as.
total_ticks = len(data)
num_ticks_per_bar = total_ticks / num_time_bars
num_ticks_per_bar = round(num_ticks_per_bar, -3) # round to the nearest thousand
Which I understand it as:
ticks <- data %>%
filter(symbol == "XBTUSD") %>%
nrow()
ticks_per_bar <- ticks / 288
ticks_per_bar <- plyr::round_any(ticks_per_bar, 1000)
floor(1:nrow(data) / ticks_per_bar))
Can somebody help me translate the Python pandas line into R language?
Usually, Pandas best translates to base R:
reset_index same as resetting row.names for sequential numbering data.frame(..., row.names = NULL)
assign(grpId =…) same as assigning a column in place such as with transform, within or dplyr's mutate
lambda row this is required inside assign to reference data frame, here aliased as row
row.index is same as row number (remember Python is 0-index unlike R)
// is the integer division which in R one can be wrapped with as.integer or floor after division
Altogether, consider below adjustment to translate Pandas line:
data_tick_grp = (data.reset_index()
.assign(grpId=lambda row: row.index // num_ticks_per_bar)
)
To R:
data_tick_grp <- transform(data.frame(data, row.names = NULL),
grpId = floor(0:(nrow(data)-1) / num_ticks_per_bar))
Or in tidy format:
data_tick_grp <- data %>%
data.frame(row.names = NULL) %>%
mutate(grpId = floor(0:(nrow(data)-1) / num_ticks_per_bar))

Pandas groupby then assign

I have a dataframe in long format with columns: date, ticker, mcap, rank_mcap. The mcap columns is "marketcap" and measure how large a certain stock is, and mcap_rank is simply the ranked verson of it (where 1 is the largest marketcap).
I want to create a top 10 market cap weighted asset (e.g. S&P10). In R I do this
df %>%
filter(day(date) == 1, rank_mcap < 11) %>%
group_by(date) %>%
mutate(weight = mcap / sum(mcap)) %>%
ungroup() %>%
What do I do in pandas? I get the following error
AttributeError: Cannot access callable attribute 'assign' of 'DataFrameGroupBy' objects, try using the 'apply' method
when I tro do to a similar approach like the R method, namely in python do this:
df.\
query('included == True & date.dt.day == 1'). \
groupby('date').\
assign(w=df.mcap / df.mcap.sum())
I studied http://pandas.pydata.org/pandas-docs/stable/comparison_with_r.html and did not come to a conclusion.
How pandas achieve Mutate in R
df.query('included == True & date.dt.day == 1').\
assign(weight = lambda x : x.groupby('date',group_keys=False).
apply(lambda y: y.mcap / y.mcap.sum()))
You can do it in the same way as you did in R using datar:
from datar.all import f, filter, group_by, ungroup, mutate, sum
df >> \
filter(f.date.day == 1, f.rank_mcap < 11) >> \
group_by(f.date) >> \
mutate(weight = f.mcap / sum(f.mcap)) >> \
ungroup()
Disclaimer: I am the author of the datar package.

python pandas groupby complex calculations

I have dataframe df with columns: a, b, c,d. And i want to group data by a and make some calculations. I will provide code of this calculations in R. My main question is how to do the same in pandas?
library(dplyr)
df %>%
group_by(a) %>%
summarise(mean_b = mean(b),
qt95 = quantile(b, .95),
diff_b_c = max(b-c),
std_b_d = sd(b)-sd(d)) %>%
ungroup()
This example is synthetic, I just want to understand pandas syntaxis
I believe you need custom function with GroupBy.apply:
def f(x):
mean_b = x.b.mean()
qt95 = x.b.quantile(.95)
diff_b_c = (x.b - x.c).max()
std_b_d = x.b.std() - x.d.std()
cols = ['mean_b','qt95','diff_b_c','std_b_d']
return pd.Series([mean_b, qt95, diff_b_c, std_b_d], index=cols)
df1 = df.groupby('a').apply(f)

Categories